Commit graph

77845 commits

Author SHA1 Message Date
Theodore Ts'o
de394a8665 ext4: update s_overhead_clusters in the superblock during an on-line resize
When doing an online resize, the on-disk superblock on-disk wasn't
updated.  This means that when the file system is unmounted and
remounted, and the on-disk overhead value is non-zero, this would
result in the results of statfs(2) to be incorrect.

This was partially fixed by Commits 10b01ee92d ("ext4: fix overhead
calculation to account for the reserved gdt blocks"), 85d825dbf4
("ext4: force overhead calculation if the s_overhead_cluster makes no
sense"), and eb7054212e ("ext4: update the cached overhead value in
the superblock").

However, since it was too expensive to forcibly recalculate the
overhead for bigalloc file systems at every mount, this didn't fix the
problem for bigalloc file systems.  This commit should address the
problem when resizing file systems with the bigalloc feature enabled.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20220629040026.112371-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:52:19 -04:00
Zhang Yi
5a57bca905 ext4: fix reading leftover inlined symlinks
Since commit 6493792d32 ("ext4: convert symlink external data block
mapping to bdev"), create new symlink with inline_data is not supported,
but it missing to handle the leftover inlined symlinks, which could
cause below error message and fail to read symlink.

 ls: cannot read symbolic link 'foo': Structure needs cleaning

 EXT4-fs error (device sda): ext4_map_blocks:605: inode #12: block
 2021161080: comm ls: lblock 0 mapped to illegal pblock 2021161080
 (length 1)

Fix this regression by adding ext4_read_inline_link(), which read the
inline data directly and convert it through a kmalloced buffer.

Fixes: 6493792d32 ("ext4: convert symlink external data block mapping to bdev")
Cc: stable@kernel.org
Reported-by: Torge Matthies <openglfreak@googlemail.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Torge Matthies <openglfreak@googlemail.com>
Link: https://lore.kernel.org/r/20220630090100.2769490-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-08-02 23:37:50 -04:00
Linus Torvalds
c2a24a7a03 This update includes the following changes:
API:
 
 - Make proc files report fips module name and version.
 
 Algorithms:
 
 - Move generic SHA1 code into lib/crypto.
 - Implement Chinese Remainder Theorem for RSA.
 - Remove blake2s.
 - Add XCTR with x86/arm64 acceleration.
 - Add POLYVAL with x86/arm64 acceleration.
 - Add HCTR2.
 - Add ARIA.
 
 Drivers:
 
 - Add support for new CCP/PSP device ID in ccp.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmLosAAACgkQxycdCkmx
 i6dvgxAAzcw0cKMuq3dbQamzeVu1bDW8rPb7yHnpXal3ao5ewa15+hFjsKhdh/s3
 cjM5Lu7Qx4lnqtsh2JVSU5o2SgEpptxXNfxAngcn46ld5EgV/G4DYNKuXsatMZ2A
 erCzXqG9dDxJmREat+5XgVfD1RFVsglmEA/Nv4Rvn+9O4O6PfwRa8GyUzeKC+byG
 qs/1JyiPqpyApgzCvlQFAdTF4PM7ruDtg3mnMy2EKAzqj4JUseXRi1i81vLVlfBL
 T40WESG/CnOwIF5MROhziAtkJMS4Y4v2VQ2++1p0gwG6pDCnq4w7u9cKPXYfNgZK
 fMVCxrNlxIH3W99VfVXbXwqDSN6qEZtQvhnliwj9aEbEltIoH+B02wNfS/BDsTec
 im+5NCnNQ6olMPyL0yHrMKisKd+DwTrEfYT5H2kFhcdcYZncQ9C6el57kimnJRzp
 4ymPRudCKm/8weWGTtmjFMi+PFP4LgvCoR+VMUd+gVe91F9ZMAO0K7b5z5FVDyDf
 wmsNBvsEnTdm/r7YceVzGwdKQaP9sE5wq8iD/yySD1PjlmzZos1CtCrqAIT/v2RK
 pQdZCIkT8qCB+Jm03eEd4pwjEDnbZdQmpKt4cTy0HWIeLJVG1sXPNpgwPCaBEV4U
 g0nctILtypChlSDmuGhTCyuElfMg6CXt4cgSZJTBikT+QcyWOm4=
 =rfWK
 -----END PGP SIGNATURE-----

Merge tag 'v5.20-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto updates from Herbert Xu:
"API:

   - Make proc files report fips module name and version

  Algorithms:

   - Move generic SHA1 code into lib/crypto

   - Implement Chinese Remainder Theorem for RSA

   - Remove blake2s

   - Add XCTR with x86/arm64 acceleration

   - Add POLYVAL with x86/arm64 acceleration

   - Add HCTR2

   - Add ARIA

  Drivers:

   - Add support for new CCP/PSP device ID in ccp"

* tag 'v5.20-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (89 commits)
  crypto: tcrypt - Remove the static variable initialisations to NULL
  crypto: arm64/poly1305 - fix a read out-of-bound
  crypto: hisilicon/zip - Use the bitmap API to allocate bitmaps
  crypto: hisilicon/sec - fix auth key size error
  crypto: ccree - Remove a useless dma_supported() call
  crypto: ccp - Add support for new CCP/PSP device ID
  crypto: inside-secure - Add missing MODULE_DEVICE_TABLE for of
  crypto: hisilicon/hpre - don't use GFP_KERNEL to alloc mem during softirq
  crypto: testmgr - some more fixes to RSA test vectors
  cyrpto: powerpc/aes - delete the rebundant word "block" in comments
  hwrng: via - Fix comment typo
  crypto: twofish - Fix comment typo
  crypto: rmd160 - fix Kconfig "its" grammar
  crypto: keembay-ocs-ecc - Drop if with an always false condition
  Documentation: qat: rewrite description
  Documentation: qat: Use code block for qat sysfs example
  crypto: lib - add module license to libsha1
  crypto: lib - make the sha1 library optional
  crypto: lib - move lib/sha1.c into lib/crypto/
  crypto: fips - make proc files report fips module name and version
  ...
2022-08-02 17:45:14 -07:00
Xiubo Li
c460f4e4bb ceph: remove useless check for the folio
The netfs_write_begin() won't set the folio if the return value
is non-zero.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:13 +02:00
Hu Weiwen
7cb9994754 ceph: don't truncate file in atomic_open
Clear O_TRUNC from the flags sent in the MDS create request.

`atomic_open' is called before permission check. We should not do any
modification to the file here. The caller will do the truncation
afterward.

Fixes: 124e68e740 ("ceph: file operations")
Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:13 +02:00
Xiubo Li
0c04a117d7 ceph: make f_bsize always equal to f_frsize
The f_frsize maybe changed in the quota size is less than the defualt
4MB.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:13 +02:00
Xiubo Li
e027ddb6d3 ceph: flush the dirty caps immediatelly when quota is approaching
When the quota is approaching we need to notify it to the MDS as
soon as possible, or the client could write to the directory more
than expected.

This will flush the dirty caps without delaying after each write,
though this couldn't prevent the real size of a directory exceed
the quota but could prevent it as soon as possible.

Link: https://tracker.ceph.com/issues/56180
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:13 +02:00
Xiubo Li
4849077604 ceph: don't get the inline data for new creating files
If the 'i_inline_version' is 1, that means the file is just new
created and there shouldn't have any inline data in it, we should
skip retrieving the inline data from MDS.

This also could help reduce possiblity of dead lock issue introduce
by the inline data and Fcr caps.

Gradually we will remove the inline feature from kclient after ceph's
scrub too have support to unline the inline data, currently this
could help reduce the teuthology test failures.

This is possiblly could also fix a bug that for some old clients if
they couldn't explictly uninline the inline data when writing, the
inline version will keep as 1 always. We may always reading non-exist
data from inline data.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Xiubo Li
0006164589 ceph: update the auth cap when the async create req is forwarded
For async create we will always try to choose the auth MDS of frag
the dentry belonged to of the parent directory to send the request
and ususally this works fine, but if the MDS migrated the directory
to another MDS before it could be handled the request will be
forwarded. And then the auth cap will be changed.

We need to update the auth cap in this case before the request is
forwarded.

Link: https://tracker.ceph.com/issues/55857
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Xiubo Li
e19feff963 ceph: make change_auth_cap_ses a global symbol
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Jeff Layton
020bc44a9f ceph: switch back to testing for NULL folio->private in ceph_dirty_folio
Willy requested that we change this back to warning on folio->private
being non-NULl. He's trying to kill off the PG_private flag, and so we'd
like to catch where it's non-NULL.

Add a VM_WARN_ON_FOLIO (since it doesn't exist yet) and change over to
using that instead of VM_BUG_ON_FOLIO along with testing the ->private
pointer.

[ xiubli: define VM_WARN_ON_FOLIO macro in case DEBUG_VM is disabled
  reported by kernel test robot <lkp@intel.com> ]

Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Jeff Layton
7467b04418 ceph: call netfs_subreq_terminated with was_async == false
"was_async" is a bit misleadingly named. It's supposed to indicate
whether it's safe to call blocking operations from the context you're
calling it from, but it sounds like it's asking whether this was done
via async operation. For ceph, this it's always called from kernel
thread context so it should be safe to set this to false.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Jeff Layton
e821450335 ceph: convert to generic_file_llseek
There's no reason we need to lock the inode for write in order to handle
an llseek. I suspect this should have been dropped in 2013 when we
stopped doing vmtruncate in llseek.

With that gone, ceph_llseek is functionally equivalent to
generic_file_llseek, so just call that after getting the size.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Jeff Layton
58dd438557 ceph: don't leak snap_rwsem in handle_cap_grant
When handle_cap_grant is called on an IMPORT op, then the snap_rwsem is
held and the function is expected to release it before returning. It
currently fails to do that in all cases which could lead to a deadlock.

Fixes: 6f05b30ea0 ("ceph: reset i_requested_max_size if file write is not wanted")
Link: https://tracker.ceph.com/issues/55857
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Luís Henriques
d93231a6bc ceph: prevent a client from exceeding the MDS maximum xattr size
The MDS tries to enforce a limit on the total key/values in extended
attributes.  However, this limit is enforced only if doing a synchronous
operation (MDS_OP_SETXATTR) -- if we're buffering the xattrs, the MDS
doesn't have a chance to enforce these limits.

This patch adds support for decoding the xattrs maximum size setting that is
distributed in the mdsmap.  Then, when setting an xattr, the kernel client
will revert to do a synchronous operation if that maximum size is exceeded.

While there, fix a dout() that would trigger a printk warning:

[   98.718078] ------------[ cut here ]------------
[   98.719012] precision 65536 too large
[   98.719039] WARNING: CPU: 1 PID: 3755 at lib/vsprintf.c:2703 vsnprintf+0x5e3/0x600
...

Link: https://tracker.ceph.com/issues/55725
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Xiubo Li
8266c4d7a7 ceph: choose auth MDS for getxattr with the Xs caps
And for the 'Xs' caps for getxattr we will also choose the auth MDS,
because the MDS side code is buggy due to setxattr won't notify the
replica MDSes when the values changed and the replica MDS will return
the old values. Though we will fix it in MDS code, but this still
makes sense for old ceph.

Link: https://tracker.ceph.com/issues/55331
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Xiubo Li
300e42a2e7 ceph: add session already open notify support
If the connection was accidently closed due to the socket issue or
something else the clients will try to open the opened sessions, the
MDSes will send the session open reply one more time if the clients
support the notify feature.

When the clients retry to open the sessions the s_seq will be 0 as
default, we need to update it anyway.

Link: https://tracker.ceph.com/issues/53911
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Xiubo Li
4868e537fa ceph: wait for the first reply of inflight async unlink
In async unlink case the kclient won't wait for the first reply
from MDS and just drop all the links and unhash the dentry and then
succeeds immediately.

For any new create/link/rename,etc requests followed by using the
same file names we must wait for the first reply of the inflight
unlink request, or the MDS possibly will fail these following
requests with -EEXIST if the inflight async unlink request was
delayed for some reasons.

And the worst case is that for the none async openc request it will
successfully open the file if the CDentry hasn't been unlinked yet,
but later the previous delayed async unlink request will remove the
CDenty. That means the just created file is possiblly deleted later
by accident.

We need to wait for the inflight async unlink requests to finish
when creating new files/directories by using the same file names.

Link: https://tracker.ceph.com/issues/55332
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Xiubo Li
4f48d5da81 fs/dcache: export d_same_name() helper
Compare dentry name with case-exact name, return true if names
are same, or false.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Xiubo Li
7c2e3d9194 ceph: remove useless CEPHFS_FEATURES_CLIENT_REQUIRED
This macro was added but never be used. And check the ceph code
there has another CEPHFS_FEATURES_MDS_REQUIRED but always be empty.

We should clean up all this related code, which make no sense but
introducing confusion.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Luís Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Luís Henriques
fea013e020 ceph: use correct index when encoding client supported features
Feature bits have to be encoded into the correct locations.  This hasn't
been an issue so far because the only hole in the feature bits was in bit
10 (CEPHFS_FEATURE_RECLAIM_CLIENT), which is located in the 2nd byte.  When
adding more bits that go beyond the this 2nd byte, the bug will show up.

[xiubli: remove incorrect comment for CEPHFS_FEATURES_CLIENT_SUPPORTED]

Fixes: 9ba1e22453 ("ceph: allocate the correct amount of extra bytes for the session features")
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:12 +02:00
Jeff Layton
637fa738b5 fscrypt: add fscrypt_context_for_new_inode
Most filesystems just call fscrypt_set_context on new inodes, which
usually causes a setxattr. That's a bit late for ceph, which can send
along a full set of attributes with the create request.

Doing so allows it to avoid race windows that where the new inode could
be seen by other clients without the crypto context attached. It also
avoids the separate round trip to the server.

Refactor the fscrypt code a bit to allow us to create a new crypto
context, attach it to the inode, and write it to the buffer, but without
calling set_context on it. ceph can later use this to marshal the
context into the attributes we send along with the create request.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:11 +02:00
Jeff Layton
d3e94fdc4e fscrypt: export fscrypt_fname_encrypt and fscrypt_fname_encrypted_size
For ceph, we want to use our own scheme for handling filenames that are
are longer than NAME_MAX after encryption and Base64 encoding. This
allows us to have a consistent view of the encrypted filenames for
clients that don't support fscrypt and clients that do but that don't
have the key.

Currently, fs/crypto only supports encrypting filenames using
fscrypt_setup_filename, but that also handles encoding nokey names. Ceph
can't use that because it handles nokey names in a different way.

Export fscrypt_fname_encrypt. Rename fscrypt_fname_encrypted_size to
__fscrypt_fname_encrypted_size and add a new wrapper called
fscrypt_fname_encrypted_size that takes an inode argument rather than a
pointer to a fscrypt_policy union.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:11 +02:00
Jeff Layton
18cc912b8a fs: change test in inode_insert5 for adding to the sb list
inode_insert5 currently looks at I_CREATING to decide whether to insert
the inode into the sb list. This test is a bit ambiguous, as I_CREATING
state is not directly related to that list.

This test is also problematic for some upcoming ceph changes to add
fscrypt support. We need to be able to allocate an inode using new_inode
and insert it into the hash later iff we end up using it, and doing that
now means that we double add it and corrupt the list.

What we really want to know in this test is whether the inode is already
in its superblock list, and then add it if it isn't. Have it test for
list_empty instead and ensure that we always initialize the list by
doing it in inode_init_once. It's only ever removed from the list with
list_del_init, so that should be sufficient.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03 00:54:11 +02:00
Linus Torvalds
569bede0cf fsverity update for 5.20
Just a small documentation update to mention the btrfs support.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYumAiBQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK/pjAQDJbkG6S1eEdhC3m6oHlSToiy2p0FDH
 +qr4fQndCO0l+QEAgo3ULXvbCKlLPOQHM2gVjnUR+UUHnjJ3p2F5aODsfQ4=
 =UMFK
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fsverity update from Eric Biggers:
 "Just a small documentation update to mention the btrfs support"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fs-verity: mention btrfs support
2022-08-02 15:24:36 -07:00
Linus Torvalds
d7b767b508 execve updates for v5.20-rc1
- Allow unsharing time namespace on vfork+exec (Andrei Vagin)
 
 - Replace usage of deprecated kmap APIs (Fabio M. De Francesco)
 
 - Fix spelling mistake (Zhang Jiaming)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmLoDyAWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJh0mEACL07hj3eT3rWg6ohZx9sCTcAjY
 /tG+zxLQ7xu717nM1a4j7CI5kdNNpYbsCqG71ikDDRrOCeEutu7M8zE1emctjtHv
 oh853D6BKhV2Hvsiuk1oM2ZHR1bmgiW1eFNAJcCLz6rE6wYu564R0wYJV0h418fH
 Rjk+Y989A7Srs9t/9GQSktjX3Q039/PG28avhA5q144/ZNycr5FnLFOf4RlmzEUz
 7E8TfGsftX8eRAfxW/dPiWuIKMuYPLqspca9pT3aFj3ze2qKnldjNV3c9M5ajL5Q
 q7KKWeWzunKyYHMaRzIxkHyhs396ZGKFN2PbcNYyml+NBItyc3fCHishMF7bW0Vb
 nyZbmYJslBloYmrSJYgqCfxyjUuhe0cMMk9iMzDVp6ROwtLgFFLwfwunM6RwRmnr
 dAmM8QGwSE3qYLhVnLEcRqpgdXzVd+S0TGhB5k5AyI3628/mLxhE66/eWq0X8QF5
 los5zku1GagMkylt6SOGb3TME4JZe6ZdZpU4fe/ilM22qw852xgbF3+6Zap6IBbD
 AdzXVCHyU/obORfIxx5KTF213m4KpkWBBi3N1/vVlxIAFAUy1WdXDM1o2RPMD7hw
 DeHe8sgfTZxLmSqfWLuX+3qC94IvrbDPFaRCIMj1QNK0ltM8I9oHRPcUFyZMaV0O
 xHN/5QtmgVDfKA3mTw==
 =82SS
 -----END PGP SIGNATURE-----

Merge tag 'execve-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve updates from Kees Cook:

 - Allow unsharing time namespace on vfork+exec (Andrei Vagin)

 - Replace usage of deprecated kmap APIs (Fabio M. De Francesco)

 - Fix spelling mistake (Zhang Jiaming)

* tag 'execve-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  exec: Call kmap_local_page() in copy_string_kernel()
  exec: Fix a spelling mistake
  selftests/timens: add a test for vfork+exit
  fs/exec: allow to unshare a time namespace on vfork+exec
2022-08-02 14:36:19 -07:00
Linus Torvalds
ddd1949f58 pstore updates for v5.20-rc1
- Migrate to modern acomp crypto interface (Ard Biesheuvel)
 
 - Use better return type for "rcnt" (Dan Carpenter)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmLoDWAWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJmSWD/9BKLN+f7kpYfGKk0q9nnceo1CU
 q9uFHIW6eqFcInHTGIzc68iig1SbtQePjhrXafLez8ZKeX/L3yy/G4R4+vKj2XDA
 Lkjd9Xrj1d1nhh/mu/SSkDlsIUXNkcxbpfQX7qUXPfNc7Wln7r+aWqHFT5cziDYf
 G58TOlmn4WsdGWyhadsBxcPE9UA7eGsoqya+PD/bKgDCqp6ArhzZd3bFK/OLkSxF
 xR2n26sS2Qfg7ApbTrBSyG4lJkuIxZILFF8a8CsmYj78xCGSrwPlOFRHW5oKHCdI
 JEQ3CR/I3HofP1ajQCti07XT/dDbwzQaHkF/sjZmhAh9OnPXClGxYlczT+Cw9ffS
 R6o/XWldz8KgG9m3K6j1NmkBdwF0zukznNRrEBaeAONAbquG36b8fcH9GTRWBRAS
 x57bTHOCws9RLvjCGoxdEOYzs2Sm/23lUuF7nv0wRaDmlH+6eG6LU26ZBJpC1ple
 xdvnJxUI6tU5ojSUbaqQ4jyel/H5NZTb0l32gAmh5nZy/2U7jlLyeGFg814Hz9Jy
 ga8IWfumYezYMZfY5zTnmzhxF5mcLB34c/B5Qkb5SgER/WwnrTkRuJAXfbPE1j5y
 h8360yT5HPg5WSUe+N7S78LFqMWscu4iEPMdXbnKbLLp0Ytzz6xTlKi++teOfVle
 PI2xO97HN6oLA02IBw==
 =jJ4i
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:

 - Migrate to modern acomp crypto interface (Ard Biesheuvel)

 - Use better return type for "rcnt" (Dan Carpenter)

* tag 'pstore-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/zone: cleanup "rcnt" type
  pstore: migrate to crypto acomp interface
2022-08-02 14:31:46 -07:00
Linus Torvalds
c013d0af81 for-5.20/block-2022-07-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2
 ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU
 MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn
 b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc
 YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU
 gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6
 M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH
 5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK
 ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH
 ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H
 WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR
 4N1HEozvIw==
 =celk
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Improve the type checking of request flags (Bart)

 - Ensure queue mapping for a single queues always picks the right queue
   (Bart)

 - Sanitize the io priority handling (Jan)

 - rq-qos race fix (Jinke)

 - Reserved tags handling improvements (John)

 - Separate memory alignment from file/disk offset aligment for O_DIRECT
   (Keith)

 - Add new ublk driver, userspace block driver using io_uring for
   communication with the userspace backend (Ming)

 - Use try_cmpxchg() to cleanup the code in various spots (Uros)

 - Finally remove bdevname() (Christoph)

 - Clean up the zoned device handling (Christoph)

 - Clean up independent access range support (Christoph)

 - Clean up and improve block sysfs handling (Christoph)

 - Clean up and improve teardown of block devices.

   This turns the usual two step process into something that is simpler
   to implement and handle in block drivers (Christoph)

 - Clean up chunk size handling (Christoph)

 - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
   Ming, Sebastian, Yang, Ying)

* tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
  ublk_drv: fix double shift bug
  ublk_drv: make sure that correct flags(features) returned to userspace
  ublk_drv: fix error handling of ublk_add_dev
  ublk_drv: fix lockdep warning
  block: remove __blk_get_queue
  block: call blk_mq_exit_queue from disk_release for never added disks
  blk-mq: fix error handling in __blk_mq_alloc_disk
  ublk: defer disk allocation
  ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
  ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
  ublk: cleanup ublk_ctrl_uring_cmd
  ublk: simplify ublk_ch_open and ublk_ch_release
  ublk: remove the empty open and release block device operations
  ublk: remove UBLK_IO_F_PREFLUSH
  ublk: add a MAINTAINERS entry
  block: don't allow the same type rq_qos add more than once
  mmc: fix disk/queue leak in case of adding disk failure
  ublk_drv: fix an IS_ERR() vs NULL check
  ublk: remove UBLK_IO_F_INTEGRITY
  ublk_drv: remove unneeded semicolon
  ...
2022-08-02 13:46:35 -07:00
Linus Torvalds
98e2474640 for-5.20/io_uring-buffered-writes-2022-07-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLkm7UQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpldTEADTg/96R+eq78UZBNZmdifY9/qwQD+kzNiK
 ACDoYZFSbWUjMeOWqRxbYr6mXBKHnGHyTGlraTTpLDzhpB1xwoWfgOK9uOYXW/Ik
 eWfgTujPW/8v/l/z86khE+GH9b/maGCRqNZgS6uLVLzhxG6oCkoYTyOh1iHaF1VM
 Rma4nbJ8GSEDtiXNDl0Bznnyks/pzwoz/9slwzZ7PxtFwZsBxKuxgMUR5HIXdRp7
 5iUoFJhZrGWyi/dbQZUsK/9VYVVnKkcBCz2pb4GEmC+3dS/vlPEoeWUpPHInNyd1
 9NB9v8c+KFmQaWnCxuxcdHvCfmRRQrX8Pr8/OBNZKO6McYrKWKA+lurp4EGClE3m
 cZdK+P/9FS/Eeua8hum9UnbPAqsJPqLTbpbrySeBdd4iFA6u7rRqDX2+nz3PNe9U
 1b7V1bWBIEY/Rsw/PKo59oIeV0auD8v9OCHJ0lF2pv6dRln2/W0y1Qfd1DI18xFG
 +9bBnQzhF7R0O8UP5ApVayQCYrd906YsSVUOqAiLmUs/BoOgRq6g/0BqSOVVKE2u
 5iq8zTsVMkxY0ZpExwZST/700JwkPIV4SVPEYRC6QssFTcylvlisIek6XYSS9HX4
 Z6gzMwJW1H47bEfG4JolTI8uBjp0hQLCPX0O0XFLVnbHQwN0kjIBmv3axAwJO2NV
 qrrHXjf09w==
 =hV7G
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/io_uring-buffered-writes-2022-07-29' of git://git.kernel.dk/linux-block

Pull io_uring buffered writes support from Jens Axboe:
 "This contains support for buffered writes, specifically for XFS. btrfs
  is in progress, will be coming in the next release.

  io_uring does support buffered writes on any file type, but since the
  buffered write path just always -EAGAIN (or -EOPNOTSUPP) any attempt
  to do so if IOCB_NOWAIT is set, any buffered write will effectively be
  handled by io-wq offload. This isn't very efficient, and we even have
  specific code in io-wq to serialize buffered writes to the same inode
  to avoid further inefficiencies with thread offload.

  This is particularly sad since most buffered writes don't block, they
  simply copy data to a page and dirty it. With this pull request, we
  can handle buffered writes a lot more effiently.

  If balance_dirty_pages() needs to block, we back off on writes as
  indicated.

  This improves buffered write support by 2-3x.

  Jan Kara helped with the mm bits for this, and Stefan handled the
  fs/iomap/xfs/io_uring parts of it"

* tag 'for-5.20/io_uring-buffered-writes-2022-07-29' of git://git.kernel.dk/linux-block:
  mm: honor FGP_NOWAIT for page cache page allocation
  xfs: Add async buffered write support
  xfs: Specify lockmode when calling xfs_ilock_for_iomap()
  io_uring: Add tracepoint for short writes
  io_uring: fix issue with io_write() not always undoing sb_start_write()
  io_uring: Add support for async buffered writes
  fs: Add async write file modification handling.
  fs: Split off inode_needs_update_time and __file_update_time
  fs: add __remove_file_privs() with flags parameter
  fs: add a FMODE_BUF_WASYNC flags for f_mode
  iomap: Return -EAGAIN from iomap_write_iter()
  iomap: Add async buffered write support
  iomap: Add flags parameter to iomap_page_create()
  mm: Add balance_dirty_pages_ratelimited_flags() function
  mm: Move updates of dirty_exceeded into one place
  mm: Move starting of background writeback into the main balancing loop
2022-08-02 13:27:23 -07:00
Linus Torvalds
b349b1181d for-5.20/io_uring-2022-07-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLkm5gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmKMD/4l3QIrLbjYIxlfrzQcHbmYuUkbQtj3SbZg
 6ejbnGVhCs1P9DdXH8MgE2BxgpiXQE0CqOK7vbSoo5ep2n2UTLI2DIxAl74SMIo7
 0wmJXtUJySuViKr3NYVHqlN180MkQYddBz0nGElhkQBPBCMhW8CrtPCeURr/YyHp
 2RxSYBXiUx2gRyig+klnp6oPEqelcBZJUyNHdA9yVrgl/RhB/t2rKj7D++8ukQM3
 Zuyh8WIkTeTfUz9hdGG7fuCEdZN4DlO2CCEc7uy0cKi6VRCKH4hYUCqClJ+/cfd2
 43dUI2O7B6D1t/ObFh8AGIDXBDqVA6ePQohQU6gooRkfQiBPKkc9d0ts4yIhRqca
 AjkzNM+0Eve3A01loJ8J84w8oZnvNpYEv5n8/sZVLWcyU3UIs0I88nC2OBiFtoRq
 d77CtFLwOTo+r3STtAhnZOqez90rhS6BqKtqlUP346PCuFItl6/MbGtwdTbLYEFj
 CVNIb2pERWSr2NxGv4lFyXaX/cRwruxojWH7yc3rRYjr4Ykevd1pe/fMGNiMAnKw
 5em/3QU3qq0ZVcXLMihksKeHHFIQwGDRMuyuv/fktV10+yYXQ0t16WzkJT3aR8Xo
 cqs0r8+6Jnj3uYcOMzj/FoLcpEPr21hnwAtzLto1mG1Wh4JRn/D7Nx5zqxPLxcW+
 NiU6VihPOw==
 =gxeV
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

 - As per (valid) complaint in the last merge window, fs/io_uring.c has
   grown quite large these days. io_uring isn't really tied to fs
   either, as it supports a wide variety of functionality outside of
   that.

   Move the code to io_uring/ and split it into files that either
   implement a specific request type, and split some code into helpers
   as well. The code is organized a lot better like this, and io_uring.c
   is now < 4K LOC (me).

 - Deprecate the epoll_ctl opcode. It'll still work, just trigger a
   warning once if used. If we don't get any complaints on this, and I
   don't expect any, then we can fully remove it in a future release
   (me).

 - Improve the cancel hash locking (Hao)

 - kbuf cleanups (Hao)

 - Efficiency improvements to the task_work handling (Dylan, Pavel)

 - Provided buffer improvements (Dylan)

 - Add support for recv/recvmsg multishot support. This is similar to
   the accept (or poll) support for have for multishot, where a single
   SQE can trigger everytime data is received. For applications that
   expect to do more than a few receives on an instantiated socket, this
   greatly improves efficiency (Dylan).

 - Efficiency improvements for poll handling (Pavel)

 - Poll cancelation improvements (Pavel)

 - Allow specifiying a range for direct descriptor allocations (Pavel)

 - Cleanup the cqe32 handling (Pavel)

 - Move io_uring types to greatly cleanup the tracing (Pavel)

 - Tons of great code cleanups and improvements (Pavel)

 - Add a way to do sync cancelations rather than through the sqe -> cqe
   interface, as that's a lot easier to use for some use cases (me).

 - Add support to IORING_OP_MSG_RING for sending direct descriptors to a
   different ring. This avoids the usually problematic SCM case, as we
   disallow those. (me)

 - Make the per-command alloc cache we use for apoll generic, place
   limits on it, and use it for netmsg as well (me).

 - Various cleanups (me, Michal, Gustavo, Uros)

* tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-block: (172 commits)
  io_uring: ensure REQ_F_ISREG is set async offload
  net: fix compat pointer in get_compat_msghdr()
  io_uring: Don't require reinitable percpu_ref
  io_uring: fix types in io_recvmsg_multishot_overflow
  io_uring: Use atomic_long_try_cmpxchg in __io_account_mem
  io_uring: support multishot in recvmsg
  net: copy from user before calling __get_compat_msghdr
  net: copy from user before calling __copy_msghdr
  io_uring: support 0 length iov in buffer select in compat
  io_uring: fix multishot ending when not polled
  io_uring: add netmsg cache
  io_uring: impose max limit on apoll cache
  io_uring: add abstraction around apoll cache
  io_uring: move apoll cache to poll.c
  io_uring: consolidate hash_locked io-wq handling
  io_uring: clear REQ_F_HASH_LOCKED on hash removal
  io_uring: don't race double poll setting REQ_F_ASYNC_DATA
  io_uring: don't miss setting REQ_F_DOUBLE_POLL
  io_uring: disable multishot recvmsg
  io_uring: only trace one of complete or overflow
  ...
2022-08-02 13:20:44 -07:00
Trond Myklebust
2135e5d562 NFSv4/pnfs: Fix a use-after-free bug in open
If someone cancels the open RPC call, then we must not try to free
either the open slot or the layoutget operation arguments, since they
are likely still in use by the hung RPC call.

Fixes: 6949493884 ("NFSv4: Don't hold the layoutget locks across multiple RPC calls")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-02 16:04:29 -04:00
Trond Myklebust
b1a28f2eb9 NFS: nfs_async_write_reschedule_io must not recurse into the writeback code
It is not safe to call filemap_fdatawrite_range() from
nfs_async_write_reschedule_io(), since we're often calling from a page
reclaim context. Just let fsync() redrive the writeback for us.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-02 16:03:12 -04:00
David Howells
2757a4dc18 afs: Fix access after dec in put functions
Reference-putting functions should not access the object being put after
decrementing the refcount unless they reduce the refcount to zero.

Fix a couple of instances of this in afs by copying the information to be
logged by tracepoint to local variables before doing the decrement.

[Fixed a bit in afs_put_server() that I'd missed but Marc caught]

Fixes: 341f741f04 ("afs: Refcount the afs_call struct")
Fixes: 4521819369 ("afs: Trace afs_server usage")
Fixes: 977e5f8ed0 ("afs: Split the usage count on struct afs_server")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/165911278430.3745403.16526310736054780645.stgit@warthog.procyon.org.uk/ # v1
2022-08-02 18:21:29 +01:00
David Howells
c56f9ec8b2 afs: Use refcount_t rather than atomic_t
Use refcount_t rather than atomic_t in afs to make use of the count
checking facilities provided.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/165911277768.3745403.423349776836296452.stgit@warthog.procyon.org.uk/ # v1
2022-08-02 18:10:11 +01:00
Christoph Hellwig
cf5e7a6521 fs: remove the NULL get_block case in mpage_writepages
No one calls mpage_writepages with a NULL get_block paramter, so remove
support for that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:04 -04:00
Christoph Hellwig
f2d3e573bf fs: don't call ->writepage from __mpage_writepage
All callers of mpage_writepage use block_write_full_page as their
->writepage implementation when called from mpage_writepages
(although for ntfs3 this is obsfucated a bit).

Just call block_write_full_page directly instead of going through
the ->writepage indirection.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:04 -04:00
Christoph Hellwig
cc9cf350d1 fs: remove the nobh helpers
All callers are gone, so remove the now dead code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:04 -04:00
Christoph Hellwig
002cbb1356 jfs: stop using the nobh helper
The nobh mode is an obscure feature to save lowlevel for large memory
32-bit configurations while trading for much slower performance and
has been long obsolete.  Switch to the regular buffer head based helpers
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:04 -04:00
Christoph Hellwig
0cc5b4ce7a ext2: remove nobh support
The nobh mode is an obscure feature to save lowlevel for large memory
32-bit configurations while trading for much slower performance and
has been long obsolete.  Remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:04 -04:00
Christoph Hellwig
9139710148 ntfs3: refactor ntfs_writepages
Handle the resident case with an explicit generic_writepages call instead
of using the obscure overload that makes mpage_writepages with a NULL
get_block do the same thing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)
b890ec2a2c hugetlb: Convert to migrate_folio
This involves converting migrate_huge_page_move_mapping().  We also need a
folio variant of hugetlb_set_page_subpool(), but that's for a later patch.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)
3648951ceb aio: Convert to migrate_folio
Use a folio throughout this function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)
1d5b9bd656 f2fs: Convert to filemap_migrate_folio()
filemap_migrate_folio() fits f2fs's needs perfectly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Chao Yu <chao@kernel.org>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)
e7b15bae55 ubifs: Convert to filemap_migrate_folio()
filemap_migrate_folio() is a little more general than ubifs really needs,
but it's better to share the code.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)
e7a60a1787 btrfs: Convert btrfs_migratepage to migrate_folio
Use filemap_migrate_folio() to do the bulk of the work, and then copy
the ordered flag across if needed.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)
2ec810d596 mm/migrate: Add filemap_migrate_folio()
There is nothing iomap-specific about iomap_migratepage(), and it fits
a pattern used by several other filesystems, so move it to mm/migrate.c,
convert it to be filemap_migrate_folio() and convert the iomap filesystems
to use it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)
541846502f mm/migrate: Convert migrate_page() to migrate_folio()
Convert all callers to pass a folio.  Most have the folio
already available.  Switch all users from aops->migratepage to
aops->migrate_folio.  Also turn the documentation into kerneldoc.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle)
4ae84a8047 nfs: Convert to migrate_folio
Use a folio throughout this function.  migrate_page() will be converted
later.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
8958b55142 btrfs: Convert btree_migratepage to migrate_folio
Use a folio throughout this function.  migrate_page() will be converted
later.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
67235182a4 mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
Use a folio throughout __buffer_migrate_folio(), add kernel-doc for
buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their
declarations to buffer.h and switch all filesystems that have wired
them up.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
37ce0b319b ext2: Use a folio in ext2_get_page()
Remove a call to read_mapping_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
240159077d gfs2: Convert gfs2_jhead_process_page() to use a folio
Use folio_put_refs() to perform only one atomic operation instead of two.
The other changes are straightforward conversions from page APIs to
their folio equivalents.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
9bb88987bc ocfs2: Convert ocfs2_read_folio() to use a folio
Use the folio API throughout.  There are a few places where we convert
back to a page to call into the rest of the filesystem, so folio usage
needs to be pushed down to those functions later.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
36a43502e1 freevxfs: Convert vxfs_immed_read_folio() to use a folio
Reorganise the file to remove the forward declaration.
Use folios throughout vxfs_immed_read_folio().
Use memcpy_to_page() instead of an open-coded kmap()/kunmap().
Remove flush_dcache_page() as this is embedded in memcpy_to_page().
Use folio_pos() instead of opencoding it.
Handle multi-page folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
9a0a953323 coda: Convert coda_symlink_filler() to use a folio
This is a straightforward conversion from the page APIs to the folio
APIs.  Symlinks are not allowed to be larger than PAGE_SIZE, so there
is little work to do here.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
ac09d88b9f befs: Convert befs_symlink_read_folio() to use a folio
This is a straightforward conversion from the page APIs to the folio
APIs.  Symlinks are not allowed to be larger than PAGE_SIZE, so there
is little work to do here.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle)
cf948cbc35 cramfs: read_mapping_page() is synchronous
Since commit 67f9fd91f9, the code to wait for the read to complete has
been dead.  That commit wrongly stated that the read was synchronous
already; this seems to have been a confusion about which ->readpage
operation was being called.  Instead of reintroducing an asynchronous
version of read_mapping_page(), call the readahead code directly to
submit all reads first before waiting for them in read_mapping_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:02 -04:00
Matthew Wilcox (Oracle)
97a3a383c4 ocfs2: Use filemap_write_and_wait_range() in ocfs2_cow_sync_writeback()
Remove the open-coding of filemap_fdatawait_range().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:02 -04:00
Matthew Wilcox (Oracle)
e775dfb33d hostfs: Handle page write errors correctly
If a page can't be written back, we need to call mapping_set_error(),
not clear the page's Uptodate flag.  Also remove the clearing of PageError
on success; that flag is used for read errors, not write errors.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:02 -04:00
Matthew Wilcox (Oracle)
31e748e4b1 squashfs: Return the actual error from squashfs_read_folio()
Since we actually know what error happened, we can report it instead
of having the generic code return -EIO for pages that were unlocked
without being marked uptodate.  Also remove a test of PageError since
we have the return value at this point.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:02 -04:00
Matthew Wilcox (Oracle)
b7a6eb22ba buffer: Don't test folio error in block_read_full_folio()
We can cache this information in a local variable instead of communicating
from one part of the function to another via folio flags.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:02 -04:00
William Dean
4f1196288d ovl: fix spelling mistakes
fix follow spelling misktakes:
	decendant  ==> descendant
	indentify  ==> identify
	underlaying ==> underlying

Reported-by: Hacash Robot <hacashRobot@santino.com>
Signed-off-by: William Dean <williamsukatube@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-08-02 15:41:10 +02:00
David Sterba
5abbb7b928 affs: use memcpy_to_page and remove replace kmap_atomic()
The use of kmap() is being deprecated in favor of kmap_local_page()
where it is feasible. For kmap around a memcpy there's a convenience
helper memcpy_to_page that also makes the flush_dcache_page() redundant.

CC: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-01 19:53:31 +02:00
Linus Torvalds
0fac198def fs.idmapped.overlay.acl.v5.20
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYufiMwAKCRCRxhvAZXjc
 os2iAQDr3tK9e2EUZDZ3Vgu3tvmTLKiU7W7f4U/ZAjJE5snBOwD+OqK8r1RdvXf8
 TatkVFFNZYlINDN6JrS5yGSKBm1+RwE=
 =8eZE
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.overlay.acl.v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull acl updates from Christian Brauner:
 "Last cycle we introduced support for mounting overlayfs on top of
  idmapped mounts. While looking into additional testing we realized
  that posix acls don't really work correctly with stacking filesystems
  on top of idmapped layers.

  We already knew what the fix were but it would require work that is
  more suitable for the merge window so we turned off posix acls for
  v5.19 for overlayfs on top of idmapped layers with Miklos routing my
  patch upstream in 72a8e05d4f ("Merge tag 'ovl-fixes-5.19-rc7' [..]").

  This contains the work to support posix acls for overlayfs on top of
  idmapped layers. Since the posix acl fixes should use the new
  vfs{g,u}id_t work the associated branch has been merged in. (We sent a
  pull request for this earlier.)

  We've also pulled in Miklos pull request containing my patch to turn
  of posix acls on top of idmapped layers. This allowed us to avoid
  rebasing the branch which we didn't like because we were already at
  rc7 by then. Merging it in allows this branch to first fix posix acls
  and then to cleanly revert the temporary fix it brought in by commit
  4a47c6385b ("ovl: turn of SB_POSIXACL with idmapped layers
  temporarily").

  The last patch in this series adds Seth Forshee as a co-maintainer for
  idmapped mounts. Seth has been integral to all of this work and is
  also the main architect behind the filesystem idmapping work which
  ultimately made filesystems such as FUSE and overlayfs available in
  containers. He continues to be active in both development and review.
  I'm very happy he decided to help and he has my full trust. This
  increases the bus factor which is always great for work like this. I'm
  honestly very excited about this because I think in general we don't
  do great in the bringing on new maintainers department"

For more explanations of the ACL issues, see

  https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org/

* tag 'fs.idmapped.overlay.acl.v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  Add Seth Forshee as co-maintainer for idmapped mounts
  Revert "ovl: turn of SB_POSIXACL with idmapped layers temporarily"
  ovl: handle idmappings in ovl_get_acl()
  acl: make posix_acl_clone() available to overlayfs
  acl: port to vfs{g,u}id_t
  acl: move idmapped mount fixup into vfs_{g,s}etxattr()
  mnt_idmapping: add vfs[g,u]id_into_k[g,u]id()
2022-08-01 09:10:07 -07:00
Linus Torvalds
bdfae5ce38 fs.idmapped.vfsuid.v5.20
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYufP6AAKCRCRxhvAZXjc
 omzRAQCGJ11r7T0C7t1kTdQiFSs5XN9ksFa86Hfj3dHEBIj+LQEA+bZ2/LLpElDz
 zPekgXkFQqdMr+FUL8sk94dzHT0GAgk=
 =BcK/
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.vfsuid.v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull fs idmapping updates from Christian Brauner:
 "This introduces the new vfs{g,u}id_t types we agreed on. Similar to
  k{g,u}id_t the new types are just simple wrapper structs around
  regular {g,u}id_t types.

  They allow to establish a type safety boundary in the VFS for idmapped
  mounts preventing confusion betwen {g,u}ids mapped into an idmapped
  mount and {g,u}ids mapped into the caller's or the filesystem's
  idmapping.

  An initial set of helpers is introduced that allows to operate on
  vfs{g,u}id_t types. We will remove all references to non-type safe
  idmapped mounts helpers in the very near future. The patches do
  already exist.

  This converts the core attribute changing codepaths which become
  significantly easier to reason about because of this change.

  Just a few highlights here as the patches give detailed overviews of
  what is happening in the commit messages:

   - The kernel internal struct iattr contains type safe vfs{g,u}id_t
     values clearly communicating that these values have to take a given
     mount's idmapping into account.

   - The ownership values placed in struct iattr to change ownership are
     identical for idmapped and non-idmapped mounts going forward. This
     also allows to simplify stacking filesystems such as overlayfs that
     change attributes In other words, they always represent the values.

   - Instead of open coding checks for whether ownership changes have
     been requested and an actual update of the inode is required we now
     have small static inline wrappers that abstract this logic away
     removing a lot of code duplication from individual filesystems that
     all open-coded the same checks"

* tag 'fs.idmapped.vfsuid.v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  mnt_idmapping: align kernel doc and parameter order
  mnt_idmapping: use new helpers in mapped_fs{g,u}id()
  fs: port HAS_UNMAPPED_ID() to vfs{g,u}id_t
  mnt_idmapping: return false when comparing two invalid ids
  attr: fix kernel doc
  attr: port attribute changes to new types
  security: pass down mount idmapping to setattr hook
  quota: port quota helpers mount ids
  fs: port to iattr ownership update helpers
  fs: introduce tiny iattr ownership update helpers
  fs: use mount types in iattr
  fs: add two type safe mapping helpers
  mnt_idmapping: add vfs{g,u}id_t
2022-08-01 08:56:55 -07:00
Linus Torvalds
e6a7cf70a3 File locking changes for v6.0
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmLnshQTHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFa1ZEACWzjP9gDRO+b5HuovRofO5gCfi0LNK
 jQAnUQmFBbV28MuRBr8lzjZFsn52C3nEz/unHpl2NXrg1dErdXmTZIUYkZIoESQl
 0hyA2lhdm/pvqfj5t9xwt9lK9xts7G+Q1Q2JsT53QlpGd7q9VOq0CFrFTuIe+HmZ
 qw9Sy/3rfP/rPALv1OzIlGDdBuslfPuijJJZq0wYx4WupA6vlGGSZXn+LxF2dHW9
 Ex/Z+n6o5mzEuPedopBBsCvdMTO2/sVmz33puqM0KBb/gmL47i15o1XXdg1O0cbL
 7LxIDOfaIm6gFsznUwrJV54WrL8zISQd/BhXbQOrbE8kmnNii1kfIyJHYx55Sa4X
 y6TmqVbYERXIwCFquO78Uywt8UgjRjuxG8SRe0AmqsxvIn/IxTjqMn5yaMURCTxA
 uyOmXHxLss3Jf2LNfd6nnrK5qKpOnPOBAn8I/4UY+eNdJGqesLyKoVPZ9O6K1dr3
 +jZJ8Ju4TVs7L3fljq6pHvbhAWivM3JEZmYrv+y8QKSRZBV0XqHagwDGHUaaLe9H
 6eHgU5yxCb+fj8EXbwxzKnJkhHXJikd4bbPOaJC+QZEKPCJJMo/pyXmDkCVwhJ73
 pjO4W0w6TGmCHinlVX6dkyYrCvWYjWglHyO5BnTY2F0Ub87/59KmepZz4dh81hi+
 ZdOIvHoF6uca3A==
 =wDtt
 -----END PGP SIGNATURE-----

Merge tag 'filelock-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking updates from Jeff Layton:
 "Just a couple of flock() patches from Kuniyuki Iwashima.

  The main change is that this moves a file_lock allocation from the
  slab to the stack"

* tag 'filelock-v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs/lock: Rearrange ops in flock syscall.
  fs/lock: Don't allocate file_lock in flock_make_lock().
2022-08-01 08:54:59 -07:00
Linus Torvalds
e88745dcfd Changes since last update:
- Add Yue Hu and Jeffle Xu as reviewers;
 
  - Add the missing wake_up when updating lzma streams;
 
  - Avoid consecutive detection for Highmem memory;
 
  - Prepare for multi-reference pclusters and get rid of PG_error;
 
  - Fix ctx->pos update for NFS export;
 
  - minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYub3chEceGlhbmdAa2Vy
 bmVsLm9yZwAKCRA5NzHcH7XmBMMSAP9P7kMPLuc0RP9AjoiQXKNAfWqIbGnbkI5C
 ACUUu5tZEgD/T7HhkDYIs/wAZzYB7qTkpepkY/XzuwnlodhaSTwnPQ8=
 =/vU1
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
 "First of all, we'd like to add Yue Hu and Jeffle Xu as two new
  reviewers. Thank them for spending time working on EROFS!

  There is no major feature outstanding in this cycle, mainly a patchset
  I worked on to prepare for rolling hash deduplication and folios for
  compressed data as the next big features. It kills the unneeded
  PG_error flag dependency as well.

  Apart from that, there are bugfixes and cleanups as always. Details
  are listed below:

   - Add Yue Hu and Jeffle Xu as reviewers

   - Add the missing wake_up when updating lzma streams

   - Avoid consecutive detection for Highmem memory

   - Prepare for multi-reference pclusters and get rid of PG_error

   - Fix ctx->pos update for NFS export

   - minor cleanups"

* tag 'erofs-for-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: (23 commits)
  erofs: update ctx->pos for every emitted dirent
  erofs: get rid of the leftover PAGE_SIZE in dir.c
  erofs: get rid of erofs_prepare_dio() helper
  erofs: introduce multi-reference pclusters (fully-referenced)
  erofs: record the longest decompressed size in this round
  erofs: introduce z_erofs_do_decompressed_bvec()
  erofs: try to leave (de)compressed_pages on stack if possible
  erofs: introduce struct z_erofs_decompress_backend
  erofs: get rid of `z_pagemap_global'
  erofs: clean up `enum z_erofs_collectmode'
  erofs: get rid of `enum z_erofs_page_type'
  erofs: rework online page handling
  erofs: switch compressed_pages[] to bufvec
  erofs: introduce `z_erofs_parse_in_bvecs'
  erofs: drop the old pagevec approach
  erofs: introduce bufvec to store decompressed buffers
  erofs: introduce `z_erofs_parse_out_bvecs()'
  erofs: clean up z_erofs_collector_begin()
  erofs: get rid of unneeded `inode', `map' and `sb'
  erofs: avoid consecutive detection for Highmem memory
  ...
2022-08-01 08:52:37 -07:00
Linus Torvalds
bec14d79f7 Merge tag 'fsnotify_for_v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:

 - support for FAN_MARK_IGNORE which untangles some of the not well
   defined corner cases with fanotify ignore masks

 - small cleanups

* tag 'fsnotify_for_v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: Fix comment typo
  fanotify: introduce FAN_MARK_IGNORE
  fanotify: cleanups for fanotify_mark() input validations
  fanotify: prepare for setting event flags in ignore mask
  fs: inotify: Fix typo in inotify comment
2022-08-01 08:50:39 -07:00
Linus Torvalds
af07685b9c Merge tag 'fs_for_v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 and reiserfs updates from Jan Kara:
 "A fix for ext2 handling of a corrupted fs image and cleanups in ext2
  and reiserfs"

* tag 'fs_for_v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2: Add more validity checks for inode counts
  fs/reiserfs/inode: remove dead code in _get_block_create_0()
  fs/ext2: replace ternary operator with min_t()
2022-08-01 08:48:37 -07:00
Linus Torvalds
eb43bbac4c dlm for 6.0
Changes in this set of commits:
 
 . Delay the cleanup of interrupted posix lock requests until the
   user space result arrives. Previously, the immediate cleanup
   would lead to extraneous warnings when the result arrived.
 
 . Tracepoint improvements, e.g. adding the lock resource name.
 
 . Delay the completion of lockspace creation until one full recovery
   cycle has completed. This allows more error cases to be returned to
   the caller.
 
 . Remove warnings from the locking layer about delayed network replies.
   The recently added midcomms warnings are much more useful.
 
 . Begin the process of deprecating two unused lock-timeout-related
   features. These features now require enabling via a Kconfig option,
   and enabling them triggers deprecation warnings. We expect to
   remove the code in v6.2.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJi5+RGAAoJEDgbc8f8gGmq1RwP/2xZaVKiTPYcQ0GfcmefCnnG
 8WxpMNv4ZPkjKVv7csBA8mcQyNuQqA4yLb3P+jEgkWDOKesJQNeTvfrittXCyfhG
 C7uvbUe3OCg9m+dIrzNKBu+2WtFu6tKa3aSlpPUF3Bhhe8IhwRmAkyd/Ky0VCGr9
 5jQWvy8D1p2pNoFsGKqhkfolqovmeTxgYtGxd/eHtiApo6tNwzbgcQAZw4vquCjk
 FSPO7s5HyINik0nQQ9b8MCjywmF6HG6UZjcd/qYHTUmcBZkgpegCKZRnYwQklnBD
 6BYj6X+w7WxgVsHgYBAtgd8oLRN5CtCmPljvnPTCjvgx6N9FTl8RJV8rwMqZ9C8U
 9+w7WosLxQFSyRm7KxHmKaatkOa3Baqg7cPXSwaZnsA3vBpitHWKs9cyDKwA0j3/
 sUWZFw+3VSuf7AJkSA848tC8Xs8G6YXvZgzvxzNEvtTJgO3X7sXB2lavZDyI0S26
 nwgXgs/Dt6QcOoQKGv8WgRSOMrFxtq/gX+f3gwPCHvM3panttPevXwKKQW2UtVOn
 u/BF3Oe9bGhf+J0o58Zp3gjtfDIz+c3yPkxeQqAc3pC/o1Lw7AMV2WxlxULoBLsv
 aErKwT0UemrQYRZnBmlGPaV4H1KyXzwC/fA1N8YAObJ/Ohe6x7oCKioWWMA4ggiD
 A4mOIY95o24rm++lNUkD
 =Dnn4
 -----END PGP SIGNATURE-----

Merge tag 'dlm-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:

 - Delay the cleanup of interrupted posix lock requests until the user
   space result arrives. Previously, the immediate cleanup would lead to
   extraneous warnings when the result arrived.

 - Tracepoint improvements, e.g. adding the lock resource name.

 - Delay the completion of lockspace creation until one full recovery
   cycle has completed. This allows more error cases to be returned to
   the caller.

 - Remove warnings from the locking layer about delayed network replies.
   The recently added midcomms warnings are much more useful.

 - Begin the process of deprecating two unused lock-timeout-related
   features. These features now require enabling via a Kconfig option,
   and enabling them triggers deprecation warnings. We expect to remove
   the code in v6.2.

* tag 'dlm-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  fs: dlm: move kref_put assert for lkb structs
  fs: dlm: don't use deprecated timeout features by default
  fs: dlm: add deprecation Kconfig and warnings for timeouts
  fs: dlm: remove timeout from dlm_user_adopt_orphan
  fs: dlm: remove waiter warnings
  fs: dlm: fix grammar in lowcomms output
  fs: dlm: add comment about lkb IFL flags
  fs: dlm: handle recovery result outside of ls_recover
  fs: dlm: make new_lockspace() wait until recovery completes
  fs: dlm: call dlm_lsop_recover_prep once
  fs: dlm: update comments about recovery and membership handling
  fs: dlm: add resource name to tracepoints
  fs: dlm: remove additional dereference of lksb
  fs: dlm: change ast and bast trace order
  fs: dlm: change posix lock sigint handling
  fs: dlm: use dlm_plock_info for do_unlock_close
  fs: dlm: change plock interrupted message to debug again
  fs: dlm: add pid to debug log
  fs: dlm: plock use list_first_entry
2022-08-01 08:46:53 -07:00
Alexander Aring
9585898922 fs: dlm: move kref_put assert for lkb structs
The unhold_lkb() function decrements the lock's kref, and
asserts that the ref count was not the final one.  Use the
kref_put release function (which should not be called) to
call the assert, rather than doing the assert based on the
kref_put return value.  Using kill_lkb() as the release
function doesn't make sense if we only want to assert.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-01 09:31:46 -05:00
Alexander Aring
6b0afc0cc3 fs: dlm: don't use deprecated timeout features by default
This patch will disable use of deprecated timeout features if
CONFIG_DLM_DEPRECATED_API is not set.  The deprecated features
will be removed in upcoming kernel release v6.2.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-01 09:31:38 -05:00
Alexander Aring
81eeb82fc2 fs: dlm: add deprecation Kconfig and warnings for timeouts
This patch adds a CONFIG_DLM_DEPRECATED_API Kconfig option
that must be enabled to use two timeout-related features
that we intend to remove in kernel v6.2.  Warnings are
printed if either is enabled and used.  Neither has ever
been used as far as we know.

. The DLM_LSFL_TIMEWARN lockspace creation flag will be
  removed, along with the associated configfs entry for
  setting the timeout.  Setting the flag and configfs file
  would cause dlm to track how long locks were waiting
  for reply messages.  After a timeout, a kernel message
  would be logged, and a netlink message would be sent
  to userspace.  Recently, midcomms messages have been
  added that produce much better logging about actual
  problems with messages.  No use has ever been found
  for the netlink messages.

. The userspace libdlm API has allowed the DLM_LKF_TIMEOUT
  flag with a timeout value to be set in lock requests.
  The lock request would be cancelled after the timeout.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2022-08-01 09:31:32 -05:00
Steve French
97b82c07c4 cifs: trivial style fixup
missing blank line after declaration

Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:37:38 -05:00
Yang Yingliang
aea02fc40a cifs: fix wrong unlock before return from cifs_tree_connect()
It should unlock 'tcon->tc_lock' before return from cifs_tree_connect().

Fixes: fe67bd563ec2 ("cifs: avoid use of global locks for high contention data")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:34:45 -05:00
Shyam Prasad N
d7d7a66aac cifs: avoid use of global locks for high contention data
During analysis of multichannel perf, it was seen that
the global locks cifs_tcp_ses_lock and GlobalMid_Lock, which
were shared between various data structures were causing a
lot of contention points.

With this change, we're breaking down the use of these locks
by introducing new locks at more granular levels. i.e.
server->srv_lock, ses->ses_lock and tcon->tc_lock to protect
the unprotected fields of server, session and tcon structs;
and server->mid_lock to protect mid related lists and entries
at server level.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:34:45 -05:00
Steve French
1bfa25ee30 cifs: remove remaining build warnings
Removed remaining warnings related to externs.  These warnings
although harmless could be distracting e.g.

 fs/cifs/cifsfs.c: note: in included file:
 fs/cifs/cifsglob.h:1968:24: warning: symbol 'sesInfoAllocCount' was not declared. Should it be static?

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:34:44 -05:00
Enzo Matsumiya
9543c8ab30 cifs: list_for_each() -> list_for_each_entry()
Replace list_for_each() by list_for_each_entr() where appropriate.
Remove no longer used list_head stack variables.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:34:44 -05:00
Enzo Matsumiya
da3847894f smb2: small refactor in smb2_check_message()
If the command is SMB2_IOCTL, OutputLength and OutputContext are
optional and can be zero, so return early and skip calculated length
check.

Move the mismatched length message to the end of the check, to avoid
unnecessary logs when the check was not a real miscalculation.

Also change the pr_warn_once() to a pr_warn() so we're sure to get a
log for the real mismatches.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:34:44 -05:00
Matthew Wilcox (Oracle)
c6f62f81b4 cifs: Fix memory leak when using fscache
If we hit the 'index == next_cached' case, we leak a refcount on the
struct page.  Fix this by using readahead_folio() which takes care of
the refcount for you.

Fixes: 0174ee9947 ("cifs: Implement cache I/O by accessing the cache directly")
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:34:44 -05:00
Steve French
89e42f49ef cifs: remove minor build warning
The build warning:
  warning: symbol 'cifs_tcp_ses_lock' was not declared. Should it be static?
can be distracting. Fix two of these.

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:34:44 -05:00
Steve French
c2c17ddbf3 cifs: remove some camelCase and also some static build warnings
Remove warnings for five global variables. For example:
  fs/cifs/cifsglob.h:1984:24: warning: symbol 'midCount' was not declared. Should it be static?

Also change them from camelCase (e.g. "midCount" to "mid_count")

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:34:44 -05:00
Yu Zhe
0827f71b88 cifs: remove unnecessary (void*) conversions.
One more.

remove unnecessary void* type castings.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:34:44 -05:00
Yu Zhe
0f46608ae7 cifs: remove unnecessary type castings
remove unnecessary void* type castings.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:34:44 -05:00
Colin Ian King
4da2cd0517 cifs: remove redundant initialization to variable mnt_sign_enabled
Variable mnt_sign_enabled is being initialized with a value that
is never read, it is being reassigned later on with a different
value. The initialization is redundant and can be removed.

Cleans up clang scan-build warning:
fs/cifs/cifssmb.c:465:7: warning: Value stored to 'mnt_sign_enabled
 during its initialization is never read

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:34:44 -05:00
Steve French
5fa2cffba0 smb3: check xattr value length earlier
Coverity complains about assigning a pointer based on
value length before checking that value length goes
beyond the end of the SMB.  Although this is even more
unlikely as value length is a single byte, and the
pointer is not dereferenced until laterm, it is clearer
to check the lengths first.

Addresses-Coverity: 1467704 ("Speculative execution data leak")
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-01 01:34:44 -05:00
Hyunchul Lee
824d4f64c2 ksmbd: prevent out of bound read for SMB2_TREE_CONNNECT
if Status is not 0 and PathLength is long,
smb_strndup_from_utf16 could make out of bound
read in smb2_tree_connnect.

This bug can lead an oops looking something like:

[ 1553.882047] BUG: KASAN: slab-out-of-bounds in smb_strndup_from_utf16+0x469/0x4c0 [ksmbd]
[ 1553.882064] Read of size 2 at addr ffff88802c4eda04 by task kworker/0:2/42805
...
[ 1553.882095] Call Trace:
[ 1553.882098]  <TASK>
[ 1553.882101]  dump_stack_lvl+0x49/0x5f
[ 1553.882107]  print_report.cold+0x5e/0x5cf
[ 1553.882112]  ? smb_strndup_from_utf16+0x469/0x4c0 [ksmbd]
[ 1553.882122]  kasan_report+0xaa/0x120
[ 1553.882128]  ? smb_strndup_from_utf16+0x469/0x4c0 [ksmbd]
[ 1553.882139]  __asan_report_load_n_noabort+0xf/0x20
[ 1553.882143]  smb_strndup_from_utf16+0x469/0x4c0 [ksmbd]
[ 1553.882155]  ? smb_strtoUTF16+0x3b0/0x3b0 [ksmbd]
[ 1553.882166]  ? __kmalloc_node+0x185/0x430
[ 1553.882171]  smb2_tree_connect+0x140/0xab0 [ksmbd]
[ 1553.882185]  handle_ksmbd_work+0x30e/0x1020 [ksmbd]
[ 1553.882197]  process_one_work+0x778/0x11c0
[ 1553.882201]  ? _raw_spin_lock_irq+0x8e/0xe0
[ 1553.882206]  worker_thread+0x544/0x1180
[ 1553.882209]  ? __cpuidle_text_end+0x4/0x4
[ 1553.882214]  kthread+0x282/0x320
[ 1553.882218]  ? process_one_work+0x11c0/0x11c0
[ 1553.882221]  ? kthread_complete_and_exit+0x30/0x30
[ 1553.882225]  ret_from_fork+0x1f/0x30
[ 1553.882231]  </TASK>

There is no need to check error request validation in server.
This check allow invalid requests not to validate message.

Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17818
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-31 23:14:32 -05:00
Hyunchul Lee
ac60778b87 ksmbd: prevent out of bound read for SMB2_WRITE
OOB read memory can be written to a file,
if DataOffset is 0 and Length is too large
in SMB2_WRITE request of compound request.

To prevent this, when checking the length of
the data area of SMB2_WRITE in smb2_get_data_area_len(),
let the minimum of DataOffset be the size of
SMB2 header + the size of SMB2_WRITE header.

This bug can lead an oops looking something like:

[  798.008715] BUG: KASAN: slab-out-of-bounds in copy_page_from_iter_atomic+0xd3d/0x14b0
[  798.008724] Read of size 252 at addr ffff88800f863e90 by task kworker/0:2/2859
...
[  798.008754] Call Trace:
[  798.008756]  <TASK>
[  798.008759]  dump_stack_lvl+0x49/0x5f
[  798.008764]  print_report.cold+0x5e/0x5cf
[  798.008768]  ? __filemap_get_folio+0x285/0x6d0
[  798.008774]  ? copy_page_from_iter_atomic+0xd3d/0x14b0
[  798.008777]  kasan_report+0xaa/0x120
[  798.008781]  ? copy_page_from_iter_atomic+0xd3d/0x14b0
[  798.008784]  kasan_check_range+0x100/0x1e0
[  798.008788]  memcpy+0x24/0x60
[  798.008792]  copy_page_from_iter_atomic+0xd3d/0x14b0
[  798.008795]  ? pagecache_get_page+0x53/0x160
[  798.008799]  ? iov_iter_get_pages_alloc+0x1590/0x1590
[  798.008803]  ? ext4_write_begin+0xfc0/0xfc0
[  798.008807]  ? current_time+0x72/0x210
[  798.008811]  generic_perform_write+0x2c8/0x530
[  798.008816]  ? filemap_fdatawrite_wbc+0x180/0x180
[  798.008820]  ? down_write+0xb4/0x120
[  798.008824]  ? down_write_killable+0x130/0x130
[  798.008829]  ext4_buffered_write_iter+0x137/0x2c0
[  798.008833]  ext4_file_write_iter+0x40b/0x1490
[  798.008837]  ? __fsnotify_parent+0x275/0xb20
[  798.008842]  ? __fsnotify_update_child_dentry_flags+0x2c0/0x2c0
[  798.008846]  ? ext4_buffered_write_iter+0x2c0/0x2c0
[  798.008851]  __kernel_write+0x3a1/0xa70
[  798.008855]  ? __x64_sys_preadv2+0x160/0x160
[  798.008860]  ? security_file_permission+0x4a/0xa0
[  798.008865]  kernel_write+0xbb/0x360
[  798.008869]  ksmbd_vfs_write+0x27e/0xb90 [ksmbd]
[  798.008881]  ? ksmbd_vfs_read+0x830/0x830 [ksmbd]
[  798.008892]  ? _raw_read_unlock+0x2a/0x50
[  798.008896]  smb2_write+0xb45/0x14e0 [ksmbd]
[  798.008909]  ? __kasan_check_write+0x14/0x20
[  798.008912]  ? _raw_spin_lock_bh+0xd0/0xe0
[  798.008916]  ? smb2_read+0x15e0/0x15e0 [ksmbd]
[  798.008927]  ? memcpy+0x4e/0x60
[  798.008931]  ? _raw_spin_unlock+0x19/0x30
[  798.008934]  ? ksmbd_smb2_check_message+0x16af/0x2350 [ksmbd]
[  798.008946]  ? _raw_spin_lock_bh+0xe0/0xe0
[  798.008950]  handle_ksmbd_work+0x30e/0x1020 [ksmbd]
[  798.008962]  process_one_work+0x778/0x11c0
[  798.008966]  ? _raw_spin_lock_irq+0x8e/0xe0
[  798.008970]  worker_thread+0x544/0x1180
[  798.008973]  ? __cpuidle_text_end+0x4/0x4
[  798.008977]  kthread+0x282/0x320
[  798.008982]  ? process_one_work+0x11c0/0x11c0
[  798.008985]  ? kthread_complete_and_exit+0x30/0x30
[  798.008989]  ret_from_fork+0x1f/0x30
[  798.008995]  </TASK>

Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17817
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-31 23:14:32 -05:00
Namjae Jeon
cf6531d981 ksmbd: fix use-after-free bug in smb2_tree_disconect
smb2_tree_disconnect() freed the struct ksmbd_tree_connect,
but it left the dangling pointer. It can be accessed
again under compound requests.

This bug can lead an oops looking something link:

[ 1685.468014 ] BUG: KASAN: use-after-free in ksmbd_tree_conn_disconnect+0x131/0x160 [ksmbd]
[ 1685.468068 ] Read of size 4 at addr ffff888102172180 by task kworker/1:2/4807
...
[ 1685.468130 ] Call Trace:
[ 1685.468132 ]  <TASK>
[ 1685.468135 ]  dump_stack_lvl+0x49/0x5f
[ 1685.468141 ]  print_report.cold+0x5e/0x5cf
[ 1685.468145 ]  ? ksmbd_tree_conn_disconnect+0x131/0x160 [ksmbd]
[ 1685.468157 ]  kasan_report+0xaa/0x120
[ 1685.468194 ]  ? ksmbd_tree_conn_disconnect+0x131/0x160 [ksmbd]
[ 1685.468206 ]  __asan_report_load4_noabort+0x14/0x20
[ 1685.468210 ]  ksmbd_tree_conn_disconnect+0x131/0x160 [ksmbd]
[ 1685.468222 ]  smb2_tree_disconnect+0x175/0x250 [ksmbd]
[ 1685.468235 ]  handle_ksmbd_work+0x30e/0x1020 [ksmbd]
[ 1685.468247 ]  process_one_work+0x778/0x11c0
[ 1685.468251 ]  ? _raw_spin_lock_irq+0x8e/0xe0
[ 1685.468289 ]  worker_thread+0x544/0x1180
[ 1685.468293 ]  ? __cpuidle_text_end+0x4/0x4
[ 1685.468297 ]  kthread+0x282/0x320
[ 1685.468301 ]  ? process_one_work+0x11c0/0x11c0
[ 1685.468305 ]  ? kthread_complete_and_exit+0x30/0x30
[ 1685.468309 ]  ret_from_fork+0x1f/0x30

Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17816
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-31 23:14:32 -05:00
Namjae Jeon
aa7253c239 ksmbd: fix memory leak in smb2_handle_negotiate
The allocated memory didn't free under an error
path in smb2_handle_negotiate().

Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-17815
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-31 23:14:32 -05:00
Namjae Jeon
af7c39d971 ksmbd: fix racy issue while destroying session on multichannel
After multi-channel connection with windows, Several channels of
session are connected. Among them, if there is a problem in one channel,
Windows connects again after disconnecting the channel. In this process,
the session is released and a kernel oop can occurs while processing
requests to other channels. When the channel is disconnected, if other
channels still exist in the session after deleting the channel from
the channel list in the session, the session should not be released.
Finally, the session will be released after all channels are disconnected.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-31 23:14:32 -05:00
Namjae Jeon
a14c573870 ksmbd: use wait_event instead of schedule_timeout()
ksmbd threads eating masses of cputime when connection is disconnected.
If connection is disconnected, ksmbd thread waits for pending requests
to be processed using schedule_timeout. schedule_timeout() incorrectly
is used, and it is more efficient to use wait_event/wake_up than to check
r_count every time with timeout.

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-31 23:14:32 -05:00
Takashi Iwai
512b74d17a exfat: Drop superfluous new line for error messages
exfat_err() adds the new line at the end of the message by itself,
hence the passed string shouldn't contain a new line.  Drop the
superfluous newline letters in the error messages in a few places that
have been put mistakenly.

Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-08-01 10:14:07 +09:00
Takashi Iwai
64fca6e621 exfat: Downgrade ENAMETOOLONG error message to debug messages
The ENAMETOOLONG error message is printed at each time when user tries
to operate with a too long name, and this can flood the kernel logs
easily, as every user can trigger this.  Let's downgrade this error
message level to a debug message for suppressing the superfluous
logs.

BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1201725
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-08-01 10:14:07 +09:00
Takashi Iwai
6425baabda exfat: Expand exfat_err() and co directly to pr_*() macro
Currently the error and info messages handled by exfat_err() and co
are tossed to exfat_msg() function that does nothing but passes the
strings with printk() invocation.  Not only that this is more overhead
by the indirect calls, but also this makes harder to extend for the
debug print usage; because of the direct printk() call, you cannot
make it for dynamic debug or without debug like the standard helpers
such as pr_debug() or dev_dbg().

For addressing the problem, this patch replaces exfat_*() macro to
expand to pr_*() directly.  Along with it, add the new exfat_debug()
macro that is expanded to pr_debug() (which output can be gracefully
suppressed via dyndbg).

Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-08-01 10:14:07 +09:00
Takashi Iwai
1b1a9195ae exfat: Define NLS_NAME_* as bit flags explicitly
NLS_NAME_* are bit flags although they are currently defined as enum;
it's casually working so far (from 0 to 2), but it's error-prone and
may bring a problem when we want to add more flag.

This patch changes the definitions of NLS_NAME_* explicitly being bit
flags.

Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-08-01 10:14:06 +09:00
Takashi Iwai
86da53e8ff exfat: Return ENAMETOOLONG consistently for oversized paths
LTP has a test for oversized file path renames and it expects the
return value to be ENAMETOOLONG.  However, exfat returns EINVAL
unexpectedly in some cases, hence LTP test fails.  The further
investigation indicated that the problem happens only when iocharset
isn't set to utf8.

The difference comes from that, in the case of utf8,
exfat_utf8_to_utf16() returns the error -ENAMETOOLONG directly and
it's treated as the final error code.  Meanwhile, on other iocharsets,
exfat_nls_to_ucs2() returns the max path size but it sets
NLS_NAME_OVERLEN to lossy flag instead; the caller side checks only
whether lossy flag is set or not, resulting in always -EINVAL
unconditionally.

This patch aligns the return code for both cases by checking the lossy
flag bit and returning ENAMETOOLONG when NLS_NAME_OVERLEN bit is set.

BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1201725
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-08-01 10:14:06 +09:00
Yuezhang Mo
be17b1ccd4 exfat: remove duplicate write inode for extending dir/file
Since the timestamps need to be updated, the directory entries
will be updated by mark_inode_dirty() whether or not a new
cluster is allocated for the file or directory, so there is no
need to use __exfat_write_inode() to update the directory entries
when allocating a new cluster for a file or directory.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-08-01 10:14:06 +09:00
Yuezhang Mo
4493895b2b exfat: remove duplicate write inode for truncating file
This commit moves updating file attributes and timestamps before
calling __exfat_write_inode(), so that all updates of the inode
had been written by __exfat_write_inode(), mark_inode_dirty() is
unneeded.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-08-01 10:14:06 +09:00
Yuezhang Mo
23e6e1c9b3 exfat: reuse __exfat_write_inode() to update directory entry
__exfat_write_inode() is used to update file and stream directory
entries, except for file->start_clu and stream->flags.

This commit moves update file->start_clu and stream->flags to
__exfat_write_inode() and reuse __exfat_write_inode() to update
directory entries.

Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reviewed-by: Andy Wu <Andy.Wu@sony.com>
Reviewed-by: Aoyama Wataru <wataru.aoyama@sony.com>
Reviewed-by: Daniel Palmer <daniel.palmer@sony.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-08-01 10:14:05 +09:00
Xie Shaowen
5e9466a5d0 xfs: delete extra space and tab in blank line
delete extra space and tab in blank line, there is no functional change.

Reported-by: Hacash Robot <hacashRobot@santino.com>
Signed-off-by: Xie Shaowen <studentxswpy@163.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-31 09:21:27 -07:00
ChenXiaoSong
001c179c4e xfs: fix NULL pointer dereference in xfs_getbmap()
Reproducer:
 1. fallocate -l 100M image
 2. mkfs.xfs -f image
 3. mount image /mnt
 4. setxattr("/mnt", "trusted.overlay.upper", NULL, 0, XATTR_CREATE)
 5. char arg[32] = "\x01\xff\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00"
                   "\x00\x00\x00\x00\x00\x08\x00\x00\x00\xc6\x2a\xf7";
    fd = open("/mnt", O_RDONLY|O_DIRECTORY);
    ioctl(fd, _IOC(_IOC_READ|_IOC_WRITE, 0x58, 0x2c, 0x20), arg);

NULL pointer dereference will occur when race happens between xfs_getbmap()
and xfs_bmap_set_attrforkoff():

         ioctl               |       setxattr
 ----------------------------|---------------------------
 xfs_getbmap                 |
   xfs_ifork_ptr             |
     xfs_inode_has_attr_fork |
       ip->i_forkoff == 0    |
     return NULL             |
   ifp == NULL               |
                             | xfs_bmap_set_attrforkoff
                             |   ip->i_forkoff > 0
   xfs_inode_has_attr_fork   |
     ip->i_forkoff > 0       |
   ifp == NULL               |
   ifp->if_format            |

Fix this by locking i_lock before xfs_ifork_ptr().

Fixes: abbf9e8a45 ("xfs: rewrite getbmap using the xfs_iext_* helpers")
Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: added fixes tag]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-31 09:21:27 -07:00
Hongnan Li
ecce9212d0 erofs: update ctx->pos for every emitted dirent
erofs_readdir update ctx->pos after filling a batch of dentries
and it may cause dir/files duplication for NFS readdirplus which
depends on ctx->pos to fill dir correctly. So update ctx->pos for
every emitted dirent in erofs_fill_dentries to fix it.

Also fix the update of ctx->pos when the initial file position has
exceeded nameoff.

Fixes: 3e917cc305 ("erofs: make filesystem exportable")
Signed-off-by: Hongnan Li <hongnan.li@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220722082732.30935-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-07-31 22:26:29 +08:00
Chao Yu
09beadf289 f2fs: fix to do sanity check on segment type in build_sit_entries()
As Wenqing Liu <wenqingliu0120@gmail.com> reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=216285

RIP: 0010:memcpy_erms+0x6/0x10
 f2fs_update_meta_page+0x84/0x570 [f2fs]
 change_curseg.constprop.0+0x159/0xbd0 [f2fs]
 f2fs_do_replace_block+0x5c7/0x18a0 [f2fs]
 f2fs_replace_block+0xeb/0x180 [f2fs]
 recover_data+0x1abd/0x6f50 [f2fs]
 f2fs_recover_fsync_data+0x12ce/0x3250 [f2fs]
 f2fs_fill_super+0x4459/0x6190 [f2fs]
 mount_bdev+0x2cf/0x3b0
 legacy_get_tree+0xed/0x1d0
 vfs_get_tree+0x81/0x2b0
 path_mount+0x47e/0x19d0
 do_mount+0xce/0xf0
 __x64_sys_mount+0x12c/0x1a0
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

The root cause is segment type is invalid, so in f2fs_do_replace_block(),
f2fs accesses f2fs_sm_info::curseg_array with out-of-range segment type,
result in accessing invalid curseg->sum_blk during memcpy in
f2fs_update_meta_page(). Fix this by adding sanity check on segment type
in build_sit_entries().

Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:19:00 -07:00
Chao Yu
7b01ad7f33 f2fs: obsolete unused MAX_DISCARD_BLOCKS
After commit a7eeb82385 ("f2fs: use bitmap in discard_entry"),
MAX_DISCARD_BLOCKS became obsolete, remove it.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:18:09 -07:00
Chao Yu
141170b759 f2fs: fix to avoid use f2fs_bug_on() in f2fs_new_node_page()
As Dipanjan Das <mail.dipanjan.das@gmail.com> reported, syzkaller
found a f2fs bug as below:

RIP: 0010:f2fs_new_node_page+0x19ac/0x1fc0 fs/f2fs/node.c:1295
Call Trace:
 write_all_xattrs fs/f2fs/xattr.c:487 [inline]
 __f2fs_setxattr+0xe76/0x2e10 fs/f2fs/xattr.c:743
 f2fs_setxattr+0x233/0xab0 fs/f2fs/xattr.c:790
 f2fs_xattr_generic_set+0x133/0x170 fs/f2fs/xattr.c:86
 __vfs_setxattr+0x115/0x180 fs/xattr.c:182
 __vfs_setxattr_noperm+0x125/0x5f0 fs/xattr.c:216
 __vfs_setxattr_locked+0x1cf/0x260 fs/xattr.c:277
 vfs_setxattr+0x13f/0x330 fs/xattr.c:303
 setxattr+0x146/0x160 fs/xattr.c:611
 path_setxattr+0x1a7/0x1d0 fs/xattr.c:630
 __do_sys_lsetxattr fs/xattr.c:653 [inline]
 __se_sys_lsetxattr fs/xattr.c:649 [inline]
 __x64_sys_lsetxattr+0xbd/0x150 fs/xattr.c:649
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

NAT entry and nat bitmap can be inconsistent, e.g. one nid is free
in nat bitmap, and blkaddr in its NAT entry is not NULL_ADDR, it
may trigger BUG_ON() in f2fs_new_node_page(), fix it.

Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:57 -07:00
Chao Liu
8ee236dcaa f2fs: fix to remove F2FS_COMPR_FL and tag F2FS_NOCOMP_FL at the same time
If the inode has the compress flag, it will fail to use
'chattr -c +m' to remove its compress flag and tag no compress flag.
However, the same command will be successful when executed again,
as shown below:

  $ touch foo.txt
  $ chattr +c foo.txt
  $ chattr -c +m foo.txt
  chattr: Invalid argument while setting flags on foo.txt
  $ chattr -c +m foo.txt
  $ f2fs_io getflags foo.txt
  get a flag on foo.txt ret=0, flags=nocompression,inline_data

Fix this by removing some checks in f2fs_setflags_common()
that do not affect the original logic. I go through all the
possible scenarios, and the results are as follows. Bold is
the only thing that has changed.

+---------------+-----------+-----------+----------+
|               |            file flags            |
+ command       +-----------+-----------+----------+
|               | no flag   | compr     | nocompr  |
+---------------+-----------+-----------+----------+
| chattr +c     | compr     | compr     | -EINVAL  |
| chattr -c     | no flag   | no flag   | nocompr  |
| chattr +m     | nocompr   | -EINVAL   | nocompr  |
| chattr -m     | no flag   | compr     | no flag  |
| chattr +c +m  | -EINVAL   | -EINVAL   | -EINVAL  |
| chattr +c -m  | compr     | compr     | compr    |
| chattr -c +m  | nocompr   | *nocompr* | nocompr  |
| chattr -c -m  | no flag   | no flag   | no flag  |
+---------------+-----------+-----------+----------+

Link: https://lore.kernel.org/linux-f2fs-devel/20220621064833.1079383-1-chaoliu719@gmail.com/
Fixes: 4c8ff7095b ("f2fs: support data compression")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Chao Liu <liuchao@coolpad.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:07 -07:00
Daeho Jeong
f8e2f32bcd f2fs: introduce sysfs atomic write statistics
introduce the below 4 new sysfs node for atomic write statistics.
- current_atomic_write: the total current atomic write block count,
                        which is not committed yet.
- peak_atomic_write: the peak value of total current atomic write block
                     count after boot.
- committed_atomic_block: the accumulated total committed atomic write
                          block count after boot.
- revoked_atomic_block: the accumulated total revoked atomic write block
                        count after boot.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:07 -07:00
qixiaoyu1
1adaa71ea9 f2fs: don't bother wait_ms by foreground gc
f2fs_gc returns -EINVAL via f2fs_balance_fs when there is enough free
secs after write checkpoint, but with gc_merge enabled, it will cause
the sleep time of gc thread to be set to no_gc_sleep_time even if there
are many dirty segments can be selected.

Signed-off-by: qixiaoyu1 <qixiaoyu1@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:07 -07:00
Chao Yu
0d5b9d8156 f2fs: invalidate meta pages only for post_read required inode
After commit e3b49ea368 ("f2fs: invalidate META_MAPPING before
IPU/DIO write"), invalidate_mapping_pages() will be called to
avoid race condition in between IPU/DIO and readahead for GC.

However, readahead flow is only used for post_read required inode,
so this patch adds check condition to avoids unnecessary page cache
invalidating for non-post_read inode.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:06 -07:00
Chao Liu
a8634ccf5d f2fs: allow compression of files without blocks
Files created by truncate(1) have a size but no blocks, so
they can be allowed to enable compression.

Signed-off-by: Chao Liu <liuchao@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:06 -07:00
Chao Yu
7165841d57 f2fs: fix to check inline_data during compressed inode conversion
When converting inode to compressed one via ioctl, it needs to check
inline_data, since inline_data flag and compressed flag are incompatible.

Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:06 -07:00
Fabio M. De Francesco
1dd55358ef f2fs: Delete f2fs_copy_page() and replace with memcpy_page()
f2fs_copy_page() is a wrapper around two kmap() + one memcpy() from/to
the mapped pages. It unnecessarily duplicates a kernel API and it makes
use of kmap(), which is being deprecated in favor of kmap_local_page().

Two main problems with kmap(): (1) It comes with an overhead as mapping
space is restricted and protected by a global lock for synchronization and
(2) it also requires global TLB invalidation when the kmap’s pool wraps
and it might block when the mapping space is fully utilized until a slot
becomes available.

With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Therefore, its
use in __clone_blkaddrs() is safe and should be preferred.

Delete f2fs_copy_page() and use a plain memcpy_page() in the only one
site calling the removed function. memcpy_page() avoids open coding two
kmap_local_page() + one memcpy() between the two kernel virtual addresses.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:57 -07:00
Chao Yu
67ca06872e f2fs: fix to invalidate META_MAPPING before DIO write
Quoted from commit e3b49ea368 ("f2fs: invalidate META_MAPPING before
IPU/DIO write")

"
Encrypted pages during GC are read and cached in META_MAPPING.
However, due to cached pages in META_MAPPING, there is an issue where
newly written pages are lost by IPU or DIO writes.

Thread A - f2fs_gc()            Thread B
/* phase 3 */
down_write(i_gc_rwsem)
ra_data_block()       ---- (a)
up_write(i_gc_rwsem)
                                f2fs_direct_IO() :
                                 - down_read(i_gc_rwsem)
                                 - __blockdev_direct_io()
                                 - get_data_block_dio_write()
                                 - f2fs_dio_submit_bio()  ---- (b)
                                 - up_read(i_gc_rwsem)
/* phase 4 */
down_write(i_gc_rwsem)
move_data_block()     ---- (c)
up_write(i_gc_rwsem)

(a) In phase 3 of f2fs_gc(), up-to-date page is read from storage and
    cached in META_MAPPING.
(b) In thread B, writing new data by IPU or DIO write on same blkaddr as
    read in (a). cached page in META_MAPPING become out-dated.
(c) In phase 4 of f2fs_gc(), out-dated page in META_MAPPING is copied to
    new blkaddr. In conclusion, the newly written data in (b) is lost.

To address this issue, invalidating pages in META_MAPPING before IPU or
DIO write.
"

In previous commit, we missed to cover extent cache hit case, and passed
wrong value for parameter @end of invalidate_mapping_pages(), fix both
issues.

Fixes: 6aa58d8ad2 ("f2fs: readahead encrypted block during GC")
Fixes: e3b49ea368 ("f2fs: invalidate META_MAPPING before IPU/DIO write")
Cc: Hyeong-Jun Kim <hj514.kim@samsung.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
Jaegeuk Kim
8e0f54a70e f2fs: add a sysfs entry to show zone capacity
This patch adds a sysfs entry showing the unusable space in a section
made by zone capacity.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
Jaegeuk Kim
074b5ea290 f2fs: adjust zone capacity when considering valid block count
This patch fixes counting unusable blocks set by zone capacity when
checking the valid block count in a section.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
Jaegeuk Kim
b771aadc6e f2fs: enforce single zone capacity
In order to simplify the complicated per-zone capacity, let's support
only one capacity for entire zoned device.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
duguowei
14de5fc3dd f2fs: remove redundant code for gc condition
Remove the redundant code and use local variant as the
argument directly. Make it more human-readable.

Signed-off-by: duguowei <duguowei@xiaomi.com>
[Jaegeuk Kim: make code neat]
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
Daeho Jeong
7a8fc58618 f2fs: introduce memory mode
Introduce memory mode to supports "normal" and "low" memory modes.
"low" mode is to support low memory devices. Because of the nature of
low memory devices, in this mode, f2fs will try to save memory sometimes
by sacrificing performance. "normal" mode is the default mode and same
as before.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:12 -07:00
Sebastian Andrzej Siewior
50417d22d0 fs/dcache: Move wakeup out of i_seq_dir write held region.
__d_add() and __d_move() wake up waiters on dentry::d_wait from within
the i_seq_dir write held region.  This violates the PREEMPT_RT
constraints as the wake up acquires wait_queue_head::lock which is a
"sleeping" spinlock on RT.

There is no requirement to do so. __d_lookup_unhash() has cleared
DCACHE_PAR_LOOKUP and dentry::d_wait and returned the now unreachable wait
queue head pointer to the caller, so the actual wake up can be postponed
until the i_dir_seq write side critical section is left. The only
requirement is that dentry::lock is held across the whole sequence
including the wake up. The previous commit includes an analysis why this
is considered safe.

Move the wake up past end_dir_add() which leaves the i_dir_seq write side
critical section and enables preemption.

For non RT kernels there is no difference because preemption is still
disabled due to dentry::lock being held, but it shortens the time between
wake up and unlocking dentry::lock, which reduces the contention for the
woken up waiter.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-30 00:38:16 -04:00
Sebastian Andrzej Siewior
45f78b0a27 fs/dcache: Move the wakeup from __d_lookup_done() to the caller.
__d_lookup_done() wakes waiters on dentry->d_wait.  On PREEMPT_RT we are
not allowed to do that with preemption disabled, since the wakeup
acquired wait_queue_head::lock, which is a "sleeping" spinlock on RT.

Calling it under dentry->d_lock is not a problem, since that is also a
"sleeping" spinlock on the same configs.  Unfortunately, two of its
callers (__d_add() and __d_move()) are holding more than just ->d_lock
and that needs to be dealt with.

The key observation is that wakeup can be moved to any point before
dropping ->d_lock.

As a first step to solve this, move the wake up outside of the
hlist_bl_lock() held section.

This is safe because:

Waiters get inserted into ->d_wait only after they'd taken ->d_lock
and observed DCACHE_PAR_LOOKUP in flags.  As long as they are
woken up (and evicted from the queue) between the moment __d_lookup_done()
has removed DCACHE_PAR_LOOKUP and dropping ->d_lock, we are safe,
since the waitqueue ->d_wait points to won't get destroyed without
having __d_lookup_done(dentry) called (under ->d_lock).

->d_wait is set only by d_alloc_parallel() and only in case when
it returns a freshly allocated in-lookup dentry.  Whenever that happens,
we are guaranteed that __d_lookup_done() will be called for resulting
dentry (under ->d_lock) before the wq in question gets destroyed.

With two exceptions wq lives in call frame of the caller of
d_alloc_parallel() and we have an explicit d_lookup_done() on the
resulting in-lookup dentry before we leave that frame.

One of those exceptions is nfs_call_unlink(), where wq is embedded into
(dynamically allocated) struct nfs_unlinkdata.  It is destroyed in
nfs_async_unlink_release() after an explicit d_lookup_done() on the
dentry wq went into.

Remaining exception is d_add_ci(). There wq is what we'd found in
->d_wait of d_add_ci() argument. Callers of d_add_ci() are two
instances of ->d_lookup() and they must have been given an in-lookup
dentry.  Which means that they'd been called by __lookup_slow() or
lookup_open(), with wq in the call frame of one of those.

Result of d_alloc_parallel() in d_add_ci() is fed to
d_splice_alias(), which either returns non-NULL (and d_add_ci() does
d_lookup_done()) or feeds dentry to __d_add() that will do
__d_lookup_done() under ->d_lock.  That concludes the analysis.

Let __d_lookup_unhash():

  1) Lock the lookup hash and clear DCACHE_PAR_LOOKUP
  2) Unhash the dentry
  3) Retrieve and clear dentry::d_wait
  4) Unlock the hash and return the retrieved waitqueue head pointer
  5) Let the caller handle the wake up.
  6) Rename __d_lookup_done() to __d_lookup_unhash_wake() to enforce
     build failures for OOT code that used __d_lookup_done() and is not
     aware of the new return value.

This does not yet solve the PREEMPT_RT problem completely because
preemption is still disabled due to i_dir_seq being held for write. This
will be addressed in subsequent steps.

An alternative solution would be to switch the waitqueue to a simple
waitqueue, but aside of Linus not being a fan of them, moving the wake up
closer to the place where dentry::lock is unlocked reduces lock contention
time for the woken up waiter.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20220613140712.77932-3-bigeasy@linutronix.de
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-30 00:36:10 -04:00
Sebastian Andrzej Siewior
cf634d540a fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT
i_dir_seq is a sequence counter with a lock which is represented by the
lowest bit. The writer atomically updates the counter which ensures that it
can be modified by only one writer at a time. This requires preemption to
be disabled across the write side critical section.

On !PREEMPT_RT kernels this is implicit by the caller acquiring
dentry::lock. On PREEMPT_RT kernels spin_lock() does not disable preemption
which means that a preempting writer or reader would live lock. It's
therefore required to disable preemption explicitly.

An alternative solution would be to replace i_dir_seq with a seqlock_t for
PREEMPT_RT, but that comes with its own set of problems due to arbitrary
lock nesting. A pure sequence count with an associated spinlock is not
possible because the locks held by the caller are not necessarily related.

As the critical section is small, disabling preemption is a sensible
solution.

Reported-by: Oleg.Karfich@wago.com
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lkml.kernel.org/r/20220613140712.77932-2-bigeasy@linutronix.de
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-30 00:35:51 -04:00
Al Viro
40a3cb0d23 d_add_ci(): make sure we don't miss d_lookup_done()
All callers of d_alloc_parallel() must make sure that resulting
in-lookup dentry (if any) will encounter __d_lookup_done() before
the final dput().  d_add_ci() might end up creating in-lookup
dentries; they are fed to d_splice_alias(), which will normally
make sure they meet __d_lookup_done().  However, it is possible
to end up with d_splice_alias() failing with ERR_PTR(-ELOOP)
without having done so.  It takes a corrupted ntfs or case-insensitive
xfs image, but neither should end up with memory corruption...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-30 00:29:05 -04:00
Christophe JAILLET
45ee6d1e93 ocfs2: fix a typo in a comment
s/heartbaet/heartbeat

Link: https://lkml.kernel.org/r/4d4a6786e8ad522bfad6d2401b7f6634f8af0e5d.1658436259.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:36 -07:00
Christophe JAILLET
702f3cf374 ocfs2: use the bitmap API to simplify code
Use bitmap_zero() instead of hand-writing it.  It is less verbose.

While at it, add an explicit #include <linux/bitmap.h>.

Link: https://lkml.kernel.org/r/86d2a027c319db12055c98f00c65f7d01e703722.1658436259.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:36 -07:00
Christophe JAILLET
97d3b2676f ocfs2: remove some useless functions
Patch series "ocfs2: A few clean_ups", v2.

__ocfs2_node_map_set_bit() and __ocfs2_node_map_clear_bit() are just
wrapper around set_bit() and clear_bit().

The leading __ also makes think that these functions are non-atomic just
like __set_bit() and __clear_bit().

So, just remove these wrappers and call set_bit() and clear_bit()
directly.

Link: https://lkml.kernel.org/r/cover.1658436259.git.christophe.jaillet@wanadoo.fr
Link: https://lkml.kernel.org/r/bd1429c84ec7d174c96dbb67a2b42b1b456d9394.1658436259.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:35 -07:00
Alexey Dobriyan
ed8fb78d7e proc: add some (hopefully) insightful comments
* /proc/${pid}/net status
* removing PDE vs last close stuff (again!)
* random small stuff

Link: https://lkml.kernel.org/r/YtwrM6sDC0OQ53YB@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:35 -07:00
Phillip Lougher
b09a7a036d squashfs: support reading fragments in readahead call
Add a function which can be used to read fragments in the readahead call.

This function is necessary because filesystems built with the -tailends
(or -always-use-fragments) option may have fragments present which cannot
be currently handled.

Link: https://lkml.kernel.org/r/20220617083810.337573-5-hsinyi@chromium.org
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Xiongwei Song <Xiongwei.Song@windriver.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:34 -07:00
Hsin-Yi Wang
8fc78b6fe2 squashfs: implement readahead
Implement readahead callback for squashfs.  It will read datablocks which
cover pages in readahead request.  For a few cases it will not mark page
as uptodate, including:

- file end is 0.
- zero filled blocks.
- current batch of pages isn't in the same datablock.
- decompressor error.

Otherwise pages will be marked as uptodate.  The unhandled pages will be
updated by readpage later.

Link: https://lkml.kernel.org/r/20220617083810.337573-4-hsinyi@chromium.org
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: Xiongwei Song <Xiongwei.Song@windriver.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:34 -07:00
Phillip Lougher
db98b43086 squashfs: always build "file direct" version of page actor
Squashfs_readahead uses the "file direct" version of the page actor, and
so build it unconditionally.

Link: https://lkml.kernel.org/r/20220617083810.337573-3-hsinyi@chromium.org
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Hou Tao <houtao1@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Xiongwei Song <Xiongwei.Song@windriver.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:34 -07:00
Hsin-Yi Wang
0c12185728 Revert "squashfs: provide backing_dev_info in order to disable read-ahead"
Patch series "Implement readahead for squashfs", v7.

Commit 9eec1d897139("squashfs: provide backing_dev_info in order to
disable read-ahead") mitigates the performance drop issue for squashfs by
closing readahead for it.

This series implements readahead callback for squashfs.


This patch (of 4):

This reverts 9eec1d8971 ("squashfs: provide backing_dev_info in order
to disable read-ahead").

Revert closing the readahead to squashfs since the readahead callback for
squashfs is implemented.

Link: https://lkml.kernel.org/r/20220617083810.337573-1-hsinyi@chromium.org
Link: https://lkml.kernel.org/r/20220617083810.337573-2-hsinyi@chromium.org
Signed-off-by: Hsin-Yi Wang <hsinyi@chromium.org>
Suggested-by: Xiongwei Song <Xiongwei.Song@windriver.com>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Zheng Liang <zhengliang6@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Hou Tao <houtao1@huawei.com>
Cc: Miao Xie <miaoxie@huawei.com>

Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:12:34 -07:00
Miaohe Lin
1168076345 hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
In some cases, e.g.  when size option is not specified, f_blocks, f_bavail
and f_bfree will be set to -1 instead of 0.  Likewise, when nr_inodes
isn't specified, f_files and f_ffree will be set to -1 too.  Update the
comment to make this clear.

Link: https://lkml.kernel.org/r/20220726142918.51693-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:19 -07:00
Miaohe Lin
445c809829 hugetlbfs: cleanup some comments in inode.c
The function generic_file_buffered_read has been renamed to filemap_read
since commit 87fa0f3eb2 ("mm/filemap: rename generic_file_buffered_read
to filemap_read").  Update the corresponding comment.  And duplicated
taken in hugetlbfs_fill_super is removed.

Link: https://lkml.kernel.org/r/20220726142918.51693-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:19 -07:00
Miaohe Lin
990e52b17d hugetlbfs: remove unneeded header file
The header file signal.h is unneeded now. Remove it.

Link: https://lkml.kernel.org/r/20220726142918.51693-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:19 -07:00
Miaohe Lin
7ec3c362cf hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
The forward declaration for hugetlbfs_ops is unnecessary.  Remove it.

Link: https://lkml.kernel.org/r/20220726142918.51693-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:19 -07:00
Miaohe Lin
d00365175e hugetlbfs: use helper macro SZ_1{K,M}
Patch series "A few cleanup and fixup patches for hugetlbfs", v2.

This series contains a few cleaup patches to remove unneeded forward
declaration, use helper macro and so on.  More details can be found in the
respective changelogs.


This patch (of 5):

Use helper macro SZ_1K and SZ_1M to do the size conversion.  Minor
readability improvement.

Link: https://lkml.kernel.org/r/20220726142918.51693-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220726142918.51693-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:19 -07:00
Shiyang Ruan
35fcd75af3 xfs: fail dax mount if reflink is enabled on a partition
Failure notification is not supported on partitions.  So, when we mount a
reflink enabled xfs on a partition with dax option, let it fail with
-EINVAL code.

Link: https://lkml.kernel.org/r/20220609143435.393724-1-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:17 -07:00
Axel Rasmussen
914eedcb9b userfaultfd: don't fail on unrecognized features
The basic interaction for setting up a userfaultfd is, userspace issues
a UFFDIO_API ioctl, and passes in a set of zero or more feature flags,
indicating the features they would prefer to use.

Of course, different kernels may support different sets of features
(depending on kernel version, kconfig options, architecture, etc).
Userspace's expectations may also not match: perhaps it was built
against newer kernel headers, which defined some features the kernel
it's running on doesn't support.

Currently, if userspace passes in a flag we don't recognize, the
initialization fails and we return -EINVAL. This isn't great, though.
Userspace doesn't have an obvious way to react to this; sure, one of the
features I asked for was unavailable, but which one? The only option it
has is to turn off things "at random" and hope something works.

Instead, modify UFFDIO_API to just ignore any unrecognized feature
flags. The interaction is now that the initialization will succeed, and
as always we return the *subset* of feature flags that can actually be
used back to userspace.

Now userspace has an obvious way to react: it checks if any flags it
asked for are missing. If so, it can conclude this kernel doesn't
support those, and it can either resign itself to not using them, or
fail with an error on its own, or whatever else.

Link: https://lkml.kernel.org/r/20220722201513.1624158-1-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:17 -07:00
NeilBrown
d6a97d3f58 NFSD: add security label to struct nfsd_attrs
nfsd_setattr() now sets a security label if provided, and nfsv4 provides
it in the 'open' and 'create' paths and the 'setattr' path.
If setting the label failed (including because the kernel doesn't
support labels), an error field in 'struct nfsd_attrs' is set, and the
caller can respond.  The open/create callers clear
FATTR4_WORD2_SECURITY_LABEL in the returned attr set in this case.
The setattr caller returns the error.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:17:00 -04:00
NeilBrown
93adc1e391 NFSD: set attributes when creating symlinks
The NFS protocol includes attributes when creating symlinks.
Linux does store attributes for symlinks and allows them to be set,
though they are not used for permission checking.

NFSD currently doesn't set standard (struct iattr) attributes when
creating symlinks, but for NFSv4 it does set ACLs and security labels.
This is inconsistent.

To improve consistency, pass the provided attributes into nfsd_symlink()
and call nfsd_create_setattr() to set them.

NOTE: this results in a behaviour change for all NFS versions when the
client sends non-default attributes with a SYMLINK request. With the
Linux client, the only attributes are:
	attr.ia_mode = S_IFLNK | S_IRWXUGO;
	attr.ia_valid = ATTR_MODE;
so the final outcome will be unchanged. Other clients might sent
different attributes, and if they did they probably expect them to be
honoured.

We ignore any error from nfsd_create_setattr().  It isn't really clear
what should be done if a file is successfully created, but the
attributes cannot be set.  NFS doesn't allow partial success to be
reported.  Reporting failure is probably more misleading than reporting
success, so the status is ignored.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:17:00 -04:00
NeilBrown
7fe2a71dda NFSD: introduce struct nfsd_attrs
The attributes that nfsd might want to set on a file include 'struct
iattr' as well as an ACL and security label.
The latter two are passed around quite separately from the first, in
part because they are only needed for NFSv4.  This leads to some
clumsiness in the code, such as the attributes NOT being set in
nfsd_create_setattr().

We need to keep the directory locked until all attributes are set to
ensure the file is never visibile without all its attributes.  This need
combined with the inconsistent handling of attributes leads to more
clumsiness.

As a first step towards tidying this up, introduce 'struct nfsd_attrs'.
This is passed (by reference) to vfs.c functions that work with
attributes, and is assembled by the various nfs*proc functions which
call them.  As yet only iattr is included, but future patches will
expand this.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:17:00 -04:00
Jeff Layton
876c553cb4 NFSD: verify the opened dentry after setting a delegation
Between opening a file and setting a delegation on it, someone could
rename or unlink the dentry. If this happens, we do not want to grant a
delegation on the open.

On a CLAIM_NULL open, we're opening by filename, and we may (in the
non-create case) or may not (in the create case) be holding i_rwsem
when attempting to set a delegation.  The latter case allows a
race.

After getting a lease, redo the lookup of the file being opened and
validate that the resulting dentry matches the one in the open file
description.

To properly redo the lookup we need an rqst pointer to pass to
nfsd_lookup_dentry(), so make sure that is available.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:17:00 -04:00
Jeff Layton
bbf936edd5 NFSD: drop fh argument from alloc_init_deleg
Currently, we pass the fh of the opened file down through several
functions so that alloc_init_deleg can pass it to delegation_blocked.
The filehandle of the open file is available in the nfs4_file however,
so there's no need to pass it in a separate argument.

Drop the argument from alloc_init_deleg, nfs4_open_delegation and
nfs4_set_delegation.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:17:00 -04:00
Chuck Lever
a11ada99ce NFSD: Move copy offload callback arguments into a separate structure
Refactor so that CB_OFFLOAD arguments can be passed without
allocating a whole struct nfsd4_copy object. On my system (x86_64)
this removes another 96 bytes from struct nfsd4_copy.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:17:00 -04:00
Chuck Lever
e72f9bc006 NFSD: Add nfsd4_send_cb_offload()
Refactor for legibility.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:59 -04:00
Chuck Lever
ad1e46c9b0 NFSD: Remove kmalloc from nfsd4_do_async_copy()
Instead of manufacturing a phony struct nfsd_file, pass the
struct file returned by nfs42_ssc_open() directly to
nfsd4_do_copy().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:59 -04:00
Chuck Lever
3b7bf5933c NFSD: Refactor nfsd4_do_copy()
Refactor: Now that nfsd4_do_copy() no longer calls the cleanup
helpers, plumb the use of struct file pointers all the way down to
_nfsd_copy_file_range().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:59 -04:00
Chuck Lever
478ed7b10d NFSD: Refactor nfsd4_cleanup_inter_ssc() (2/2)
Move the nfsd4_cleanup_*() call sites out of nfsd4_do_copy(). A
subsequent patch will modify one of the new call sites to avoid
the need to manufacture the phony struct nfsd_file.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:59 -04:00
Chuck Lever
24d796ea38 NFSD: Refactor nfsd4_cleanup_inter_ssc() (1/2)
The @src parameter is sometimes a pointer to a struct nfsd_file and
sometimes a pointer to struct file hiding in a phony struct
nfsd_file. Refactor nfsd4_cleanup_inter_ssc() so the @src parameter
is always an explicit struct file.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:59 -04:00
Chuck Lever
1913cdf56c NFSD: Replace boolean fields in struct nfsd4_copy
Clean up: saves 8 bytes, and we can replace check_and_set_stop_copy()
with an atomic bitop.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:59 -04:00
Chuck Lever
8ea6e2c90b NFSD: Make nfs4_put_copy() static
Clean up: All call sites are in fs/nfsd/nfs4proc.c.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:59 -04:00
Chuck Lever
d314309425 NFSD: Reorder the fields in struct nfsd4_op
Pack the fields to reduce the size of struct nfsd4_op, which is used
an array in struct nfsd4_compoundargs.

sizeof(struct nfsd4_op):
Before: /* size: 672, cachelines: 11, members: 5 */
After:  /* size: 640, cachelines: 10, members: 5 */

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:58 -04:00
Chuck Lever
87689df694 NFSD: Shrink size of struct nfsd4_copy
struct nfsd4_copy is part of struct nfsd4_op, which resides in an
8-element array.

sizeof(struct nfsd4_op):
Before: /* size: 1696, cachelines: 27, members: 5 */
After:  /* size: 672, cachelines: 11, members: 5 */

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:58 -04:00
Chuck Lever
09426ef2a6 NFSD: Shrink size of struct nfsd4_copy_notify
struct nfsd4_copy_notify is part of struct nfsd4_op, which resides
in an 8-element array.

sizeof(struct nfsd4_op):
Before: /* size: 2208, cachelines: 35, members: 5 */
After:  /* size: 1696, cachelines: 27, members: 5 */

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:58 -04:00
Chuck Lever
bb4d842722 NFSD: nfserrno(-ENOMEM) is nfserr_jukebox
Suggested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:58 -04:00
Chuck Lever
5304877936 NFSD: Fix strncpy() fortify warning
In function ‘strncpy’,
    inlined from ‘nfsd4_ssc_setup_dul’ at /home/cel/src/linux/manet/fs/nfsd/nfs4proc.c:1392:3,
    inlined from ‘nfsd4_interssc_connect’ at /home/cel/src/linux/manet/fs/nfsd/nfs4proc.c:1489:11:
/home/cel/src/linux/manet/include/linux/fortify-string.h:52:33: warning: ‘__builtin_strncpy’ specified bound 63 equals destination size [-Wstringop-truncation]
   52 | #define __underlying_strncpy    __builtin_strncpy
      |                                 ^
/home/cel/src/linux/manet/include/linux/fortify-string.h:89:16: note: in expansion of macro ‘__underlying_strncpy’
   89 |         return __underlying_strncpy(p, q, size);
      |                ^~~~~~~~~~~~~~~~~~~~

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:58 -04:00
Chuck Lever
99b002a1fa NFSD: Clean up nfsd4_encode_readlink()
Similar changes to nfsd4_encode_readv(), all bundled into a single
patch.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:58 -04:00
Chuck Lever
5e64d85c7d NFSD: Use xdr_pad_size()
Clean up: Use a helper instead of open-coding the calculation of
the XDR pad size.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:57 -04:00
Chuck Lever
071ae99fea NFSD: Simplify starting_len
Clean-up: Now that nfsd4_encode_readv() does not have to encode the
EOF or rd_length values, it no longer needs to subtract 8 from
@starting_len.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:57 -04:00
Chuck Lever
28d5bc468e NFSD: Optimize nfsd4_encode_readv()
write_bytes_to_xdr_buf() is pretty expensive to use for inserting
an XDR data item that is always 1 XDR_UNIT at an address that is
always XDR word-aligned.

Since both the readv and splice read paths encode EOF and maxcount
values, move both to a common code path.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:57 -04:00
Chuck Lever
24c7fb8549 NFSD: Add an nfsd4_read::rd_eof field
Refactor: Make the EOF result available in the entire NFSv4 READ
path.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:57 -04:00
Chuck Lever
c738b218a2 NFSD: Clean up SPLICE_OK in nfsd4_encode_read()
Do the test_bit() once -- this reduces the number of locked-bus
operations and makes the function a little easier to read.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:57 -04:00
Chuck Lever
ab04de60ae NFSD: Optimize nfsd4_encode_fattr()
write_bytes_to_xdr_buf() is a generic way to place a variable-length
data item in an already-reserved spot in the encoding buffer.

However, it is costly. In nfsd4_encode_fattr(), it is unnecessary
because the data item is fixed in size and the buffer destination
address is always word-aligned.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:57 -04:00
Chuck Lever
095a764b7a NFSD: Optimize nfsd4_encode_operation()
write_bytes_to_xdr_buf() is a generic way to place a variable-length
data item in an already-reserved spot in the encoding buffer.
However, it is costly, and here, it is unnecessary because the
data item is fixed in size, the buffer destination address is
always word-aligned, and the destination location is already in
@p.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:57 -04:00
Jeff Layton
3a5940bfa1 nfsd: silence extraneous printk on nfsd.ko insertion
This printk pops every time nfsd.ko gets plugged in. Most kmods don't do
that and this one is not very informative. Olaf's email address seems to
be defunct at this point anyway. Just drop it.

Cc: Olaf Kirch <okir@suse.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:56 -04:00
Dai Ngo
4271c2c088 NFSD: limit the number of v4 clients to 1024 per 1GB of system memory
Currently there is no limit on how many v4 clients are supported
by the system. This can be a problem in systems with small memory
configuration to function properly when a very large number of
clients exist that creates memory shortage conditions.

This patch enforces a limit of 1024 NFSv4 clients, including courtesy
clients, per 1GB of system memory.  When the number of the clients
reaches the limit, requests that create new clients are returned
with NFS4ERR_DELAY and the laundromat is kicked start to trim old
clients. Due to the overhead of the upcall to remove the client
record, the maximun number of clients the laundromat removes on
each run is limited to 128. This is done to ensure the laundromat
can still process the other tasks in a timely manner.

Since there is now a limit of the number of clients, the 24-hr
idle time limit of courtesy client is no longer needed and was
removed.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:56 -04:00
Dai Ngo
0926c39515 NFSD: keep track of the number of v4 clients in the system
Add counter nfs4_client_count to keep track of the total number
of v4 clients, including courtesy clients, in the system.

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:56 -04:00
Dai Ngo
6867137ebc NFSD: refactoring v4 specific code to a helper in nfs4state.c
This patch moves the v4 specific code from nfsd_init_net() to
nfsd4_init_leases_net() helper in nfs4state.c

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:56 -04:00
Chuck Lever
427f5f83a3 NFSD: Ensure nf_inode is never dereferenced
The documenting comment for struct nf_file states:

/*
 * A representation of a file that has been opened by knfsd. These are hashed
 * in the hashtable by inode pointer value. Note that this object doesn't
 * hold a reference to the inode by itself, so the nf_inode pointer should
 * never be dereferenced, only used for comparison.
 */

Replace the two existing dereferences to make the comment always
true.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:56 -04:00
Chuck Lever
5e138c4a75 NFSD: NFSv4 CLOSE should release an nfsd_file immediately
The last close of a file should enable other accessors to open and
use that file immediately. Leaving the file open in the filecache
prevents other users from accessing that file until the filecache
garbage-collects the file -- sometimes that takes several seconds.

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://bugzilla.linux-nfs.org/show_bug.cgi?387
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:42 -04:00
Chuck Lever
b40a283947 NFSD: Move nfsd_file_trace_alloc() tracepoint
Avoid recording the allocation of an nfsd_file item that is
immediately released because a matching item was already
inserted in the hash.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:16:07 -04:00
Chuck Lever
be0230069f NFSD: Separate tracepoints for acquire and create
These tracepoints collect different information: the create case does
not open a file, so there's no nf_file available.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:15:54 -04:00
Chuck Lever
0ec8e9d153 NFSD: Clean up unused code after rhashtable conversion
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:15:38 -04:00
Chuck Lever
ce502f81ba NFSD: Convert the filecache to use rhashtable
Enable the filecache hash table to start small, then grow with the
workload. Smaller server deployments benefit because there should
be lower memory utilization. Larger server deployments should see
improved scaling with the number of open files.

Suggested-by: Jeff Layton <jlayton@kernel.org>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:14:25 -04:00
Chuck Lever
fc22945ecc NFSD: Set up an rhashtable for the filecache
Add code to initialize and tear down an rhashtable. The rhashtable
is not used yet.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:12:02 -04:00
Chuck Lever
c7b824c3d0 NFSD: Replace the "init once" mechanism
In a moment, the nfsd_file_hashtbl global will be replaced with an
rhashtable. Replace the one or two spots that need to check if the
hash table is available. We can easily reuse the SHUTDOWN flag for
this purpose.

Document that this mechanism relies on callers to hold the
nfsd_mutex to prevent init, shutdown, and purging to run
concurrently.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:12:01 -04:00
Chuck Lever
f0743c2b25 NFSD: Remove nfsd_file::nf_hashval
The value in this field can always be computed from nf_inode, thus
it is no longer used.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:12:01 -04:00
Chuck Lever
cb7ec76e73 NFSD: nfsd_file_hash_remove can compute hashval
Remove an unnecessary use of nf_hashval.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:12:01 -04:00
Chuck Lever
a845511007 NFSD: Refactor __nfsd_file_close_inode()
The code that computes the hashval is the same in both callers.

To prevent them from going stale, reframe the documenting comments
to remove descriptions of the underlying hash table structure, which
is about to be replaced.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:11:50 -04:00
Chuck Lever
8755326399 NFSD: nfsd_file_unhash can compute hashval from nf->nf_inode
Remove an unnecessary usage of nf_hashval.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:11:42 -04:00
Chuck Lever
f53cef15dd NFSD: Remove lockdep assertion from unhash_and_release_locked()
IIUC, holding the hash bucket lock is needed only in
nfsd_file_unhash, and there is already a lockdep assertion there.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:11:41 -04:00
Chuck Lever
54f7df7094 NFSD: No longer record nf_hashval in the trace log
I'm about to replace nfsd_file_hashtbl with an rhashtable. The
individual hash values will no longer be visible or relevant, so
remove them from the tracepoints.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:11:29 -04:00
Chuck Lever
6df1941136 NFSD: Never call nfsd_file_gc() in foreground paths
The checks in nfsd_file_acquire() and nfsd_file_put() that directly
invoke filecache garbage collection are intended to keep cache
occupancy between a low- and high-watermark. The reason to limit the
capacity of the filecache is to keep filecache lookups reasonably
fast.

However, invoking garbage collection at those points has some
undesirable negative impacts. Files that are held open by NFSv4
clients often push the occupancy of the filecache over these
watermarks. At that point:

- Every call to nfsd_file_acquire() and nfsd_file_put() results in
  an LRU walk. This has the same effect on lookup latency as long
  chains in the hash table.
- Garbage collection will then run on every nfsd thread, causing a
  lot of unnecessary lock contention.
- Limiting cache capacity pushes out files used only by NFSv3
  clients, which are the type of files the filecache is supposed to
  help.

To address those negative impacts, remove the direct calls to the
garbage collector. Subsequent patches will address maintaining
lookup efficiency as cache capacity increases.

Suggested-by: Wang Yugui <wangyugui@e16-tech.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:11:19 -04:00
Chuck Lever
edead3a558 NFSD: Fix the filecache LRU shrinker
Without LRU item rotation, the shrinker visits only a few items on
the end of the LRU list, and those would always be long-term OPEN
files for NFSv4 workloads. That makes the filecache shrinker
completely ineffective.

Adopt the same strategy as the inode LRU by using LRU_ROTATE.

Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:11:19 -04:00
Chuck Lever
4a0e73e635 NFSD: Leave open files out of the filecache LRU
There have been reports of problems when running fstests generic/531
against Linux NFS servers with NFSv4. The NFS server that hosts the
test's SCRATCH_DEV suffers from CPU soft lock-ups during the test.
Analysis shows that:

fs/nfsd/filecache.c
 482                 ret = list_lru_walk(&nfsd_file_lru,
 483                                 nfsd_file_lru_cb,
 484                                 &head, LONG_MAX);

causes nfsd_file_gc() to walk the entire length of the filecache LRU
list every time it is called (which is quite frequently). The walk
holds a spinlock the entire time that prevents other nfsd threads
from accessing the filecache.

What's more, for NFSv4 workloads, none of the items that are visited
during this walk may be evicted, since they are all files that are
held OPEN by NFS clients.

Address this by ensuring that open files are not kept on the LRU
list.

Reported-by: Frank van der Linden <fllinden@amazon.com>
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=386
Suggested-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:10:08 -04:00
Chuck Lever
c46203acdd NFSD: Trace filecache LRU activity
Observe the operation of garbage collection and the lifetime of
filecache items.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:10:07 -04:00
Chuck Lever
668ed92e65 NFSD: WARN when freeing an item still linked via nf_lru
Add a guardrail to prevent freeing memory that is still on a list.
This includes either a dispose list or the LRU list.

This is the sign of a bug, but this class of bugs can be detected
so that they don't endanger system stability, especially while
debugging.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:10:07 -04:00
Chuck Lever
2e6c6e4c43 NFSD: Hook up the filecache stat file
There has always been the capability of exporting filecache metrics
via /proc, but it was never hooked up. Let's surface these metrics
to enable better observability of the filecache.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:10:07 -04:00
Chuck Lever
8b330f7804 NFSD: Zero counters when the filecache is re-initialized
If nfsd_file_cache_init() is called after a shutdown, be sure the
stat counters are reset.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:10:07 -04:00
Chuck Lever
df2aff524f NFSD: Record number of flush calls
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:10:07 -04:00
Chuck Lever
94660cc19c NFSD: Report the number of items evicted by the LRU walk
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:10:07 -04:00
Chuck Lever
39f1d1ff81 NFSD: Refactor nfsd_file_lru_scan()
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:09:55 -04:00
Chuck Lever
3bc6d3470f NFSD: Refactor nfsd_file_gc()
Refactor nfsd_file_gc() to use the new list_lru helper.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:09:48 -04:00
Chuck Lever
0bac5a264d NFSD: Add nfsd_file_lru_dispose_list() helper
Refactor the invariant part of nfsd_file_lru_walk_list() into a
separate helper function.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:09:39 -04:00
Chuck Lever
904940e94a NFSD: Report average age of filecache items
This is a measure of how long items stay in the filecache, to help
assess how efficient the cache is.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:09:22 -04:00
Chuck Lever
d63293272a NFSD: Report count of freed filecache items
Surface the count of freed nfsd_file items.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:09:22 -04:00
Chuck Lever
29d4bdbbb9 NFSD: Report count of calls to nfsd_file_acquire()
Count the number of successful acquisitions that did not create a
file (ie, acquisitions that do not result in a compulsory cache
miss). This count can be compared directly with the reported hit
count to compute a hit ratio.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:09:22 -04:00
Chuck Lever
0fd244c115 NFSD: Report filecache LRU size
Surface the NFSD filecache's LRU list length to help field
troubleshooters monitor filecache issues.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:09:21 -04:00
Chuck Lever
ca3f9acb6d NFSD: Demote a WARN to a pr_warn()
The call trace doesn't add much value, but it sure is noisy.

Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:09:21 -04:00
Colin Ian King
842e00ac3a nfsd: remove redundant assignment to variable len
Variable len is being assigned a value zero and this is never
read, it is being re-assigned later. The assignment is redundant
and can be removed.

Cleans up clang scan-build warning:
fs/nfsd/nfsctl.c:636:2: warning: Value stored to 'len' is never read

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-29 20:08:56 -04:00