Commit graph

83819 commits

Author SHA1 Message Date
Linus Torvalds
5edc6bb321 Tracing fixes for 6.6-rc2:
- Fix the "bytes" output of the per_cpu stat file
   The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the
   accounting was not accurate. It is suppose to show how many used bytes are
   still in the ring buffer, but even when the ring buffer was empty it would
   still show there were bytes used.
 
 - Fix a bug in eventfs where reading a dynamic event directory (open) and then
   creating a dynamic event that goes into that diretory screws up the accounting.
   On close, the newly created event dentry will get a "dput" without ever having
   a "dget" done for it. The fix is to allocate an array on dir open to save what
   dentries were actually "dget" on, and what ones to "dput" on close.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZQ9wihQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6quz4AP4vSFohvmAcTzC+sKP7gMLUvEmqL76+
 1pixXrQOIP5BrQEApUW3VnjqYgjZJR2ne0N4MvvmYElm/ylBhDd4JRrD3g8=
 =X9wd
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix the "bytes" output of the per_cpu stat file

   The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the
   accounting was not accurate. It is suppose to show how many used
   bytes are still in the ring buffer, but even when the ring buffer was
   empty it would still show there were bytes used.

 - Fix a bug in eventfs where reading a dynamic event directory (open)
   and then creating a dynamic event that goes into that diretory screws
   up the accounting.

   On close, the newly created event dentry will get a "dput" without
   ever having a "dget" done for it. The fix is to allocate an array on
   dir open to save what dentries were actually "dget" on, and what ones
   to "dput" on close.

* tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  eventfs: Remember what dentries were created on dir open
  ring-buffer: Fix bytes info in per_cpu buffer stats
2023-09-24 13:55:34 -07:00
Linus Torvalds
85eba5f175 13 hotfixes, 10 of which pertain to post-6.5 issues. The other 3 are
cc:stable.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZQ8hRwAKCRDdBJ7gKXxA
 jlK9AQDzT/FUQV3kIshsV1IwAKFcg7gtcFSN0vs+pV+e1+4tbQD/Z2OgfGFFsCSP
 X6uc2cYHc9DG5/o44iFgadW8byMssQs=
 =w+St
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-09-23-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "13 hotfixes, 10 of which pertain to post-6.5 issues. The other three
  are cc:stable"

* tag 'mm-hotfixes-stable-2023-09-23-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  proc: nommu: fix empty /proc/<pid>/maps
  filemap: add filemap_map_order0_folio() to handle order0 folio
  proc: nommu: /proc/<pid>/maps: release mmap read lock
  mm: memcontrol: fix GFP_NOFS recursion in memory.high enforcement
  pidfd: prevent a kernel-doc warning
  argv_split: fix kernel-doc warnings
  scatterlist: add missing function params to kernel-doc
  selftests/proc: fixup proc-empty-vm test after KSM changes
  revert "scripts/gdb/symbols: add specific ko module load command"
  selftests: link libasan statically for tests with -fsanitize=address
  task_work: add kerneldoc annotation for 'data' argument
  mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong buddy list
  sh: mm: re-add lost __ref to ioremap_prot() to fix modpost warning
2023-09-23 11:51:16 -07:00
Linus Torvalds
8565bdf8cd Six smb3 client fixes, including three for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmUOSHkACgkQiiy9cAdy
 T1HAswwAmCPPUWgIiR4XqWiXOmWh60+4Q7xsmSVvRAixvTf79Tkif/1AmVYhnYls
 QvnbT5TIFPtFbnWHOIYSPxkjBlVRNTgSviSdOebZAv4aP3DJ+HhWqnv39ti7DFMt
 qbNn9j53czq5eLChkIiCfqbeg4I8ra+vqZeeDs5TaRFqMLZ5KlJGlVJjnf88/o8m
 cCjifqq53Bq1pZeaR1Q9aiwuzsSKazs4odEeFvwFSw0tsdjEj4Rk5vJqDcMnt460
 yIXzoUv3iGJ2oz7qi5uysGPOpLAbDqwbJ5+127qja3cbgHg3MJqwotheRuULYX8G
 Dz9hhgu5oGlvSO+fZXnsT2bQL+oqYUyql9EU5NBTF2gPf5M3E1vQaMNPj1cueNAg
 dfa3x3Ez/M1XuH2aptK3ePuPIQIZgMjpMC7BPBKaIvxtGcNxEIuL0s+TEQbjJ8R1
 /ybJv2i+LYfCrnjTDvH0Y4lTS40AHIcOwywQd6rbMuStK71+B7/bNYJkdKKK9PEF
 L39ZwXDj
 =Q4Rt
 -----END PGP SIGNATURE-----

Merge tag '6.6-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:
 "Six smb3 client fixes, including three for stable, from the SMB
  plugfest (testing event) this week:

   - Reparse point handling fix (found when investigating dir
     enumeration when fifo in dir)

   - Fix excessive thread creation for dir lease cleanup

   - UAF fix in negotiate path

   - remove duplicate error message mapping and fix confusing warning
     message

   - add dynamic trace point to improve debugging RDMA connection
     attempts"

* tag '6.6-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: fix confusing debug message
  smb: client: handle STATUS_IO_REPARSE_TAG_NOT_HANDLED
  smb3: remove duplicate error mapping
  cifs: Fix UAF in cifs_demultiplex_thread()
  smb3: do not start laundromat thread when dir leases  disabled
  smb3: Add dynamic trace points for RDMA (smbdirect) reconnect
2023-09-23 11:34:48 -07:00
Linus Torvalds
59c376d636 Fixes for 6.6-rc3:
* Return EIO on bad inputs to iomap_to_bh instead of BUGging, to deal
   less poorly with block device io racing with block device resizing.
 * Fix a stale page data exposure bug introduced in 6.6-rc1 when
   unsharing a file range that is not in the page cache.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZQsCFAAKCRBKO3ySh0YR
 pkhZAP9VHpjBn95MEai0dxAVjAi8IDcfwdzBuifBWlkQwnt6MAEAiHfHEDfN23o9
 4Xg9EDqa8IOSYwxphJYnYG73Luvi5QQ=
 =Xtnv
 -----END PGP SIGNATURE-----

Merge tag 'iomap-6.6-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap fixes from Darrick Wong:

 - Return EIO on bad inputs to iomap_to_bh instead of BUGging, to deal
   less poorly with block device io racing with block device resizing

 - Fix a stale page data exposure bug introduced in 6.6-rc1 when
   unsharing a file range that is not in the page cache

* tag 'iomap-6.6-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: convert iomap_unshare_iter to use large folios
  iomap: don't skip reading in !uptodate folios when unsharing a range
  iomap: handle error conditions more gracefully in iomap_to_bh
2023-09-23 09:56:40 -07:00
Linus Torvalds
3abc79dce6 Bug fixes for 6.6-rc3:
* Fix an integer overflow bug when processing an fsmap call.
 
  * Fix crash due to CPU hot remove event racing with filesystem mount
    operation.
 
  * During read-only mount, XFS does not allow the contents of the log to be
    recovered when there are one or more unrecognized rcompat features in the
    primary superblock, since the log might have intent items which the kernel
    does not know how to process.
 
  * During recovery of log intent items, XFS now reserves log space sufficient
    for one cycle of a permanent transaction to execute. Otherwise, this could
    lead to livelocks due to non-availability of log space.
 
  * On an fs which has an ondisk unlinked inode list, trying to delete a file
    or allocating an O_TMPFILE file can cause the fs to the shutdown if the
    first inode in the ondisk inode list is not present in the inode cache.
    The bug is solved by explicitly loading the first inode in the ondisk
    unlinked inode list into the inode cache if it is not already cached.
 
    A similar problem arises when the uncached inode is present in the middle
    of the ondisk unlinked inode list. This second bug is triggered when
    executing operations like quotacheck and bulkstat. In this case, XFS now
    reads in the entire ondisk unlinked inode list.
 
  * Enable LARP mode only on recent v5 filesystems.
 
  * Fix a out of bounds memory access in scrub.
 
  * Fix a performance bug when locating the tail of the log during mounting a
    filesystem.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZQkx4QAKCRAH7y4RirJu
 9HrTAQD6QhvHkS43vueGOb4WISZPG/jMKJ/FjvwLZrIZ0erbJwEAtRWhClwFv3NZ
 exJFtsmxrKC6Vifuo0pvfoCiK5mUvQ8=
 =SrJR
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Chandan Babu:

 - Fix an integer overflow bug when processing an fsmap call

 - Fix crash due to CPU hot remove event racing with filesystem mount
   operation

 - During read-only mount, XFS does not allow the contents of the log to
   be recovered when there are one or more unrecognized rcompat features
   in the primary superblock, since the log might have intent items
   which the kernel does not know how to process

 - During recovery of log intent items, XFS now reserves log space
   sufficient for one cycle of a permanent transaction to execute.
   Otherwise, this could lead to livelocks due to non-availability of
   log space

 - On an fs which has an ondisk unlinked inode list, trying to delete a
   file or allocating an O_TMPFILE file can cause the fs to the shutdown
   if the first inode in the ondisk inode list is not present in the
   inode cache. The bug is solved by explicitly loading the first inode
   in the ondisk unlinked inode list into the inode cache if it is not
   already cached

   A similar problem arises when the uncached inode is present in the
   middle of the ondisk unlinked inode list. This second bug is
   triggered when executing operations like quotacheck and bulkstat. In
   this case, XFS now reads in the entire ondisk unlinked inode list

 - Enable LARP mode only on recent v5 filesystems

 - Fix a out of bounds memory access in scrub

 - Fix a performance bug when locating the tail of the log during
   mounting a filesystem

* tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail
  xfs: only call xchk_stats_merge after validating scrub inputs
  xfs: require a relatively recent V5 filesystem for LARP mode
  xfs: make inode unlinked bucket recovery work with quotacheck
  xfs: load uncached unlinked inodes into memory on demand
  xfs: reserve less log space when recovering log intent items
  xfs: fix log recovery when unknown rocompat bits are set
  xfs: reload entire unlinked bucket lists
  xfs: allow inode inactivation during a ro mount log recovery
  xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
  xfs: remove CPU hotplug infrastructure
  xfs: remove the all-mounts list
  xfs: use per-mount cpumask to track nonempty percpu inodegc lists
  xfs: fix an agbno overflow in __xfs_getfsmap_datadev
  xfs: fix per-cpu CIL structure aggregation racing with dying cpus
  xfs: fix select in config XFS_ONLINE_SCRUB_STATS
2023-09-22 16:32:19 -07:00
Steven Rostedt (Google)
ef36b4f928 eventfs: Remember what dentries were created on dir open
Using the following code with libtracefs:

	int dfd;

	// create the directory events/kprobes/kp1
	tracefs_kprobe_raw(NULL, "kp1", "schedule_timeout", "time=$arg1");

	// Open the kprobes directory
	dfd = tracefs_instance_file_open(NULL, "events/kprobes", O_RDONLY);

	// Do a lookup of the kprobes/kp1 directory (by looking at enable)
	tracefs_file_exists(NULL, "events/kprobes/kp1/enable");

	// Now create a new entry in the kprobes directory
	tracefs_kprobe_raw(NULL, "kp2", "schedule_hrtimeout", "expires=$arg1");

	// Do another lookup to create the dentries
	tracefs_file_exists(NULL, "events/kprobes/kp2/enable"))

	// Close the directory
	close(dfd);

What happened above, the first open (dfd) will call
dcache_dir_open_wrapper() that will create the dentries and up their ref
counts.

Now the creation of "kp2" will add another dentry within the kprobes
directory.

Upon the close of dfd, eventfs_release() will now do a dput for all the
entries in kprobes. But this is where the problem lies. The open only
upped the dentry of kp1 and not kp2. Now the close is decrementing both
kp1 and kp2, which causes kp2 to get a negative count.

Doing a "trace-cmd reset" which deletes all the kprobes cause the kernel
to crash! (due to the messed up accounting of the ref counts).

To solve this, save all the dentries that are opened in the
dcache_dir_open_wrapper() into an array, and use this array to know what
dentries to do a dput on in eventfs_release().

Since the dcache_dir_open_wrapper() calls dcache_dir_open() which uses the
file->private_data, we need to also add a wrapper around dcache_readdir()
that uses the cursor assigned to the file->private_data. This is because
the dentries need to also be saved in the file->private_data. To do this
create the structure:

  struct dentry_list {
	void		*cursor;
	struct dentry	**dentries;
  };

Which will hold both the cursor and the dentries. Some shuffling around is
needed to make sure that dcache_dir_open() and dcache_readdir() only see
the cursor.

Link: https://lore.kernel.org/linux-trace-kernel/20230919211804.230edf1e@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d4fa@gandalf.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ajay Kaher <akaher@vmware.com>
Fixes: 6394044955 ("eventfs: Implement eventfs lookup, read, open functions")
Reported-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-22 16:58:11 -04:00
Linus Torvalds
b5cbe7c00a v6.6-rc3.vfs.ctime.revert
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZQsZLQAKCRCRxhvAZXjc
 op0vAP96hkSUnmXmxTr8GHId3yfElN8ZZ3aSfePeBdljjKEZVAEA2+cbHLy4GqRi
 TpjP1HNIdmtbVSC2ZnrgqkbwGageQgg=
 =s92y
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-rc3.vfs.ctime.revert' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull finegrained timestamp reverts from Christian Brauner:
 "Earlier this week we sent a few minor fixes for the multi-grained
  timestamp work in [1]. While we were polishing those up after Linus
  realized that there might be a nicer way to fix them we received a
  regression report in [2] that fine grained timestamps break gnulib
  tests and thus possibly other tools.

  The kernel will elide fine-grain timestamp updates when no one is
  actively querying for them to avoid performance impacts. So a sequence
  like write(f1) stat(f2) write(f2) stat(f2) write(f1) stat(f1) may
  result in timestamp f1 to be older than the final f2 timestamp even
  though f1 was last written too but the second write didn't update the
  timestamp.

  Such plotholes can lead to subtle bugs when programs compare
  timestamps. For example, the nap() function in [2] will estimate that
  it needs to wait one ns on a fine-grain timestamp enabled filesytem
  between subsequent calls to observe a timestamp change. But in general
  we don't update timestamps with more than one jiffie if we think that
  no one is actively querying for fine-grain timestamps to avoid
  performance impacts.

  While discussing various fixes the decision was to go back to the
  drawing board and ultimately to explore a solution that involves only
  exposing such fine-grained timestamps to nfs internally and never to
  userspace.

  As there are multiple solutions discussed the honest thing to do here
  is not to fix this up or disable it but to cleanly revert. The general
  infrastructure will probably come back but there is no reason to keep
  this code in mainline.

  The general changes to timestamp handling are valid and a good cleanup
  that will stay. The revert is fully bisectable"

Link: https://lore.kernel.org/all/20230918-hirte-neuzugang-4c2324e7bae3@brauner [1]
Link: https://lore.kernel.org/all/bf0524debb976627693e12ad23690094e4514303.camel@linuxfromscratch.org [2]

* tag 'v6.6-rc3.vfs.ctime.revert' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  Revert "fs: add infrastructure for multigrain timestamps"
  Revert "btrfs: convert to multigrain timestamps"
  Revert "ext4: switch to multigrain timestamps"
  Revert "xfs: switch to multigrain timestamps"
  Revert "tmpfs: add support for multigrain timestamps"
2023-09-21 10:15:26 -07:00
Steve French
c8ebf077fb smb3: fix confusing debug message
The message said it was an invalid mode, when it was intentionally
not set.  Fix confusing message logged to dmesg.

Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-20 19:50:05 -05:00
Paulo Alcantara
7fb77d9c87 smb: client: handle STATUS_IO_REPARSE_TAG_NOT_HANDLED
Fix missing set of cifs_open_info_data::reparse_point when SMB2_CREATE
request fails with STATUS_IO_REPARSE_TAG_NOT_HANDLED.

Fixes: 5f71ebc412 ("smb: client: parse reparse point flag in create response")
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-20 16:12:09 -05:00
Steve French
6e2e27e47c smb3: remove duplicate error mapping
In status_to_posix_error STATUS_IO_REPARSE_TAG_NOT_HANDLED was mapped
to both -EOPNOTSUPP and also to -EIO but the later one (-EIO) is
ignored. Remove the duplicate.

Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-20 16:04:51 -05:00
Linus Torvalds
a229cf67ab for-6.6-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmULIZUACgkQxWXV+ddt
 WDv77Q//ZiKpmevPQmQfUtmV8WwMfD2a9zRlKBpGggwtrD4mf3CYRLnOpTm81MPO
 vFIuYacBn+9UXqp2j/IbvNWfQAPQNVDxSPXx66uba93RJc+bB1J3TydxcEyJ7fr4
 dwhLLk01jttfk0+rnjF34fmXiHSTtI6D2WeaLCzUbaPLw4SZ+ul+GAdeF3P174iO
 OMNBUln7hK00Q7j8kFf4j6SW1yIIKMTl6MfOFJYanIqzx51PYFFVtKwoCr0Vt53v
 ZHbgrK582ZJO6pKF9kJF/1tqrY9/Df8jzgSypK8pew/SukMOrf7iVwrmhietuhKA
 92j5sxKhCRyq6Qg6ZwC0jyk+oMqrT8r+q3r38a5qDJx/9Q279vkXBqQnACfLjmnH
 6+sNdkY5/uBWnDMh/+d6yBtfbdW5DtuET4McYpJt1Nk2St/f3UzPaL4LcNkDXNPk
 t1Q4W4v0KS1V8TbsLfdD629CMghxQNKVs1XqyCAbUq9ub4LE2CtL3lDm730qZoZt
 +LM7+sAxEOJC6yqYfdEbcIc8l27Hl5nZEzamcvMrRz61N85/8Jx4Sq2b6VSE9TCE
 hNEWAL5sOjhuhmUPhatYC+KO1P6NDP+Yg99yZCZIT9s/P1oK5H+aETshWX+lvJ+Q
 Ai+qzKvp2ERHFcE+R5qIXs/uX7azpzjqsRZxY2/zdp70ugQDSXE=
 =0eEg
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more followup fixes to the directory listing.

  People have noticed different behaviour compared to other filesystems
  after changes in 6.5. This is now unified to more "logical" and
  expected behaviour while still within POSIX. And a few more fixes for
  stable.

   - change behaviour of readdir()/rewinddir() when new directory
     entries are created after opendir(), properly tracking the last
     entry

   - fix race in readdir when multiple threads can set the last entry
     index for a directory

  Additionally:

   - use exclusive lock when direct io might need to drop privs and call
     notify_change()

   - don't clear uptodate bit on page after an error, this may lead to a
     deadlock in subpage mode

   - fix waiting pattern when multiple readers block on Merkle tree
     data, switch to folios"

* tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix race between reading a directory and adding entries to it
  btrfs: refresh dir last index during a rewinddir(3) call
  btrfs: set last dir index to the current last index when opening dir
  btrfs: don't clear uptodate on write errors
  btrfs: file_remove_privs needs an exclusive lock in direct io write
  btrfs: convert btrfs_read_merkle_tree_page() to use a folio
2023-09-20 11:03:45 -07:00
Christian Brauner
647aa76828
Revert "fs: add infrastructure for multigrain timestamps"
This reverts commit ffb6cf19e0.

Users reported regressions due to enabling multi-grained timestamps
unconditionally. As no clear consensus on a solution has come up and the
discussion has gone back to the drawing board revert the infrastructure
changes for. If it isn't code that's here to stay, make it go away.

Message-ID: <20230920-keine-eile-c9755b5825db@brauner>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20 18:05:31 +02:00
Christian Brauner
efd34f0316
Revert "btrfs: convert to multigrain timestamps"
This reverts commit 50e9ceef1d.

Users reported regressions due to enabling multi-grained timestamps
unconditionally. As no clear consensus on a solution has come up and the
discussion has gone back to the drawing board revert the infrastructure
changes for. If it isn't code that's here to stay, make it go away.

Message-ID: <20230920-keine-eile-c9755b5825db@brauner>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20 18:05:31 +02:00
Christian Brauner
50ec1d721e
Revert "ext4: switch to multigrain timestamps"
This reverts commit 0269b58586.

Users reported regressions due to enabling multi-grained timestamps
unconditionally. As no clear consensus on a solution has come up and the
discussion has gone back to the drawing board revert the infrastructure
changes for. If it isn't code that's here to stay, make it go away.

Message-ID: <20230920-keine-eile-c9755b5825db@brauner>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20 18:05:31 +02:00
Christian Brauner
f798accd59
Revert "xfs: switch to multigrain timestamps"
This reverts commit e44df26647.

Users reported regressions due to enabling multi-grained timestamps
unconditionally. As no clear consensus on a solution has come up and the
discussion has gone back to the drawing board revert the infrastructure
changes for. If it isn't code that's here to stay, make it go away.

Message-ID: <20230920-keine-eile-c9755b5825db@brauner>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20 18:05:31 +02:00
Zhang Xiaoxu
d527f51331 cifs: Fix UAF in cifs_demultiplex_thread()
There is a UAF when xfstests on cifs:

  BUG: KASAN: use-after-free in smb2_is_network_name_deleted+0x27/0x160
  Read of size 4 at addr ffff88810103fc08 by task cifsd/923

  CPU: 1 PID: 923 Comm: cifsd Not tainted 6.1.0-rc4+ #45
  ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   print_report+0x171/0x472
   kasan_report+0xad/0x130
   kasan_check_range+0x145/0x1a0
   smb2_is_network_name_deleted+0x27/0x160
   cifs_demultiplex_thread.cold+0x172/0x5a4
   kthread+0x165/0x1a0
   ret_from_fork+0x1f/0x30
   </TASK>

  Allocated by task 923:
   kasan_save_stack+0x1e/0x40
   kasan_set_track+0x21/0x30
   __kasan_slab_alloc+0x54/0x60
   kmem_cache_alloc+0x147/0x320
   mempool_alloc+0xe1/0x260
   cifs_small_buf_get+0x24/0x60
   allocate_buffers+0xa1/0x1c0
   cifs_demultiplex_thread+0x199/0x10d0
   kthread+0x165/0x1a0
   ret_from_fork+0x1f/0x30

  Freed by task 921:
   kasan_save_stack+0x1e/0x40
   kasan_set_track+0x21/0x30
   kasan_save_free_info+0x2a/0x40
   ____kasan_slab_free+0x143/0x1b0
   kmem_cache_free+0xe3/0x4d0
   cifs_small_buf_release+0x29/0x90
   SMB2_negotiate+0x8b7/0x1c60
   smb2_negotiate+0x51/0x70
   cifs_negotiate_protocol+0xf0/0x160
   cifs_get_smb_ses+0x5fa/0x13c0
   mount_get_conns+0x7a/0x750
   cifs_mount+0x103/0xd00
   cifs_smb3_do_mount+0x1dd/0xcb0
   smb3_get_tree+0x1d5/0x300
   vfs_get_tree+0x41/0xf0
   path_mount+0x9b3/0xdd0
   __x64_sys_mount+0x190/0x1d0
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

The UAF is because:

 mount(pid: 921)               | cifsd(pid: 923)
-------------------------------|-------------------------------
                               | cifs_demultiplex_thread
SMB2_negotiate                 |
 cifs_send_recv                |
  compound_send_recv           |
   smb_send_rqst               |
    wait_for_response          |
     wait_event_state      [1] |
                               |  standard_receive3
                               |   cifs_handle_standard
                               |    handle_mid
                               |     mid->resp_buf = buf;  [2]
                               |     dequeue_mid           [3]
     KILL the process      [4] |
    resp_iov[i].iov_base = buf |
 free_rsp_buf              [5] |
                               |   is_network_name_deleted [6]
                               |   callback

1. After send request to server, wait the response until
    mid->mid_state != SUBMITTED;
2. Receive response from server, and set it to mid;
3. Set the mid state to RECEIVED;
4. Kill the process, the mid state already RECEIVED, get 0;
5. Handle and release the negotiate response;
6. UAF.

It can be easily reproduce with add some delay in [3] - [6].

Only sync call has the problem since async call's callback is
executed in cifsd process.

Add an extra state to mark the mid state to READY before wakeup the
waitter, then it can get the resp safely.

Fixes: ec637e3ffb ("[CIFS] Avoid extra large buffer allocation (and memcpy) in cifs_readpages")
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-19 22:02:54 -05:00
Ben Wolsieffer
fe44198016 proc: nommu: fix empty /proc/<pid>/maps
On no-MMU, /proc/<pid>/maps reads as an empty file.  This happens because
find_vma(mm, 0) always returns NULL (assuming no vma actually contains the
zero address, which is normally the case).

To fix this bug and improve the maintainability in the future, this patch
makes the no-MMU implementation as similar as possible to the MMU
implementation.

The only remaining differences are the lack of hold/release_task_mempolicy
and the extra code to shoehorn the gate vma into the iterator.

This has been tested on top of 6.5.3 on an STM32F746.

Link: https://lkml.kernel.org/r/20230915160055.971059-2-ben.wolsieffer@hefring.com
Fixes: 0c563f1480 ("proc: remove VMA rbtree use from nommu")
Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Giulio Benetti <giulio.benetti@benettiengineering.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-19 13:21:34 -07:00
Ben Wolsieffer
578d7699e5 proc: nommu: /proc/<pid>/maps: release mmap read lock
The no-MMU implementation of /proc/<pid>/map doesn't normally release
the mmap read lock, because it uses !IS_ERR_OR_NULL(_vml) to determine
whether to release the lock.  Since _vml is NULL when the end of the
mappings is reached, the lock is not released.

Reading /proc/1/maps twice doesn't cause a hang because it only
takes the read lock, which can be taken multiple times and therefore
doesn't show any problem if the lock isn't released. Instead, you need
to perform some operation that attempts to take the write lock after
reading /proc/<pid>/maps. To actually reproduce the bug, compile the
following code as 'proc_maps_bug':

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>

int main(int argc, char *argv[]) {
        void *buf;
        sleep(1);
        buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        puts("mmap returned");
        return 0;
}

Then, run:

  ./proc_maps_bug &; cat /proc/$!/maps; fg

Without this patch, mmap() will hang and the command will never
complete.
	
This code was incorrectly adapted from the MMU implementation, which at
the time released the lock in m_next() before returning the last entry.

The MMU implementation has diverged further from the no-MMU version since
then, so this patch brings their locking and error handling into sync,
fixing the bug and hopefully avoiding similar issues in the future.

Link: https://lkml.kernel.org/r/20230914163019.4050530-2-ben.wolsieffer@hefring.com
Fixes: 47fecca15c ("fs/proc/task_nommu.c: don't use priv->task->mm")
Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Giulio Benetti <giulio.benetti@benettiengineering.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-19 13:21:34 -07:00
Steve French
2da338ff75 smb3: do not start laundromat thread when dir leases
disabled

When no directory lease support, or for IPC shares where directories
can not be opened, do not start an unneeded laundromat thread for
that mount (it wastes resources).

Fixes: d14de8067e ("cifs: Add a laundromat thread for cached directories")
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-19 13:32:02 -05:00
Darrick J. Wong
a5f31a5028 iomap: convert iomap_unshare_iter to use large folios
Convert iomap_unshare_iter to create large folios if possible, since the
write and zeroing paths already do that.  I think this got missed in the
conversion of the write paths that landed in 6.6-rc1.

Cc: ritesh.list@gmail.com, willy@infradead.org
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
2023-09-19 09:05:35 -07:00
Steve French
e3603ccf4a smb3: Add dynamic trace points for RDMA (smbdirect) reconnect
smb3_smbd_connect_done and smb3_smbd_connect_err

To improve debugging of RDMA issues add those two. We already
had dynamic tracepoints for non-RDMA connect done and error cases.

Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-19 11:03:24 -05:00
Darrick J. Wong
35d30c9cf1 iomap: don't skip reading in !uptodate folios when unsharing a range
Prior to commit a01b8f2252, we would always read in the contents of a
!uptodate folio prior to writing userspace data into the folio,
allocated a folio state object, etc.  Ritesh introduced an optimization
that skips all of that if the write would cover the entire folio.

Unfortunately, the optimization misses the unshare case, where we always
have to read in the folio contents since there isn't a data buffer
supplied by userspace.  This can result in stale kernel memory exposure
if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared
file that isn't already cached.

This was caught by observing fstests regressions in the "unshare around"
mechanism that is used for unaligned writes to a reflinked realtime
volume when the realtime extent size is larger than 1FSB, though I think
it applies to any shared file.

Cc: ritesh.list@gmail.com, willy@infradead.org
Fixes: a01b8f2252 ("iomap: Allocate ifs in ->write_begin() early")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
2023-09-18 15:57:39 -07:00
Linus Torvalds
2cf0f71562 NFS Client Bugfixes for Linux 6.6
Bugfixes:
   * Various O_DIRECT related fixes from Trond
     * Error handling
     * Locking issues
     * Use the correct commit infor for joining page groups
     * Fixes for rescheduling IO
   * Sunrpc bad verifier fixes
     * Report EINVAL errors from connect()
     * Revalidate creds that the server has rejected
     * Revert "SUNRPC: Fail faster on bad verifier"
   * Fix pNFS session trunking when MDS=DS
   * Fix zero-value filehandles for post-open getattr operations
   * Fix compiler warning about tautological comparisons
     * Revert "SUNRPC: clean up integer overflow check" before Trond's fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmUIjesACgkQ18tUv7Cl
 QOv/MBAAxxZrW45qlTsEMvhnFXu1cvt08IdqivWURuSxK6oV31tARMqw00MxzhsF
 ubNX4hynDNmsWYJLryHTTGwUJzaNQOhAGjU46AhwFgS3Lj2wefeuR3IYuEDgQcHF
 8uXrrT/GWRIG4a/Uu1IxodfgMvgUPV+HGFc5BQFGkXGqPQdgbdxBDtOtyU1QlmaV
 rqKnxPgUibEZwVoz8+PNsN+pwNjKSaSR3pH53KKS6HUBAM69sT4jJiJDO/UGSlzb
 F9U1DHPeyltHXVAaQXUjHlsjh49NbIjzD5/T8xKWscvS7rbQFCcJkUl3MBHyvSEr
 eixx7muqOMCEm4lNFJqHIY7XrEN+50wi92tKj05Bz9wrnp4L84cxSV/KNZrKNA3T
 LCeIm4DGynPcLPopEUm9lqZMrabwvdV4wZSiybB6p/F/5ksVyaMnwTLzZZ1+NhkB
 OglNhF4L6m18/Ai/bY6XOixzgzWfcmYfXRhq7YoeO/JSvixG9Sk2crC6ryjWadgo
 xbitjjCYXHl3MUkH2TcaQoMGFhmvSShXg//5YXpxm3/0C3EVPa44T7/aFJqelL2p
 kkVsLcjcevAOwBHxehYLofL6c3GhLqRnwmTV7yqgJqfZf8uGxeQhXCMqOmv57/c0
 3iJJM4Llb14B+t3xl3T8MotT4WejzlHDnJKrscipfXJwRT7UDIM=
 =M6aI
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "Various O_DIRECT related fixes from Trond:
   - Error handling
   - Locking issues
   - Use the correct commit info for joining page groups
   - Fixes for rescheduling IO

  Sunrpc bad verifier fixes:
   - Report EINVAL errors from connect()
   - Revalidate creds that the server has rejected
   - Revert "SUNRPC: Fail faster on bad verifier"

  Misc:
   - Fix pNFS session trunking when MDS=DS
   - Fix zero-value filehandles for post-open getattr operations
   - Fix compiler warning about tautological comparisons
   - Revert 'SUNRPC: clean up integer overflow check' before Trond's fix"

* tag 'nfs-for-6.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Silence compiler complaints about tautological comparisons
  Revert "SUNRPC: clean up integer overflow check"
  NFSv4.1: fix zero value filehandle in post open getattr
  NFSv4.1: fix pnfs MDS=DS session trunking
  Revert "SUNRPC: Fail faster on bad verifier"
  SUNRPC: Mark the cred for revalidation if the server rejects it
  NFS/pNFS: Report EINVAL errors from connect() to the server
  NFS: More fixes for nfs_direct_write_reschedule_io()
  NFS: Use the correct commit info in nfs_join_page_group()
  NFS: More O_DIRECT accounting fixes for error paths
  NFS: Fix O_DIRECT locking issues
  NFS: Fix error handling for O_DIRECT write scheduling
2023-09-18 12:16:34 -07:00
Dave Wysochanski
df1c357f25 netfs: Only call folio_start_fscache() one time for each folio
If a network filesystem using netfs implements a clamp_length()
function, it can set subrequest lengths smaller than a page size.

When we loop through the folios in netfs_rreq_unlock_folios() to
set any folios to be written back, we need to make sure we only
call folio_start_fscache() once for each folio.

Otherwise, this simple testcase:

  mount -o fsc,rsize=1024,wsize=1024 127.0.0.1:/export /mnt/nfs
  dd if=/dev/zero of=/mnt/nfs/file.bin bs=4096 count=1
  1+0 records in
  1+0 records out
  4096 bytes (4.1 kB, 4.0 KiB) copied, 0.0126359 s, 324 kB/s
  echo 3 > /proc/sys/vm/drop_caches
  cat /mnt/nfs/file.bin > /dev/null

will trigger an oops similar to the following:

  page dumped because: VM_BUG_ON_FOLIO(folio_test_private_2(folio))
  ------------[ cut here ]------------
  kernel BUG at include/linux/netfs.h:44!
  ...
  CPU: 5 PID: 134 Comm: kworker/u16:5 Kdump: loaded Not tainted 6.4.0-rc5
  ...
  RIP: 0010:netfs_rreq_unlock_folios+0x68e/0x730 [netfs]
  ...
  Call Trace:
    netfs_rreq_assess+0x497/0x660 [netfs]
    netfs_subreq_terminated+0x32b/0x610 [netfs]
    nfs_netfs_read_completion+0x14e/0x1a0 [nfs]
    nfs_read_completion+0x2f9/0x330 [nfs]
    rpc_free_task+0x72/0xa0 [sunrpc]
    rpc_async_release+0x46/0x70 [sunrpc]
    process_one_work+0x3bd/0x710
    worker_thread+0x89/0x610
    kthread+0x181/0x1c0
    ret_from_fork+0x29/0x50

Fixes: 3d3c950467 ("netfs: Provide readahead and readpage netfs helpers"
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2210612
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230608214137.856006-1-dwysocha@redhat.com/ # v1
Link: https://lore.kernel.org/r/20230915185704.1082982-1-dwysocha@redhat.com/ # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-09-18 12:03:46 -07:00
Linus Torvalds
a49d273e57 gfs2 fixes
- Fix another freeze/thaw hang
 
 - Fix glock cache shrinking
 
 - Fix the quota=quiet mount option.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmUIXsEUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZToCqw/+NR9VEK014CCpa0sMn5VWUOr2GfdS
 hM/F2K8+0qqkGzqlfG6vZ1e24UCojQHyYcFT8D0rIep1ZMI0K+yh+5aI74YQEzOx
 qwjNbA7di/WLGMHUsGHGimJUsWoJGJYoFKcJL72EgHvVasskldC0qrg0T6EtaAyz
 A2kL11B85ZOOZqE8oqwn9DZeL+WEESGHkir396fexhPeM25KTGQwkEt3xjIsdMrb
 PQ5AfCzvqkbKxoNxO7ua0Dvq0IK5KjZV47oOCnMqtLPs9q7behOSohpOpbhO2xB5
 daHq9ZdShGbJ91Md1I1R9Jiu74uYD3+EMd9mt71cKai3dTYqq39mUuTy4pSl0i3o
 yWOofoqtcCWZolPJnOE/bLvUAOE/92myAwgYWo4ceg+MuKu5Amk5z0H6j2yph4P1
 DVkAe+1qKzon1Pf1ps/UvYkTBQPSFBFmds//lP0v75p19ke+gIzAl2VE07PxWPW3
 RKZxIlzBIqwGtlIW4oA1/9rkyiDLAmu9+kFDJMG7XL6a4sxrezM1zZncyoO0N/3o
 66oKs4ZirmD2kMQkhIsar/6SJim0w9nHmr0uYttwwXPWk8WtAOq0KUIA2kTIKC2u
 ywRh7Ge4SS7Wsv0VlT6bEuieRdm0eh4AHW1ODfMmglFN62Ob6lI+gy76wFvr9+Wj
 XDB0+uC3xpYlrQI=
 =jocw
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v6.6-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fixes from Andreas Gruenbacher:

 - Fix another freeze/thaw hang

 - Fix glock cache shrinking

 - Fix the quota=quiet mount option

* tag 'gfs2-v6.6-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix quota=quiet oversight
  gfs2: fix glock shrinker ref issues
  gfs2: Fix another freeze/thaw hang
2023-09-18 11:59:38 -07:00
Bob Peterson
fb95d53608 gfs2: Fix quota=quiet oversight
Patch eef46ab713 introduced a new gfs2 quota=quiet mount option.
Checks for the new option were added to quota.c, but a check in
gfs2_quota_lock_check() was overlooked.  This patch adds the missing
check.

Fixes: eef46ab713 ("gfs2: Introduce new quota=quiet mount option")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-09-18 16:26:24 +02:00
Bob Peterson
62862485a4 gfs2: fix glock shrinker ref issues
Before this patch, function gfs2_scan_glock_lru would only try to free
glocks that had a reference count of 0. But if the reference count ever
got to 0, the glock should have already been freed.

Shrinker function gfs2_dispose_glock_lru checks whether glocks on the
LRU are demote_ok, and if so, tries to demote them. But that's only
possible if the reference count is at least 1.

This patch changes gfs2_scan_glock_lru so it will try to demote and/or
dispose of glocks that have a reference count of 1 and which are either
demotable, or are already unlocked.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-09-18 16:00:50 +02:00
Andreas Gruenbacher
52954b7509 gfs2: Fix another freeze/thaw hang
On a thawed filesystem, the freeze glock is held in shared mode.  In
order to initiate a cluster-wide freeze, the node initiating the freeze
drops the freeze glock and grabs it in exclusive mode.  The other nodes
recognize this as contention on the freeze glock; function
freeze_go_callback is invoked.  This indicates to them that they must
freeze the filesystem locally, drop the freeze glock, and then
re-acquire it in shared mode before being able to unfreeze the
filesystem locally.

While a node is trying to re-acquire the freeze glock in shared mode,
additional contention can occur.  In that case, the node must behave in
the same way as above.

Unfortunately, freeze_go_callback() contains a check that causes it to
bail out when the freeze glock isn't held in shared mode.  Fix that to
allow the glock to be unlocked or held in shared mode.

In addition, update a reference to trylock_super() which has been
renamed to super_trylock_shared() in the meantime.

Fixes: b77b4a4815 ("gfs2: Rework freeze / thaw logic")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-09-18 16:00:49 +02:00
Linus Torvalds
42aadec8c7 stat: remove no-longer-used helper macros
The choose_32_64() macros were added to deal with an odd inconsistency
between the 32-bit and 64-bit layout of 'struct stat' way back when in
commit a52dd971f9 ("vfs: de-crapify "cp_new_stat()" function").

Then a decade later Mikulas noticed that said inconsistency had been a
mistake in the early x86-64 port, and shouldn't have existed in the
first place.  So commit 932aba1e16 ("stat: fix inconsistency between
struct stat and struct compat_stat") removed the uses of the helpers.

But the helpers remained around, unused.

Get rid of them.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-09-17 10:46:12 -07:00
Linus Torvalds
45c3c62722 three small SMB3 client fixes, one to improve a null check and two minor cleanup
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmUGFSgACgkQiiy9cAdy
 T1H8ZQwAuJhiLsTJK0lnWWxZC+KsIvTXlNKqx3VUqhJeYdxAc1tNCVjHTgdm63QA
 gRA0Htt8UhUoVVIMiipW2/PHA4rrNU7i0ULXSasAL6d8pPuZfeCzoehSfFo4u2ra
 bVDjfQUDtRakSU//Aj+Bv2sO77UWz0pQ5y0v2LCpPQ9Ks5TmLgxT+40uXCXf/LAe
 3aBbvrgLOlt0JMXaIEaQoecMitUqajmuuq/5SVQ7lz0xvn7cCLKgk22LehtwHR0W
 Ae8GdCkfFipdq+gp76CZPHO9evmRCsjmF95z56/++HdLrftYln5W/TDfjTlOZM9V
 tP99hK/2EjsWL7TMCOG59w21sKuaOdBA7AV7blgWxZAbKsrBgtMEXgQxSZMiK+Vm
 lKR5lGLWoujQLcnzWRh+WL7XP0ZxzitTlrlLeFxciPSGP843GRx+0oINLKL8CInr
 9mTwkzlzODNKA+83yRs5+Q3i0mq161IugsRrk1NHRUsr7oXiWWIxhcqCy5N5+R2S
 SfB16ql5
 =WtnH
 -----END PGP SIGNATURE-----

Merge tag '6.6-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:
 "Three small SMB3 client fixes, one to improve a null check and two
  minor cleanups"

* tag '6.6-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: fix some minor typos and repeated words
  smb3: correct places where ENOTSUPP is used instead of preferred EOPNOTSUPP
  smb3: move server check earlier when setting channel sequence number
2023-09-17 10:41:42 -07:00
Linus Torvalds
39e0c8afdc two ksmbd server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmUGEycACgkQiiy9cAdy
 T1FJQwv/XxgRr8cCkGvoxJW+P969Aw/PMy4BhhL1s26zm85wEudfUldV19VeJad0
 4Jcc7w608pCAhR9Xsl3MLB437ttOID9jOK9G10RvqwAJSaCqO6whpXRaJTGe4g4n
 N71/18R2uPW3HswBe5KSk0pC6WtYaGDTDpJ7hV1zAwiQWQywMA9FgzOspQCRaBEf
 JqH9F8tnzpfT/lDTe64Q3mRXC1ppO/XdJHhxzgRv8l41bc/0PWCPk2GuxpMQNhCo
 BediiKyIa/kUPEDID9k02VVxoW+aitbcw5kYUfMO55V6IkstuDbjq5k7k+r0BKfK
 AM8YE/LyRM5izwaV73tS2mSVZlEQLSlfwAuAY5uvcnanUIegFypCHEclnNmkS3Qx
 dXAonMWGD4+8N/aywNg5Zm5ql3HzLzS4uCIVJbyeOLqd1GljaYjvWsGkXvY9NnyT
 ED5ya4jTFZeEbONEdnPcgmOEZifs93VnklCsaGMFRJbv1gnKsBOt75EooeB7+44j
 TyRaeNNe
 =MOeW
 -----END PGP SIGNATURE-----

Merge tag '6.6-rc1-ksmbd' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
 "Two ksmbd server fixes"

* tag '6.6-rc1-ksmbd' of git://git.samba.org/ksmbd:
  ksmbd: fix passing freed memory 'aux_payload_buf'
  ksmbd: remove unneeded mark_inode_dirty in set_info_sec()
2023-09-17 10:38:01 -07:00
Linus Torvalds
3fde3003ca Regression and bug fixes for ext4.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmUGh1YACgkQ8vlZVpUN
 gaN9lQgAqmMWu3xLwOERgVbK3CYT8WMcv0m9/by+vSwghCoPVDWWENgEgAzo4YpK
 Lsp4q62wHaWs6AzvJEaJ8ryedo7e4FUHxcvp2f6dCuOPadOEZZZTa4G5fAr0kYXS
 TIoaFtv6F2QVnGU6Y5lhtfYzmgLRdLL0B6MfSTYGO2MSREqxapvfxyGBQdkOuXfO
 UEtrUUEqQ2GdDcKp+FRRnaUvNaTPEESY8d5eVwrMmyUhQWUQL/N2BPbFkk1TP6RG
 MLDNsUZpdhZvLs6qLuR7dvO5wa2fshvRJIXlPINM0R0as5LmHqVL/ifCNkCn4W+k
 ZNvdSPhqew68KHHq3sYFtm9rbZ3YOA==
 =DopS
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Regression and bug fixes for ext4"

* tag 'ext4_for_linus-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix rec_len verify error
  ext4: do not let fstrim block system suspend
  ext4: move setting of trimmed bit into ext4_try_to_trim_range()
  jbd2: Fix memory leak in journal_init_common()
  jbd2: Remove page size assumptions
  buffer: Make bh_offset() work for compound pages
2023-09-17 10:33:53 -07:00
Linus Torvalds
d8d7cd6563 nfsd-6.6 fixes:
- Use correct order when encoding NFSv4 RENAME change_info
 - Fix a potential oops during NFSD shutdown
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmUEf90ACgkQM2qzM29m
 f5dcrhAAjz3SLoz4nAlo441JMATp/auwqTBKlWvNUWbK/ZgYjUcTXH8RfPbBQ/Yl
 5NfsThubmPTSr/bNvV+C963C4c/25d45SYwabAdLoZWKIpWf1mCDNhE1JkGQoM83
 KyZcrLnEknh/fN6jAxueFhWPCaavGDK8eI8H6Ls33GvX0cG4S7IYseGZ1x2dMWpE
 XFnV8oAWSD3cjKrTGwPYja2W0ShISGn4UMnj+lPWEpfCDydnQzKh87dhWzvlAxCf
 4Ckikmv3vcCfm15Ja0JZX3K1cvh0A3oISbgN/QNqGpzDzQ1qyugsx9uNltDa76N8
 NzGENnvO2/WURkw4DHJYIsiNt82uB04NYSAL6mLat3GU4DeAKOl3r+9W0jZPFS17
 7mOLh9dqg3ubSxOr6BpeY4I6+Uq2enUTwh1+VkvC7hQ906ZY8kzaj5MtLddvkjCw
 p4BTDzH280AwtnwiJ7q0WvBR5TxdGc3fqqsLJ+yMX2aGdhM1iMOjjENoSc3+v+Cy
 pEK1tju/lDzcMZrl5A08Za30L5boHp11n5SLia7tctKiFs4biL3WwNu527Y/Klq6
 04DtkcXdF/tW702388sKL1UnnXhw4KA8k7Us4HrQL+zXDZ+/UJ+AzY6s8ls2ytbN
 0NESe/ntsKjoijrRphAp+RgC8fhC7iT0GMJP8OHyHUqEOU2Wrbw=
 =ZTRy
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Use correct order when encoding NFSv4 RENAME change_info

 - Fix a potential oops during NFSD shutdown

* tag 'nfsd-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: fix possible oops when nfsd/pool_stats is closed.
  nfsd: fix change_info in NFSv4 RENAME replies
2023-09-15 16:48:44 -07:00
Linus Torvalds
e42bebf6db First set of EFI fixes for v6.6:
- Missing x86 patch for the runtime cleanup that was merged in -rc1
 - Kconfig tweak for kexec on x86 so EFI support does not get disabled
   inadvertently
 - Use the right EFI memory type for the unaccepted memory table so
   kexec/kdump exposes it to the crash kernel as well
 - Work around EFI implementations which do not implement
   QueryVariableInfo, which is now called by statfs() on efivarfs
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZQGDxwAKCRAwbglWLn0t
 XApcAP4+Fv6orQy4h/nmkDhWJa8vg36gWu1CDmy/abo0v3ODZQD/duk7Ejqw5vAm
 kvFmxLheNVs/RmgZtmB7CugAqibTkAE=
 =EdJu
 -----END PGP SIGNATURE-----

Merge tag 'efi-fixes-for-v6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI fixes from Ard Biesheuvel:

 - Missing x86 patch for the runtime cleanup that was merged in -rc1

 - Kconfig tweak for kexec on x86 so EFI support does not get disabled
   inadvertently

 - Use the right EFI memory type for the unaccepted memory table so
   kexec/kdump exposes it to the crash kernel as well

 - Work around EFI implementations which do not implement
   QueryVariableInfo, which is now called by statfs() on efivarfs

* tag 'efi-fixes-for-v6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efivarfs: fix statfs() on efivarfs
  efi/unaccepted: Use ACPI reclaim memory for unaccepted memory table
  efi/x86: Ensure that EFI_RUNTIME_MAP is enabled for kexec
  efi/x86: Move EFI runtime call setup/teardown helpers out of line
2023-09-15 12:42:48 -07:00
Olga Kornievskaia
4506f23e11 NFSv4.1: fix zero value filehandle in post open getattr
Currently, if the OPEN compound experiencing an error and needs to
get the file attributes separately, it will send a stand alone
GETATTR but it would use the filehandle from the results of
the OPEN compound. In case of the CLAIM_FH OPEN, nfs_openres's fh
is zero value. That generate a GETATTR that's sent with a zero
value filehandle, and results in the server returning an error.

Instead, for the CLAIM_FH OPEN, take the filehandle that was used
in the PUTFH of the OPEN compound.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-15 10:50:47 -04:00
Steve French
2c75426c1f smb3: fix some minor typos and repeated words
Minor cleanup pointed out by checkpatch (repeated words, missing blank
lines) in smb2pdu.c and old header location referred to in transport.c

Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-15 01:37:33 -05:00
Steve French
ebc3d4e44a smb3: correct places where ENOTSUPP is used instead of preferred EOPNOTSUPP
checkpatch flagged a few places with:
     WARNING: ENOTSUPP is not a SUSV4 error code, prefer EOPNOTSUPP
Also fixed minor typo

Signed-off-by: Steve French <stfrench@microsoft.com>
2023-09-15 01:10:40 -05:00
Filipe Manana
8e7f82deb0 btrfs: fix race between reading a directory and adding entries to it
When opening a directory (opendir(3)) or rewinding it (rewinddir(3)), we
are not holding the directory's inode locked, and this can result in later
attempting to add two entries to the directory with the same index number,
resulting in a transaction abort, with -EEXIST (-17), when inserting the
second delayed dir index. This results in a trace like the following:

  Sep 11 22:34:59 myhostname kernel: BTRFS error (device dm-3): err add delayed dir index item(name: cockroach-stderr.log) into the insertion tree of the delayed node(root id: 5, inode id: 4539217, errno: -17)
  Sep 11 22:34:59 myhostname kernel: ------------[ cut here ]------------
  Sep 11 22:34:59 myhostname kernel: kernel BUG at fs/btrfs/delayed-inode.c:1504!
  Sep 11 22:34:59 myhostname kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  Sep 11 22:34:59 myhostname kernel: CPU: 0 PID: 7159 Comm: cockroach Not tainted 6.4.15-200.fc38.x86_64 #1
  Sep 11 22:34:59 myhostname kernel: Hardware name: ASUS ESC500 G3/P9D WS, BIOS 2402 06/27/2018
  Sep 11 22:34:59 myhostname kernel: RIP: 0010:btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel: Code: eb dd 48 (...)
  Sep 11 22:34:59 myhostname kernel: RSP: 0000:ffffa9980e0fbb28 EFLAGS: 00010282
  Sep 11 22:34:59 myhostname kernel: RAX: 0000000000000000 RBX: ffff8b10b8f4a3c0 RCX: 0000000000000000
  Sep 11 22:34:59 myhostname kernel: RDX: 0000000000000000 RSI: ffff8b177ec21540 RDI: ffff8b177ec21540
  Sep 11 22:34:59 myhostname kernel: RBP: ffff8b110cf80888 R08: 0000000000000000 R09: ffffa9980e0fb938
  Sep 11 22:34:59 myhostname kernel: R10: 0000000000000003 R11: ffffffff86146508 R12: 0000000000000014
  Sep 11 22:34:59 myhostname kernel: R13: ffff8b1131ae5b40 R14: ffff8b10b8f4a418 R15: 00000000ffffffef
  Sep 11 22:34:59 myhostname kernel: FS:  00007fb14a7fe6c0(0000) GS:ffff8b177ec00000(0000) knlGS:0000000000000000
  Sep 11 22:34:59 myhostname kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  Sep 11 22:34:59 myhostname kernel: CR2: 000000c00143d000 CR3: 00000001b3b4e002 CR4: 00000000001706f0
  Sep 11 22:34:59 myhostname kernel: Call Trace:
  Sep 11 22:34:59 myhostname kernel:  <TASK>
  Sep 11 22:34:59 myhostname kernel:  ? die+0x36/0x90
  Sep 11 22:34:59 myhostname kernel:  ? do_trap+0xda/0x100
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  ? do_error_trap+0x6a/0x90
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  ? exc_invalid_op+0x50/0x70
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  ? asm_exc_invalid_op+0x1a/0x20
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  ? btrfs_insert_delayed_dir_index+0x1da/0x260
  Sep 11 22:34:59 myhostname kernel:  btrfs_insert_dir_item+0x200/0x280
  Sep 11 22:34:59 myhostname kernel:  btrfs_add_link+0xab/0x4f0
  Sep 11 22:34:59 myhostname kernel:  ? ktime_get_real_ts64+0x47/0xe0
  Sep 11 22:34:59 myhostname kernel:  btrfs_create_new_inode+0x7cd/0xa80
  Sep 11 22:34:59 myhostname kernel:  btrfs_symlink+0x190/0x4d0
  Sep 11 22:34:59 myhostname kernel:  ? schedule+0x5e/0xd0
  Sep 11 22:34:59 myhostname kernel:  ? __d_lookup+0x7e/0xc0
  Sep 11 22:34:59 myhostname kernel:  vfs_symlink+0x148/0x1e0
  Sep 11 22:34:59 myhostname kernel:  do_symlinkat+0x130/0x140
  Sep 11 22:34:59 myhostname kernel:  __x64_sys_symlinkat+0x3d/0x50
  Sep 11 22:34:59 myhostname kernel:  do_syscall_64+0x5d/0x90
  Sep 11 22:34:59 myhostname kernel:  ? syscall_exit_to_user_mode+0x2b/0x40
  Sep 11 22:34:59 myhostname kernel:  ? do_syscall_64+0x6c/0x90
  Sep 11 22:34:59 myhostname kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc

The race leading to the problem happens like this:

1) Directory inode X is loaded into memory, its ->index_cnt field is
   initialized to (u64)-1 (at btrfs_alloc_inode());

2) Task A is adding a new file to directory X, holding its vfs inode lock,
   and calls btrfs_set_inode_index() to get an index number for the entry.

   Because the inode's index_cnt field is set to (u64)-1 it calls
   btrfs_inode_delayed_dir_index_count() which fails because no dir index
   entries were added yet to the delayed inode and then it calls
   btrfs_set_inode_index_count(). This functions finds the last dir index
   key and then sets index_cnt to that index value + 1. It found that the
   last index key has an offset of 100. However before it assigns a value
   of 101 to index_cnt...

3) Task B calls opendir(3), ending up at btrfs_opendir(), where the VFS
   lock for inode X is not taken, so it calls btrfs_get_dir_last_index()
   and sees index_cnt still with a value of (u64)-1. Because of that it
   calls btrfs_inode_delayed_dir_index_count() which fails since no dir
   index entries were added to the delayed inode yet, and then it also
   calls btrfs_set_inode_index_count(). This also finds that the last
   index key has an offset of 100, and before it assigns the value 101
   to the index_cnt field of inode X...

4) Task A assigns a value of 101 to index_cnt. And then the code flow
   goes to btrfs_set_inode_index() where it increments index_cnt from
   101 to 102. Task A then creates a delayed dir index entry with a
   sequence number of 101 and adds it to the delayed inode;

5) Task B assigns 101 to the index_cnt field of inode X;

6) At some later point when someone tries to add a new entry to the
   directory, btrfs_set_inode_index() will return 101 again and shortly
   after an attempt to add another delayed dir index key with index
   number 101 will fail with -EEXIST resulting in a transaction abort.

Fix this by locking the inode at btrfs_get_dir_last_index(), which is only
only used when opening a directory or attempting to lseek on it.

Reported-by: ken <ken@bllue.org>
Link: https://lore.kernel.org/linux-btrfs/CAE6xmH+Lp=Q=E61bU+v9eWX8gYfLvu6jLYxjxjFpo3zHVPR0EQ@mail.gmail.com/
Reported-by: syzbot+d13490c82ad5353c779d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/
Fixes: 9b378f6ad4 ("btrfs: fix infinite directory reads")
CC: stable@vger.kernel.org # 6.5+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-14 23:24:42 +02:00
Filipe Manana
e60aa5da14 btrfs: refresh dir last index during a rewinddir(3) call
When opening a directory we find what's the index of its last entry and
then store it in the directory's file handle private data (struct
btrfs_file_private::last_index), so that in the case new directory entries
are added to a directory after an opendir(3) call we don't end up in an
infinite loop (see commit 9b378f6ad4 ("btrfs: fix infinite directory
reads")) when calling readdir(3).

However once rewinddir(3) is called, POSIX states [1] that any new
directory entries added after the previous opendir(3) call, must be
returned by subsequent calls to readdir(3):

  "The rewinddir() function shall reset the position of the directory
   stream to which dirp refers to the beginning of the directory.
   It shall also cause the directory stream to refer to the current
   state of the corresponding directory, as a call to opendir() would
   have done."

We currently don't refresh the last_index field of the struct
btrfs_file_private associated to the directory, so after a rewinddir(3)
we are not returning any new entries added after the opendir(3) call.

Fix this by finding the current last index of the directory when llseek
is called against the directory.

This can be reproduced by the following C program provided by Ian Johnson:

   #include <dirent.h>
   #include <stdio.h>

   int main(void) {
     DIR *dir = opendir("test");

     FILE *file;
     file = fopen("test/1", "w");
     fwrite("1", 1, 1, file);
     fclose(file);

     file = fopen("test/2", "w");
     fwrite("2", 1, 1, file);
     fclose(file);

     rewinddir(dir);

     struct dirent *entry;
     while ((entry = readdir(dir))) {
        printf("%s\n", entry->d_name);
     }
     closedir(dir);
     return 0;
   }

Reported-by: Ian Johnson <ian@ianjohnson.dev>
Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
Fixes: 9b378f6ad4 ("btrfs: fix infinite directory reads")
CC: stable@vger.kernel.org # 6.5+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-14 23:24:42 +02:00
Filipe Manana
357950361c btrfs: set last dir index to the current last index when opening dir
When opening a directory for reading it, we set the last index where we
stop iteration to the value in struct btrfs_inode::index_cnt. That value
does not match the index of the most recently added directory entry but
it's instead the index number that will be assigned the next directory
entry.

This means that if after the call to opendir(3) new directory entries are
added, a readdir(3) call will return the first new directory entry. This
is fine because POSIX says the following [1]:

  "If a file is removed from or added to the directory after the most
   recent call to opendir() or rewinddir(), whether a subsequent call to
   readdir() returns an entry for that file is unspecified."

For example for the test script from commit 9b378f6ad4 ("btrfs: fix
infinite directory reads"), where we have 2000 files in a directory, ext4
doesn't return any new directory entry after opendir(3), while xfs returns
the first 13 new directory entries added after the opendir(3) call.

If we move to a shorter example with an empty directory when opendir(3) is
called, and 2 files added to the directory after the opendir(3) call, then
readdir(3) on btrfs will return the first file, ext4 and xfs return the 2
files (but in a different order). A test program for this, reported by
Ian Johnson, is the following:

   #include <dirent.h>
   #include <stdio.h>

   int main(void) {
     DIR *dir = opendir("test");

     FILE *file;
     file = fopen("test/1", "w");
     fwrite("1", 1, 1, file);
     fclose(file);

     file = fopen("test/2", "w");
     fwrite("2", 1, 1, file);
     fclose(file);

     struct dirent *entry;
     while ((entry = readdir(dir))) {
        printf("%s\n", entry->d_name);
     }
     closedir(dir);
     return 0;
   }

To make this less odd, change the behaviour to never return new entries
that were added after the opendir(3) call. This is done by setting the
last_index field of the struct btrfs_file_private attached to the
directory's file handle with a value matching btrfs_inode::index_cnt
minus 1, since that value always matches the index of the next new
directory entry and not the index of the most recently added entry.

[1] https://pubs.opengroup.org/onlinepubs/007904875/functions/readdir_r.html

Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/
CC: stable@vger.kernel.org # 6.5+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-14 23:24:42 +02:00
Shida Zhang
7fda67e8c3 ext4: fix rec_len verify error
With the configuration PAGE_SIZE 64k and filesystem blocksize 64k,
a problem occurred when more than 13 million files were directly created
under a directory:

EXT4-fs error (device xx): ext4_dx_csum_set:492: inode #xxxx: comm xxxxx: dir seems corrupt?  Run e2fsck -D.
EXT4-fs error (device xx): ext4_dx_csum_verify:463: inode #xxxx: comm xxxxx: dir seems corrupt?  Run e2fsck -D.
EXT4-fs error (device xx): dx_probe:856: inode #xxxx: block 8188: comm xxxxx: Directory index failed checksum

When enough files are created, the fake_dirent->reclen will be 0xffff.
it doesn't equal to the blocksize 65536, i.e. 0x10000.

But it is not the same condition when blocksize equals to 4k.
when enough files are created, the fake_dirent->reclen will be 0x1000.
it equals to the blocksize 4k, i.e. 0x1000.

The problem seems to be related to the limitation of the 16-bit field
when the blocksize is set to 64k.
To address this, helpers like ext4_rec_len_{from,to}_disk has already
been introduced to complete the conversion between the encoded and the
plain form of rec_len.

So fix this one by using the helper, and all the other in this file too.

Cc: stable@kernel.org
Fixes: dbe8944404 ("ext4: Calculate and verify checksums for htree nodes")
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20230803060938.1929759-1-zhangshida@kylinos.cn
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-09-14 12:07:07 -04:00
Jan Kara
5229a658f6 ext4: do not let fstrim block system suspend
Len Brown has reported that system suspend sometimes fail due to
inability to freeze a task working in ext4_trim_fs() for one minute.
Trimming a large filesystem on a disk that slowly processes discard
requests can indeed take a long time. Since discard is just an advisory
call, it is perfectly fine to interrupt it at any time and the return
number of discarded blocks until that moment. Do that when we detect the
task is being frozen.

Cc: stable@kernel.org
Reported-by: Len Brown <lenb@kernel.org>
Suggested-by: Dave Chinner <david@fromorbit.com>
References: https://bugzilla.kernel.org/show_bug.cgi?id=216322
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230913150504.9054-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-09-14 12:06:59 -04:00
Jan Kara
45e4ab320c ext4: move setting of trimmed bit into ext4_try_to_trim_range()
Currently we set the group's trimmed bit in ext4_trim_all_free() based
on return value of ext4_try_to_trim_range(). However when we will want
to abort trimming because of suspend attempt, we want to return success
from ext4_try_to_trim_range() but not set the trimmed bit. Instead
implementing awkward propagation of this information, just move setting
of trimmed bit into ext4_try_to_trim_range() when the whole group is
trimmed.

Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230913150504.9054-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-09-14 12:06:47 -04:00
Li Zetao
1bb0763f1e jbd2: Fix memory leak in journal_init_common()
There is a memory leak reported by kmemleak:

  unreferenced object 0xff11000105903b80 (size 64):
    comm "mount", pid 3382, jiffies 4295032021 (age 27.826s)
    hex dump (first 32 bytes):
      04 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
      ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00  ................
    backtrace:
      [<ffffffffae86ac40>] __kmalloc_node+0x50/0x160
      [<ffffffffaf2486d8>] crypto_alloc_tfmmem.isra.0+0x38/0x110
      [<ffffffffaf2498e5>] crypto_create_tfm_node+0x85/0x2f0
      [<ffffffffaf24a92c>] crypto_alloc_tfm_node+0xfc/0x210
      [<ffffffffaedde777>] journal_init_common+0x727/0x1ad0
      [<ffffffffaede1715>] jbd2_journal_init_inode+0x2b5/0x500
      [<ffffffffaed786b5>] ext4_load_and_init_journal+0x255/0x2440
      [<ffffffffaed8b423>] ext4_fill_super+0x8823/0xa330
      ...

The root cause was traced to an error handing path in journal_init_common()
when malloc memory failed in register_shrinker(). The checksum driver is
used to reference to checksum algorithm via cryptoapi and the user should
release the memory when the driver is no longer needed or the journal
initialization failed.

Fix it by calling crypto_free_shash() on the "err_cleanup" error handing
path in journal_init_common().

Fixes: c30713084b ("jbd2: move load_superblock() into journal_init_common()")
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230911025138.983101-1-lizetao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-09-14 12:06:22 -04:00
Linus Torvalds
99214f6778 Tracing fixes for 6.6:
- Add missing LOCKDOWN checks for eventfs callers
   When LOCKDOWN is active for tracing, it causes inconsistent state
   when some functions succeed and others fail.
 
 - Use dput() to free the top level eventfs descriptor
   There was a race between accesses and freeing it.
 
 - Fix a long standing bug that eventfs exposed due to changing timings
   by dynamically creating files. That is, If a event file is opened
   for an instance, there's nothing preventing the instance from being
   removed which will make accessing the files cause use-after-free bugs.
 
 - Fix a ring buffer race that happens when iterating over the ring
   buffer while writers are active. Check to make sure not to read
   the event meta data if it's beyond the end of the ring buffer sub buffer.
 
 - Fix the print trigger that disappeared because the test to create it
   was looking for the event dir field being filled, but now it has the
   "ef" field filled for the eventfs structure.
 
 - Remove the unused "dir" field from the event structure.
 
 - Fix the order of the trace_dynamic_info as it had it backwards for the
   offset and len fields for which one was for which endianess.
 
 - Fix NULL pointer dereference with eventfs_remove_rec()
   If an allocation fails in one of the eventfs_add_*() functions,
   the caller of it in event_subsystem_dir() or event_create_dir()
   assigns the result to the structure. But it's assigning the ERR_PTR
   and not NULL. This was passed to eventfs_remove_rec() which expects
   either a good pointer or a NULL, not ERR_PTR. The fix is to not
   assign the ERR_PTR to the structure, but to keep it NULL on error.
 
 - Fix list_for_each_rcu() to use list_for_each_srcu() in
   dcache_dir_open_wrapper(). One iteration of the code used RCU
   but because it had to call sleepable code, it had to be changed
   to use SRCU, but one of the iterations was missed.
 
 - Fix synthetic event print function to use "as_u64" instead of
   passing in a pointer to the union. To fix big/little endian issues,
   the u64 that represented several types was turned into a union to
   define the types properly.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZQCvoBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtgrAP9MiYiCMU+90oJ+61DFchbs3y7BNidP
 s3lLRDUMJ935NQD/SSAm54PqWb+YXMpD7m9+3781l6xqwfabBMXNaEl+FwA=
 =tlZu
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Add missing LOCKDOWN checks for eventfs callers

   When LOCKDOWN is active for tracing, it causes inconsistent state
   when some functions succeed and others fail.

 - Use dput() to free the top level eventfs descriptor

   There was a race between accesses and freeing it.

 - Fix a long standing bug that eventfs exposed due to changing timings
   by dynamically creating files. That is, If a event file is opened for
   an instance, there's nothing preventing the instance from being
   removed which will make accessing the files cause use-after-free
   bugs.

 - Fix a ring buffer race that happens when iterating over the ring
   buffer while writers are active. Check to make sure not to read the
   event meta data if it's beyond the end of the ring buffer sub buffer.

 - Fix the print trigger that disappeared because the test to create it
   was looking for the event dir field being filled, but now it has the
   "ef" field filled for the eventfs structure.

 - Remove the unused "dir" field from the event structure.

 - Fix the order of the trace_dynamic_info as it had it backwards for
   the offset and len fields for which one was for which endianess.

 - Fix NULL pointer dereference with eventfs_remove_rec()

   If an allocation fails in one of the eventfs_add_*() functions, the
   caller of it in event_subsystem_dir() or event_create_dir() assigns
   the result to the structure. But it's assigning the ERR_PTR and not
   NULL. This was passed to eventfs_remove_rec() which expects either a
   good pointer or a NULL, not ERR_PTR. The fix is to not assign the
   ERR_PTR to the structure, but to keep it NULL on error.

 - Fix list_for_each_rcu() to use list_for_each_srcu() in
   dcache_dir_open_wrapper(). One iteration of the code used RCU but
   because it had to call sleepable code, it had to be changed to use
   SRCU, but one of the iterations was missed.

 - Fix synthetic event print function to use "as_u64" instead of passing
   in a pointer to the union. To fix big/little endian issues, the u64
   that represented several types was turned into a union to define the
   types properly.

* tag 'trace-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec()
  tracefs/eventfs: Use list_for_each_srcu() in dcache_dir_open_wrapper()
  tracing/synthetic: Print out u64 values properly
  tracing/synthetic: Fix order of struct trace_dynamic_info
  selftests/ftrace: Fix dependencies for some of the synthetic event tests
  tracing: Remove unused trace_event_file dir field
  tracing: Use the new eventfs descriptor for print trigger
  ring-buffer: Do not attempt to read past "commit"
  tracefs/eventfs: Free top level files on removal
  ring-buffer: Avoid softlockup in ring_buffer_resize()
  tracing: Have event inject files inc the trace array ref count
  tracing: Have option files inc the trace array ref count
  tracing: Have current_trace inc the trace array ref count
  tracing: Have tracing_max_latency inc the trace array ref count
  tracing: Increase trace array ref count on enable and filter files
  tracefs/eventfs: Use dput to free the toplevel events directory
  tracefs/eventfs: Add missing lockdown checks
  tracefs: Add missing lockdown check to tracefs_create_dir()
2023-09-13 11:30:11 -07:00
Josef Bacik
b595d25996 btrfs: don't clear uptodate on write errors
We have been consistently seeing hangs with generic/648 in our subpage
GitHub CI setup.  This is a classic deadlock, we are calling
btrfs_read_folio() on a folio, which requires holding the folio lock on
the folio, and then finding a ordered extent that overlaps that range
and calling btrfs_start_ordered_extent(), which then tries to write out
the dirty page, which requires taking the folio lock and then we
deadlock.

The hang happens because we're writing to range [1271750656, 1271767040),
page index [77621, 77622], and page 77621 is !Uptodate.  It is also Dirty,
so we call btrfs_read_folio() for 77621 and which does
btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered
extent which is [1271644160, 1271746560), page index [77615, 77621].
The page indexes overlap, but the actual bytes don't overlap.  We're
holding the page lock for 77621, then call
btrfs_lock_and_flush_ordered_range() which tries to flush the dirty
page, and tries to lock 77621 again and then we deadlock.

The byte ranges do not overlap, but with subpage support if we clear
uptodate on any portion of the page we mark the entire thing as not
uptodate.

We have been clearing page uptodate on write errors, but no other file
system does this, and is in fact incorrect.  This doesn't hurt us in the
!subpage case because we can't end up with overlapped ranges that don't
also overlap on the page.

Fix this by not clearing uptodate when we have a write error.  The only
thing we should be doing in this case is setting the mapping error and
carrying on.  This makes it so we would no longer call
btrfs_read_folio() on the page as it's uptodate and eliminates the
deadlock.

With this patch we're now able to make it through a full fstests run on
our subpage blocksize VMs.

Note for stable backports: this probably goes beyond 6.1 but the code
has been cleaned up and clearing the uptodate bit must be verified on
each version independently.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-13 18:41:07 +02:00
Bernd Schubert
9af86694fd btrfs: file_remove_privs needs an exclusive lock in direct io write
This was noticed by Miklos that file_remove_privs might call into
notify_change(), which requires to hold an exclusive lock. The problem
exists in FUSE and btrfs. We can fix it without any additional helpers
from VFS, in case the privileges would need to be dropped, change the
lock type to be exclusive and redo the loop.

Fixes: e9adabb971 ("btrfs: use shared lock for direct writes within EOF")
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-13 18:41:03 +02:00
Matthew Wilcox (Oracle)
06ed09351b btrfs: convert btrfs_read_merkle_tree_page() to use a folio
Remove a number of hidden calls to compound_head() by using a folio
throughout.  Also follow core kernel coding style by adding the folio to
the page cache immediately after allocation instead of doing the read
first, then adding it to the page cache.  This ordering makes subsequent
readers block waiting for the first reader instead of duplicating the
work only to throw it away when they find out they lost the race.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-13 18:40:54 +02:00
Olga Kornievskaia
806a3bc421 NFSv4.1: fix pnfs MDS=DS session trunking
Currently, when GETDEVICEINFO returns multiple locations where each
is a different IP but the server's identity is same as MDS, then
nfs4_set_ds_client() finds the existing nfs_client structure which
has the MDS's max_connect value (and if it's 1), then the 1st IP
on the DS's list will get dropped due to MDS trunking rules. Other
IPs would be added as they fall under the pnfs trunking rules.

For the list of IPs the 1st goes thru calling nfs4_set_ds_client()
which will eventually call nfs4_add_trunk() and call into
rpc_clnt_test_and_add_xprt() which has the check for MDS trunking.
The other IPs (after the 1st one), would call rpc_clnt_add_xprt()
which doesn't go thru that check.

nfs4_add_trunk() is called when MDS trunking is happening and it
needs to enforce the usage of max_connect mount option of the
1st mount. However, this shouldn't be applied to pnfs flow.

Instead, this patch proposed to treat MDS=DS as DS trunking and
make sure that MDS's max_connect limit does not apply to the
1st IP returned in the GETDEVICEINFO list. It does so by
marking the newly created client with a new flag NFS_CS_PNFS
which then used to pass max_connect value to use into the
rpc_clnt_test_and_add_xprt() instead of the existing rpc
client's max_connect value set by the MDS connection.

For example, mount was done without max_connect value set
so MDS's rpc client has cl_max_connect=1. Upon calling into
rpc_clnt_test_and_add_xprt() and using rpc client's value,
the caller passes in max_connect value which is previously
been set in the pnfs path (as a part of handling
GETDEVICEINFO list of IPs) in nfs4_set_ds_client().

However, when NFS_CS_PNFS flag is not set and we know we
are doing MDS trunking, comparing a new IP of the same
server, we then set the max_connect value to the
existing MDS's value and pass that into
rpc_clnt_test_and_add_xprt().

Fixes: dc48e0abee ("SUNRPC enforce creation of no more than max_connect xprts")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13 11:51:11 -04:00
Trond Myklebust
dd7d7ee3ba NFS/pNFS: Report EINVAL errors from connect() to the server
With IPv6, connect() can occasionally return EINVAL if a route is
unavailable. If this happens during I/O to a data server, we want to
report it using LAYOUTERROR as an inability to connect.

Fixes: dd52128afd ("NFSv4.1/pnfs Ensure flexfiles reports all connection related errors")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-09-13 11:51:11 -04:00