Commit graph

3874 commits

Author SHA1 Message Date
Zhang Qilong
049ea86cb5 f2fs: add static init_idisk_time function to reduce the code
We can use a inner function to init the disk time
of f2fs_inode_info for cleaning redundant code.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:42 -07:00
Yonggil Song
d382e36970 f2fs: fix typo
Fix typo in f2fs.h
Detected by Jaeyoon Choi

Signed-off-by: Yonggil Song <yonggil.song@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:42 -07:00
Shuqi Zhang
9b7eadd9bd f2fs: fix wrong dirty page count when race between mmap and fallocate.
This is a BUG_ON issue as follows when running xfstest-generic-503:
WARNING: CPU: 21 PID: 1385 at fs/f2fs/inode.c:762 f2fs_evict_inode+0x847/0xaa0
Modules linked in:
CPU: 21 PID: 1385 Comm: umount Not tainted 5.19.0-rc5+ #73
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014

Call Trace:
evict+0x129/0x2d0
dispose_list+0x4f/0xb0
evict_inodes+0x204/0x230
generic_shutdown_super+0x5b/0x1e0
kill_block_super+0x29/0x80
kill_f2fs_super+0xe6/0x140
deactivate_locked_super+0x44/0xc0
deactivate_super+0x79/0x90
cleanup_mnt+0x114/0x1a0
__cleanup_mnt+0x16/0x20
task_work_run+0x98/0x100
exit_to_user_mode_prepare+0x3d0/0x3e0
syscall_exit_to_user_mode+0x12/0x30
do_syscall_64+0x42/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0

Function flow analysis when BUG occurs:
f2fs_fallocate                    mmap
                                  do_page_fault
                                    pte_spinlock  // ---lock_pte
                                    do_wp_page
                                      wp_page_shared
                                        pte_unmap_unlock   // unlock_pte
                                          do_page_mkwrite
                                          f2fs_vm_page_mkwrite
                                            down_read(invalidate_lock)
                                            lock_page
                                            if (PageMappedToDisk(page))
                                              goto out;
                                            // set_page_dirty  --NOT RUN
                                            out: up_read(invalidate_lock);
                                        finish_mkwrite_fault // unlock_pte
f2fs_collapse_range
  down_write(i_mmap_sem)
  truncate_pagecache
    unmap_mapping_pages
      i_mmap_lock_write // down_write(i_mmap_rwsem)
        ......
        zap_pte_range
          pte_offset_map_lock // ---lock_pte
           set_page_dirty
            f2fs_dirty_data_folio
              if (!folio_test_dirty(folio)) {
                                        fault_dirty_shared_page
                                          set_page_dirty
                                            f2fs_dirty_data_folio
                                              if (!folio_test_dirty(folio)) {
                                                filemap_dirty_folio
                                                f2fs_update_dirty_folio // ++
                                              }
                                            unlock_page
                filemap_dirty_folio
                f2fs_update_dirty_folio // page count++
              }
          pte_unmap_unlock  // --unlock_pte
      i_mmap_unlock_write  // up_write(i_mmap_rwsem)
  truncate_inode_pages
  up_write(i_mmap_sem)

When race happens between mmap-do_page_fault-wp_page_shared and
fallocate-truncate_pagecache-zap_pte_range, the zap_pte_range calls
function set_page_dirty without page lock. Besides, though
truncate_pagecache has immap and pte lock, wp_page_shared calls
fault_dirty_shared_page without any. In this case, two threads race
in f2fs_dirty_data_folio function. Page is set to dirty only ONCE,
but the count is added TWICE by calling filemap_dirty_folio.
Thus the count of dirty page cannot accord with the real dirty pages.

Following is the solution to in case of race happens without any lock.
Since folio_test_set_dirty in filemap_dirty_folio is atomic, judge return
value will not be at risk of race.

Signed-off-by: Shuqi Zhang <zhangshuqi3@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:42 -07:00
Zhang Qilong
173cdf2c32 f2fs: use COMPRESS_MAPPING to get compress cache mapping
Just use the defined COMPRESS_MAPPING to get compress cache
mapping instaed of direct accessing name.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:42 -07:00
Zhang Qilong
280dfeae56 f2fs: return the tmp_ptr directly in __bitmap_ptr
Just return tmp_ptr here, it's no need to dereference
checkpoint pointer again.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-10-04 13:31:42 -07:00
Linus Torvalds
725737e7c2 STATX_DIOALIGN for 6.1
Make statx() support reporting direct I/O (DIO) alignment information.
 This provides a generic interface for userspace programs to determine
 whether a file supports DIO, and if so with what alignment restrictions.
 Specifically, STATX_DIOALIGN works on block devices, and on regular
 files when their containing filesystem has implemented support.
 
 An interface like this has been requested for years, since the
 conditions for when DIO is supported in Linux have gotten increasingly
 complex over time.  Today, DIO support and alignment requirements can be
 affected by various filesystem features such as multi-device support,
 data journalling, inline data, encryption, verity, compression,
 checkpoint disabling, log-structured mode, etc.  Further complicating
 things, Linux v6.0 relaxed the traditional rule of DIO needing to be
 aligned to the block device's logical block size; now user buffers (but
 not file offsets) only need to be aligned to the DMA alignment.
 
 The approach of uplifting the XFS specific ioctl XFS_IOC_DIOINFO was
 discarded in favor of creating a clean new interface with statx().
 
 For more information, see the individual commits and the man page update
 https://lore.kernel.org/r/20220722074229.148925-1-ebiggers@kernel.org.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYzpV2xQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKwF1AQDetPX5hyuq0/mwikOywLTTJsoHgGY5
 euO+dISqjH/InwD9HAQqfPRkdM1j4ml82BjjkAfrhzZXOOWPKJm0zOhMIQg=
 =0Oav
 -----END PGP SIGNATURE-----

Merge tag 'statx-dioalign-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull STATX_DIOALIGN support from Eric Biggers:
 "Make statx() support reporting direct I/O (DIO) alignment information.

  This provides a generic interface for userspace programs to determine
  whether a file supports DIO, and if so with what alignment
  restrictions. Specifically, STATX_DIOALIGN works on block devices, and
  on regular files when their containing filesystem has implemented
  support.

  An interface like this has been requested for years, since the
  conditions for when DIO is supported in Linux have gotten increasingly
  complex over time. Today, DIO support and alignment requirements can
  be affected by various filesystem features such as multi-device
  support, data journalling, inline data, encryption, verity,
  compression, checkpoint disabling, log-structured mode, etc.

  Further complicating things, Linux v6.0 relaxed the traditional rule
  of DIO needing to be aligned to the block device's logical block size;
  now user buffers (but not file offsets) only need to be aligned to the
  DMA alignment.

  The approach of uplifting the XFS specific ioctl XFS_IOC_DIOINFO was
  discarded in favor of creating a clean new interface with statx().

  For more information, see the individual commits and the man page
  update[1]"

Link: https://lore.kernel.org/r/20220722074229.148925-1-ebiggers@kernel.org [1]

* tag 'statx-dioalign-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
  xfs: support STATX_DIOALIGN
  f2fs: support STATX_DIOALIGN
  f2fs: simplify f2fs_force_buffered_io()
  f2fs: move f2fs_force_buffered_io() into file.c
  ext4: support STATX_DIOALIGN
  fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGN
  vfs: support STATX_DIOALIGN on block devices
  statx: add direct I/O alignment information
2022-10-03 20:33:41 -07:00
Miklos Szeredi
863f144f12 vfs: open inside ->tmpfile()
This is in preparation for adding tmpfile support to fuse, which requires
that the tmpfile creation and opening are done as a single operation.

Replace the 'struct dentry *' argument of i_op->tmpfile with
'struct file *'.

Call finish_open_simple() as the last thing in ->tmpfile() instances (may
be omitted in the error case).

Change d_tmpfile() argument to 'struct file *' as well to make callers more
readable.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-09-24 07:00:00 +02:00
Christoph Hellwig
0e91fc1e0f fscrypt: work on block_devices instead of request_queues
request_queues are a block layer implementation detail that should not
leak into file systems.  Change the fscrypt inline crypto code to
retrieve block devices instead of request_queues from the file system.
As part of that, clean up the interaction with multi-device file systems
by returning both the number of devices and the actual device array in a
single method call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[ebiggers: bug fixes and minor tweaks]
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220901193208.138056-4-ebiggers@kernel.org
2022-09-21 20:33:06 -07:00
Zhang Qilong
8140654e78 f2fs: simplify code in f2fs_prepare_decomp_mem
It could return directly after init_decompress_ctx.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-09-12 23:08:25 -07:00
Zhang Qilong
ddd3b16c8c f2fs: replace logical value "true" with a int number
The "true" is not match the parametera type "int", and
we modify it.

Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-09-12 23:08:25 -07:00
Jaegeuk Kim
da35fe96d1 f2fs: increase the limit for reserve_root
This patch increases the threshold that limits the reserved root space from 0.2%
to 12.5% by using simple shift operation.

Typically Android sets 128MB, but if the storage capacity is 32GB, 0.2% which is
around 64MB becomes too small. Let's relax it.

Cc: stable@vger.kernel.org
Reported-by: Aran Dalton <arda@allwinnertech.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-09-12 23:08:21 -07:00
Jaegeuk Kim
4f99484d27 f2fs: complete checkpoints during remount
Otherwise, pending checkpoints can contribute a race condition to give a
quota warning.

- Thread                      - checkpoint thread
                              add checkpoints to the list
do_remount()
 down_write(&sb->s_umount);
 f2fs_remount()
                              block_operations()
                               down_read_trylock(&sb->s_umount) = 0
 up_write(&sb->s_umount);
                               f2fs_quota_sync()
                                dquot_writeback_dquots()
                                 WARN_ON_ONCE(!rwsem_is_locked(&sb->s_umount));

Or,

do_remount()
 down_write(&sb->s_umount);
 f2fs_remount()
                              create a ckpt thread
                              f2fs_enable_checkpoint() adds checkpoints
			      wait for f2fs_sync_fs()
                              trigger another pending checkpoint
                               block_operations()
                                down_read_trylock(&sb->s_umount) = 0
 up_write(&sb->s_umount);
                                f2fs_quota_sync()
                                 dquot_writeback_dquots()
                                  WARN_ON_ONCE(!rwsem_is_locked(&sb->s_umount));

Cc: stable@vger.kernel.org
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-09-12 23:07:20 -07:00
Jaegeuk Kim
c7b5857637 f2fs: flush pending checkpoints when freezing super
This avoids -EINVAL when trying to freeze f2fs.

Cc: stable@vger.kernel.org
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-09-12 23:07:12 -07:00
Eric Biggers
c8c02272a9 f2fs: support STATX_DIOALIGN
Add support for STATX_DIOALIGN to f2fs, so that direct I/O alignment
restrictions are exposed to userspace in a generic way.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20220827065851.135710-8-ebiggers@kernel.org
2022-09-11 19:47:12 -05:00
Eric Biggers
bd36732931 f2fs: simplify f2fs_force_buffered_io()
f2fs only allows direct I/O that is aligned to the filesystem block
size.  Given that fact, simplify f2fs_force_buffered_io() by removing
the redundant call to block_unaligned_IO().

This makes it easier to reuse this code for STATX_DIOALIGN.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20220827065851.135710-7-ebiggers@kernel.org
2022-09-11 19:47:12 -05:00
Eric Biggers
2db0487faa f2fs: move f2fs_force_buffered_io() into file.c
f2fs_force_buffered_io() is only used in file.c, so move it into there.
No behavior change.  This makes it easier to review later patches.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20220827065851.135710-6-ebiggers@kernel.org
2022-09-11 19:47:12 -05:00
Eric Biggers
53dd3f802a fscrypt: change fscrypt_dio_supported() to prepare for STATX_DIOALIGN
To prepare for STATX_DIOALIGN support, make two changes to
fscrypt_dio_supported().

First, remove the filesystem-block-alignment check and make the
filesystems handle it instead.  It previously made sense to have it in
fs/crypto/; however, to support STATX_DIOALIGN the alignment restriction
would have to be returned to filesystems.  It ends up being simpler if
filesystems handle this part themselves, especially for f2fs which only
allows fs-block-aligned DIO in the first place.

Second, make fscrypt_dio_supported() work on inodes whose encryption key
hasn't been set up yet, by making it set up the key if needed.  This is
required for statx(), since statx() doesn't require a file descriptor.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220827065851.135710-4-ebiggers@kernel.org
2022-09-11 19:47:12 -05:00
Eric Biggers
14db0b3c7b fscrypt: stop using PG_error to track error status
As a step towards freeing the PG_error flag for other uses, change ext4
and f2fs to stop using PG_error to track decryption errors.  Instead, if
a decryption error occurs, just mark the whole bio as failed.  The
coarser granularity isn't really a problem since it isn't any worse than
what the block layer provides, and errors from a multi-page readahead
aren't reported to applications unless a single-page read fails too.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org> # for f2fs part
Link: https://lore.kernel.org/r/20220815235052.86545-2-ebiggers@kernel.org
2022-09-06 15:15:56 -07:00
Chao Yu
265576181b f2fs: remove gc_urgent_high_limited for cleanup
Remove redundant sbi->gc_urgent_high_limited.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-29 21:15:51 -07:00
Chao Yu
34a2352560 f2fs: iostat: support accounting compressed IO
Previously, we supported to account FS_CDATA_READ_IO type IO only,
in this patch, it adds to account more type IO for compressed file:
- APP_BUFFERED_CDATA_IO
- APP_MAPPED_CDATA_IO
- FS_CDATA_IO
- APP_BUFFERED_CDATA_READ_IO
- APP_MAPPED_CDATA_READ_IO

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-29 21:15:51 -07:00
Eric Biggers
b87846bd61 f2fs: use memcpy_{to,from}_page() where possible
This is simpler, and as a side effect it replaces several uses of
kmap_atomic() with its recommended replacement kmap_local_page().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-29 21:15:51 -07:00
Jaegeuk Kim
605b0a778a f2fs: fix wrong continue condition in GC
We should decrease the frozen counter.

Cc: stable@vger.kernel.org
Fixes: 325163e989 ("f2fs: add gc_urgent_high_remaining sysfs node")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-29 21:15:51 -07:00
Jaegeuk Kim
80dc113aaa f2fs: LFS mode does not support ATGC
ATGC is using SSR which violates LFS mode used by zoned device.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-29 21:15:50 -07:00
Linus Torvalds
1daf117f1d f2fs-for-6.0
In this cycle, we mainly fixed some corner cases that manipulate a per-file
 compression flag inappropriately. And, we found f2fs counted valid blocks in a
 section incorrectly when zone capacity is set, and thus, fixed it with
 additional sysfs entry to check it easily. Lastly, this series includes
 several patches with respect to the new atomic write support such as a
 couple of bug fixes and re-adding atomic_write_abort support that we removed
 by mistake in the previous release.
 
 Enhancement:
  - add sysfs entries to understand atomic write operations and zone
    capacity
  - introduce memory mode to get a hint for low-memory devices
  - adjust the waiting time of foreground GC
  - decompress clusters under softirq to avoid non-deterministic latency
  - do not skip updating inode when retrying to flush node page
  - enforce single zone capacity
 
 Bug fix:
  - set the compression/no-compression flags correctly
  - revive F2FS_IOC_ABORT_VOLATILE_WRITE
  - check inline_data during compressed inode conversion
  - understand zone capacity when calculating valid block count
 
 As usual, the series includes several minor clean-ups and sanity checks.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmLxOccACgkQQBSofoJI
 UNId/g/+Nx3FK874cyobE1PnPpUtfxLGqO9fjhrbje3bniTpgE9NtJUFg5hRQkxE
 XHuufMrW++aBhn2ESMjbfdQ3v6vy5XUy7bi4FR71KxW4qp15mAqjTPfAZBFKZfMv
 lCv54NKlura91GhI9Dl6JgGe1+MwNXIxVROyGvjXYogF0DWl+iJh4vYuCFUguiNU
 mP6FmnZvbtK89jYxODoqwQaC+b6DV7ceaQ+c0dtS5TRvsUNv5mjWDeTvPMgk3At/
 mAuWYXfIrf5xfDY93JPbrJhBLvu7Ey3EfXBnaFGRYbYxYYub9JZ4+/5di/rB9jRc
 9AZ6LcLX3aKaT71EWa9vdCIffz8/PcSRjsmpEuVs7KNySwcnolnb1tAzlJPKy2AV
 IJliY1Ef0+jrpg2lHYZoMb5qvo80c3xlyxlgZt0LSZKf1Wo41sjJVt6ZS7WLhHXu
 OlzeI7lZBS9RKPUtU5cGNWkmZqamvmq09mMvqF4IUIaY40MizKZoV0yh9BjuUoxM
 xniBIlC/q0HvwmbQ2OtNKDgv7+FdxrRlaDyhhkppa3UA8ZK3Edch26N9pBoh/r33
 zJIR2BwCGmHz7yaX4HGzSt1phex2ABIGuZ4vBaGI7XDuYUD1tCZpC8wMCs2X3pKo
 ldQz3uu0GA0BSsNKpRks2dwRF0JJVGTk8UwcSXPwTdTTdqyhmvI=
 =dJ41
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this cycle, we mainly fixed some corner cases that manipulate a
  per-file compression flag inappropriately. And, we found f2fs counted
  valid blocks in a section incorrectly when zone capacity is set, and
  thus, fixed it with additional sysfs entry to check it easily.

  Lastly, this series includes several patches with respect to the new
  atomic write support such as a couple of bug fixes and re-adding
  atomic_write_abort support that we removed by mistake in the previous
  release.

  Enhancements:
   - add sysfs entries to understand atomic write operations and zone
     capacity
   - introduce memory mode to get a hint for low-memory devices
   - adjust the waiting time of foreground GC
   - decompress clusters under softirq to avoid non-deterministic
     latency
   - do not skip updating inode when retrying to flush node page
   - enforce single zone capacity

  Bug fixes:
   - set the compression/no-compression flags correctly
   - revive F2FS_IOC_ABORT_VOLATILE_WRITE
   - check inline_data during compressed inode conversion
   - understand zone capacity when calculating valid block count

  As usual, the series includes several minor clean-ups and sanity
  checks"

* tag 'f2fs-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (29 commits)
  f2fs: use onstack pages instead of pvec
  f2fs: intorduce f2fs_all_cluster_page_ready
  f2fs: clean up f2fs_abort_atomic_write()
  f2fs: handle decompress only post processing in softirq
  f2fs: do not allow to decompress files have FI_COMPRESS_RELEASED
  f2fs: do not set compression bit if kernel doesn't support
  f2fs: remove device type check for direct IO
  f2fs: fix null-ptr-deref in f2fs_get_dnode_of_data
  f2fs: revive F2FS_IOC_ABORT_VOLATILE_WRITE
  f2fs: fix to do sanity check on segment type in build_sit_entries()
  f2fs: obsolete unused MAX_DISCARD_BLOCKS
  f2fs: fix to avoid use f2fs_bug_on() in f2fs_new_node_page()
  f2fs: fix to remove F2FS_COMPR_FL and tag F2FS_NOCOMP_FL at the same time
  f2fs: introduce sysfs atomic write statistics
  f2fs: don't bother wait_ms by foreground gc
  f2fs: invalidate meta pages only for post_read required inode
  f2fs: allow compression of files without blocks
  f2fs: fix to check inline_data during compressed inode conversion
  f2fs: Delete f2fs_copy_page() and replace with memcpy_page()
  f2fs: fix to invalidate META_MAPPING before DIO write
  ...
2022-08-08 11:18:31 -07:00
Linus Torvalds
6614a3c316 - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
 
 - Some kmemleak fixes from Patrick Wang and Waiman Long
 
 - DAMON updates from SeongJae Park
 
 - memcg debug/visibility work from Roman Gushchin
 
 - vmalloc speedup from Uladzislau Rezki
 
 - more folio conversion work from Matthew Wilcox
 
 - enhancements for coherent device memory mapping from Alex Sierra
 
 - addition of shared pages tracking and CoW support for fsdax, from
   Shiyang Ruan
 
 - hugetlb optimizations from Mike Kravetz
 
 - Mel Gorman has contributed some pagealloc changes to improve latency
   and realtime behaviour.
 
 - mprotect soft-dirty checking has been improved by Peter Xu
 
 - Many other singleton patches all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
 jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
 SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
 =w/UH
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Most of the MM queue. A few things are still pending.

  Liam's maple tree rework didn't make it. This has resulted in a few
  other minor patch series being held over for next time.

  Multi-gen LRU still isn't merged as we were waiting for mapletree to
  stabilize. The current plan is to merge MGLRU into -mm soon and to
  later reintroduce mapletree, with a view to hopefully getting both
  into 6.1-rc1.

  Summary:

   - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
     Lin, Yang Shi, Anshuman Khandual and Mike Rapoport

   - Some kmemleak fixes from Patrick Wang and Waiman Long

   - DAMON updates from SeongJae Park

   - memcg debug/visibility work from Roman Gushchin

   - vmalloc speedup from Uladzislau Rezki

   - more folio conversion work from Matthew Wilcox

   - enhancements for coherent device memory mapping from Alex Sierra

   - addition of shared pages tracking and CoW support for fsdax, from
     Shiyang Ruan

   - hugetlb optimizations from Mike Kravetz

   - Mel Gorman has contributed some pagealloc changes to improve
     latency and realtime behaviour.

   - mprotect soft-dirty checking has been improved by Peter Xu

   - Many other singleton patches all over the place"

 [ XFS merge from hell as per Darrick Wong in

   https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]

* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
  tools/testing/selftests/vm/hmm-tests.c: fix build
  mm: Kconfig: fix typo
  mm: memory-failure: convert to pr_fmt()
  mm: use is_zone_movable_page() helper
  hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
  hugetlbfs: cleanup some comments in inode.c
  hugetlbfs: remove unneeded header file
  hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
  hugetlbfs: use helper macro SZ_1{K,M}
  mm: cleanup is_highmem()
  mm/hmm: add a test for cross device private faults
  selftests: add soft-dirty into run_vmtests.sh
  selftests: soft-dirty: add test for mprotect
  mm/mprotect: fix soft-dirty check in can_change_pte_writable()
  mm: memcontrol: fix potential oom_lock recursion deadlock
  mm/gup.c: fix formatting in check_and_migrate_movable_page()
  xfs: fail dax mount if reflink is enabled on a partition
  mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
  userfaultfd: don't fail on unrecognized features
  hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
  ...
2022-08-05 16:32:45 -07:00
Fengnan Chang
01fc4b9a6e f2fs: use onstack pages instead of pvec
Since pvec have 15 pages, it not a multiple of 4, when write compressed
pages, write in 64K as a unit, it will call pagevec_lookup_range_tag
agagin, sometimes this will take a lot of time.
Use onstack pages instead of pvec to mitigate this problem.

Signed-off-by: Fengnan Chang <fengnanchang@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-05 04:20:55 -07:00
Fengnan Chang
4f8219f8aa f2fs: intorduce f2fs_all_cluster_page_ready
When write total cluster, all pages is uptodate, there is not need to call
f2fs_prepare_compress_overwrite, intorduce f2fs_all_cluster_page_ready
to avoid this.

Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-05 04:20:28 -07:00
Chao Yu
e53f864347 f2fs: clean up f2fs_abort_atomic_write()
f2fs_abort_atomic_write() has checked whether current inode is
atomic_write one or not, it's redundant to check in its caller,
remove it for cleanup.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-05 04:20:02 -07:00
Daeho Jeong
bff139b49d f2fs: handle decompress only post processing in softirq
Now decompression is being handled in workqueue and it makes read I/O
latency non-deterministic, because of the non-deterministic scheduling
nature of workqueues. So, I made it handled in softirq context only if
possible, not in low memory devices, since this modification will
maintain decompresion related memory a little longer.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-05 04:18:31 -07:00
Jaewook Kim
90be48bd9d f2fs: do not allow to decompress files have FI_COMPRESS_RELEASED
If a file has FI_COMPRESS_RELEASED, all writes for it should not be
allowed. However, as of now, in case of compress_mode=user, writes
triggered by IOCTLs like F2FS_IOC_DE/COMPRESS_FILE are allowed unexpectly,
which could crash that file.
To fix it, let's do not allow F2FS_IOC_DE/COMPRESS_IOCTL if a file already
has FI_COMPRESS_RELEASED flag.

This is the reproduction process:
1.  $ touch ./file
2.  $ chattr +c ./file
3.  $ dd if=/dev/random of=./file bs=4096 count=30 conv=notrunc
4.  $ dd if=/dev/zero of=./file bs=4096 count=34 seek=30 conv=notrunc
5.  $ sync
6.  $ do_compress ./file      ; call F2FS_IOC_COMPRESS_FILE
7.  $ get_compr_blocks ./file ; call F2FS_IOC_GET_COMPRESS_BLOCKS
8.  $ release ./file          ; call F2FS_IOC_RELEASE_COMPRESS_BLOCKS
9.  $ do_compress ./file      ; call F2FS_IOC_COMPRESS_FILE again
10. $ get_compr_blocks ./file ; call F2FS_IOC_GET_COMPRESS_BLOCKS again

This reproduction process is tested in 128kb cluster size.
You can find compr_blocks has a negative value.

Fixes: 5fdb322ff2 ("f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE")

Signed-off-by: Junbeom Yeom <junbeom.yeom@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Jaewook Kim <jw5454.kim@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-05 04:18:08 -07:00
Jaegeuk Kim
912f0d6580 f2fs: do not set compression bit if kernel doesn't support
If kernel doesn't have CONFIG_F2FS_FS_COMPRESSION, a file having FS_COMPR_FL via
ioctl(FS_IOC_SETFLAGS) is unaccessible due to f2fs_is_compress_backend_ready().
Let's avoid it.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-05 04:18:08 -07:00
Eunhee Rho
dbf8e63f48 f2fs: remove device type check for direct IO
To ensure serialized IOs, f2fs allows only LFS mode for zoned
device. Remove redundant check for direct IO.

Signed-off-by: Eunhee Rho <eunhee83.rho@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-05 04:18:08 -07:00
Ye Bin
4a2c5b7994 f2fs: fix null-ptr-deref in f2fs_get_dnode_of_data
There is issue as follows when test f2fs atomic write:
F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th superblock
F2FS-fs (loop0): invalid crc_offset: 0
F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=1, run fsck to fix.
F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=2, run fsck to fix.
==================================================================
BUG: KASAN: null-ptr-deref in f2fs_get_dnode_of_data+0xac/0x16d0
Read of size 8 at addr 0000000000000028 by task rep/1990

CPU: 4 PID: 1990 Comm: rep Not tainted 5.19.0-rc6-next-20220715 #266
Call Trace:
 <TASK>
 dump_stack_lvl+0x6e/0x91
 print_report.cold+0x49a/0x6bb
 kasan_report+0xa8/0x130
 f2fs_get_dnode_of_data+0xac/0x16d0
 f2fs_do_write_data_page+0x2a5/0x1030
 move_data_page+0x3c5/0xdf0
 do_garbage_collect+0x2015/0x36c0
 f2fs_gc+0x554/0x1d30
 f2fs_balance_fs+0x7f5/0xda0
 f2fs_write_single_data_page+0xb66/0xdc0
 f2fs_write_cache_pages+0x716/0x1420
 f2fs_write_data_pages+0x84f/0x9a0
 do_writepages+0x130/0x3a0
 filemap_fdatawrite_wbc+0x87/0xa0
 file_write_and_wait_range+0x157/0x1c0
 f2fs_do_sync_file+0x206/0x12d0
 f2fs_sync_file+0x99/0xc0
 vfs_fsync_range+0x75/0x140
 f2fs_file_write_iter+0xd7b/0x1850
 vfs_write+0x645/0x780
 ksys_write+0xf1/0x1e0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

As 3db1de0e58 commit changed atomic write way which new a cow_inode for
atomic write file, and also mark cow_inode as FI_ATOMIC_FILE.
When f2fs_do_write_data_page write cow_inode will use cow_inode's cow_inode
which is NULL. Then will trigger null-ptr-deref.
To solve above issue, introduce FI_COW_FILE flag for COW inode.

Fiexes: 3db1de0e582c("f2fs: change the current atomic write way")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-05 04:18:08 -07:00
Daeho Jeong
23339e5752 f2fs: revive F2FS_IOC_ABORT_VOLATILE_WRITE
F2FS_IOC_ABORT_VOLATILE_WRITE was used to abort a atomic write before.
However it was removed accidentally. So revive it by changing the name,
since volatile write had gone.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Fiexes: 7bc155fec5b3("f2fs: kill volatile write support")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-08-05 04:17:57 -07:00
Linus Torvalds
f00654007f Folio changes for 6.0
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
    when running xfstests
 
  - Convert more of mpage to use folios
 
  - Remove add_to_page_cache() and add_to_page_cache_locked()
 
  - Convert find_get_pages_range() to filemap_get_folios()
 
  - Improvements to the read_cache_page() family of functions
 
  - Remove a few unnecessary checks of PageError
 
  - Some straightforward filesystem conversions to use folios
 
  - Split PageMovable users out from address_space_operations into their
    own movable_operations
 
  - Convert aops->migratepage to aops->migrate_folio
 
  - Remove nobh support (Christoph Hellwig)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp
 gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha
 mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS
 7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT
 /58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z
 C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M
 Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q==
 =DgUi
 -----END PGP SIGNATURE-----

Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache

Pull folio updates from Matthew Wilcox:

 - Fix an accounting bug that made NR_FILE_DIRTY grow without limit
   when running xfstests

 - Convert more of mpage to use folios

 - Remove add_to_page_cache() and add_to_page_cache_locked()

 - Convert find_get_pages_range() to filemap_get_folios()

 - Improvements to the read_cache_page() family of functions

 - Remove a few unnecessary checks of PageError

 - Some straightforward filesystem conversions to use folios

 - Split PageMovable users out from address_space_operations into
   their own movable_operations

 - Convert aops->migratepage to aops->migrate_folio

 - Remove nobh support (Christoph Hellwig)

* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
  fs: remove the NULL get_block case in mpage_writepages
  fs: don't call ->writepage from __mpage_writepage
  fs: remove the nobh helpers
  jfs: stop using the nobh helper
  ext2: remove nobh support
  ntfs3: refactor ntfs_writepages
  mm/folio-compat: Remove migration compatibility functions
  fs: Remove aops->migratepage()
  secretmem: Convert to migrate_folio
  hugetlb: Convert to migrate_folio
  aio: Convert to migrate_folio
  f2fs: Convert to filemap_migrate_folio()
  ubifs: Convert to filemap_migrate_folio()
  btrfs: Convert btrfs_migratepage to migrate_folio
  mm/migrate: Add filemap_migrate_folio()
  mm/migrate: Convert migrate_page() to migrate_folio()
  nfs: Convert to migrate_folio
  btrfs: Convert btree_migratepage to migrate_folio
  mm/migrate: Convert expected_page_refs() to folio_expected_refs()
  mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
  ...
2022-08-03 10:35:43 -07:00
Linus Torvalds
c013d0af81 for-5.20/block-2022-07-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLko3gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmQaD/90NKFj4v8I456TUQyg1jimXEsL+e84E6o2
 ALWVb6JzQvlPVQXNLnK5YKIunMWOTtTMz0nyB8sVRwVJVJO0P5d7QopAkZM8fkyU
 MK5OCzoryENw4DTc2wJS4in6cSbGylIuN74wMzlf7+M67JTImfoZQhbTMcjwzZfn
 b3OlL6sID7zMXwGcuOJPZyUJICCpDhzdSF9JXqKma5PQuG2SBmQyvFxJAcsoFBPc
 YetnoRIOIN6yBvsIZaPaYq7XI9MIvF0e67EQtyCEHj4tHpyVnyDWkeObVFULsISU
 gGEKbkYPvNUzRAU5Q1NBBHh1tTfkf/MaUxTuZwoEwZ/s04IGBGMmrZGyfvdfzYo6
 M7NwSEg/TrUSNfTwn65mQi7uOXu1pGkJrqz84Flm8u9Qid9Vd7LExLG5p/ggnWdH
 5th93MDEmtEg29e9DXpEAuS5d0t3TtSvosflaKpyfNNfr+P0rWCN6GM/uW62VUTK
 ls69SQh/AQJRbg64jU4xper6WhaYtSXK7TKEnxJycoEn9gYNyCcdot2uekth0xRH
 ChHGmRlteiqe/y4uFWn/2dcxWjoleiHbFjTaiRL75WVl8wIDEjw02LGuoZ61Ss9H
 WOV+MT7KqNjBGe6lreUY+O/PO02dzmoR6heJXN19p8zr/pBuLCTGX7UpO7rzgaBR
 4N1HEozvIw==
 =celk
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Improve the type checking of request flags (Bart)

 - Ensure queue mapping for a single queues always picks the right queue
   (Bart)

 - Sanitize the io priority handling (Jan)

 - rq-qos race fix (Jinke)

 - Reserved tags handling improvements (John)

 - Separate memory alignment from file/disk offset aligment for O_DIRECT
   (Keith)

 - Add new ublk driver, userspace block driver using io_uring for
   communication with the userspace backend (Ming)

 - Use try_cmpxchg() to cleanup the code in various spots (Uros)

 - Finally remove bdevname() (Christoph)

 - Clean up the zoned device handling (Christoph)

 - Clean up independent access range support (Christoph)

 - Clean up and improve block sysfs handling (Christoph)

 - Clean up and improve teardown of block devices.

   This turns the usual two step process into something that is simpler
   to implement and handle in block drivers (Christoph)

 - Clean up chunk size handling (Christoph)

 - Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
   Ming, Sebastian, Yang, Ying)

* tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
  ublk_drv: fix double shift bug
  ublk_drv: make sure that correct flags(features) returned to userspace
  ublk_drv: fix error handling of ublk_add_dev
  ublk_drv: fix lockdep warning
  block: remove __blk_get_queue
  block: call blk_mq_exit_queue from disk_release for never added disks
  blk-mq: fix error handling in __blk_mq_alloc_disk
  ublk: defer disk allocation
  ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
  ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
  ublk: cleanup ublk_ctrl_uring_cmd
  ublk: simplify ublk_ch_open and ublk_ch_release
  ublk: remove the empty open and release block device operations
  ublk: remove UBLK_IO_F_PREFLUSH
  ublk: add a MAINTAINERS entry
  block: don't allow the same type rq_qos add more than once
  mmc: fix disk/queue leak in case of adding disk failure
  ublk_drv: fix an IS_ERR() vs NULL check
  ublk: remove UBLK_IO_F_INTEGRITY
  ublk_drv: remove unneeded semicolon
  ...
2022-08-02 13:46:35 -07:00
Matthew Wilcox (Oracle)
1d5b9bd656 f2fs: Convert to filemap_migrate_folio()
filemap_migrate_folio() fits f2fs's needs perfectly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Chao Yu <chao@kernel.org>
2022-08-02 12:34:04 -04:00
Linus Torvalds
bdfae5ce38 fs.idmapped.vfsuid.v5.20
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYufP6AAKCRCRxhvAZXjc
 omzRAQCGJ11r7T0C7t1kTdQiFSs5XN9ksFa86Hfj3dHEBIj+LQEA+bZ2/LLpElDz
 zPekgXkFQqdMr+FUL8sk94dzHT0GAgk=
 =BcK/
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.vfsuid.v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull fs idmapping updates from Christian Brauner:
 "This introduces the new vfs{g,u}id_t types we agreed on. Similar to
  k{g,u}id_t the new types are just simple wrapper structs around
  regular {g,u}id_t types.

  They allow to establish a type safety boundary in the VFS for idmapped
  mounts preventing confusion betwen {g,u}ids mapped into an idmapped
  mount and {g,u}ids mapped into the caller's or the filesystem's
  idmapping.

  An initial set of helpers is introduced that allows to operate on
  vfs{g,u}id_t types. We will remove all references to non-type safe
  idmapped mounts helpers in the very near future. The patches do
  already exist.

  This converts the core attribute changing codepaths which become
  significantly easier to reason about because of this change.

  Just a few highlights here as the patches give detailed overviews of
  what is happening in the commit messages:

   - The kernel internal struct iattr contains type safe vfs{g,u}id_t
     values clearly communicating that these values have to take a given
     mount's idmapping into account.

   - The ownership values placed in struct iattr to change ownership are
     identical for idmapped and non-idmapped mounts going forward. This
     also allows to simplify stacking filesystems such as overlayfs that
     change attributes In other words, they always represent the values.

   - Instead of open coding checks for whether ownership changes have
     been requested and an actual update of the inode is required we now
     have small static inline wrappers that abstract this logic away
     removing a lot of code duplication from individual filesystems that
     all open-coded the same checks"

* tag 'fs.idmapped.vfsuid.v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  mnt_idmapping: align kernel doc and parameter order
  mnt_idmapping: use new helpers in mapped_fs{g,u}id()
  fs: port HAS_UNMAPPED_ID() to vfs{g,u}id_t
  mnt_idmapping: return false when comparing two invalid ids
  attr: fix kernel doc
  attr: port attribute changes to new types
  security: pass down mount idmapping to setattr hook
  quota: port quota helpers mount ids
  fs: port to iattr ownership update helpers
  fs: introduce tiny iattr ownership update helpers
  fs: use mount types in iattr
  fs: add two type safe mapping helpers
  mnt_idmapping: add vfs{g,u}id_t
2022-08-01 08:56:55 -07:00
Chao Yu
09beadf289 f2fs: fix to do sanity check on segment type in build_sit_entries()
As Wenqing Liu <wenqingliu0120@gmail.com> reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=216285

RIP: 0010:memcpy_erms+0x6/0x10
 f2fs_update_meta_page+0x84/0x570 [f2fs]
 change_curseg.constprop.0+0x159/0xbd0 [f2fs]
 f2fs_do_replace_block+0x5c7/0x18a0 [f2fs]
 f2fs_replace_block+0xeb/0x180 [f2fs]
 recover_data+0x1abd/0x6f50 [f2fs]
 f2fs_recover_fsync_data+0x12ce/0x3250 [f2fs]
 f2fs_fill_super+0x4459/0x6190 [f2fs]
 mount_bdev+0x2cf/0x3b0
 legacy_get_tree+0xed/0x1d0
 vfs_get_tree+0x81/0x2b0
 path_mount+0x47e/0x19d0
 do_mount+0xce/0xf0
 __x64_sys_mount+0x12c/0x1a0
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

The root cause is segment type is invalid, so in f2fs_do_replace_block(),
f2fs accesses f2fs_sm_info::curseg_array with out-of-range segment type,
result in accessing invalid curseg->sum_blk during memcpy in
f2fs_update_meta_page(). Fix this by adding sanity check on segment type
in build_sit_entries().

Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:19:00 -07:00
Chao Yu
7b01ad7f33 f2fs: obsolete unused MAX_DISCARD_BLOCKS
After commit a7eeb82385 ("f2fs: use bitmap in discard_entry"),
MAX_DISCARD_BLOCKS became obsolete, remove it.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:18:09 -07:00
Chao Yu
141170b759 f2fs: fix to avoid use f2fs_bug_on() in f2fs_new_node_page()
As Dipanjan Das <mail.dipanjan.das@gmail.com> reported, syzkaller
found a f2fs bug as below:

RIP: 0010:f2fs_new_node_page+0x19ac/0x1fc0 fs/f2fs/node.c:1295
Call Trace:
 write_all_xattrs fs/f2fs/xattr.c:487 [inline]
 __f2fs_setxattr+0xe76/0x2e10 fs/f2fs/xattr.c:743
 f2fs_setxattr+0x233/0xab0 fs/f2fs/xattr.c:790
 f2fs_xattr_generic_set+0x133/0x170 fs/f2fs/xattr.c:86
 __vfs_setxattr+0x115/0x180 fs/xattr.c:182
 __vfs_setxattr_noperm+0x125/0x5f0 fs/xattr.c:216
 __vfs_setxattr_locked+0x1cf/0x260 fs/xattr.c:277
 vfs_setxattr+0x13f/0x330 fs/xattr.c:303
 setxattr+0x146/0x160 fs/xattr.c:611
 path_setxattr+0x1a7/0x1d0 fs/xattr.c:630
 __do_sys_lsetxattr fs/xattr.c:653 [inline]
 __se_sys_lsetxattr fs/xattr.c:649 [inline]
 __x64_sys_lsetxattr+0xbd/0x150 fs/xattr.c:649
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

NAT entry and nat bitmap can be inconsistent, e.g. one nid is free
in nat bitmap, and blkaddr in its NAT entry is not NULL_ADDR, it
may trigger BUG_ON() in f2fs_new_node_page(), fix it.

Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:57 -07:00
Chao Liu
8ee236dcaa f2fs: fix to remove F2FS_COMPR_FL and tag F2FS_NOCOMP_FL at the same time
If the inode has the compress flag, it will fail to use
'chattr -c +m' to remove its compress flag and tag no compress flag.
However, the same command will be successful when executed again,
as shown below:

  $ touch foo.txt
  $ chattr +c foo.txt
  $ chattr -c +m foo.txt
  chattr: Invalid argument while setting flags on foo.txt
  $ chattr -c +m foo.txt
  $ f2fs_io getflags foo.txt
  get a flag on foo.txt ret=0, flags=nocompression,inline_data

Fix this by removing some checks in f2fs_setflags_common()
that do not affect the original logic. I go through all the
possible scenarios, and the results are as follows. Bold is
the only thing that has changed.

+---------------+-----------+-----------+----------+
|               |            file flags            |
+ command       +-----------+-----------+----------+
|               | no flag   | compr     | nocompr  |
+---------------+-----------+-----------+----------+
| chattr +c     | compr     | compr     | -EINVAL  |
| chattr -c     | no flag   | no flag   | nocompr  |
| chattr +m     | nocompr   | -EINVAL   | nocompr  |
| chattr -m     | no flag   | compr     | no flag  |
| chattr +c +m  | -EINVAL   | -EINVAL   | -EINVAL  |
| chattr +c -m  | compr     | compr     | compr    |
| chattr -c +m  | nocompr   | *nocompr* | nocompr  |
| chattr -c -m  | no flag   | no flag   | no flag  |
+---------------+-----------+-----------+----------+

Link: https://lore.kernel.org/linux-f2fs-devel/20220621064833.1079383-1-chaoliu719@gmail.com/
Fixes: 4c8ff7095b ("f2fs: support data compression")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Chao Liu <liuchao@coolpad.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:07 -07:00
Daeho Jeong
f8e2f32bcd f2fs: introduce sysfs atomic write statistics
introduce the below 4 new sysfs node for atomic write statistics.
- current_atomic_write: the total current atomic write block count,
                        which is not committed yet.
- peak_atomic_write: the peak value of total current atomic write block
                     count after boot.
- committed_atomic_block: the accumulated total committed atomic write
                          block count after boot.
- revoked_atomic_block: the accumulated total revoked atomic write block
                        count after boot.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:07 -07:00
qixiaoyu1
1adaa71ea9 f2fs: don't bother wait_ms by foreground gc
f2fs_gc returns -EINVAL via f2fs_balance_fs when there is enough free
secs after write checkpoint, but with gc_merge enabled, it will cause
the sleep time of gc thread to be set to no_gc_sleep_time even if there
are many dirty segments can be selected.

Signed-off-by: qixiaoyu1 <qixiaoyu1@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:07 -07:00
Chao Yu
0d5b9d8156 f2fs: invalidate meta pages only for post_read required inode
After commit e3b49ea368 ("f2fs: invalidate META_MAPPING before
IPU/DIO write"), invalidate_mapping_pages() will be called to
avoid race condition in between IPU/DIO and readahead for GC.

However, readahead flow is only used for post_read required inode,
so this patch adds check condition to avoids unnecessary page cache
invalidating for non-post_read inode.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:06 -07:00
Chao Liu
a8634ccf5d f2fs: allow compression of files without blocks
Files created by truncate(1) have a size but no blocks, so
they can be allowed to enable compression.

Signed-off-by: Chao Liu <liuchao@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:06 -07:00
Chao Yu
7165841d57 f2fs: fix to check inline_data during compressed inode conversion
When converting inode to compressed one via ioctl, it needs to check
inline_data, since inline_data flag and compressed flag are incompatible.

Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:17:06 -07:00
Fabio M. De Francesco
1dd55358ef f2fs: Delete f2fs_copy_page() and replace with memcpy_page()
f2fs_copy_page() is a wrapper around two kmap() + one memcpy() from/to
the mapped pages. It unnecessarily duplicates a kernel API and it makes
use of kmap(), which is being deprecated in favor of kmap_local_page().

Two main problems with kmap(): (1) It comes with an overhead as mapping
space is restricted and protected by a global lock for synchronization and
(2) it also requires global TLB invalidation when the kmap’s pool wraps
and it might block when the mapping space is fully utilized until a slot
becomes available.

With kmap_local_page() the mappings are per thread, CPU local, can take
page faults, and can be called from any context (including interrupts).
It is faster than kmap() in kernels with HIGHMEM enabled. Therefore, its
use in __clone_blkaddrs() is safe and should be preferred.

Delete f2fs_copy_page() and use a plain memcpy_page() in the only one
site calling the removed function. memcpy_page() avoids open coding two
kmap_local_page() + one memcpy() between the two kernel virtual addresses.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:57 -07:00
Chao Yu
67ca06872e f2fs: fix to invalidate META_MAPPING before DIO write
Quoted from commit e3b49ea368 ("f2fs: invalidate META_MAPPING before
IPU/DIO write")

"
Encrypted pages during GC are read and cached in META_MAPPING.
However, due to cached pages in META_MAPPING, there is an issue where
newly written pages are lost by IPU or DIO writes.

Thread A - f2fs_gc()            Thread B
/* phase 3 */
down_write(i_gc_rwsem)
ra_data_block()       ---- (a)
up_write(i_gc_rwsem)
                                f2fs_direct_IO() :
                                 - down_read(i_gc_rwsem)
                                 - __blockdev_direct_io()
                                 - get_data_block_dio_write()
                                 - f2fs_dio_submit_bio()  ---- (b)
                                 - up_read(i_gc_rwsem)
/* phase 4 */
down_write(i_gc_rwsem)
move_data_block()     ---- (c)
up_write(i_gc_rwsem)

(a) In phase 3 of f2fs_gc(), up-to-date page is read from storage and
    cached in META_MAPPING.
(b) In thread B, writing new data by IPU or DIO write on same blkaddr as
    read in (a). cached page in META_MAPPING become out-dated.
(c) In phase 4 of f2fs_gc(), out-dated page in META_MAPPING is copied to
    new blkaddr. In conclusion, the newly written data in (b) is lost.

To address this issue, invalidating pages in META_MAPPING before IPU or
DIO write.
"

In previous commit, we missed to cover extent cache hit case, and passed
wrong value for parameter @end of invalidate_mapping_pages(), fix both
issues.

Fixes: 6aa58d8ad2 ("f2fs: readahead encrypted block during GC")
Fixes: e3b49ea368 ("f2fs: invalidate META_MAPPING before IPU/DIO write")
Cc: Hyeong-Jun Kim <hj514.kim@samsung.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
Jaegeuk Kim
8e0f54a70e f2fs: add a sysfs entry to show zone capacity
This patch adds a sysfs entry showing the unusable space in a section
made by zone capacity.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
Jaegeuk Kim
074b5ea290 f2fs: adjust zone capacity when considering valid block count
This patch fixes counting unusable blocks set by zone capacity when
checking the valid block count in a section.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
Jaegeuk Kim
b771aadc6e f2fs: enforce single zone capacity
In order to simplify the complicated per-zone capacity, let's support
only one capacity for entire zoned device.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
duguowei
14de5fc3dd f2fs: remove redundant code for gc condition
Remove the redundant code and use local variant as the
argument directly. Make it more human-readable.

Signed-off-by: duguowei <duguowei@xiaomi.com>
[Jaegeuk Kim: make code neat]
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:20 -07:00
Daeho Jeong
7a8fc58618 f2fs: introduce memory mode
Introduce memory mode to supports "normal" and "low" memory modes.
"low" mode is to support low memory devices. Because of the nature of
low memory devices, in this mode, f2fs will try to save memory sometimes
by sacrificing performance. "normal" mode is the default mode and same
as before.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-07-30 20:16:12 -07:00
Bart Van Assche
7649c873c1 fs/f2fs: Use the enum req_op and blk_opf_t types
Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for
variables that represent request flags.

Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-53-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14 12:14:32 -06:00
Roman Gushchin
e33c267ab7 mm: shrinkers: provide shrinkers with names
Currently shrinkers are anonymous objects.  For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g.  for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.

This commit adds names to shrinkers.  register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.

In some cases it's not possible to determine a good name at the time when
a shrinker is allocated.  For such cases shrinker_debugfs_rename() is
provided.

The expected format is:
    <subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.

After this change the shrinker debugfs directory looks like:
  $ cd /sys/kernel/debug/shrinker/
  $ ls
    dquota-cache-16     sb-devpts-28     sb-proc-47       sb-tmpfs-42
    mm-shadow-18        sb-devtmpfs-5    sb-proc-48       sb-tmpfs-43
    mm-zspool:zram0-34  sb-hugetlbfs-17  sb-pstore-31     sb-tmpfs-44
    rcu-kfree-0         sb-hugetlbfs-33  sb-rootfs-2      sb-tmpfs-49
    sb-aio-20           sb-iomem-12      sb-securityfs-6  sb-tracefs-13
    sb-anon_inodefs-15  sb-mqueue-21     sb-selinuxfs-22  sb-xfs:vda1-36
    sb-bdev-3           sb-nsfs-4        sb-sockfs-8      sb-zsmalloc-19
    sb-bpf-32           sb-pipefs-14     sb-sysfs-26      thp-deferred_split-10
    sb-btrfs:vda2-24    sb-proc-25       sb-tmpfs-1       thp-zero-9
    sb-cgroup2-30       sb-proc-39       sb-tmpfs-27      xfs-buf:vda1-37
    sb-configfs-23      sb-proc-41       sb-tmpfs-29      xfs-inodegc:vda1-38
    sb-dax-11           sb-proc-45       sb-tmpfs-35
    sb-debugfs-7        sb-proc-46       sb-tmpfs-40

[roman.gushchin@linux.dev: fix build warnings]
  Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
  Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03 18:08:40 -07:00
Matthew Wilcox (Oracle)
bbfe4f6600 f2fs: Convert f2fs_invalidate_compress_pages() to use filemap_get_folios()
Convert this function to use folios throughout.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Acked-by: Chao Yu <chao@kernel.org>
2022-06-29 08:51:06 -04:00
Chao Yu
29be7ec3df f2fs: initialize page_array_entry slab only if compression feature is on
Otherwise, in image which doesn't support compression feature,
page_array_entry will be initialized w/o use.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-28 09:27:10 -07:00
Jack Qiu
a4a0e16dbf f2fs: optimize error handling in redirty_blocks
Current error handling is at risk of page leaks. However, we dot't seek
any failure scenarios, just use f2fs_bug_on.

Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-28 09:27:09 -07:00
Jaegeuk Kim
7859e97f62 f2fs: do not skip updating inode when retrying to flush node page
Let's try to flush dirty inode again to improve subtle i_blocks mismatch.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-28 09:27:09 -07:00
Christian Brauner
b27c82e129
attr: port attribute changes to new types
Now that we introduced new infrastructure to increase the type safety
for filesystems supporting idmapped mounts port the first part of the
vfs over to them.

This ports the attribute changes codepaths to rely on the new better
helpers using a dedicated type.

Before this change we used to take a shortcut and place the actual
values that would be written to inode->i_{g,u}id into struct iattr. This
had the advantage that we moved idmappings mostly out of the picture
early on but it made reasoning about changes more difficult than it
should be.

The filesystem was never explicitly told that it dealt with an idmapped
mount. The transition to the value that needed to be stored in
inode->i_{g,u}id appeared way too early and increased the probability of
bugs in various codepaths.

We know place the same value in struct iattr no matter if this is an
idmapped mount or not. The vfs will only deal with type safe
vfs{g,u}id_t. This makes it massively safer to perform permission checks
as the type will tell us what checks we need to perform and what helpers
we need to use.

Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to
inode->i_{g,u}id since they are different types. Instead they need to
use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the
vfs{g,u}id into the filesystem.

The other nice effect is that filesystems like overlayfs don't need to
care about idmappings explicitly anymore and can simply set up struct
iattr accordingly directly.

Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1]
Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-06-26 18:18:56 +02:00
Christian Brauner
71e7b535b8
quota: port quota helpers mount ids
Port the is_quota_modification() and dqout_transfer() helper to type
safe vfs{g,u}id_t. Since these helpers are only called by a few
filesystems don't introduce a new helper but simply extend the existing
helpers to pass down the mount's idmapping.

Note, that this is a non-functional change, i.e. nothing will have
happened here or at the end of this series to how quota are done! This
a change necessary because we will at the end of this series make
ownership changes easier to reason about by keeping the original value
in struct iattr for both non-idmapped and idmapped mounts.

For now we always pass the initial idmapping which makes the idmapping
functions these helpers call nops.

This is done because we currently always pass the actual value to be
written to i_{g,u}id via struct iattr. While this allowed us to treat
the {g,u}id values in struct iattr as values that can be directly
written to inode->i_{g,u}id it also increases the potential for
confusion for filesystems.

Now that we are have dedicated types to prevent this confusion we will
ultimately only map the value from the idmapped mount into a filesystem
value that can be written to inode->i_{g,u}id when the filesystem
actually updates the inode. So pass down the initial idmapping until we
finished that conversion at which point we pass down the mount's
idmapping.

Since struct iattr uses an anonymous union with overlapping types as
supported by the C standard, filesystems that haven't converted to
ia_vfs{g,u}id won't see any difference and things will continue to work
as before. In other words, no functional changes intended with this
change.

Link: https://lore.kernel.org/r/20220621141454.2914719-7-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-06-26 18:18:55 +02:00
Christian Brauner
35faf3109a
fs: port to iattr ownership update helpers
Earlier we introduced new helpers to abstract ownership update and
remove code duplication. This converts all filesystems supporting
idmapped mounts to make use of these new helpers.

For now we always pass the initial idmapping which makes the idmapping
functions these helpers call nops.

This is done because we currently always pass the actual value to be
written to i_{g,u}id via struct iattr. While this allowed us to treat
the {g,u}id values in struct iattr as values that can be directly
written to inode->i_{g,u}id it also increases the potential for
confusion for filesystems.

Now that we are have dedicated types to prevent this confusion we will
ultimately only map the value from the idmapped mount into a filesystem
value that can be written to inode->i_{g,u}id when the filesystem
actually updates the inode. So pass down the initial idmapping until we
finished that conversion at which point we pass down the mount's
idmapping.

No functional changes intended.

Link: https://lore.kernel.org/r/20220621141454.2914719-6-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-06-26 18:18:55 +02:00
Eric Biggers
c5bca38d2e f2fs: use the updated test_dummy_encryption helper functions
Switch f2fs over to the functions that are replacing
fscrypt_set_test_dummy_encryption().  Since f2fs hasn't been converted
to the new mount API yet, this doesn't really provide a benefit for
f2fs.  But it allows fscrypt_set_test_dummy_encryption() to be removed.

Also take the opportunity to eliminate an #ifdef.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-25 12:11:56 -07:00
Jaegeuk Kim
82c7863ed9 f2fs: do not count ENOENT for error case
Otherwise, we can get a wrong cp_error mark.

Cc: <stable@vger.kernel.org>
Fixes: a7b8618aa2 ("f2fs: avoid infinite loop to flush node pages")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-21 08:29:56 -07:00
Daeho Jeong
61803e9843 f2fs: fix iostat related lock protection
Made iostat related locks safe to be called from irq context again.

Cc: <stable@vger.kernel.org>
Fixes: a1e09b03e6 ("f2fs: use iomap for direct I/O")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Stanley Chu <stanley.chu@mediatek.com>
Tested-by: Eddie Huang <eddie.huang@mediatek.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-19 15:16:12 -07:00
Jaegeuk Kim
4cde00d507 f2fs: attach inline_data after setting compression
This fixes the below corruption.

[345393.335389] F2FS-fs (vdb): sanity_check_inode: inode (ino=6d0, mode=33206) should not have inline_data, run fsck to fix

Cc: <stable@vger.kernel.org>
Fixes: 677a82b44e ("f2fs: fix to do sanity check for inline inode")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-19 15:16:10 -07:00
Linus Torvalds
1501f707d2 f2fs-for-5.19
In this round, we've refactored the existing atomic write support implemented
 by in-memory operations to have storing data in disk temporarily, which can give
 us a benefit to accept more atomic writes. At the same time, we removed the
 existing volatile write support. We've also revisited the file pinning and GC
 flows and found some corner cases which contributeed abnormal system behaviours.
 As usual, there're several minor code refactoring for readability, sanity check,
 and clean ups.
 
 Enhancement
  - allow compression for mmap files in compress_mode=user
  - kill volatile write support
  - change the current atomic write way
  - give priority to select unpinned section for foreground GC
  - introduce data read/write showing path info
  - remove unnecessary f2fs_lock_op in f2fs_new_inode
 
 Bug fix
  - fix the file pinning flow during checkpoint=disable and GCs
  - fix foreground and background GCs to select the right victims and get free
    sections on time
  - fix GC flags on defragmenting pages
  - avoid an infinite loop to flush node pages
  - fix fallocate to use file_modified to update permissions consistently
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmKWfyEACgkQQBSofoJI
 UNJaAQ/9Hs3aGIyriGV8CMbarklRuQ24o3khQKdia5gHseFVsydMfba8tyvl7vYV
 fZnHKp9rnEV1emxWn7hHLaGOvPV8leajZqMLhqG384BIb0yoTnRipnK5t0JkoiJX
 53XC5yfxQd01dwS+J4uOSu2jW0Gs6iBLD6H9ahOs86OE6jF1TeQ/fqjsrhm9I8Zr
 GsNON6zxafPn248sYyVBB3Y5GjPBPf+USif3ZEidAWimW/TIGbXLUT1hA0B79YoX
 DRAmN3tYS75yXauQvFPerMbOmP2gwCPcvdCI/PZ4U/ApsEPP7k1SbOZYAjjGUB30
 Qn8cSMxzPZ1cHvzIC96vwJk8XPdcDhICfzROb7jJdeznD8cWTDv0E+Vd33HUf/mG
 pi5Lkpc4STvYD+KUaKpdnHVg6ARWw4HOnUtW43MF3OsfuyGEEPlROs6lBVYnk/Hz
 smlrgnnLMTOpH9y2JyuyExeHEJ3EAgWbJ8aRpq7Ua7FvKF45Yj1lIytWlvWXSnRf
 rp+A5QJhVtYvT+y2Rk2h5oTRj/9l3+pR0X7CTOfSivJuf6aH5XVgI0EmxT2iBTCp
 4SDBjLC+nXXP3EK1HamLiz1mU23Qg1Qwvx3Wc4xgdwQf3s+jyYxki9tIjzdwJCCZ
 adjd3fc/GrD9UPDmJDXlD5QSoOJ94K/NOwYpu1L1/Q+dVwkl+IE=
 =ta8Y
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've refactored the existing atomic write support
  implemented by in-memory operations to have storing data in disk
  temporarily, which can give us a benefit to accept more atomic writes.

  At the same time, we removed the existing volatile write support.

  We've also revisited the file pinning and GC flows and found some
  corner cases which contributeed abnormal system behaviours.

  As usual, there're several minor code refactoring for readability,
  sanity check, and clean ups.

  Enhancements:

   - allow compression for mmap files in compress_mode=user

   - kill volatile write support

   - change the current atomic write way

   - give priority to select unpinned section for foreground GC

   - introduce data read/write showing path info

   - remove unnecessary f2fs_lock_op in f2fs_new_inode

  Bug fixes:

   - fix the file pinning flow during checkpoint=disable and GCs

   - fix foreground and background GCs to select the right victims and
     get free sections on time

   - fix GC flags on defragmenting pages

   - avoid an infinite loop to flush node pages

   - fix fallocate to use file_modified to update permissions
     consistently"

* tag 'f2fs-for-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (40 commits)
  f2fs: fix to tag gcing flag on page during file defragment
  f2fs: replace F2FS_I(inode) and sbi by the local variable
  f2fs: add f2fs_init_write_merge_io function
  f2fs: avoid unneeded error handling for revoke_entry_slab allocation
  f2fs: allow compression for mmap files in compress_mode=user
  f2fs: fix typo in comment
  f2fs: make f2fs_read_inline_data() more readable
  f2fs: fix to do sanity check for inline inode
  f2fs: fix fallocate to use file_modified to update permissions consistently
  f2fs: don't use casefolded comparison for "." and ".."
  f2fs: do not stop GC when requiring a free section
  f2fs: keep wait_ms if EAGAIN happens
  f2fs: introduce f2fs_gc_control to consolidate f2fs_gc parameters
  f2fs: reject test_dummy_encryption when !CONFIG_FS_ENCRYPTION
  f2fs: kill volatile write support
  f2fs: change the current atomic write way
  f2fs: don't need inode lock for system hidden quota
  f2fs: stop allocating pinned sections if EAGAIN happens
  f2fs: skip GC if possible when checkpoint disabling
  f2fs: give priority to select unpinned section for foreground GC
  ...
2022-05-31 16:52:59 -07:00
Chao Yu
2d1fe8a86b f2fs: fix to tag gcing flag on page during file defragment
In order to garantee migrated data be persisted during checkpoint,
otherwise out-of-order persistency between data and node may cause
data corruption after SPOR.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-27 13:58:42 -07:00
Yufen Yu
054cb2891b f2fs: replace F2FS_I(inode) and sbi by the local variable
We have define 'fi' at the begin of the functions, just use it,
rather than use F2FS_I(inode) again.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
[Jaegeuk Kim: replace sbi]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-27 13:58:23 -07:00
Yufen Yu
908ea65416 f2fs: add f2fs_init_write_merge_io function
Almost all other initialization of variables in f2fs_fill_super are
extraced to a single function. Also do it for write_io[], which can
make code more clean.

This patch just refactors the code, theres no functional change.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
[Jaegeuk Kim: clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-25 10:48:56 -07:00
Chao Yu
78901cfa44 f2fs: avoid unneeded error handling for revoke_entry_slab allocation
In __f2fs_commit_atomic_write(), we will guarantee success of
revoke_entry_slab allocation, so let's avoid unneeded error handling.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-24 23:06:51 -07:00
Sungjong Seo
66d34fcbbe f2fs: allow compression for mmap files in compress_mode=user
Since commit e3c548323d ("f2fs: let's allow compression for mmap files"),
it has been allowed to compress mmap files. However, in compress_mode=user,
it is not allowed yet. To keep the same concept in both compress_modes,
f2fs_ioc_(de)compress_file() should also allow it.

Let's remove checking mmap files in f2fs_ioc_(de)compress_file() so that
the compression for mmap files is also allowed in compress_mode=user.

Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-24 23:06:38 -07:00
Linus Torvalds
fdaf9a5840 Page cache changes for 5.19
- Appoint myself page cache maintainer
 
  - Fix how scsicam uses the page cache
 
  - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
 
  - Remove the AOP flags entirely
 
  - Remove pagecache_write_begin() and pagecache_write_end()
 
  - Documentation updates
 
  - Convert several address_space operations to use folios:
    - is_dirty_writeback
    - readpage becomes read_folio
    - releasepage becomes release_folio
    - freepage becomes free_folio
 
  - Change filler_t to require a struct file pointer be the first argument
    like ->read_folio
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmKNMDUACgkQDpNsjXcp
 gj4/mwf/bpHhXH4ZoNIvtUpTF6rZbqeffmc0VrbxCZDZ6igRnRPglxZ9H9v6L53O
 7B0FBQIfxgNKHZpdqGdOkv8cjg/GMe/HJUbEy5wOakYPo4L9fZpHbDZ9HM2Eankj
 xBqLIBgBJ7doKr+Y62DAN19TVD8jfRfVtli5mqXJoNKf65J7BkxljoTH1L3EXD9d
 nhLAgyQjR67JQrT/39KMW+17GqLhGefLQ4YnAMONtB6TVwX/lZmigKpzVaCi4r26
 bnk5vaR/3PdjtNxIoYvxdc71y2Eg05n2jEq9Wcy1AaDv/5vbyZUlZ2aBSaIVbtKX
 WfrhN9O3L0bU5qS7p9PoyfLc9wpq8A==
 =djLv
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache

Pull page cache updates from Matthew Wilcox:

 - Appoint myself page cache maintainer

 - Fix how scsicam uses the page cache

 - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS

 - Remove the AOP flags entirely

 - Remove pagecache_write_begin() and pagecache_write_end()

 - Documentation updates

 - Convert several address_space operations to use folios:
     - is_dirty_writeback
     - readpage becomes read_folio
     - releasepage becomes release_folio
     - freepage becomes free_folio

 - Change filler_t to require a struct file pointer be the first
   argument like ->read_folio

* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
  nilfs2: Fix some kernel-doc comments
  Appoint myself page cache maintainer
  fs: Remove aops->freepage
  secretmem: Convert to free_folio
  nfs: Convert to free_folio
  orangefs: Convert to free_folio
  fs: Add free_folio address space operation
  fs: Convert drop_buffers() to use a folio
  fs: Change try_to_free_buffers() to take a folio
  jbd2: Convert release_buffer_page() to use a folio
  jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
  reiserfs: Convert release_buffer_page() to use a folio
  fs: Remove last vestiges of releasepage
  ubifs: Convert to release_folio
  reiserfs: Convert to release_folio
  orangefs: Convert to release_folio
  ocfs2: Convert to release_folio
  nilfs2: Remove comment about releasepage
  nfs: Convert to release_folio
  jfs: Convert to release_folio
  ...
2022-05-24 19:55:07 -07:00
Linus Torvalds
bd1b7c1384 for-5.19-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmKLxJAACgkQxWXV+ddt
 WDvC4BAAnSNwZ15FJKe5Y423f6PS6EXjyMuc5t/fW6UumTTbI+tsS+Glkis+JNBf
 BiDZSlVQmiK9WoQSJe04epZgHaK8MaCARyZaRaxjDC4Nvfq4DlD9mbAU9D6e7tZY
 Mo8M99D8wDW+SB+P8RBpNjwB/oGCMmE3nKC83g+1ObmA0FVRCyQ1Kazf8RzNT1rZ
 DiaJoKTvU1/wDN3/1rw5yG+EfW2m9A14gRCihslhFYaDV7jhpuabl8wLT7MftZtE
 MtJ6EOOQbgIDjnp5BEIrPmowW/N0tKDT/gorF7cWgLG2R1cbSlKgqSH1Sq7CjFUE
 AKj/DwfqZArPLpqMThWklCwy2B9qDEezrQSy7renP/vkeFLbOp8hQuIY5KRzohdG
 oDI8ThlQGtCVjbny6NX/BbCnWRAfTz0TquCgag3Xl8NbkRFgFJtkf/cSxzb+3LW1
 tFeiUyTVLXVDS1cZLwgcb29Rrtp4bjd5/v3uECQlVD+or5pcAqSMkQgOBlyQJGbE
 Xb0nmPRihzQ8D4vINa63WwRyq0+QczVjvBxKj1daas0VEKGd32PIBS/0Qha+EpGl
 uFMiHBMSfqyl8QcShFk0cCbcgPMcNc7I6IAbXCE/WhhFG0ytqm9vpmlLqsTrXmHH
 z7/Eye/waqgACNEXoA8C4pyYzduQ4i1CeLDOdcsvBU6XQSuicSM=
 =lv6P
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Features:

   - subpage:
      - support for PAGE_SIZE > 4K (previously only 64K)
      - make it work with raid56

   - repair super block num_devices automatically if it does not match
     the number of device items

   - defrag can convert inline extents to regular extents, up to now
     inline files were skipped but the setting of mount option
     max_inline could affect the decision logic

   - zoned:
      - minimal accepted zone size is explicitly set to 4MiB
      - make zone reclaim less aggressive and don't reclaim if there are
        enough free zones
      - add per-profile sysfs tunable of the reclaim threshold

   - allow automatic block group reclaim for non-zoned filesystems, with
     sysfs tunables

   - tree-checker: new check, compare extent buffer owner against owner
     rootid

  Performance:

   - avoid blocking on space reservation when doing nowait direct io
     writes (+7% throughput for reads and writes)

   - NOCOW write throughput improvement due to refined locking (+3%)

   - send: reduce pressure to page cache by dropping extent pages right
     after they're processed

  Core:

   - convert all radix trees to xarray

   - add iterators for b-tree node items

   - support printk message index

   - user bulk page allocation for extent buffers

   - switch to bio_alloc API, use on-stack bios where convenient, other
     bio cleanups

   - use rw lock for block groups to favor concurrent reads

   - simplify workques, don't allocate high priority threads for all
     normal queues as we need only one

   - refactor scrub, process chunks based on their constraints and
     similarity

   - allocate direct io structures on stack and pass around only
     pointers, avoids allocation and reduces potential error handling

  Fixes:

   - fix count of reserved transaction items for various inode
     operations

   - fix deadlock between concurrent dio writes when low on free data
     space

   - fix a few cases when zones need to be finished

  VFS, iomap:

   - add helper to check if sb write has started (usable for assertions)

   - new helper iomap_dio_alloc_bio, export iomap_dio_bio_end_io"

* tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (173 commits)
  btrfs: zoned: introduce a minimal zone size 4M and reject mount
  btrfs: allow defrag to convert inline extents to regular extents
  btrfs: add "0x" prefix for unsupported optional features
  btrfs: do not account twice for inode ref when reserving metadata units
  btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer
  btrfs: send: avoid trashing the page cache
  btrfs: send: keep the current inode open while processing it
  btrfs: allocate the btrfs_dio_private as part of the iomap dio bio
  btrfs: move struct btrfs_dio_private to inode.c
  btrfs: remove the disk_bytenr in struct btrfs_dio_private
  btrfs: allocate dio_data on stack
  iomap: add per-iomap_iter private data
  iomap: allow the file system to provide a bio_set for direct I/O
  btrfs: add a btrfs_dio_rw wrapper
  btrfs: zoned: zone finish unused block group
  btrfs: zoned: properly finish block group on metadata write
  btrfs: zoned: finish block group when there are no more allocatable bytes left
  btrfs: zoned: consolidate zone finish functions
  btrfs: zoned: introduce btrfs_zoned_bg_is_full
  btrfs: improve error reporting in lookup_inline_extent_backref
  ...
2022-05-24 18:52:35 -07:00
Linus Torvalds
115cd47132 for-5.19/block-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrUsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgDjD/44hY9h0JsOLoRH1IvFtuaH6n718JXuqG17
 hHCfmnAUVqj2jT00IUbVlUTd905bCGpfrodBL3PAmPev1zZHOUd/MnJKrSynJ+/s
 NJEMZQaHxLmocNDpJ1sZo7UbAFErsZXB0gVYUO8cH2bFYNu84H1mhRCOReYyqmvQ
 aIAASX5qRB/ciBQCivzAJl2jTdn4WOn5hWi9RLidQB7kSbaXGPmgKAuN88WI4H7A
 zQgAkEl2EEquyMI5tV1uquS7engJaC/4PsenF0S9iTyrhJLjneczJBJZKMLeMR8d
 sOm6sKJdpkrfYDyaA4PIkgmLoEGTtwGpqGHl4iXTyinUAxJoca5tmPvBb3wp66GE
 2Mr7pumxc1yJID2VHbsERXlOAX3aZNCowx2gum2MTRIO8g11Eu3aaVn2kv37MBJ2
 4R2a/cJFl5zj9M8536cG+Yqpy0DDVCCQKUIqEupgEu1dyfpznyWH5BTAHXi1E8td
 nxUin7uXdD0AJkaR0m04McjS/Bcmc1dc6I8xvkdUFYBqYCZWpKOTiEpIBlHg0XJA
 sxdngyz5lSYTGVA4o4QCrdR0Tx1n36A1IYFuQj0wzxBJYZ02jEZuII/A3dd+8hiv
 EY+VeUQeVIXFFuOcY+e0ScPpn7Nr17hAd1en/j2Hcoe4ZE8plqG2QTcnwgflcbis
 iomvJ4yk0Q==
 =0Rw1
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Here are the core block changes for 5.19. This contains:

   - blk-throttle accounting fix (Laibin)

   - Series removing redundant assignments (Michal)

   - Expose bio cache via the bio_set, so that DM can use it (Mike)

   - Finish off the bio allocation interface cleanups by dealing with
     the weirdest member of the family. bio_kmalloc combines a kmalloc
     for the bio and bio_vecs with a hidden bio_init call and magic
     cleanup semantics (Christoph)

   - Clean up the block layer API so that APIs consumed by file systems
     are (almost) only struct block_device based, so that file systems
     don't have to poke into block layer internals like the
     request_queue (Christoph)

   - Clean up the blk_execute_rq* API (Christoph)

   - Clean up various lose end in the blk-cgroup code to make it easier
     to follow in preparation of reworking the blkcg assignment for bios
     (Christoph)

   - Fix use-after-free issues in BFQ when processes with merged queues
     get moved to different cgroups (Jan)

   - BFQ fixes (Jan)

   - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming,
     Wolfgang, me)"

* tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits)
  blk-mq: fix typo in comment
  bfq: Remove bfq_requeue_request_body()
  bfq: Remove superfluous conversion from RQ_BIC()
  bfq: Allow current waker to defend against a tentative one
  bfq: Relax waker detection for shared queues
  blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE()
  blk-throttle: Set BIO_THROTTLED when bio has been throttled
  blk-cgroup: Remove unnecessary rcu_read_lock/unlock()
  blk-cgroup: always terminate io.stat lines
  block, bfq: make bfq_has_work() more accurate
  block, bfq: protect 'bfqd->queued' by 'bfqd->lock'
  block: cleanup the VM accounting in submit_bio
  block: Fix the bio.bi_opf comment
  block: reorder the REQ_ flags
  blk-iocost: combine local_stat and desc_stat to stat
  block: improve the error message from bio_check_eod
  block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone
  block: remove superfluous calls to blkcg_bio_issue_init
  kthread: unexport kthread_blkcg
  blk-cgroup: cleanup blkcg_maybe_throttle_current
  ...
2022-05-23 13:56:39 -07:00
Julia Lawall
759820c92a f2fs: fix typo in comment
Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-23 10:29:41 -07:00
Chao Liu
d9c454ab22 f2fs: make f2fs_read_inline_data() more readable
In f2fs_read_inline_data(), it is confused with checking of
inline_data flag, as we checked it before calling. So this
patch add some comments for f2fs_has_inline_data().

Signed-off-by: Chao Liu <liuchao@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-19 17:23:22 -07:00
Chao Yu
677a82b44e f2fs: fix to do sanity check for inline inode
Yanming reported a kernel bug in Bugzilla kernel [1], which can be
reproduced. The bug message is:

The kernel message is shown below:

kernel BUG at fs/inode.c:611!
Call Trace:
 evict+0x282/0x4e0
 __dentry_kill+0x2b2/0x4d0
 dput+0x2dd/0x720
 do_renameat2+0x596/0x970
 __x64_sys_rename+0x78/0x90
 do_syscall_64+0x3b/0x90

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215895

The bug is due to fuzzed inode has both inline_data and encrypted flags.
During f2fs_evict_inode(), as the inode was deleted by rename(), it
will cause inline data conversion due to conflicting flags. The page
cache will be polluted and the panic will be triggered in clear_inode().

Try fixing the bug by doing more sanity checks for inline data inode in
sanity_check_inode().

Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-18 15:36:11 -07:00
Chao Yu
958ed92922 f2fs: fix fallocate to use file_modified to update permissions consistently
This patch tries to fix permission consistency issue as all other
mainline filesystems.

Since the initial introduction of (posix) fallocate back at the turn of
the century, it has been possible to use this syscall to change the
user-visible contents of files.  This can happen by extending the file
size during a preallocation, or through any of the newer modes (punch,
zero, collapse, insert range).  Because the call can be used to change
file contents, we should treat it like we do any other modification to a
file -- update the mtime, and drop set[ug]id privileges/capabilities.

The VFS function file_modified() does all this for us if pass it a
locked inode, so let's make fallocate drop permissions correctly.

Cc: stable@kernel.org
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-18 15:36:09 -07:00
Eric Biggers
b5639bb431 f2fs: don't use casefolded comparison for "." and ".."
Tryng to rename a directory that has all following properties fails with
EINVAL and triggers the 'WARN_ON_ONCE(!fscrypt_has_encryption_key(dir))'
in f2fs_match_ci_name():

    - The directory is casefolded
    - The directory is encrypted
    - The directory's encryption key is not yet set up
    - The parent directory is *not* encrypted

The problem is incorrect handling of the lookup of ".." to get the
parent reference to update.  fscrypt_setup_filename() treats ".." (and
".") specially, as it's never encrypted.  It's passed through as-is, and
setting up the directory's key is not attempted.  As the name isn't a
no-key name, f2fs treats it as a "normal" name and attempts a casefolded
comparison.  That breaks the assumption of the WARN_ON_ONCE() in
f2fs_match_ci_name() which assumes that for encrypted directories,
casefolded comparisons only happen when the directory's key is set up.

We could just remove this WARN_ON_ONCE().  However, since casefolding is
always a no-op on "." and ".." anyway, let's instead just not casefold
these names.  This results in the standard bytewise comparison.

Fixes: 7ad08a58bf ("f2fs: Handle casefolding with Encryption")
Cc: <stable@vger.kernel.org> # v5.11+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-17 11:19:23 -07:00
Jaegeuk Kim
c81d5bae40 f2fs: do not stop GC when requiring a free section
The f2fs_gc uses a bitmap to indicate pinned sections, but when disabling
chckpoint, we call f2fs_gc() with NULL_SEGNO which selects the same dirty
segment as a victim all the time, resulting in checkpoint=disable failure,
for example. Let's pick another one, if we fail to collect it.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-17 11:19:19 -07:00
Christoph Hellwig
786f847f43 iomap: add per-iomap_iter private data
Allow the file system to keep state for all iterations.  For now only
wire it up for direct I/O as there is an immediate need for it there.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
Jaegeuk Kim
c58d7c55de f2fs: keep wait_ms if EAGAIN happens
In f2fs_gc thread, let's keep wait_ms when sec_freed was zero.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-12 13:30:19 -07:00
Jaegeuk Kim
d147ea4adb f2fs: introduce f2fs_gc_control to consolidate f2fs_gc parameters
No functional change.

- remove checkpoint=disable check for f2fs_write_checkpoint
- get sec_freed all the time

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-12 13:29:14 -07:00
Eric Biggers
64e3ed0b8e f2fs: reject test_dummy_encryption when !CONFIG_FS_ENCRYPTION
There is no good reason to allow this mount option when the kernel isn't
configured with encryption support.  Since this option is only for
testing, we can just fix this; we don't really need to worry about
breaking anyone who might be counting on this option being ignored.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-12 10:14:03 -07:00
Jaegeuk Kim
7bc155fec5 f2fs: kill volatile write support
There's no user, since all can use atomic writes simply.
Let's kill it.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-12 10:14:03 -07:00
Daeho Jeong
3db1de0e58 f2fs: change the current atomic write way
Current atomic write has three major issues like below.
 - keeps the updates in non-reclaimable memory space and they are even
   hard to be migrated, which is not good for contiguous memory
   allocation.
 - disk spaces used for atomic files cannot be garbage collected, so
   this makes it difficult for the filesystem to be defragmented.
 - If atomic write operations hit the threshold of either memory usage
   or garbage collection failure count, All the atomic write operations
   will fail immediately.

To resolve the issues, I will keep a COW inode internally for all the
updates to be flushed from memory, when we need to flush them out in a
situation like high memory pressure. These COW inodes will be tagged
as orphan inodes to be reclaimed in case of sudden power-cut or system
failure during atomic writes.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-12 10:14:03 -07:00
Jaegeuk Kim
6213f5d4d2 f2fs: don't need inode lock for system hidden quota
Let's avoid false-alarmed lockdep warning.

[   58.914674] [T1501146] -> #2 (&sb->s_type->i_mutex_key#20){+.+.}-{3:3}:
[   58.915975] [T1501146] system_server:        down_write+0x7c/0xe0
[   58.916738] [T1501146] system_server:        f2fs_quota_sync+0x60/0x1a8
[   58.917563] [T1501146] system_server:        block_operations+0x16c/0x43c
[   58.918410] [T1501146] system_server:        f2fs_write_checkpoint+0x114/0x318
[   58.919312] [T1501146] system_server:        f2fs_issue_checkpoint+0x178/0x21c
[   58.920214] [T1501146] system_server:        f2fs_sync_fs+0x48/0x6c
[   58.920999] [T1501146] system_server:        f2fs_do_sync_file+0x334/0x738
[   58.921862] [T1501146] system_server:        f2fs_sync_file+0x30/0x48
[   58.922667] [T1501146] system_server:        __arm64_sys_fsync+0x84/0xf8
[   58.923506] [T1501146] system_server:        el0_svc_common.llvm.12821150825140585682+0xd8/0x20c
[   58.924604] [T1501146] system_server:        do_el0_svc+0x28/0xa0
[   58.925366] [T1501146] system_server:        el0_svc+0x24/0x38
[   58.926094] [T1501146] system_server:        el0_sync_handler+0x88/0xec
[   58.926920] [T1501146] system_server:        el0_sync+0x1b4/0x1c0

[   58.927681] [T1501146] -> #1 (&sbi->cp_global_sem){+.+.}-{3:3}:
[   58.928889] [T1501146] system_server:        down_write+0x7c/0xe0
[   58.929650] [T1501146] system_server:        f2fs_write_checkpoint+0xbc/0x318
[   58.930541] [T1501146] system_server:        f2fs_issue_checkpoint+0x178/0x21c
[   58.931443] [T1501146] system_server:        f2fs_sync_fs+0x48/0x6c
[   58.932226] [T1501146] system_server:        sync_filesystem+0xac/0x130
[   58.933053] [T1501146] system_server:        generic_shutdown_super+0x38/0x150
[   58.933958] [T1501146] system_server:        kill_block_super+0x24/0x58
[   58.934791] [T1501146] system_server:        kill_f2fs_super+0xcc/0x124
[   58.935618] [T1501146] system_server:        deactivate_locked_super+0x90/0x120
[   58.936529] [T1501146] system_server:        deactivate_super+0x74/0xac
[   58.937356] [T1501146] system_server:        cleanup_mnt+0x128/0x168
[   58.938150] [T1501146] system_server:        __cleanup_mnt+0x18/0x28
[   58.938944] [T1501146] system_server:        task_work_run+0xb8/0x14c
[   58.939749] [T1501146] system_server:        do_notify_resume+0x114/0x1e8
[   58.940595] [T1501146] system_server:        work_pending+0xc/0x5f0

[   58.941375] [T1501146] -> #0 (&sbi->gc_lock){+.+.}-{3:3}:
[   58.942519] [T1501146] system_server:        __lock_acquire+0x1270/0x2868
[   58.943366] [T1501146] system_server:        lock_acquire+0x114/0x294
[   58.944169] [T1501146] system_server:        down_write+0x7c/0xe0
[   58.944930] [T1501146] system_server:        f2fs_issue_checkpoint+0x13c/0x21c
[   58.945831] [T1501146] system_server:        f2fs_sync_fs+0x48/0x6c
[   58.946614] [T1501146] system_server:        f2fs_do_sync_file+0x334/0x738
[   58.947472] [T1501146] system_server:        f2fs_ioc_commit_atomic_write+0xc8/0x14c
[   58.948439] [T1501146] system_server:        __f2fs_ioctl+0x674/0x154c
[   58.949253] [T1501146] system_server:        f2fs_ioctl+0x54/0x88
[   58.950018] [T1501146] system_server:        __arm64_sys_ioctl+0xa8/0x110
[   58.950865] [T1501146] system_server:        el0_svc_common.llvm.12821150825140585682+0xd8/0x20c
[   58.951965] [T1501146] system_server:        do_el0_svc+0x28/0xa0
[   58.952727] [T1501146] system_server:        el0_svc+0x24/0x38
[   58.953454] [T1501146] system_server:        el0_sync_handler+0x88/0xec
[   58.954279] [T1501146] system_server:        el0_sync+0x1b4/0x1c0

Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-12 10:14:02 -07:00
Matthew Wilcox (Oracle)
c26cd04586 f2fs: Convert to release_folio
While converting f2fs_release_page() to f2fs_release_folio(), cache the
sb_info so we don't need to retrieve it twice, and remove the redundant
call to set_page_private().  The use of folios should be pushed further
into f2fs from here.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-05-09 23:12:33 -04:00
Matthew Wilcox (Oracle)
be05584f06 f2fs: Convert f2fs to read_folio
This is a "weak" conversion which converts straight back to using pages.
A full conversion should be performed at some point, hopefully by
someone familiar with the filesystem.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-05-09 16:21:45 -04:00
Jaegeuk Kim
2e42b7f817 f2fs: stop allocating pinned sections if EAGAIN happens
EAGAIN doesn't guarantee to have a free section. Let's report it.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-09 09:31:27 -07:00
Weichao Guo
2880f47b94 f2fs: skip GC if possible when checkpoint disabling
If the number of unusable blocks is not larger than
unusable capacity, we can skip GC when checkpoint
disabling.

Signed-off-by: Weichao Guo <guoweichao@oppo.com>
Signed-off-by: Chao Yu <chao@kernel.org>
[Jaegeuk Kim: Fix missing gc_mode assignment]
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-09 09:30:38 -07:00
Matthew Wilcox (Oracle)
6e0ee04465 f2fs: Call aops write_begin() and write_end() directly
pagecache_write_begin() and pagecache_write_end() are now trivial
wrappers, so call the aops directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:45:56 -04:00
Matthew Wilcox (Oracle)
9d6b0cd757 fs: Remove flags parameter from aops->write_begin
There are no more aop flags left, so remove the parameter.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:28:19 -04:00
Matthew Wilcox (Oracle)
b7446e7cf1 fs: Remove aop flags parameter from grab_cache_page_write_begin()
There are no more aop flags left, so remove the parameter.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:28:19 -04:00
Matthew Wilcox (Oracle)
df8081801d f2fs: Convert f2fs_grab_cache_page() to use scoped memory APIs
Prevent GFP_FS allocations by using memalloc_nofs_save() instead
of AOP_FLAG_NOFS.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-08 14:28:19 -04:00
Chao Yu
7141912962 f2fs: give priority to select unpinned section for foreground GC
Previously, during foreground GC, if victims contain data of pinned file,
it will fail migration of the data, and meanwhile i_gc_failures of that
pinned file may increase, and when it exceeds threshold, GC will unpin
the file, result in breaking pinfile's semantics.

In order to mitigate such condition, let's record and skip section which
has pinned file's data and give priority to select unpinned one.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 14:26:33 -07:00
Chao Yu
6b8beca0ed f2fs: fix to do sanity check on total_data_blocks
As Yanming reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=215916

The kernel message is shown below:

kernel BUG at fs/f2fs/segment.c:2560!
Call Trace:
 allocate_segment_by_default+0x228/0x440
 f2fs_allocate_data_block+0x13d1/0x31f0
 do_write_page+0x18d/0x710
 f2fs_outplace_write_data+0x151/0x250
 f2fs_do_write_data_page+0xef9/0x1980
 move_data_page+0x6af/0xbc0
 do_garbage_collect+0x312f/0x46f0
 f2fs_gc+0x6b0/0x3bc0
 f2fs_balance_fs+0x921/0x2260
 f2fs_write_single_data_page+0x16be/0x2370
 f2fs_write_cache_pages+0x428/0xd00
 f2fs_write_data_pages+0x96e/0xd50
 do_writepages+0x168/0x550
 __writeback_single_inode+0x9f/0x870
 writeback_sb_inodes+0x47d/0xb20
 __writeback_inodes_wb+0xb2/0x200
 wb_writeback+0x4bd/0x660
 wb_workfn+0x5f3/0xab0
 process_one_work+0x79f/0x13e0
 worker_thread+0x89/0xf60
 kthread+0x26a/0x300
 ret_from_fork+0x22/0x30
RIP: 0010:new_curseg+0xe8d/0x15f0

The root cause is: ckpt.valid_block_count is inconsistent with SIT table,
stat info indicates filesystem has free blocks, but SIT table indicates
filesystem has no free segment.

So that during garbage colloection, it triggers panic when LFS allocator
fails to find free segment.

This patch tries to fix this issue by checking consistency in between
ckpt.valid_block_count and block accounted from SIT.

Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 14:26:32 -07:00
Chao Yu
cfd66bb715 f2fs: fix deadloop in foreground GC
As Yanming reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=215914

The root cause is: in a very small sized image, it's very easy to
exceed threshold of foreground GC, if we calculate free space and
dirty data based on section granularity, in corner case,
has_not_enough_free_secs() will always return true, result in
deadloop in f2fs_gc().

So this patch refactors has_not_enough_free_secs() as below to fix
this issue:
1. calculate needed space based on block granularity, and separate
all blocks to two parts, section part, and block part, comparing
section part to free section, and comparing block part to free space
in openned log.
2. account F2FS_DIRTY_NODES, F2FS_DIRTY_IMETA and F2FS_DIRTY_DENTS
as node block consumer;
3. account F2FS_DIRTY_DENTS as data block consumer;

Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 10:20:05 -07:00
Chao Yu
25f8236213 f2fs: fix to do sanity check on block address in f2fs_do_zero_range()
As Yanming reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=215894

I have encountered a bug in F2FS file system in kernel v5.17.

I have uploaded the system call sequence as case.c, and a fuzzed image can
be found in google net disk

The kernel should enable CONFIG_KASAN=y and CONFIG_KASAN_INLINE=y. You can
reproduce the bug by running the following commands:

kernel BUG at fs/f2fs/segment.c:2291!
Call Trace:
 f2fs_invalidate_blocks+0x193/0x2d0
 f2fs_fallocate+0x2593/0x4a70
 vfs_fallocate+0x2a5/0xac0
 ksys_fallocate+0x35/0x70
 __x64_sys_fallocate+0x8e/0xf0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The root cause is, after image was fuzzed, block mapping info in inode
will be inconsistent with SIT table, so in f2fs_fallocate(), it will cause
panic when updating SIT with invalid blkaddr.

Let's fix the issue by adding sanity check on block address before updating
SIT table with it.

Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 10:18:12 -07:00
Chao Yu
4d17e6fe92 f2fs: fix to avoid f2fs_bug_on() in dec_valid_node_count()
As Yanming reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=215897

I have encountered a bug in F2FS file system in kernel v5.17.

The kernel should enable CONFIG_KASAN=y and CONFIG_KASAN_INLINE=y. You can
reproduce the bug by running the following commands:

The kernel message is shown below:

kernel BUG at fs/f2fs/f2fs.h:2511!
Call Trace:
 f2fs_remove_inode_page+0x2a2/0x830
 f2fs_evict_inode+0x9b7/0x1510
 evict+0x282/0x4e0
 do_unlinkat+0x33a/0x540
 __x64_sys_unlinkat+0x8e/0xd0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The root cause is: .total_valid_block_count or .total_valid_node_count
could fuzzed to zero, then once dec_valid_node_count() was called, it
will cause BUG_ON(), this patch fixes to print warning info and set
SBI_NEED_FSCK into CP instead of panic.

Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 10:18:12 -07:00
Byungki Lee
a9163b947a f2fs: write checkpoint during FG_GC
If there's not enough free sections each of which consistis of large segments,
we can hit no free section for upcoming section allocation. Let's reclaim some
prefree segments by writing checkpoints.

Signed-off-by: Byungki Lee <dominicus79@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 10:18:12 -07:00
Chao Yu
f2db71053d f2fs: fix to clear dirty inode in f2fs_evict_inode()
As Yanming reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=215904

The kernel message is shown below:

kernel BUG at fs/f2fs/inode.c:825!
Call Trace:
 evict+0x282/0x4e0
 __dentry_kill+0x2b2/0x4d0
 shrink_dentry_list+0x17c/0x4f0
 shrink_dcache_parent+0x143/0x1e0
 do_one_tree+0x9/0x30
 shrink_dcache_for_umount+0x51/0x120
 generic_shutdown_super+0x5c/0x3a0
 kill_block_super+0x90/0xd0
 kill_f2fs_super+0x225/0x310
 deactivate_locked_super+0x78/0xc0
 cleanup_mnt+0x2b7/0x480
 task_work_run+0xc8/0x150
 exit_to_user_mode_prepare+0x14a/0x150
 syscall_exit_to_user_mode+0x1d/0x40
 do_syscall_64+0x48/0x90

The root cause is: inode node and dnode node share the same nid,
so during f2fs_evict_inode(), dnode node truncation will invalidate
its NAT entry, so when truncating inode node, it fails due to
invalid NAT entry, result in inode is still marked as dirty, fix
this issue by clearing dirty for inode and setting SBI_NEED_FSCK
flag in filesystem.

output from dump.f2fs:
[print_node_info: 354] Node ID [0xf:15] is inode
i_nid[0]                      		[0x       f : 15]

Cc: stable@vger.kernel.org
Reported-by: Ming Yan <yanming@tju.edu.cn>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 10:18:12 -07:00
Luis Chamberlain
7f262f7375 f2fs: ensure only power of 2 zone sizes are allowed
F2FS zoned support has power of 2 zone size assumption in many places
such as in __f2fs_issue_discard_zone, init_blkz_info. As the power of 2
requirement has been removed from the block layer, explicitly add a
condition in f2fs to allow only power of 2 zone size devices.

This condition will be relaxed once those calculation based on power of
2 is made generic.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 10:18:12 -07:00
Luis Chamberlain
d46db4595b f2fs: call bdev_zone_sectors() only once on init_blkz_info()
Instead of calling bdev_zone_sectors() multiple times, call
it once and cache the value locally. This will make the
subsequent change easier to read.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 10:18:11 -07:00
Niels Dossche
4de851459e f2fs: extend stat_lock to avoid potential race in statfs
There are multiple calculations and reads of fields of sbi that should
be protected by stat_lock. As stat_lock is not used to read these
values in statfs, this can lead to inconsistent results.
Extend the locking to prevent this issue.
Commit c9c8ed50d9 ("f2fs: fix to avoid potential race on
sbi->unusable_block_count access/update")
already added the use of sbi->stat_lock in statfs in
order to make the calculation of multiple, different fields atomic so
that results are consistent. This is similar to that patch regarding the
change in statfs.

Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 10:18:11 -07:00
Jaegeuk Kim
a7b8618aa2 f2fs: avoid infinite loop to flush node pages
xfstests/generic/475 can give EIO all the time which give an infinite loop
to flush node page like below. Let's avoid it.

[16418.518551] Call Trace:
[16418.518553]  ? dm_submit_bio+0x48/0x400
[16418.518574]  ? submit_bio_checks+0x1ac/0x5a0
[16418.525207]  __submit_bio+0x1a9/0x230
[16418.525210]  ? kmem_cache_alloc+0x29e/0x3c0
[16418.525223]  submit_bio_noacct+0xa8/0x2b0
[16418.525226]  submit_bio+0x4d/0x130
[16418.525238]  __submit_bio+0x49/0x310 [f2fs]
[16418.525339]  ? bio_add_page+0x6a/0x90
[16418.525344]  f2fs_submit_page_bio+0x134/0x1f0 [f2fs]
[16418.525365]  read_node_page+0x125/0x1b0 [f2fs]
[16418.525388]  __get_node_page.part.0+0x58/0x3f0 [f2fs]
[16418.525409]  __get_node_page+0x2f/0x60 [f2fs]
[16418.525431]  f2fs_get_dnode_of_data+0x423/0x860 [f2fs]
[16418.525452]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[16418.525458]  ? __mod_memcg_state.part.0+0x2a/0x30
[16418.525465]  ? __mod_memcg_lruvec_state+0x27/0x40
[16418.525467]  ? __xa_set_mark+0x57/0x70
[16418.525472]  f2fs_do_write_data_page+0x10e/0x7b0 [f2fs]
[16418.525493]  f2fs_write_single_data_page+0x555/0x830 [f2fs]
[16418.525514]  ? sysvec_apic_timer_interrupt+0x4e/0x90
[16418.525518]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[16418.525523]  f2fs_write_cache_pages+0x303/0x880 [f2fs]
[16418.525545]  ? blk_flush_plug_list+0x47/0x100
[16418.525548]  f2fs_write_data_pages+0xfd/0x320 [f2fs]
[16418.525569]  do_writepages+0xd5/0x210
[16418.525648]  filemap_fdatawrite_wbc+0x7d/0xc0
[16418.525655]  filemap_fdatawrite+0x50/0x70
[16418.525658]  f2fs_sync_dirty_inodes+0xa4/0x230 [f2fs]
[16418.525679]  f2fs_write_checkpoint+0x16d/0x1720 [f2fs]
[16418.525699]  ? ttwu_do_wakeup+0x1c/0x160
[16418.525709]  ? ttwu_do_activate+0x6d/0xd0
[16418.525711]  ? __wait_for_common+0x11d/0x150
[16418.525715]  kill_f2fs_super+0xca/0x100 [f2fs]
[16418.525733]  deactivate_locked_super+0x3b/0xb0
[16418.525739]  deactivate_super+0x40/0x50
[16418.525741]  cleanup_mnt+0x139/0x190
[16418.525747]  __cleanup_mnt+0x12/0x20
[16418.525749]  task_work_run+0x6d/0xa0
[16418.525765]  exit_to_user_mode_prepare+0x1ad/0x1b0
[16418.525771]  syscall_exit_to_user_mode+0x27/0x50
[16418.525774]  do_syscall_64+0x48/0xc0
[16418.525776]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 10:18:11 -07:00
Jaegeuk Kim
c550e25bca f2fs: use flush command instead of FUA for zoned device
The block layer for zoned disk can reorder the FUA'ed IOs. Let's use flush
command to keep the write order.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 10:18:11 -07:00
Dongliang Mu
dc2f78e2d4 f2fs: remove WARN_ON in f2fs_is_valid_blkaddr
Syzbot triggers two WARNs in f2fs_is_valid_blkaddr and
__is_bitmap_valid. For example, in f2fs_is_valid_blkaddr,
if type is DATA_GENERIC_ENHANCE or DATA_GENERIC_ENHANCE_READ,
it invokes WARN_ON if blkaddr is not in the right range.
The call trace is as follows:

 f2fs_get_node_info+0x45f/0x1070
 read_node_page+0x577/0x1190
 __get_node_page.part.0+0x9e/0x10e0
 __get_node_page
 f2fs_get_node_page+0x109/0x180
 do_read_inode
 f2fs_iget+0x2a5/0x58b0
 f2fs_fill_super+0x3b39/0x7ca0

Fix these two WARNs by replacing WARN_ON with dump_stack.

Reported-by: syzbot+763ae12a2ede1d99d4dc@syzkaller.appspotmail.com
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-05-06 10:18:11 -07:00
Jakob Koschel
9e3a845df9 f2fs: replace usage of found with dedicated list iterator variable
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.

To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean [1].

This removes the need to use a found variable and simply checking if
the variable was set, can determine if the break/goto was hit.

Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25 15:13:04 -07:00
Jakob Koschel
df35435d41 f2fs: Remove usage of list iterator pas the loop for list_move_tail()
In preparation to limit the scope of a list iterator to the list
traversal loop, the usage of the list iterator variable 'next' should
be avoided past the loop body [1].

Instead of calling list_move_tail() on 'next' after the loop, it is
called within the loop if the correct location was found.
After the loop it covers the case if no location was found and it
should be inserted based on the 'head' of the list.

Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25 15:13:03 -07:00
Jakob Koschel
2aaf51dd39 f2fs: fix dereference of stale list iterator after loop body
The list iterator variable will be a bogus pointer if no break was hit.
Dereferencing it (cur->page in this case) could load an out-of-bounds/undefined
value making it unsafe to use that in the comparision to determine if the
specific element was found.

Since 'cur->page' *can* be out-ouf-bounds it cannot be guaranteed that
by chance (or intention of an attacker) it matches the value of 'page'
even though the correct element was not found.

This is fixed by using a separate list iterator variable for the loop
and only setting the original variable if a suitable element was found.
Then determing if the element was found is simply checking if the
variable is set.

Fixes: 8c242db9b8 ("f2fs: fix stale ATOMIC_WRITTEN_PAGE private pointer")
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25 15:13:03 -07:00
Chao Yu
12662d1946 f2fs: fix to do sanity check on inline_dots inode
As Wenqing reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=215765

It will cause a kernel panic with steps:
- mkdir mnt
- mount tmp40.img mnt
- ls mnt

folio_mark_dirty+0x33/0x50
f2fs_add_regular_entry+0x541/0xad0 [f2fs]
f2fs_add_dentry+0x6c/0xb0 [f2fs]
f2fs_do_add_link+0x182/0x230 [f2fs]
__recover_dot_dentries+0x2d6/0x470 [f2fs]
f2fs_lookup+0x5af/0x6a0 [f2fs]
__lookup_slow+0xac/0x200
lookup_slow+0x45/0x70
walk_component+0x16c/0x250
path_lookupat+0x8b/0x1f0
filename_lookup+0xef/0x250
user_path_at_empty+0x46/0x70
vfs_statx+0x98/0x190
__do_sys_newlstat+0x41/0x90
__x64_sys_newlstat+0x1a/0x30
do_syscall_64+0x37/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xae

The root cause is for special file: e.g. character, block, fifo or
socket file, f2fs doesn't assign address space operations pointer array
for mapping->a_ops field, so, in a fuzzed image, if inline_dots flag was
tagged in special file, during lookup(), when f2fs runs into
__recover_dot_dentries(), it will cause NULL pointer access once
f2fs_add_regular_entry() calls a_ops->set_dirty_page().

Fixes: 510022a858 ("f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries")
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25 15:13:03 -07:00
Jaegeuk Kim
c277f1411d f2fs: introduce data read/write showing path info
This was used in Android for a long time. Let's upstream it.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25 15:13:03 -07:00
Jaegeuk Kim
c2ca36e82f f2fs: remove unnecessary f2fs_lock_op in f2fs_new_inode
This can be removed, since f2fs_alloc_nid() actually doesn't require to block
checkpoint and __f2fs_build_free_nids() is covered by nm_i->nat_tree_lock.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25 15:13:02 -07:00
Chao Yu
642c096991 f2fs: don't set GC_FAILURE_PIN for background GC
So that it can reduce the possibility that file be unpinned forcely by
foreground GC due to .i_gc_failures[GC_FAILURE_PIN] exceeds threshold.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25 15:13:02 -07:00
Chao Yu
a22bb5526d f2fs: check pinfile in gc_data_segment() in advance
In order to skip migrating section which contains data of pinned
file in advance.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-25 15:13:02 -07:00
Jaegeuk Kim
4d8ec91208 f2fs: should not truncate blocks during roll-forward recovery
If the file preallocated blocks and fsync'ed, we should not truncate them during
roll-forward recovery which will recover i_size correctly back.

Fixes: d4dd19ec1e ("f2fs: do not expose unwritten blocks to user by DIO")
Cc: <stable@vger.kernel.org> # 5.17+
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-21 18:57:09 -07:00
Jaegeuk Kim
27275f181c f2fs: fix wrong condition check when failing metapage read
This patch fixes wrong initialization.

Fixes: 50c63009f6 ("f2fs: avoid an infinite loop in f2fs_sync_dirty_inodes")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-20 11:16:43 -07:00
Jaegeuk Kim
0adc2ab0e8 f2fs: keep io_flags to avoid IO split due to different op_flags in two fio holders
Let's attach io_flags to bio only, so that we can merge IOs given original
io_flags only.

Fixes: 64bf0eef01 ("f2fs: pass the bio operation to bio_alloc_bioset")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-20 11:16:43 -07:00
Jaegeuk Kim
930e260763 f2fs: remove obsolete whint_mode
This patch removes obsolete whint_mode.

Fixes: 41d36a9f3e ("fs: remove kiocb.ki_hint")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-04-20 11:16:43 -07:00
Christoph Hellwig
44abff2c0b block: decouple REQ_OP_SECURE_ERASE from REQ_OP_DISCARD
Secure erase is a very different operation from discard in that it is
a data integrity operation vs hint.  Fully split the limits and helper
infrastructure to make the separation more clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nifs2]
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org> [f2fs]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Chao Yu <chao@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-27-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:59 -06:00
Christoph Hellwig
7b47ef52d0 block: add a bdev_discard_granularity helper
Abstract away implementation details from file systems by providing a
block_device based helper to retrieve the discard granularity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Link: https://lore.kernel.org/r/20220415045258.199825-26-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:59 -06:00
Christoph Hellwig
70200574cc block: remove QUEUE_FLAG_DISCARD
Just use a non-zero max_discard_sectors as an indicator for discard
support, similar to what is done for write zeroes.

The only places where needs special attention is the RAID5 driver,
which must clear discard support for security reasons by default,
even if the default stacking rules would allow for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:59 -06:00
Christoph Hellwig
cf0fbf894b block: add a bdev_max_discard_sectors helper
Add a helper to query the number of sectors support per each discard bio
based on the block device and use this helper to stop various places from
poking into the request_queue to see if discard is supported and if so how
much.  This mirrors what is done e.g. for write zeroes as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-24-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17 19:49:59 -06:00
Matthew Wilcox (Oracle)
29c87793eb f2fs: Get the superblock from the mapping instead of the page
It's slightly more efficient to go directly from the mapping to the
superblock than to go from the page.  Now that these routines have
the mapping passed to them, there's no reason not to use it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
2022-04-01 14:40:44 -04:00
Matthew Wilcox (Oracle)
0fb5b2ebc0 f2fs: Correct f2fs_dirty_data_folio() conversion
I got the return value wrong.  Very little checks the return value
from set_page_dirty(), so nobody noticed during testing.

Fixes: 4f5e34f713 ("f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
2022-04-01 14:40:44 -04:00
Matthew Wilcox (Oracle)
800ba29547 fs: Pass an iocb to generic_perform_write()
We can extract both the file pointer and the pos from the iocb.
This simplifies each caller as well as allowing generic_perform_write()
to see more of the iocb contents in the future.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
2022-04-01 14:40:44 -04:00
Matthew Wilcox (Oracle)
704528d895 fs: Remove ->readpages address space operation
All filesystems have now been converted to use ->readahead, so
remove the ->readpages operation and fix all the comments that
used to refer to it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
2022-04-01 13:45:33 -04:00
Linus Torvalds
561593a048 for-5.18/write-streams-2022-03-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI1AHwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplPjEACVJzKg5NkxpdkDThvq5tejws9KxB/4mHJg
 NoDMcv1TF+Orsd/HNW6XrgYnbU0ObHom3568/xb8BNegRVFe7V4ME/4IYNRyGOmV
 qbfciu04L1UkJhI52CIidkOioFABL3r1zgLCIz5vk0Cv9X7Le9x0UabHxJf7u9S+
 Z3lNdyxezN0SGx8VT86l/7lSoHtG3VHO9IsQCuNGF02SB+6uGpXBlptbEoQ4nTxd
 T7/H9FNOe2Wf7eKvcOOds8UlvZYAfYcY0GcRrIOXdHIy25mKFWwn5cDgFTMOH5ID
 xXpm+JFkDkrfSW1o4FFPxbN9Z6RbVXbGCsrXlIragLO2MJQdXiIUxS1OPT5oAado
 H9MlX6QtkwziLW9zUWa/N/jmRjc2vzHAxD6JFg/wXxNdtY0kd8TQpaxwTB8mVDPN
 VCGutt7lJS1CQInQ+ppzbdqzzuLHC1RHAyWSmfUE9rb8cbjxtJBnSIorYRLUesMT
 GRwqVTXW0osxSgCb1iDiBCJANrX1yPZcemv4Wh1gzbT6IE9sWxWXsE5sy9KvswNc
 M+E4nu/TYYTfkynItJjLgmDLOoi+V0FBY6ba0mRPBjkriSP4AVlwsZLGVsAHQzuA
 o5paW1GjRCCwhIQ6+AzZIoOz6wqvprBlUgUkUneyYAQ2ZKC3pZi8zPnpoVdFucVa
 VaTzP71C1Q==
 =efaq
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block

Pull NVMe write streams removal from Jens Axboe:
 "This removes the write streams support in NVMe. No vendor ever really
  shipped working support for this, and they are not interested in
  supporting it.

  With the NVMe support gone, we have nothing in the tree that supports
  this. Remove passing around of the hints.

  The only discussion point in this patchset imho is the fact that the
  file specific write hint setting/getting fcntl helpers will now return
  -1/EINVAL like they did before we supported write hints. No known
  applications use these functions, I only know of one prototype that I
  help do for RocksDB, and that's not used. That said, with a change
  like this, it's always a bit controversial. Alternatively, we could
  just make them return 0 and pretend it worked. It's placement based
  hints after all"

* tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block:
  fs: remove fs.f_write_hint
  fs: remove kiocb.ki_hint
  block: remove the per-bio/request write hint
  nvme: remove support or stream based temperature hint
2022-03-26 11:51:46 -07:00
Linus Torvalds
6b1f86f8e9 Filesystem folio changes for 5.18
Primarily this series converts some of the address_space operations
 to take a folio instead of a page.
 
 ->is_partially_uptodate() takes a folio instead of a page and changes the
 type of the 'from' and 'count' arguments to make it obvious they're bytes.
 ->invalidatepage() becomes ->invalidate_folio() and has a similar type change.
 ->launder_page() becomes ->launder_folio()
 ->set_page_dirty() becomes ->dirty_folio() and adds the address_space as
 an argument.
 
 There are a couple of other misc changes up front that weren't worth
 separating into their own pull request.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4hqMACgkQDpNsjXcp
 gj7r7Af/fVJ7m8kKqjP/IayX3HiJRuIDQw+vM++BlRNXdjz+IyED6whdmFGxJeOY
 BMyT+8ApOAz7ErS4G+7fAv4ScJK/aEgFUsnSeAiCp0PliiEJ5NNJzElp6sVmQ7H5
 SX7+Ek444FZUGsQuy0qL7/ELpR3ditnD7x+5U2g0p5TeaHGUQn84crRyfR4xuhNG
 EBD9D71BOb7OxUcOHe93pTkK51QsQ0aCrcIsB1tkK5KR0BAthn1HqF7ehL90Rvrr
 omx5M7aDWGY4oj7IKrhlAs+55Ah2WaOzrZBp0FXNbr4UENDBKWKyUxErwa4xPkf6
 Gm1iQG/CspOHnxN3YWsd5WjtlL3A+A==
 =cOiq
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache

Pull filesystem folio updates from Matthew Wilcox:
 "Primarily this series converts some of the address_space operations to
  take a folio instead of a page.

  Notably:

   - a_ops->is_partially_uptodate() takes a folio instead of a page and
     changes the type of the 'from' and 'count' arguments to make it
     obvious they're bytes.

   - a_ops->invalidatepage() becomes ->invalidate_folio() and has a
     similar type change.

   - a_ops->launder_page() becomes ->launder_folio()

   - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
     address_space as an argument.

  There are a couple of other misc changes up front that weren't worth
  separating into their own pull request"

* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
  fs: Remove aops ->set_page_dirty
  fb_defio: Use noop_dirty_folio()
  fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
  fs: Convert __set_page_dirty_buffers to block_dirty_folio
  nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
  mm: Convert swap_set_page_dirty() to swap_dirty_folio()
  ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
  f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
  f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
  f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
  afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
  btrfs: Convert extent_range_redirty_for_io() to use folios
  fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
  btrfs: Convert from set_page_dirty to dirty_folio
  fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
  fs: Add aops->dirty_folio
  fs: Remove aops->launder_page
  orangefs: Convert launder_page to launder_folio
  nfs: Convert from launder_page to launder_folio
  fuse: Convert from launder_page to launder_folio
  ...
2022-03-22 18:26:56 -07:00
Linus Torvalds
3bf03b9a08 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs

 - Most the MM patches which precede the patches in Willy's tree: kasan,
   pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
   sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb,
   userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp,
   cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap,
   zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits)
  mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release()
  Docs/ABI/testing: add DAMON sysfs interface ABI document
  Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface
  selftests/damon: add a test for DAMON sysfs interface
  mm/damon/sysfs: support DAMOS stats
  mm/damon/sysfs: support DAMOS watermarks
  mm/damon/sysfs: support schemes prioritization
  mm/damon/sysfs: support DAMOS quotas
  mm/damon/sysfs: support DAMON-based Operation Schemes
  mm/damon/sysfs: support the physical address space monitoring
  mm/damon/sysfs: link DAMON for virtual address spaces monitoring
  mm/damon: implement a minimal stub for sysfs-based DAMON interface
  mm/damon/core: add number of each enum type values
  mm/damon/core: allow non-exclusive DAMON start/stop
  Docs/damon: update outdated term 'regions update interval'
  Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling
  Docs/vm/damon: call low level monitoring primitives the operations
  mm/damon: remove unnecessary CONFIG_DAMON option
  mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}()
  mm/damon/dbgfs-test: fix is_target_id() change
  ...
2022-03-22 16:11:53 -07:00
Muchun Song
65d3af647b f2fs: allocate inode by using alloc_inode_sb()
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() to alloc_inode_sb().

Link: https://lkml.kernel.org/r/20220228122126.37293-6-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:03 -07:00
NeilBrown
a64239d0ef f2fs: replace congestion_wait() calls with io_schedule_timeout()
As congestion is no longer tracked, congestion_wait() is effectively
equivalent to io_schedule_timeout().

So introduce f2fs_io_schedule_timeout() which sets TASK_UNINTERRUPTIBLE
and call that instead.

Link: https://lkml.kernel.org/r/164549983744.9187.6425865370954230902.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:01 -07:00
Linus Torvalds
ef510682af f2fs-for-5.18
In this cycle, f2fs has some performance improvements for Android workloads such
 as using read-unfair rwsems and adding some sysfs entries to control GCs and
 discard commands in more details. In addtiion, it has some tunings to improve
 the recovery speed after sudden power-cut.
 
 Enhancement:
  - add reader-unfair rwsems with F2FS_UNFAIR_RWSEM
   : will replace with generic API support
  - adjust to make the readahead/recovery flow more efficiently
  - sysfs entries to control issue speeds of GCs and Discard commands
  - enable idmapped mounts
 
 Bug fix:
  - correct wrong error handling routines
  - fix missing conditions in quota
  - fix a potential deadlock between writeback and block plug routines
  - fix a deadlock btween freezefs and evict_inode
 
 We've added some boundary checks to avoid kernel panics on corrupted images,
 and several minor code clean-ups.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmI44c4ACgkQQBSofoJI
 UNIqdBAAgBjV/76Gphbpg2lR5+13pWBV0jp66yYaaPiqmM6IsSPYKTlGMJpEBy41
 x6M+MRc+NjtwSEHAWOptOIPbP9zwXYJn/KSDMCAP3+454YhBFDLqDAkAxBt1frYT
 0EkwCIYw/LqmVnuIttQ01gnT8v5zH4d/x4+gsdM+b7flmpCP/AoZDvI19Zd66F0y
 RdOdQQWyhvmmetZbaeaPoxbjS8LJ9b0ZMcxidTv9a+5GylCAXNicBdM9x1iVVmJ1
 dT1n2w7USKVdL4ydpwPUiec6RwACRk49CL3FgyyGNRlcpMmU9ArcY2l/Qr+At7ky
 tgPODXme/EvH12DsfoixjkNSLc4a7RHPfiJ3qy8XC6dshWYMKIegjateG8lVhf0P
 kdifMRCdOa+/l+RoyD1IjKTXPmVl9ihh6RBYDr6YrFclxg3uI4CvJCXht4dSXOCE
 5vLIVZEf5yk+6Ee2ozcNTG2hZ8gd+aNy1WqBN3/5lFxhBYVNlTnUYd0URzenwIdW
 i2QP99mFrntCL25lhF7f7AeTHxSg/UVXnRA1oQZ+6qIPPLhNdApfd1lov/6+Hhe4
 0zDbCbmIfVko/vZJeYOppaj+6jSZ3FafMfH5dDYyis4S4RbX2sjR9wGSd8PEdOTw
 /4dZXXfB2XslPb3KQsJSyGz75af3PxZ8PHLxj0HBSQXOA140htY=
 =t75l
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this cycle, f2fs has some performance improvements for Android
  workloads such as using read-unfair rwsems and adding some sysfs
  entries to control GCs and discard commands in more details. In
  addtiion, it has some tunings to improve the recovery speed after
  sudden power-cut.

  Enhancement:
   - add reader-unfair rwsems with F2FS_UNFAIR_RWSEM: will replace with
     generic API support
   - adjust to make the readahead/recovery flow more efficiently
   - sysfs entries to control issue speeds of GCs and Discard commands
   - enable idmapped mounts

  Bug fix:
   - correct wrong error handling routines
   - fix missing conditions in quota
   - fix a potential deadlock between writeback and block plug routines
   - fix a deadlock btween freezefs and evict_inode

  We've added some boundary checks to avoid kernel panics on corrupted
  images, and several minor code clean-ups"

* tag 'f2fs-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (27 commits)
  f2fs: fix to do sanity check on .cp_pack_total_block_count
  f2fs: make gc_urgent and gc_segment_mode sysfs node readable
  f2fs: use aggressive GC policy during f2fs_disable_checkpoint()
  f2fs: fix compressed file start atomic write may cause data corruption
  f2fs: initialize sbi->gc_mode explicitly
  f2fs: introduce gc_urgent_mid mode
  f2fs: compress: fix to print raw data size in error path of lz4 decompression
  f2fs: remove redundant parameter judgment
  f2fs: use spin_lock to avoid hang
  f2fs: don't get FREEZE lock in f2fs_evict_inode in frozen fs
  f2fs: remove unnecessary read for F2FS_FITS_IN_INODE
  f2fs: introduce F2FS_UNFAIR_RWSEM to support unfair rwsem
  f2fs: avoid an infinite loop in f2fs_sync_dirty_inodes
  f2fs: fix to do sanity check on curseg->alloc_type
  f2fs: fix to avoid potential deadlock
  f2fs: quota: fix loop condition at f2fs_quota_sync()
  f2fs: Restore rwsem lockdep support
  f2fs: fix missing free nid in f2fs_handle_failed_inode
  f2fs: support idmapped mounts
  f2fs: add a way to limit roll forward recovery time
  ...
2022-03-22 10:00:31 -07:00
Linus Torvalds
881b568756 fscrypt updates for 5.18
Add support for direct I/O on encrypted files when blk-crypto (inline
 encryption) is being used for file contents encryption.
 
 There will be a merge conflict with the block pull request in
 fs/iomap/direct-io.c, due to some bio interface cleanups.  The merge
 resolution is straightforward and can be found in linux-next.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYjgCfhQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK6vjAQDp8D8OKIyj67KjYwvyHpNy0fZhxeur
 RexC0nDfd9BE/AD/fV6zpCglmsuGxqxL0jmqeczKXn2y7nRFmPciCBTi/wY=
 =kwNd
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Eric Biggers:
 "Add support for direct I/O on encrypted files when blk-crypto (inline
  encryption) is being used for file contents encryption"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: update documentation for direct I/O support
  f2fs: support direct I/O with fscrypt using blk-crypto
  ext4: support direct I/O with fscrypt using blk-crypto
  iomap: support direct I/O with fscrypt using blk-crypto
  fscrypt: add functions for direct I/O support
2022-03-22 09:50:16 -07:00
Linus Torvalds
d347ee54a7 for-5.18/alloc-cleanups-2022-03-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI0/7IQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpm+FD/9wmpVQE30Y4Atgw4VygOXwlS6mhNHIVGUV
 Khd34mwVSh21hjBnyenWJX7KlcVT5YsfEmEz/CNvhxG7QwqjjFuu085n5eImJkHZ
 yP9N2jLzR5mECs9nZ9FGZO7h6hl8gEX5agR/QndoVBMyA3ho4SsXyHUoDoobgQ+R
 SuwAB0gvxA72wMGgIPtJ/y3EJxQyja+6kpOQiO+avQgccGXloi44YRLhqUJ5RkCE
 iJxeiTB4xJt46f2RnU6cS8yy4WqsTY4PLCnM2mpgkQKDfesqA+4wZh1K7TcNKVN1
 MntawBwHEtp0JqZxQQqr6PFqqOLoIs22+JJagFxzk00jHy+s10dTL1O8Yj/If9Y2
 l7ivguJog9yXPTdJ7THR2x1RTTxYN7p4DfU6L6cw7VBmG51mSpKwePlK15dqUIC9
 mpYUgqrR4v1LYXLSpDHqPFRtARKvyzNmmk61qmA3SyWLOXbOuyMYWHYW5ta62S9B
 MnqaFjrKHrRzI2co8/Kqyv9t5ffb7eUvu9OUAmM5GnEpl/yg8iEUN+Rk55f0AnNa
 1ABRUpiB6AOxRZZ/pqUBCB7wtkoyEk/O/Jqax0FrZWTYtdTQebMC8fbfPXCIdc3A
 HH7e7Kc8mr4QLhTS9oJ/pvNrm4kdkunTZe8CGQmocGiJjU5sWrMFRNbk2N7/fuAi
 kAYOByEjDA==
 =PzC6
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/alloc-cleanups-2022-03-18' of git://git.kernel.dk/linux-block

Pull bio_alloc() cleanups from Jens Axboe:
 "Filesystem cleanups to pass the bio op to bio_alloc() instead of
  setting it just before bio submission".

* tag 'for-5.18/alloc-cleanups-2022-03-18' of git://git.kernel.dk/linux-block:
  f2fs: pass the bio operation to bio_alloc_bioset
  f2fs: don't pass a bio to f2fs_target_device
  nilfs2: pass the operation to bio_alloc
  ext4: pass the operation to bio_alloc
  mpage: pass the operation to bio_alloc
2022-03-21 17:21:05 -07:00
Linus Torvalds
616355cc81 for-5.18/block-2022-03-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI0+GcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprUpD/9aTJEnj7VCw7UouSsg098sdjtoy9ilslU3
 ew47K8CIXHbCB4CDqLnFyvCwAdG1XGgS+fUmFAxvTr29R9SZeS5d+bXL6sZzEo0C
 bwxsJy9MM2QRtMvB+giAt1myXbwB8cG+ketMBWXqwXXRHRzPbbQfMZia7FqWMnfY
 KQanH9IwYHp1oa5U/W6Qcjm4oCnLgBMRwqByzUCtiF3y9qgaLkK+3IgkNwjJQjLA
 DTeUJ/9CgxGQQbzA+LPktbw2xfTqiUfcKq0mWx6Zt4wwNXn1ClqUDUXX6QSM8/5u
 3OimbscSkEPPTIYZbVBPkhFnAlQb4JaJEgOrbXvYKVV2Dh+eZY81XwNeE/E8gdBY
 TnHOTOCjkN/4sR3hIrWazlJzPLdpPA0eOYrhguCraQsX9mcsYNxlJ9otRv/Ve99g
 uqL0RZg3+NoK84fm79FCGy/ZmPQJvJttlBT9CKVwylv/Lky42xWe7AdM3OipKluY
 2nh+zN5Ai7WxZdTKXQFRhCSWfWQ+1qW51tB3dcGW+BooZr/oox47qKQVcHsEWbq1
 RNR45F5a4AuPwYUHF/P36WviLnEuq9AvX7OTTyYOplyVQohKIoDXp9chVzLNzBiZ
 KBR00W6MLKKKN+8foalQWgNyb2i2PH7Ib4xRXvXj/22Vwxg5UmUoBmSDSas9SZUS
 +dMo7CtNgA==
 =DpgP
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo)

 - blk-rq-qos completion fix (Tejun)

 - blk-cgroup merge fix (Tejun)

 - Add offline error return value to distinguish it from an IO error on
   the device (Song)

 - IO stats fixes (Zhang, Christoph)

 - blkcg refcount fixes (Ming, Yu)

 - Fix for indefinite dispatch loop softlockup (Shin'ichiro)

 - blk-mq hardware queue management improvements (Ming)

 - sbitmap dead code removal (Ming, John)

 - Plugging merge improvements (me)

 - Show blk-crypto capabilities in sysfs (Eric)

 - Multiple delayed queue run improvement (David)

 - Block throttling fixes (Ming)

 - Start deprecating auto module loading based on dev_t (Christoph)

 - bio allocation improvements (Christoph, Chaitanya)

 - Get rid of bio_devname (Christoph)

 - bio clone improvements (Christoph)

 - Block plugging improvements (Christoph)

 - Get rid of genhd.h header (Christoph)

 - Ensure drivers use appropriate flush helpers (Christoph)

 - Refcounting improvements (Christoph)

 - Queue initialization and teardown improvements (Ming, Christoph)

 - Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng,
   Lukas, Nian, Yang, Eric, Chengming)

* tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits)
  block: cancel all throttled bios in del_gendisk()
  block: let blkcg_gq grab request queue's refcnt
  block: avoid use-after-free on throttle data
  block: limit request dispatch loop duration
  block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative"
  sr: simplify the local variable initialization in sr_block_open()
  block: don't merge across cgroup boundaries if blkcg is enabled
  block: fix rq-qos breakage from skipping rq_qos_done_bio()
  block: flush plug based on hardware and software queue order
  block: ensure plug merging checks the correct queue at least once
  block: move rq_qos_exit() into disk_release()
  block: do more work in elevator_exit
  block: move blk_exit_queue into disk_release
  block: move q_usage_counter release into blk_queue_release
  block: don't remove hctx debugfs dir from blk_mq_exit_queue
  block: move blkcg initialization/destroy into disk allocation/release handler
  sr: implement ->free_disk to simplify refcounting
  sd: implement ->free_disk to simplify refcounting
  sd: delay calling free_opal_dev
  sd: call sd_zbc_release_disk before releasing the scsi_device reference
  ...
2022-03-21 16:48:55 -07:00
Chao Yu
5b5b4f85b0 f2fs: fix to do sanity check on .cp_pack_total_block_count
As bughunter reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=215709

f2fs may hang when mounting a fuzzed image, the dmesg shows as below:

__filemap_get_folio+0x3a9/0x590
pagecache_get_page+0x18/0x60
__get_meta_page+0x95/0x460 [f2fs]
get_checkpoint_version+0x2a/0x1e0 [f2fs]
validate_checkpoint+0x8e/0x2a0 [f2fs]
f2fs_get_valid_checkpoint+0xd0/0x620 [f2fs]
f2fs_fill_super+0xc01/0x1d40 [f2fs]
mount_bdev+0x18a/0x1c0
f2fs_mount+0x15/0x20 [f2fs]
legacy_get_tree+0x28/0x50
vfs_get_tree+0x27/0xc0
path_mount+0x480/0xaa0
do_mount+0x7c/0xa0
__x64_sys_mount+0x8b/0xe0
do_syscall_64+0x38/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae

The root cause is cp_pack_total_block_count field in checkpoint was fuzzed
to one, as calcuated, two cp pack block locates in the same block address,
so then read latter cp pack block, it will block on the page lock due to
the lock has already held when reading previous cp pack block, fix it by
adding sanity check for cp_pack_total_block_count.

Cc: stable@vger.kernel.org
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-21 09:10:21 -07:00
Daeho Jeong
e60aeb2dee f2fs: make gc_urgent and gc_segment_mode sysfs node readable
Changed a way of showing values of them to use strings.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-21 09:09:54 -07:00
Chao Yu
98e92867b9 f2fs: use aggressive GC policy during f2fs_disable_checkpoint()
Let's enable GC_URGENT_HIGH mode during f2fs_disable_checkpoint(),
so that we can use SSR allocator for GCed data/node persistence,
it can improve the performance due to it avoiding migration of
data/node locates in selected target segment of SSR allocator.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-18 09:13:02 -07:00
Fengnan Chang
9b56adcf52 f2fs: fix compressed file start atomic write may cause data corruption
When compressed file has blocks, f2fs_ioc_start_atomic_write will succeed,
but compressed flag will be remained in inode. If write partial compreseed
cluster and commit atomic write will cause data corruption.

This is the reproduction process:
Step 1:
create a compressed file ,write 64K data , call fsync(), then the blocks
are write as compressed cluster.
Step2:
iotcl(F2FS_IOC_START_ATOMIC_WRITE)  --- this should be fail, but not.
write page 0 and page 3.
iotcl(F2FS_IOC_COMMIT_ATOMIC_WRITE)  -- page 0 and 3 write as normal file,
Step3:
drop cache.
read page 0-4   -- Since page 0 has a valid block address, read as
non-compressed cluster, page 1 and 2 will be filled with compressed data
or zero.

The root cause is, after commit 7eab7a6968 ("f2fs: compress: remove
unneeded read when rewrite whole cluster"), in step 2, f2fs_write_begin()
only set target page dirty, and in f2fs_commit_inmem_pages(), we will write
partial raw pages into compressed cluster, result in corrupting compressed
cluster layout.

Fixes: 4c8ff7095b ("f2fs: support data compression")
Fixes: 7eab7a6968 ("f2fs: compress: remove unneeded read when rewrite whole cluster")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-18 09:12:48 -07:00
Chao Yu
c86868bbc2 f2fs: initialize sbi->gc_mode explicitly
It needs to initialized sbi->gc_mode to GC_NORMAL explicitly.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-17 23:48:03 -07:00
Daeho Jeong
d98af5f455 f2fs: introduce gc_urgent_mid mode
We need a mid level of gc urgent mode to do GC forcibly in a period
of given gc_urgent_sleep_time, but not like using greedy GC approach
and switching to SSR mode such as gc urgent high mode. This can be
used for more aggressive periodic storage clean up.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-17 09:16:22 -07:00
Chao Yu
d284af43f7 f2fs: compress: fix to print raw data size in error path of lz4 decompression
In lz4_decompress_pages(), if size of decompressed data is not equal to
expected one, we should print the size rather than size of target buffer
for decompressed data, fix it.

Signed-off-by: Chao Yu <chao.yu@oppo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-17 09:16:22 -07:00
Wang Xiaojun
646f64b576 f2fs: remove redundant parameter judgment
iput() has already judged the incoming parameter, so there is
no need to repeat the judgment here.

Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-17 09:16:22 -07:00
Jaegeuk Kim
98237fcda4 f2fs: use spin_lock to avoid hang
[14696.634553] task:cat             state:D stack:    0 pid:1613738 ppid:1613735 flags:0x00000004
[14696.638285] Call Trace:
[14696.639038]  <TASK>
[14696.640032]  __schedule+0x302/0x930
[14696.640969]  schedule+0x58/0xd0
[14696.641799]  schedule_preempt_disabled+0x18/0x30
[14696.642890]  __mutex_lock.constprop.0+0x2fb/0x4f0
[14696.644035]  ? mod_objcg_state+0x10c/0x310
[14696.645040]  ? obj_cgroup_charge+0xe1/0x170
[14696.646067]  __mutex_lock_slowpath+0x13/0x20
[14696.647126]  mutex_lock+0x34/0x40
[14696.648070]  stat_show+0x25/0x17c0 [f2fs]
[14696.649218]  seq_read_iter+0x120/0x4b0
[14696.650289]  ? aa_file_perm+0x12a/0x500
[14696.651357]  ? lru_cache_add+0x1c/0x20
[14696.652470]  seq_read+0xfd/0x140
[14696.653445]  full_proxy_read+0x5c/0x80
[14696.654535]  vfs_read+0xa0/0x1a0
[14696.655497]  ksys_read+0x67/0xe0
[14696.656502]  __x64_sys_read+0x1a/0x20
[14696.657580]  do_syscall_64+0x3b/0xc0
[14696.658671]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[14696.660068] RIP: 0033:0x7efe39df1cb2
[14696.661133] RSP: 002b:00007ffc8badd948 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[14696.662958] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007efe39df1cb2
[14696.664757] RDX: 0000000000020000 RSI: 00007efe399df000 RDI: 0000000000000003
[14696.666542] RBP: 00007efe399df000 R08: 00007efe399de010 R09: 00007efe399de010
[14696.668363] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000000
[14696.670155] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
[14696.671965]  </TASK>
[14696.672826] task:umount          state:D stack:    0 pid:1614985 ppid:1614984 flags:0x00004000
[14696.674930] Call Trace:
[14696.675903]  <TASK>
[14696.676780]  __schedule+0x302/0x930
[14696.677927]  schedule+0x58/0xd0
[14696.679019]  schedule_preempt_disabled+0x18/0x30
[14696.680412]  __mutex_lock.constprop.0+0x2fb/0x4f0
[14696.681783]  ? destroy_inode+0x65/0x80
[14696.683006]  __mutex_lock_slowpath+0x13/0x20
[14696.684305]  mutex_lock+0x34/0x40
[14696.685442]  f2fs_destroy_stats+0x1e/0x60 [f2fs]
[14696.686803]  f2fs_put_super+0x158/0x390 [f2fs]
[14696.688238]  generic_shutdown_super+0x7a/0x120
[14696.689621]  kill_block_super+0x27/0x50
[14696.690894]  kill_f2fs_super+0x7f/0x100 [f2fs]
[14696.692311]  deactivate_locked_super+0x35/0xa0
[14696.693698]  deactivate_super+0x40/0x50
[14696.694985]  cleanup_mnt+0x139/0x190
[14696.696209]  __cleanup_mnt+0x12/0x20
[14696.697390]  task_work_run+0x64/0xa0
[14696.698587]  exit_to_user_mode_prepare+0x1b7/0x1c0
[14696.700053]  syscall_exit_to_user_mode+0x27/0x50
[14696.701418]  do_syscall_64+0x48/0xc0
[14696.702630]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-17 09:16:22 -07:00
Matthew Wilcox (Oracle)
cbc975b182 f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
Removes a call to __set_page_dirty_nobuffers().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:34:38 -04:00
Matthew Wilcox (Oracle)
4f5e34f713 f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
Removes several calls to __set_page_dirty_nobuffers().  Also turn the
PageSwapCache() case into a BUG() as there's no way for a swapcache page
to make it to a filesystem that doesn't use SWP_FS_OPS.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:34:38 -04:00
Matthew Wilcox (Oracle)
1d9ac659ff f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
Removes a call to __set_page_dirty_nobuffers().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:34:38 -04:00
Matthew Wilcox (Oracle)
9150399673 f2fs: Convert invalidatepage to invalidate_folio
This is a minimal change which just accepts the new arguments and passes
the single struct page to the functions which do the work.  There is
very little progress here toards making f2fs support large folios.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:23:30 -04:00
Jaegeuk Kim
ba900534f8 f2fs: don't get FREEZE lock in f2fs_evict_inode in frozen fs
Let's purge inode cache in order to avoid the below deadlock.

[freeze test]                         shrinkder
freeze_super
 - pwercpu_down_write(SB_FREEZE_FS)
                                       - super_cache_scan
                                         - down_read(&sb->s_umount)
                                           - prune_icache_sb
                                            - dispose_list
                                             - evict
                                              - f2fs_evict_inode
thaw_super
 - down_write(&sb->s_umount);
                                              - __percpu_down_read(SB_FREEZE_FS)

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-11 07:36:17 -08:00
Jia Yang
d13732cc0c f2fs: remove unnecessary read for F2FS_FITS_IN_INODE
F2FS_FITS_IN_INODE only cares the type of f2fs inode, so there
is no need to read node page of f2fs inode.

Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-09 18:48:25 -08:00
Christoph Hellwig
64bf0eef01 f2fs: pass the bio operation to bio_alloc_bioset
Refactor block I/O code so that the bio operation and known flags are set
at bio allocation time.  Only the later updated flags are updated on the
fly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220228124123.856027-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08 17:59:03 -07:00
Christoph Hellwig
5189810a66 f2fs: don't pass a bio to f2fs_target_device
Set the bdev at bio allocation time by changing the f2fs_target_device
calling conventions, so that no bio needs to be passed in.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220228124123.856027-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08 17:59:03 -07:00
Christoph Hellwig
41d36a9f3e fs: remove kiocb.ki_hint
This field is entirely unused now except for a tracepoint in f2fs, so
remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220308060529.736277-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08 17:55:03 -07:00
Christoph Hellwig
c75e707fe1 block: remove the per-bio/request write hint
With the NVMe support for this gone, there are no consumers of these hints
left, so remove them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07 12:45:57 -07:00
Jens Axboe
13400b1454 Merge branch 'for-5.18/block' into for-5.18/write-streams
* for-5.18/block: (96 commits)
  block: remove bio_devname
  ext4: stop using bio_devname
  raid5-ppl: stop using bio_devname
  raid1: stop using bio_devname
  md-multipath: stop using bio_devname
  dm-integrity: stop using bio_devname
  dm-crypt: stop using bio_devname
  pktcdvd: remove a pointless debug check in pkt_submit_bio
  block: remove handle_bad_sector
  block: fix and cleanup bio_check_ro
  bfq: fix use-after-free in bfq_dispatch_request
  blk-crypto: show crypto capabilities in sysfs
  block: don't delete queue kobject before its children
  block: simplify calling convention of elv_unregister_queue()
  block: remove redundant semicolon
  block: default BLOCK_LEGACY_AUTOLOAD to y
  block: update io_ticks when io hang
  block, bfq: don't move oom_bfqq
  block, bfq: avoid moving bfqq to it's parent bfqg
  block, bfq: cleanup bfq_bfqq_to_bfqg()
  ...
2022-03-07 12:44:37 -07:00
Jaegeuk Kim
7f8e249dcc f2fs: introduce F2FS_UNFAIR_RWSEM to support unfair rwsem
Unfair rwsem should be used when blk-cg is on. Otherwise, there is regression.

FYI, we noticed a -26.7% regression of aim7.jobs-per-min due to commit:

commit: e4544b63a7 ("f2fs: move f2fs to use reader-unfair rwsems")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

in testcase: aim7
on test machine: 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory
with following parameters:

	disk: 4BRD_12G
	md: RAID0
	fs: f2fs
	test: sync_disk_rw
	load: 100
	cpufreq_governor: performance
	ucode: 0x500320a

test-description: AIM7 is a traditional UNIX system level benchmark suite which is used to test and measure the performance of multiuser system.
test-url: https://sourceforge.net/projects/aimbench/files/aim-suite7/

Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-04 09:15:53 -08:00
Jaegeuk Kim
50c63009f6 f2fs: avoid an infinite loop in f2fs_sync_dirty_inodes
If one read IO is always failing, we can fall into an infinite loop in
f2fs_sync_dirty_inodes. This happens during xfstests/generic/475.

[  142.803335] Buffer I/O error on dev dm-1, logical block 8388592, async page read
...
[  382.887210]  submit_bio_noacct+0xdd/0x2a0
[  382.887213]  submit_bio+0x80/0x110
[  382.887223]  __submit_bio+0x4d/0x300 [f2fs]
[  382.887282]  f2fs_submit_page_bio+0x125/0x200 [f2fs]
[  382.887299]  __get_meta_page+0xc9/0x280 [f2fs]
[  382.887315]  f2fs_get_meta_page+0x13/0x20 [f2fs]
[  382.887331]  f2fs_get_node_info+0x317/0x3c0 [f2fs]
[  382.887350]  f2fs_do_write_data_page+0x327/0x6f0 [f2fs]
[  382.887367]  f2fs_write_single_data_page+0x5b7/0x960 [f2fs]
[  382.887386]  f2fs_write_cache_pages+0x302/0x890 [f2fs]
[  382.887405]  ? preempt_count_add+0x7a/0xc0
[  382.887408]  f2fs_write_data_pages+0xfd/0x320 [f2fs]
[  382.887425]  ? _raw_spin_unlock+0x1a/0x30
[  382.887428]  do_writepages+0xd3/0x1d0
[  382.887432]  filemap_fdatawrite_wbc+0x69/0x90
[  382.887434]  filemap_fdatawrite+0x50/0x70
[  382.887437]  f2fs_sync_dirty_inodes+0xa4/0x270 [f2fs]
[  382.887453]  f2fs_write_checkpoint+0x189/0x1640 [f2fs]
[  382.887469]  ? schedule_timeout+0x114/0x150
[  382.887471]  ? ttwu_do_activate+0x6d/0xb0
[  382.887473]  ? preempt_count_add+0x7a/0xc0
[  382.887476]  kill_f2fs_super+0xca/0x100 [f2fs]
[  382.887491]  deactivate_locked_super+0x35/0xa0
[  382.887494]  deactivate_super+0x40/0x50
[  382.887497]  cleanup_mnt+0x139/0x190
[  382.887499]  __cleanup_mnt+0x12/0x20
[  382.887501]  task_work_run+0x64/0xa0
[  382.887505]  exit_to_user_mode_prepare+0x1b7/0x1c0
[  382.887508]  syscall_exit_to_user_mode+0x27/0x50
[  382.887510]  do_syscall_64+0x48/0xc0
[  382.887513]  entry_SYSCALL_64_after_hwframe+0x44/0xae

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-04 09:15:46 -08:00
Chao Yu
f41ee8b91c f2fs: fix to do sanity check on curseg->alloc_type
As Wenqing Liu reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=215657

- Overview
UBSAN: array-index-out-of-bounds in fs/f2fs/segment.c:3460:2 when mount and operate a corrupted image

- Reproduce
tested on kernel 5.17-rc4, 5.17-rc6

1. mkdir test_crash
2. cd test_crash
3. unzip tmp2.zip
4. mkdir mnt
5. ./single_test.sh f2fs 2

- Kernel dump
[   46.434454] loop0: detected capacity change from 0 to 131072
[   46.529839] F2FS-fs (loop0): Mounted with checkpoint version = 7548c2d9
[   46.738319] ================================================================================
[   46.738412] UBSAN: array-index-out-of-bounds in fs/f2fs/segment.c:3460:2
[   46.738475] index 231 is out of range for type 'unsigned int [2]'
[   46.738539] CPU: 2 PID: 939 Comm: umount Not tainted 5.17.0-rc6 #1
[   46.738547] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[   46.738551] Call Trace:
[   46.738556]  <TASK>
[   46.738563]  dump_stack_lvl+0x47/0x5c
[   46.738581]  ubsan_epilogue+0x5/0x50
[   46.738592]  __ubsan_handle_out_of_bounds+0x68/0x80
[   46.738604]  f2fs_allocate_data_block+0xdff/0xe60 [f2fs]
[   46.738819]  do_write_page+0xef/0x210 [f2fs]
[   46.738934]  f2fs_do_write_node_page+0x3f/0x80 [f2fs]
[   46.739038]  __write_node_page+0x2b7/0x920 [f2fs]
[   46.739162]  f2fs_sync_node_pages+0x943/0xb00 [f2fs]
[   46.739293]  f2fs_write_checkpoint+0x7bb/0x1030 [f2fs]
[   46.739405]  kill_f2fs_super+0x125/0x150 [f2fs]
[   46.739507]  deactivate_locked_super+0x60/0xc0
[   46.739517]  deactivate_super+0x70/0xb0
[   46.739524]  cleanup_mnt+0x11a/0x200
[   46.739532]  __cleanup_mnt+0x16/0x20
[   46.739538]  task_work_run+0x67/0xa0
[   46.739547]  exit_to_user_mode_prepare+0x18c/0x1a0
[   46.739559]  syscall_exit_to_user_mode+0x26/0x40
[   46.739568]  do_syscall_64+0x46/0xb0
[   46.739584]  entry_SYSCALL_64_after_hwframe+0x44/0xae

The root cause is we missed to do sanity check on curseg->alloc_type,
result in out-of-bound accessing on sbi->block_count[] array, fix it.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-03 18:19:41 -08:00
Chao Yu
344150999b f2fs: fix to avoid potential deadlock
Quoted from Jing Xia's report, there is a potential deadlock may happen
between kworker and checkpoint as below:

[T:writeback]				[T:checkpoint]
- wb_writeback
 - blk_start_plug
bio contains NodeA was plugged in writeback threads
					- do_writepages  -- sync write inodeB, inc wb_sync_req[DATA]
					 - f2fs_write_data_pages
					  - f2fs_write_single_data_page -- write last dirty page
					   - f2fs_do_write_data_page
					    - set_page_writeback  -- clear page dirty flag and
					    PAGECACHE_TAG_DIRTY tag in radix tree
					    - f2fs_outplace_write_data
					     - f2fs_update_data_blkaddr
					      - f2fs_wait_on_page_writeback -- wait NodeA to writeback here
					   - inode_dec_dirty_pages
 - writeback_sb_inodes
  - writeback_single_inode
   - do_writepages
    - f2fs_write_data_pages -- skip writepages due to wb_sync_req[DATA]
     - wbc->pages_skipped += get_dirty_pages() -- PAGECACHE_TAG_DIRTY is not set but get_dirty_pages() returns one
  - requeue_inode -- requeue inode to wb->b_dirty queue due to non-zero.pages_skipped
 - blk_finish_plug

Let's try to avoid deadlock condition by forcing unplugging previous bio via
blk_finish_plug(current->plug) once we'v skipped writeback in writepages()
due to valid sbi->wb_sync_req[DATA/NODE].

Fixes: 687de7f101 ("f2fs: avoid IO split due to mixed WB_SYNC_ALL and WB_SYNC_NONE")
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Signed-off-by: Jing Xia <jing.xia@unisoc.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-03-03 13:30:48 -08:00
Juhyung Park
680af5b824 f2fs: quota: fix loop condition at f2fs_quota_sync()
cnt should be passed to sb_has_quota_active() instead of type to check
active quota properly.

Moreover, when the type is -1, the compiler with enough inline knowledge
can discard sb_has_quota_active() check altogether, causing a NULL pointer
dereference at the following inode_lock(dqopt->files[cnt]):

[    2.796010] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000a0
[    2.796024] Mem abort info:
[    2.796025]   ESR = 0x96000005
[    2.796028]   EC = 0x25: DABT (current EL), IL = 32 bits
[    2.796029]   SET = 0, FnV = 0
[    2.796031]   EA = 0, S1PTW = 0
[    2.796032] Data abort info:
[    2.796034]   ISV = 0, ISS = 0x00000005
[    2.796035]   CM = 0, WnR = 0
[    2.796046] user pgtable: 4k pages, 39-bit VAs, pgdp=00000003370d1000
[    2.796048] [00000000000000a0] pgd=0000000000000000, pud=0000000000000000
[    2.796051] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[    2.796056] CPU: 7 PID: 640 Comm: f2fs_ckpt-259:7 Tainted: G S                5.4.179-arter97-r8-64666-g2f16e087f9d8 #1
[    2.796057] Hardware name: Qualcomm Technologies, Inc. Lahaina MTP lemonadep (DT)
[    2.796059] pstate: 80c00005 (Nzcv daif +PAN +UAO)
[    2.796065] pc : down_write+0x28/0x70
[    2.796070] lr : f2fs_quota_sync+0x100/0x294
[    2.796071] sp : ffffffa3f48ffc30
[    2.796073] x29: ffffffa3f48ffc30 x28: 0000000000000000
[    2.796075] x27: ffffffa3f6d718b8 x26: ffffffa415fe9d80
[    2.796077] x25: ffffffa3f7290048 x24: 0000000000000001
[    2.796078] x23: 0000000000000000 x22: ffffffa3f7290000
[    2.796080] x21: ffffffa3f72904a0 x20: ffffffa3f7290110
[    2.796081] x19: ffffffa3f77a9800 x18: ffffffc020aae038
[    2.796083] x17: ffffffa40e38e040 x16: ffffffa40e38e6d0
[    2.796085] x15: ffffffa40e38e6cc x14: ffffffa40e38e6d0
[    2.796086] x13: 00000000000004f6 x12: 00162c44ff493000
[    2.796088] x11: 0000000000000400 x10: ffffffa40e38c948
[    2.796090] x9 : 0000000000000000 x8 : 00000000000000a0
[    2.796091] x7 : 0000000000000000 x6 : 0000d1060f00002a
[    2.796093] x5 : ffffffa3f48ff718 x4 : 000000000000000d
[    2.796094] x3 : 00000000060c0000 x2 : 0000000000000001
[    2.796096] x1 : 0000000000000000 x0 : 00000000000000a0
[    2.796098] Call trace:
[    2.796100]  down_write+0x28/0x70
[    2.796102]  f2fs_quota_sync+0x100/0x294
[    2.796104]  block_operations+0x120/0x204
[    2.796106]  f2fs_write_checkpoint+0x11c/0x520
[    2.796107]  __checkpoint_and_complete_reqs+0x7c/0xd34
[    2.796109]  issue_checkpoint_thread+0x6c/0xb8
[    2.796112]  kthread+0x138/0x414
[    2.796114]  ret_from_fork+0x10/0x18
[    2.796117] Code: aa0803e0 aa1f03e1 52800022 aa0103e9 (c8e97d02)
[    2.796120] ---[ end trace 96e942e8eb6a0b53 ]---
[    2.800116] Kernel panic - not syncing: Fatal exception
[    2.800120] SMP: stopping secondary CPUs

Fixes: 9de71ede81 ("f2fs: quota: fix potential deadlock")
Cc: <stable@vger.kernel.org> # v5.15+
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-25 11:11:31 -08:00
Bart Van Assche
c7f91bd410 f2fs: Restore rwsem lockdep support
Lockdep uses lock class keys in its analysis. init_rwsem() instantiates
one lock class key with each init_rwsem() user as follows:

 #define init_rwsem(sem)                                        \
 do {                                                           \
         static struct lock_class_key __key;                    \
                                                                \
         __init_rwsem((sem), #sem, &__key);                     \
 } while (0)

Commit e4544b63a7 ("f2fs: move f2fs to use reader-unfair rwsems") reduced
the number of lock class keys from one per init_rwsem() user to one per
file in which init_f2fs_rwsem() is used. This causes the same lock class key
to be associated with multiple f2fs rwsems and also triggers a number of
false positive lockdep deadlock reports. Fix this by again instantiating one
lock class key with each init_f2fs_rwsem() caller.

Cc: Tim Murray <timmurray@google.com>
Reported-by: syzbot+0b9cadf5fc45a98a5083@syzkaller.appspotmail.com
Fixes: e4544b63a7 ("f2fs: move f2fs to use reader-unfair rwsems")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-25 11:11:31 -08:00
Jaegeuk Kim
2fef99b837 f2fs: fix missing free nid in f2fs_handle_failed_inode
This patch fixes xfstests/generic/475 failure.

[  293.680694] F2FS-fs (dm-1): May loss orphan inode, run fsck to fix.
[  293.685358] Buffer I/O error on dev dm-1, logical block 8388592, async page read
[  293.691527] Buffer I/O error on dev dm-1, logical block 8388592, async page read
[  293.691764] sh (7615): drop_caches: 3
[  293.691819] sh (7616): drop_caches: 3
[  293.694017] Buffer I/O error on dev dm-1, logical block 1, async page read
[  293.695659] sh (7618): drop_caches: 3
[  293.696979] sh (7617): drop_caches: 3
[  293.700290] sh (7623): drop_caches: 3
[  293.708621] sh (7626): drop_caches: 3
[  293.711386] sh (7628): drop_caches: 3
[  293.711825] sh (7627): drop_caches: 3
[  293.716738] sh (7630): drop_caches: 3
[  293.719613] sh (7632): drop_caches: 3
[  293.720971] sh (7633): drop_caches: 3
[  293.727741] sh (7634): drop_caches: 3
[  293.730783] sh (7636): drop_caches: 3
[  293.732681] sh (7635): drop_caches: 3
[  293.732988] sh (7637): drop_caches: 3
[  293.738836] sh (7639): drop_caches: 3
[  293.740568] sh (7641): drop_caches: 3
[  293.743053] sh (7640): drop_caches: 3
[  293.821889] ------------[ cut here ]------------
[  293.824654] kernel BUG at fs/f2fs/node.c:3334!
[  293.826226] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  293.828713] CPU: 0 PID: 7653 Comm: umount Tainted: G           OE     5.17.0-rc1-custom #1
[  293.830946] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[  293.832526] RIP: 0010:f2fs_destroy_node_manager+0x33f/0x350 [f2fs]
[  293.833905] Code: e8 d6 3d f9 f9 48 8b 45 d0 65 48 2b 04 25 28 00 00 00 75 1a 48 81 c4 28 03 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b
[  293.837783] RSP: 0018:ffffb04ec31e7a20 EFLAGS: 00010202
[  293.839062] RAX: 0000000000000001 RBX: ffff9df947db2eb8 RCX: 0000000080aa0072
[  293.840666] RDX: 0000000000000000 RSI: ffffe86c0432a140 RDI: ffffffffc0b72a21
[  293.842261] RBP: ffffb04ec31e7d70 R08: ffff9df94ca85780 R09: 0000000080aa0072
[  293.843909] R10: ffff9df94ca85700 R11: ffff9df94e1ccf58 R12: ffff9df947db2e00
[  293.845594] R13: ffff9df947db2ed0 R14: ffff9df947db2eb8 R15: ffff9df947db2eb8
[  293.847855] FS:  00007f5a97379800(0000) GS:ffff9dfa77c00000(0000) knlGS:0000000000000000
[  293.850647] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  293.852940] CR2: 00007f5a97528730 CR3: 000000010bc76005 CR4: 0000000000370ef0
[  293.854680] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  293.856423] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  293.858380] Call Trace:
[  293.859302]  <TASK>
[  293.860311]  ? ttwu_do_wakeup+0x1c/0x170
[  293.861800]  ? ttwu_do_activate+0x6d/0xb0
[  293.863057]  ? _raw_spin_unlock_irqrestore+0x29/0x40
[  293.864411]  ? try_to_wake_up+0x9d/0x5e0
[  293.865618]  ? debug_smp_processor_id+0x17/0x20
[  293.866934]  ? debug_smp_processor_id+0x17/0x20
[  293.868223]  ? free_unref_page+0xbf/0x120
[  293.869470]  ? __free_slab+0xcb/0x1c0
[  293.870614]  ? preempt_count_add+0x7a/0xc0
[  293.871811]  ? __slab_free+0xa0/0x2d0
[  293.872918]  ? __wake_up_common_lock+0x8a/0xc0
[  293.874186]  ? __slab_free+0xa0/0x2d0
[  293.875305]  ? free_inode_nonrcu+0x20/0x20
[  293.876466]  ? free_inode_nonrcu+0x20/0x20
[  293.877650]  ? debug_smp_processor_id+0x17/0x20
[  293.878949]  ? call_rcu+0x11a/0x240
[  293.880060]  ? f2fs_destroy_stats+0x59/0x60 [f2fs]
[  293.881437]  ? kfree+0x1fe/0x230
[  293.882674]  f2fs_put_super+0x160/0x390 [f2fs]
[  293.883978]  generic_shutdown_super+0x7a/0x120
[  293.885274]  kill_block_super+0x27/0x50
[  293.886496]  kill_f2fs_super+0x7f/0x100 [f2fs]
[  293.887806]  deactivate_locked_super+0x35/0xa0
[  293.889271]  deactivate_super+0x40/0x50
[  293.890513]  cleanup_mnt+0x139/0x190
[  293.891689]  __cleanup_mnt+0x12/0x20
[  293.892850]  task_work_run+0x64/0xa0
[  293.894035]  exit_to_user_mode_prepare+0x1b7/0x1c0
[  293.895409]  syscall_exit_to_user_mode+0x27/0x50
[  293.896872]  do_syscall_64+0x48/0xc0
[  293.898090]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  293.899517] RIP: 0033:0x7f5a975cd25b

Fixes: 7735730d39 ("f2fs: fix to propagate error from __get_meta_page()")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-25 11:11:20 -08:00
Chao Yu
984fc4e76d f2fs: support idmapped mounts
This patch enables idmapped mounts for f2fs, since all dedicated helpers
for this functionality existsm, so, in this patch we just pass down the
user_namespace argument from the VFS methods to the relevant helpers.

Simple idmap example on f2fs image:

1. truncate -s 128M f2fs.img
2. mkfs.f2fs f2fs.img
3. mount f2fs.img /mnt/f2fs/
4. touch /mnt/f2fs/file

5. ls -ln /mnt/f2fs/
total 0
-rw-r--r-- 1 0 0 0 2月   4 13:17 file

6. ./mount-idmapped --map-mount b:0:1001:1 /mnt/f2fs/ /mnt/scratch_f2fs/

7. ls -ln /mnt/scratch_f2fs/
total 0
-rw-r--r-- 1 1001 1001 0 2月   4 13:17 file

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-12 06:20:46 -08:00
Jaegeuk Kim
47c8ebcce8 f2fs: add a way to limit roll forward recovery time
This adds a sysfs entry to call checkpoint during fsync() in order to avoid
long elapsed time to run roll-forward recovery when booting the device.
Default value doesn't enforce the limitation which is same as before.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-12 05:58:18 -08:00
Eric Biggers
8a2c77bc2a f2fs: support direct I/O with fscrypt using blk-crypto
Encrypted files traditionally haven't supported DIO, due to the need to
encrypt/decrypt the data.  However, when the encryption is implemented
using inline encryption (blk-crypto) instead of the traditional
filesystem-layer encryption, it is straightforward to support DIO.

Therefore, make f2fs support DIO on files that are using inline
encryption.  Since f2fs uses iomap for DIO, and fscrypt support was
already added to iomap DIO, this just requires two small changes:

- Let DIO proceed when supported, by checking fscrypt_dio_supported()
  instead of assuming that encrypted files never support DIO.

- In f2fs_iomap_begin(), use fscrypt_limit_io_blocks() to limit the
  length of the mapping in the rare case where a DUN discontiguity
  occurs in the middle of an extent.  The iomap DIO implementation
  requires this, since it assumes that it can submit a bio covering (up
  to) the whole mapping, without checking fscrypt constraints itself.

Co-developed-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Satya Tangirala <satyat@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20220128233940.79464-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2022-02-08 11:02:16 -08:00
Chao Yu
1018a5463a f2fs: introduce F2FS_IPU_HONOR_OPU_WRITE ipu policy
Once F2FS_IPU_FORCE policy is enabled in some cases:
a) f2fs forces to use F2FS_IPU_FORCE in a small-sized volume
b) user sets F2FS_IPU_FORCE policy via sysfs

Then we may fail to defragment file due to IPU policy check, it doesn't
make sense, let's introduce a new IPU policy to allow OPU during file
defragmentation.

In small-sized volume, let's enable F2FS_IPU_HONOR_OPU_WRITE policy
by default.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-07 11:28:35 -08:00
Chao Yu
430f163b01 f2fs: adjust readahead block number during recovery
In a fragmented image, entries in dnode block list may locate in
incontiguous physical block address space, however, in recovery flow,
we will always readahead BIO_MAX_VECS size blocks, so in such case,
current readahead policy is low efficient, let's adjust readahead
window size dynamically based on consecutiveness of dnode blocks.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-03 22:21:28 -08:00
Chao Yu
6d18762ed5 f2fs: fix to unlock page correctly in error path of is_alive()
As Pavel Machek reported in below link [1]:

After commit 77900c45ee ("f2fs: fix to do sanity check in is_alive()"),
node page should be unlock via calling f2fs_put_page() in the error path
of is_alive(), otherwise, f2fs may hang when it tries to lock the node
page, fix it.

[1] https://lore.kernel.org/stable/20220124203637.GA19321@duo.ucw.cz/

Fixes: 77900c45ee ("f2fs: fix to do sanity check in is_alive()")
Cc: <stable@vger.kernel.org>
Reported-by: Pavel Machek <pavel@denx.de>
Signed-off-by: Pavel Machek <pavel@denx.de>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-03 22:21:28 -08:00
Konstantin Vyshetsky
b2e4a2b300 f2fs: expose discard related parameters in sysfs
This patch exposes max_discard_request, min_discard_issue_time,
mid_discard_issue_time, and max_discard_issue_time in sysfs. This will
allow the user to fine tune discard operations.

Signed-off-by: Konstantin Vyshetsky <vkon@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-02 16:34:55 -08:00
Konstantin Vyshetsky
d2d8e89648 f2fs: move discard parameters into discard_cmd_control
This patch unifies parameters related to how often discard is issued and
how many requests go out at the same time by placing them in
discard_cmd_control. The move will allow the parameters to be modified
in the future without relying on hard-coded values.

Signed-off-by: Konstantin Vyshetsky <vkon@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-02-02 16:34:46 -08:00
Christoph Hellwig
609be10667 block: pass a block_device and opf to bio_alloc_bioset
Pass the block_device and operation that we plan to use this bio for to
bio_alloc_bioset to optimize the assigment.  NULL/0 can be passed, both
for the passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.

Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:49:59 -07:00
Linus Torvalds
630c12862c Fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into the
previous CONFIG_UNICODE.  It is -rc material since we don't want to
 expose the former symbol on 5.17.
 
 This has been living on linux-next for the past week.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8jAUPq50yNjPBCi4QEuZqsMcppQFAmH4lC0ACgkQQEuZqsMc
 ppRl1Q/+Lyba+DORs26C4p1GDS5ezHOCdbBUE8RFwWjIl+h5ckQ/8kndaXPRLorZ
 1S9E6h5RfqhekGKOhMTXyfzqcW8qMzUy4i3J2lmJpDwATqLt+4Wu/M2BBH2CaIIL
 EhhW8D+WduAEM/TFYihH9LJ0RopvIsqcy8qdu+oSBGfPAdxJ0f2+Yx0pNTRfqVmi
 8+Dry0nRhP12o9wXElpZ0/BYEZTlY+Zo6L/heT6/GKDLpz/YmZp18GAc/0TWb3LL
 ASujr+anU2LxSFskkyuMu+rbFE8eDshvHEuBZLxlD2o+tG6lAi4mNWZYc0/+jPMw
 8TdJ5MEX3IlljXLRKuYctoCdsFQKLxH5IN5wLkiLvM5fBpeb/sWqNolx8f2s/f9R
 TaUdjwiqFnML4VnlEH3hd3/hUUVbnE+xJo6g1iRGgJY3eecimvwl8P5H7k9Sn3OS
 4zh0bHT9pfg+vUR0BVnfdWi4OpPxSrdqCgFhHsmKaGMvTApm0qMKK1Cg4OPNtYwr
 d1RMqsqEBSJTHzr0nHoiWLhkIo8npRPy+LMK51D8j6wg0kOj4GGYerWm1MD9ZlbI
 rhPy7nDgdcH48Gk1m6o7dROZKCvkZK+/QDPelBgZHGcGB94lUugYVJQrlBjI+2+7
 Wx5oQLgQgeabeMtDZ/YNy5Dsre20vas2oLj5cs6uuoWNOcBO6Ew=
 =YVNN
 -----END PGP SIGNATURE-----

Merge tag 'unicode-for-next-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode

Pull unicode cleanup from Gabriel Krisman Bertazi:
 "A fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into
  the previous CONFIG_UNICODE. It is -rc material since we don't want to
  expose the former symbol on 5.17.

  This has been living on linux-next for the past week"

* tag 'unicode-for-next-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
  unicode: clean up the Kconfig symbol confusion
2022-02-01 11:13:24 -08:00
Chao Yu
7d19e3dab0 f2fs: fix to enable ATGC correctly via gc_idle sysfs interface
It needs to assign sbi->gc_mode with GC_IDLE_AT rather than GC_AT when
user tries to enable ATGC via gc_idle sysfs interface, fix it.

Fixes: 093749e296 ("f2fs: support age threshold based garbage collection")
Cc: Zhipeng Tan <tanzhipeng@hust.edu.cn>
Signed-off-by: Jicheng Shao <shaojicheng@hust.edu.cn>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-24 17:40:04 -08:00
Tim Murray
e4544b63a7 f2fs: move f2fs to use reader-unfair rwsems
f2fs rw_semaphores work better if writers can starve readers,
especially for the checkpoint thread, because writers are strictly
more important than reader threads. This prevents significant priority
inversion between low-priority readers that blocked while trying to
acquire the read lock and a second acquisition of the write lock that
might be blocking high priority work.

Signed-off-by: Tim Murray <timmurray@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-24 17:40:04 -08:00
Linus Torvalds
3689f9f8b0 bitmap patches for 5.17-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQHJBAABCgAzFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmHi+xgVHHl1cnkubm9y
 b3ZAZ21haWwuY29tAAoJELFEgP06H77IxdoMAMf3E+L51Ys/4iAiyJQNVoT3aIBC
 A8ZVOB9he1OA3o3wBNIRKmICHk+ovnfCWcXTr9fG/Ade2wJz88NAsGPQ1Phywb+s
 iGlpySllFN72RT9ZqtJhLEzgoHHOL0CzTW07TN9GJy4gQA2h2G9CTP+OmsQdnVqE
 m9Fn3PSlJ5lhzePlKfnln8rGZFgrriJakfEFPC79n/7an4+2Hvkb5rWigo7KQc4Z
 9YNqYUcHWZFUgq80adxEb9LlbMXdD+Z/8fCjOrAatuwVkD4RDt6iKD0mFGjHXGL7
 MZ9KRS8AfZXawmetk3jjtsV+/QkeS+Deuu7k0FoO0Th2QV7BGSDhsLXAS5By/MOC
 nfSyHhnXHzCsBMyVNrJHmNhEZoN29+tRwI84JX9lWcf/OLANcCofnP6f2UIX7tZY
 CAZAgVELp+0YQXdybrfzTQ8BT3TinjS/aZtCrYijRendI1GwUXcyl69vdOKqAHuk
 5jy8k/xHyp+ZWu6v+PyAAAEGowY++qhL0fmszA==
 =RKW4
 -----END PGP SIGNATURE-----

Merge tag 'bitmap-5.17-rc1' of git://github.com/norov/linux

Pull bitmap updates from Yury Norov:

 - introduce for_each_set_bitrange()

 - use find_first_*_bit() instead of find_next_*_bit() where possible

 - unify for_each_bit() macros

* tag 'bitmap-5.17-rc1' of git://github.com/norov/linux:
  vsprintf: rework bitmap_list_string
  lib: bitmap: add performance test for bitmap_print_to_pagebuf
  bitmap: unify find_bit operations
  mm/percpu: micro-optimize pcpu_is_populated()
  Replace for_each_*_bit_from() with for_each_*_bit() where appropriate
  find: micro-optimize for_each_{set,clear}_bit()
  include/linux: move for_each_bit() macros from bitops.h to find.h
  cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
  tools: sync tools/bitmap with mother linux
  all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate
  cpumask: use find_first_and_bit()
  lib: add find_first_and_bit()
  arch: remove GENERIC_FIND_FIRST_BIT entirely
  include: move find.h from asm_generic to linux
  bitops: move find_bit_*_le functions from le.h to find.h
  bitops: protect find_first_{,zero}_bit properly
2022-01-23 06:20:44 +02:00
Christoph Hellwig
0a4ee51818 mm: remove cleancache
Patch series "remove Xen tmem leftovers".

Since the removal of the Xen tmem driver in 2019, the cleancache hooks
are entirely unused, as are large parts of frontswap.  This series
against linux-next (with the folio changes included) removes
cleancaches, and cuts down frontswap to the bits actually used by zswap.

This patch (of 13):

The cleancache subsystem is unused since the removal of Xen tmem driver
in commit 814bbf49dc ("xen: remove tmem driver").

[akpm@linux-foundation.org: remove now-unreachable code]

Link: https://lkml.kernel.org/r/20211224062246.1258487-1-hch@lst.de
Link: https://lkml.kernel.org/r/20211224062246.1258487-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-22 08:33:38 +02:00
Christoph Hellwig
5298d4bfe8 unicode: clean up the Kconfig symbol confusion
Turn the CONFIG_UNICODE symbol into a tristate that generates some always
built in code and remove the confusing CONFIG_UNICODE_UTF8_DATA symbol.

Note that a lot of the IS_ENABLED() checks could be turned from cpp
statements into normal ifs, but this change is intended to be fairly
mechanic, so that should be cleaned up later.

Fixes: 2b3d047870 ("unicode: Add utf8-data module")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2022-01-20 19:57:24 -05:00
Linus Torvalds
1d1df41c5a f2fs-for-5.17-rc1
In this round, we've tried to address some performance issues in f2fs_checkpoint
 and direct IO flows. Also, there was a work to enhance the page cache management
 used for compression. Other than them, we've done typical work including sysfs,
 code clean-ups, tracepoint, sanity check, in addition to bug fixes on corner
 cases.
 
 Enhancement:
  - use iomap for direct IO
  - try to avoid lock contention to improve f2fs_ckpt speed
  - avoid unnecessary memory allocation in compression flow
  - POSIX_FADV_DONTNEED drops the page cache containing compression pages
  - add some sysfs entries (gc_urgent_high_remaining, pending_discard)
 
 Bug fix:
  - try not to expose unwritten blocks to user by DIO
    : this was added to avoid merge conflict; another patch is coming to address
      other missing case.
  - relax minor error condition for file pinning feature used in Android OTA
  - fix potential deadlock case in compression flow
  - should not truncate any block on pinned file
 
 In addition, we've done some code clean-ups and tracepoint/sanity check
 improvement.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmHnY0sACgkQQBSofoJI
 UNIOkg//UmjCSSG63/YZM/lQQQe4kK/tT6QTT8W/VQtzWL9vXcL7bcaxzwX3LQbR
 Gb47Zmsw9bzVJt6GQ2VRbODE1py/KPNMl5SDXJXHo6fOZ/dOnHve32gLwcLEzhPd
 casB0TbwQJ6bpEsJiZ5ho741mURxUrSCHAAX6QIQVXh8ofm9qAqlWu74OLI6UHiV
 MM84XmXcHtGUZG5SCTWfSCJhJM6Az/3A83ws9KVeu86dlE7IrigphU2nI2vdCKiO
 trR3CiLC/364fiM+9ssLS3X2wKFPD/unEU7ljBv5UaG36jsVfW+tisjTKldzpiKK
 44cNgDv1FEDxC0g3FKUhEGezAhxT8AJZB0in0zn8+5scarKGJtFCy9XhCGMVaeP+
 usxvHVy8Ga1I7sMV6oHEBcGiPJWkmurzq1XXobtj6oL/JxN4gqUJeHTcod89hQHA
 lx9kZs7MLKm2au+T3gZf5xyx35YCie8sY/N1qoPy8tU9Q7FJ54NdqqAc9JEZ6mSk
 k9ybMaa/srHG/EI/XYPw0DrobHg6P5+bYtmsRvw2vP/nsNsD3ZI/EwBBEll2ITxC
 V5Dn7MljYWI/5kB41Hl5xz6X65WeIN7koRyTXw5mp9tkNrLugqII5hzhwhSlcqJ1
 3k9TAN3RbVpWHBcyryDyLbm/+dcbwIJ4v/eJEMIDk8F2SrBGOZs=
 =LCJH
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've tried to address some performance issues in
  f2fs_checkpoint and direct IO flows. Also, there was a work to enhance
  the page cache management used for compression. Other than them, we've
  done typical work including sysfs, code clean-ups, tracepoint, sanity
  check, in addition to bug fixes on corner cases.

  Enhancements:
   - use iomap for direct IO
   - try to avoid lock contention to improve f2fs_ckpt speed
   - avoid unnecessary memory allocation in compression flow
   - POSIX_FADV_DONTNEED drops the page cache containing compression
     pages
   - add some sysfs entries (gc_urgent_high_remaining, pending_discard)

  Bug fixes:
   - try not to expose unwritten blocks to user by DIO (this was added
     to avoid merge conflict; another patch is coming to address other
     missing case)
   - relax minor error condition for file pinning feature used in
     Android OTA
   - fix potential deadlock case in compression flow
   - should not truncate any block on pinned file

  In addition, we've done some code clean-ups and tracepoint/sanity
  check improvement"

* tag 'f2fs-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (29 commits)
  f2fs: do not allow partial truncation on pinned file
  f2fs: remove redunant invalidate compress pages
  f2fs: Simplify bool conversion
  f2fs: don't drop compressed page cache in .{invalidate,release}page
  f2fs: fix to reserve space for IO align feature
  f2fs: fix to check available space of CP area correctly in update_ckpt_flags()
  f2fs: support fault injection to f2fs_trylock_op()
  f2fs: clean up __find_inline_xattr() with __find_xattr()
  f2fs: fix to do sanity check on last xattr entry in __f2fs_setxattr()
  f2fs: do not bother checkpoint by f2fs_get_node_info
  f2fs: avoid down_write on nat_tree_lock during checkpoint
  f2fs: compress: fix potential deadlock of compress file
  f2fs: avoid EINVAL by SBI_NEED_FSCK when pinning a file
  f2fs: add gc_urgent_high_remaining sysfs node
  f2fs: fix to do sanity check in is_alive()
  f2fs: fix to avoid panic in is_alive() if metadata is inconsistent
  f2fs: fix to do sanity check on inode type during garbage collection
  f2fs: avoid duplicate call of mark_inode_dirty
  f2fs: show number of pending discard commands
  f2fs: support POSIX_FADV_DONTNEED drop compressed page cache
  ...
2022-01-19 11:50:20 +02:00
Linus Torvalds
6661224e66 unicode patches for 5.17
This includes patches from Christoph Hellwig to split the large data
 tables of the unicode subsystem into a loadable module, which allow
 users to not have them around if case-insensitive filesystems are not to
 be used.  It also includes minor code fixes to unicode and its users,
 from the same author.
 
 There is a trivial conflict in the function encoding_show in
 fs/f2fs/sysfs.c reported by linux-next between commit
 
 84eab2a899 ("f2fs: replace snprintf in show functions with sysfs_emit")
 
 and commit a440943e68 ("unicode: remove the charset field from struct
 unicode_map").  from my tree.
 
 All the patches here have been on linux-next releases for the past
 months.
 
 Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8jAUPq50yNjPBCi4QEuZqsMcppQFAmHeLp0ACgkQQEuZqsMc
 ppRWdhAAstuibIlhUj1Vae070P92oaxM/Azz3IgyVFWensJyQV1PvbtFQDhyKM4w
 M3tQ45eK49vVHn+JpLHbiAdZV66rD/sMSsruCVIf/8KNVDisOBQtFar5yxVr0Ion
 AOMoG6/Xrk8BZlZH62fhtJGtu/EFmeFoGVdC81NdTSroe9G+26we3IULwHSE1lNH
 XMJFCgU6otuLDOna16U7kL77Tu7GXRJcQe1+2nRJ+u6Agxy2xTo/s4FHuxzRK0/e
 GsgO1scY6unWM23O6z+qJYazng2Zt3EOZtSGqU4TsvZwjUi2UtAYW1/vAQGc/q3Y
 hGxPYGgKC1VrXLfIcuyng7j0vFPtADbdHMbsJPoyy+Nz4znDJ81IAKAHMO1in3C8
 CHKjW+6InmXNye/uwdRt8Tx49jxUHmWUbQRT5FwMDpzC7MAL+DVdPpVVQgpLVM/H
 gW3YpBEk5qQvVdh8DWZVW3rT3SnMX/v0+u+76FsMHKYNJMNrCnP6vXpCPQl/Gyut
 ycgK7qVF3o/bgNBf072H3ZBZajTv7ePvacP4Wth7m9I2ykk+p4IjQLpTC5rJK0By
 VC1xS4im2VqiIWE9eE5y9cXU1oa/AfOcOF+7FZcxT13IL6hKTtd4+H4yKgdcNsyk
 7RjpGgjp+SU51/EilhEqMFgEe07CURxwGwhApizBSiTIOgZS96U=
 =4q9x
 -----END PGP SIGNATURE-----

Merge tag 'unicode-for-next-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode

Pull unicode updates from Gabriel Krisman Bertazi:
 "This includes patches from Christoph Hellwig to split the large data
  tables of the unicode subsystem into a loadable module, which allow
  users to not have them around if case-insensitive filesystems are not
  to be used. It also includes minor code fixes to unicode and its
  users, from the same author.

  All the patches here have been on linux-next releases for the past
  months"

* tag 'unicode-for-next-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
  unicode: only export internal symbols for the selftests
  unicode: Add utf8-data module
  unicode: cache the normalization tables in struct unicode_map
  unicode: move utf8cursor to utf8-selftest.c
  unicode: simplify utf8len
  unicode: remove the unused utf8{,n}age{min,max} functions
  unicode: pass a UNICODE_AGE() tripple to utf8_load
  unicode: mark the version field in struct unicode_map unsigned
  unicode: remove the charset field from struct unicode_map
  f2fs: simplify f2fs_sb_read_encoding
  ext4: simplify ext4_sb_read_encoding
2022-01-17 05:40:02 +02:00
Linus Torvalds
f56caedaf9 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "146 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
  dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
  memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
  userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
  ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
  damon)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
  mm/damon: hide kernel pointer from tracepoint event
  mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
  mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
  mm/damon/dbgfs: remove an unnecessary variable
  mm/damon: move the implementation of damon_insert_region to damon.h
  mm/damon: add access checking for hugetlb pages
  Docs/admin-guide/mm/damon/usage: update for schemes statistics
  mm/damon/dbgfs: support all DAMOS stats
  Docs/admin-guide/mm/damon/reclaim: document statistics parameters
  mm/damon/reclaim: provide reclamation statistics
  mm/damon/schemes: account how many times quota limit has exceeded
  mm/damon/schemes: account scheme actions that successfully applied
  mm/damon: remove a mistakenly added comment for a future feature
  Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
  Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
  Docs/admin-guide/mm/damon/usage: remove redundant information
  Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
  mm/damon: convert macro functions to static inline functions
  mm/damon: modify damon_rand() macro to static inline function
  mm/damon: move damon_rand() definition into damon.h
  ...
2022-01-15 20:37:06 +02:00
Yury Norov
b5c7e7ec7d all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate
find_first{,_zero}_bit is a more effective analogue of 'next' version if
start == 0. This patch replaces 'next' with 'first' where things look
trivial.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2022-01-15 08:47:31 -08:00
NeilBrown
4034247a0d mm: introduce memalloc_retry_wait()
Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying.  Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:

 - a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
 - a need to check for the process being signalled between failures
 - the possibility that other recovery actions could be performed
 - the allocation is quite deep in support code, and passing down an
   extra flag to say if __GFP_NOFAIL is wanted would be clumsy.

Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.

It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.

This patch introduces memalloc_retry_wait() with takes on that
responsibility.  Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used.  It will wait
however is appropriate.

For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests.  If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing.  So there is no need for much
further waiting.  memalloc_retry_wait() will wait until the current
jiffie ends.  If this condition is not met, then alloc_page() won't have
waited much if at all.  In that case memalloc_retry_wait() waits about
200ms.  This is the delay that most current loops uses.

linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.

Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:29 +02:00
Jaegeuk Kim
5fed0be858 f2fs: do not allow partial truncation on pinned file
If the pinned file has a hole by partial truncation, application that has
the block map will be broken.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-10 09:05:09 -08:00
Matthew Wilcox (Oracle)
51dcbdac28 mm: Convert find_lock_entries() to use a folio_batch
find_lock_entries() already only returned the head page of folios, so
convert it to return a folio_batch instead of a pagevec.  That cascades
through converting truncate_inode_pages_range() to
delete_from_page_cache_batch() and page_cache_delete_batch().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-08 00:28:41 -05:00
Fengnan Chang
2b642898e5 f2fs: remove redunant invalidate compress pages
Compress page will invalidate in truncate block process too, so remove
redunant invalidate compress pages in f2fs_evict_inode.

Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04 13:21:23 -08:00
Yang Li
d361b690b6 f2fs: Simplify bool conversion
Fix the following coccicheck warning:
./fs/f2fs/sysfs.c:491:41-46: WARNING: conversion to bool not needed here

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04 13:20:57 -08:00
Chao Yu
2a64e303e3 f2fs: don't drop compressed page cache in .{invalidate,release}page
For compressed inode, in .{invalidate,release}page, we will call
f2fs_invalidate_compress_pages() to drop all compressed page cache of
current inode.

But we don't need to drop compressed page cache synchronously in
.invalidatepage, because, all trancation paths of compressed physical
block has been covered with f2fs_invalidate_compress_page().

And also we don't need to drop compressed page cache synchronously
in .releasepage, because, if there is out-of-memory, we can count
on page cache reclaim on sbi->compress_inode.

BTW, this patch may fix the issue reported below:

https://lore.kernel.org/linux-f2fs-devel/20211202092812.197647-1-changfengnan@vivo.com/T/#u

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04 13:20:57 -08:00
Chao Yu
300a842937 f2fs: fix to reserve space for IO align feature
https://bugzilla.kernel.org/show_bug.cgi?id=204137

With below script, we will hit panic during new segment allocation:

DISK=bingo.img
MOUNT_DIR=/mnt/f2fs

dd if=/dev/zero of=$DISK bs=1M count=105
mkfs.f2fe -a 1 -o 19 -t 1 -z 1 -f -q $DISK

mount -t f2fs $DISK $MOUNT_DIR -o "noinline_dentry,flush_merge,noextent_cache,mode=lfs,io_bits=7,fsync_mode=strict"

for (( i = 0; i < 4096; i++ )); do
	name=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 10`
	mkdir $MOUNT_DIR/$name
done

umount $MOUNT_DIR
rm $DISK

--- Core dump ---
Call Trace:
 allocate_segment_by_default+0x9d/0x100 [f2fs]
 f2fs_allocate_data_block+0x3c0/0x5c0 [f2fs]
 do_write_page+0x62/0x110 [f2fs]
 f2fs_outplace_write_data+0x43/0xc0 [f2fs]
 f2fs_do_write_data_page+0x386/0x560 [f2fs]
 __write_data_page+0x706/0x850 [f2fs]
 f2fs_write_cache_pages+0x267/0x6a0 [f2fs]
 f2fs_write_data_pages+0x19c/0x2e0 [f2fs]
 do_writepages+0x1c/0x70
 __filemap_fdatawrite_range+0xaa/0xe0
 filemap_fdatawrite+0x1f/0x30
 f2fs_sync_dirty_inodes+0x74/0x1f0 [f2fs]
 block_operations+0xdc/0x350 [f2fs]
 f2fs_write_checkpoint+0x104/0x1150 [f2fs]
 f2fs_sync_fs+0xa2/0x120 [f2fs]
 f2fs_balance_fs_bg+0x33c/0x390 [f2fs]
 f2fs_write_node_pages+0x4c/0x1f0 [f2fs]
 do_writepages+0x1c/0x70
 __writeback_single_inode+0x45/0x320
 writeback_sb_inodes+0x273/0x5c0
 wb_writeback+0xff/0x2e0
 wb_workfn+0xa1/0x370
 process_one_work+0x138/0x350
 worker_thread+0x4d/0x3d0
 kthread+0x109/0x140
 ret_from_fork+0x25/0x30

The root cause here is, with IO alignment feature enables, in worst
case, we need F2FS_IO_SIZE() free blocks space for single one 4k write
due to IO alignment feature will fill dummy pages to make IO being
aligned.

So we will easily run out of free segments during non-inline directory's
data writeback, even in process of foreground GC.

In order to fix this issue, I just propose to reserve additional free
space for IO alignment feature to handle worst case of free space usage
ratio during FGGC.

Fixes: 0a595ebaaa ("f2fs: support IO alignment for DATA and NODE writes")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04 13:20:56 -08:00
Chao Yu
b702c83e2e f2fs: fix to check available space of CP area correctly in update_ckpt_flags()
Otherwise, nat_bit area may be persisted across boundary of CP area during
nat_bit rebuilding.

Fixes: 94c821fb28 ("f2fs: rebuild nat_bits during umount")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04 13:20:56 -08:00
Chao Yu
3e0203893e f2fs: support fault injection to f2fs_trylock_op()
f2fs: support fault injection for f2fs_trylock_op()

This patch supports to inject fault into f2fs_trylock_op().

Usage:
a) echo 65536 > /sys/fs/f2fs/<dev>/inject_type or
b) mount -o fault_type=65536 <dev> <mountpoint>

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04 13:20:56 -08:00
Chao Yu
dd9d4a3a30 f2fs: clean up __find_inline_xattr() with __find_xattr()
Just cleanup, no logic change.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04 13:20:56 -08:00
Chao Yu
645a3c40ca f2fs: fix to do sanity check on last xattr entry in __f2fs_setxattr()
As Wenqing Liu reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=215235

- Overview
page fault in f2fs_setxattr() when mount and operate on corrupted image

- Reproduce
tested on kernel 5.16-rc3, 5.15.X under root

1. unzip tmp7.zip
2. ./single.sh f2fs 7

Sometimes need to run the script several times

- Kernel dump
loop0: detected capacity change from 0 to 131072
F2FS-fs (loop0): Found nat_bits in checkpoint
F2FS-fs (loop0): Mounted with checkpoint version = 7548c2ee
BUG: unable to handle page fault for address: ffffe47bc7123f48
RIP: 0010:kfree+0x66/0x320
Call Trace:
 __f2fs_setxattr+0x2aa/0xc00 [f2fs]
 f2fs_setxattr+0xfa/0x480 [f2fs]
 __f2fs_set_acl+0x19b/0x330 [f2fs]
 __vfs_removexattr+0x52/0x70
 __vfs_removexattr_locked+0xb1/0x140
 vfs_removexattr+0x56/0x100
 removexattr+0x57/0x80
 path_removexattr+0xa3/0xc0
 __x64_sys_removexattr+0x17/0x20
 do_syscall_64+0x37/0xb0
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The root cause is in __f2fs_setxattr(), we missed to do sanity check on
last xattr entry, result in out-of-bound memory access during updating
inconsistent xattr data of target inode.

After the fix, it can detect such xattr inconsistency as below:

F2FS-fs (loop11): inode (7) has invalid last xattr entry, entry_size: 60676
F2FS-fs (loop11): inode (8) has corrupted xattr
F2FS-fs (loop11): inode (8) has corrupted xattr
F2FS-fs (loop11): inode (8) has invalid last xattr entry, entry_size: 47736

Cc: stable@vger.kernel.org
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04 13:20:56 -08:00
Jaegeuk Kim
a9419b63bf f2fs: do not bother checkpoint by f2fs_get_node_info
This patch tries to mitigate lock contention between f2fs_write_checkpoint and
f2fs_get_node_info along with nat_tree_lock.

The idea is, if checkpoint is currently running, other threads that try to grab
nat_tree_lock would be better to wait for checkpoint.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04 13:20:49 -08:00
Jaegeuk Kim
0df035c720 f2fs: avoid down_write on nat_tree_lock during checkpoint
Let's cache nat entry if there's no lock contention only.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-01-04 13:20:19 -08:00
Hyeong-Jun Kim
7377e85396 f2fs: compress: fix potential deadlock of compress file
There is a potential deadlock between writeback process and a process
performing write_begin() or write_cache_pages() while trying to write
same compress file, but not compressable, as below:

[Process A] - doing checkpoint
[Process B]                     [Process C]
f2fs_write_cache_pages()
- lock_page() [all pages in cluster, 0-31]
- f2fs_write_multi_pages()
 - f2fs_write_raw_pages()
  - f2fs_write_single_data_page()
   - f2fs_do_write_data_page()
     - return -EAGAIN [f2fs_trylock_op() failed]
   - unlock_page(page) [e.g., page 0]
                                - generic_perform_write()
                                 - f2fs_write_begin()
                                  - f2fs_prepare_compress_overwrite()
                                   - prepare_compress_overwrite()
                                    - lock_page() [e.g., page 0]
                                    - lock_page() [e.g., page 1]
   - lock_page(page) [e.g., page 0]

Since there is no compress process, it is no longer necessary to hold
locks on every pages in cluster within f2fs_write_raw_pages().

This patch changes f2fs_write_raw_pages() to release all locks first
and then perform write same as the non-compress file in
f2fs_write_cache_pages().

Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Hyeong-Jun Kim <hj514.kim@samsung.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-14 11:17:55 -08:00
Jaegeuk Kim
19bdba5265 f2fs: avoid EINVAL by SBI_NEED_FSCK when pinning a file
Android OTA failed due to SBI_NEED_FSCK flag when pinning the file. Let's avoid
it since we can do in-place-updates.

Cc: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-14 11:17:48 -08:00
Daeho Jeong
325163e989 f2fs: add gc_urgent_high_remaining sysfs node
Added a new sysfs node called gc_urgent_high_remaining. The user can
set the trial count limit for GC urgent high mode with this value. If
GC thread gets to the limit, the mode will turn back to GC normal mode.
By default, the value is zero, which means there is no limit like before.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-10 15:48:33 -08:00
Chao Yu
77900c45ee f2fs: fix to do sanity check in is_alive()
In fuzzed image, SSA table may indicate that a data block belongs to
invalid node, which node ID is out-of-range (0, 1, 2 or max_nid), in
order to avoid migrating inconsistent data in such corrupted image,
let's do sanity check anyway before data block migration.

Cc: stable@vger.kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-10 15:48:33 -08:00
Chao Yu
f6db43076d f2fs: fix to avoid panic in is_alive() if metadata is inconsistent
As report by Wenqing Liu in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=215231

If we enable CONFIG_F2FS_CHECK_FS config, and with fuzzed image attached
in above link, we will encounter panic when executing below script:

1. mkdir mnt
2. mount -t f2fs tmp1.img mnt
3. touch tmp

F2FS-fs (loop11): mismatched blkaddr 5765 (source_blkaddr 1) in seg 3
kernel BUG at fs/f2fs/gc.c:1042!
 do_garbage_collect+0x90f/0xa80 [f2fs]
 f2fs_gc+0x294/0x12a0 [f2fs]
 f2fs_balance_fs+0x2c5/0x7d0 [f2fs]
 f2fs_create+0x239/0xd90 [f2fs]
 lookup_open+0x45e/0xa90
 open_last_lookups+0x203/0x670
 path_openat+0xae/0x490
 do_filp_open+0xbc/0x160
 do_sys_openat2+0x2f1/0x500
 do_sys_open+0x5e/0xa0
 __x64_sys_openat+0x28/0x40

Previously, f2fs tries to catch data inconcistency exception in between
SSA and SIT table during GC, however once the exception is caught, it will
call f2fs_bug_on to hang kernel, it's not needed, instead, let's set
SBI_NEED_FSCK flag and skip migrating current block.

Fixes: bbf9f7d90f ("f2fs: Fix indefinite loop in f2fs_gc()")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-10 15:48:32 -08:00
Chao Yu
9056d6489f f2fs: fix to do sanity check on inode type during garbage collection
As report by Wenqing Liu in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=215231

- Overview
kernel NULL pointer dereference triggered  in folio_mark_dirty() when mount and operate on a crafted f2fs image

- Reproduce
tested on kernel 5.16-rc3, 5.15.X under root

1. mkdir mnt
2. mount -t f2fs tmp1.img mnt
3. touch tmp
4. cp tmp mnt

F2FS-fs (loop0): sanity_check_inode: inode (ino=49) extent info [5942, 4294180864, 4] is incorrect, run fsck to fix
F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=31340049, run fsck to fix.
BUG: kernel NULL pointer dereference, address: 0000000000000000
 folio_mark_dirty+0x33/0x50
 move_data_page+0x2dd/0x460 [f2fs]
 do_garbage_collect+0xc18/0x16a0 [f2fs]
 f2fs_gc+0x1d3/0xd90 [f2fs]
 f2fs_balance_fs+0x13a/0x570 [f2fs]
 f2fs_create+0x285/0x840 [f2fs]
 path_openat+0xe6d/0x1040
 do_filp_open+0xc5/0x140
 do_sys_openat2+0x23a/0x310
 do_sys_open+0x57/0x80

The root cause is for special file: e.g. character, block, fifo or socket file,
f2fs doesn't assign address space operations pointer array for mapping->a_ops field,
so, in a fuzzed image, SSA table indicates a data block belong to special file, when
f2fs tries to migrate that block, it causes NULL pointer access once move_data_page()
calls a_ops->set_dirty_page().

Cc: stable@vger.kernel.org
Reported-by: Wenqing Liu <wenqingliu0120@gmail.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-10 15:48:32 -08:00
Jaegeuk Kim
766c663933 f2fs: avoid duplicate call of mark_inode_dirty
Let's check the condition first before set|clear bit.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-10 15:48:32 -08:00
Jaegeuk Kim
ae2e2804ca f2fs: show number of pending discard commands
This information can be used to check how much time we need to give to issue
all the discard commands.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-10 15:48:32 -08:00
Fengnan Chang
e64347ae13 f2fs: support POSIX_FADV_DONTNEED drop compressed page cache
Previously, compressed page cache drop when clean page cache, but
POSIX_FADV_DONTNEED can't clean compressed page cache because raw page
don't have private data, and won't call f2fs_invalidate_compress_pages.
This commit call f2fs_invalidate_compress_pages() directly in
f2fs_file_fadvise() for POSIX_FADV_DONTNEED case.

Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-10 15:48:32 -08:00
Fengnan Chang
d1917865a7 f2fs: fix remove page failed in invalidate compress pages
Since compress inode not a regular file, generic_error_remove_page in
f2fs_invalidate_compress_pages will always be failed, set compress
inode as a regular file to fix it.

Fixes: 6ce19aff0b ("f2fs: compress: add compress_inode to cache compressed blocks")
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-10 15:48:32 -08:00
Jaegeuk Kim
bd984c0309 f2fs: show more DIO information in tracepoint
This prints more information of DIO in tracepoint.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-10 15:48:31 -08:00
Eric Biggers
a1e09b03e6 f2fs: use iomap for direct I/O
Make f2fs_file_read_iter() and f2fs_file_write_iter() use the iomap
direct I/O implementation instead of the fs/direct-io.c one.

The iomap implementation is more efficient, and it also avoids the need
to add new features and optimizations to the old implementation.

This new implementation also eliminates the need for f2fs to hook bio
submission and completion and to allocate memory per-bio.  This is
because it's possible to correctly update f2fs's in-flight DIO counters
using __iomap_dio_rw() in combination with an implementation of
iomap_dio_ops::end_io() (as suggested by Christoph Hellwig).

When possible, this new implementation preserves existing f2fs behavior
such as the conditions for falling back to buffered I/O.

This patch has been tested with xfstests by running 'gce-xfstests -c
f2fs -g auto -X generic/017' with and without this patch; no regressions
were seen.  (Some tests fail both before and after.  generic/017 hangs
both before and after, so it had to be excluded.)

Signed-off-by: Eric Biggers <ebiggers@google.com>
[Jaegeuk Kim: use spin_lock_bh for f2fs_update_iostat in softirq]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-10 15:48:30 -08:00
Eric Biggers
1517c1a7a4 f2fs: implement iomap operations
Implement 'struct iomap_ops' for f2fs, in preparation for making f2fs
use iomap for direct I/O.

Note that this may be used for other things besides direct I/O in the
future; however, for now I've only tested it for direct I/O.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-04 10:53:35 -08:00
Eric Biggers
ccf7cf9237 f2fs: fix the f2fs_file_write_iter tracepoint
Pass in the original position and count rather than the position and
count that were updated by the write.  Also use the correct types for
all arguments, in particular the file offset which was being truncated
to 32 bits on 32-bit platforms.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-04 10:53:35 -08:00
Jaegeuk Kim
d4dd19ec1e f2fs: do not expose unwritten blocks to user by DIO
DIO preallocates physical blocks before writing data, but if an error occurrs
or power-cut happens, we can see block contents from the disk. This patch tries
to fix it by 1) turning to buffered writes for DIO into holes, 2) truncating
unwritten blocks from error or power-cut.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-04 10:53:33 -08:00
Eric Biggers
b31bf0f96e f2fs: reduce indentation in f2fs_file_write_iter()
Replace 'if (ret > 0)' with 'if (ret <= 0) goto out_unlock;'.
No change in behavior.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-12-04 10:48:53 -08:00
Eric Biggers
3d697a4a6b f2fs: rework write preallocations
f2fs_write_begin() assumes that all blocks were preallocated by
default unless FI_NO_PREALLOC is explicitly set.  This invites data
corruption, as there are cases in which not all blocks are preallocated.
Commit 47501f87c6 ("f2fs: preallocate DIO blocks when forcing
buffered_io") fixed one case, but there are others remaining.

Fix up this logic by replacing this flag with FI_PREALLOCATED_ALL, which
only gets set if all blocks for the current write were preallocated.

Also clean up f2fs_preallocate_blocks(), move it to file.c, and make it
handle some of the logic that was previously in write_iter() directly.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-11-17 11:28:22 -08:00
Fengnan Chang
3271d7eb00 f2fs: compress: reduce one page array alloc and free when write compressed page
Don't alloc new page pointers array to replace old, just use old, introduce
valid_nr_cpages to indicate valid number of page pointers in array, try to
reduce one page array alloc and free when write compress page.

Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-11-17 11:28:22 -08:00
Linus Torvalds
c8c109546a Update to zstd-1.4.10
This PR includes 5 commits that update the zstd library version:
 
 1. Adds a new kernel-style wrapper around zstd. This wrapper API
    is functionally equivalent to the subset of the current zstd API that is
    currently used. The wrapper API changes to be kernel style so that the symbols
    don't collide with zstd's symbols. The update to zstd-1.4.10 maintains the same
    API and preserves the semantics, so that none of the callers need to be
    updated. All callers are updated in the commit, because there are zero
    functional changes.
 2. Adds an indirection for `lib/decompress_unzstd.c` so it
    doesn't depend on the layout of `lib/zstd/` to include every source file.
    This allows the next patch to be automatically generated.
 3. Imports the zstd-1.4.10 source code. This commit is automatically generated
    from upstream zstd (https://github.com/facebook/zstd).
 4. Adds me (terrelln@fb.com) as the maintainer of `lib/zstd`.
 5. Fixes a newly added build warning for clang.
 
 The discussion around this patchset has been pretty long, so I've included a
 FAQ-style summary of the history of the patchset, and why we are taking this
 approach.
 
 Why do we need to update?
 -------------------------
 
 The zstd version in the kernel is based off of zstd-1.3.1, which is was released
 August 20, 2017. Since then zstd has seen many bug fixes and performance
 improvements. And, importantly, upstream zstd is continuously fuzzed by OSS-Fuzz,
 and bug fixes aren't backported to older versions. So the only way to sanely get
 these fixes is to keep up to date with upstream zstd. There are no known security
 issues that affect the kernel, but we need to be able to update in case there
 are. And while there are no known security issues, there are relevant bug fixes.
 For example the problem with large kernel decompression has been fixed upstream
 for over 2 years https://lkml.org/lkml/2020/9/29/27.
 
 Additionally the performance improvements for kernel use cases are significant.
 Measured for x86_64 on my Intel i9-9900k @ 3.6 GHz:
 
 - BtrFS zstd compression at levels 1 and 3 is 5% faster
 - BtrFS zstd decompression+read is 15% faster
 - SquashFS zstd decompression+read is 15% faster
 - F2FS zstd compression+write at level 3 is 8% faster
 - F2FS zstd decompression+read is 20% faster
 - ZRAM decompression+read is 30% faster
 - Kernel zstd decompression is 35% faster
 - Initramfs zstd decompression+build is 5% faster
 
 On top of this, there are significant performance improvements coming down the
 line in the next zstd release, and the new automated update patch generation
 will allow us to pull them easily.
 
 How is the update patch generated?
 ----------------------------------
 
 The first two patches are preparation for updating the zstd version. Then the
 3rd patch in the series imports upstream zstd into the kernel. This patch is
 automatically generated from upstream. A script makes the necessary changes and
 imports it into the kernel. The changes are:
 
 - Replace all libc dependencies with kernel replacements and rewrite includes.
 - Remove unncessary portability macros like: #if defined(_MSC_VER).
 - Use the kernel xxhash instead of bundling it.
 
 This automation gets tested every commit by upstream's continuous integration.
 When we cut a new zstd release, we will submit a patch to the kernel to update
 the zstd version in the kernel.
 
 The automated process makes it easy to keep the kernel version of zstd up to
 date. The current zstd in the kernel shares the guts of the code, but has a lot
 of API and minor changes to work in the kernel. This is because at the time
 upstream zstd was not ready to be used in the kernel envrionment as-is. But,
 since then upstream zstd has evolved to support being used in the kernel as-is.
 
 Why are we updating in one big patch?
 -------------------------------------
 
 The 3rd patch in the series is very large. This is because it is restructuring
 the code, so it both deletes the existing zstd, and re-adds the new structure.
 Future updates will be directly proportional to the changes in upstream zstd
 since the last import. They will admittidly be large, as zstd is an actively
 developed project, and has hundreds of commits between every release. However,
 there is no other great alternative.
 
 One option ruled out is to replay every upstream zstd commit. This is not feasible
 for several reasons:
 - There are over 3500 upstream commits since the zstd version in the kernel.
 - The automation to automatically generate the kernel update was only added recently,
   so older commits cannot easily be imported.
 - Not every upstream zstd commit builds.
 - Only zstd releases are "supported", and individual commits may have bugs that were
   fixed before a release.
 
 Another option to reduce the patch size would be to first reorganize to the new
 file structure, and then apply the patch. However, the current kernel zstd is formatted
 with clang-format to be more "kernel-like". But, the new method imports zstd as-is,
 without additional formatting, to allow for closer correlation with upstream, and
 easier debugging. So the patch wouldn't be any smaller.
 
 It also doesn't make sense to import upstream zstd commit by commit going
 forward. Upstream zstd doesn't support production use cases running of the
 development branch. We have a lot of post-commit fuzzing that catches many bugs,
 so indiviudal commits may be buggy, but fixed before a release. So going forward,
 I intend to import every (important) zstd release into the Kernel.
 
 So, while it isn't ideal, updating in one big patch is the only patch I see forward.
 
 Who is responsible for this code?
 ---------------------------------
 
 I am. This patchset adds me as the maintainer for zstd. Previously, there was no tree
 for zstd patches. Because of that, there were several patches that either got ignored,
 or took a long time to merge, since it wasn't clear which tree should pick them up.
 I'm officially stepping up as maintainer, and setting up my tree as the path through
 which zstd patches get merged. I'll make sure that patches to the kernel zstd get
 ported upstream, so they aren't erased when the next version update happens.
 
 How is this code tested?
 ------------------------
 
 I tested every caller of zstd on x86_64 (BtrFS, ZRAM, SquashFS, F2FS, Kernel,
 InitRAMFS). I also tested Kernel & InitRAMFS on i386 and aarch64. I checked both
 performance and correctness.
 
 Also, thanks to many people in the community who have tested these patches locally.
 If you have tested the patches, please reply with a Tested-By so I can collect them
 for the PR I will send to Linus.
 
 Lastly, this code will bake in linux-next before being merged into v5.16.
 
 Why update to zstd-1.4.10 when zstd-1.5.0 has been released?
 ------------------------------------------------------------
 
 This patchset has been outstanding since 2020, and zstd-1.4.10 was the latest
 release when it was created. Since the update patch is automatically generated
 from upstream, I could generate it from zstd-1.5.0. However, there were some
 large stack usage regressions in zstd-1.5.0, and are only fixed in the latest
 development branch. And the latest development branch contains some new code that
 needs to bake in the fuzzer before I would feel comfortable releasing to the
 kernel.
 
 Once this patchset has been merged, and we've released zstd-1.5.1, we can update
 the kernel to zstd-1.5.1, and exercise the update process.
 
 You may notice that zstd-1.4.10 doesn't exist upstream. This release is an
 artifical release based off of zstd-1.4.9, with some fixes for the kernel
 backported from the development branch. I will tag the zstd-1.4.10 release after
 this patchset is merged, so the Linux Kernel is running a known version of zstd
 that can be debugged upstream.
 
 Why was a wrapper API added?
 ----------------------------
 
 The first versions of this patchset migrated the kernel to the upstream zstd
 API. It first added a shim API that supported the new upstream API with the old
 code, then updated callers to use the new shim API, then transitioned to the
 new code and deleted the shim API. However, Cristoph Hellwig suggested that we
 transition to a kernel style API, and hide zstd's upstream API behind that.
 This is because zstd's upstream API is supports many other use cases, and does
 not follow the kernel style guide, while the kernel API is focused on the
 kernel's use cases, and follows the kernel style guide.
 
 Where is the previous discussion?
 ---------------------------------
 
 Links for the discussions of the previous versions of the patch set.
 The largest changes in the design of the patchset are driven by the discussions
 in V11, V5, and V1. Sorry for the mix of links, I couldn't find most of the the
 threads on lkml.org.
 
 V12: https://www.spinics.net/lists/linux-crypto/msg58189.html
 V11: https://lore.kernel.org/linux-btrfs/20210430013157.747152-1-nickrterrell@gmail.com/
 V10: https://lore.kernel.org/lkml/20210426234621.870684-2-nickrterrell@gmail.com/
 V9: https://lore.kernel.org/linux-btrfs/20210330225112.496213-1-nickrterrell@gmail.com/
 V8: https://lore.kernel.org/linux-f2fs-devel/20210326191859.1542272-1-nickrterrell@gmail.com/
 V7: https://lkml.org/lkml/2020/12/3/1195
 V6: https://lkml.org/lkml/2020/12/2/1245
 V5: https://lore.kernel.org/linux-btrfs/20200916034307.2092020-1-nickrterrell@gmail.com/
 V4: https://www.spinics.net/lists/linux-btrfs/msg105783.html
 V3: https://lkml.org/lkml/2020/9/23/1074
 V2: https://www.spinics.net/lists/linux-btrfs/msg105505.html
 V1: https://lore.kernel.org/linux-btrfs/20200916034307.2092020-1-nickrterrell@gmail.com/
 
 Signed-off-by: Nick Terrell <terrelln@fb.com>
 Tested By: Paul Jones <paul@pauljones.id.au>
 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
 Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM/Clang v13.0.0 on x86-64
 Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEmIwAqlFIzbQodPwyuzRpqaNEqPUFAmGJyKIACgkQuzRpqaNE
 qPXnmw/+PKyCn6LvRQqNfdpF5f59j/B1Fab15tkpVyz3UWnCw+EKaPZOoTfIsjRf
 7TMUVm4iGsm+6xBO/YrGdRl4IxocNgXzsgnJ1lTGDbvfRC1tG+YNwuv+EEXwKYq5
 Yz3DRwDotgsrV0Kg05b+VIgkmAuY3ukmu2n09LnAdKkxoIgmHw3MIDCdVZW2Br4c
 sjJmYI+fiJd7nAlbDa42VOrdTiLzkl/2BsjWBqTv6zbiQ5uuJGsKb7P3kpcybWzD
 5C118pyE3qlVyvFz+UFu8WbN0NSf47DP22KV/3IrhNX7CVQxYBe+9/oVuPWTgRx0
 4Vl0G6u7rzh4wDZuGqTC3LYWwH9GfycI0fnVC0URP2XMOcGfPlGd3L0PEmmAeTmR
 fEbaGAN4dr0jNO3lmbyAGe/G8tvtXQx/4ZjS9Pa3TlQP24GARU/f78/blbKR87Vz
 BGMndmSi92AscgXb9buO3bCwAY1YtH5WiFaZT1XVk42cj4MiOLvPTvP4UMzDDxcZ
 56ahmAP/84kd6H+cv9LmgEMqcIBmxdUcO1nuAItJ4wdrMUgw3+lrbxwFkH9xPV7I
 okC1K0TIVEobADbxbdMylxClAylbuW+37Pko97NmAlnzNCPNE38f3s3gtXRrUTaR
 IP8jv5UQ7q3dFiWnNLLodx5KM6s32GVBKRLRnn/6SJB7QzlyHXU=
 =Xb18
 -----END PGP SIGNATURE-----

Merge tag 'zstd-for-linus-v5.16' of git://github.com/terrelln/linux

Pull zstd update from Nick Terrell:
 "Update to zstd-1.4.10.

  Add myself as the maintainer of zstd and update the zstd version in
  the kernel, which is now 4 years out of date, to a much more recent
  zstd release. This includes bug fixes, much more extensive fuzzing,
  and performance improvements. And generates the kernel zstd
  automatically from upstream zstd, so it is easier to keep the zstd
  verison up to date, and we don't fall so far out of date again.

  This includes 5 commits that update the zstd library version:

   - Adds a new kernel-style wrapper around zstd.

     This wrapper API is functionally equivalent to the subset of the
     current zstd API that is currently used. The wrapper API changes to
     be kernel style so that the symbols don't collide with zstd's
     symbols. The update to zstd-1.4.10 maintains the same API and
     preserves the semantics, so that none of the callers need to be
     updated. All callers are updated in the commit, because there are
     zero functional changes.

   - Adds an indirection for `lib/decompress_unzstd.c` so it doesn't
     depend on the layout of `lib/zstd/` to include every source file.
     This allows the next patch to be automatically generated.

   - Imports the zstd-1.4.10 source code. This commit is automatically
     generated from upstream zstd (https://github.com/facebook/zstd).

   - Adds me (terrelln@fb.com) as the maintainer of `lib/zstd`.

   - Fixes a newly added build warning for clang.

  The discussion around this patchset has been pretty long, so I've
  included a FAQ-style summary of the history of the patchset, and why
  we are taking this approach.

  Why do we need to update?
  -------------------------

  The zstd version in the kernel is based off of zstd-1.3.1, which is
  was released August 20, 2017. Since then zstd has seen many bug fixes
  and performance improvements. And, importantly, upstream zstd is
  continuously fuzzed by OSS-Fuzz, and bug fixes aren't backported to
  older versions. So the only way to sanely get these fixes is to keep
  up to date with upstream zstd.

  There are no known security issues that affect the kernel, but we need
  to be able to update in case there are. And while there are no known
  security issues, there are relevant bug fixes. For example the problem
  with large kernel decompression has been fixed upstream for over 2
  years [1]

  Additionally the performance improvements for kernel use cases are
  significant. Measured for x86_64 on my Intel i9-9900k @ 3.6 GHz:

   - BtrFS zstd compression at levels 1 and 3 is 5% faster

   - BtrFS zstd decompression+read is 15% faster

   - SquashFS zstd decompression+read is 15% faster

   - F2FS zstd compression+write at level 3 is 8% faster

   - F2FS zstd decompression+read is 20% faster

   - ZRAM decompression+read is 30% faster

   - Kernel zstd decompression is 35% faster

   - Initramfs zstd decompression+build is 5% faster

  On top of this, there are significant performance improvements coming
  down the line in the next zstd release, and the new automated update
  patch generation will allow us to pull them easily.

  How is the update patch generated?
  ----------------------------------

  The first two patches are preparation for updating the zstd version.
  Then the 3rd patch in the series imports upstream zstd into the
  kernel. This patch is automatically generated from upstream. A script
  makes the necessary changes and imports it into the kernel. The
  changes are:

   - Replace all libc dependencies with kernel replacements and rewrite
     includes.

   - Remove unncessary portability macros like: #if defined(_MSC_VER).

   - Use the kernel xxhash instead of bundling it.

  This automation gets tested every commit by upstream's continuous
  integration. When we cut a new zstd release, we will submit a patch to
  the kernel to update the zstd version in the kernel.

  The automated process makes it easy to keep the kernel version of zstd
  up to date. The current zstd in the kernel shares the guts of the
  code, but has a lot of API and minor changes to work in the kernel.
  This is because at the time upstream zstd was not ready to be used in
  the kernel envrionment as-is. But, since then upstream zstd has
  evolved to support being used in the kernel as-is.

  Why are we updating in one big patch?
  -------------------------------------

  The 3rd patch in the series is very large. This is because it is
  restructuring the code, so it both deletes the existing zstd, and
  re-adds the new structure. Future updates will be directly
  proportional to the changes in upstream zstd since the last import.
  They will admittidly be large, as zstd is an actively developed
  project, and has hundreds of commits between every release. However,
  there is no other great alternative.

  One option ruled out is to replay every upstream zstd commit. This is
  not feasible for several reasons:

   - There are over 3500 upstream commits since the zstd version in the
     kernel.

   - The automation to automatically generate the kernel update was only
     added recently, so older commits cannot easily be imported.

   - Not every upstream zstd commit builds.

   - Only zstd releases are "supported", and individual commits may have
     bugs that were fixed before a release.

  Another option to reduce the patch size would be to first reorganize
  to the new file structure, and then apply the patch. However, the
  current kernel zstd is formatted with clang-format to be more
  "kernel-like". But, the new method imports zstd as-is, without
  additional formatting, to allow for closer correlation with upstream,
  and easier debugging. So the patch wouldn't be any smaller.

  It also doesn't make sense to import upstream zstd commit by commit
  going forward. Upstream zstd doesn't support production use cases
  running of the development branch. We have a lot of post-commit
  fuzzing that catches many bugs, so indiviudal commits may be buggy,
  but fixed before a release. So going forward, I intend to import every
  (important) zstd release into the Kernel.

  So, while it isn't ideal, updating in one big patch is the only patch
  I see forward.

  Who is responsible for this code?
  ---------------------------------

  I am. This patchset adds me as the maintainer for zstd. Previously,
  there was no tree for zstd patches. Because of that, there were
  several patches that either got ignored, or took a long time to merge,
  since it wasn't clear which tree should pick them up. I'm officially
  stepping up as maintainer, and setting up my tree as the path through
  which zstd patches get merged. I'll make sure that patches to the
  kernel zstd get ported upstream, so they aren't erased when the next
  version update happens.

  How is this code tested?
  ------------------------

  I tested every caller of zstd on x86_64 (BtrFS, ZRAM, SquashFS, F2FS,
  Kernel, InitRAMFS). I also tested Kernel & InitRAMFS on i386 and
  aarch64. I checked both performance and correctness.

  Also, thanks to many people in the community who have tested these
  patches locally.

  Lastly, this code will bake in linux-next before being merged into
  v5.16.

  Why update to zstd-1.4.10 when zstd-1.5.0 has been released?
  ------------------------------------------------------------

  This patchset has been outstanding since 2020, and zstd-1.4.10 was the
  latest release when it was created. Since the update patch is
  automatically generated from upstream, I could generate it from
  zstd-1.5.0.

  However, there were some large stack usage regressions in zstd-1.5.0,
  and are only fixed in the latest development branch. And the latest
  development branch contains some new code that needs to bake in the
  fuzzer before I would feel comfortable releasing to the kernel.

  Once this patchset has been merged, and we've released zstd-1.5.1, we
  can update the kernel to zstd-1.5.1, and exercise the update process.

  You may notice that zstd-1.4.10 doesn't exist upstream. This release
  is an artifical release based off of zstd-1.4.9, with some fixes for
  the kernel backported from the development branch. I will tag the
  zstd-1.4.10 release after this patchset is merged, so the Linux Kernel
  is running a known version of zstd that can be debugged upstream.

  Why was a wrapper API added?
  ----------------------------

  The first versions of this patchset migrated the kernel to the
  upstream zstd API. It first added a shim API that supported the new
  upstream API with the old code, then updated callers to use the new
  shim API, then transitioned to the new code and deleted the shim API.
  However, Cristoph Hellwig suggested that we transition to a kernel
  style API, and hide zstd's upstream API behind that. This is because
  zstd's upstream API is supports many other use cases, and does not
  follow the kernel style guide, while the kernel API is focused on the
  kernel's use cases, and follows the kernel style guide.

  Where is the previous discussion?
  ---------------------------------

  Links for the discussions of the previous versions of the patch set
  below. The largest changes in the design of the patchset are driven by
  the discussions in v11, v5, and v1. Sorry for the mix of links, I
  couldn't find most of the the threads on lkml.org"

Link: https://lkml.org/lkml/2020/9/29/27 [1]
Link: https://www.spinics.net/lists/linux-crypto/msg58189.html [v12]
Link: https://lore.kernel.org/linux-btrfs/20210430013157.747152-1-nickrterrell@gmail.com/ [v11]
Link: https://lore.kernel.org/lkml/20210426234621.870684-2-nickrterrell@gmail.com/ [v10]
Link: https://lore.kernel.org/linux-btrfs/20210330225112.496213-1-nickrterrell@gmail.com/ [v9]
Link: https://lore.kernel.org/linux-f2fs-devel/20210326191859.1542272-1-nickrterrell@gmail.com/ [v8]
Link: https://lkml.org/lkml/2020/12/3/1195 [v7]
Link: https://lkml.org/lkml/2020/12/2/1245 [v6]
Link: https://lore.kernel.org/linux-btrfs/20200916034307.2092020-1-nickrterrell@gmail.com/ [v5]
Link: https://www.spinics.net/lists/linux-btrfs/msg105783.html [v4]
Link: https://lkml.org/lkml/2020/9/23/1074 [v3]
Link: https://www.spinics.net/lists/linux-btrfs/msg105505.html [v2]
Link: https://lore.kernel.org/linux-btrfs/20200916034307.2092020-1-nickrterrell@gmail.com/ [v1]
Signed-off-by: Nick Terrell <terrelln@fb.com>
Tested By: Paul Jones <paul@pauljones.id.au>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM/Clang v13.0.0 on x86-64
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>

* tag 'zstd-for-linus-v5.16' of git://github.com/terrelln/linux:
  lib: zstd: Add cast to silence clang's -Wbitwise-instead-of-logical
  MAINTAINERS: Add maintainer entry for zstd
  lib: zstd: Upgrade to latest upstream zstd version 1.4.10
  lib: zstd: Add decompress_sources.h for decompress_unzstd
  lib: zstd: Add kernel-specific API
2021-11-13 15:32:30 -08:00
Linus Torvalds
5664896ba2 f2fs-for-5.16-rc1
In this cycle, we've applied relatively small number of patches which fix subtle
 corner cases mainly, while introducing a new mount option to be able to fragment
 the disk intentionally for performance tests.
 
 Enhancement:
  - add a mount option to fragmente on-disk layout to understand the performance
  - support direct IO for multi-partitions
  - add a fault injection of dquot_initialize
 
 Bug fix:
  - address some lockdep complaints
  - fix a deadlock issue with quota
  - fix a memory tuning condition
  - fix compression condition to improve the ratio
  - fix disabling compression on the non-empty compressed file
  - invalidate cached pages before IPU/DIO writes
 
 And, we've added some minor clean-ups as usual.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmGMILwACgkQQBSofoJI
 UNJDRA/+KPyCXdY0OqL26BuGKj+z7hW6bz7tlh6h3wdnPdsR/W3ehbqQEr3GBb+q
 yokmD75/in7vZwGsDHGowFWMAfWOHYEqHz5UAq91sHjhfZzLDNUgLFWJedBX2XJb
 UoEAa7KzRt9M9K2p/5vSTs07RN3okUiRkFhVBBQJIaL7xi6MpadN/XAqpyoBqsiP
 pAV6J3GF6WNF19P/hkN1CJI8rV+PFrvY6C23lMkP7mnsWh03jMSgDDuhLHMQpAba
 EJYq7QbSatsLDRdR+jUQwIfMucvvzN7M6ja9+NTGlbeACvND8vXKYXOwngCq9+je
 2PIU4J8zNqnEkLsPn8STm4zwZHCA7VFdeCobCZcaVZCZFBzVqCkVYE9wqFVaQmr1
 bCrRFvEb+D1pkHYFujVXwCAfPlO6twiAInFNMa3WQ3FduJq2nhc8OLCJJ46D1KT2
 ZzzLv2EIIlncxPvgLIhiEE9DgPOyV56PQAO3OTsBZcvycU32aHo4hyexju1ubKiD
 CZFEHLnPbxX8Ulh3NX4uUxqPAEVhM/aw4l4e8xhmVRY3uj75geY7M6rt1vD+Y5Et
 EwbUE8XbLy+GhqbbO/SX9G38pftOiIquH1J0RuhuVNNmkIDkQvnSNp8WqHTdjEJE
 NiHZ5bkRkii34Wfrax9UccqGDswh/gjHAXEfGD8nFfcQZwLP1n8=
 =KGQ3
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this cycle, we've applied relatively small number of patches which
  fix subtle corner cases mainly, while introducing a new mount option
  to be able to fragment the disk intentionally for performance tests.

  Enhancements:

   - add a mount option to fragmente on-disk layout to understand the
     performance

   - support direct IO for multi-partitions

   - add a fault injection of dquot_initialize

  Bug fixes:

   - address some lockdep complaints

   - fix a deadlock issue with quota

   - fix a memory tuning condition

   - fix compression condition to improve the ratio

   - fix disabling compression on the non-empty compressed file

   - invalidate cached pages before IPU/DIO writes

  And, we've added some minor clean-ups as usual"

* tag 'f2fs-for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: fix UAF in f2fs_available_free_memory
  f2fs: invalidate META_MAPPING before IPU/DIO write
  f2fs: support fault injection for dquot_initialize()
  f2fs: fix incorrect return value in f2fs_sanity_check_ckpt()
  f2fs: compress: disallow disabling compress on non-empty compressed file
  f2fs: compress: fix overwrite may reduce compress ratio unproperly
  f2fs: multidevice: support direct IO
  f2fs: introduce fragment allocation mode mount option
  f2fs: replace snprintf in show functions with sysfs_emit
  f2fs: include non-compressed blocks in compr_written_block
  f2fs: fix wrong condition to trigger background checkpoint correctly
  f2fs: fix to use WHINT_MODE
  f2fs: fix up f2fs_lookup tracepoints
  f2fs: set SBI_NEED_FSCK flag when inconsistent node block found
  f2fs: introduce excess_dirty_threshold()
  f2fs: avoid attaching SB_ACTIVE flag during mount
  f2fs: quota: fix potential deadlock
  f2fs: should use GFP_NOFS for directory inodes
2021-11-13 11:20:22 -08:00
Dongliang Mu
5429c9dbc9 f2fs: fix UAF in f2fs_available_free_memory
if2fs_fill_super
-> f2fs_build_segment_manager
   -> create_discard_cmd_control
      -> f2fs_start_discard_thread

It invokes kthread_run to create a thread and run issue_discard_thread.

However, if f2fs_build_node_manager fails, the control flow goes to
free_nm and calls f2fs_destroy_node_manager. This function will free
sbi->nm_info. However, if issue_discard_thread accesses sbi->nm_info
after the deallocation, but before the f2fs_stop_discard_thread, it will
cause UAF(Use-after-free).

-> f2fs_destroy_segment_manager
   -> destroy_discard_cmd_control
      -> f2fs_stop_discard_thread

Fix this by stopping discard thread before f2fs_destroy_node_manager.

Note that, the commit d6d2b491a8 introduces the call of
f2fs_available_free_memory into issue_discard_thread.

Cc: stable@vger.kernel.org
Fixes: d6d2b491a8 ("f2fs: allow to change discard policy based on cached discard cmds")
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-11-09 08:23:17 -08:00
Hyeong-Jun Kim
e3b49ea368 f2fs: invalidate META_MAPPING before IPU/DIO write
Encrypted pages during GC are read and cached in META_MAPPING.
However, due to cached pages in META_MAPPING, there is an issue where
newly written pages are lost by IPU or DIO writes.

Thread A - f2fs_gc()            Thread B
/* phase 3 */
down_write(i_gc_rwsem)
ra_data_block()       ---- (a)
up_write(i_gc_rwsem)
                                f2fs_direct_IO() :
                                 - down_read(i_gc_rwsem)
                                 - __blockdev_direct_io()
                                 - get_data_block_dio_write()
                                 - f2fs_dio_submit_bio()  ---- (b)
                                 - up_read(i_gc_rwsem)
/* phase 4 */
down_write(i_gc_rwsem)
move_data_block()     ---- (c)
up_write(i_gc_rwsem)

(a) In phase 3 of f2fs_gc(), up-to-date page is read from storage and
    cached in META_MAPPING.
(b) In thread B, writing new data by IPU or DIO write on same blkaddr as
    read in (a). cached page in META_MAPPING become out-dated.
(c) In phase 4 of f2fs_gc(), out-dated page in META_MAPPING is copied to
    new blkaddr. In conclusion, the newly written data in (b) is lost.

To address this issue, invalidating pages in META_MAPPING before IPU or
DIO write.

Fixes: 6aa58d8ad2 ("f2fs: readahead encrypted block during GC")
Signed-off-by: Hyeong-Jun Kim <hj514.kim@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-11-09 08:16:34 -08:00
Nick Terrell
cf30f6a5f0 lib: zstd: Add kernel-specific API
This patch:
- Moves `include/linux/zstd.h` -> `include/linux/zstd_lib.h`
- Updates modified zstd headers to yearless copyright
- Adds a new API in `include/linux/zstd.h` that is functionally
  equivalent to the in-use subset of the current API. Functions are
  renamed to avoid symbol collisions with zstd, to make it clear it is
  not the upstream zstd API, and to follow the kernel style guide.
- Updates all callers to use the new API.

There are no functional changes in this patch. Since there are no
functional change, I felt it was okay to update all the callers in a
single patch. Once the API is approved, the callers are mechanically
changed.

This patch is preparing for the 3rd patch in this series, which updates
zstd to version 1.4.10. Since the upstream zstd API is no longer exposed
to callers, the update can happen transparently.

Signed-off-by: Nick Terrell <terrelln@fb.com>
Tested By: Paul Jones <paul@pauljones.id.au>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # LLVM/Clang v13.0.0 on x86-64
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
2021-11-08 16:55:21 -08:00
Linus Torvalds
c03098d4b9 gfs2: Fix mmap + page fault deadlocks
Functions gfs2_file_read_iter and gfs2_file_write_iter are both
 accessing the user buffer to write to or read from while holding the
 inode glock.  In the most basic scenario, that buffer will not be
 resident and it will be mapped to the same file.  Accessing the buffer
 will trigger a page fault, and gfs2 will deadlock trying to take the
 same inode glock again while trying to handle that fault.
 
 Fix that and similar, more complex scenarios by disabling page faults
 while accessing user buffers.  To make this work, introduce a small
 amount of new infrastructure and fix some bugs that didn't trigger so
 far, with page faults enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmGBPisUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpE6A/7BezUnGuNJxJrR8pC+vcLYA7xAgUU
 6STQ6IN7w5UHRlSkNzZxZ2XPxW4uVQ4SxSEeaLqBsHZihepjcLNFZ/8MhQ6UPSD0
 8noHOi7CoIcp6IuWQtCpxRM/xjjm2SlMt2XbVJZaiJcdzCV9gB6TU9EkBRq7Zm/X
 9WFBbv1xZF0skn9ISCJvNtiiI+VyWKgMDUKxJUiTQjmJcklyyqHcVGmQi9BjqPz4
 4s3F+WH6CoGbDKlmNk/6Y9wZ/2+sbvGswVscUxPwJVPoZWsR1xBBUdAeAmEMD1P4
 BgE/Y1J8JXyVPYtyvZKq70XUhKdQkxB7RfX87YasOk9mY4Kjd5rIIGEykh+o2vC9
 kDhCHvf2Mnw5I6Rum3B7UXyB1vemY+fECIHsXhgBnS+ztabRtcAdpCuWoqb43ymw
 yEX1KwXyU4FpRYbrRvdZT42Fmh6ty8TW+N4swg8S2TrffirvgAi5yrcHZ4mPupYv
 lyzvsCW7Wv8hPXn/twNObX+okRgJnsxcCdBXARdCnRXfA8tH23xmu88u8RA1Vdxh
 nzTvv6Dx2EowwojuDWMx29Mw3fA2IqIfbOV+4FaRU7NZ2ZKtknL8yGl27qQUsMoJ
 vYsHTmagasjQr+NDJ3vQRLCw+JQ6B1hENpdkmixFD9moo7X1ZFW3HBi/UL973Bv6
 5CmgeXto8FRUFjI=
 =WeNd
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher:
 "Functions gfs2_file_read_iter and gfs2_file_write_iter are both
  accessing the user buffer to write to or read from while holding the
  inode glock.

  In the most basic deadlock scenario, that buffer will not be resident
  and it will be mapped to the same file. Accessing the buffer will
  trigger a page fault, and gfs2 will deadlock trying to take the same
  inode glock again while trying to handle that fault.

  Fix that and similar, more complex scenarios by disabling page faults
  while accessing user buffers. To make this work, introduce a small
  amount of new infrastructure and fix some bugs that didn't trigger so
  far, with page faults enabled"

* tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix mmap + page fault deadlocks for direct I/O
  iov_iter: Introduce nofault flag to disable page faults
  gup: Introduce FOLL_NOFAULT flag to disable page faults
  iomap: Add done_before argument to iomap_dio_rw
  iomap: Support partial direct I/O on user copy failures
  iomap: Fix iomap_dio_rw return value for user copies
  gfs2: Fix mmap + page fault deadlocks for buffered I/O
  gfs2: Eliminate ip->i_gh
  gfs2: Move the inode glock locking to gfs2_file_buffered_write
  gfs2: Introduce flag for glock holder auto-demotion
  gfs2: Clean up function may_grant
  gfs2: Add wrapper for iomap_file_buffered_write
  iov_iter: Introduce fault_in_iov_iter_writeable
  iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
  gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
  powerpc/kvm: Fix kvm_use_magic_page
  iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
2021-11-02 12:25:03 -07:00
Linus Torvalds
cd3e8ea847 fscrypt updates for 5.16
Some cleanups for fs/crypto/:
 
 - Allow 256-bit master keys with AES-256-XTS
 
 - Improve documentation and comments
 
 - Remove unneeded field fscrypt_operations::max_namelen
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYX8U4hQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKyXYAP0d7BNuKsMyw6qlzLMxbaO5wdTg2HaD
 04ApVeHM6qp7IQEA/Ve2Mr+BcPOZ7E6io8haZtXs0MrRMYeessKWcWMCdQ0=
 =2WNZ
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Eric Biggers:
 "Some cleanups for fs/crypto/:

   - Allow 256-bit master keys with AES-256-XTS

   - Improve documentation and comments

   - Remove unneeded field fscrypt_operations::max_namelen"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: improve a few comments
  fscrypt: allow 256-bit master keys with AES-256-XTS
  fscrypt: improve documentation for inline encryption
  fscrypt: clean up comments in bio.c
  fscrypt: remove fscrypt_operations::max_namelen
2021-11-01 11:36:35 -07:00
Chao Yu
10a2687856 f2fs: support fault injection for dquot_initialize()
This patch adds a new function f2fs_dquot_initialize() to wrap
dquot_initialize(), and it supports to inject fault into
f2fs_dquot_initialize() to simulate inner failure occurs in
dquot_initialize().

Usage:
a) echo 65536 > /sys/fs/f2fs/<dev>/inject_type or
b) mount -o fault_type=65536 <dev> <mountpoint>

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-29 10:38:53 -07:00
Chao Yu
ca98d72141 f2fs: fix incorrect return value in f2fs_sanity_check_ckpt()
As Pavel Machek reported in [1]

This code looks quite confused: part of function returns 1 on
corruption, part returns -errno. The problem is not stable-specific.

[1] https://lkml.org/lkml/2021/9/19/207

Let's fix to make 'insane cp_payload case' to return 1 rater than
EFSCORRUPTED, so that return value can be kept consistent for all
error cases, it can avoid confusion of code logic.

Fixes: 65ddf65648 ("f2fs: fix to do sanity check for sb/cp fields correctly")
Reported-by: Pavel Machek <pavel@denx.de>
Reviewed-by: Pavel Machek <pavel@denx.de>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-29 10:37:35 -07:00
Hyeong-Jun Kim
02d58cd253 f2fs: compress: disallow disabling compress on non-empty compressed file
Compresse file and normal file has differ in i_addr addressing,
specifically addrs per inode/block. So, we will face data loss, if we
disable the compression flag on non-empty files. Therefore we should
disallow not only enabling but disabling the compression flag on
non-empty files.

Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Hyeong-Jun Kim <hj514.kim@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-27 15:32:29 -07:00
Fengnan Chang
b368cc5e26 f2fs: compress: fix overwrite may reduce compress ratio unproperly
when overwrite only first block of cluster, since cluster is not full, it
will call f2fs_write_raw_pages when f2fs_write_multi_pages, and cause the
whole cluster become uncompressed eventhough data can be compressed.
this may will make random write bench score reduce a lot.

root# dd if=/dev/zero of=./fio-test bs=1M count=1

root# sync

root# echo 3 > /proc/sys/vm/drop_caches

root# f2fs_io get_cblocks ./fio-test

root# dd if=/dev/zero of=./fio-test bs=4K count=1 oflag=direct conv=notrunc

w/o patch:
root# f2fs_io get_cblocks ./fio-test
189

w/ patch:
root# f2fs_io get_cblocks ./fio-test
192

Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-26 14:04:31 -07:00
Chao Yu
71f2c82062 f2fs: multidevice: support direct IO
Commit 3c62be17d4 ("f2fs: support multiple devices") missed
to support direct IO for multiple device feature, this patch
adds to support the missing part of multidevice feature.

In addition, for multiple device image, we should be aware of
any issued direct write IO rather than just buffered write IO,
so that fsync and syncfs can issue a preflush command to the
device where direct write IO goes, to persist user data for
posix compliant.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-26 14:04:30 -07:00
Daeho Jeong
6691d940b0 f2fs: introduce fragment allocation mode mount option
Added two options into "mode=" mount option to make it possible for
developers to simulate filesystem fragmentation/after-GC situation
itself. The developers use these modes to understand filesystem
fragmentation/after-GC condition well, and eventually get some
insights to handle them better.

"fragment:segment": f2fs allocates a new segment in ramdom position.
		With this, we can simulate the after-GC condition.
"fragment:block" : We can scatter block allocation with
		"max_fragment_chunk" and "max_fragment_hole" sysfs
		nodes. f2fs will allocate 1..<max_fragment_chunk>
		blocks in a chunk and make a hole in the length of
		1..<max_fragment_hole> by turns	in a newly allocated
		free segment. Plus, this mode implicitly enables
		"fragment:segment" option for more randomness.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-26 14:04:30 -07:00
Qing Wang
84eab2a899 f2fs: replace snprintf in show functions with sysfs_emit
coccicheck complains about the use of snprintf() in sysfs show functions.

Fix the following coccicheck warning:
fs/f2fs/sysfs.c:198:12-20: WARNING: use scnprintf or sprintf.
fs/f2fs/sysfs.c:247:8-16: WARNING: use scnprintf or sprintf.

Use sysfs_emit instead of scnprintf or sprintf makes more sense.

Signed-off-by: Qing Wang <wangqing@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-26 14:04:30 -07:00
Daeho Jeong
09631cf323 f2fs: include non-compressed blocks in compr_written_block
Need to include non-compressed blocks in compr_written_block to
estimate average compression ratio more accurately.

Fixes: 5ac443e26a ("f2fs: add sysfs nodes to get runtime compression stat")
Cc: stable@vger.kernel.org
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-26 14:04:30 -07:00
Andreas Gruenbacher
a6294593e8 iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
Turn iov_iter_fault_in_readable into a function that returns the number
of bytes not faulted in, similar to copy_to_user, instead of returning a
non-zero value when any of the requested pages couldn't be faulted in.
This supports the existing users that require all pages to be faulted in
as well as new users that are happy if any pages can be faulted in.

Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
sure this change doesn't silently break things.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-18 16:35:06 +02:00
Christoph Hellwig
e41d12f539 mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>
There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
break the include chain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Chao Yu
cd6d697a6e f2fs: fix wrong condition to trigger background checkpoint correctly
In f2fs_balance_fs_bg(), it needs to check both NAT_ENTRIES and INO_ENTRIES
memory usage to decide whether we should skip background checkpoint, otherwise
we may always skip checking INO_ENTRIES memory usage, so that INO_ENTRIES may
potentially cause high memory footprint.

Fixes: 493720a485 ("f2fs: fix to avoid REQ_TIME and CP_TIME collision")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-11 16:15:23 -07:00
Keoseong Park
011e0868e0 f2fs: fix to use WHINT_MODE
Since active_logs can be set to 2 or 4 or NR_CURSEG_PERSIST_TYPE(6),
it cannot be set to NR_CURSEG_TYPE(8).
That is, whint_mode is always off.

Therefore, the condition is changed from NR_CURSEG_TYPE to NR_CURSEG_PERSIST_TYPE.

Cc: Chao Yu <chao@kernel.org>
Fixes: d0b9e42ab6 (f2fs: introduce inmem curseg)
Reported-by: tanghuan <tanghuan@vivo.com>
Signed-off-by: Keoseong Park <keosung.park@samsung.com>
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-10-11 16:15:06 -07:00
Christoph Hellwig
49bd03cc7e unicode: pass a UNICODE_AGE() tripple to utf8_load
Don't bother with pointless string parsing when the caller can just pass
the version in the format that the core expects.  Also remove the
fallback to the latest version that none of the callers actually uses.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2021-10-11 17:01:46 -03:00
Christoph Hellwig
a440943e68 unicode: remove the charset field from struct unicode_map
It is hardcoded and only used for a f2fs sysfs file where it can be
hardcoded just as easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2021-10-11 17:01:35 -03:00
Christoph Hellwig
86e8057579 f2fs: simplify f2fs_sb_read_encoding
Return the encoding table as the return value instead of as an argument,
and don't bother with the encoding flags as the caller can handle that
trivially.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
2021-10-11 17:01:29 -03:00
Eric Biggers
4373b3dc92 fscrypt: remove fscrypt_operations::max_namelen
The max_namelen field is unnecessary, as it is set to 255 (NAME_MAX) on
all filesystems that support fscrypt (or plan to support fscrypt).  For
simplicity, just use NAME_MAX directly instead.

Link: https://lore.kernel.org/r/20210909184513.139281-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-09-20 19:32:33 -07:00
Weichao Guo
6663b138de f2fs: set SBI_NEED_FSCK flag when inconsistent node block found
Inconsistent node block will cause a file fail to open or read,
which could make the user process crashes or stucks. Let's mark
SBI_NEED_FSCK flag to trigger a fix at next fsck time. After
unlinking the corrupted file, the user process could regenerate
a new one and work correctly.

Signed-off-by: Weichao Guo <guoweichao@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-09-20 16:29:00 -07:00
Chao Yu
287b1406dd f2fs: introduce excess_dirty_threshold()
This patch enables f2fs_balance_fs_bg() to check all metadatas' dirty
threshold rather than just checking node block's, so that checkpoint()
from background can be triggered more frequently to avoid heaping up
too much dirty metadatas.

Threshold value by default:
race with foreground ops	single type	global
No				16MB		24MB
Yes				24MB		36MB

In addtion, let f2fs_balance_fs_bg() be aware of roll-forward sapce
as well as fsync().

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-09-20 16:12:51 -07:00
Chao Yu
c02599f210 f2fs: avoid attaching SB_ACTIVE flag during mount
Quoted from [1]

"I do remember that I've added this code back then because otherwise
orphan cleanup was losing updates to quota files. But you're right
that now I don't see how that could be happening and it would be nice
if we could get rid of this hack"

[1] https://lore.kernel.org/linux-ext4/99cce8ca-e4a0-7301-840f-2ace67c551f3@huawei.com/T/#m04990cfbc4f44592421736b504afcc346b2a7c00

Related fix in ext4 by
commit 72ffb49a7b ("ext4: do not set SB_ACTIVE in ext4_orphan_cleanup()").

f2fs has the same hack implementation in
- f2fs_recover_orphan_inodes()
- f2fs_recover_fsync_data()

Let's get rid of this hack as well in f2fs.

Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-09-16 08:38:32 -07:00
Chao Yu
a5c0042200 f2fs: quota: fix potential deadlock
As Yi Zhuang reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=214299

There is potential deadlock during quota data flush as below:

Thread A:			Thread B:
f2fs_dquot_acquire
down_read(&sbi->quota_sem)
				f2fs_write_checkpoint
				block_operations
				f2fs_look_all
				down_write(&sbi->cp_rwsem)
f2fs_quota_write
f2fs_write_begin
__do_map_lock
f2fs_lock_op
down_read(&sbi->cp_rwsem)
				__need_flush_qutoa
				down_write(&sbi->quota_sem)

This patch changes block_operations() to use trylock, if it fails,
it means there is potential quota data updater, in this condition,
let's flush quota data first and then trylock again to check dirty
status of quota data.

The side effect is: in heavy race condition (e.g. multi quota data
upaters vs quota data flusher), it may decrease the probability of
synchronizing quota data successfully in checkpoint() due to limited
retry time of quota flush.

Reported-by: Yi Zhuang <zhuangyi1@huawei.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-09-16 08:38:32 -07:00
Jaegeuk Kim
92d602bc71 f2fs: should use GFP_NOFS for directory inodes
We use inline_dentry which requires to allocate dentry page when adding a link.
If we allow to reclaim memory from filesystem, we do down_read(&sbi->cp_rwsem)
twice by f2fs_lock_op(). I think this should be okay, but how about stopping
the lockdep complaint [1]?

f2fs_create()
 - f2fs_lock_op()
 - f2fs_do_add_link()
  - __f2fs_find_entry
   - f2fs_get_read_data_page()
   -> kswapd
    - shrink_node
     - f2fs_evict_inode
      - f2fs_lock_op()

[1]

fs_reclaim
){+.+.}-{0:0}
:
kswapd0:        lock_acquire+0x114/0x394
kswapd0:        __fs_reclaim_acquire+0x40/0x50
kswapd0:        prepare_alloc_pages+0x94/0x1ec
kswapd0:        __alloc_pages_nodemask+0x78/0x1b0
kswapd0:        pagecache_get_page+0x2e0/0x57c
kswapd0:        f2fs_get_read_data_page+0xc0/0x394
kswapd0:        f2fs_find_data_page+0xa4/0x23c
kswapd0:        find_in_level+0x1a8/0x36c
kswapd0:        __f2fs_find_entry+0x70/0x100
kswapd0:        f2fs_do_add_link+0x84/0x1ec
kswapd0:        f2fs_mkdir+0xe4/0x1e4
kswapd0:        vfs_mkdir+0x110/0x1c0
kswapd0:        do_mkdirat+0xa4/0x160
kswapd0:        __arm64_sys_mkdirat+0x24/0x34
kswapd0:        el0_svc_common.llvm.17258447499513131576+0xc4/0x1e8
kswapd0:        do_el0_svc+0x28/0xa0
kswapd0:        el0_svc+0x24/0x38
kswapd0:        el0_sync_handler+0x88/0xec
kswapd0:        el0_sync+0x1c0/0x200
kswapd0:
-> #1
(
&sbi->cp_rwsem
){++++}-{3:3}
:
kswapd0:        lock_acquire+0x114/0x394
kswapd0:        down_read+0x7c/0x98
kswapd0:        f2fs_do_truncate_blocks+0x78/0x3dc
kswapd0:        f2fs_truncate+0xc8/0x128
kswapd0:        f2fs_evict_inode+0x2b8/0x8b8
kswapd0:        evict+0xd4/0x2f8
kswapd0:        iput+0x1c0/0x258
kswapd0:        do_unlinkat+0x170/0x2a0
kswapd0:        __arm64_sys_unlinkat+0x4c/0x68
kswapd0:        el0_svc_common.llvm.17258447499513131576+0xc4/0x1e8
kswapd0:        do_el0_svc+0x28/0xa0
kswapd0:        el0_svc+0x24/0x38
kswapd0:        el0_sync_handler+0x88/0xec
kswapd0:        el0_sync+0x1c0/0x200

Cc: stable@vger.kernel.org
Fixes: bdbc90fa55 ("f2fs: don't put dentry page in pagecache into highmem")
Reviewed-by: Chao Yu <chao@kernel.org>
Reviewed-by: Stanley Chu <stanley.chu@mediatek.com>
Reviewed-by: Light Hsieh <light.hsieh@mediatek.com>
Tested-by: Light Hsieh <light.hsieh@mediatek.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-09-16 08:38:10 -07:00
Linus Torvalds
6abaa83c73 f2fs-for-5.15-rc1
In this cycle, we've addressed some performance issues such as lock contention,
 misbehaving compress_cache, allowing extent_cache for compressed files, and new
 sysfs to adjust ra_size for fadvise. In order to diagnose the performance issues
 quickly, we also added an iostat which shows the IO latencies periodically. On
 the stability side, we've found two memory leakage cases in the error path in
 compression flow. And, we've also fixed various corner cases in fiemap, quota,
 checkpoint=disable, zstd, and so on.
 
 Enhancement:
  - avoid long checkpoint latency by releasing nat_tree_lock
  - collect and show iostats periodically
  - support extent_cache for compressed files
  - add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
  - report f2fs GC status via sysfs
  - add discard_unit=%s in mount option to handle zoned device
 
 Bug fix:
  - fix two memory leakages when an error happens in the compressed IO flow
  - fix commpress_cache to get the right LBA
  - fix fiemap to deal with compressed case correctly
  - fix wrong EIO returns due to SBI_NEED_FSCK
  - fix missing writes when enabling checkpoint back
  - fix quota deadlock
  - fix zstd level mount option
 
 In addition to the above major updates, we've cleaned up several code paths such
 as dio, unnecessary operations, debugfs/f2fs/status, sanity check, and typos.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmEyw1sACgkQQBSofoJI
 UNLJmA/+NHUgwUjLMcHvmLyp6QYpQDZtKj93/sRDo+YHOYNdYFjWWUb329PYTKWS
 kEdzApCP+KHfVxeSkiL/x3qWP+RlTkIf96P0kR3/BKi0tjg25G2riFWztusDDFpt
 xi+AW5sUFDvIx1tFumvQHAQedSwBgcZ96ovT5EwxEuONkljhZC9phEC6vSXz9nOR
 e2EQIyezbC5O21np1KSeqSgqRMpVkJkVcEHy4VmpMBCLMOOYPepWwKw+yPaV/jR/
 zUXdo2/53vma50M5LCDPCtjCtWQgLoeNeGLxyjfzQuTJU6TmtPY65JObLPt6pUSj
 fRW6qIziTZbVYXzOWBD0EYilv2N4c3BNJdhQCpx2Vyjw9/LLxzqKPOUyzBoa1kjY
 eZVvmaLXVCKsoJdHDSi7OH/4BqS6SuSZE8eO/nGkgswqiErHZ0Vwl3bFCWC7r/Bk
 r2U5spJx/83XO6c9H1bzeWEies1DRtwnCDIRRuw35RtJ4uHZaqCfkuJ7rOBwC90X
 4SpaAKdUxP2RWc3GKELBIhaqPn7vyMy9ile6VU14PjM8UcY5hyE87T2azqR8gGut
 nVjRL4cbMGTPj6m1Qj8KqBRSaLuShe6AncUy7bNGiM+JlcLcdB6OJ1ZYLl9hjx2r
 TbIouXThgcZ4SIK0DEaBLKz2b9/0TfaO9gw1XzpRma+bWA1pApM=
 =W67o
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this cycle, we've addressed some performance issues such as lock
  contention, misbehaving compress_cache, allowing extent_cache for
  compressed files, and new sysfs to adjust ra_size for fadvise.

  In order to diagnose the performance issues quickly, we also added an
  iostat which shows the IO latencies periodically.

  On the stability side, we've found two memory leakage cases in the
  error path in compression flow. And, we've also fixed various corner
  cases in fiemap, quota, checkpoint=disable, zstd, and so on.

  Enhancements:
   - avoid long checkpoint latency by releasing nat_tree_lock
   - collect and show iostats periodically
   - support extent_cache for compressed files
   - add a sysfs entry to manage ra_size given fadvise(POSIX_FADV_SEQUENTIAL)
   - report f2fs GC status via sysfs
   - add discard_unit=%s in mount option to handle zoned device

  Bug fixes:
   - fix two memory leakages when an error happens in the compressed IO flow
   - fix commpress_cache to get the right LBA
   - fix fiemap to deal with compressed case correctly
   - fix wrong EIO returns due to SBI_NEED_FSCK
   - fix missing writes when enabling checkpoint back
   - fix quota deadlock
   - fix zstd level mount option

  In addition to the above major updates, we've cleaned up several code
  paths such as dio, unnecessary operations, debugfs/f2fs/status, sanity
  check, and typos"

* tag 'f2fs-for-5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (46 commits)
  f2fs: should put a page beyond EOF when preparing a write
  f2fs: deallocate compressed pages when error happens
  f2fs: enable realtime discard iff device supports discard
  f2fs: guarantee to write dirty data when enabling checkpoint back
  f2fs: fix to unmap pages from userspace process in punch_hole()
  f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()
  f2fs: fix to account missing .skipped_gc_rwsem
  f2fs: adjust unlock order for cleanup
  f2fs: Don't create discard thread when device doesn't support realtime discard
  f2fs: rebuild nat_bits during umount
  f2fs: introduce periodic iostat io latency traces
  f2fs: separate out iostat feature
  f2fs: compress: do sanity check on cluster
  f2fs: fix description about main_blkaddr node
  f2fs: convert S_IRUGO to 0444
  f2fs: fix to keep compatibility of fault injection interface
  f2fs: support fault injection for f2fs_kmem_cache_alloc()
  f2fs: compress: allow write compress released file after truncate to zero
  f2fs: correct comment in segment.h
  f2fs: improve sbi status info in debugfs/f2fs/status
  ...
2021-09-04 10:48:47 -07:00
Linus Torvalds
815409a12c overlayfs update for 5.15
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYTDKKAAKCRDh3BK/laaZ
 PG9PAQCUF0fdBlCKudwSEt5PV5xemycL9OCAlYCd7d4XbBIe9wEA6sVJL9J+OwV2
 aF0NomiXtJccE+S9+byjVCyqSzQJGQQ=
 =6L2Y
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs update from Miklos Szeredi:

 - Copy up immutable/append/sync/noatime attributes (Amir Goldstein)

 - Improve performance by enabling RCU lookup.

 - Misc fixes and improvements

The reason this touches so many files is that the ->get_acl() method now
gets a "bool rcu" argument.  The ->get_acl() API was updated based on
comments from Al and Linus:

Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/

* tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: enable RCU'd ->get_acl()
  vfs: add rcu argument to ->get_acl() callback
  ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup()
  ovl: use kvalloc in xattr copy-up
  ovl: update ctime when changing fileattr
  ovl: skip checking lower file's i_writecount on truncate
  ovl: relax lookup error on mismatch origin ftype
  ovl: do not set overlay.opaque for new directories
  ovl: add ovl_allow_offline_changes() helper
  ovl: disable decoding null uuid with redirect_dir
  ovl: consistent behavior for immutable/append-only inodes
  ovl: copy up sync/noatime fileattr flags
  ovl: pass ovl_fs to ovl_check_setxattr()
  fs: add generic helper for filling statx attribute flags
2021-09-02 09:21:27 -07:00
Jaegeuk Kim
9605f75cf3 f2fs: should put a page beyond EOF when preparing a write
The prepare_compress_overwrite() gets/locks a page to prepare a read, and calls
f2fs_read_multi_pages() which checks EOF first. If there's any page beyond EOF,
we unlock the page and set cc->rpages[i] = NULL, which we can't put the page
anymore. This makes page leak, so let's fix by putting that page.

Fixes: a949dc5f2c ("f2fs: compress: fix race condition of overwrite vs truncate")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-31 14:39:39 -07:00
Jaegeuk Kim
827f02842e f2fs: deallocate compressed pages when error happens
In f2fs_write_multi_pages(), f2fs_compress_pages() allocates pages for
compression work in cc->cpages[]. Then, f2fs_write_compressed_pages() initiates
bio submission. But, if there's any error before submitting the IOs like early
f2fs_cp_error(), previously it didn't free cpages by f2fs_compress_free_page().
Let's fix memory leak by putting that just before deallocating cc->cpages.

Fixes: 4c8ff7095b ("f2fs: support data compression")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-31 14:39:24 -07:00
Linus Torvalds
cd358208d7 fscrypt updates for 5.15
Some small fixes and cleanups for fs/crypto/:
 
 - Fix ->getattr() for ext4, f2fs, and ubifs to report the correct
   st_size for encrypted symlinks.
 
 - Use base64url instead of a custom Base64 variant.
 
 - Document struct fscrypt_operations.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYS0HzhQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK+XZAQDfvDE9gK4Ii2uE4Jb5XYv4M/BnVhoR
 WIhNEoHROIGv+AEAtyfmeCMdpPobkWHFfAE1iBysl3iS56fibQhi2wqyuQI=
 =s6Wi
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Eric Biggers:
 "Some small fixes and cleanups for fs/crypto/:

   - Fix ->getattr() for ext4, f2fs, and ubifs to report the correct
     st_size for encrypted symlinks

   - Use base64url instead of a custom Base64 variant

   - Document struct fscrypt_operations"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: document struct fscrypt_operations
  fscrypt: align Base64 encoding with RFC 4648 base64url
  fscrypt: remove mention of symlink st_size quirk from documentation
  ubifs: report correct st_size for encrypted symlinks
  f2fs: report correct st_size for encrypted symlinks
  ext4: report correct st_size for encrypted symlinks
  fscrypt: add fscrypt_symlink_getattr() for computing st_size
2021-08-31 10:01:14 -07:00
Linus Torvalds
679369114e for-5.15/block-2021-08-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs6H0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpukbD/9Qk9fQte+WJVmpbdvhV40gcKBVnGOVH0ke
 k+36x6AB/gWKnFHwtprsSyVqPxmzqwTv9VIq5l/s3Vydt3L61znvTneBeN03Wlkn
 UTxD0lY8HzyVWnZb82LBBjjy7cs6EzrFG4kBH/ZiTAyTcBsCAvzo5J7mywb4gFjj
 L/HeBq58EJ3WCUlxlVW1ijctvi7wnGoaH5bZY1TE00GGT6TysN2bEPfzjkuYHrDz
 RqhoQdWPLDz6h3x9lAncPw2MWlcmlGvJ96ABseAKFPKvXxE2PzgolSoQfVUUJtko
 bqGyy2ns+pxN11SrcGYjogEKVKhONoms/5UN1RtwRBVsgvecxlHER/SgyZ8luBDo
 lFhVXulkSjpswbWutRy3USge98GwMu2Z4ppP2CDmO7hkQd0DF8sL0kPKyaREkcHi
 NmsD/0zF2uUhUVN+PRC/MuzngAmL4Mmxjk70L+MohlK7e+H3pnEo1ec3OMcXe+wB
 dG6t/BFD9bYmj0UjsHeXEoR/iRuvSba1L8zBz5dhRaHH6DvdycYhpynXWWlU3C8K
 3nzEVVpcDINMsiRl1Vqb6g6HsMwHIH84FRl7Mc51UmhW9C4gLfWMCt1guQuzOj72
 yEbmCLydE/FR2IUPY7eqX8hRG8GTUlMtSvGdgnvBOcWj+K3buT/c5yVTHgTrN8ox
 LCOXHSvV6w==
 =S8fs
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Nothing major in here - lots of good cleanups and tech debt handling,
  which is also evident in the diffstats. In particular:

   - Add disk sequence numbers (Matteo)

   - Discard merge fix (Ming)

   - Relax disk zoned reporting restrictions (Niklas)

   - Bio error handling zoned leak fix (Pavel)

   - Start of proper add_disk() error handling (Luis, Christoph)

   - blk crypto fix (Eric)

   - Non-standard GPT location support (Dmitry)

   - IO priority improvements and cleanups (Damien)o

   - blk-throtl improvements (Chunguang)

   - diskstats_show() stack reduction (Abd-Alrhman)

   - Loop scheduler selection (Bart)

   - Switch block layer to use kmap_local_page() (Christoph)

   - Remove obsolete disk_name helper (Christoph)

   - block_device refcounting improvements (Christoph)

   - Ensure gendisk always has a request queue reference (Christoph)

   - Misc fixes/cleanups (Shaokun, Oliver, Guoqing)"

* tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits)
  sg: pass the device name to blk_trace_setup
  block, bfq: cleanup the repeated declaration
  blk-crypto: fix check for too-large dun_bytes
  blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
  blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
  block: mark blkdev_fsync static
  block: refine the disk_live check in del_gendisk
  mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA
  mmc: block: Support alternative_gpt_sector() operation
  partitions/efi: Support non-standard GPT location
  block: Add alternative_gpt_sector() operation
  bio: fix page leak bio_add_hw_page failure
  block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT
  block: remove a pointless call to MINOR() in device_add_disk
  null_blk: add error handling support for add_disk()
  virtio_blk: add error handling support for add_disk()
  block: add error handling for device_add_disk / add_disk
  block: return errors from disk_alloc_events
  block: return errors from blk_integrity_add
  block: call blk_register_queue earlier in device_add_disk
  ...
2021-08-30 18:52:11 -07:00
Chao Yu
f7db8dd698 f2fs: enable realtime discard iff device supports discard
Let's only enable realtime discard if and only if device supports
discard functionality.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-30 10:12:50 -07:00
Jaegeuk Kim
dddd3d6529 f2fs: guarantee to write dirty data when enabling checkpoint back
We must flush all the dirty data when enabling checkpoint back. Let's guarantee
that first by adding a retry logic on sync_inodes_sb(). In addition to that,
this patch adds to flush data in fsync when checkpoint is disabled, which can
mitigate the sync_inodes_sb() failures in advance.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-30 10:12:47 -07:00
Chao Yu
c8dc3047c4 f2fs: fix to unmap pages from userspace process in punch_hole()
We need to unmap pages from userspace process before removing pagecache
in punch_hole() like we did in f2fs_setattr().

Similar change:
commit 5e44f8c374 ("ext4: hole-punch use truncate_pagecache_range")

Fixes: fbfa2cc58d ("f2fs: add file operations")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-30 10:12:47 -07:00
Chao Yu
adf9ea89c7 f2fs: fix unexpected ENOENT comes from f2fs_map_blocks()
In below path, it will return ENOENT if filesystem is shutdown:

- f2fs_map_blocks
 - f2fs_get_dnode_of_data
  - f2fs_get_node_page
   - __get_node_page
    - read_node_page
     - is_sbi_flag_set(sbi, SBI_IS_SHUTDOWN)
       return -ENOENT
 - force return value from ENOENT to 0

It should be fine for read case, since it indicates a hole condition,
and caller could use .m_next_pgofs to skip the hole and continue the
lookup.

However it may cause confusing for write case, since leaving a hole
there, and said nothing was wrong doesn't help.

There is at least one case from dax_iomap_actor() will complain that,
so fix this in prior to supporting dax in f2fs.

xfstest generic/388 reports below warning:

ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 485833 at fs/dax.c:1127 dax_iomap_actor+0x339/0x370
Call Trace:
 iomap_apply+0x1c4/0x7b0
 ? dax_iomap_rw+0x1c0/0x1c0
 dax_iomap_rw+0xad/0x1c0
 ? dax_iomap_rw+0x1c0/0x1c0
 f2fs_file_write_iter+0x5ab/0x970 [f2fs]
 do_iter_readv_writev+0x273/0x2e0
 do_iter_write+0xab/0x1f0
 vfs_iter_write+0x21/0x40
 iter_file_splice_write+0x287/0x540
 do_splice+0x37c/0xa60
 __x64_sys_splice+0x15f/0x3a0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae

ubuntu godown: xfstests-induced forced shutdown of /mnt/scratch_f2fs:
------------[ cut here ]------------
RIP: 0010:dax_iomap_pte_fault.isra.0+0x72e/0x14a0
Call Trace:
 dax_iomap_fault+0x44/0x70
 f2fs_dax_huge_fault+0x155/0x400 [f2fs]
 f2fs_dax_fault+0x18/0x30 [f2fs]
 __do_fault+0x4e/0x120
 do_fault+0x3cf/0x7a0
 __handle_mm_fault+0xa8c/0xf20
 ? find_held_lock+0x39/0xd0
 handle_mm_fault+0x1b6/0x480
 do_user_addr_fault+0x320/0xcd0
 ? rcu_read_lock_sched_held+0x67/0xc0
 exc_page_fault+0x77/0x3f0
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30

Fixes: 83a3bfdb5a ("f2fs: indicate shutdown f2fs to allow unmount successfully")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-30 10:12:47 -07:00
Chao Yu
ad126ebdde f2fs: fix to account missing .skipped_gc_rwsem
There is a missing place we forgot to account .skipped_gc_rwsem, fix it.

Fixes: 6f8d445506 ("f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-30 10:12:47 -07:00
Chao Yu
d75da8c8a4 f2fs: adjust unlock order for cleanup
This patch adjusts unlock order of .i_mmap_sem and .i_gc_rwsem for
cleanup.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-30 10:12:47 -07:00
Fengnan Chang
4d67490498 f2fs: Don't create discard thread when device doesn't support realtime discard
Don't create discard thread when device doesn't support realtime discard
or user specifies nodiscard mount option.

Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-30 10:12:46 -07:00
Chao Yu
94c821fb28 f2fs: rebuild nat_bits during umount
If all free_nat_bitmap are available, we can rebuild nat_bits from
free_nat_bitmap entirely during umount, let's make another chance
to reenable nat_bits for image.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-23 10:25:52 -07:00
Daeho Jeong
a4b6817625 f2fs: introduce periodic iostat io latency traces
Whenever we notice some sluggish issues on our machines, we are always
curious about how well all types of I/O in the f2fs filesystem are
handled. But, it's hard to get this kind of real data. First of all,
we need to reproduce the issue while turning on the profiling tool like
blktrace, but the issue doesn't happen again easily. Second, with the
intervention of any tools, the overall timing of the issue will be
slightly changed and it sometimes makes us hard to figure it out.

So, I added the feature printing out IO latency statistics tracepoint
events, which are minimal things to understand filesystem's I/O related
behaviors, into F2FS_IOSTAT kernel config. With "iostat_enable" sysfs
node on, we can get this statistics info in a periodic way and it
would cause the least overhead.

[samples]
 f2fs_ckpt-254:1-507     [003] ....  2842.439683: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [136/1/801], rd_node [136/1/1704], rd_meta [4/2/4],
wr_sync_data [164/16/3331], wr_sync_node [152/3/648],
wr_sync_meta [160/2/4243], wr_async_data [24/13/15],
wr_async_node [0/0/0], wr_async_meta [0/0/0]

 f2fs_ckpt-254:1-507     [002] ....  2845.450514: f2fs_iostat_latency:
dev = (254,11), iotype [peak lat.(ms)/avg lat.(ms)/count],
rd_data [60/3/456], rd_node [60/3/1258], rd_meta [0/0/1],
wr_sync_data [120/12/2285], wr_sync_node [88/5/428],
wr_sync_meta [52/6/2990], wr_async_data [4/1/3],
wr_async_node [0/0/0], wr_async_meta [0/0/0]

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-23 10:25:51 -07:00
Daeho Jeong
521187439a f2fs: separate out iostat feature
Added F2FS_IOSTAT config option to support getting IO statistics through
sysfs and printing out periodic IO statistics tracepoint events and
moved I/O statistics related codes into separate files for better
maintenance.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
[Jaegeuk Kim: set default=y]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-23 10:25:51 -07:00
Miklos Szeredi
0cad624662 vfs: add rcu argument to ->get_acl() callback
Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-18 22:08:24 +02:00
Damien Le Moal
202bc942c5 block: Introduce IOPRIO_NR_LEVELS
The BFQ scheduler and ioprio_check_cap() both assume that the RT
priority class (IOPRIO_CLASS_RT) can have up to 8 different priority
levels, similarly to the BE class (IOPRIO_CLASS_iBE). This is
controlled using the IOPRIO_BE_NR macro , which is badly named as the
number of levels also applies to the RT class.

Introduce the class independent IOPRIO_NR_LEVELS macro, defined to 8,
to make things clear. Keep the old IOPRIO_BE_NR macro definition as an
alias for IOPRIO_NR_LEVELS.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210811033702.368488-6-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-18 07:21:12 -06:00
Chao Yu
bbe1da7e34 f2fs: compress: do sanity check on cluster
This patch adds f2fs_sanity_check_cluster() to support doing
sanity check on cluster of compressed file, it will be triggered
from below two paths:

- __f2fs_cluster_blocks()
- f2fs_map_blocks(F2FS_GET_BLOCK_FIEMAP)

And it can detect below three kind of cluster insanity status.

C: COMPRESS_ADDR
N: NULL_ADDR or NEW_ADDR
V: valid blkaddr
*: any value

1. [*|C|*|*]
2. [C|*|C|*]
3. [C|N|N|V]

Signed-off-by: Chao Yu <chao@kernel.org>
[Nathan Chancellor: fix missing inline warning]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-17 11:59:06 -07:00
Yangtao Li
491f7f71e1 f2fs: convert S_IRUGO to 0444
To fix:
WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-17 11:59:05 -07:00
Chao Yu
b96d9b3b09 f2fs: fix to keep compatibility of fault injection interface
The value of FAULT_* macros and its description in f2fs.rst became
inconsistent, fix this to keep compatibility of fault injection
interface.

Fixes: 67883ade7a ("f2fs: remove FAULT_ALLOC_BIO")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-17 11:59:05 -07:00
Chao Yu
324105775c f2fs: support fault injection for f2fs_kmem_cache_alloc()
This patch supports to inject fault into f2fs_kmem_cache_alloc().

Usage:
a) echo 32768 > /sys/fs/f2fs/<dev>/inject_type or
b) mount -o fault_type=32768 <dev> <mountpoint>

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-17 11:59:05 -07:00
Fengnan Chang
4a4fc043f5 f2fs: compress: allow write compress released file after truncate to zero
For compressed file, after release compress blocks, don't allow write
direct, but we should allow write direct after truncate to zero.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-17 11:59:04 -07:00
Yangtao Li
1927ccdb79 f2fs: correct comment in segment.h
s/two/three

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-13 11:57:08 -07:00
Yangtao Li
b6d9246d03 f2fs: improve sbi status info in debugfs/f2fs/status
Do not use numbers but strings to improve readability when flag is set.

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-13 11:57:02 -07:00
Fengnan Chang
a2649315bc f2fs: compress: avoid duplicate counting of valid blocks when read compressed file
Since cluster is basic unit of compression, one cluster is compressed or
not, so we can calculate valid blocks only for first page in cluster,
the other pages just skip.

Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-12 13:46:05 -07:00
Chao Yu
65ddf65648 f2fs: fix to do sanity check for sb/cp fields correctly
This patch fixes below problems of sb/cp sanity check:
- in sanity_check_raw_superi(), it missed to consider log header
blocks while cp_payload check.
- in f2fs_sanity_check_ckpt(), it missed to check nat_bits_blocks.

Cc: <stable@kernel.org>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-05 17:17:07 -07:00
Chao Yu
4b10651864 f2fs: avoid unneeded memory allocation in __add_ino_entry()
__add_ino_entry() will allocate slab cache even if we have already
cached ino entry in radix tree, e.g. for case of multiple devices.

Let's check radix tree first under protection of rcu lock to see
whether we need to do slab allocation, it will mitigate memory
pressure from "f2fs_ino_entry" slab cache.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-05 17:16:36 -07:00
Chao Yu
94afd6d6e5 f2fs: extent cache: support unaligned extent
Compressed inode may suffer read performance issue due to it can not
use extent cache, so I propose to add this unaligned extent support
to improve it.

Currently, it only works in readonly format f2fs image.

Unaligned extent: in one compressed cluster, physical block number
will be less than logical block number, so we add an extra physical
block length in extent info in order to indicate such extent status.

The idea is if one whole cluster blocks are contiguous physically,
once its mapping info was readed at first time, we will cache an
unaligned (or aligned) extent info entry in extent cache, it expects
that the mapping info will be hitted when rereading cluster.

Merge policy:
- Aligned extents can be merged.
- Aligned extent and unaligned extent can not be merged.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-05 11:26:11 -07:00
Tiezhu Yang
6b3ba1e77d f2fs: Kconfig: clean up config options about compression
In fs/f2fs/Kconfig, F2FS_FS_LZ4HC depends on F2FS_FS_LZ4 and F2FS_FS_LZ4
depends on F2FS_FS_COMPRESSION, so no need to make F2FS_FS_LZ4HC depends
on F2FS_FS_COMPRESSION explicitly, remove the redudant "depends on", do
the similar thing for F2FS_FS_LZORLE.

At the same time, it is better to move F2FS_FS_LZORLE next to F2FS_FS_LZO,
it looks like a little more clear when make menuconfig, the location of
"LZO-RLE compression support" is under "LZO compression support" instead
of "F2FS compression feature".

Without this patch:

F2FS compression feature
  LZO compression support
  LZ4 compression support
    LZ4HC compression support
  ZSTD compression support
  LZO-RLE compression support

With this patch:

F2FS compression feature
  LZO compression support
    LZO-RLE compression support
  LZ4 compression support
    LZ4HC compression support
  ZSTD compression support

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-05 10:46:58 -07:00
Yangtao Li
d4bf15a7ce f2fs: reduce the scope of setting fsck tag when de->name_len is zero
I recently found a case where de->name_len is 0 in f2fs_fill_dentries()
easily reproduced, and finally set the fsck flag.

Thread A			Thread B
- f2fs_readdir
 - f2fs_read_inline_dir
  - ctx->pos = d.max
				- f2fs_add_dentry
				 - f2fs_add_inline_entry
				  - do_convert_inline_dir
				 - f2fs_add_regular_entry
- f2fs_readdir
 - f2fs_fill_dentries
  - set_sbi_flag(sbi, SBI_NEED_FSCK)

Process A opens the folder, and has been reading without closing it.
During this period, Process B created a file under the folder (occupying
multiple f2fs_dir_entry, exceeding the d.max of the inline dir). After
creation, process A uses the d.max of inline dir to read it again, and
it will read that de->name_len is 0.

And Chao pointed out that w/o inline conversion, the race condition still
can happen as below:

dir_entry1: A
dir_entry2: B
dir_entry3: C
free slot: _
ctx->pos: ^

Thread A is traversing directory,
ctx-pos moves to below position after readdir() by thread A:
AAAABBBB___
        ^

Then thread B delete dir_entry2, and create dir_entry3.

Thread A calls readdir() to lookup dirents starting from middle
of new dirent slots as below:
AAAACCCCCC_
        ^
In these scenarios, the file system is not damaged, and it's hard to
avoid it. But we can bypass tagging FSCK flag if:
a) bit_pos (:= ctx->pos % d->max) is non-zero and
b) before bit_pos moves to first valid dir_entry.

Fixes: ddf06b753a ("f2fs: fix to trigger fsck if dirent.name_len is zero")
Signed-off-by: Yangtao Li <frank.li@vivo.com>
[Chao: clean up description]
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-04 15:51:21 -07:00
Chao Yu
91803392c7 f2fs: fix to stop filesystem update once CP failed
During f2fs_write_checkpoint(), once we failed in
f2fs_flush_nat_entries() or do_checkpoint(), metadata of filesystem
such as prefree bitmap, nat/sit version bitmap won't be recovered,
it may cause f2fs image to be inconsistent, let's just set CP error
flag to avoid further updates until we figure out a scheme to rollback
all metadatas in such condition.

Reported-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 18:08:29 -07:00
Daeho Jeong
0f6b56ec95 f2fs: add sysfs node to control ra_pages for fadvise seq file
fadvise() allows the user to expand the readahead window to double with
POSIX_FADV_SEQUENTIAL, now. But, in some use cases, it is not that
sufficient and we need to meet the need in a restricted way. We can
control the multiplier value of bdi device readahead between 2 (default)
and 256 for POSIX_FADV_SEQUENTIAL advise option.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 11:16:22 -07:00
Chao Yu
4f993264fe f2fs: introduce discard_unit mount option
As James Z reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=213877

[1.] One-line summary of the problem:
Mount multiple SMR block devices exceed certain number cause system non-response

[2.] Full description of the problem/report:
Created some F2FS on SMR devices (mkfs.f2fs -m), then mounted in sequence. Each device is the same Model: HGST HSH721414AL (Size 14TB).
Empirically, found that when the amount of SMR device * 1.5Gb > System RAM, the system ran out of memory and hung. No dmesg output. For example, 24 SMR Disk need 24*1.5GB = 36GB. A system with 32G RAM can only mount 21 devices, the 22nd device will be a reproducible cause of system hang.
The number of SMR devices with other FS mounted on this system does not interfere with the result above.

[3.] Keywords (i.e., modules, networking, kernel):
F2FS, SMR, Memory

[4.] Kernel information
[4.1.] Kernel version (uname -a):
Linux 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul 20 20:27:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

[4.2.] Kernel .config file:
Default Fedora 34 with f2fs-tools-1.14.0-2.fc34.x86_64

[5.] Most recent kernel version which did not have the bug:
None

[6.] Output of Oops.. message (if applicable) with symbolic information
     resolved (see Documentation/admin-guide/oops-tracing.rst)
None

[7.] A small shell script or example program which triggers the
     problem (if possible)
mount /dev/sdX /mnt/0X

[8.] Memory consumption

With 24 * 14T SMR Block device with F2FS
free -g
              total        used        free      shared  buff/cache   available
Mem:             46          36           0           0          10          10
Swap:             0           0           0

With 3 * 14T SMR Block device with F2FS
free -g
               total        used        free      shared  buff/cache   available
Mem:               7           5           0           0           1           1
Swap:              7           0           7

The root cause is, there are three bitmaps:
- cur_valid_map
- ckpt_valid_map
- discard_map
and each of them will cost ~500MB memory, {cur, ckpt}_valid_map are
necessary, but discard_map is optional, since this bitmap will only be
useful in mountpoint that small discard is enabled.

For a blkzoned device such as SMR or ZNS devices, f2fs will only issue
discard for a section(zone) when all blocks of that section are invalid,
so, for such device, we don't need small discard functionality at all.

This patch introduces a new mountoption "discard_unit=block|segment|
section" to support issuing discard with different basic unit which is
aligned to block, segment or section, so that user can specify
"discard_unit=segment" or "discard_unit=section" to disable small
discard functionality.

Note that this mount option can not be changed by remount() due to
related metadata need to be initialized during mount().

In order to save memory, let's use "discard_unit=section" for blkzoned
device by default.

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-03 11:16:17 -07:00
Laibin Qiu
dc675a9712 f2fs: fix min_seq_blocks can not make sense in some scenes.
F2FS have dirty page count control for batched sequential
write in writepages, and get the value of min_seq_blocks by
blocks_per_seg * segs_per_sec(segs_per_sec defaults to 1).
But in some scenes we set a lager section size, Min_seq_blocks
will become too large to achieve the expected effect(eg. 4thread
sequential write, the number of merge requests will be reduced).

Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-02 11:24:26 -07:00
Chao Yu
2787991516 f2fs: fix to force keeping write barrier for strict fsync mode
[1] https://www.mail-archive.com/linux-f2fs-devel@lists.sourceforge.net/msg15126.html

As [1] reported, if lower device doesn't support write barrier, in below
case:

- write page #0; persist
- overwrite page #0
- fsync
 - write data page #0 OPU into device's cache
 - write inode page into device's cache
 - issue flush

If SPO is triggered during flush command, inode page can be persisted
before data page #0, so that after recovery, inode page can be recovered
with new physical block address of data page #0, however there may
contains dummy data in new physical block address.

Then what user will see is: after overwrite & fsync + SPO, old data in
file was corrupted, if any user do care about such case, we can suggest
user to use STRICT fsync mode, in this mode, we will force to use atomic
write sematics to keep write order in between data/node and last node,
so that it avoids potential data corruption during fsync().

Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-02 11:24:26 -07:00
Chao Yu
277afbde6c f2fs: fix wrong checkpoint_changed value in f2fs_remount()
In f2fs_remount(), return value of test_opt() is an unsigned int type
variable, however when we compare it to a bool type variable, it cause
wrong result, fix it.

Fixes: 4354994f09 ("f2fs: checkpoint disabling")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-02 11:24:26 -07:00
Jaegeuk Kim
2e650912c0 f2fs: show sbi status in debugfs/f2fs/status
We need to get sbi->s_flag to understand the current f2fs status as well.
One example is SBI_NEED_FSCK.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-02 11:24:26 -07:00
Daeho Jeong
4931e0c93e f2fs: turn back remapped address in compressed page endio
Turned back the remmaped sector address to the address in the partition,
when ending io, for compress cache to work properly.

Fixes: 6ce19aff0b ("f2fs: compress: add compress_inode to cache
compressed blocks")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Hyeong Jun Kim <hj514.kim@samsung.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-02 11:24:25 -07:00
Daeho Jeong
093f0bac32 f2fs: change fiemap way in printing compression chunk
When we print out a discontinuous compression chunk, it shows like a
continuous chunk now. To show it more correctly, I've changed the way of
printing fiemap info like below. Plus, eliminated NEW_ADDR(-1) in fiemap
info, since it is not in fiemap user api manual.

Let's assume 16KB compression cluster.

<before>
   Logical          Physical         Length           Flags
0:  0000000000000000 00000002c091f000 0000000000004000 1008
1:  0000000000004000 00000002c0920000 0000000000004000 1008
  ...
9:  0000000000034000 0000000f8c623000 0000000000004000 1008
10: 0000000000038000 000000101a6eb000 0000000000004000 1008

<after>
0:  0000000000000000 00000002c091f000 0000000000004000 1008
1:  0000000000004000 00000002c0920000 0000000000004000 1008
  ...
9:  0000000000034000 0000000f8c623000 0000000000001000 1008
10: 0000000000035000 000000101a6ea000 0000000000003000 1008
11: 0000000000038000 000000101a6eb000 0000000000002000 1008
12: 000000000003a000 00000002c3544000 0000000000002000 1008

Flags
0x1000 => FIEMAP_EXTENT_MERGED
0x0008 => FIEMAP_EXTENT_ENCODED

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Tested-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-02 11:24:25 -07:00
Jaegeuk Kim
b7ec206173 f2fs: do not submit NEW_ADDR to read node block
After the below patch, give cp is errored, we drop dirty node pages. This
can give NEW_ADDR to read node pages. Don't do WARN_ON() which gives
generic/475 failure.

Fixes: 28607bf3aa ("f2fs: drop dirty node pages when cp is in error status")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-02 11:24:25 -07:00
Fengnan Chang
7eab7a6968 f2fs: compress: remove unneeded read when rewrite whole cluster
when we overwrite the whole page in cluster, we don't need read original
data before write, because after write_end(), writepages() can help to
load left data in that cluster.

Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-08-02 11:24:25 -07:00
Eric Biggers
461b43a8f9 f2fs: report correct st_size for encrypted symlinks
The stat() family of syscalls report the wrong size for encrypted
symlinks, which has caused breakage in several userspace programs.

Fix this by calling fscrypt_symlink_getattr() after f2fs_getattr() for
encrypted symlinks.  This function computes the correct size by reading
and decrypting the symlink target (if it's not already cached).

For more details, see the commit which added fscrypt_symlink_getattr().

Fixes: cbaf042a3c ("f2fs crypto: add symlink encryption")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210702065350.209646-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-07-25 20:01:06 -07:00
Jaegeuk Kim
2eeb0dce72 f2fs: don't sleep while grabing nat_tree_lock
This tries to fix priority inversion in the below condition resulting in
long checkpoint delay.

f2fs_get_node_info()
 - nat_tree_lock
  -> sleep to grab journal_rwsem by contention

                                     checkpoint
                                     - waiting for nat_tree_lock

In order to let checkpoint go, let's release nat_tree_lock, if there's a
journal_rwsem contention.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-25 08:42:38 -07:00
Eric Biggers
6de8687ccd f2fs: remove allow_outplace_dio()
We can just check f2fs_lfs_mode() directly.  The block_unaligned_IO()
check is redundant because in LFS mode, f2fs doesn't do direct I/O
writes that aren't block-aligned (due to f2fs_force_buffered_io()
returning true in this case, triggering the fallback to buffered I/O).

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-25 08:42:37 -07:00
Eric Biggers
3e679dc78c f2fs: make f2fs_write_failed() take struct inode
Make f2fs_write_failed() take a 'struct inode' directly rather than a
'struct address_space', as this simplifies it slightly.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-25 08:42:31 -07:00
Chao Yu
9de71ede81 f2fs: quota: fix potential deadlock
xfstest generic/587 reports a deadlock issue as below:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc1 #69 Not tainted
------------------------------------------------------
repquota/8606 is trying to acquire lock:
ffff888022ac9320 (&sb->s_type->i_mutex_key#18){+.+.}-{3:3}, at: f2fs_quota_sync+0x207/0x300 [f2fs]

but task is already holding lock:
ffff8880084bcde8 (&sbi->quota_sem){.+.+}-{3:3}, at: f2fs_quota_sync+0x59/0x300 [f2fs]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&sbi->quota_sem){.+.+}-{3:3}:
       __lock_acquire+0x648/0x10b0
       lock_acquire+0x128/0x470
       down_read+0x3b/0x2a0
       f2fs_quota_sync+0x59/0x300 [f2fs]
       f2fs_quota_on+0x48/0x100 [f2fs]
       do_quotactl+0x5e3/0xb30
       __x64_sys_quotactl+0x23a/0x4e0
       do_syscall_64+0x3b/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #1 (&sbi->cp_rwsem){++++}-{3:3}:
       __lock_acquire+0x648/0x10b0
       lock_acquire+0x128/0x470
       down_read+0x3b/0x2a0
       f2fs_unlink+0x353/0x670 [f2fs]
       vfs_unlink+0x1c7/0x380
       do_unlinkat+0x413/0x4b0
       __x64_sys_unlinkat+0x50/0xb0
       do_syscall_64+0x3b/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #0 (&sb->s_type->i_mutex_key#18){+.+.}-{3:3}:
       check_prev_add+0xdc/0xb30
       validate_chain+0xa67/0xb20
       __lock_acquire+0x648/0x10b0
       lock_acquire+0x128/0x470
       down_write+0x39/0xc0
       f2fs_quota_sync+0x207/0x300 [f2fs]
       do_quotactl+0xaff/0xb30
       __x64_sys_quotactl+0x23a/0x4e0
       do_syscall_64+0x3b/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
  &sb->s_type->i_mutex_key#18 --> &sbi->cp_rwsem --> &sbi->quota_sem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&sbi->quota_sem);
                               lock(&sbi->cp_rwsem);
                               lock(&sbi->quota_sem);
  lock(&sb->s_type->i_mutex_key#18);

 *** DEADLOCK ***

3 locks held by repquota/8606:
 #0: ffff88801efac0e0 (&type->s_umount_key#53){++++}-{3:3}, at: user_get_super+0xd9/0x190
 #1: ffff8880084bc380 (&sbi->cp_rwsem){++++}-{3:3}, at: f2fs_quota_sync+0x3e/0x300 [f2fs]
 #2: ffff8880084bcde8 (&sbi->quota_sem){.+.+}-{3:3}, at: f2fs_quota_sync+0x59/0x300 [f2fs]

stack backtrace:
CPU: 6 PID: 8606 Comm: repquota Not tainted 5.14.0-rc1 #69
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
Call Trace:
 dump_stack_lvl+0xce/0x134
 dump_stack+0x17/0x20
 print_circular_bug.isra.0.cold+0x239/0x253
 check_noncircular+0x1be/0x1f0
 check_prev_add+0xdc/0xb30
 validate_chain+0xa67/0xb20
 __lock_acquire+0x648/0x10b0
 lock_acquire+0x128/0x470
 down_write+0x39/0xc0
 f2fs_quota_sync+0x207/0x300 [f2fs]
 do_quotactl+0xaff/0xb30
 __x64_sys_quotactl+0x23a/0x4e0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f883b0b4efe

The root cause is ABBA deadlock of inode lock and cp_rwsem,
reorder locks in f2fs_quota_sync() as below to fix this issue:
- lock inode
- lock cp_rwsem
- lock quota_sem

Fixes: db6ec53b7e ("f2fs: add a rw_sem to cover quota flag changes")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-19 18:16:43 -07:00
Jaegeuk Kim
1ffc8f5f77 f2fs: let's keep writing IOs on SBI_NEED_FSCK
SBI_NEED_FSCK is an indicator that fsck.f2fs needs to be triggered, so it
is not fully critical to stop any IO writes. So, let's allow to write data
instead of reporting EIO forever given SBI_NEED_FSCK, but do keep OPU.

Fixes: 9557727876 ("f2fs: drop inplace IO if fs status is abnormal")
Cc: <stable@kernel.org> # v5.13+
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-19 18:16:40 -07:00
Jia Yang
10d0786b39 f2fs: Revert "f2fs: Fix indefinite loop in f2fs_gc() v1"
This reverts commit 957fa47823.

The patch "f2fs: Fix indefinite loop in f2fs_gc()" v1 and v4 are all
merged. Patch v4 is test info for patch v1. Patch v1 doesn't work and
may cause that sbi->cur_victim_sec can't be resetted to NULL_SEGNO,
which makes SSR unable to get segment of sbi->cur_victim_sec.
So it should be reverted.

The mails record:
[1] https://lore.kernel.org/linux-f2fs-devel/7288dcd4-b168-7656-d1af-7e2cafa4f720@huawei.com/T/
[2] https://lore.kernel.org/linux-f2fs-devel/20190809153653.GD93481@jaegeuk-macbookpro.roam.corp.google.com/T/

Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-19 11:54:48 -07:00
Wang Xiaojun
5417c98c12 f2fs: avoid to create an empty string as the extension_list
When creating a file, we need to set the temperature based on
extension_list. If the empty string is a valid extension_list,
the is_extension_exist will always returns true,
which affects the separation of hot and cold.

Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-13 16:11:08 -07:00
Chao Yu
01f6afd0f3 f2fs: compress: fix to set zstd compress level correctly
As 5kft reported in [1]:

set_compress_context() should set compress level into .i_compress_flag
for zstd as well as lz4hc, otherwise, zstd compressor will still use
default zstd compress level during compression, fix it.

[1] https://lore.kernel.org/linux-f2fs-devel/8e29f52b-6b0d-45ec-9520-e63eb254287a@www.fastmail.com/T/#u

Fixes: 3fde13f817 ("f2fs: compress: support compress level")
Reported-by: 5kft <5kft@5kft.org>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-13 16:10:35 -07:00
Daeho Jeong
07c6b5933e f2fs: add sysfs nodes to get GC info for each GC mode
Added gc_reclaimed_segments and gc_segment_mode sysfs nodes.
1) "gc_reclaimed_segments" shows how many segments have been
reclaimed by GC during a specific GC mode.
2) "gc_segment_mode" is used to control for which gc mode
the "gc_reclaimed_segments" node shows.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-13 16:09:55 -07:00
Jan Kara
edc6d01bad f2fs: Convert to using invalidate_lock
Use invalidate_lock instead of f2fs' private i_mmap_sem. The intended
purpose is exactly the same. By this conversion we fix a long standing
race between hole punching and read(2) / readahead(2) paths that can
lead to stale page cache contents.

CC: Jaegeuk Kim <jaegeuk@kernel.org>
CC: Chao Yu <yuchao0@huawei.com>
CC: linux-f2fs-devel@lists.sourceforge.net
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-07-13 14:29:01 +02:00
Jaegeuk Kim
28607bf3aa f2fs: drop dirty node pages when cp is in error status
Otherwise, writeback is going to fall in a loop to flush dirty inode forever
before getting SBI_CLOSING.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-06 22:05:06 -07:00
Jaegeuk Kim
c9ebd3df43 f2fs: initialize page->private when using for our internal use
We need to guarantee it's initially zero. Otherwise, it'll hurt entire flag
operations.

Fixes: b763f3bedc ("f2fs: restructure f2fs page.private layout")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-05 17:17:30 -07:00
Fengnan Chang
151b1982be f2fs: compress: add nocompress extensions support
When we create a directory with enable compression, all file write into
directory will try to compress.But sometimes we may know, new file
cannot meet compression ratio requirements.
We need a nocompress extension to skip those files to avoid unnecessary
compress page test.

After add nocompress_extension, the priority should be:
dir_flag < comp_extention,nocompress_extension < comp_file_flag,
no_comp_file_flag.

Priority in between FS_COMPR_FL, FS_NOCOMP_FS, extensions:
   * compress_extension=so; nocompress_extension=zip; chattr +c dir;
     touch dir/foo.so; touch dir/bar.zip; touch dir/baz.txt; then foo.so
     and baz.txt should be compresse, bar.zip should be non-compressed.
     chattr +c dir/bar.zip can enable compress on bar.zip.
   * compress_extension=so; nocompress_extension=zip; chattr -c dir;
     touch dir/foo.so; touch dir/bar.zip; touch dir/baz.txt; then foo.so
     should be compresse, bar.zip and baz.txt should be non-compressed.
     chattr+c dir/bar.zip; chattr+c dir/baz.txt; can enable compress on
     bar.zip and baz.txt.

Signed-off-by: Fengnan Chang <changfengnan@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-07-01 18:34:31 -07:00
Jaegeuk Kim
132e320978 f2fs: remove false alarm on iget failure during GC
This patch removes setting SBI_NEED_FSCK when GC gets an error on f2fs_iget,
since f2fs_iget can give ENOMEM and others by race condition.
If we set this critical fsck flag, we'll get EIO during fsync via the below
code path.

In f2fs_inplace_write_data(),

	if (is_sbi_flag_set(sbi, SBI_NEED_FSCK) || f2fs_cp_error(sbi)) {
		err = -EIO;
		goto drop_bio;
	}

Fixes: 9557727876 ("f2fs: drop inplace IO if fs status is abnormal")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-28 08:47:52 -07:00
Daeho Jeong
4215d054ae f2fs: enable extent cache for compression files in read-only
Let's allow extent cache for RO partition.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-28 08:43:51 -07:00
Chao Yu
4d9a2bb1a6 f2fs: introduce f2fs_casefolded_name slab cache
Add a slab cache: "f2fs_casefolded_name" for memory allocation
of casefold name.

Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23 01:09:35 -07:00
Chao Yu
859fca6b70 f2fs: swap: support migrating swapfile in aligned write mode
This patch supports to migrate swapfile in aligned write mode during
swapon in order to keep swapfile being aligned to section as much as
possible, then pinned swapfile will locates fully filled section which
may not affected by GC.

However, for the case that swapfile's size is not aligned to section
size, it will still leave last extent in file's tail as unaligned due
to its size is smaller than section size, like case #2.

case #1
xfs_io -f /mnt/f2fs/file -c "pwrite 0 4M" -c "fsync"

Before swapon:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..3047]:       1123352..1126399  3048 0x1000
   1: [3048..7143]:    237568..241663    4096 0x1000
   2: [7144..8191]:    245760..246807    1048 0x1001
After swapon:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..8191]:       249856..258047    8192 0x1001
Kmsg:
F2FS-fs (zram0): Swapfile (2) is not align to section:
1) creat(), 2) ioctl(F2FS_IOC_SET_PIN_FILE), 3) fallocate(2097152 * n)

case #2
xfs_io -f /mnt/f2fs/file -c "pwrite 0 3M" -c "fsync"

Before swapon:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..3047]:       246808..249855    3048 0x1000
   1: [3048..6143]:    237568..240663    3096 0x1001
After swapon:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   0: [0..4095]:       258048..262143    4096 0x1000
   1: [4096..6143]:    238616..240663    2048 0x1001
Kmsg:
F2FS-fs (zram0): Swapfile: last extent is not aligned to section
F2FS-fs (zram0): Swapfile (2) is not align to section:
1) creat(), 2) ioctl(F2FS_IOC_SET_PIN_FILE), 3) fallocate(2097152 * n)

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23 01:09:35 -07:00
Chao Yu
0b8fc00601 f2fs: swap: remove dead codes
After commit af4b6b8edf ("f2fs: introduce check_swap_activate_fast()"),
we will never run into original logic of check_swap_activate() before
f2fs supports non 4k-sized page, so let's delete those dead codes.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23 01:09:35 -07:00
Chao Yu
6ce19aff0b f2fs: compress: add compress_inode to cache compressed blocks
Support to use address space of inner inode to cache compressed block,
in order to improve cache hit ratio of random read.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23 01:09:35 -07:00
Jaegeuk Kim
4c89b53d05 f2fs: clean up /sys/fs/f2fs/<disk>/features
Let's create /sys/fs/f2fs/<disk>/feature_list/ to meet sysfs rule.

Note that there are three feature list entries:
1) /sys/fs/f2fs/features
  : shows runtime features supported by in-kernel f2fs along with Kconfig.
    - ref. F2FS_FEATURE_RO_ATTR()

2) /sys/fs/f2fs/$s_id/features <deprecated>
  : shows on-disk features enabled by mkfs.f2fs, used for old kernels. This
    won't add new feature anymore, and thus, users should check entries in 3)
    instead of this 2).

3) /sys/fs/f2fs/$s_id/feature_list
  : shows on-disk features enabled by mkfs.f2fs per instance, which follows
    sysfs entry rule where each entry should expose single value.
    This list covers old feature list provided by 2) and beyond. Therefore,
    please add new on-disk feature in this list only.
    - ref. F2FS_SB_FEATURE_RO_ATTR()

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23 01:09:35 -07:00
Jaegeuk Kim
4a196df4cf f2fs: add pin_file in feature list
This patch adds missing pin_file feature supported by kernel.

Fixes: f5a53edcf0 ("f2fs: support aligned pinned file")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23 01:09:35 -07:00
Daniel Rosenberg
4c039d5452 f2fs: Advertise encrypted casefolding in sysfs
Older kernels don't support encryption with casefolding. This adds
the sysfs entry encrypted_casefold to show support for those combined
features. Support for this feature was originally added by
commit 7ad08a58bf ("f2fs: Handle casefolding with Encryption")

Fixes: 7ad08a58bf ("f2fs: Handle casefolding with Encryption")
Cc: stable@vger.kernel.org # v5.11+
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23 01:09:34 -07:00
Daniel Rosenberg
39307f8ee3 f2fs: Show casefolding support only when supported
The casefolding feature is only supported when CONFIG_UNICODE is set.
This modifies the feature list f2fs presents under sysfs accordingly.

Fixes: 5aba54302a ("f2fs: include charset encoding information in the superblock")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23 01:09:34 -07:00
Jaegeuk Kim
a7d9fe3c33 f2fs: support RO feature
Given RO feature in superblock, we don't need to check provisioning/reserve
spaces and SSA area.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23 01:09:34 -07:00
Joe Perches
833dcd3545 f2fs: logging neatening
Update the logging uses that have unnecessary newlines as the f2fs_printk
function and so its f2fs_<level> macro callers already adds one.

This allows searching single line logging entries with an easier grep and
also avoids unnecessary blank lines in the logging.

Miscellanea:

o Coalesce formats
o Align to open parenthesis

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23 01:09:34 -07:00
Jaegeuk Kim
c61404153e f2fs: introduce FI_COMPRESS_RELEASED instead of using IMMUTABLE bit
Once we release compressed blocks, we used to set IMMUTABLE bit. But it turned
out it disallows every fs operations which we don't need for compression.

Let's just prevent writing data only.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23 01:09:34 -07:00
Chao Yu
8f1d498326 f2fs: compress: remove unneeded preallocation
We will reserve iblocks for compression saved, so during compressed
cluster overwrite, we don't need to preallocate blocks for later
write.

In addition, it adds a bug_on to detect wrong reserved iblock number
in __f2fs_cluster_blocks().

Bug fix in the original patch by Jaegeuk:
If we released compressed blocks having an immutable bit, we can see less
number of compressed block addresses. Let's fix wrong BUG_ON.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-06-23 01:09:33 -07:00
Chao Yu
8939a8489c f2fs: atgc: export entries for better tunability via sysfs
This patch export below sysfs entries for better ATGC tunability.

/sys/fs/f2fs/<disk>/atgc_candidate_ratio
/sys/fs/f2fs/<disk>/atgc_candidate_count
/sys/fs/f2fs/<disk>/atgc_age_weight
/sys/fs/f2fs/<disk>/atgc_age_threshold

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-26 07:00:36 -07:00
Chao Yu
4a67d9b07a f2fs: compress: fix to disallow temp extension
This patch restricts to configure compress extension as format of:

 [filename + '.' + extension]

rather than:

 [filename + '.' + extension + (optional: '.' + temp extension)]

in order to avoid to enable compression incorrectly:

1. compress_extension=so
2. touch file.soa
3. touch file.so.tmp

Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-26 07:00:36 -07:00
Jaegeuk Kim
e3c548323d f2fs: let's allow compression for mmap files
This patch allows to compress mmap files. E.g., for so files.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-26 07:00:36 -07:00
Chao Yu
0dd571785d f2fs: add MODULE_SOFTDEP to ensure crc32 is included in the initramfs
As marcosfrm reported in bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=213089

Initramfs generators rely on "pre" softdeps (and "depends") to include
additional required modules.

F2FS does not declare "pre: crc32" softdep. Then every generator (dracut,
mkinitcpio...) has to maintain a hardcoded list for this purpose.

Hence let's use MODULE_SOFTDEP("pre: crc32") in f2fs code.

Fixes: 43b6573bac ("f2fs: use cryptoapi crc32 functions")
Reported-by: marcosfrm <marcosfrm@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-26 07:00:36 -07:00
Tom Rix
4f55dc2a98 f2fs: return success if there is no work to do
Static analysis reports this problem
file.c:3206:2: warning: Undefined or garbage value returned to caller
        return err;
        ^~~~~~~~~~

err is only set if there is some work to do.  Because the loop returns
immediately on an error, if all the work was done, a 0 would be returned.
Instead of checking the unlikely case that there was no work to do,
change the return of err to 0.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-26 07:00:35 -07:00
Chao Yu
91f0fb6903 f2fs: compress: clean up parameter of __f2fs_cluster_blocks()
Previously, in order to reuse __f2fs_cluster_blocks(),
f2fs_is_compressed_cluster() assigned a compress_ctx type variable,
which is used to pass few parameters (cc.inode, cc.cluster_size,
cc.cluster_idx), it's wasteful to allocate such large space in stack.

Let's clean up parameters of __f2fs_cluster_blocks() to avoid that.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-14 11:22:09 -07:00
Chao Yu
fbec3b963a f2fs: compress: remove unneeded f2fs_put_dnode()
If we don't initialize dn.inode_page for f2fs_get_block(),
f2fs_get_block() will call f2fs_put_dnode() itself, so let's
remove unneeded f2fs_put_dnode() in f2fs_vm_page_mkwrite().

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-14 11:22:08 -07:00
Chao Yu
89e53ff165 f2fs: atgc: fix to set default age threshold
Default age threshold value is missed to set, fix it.

Fixes: 093749e296 ("f2fs: support age threshold based garbage collection")
Reported-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-14 11:22:08 -07:00
Shin'ichiro Kawasaki
d927ccfccb f2fs: Prevent swap file in LFS mode
The kernel writes to swap files on f2fs directly without the assistance
of the filesystem. This direct write by kernel can be non-sequential
even when the f2fs is in LFS mode. Such non-sequential write conflicts
with the LFS semantics. Especially when f2fs is set up on zoned block
devices, the non-sequential write causes unaligned write command errors.

To avoid the non-sequential writes to swap files, prevent swap file
activation when the filesystem is in LFS mode.

Fixes: 4969c06a0d ("f2fs: support swap file w/ DIO")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.10+
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-14 11:22:08 -07:00
Chao Yu
cad83c968c f2fs: fix to avoid racing on fsync_entry_slab by multi filesystem instances
As syzbot reported, there is an use-after-free issue during f2fs recovery:

Use-after-free write at 0xffff88823bc16040 (in kfence-#10):
 kmem_cache_destroy+0x1f/0x120 mm/slab_common.c:486
 f2fs_recover_fsync_data+0x75b0/0x8380 fs/f2fs/recovery.c:869
 f2fs_fill_super+0x9393/0xa420 fs/f2fs/super.c:3945
 mount_bdev+0x26c/0x3a0 fs/super.c:1367
 legacy_get_tree+0xea/0x180 fs/fs_context.c:592
 vfs_get_tree+0x86/0x270 fs/super.c:1497
 do_new_mount fs/namespace.c:2905 [inline]
 path_mount+0x196f/0x2be0 fs/namespace.c:3235
 do_mount fs/namespace.c:3248 [inline]
 __do_sys_mount fs/namespace.c:3456 [inline]
 __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
 do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The root cause is multi f2fs filesystem instances can race on accessing
global fsync_entry_slab pointer, result in use-after-free issue of slab
cache, fixes to init/destroy this slab cache only once during module
init/destroy procedure to avoid this issue.

Reported-by: syzbot+9d90dad32dd9727ed084@syzkaller.appspotmail.com
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-14 11:22:08 -07:00
Chao Yu
b763f3bedc f2fs: restructure f2fs page.private layout
Restruct f2fs page private layout for below reasons:

There are some cases that f2fs wants to set a flag in a page to
indicate a specified status of page:
a) page is in transaction list for atomic write
b) page contains dummy data for aligned write
c) page is migrating for GC
d) page contains inline data for inline inode flush
e) page belongs to merkle tree, and is verified for fsverity
f) page is dirty and has filesystem/inode reference count for writeback
g) page is temporary and has decompress io context reference for compression

There are existed places in page structure we can use to store
f2fs private status/data:
- page.flags: PG_checked, PG_private
- page.private

However it was a mess when we using them, which may cause potential
confliction:
		page.private	PG_private	PG_checked	page._refcount (+1 at most)
a)		-1		set				+1
b)		-2		set
c), d), e)					set
f)		0		set				+1
g)		pointer		set

The other problem is page.flags has no free slot, if we can avoid set
zero to page.private and set PG_private flag, then we use non-zero value
to indicate PG_private status, so that we may have chance to reclaim
PG_private slot for other usage. [1]

The other concern is f2fs has bad scalability in aspect of indicating
more page status.

So in this patch, let's restructure f2fs' page.private as below to
solve above issues:

Layout A: lowest bit should be 1
| bit0 = 1 | bit1 | bit2 | ... | bit MAX | private data .... |
 bit 0	PAGE_PRIVATE_NOT_POINTER
 bit 1	PAGE_PRIVATE_ATOMIC_WRITE
 bit 2	PAGE_PRIVATE_DUMMY_WRITE
 bit 3	PAGE_PRIVATE_ONGOING_MIGRATION
 bit 4	PAGE_PRIVATE_INLINE_INODE
 bit 5	PAGE_PRIVATE_REF_RESOURCE
 bit 6-	f2fs private data

Layout B: lowest bit should be 0
 page.private is a wrapped pointer.

After the change:
		page.private	PG_private	PG_checked	page._refcount (+1 at most)
a)		11		set				+1
b)		101		set				+1
c)		1001		set				+1
d)		10001		set				+1
e)						set
f)		100001		set				+1
g)		pointer		set				+1

[1] https://lore.kernel.org/linux-f2fs-devel/20210422154705.GO3596236@casper.infradead.org/T/#u

Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-14 11:22:08 -07:00
Chao Yu
ee68d27181 f2fs: add cp_error check in f2fs_write_compressed_pages
This patch adds cp_error check in f2fs_write_compressed_pages() like we did
in f2fs_write_single_data_page()

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-14 11:22:07 -07:00
Chao Yu
5db479f049 f2fs: compress: rename __cluster_may_compress
This patch renames __cluster_may_compress() to cluster_has_invalid_data() for
better readability.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-14 11:22:07 -07:00
Jaegeuk Kim
f395183f95 f2fs: return EINVAL for hole cases in swap file
This tries to fix xfstests/generic/495.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-12 07:38:00 -07:00
Jaegeuk Kim
ca298241bc f2fs: avoid swapon failure by giving a warning first
The final solution can be migrating blocks to form a section-aligned file
internally. Meanwhile, let's ask users to do that when preparing the swap
file initially like:
1) create()
2) ioctl(F2FS_IOC_SET_PIN_FILE)
3) fallocate()

Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: 36e4d95891 ("f2fs: check if swapfile is section-alligned")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-11 20:51:53 -07:00
Chao Yu
8bfbfb0ddd f2fs: compress: fix to assign cc.cluster_idx correctly
In f2fs_destroy_compress_ctx(), after f2fs_destroy_compress_ctx(),
cc.cluster_idx will be cleared w/ NULL_CLUSTER, f2fs_cluster_blocks()
may check wrong cluster metadata, fix it.

Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-11 14:48:12 -07:00
Chao Yu
a949dc5f2c f2fs: compress: fix race condition of overwrite vs truncate
pos_fsstress testcase complains a panic as belew:

------------[ cut here ]------------
kernel BUG at fs/f2fs/compress.c:1082!
invalid opcode: 0000 [#1] SMP PTI
CPU: 4 PID: 2753477 Comm: kworker/u16:2 Tainted: G           OE     5.12.0-rc1-custom #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Workqueue: writeback wb_workfn (flush-252:16)
RIP: 0010:prepare_compress_overwrite+0x4c0/0x760 [f2fs]
Call Trace:
 f2fs_prepare_compress_overwrite+0x5f/0x80 [f2fs]
 f2fs_write_cache_pages+0x468/0x8a0 [f2fs]
 f2fs_write_data_pages+0x2a4/0x2f0 [f2fs]
 do_writepages+0x38/0xc0
 __writeback_single_inode+0x44/0x2a0
 writeback_sb_inodes+0x223/0x4d0
 __writeback_inodes_wb+0x56/0xf0
 wb_writeback+0x1dd/0x290
 wb_workfn+0x309/0x500
 process_one_work+0x220/0x3c0
 worker_thread+0x53/0x420
 kthread+0x12f/0x150
 ret_from_fork+0x22/0x30

The root cause is truncate() may race with overwrite as below,
so that one reference count left in page can not guarantee the
page attaching in mapping tree all the time, after truncation,
later find_lock_page() may return NULL pointer.

- prepare_compress_overwrite
 - f2fs_pagecache_get_page
 - unlock_page
					- f2fs_setattr
					 - truncate_setsize
					  - truncate_inode_page
					   - delete_from_page_cache
 - find_lock_page

Fix this by avoiding referencing updated page.

Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-11 14:48:12 -07:00
Chao Yu
a12cc5b423 f2fs: compress: fix to free compress page correctly
In error path of f2fs_write_compressed_pages(), it needs to call
f2fs_compress_free_page() to release temporary page.

Fixes: 5e6bbde959 ("f2fs: introduce mempool for {,de}compress intermediate page allocation")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-11 14:48:12 -07:00
Jaegeuk Kim
a753103909 f2fs: support iflag change given the mask
In f2fs_fileattr_set(),

	if (!fa->flags_valid)
		mask &= FS_COMMON_FL;

In this case, we can set supported flags by mask only instead of BUG_ON.

/* Flags shared betwen flags/xflags */
	(FS_SYNC_FL | FS_IMMUTABLE_FL | FS_APPEND_FL | \
	 FS_NODUMP_FL |	FS_NOATIME_FL | FS_DAX_FL | \
	 FS_PROJINHERIT_FL)

Fixes: 9b1bb01c8a ("f2fs: convert to fileattr")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-11 14:48:11 -07:00
Jaegeuk Kim
349c4d6c75 f2fs: avoid null pointer access when handling IPU error
Unable to handle kernel NULL pointer dereference at virtual address 000000000000001a
 pc : f2fs_inplace_write_data+0x144/0x208
 lr : f2fs_inplace_write_data+0x134/0x208
 Call trace:
  f2fs_inplace_write_data+0x144/0x208
  f2fs_do_write_data_page+0x270/0x770
  f2fs_write_single_data_page+0x47c/0x830
  __f2fs_write_data_pages+0x444/0x98c
  f2fs_write_data_pages.llvm.16514453770497736882+0x2c/0x38
  do_writepages+0x58/0x118
  __writeback_single_inode+0x44/0x300
  writeback_sb_inodes+0x4b8/0x9c8
  wb_writeback+0x148/0x42c
  wb_do_writeback+0xc8/0x390
  wb_workfn+0xb0/0x2f4
  process_one_work+0x1fc/0x444
  worker_thread+0x268/0x4b4
  kthread+0x13c/0x158
  ret_from_fork+0x10/0x18

Fixes: 9557727876 ("f2fs: drop inplace IO if fs status is abnormal")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-05-11 14:48:07 -07:00
Linus Torvalds
d0195c7d7a f2fs-for-5.13-rc1
In this round, we added a new mount option, "checkpoint_merge", which introduces
 a kernel thread dealing with the f2fs checkpoints. Once we start to manage the
 IO priority along with blk-cgroup, the checkpoint operation can be processed in
 a lower priority under the process context. Since the checkpoint holds all the
 filesystem operations, we give a higher priority to the checkpoint thread all
 the time.
 
 Enhancement:
 - introduce gc_merge mount option to introduce a checkpoint thread
 - improve to run discard thread efficiently
 - allow modular compression algorithms
 - expose # of overprivision segments to sysfs
 - expose runtime compression stat to sysfs
 
 Bug fix:
 - fix OOB memory access by the node id lookup
 - avoid touching checkpointed data in the checkpoint-disabled mode
 - fix the resizing flow to avoid kernel panic and race conditions
 - fix block allocation issues on pinned files
 - address some swapfile issues
 - fix hugtask problem and kernel panic during atomic write operations
 - don't start checkpoint thread in RO
 
 And, we've cleaned up some kernel coding style and build warnings. In addition,
 we fixed some minor race conditions and error handling routines.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmCQhVIACgkQQBSofoJI
 UNIggA/8DZINzFLMCj6+6P5wNAWj3nYtx/FnwZ7C31f8qkiZjgA4LfONUnDvV7sU
 GS8MuLQz4eTYfqU2rVgGiSm+aCkEOovnk7C7Huo7pezgqYb+5J6ACXsqdU3dcD5M
 kShJMqLKcTKqtOMbnJrdGvmw1/ysuAi7UhSSgVV+9NQxlhxADnagOGbQ7lXNSV3R
 spGMWazGY2uA5DFCCa4lMX79lyFATCzEKB3SKAW5r+8QSmxJY8ViK2Er7AnvwRJz
 XJ/QJ8ALNb/GGyHzBWFv3P6Yxo/G3FkUvTIc5Rhi9P2lUgjI2NALokj7AOnfNh4a
 uSXHVlNrrfH+gpx9xr5z8MUmroCYCCOJ6EhnVweqViRmekY8jSb2HxmUtDTIf19U
 LWl3gtD2GDQx6CY0a0K58Oa2Lp0Bp9MWUdPA/4P21EymZwXum7aCkhV+DnigcoCj
 yCmKlI8nIpCS97dIO/7MsnG6Tu/7c+Prytd2ezUo+6hlkXPZk8shs+elnjlWu/6V
 3ZpSWKzQsPJL8U7eB9H04AxEokrrXm3fRhR86C7JdkEe0gyGFf3dB/G+jqzcNi5m
 ZpxiCeZ8RbNmPpnH8NWeYHk9uDKMOXuUPFYDoaOwImNWfqj0jfhiTxHo4MyBLAuk
 MT632ICcuJLvwgnSbMAI3U7v6+dZXKH4y6U7IHFjxhI7beMzxmI=
 =P5uk
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we added a new mount option, "checkpoint_merge", which
  introduces a kernel thread dealing with the f2fs checkpoints. Once we
  start to manage the IO priority along with blk-cgroup, the checkpoint
  operation can be processed in a lower priority under the process
  context. Since the checkpoint holds all the filesystem operations, we
  give a higher priority to the checkpoint thread all the time.

  Enhancements:
   - introduce gc_merge mount option to introduce a checkpoint thread
   - improve to run discard thread efficiently
   - allow modular compression algorithms
   - expose # of overprivision segments to sysfs
   - expose runtime compression stat to sysfs

  Bug fixes:
   - fix OOB memory access by the node id lookup
   - avoid touching checkpointed data in the checkpoint-disabled mode
   - fix the resizing flow to avoid kernel panic and race conditions
   - fix block allocation issues on pinned files
   - address some swapfile issues
   - fix hugtask problem and kernel panic during atomic write operations
   - don't start checkpoint thread in RO

  And, we've cleaned up some kernel coding style and build warnings. In
  addition, we fixed some minor race conditions and error handling
  routines"

* tag 'f2fs-for-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (48 commits)
  f2fs: drop inplace IO if fs status is abnormal
  f2fs: compress: remove unneed check condition
  f2fs: clean up left deprecated IO trace codes
  f2fs: avoid using native allocate_segment_by_default()
  f2fs: remove unnecessary struct declaration
  f2fs: fix to avoid NULL pointer dereference
  f2fs: avoid duplicated codes for cleanup
  f2fs: document: add description about compressed space handling
  f2fs: clean up build warnings
  f2fs: fix the periodic wakeups of discard thread
  f2fs: fix to avoid accessing invalid fio in f2fs_allocate_data_block()
  f2fs: fix to avoid GC/mmap race with f2fs_truncate()
  f2fs: set checkpoint_merge by default
  f2fs: Fix a hungtask problem in atomic write
  f2fs: fix to restrict mount condition on readonly block device
  f2fs: introduce gc_merge mount option
  f2fs: fix to cover __allocate_new_section() with curseg_lock
  f2fs: fix wrong alloc_type in f2fs_do_replace_block
  f2fs: delete empty compress.h
  f2fs: fix a typo in inode.c
  ...
2021-05-04 18:03:38 -07:00
Linus Torvalds
27787ba3fa Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff all over the place"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  useful constants: struct qstr for ".."
  hostfs_open(): don't open-code file_dentry()
  whack-a-mole: kill strlen_user() (again)
  autofs: should_expire() argument is guaranteed to be positive
  apparmor:match_mn() - constify devpath argument
  buffer: a small optimization in grow_buffers
  get rid of autofs_getpath()
  constify dentry argument of dentry_path()/dentry_path_raw()
2021-05-02 09:14:01 -07:00
Linus Torvalds
820c4bae40 Network filesystem helper library
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCHPZwACgkQ+7dXa6fL
 C2uJxw/9FVNssHxtA8iFDvZskE4YHiL6vMgOgKOeVmBfUvxqJcxWQXcF8ycbon5y
 jGcDRV1DWTv395ckALHqmD6SlH/5q+OBt4cCOXCebOlzbC63JmjJ6xOjHntZKw3i
 9c3GITNca5AsPXHXHGIcoRY4/4FntpLoVpyfYJ4ZZJCY7a7QUbgnEIIy9/Ps8Clw
 BahhiKChl2JCgV3KZBk/ypkf0IBduxKgT+IUxA9o7H5UsLzvUgnfd5uMIALLPMI1
 NXzUHBJoUtnWcB52nWPufJx9YwkMfSx70mutT0T74CFxbJakwRgAl2tWr5g989qM
 /fQrsOhMlU3NaXYaRPelbxkuzvy3hU1xSe3GLiZcxmh4Cb/YAX0TrHRecO62NWff
 pu/UWQS8Du5Gy8DrHScuo8baI1KFfyiV2lWQPfBO8kPaEB2ERw+PN6fWSh993Cn9
 4UHaR3Oyn4qyVXeirNZg+frado+BEZAbNMZwn0lyi6jnLeyir6qABOdpQk34SB35
 D4jfdPOBxeh3OVFkc+EBJ98i3/nal2+yXrNOqkP4OwmF0HqGt0YKKSaLNigXaDdO
 3CKmQlBqBZsUdRYHJyJsofrifkKjP78zx2WyUJPms8MGX9z+9kYR3f1erifLesCT
 Kb2TrAFx4ZgqS5tFh6UHnX4x0qy2RckgNrKTMpv38K8lNqplvLo=
 =tZgy
 -----END PGP SIGNATURE-----

Merge tag 'netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull network filesystem helper library updates from David Howells:
 "Here's a set of patches for 5.13 to begin the process of overhauling
  the local caching API for network filesystems. This set consists of
  two parts:

  (1) Add a helper library to handle the new VM readahead interface.

      This is intended to be used unconditionally by the filesystem
      (whether or not caching is enabled) and provides a common
      framework for doing caching, transparent huge pages and, in the
      future, possibly fscrypt and read bandwidth maximisation. It also
      allows the netfs and the cache to align, expand and slice up a
      read request from the VM in various ways; the netfs need only
      provide a function to read a stretch of data to the pagecache and
      the helper takes care of the rest.

  (2) Add an alternative fscache/cachfiles I/O API that uses the kiocb
      facility to do async DIO to transfer data to/from the netfs's
      pages, rather than using readpage with wait queue snooping on one
      side and vfs_write() on the other. It also uses less memory, since
      it doesn't do buffered I/O on the backing file.

      Note that this uses SEEK_HOLE/SEEK_DATA to locate the data
      available to be read from the cache. Whilst this is an improvement
      from the bmap interface, it still has a problem with regard to a
      modern extent-based filesystem inserting or removing bridging
      blocks of zeros. Fixing that requires a much greater overhaul.

  This is a step towards overhauling the fscache API. The change is
  opt-in on the part of the network filesystem. A netfs should not try
  to mix the old and the new API because of conflicting ways of handling
  pages and the PG_fscache page flag and because it would be mixing DIO
  with buffered I/O. Further, the helper library can't be used with the
  old API.

  This does not change any of the fscache cookie handling APIs or the
  way invalidation is done at this time.

  In the near term, I intend to deprecate and remove the old I/O API
  (fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(),
  fscache_write_page() and fscache_uncache_page()) and eventually
  replace most of fscache/cachefiles with something simpler and easier
  to follow.

  This patchset contains the following parts:

   - Some helper patches, including provision of an ITER_XARRAY iov
     iterator and a function to do readahead expansion.

   - Patches to add the netfs helper library.

   - A patch to add the fscache/cachefiles kiocb API.

   - A pair of patches to fix some review issues in the ITER_XARRAY and
     read helpers as spotted by Al and Willy.

  Jeff Layton has patches to add support in Ceph for this that he
  intends for this merge window. I have a set of patches to support AFS
  that I will post a separate pull request for.

  With this, AFS without a cache passes all expected xfstests; with a
  cache, there's an extra failure, but that's also there before these
  patches. Fixing that probably requires a greater overhaul. Ceph also
  passes the expected tests.

  I also have patches in a separate branch to tidy up the handling of
  PG_fscache/PG_private_2 and their contribution to page refcounting in
  the core kernel here, but I haven't included them in this set and will
  route them separately"

Link: https://lore.kernel.org/lkml/3779937.1619478404@warthog.procyon.org.uk/

* tag 'netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  netfs: Miscellaneous fixes
  iov_iter: Four fixes for ITER_XARRAY
  fscache, cachefiles: Add alternate API to use kiocb for read/write to cache
  netfs: Add a tracepoint to log failures that would be otherwise unseen
  netfs: Define an interface to talk to a cache
  netfs: Add write_begin helper
  netfs: Gather stats
  netfs: Add tracepoints
  netfs: Provide readahead and readpage netfs helpers
  netfs, mm: Add set/end/wait_on_page_fscache() aliases
  netfs, mm: Move PG_fscache helper funcs to linux/netfs.h
  netfs: Documentation for helper library
  netfs: Make a netfs helper module
  mm: Implement readahead_control pageset expansion
  mm/readahead: Handle ractl nr_pages being modified
  fs: Document file_ra_state
  mm/filemap: Pass the file_ra_state in the ractl
  mm: Add set/end/wait functions for PG_private_2
  iov_iter: Add ITER_XARRAY
2021-04-27 13:08:12 -07:00
Chao Yu
9557727876 f2fs: drop inplace IO if fs status is abnormal
If filesystem has cp_error or need_fsck status, let's drop inplace IO
to avoid further corruption of fs data.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-26 09:50:39 -07:00
Chao Yu
8af85f712f f2fs: compress: remove unneed check condition
In only call path of __cluster_may_compress(), __f2fs_write_data_pages()
has checked SBI_POR_DOING condition, and also cluster_may_compress()
has checked CP_ERROR_FLAG condition, so remove redundant check condition
in __cluster_may_compress() for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-26 09:50:36 -07:00
Chao Yu
2e22d48dca f2fs: clean up left deprecated IO trace codes
Commit d5f7bc0064 ("f2fs: deprecate f2fs_trace_io") left some
dead codes, delete them.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-24 17:41:50 -07:00
Matthew Wilcox (Oracle)
fcd9ae4f7f mm/filemap: Pass the file_ra_state in the ractl
For readahead_expand(), we need to modify the file ra_state, so pass it
down by adding it to the ractl.  We have to do this because it's not always
the same as f_ra in the struct file that is already being passed.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
Link: https://lore.kernel.org/r/20210407201857.3582797-2-willy@infradead.org/
Link: https://lore.kernel.org/r/161789067431.6155.8063840447229665720.stgit@warthog.procyon.org.uk/ # v6
2021-04-23 09:25:00 +01:00
Chao Yu
509f1010e4 f2fs: avoid using native allocate_segment_by_default()
As we did for other cases, in fix_curseg_write_pointer(), let's
use wrapped f2fs_allocate_new_section() instead of native
allocate_segment_by_default(), by this way, it fixes to cover
segment allocation with curseg_lock and sentry_lock.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-21 21:00:59 -07:00
Wan Jiabing
a7b4e506dc f2fs: remove unnecessary struct declaration
struct dnode_of_data is defined at 897th line.
The declaration here is unnecessary. Remove it.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-19 09:36:14 -07:00
Al Viro
80e5d1ff5d useful constants: struct qstr for ".."
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-04-15 22:36:45 -04:00
Yi Chen
594b6d0428 f2fs: fix to avoid NULL pointer dereference
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
pc : f2fs_put_page+0x1c/0x26c
lr : __revoke_inmem_pages+0x544/0x75c
f2fs_put_page+0x1c/0x26c
__revoke_inmem_pages+0x544/0x75c
__f2fs_commit_inmem_pages+0x364/0x3c0
f2fs_commit_inmem_pages+0xc8/0x1a0
f2fs_ioc_commit_atomic_write+0xa4/0x15c
f2fs_ioctl+0x5b0/0x1574
file_ioctl+0x154/0x320
do_vfs_ioctl+0x164/0x740
__arm64_sys_ioctl+0x78/0xa4
el0_svc_common+0xbc/0x1d0
el0_svc_handler+0x74/0x98
el0_svc+0x8/0xc

In f2fs_put_page, we access page->mapping is NULL.
The root cause is:
In some cases, the page refcount and ATOMIC_WRITTEN_PAGE
flag miss set for page-priavte flag has been set.
We add f2fs_bug_on like this:

f2fs_register_inmem_page()
{
	...
	f2fs_set_page_private(page, ATOMIC_WRITTEN_PAGE);

	f2fs_bug_on(F2FS_I_SB(inode), !IS_ATOMIC_WRITTEN_PAGE(page));
	...
}

The bug on stack follow link this:
PC is at f2fs_register_inmem_page+0x238/0x2b4
LR is at f2fs_register_inmem_page+0x2a8/0x2b4
f2fs_register_inmem_page+0x238/0x2b4
f2fs_set_data_page_dirty+0x104/0x164
set_page_dirty+0x78/0xc8
f2fs_write_end+0x1b4/0x444
generic_perform_write+0x144/0x1cc
__generic_file_write_iter+0xc4/0x174
f2fs_file_write_iter+0x2c0/0x350
__vfs_write+0x104/0x134
vfs_write+0xe8/0x19c
SyS_pwrite64+0x78/0xb8

To fix this issue, let's add page refcount add page-priavte flag.
The page-private flag is not cleared and needs further analysis.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Ge Qiu <qiuge@huawei.com>
Signed-off-by: Dehe Gu <gudehe@huawei.com>
Signed-off-by: Yi Chen <chenyi77@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-13 10:46:34 -07:00
Chao Yu
453e2ff8e4 f2fs: avoid duplicated codes for cleanup
f2fs_segment_has_free_slot() was copied and modified from
__next_free_blkoff(), they are almost the same, clean up to
reuse common code as much as possible.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-13 10:43:15 -07:00
Miklos Szeredi
9b1bb01c8a f2fs: convert to fileattr
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-12 15:04:29 +02:00
Yi Zhuang
5f029c045c f2fs: clean up build warnings
This patch combined the below three clean-up patches.

- modify open brace '{' following function definitions
- ERROR: spaces required around that ':'
- ERROR: spaces required before the open parenthesis '('
- ERROR: spaces prohibited before that ','
- Made suggested modifications from checkpatch in reference to WARNING:
 Missing a blank line after declarations

Signed-off-by: Yi Zhuang <zhuangyi1@huawei.com>
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-10 10:36:39 -07:00
Sahitya Tummala
c35b8d5e75 f2fs: fix the periodic wakeups of discard thread
Fix the unnecessary periodic wakeups of discard thread that happens under
below two conditions -

1. When f2fs is heavily utilized over 80%, the current discard policy
sets the max sleep timeout of discard thread as 50ms
(DEF_MIN_DISCARD_ISSUE_TIME). But this is set even when there are
no pending discard commands to be issued.

2. In the issue_discard_thread() path when there are no pending discard
commands, it fails to reset the wait_ms to max timeout value.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-06 07:18:31 -07:00
Chao Yu
25ae837e61 f2fs: fix to avoid accessing invalid fio in f2fs_allocate_data_block()
Callers may pass fio parameter with NULL value to f2fs_allocate_data_block(),
so we should make sure accessing fio's field after fio's validation check.

Fixes: f608c38c59 ("f2fs: clean up parameter of f2fs_allocate_data_block()")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-06 07:18:31 -07:00
Chao Yu
a303b0ac92 f2fs: fix to avoid GC/mmap race with f2fs_truncate()
It missed to hold i_gc_rwsem and i_map_sem around f2fs_truncate()
in f2fs_file_write_iter() to avoid racing with background GC and
mmap, fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-06 07:16:22 -07:00
Jaegeuk Kim
b5d15199a2 f2fs: set checkpoint_merge by default
Once we introduced checkpoint_merge, we've seen some contention w/o the option.
In order to avoid it, let's set it by default.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-04-01 18:36:35 -07:00
Yi Zhuang
be1ee45d51 f2fs: Fix a hungtask problem in atomic write
In the cache writing process, if it is an atomic file, increase the page
count of F2FS_WB_CP_DATA, otherwise increase the page count of
F2FS_WB_DATA.

When you step into the hook branch due to insufficient memory in
f2fs_write_begin, f2fs_drop_inmem_pages_all will be called to traverse
all atomic inodes and clear the FI_ATOMIC_FILE mark of all atomic files.

In f2fs_drop_inmem_pages,first acquire the inmem_lock , revoke all the
inmem_pages, and then clear the FI_ATOMIC_FILE mark. Before this mark is
cleared, other threads may hold inmem_lock to add inmem_pages to the inode
that has just been emptied inmem_pages, and increase the page count of
F2FS_WB_CP_DATA.

When the IO returns, it is found that the FI_ATOMIC_FILE flag is cleared
by f2fs_drop_inmem_pages_all, and f2fs_is_atomic_file returns false,which
causes the page count of F2FS_WB_DATA to be decremented. The page count of
F2FS_WB_CP_DATA cannot be cleared. Finally, hungtask is triggered in
f2fs_wait_on_all_pages because get_pages will never return zero.

process A:				process B:
f2fs_drop_inmem_pages_all
->f2fs_drop_inmem_pages of inode#1
    ->mutex_lock(&fi->inmem_lock)
    ->__revoke_inmem_pages of inode#1	f2fs_ioc_commit_atomic_write
    ->mutex_unlock(&fi->inmem_lock)	->f2fs_commit_inmem_pages of inode#1
					->mutex_lock(&fi->inmem_lock)
					->__f2fs_commit_inmem_pages
					    ->f2fs_do_write_data_page
					        ->f2fs_outplace_write_data
					            ->do_write_page
					                ->f2fs_submit_page_write
					                    ->inc_page_count(sbi, F2FS_WB_CP_DATA )
					->mutex_unlock(&fi->inmem_lock)
    ->spin_lock(&sbi->inode_lock[ATOMIC_FILE]);
    ->clear_inode_flag(inode, FI_ATOMIC_FILE)
    ->spin_unlock(&sbi->inode_lock[ATOMIC_FILE])
					f2fs_write_end_io
					->dec_page_count(sbi, F2FS_WB_DATA );

We can fix the problem by putting the action of clearing the FI_ATOMIC_FILE
mark into the inmem_lock lock. This operation can ensure that no one will
submit the inmem pages before the FI_ATOMIC_FILE mark is cleared, so that
there will be no atomic writes waiting for writeback.

Fixes: 57864ae5ce ("f2fs: limit # of inmemory pages")
Signed-off-by: Yi Zhuang <zhuangyi1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-31 08:51:08 -07:00
Chao Yu
23738e7447 f2fs: fix to restrict mount condition on readonly block device
When we mount an unclean f2fs image in a readonly block device, let's
make mount() succeed only when there is no recoverable data in that
image, otherwise after mount(), file fsyned won't be recovered as user
expected.

Fixes: 938a184265 ("f2fs: give a warning only for readonly partition")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-30 20:58:15 -07:00
Chao Yu
5911d2d1d1 f2fs: introduce gc_merge mount option
In this patch, we will add two new mount options: "gc_merge" and
"nogc_merge", when background_gc is on, "gc_merge" option can be
set to let background GC thread to handle foreground GC requests,
it can eliminate the sluggish issue caused by slow foreground GC
operation when GC is triggered from a process with limited I/O
and CPU resources.

Original idea is from Xiang.

Signed-off-by: Gao Xiang <xiang@kernel.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-30 18:48:56 -07:00
Chao Yu
823d13e12b f2fs: fix to cover __allocate_new_section() with curseg_lock
In order to avoid race with f2fs_do_replace_block().

Fixes: f5a53edcf0 ("f2fs: support aligned pinned file")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-26 10:27:46 -07:00
Wang Xiaojun
753a8ed0ae f2fs: fix wrong alloc_type in f2fs_do_replace_block
If the alloc_type of the original curseg is LFS,
when we change_curseg and then do recover curseg,
the alloc_type becomes SSR.

Signed-off-by: Wang Xiaojun <wangxiaojun11@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-26 10:27:45 -07:00
Chao Yu
e8bf1f522a f2fs: delete empty compress.h
Commit 75e91c8889 ("f2fs: compress: fix compression chksum")
wrongly introduced empty compress.h, delete it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-26 10:27:45 -07:00
Ruiqi Gong
2c718feead f2fs: fix a typo in inode.c
Do a trivial typo fix.
s/runing/running

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Ruiqi Gong <gongruiqi1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-26 10:27:44 -07:00
Sahitya Tummala
d6d2b491a8 f2fs: allow to change discard policy based on cached discard cmds
With the default DPOLICY_BG discard thread is ioaware, which prevents
the discard thread from issuing the discard commands. On low RAM setups,
it is observed that these discard commands in the cache are consuming
high memory. This patch aims to relax the memory pressure on the system
due to f2fs pending discard cmds by changing the policy to DPOLICY_FORCE
based on the nm_i->ram_thresh configured.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-26 10:27:44 -07:00
Chao Yu
61461fc921 f2fs: fix to avoid touching checkpointed data in get_victim()
In CP disabling mode, there are two issues when using LFS or SSR | AT_SSR
mode to select victim:

1. LFS is set to find source section during GC, the victim should have
no checkpointed data, since after GC, section could not be set free for
reuse.

Previously, we only check valid chpt blocks in current segment rather
than section, fix it.

2. SSR | AT_SSR are set to find target segment for writes which can be
fully filled by checkpointed and newly written blocks, we should never
select such segment, otherwise it can cause panic or data corruption
during allocation, potential case is described as below:

 a) target segment has 'n' (n < 512) ckpt valid blocks
 b) GC migrates 'n' valid blocks to other segment (segment is still
    in dirty list)
 c) GC migrates '512 - n' blocks to target segment (segment has 'n'
    cp_vblocks and '512 - n' vblocks)
 d) If GC selects target segment via {AT,}SSR allocator, however there
    is no free space in targe segment.

Fixes: 4354994f09 ("f2fs: checkpoint disabling")
Fixes: 093749e296 ("f2fs: support age threshold based garbage collection")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-26 10:27:01 -07:00
Chao Yu
88f2cfc5fa f2fs: fix to update last i_size if fallocate partially succeeds
In the case of expanding pinned file, map.m_lblk and map.m_len
will update in each round of section allocation, so in error
path, last i_size will be calculated with wrong m_lblk and m_len,
fix it.

Fixes: f5a53edcf0 ("f2fs: support aligned pinned file")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-25 18:20:51 -07:00
Chao Yu
3fd9735908 f2fs: fix error path of f2fs_remount()
In error path of f2fs_remount(), it missed to restart/stop kernel thread
or enable/disable checkpoint, then mount option status may not be
consistent with real condition of filesystem, so let's reorder remount
flow a bit as below and do recovery correctly in error path:

1) handle gc thread
2) handle ckpt thread
3) handle flush thread
4) handle checkpoint disabling

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-25 18:20:51 -07:00
qiulaibin
f3e367d4fe f2fs: fix wrong comment of nat_tree_lock
Do trivial comment fix of nat_tree_lock.

Signed-off-by: qiulaibin <qiulaibin@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-25 18:20:51 -07:00
Chao Yu
b862676e37 f2fs: fix to avoid out-of-bounds memory access
butt3rflyh4ck <butterflyhuangxx@gmail.com> reported a bug found by
syzkaller fuzzer with custom modifications in 5.12.0-rc3+ [1]:

 dump_stack+0xfa/0x151 lib/dump_stack.c:120
 print_address_description.constprop.0.cold+0x82/0x32c mm/kasan/report.c:232
 __kasan_report mm/kasan/report.c:399 [inline]
 kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
 f2fs_test_bit fs/f2fs/f2fs.h:2572 [inline]
 current_nat_addr fs/f2fs/node.h:213 [inline]
 get_next_nat_page fs/f2fs/node.c:123 [inline]
 __flush_nat_entry_set fs/f2fs/node.c:2888 [inline]
 f2fs_flush_nat_entries+0x258e/0x2960 fs/f2fs/node.c:2991
 f2fs_write_checkpoint+0x1372/0x6a70 fs/f2fs/checkpoint.c:1640
 f2fs_issue_checkpoint+0x149/0x410 fs/f2fs/checkpoint.c:1807
 f2fs_sync_fs+0x20f/0x420 fs/f2fs/super.c:1454
 __sync_filesystem fs/sync.c:39 [inline]
 sync_filesystem fs/sync.c:67 [inline]
 sync_filesystem+0x1b5/0x260 fs/sync.c:48
 generic_shutdown_super+0x70/0x370 fs/super.c:448
 kill_block_super+0x97/0xf0 fs/super.c:1394

The root cause is, if nat entry in checkpoint journal area is corrupted,
e.g. nid of journalled nat entry exceeds max nid value, during checkpoint,
once it tries to flush nat journal to NAT area, get_next_nat_page() may
access out-of-bounds memory on nat_bitmap due to it uses wrong nid value
as bitmap offset.

[1] https://lore.kernel.org/lkml/CAFcO6XOMWdr8pObek6eN6-fs58KG9doRFadgJj-FnF-1x43s2g@mail.gmail.com/T/#u

Reported-and-tested-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-25 18:20:51 -07:00
Chao Yu
3f7070b050 f2fs: don't start checkpoint thread in readonly mountpoint
In readonly mountpoint, there should be no write IOs include checkpoint
IO, so that it's not needed to create kernel checkpoint thread.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-25 18:20:51 -07:00
Weichao Guo
ac2d750b20 f2fs: do not use AT_SSR mode in FG_GC & high urgent BG_GC
AT_SSR mode is introduced by age threshold based GC for better
hot/cold data seperation and avoiding free segment cost. However,
LFS write mode is preferred in the scenario of foreground or high
urgent GC, which should be finished ASAP. Let's only use AT_SSR
in background GC and not high urgent GC modes.

Signed-off-by: Weichao Guo <guoweichao@oppo.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-25 18:20:50 -07:00
Daeho Jeong
5ac443e26a f2fs: add sysfs nodes to get runtime compression stat
I've added new sysfs nodes to show runtime compression stat since mount.
compr_written_block - show the block count written after compression
compr_saved_block - show the saved block count with compression
compr_new_inode - show the count of inode newly enabled for compression

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-25 18:20:50 -07:00
Chengguang Xu
0bb2045ce5 f2fs: fix to use per-inode maxbytes in f2fs_fiemap
F2FS inode may have different max size,
so change to use per-inode maxbytes.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-25 18:20:50 -07:00
Chao Yu
e1175f0229 f2fs: fix to align to section for fallocate() on pinned file
Now, fallocate() on a pinned file only allocates blocks which aligns
to segment rather than section, so GC may try to migrate pinned file's
block, and after several times of failure, pinned file's block could
be migrated to other place, however user won't be aware of such
condition, and then old obsolete block address may be readed/written
incorrectly.

To avoid such condition, let's try to allocate pinned file's blocks
with section alignment.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-23 10:41:12 -07:00
Jaegeuk Kim
0823427989 f2fs: expose # of overprivision segments
This is useful when checking conditions during checkpoint=disable in Android.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:44 -08:00
Eric Biggers
3c0315424f f2fs: fix error handling in f2fs_end_enable_verity()
f2fs didn't properly clean up if verity failed to be enabled on a file:

- It left verity metadata (pages past EOF) in the page cache, which
  would be exposed to userspace if the file was later extended.

- It didn't truncate the verity metadata at all (either from cache or
  from disk) if an error occurred while setting the verity bit.

Fix these bugs by adding a call to truncate_inode_pages() and ensuring
that we truncate the verity metadata (both from cache and from disk) in
all error paths.  Also rework the code to cleanly separate the success
path from the error paths, which makes it much easier to understand.

Finally, log a message if f2fs_truncate() fails, since it might
otherwise fail silently.

Reported-by: Yunlei He <heyunlei@hihonor.com>
Fixes: 95ae251fe8 ("f2fs: add fs-verity support")
Cc: <stable@vger.kernel.org> # v5.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:44 -08:00
Colin Ian King
28e18ee636 f2fs: fix a redundant call to f2fs_balance_fs if an error occurs
The  uninitialized variable dn.node_changed does not get set when a
call to f2fs_get_node_page fails.  This uninitialized value gets used
in the call to f2fs_balance_fs() that may or not may not balances
dirty node and dentry pages depending on the uninitialized state of
the variable. Fix this by only calling f2fs_balance_fs if err is
not set.

Thanks to Jaegeuk Kim for suggesting an appropriate fix.

Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: 2a34076070 ("f2fs: call f2fs_balance_fs only when node was changed")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:44 -08:00
Chao Yu
1153db095f f2fs: remove unused file_clear_encrypt()
- file_clear_encrypt() was never be used, remove it.
- In addition, relocating macros for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:43 -08:00
huangjianan@oppo.com
36e4d95891 f2fs: check if swapfile is section-alligned
If the swapfile isn't created by pin and fallocate, it can't be
guaranteed section-aligned, so it may be selected by f2fs gc. When
gc_pin_file_threshold is reached, the address of swapfile may change,
but won't be synchronized to swap_extent, so swap will write to wrong
address, which will cause data corruption.

Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Guo Weichao <guoweichao@oppo.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:43 -08:00
huangjianan@oppo.com
1da6610383 f2fs: fix last_lblock check in check_swap_activate_fast
Because page_no < sis->max guarantees that the while loop break out
normally, the wrong check contidion here doesn't cause a problem.

Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Guo Weichao <guoweichao@oppo.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:43 -08:00
huangjianan@oppo.com
ebc29b62a1 f2fs: remove unnecessary IS_SWAPFILE check
Now swapfile in f2fs directly submit IO to blockdev according to
swapfile extents reported by f2fs when swapon, therefore there is
no need to check IS_SWAPFILE when exec filesystem operation.

Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Signed-off-by: Guo Weichao <guoweichao@oppo.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:43 -08:00
Gustavo A. R. Silva
4260c4067f f2fs: Replace one-element array with flexible-array member
There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use “flexible array members”[1] for these cases. The older
style of one-element or zero-length arrays should no longer be used[2].

Refactor the code according to the use of a flexible-array member in
struct f2fs_checkpoint, instead of a one-element arrays.

Notice that a temporary pointer to void '*tmp_ptr' was used in order to
fix the following errors when using a flexible array instead of a one
element array in struct f2fs_checkpoint:

  CC [M]  fs/f2fs/dir.o
In file included from fs/f2fs/dir.c:13:
fs/f2fs/f2fs.h: In function ‘__bitmap_ptr’:
fs/f2fs/f2fs.h:2227:40: error: invalid use of flexible array member
 2227 |   return &ckpt->sit_nat_version_bitmap + offset + sizeof(__le32);
      |                                        ^
fs/f2fs/f2fs.h:2227:49: error: invalid use of flexible array member
 2227 |   return &ckpt->sit_nat_version_bitmap + offset + sizeof(__le32);
      |                                                 ^
fs/f2fs/f2fs.h:2238:40: error: invalid use of flexible array member
 2238 |   return &ckpt->sit_nat_version_bitmap + offset;
      |                                        ^
make[2]: *** [scripts/Makefile.build:287: fs/f2fs/dir.o] Error 1
make[1]: *** [scripts/Makefile.build:530: fs/f2fs] Error 2
make: *** [Makefile:1819: fs] Error 2

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/79
Build-tested-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/603647e4.DeEFbl4eqljuwAUe%25lkp@intel.com/
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:42 -08:00
Geert Uytterhoeven
72f8588124 f2fs: compress: Allow modular (de)compression algorithms
If F2FS_FS is modular, enabling the compressions options
F2FS_FS_{LZ4,LZ4HZ,LZO,LZORLE,ZSTD} will make the (de)compression
algorithms {LZ4,LZ4HC,LZO,ZSTD}_{,DE}COMPRESS builtin instead of
modular, as the former depend on an intermediate boolean
F2FS_FS_COMPRESSION, which in-turn depends on tristate F2FS_FS.

Indeed, if a boolean symbol A depends directly on a tristate symbol B
and selects another tristate symbol C:

    tristate B

    tristate C

    bool A
      depends on B
      select C

and B is modular, then C will also be modular.

However, if there is an intermediate boolean D in the dependency chain
between A and B:

    tristate B

    tristate C

    bool D
      depends on B

    bool A
      depends on D
      select C

then the modular state won't propagate from B to C, and C will be
builtin instead of modular.

As modular dependency propagation through intermediate symbols is
obscure, fix this in a robust way by moving the selection of tristate
(de)compression algorithms from the boolean compression options to the
tristate main F2FS_FS option.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:42 -08:00
Chao Yu
43f8c47ea7 f2fs: check discard command number before traversing discard pending list
In trim thread, let's add a condition to check discard command number
before traversing discard pending list, it can avoid unneeded traversing
if there is no discard command.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:42 -08:00
Chao Yu
3b42c741b1 f2fs: update comments for explicit memory barrier
Add more detailed comments for explicit memory barrier used by
f2fs, in order to enhance code readability.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:42 -08:00
Chao Yu
4831675c6b f2fs: remove unused FORCE_FG_GC macro
FORCE_FG_GC was introduced by commit 6aefd93b01 ("f2fs: introduce
background_gc=sync mount option"), but never be used, remove it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:42 -08:00
Chao Yu
cd6ee739b8 f2fs: avoid unused f2fs_show_compress_options()
LKP reports:

   fs/f2fs/super.c:1516:20: warning: unused function 'f2fs_show_compress_options' [-Wunused-function]
   static inline void f2fs_show_compress_options(struct seq_file *seq,

Fix this issue by covering f2fs_show_compress_options() with
CONFIG_F2FS_FS_COMPRESSION macro.

Fixes: 4c8ff7095b ("f2fs: support data compression")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:41 -08:00
Chao Yu
3ab0598e6d f2fs: fix panic during f2fs_resize_fs()
f2fs_resize_fs() hangs in below callstack with testcase:
- mkfs 16GB image & mount image
- dd 8GB fileA
- dd 8GB fileB
- sync
- rm fileA
- sync
- resize filesystem to 8GB

kernel BUG at segment.c:2484!
Call Trace:
 allocate_segment_by_default+0x92/0xf0 [f2fs]
 f2fs_allocate_data_block+0x44b/0x7e0 [f2fs]
 do_write_page+0x5a/0x110 [f2fs]
 f2fs_outplace_write_data+0x55/0x100 [f2fs]
 f2fs_do_write_data_page+0x392/0x850 [f2fs]
 move_data_page+0x233/0x320 [f2fs]
 do_garbage_collect+0x14d9/0x1660 [f2fs]
 free_segment_range+0x1f7/0x310 [f2fs]
 f2fs_resize_fs+0x118/0x330 [f2fs]
 __f2fs_ioctl+0x487/0x3680 [f2fs]
 __x64_sys_ioctl+0x8e/0xd0
 do_syscall_64+0x33/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The root cause is we forgot to check that whether we have enough space
in resized filesystem to store all valid blocks in before-resizing
filesystem, then allocator will run out-of-space during block migration
in free_segment_range().

Fixes: b4b10061ef ("f2fs: refactor resize_fs to avoid meta updates in progress")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:41 -08:00
Chao Yu
7dede88659 f2fs: fix to allow migrating fully valid segment
F2FS_IOC_FLUSH_DEVICE/F2FS_IOC_RESIZE_FS needs to migrate all blocks of
target segment to other place, no matter the segment has partially or fully
valid blocks.

However, after commit 803e74be04 ("f2fs: stop GC when the victim becomes
fully valid"), we may skip migration due to target segment is fully valid,
result in failing the ioctl interface, fix this.

Fixes: 803e74be04 ("f2fs: stop GC when the victim becomes fully valid")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:41 -08:00
jiahao
2a4bd0c37c f2fs: fix a spacing coding style
Add a space before the plus.

Signed-off-by: jiahao <jiahao@xiaomi.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-03-12 13:16:41 -08:00
Christoph Hellwig
a8affc03a9 block: rename BIO_MAX_PAGES to BIO_MAX_VECS
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been
horribly confusingly misnamed.  Rename it to BIO_MAX_VECS to stop
confusing users of the bio API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-11 07:47:48 -07:00
Linus Torvalds
3ab6608e66 block-5.12-2021-02-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmA6njIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprolD/9zWti9LsZvA7yE+PhVwrwF3CsNzLfQlClw
 99HaA7HxtAc/VLJrnD/SubhCAPdBC5B2xPv6faajdwF2iUR3Rr1Uc93CQ3uP2KKq
 kvm6ALTpzPTMI6YSABhY74sg9BkkoDbMo54JQYVQPleiE+5eDLbuFZck6ObfUHyY
 a4aaImlndWp/t14GzrClL4hucF+5KJy846P+QCVclkh0yl8xSsqZ5LIFU7tu3iQb
 HpZ5HKLT/2ma/EOr3wknnsIe97AUZQU0q5aMparhYlm+qR511eop3QXx850FL/oC
 tEGceKLij6qazmkiocKVzML8Fs+Y9/a4vCMjLCScWJmzDlmKdlH2uudeahN6b9Hm
 15qRQHOjl1Hc2bdr5ZVn87nq9RWhSm18C+SRMwOKHCOnEhwxqM3RjRfAgj4BJ6QB
 PFbFqdY+8Y1YLPFmn9hph72ePaEcN4L2IXW6TI/WX8mot8ODAnkq9Hr38dKwzO+i
 0mon6DVyJKKho6XwvVu5IYurkR2beQprjeVUxwZjjT6DxUgsc+J6itK5LDHFSkeZ
 qZlXn5Di8MkiXg0DFJYDQiFXnO0Z5GlRWOGPVfBaOr3x+1dqzDdHGw4oz1oGqvnr
 GNNYCsYIpDGm7eauX5lqL5MUFpjqRCceXy5JSHPhnWWw617nYkr4H9jdsV9HiTX1
 tQFx05QW3w==
 =ccMs
 -----END PGP SIGNATURE-----

Merge tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
 "A few stragglers (and one due to me missing it originally), and fixes
  for changes in this merge window mostly. In particular:

   - blktrace cleanups (Chaitanya, Greg)

   - Kill dead blk_pm_* functions (Bart)

   - Fixes for the bio alloc changes (Christoph)

   - Fix for the partition changes (Christoph, Ming)

   - Fix for turning off iopoll with polled IO inflight (Jeffle)

   - nbd disconnect fix (Josef)

   - loop fsync error fix (Mauricio)

   - kyber update depth fix (Yang)

   - max_sectors alignment fix (Mikulas)

   - Add bio_max_segs helper (Matthew)"

* tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block: (21 commits)
  block: Add bio_max_segs
  blktrace: fix documentation for blk_fill_rw()
  block: memory allocations in bounce_clone_bio must not fail
  block: remove the gfp_mask argument to bounce_clone_bio
  block: fix bounce_clone_bio for passthrough bios
  block-crypto-fallback: use a bio_set for splitting bios
  block: fix logging on capacity change
  blk-settings: align max_sectors on "logical_block_size" boundary
  block: reopen the device in blkdev_reread_part
  block: don't skip empty device in in disk_uevent
  blktrace: remove debugfs file dentries from struct blk_trace
  nbd: handle device refs for DESTROY_ON_DISCONNECT properly
  kyber: introduce kyber_depth_updated()
  loop: fix I/O error on fsync() in detached loop devices
  block: fix potential IO hang when turning off io_poll
  block: get rid of the trace rq insert wrapper
  blktrace: fix blk_rq_merge documentation
  blktrace: fix blk_rq_issue documentation
  blktrace: add blk_fill_rwbs documentation comment
  block: remove superfluous param in blk_fill_rwbs()
  ...
2021-02-28 11:23:38 -08:00
Matthew Wilcox (Oracle)
5f7136db82 block: Add bio_max_segs
It's often inconvenient to use BIO_MAX_PAGES due to min() requiring the
sign to be the same.  Introduce bio_max_segs() and change BIO_MAX_PAGES to
be unsigned to make it easier for the users.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-26 15:49:51 -07:00
Linus Torvalds
7d6beb71da idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
 ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
 4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
 =yPaw
 -----END PGP SIGNATURE-----

Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull idmapped mounts from Christian Brauner:
 "This introduces idmapped mounts which has been in the making for some
  time. Simply put, different mounts can expose the same file or
  directory with different ownership. This initial implementation comes
  with ports for fat, ext4 and with Christoph's port for xfs with more
  filesystems being actively worked on by independent people and
  maintainers.

  Idmapping mounts handle a wide range of long standing use-cases. Here
  are just a few:

   - Idmapped mounts make it possible to easily share files between
     multiple users or multiple machines especially in complex
     scenarios. For example, idmapped mounts will be used in the
     implementation of portable home directories in
     systemd-homed.service(8) where they allow users to move their home
     directory to an external storage device and use it on multiple
     computers where they are assigned different uids and gids. This
     effectively makes it possible to assign random uids and gids at
     login time.

   - It is possible to share files from the host with unprivileged
     containers without having to change ownership permanently through
     chown(2).

   - It is possible to idmap a container's rootfs and without having to
     mangle every file. For example, Chromebooks use it to share the
     user's Download folder with their unprivileged containers in their
     Linux subsystem.

   - It is possible to share files between containers with
     non-overlapping idmappings.

   - Filesystem that lack a proper concept of ownership such as fat can
     use idmapped mounts to implement discretionary access (DAC)
     permission checking.

   - They allow users to efficiently changing ownership on a per-mount
     basis without having to (recursively) chown(2) all files. In
     contrast to chown (2) changing ownership of large sets of files is
     instantenous with idmapped mounts. This is especially useful when
     ownership of a whole root filesystem of a virtual machine or
     container is changed. With idmapped mounts a single syscall
     mount_setattr syscall will be sufficient to change the ownership of
     all files.

   - Idmapped mounts always take the current ownership into account as
     idmappings specify what a given uid or gid is supposed to be mapped
     to. This contrasts with the chown(2) syscall which cannot by itself
     take the current ownership of the files it changes into account. It
     simply changes the ownership to the specified uid and gid. This is
     especially problematic when recursively chown(2)ing a large set of
     files which is commong with the aforementioned portable home
     directory and container and vm scenario.

   - Idmapped mounts allow to change ownership locally, restricting it
     to specific mounts, and temporarily as the ownership changes only
     apply as long as the mount exists.

  Several userspace projects have either already put up patches and
  pull-requests for this feature or will do so should you decide to pull
  this:

   - systemd: In a wide variety of scenarios but especially right away
     in their implementation of portable home directories.

         https://systemd.io/HOME_DIRECTORY/

   - container runtimes: containerd, runC, LXD:To share data between
     host and unprivileged containers, unprivileged and privileged
     containers, etc. The pull request for idmapped mounts support in
     containerd, the default Kubernetes runtime is already up for quite
     a while now: https://github.com/containerd/containerd/pull/4734

   - The virtio-fs developers and several users have expressed interest
     in using this feature with virtual machines once virtio-fs is
     ported.

   - ChromeOS: Sharing host-directories with unprivileged containers.

  I've tightly synced with all those projects and all of those listed
  here have also expressed their need/desire for this feature on the
  mailing list. For more info on how people use this there's a bunch of
  talks about this too. Here's just two recent ones:

      https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
      https://fosdem.org/2021/schedule/event/containers_idmap/

  This comes with an extensive xfstests suite covering both ext4 and
  xfs:

      https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

  It covers truncation, creation, opening, xattrs, vfscaps, setid
  execution, setgid inheritance and more both with idmapped and
  non-idmapped mounts. It already helped to discover an unrelated xfs
  setgid inheritance bug which has since been fixed in mainline. It will
  be sent for inclusion with the xfstests project should you decide to
  merge this.

  In order to support per-mount idmappings vfsmounts are marked with
  user namespaces. The idmapping of the user namespace will be used to
  map the ids of vfs objects when they are accessed through that mount.
  By default all vfsmounts are marked with the initial user namespace.
  The initial user namespace is used to indicate that a mount is not
  idmapped. All operations behave as before and this is verified in the
  testsuite.

  Based on prior discussions we want to attach the whole user namespace
  and not just a dedicated idmapping struct. This allows us to reuse all
  the helpers that already exist for dealing with idmappings instead of
  introducing a whole new range of helpers. In addition, if we decide in
  the future that we are confident enough to enable unprivileged users
  to setup idmapped mounts the permission checking can take into account
  whether the caller is privileged in the user namespace the mount is
  currently marked with.

  The user namespace the mount will be marked with can be specified by
  passing a file descriptor refering to the user namespace as an
  argument to the new mount_setattr() syscall together with the new
  MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
  of extensibility.

  The following conditions must be met in order to create an idmapped
  mount:

   - The caller must currently have the CAP_SYS_ADMIN capability in the
     user namespace the underlying filesystem has been mounted in.

   - The underlying filesystem must support idmapped mounts.

   - The mount must not already be idmapped. This also implies that the
     idmapping of a mount cannot be altered once it has been idmapped.

   - The mount must be a detached/anonymous mount, i.e. it must have
     been created by calling open_tree() with the OPEN_TREE_CLONE flag
     and it must not already have been visible in the filesystem.

  The last two points guarantee easier semantics for userspace and the
  kernel and make the implementation significantly simpler.

  By default vfsmounts are marked with the initial user namespace and no
  behavioral or performance changes are observed.

  The manpage with a detailed description can be found here:

      1d7b902e28

  In order to support idmapped mounts, filesystems need to be changed
  and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
  patches to convert individual filesystem are not very large or
  complicated overall as can be seen from the included fat, ext4, and
  xfs ports. Patches for other filesystems are actively worked on and
  will be sent out separately. The xfstestsuite can be used to verify
  that port has been done correctly.

  The mount_setattr() syscall is motivated independent of the idmapped
  mounts patches and it's been around since July 2019. One of the most
  valuable features of the new mount api is the ability to perform
  mounts based on file descriptors only.

  Together with the lookup restrictions available in the openat2()
  RESOLVE_* flag namespace which we added in v5.6 this is the first time
  we are close to hardened and race-free (e.g. symlinks) mounting and
  path resolution.

  While userspace has started porting to the new mount api to mount
  proper filesystems and create new bind-mounts it is currently not
  possible to change mount options of an already existing bind mount in
  the new mount api since the mount_setattr() syscall is missing.

  With the addition of the mount_setattr() syscall we remove this last
  restriction and userspace can now fully port to the new mount api,
  covering every use-case the old mount api could. We also add the
  crucial ability to recursively change mount options for a whole mount
  tree, both removing and adding mount options at the same time. This
  syscall has been requested multiple times by various people and
  projects.

  There is a simple tool available at

      https://github.com/brauner/mount-idmapped

  that allows to create idmapped mounts so people can play with this
  patch series. I'll add support for the regular mount binary should you
  decide to pull this in the following weeks:

  Here's an example to a simple idmapped mount of another user's home
  directory:

	u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

	u1001@f2-vm:/$ ls -al /home/ubuntu/
	total 28
	drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
	drwxr-xr-x 4 root   root   4096 Oct 28 04:00 ..
	-rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
	-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
	-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
	-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
	-rw-r--r-- 1 ubuntu ubuntu    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ ls -al /mnt/
	total 28
	drwxr-xr-x  2 u1001 u1001 4096 Oct 28 22:07 .
	drwxr-xr-x 29 root  root  4096 Oct 28 22:01 ..
	-rw-------  1 u1001 u1001 3154 Oct 28 22:12 .bash_history
	-rw-r--r--  1 u1001 u1001  220 Feb 25  2020 .bash_logout
	-rw-r--r--  1 u1001 u1001 3771 Feb 25  2020 .bashrc
	-rw-r--r--  1 u1001 u1001  807 Feb 25  2020 .profile
	-rw-r--r--  1 u1001 u1001    0 Oct 16 16:11 .sudo_as_admin_successful
	-rw-------  1 u1001 u1001 1144 Oct 28 00:43 .viminfo

	u1001@f2-vm:/$ touch /mnt/my-file

	u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

	u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

	u1001@f2-vm:/$ ls -al /mnt/my-file
	-rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

	u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
	-rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

	u1001@f2-vm:/$ getfacl /mnt/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: mnt/my-file
	# owner: u1001
	# group: u1001
	user::rw-
	user:u1001:rwx
	group::rw-
	mask::rwx
	other::r--

	u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
	getfacl: Removing leading '/' from absolute path names
	# file: home/ubuntu/my-file
	# owner: ubuntu
	# group: ubuntu
	user::rw-
	user:ubuntu:rwx
	group::rw-
	mask::rwx
	other::r--"

* tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
  xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
  xfs: support idmapped mounts
  ext4: support idmapped mounts
  fat: handle idmapped mounts
  tests: add mount_setattr() selftests
  fs: introduce MOUNT_ATTR_IDMAP
  fs: add mount_setattr()
  fs: add attr_flags_to_mnt_flags helper
  fs: split out functions to hold writers
  namespace: only take read lock in do_reconfigure_mnt()
  mount: make {lock,unlock}_mount_hash() static
  namespace: take lock_mount_hash() directly when changing flags
  nfs: do not export idmapped mounts
  overlayfs: do not mount on top of idmapped mounts
  ecryptfs: do not mount on top of idmapped mounts
  ima: handle idmapped mounts
  apparmor: handle idmapped mounts
  fs: make helpers idmap mount aware
  exec: handle idmapped mounts
  would_dump: handle idmapped mounts
  ...
2021-02-23 13:39:45 -08:00
Linus Torvalds
d61c6a58ae \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmAzoWUACgkQnJ2qBz9k
 QNnFgQgAlng0JOzeCQvLpwweqFl0FCxYbOsZXC1xDyvfX3TiA6A6oiOR4tx3uhQN
 cOQmJXaiMn4oCXjD1j6WZwGfy23yx0XchaoFK9jy2IqodaB/zUjkiWYYqt0G3XIX
 ud35mxjLAGS12BCD0c+vHy2RMsUFl5ep+5aBHRHZJJhCcYbl7e5ctXZ3xB1Q0mgI
 r639gD8JhH3ICdu9W0NaMvqOrVhJFNmhSGATKL/N96+oKub2x2ycYE4L2OXegxy3
 mnFf26LjA8jt7K+KfHloTvkC6D4HVnnvKFvKiIbGKafiWhAE7q57ZO6BPCMajGue
 3UHIhWGmwKXRU72+nW6N+089GbcO/g==
 =1e+z
 -----END PGP SIGNATURE-----

Merge tag 'lazytime_for_v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull lazytime updates from Jan Kara:
 "Cleanups of the lazytime handling in the writeback code making rules
  for calling ->dirty_inode() filesystem handlers saner"

* tag 'lazytime_for_v5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext4: simplify i_state checks in __ext4_update_other_inode_time()
  gfs2: don't worry about I_DIRTY_TIME in gfs2_fsync()
  fs: improve comments for writeback_single_inode()
  fs: drop redundant check from __writeback_single_inode()
  fs: clean up __mark_inode_dirty() a bit
  fs: pass only I_DIRTY_INODE flags to ->dirty_inode
  fs: don't call ->dirty_inode for lazytime timestamp updates
  fat: only specify I_DIRTY_TIME when needed in fat_update_time()
  fs: only specify I_DIRTY_TIME when needed in generic_update_time()
  fs: correctly document the inode dirty flags
2021-02-22 13:17:39 -08:00
Linus Torvalds
582cd91f69 for-5.12/block-2021-02-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP
 EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+
 RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt
 Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK
 dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw
 ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg
 rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u
 ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l
 Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ
 wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC
 VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26
 WC22RGC2MA==
 =os1E
 -----END PGP SIGNATURE-----

Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Another nice round of removing more code than what is added, mostly
  due to Christoph's relentless pursuit of tech debt removal/cleanups.
  This pull request contains:

   - Two series of BFQ improvements (Paolo, Jan, Jia)

   - Block iov_iter improvements (Pavel)

   - bsg error path fix (Pan)

   - blk-mq scheduler improvements (Jan)

   - -EBUSY discard fix (Jan)

   - bvec allocation improvements (Ming, Christoph)

   - bio allocation and init improvements (Christoph)

   - Store bdev pointer in bio instead of gendisk + partno (Christoph)

   - Block trace point cleanups (Christoph)

   - hard read-only vs read-only split (Christoph)

   - Block based swap cleanups (Christoph)

   - Zoned write granularity support (Damien)

   - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"

* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
  mm: simplify swapdev_block
  sd_zbc: clear zone resources for non-zoned case
  block: introduce blk_queue_clear_zone_settings()
  zonefs: use zone write granularity as block size
  block: introduce zone_write_granularity limit
  block: use blk_queue_set_zoned in add_partition()
  nullb: use blk_queue_set_zoned() to setup zoned devices
  nvme: cleanup zone information initialization
  block: document zone_append_max_bytes attribute
  block: use bi_max_vecs to find the bvec pool
  md/raid10: remove dead code in reshape_request
  block: mark the bio as cloned in bio_iov_bvec_set
  block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
  block: remove a layer of indentation in bio_iov_iter_get_pages
  block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
  block: remove the 1 and 4 vec bvec_slabs entries
  block: streamline bvec_alloc
  block: factor out a bvec_alloc_gfp helper
  block: move struct biovec_slab to bio.c
  block: reuse BIO_INLINE_VECS for integrity bvecs
  ...
2021-02-21 11:02:48 -08:00
Linus Torvalds
f7b36dc5cb fsverity updates for 5.12
Add an ioctl which allows reading fs-verity metadata from a file.
 
 This is useful when a file with fs-verity enabled needs to be served
 somewhere, and the other end wants to do its own fs-verity compatible
 verification of the file.  See the commit messages for details.
 
 This new ioctl has been tested using new xfstests I've written for it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYCv/2hQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK6/7AQDRmmnV+G34yGPCWfu8tyjdYvWPyak2
 IA/I+eM6S/F+4QEAkbX6rOwYVhLHN9KSOYyNhJiBchm6xq83J+R8BYh/Kw0=
 =FPNK
 -----END PGP SIGNATURE-----

Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fsverity updates from Eric Biggers:
 "Add an ioctl which allows reading fs-verity metadata from a file.

  This is useful when a file with fs-verity enabled needs to be served
  somewhere, and the other end wants to do its own fs-verity compatible
  verification of the file. See the commit messages for details.

  This new ioctl has been tested using new xfstests I've written for it"

* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fs-verity: support reading signature with ioctl
  fs-verity: support reading descriptor with ioctl
  fs-verity: support reading Merkle tree with ioctl
  fs-verity: add FS_IOC_READ_VERITY_METADATA ioctl
  fs-verity: don't pass whole descriptor to fsverity_verify_signature()
  fs-verity: factor out fsverity_get_descriptor()
2021-02-21 10:25:24 -08:00
Jaegeuk Kim
938a184265 f2fs: give a warning only for readonly partition
Let's allow mounting readonly partition. We're able to recovery later once we
have it as read-write back.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-02-12 14:09:54 -08:00
Jaegeuk Kim
d50dfc0c7d f2fs: don't grab superblock freeze for flush/ckpt thread
There are controlled by f2fs_freeze().

This fixes xfstests/generic/068 which is stuck at

 task:f2fs_ckpt-252:3 state:D stack:    0 pid: 5761 ppid:     2 flags:0x00004000
 Call Trace:
  __schedule+0x44c/0x8a0
  schedule+0x4f/0xc0
  percpu_rwsem_wait+0xd8/0x140
  ? percpu_down_write+0xf0/0xf0
  __percpu_down_read+0x56/0x70
  issue_checkpoint_thread+0x12c/0x160 [f2fs]
  ? wait_woken+0x80/0x80
  kthread+0x114/0x150
  ? __checkpoint_and_complete_reqs+0x110/0x110 [f2fs]
  ? kthread_park+0x90/0x90
  ret_from_fork+0x22/0x30

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-02-08 13:42:21 -08:00
Eric Biggers
e17fe6579d fs-verity: add FS_IOC_READ_VERITY_METADATA ioctl
Add an ioctl FS_IOC_READ_VERITY_METADATA which will allow reading verity
metadata from a file that has fs-verity enabled, including:

- The Merkle tree
- The fsverity_descriptor (not including the signature if present)
- The built-in signature, if present

This ioctl has similar semantics to pread().  It is passed the type of
metadata to read (one of the above three), and a buffer, offset, and
size.  It returns the number of bytes read or an error.

Separate patches will add support for each of the above metadata types.
This patch just adds the ioctl itself.

This ioctl doesn't make any assumption about where the metadata is
stored on-disk.  It does assume the metadata is in a stable format, but
that's basically already the case:

- The Merkle tree and fsverity_descriptor are defined by how fs-verity
  file digests are computed; see the "File digest computation" section
  of Documentation/filesystems/fsverity.rst.  Technically, the way in
  which the levels of the tree are ordered relative to each other wasn't
  previously specified, but it's logical to put the root level first.

- The built-in signature is the value passed to FS_IOC_ENABLE_VERITY.

This ioctl is useful because it allows writing a server program that
takes a verity file and serves it to a client program, such that the
client can do its own fs-verity compatible verification of the file.
This only makes sense if the client doesn't trust the server and if the
server needs to provide the storage for the client.

More concretely, there is interest in using this ability in Android to
export APK files (which are protected by fs-verity) to "protected VMs".
This would use Protected KVM (https://lwn.net/Articles/836693), which
provides an isolated execution environment without having to trust the
traditional "host".  A "guest" VM can boot from a signed image and
perform specific tasks in a minimum trusted environment using files that
have fs-verity enabled on the host, without trusting the host or
requiring that the guest has its own trusted storage.

Technically, it would be possible to duplicate the metadata and store it
in separate files for serving.  However, that would be less efficient
and would require extra care in userspace to maintain file consistency.

In addition to the above, the ability to read the built-in signatures is
useful because it allows a system that is using the in-kernel signature
verification to migrate to userspace signature verification.

Link: https://lore.kernel.org/r/20210115181819.34732-4-ebiggers@kernel.org
Reviewed-by: Victor Hsieh <victorhsieh@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-02-07 14:51:11 -08:00
Daeho Jeong
e659206617 f2fs: add ckpt_thread_ioprio sysfs node
Added "ckpt_thread_ioprio" sysfs node to give a way to change checkpoint
merge daemon's io priority. Its default value is "be,3", which means
"BE" I/O class and I/O priority "3". We can select the class between "rt"
and "be", and set the I/O priority within valid range of it.
"," delimiter is necessary in between I/O class and priority number.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-02-03 13:03:58 -08:00
Daeho Jeong
261eeb9c15 f2fs: introduce checkpoint_merge mount option
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge",
which creates a kernel daemon and makes it to merge concurrent checkpoint
requests as much as possible to eliminate redundant checkpoint issues. Plus,
we can eliminate the sluggish issue caused by slow checkpoint operation
when the checkpoint is done in a process context in a cgroup having
low i/o budget and cpu shares. To make this do better, we set the
default i/o priority of the kernel daemon to "3", to give one higher
priority than other kernel threads. The below verification result
explains this.
The basic idea has come from https://opensource.samsung.com.

[Verification]
Android Pixel Device(ARM64, 7GB RAM, 256GB UFS)
Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20)
Set "strict_guarantees" to "1" in BFQ tunables

In "fg" cgroup,
- thread A => trigger 1000 checkpoint operations
  "for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1;
   done"
- thread B => gererating async. I/O
  "fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1
       --filename=test_img --name=test"

In "bg" cgroup,
- thread C => trigger repeated checkpoint operations
  "echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file;
   fsync test_dir2; done"

We've measured thread A's execution time.

[ w/o patch ]
Elapsed Time: Avg. 68 seconds
[ w/  patch ]
Elapsed Time: Avg. 48 seconds

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan]
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-02-03 13:03:06 -08:00
Chao Yu
c8e43d55b1 f2fs: relocate inline conversion from mmap() to mkwrite()
If there is page fault only for read case on inline inode, we don't need
to convert inline inode, instead, let's do conversion for write case.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-02-02 09:14:27 -08:00
Dehe Gu
39f71b7e40 f2fs: fix a wrong condition in __submit_bio
We should use !F2FS_IO_ALIGNED() to check and submit_io directly.

Fixes: 8223ecc456 ("f2fs: fix to add missing F2FS_IO_ALIGNED() condition")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Dehe Gu <gudehe@huawei.com>
Signed-off-by: Ge Qiu <qiuge@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2021-02-02 09:14:27 -08:00