Commit Graph

437 Commits

Author SHA1 Message Date
Kemeng Shi 12f7900c57 writeback: move wb_wakeup_delayed defination to fs-writeback.c
The wb_wakeup_delayed is only used in fs-writeback.c. Move it to
fs-writeback.c after defination of wb_wakeup and make it static.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240118203339.764093-1-shikemeng@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-01-22 15:33:38 +01:00
David Howells c9c4ff12df netfs: Move pinning-for-writeback from fscache to netfs
Move the resource pinning-for-writeback from fscache code to netfslib code.
This is used to keep a cache backing object pinned whilst we have dirty
pages on the netfs inode in the pagecache such that VM writeback will be
able to reach it.

Whilst we're at it, switch the parameters of netfs_unpin_writeback() to
match ->write_inode() so that it can be used for that directly.

Note that this mechanism could be more generically useful than that for
network filesystems.  Quite often they have to keep around other resources
(e.g. authentication tokens or network connections) until the writeback is
complete.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
2023-12-24 15:08:49 +00:00
Linus Torvalds 3b3f874cc1 vfs-6.7.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTpoQAAKCRCRxhvAZXjc
 ovFNAQDgIRjXfZ1Ku+USxsRRdqp8geJVaNc3PuMmYhOYhUenqgEAmC1m+p0y31dS
 P6+HlL16Mqgu0tpLCcJK9BibpDZ0Ew4=
 =7yD1
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "This contains the usual miscellaneous features, cleanups, and fixes
  for vfs and individual fses.

  Features:

   - Rename and export helpers that get write access to a mount. They
     are used in overlayfs to get write access to the upper mount.

   - Print the pretty name of the root device on boot failure. This
     helps in scenarios where we would usually only print
     "unknown-block(1,2)".

   - Add an internal SB_I_NOUMASK flag. This is another part in the
     endless POSIX ACL saga in a way.

     When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
     the umask because if the relevant inode has POSIX ACLs set it might
     take the umask from there. But if the inode doesn't have any POSIX
     ACLs set then we apply the umask in the filesytem itself. So we end
     up with:

      (1) no SB_POSIXACL -> strip umask in vfs
      (2) SB_POSIXACL    -> strip umask in filesystem

     The umask semantics associated with SB_POSIXACL allowed filesystems
     that don't even support POSIX ACLs at all to raise SB_POSIXACL
     purely to avoid umask stripping. That specifically means NFS v4 and
     Overlayfs. NFS v4 does it because it delegates this to the server
     and Overlayfs because it needs to delegate umask stripping to the
     upper filesystem, i.e., the filesystem used as the writable layer.

     This went so far that SB_POSIXACL is raised eve on kernels that
     don't even have POSIX ACL support at all.

     Stop this blatant abuse and add SB_I_NOUMASK which is an internal
     superblock flag that filesystems can raise to opt out of umask
     handling. That should really only be the two mentioned above. It's
     not that we want any filesystems to do this. Ideally we have all
     umask handling always in the vfs.

   - Make overlayfs use SB_I_NOUMASK too.

   - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
     IS_POSIXACL() if the kernel doesn't have support for it. This is a
     very old patch but it's only possible to do this now with the wider
     cleanup that was done.

   - Follow-up work on fake path handling from last cycle. Citing mostly
     from Amir:

     When overlayfs was first merged, overlayfs files of regular files
     and directories, the ones that are installed in file table, had a
     "fake" path, namely, f_path is the overlayfs path and f_inode is
     the "real" inode on the underlying filesystem.

     In v6.5, we took another small step by introducing of the
     backing_file container and the file_real_path() helper. This change
     allowed vfs and filesystem code to get the "real" path of an
     overlayfs backing file. With this change, we were able to make
     fsnotify work correctly and report events on the "real" filesystem
     objects that were accessed via overlayfs.

     This method works fine, but it still leaves the vfs vulnerable to
     new code that is not aware of files with fake path. A recent
     example is commit db1d1e8b98 ("IMA: use vfs_getattr_nosec to get
     the i_version"). This commit uses direct referencing to f_path in
     IMA code that otherwise uses file_inode() and file_dentry() to
     reference the filesystem objects that it is measuring.

     This contains work to switch things around: instead of having
     filesystem code opt-in to get the "real" path, have generic code
     opt-in for the "fake" path in the few places that it is needed.

     Is it far more likely that new filesystems code that does not use
     the file_dentry() and file_real_path() helpers will end up causing
     crashes or averting LSM/audit rules if we keep the "fake" path
     exposed by default.

     This change already makes file_dentry() moot, but for now we did
     not change this helper just added a WARN_ON() in ovl_d_real() to
     catch if we have made any wrong assumptions.

     After the dust settles on this change, we can make file_dentry() a
     plain accessor and we can drop the inode argument to ->d_real().

   - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small
     change but it really isn't and I would like to see everyone on
     their tippie toes for any possible bugs from this work.

     Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for
     files since a very long time because of the nasty interactions
     between the SCM_RIGHTS file descriptor garbage collection. So
     extending it makes a lot of sense but it is a subtle change. There
     are almost no places that fiddle with file rcu semantics directly
     and the ones that did mess around with struct file internal under
     rcu have been made to stop doing that because it really was always
     dodgy.

     I forgot to put in the link tag for this change and the discussion
     in the commit so adding it into the merge message:

       https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com

  Cleanups:

   - Various smaller pipe cleanups including the removal of a spin lock
     that was only used to protect against writes without pipe_lock()
     from O_NOTIFICATION_PIPE aka watch queues. As that was never
     implemented remove the additional locking from pipe_write().

   - Annotate struct watch_filter with the new __counted_by attribute.

   - Clarify do_unlinkat() cleanup so that it doesn't look like an extra
     iput() is done that would cause issues.

   - Simplify file cleanup when the file has never been opened.

   - Use module helper instead of open-coding it.

   - Predict error unlikely for stale retry.

   - Use WRITE_ONCE() for mount expiry field instead of just commenting
     that one hopes the compiler doesn't get smart.

  Fixes:

   - Fix readahead on block devices.

   - Fix writeback when layztime is enabled and inodes whose timestamp
     is the only thing that changed reside on wb->b_dirty_time. This
     caused excessively large zombie memory cgroup when lazytime was
     enabled as such inodes weren't handled fast enough.

   - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()"

* tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits)
  file, i915: fix file reference for mmap_singleton()
  vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups
  writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
  chardev: Simplify usage of try_module_get()
  ovl: rely on SB_I_NOUMASK
  fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n
  fs: store real path instead of fake path in backing file f_path
  fs: create helper file_user_path() for user displayed mapped file path
  fs: get mnt_writers count for an open backing file's real path
  vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked
  vfs: predict the error in retry_estale as unlikely
  backing file: free directly
  vfs: fix readahead(2) on block devices
  io_uring: use files_lookup_fd_locked()
  file: convert to SLAB_TYPESAFE_BY_RCU
  vfs: shave work on failed file open
  fs: simplify misleading code to remove ambiguity regarding ihold()/iput()
  watch_queue: Annotate struct watch_filter with __counted_by
  fs/pipe: use spinlock in pipe_read() only if there is a watch_queue
  fs/pipe: remove unnecessary spinlock from pipe_write()
  ...
2023-10-30 09:14:19 -10:00
Jingbo Xu 6654408a33
writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs
The cgwb cleanup routine will try to release the dying cgwb by switching
the attached inodes.  It fetches the attached inodes from wb->b_attached
list, omitting the fact that inodes only with dirty timestamps reside in
wb->b_dirty_time list, which is the case when lazytime is enabled.  This
causes enormous zombie memory cgroup when lazytime is enabled, as inodes
with dirty timestamps can not be switched to a live cgwb for a long time.

It is reasonable not to switch cgwb for inodes with dirty data, as
otherwise it may break the bandwidth restrictions.  However since the
writeback of inode metadata is not accounted for, let's also switch
inodes with dirty timestamps to avoid zombie memory and block cgroups
when laztytime is enabled.

Fixes: c22d70a162 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20231014125511.102978-1-jefflexu@linux.alibaba.com
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-19 11:03:16 +02:00
Chunhai Guo be049c3a08
fs-writeback: do not requeue a clean inode having skipped pages
When writing back an inode and performing an fsync on it concurrently, a
deadlock issue may arise as shown below. In each writeback iteration, a
clean inode is requeued to the wb->b_dirty queue due to non-zero
pages_skipped, without anything actually being written. This causes an
infinite loop and prevents the plug from being flushed, resulting in a
deadlock. We now avoid requeuing the clean inode to prevent this issue.

    wb_writeback        fsync (inode-Y)
blk_start_plug(&plug)
for (;;) {
  iter i-1: some reqs with page-X added into plug->mq_list // f2fs node page-X with PG_writeback
                        filemap_fdatawrite
                          __filemap_fdatawrite_range // write inode-Y with sync_mode WB_SYNC_ALL
                           do_writepages
                            f2fs_write_data_pages
                             __f2fs_write_data_pages // wb_sync_req[DATA]++ for WB_SYNC_ALL
                              f2fs_write_cache_pages
                               f2fs_write_single_data_page
                                f2fs_do_write_data_page
                                 f2fs_outplace_write_data
                                  f2fs_update_data_blkaddr
                                   f2fs_wait_on_page_writeback
                                     wait_on_page_writeback // wait for f2fs node page-X
  iter i:
    progress = __writeback_inodes_wb(wb, work)
    . writeback_sb_inodes
    .   __writeback_single_inode // write inode-Y with sync_mode WB_SYNC_NONE
    .   . do_writepages
    .   .   f2fs_write_data_pages
    .   .   .  __f2fs_write_data_pages // skip writepages due to (wb_sync_req[DATA]>0)
    .   .   .   wbc->pages_skipped += get_dirty_pages(inode) // wbc->pages_skipped = 1
    .   if (!(inode->i_state & I_DIRTY_ALL)) // i_state = I_SYNC | I_SYNC_QUEUED
    .    total_wrote++;  // total_wrote = 1
    .   requeue_inode // requeue inode-Y to wb->b_dirty queue due to non-zero pages_skipped
    if (progress) // progress = 1
      continue;
  iter i+1:
      queue_io
      // similar process with iter i, infinite for-loop !
}
blk_finish_plug(&plug)   // flush plug won't be called

Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230916045131.957929-1-guochunhai@vivo.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-20 14:22:01 +02:00
Christian Brauner d8ce82efde super: make locking naming consistent
Make the naming consistent with the earlier introduced
super_lock_{read,write}() helpers.

Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230818-vfs-super-fixes-v3-v3-2-9f0b1876e46b@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21 14:36:57 +02:00
Yosry Ahmed 2816ea2abf writeback: move wb_over_bg_thresh() call outside lock section
Patch series "cgroup: eliminate atomic rstat flushing", v5.

A previous patch series [1] changed most atomic rstat flushing contexts to
become non-atomic.  This was done to avoid an expensive operation that
scales with # cgroups and # cpus to happen with irqs disabled and
scheduling not permitted.  There were two remaining atomic flushing
contexts after that series.  This series tries to eliminate them as well,
eliminating atomic rstat flushing completely.

The two remaining atomic flushing contexts are:
(a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
(b) mem_cgroup_threshold()->mem_cgroup_usage()

For (a), flushing needs to be atomic as wb_writeback() calls
wb_over_bg_thresh() with a spinlock held.  However, it seems like the call
to wb_over_bg_thresh() doesn't need to be protected by that spinlock, so
this series proposes a refactoring that moves the call outside the lock
criticial section and makes the stats flushing in mem_cgroup_wb_stats()
non-atomic.

For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
with irqs disabled.  We only flush the stats when calculating the root
usage, as it is approximated as the sum of some memcg stats (file, anon,
and optionally swap) instead of the conventional page counter.  This
series proposes changing this calculation to use the global stats instead,
eliminating the need for a memcg stat flush.

After these 2 contexts are eliminated, we no longer need
mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic().  We can
remove them and simplify the code.

[1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/


This patch (of 5):

wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
flush, which can be expensive on large systems. Currently,
wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
have to do the rstat flush atomically. On systems with a lot of
cpus and/or cgroups, this can cause us to disable irqs for a long time,
potentially causing problems.

Move the call to wb_over_bg_thresh() outside the lock section in
preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
The list_empty(&wb->work_list) check should be okay outside the lock
section of wb->list_lock as it is protected by a separate lock
(wb->work_lock), and wb_over_bg_thresh() doesn't seem like it is
modifying any of wb->b_* lists the wb->list_lock is protecting.
Also, the loop seems to be already releasing and reacquring the
lock, so this refactoring looks safe.

Link: https://lkml.kernel.org/r/20230421174020.2994750-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230421174020.2994750-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:14 -07:00
Linus Torvalds a3b111b046 for-6.4/block-2023-05-06
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmRWLQYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnwhD/4xRYfAY4O0oGZCITiKEEFxiSfDHCPMEBji
 zTEtideDRjcrpFmjmZ411C9prW32MxYQ3wQTf4O7w4t906xTYVr9FQy8g3Et4izI
 zglPcsa2jPzYCadQ0Ye4n9dRuYOH9FDJzDDLC+smu8zQKKmAqEAN/1ftpnADdVrY
 qX1sHyfz4RQRAgTHg2WqgKOi2O9VwGSMwOfBocmzAnruv9oLUlypcGnFPRCSZH/a
 OKpUZlvQhCBZTKScvVxQeJMg2Tl5yokQ0TH+gkQsdav9XcPJktqXiuD+c4h6q5ux
 oTlysEqcrwcaAafOcV0w9u80SFNlFACUYsNmEnFJaPXFTqAdNHvo1DJNsmxiHJDU
 bGo5ktlo5b/VZ51niOoWvGxursavq16G4yIYlGGHc7f4wGs12oc5ZP/yM3GRUY+C
 PdezwEvvQufxP7sFokfpgAS4SuH+tBlrhFXMsYaI4NukZQW4TK1zzbMrzOkdxFhW
 BOx17VFUKWtUnRmxinFGIA8Vj+FXN+E+ND+FoDsbrMyJD4maKDdJapPchG0J0Vbs
 pDcsB4c0pBC6H2xrobKiA1CuSq2t2qvyvwe1Zl2Xd+RVW9vBB5SI6HXYrC+UtxwY
 7LfX8F13cFD1E6iJ9Nta6x8fOunGnOVBdW5O0k4hDWEuZduvHItEDn2c3Ehqp4Jw
 P8dFBbk8SQ==
 =gAYf
 -----END PGP SIGNATURE-----

Merge tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - MD pull request via Song:
      - Improve raid5 sequential IO performance on spinning disks, which
        fixes a regression since v6.0 (Jan Kara)
      - Fix bitmap offset types, which fixes an issue introduced in this
        merge window (Jonathan Derrick)

 - Cleanup of hweight type used for cgroup writeback (Maxim)

 - Fix a regression with the "has_submit_bio" changes across partitions
   (Ming)

 - Cleanup of QUEUE_FLAG_ADD_RANDOM clearing.

   We used to set this flag on queues non blk-mq queues, and hence some
   drivers clear it unconditionally. Since all of these have since been
   converted to true blk-mq drivers, drop the useless clear as the bit
   is not set (Chaitanya)

 - Fix the flags being set in a bio for a flush for drbd (Christoph)

 - Cleanup and deduplication of the code handling setting block device
   capacity (Damien)

 - Fix for ublk handling IO timeouts (Ming)

 - Fix for a regression in blk-cgroup teardown (Tao)

 - NBD documentation and code fixes (Eric)

 - Convert blk-integrity to using device_attributes rather than a second
   kobject to manage lifetimes (Thomas)

* tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux:
  ublk: add timeout handler
  drbd: correctly submit flush bio on barrier
  mailmap: add mailmap entries for Jens Axboe
  block: Skip destroyed blkg when restart in blkg_destroy_all()
  writeback: fix call of incorrect macro
  md: Fix bitmap offset type in sb writer
  md/raid5: Improve performance for sequential IO
  docs nbd: userspace NBD now favors github over sourceforge
  block nbd: use req.cookie instead of req.handle
  uapi nbd: add cookie alias to handle
  uapi nbd: improve doc links to userspace spec
  blk-integrity: register sysfs attributes on struct device
  blk-integrity: convert to struct device_attribute
  blk-integrity: use sysfs_emit
  block/drivers: remove dead clear of random flag
  block: sync part's ->bd_has_submit_bio with disk's
  block: Cleanup set_capacity()/bdev_set_nr_sectors()
2023-05-06 08:28:58 -07:00
Maxim Korotkov 3e46c89c74 writeback: fix call of incorrect macro
the variable 'history' is of type u16, it may be an error
 that the hweight32 macro was used for it
 I guess macro hweight16 should be used

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 2a81490811 ("writeback: implement foreign cgroup inode detection")
Signed-off-by: Maxim Korotkov <korotkov.maxim.s@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230119104443.3002-1-korotkov.maxim.s@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-28 10:41:32 -06:00
Baokun Li 1ba1199ec5 writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
KASAN report null-ptr-deref:
==================================================================
BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
Write of size 8 at addr 0000000000000000 by task sync/943
CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
Call Trace:
 <TASK>
 dump_stack_lvl+0x7f/0xc0
 print_report+0x2ba/0x340
 kasan_report+0xc4/0x120
 kasan_check_range+0x1b7/0x2e0
 __kasan_check_write+0x24/0x40
 bdi_split_work_to_wbs+0x5c5/0x7b0
 sync_inodes_sb+0x195/0x630
 sync_inodes_one_sb+0x3a/0x50
 iterate_supers+0x106/0x1b0
 ksys_sync+0x98/0x160
[...]
==================================================================

The race that causes the above issue is as follows:

           cpu1                     cpu2
-------------------------|-------------------------
inode_switch_wbs
 INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
 queue_rcu_work(isw_wq, &isw->work)
 // queue_work async
  inode_switch_wbs_work_fn
   wb_put_many(old_wb, nr_switched)
    percpu_ref_put_many
     ref->data->release(ref)
     cgwb_release
      queue_work(cgwb_release_wq, &wb->release_work)
      // queue_work async
       &wb->release_work
       cgwb_release_workfn
                            ksys_sync
                             iterate_supers
                              sync_inodes_one_sb
                               sync_inodes_sb
                                bdi_split_work_to_wbs
                                 kmalloc(sizeof(*work), GFP_ATOMIC)
                                 // alloc memory failed
        percpu_ref_exit
         ref->data = NULL
         kfree(data)
                                 wb_get(wb)
                                  percpu_ref_get(&wb->refcnt)
                                   percpu_ref_get_many(ref, 1)
                                    atomic_long_add(nr, &ref->data->count)
                                     atomic64_add(i, v)
                                     // trigger null-ptr-deref

bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
wbs.  If the allocation of new work fails, the on-stack fallback will be
used and the reference count of the current wb is increased afterwards. 
If cgroup writeback membership switches occur before getting the reference
count and the current wb is released as old_wd, then calling wb_get() or
wb_put() will trigger the null pointer dereference above.

This issue was introduced in v4.3-rc7 (see fix tag1).  Both
sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
bdi_split_work_to_wbs() can trigger this issue.  For scenarios called via
sync_inodes_sb(), originally commit 7fc5854f8c ("writeback: synchronize
sync(2) against cgroup writeback membership switches") reduced the
possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
and the issue becomes easily reproducible again.

To solve this problem, percpu_ref_exit() is called under RCU protection to
avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs(). 
Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
and skip the current wb if wb_tryget() fails because the wb has already
been shutdown.

Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
Fixes: b817525a4a ("writeback: bdi_writeback iteration must not skip dying ones")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: yangerkun <yangerkun@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-16 10:41:26 -07:00
Matthew Wilcox (Oracle) 75376c6fb9 mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio()
Only one caller doesn't have a folio, so move the page_folio() call to
that one caller from mem_cgroup_css_from_folio().

Link: https://lkml.kernel.org/r/20230116192507.2146150-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:19 -08:00
Matthew Wilcox (Oracle) 9cfb816b1c mm/fs: convert inode_attach_wb() to take a folio
Patch series "Writeback folio conversions".

Remove more calls to compound_head() by passing folios around instead of
pages.


This patch (of 2):

The only caller of inode_attach_wb() which doesn't pass NULL already has a
folio, so convert the whole call-chain to take folios.

Link: https://lkml.kernel.org/r/20230116192507.2146150-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230116192507.2146150-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:19 -08:00
Linus Torvalds 23dc9c755a for-6.2/writeback-2022-12-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmOX8JgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpi7zD/4qlRXcD3ImctGqZ4cD4FUmM5ktF4uX7oA4
 ss6ytg8jM8JBiL4dY6XVQAAVZAYX/3h5WXAJa9Ih+S6M/+hORHQgHLt0oCulDRvS
 d+QNwRBo/+YxDc0rrcZ6JiHqv3pzNyeKo0AW91uNaBX6OKN6q8yrE2icfvNYgnQv
 +9YFJhonE9Qp1AJzgg8ohWJRJXdfp1nGa45Wkqkcf3H4LYFG8vA00M2OKxHgmzuY
 xJr4a40aGJiSgziP5CYsyS7jRCgNxkAqnyDSi1c+JbYy3J8nZOmMzq9tGDBZvgiW
 qD/tamgbJ+yibqgQENm8BOplwoKkOX0G4rl80gnodIJDHpLdMC1pAKZiabgJXVU3
 9khYsZFysjeETfGc095AW736XtyEPVpFuS3pBV9cGGPtHqlWpGzZeRutPkHioupw
 hJNfwbVcU0SrUzYRGUjiLx9e2M77tSX9d8iX/AmpF2WtdIwfAkzlOa4ObUEud0AW
 2Pr3tdT40Gy/E/7UgyxXzdInUv/GPVt/9JQ9rPRx0nH5SgQUGStt+shimupwtSJ1
 6jjxb1TfmINl88mqNbxbI6upYgh5uPaZQ+JUw/p9ztQTUGFymEiFAstyv5pDwThT
 6xXyUv7keUnuvQayzPgzcqsk6bQcJmwT0+bQ+nDuGANhtUhsyo1ooPYO+rttvcRq
 hnzsB6a2uA==
 =9UA2
 -----END PGP SIGNATURE-----

Merge tag 'for-6.2/writeback-2022-12-12' of git://git.kernel.dk/linux

Pull writeback updates from Jens Axboe:

 - Sanity check adding freed inodes to lists (Jan)

 - Removal of an old unused define (Miaohe)

* tag 'for-6.2/writeback-2022-12-12' of git://git.kernel.dk/linux:
  writeback: remove obsolete macro EXPIRE_DIRTY_ATIME
  writeback: Add asserts for adding freed inode to lists
2022-12-15 18:09:48 -08:00
Miaohe Lin 23e188a164 writeback: remove obsolete macro EXPIRE_DIRTY_ATIME
EXPIRE_DIRTY_ATIME is not used anymore. Remove it.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221210101042.2012931-1-linmiaohe@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-12 13:08:42 -07:00
Jan Kara a9438b44bc writeback: Add asserts for adding freed inode to lists
In the past we had several use-after-free issues with inodes getting
added to writeback lists after evict() removed them. These are painful
to debug so add some asserts to catch the problem earlier. The only
non-obvious change in the commit is that we need to tweak
redirty_tail_locked() to avoid triggering assertion in
inode_io_list_move_locked().

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221212113633.29181-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-12 13:08:34 -07:00
Svyatoslav Feldsherov 4e3c51f4e8 fs: do not update freeing inode i_io_list
After commit cbfecb927f ("fs: record I_DIRTY_TIME even if inode
already has I_DIRTY_INODE") writeback_single_inode can push inode with
I_DIRTY_TIME set to b_dirty_time list. In case of freeing inode with
I_DIRTY_TIME set this can happen after deletion of inode from i_io_list
at evict. Stack trace is following.

evict
fat_evict_inode
fat_truncate_blocks
fat_flush_inodes
writeback_inode
sync_inode_metadata(inode, sync=0)
writeback_single_inode(inode, wbc) <- wbc->sync_mode == WB_SYNC_NONE

This will lead to use after free in flusher thread.

Similar issue can be triggered if writeback_single_inode in the
stack trace update inode->i_io_list. Add explicit check to avoid it.

Fixes: cbfecb927f ("fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE")
Reported-by: syzbot+6ba92bd00d5093f7e371@syzkaller.appspotmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Svyatoslav Feldsherov <feldsherov@google.com>
Link: https://lore.kernel.org/r/20221115202001.324188-1-feldsherov@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-11-22 17:00:00 -05:00
Lukas Czerner cbfecb927f fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
Currently the I_DIRTY_TIME will never get set if the inode already has
I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME.  That's
true, however ext4 will only update the on-disk inode in
->dirty_inode(), not on actual writeback. As a result if the inode
already has I_DIRTY_INODE state by the time we get to
__mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
into on-disk inode and will not get updated until the next I_DIRTY_INODE
update, which might never come if we crash or get a power failure.

The problem can be reproduced on ext4 by running xfstest generic/622
with -o iversion mount option.

Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
I_DIRTY_INODE. Also make sure that the case is properly handled in
writeback_single_inode() as well. Additionally changes in
xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.

Thanks Jan Kara for suggestions on how to make this work properly.

Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220825100657.44217-1-lczerner@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-09-29 23:02:00 -04:00
Khazhismel Kumykov f87904c075 writeback: avoid use-after-free after removing device
When a disk is removed, bdi_unregister gets called to stop further
writeback and wait for associated delayed work to complete.  However,
wb_inode_writeback_end() may schedule bandwidth estimation dwork after
this has completed, which can result in the timer attempting to access the
just freed bdi_writeback.

Fix this by checking if the bdi_writeback is alive, similar to when
scheduling writeback work.

Since this requires wb->work_lock, and wb_inode_writeback_end() may get
called from interrupt, switch wb->work_lock to an irqsafe lock.

Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com
Fixes: 45a2966fd6 ("writeback: fix bandwidth estimate for spiky workload")
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-28 14:02:43 -07:00
Jchao Sun 10e1407310 writeback: Fix inode->i_io_list not be protected by inode->i_lock error
Commit b35250c081 ("writeback: Protect inode->i_io_list with
inode->i_lock") made inode->i_io_list not only protected by
wb->list_lock but also inode->i_lock, but inode_io_list_move_locked()
was missed. Add lock there and also update comment describing
things protected by inode->i_lock. This also fixes a race where
__mark_inode_dirty() could move inode under flush worker's hands
and thus sync(2) could miss writing some inodes.

Fixes: b35250c081 ("writeback: Protect inode->i_io_list with inode->i_lock")
Link: https://lore.kernel.org/r/20220524150540.12552-1-sunjunchao2870@gmail.com
CC: stable@vger.kernel.org
Signed-off-by: Jchao Sun <sunjunchao2870@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2022-06-06 09:54:30 +02:00
Linus Torvalds 8b728edc5b \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmKOH70ACgkQnJ2qBz9k
 QNkRBwgAxhAQmdNGzsMmE9wlF/Dj9VyGioug9thdUBMfEL6xXr1/Gf0e+fMLFFSr
 h2niti2jcuAuYSMqJNuss1Qdo8rWuGbKvjEyqzDfgJ1VDy6oGD14i1qPbKUv1/LN
 AkH4kClQpsltPYYOq9gkagOwRtHaMPCRc1HreqMGk/rDqCe26vZ4xkp2zr328Dt7
 JVdiHNnLj5d03xt3LWgchtqptdaRHJm8ymvICNcIxV3Ievxs+p21BhA/2m/v22/i
 sG/Kg9g7+bQ5sTI+DQHCjodylm9PciFviTRPc3xOkjXDLQ/Md63RlNoHja6uCrPD
 OxvH/L9rZvJ/yfSrg7ztDxcB1vGT/w==
 =UWYT
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull writeback and ext2 cleanups from Jan Kara:
 "One small ext2 cleanup and one writeback spelling fix"

* tag 'fs_for_v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  writeback: fix typo in comment
  fs: ext2: Fix duplicate included linux/dax.h
2022-05-25 19:24:06 -07:00
Julia Lawall 2999e1e387 writeback: fix typo in comment
Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.

Link: https://lore.kernel.org/r/20220521111145.81697-32-Julia.Lawall@inria.fr
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>

Signed-off-by: Jan Kara <jack@suse.cz>
2022-05-23 17:39:20 +02:00
Zhihao Cheng 68f4c6eba7 fs-writeback: writeback_sb_inodes:Recalculate 'wrote' according skipped pages
Commit 505a666ee3 ("writeback: plug writeback in wb_writeback() and
writeback_inodes_wb()") has us holding a plug during wb_writeback, which
may cause a potential ABBA dead lock:

    wb_writeback		fat_file_fsync
blk_start_plug(&plug)
for (;;) {
  iter i-1: some reqs have been added into plug->mq_list  // LOCK A
  iter i:
    progress = __writeback_inodes_wb(wb, work)
    . writeback_sb_inodes // fat's bdev
    .   __writeback_single_inode
    .   . generic_writepages
    .   .   __block_write_full_page
    .   .   . . 	    __generic_file_fsync
    .   .   . . 	      sync_inode_metadata
    .   .   . . 	        writeback_single_inode
    .   .   . . 		  __writeback_single_inode
    .   .   . . 		    fat_write_inode
    .   .   . . 		      __fat_write_inode
    .   .   . . 		        sync_dirty_buffer	// fat's bdev
    .   .   . . 			  lock_buffer(bh)	// LOCK B
    .   .   . . 			    submit_bh
    .   .   . . 			      blk_mq_get_tag	// LOCK A
    .   .   . trylock_buffer(bh)  // LOCK B
    .   .   .   redirty_page_for_writepage
    .   .   .     wbc->pages_skipped++
    .   .   --wbc->nr_to_write
    .   wrote += write_chunk - wbc.nr_to_write  // wrote > 0
    .   requeue_inode
    .     redirty_tail_locked
    if (progress)    // progress > 0
      continue;
  iter i+1:
      queue_io
      // similar process with iter i, infinite for-loop !
}
blk_finish_plug(&plug)   // flush plug won't be called

Above process triggers a hungtask like:
[  399.044861] INFO: task bb:2607 blocked for more than 30 seconds.
[  399.046824]       Not tainted 5.18.0-rc1-00005-gefae4d9eb6a2-dirty
[  399.051539] task:bb              state:D stack:    0 pid: 2607 ppid:
2426 flags:0x00004000
[  399.051556] Call Trace:
[  399.051570]  __schedule+0x480/0x1050
[  399.051592]  schedule+0x92/0x1a0
[  399.051602]  io_schedule+0x22/0x50
[  399.051613]  blk_mq_get_tag+0x1d3/0x3c0
[  399.051640]  __blk_mq_alloc_requests+0x21d/0x3f0
[  399.051657]  blk_mq_submit_bio+0x68d/0xca0
[  399.051674]  __submit_bio+0x1b5/0x2d0
[  399.051708]  submit_bio_noacct+0x34e/0x720
[  399.051718]  submit_bio+0x3b/0x150
[  399.051725]  submit_bh_wbc+0x161/0x230
[  399.051734]  __sync_dirty_buffer+0xd1/0x420
[  399.051744]  sync_dirty_buffer+0x17/0x20
[  399.051750]  __fat_write_inode+0x289/0x310
[  399.051766]  fat_write_inode+0x2a/0xa0
[  399.051783]  __writeback_single_inode+0x53c/0x6f0
[  399.051795]  writeback_single_inode+0x145/0x200
[  399.051803]  sync_inode_metadata+0x45/0x70
[  399.051856]  __generic_file_fsync+0xa3/0x150
[  399.051880]  fat_file_fsync+0x1d/0x80
[  399.051895]  vfs_fsync_range+0x40/0xb0
[  399.051929]  __x64_sys_fsync+0x18/0x30

In my test, 'need_resched()' (which is imported by 590dca3a71 "fs-writeback:
unplug before cond_resched in writeback_sb_inodes") in function
'writeback_sb_inodes()' seldom comes true, unless cond_resched() is deleted
from write_cache_pages().

Fix it by correcting wrote number according number of skipped pages
in writeback_sb_inodes().

Goto Link to find a reproducer.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215837
Cc: stable@vger.kernel.org # v4.3
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220510133805.1988292-1-chengzhihao1@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-19 06:29:41 -06:00
Jing Xia 846a3351dd writeback: Avoid skipping inode writeback
We have run into an issue that a task gets stuck in
balance_dirty_pages_ratelimited() when perform I/O stress testing.
The reason we observed is that an I_DIRTY_PAGES inode with lots
of dirty pages is in b_dirty_time list and standard background
writeback cannot writeback the inode.
After studing the relevant code, the following scenario may lead
to the issue:

task1                                   task2
-----                                   -----
fuse_flush
 write_inode_now //in b_dirty_time
  writeback_single_inode
   __writeback_single_inode
                                 fuse_write_end
                                  filemap_dirty_folio
                                   __xa_set_mark:PAGECACHE_TAG_DIRTY
    lock inode->i_lock
    if mapping tagged PAGECACHE_TAG_DIRTY
    inode->i_state |= I_DIRTY_PAGES
    unlock inode->i_lock
                                   __mark_inode_dirty:I_DIRTY_PAGES
                                      lock inode->i_lock
                                      -was dirty,inode stays in
                                      -b_dirty_time
                                      unlock inode->i_lock

   if(!(inode->i_state & I_DIRTY_All))
      -not true,so nothing done

This patch moves the dirty inode to b_dirty list when the inode
currently is not queued in b_io or b_more_io list at the end of
writeback_single_inode.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: stable@vger.kernel.org
Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
Signed-off-by: Jing Xia <jing.xia@unisoc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220510023514.27399-1-jing.xia@unisoc.com
2022-05-10 12:04:21 +02:00
Linus Torvalds 3bf03b9a08 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs

 - Most the MM patches which precede the patches in Willy's tree: kasan,
   pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
   sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb,
   userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp,
   cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap,
   zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits)
  mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release()
  Docs/ABI/testing: add DAMON sysfs interface ABI document
  Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface
  selftests/damon: add a test for DAMON sysfs interface
  mm/damon/sysfs: support DAMOS stats
  mm/damon/sysfs: support DAMOS watermarks
  mm/damon/sysfs: support schemes prioritization
  mm/damon/sysfs: support DAMOS quotas
  mm/damon/sysfs: support DAMON-based Operation Schemes
  mm/damon/sysfs: support the physical address space monitoring
  mm/damon/sysfs: link DAMON for virtual address spaces monitoring
  mm/damon: implement a minimal stub for sysfs-based DAMON interface
  mm/damon/core: add number of each enum type values
  mm/damon/core: allow non-exclusive DAMON start/stop
  Docs/damon: update outdated term 'regions update interval'
  Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling
  Docs/vm/damon: call low level monitoring primitives the operations
  mm/damon: remove unnecessary CONFIG_DAMON option
  mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}()
  mm/damon/dbgfs-test: fix is_target_id() change
  ...
2022-03-22 16:11:53 -07:00
Hugh Dickins b698f0a177 mm/fs: delete PF_SWAPWRITE
PF_SWAPWRITE has been redundant since v3.2 commit ee72886d8e ("mm:
vmscan: do not writeback filesystem pages in direct reclaim").

Coincidentally, NeilBrown's current patch "remove inode_congested()"
deletes may_write_to_inode(), which appeared to be the one function which
took notice of PF_SWAPWRITE.  But if you study the old logic, and the
conditions under which may_write_to_inode() was called, you discover that
flag and function have been pointless for a decade.

Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Jan Kara <jack@suse.de>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:08 -07:00
NeilBrown fe55d563d4 remove inode_congested()
inode_congested() reports if the backing-device for the inode is
congested.  No bdi reports congestion any more, so this always returns
'false'.

So remove inode_congested() and related functions, and remove the call
sites, assuming that inode_congested() always returns 'false'.

Link: https://lkml.kernel.org/r/164549983741.9187.2174285592262191311.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:01 -07:00
Christoph Hellwig aa8dcccaf3 block: check that there is a plug in blk_flush_plug
Rename blk_flush_plug to __blk_flush_plug and add a wrapper that includes
the NULL check instead of open coding that check everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220127070549.1377856-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:50:00 -07:00
Christoph Hellwig b1f866b013 block: remove blk_needs_flush_plug
blk_needs_flush_plug fails to account for the cb_list, which needs
flushing as well.  Remove it and just check if there is a plug instead
of poking into the internals of the plug structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220127070549.1377856-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:50:00 -07:00
Linus Torvalds 8834147f95 fscache rewrite
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmHeBGsACgkQ+7dXa6fL
 C2tyLw/8C2Gs/XvOZvRO7KPetKI9BbQSFoCe7uvGbiPq5CEmgcjWzQxvQGklBiZD
 qYa6pMNye1iGpsHOY3Yu210b7vMQiRLnnxvVle0UrjpZR7CcxYS0gGV+6yRdbDGy
 W1X6GFiX06qiNsgBH4msYp0SmbhhfkTyAx1BeBZAEtX8iFgaPfOldPY2nLMcTDD6
 6FT1nTzRcMHx9IUQZJtpeatzc70Qg8+fOr2UAY2nOIypXh6+vAMBO80xtUjGVU+1
 pWD1E+8cXSLfwEEzquFWoWTsTX7hNfsesEN10FmBf1bVCH9ZDFE01MOl6B8+CkFl
 +xfkvDNFC3yyUwAMVAV4+A4Be+cVLSqN2R91QIKJnAj9w1OjxASrwZJ1YeZp6KP4
 h0XKuPs3sRwwbNPVL/nP0UPNexoJnOUAaHesl4uKkRrExmxz9xGOIqIri2+tUIO+
 HkGyNns1huymj1K1ja4AQbDiZZX39GgYVleyg9g3uuy1FS4k+/myJcXo/CqWn3ON
 4oeNwxwLvlcqIQnPrESvwev50lFZYB4pfwvez6T2C5dL/Wk/xdeJK9iG81RWgx7y
 5XcDeoGDE08gMCGWVPjuhOCXypeiRGHhRNlcxTtq5kLwBZGkcYg/wFFnWn+6hzc4
 kyXw2kS5WZq4Q/FPh7BdY0eHp6xv0EpAOZwceneLB9lhNINdxcQ=
 =ISJ6
 -----END PGP SIGNATURE-----

Merge tag 'fscache-rewrite-20220111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fscache rewrite from David Howells:
 "This is a set of patches that rewrites the fscache driver and the
  cachefiles driver, significantly simplifying the code compared to
  what's upstream, removing the complex operation scheduling and object
  state machine in favour of something much smaller and simpler.

  The series is structured such that the first few patches disable
  fscache use by the network filesystems using it, remove the cachefiles
  driver entirely and as much of the fscache driver as can be got away
  with without causing build failures in the network filesystems.

  The patches after that recreate fscache and then cachefiles,
  attempting to add the pieces in a logical order. Finally, the
  filesystems are reenabled and then the very last patch changes the
  documentation.

  [!] Note: I have dropped the cifs patch for the moment, leaving local
      caching in cifs disabled. I've been having trouble getting that
      working. I think I have it done, but it needs more testing (there
      seem to be some test failures occurring with v5.16 also from
      xfstests), so I propose deferring that patch to the end of the
      merge window.

  WHY REWRITE?
  ============

  Fscache's operation scheduling API was intended to handle sequencing
  of cache operations, which were all required (where possible) to run
  asynchronously in parallel with the operations being done by the
  network filesystem, whilst allowing the cache to be brought online and
  offline and to interrupt service for invalidation.

  With the advent of the tmpfile capacity in the VFS, however, an
  opportunity arises to do invalidation much more simply, without having
  to wait for I/O that's actually in progress: Cachefiles can simply
  create a tmpfile, cut over the file pointer for the backing object
  attached to a cookie and abandon the in-progress I/O, dismissing it
  upon completion.

  Future work here would involve using Omar Sandoval's vfs_link() with
  AT_LINK_REPLACE[1] to allow an extant file to be displaced by a new
  hard link from a tmpfile as currently I have to unlink the old file
  first.

  These patches can also simplify the object state handling as I/O
  operations to the cache don't all have to be brought to a stop in
  order to invalidate a file. To that end, and with an eye on to writing
  a new backing cache model in the future, I've taken the opportunity to
  simplify the indexing structure.

  I've separated the index cookie concept from the file cookie concept
  by C type now. The former is now called a "volume cookie" (struct
  fscache_volume) and there is a container of file cookies. There are
  then just the two levels. All the index cookie levels are collapsed
  into a single volume cookie, and this has a single printable string as
  a key. For instance, an AFS volume would have a key of something like
  "afs,example.com,1000555", combining the filesystem name, cell name
  and volume ID. This is freeform, but must not have '/' chars in it.

  I've also eliminated all pointers back from fscache into the network
  filesystem. This required the duplication of a little bit of data in
  the cookie (cookie key, coherency data and file size), but it's not
  actually that much. This gets rid of problems with making sure we keep
  netfs data structures around so that the cache can access them.

  These patches mean that most of the code that was in the drivers
  before is simply gone and those drivers are now almost entirely new
  code. That being the case, there doesn't seem any particular reason to
  try and maintain bisectability across it. Further, there has to be a
  point in the middle where things are cut over as there's a single
  point everything has to go through (ie. /dev/cachefiles) and it can't
  be in use by two drivers at once.

  ISSUES YET OUTSTANDING
  ======================

  There are some issues still outstanding, unaddressed by this patchset,
  that will need fixing in future patchsets, but that don't stop this
  series from being usable:

  (1) The cachefiles driver needs to stop using the backing filesystem's
      metadata to store information about what parts of the cache are
      populated. This is not reliable with modern extent-based
      filesystems.

      Fixing this is deferred to a separate patchset as it involves
      negotiation with the network filesystem and the VM as to how much
      data to download to fulfil a read - which brings me on to (2)...

  (2) NFS (and CIFS with the dropped patch) do not take account of how
      the cache would like I/O to be structured to meet its granularity
      requirements. Previously, the cache used page granularity, which
      was fine as the network filesystems also dealt in page
      granularity, and the backing filesystem (ext4, xfs or whatever)
      did whatever it did out of sight. However, we now have folios to
      deal with and the cache will now have to store its own metadata to
      track its contents.

      The change I'm looking at making for cachefiles is to store
      content bitmaps in one or more xattrs and making a bit in the map
      correspond to something like a 256KiB block. However, the size of
      an xattr and the fact that they have to be read/updated in one go
      means that I'm looking at covering 1GiB of data per 512-byte map
      and storing each map in an xattr. Cachefiles has the potential to
      grow into a fully fledged filesystem of its very own if I'm not
      careful.

      However, I'm also looking at changing things even more radically
      and going to a different model of how the cache is arranged and
      managed - one that's more akin to the way, say, openafs does
      things - which brings me on to (3)...

  (3) The way cachefilesd does culling is very inefficient for large
      caches and it would be better to move it into the kernel if I can
      as cachefilesd has to keep asking the kernel if it can cull a
      file. Changing the way the backend works would allow this to be
      addressed.

  BITS THAT MAY BE CONTROVERSIAL
  ==============================

  There are some bits I've added that may be controversial:

  (1) I've provided a flag, S_KERNEL_FILE, that cachefiles uses to check
      if a files is already being used by some other kernel service
      (e.g. a duplicate cachefiles cache in the same directory) and
      reject it if it is. This isn't entirely necessary, but it helps
      prevent accidental data corruption.

      I don't want to use S_SWAPFILE as that has other effects, but
      quite possibly swapon() should set S_KERNEL_FILE too.

      Note that it doesn't prevent userspace from interfering, though
      perhaps it should. (I have made it prevent a marked directory from
      being rmdir-able).

  (2) Cachefiles wants to keep the backing file for a cookie open whilst
      we might need to write to it from network filesystem writeback.
      The problem is that the network filesystem unuses its cookie when
      its file is closed, and so we have nothing pinning the cachefiles
      file open and it will get closed automatically after a short time
      to avoid EMFILE/ENFILE problems.

      Reopening the cache file, however, is a problem if this is being
      done due to writeback triggered by exit(). Some filesystems will
      oops if we try to open a file in that context because they want to
      access current->fs or suchlike.

      To get around this, I added the following:

      (A) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network
          filesystem inode to indicate that we have a usage count on the
          cookie caching that inode.

      (B) A flag in struct writeback_control, unpinned_fscache_wb, that
          is set when __writeback_single_inode() clears the last dirty
          page from i_pages - at which point it clears
          I_PINNING_FSCACHE_WB and sets this flag.

          This has to be done here so that clearing I_PINNING_FSCACHE_WB
          can be done atomically with the check of PAGECACHE_TAG_DIRTY
          that clears I_DIRTY_PAGES.

      (C) A function, fscache_set_page_dirty(), which if it is not set,
          sets I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to
          pin the cache resources.

      (D) A function, fscache_unpin_writeback(), to be called by
          ->write_inode() to unuse the cookie.

      (E) A function, fscache_clear_inode_writeback(), to be called when
          the inode is evicted, before clear_inode() is called. This
          cleans up any lingering I_PINNING_FSCACHE_WB.

      The network filesystem can then use these tools to make sure that
      fscache_write_to_cache() can write locally modified data to the
      cache as well as to the server.

      For the future, I'm working on write helpers for netfs lib that
      should allow this facility to be removed by keeping track of the
      dirty regions separately - but that's incomplete at the moment and
      is also going to be affected by folios, one way or another, since
      it deals with pages"

Link: https://lore.kernel.org/all/510611.1641942444@warthog.procyon.org.uk/
Tested-by: Dominique Martinet <asmadeus@codewreck.org> # 9p
Tested-by: kafs-testing@auristor.com # afs
Tested-by: Jeff Layton <jlayton@kernel.org> # ceph
Tested-by: Dave Wysochanski <dwysocha@redhat.com> # nfs
Tested-by: Daire Byrne <daire@dneg.com> # nfs

* tag 'fscache-rewrite-20220111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (67 commits)
  9p, afs, ceph, nfs: Use current_is_kswapd() rather than gfpflags_allow_blocking()
  fscache: Add a tracepoint for cookie use/unuse
  fscache: Rewrite documentation
  ceph: add fscache writeback support
  ceph: conversion to new fscache API
  nfs: Implement cache I/O by accessing the cache directly
  nfs: Convert to new fscache volume/cookie API
  9p: Copy local writes to the cache when writing to the server
  9p: Use fscache indexing rewrite and reenable caching
  afs: Skip truncation on the server of data we haven't written yet
  afs: Copy local writes to the cache when writing to the server
  afs: Convert afs to use the new fscache API
  fscache, cachefiles: Display stat of culling events
  fscache, cachefiles: Display stats of no-space events
  cachefiles: Allow cachefiles to actually function
  fscache, cachefiles: Store the volume coherency data
  cachefiles: Implement the I/O routines
  cachefiles: Implement cookie resize for truncate
  cachefiles: Implement begin and end I/O operation
  cachefiles: Implement backing file wrangling
  ...
2022-01-12 13:45:12 -08:00
David Howells 08276bdae6 vfs, fscache: Implement pinning of cache usage for writeback
Cachefiles has a problem in that it needs to keep the backing file for a
cookie open whilst there are local modifications pending that need to be
written to it.  However, we don't want to keep the file open indefinitely,
as that causes EMFILE/ENFILE/ENOMEM problems.

Reopening the cache file, however, is a problem if this is being done due
to writeback triggered by exit().  Some filesystems will oops if we try to
open a file in that context because they want to access current->fs or
other resources that have already been dismantled.

To get around this, I added the following:

 (1) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem
     inode to indicate that we have a usage count on the cookie caching
     that inode.

 (2) A flag in struct writeback_control, unpinned_fscache_wb, that is set
     when __writeback_single_inode() clears the last dirty page from
     i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this
     flag.

     This has to be done here so that clearing I_PINNING_FSCACHE_WB can be
     done atomically with the check of PAGECACHE_TAG_DIRTY that clears
     I_DIRTY_PAGES.

 (3) A function, fscache_set_page_dirty(), which if it is not set, sets
     I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache
     resources.

 (4) A function, fscache_unpin_writeback(), to be called by ->write_inode()
     to unuse the cookie.

 (5) A function, fscache_clear_inode_writeback(), to be called when the
     inode is evicted, before clear_inode() is called.  This cleans up any
     lingering I_PINNING_FSCACHE_WB.

The network filesystem can then use these tools to make sure that
fscache_write_to_cache() can write locally modified data to the cache as
well as to the server.

For the future, I'm working on write helpers for netfs lib that should
allow this facility to be removed by keeping track of the dirty regions
separately - but that's incomplete at the moment and is also going to be
affected by folios, one way or another, since it deals with pages

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819615157.215744.17623791756928043114.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906917856.143852.8224898306177154573.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967124567.1823006.14188359004568060298.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021524705.640689.17824932021727663017.stgit@warthog.procyon.org.uk/ # v4
2022-01-07 09:22:19 +00:00
Matthew Wilcox (Oracle) 22b3c8d661 fs/writeback: Convert inode_switch_wbs_work_fn to folios
This gets the statistics correct for large folios by modifying the
counters by the number of pages in the folio instead of by 1.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2022-01-02 20:28:49 -05:00
Linus Torvalds bf953917be Various hardening fixes and cleanups for 5.16-rc1
Hi Linus,
 
 Please, pull the following hardening fixes and cleanups that I've
 been collecting during the last development cycle. All of them have
 been baking in linux-next.
 
 Fix -Wcast-function-type error:
 
 - firewire: Remove function callback casts (Oscar Carter)
 
 Fix application of sizeof operator:
 
 - firmware/psci: fix application of sizeof to pointer (jing yangyang)
 
 Replace open coded instances with size_t saturating arithmetic helpers:
 
 - assoc_array: Avoid open coded arithmetic in allocator arguments (Len Baker)
 - writeback: prefer struct_size over open coded arithmetic (Len Baker)
 - aio: Prefer struct_size over open coded arithmetic (Len Baker)
 - dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic (Len Baker)
 
 Flexible array transformation:
 
 - KVM: PPC: Replace zero-length array with flexible array member (Len Baker)
 
 Use 2-factor argument multiplication form:
 
 - nouveau/svm: Use kvcalloc() instead of kvzalloc() (Gustavo A. R. Silva)
 - xfs: Use kvcalloc() instead of kvzalloc() (Gustavo A. R. Silva)
 
 Thanks
 --
 Gustavo
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAmGAWzsACgkQRwW0y0cG
 2zF55A/+PTBZKg0XLQkPZ7HFipobeZpfvM0dU4JutwN6Kts1RmMRftPn6ootY18v
 4tWR4jXcnblvEr7UTgYAl6QQdytFXZKOK+JKMWV8LXLqyNGF6sS2PmA6zk/iQoa5
 1q0IKUaaqLIXwmm3xoz+/uNHsb+kfjYOHZpHA6HhYZQFDyShW7+hhIeS1NauJo2X
 op3IWasMumrawPkCJZ0ZJJQLELtZNGt4gHnOjB1MAYhOTAokowgeeDNtyfoJ9j1L
 iL8kimphVLI35H/GERBozmqdqRGIIZLlQF4P66VfNNEXSDoKOemAKDSFrfmYoVwE
 kdh6fqeKPV/aRImrCtNthfpiEjqEpm8afQGMC5H5uPnZontUX9tcU1Qagg0vwYx0
 fLZ8mMuNQK5AZfugK+1+2ShfBYUlhvWRhQdtjC9nIAoO80NqouWB7QD0zIHC2WV7
 durdlhzxik70ISnXqKmTR6bQNcXB6kFLPR30RpcA3E6+AgwlkP0FmaD3e+sDttJ0
 vtxDMHqMMNNzOWlLW2eqEdKMEfoU0gLyRt5iM7EN6R8HUXwup5f9bu7V4LuCnR6y
 FAX4tEa8b5wg01zNfyWClCccU6tetSeXjdrhdIk7szQVsOsYXc4zxDrp6xvqsAh2
 B7GbGk5qeUzM/O7QWNIl+5s/NhUjEzQ3QiQebRDdjVyINU2OKsI=
 =Jk0U
 -----END PGP SIGNATURE-----

Merge tag 'kspp-misc-fixes-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux

Pull hardening fixes and cleanups from Gustavo A. R. Silva:
 "Various hardening fixes and cleanups that I've been collecting during
  the last development cycle:

  Fix -Wcast-function-type error:

   - firewire: Remove function callback casts (Oscar Carter)

  Fix application of sizeof operator:

   - firmware/psci: fix application of sizeof to pointer (jing yangyang)

  Replace open coded instances with size_t saturating arithmetic
  helpers:

   - assoc_array: Avoid open coded arithmetic in allocator arguments
     (Len Baker)

   - writeback: prefer struct_size over open coded arithmetic (Len
     Baker)

   - aio: Prefer struct_size over open coded arithmetic (Len Baker)

   - dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic
     (Len Baker)

  Flexible array transformation:

   - KVM: PPC: Replace zero-length array with flexible array member (Len
     Baker)

  Use 2-factor argument multiplication form:

   - nouveau/svm: Use kvcalloc() instead of kvzalloc() (Gustavo A. R.
     Silva)

   - xfs: Use kvcalloc() instead of kvzalloc() (Gustavo A. R. Silva)"

* tag 'kspp-misc-fixes-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
  firewire: Remove function callback casts
  nouveau/svm: Use kvcalloc() instead of kvzalloc()
  firmware/psci: fix application of sizeof to pointer
  dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic
  KVM: PPC: Replace zero-length array with flexible array member
  aio: Prefer struct_size over open coded arithmetic
  writeback: prefer struct_size over open coded arithmetic
  xfs: Use kvcalloc() instead of kvzalloc()
  assoc_array: Avoid open coded arithmetic in allocator arguments
2021-11-01 17:29:10 -07:00
Len Baker 98b160c828 writeback: prefer struct_size over open coded arithmetic
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting. Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

In this case these are not actually dynamic sizes: all the operands
involved in the calculation are constant values. However it is better to
refactor them anyway, just to keep the open-coded math idiom out of
code.

So, use the struct_size() helper to do the arithmetic instead of the
argument "size + count * size" in the kzalloc() functions.

This code was detected with the help of Coccinelle and audited and fixed
manually.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments

Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-10-20 18:20:28 -05:00
Christoph Hellwig 008f75a20e block: cleanup the flush plug helpers
Consolidate the various helpers into a single blk_flush_plug helper that
takes a plk_plug and the from_scheduler bool and switch all callsites to
call it directly.  Checks that the plug is non-NULL must be performed by
the caller, something that most already do anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 09:56:11 -06:00
Linus Torvalds 14726903c8 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "173 patches.

  Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
  pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
  bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
  hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
  oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
  mm/madvise: add MADV_WILLNEED to process_madvise()
  mm/vmstat: remove unneeded return value
  mm/vmstat: simplify the array size calculation
  mm/vmstat: correct some wrong comments
  mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
  selftests: vm: add COW time test for KSM pages
  selftests: vm: add KSM merging time test
  mm: KSM: fix data type
  selftests: vm: add KSM merging across nodes test
  selftests: vm: add KSM zero page merging test
  selftests: vm: add KSM unmerge test
  selftests: vm: add KSM merge test
  mm/migrate: correct kernel-doc notation
  mm: wire up syscall process_mrelease
  mm: introduce process_mrelease system call
  memblock: make memblock_find_in_range method private
  mm/mempolicy.c: use in_task() in mempolicy_slab_node()
  mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  mm/mempolicy: advertise new MPOL_PREFERRED_MANY
  mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  ...
2021-09-03 10:08:28 -07:00
Shakeel Butt 7490a2d248 writeback: memcg: simplify cgroup_writeback_by_id
Currently cgroup_writeback_by_id calls mem_cgroup_wb_stats() to get dirty
pages for a memcg.  However mem_cgroup_wb_stats() does a lot more than
just get the number of dirty pages.  Just directly get the number of dirty
pages instead of calling mem_cgroup_wb_stats().  Also
cgroup_writeback_by_id() is only called for best-effort dirty flushing, so
remove the unused 'nr' parameter and no need to explicitly flush memcg
stats.

Link: https://lkml.kernel.org/r/20210722182627.2267368-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:10 -07:00
Jan Kara fee468fdf4 writeback: reliably update bandwidth estimation
Currently we trigger writeback bandwidth estimation from
balance_dirty_pages() and from wb_writeback().  However neither of these
need to trigger when the system is relatively idle and writeback is
triggered e.g.  from fsync(2).  Make sure writeback estimates happen
reliably by triggering them from do_writepages().

Link: https://lkml.kernel.org/r/20210713104716.22868-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:10 -07:00
Jan Kara 633a2abb9e writeback: track number of inodes under writeback
Patch series "writeback: Fix bandwidth estimates", v4.

Fix estimate of writeback throughput when device is not fully busy doing
writeback.  Michael Stapelberg has reported that such workload (e.g.
generated by linking) tends to push estimated throughput down to 0 and as
a result writeback on the device is practically stalled.

The first three patches fix the reported issue, the remaining two patches
are unrelated cleanups of problems I've noticed when reading the code.

This patch (of 4):

Track number of inodes under writeback for each bdi_writeback structure.
We will use this to decide whether wb does any IO and so we can estimate
its writeback throughput.  In principle we could use number of pages under
writeback (WB_WRITEBACK counter) for this however normal percpu counter
reads are too inaccurate for our purposes and summing the counter is too
expensive.

Link: https://lkml.kernel.org/r/20210713104519.16394-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20210713104716.22868-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Michael Stapelberg <stapelberg+linux@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:10 -07:00
Josef Bacik 5662c967c6 fs: kill sync_inode
Now that all users of sync_inode() have been deleted, remove
sync_inode().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:07 +02:00
Roman Gushchin 593311e85b writeback, cgroup: do not reparent dax inodes
The inode switching code is not suited for dax inodes.  An attempt to
switch a dax inode to a parent writeback structure (as a part of a
writeback cleanup procedure) results in a panic like this:

  run fstests generic/270 at 2021-07-15 05:54:02
  XFS (pmem0p2): EXPERIMENTAL big timestamp feature in use.  Use at your own risk!
  XFS (pmem0p2): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
  XFS (pmem0p2): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
  XFS (pmem0p2): Mounting V5 Filesystem
  XFS (pmem0p2): Ending clean mount
  XFS (pmem0p2): Quotacheck needed: Please wait.
  XFS (pmem0p2): Quotacheck: Done.
  XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
  XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
  XFS (pmem0p2): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
  BUG: unable to handle page fault for address: 0000000005b0f669
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 13 PID: 10479 Comm: kworker/13:16 Not tainted 5.14.0-rc1-master-8096acd7442e+ #8
  Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 09/13/2016
  Workqueue: inode_switch_wbs inode_switch_wbs_work_fn
  RIP: 0010:inode_do_switch_wbs+0xaf/0x470
  Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
  RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
  RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
  RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
  RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
  R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
  R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
  FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
  Call Trace:
   inode_switch_wbs_work_fn+0xb6/0x2a0
   process_one_work+0x1e6/0x380
   worker_thread+0x53/0x3d0
   kthread+0x10f/0x130
   ret_from_fork+0x22/0x30
  Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge stp llc rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm mgag200 i2c_algo_bit iTCO_wdt irqbypass drm_kms_helper iTCO_vendor_support acpi_ipmi rapl syscopyarea sysfillrect intel_cstate ipmi_si sysimgblt ioatdma dax_pmem_compat fb_sys_fops ipmi_devintf device_dax i2c_i801 pcspkr intel_uncore hpilo nd_pmem cec dax_pmem_core dca i2c_smbus acpi_tad lpc_ich ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sd_mod t10_pi crct10dif_pclmul crc32_pclmul crc32c_intel tg3 ghash_clmulni_intel serio_raw hpsa hpwdt scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod
  CR2: 0000000005b0f669
  ---[ end trace ed2105faff8384f3 ]---
  RIP: 0010:inode_do_switch_wbs+0xaf/0x470
  Code: 00 30 0f 85 c1 03 00 00 0f 1f 44 00 00 31 d2 48 c7 c6 ff ff ff ff 48 8d 7c 24 08 e8 eb 49 1a 00 48 85 c0 74 4a bb ff ff ff ff <48> 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 48 8b 00 a8 08 0f 85
  RSP: 0018:ffff9c66691abdc8 EFLAGS: 00010002
  RAX: 0000000005b0f661 RBX: 00000000ffffffff RCX: ffff89e6a21382b0
  RDX: 0000000000000001 RSI: ffff89e350230248 RDI: ffffffffffffffff
  RBP: ffff89e681d19400 R08: 0000000000000000 R09: 0000000000000228
  R10: ffffffffffffffff R11: ffffffffffffffc0 R12: ffff89e6a2138130
  R13: ffff89e316af7400 R14: ffff89e316af6e78 R15: ffff89e6a21382b0
  FS:  0000000000000000(0000) GS:ffff89ee5fb40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000005b0f669 CR3: 0000000cb2410004 CR4: 00000000001706e0
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: 0x15200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
  ---[ end Kernel panic - not syncing: Fatal exception ]---

The crash happens on an attempt to iterate over attached pagecache pages
and check the dirty flag: a dax inode's xarray contains pfn's instead of
generic struct page pointers.

This happens for DAX and not for other kinds of non-page entries in the
inodes because it's a tagged iteration, and shadow/swap entries are
never tagged; only DAX entries get tagged.

Fix the problem by bailing out (with the false return value) of
inode_prepare_sbs_switch() if a dax inode is passed.

[willy@infradead.org: changelog addition]

Link: https://lkml.kernel.org/r/20210719171350.3876830-1-guro@fb.com
Fixes: c22d70a162 ("writeback, cgroup: release dying cgwbs by switching attached inodes")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Reported-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Murphy Zhou <jencce.kernel@gmail.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-23 17:43:28 -07:00
Linus Torvalds 911a2997a5 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmDcl7AACgkQnJ2qBz9k
 QNnsBQf+LBAPsfykQ/f8EdHErO1lfbVTmwf2g/JzTkjrIVZTZ6Ic47aCIiFxgHU2
 Js9ufaPxpsbbopzpn2PAoCUzxNsZDqgXtnC03MOUAqoSFbAvgLHz2sQwjqeYJUGQ
 P6n7VipEA/qBVpQI5zeCUhHYcahoNrRjSLzaFnE2Z8CrQYQ6Ry9gVEhduvu2OTru
 62cWlAWlTJfx/FcR1Y0F/ZznnNSKMiAHcEe3F6Beztplg2ooq+z6FclJYrkmnxMq
 SXSOsqTCdi1/oFx36NpvLkykrIS9I7N/iqCnKwbm6X+nyZZKyAwYZhWVqkbozPPu
 +u1Ppq8o0IuWwEA6/UAmxgAO3m/Gkw==
 =tn0h
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull misc fs updates from Jan Kara:
 "The new quotactl_fd() syscall (remake of quotactl_path() syscall that
  got introduced & disabled in 5.13 cycle), and couple of udf, reiserfs,
  isofs, and writeback fixes and cleanups"

* tag 'fs_for_v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  writeback: fix obtain a reference to a freeing memcg css
  quota: remove unnecessary oom message
  isofs: remove redundant continue statement
  quota: Wire up quotactl_fd syscall
  quota: Change quotactl_path() systcall to an fd-based one
  reiserfs: Remove unneed check in reiserfs_write_full_page()
  udf: Fix NULL pointer dereference in udf_symlink function
  reiserfs: add check for invalid 1st journal block
2021-07-01 12:06:39 -07:00
Linus Torvalds df668a5fe4 for-5.14/block-2021-06-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbXAwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr0HEADDJaSgjpnWQwH1RVLNagJa9KnktxZYsEs+
 as3QmDdpKRG3rEC9bdE7FLe/xq3WBaO5j1hTQ9P6IguqLyS1Df72DtTlKyaCrZoe
 zv9eIlY4lZUfksE2nzWmlN9uG0FBVXeEQpHCLSNbUZeK1zvV6+NNhQqw2kc0sEqu
 hReUFeMUbsMcu/w5T3XMVJNsTMCql9wta2H0q5hONQyJQSrIwa1D+sUdE5I8fO4j
 bnoYX9yxHX26EztX1UJiGRgoq5Trz7LY7hAfljKSkewpFwiHE2vBdq2L0C2RKsIV
 tTs2DjMCMQyPNeA7WAG8HlR4aPG+7+/fuBP1KJHkykjWXglWN7OqISuBv6rrBgQs
 gNRnZ4qmb1CzD6aLEBk59nHt6po6eMxXIW856YktKy8rKcrgK29qP44Z+oomkPKo
 ZjQ0wqN5CvpObM/dIKxl9bAJ4zQDHBt49d5nTTQLfWl/mgevu6ZNWD/hONyCQmFy
 zKKqQ/wkxWHutOsjC5/MKNb3ZRNH9tt9X+HfULO2DU6IqqifYw/ex4z4MVsBopJC
 7pPfd81kgC73TgXe1AaCwHqNWsrqYCuTK0ew1CtGudlS3lucMwtap4GBiCgg5gbu
 M8pEgwO4OcCLHyRUc8zdfqI7HumbprbFmojPkwGSEe0ofVD74lMhzbUj5jvTYY2B
 t8D2XcgyOA==
 =lhon
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:

 - disk events cleanup (Christoph)

 - gendisk and request queue allocation simplifications (Christoph)

 - bdev_disk_changed cleanups (Christoph)

 - IO priority improvements (Bart)

 - Chained bio completion trace fix (Edward)

 - blk-wbt fixes (Jan)

 - blk-wbt enable/disable fix (Zhang)

 - Scheduler dispatch improvements (Jan, Ming)

 - Shared tagset scheduler improvements (John)

 - BFQ updates (Paolo, Luca, Pietro)

 - BFQ lock inversion fix (Jan)

 - Documentation improvements (Kir)

 - CLONE_IO block cgroup fix (Tejun)

 - Remove of ancient and deprecated block dump feature (zhangyi)

 - Discard merge fix (Ming)

 - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
   Yang)

* tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
  block: fix discard request merge
  block/mq-deadline: Remove a WARN_ON_ONCE() call
  blk-mq: update hctx->dispatch_busy in case of real scheduler
  blk: Fix lock inversion between ioc lock and bfqd lock
  bfq: Remove merged request already in bfq_requests_merged()
  block: pass a gendisk to bdev_disk_changed
  block: move bdev_disk_changed
  block: add the events* attributes to disk_attrs
  block: move the disk events code to a separate file
  block: fix trace completion for chained bio
  block/partitions/msdos: Fix typo inidicator -> indicator
  block, bfq: reset waker pointer with shared queues
  block, bfq: check waker only for queues with no in-flight I/O
  block, bfq: avoid delayed merge of async queues
  block, bfq: boost throughput by extending queue-merging times
  block, bfq: consider also creation time in delayed stable merge
  block, bfq: fix delayed stable merge check
  block, bfq: let also stably merged queues enjoy weight raising
  blk-wbt: make sure throttle is enabled properly
  blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
  ...
2021-06-30 12:12:56 -07:00
Roman Gushchin c22d70a162 writeback, cgroup: release dying cgwbs by switching attached inodes
Asynchronously try to release dying cgwbs by switching attached inodes to
the nearest living ancestor wb.  It helps to get rid of per-cgroup
writeback structures themselves and of pinned memory and block cgroups,
which are significantly larger structures (mostly due to large per-cpu
statistics data).  This prevents memory waste and helps to avoid different
scalability problems caused by large piles of dying cgroups.

Reuse the existing mechanism of inode switching used for foreign inode
detection.  To speed things up batch up to 115 inode switching in a single
operation (the maximum number is selected so that the resulting struct
inode_switch_wbs_context can fit into 1024 bytes).  Because every
switching consists of two steps divided by an RCU grace period, it would
be too slow without batching.  Please note that the whole batch counts as
a single operation (when increasing/decreasing isw_nr_in_flight).  This
allows to keep umounting working (flush the switching queue), however
prevents cleanups from consuming the whole switching quota and effectively
blocking the frn switching.

A cgwb cleanup operation can fail due to different reasons (e.g.  not
enough memory, the cgwb has an in-flight/pending io, an attached inode in
a wrong state, etc).  In this case the next scheduled cleanup will make a
new attempt.  An attempt is made each time a new cgwb is offlined (in
other words a memcg and/or a blkcg is deleted by a user).  In the future
an additional attempt scheduled by a timer can be implemented.

[guro@fb.com: replace open-coded "115" with arithmetic]
  Link: https://lkml.kernel.org/r/YMEcSBcq/VXMiPPO@carbon.dhcp.thefacebook.com
[guro@fb.com: add smp_mb() to inode_prepare_wbs_switch()]
  Link: https://lkml.kernel.org/r/YMFa+guFw7OFjf3X@carbon.dhcp.thefacebook.com
[willy@infradead.org: fix documentation]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-2-willy@infradead.org

Link: https://lkml.kernel.org/r/20210608230225.2078447-9-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin f5fbe6b7ad writeback, cgroup: support switching multiple inodes at once
Currently only a single inode can be switched to another writeback
structure at once.  That means to switch an inode a separate
inode_switch_wbs_context structure must be allocated, and a separate rcu
callback and work must be scheduled.

It's fine for the existing ad-hoc switching, which is not happening that
often, but sub-optimal for massive switching required in order to release
a writeback structure.  To prepare for it, let's add a support for
switching multiple inodes at once.

Instead of containing a single inode pointer, inode_switch_wbs_context
will contain a NULL-terminated array of inode pointers.
inode_do_switch_wbs() will be called for each inode.

To optimize the locking bdi->wb_switch_rwsem, old_wb's and new_wb's
list_locks will be acquired and released only once altogether for all
inodes.  wb_wakeup() will be also be called only once.  Instead of calling
wb_put(old_wb) after each successful switch, wb_put_many() is introduced
and used.

Link: https://lkml.kernel.org/r/20210608230225.2078447-8-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin 72d4512e9c writeback, cgroup: split out the functional part of inode_switch_wbs_work_fn()
Split out the functional part of the inode_switch_wbs_work_fn() function
as inode_do switch_wbs() to reuse it later for switching inodes attached
to dying cgwbs.

This commit doesn't bring any functional changes.

Link: https://lkml.kernel.org/r/20210608230225.2078447-7-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin f3b6a6df38 writeback, cgroup: keep list of inodes attached to bdi_writeback
Currently there is no way to iterate over inodes attached to a specific
cgwb structure.  It limits the ability to efficiently reclaim the
writeback structure itself and associated memory and block cgroup
structures without scanning all inodes belonging to a sb, which can be
prohibitively expensive.

While dirty/in-active-writeback an inode belongs to one of the
bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time.  Once
cleaned up, it's removed from all io lists.  So the inode->i_io_list can
be reused to maintain the list of inodes, attached to a bdi_writeback
structure.

This patch introduces a new wb->b_attached list, which contains all inodes
which were dirty at least once and are attached to the given cgwb.  Inodes
attached to the root bdi_writeback structures are never placed on such
list.  The following patch will use this list to try to release cgwbs
structures more efficiently.

Link: https://lkml.kernel.org/r/20210608230225.2078447-6-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin 29264d92a0 writeback, cgroup: switch to rcu_work API in inode_switch_wbs()
Inode's wb switching requires two steps divided by an RCU grace period.
It's currently implemented as an RCU callback inode_switch_wbs_rcu_fn(),
which schedules inode_switch_wbs_work_fn() as a work.

Switching to the rcu_work API allows to do the same in a cleaner and
slightly shorter form.

Link: https://lkml.kernel.org/r/20210608230225.2078447-5-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:48 -07:00
Roman Gushchin 8826ee4fe7 writeback, cgroup: increment isw_nr_in_flight before grabbing an inode
isw_nr_in_flight is used to determine whether the inode switch queue
should be flushed from the umount path.  Currently it's increased after
grabbing an inode and even scheduling the switch work.  It means the
umount path can walk past cleanup_offline_cgwb() with active inode
references, which can result in a "Busy inodes after unmount." message and
use-after-free issues (with inode->i_sb which gets freed).

Fix it by incrementing isw_nr_in_flight before doing anything with the
inode and decrementing in the case when switching wasn't scheduled.

The problem hasn't yet been seen in the real life and was discovered by
Jan Kara by looking into the code.

Link: https://lkml.kernel.org/r/20210608230225.2078447-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:47 -07:00
Roman Gushchin 592fa00218 writeback, cgroup: add smp_mb() to cgroup_writeback_umount()
A full memory barrier is required between clearing SB_ACTIVE flag in
generic_shutdown_super() and checking isw_nr_in_flight in
cgroup_writeback_umount(), otherwise a new switch operation might be
scheduled after atomic_read(&isw_nr_in_flight) returned 0.  This would
result in a non-flushed isw_wq, and a potential crash.

The problem hasn't yet been seen in the real life and was discovered by
Jan Kara by looking into the code.

Link: https://lkml.kernel.org/r/20210608230225.2078447-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:47 -07:00
Roman Gushchin 4ade5867b4 writeback, cgroup: do not switch inodes with I_WILL_FREE flag
Patch series "cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups", v9.

When an inode is getting dirty for the first time it's associated with a
wb structure (see __inode_attach_wb()).  It can later be switched to
another wb (if e.g.  some other cgroup is writing a lot of data to the
same inode), but otherwise stays attached to the original wb until being
reclaimed.

The problem is that the wb structure holds a reference to the original
memory and blkcg cgroups.  So if an inode has been dirty once and later is
actively used in read-only mode, it has a good chance to pin down the
original memory and blkcg cgroups forever.  This is often the case with
services bringing data for other services, e.g.  updating some rpm
packages.

In the real life it becomes a problem due to a large size of the memcg
structure, which can easily be 1000x larger than an inode.  Also a really
large number of dying cgroups can raise different scalability issues, e.g.
making the memory reclaim costly and less effective.

To solve the problem inodes should be eventually detached from the
corresponding writeback structure.  It's inefficient to do it after every
writeback completion.  Instead it can be done whenever the original memory
cgroup is offlined and writeback structure is getting killed.  Scanning
over a (potentially long) list of inodes and detach them from the
writeback structure can take quite some time.  To avoid scanning all
inodes, attached inodes are kept on a new list (b_attached).  To make it
less noticeable to a user, the scanning and switching is performed from a
work context.

Big thanks to Jan Kara, Dennis Zhou, Hillf Danton and Tejun Heo for their
ideas and contribution to this patchset.

This patch (of 8):

If an inode's state has I_WILL_FREE flag set, the inode will be freed
soon, so there is no point in trying to switch the inode to a different
cgwb.

I_WILL_FREE was ignored since the introduction of the inode switching, so
it looks like it doesn't lead to any noticeable issues for a user.  This
is why the patch is not intended for a stable backport.

Link: https://lkml.kernel.org/r/20210608230225.2078447-1-guro@fb.com
Link: https://lkml.kernel.org/r/20210608230225.2078447-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:47 -07:00