linux-stable/fs/xfs
Nhat Pham 0a97c01cd2 list_lru: allow explicit memcg and NUMA node selection
Patch series "workload-specific and memory pressure-driven zswap
writeback", v8.

There are currently several issues with zswap writeback:

1. There is only a single global LRU for zswap, making it impossible to
   perform worload-specific shrinking - an memcg under memory pressure
   cannot determine which pages in the pool it owns, and often ends up
   writing pages from other memcgs. This issue has been previously
   observed in practice and mitigated by simply disabling
   memcg-initiated shrinking:

   https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u

   But this solution leaves a lot to be desired, as we still do not
   have an avenue for an memcg to free up its own memory locked up in
   the zswap pool.

2. We only shrink the zswap pool when the user-defined limit is hit.
   This means that if we set the limit too high, cold data that are
   unlikely to be used again will reside in the pool, wasting precious
   memory. It is hard to predict how much zswap space will be needed
   ahead of time, as this depends on the workload (specifically, on
   factors such as memory access patterns and compressibility of the
   memory pages).

This patch series solves these issues by separating the global zswap LRU
into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
memcg- and NUMA-aware) zswap writeback under memory pressure.  The new
shrinker does not have any parameter that must be tuned by the user, and
can be opted in or out on a per-memcg basis.

As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance.  Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.


This patch (of 6):

The interface of list_lru is based on the assumption that the list node
and the data it represents belong to the same allocated on the correct
node/memcg.  While this assumption is valid for existing slab objects LRU
such as dentries and inodes, it is undocumented, and rather inflexible for
certain potential list_lru users (such as the upcoming zswap shrinker and
the THP shrinker).  It has caused us a lot of issues during our
development.

This patch changes list_lru interface so that the caller must explicitly
specify numa node and memcg when adding and removing objects.  The old
list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
list_lru_del_obj(), respectively.

It also extends the list_lru API with a new function, list_lru_putback,
which undoes a previous list_lru_isolate call.  Unlike list_lru_add, it
does not increment the LRU node count (as list_lru_isolate does not
decrement the node count).  list_lru_putback also allows for explicit
memcg and NUMA node selection.

Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12 10:57:01 -08:00
..
libxfs xfs: inode recovery does not validate the recovered inode 2023-11-13 09:11:41 +05:30
scrub xfs: simplify rt bitmap/summary block accessor functions 2023-10-19 08:33:42 -07:00
Kconfig xfs: fix again select in kconfig XFS_ONLINE_SCRUB_STATS 2023-11-13 09:11:41 +05:30
kmem.c
kmem.h
Makefile xfs: move the realtime summary file scrubber to a separate source file 2023-08-10 07:48:09 -07:00
mrlock.h
xfs.h
xfs_acl.c xfs: convert to ctime accessor functions 2023-07-24 10:30:06 +02:00
xfs_acl.h fs: port ->set_acl() to pass mnt_idmap 2023-01-19 09:24:27 +01:00
xfs_aops.c fs: convert error_remove_page to error_remove_folio 2023-12-10 16:51:42 -08:00
xfs_aops.h
xfs_attr_inactive.c xfs: make inode unlinked bucket recovery work with quotacheck 2023-09-12 10:31:07 -07:00
xfs_attr_item.c xfs: reserve less log space when recovering log intent items 2023-09-12 10:31:07 -07:00
xfs_attr_item.h xfs: share xattr name and value buffers when logging xattr updates 2022-05-23 08:43:46 +10:00
xfs_attr_list.c xfs: use XFS_IFORK_Q to determine the presence of an xattr fork 2022-07-09 15:17:21 -07:00
xfs_bio_io.c fs/xfs: Use the enum req_op and blk_opf_t types 2022-07-14 12:14:33 -06:00
xfs_bmap_item.c xfs: reserve less log space when recovering log intent items 2023-09-12 10:31:07 -07:00
xfs_bmap_item.h
xfs_bmap_util.c New code for 6.7: 2023-11-08 13:22:16 -08:00
xfs_bmap_util.h xfs: xfs_bmap_punch_delalloc_range() should take a byte range 2022-11-29 09:09:17 +11:00
xfs_buf.c list_lru: allow explicit memcg and NUMA node selection 2023-12-12 10:57:01 -08:00
xfs_buf.h Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
xfs_buf_item.c xfs: buffer pins need to hold a buffer reference 2023-06-05 04:05:27 +10:00
xfs_buf_item.h xfs: convert buffer log item flags to unsigned. 2022-04-21 10:46:40 +10:00
xfs_buf_item_recover.c xfs: verify buffer contents when we skip log replay 2023-04-12 15:49:23 +10:00
xfs_dahash_test.c xfs: test the ascii case-insensitive hash 2023-04-11 19:05:05 -07:00
xfs_dahash_test.h xfs: test dir/attr hash when loading module 2023-03-19 09:55:49 -07:00
xfs_dir2_readdir.c xfs: rearrange the logic and remove the broken comment for xfs_dir2_isxx 2022-10-04 16:39:58 +11:00
xfs_discard.c xfs: abort fstrim if kernel is suspending 2023-10-04 09:25:04 +11:00
xfs_discard.h xfs: move log discard work to xfs_discard.c 2023-10-04 09:24:02 +11:00
xfs_dquot.c list_lru: allow explicit memcg and NUMA node selection 2023-12-12 10:57:01 -08:00
xfs_dquot.h xfs: remove warning counters from struct xfs_dquot_res 2022-05-11 17:12:09 +10:00
xfs_dquot_item.c
xfs_dquot_item.h
xfs_dquot_item_recover.c xfs: dquot recovery does not validate the recovered dquot 2023-11-22 23:39:36 +05:30
xfs_drain.c xfs: minimize overhead of drain wakeups by using jump labels 2023-04-11 18:59:59 -07:00
xfs_drain.h xfs: minimize overhead of drain wakeups by using jump labels 2023-04-11 18:59:59 -07:00
xfs_error.c xfs: make kobj_type structures constant 2023-02-10 08:59:48 -08:00
xfs_error.h xfs: allow setting full range of panic tags 2023-02-09 18:36:17 -08:00
xfs_export.c xfs: fix reloading entire unlinked bucket lists 2023-09-24 18:12:13 -07:00
xfs_export.h
xfs_extent_busy.c xfs: process free extents to busy list in FIFO order 2023-10-11 12:35:21 -07:00
xfs_extent_busy.h xfs: reduce AGF hold times during fstrim operations 2023-10-04 09:24:52 +11:00
xfs_extfree_item.c xfs: reserve less log space when recovering log intent items 2023-09-12 10:31:07 -07:00
xfs_extfree_item.h xfs: refactor all the EFI/EFD log item sizeof logic 2022-10-31 08:58:20 -07:00
xfs_file.c xfs: allow read IO and FICLONE to run concurrently 2023-10-23 12:02:26 +05:30
xfs_filestream.c xfs: fix double xfs_perag_rele() in xfs_filestream_pick_ag() 2023-06-05 14:48:15 +10:00
xfs_filestream.h xfs: pass perag to filestreams tracing 2023-02-13 09:14:56 +11:00
xfs_fsmap.c xfs: convert do_div calls to xfs_rtb_to_rtx helper calls 2023-10-17 16:25:55 -07:00
xfs_fsmap.h
xfs_fsops.c Minor cleanups for 6.5: 2023-07-09 09:50:42 -07:00
xfs_fsops.h
xfs_globals.c xfs: allow setting full range of panic tags 2023-02-09 18:36:17 -08:00
xfs_health.c
xfs_icache.c xfs: dynamically allocate the xfs-inodegc shrinker 2023-10-04 10:32:26 -07:00
xfs_icache.h xfs: use per-mount cpumask to track nonempty percpu inodegc lists 2023-09-11 08:39:03 -07:00
xfs_icreate_item.c xfs: fix potential log item leak 2022-05-04 11:45:11 +10:00
xfs_icreate_item.h
xfs_inode.c New code for 6.7: 2023-11-08 13:22:16 -08:00
xfs_inode.h xfs: respect the stable writes flag on the RT device 2023-11-20 15:05:19 +01:00
xfs_inode_item.c New code for 6.7: 2023-11-08 13:22:16 -08:00
xfs_inode_item.h xfs: fix AGF vs inode cluster buffer deadlock 2023-06-05 04:08:27 +10:00
xfs_inode_item_recover.c xfs: recovery should not clear di_flushiter unconditionally 2023-11-13 09:11:41 +05:30
xfs_ioctl.c xfs: respect the stable writes flag on the RT device 2023-11-20 15:05:19 +01:00
xfs_ioctl.h fs: port ->fileattr_set() to pass mnt_idmap 2023-01-19 09:24:27 +01:00
xfs_ioctl32.c fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap 2023-01-19 09:24:29 +01:00
xfs_ioctl32.h arch: Remove Itanium (IA-64) architecture 2023-09-11 08:13:17 +00:00
xfs_iomap.c xfs: don't allocate into the data fork for an unshare request 2023-05-02 09:14:51 +10:00
xfs_iomap.h xfs: use iomap_valid method to detect stale cached iomaps 2022-11-29 09:09:17 +11:00
xfs_iops.c xfs: respect the stable writes flag on the RT device 2023-11-20 15:05:19 +01:00
xfs_iops.h fs: port ->setattr() to pass mnt_idmap 2023-01-19 09:24:02 +01:00
xfs_itable.c xfs: convert to new timestamp accessors 2023-10-18 14:08:29 +02:00
xfs_itable.h fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap 2023-01-19 09:24:29 +01:00
xfs_iunlink_item.c xfs: create traced helper to get extra perag references 2023-04-11 18:59:55 -07:00
xfs_iunlink_item.h xfs: add in-memory iunlink log item 2022-07-14 11:47:42 +10:00
xfs_iwalk.c xfs: create traced helper to get extra perag references 2023-04-11 18:59:55 -07:00
xfs_iwalk.h xfs: Decouple XFS_IBULK flags from XFS_IWALK flags 2022-04-13 07:02:44 +00:00
xfs_linux.h xfs: use shifting and masking when converting rt extents, if possible 2023-10-17 16:26:25 -07:00
xfs_log.c xfs: up(ic_sema) if flushing data device fails 2023-11-13 09:11:40 +05:30
xfs_log.h xfs: move CIL ordering to the logvec chain 2022-07-07 18:56:08 +10:00
xfs_log_cil.c xfs: move log discard work to xfs_discard.c 2023-10-04 09:24:02 +11:00
xfs_log_priv.h xfs: move log discard work to xfs_discard.c 2023-10-04 09:24:02 +11:00
xfs_log_recover.c xfs: abort intent items when recovery intents fail 2023-11-13 09:08:34 +05:30
xfs_message.c Merge branch 'guilt/xfs-unsigned-flags-5.18' into xfs-5.19-for-next 2022-04-21 16:45:03 +10:00
xfs_message.h xfs: implement per-mount warnings for scrub and shrink usage 2022-05-27 10:31:34 +10:00
xfs_mount.c xfs: dynamically allocate the xfs-inodegc shrinker 2023-10-04 10:32:26 -07:00
xfs_mount.h New code for 6.7: 2023-11-08 13:22:16 -08:00
xfs_mru_cache.c
xfs_mru_cache.h
xfs_notify_failure.c xfs: correct calculation for agend and blockcount 2023-10-12 10:11:56 +05:30
xfs_ondisk.h xfs: use accessor functions for summary info words 2023-10-18 16:53:00 -07:00
xfs_pnfs.c fs: port ->setattr() to pass mnt_idmap 2023-01-19 09:24:02 +01:00
xfs_pnfs.h
xfs_pwork.c
xfs_pwork.h
xfs_qm.c list_lru: allow explicit memcg and NUMA node selection 2023-12-12 10:57:01 -08:00
xfs_qm.h xfs: dynamically allocate the xfs-qm shrinker 2023-10-04 10:32:26 -07:00
xfs_qm_bhv.c
xfs_qm_syscalls.c xfs: introduce xfs_inodegc_push() 2022-06-23 13:34:38 -07:00
xfs_quota.h
xfs_quotaops.c xfs: don't set quota warning values 2022-05-11 17:12:09 +10:00
xfs_refcount_item.c xfs: reserve less log space when recovering log intent items 2023-09-12 10:31:07 -07:00
xfs_refcount_item.h
xfs_reflink.c xfs: only remap the written blocks in xfs_reflink_end_cow_extent 2023-11-13 09:09:19 +05:30
xfs_reflink.h xfs: pass perag to xfs_alloc_read_agf() 2022-07-07 19:07:40 +10:00
xfs_rmap_item.c xfs: reserve less log space when recovering log intent items 2023-09-12 10:31:07 -07:00
xfs_rmap_item.h
xfs_rtalloc.c New code for 6.7: 2023-11-08 13:22:16 -08:00
xfs_rtalloc.h xfs: convert rt extent numbers to xfs_rtxnum_t 2023-10-17 16:24:22 -07:00
xfs_stats.c xfs: replace unnecessary seq_printf with seq_puts 2022-09-19 06:48:14 +10:00
xfs_stats.h
xfs_super.c New code for 6.7: 2023-11-08 13:22:16 -08:00
xfs_super.h xfs: create scaffolding for creating debugfs entries 2023-08-10 07:48:07 -07:00
xfs_symlink.c fs: port fs{g,u}id helpers to mnt_idmap 2023-01-19 09:24:30 +01:00
xfs_symlink.h fs: port inode_init_owner() to mnt_idmap 2023-01-19 09:24:28 +01:00
xfs_sysctl.c xfs: simplify two-level sysctl registration for xfs_table 2023-04-13 11:49:35 -07:00
xfs_sysctl.h xfs: Add larp debug option 2022-05-11 17:01:22 +10:00
xfs_sysfs.c xfs: make kobj_type structures constant 2023-02-10 08:59:48 -08:00
xfs_sysfs.h xfs: make kobj_type structures constant 2023-02-10 08:59:48 -08:00
xfs_trace.c xfs: add debug knob to slow down writeback for fun 2022-11-28 17:24:35 -08:00
xfs_trace.h xfs: load uncached unlinked inodes into memory on demand 2023-09-12 10:31:07 -07:00
xfs_trans.c xfs: use shifting and masking when converting rt extents, if possible 2023-10-17 16:26:25 -07:00
xfs_trans.h xfs: t_firstblock is tracking AGs not blocks 2023-02-11 04:11:06 +11:00
xfs_trans_ail.c xfs: don't reverse order of items in bulk AIL insertion 2023-06-29 09:28:23 -07:00
xfs_trans_buf.c
xfs_trans_dquot.c xfs: remove quota warning limit from struct xfs_quota_limits 2022-05-11 17:12:09 +10:00
xfs_trans_priv.h xfs: convert log vector chain to use list heads 2022-07-07 18:55:59 +10:00
xfs_xattr.c vfs-6.7.xattr 2023-10-30 09:29:44 -10:00
xfs_xattr.h xfs: move xfs_xattr_handlers to .rodata 2023-10-10 13:49:20 +02:00