linux-stable/fs/nfs
Nhat Pham 0a97c01cd2 list_lru: allow explicit memcg and NUMA node selection
Patch series "workload-specific and memory pressure-driven zswap
writeback", v8.

There are currently several issues with zswap writeback:

1. There is only a single global LRU for zswap, making it impossible to
   perform worload-specific shrinking - an memcg under memory pressure
   cannot determine which pages in the pool it owns, and often ends up
   writing pages from other memcgs. This issue has been previously
   observed in practice and mitigated by simply disabling
   memcg-initiated shrinking:

   https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u

   But this solution leaves a lot to be desired, as we still do not
   have an avenue for an memcg to free up its own memory locked up in
   the zswap pool.

2. We only shrink the zswap pool when the user-defined limit is hit.
   This means that if we set the limit too high, cold data that are
   unlikely to be used again will reside in the pool, wasting precious
   memory. It is hard to predict how much zswap space will be needed
   ahead of time, as this depends on the workload (specifically, on
   factors such as memory access patterns and compressibility of the
   memory pages).

This patch series solves these issues by separating the global zswap LRU
into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
memcg- and NUMA-aware) zswap writeback under memory pressure.  The new
shrinker does not have any parameter that must be tuned by the user, and
can be opted in or out on a per-memcg basis.

As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance.  Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.


This patch (of 6):

The interface of list_lru is based on the assumption that the list node
and the data it represents belong to the same allocated on the correct
node/memcg.  While this assumption is valid for existing slab objects LRU
such as dentries and inodes, it is undocumented, and rather inflexible for
certain potential list_lru users (such as the upcoming zswap shrinker and
the THP shrinker).  It has caused us a lot of issues during our
development.

This patch changes list_lru interface so that the caller must explicitly
specify numa node and memcg when adding and removing objects.  The old
list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
list_lru_del_obj(), respectively.

It also extends the list_lru API with a new function, list_lru_putback,
which undoes a previous list_lru_isolate call.  Unlike list_lru_add, it
does not increment the LRU node count (as list_lru_isolate does not
decrement the node count).  list_lru_putback also allows for explicit
memcg and NUMA node selection.

Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-12 10:57:01 -08:00
..
blocklayout nfs/blocklayout: Convert to use bdev_open_by_dev/path() 2023-10-28 13:29:21 +02:00
filelayout nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by 2023-10-02 09:48:53 -07:00
flexfilelayout hardening updates for v6.7-rc1 2023-10-30 19:09:55 -10:00
cache_lib.c
cache_lib.h
callback.c SUNRPC: change how svc threads are asked to exit. 2023-10-16 12:44:04 -04:00
callback.h
callback_proc.c nfs: convert to new timestamp accessors 2023-10-18 14:08:23 +02:00
callback_xdr.c SUNRPC: Use per-CPU counters to tally server RPC counts 2023-02-20 09:20:32 -05:00
client.c NFS/pNFS: Set the connect timeout for the pNFS flexfiles driver 2023-08-24 13:24:15 -04:00
delegation.c NFSv4: fairly test all delegations on a SEQ4_ revocation 2023-11-01 15:15:52 -04:00
delegation.h NFSv4: fairly test all delegations on a SEQ4_ revocation 2023-11-01 15:15:52 -04:00
dir.c nfs: Convert nfs_symlink() to use a folio 2023-11-01 15:40:44 -04:00
direct.c NFS: More fixes for nfs_direct_write_reschedule_io() 2023-09-13 11:51:11 -04:00
dns_resolve.c NFS: Move common includes outside ifdef 2023-08-24 13:24:15 -04:00
dns_resolve.h NFS: Avoid memcpy() run-time warning for struct sockaddr overflows 2022-10-27 15:52:10 -04:00
export.c nfsd: allow reaping files still under writeback 2023-04-26 09:04:59 -04:00
file.c fs: convert error_remove_page to error_remove_folio 2023-12-10 16:51:42 -08:00
fs_context.c NFS: Add an "xprtsec=" NFS mount option 2023-06-19 12:30:17 -04:00
fscache.c mm, netfs, fscache: stop read optimisation when folio removed from pagecache 2023-08-18 10:12:13 -07:00
fscache.h nfs: convert to new timestamp accessors 2023-10-18 14:08:23 +02:00
getroot.c
inode.c nfs: convert to new timestamp accessors 2023-10-18 14:08:23 +02:00
internal.h NFS/pNFS: Set the connect timeout for the pNFS flexfiles driver 2023-08-24 13:24:15 -04:00
io.c
iostat.h NFS: Remove all NFSIOS_FSCACHE counters due to conversion to netfs API 2023-04-11 13:08:26 -04:00
Kconfig nfs41: drop dependency between flexfiles layout driver and NFSv3 modules 2023-11-01 15:39:46 -04:00
Makefile nfs: Convert to new fscache volume/cookie API 2022-01-10 11:53:25 +00:00
mount_clnt.c NFS: Avoid memcpy() run-time warning for struct sockaddr overflows 2022-10-27 15:52:10 -04:00
namespace.c fs: pass the request_mask to generic_fillattr 2023-08-09 08:56:36 +02:00
netns.h
nfs.h nfs: move nfs4_xattr_handlers to .rodata 2023-10-09 16:24:20 +02:00
nfs2super.c
nfs2xdr.c NFS: Guard against READDIR loop when entry names exceed MAXNAMELEN 2023-08-30 11:08:27 -04:00
nfs3_fs.h fs: drop unused posix acl handlers 2023-03-06 09:57:12 +01:00
nfs3acl.c Mainly singleton patches all over the place. Series of note are: 2023-04-27 19:57:00 -07:00
nfs3client.c NFS/pNFS: Set the connect timeout for the pNFS flexfiles driver 2023-08-24 13:24:15 -04:00
nfs3proc.c nfs: Convert nfs_symlink() to use a folio 2023-11-01 15:40:44 -04:00
nfs3super.c fs: drop unused posix acl handlers 2023-03-06 09:57:12 +01:00
nfs3xdr.c NFS: Guard against READDIR loop when entry names exceed MAXNAMELEN 2023-08-30 11:08:27 -04:00
nfs4_fs.h NFS client updates for Linux 6.7 2023-11-08 13:39:16 -08:00
nfs4client.c NFSv4.1: fix pnfs MDS=DS session trunking 2023-09-13 11:51:11 -04:00
nfs4file.c fs: Pass argument to fcntl_setlease as int 2023-07-10 14:36:11 +02:00
nfs4getroot.c
nfs4idmap.c cred: Do not default to init_cred in prepare_kernel_cred() 2022-11-01 10:04:52 -07:00
nfs4idmap.h
nfs4namespace.c NFS: Avoid memcpy() run-time warning for struct sockaddr overflows 2022-10-27 15:52:10 -04:00
nfs4proc.c NFS client updates for Linux 6.7 2023-11-08 13:39:16 -08:00
nfs4renewd.c
nfs4session.c
nfs4session.h
nfs4state.c NFSv4: Fix a state manager thread deadlock regression 2023-09-27 15:16:40 -04:00
nfs4super.c
nfs4sysctl.c nfs: simplify two-level sysctl registration for nfs4_cb_sysctls 2023-04-13 11:49:35 -07:00
nfs4trace.c
nfs4trace.h nfs4trace: fix state manager flag printing 2023-02-14 15:43:57 -05:00
nfs4xdr.c NFSv4.2: Fix a memory stomp in decode_attr_security_label 2022-11-27 22:09:59 -05:00
nfs42.h NFSv4.2: Rework scratch handling for READ_PLUS (again) 2023-08-23 15:58:47 -04:00
nfs42proc.c nfs42: client needs to strip file mode's suid/sgid bit after ALLOCATE op 2023-10-11 09:37:48 -04:00
nfs42xattr.c list_lru: allow explicit memcg and NUMA node selection 2023-12-12 10:57:01 -08:00
nfs42xdr.c NFSv4.2: Rework scratch handling for READ_PLUS (again) 2023-08-23 15:58:47 -04:00
nfsroot.c NFS: Prefer strscpy over strlcpy calls 2023-05-22 12:34:41 -07:00
nfstrace.c
nfstrace.h NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit 2023-04-11 13:08:27 -04:00
pagelist.c NFS: Convert buffered read paths to use netfs when fscache is enabled 2023-04-11 13:08:26 -04:00
pnfs.c NFSv4/pnfs: Allow layoutget to return EAGAIN for softerr mounts 2023-10-22 19:47:56 -04:00
pnfs.h NFSv4/pnfs: Allow layoutget to return EAGAIN for softerr mounts 2023-10-22 19:47:56 -04:00
pnfs_dev.c NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info 2023-08-24 13:24:15 -04:00
pnfs_nfs.c pNFS: Fix assignment of xprtdata.cred 2023-08-30 14:31:31 -04:00
proc.c nfs: Convert nfs_symlink() to use a folio 2023-11-01 15:40:44 -04:00
read.c NFSv4.2: Rework scratch handling for READ_PLUS (again) 2023-08-23 15:58:47 -04:00
super.c NFS client updates for Linux 6.7 2023-11-08 13:39:16 -08:00
symlink.c fs: Change the type of filler_t 2022-05-09 16:36:48 -04:00
sysctl.c nfs: simplify two-level sysctl registration for nfs_cb_sysctls 2023-04-13 11:49:35 -07:00
sysfs.c NFS: Fix sysfs server name memory leak 2023-08-19 10:26:29 -04:00
sysfs.h NFS: Add sysfs links to sunrpc clients for nfs_clients 2023-06-19 15:04:13 -04:00
unlink.c NFS: Fix a race in nfs_call_unlink() 2022-11-27 22:10:00 -05:00
write.c NFSv4/pnfs: Allow layoutget to return EAGAIN for softerr mounts 2023-10-22 19:47:56 -04:00