Go to file
Johannes Weiner 51b8c1fe25 vfs: keep inodes with page cache off the inode shrinker LRU
Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache.  This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.

To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem.  However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree.  It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.

Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults.  We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior.  We also use it to quantify and report workload health through
psi.  The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.

The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.

To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible.  We've tried (commit a76cf1a474
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6a8 ("Revert "mm: don't reclaim inodes with many
attached pages"")).

The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work.  This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.

Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would.  An otherwise
clean and unused inode then gets queued when the last cache entry
disappears.  This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.

Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock).  Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably.  Additions may race with
deletions, but we'll check again in the shrinker.  If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.

Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 10:02:48 -08:00
Documentation Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions 2021-11-06 13:30:46 -07:00
LICENSES LICENSES/dual/CC-BY-4.0: Git rid of "smart quotes" 2021-07-15 06:31:24 -06:00
arch mm: remove HARDENED_USERCOPY_FALLBACK 2021-11-06 13:30:43 -07:00
block block-5.15-2021-10-29 2021-10-29 11:10:29 -07:00
certs certs: Add support for using elliptic curve keys for signing modules 2021-08-23 19:55:42 +03:00
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2021-08-30 12:57:10 -07:00
drivers include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h 2021-11-06 13:30:43 -07:00
fs vfs: keep inodes with page cache off the inode shrinker LRU 2021-11-09 10:02:48 -08:00
include vfs: keep inodes with page cache off the inode shrinker LRU 2021-11-09 10:02:48 -08:00
init memblock: use memblock_free for freeing virtual pointers 2021-11-06 13:30:41 -07:00
ipc ipc: remove memcg accounting for sops objects in do_semtimedop() 2021-09-14 10:22:11 -07:00
kernel stacktrace: move filter_irq_stacks() to kernel/stacktrace.c 2021-11-06 13:30:43 -07:00
lib kfence: default to dynamic branch instead of static keys mode 2021-11-06 13:30:43 -07:00
mm vfs: keep inodes with page cache off the inode shrinker LRU 2021-11-09 10:02:48 -08:00
net include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h 2021-11-06 13:30:43 -07:00
samples samples/bpf: Relicense bpf_insn.h as GPL-2.0-only OR BSD-2-Clause 2021-09-29 16:03:55 +02:00
scripts Compiler Attributes: add __alloc_size() for better bounds checking 2021-11-06 13:30:33 -07:00
security mm: remove HARDENED_USERCOPY_FALLBACK 2021-11-06 13:30:43 -07:00
sound ALSA: usb-audio: Fix microphone sound on Jieli webcam. 2021-10-19 08:07:01 +02:00
tools selftests/damon: support watermarks 2021-11-06 13:30:45 -07:00
usr .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
virt KVM: Remove tlbs_dirty 2021-09-23 11:01:12 -04:00
.clang-format clang-format: Update with the latest for_each macro list 2021-05-12 23:32:39 +02:00
.cocciconfig
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for dts files 2019-12-04 19:44:11 -08:00
.gitignore .gitignore: ignore only top-level modules.builtin 2021-05-02 00:43:35 +09:00
.mailmap mailmap: add Andrej Shadura 2021-10-18 20:22:03 -10:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: Move Daniel Drake to credits 2021-09-21 08:34:58 +03:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS MAINTAINERS: update SeongJae's email address 2021-11-06 13:30:44 -07:00
Makefile Compiler Attributes: add __alloc_size() for better bounds checking 2021-11-06 13:30:33 -07:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.