linux-stable/Documentation
Nhat Pham 8cba9576df hugetlb: memcg: account hugetlb-backed memory in memory controller
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory.  This has been observed in our production system.

For instance, here is one of our usecases: suppose there are two 32G
containers.  The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it.  The rest is anon, cache, slab, etc.  We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness.  But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc.  fair.

What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max.  Similar procedure is done to other memory limits
(memory.low for e.g).  However, this is rather cumbersome and buggy. 
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.

This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller).  Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.

Some caveats to consider:
  * This feature is only available on cgroup v2.
  * There is no hugetlb pool management involved in the memory
    controller. As stated above, hugetlb folios are only charged towards
    the memory controller when it is used. Host overcommit management
    has to consider it when configuring hard limits.
  * Failure to charge towards the memcg results in SIGBUS. This could
    happen even if the hugetlb pool still has pages (but the cgroup
    limit is hit and reclaim attempt fails).
  * When this feature is enabled, hugetlb pages contribute to memory
    reclaim protection. low, min limits tuning must take into account
    hugetlb memory.
  * Hugetlb pages utilized while this option is not selected will not
    be tracked by the memory controller (even if cgroup v2 is remounted
    later on).

Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18 14:34:17 -07:00
..
ABI Docs/ABI/damon: update for DAMOS apply intervals 2023-10-04 10:32:32 -07:00
PCI Merge branch 'pci/misc' 2023-08-29 11:03:57 -05:00
RCU
accel
accounting
admin-guide hugetlb: memcg: account hugetlb-backed memory in memory controller 2023-10-18 14:34:17 -07:00
arch LoongArch fixes for v6.6-rc3 2023-09-23 10:57:03 -07:00
block Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
bpf Including fixes from netfilter and bpf. 2023-09-07 18:33:07 -07:00
cdrom
core-api printk changes for 6.6 2023-09-04 13:20:19 -07:00
cpu-freq
crypto
dev-tools Documentation: *san: drop "the" from article titles 2023-10-18 14:34:15 -07:00
devicetree ARM: SoC fixes for 6.6 2023-09-30 18:41:37 -07:00
doc-guide
driver-api ata changes for 6.6 2023-09-05 12:37:28 -07:00
fault-injection Documentation: Fix typos 2023-08-18 11:29:03 -06:00
fb Documentation: Fix typos 2023-08-18 11:29:03 -06:00
features LoongArch changes for v6.6 2023-09-08 12:16:52 -07:00
filesystems v6.6-rc4.vfs.fixes 2023-09-26 08:50:30 -07:00
firmware-guide Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
firmware_class
fpga
gpu drm ci for 6.6-rc1 2023-09-10 11:55:26 -07:00
hid
hwmon Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
i2c media updates for v6.6-rc1 2023-09-01 12:21:32 -07:00
iio
images
infiniband
input input: docs: pxrc: remove reference to phoenix-sim 2023-08-28 12:43:32 -06:00
isdn
kbuild Documentation: kbuild: explain handling optional dependencies 2023-09-25 16:01:05 +09:00
kernel-hacking
leds
litmus-tests
livepatch Documentation: Fix typos 2023-08-18 11:29:03 -06:00
locking Documentation: Fix typos 2023-08-18 11:29:03 -06:00
maintainer Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
mhi
misc-devices
mm Docs/mm/damon/design: document DAMOS apply intervals 2023-10-04 10:32:31 -07:00
netlabel
netlink doc/netlink: Add spec for rt route messages 2023-08-27 17:17:11 -07:00
networking Documentation: netdev: fix dead link in ax25.rst 2023-09-18 12:56:58 +01:00
nvdimm
nvme
pcmcia
peci
power Documentation: Fix typos 2023-08-18 11:29:03 -06:00
powerpc powerpc updates for 6.6 2023-08-31 12:43:10 -07:00
process Documentation: embargoed-hardware-issues.rst: Add myself for RISC-V 2023-09-13 09:19:49 +02:00
riscv Merge patch series "RISC-V: Probe for misaligned access speed" 2023-09-08 11:24:12 -07:00
rust Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
scheduler Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
scsi SCSI misc on 20230902 2023-09-02 12:02:41 -07:00
security Documentation: Fix typos 2023-08-18 11:29:03 -06:00
sound ALSA: docs: Fix a typo of midi2_ump_probe option for snd-usb-audio 2023-09-12 10:00:46 +02:00
sphinx Documentation: Fix typos 2023-08-18 11:29:03 -06:00
sphinx-static
spi Documentation: Fix typos 2023-08-18 11:29:03 -06:00
staging
target
timers
tools Documentation: Fix typos 2023-08-18 11:29:03 -06:00
trace mm, vmscan: remove ISOLATE_UNMAPPED 2023-10-04 10:32:29 -07:00
translations docs/zh_CN/LoongArch: Update the links of ABI 2023-09-20 14:26:38 +08:00
usb USB / Thunderbolt / PHY driver update for 6.6-rc1 2023-09-01 09:23:34 -07:00
userspace-api Including fixes from netfilter and bpf. 2023-09-07 18:33:07 -07:00
virt ARM: 2023-09-07 13:52:20 -07:00
w1 Documentation: Fix typos 2023-08-18 11:29:03 -06:00
watchdog Documentation: Fix typos 2023-08-18 11:29:03 -06:00
wmi Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
.gitignore
Changes
CodingStyle
Kconfig
Makefile
SubmittingPatches
atomic_bitops.txt
atomic_t.txt
conf.py
docutils.conf
dontdiff
index.rst
memory-barriers.txt
subsystem-apis.rst