linux-stable/Documentation
Johannes Weiner 8a931f8013 mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says.  The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.

Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.

Consider memory in a data center host.  At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing.  Both branches are further subdivided into individual
services, job components etc.

We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other.  Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree.  Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general.  This
results in very inefficient resource distribution.

Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level.  Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads.  It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g.  rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.

It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.

To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.

That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now.  However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size.  Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree.  But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.

E.g.  a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.

This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.

The floating protection composes naturally with fixed protection.
Consider the following example tree:

		A            A: low = 2G
               / \          A1: low = 1G
              A1 A2         A2: low = 0G

As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.

There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree.  Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior.  Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:28 -07:00
..
ABI TTY/Serial patches for 5.7-rc1 2020-03-31 16:18:55 -07:00
PCI docs: fix pointers to io-mapping.rst and io_ordering.rst files 2020-03-11 14:15:20 -06:00
RCU doc: Add rcutorture scripting to torture.txt 2020-02-27 07:03:14 -08:00
accounting doc: cgroup: improve formatting of references 2020-03-02 12:57:03 -07:00
admin-guide mm: memcontrol: recursive memory.low protection 2020-04-02 09:35:28 -07:00
arm docs: arm: tcm: Fix a few typos 2020-02-19 02:42:21 -07:00
arm64 arm64 updates for 5.7: 2020-03-31 10:05:01 -07:00
block block: Document genhd capability flags 2020-03-12 07:47:22 -06:00
bpf bpf: lsm: Add Documentation 2020-03-30 01:35:12 +02:00
cdrom docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
core-api mm: dump_page(): additional diagnostics for huge pinned pages 2020-04-02 09:35:27 -07:00
cpu-freq docs: cpu-freq: convert cpufreq-stats.txt to ReST 2020-03-06 00:01:02 +01:00
crypto crypto: algapi - make unregistration functions return void 2019-12-20 14:58:35 +08:00
dev-tools This has been a busy cycle for documentation work. Highlights include: 2020-03-30 12:45:23 -07:00
devicetree Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-03-31 17:29:33 -07:00
doc-guide Documentation: build warnings related to missing blank lines after explicit markups has been fixed 2020-02-05 10:30:03 -07:00
driver-api Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2020-03-30 16:13:08 -07:00
fault-injection docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
fb fbdev: fbmem: allow overriding the number of bootup logos 2020-01-03 14:27:40 +01:00
features documentation: vm: Advertise support for pte_special in riscv 2020-02-19 02:41:01 -07:00
filesystems fscrypt updates for 5.7 2020-03-31 12:58:36 -07:00
firmware-guide docs: firmware-guide: ACPI: Replace dma_request_slave_channel() with dma_request_chan() 2019-12-19 22:58:12 +01:00
firmware_class
fpga Documentation: fpga: dfl: add descriptions for thermal/power management interfaces 2019-10-16 19:18:26 -07:00
gpu docs: gpu: i915.rst: fix warnings due to file renames 2020-02-25 03:02:35 -07:00
hid docs: add some documentation dirs to the driver-api book 2019-07-15 11:03:02 -03:00
hwmon docs: hwmon: Update documentation for isl68137 pmbus driver 2020-03-22 16:42:54 -07:00
i2c docs: i2c: writing-clients: properly name the stop condition 2020-01-29 22:02:09 +01:00
ia64 docs: add SPDX tags to new index files 2019-07-15 11:03:03 -03:00
ide docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
iio docs: add some documentation dirs to the driver-api book 2019-07-15 11:03:02 -03:00
infiniband Documentation/infiniband: update name of some functions 2019-09-13 16:55:55 -03:00
input Input: docs: fix spelling mistake "potocol" -> "protocol" 2019-08-06 11:24:49 -06:00
isdn isdn: capi: dead code removal 2019-12-11 09:12:38 +01:00
kbuild Kbuild updates for v5.7 2020-03-31 16:03:39 -07:00
kernel-hacking docs: locking: Drop :c:func: throughout 2020-03-20 17:16:24 -06:00
leds leds: core: Add support for composing LED class device names 2019-07-25 20:07:52 +02:00
livepatch livepatch: Documentation of the new API for tracking system state changes 2019-11-01 13:08:24 +01:00
locking Documentation/locking/locktypes: Minor copy editor fixes 2020-03-28 12:47:34 +01:00
m68k docs: README.buddha: convert to ReST and add to m68k book 2019-07-31 13:30:10 -06:00
maintainer Add a maintainer entry profile for documentation 2020-01-24 09:48:39 -07:00
media media updates for v5.7-rc1 2020-03-30 13:42:05 -07:00
mips docs: mips: remove no longer needed au1xxx_ide.rst documentation 2020-03-24 15:53:48 +01:00
misc-devices docs: Move Intel Many Integrated Core documentation (mic) under misc-devices 2020-03-10 11:12:34 -06:00
netlabel docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
networking Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-03-31 17:29:33 -07:00
nios2 docs: nios2: add it to the main Documentation body 2019-07-31 13:31:51 -06:00
nvdimm docs: nvdimm: use ReST notation for subsection 2020-01-24 09:54:42 -07:00
openrisc docs: openrisc: convert to ReST and add to documentation body 2019-07-31 13:30:20 -06:00
parisc docs: parisc: convert to ReST and add to documentation body 2019-07-31 13:30:15 -06:00
pcmcia docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
power Merge branches 'pm-core', 'pm-sleep', 'pm-acpi' and 'pm-domains' 2020-03-30 14:46:58 +02:00
powerpc docs: prevent warnings due to autosectionlabel 2020-03-20 17:01:29 -06:00
process This has been a busy cycle for documentation work. Highlights include: 2020-03-30 12:45:23 -07:00
riscv It has been a relatively quiet cycle for documentation, but there's still a 2020-01-29 15:27:31 -08:00
s390 Documentation/s390: remove outdated debugging390 documentation 2019-08-21 12:41:43 +02:00
scheduler Documentation/scheduler: fix links in sched-stats 2019-10-29 04:35:41 -06:00
scsi scsi: simplify scsi_partsize 2020-03-24 07:57:07 -06:00
security docs: prevent warnings due to autosectionlabel 2020-03-20 17:01:29 -06:00
sh docs: remove extra conf.py files 2019-07-17 06:57:52 -03:00
sound sound updates for 5.6-rc1 2020-01-28 16:26:57 -08:00
sparc docs: add arch doc directories to the index 2019-07-15 11:03:01 -03:00
sphinx docs: Fix empty parallelism argument 2020-02-25 03:11:04 -07:00
sphinx-static doc-rst: Reduce CSS padding around Field 2019-10-02 10:03:06 -06:00
spi spi: docs: convert to ReST and add it to the kABI bookset 2019-07-31 14:13:13 -06:00
target docs: prevent warnings due to autosectionlabel 2020-03-20 17:01:29 -06:00
timers docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
trace Power management updates for 5.7-rc1 2020-03-30 15:05:01 -07:00
translations media updates for v5.7-rc1 2020-03-30 13:42:05 -07:00
usb usb: gadget: add raw-gadget interface 2020-03-15 11:34:48 +02:00
userspace-api docs: userspace: ioctl-number: remove mc146818rtc conflict 2020-02-13 11:42:02 -07:00
virt KVM: SVM: document KVM_MEM_ENCRYPT_OP, let userspace detect if SEV is available 2020-03-20 13:47:52 -04:00
vm mm/zswap.c: add allocation hysteresis if pool limit is hit 2020-01-31 10:30:39 -08:00
w1 docs: w1: Fix a typo in omap-hdq.rst 2019-12-30 11:58:02 -07:00
watchdog linux-watchdog 5.4-rc1 tag 2019-09-27 11:17:38 -07:00
x86 Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2020-03-31 11:04:05 -07:00
xtensa docs: add arch doc directories to the index 2019-07-15 11:03:01 -03:00
.gitignore
COPYING-logo docs: logo.txt: rename it to COPYING-logo 2019-07-15 09:20:27 -03:00
Changes
CodingStyle
DMA-API-HOWTO.txt docs: DMA-API-HOWTO.txt: fix an unmarked code block 2019-07-15 09:20:24 -03:00
DMA-API.txt dma-mapping: remove dma_release_declared_memory 2019-09-04 11:13:19 +02:00
DMA-ISA-LPC.txt
DMA-attributes.txt dma-mapping: remove the DMA_ATTR_WRITE_BARRIER flag 2019-11-14 12:01:54 -04:00
IPMI.txt
IRQ-affinity.txt
IRQ-domain.txt
IRQ.txt
Kconfig
Makefile Kbuild updates for v5.7 2020-03-31 16:03:39 -07:00
SubmittingPatches
asm-annotations.rst Documentation: Call out example SYM_FUNC_* usage as x86-specific 2020-01-16 12:53:16 -07:00
atomic_bitops.txt
atomic_t.txt Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-07-08 16:12:03 -07:00
bus-virt-phys-mapping.txt
conf.py docs: conf.py: avoid thousands of duplicate label warning on Sphinx 2020-03-20 17:01:34 -06:00
crc32.txt
debugging-via-ohci1394.txt
digsig.txt
docutils.conf
dontdiff modpost: dump missing namespaces into a single modules.nsdeps file 2019-11-11 20:10:01 +09:00
futex-requeue-pi.txt
hwspinlock.txt
index.rst Power management updates for 5.7-rc1 2020-03-30 15:05:01 -07:00
irqflags-tracing.txt
kprobes.txt
kref.txt docs: kref: Clarify the use of two kref_put() in example code 2020-02-25 03:39:10 -07:00
logo.gif
lzo.txt
mailbox.txt
memory-barriers.txt Documentation/memory-barriers: Fix typos 2020-02-27 07:03:14 -08:00
nommu-mmap.txt
percpu-rw-semaphore.txt
pi-futex.txt docs: locking: convert docs to ReST and rename to *.rst 2019-07-15 08:53:27 -03:00
preempt-locking.txt
rbtree.txt docs: rbtree.txt: fix Sphinx build warnings 2019-07-15 09:20:24 -03:00
remoteproc.txt
robust-futex-ABI.txt threads: Update PID limit comment according to futex UAPI change 2020-03-21 17:48:13 +01:00
robust-futexes.txt
rpmsg.txt
speculation.txt
static-keys.txt
tee.txt Documentation: tee: add AMD-TEE driver details 2020-01-04 13:49:51 +08:00
this_cpu_ops.txt
unaligned-memory-access.txt
xz.txt