Merge branch 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:
 "Documentation updates and the addition of cgroup_parse_float() which
  will be used by new controllers including blk-iocost"

* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  docs: cgroup-v1: convert docs to ReST and rename to *.rst
  cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS
  cgroup: add cgroup_parse_float()
This commit is contained in:
Linus Torvalds 2019-07-08 21:35:12 -07:00
commit 92c1d65221
37 changed files with 989 additions and 627 deletions

View File

@ -705,6 +705,12 @@ Conventions
informational files on the root cgroup which end up showing global informational files on the root cgroup which end up showing global
information available elsewhere shouldn't exist. information available elsewhere shouldn't exist.
- The default time unit is microseconds. If a different unit is ever
used, an explicit unit suffix must be present.
- A parts-per quantity should use a percentage decimal with at least
two digit fractional part - e.g. 13.40.
- If a controller implements weight based resource distribution, its - If a controller implements weight based resource distribution, its
interface file should be named "weight" and have the range [1, interface file should be named "weight" and have the range [1,
10000] with 100 as the default. The values are chosen to allow 10000] with 100 as the default. The values are chosen to allow

View File

@ -241,7 +241,7 @@ Guest mitigation mechanisms
For further information about confining guests to a single or to a group For further information about confining guests to a single or to a group
of cores consult the cpusets documentation: of cores consult the cpusets documentation:
https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.rst
.. _interrupt_isolation: .. _interrupt_isolation:

View File

@ -4084,7 +4084,7 @@
relax_domain_level= relax_domain_level=
[KNL, SMP] Set scheduler's default relax_domain_level. [KNL, SMP] Set scheduler's default relax_domain_level.
See Documentation/cgroup-v1/cpusets.txt. See Documentation/cgroup-v1/cpusets.rst.
reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory
Format: <base1>,<size1>[,<base2>,<size2>,...] Format: <base1>,<size1>[,<base2>,<size2>,...]
@ -4594,7 +4594,7 @@
swapaccount=[0|1] swapaccount=[0|1]
[KNL] Enable accounting of swap in memory resource [KNL] Enable accounting of swap in memory resource
controller if no parameter or 1 is given or disable controller if no parameter or 1 is given or disable
it if 0 is given (See Documentation/cgroup-v1/memory.txt) it if 0 is given (See Documentation/cgroup-v1/memory.rst)
swiotlb= [ARM,IA-64,PPC,MIPS,X86] swiotlb= [ARM,IA-64,PPC,MIPS,X86]
Format: { <int> | force | noforce } Format: { <int> | force | noforce }

View File

@ -15,7 +15,7 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy
support. support.
Memory policies should not be confused with cpusets Memory policies should not be confused with cpusets
(``Documentation/cgroup-v1/cpusets.txt``) (``Documentation/cgroup-v1/cpusets.rst``)
which is an administrative mechanism for restricting the nodes from which which is an administrative mechanism for restricting the nodes from which
memory may be allocated by a set of processes. Memory policies are a memory may be allocated by a set of processes. Memory policies are a
programming interface that a NUMA-aware application can take advantage of. When programming interface that a NUMA-aware application can take advantage of. When

View File

@ -539,7 +539,7 @@ As for cgroups-v1 (blkio controller), the exact set of stat files
created, and kept up-to-date by bfq, depends on whether created, and kept up-to-date by bfq, depends on whether
CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all
the stat files documented in the stat files documented in
Documentation/cgroup-v1/blkio-controller.txt. If, instead, Documentation/cgroup-v1/blkio-controller.rst. If, instead,
CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files
blkio.bfq.io_service_bytes blkio.bfq.io_service_bytes
blkio.bfq.io_service_bytes_recursive blkio.bfq.io_service_bytes_recursive

View File

@ -1,5 +1,7 @@
Block IO Controller ===================
=================== Block IO Controller
===================
Overview Overview
======== ========
cgroup subsys "blkio" implements the block io controller. There seems to be cgroup subsys "blkio" implements the block io controller. There seems to be
@ -17,24 +19,27 @@ HOWTO
===== =====
Throttling/Upper Limit policy Throttling/Upper Limit policy
----------------------------- -----------------------------
- Enable Block IO controller - Enable Block IO controller::
CONFIG_BLK_CGROUP=y CONFIG_BLK_CGROUP=y
- Enable throttling in block layer - Enable throttling in block layer::
CONFIG_BLK_DEV_THROTTLING=y CONFIG_BLK_DEV_THROTTLING=y
- Mount blkio controller (see cgroups.txt, Why are cgroups needed?) - Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::
mount -t cgroup -o blkio none /sys/fs/cgroup/blkio mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
- Specify a bandwidth rate on particular device for root group. The format - Specify a bandwidth rate on particular device for root group. The format
for policy is "<major>:<minor> <bytes_per_second>". for policy is "<major>:<minor> <bytes_per_second>"::
echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
Above will put a limit of 1MB/second on reads happening for root group Above will put a limit of 1MB/second on reads happening for root group
on device having major/minor number 8:16. on device having major/minor number 8:16.
- Run dd to read a file and see if rate is throttled to 1MB/s or not. - Run dd to read a file and see if rate is throttled to 1MB/s or not::
# dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
1024+0 records in 1024+0 records in
@ -51,7 +56,7 @@ throttling's hierarchy support is enabled iff "sane_behavior" is
enabled from cgroup side, which currently is a development option and enabled from cgroup side, which currently is a development option and
not publicly available. not publicly available.
If somebody created a hierarchy like as follows. If somebody created a hierarchy like as follows::
root root
/ \ / \
@ -66,7 +71,7 @@ directly generated by tasks in that cgroup.
Throttling without "sane_behavior" enabled from cgroup side will Throttling without "sane_behavior" enabled from cgroup side will
practically treat all groups at same level as if it looks like the practically treat all groups at same level as if it looks like the
following. following::
pivot pivot
/ / \ \ / / \ \
@ -99,27 +104,31 @@ Proportional weight policy files
These rules override the default value of group weight as specified These rules override the default value of group weight as specified
by blkio.weight. by blkio.weight.
Following is the format. Following is the format::
# echo dev_maj:dev_minor weight > blkio.weight_device # echo dev_maj:dev_minor weight > blkio.weight_device
Configure weight=300 on /dev/sdb (8:16) in this cgroup
# echo 8:16 300 > blkio.weight_device
# cat blkio.weight_device
dev weight
8:16 300
Configure weight=500 on /dev/sda (8:0) in this cgroup Configure weight=300 on /dev/sdb (8:16) in this cgroup::
# echo 8:0 500 > blkio.weight_device
# cat blkio.weight_device
dev weight
8:0 500
8:16 300
Remove specific weight for /dev/sda in this cgroup # echo 8:16 300 > blkio.weight_device
# echo 8:0 0 > blkio.weight_device # cat blkio.weight_device
# cat blkio.weight_device dev weight
dev weight 8:16 300
8:16 300
Configure weight=500 on /dev/sda (8:0) in this cgroup::
# echo 8:0 500 > blkio.weight_device
# cat blkio.weight_device
dev weight
8:0 500
8:16 300
Remove specific weight for /dev/sda in this cgroup::
# echo 8:0 0 > blkio.weight_device
# cat blkio.weight_device
dev weight
8:16 300
- blkio.leaf_weight[_device] - blkio.leaf_weight[_device]
- Equivalents of blkio.weight[_device] for the purpose of - Equivalents of blkio.weight[_device] for the purpose of
@ -244,30 +253,30 @@ Throttling/Upper limit policy files
- blkio.throttle.read_bps_device - blkio.throttle.read_bps_device
- Specifies upper limit on READ rate from the device. IO rate is - Specifies upper limit on READ rate from the device. IO rate is
specified in bytes per second. Rules are per device. Following is specified in bytes per second. Rules are per device. Following is
the format. the format::
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
- blkio.throttle.write_bps_device - blkio.throttle.write_bps_device
- Specifies upper limit on WRITE rate to the device. IO rate is - Specifies upper limit on WRITE rate to the device. IO rate is
specified in bytes per second. Rules are per device. Following is specified in bytes per second. Rules are per device. Following is
the format. the format::
echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
- blkio.throttle.read_iops_device - blkio.throttle.read_iops_device
- Specifies upper limit on READ rate from the device. IO rate is - Specifies upper limit on READ rate from the device. IO rate is
specified in IO per second. Rules are per device. Following is specified in IO per second. Rules are per device. Following is
the format. the format::
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
- blkio.throttle.write_iops_device - blkio.throttle.write_iops_device
- Specifies upper limit on WRITE rate to the device. IO rate is - Specifies upper limit on WRITE rate to the device. IO rate is
specified in io per second. Rules are per device. Following is specified in io per second. Rules are per device. Following is
the format. the format::
echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device echo "<major>:<minor> <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
Note: If both BW and IOPS rules are specified for a device, then IO is Note: If both BW and IOPS rules are specified for a device, then IO is
subjected to both the constraints. subjected to both the constraints.

View File

@ -1,35 +1,39 @@
CGROUPS ==============
------- Control Groups
==============
Written by Paul Menage <menage@google.com> based on Written by Paul Menage <menage@google.com> based on
Documentation/cgroup-v1/cpusets.txt Documentation/cgroup-v1/cpusets.rst
Original copyright statements from cpusets.txt: Original copyright statements from cpusets.txt:
Portions Copyright (C) 2004 BULL SA. Portions Copyright (C) 2004 BULL SA.
Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
Modified by Paul Jackson <pj@sgi.com> Modified by Paul Jackson <pj@sgi.com>
Modified by Christoph Lameter <cl@linux.com> Modified by Christoph Lameter <cl@linux.com>
CONTENTS: .. CONTENTS:
=========
1. Control Groups 1. Control Groups
1.1 What are cgroups ? 1.1 What are cgroups ?
1.2 Why are cgroups needed ? 1.2 Why are cgroups needed ?
1.3 How are cgroups implemented ? 1.3 How are cgroups implemented ?
1.4 What does notify_on_release do ? 1.4 What does notify_on_release do ?
1.5 What does clone_children do ? 1.5 What does clone_children do ?
1.6 How do I use cgroups ? 1.6 How do I use cgroups ?
2. Usage Examples and Syntax 2. Usage Examples and Syntax
2.1 Basic Usage 2.1 Basic Usage
2.2 Attaching processes 2.2 Attaching processes
2.3 Mounting hierarchies by name 2.3 Mounting hierarchies by name
3. Kernel API 3. Kernel API
3.1 Overview 3.1 Overview
3.2 Synchronization 3.2 Synchronization
3.3 Subsystem API 3.3 Subsystem API
4. Extended attributes usage 4. Extended attributes usage
5. Questions 5. Questions
1. Control Groups 1. Control Groups
================= =================
@ -72,7 +76,7 @@ On their own, the only use for cgroups is for simple job
tracking. The intention is that other subsystems hook into the generic tracking. The intention is that other subsystems hook into the generic
cgroup support to provide new attributes for cgroups, such as cgroup support to provide new attributes for cgroups, such as
accounting/limiting the resources which processes in a cgroup can accounting/limiting the resources which processes in a cgroup can
access. For example, cpusets (see Documentation/cgroup-v1/cpusets.txt) allow access. For example, cpusets (see Documentation/cgroup-v1/cpusets.rst) allow
you to associate a set of CPUs and a set of memory nodes with the you to associate a set of CPUs and a set of memory nodes with the
tasks in each cgroup. tasks in each cgroup.
@ -108,7 +112,7 @@ As an example of a scenario (originally proposed by vatsa@in.ibm.com)
that can benefit from multiple hierarchies, consider a large that can benefit from multiple hierarchies, consider a large
university server with various users - students, professors, system university server with various users - students, professors, system
tasks etc. The resource planning for this server could be along the tasks etc. The resource planning for this server could be along the
following lines: following lines::
CPU : "Top cpuset" CPU : "Top cpuset"
/ \ / \
@ -136,7 +140,7 @@ depending on who launched it (prof/student).
With the ability to classify tasks differently for different resources With the ability to classify tasks differently for different resources
(by putting those resource subsystems in different hierarchies), (by putting those resource subsystems in different hierarchies),
the admin can easily set up a script which receives exec notifications the admin can easily set up a script which receives exec notifications
and depending on who is launching the browser he can and depending on who is launching the browser he can::
# echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks # echo browser_pid > /sys/fs/cgroup/<restype>/<userclass>/tasks
@ -151,7 +155,7 @@ wants to do online gaming :)) OR give one of the student's simulation
apps enhanced CPU power. apps enhanced CPU power.
With ability to write PIDs directly to resource classes, it's just a With ability to write PIDs directly to resource classes, it's just a
matter of: matter of::
# echo pid > /sys/fs/cgroup/network/<new_class>/tasks # echo pid > /sys/fs/cgroup/network/<new_class>/tasks
(after some time) (after some time)
@ -306,7 +310,7 @@ configuration from the parent during initialization.
-------------------------- --------------------------
To start a new job that is to be contained within a cgroup, using To start a new job that is to be contained within a cgroup, using
the "cpuset" cgroup subsystem, the steps are something like: the "cpuset" cgroup subsystem, the steps are something like::
1) mount -t tmpfs cgroup_root /sys/fs/cgroup 1) mount -t tmpfs cgroup_root /sys/fs/cgroup
2) mkdir /sys/fs/cgroup/cpuset 2) mkdir /sys/fs/cgroup/cpuset
@ -320,7 +324,7 @@ the "cpuset" cgroup subsystem, the steps are something like:
For example, the following sequence of commands will setup a cgroup For example, the following sequence of commands will setup a cgroup
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
and then start a subshell 'sh' in that cgroup: and then start a subshell 'sh' in that cgroup::
mount -t tmpfs cgroup_root /sys/fs/cgroup mount -t tmpfs cgroup_root /sys/fs/cgroup
mkdir /sys/fs/cgroup/cpuset mkdir /sys/fs/cgroup/cpuset
@ -345,8 +349,9 @@ and then start a subshell 'sh' in that cgroup:
Creating, modifying, using cgroups can be done through the cgroup Creating, modifying, using cgroups can be done through the cgroup
virtual filesystem. virtual filesystem.
To mount a cgroup hierarchy with all available subsystems, type: To mount a cgroup hierarchy with all available subsystems, type::
# mount -t cgroup xxx /sys/fs/cgroup
# mount -t cgroup xxx /sys/fs/cgroup
The "xxx" is not interpreted by the cgroup code, but will appear in The "xxx" is not interpreted by the cgroup code, but will appear in
/proc/mounts so may be any useful identifying string that you like. /proc/mounts so may be any useful identifying string that you like.
@ -355,18 +360,19 @@ Note: Some subsystems do not work without some user input first. For instance,
if cpusets are enabled the user will have to populate the cpus and mems files if cpusets are enabled the user will have to populate the cpus and mems files
for each new cgroup created before that group can be used. for each new cgroup created before that group can be used.
As explained in section `1.2 Why are cgroups needed?' you should create As explained in section `1.2 Why are cgroups needed?` you should create
different hierarchies of cgroups for each single resource or group of different hierarchies of cgroups for each single resource or group of
resources you want to control. Therefore, you should mount a tmpfs on resources you want to control. Therefore, you should mount a tmpfs on
/sys/fs/cgroup and create directories for each cgroup resource or resource /sys/fs/cgroup and create directories for each cgroup resource or resource
group. group::
# mount -t tmpfs cgroup_root /sys/fs/cgroup # mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/rg1 # mkdir /sys/fs/cgroup/rg1
To mount a cgroup hierarchy with just the cpuset and memory To mount a cgroup hierarchy with just the cpuset and memory
subsystems, type: subsystems, type::
# mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
# mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1
While remounting cgroups is currently supported, it is not recommend While remounting cgroups is currently supported, it is not recommend
to use it. Remounting allows changing bound subsystems and to use it. Remounting allows changing bound subsystems and
@ -375,9 +381,10 @@ hierarchy is empty and release_agent itself should be replaced with
conventional fsnotify. The support for remounting will be removed in conventional fsnotify. The support for remounting will be removed in
the future. the future.
To Specify a hierarchy's release_agent: To Specify a hierarchy's release_agent::
# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
xxx /sys/fs/cgroup/rg1 # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \
xxx /sys/fs/cgroup/rg1
Note that specifying 'release_agent' more than once will return failure. Note that specifying 'release_agent' more than once will return failure.
@ -390,32 +397,39 @@ Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the
tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1 tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1
is the cgroup that holds the whole system. is the cgroup that holds the whole system.
If you want to change the value of release_agent: If you want to change the value of release_agent::
# echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
# echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent
It can also be changed via remount. It can also be changed via remount.
If you want to create a new cgroup under /sys/fs/cgroup/rg1: If you want to create a new cgroup under /sys/fs/cgroup/rg1::
# cd /sys/fs/cgroup/rg1
# mkdir my_cgroup
Now you want to do something with this cgroup. # cd /sys/fs/cgroup/rg1
# cd my_cgroup # mkdir my_cgroup
In this directory you can find several files: Now you want to do something with this cgroup:
# ls
cgroup.procs notify_on_release tasks
(plus whatever files added by the attached subsystems)
Now attach your shell to this cgroup: # cd my_cgroup
# /bin/echo $$ > tasks
In this directory you can find several files::
# ls
cgroup.procs notify_on_release tasks
(plus whatever files added by the attached subsystems)
Now attach your shell to this cgroup::
# /bin/echo $$ > tasks
You can also create cgroups inside your cgroup by using mkdir in this You can also create cgroups inside your cgroup by using mkdir in this
directory. directory::
# mkdir my_sub_cs
To remove a cgroup, just use rmdir: # mkdir my_sub_cs
# rmdir my_sub_cs
To remove a cgroup, just use rmdir::
# rmdir my_sub_cs
This will fail if the cgroup is in use (has cgroups inside, or This will fail if the cgroup is in use (has cgroups inside, or
has processes attached, or is held alive by other subsystem-specific has processes attached, or is held alive by other subsystem-specific
@ -424,19 +438,21 @@ reference).
2.2 Attaching processes 2.2 Attaching processes
----------------------- -----------------------
# /bin/echo PID > tasks ::
# /bin/echo PID > tasks
Note that it is PID, not PIDs. You can only attach ONE task at a time. Note that it is PID, not PIDs. You can only attach ONE task at a time.
If you have several tasks to attach, you have to do it one after another: If you have several tasks to attach, you have to do it one after another::
# /bin/echo PID1 > tasks # /bin/echo PID1 > tasks
# /bin/echo PID2 > tasks # /bin/echo PID2 > tasks
... ...
# /bin/echo PIDn > tasks # /bin/echo PIDn > tasks
You can attach the current shell task by echoing 0: You can attach the current shell task by echoing 0::
# echo 0 > tasks # echo 0 > tasks
You can use the cgroup.procs file instead of the tasks file to move all You can use the cgroup.procs file instead of the tasks file to move all
threads in a threadgroup at once. Echoing the PID of any task in a threads in a threadgroup at once. Echoing the PID of any task in a
@ -529,7 +545,7 @@ Each subsystem may export the following methods. The only mandatory
methods are css_alloc/free. Any others that are null are presumed to methods are css_alloc/free. Any others that are null are presumed to
be successful no-ops. be successful no-ops.
struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp) ``struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)``
(cgroup_mutex held by caller) (cgroup_mutex held by caller)
Called to allocate a subsystem state object for a cgroup. The Called to allocate a subsystem state object for a cgroup. The
@ -544,7 +560,7 @@ identified by the passed cgroup object having a NULL parent (since
it's the root of the hierarchy) and may be an appropriate place for it's the root of the hierarchy) and may be an appropriate place for
initialization code. initialization code.
int css_online(struct cgroup *cgrp) ``int css_online(struct cgroup *cgrp)``
(cgroup_mutex held by caller) (cgroup_mutex held by caller)
Called after @cgrp successfully completed all allocations and made Called after @cgrp successfully completed all allocations and made
@ -554,7 +570,7 @@ callback can be used to implement reliable state sharing and
propagation along the hierarchy. See the comment on propagation along the hierarchy. See the comment on
cgroup_for_each_descendant_pre() for details. cgroup_for_each_descendant_pre() for details.
void css_offline(struct cgroup *cgrp); ``void css_offline(struct cgroup *cgrp);``
(cgroup_mutex held by caller) (cgroup_mutex held by caller)
This is the counterpart of css_online() and called iff css_online() This is the counterpart of css_online() and called iff css_online()
@ -564,7 +580,7 @@ all references it's holding on @cgrp. When all references are dropped,
cgroup removal will proceed to the next step - css_free(). After this cgroup removal will proceed to the next step - css_free(). After this
callback, @cgrp should be considered dead to the subsystem. callback, @cgrp should be considered dead to the subsystem.
void css_free(struct cgroup *cgrp) ``void css_free(struct cgroup *cgrp)``
(cgroup_mutex held by caller) (cgroup_mutex held by caller)
The cgroup system is about to free @cgrp; the subsystem should free The cgroup system is about to free @cgrp; the subsystem should free
@ -573,7 +589,7 @@ is completely unused; @cgrp->parent is still valid. (Note - can also
be called for a newly-created cgroup if an error occurs after this be called for a newly-created cgroup if an error occurs after this
subsystem's create() method has been called for the new cgroup). subsystem's create() method has been called for the new cgroup).
int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset) ``int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
(cgroup_mutex held by caller) (cgroup_mutex held by caller)
Called prior to moving one or more tasks into a cgroup; if the Called prior to moving one or more tasks into a cgroup; if the
@ -594,7 +610,7 @@ fork. If this method returns 0 (success) then this should remain valid
while the caller holds cgroup_mutex and it is ensured that either while the caller holds cgroup_mutex and it is ensured that either
attach() or cancel_attach() will be called in future. attach() or cancel_attach() will be called in future.
void css_reset(struct cgroup_subsys_state *css) ``void css_reset(struct cgroup_subsys_state *css)``
(cgroup_mutex held by caller) (cgroup_mutex held by caller)
An optional operation which should restore @css's configuration to the An optional operation which should restore @css's configuration to the
@ -608,7 +624,7 @@ This prevents unexpected resource control from a hidden css and
ensures that the configuration is in the initial state when it is made ensures that the configuration is in the initial state when it is made
visible again later. visible again later.
void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset) ``void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
(cgroup_mutex held by caller) (cgroup_mutex held by caller)
Called when a task attach operation has failed after can_attach() has succeeded. Called when a task attach operation has failed after can_attach() has succeeded.
@ -617,26 +633,26 @@ function, so that the subsystem can implement a rollback. If not, not necessary.
This will be called only about subsystems whose can_attach() operation have This will be called only about subsystems whose can_attach() operation have
succeeded. The parameters are identical to can_attach(). succeeded. The parameters are identical to can_attach().
void attach(struct cgroup *cgrp, struct cgroup_taskset *tset) ``void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)``
(cgroup_mutex held by caller) (cgroup_mutex held by caller)
Called after the task has been attached to the cgroup, to allow any Called after the task has been attached to the cgroup, to allow any
post-attachment activity that requires memory allocations or blocking. post-attachment activity that requires memory allocations or blocking.
The parameters are identical to can_attach(). The parameters are identical to can_attach().
void fork(struct task_struct *task) ``void fork(struct task_struct *task)``
Called when a task is forked into a cgroup. Called when a task is forked into a cgroup.
void exit(struct task_struct *task) ``void exit(struct task_struct *task)``
Called during task exit. Called during task exit.
void free(struct task_struct *task) ``void free(struct task_struct *task)``
Called when the task_struct is freed. Called when the task_struct is freed.
void bind(struct cgroup *root) ``void bind(struct cgroup *root)``
(cgroup_mutex held by caller) (cgroup_mutex held by caller)
Called when a cgroup subsystem is rebound to a different hierarchy Called when a cgroup subsystem is rebound to a different hierarchy
@ -649,6 +665,7 @@ that is being created/destroyed (and hence has no sub-cgroups).
cgroup filesystem supports certain types of extended attributes in its cgroup filesystem supports certain types of extended attributes in its
directories and files. The current supported types are: directories and files. The current supported types are:
- Trusted (XATTR_TRUSTED) - Trusted (XATTR_TRUSTED)
- Security (XATTR_SECURITY) - Security (XATTR_SECURITY)
@ -666,12 +683,13 @@ in containers and systemd for assorted meta data like main PID in a cgroup
5. Questions 5. Questions
============ ============
Q: what's up with this '/bin/echo' ? ::
A: bash's builtin 'echo' command does not check calls to write() against
errors. If you use it in the cgroup file system, you won't be
able to tell whether a command succeeded or failed.
Q: When I attach processes, only the first of the line gets really attached ! Q: what's up with this '/bin/echo' ?
A: We can only return one error code per call to write(). So you should also A: bash's builtin 'echo' command does not check calls to write() against
put only ONE PID. errors. If you use it in the cgroup file system, you won't be
able to tell whether a command succeeded or failed.
Q: When I attach processes, only the first of the line gets really attached !
A: We can only return one error code per call to write(). So you should also
put only ONE PID.

View File

@ -1,5 +1,6 @@
=========================
CPU Accounting Controller CPU Accounting Controller
------------------------- =========================
The CPU accounting controller is used to group tasks using cgroups and The CPU accounting controller is used to group tasks using cgroups and
account the CPU usage of these groups of tasks. account the CPU usage of these groups of tasks.
@ -8,9 +9,9 @@ The CPU accounting controller supports multi-hierarchy groups. An accounting
group accumulates the CPU usage of all of its child groups and the tasks group accumulates the CPU usage of all of its child groups and the tasks
directly present in its group. directly present in its group.
Accounting groups can be created by first mounting the cgroup filesystem. Accounting groups can be created by first mounting the cgroup filesystem::
# mount -t cgroup -ocpuacct none /sys/fs/cgroup # mount -t cgroup -ocpuacct none /sys/fs/cgroup
With the above step, the initial or the parent accounting group becomes With the above step, the initial or the parent accounting group becomes
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
@ -19,11 +20,11 @@ the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
by this group which is essentially the CPU time obtained by all the tasks by this group which is essentially the CPU time obtained by all the tasks
in the system. in the system.
New accounting groups can be created under the parent group /sys/fs/cgroup. New accounting groups can be created under the parent group /sys/fs/cgroup::
# cd /sys/fs/cgroup # cd /sys/fs/cgroup
# mkdir g1 # mkdir g1
# echo $$ > g1/tasks # echo $$ > g1/tasks
The above steps create a new group g1 and move the current shell The above steps create a new group g1 and move the current shell
process (bash) into it. CPU time consumed by this bash and its children process (bash) into it. CPU time consumed by this bash and its children

View File

@ -1,35 +1,36 @@
CPUSETS =======
------- CPUSETS
=======
Copyright (C) 2004 BULL SA. Copyright (C) 2004 BULL SA.
Written by Simon.Derr@bull.net Written by Simon.Derr@bull.net
Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. - Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
Modified by Paul Jackson <pj@sgi.com> - Modified by Paul Jackson <pj@sgi.com>
Modified by Christoph Lameter <cl@linux.com> - Modified by Christoph Lameter <cl@linux.com>
Modified by Paul Menage <menage@google.com> - Modified by Paul Menage <menage@google.com>
Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> - Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
CONTENTS: .. CONTENTS:
=========
1. Cpusets 1. Cpusets
1.1 What are cpusets ? 1.1 What are cpusets ?
1.2 Why are cpusets needed ? 1.2 Why are cpusets needed ?
1.3 How are cpusets implemented ? 1.3 How are cpusets implemented ?
1.4 What are exclusive cpusets ? 1.4 What are exclusive cpusets ?
1.5 What is memory_pressure ? 1.5 What is memory_pressure ?
1.6 What is memory spread ? 1.6 What is memory spread ?
1.7 What is sched_load_balance ? 1.7 What is sched_load_balance ?
1.8 What is sched_relax_domain_level ? 1.8 What is sched_relax_domain_level ?
1.9 How do I use cpusets ? 1.9 How do I use cpusets ?
2. Usage Examples and Syntax 2. Usage Examples and Syntax
2.1 Basic Usage 2.1 Basic Usage
2.2 Adding/removing cpus 2.2 Adding/removing cpus
2.3 Setting flags 2.3 Setting flags
2.4 Attaching processes 2.4 Attaching processes
3. Questions 3. Questions
4. Contact 4. Contact
1. Cpusets 1. Cpusets
========== ==========
@ -48,7 +49,7 @@ hooks, beyond what is already present, required to manage dynamic
job placement on large systems. job placement on large systems.
Cpusets use the generic cgroup subsystem described in Cpusets use the generic cgroup subsystem described in
Documentation/cgroup-v1/cgroups.txt. Documentation/cgroup-v1/cgroups.rst.
Requests by a task, using the sched_setaffinity(2) system call to Requests by a task, using the sched_setaffinity(2) system call to
include CPUs in its CPU affinity mask, and using the mbind(2) and include CPUs in its CPU affinity mask, and using the mbind(2) and
@ -157,7 +158,7 @@ modifying cpusets is via this cpuset file system.
The /proc/<pid>/status file for each task has four added lines, The /proc/<pid>/status file for each task has four added lines,
displaying the task's cpus_allowed (on which CPUs it may be scheduled) displaying the task's cpus_allowed (on which CPUs it may be scheduled)
and mems_allowed (on which Memory Nodes it may obtain memory), and mems_allowed (on which Memory Nodes it may obtain memory),
in the two formats seen in the following example: in the two formats seen in the following example::
Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-127 Cpus_allowed_list: 0-127
@ -181,6 +182,7 @@ files describing that cpuset:
- cpuset.sched_relax_domain_level: the searching range when migrating tasks - cpuset.sched_relax_domain_level: the searching range when migrating tasks
In addition, only the root cpuset has the following file: In addition, only the root cpuset has the following file:
- cpuset.memory_pressure_enabled flag: compute memory_pressure? - cpuset.memory_pressure_enabled flag: compute memory_pressure?
New cpusets are created using the mkdir system call or shell New cpusets are created using the mkdir system call or shell
@ -266,7 +268,8 @@ to monitor a cpuset for signs of memory pressure. It's up to the
batch manager or other user code to decide what to do about it and batch manager or other user code to decide what to do about it and
take action. take action.
==> Unless this feature is enabled by writing "1" to the special file ==>
Unless this feature is enabled by writing "1" to the special file
/dev/cpuset/memory_pressure_enabled, the hook in the rebalance /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
code of __alloc_pages() for this metric reduces to simply noticing code of __alloc_pages() for this metric reduces to simply noticing
that the cpuset_memory_pressure_enabled flag is zero. So only that the cpuset_memory_pressure_enabled flag is zero. So only
@ -399,6 +402,7 @@ have tasks running on them unless explicitly assigned.
This default load balancing across all CPUs is not well suited for This default load balancing across all CPUs is not well suited for
the following two situations: the following two situations:
1) On large systems, load balancing across many CPUs is expensive. 1) On large systems, load balancing across many CPUs is expensive.
If the system is managed using cpusets to place independent jobs If the system is managed using cpusets to place independent jobs
on separate sets of CPUs, full load balancing is unnecessary. on separate sets of CPUs, full load balancing is unnecessary.
@ -501,6 +505,7 @@ all the CPUs that must be load balanced.
The cpuset code builds a new such partition and passes it to the The cpuset code builds a new such partition and passes it to the
scheduler sched domain setup code, to have the sched domains rebuilt scheduler sched domain setup code, to have the sched domains rebuilt
as necessary, whenever: as necessary, whenever:
- the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes,
- or CPUs come or go from a cpuset with this flag enabled, - or CPUs come or go from a cpuset with this flag enabled,
- or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs
@ -553,13 +558,15 @@ this searching range as you like. This file takes int value which
indicates size of searching range in levels ideally as follows, indicates size of searching range in levels ideally as follows,
otherwise initial value -1 that indicates the cpuset has no request. otherwise initial value -1 that indicates the cpuset has no request.
-1 : no request. use system default or follow request of others. ====== ===========================================================
0 : no search. -1 no request. use system default or follow request of others.
1 : search siblings (hyperthreads in a core). 0 no search.
2 : search cores in a package. 1 search siblings (hyperthreads in a core).
3 : search cpus in a node [= system wide on non-NUMA system] 2 search cores in a package.
4 : search nodes in a chunk of node [on NUMA system] 3 search cpus in a node [= system wide on non-NUMA system]
5 : search system wide [on NUMA system] 4 search nodes in a chunk of node [on NUMA system]
5 search system wide [on NUMA system]
====== ===========================================================
The system default is architecture dependent. The system default The system default is architecture dependent. The system default
can be changed using the relax_domain_level= boot parameter. can be changed using the relax_domain_level= boot parameter.
@ -578,13 +585,14 @@ and whether it is acceptable or not depends on your situation.
Don't modify this file if you are not sure. Don't modify this file if you are not sure.
If your situation is: If your situation is:
- The migration costs between each cpu can be assumed considerably - The migration costs between each cpu can be assumed considerably
small(for you) due to your special application's behavior or small(for you) due to your special application's behavior or
special hardware support for CPU cache etc. special hardware support for CPU cache etc.
- The searching cost doesn't have impact(for you) or you can make - The searching cost doesn't have impact(for you) or you can make
the searching cost enough small by managing cpuset to compact etc. the searching cost enough small by managing cpuset to compact etc.
- The latency is required even it sacrifices cache hit rate etc. - The latency is required even it sacrifices cache hit rate etc.
then increasing 'sched_relax_domain_level' would benefit you. then increasing 'sched_relax_domain_level' would benefit you.
1.9 How do I use cpusets ? 1.9 How do I use cpusets ?
@ -678,7 +686,7 @@ To start a new job that is to be contained within a cpuset, the steps are:
For example, the following sequence of commands will setup a cpuset For example, the following sequence of commands will setup a cpuset
named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, named "Charlie", containing just CPUs 2 and 3, and Memory Node 1,
and then start a subshell 'sh' in that cpuset: and then start a subshell 'sh' in that cpuset::
mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
cd /sys/fs/cgroup/cpuset cd /sys/fs/cgroup/cpuset
@ -693,6 +701,7 @@ and then start a subshell 'sh' in that cpuset:
cat /proc/self/cpuset cat /proc/self/cpuset
There are ways to query or modify cpusets: There are ways to query or modify cpusets:
- via the cpuset file system directly, using the various cd, mkdir, echo, - via the cpuset file system directly, using the various cd, mkdir, echo,
cat, rmdir commands from the shell, or their equivalent from C. cat, rmdir commands from the shell, or their equivalent from C.
- via the C library libcpuset. - via the C library libcpuset.
@ -722,115 +731,133 @@ Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the
tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset
is the cpuset that holds the whole system. is the cpuset that holds the whole system.
If you want to create a new cpuset under /sys/fs/cgroup/cpuset: If you want to create a new cpuset under /sys/fs/cgroup/cpuset::
# cd /sys/fs/cgroup/cpuset
# mkdir my_cpuset
Now you want to do something with this cpuset. # cd /sys/fs/cgroup/cpuset
# cd my_cpuset # mkdir my_cpuset
In this directory you can find several files: Now you want to do something with this cpuset::
# ls
cgroup.clone_children cpuset.memory_pressure # cd my_cpuset
cgroup.event_control cpuset.memory_spread_page
cgroup.procs cpuset.memory_spread_slab In this directory you can find several files::
cpuset.cpu_exclusive cpuset.mems
cpuset.cpus cpuset.sched_load_balance # ls
cpuset.mem_exclusive cpuset.sched_relax_domain_level cgroup.clone_children cpuset.memory_pressure
cpuset.mem_hardwall notify_on_release cgroup.event_control cpuset.memory_spread_page
cpuset.memory_migrate tasks cgroup.procs cpuset.memory_spread_slab
cpuset.cpu_exclusive cpuset.mems
cpuset.cpus cpuset.sched_load_balance
cpuset.mem_exclusive cpuset.sched_relax_domain_level
cpuset.mem_hardwall notify_on_release
cpuset.memory_migrate tasks
Reading them will give you information about the state of this cpuset: Reading them will give you information about the state of this cpuset:
the CPUs and Memory Nodes it can use, the processes that are using the CPUs and Memory Nodes it can use, the processes that are using
it, its properties. By writing to these files you can manipulate it, its properties. By writing to these files you can manipulate
the cpuset. the cpuset.
Set some flags: Set some flags::
# /bin/echo 1 > cpuset.cpu_exclusive
Add some cpus: # /bin/echo 1 > cpuset.cpu_exclusive
# /bin/echo 0-7 > cpuset.cpus
Add some mems: Add some cpus::
# /bin/echo 0-7 > cpuset.mems
Now attach your shell to this cpuset: # /bin/echo 0-7 > cpuset.cpus
# /bin/echo $$ > tasks
Add some mems::
# /bin/echo 0-7 > cpuset.mems
Now attach your shell to this cpuset::
# /bin/echo $$ > tasks
You can also create cpusets inside your cpuset by using mkdir in this You can also create cpusets inside your cpuset by using mkdir in this
directory. directory::
# mkdir my_sub_cs
# mkdir my_sub_cs
To remove a cpuset, just use rmdir::
# rmdir my_sub_cs
To remove a cpuset, just use rmdir:
# rmdir my_sub_cs
This will fail if the cpuset is in use (has cpusets inside, or has This will fail if the cpuset is in use (has cpusets inside, or has
processes attached). processes attached).
Note that for legacy reasons, the "cpuset" filesystem exists as a Note that for legacy reasons, the "cpuset" filesystem exists as a
wrapper around the cgroup filesystem. wrapper around the cgroup filesystem.
The command The command::
mount -t cpuset X /sys/fs/cgroup/cpuset mount -t cpuset X /sys/fs/cgroup/cpuset
is equivalent to is equivalent to::
mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset
echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent
2.2 Adding/removing cpus 2.2 Adding/removing cpus
------------------------ ------------------------
This is the syntax to use when writing in the cpus or mems files This is the syntax to use when writing in the cpus or mems files
in cpuset directories: in cpuset directories::
# /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
# /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4
To add a CPU to a cpuset, write the new list of CPUs including the To add a CPU to a cpuset, write the new list of CPUs including the
CPU to be added. To add 6 to the above cpuset: CPU to be added. To add 6 to the above cpuset::
# /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6
Similarly to remove a CPU from a cpuset, write the new list of CPUs Similarly to remove a CPU from a cpuset, write the new list of CPUs
without the CPU to be removed. without the CPU to be removed.
To remove all the CPUs: To remove all the CPUs::
# /bin/echo "" > cpuset.cpus -> clear cpus list # /bin/echo "" > cpuset.cpus -> clear cpus list
2.3 Setting flags 2.3 Setting flags
----------------- -----------------
The syntax is very simple: The syntax is very simple::
# /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive'
# /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive'
2.4 Attaching processes 2.4 Attaching processes
----------------------- -----------------------
# /bin/echo PID > tasks ::
# /bin/echo PID > tasks
Note that it is PID, not PIDs. You can only attach ONE task at a time. Note that it is PID, not PIDs. You can only attach ONE task at a time.
If you have several tasks to attach, you have to do it one after another: If you have several tasks to attach, you have to do it one after another::
# /bin/echo PID1 > tasks # /bin/echo PID1 > tasks
# /bin/echo PID2 > tasks # /bin/echo PID2 > tasks
... ...
# /bin/echo PIDn > tasks # /bin/echo PIDn > tasks
3. Questions 3. Questions
============ ============
Q: what's up with this '/bin/echo' ? Q:
A: bash's builtin 'echo' command does not check calls to write() against what's up with this '/bin/echo' ?
A:
bash's builtin 'echo' command does not check calls to write() against
errors. If you use it in the cpuset file system, you won't be errors. If you use it in the cpuset file system, you won't be
able to tell whether a command succeeded or failed. able to tell whether a command succeeded or failed.
Q: When I attach processes, only the first of the line gets really attached ! Q:
A: We can only return one error code per call to write(). So you should also When I attach processes, only the first of the line gets really attached !
A:
We can only return one error code per call to write(). So you should also
put only ONE pid. put only ONE pid.
4. Contact 4. Contact

View File

@ -1,6 +1,9 @@
===========================
Device Whitelist Controller Device Whitelist Controller
===========================
1. Description: 1. Description
==============
Implement a cgroup to track and enforce open and mknod restrictions Implement a cgroup to track and enforce open and mknod restrictions
on device files. A device cgroup associates a device access on device files. A device cgroup associates a device access
@ -16,24 +19,26 @@ devices from the whitelist or add new entries. A child cgroup can
never receive a device access which is denied by its parent. never receive a device access which is denied by its parent.
2. User Interface 2. User Interface
=================
An entry is added using devices.allow, and removed using An entry is added using devices.allow, and removed using
devices.deny. For instance devices.deny. For instance::
echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow
allows cgroup 1 to read and mknod the device usually known as allows cgroup 1 to read and mknod the device usually known as
/dev/null. Doing /dev/null. Doing::
echo a > /sys/fs/cgroup/1/devices.deny echo a > /sys/fs/cgroup/1/devices.deny
will remove the default 'a *:* rwm' entry. Doing will remove the default 'a *:* rwm' entry. Doing::
echo a > /sys/fs/cgroup/1/devices.allow echo a > /sys/fs/cgroup/1/devices.allow
will add the 'a *:* rwm' entry to the whitelist. will add the 'a *:* rwm' entry to the whitelist.
3. Security 3. Security
===========
Any task can move itself between cgroups. This clearly won't Any task can move itself between cgroups. This clearly won't
suffice, but we can decide the best way to adequately restrict suffice, but we can decide the best way to adequately restrict
@ -50,6 +55,7 @@ A cgroup may not be granted more permissions than the cgroup's
parent has. parent has.
4. Hierarchy 4. Hierarchy
============
device cgroups maintain hierarchy by making sure a cgroup never has more device cgroups maintain hierarchy by making sure a cgroup never has more
access permissions than its parent. Every time an entry is written to access permissions than its parent. Every time an entry is written to
@ -58,7 +64,8 @@ from their whitelist and all the locally set whitelist entries will be
re-evaluated. In case one of the locally set whitelist entries would provide re-evaluated. In case one of the locally set whitelist entries would provide
more access than the cgroup's parent, it'll be removed from the whitelist. more access than the cgroup's parent, it'll be removed from the whitelist.
Example: Example::
A A
/ \ / \
B B
@ -67,10 +74,12 @@ Example:
A allow "b 8:* rwm", "c 116:1 rw" A allow "b 8:* rwm", "c 116:1 rw"
B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm" B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm"
If a device is denied in group A: If a device is denied in group A::
# echo "c 116:* r" > A/devices.deny # echo "c 116:* r" > A/devices.deny
it'll propagate down and after revalidating B's entries, the whitelist entry it'll propagate down and after revalidating B's entries, the whitelist entry
"c 116:2 rwm" will be removed: "c 116:2 rwm" will be removed::
group whitelist entries denied devices group whitelist entries denied devices
A all "b 8:* rwm", "c 116:* rw" A all "b 8:* rwm", "c 116:* rw"
@ -79,7 +88,8 @@ it'll propagate down and after revalidating B's entries, the whitelist entry
In case parent's exceptions change and local exceptions are not allowed In case parent's exceptions change and local exceptions are not allowed
anymore, they'll be deleted. anymore, they'll be deleted.
Notice that new whitelist entries will not be propagated: Notice that new whitelist entries will not be propagated::
A A
/ \ / \
B B
@ -88,24 +98,30 @@ Notice that new whitelist entries will not be propagated:
A "c 1:3 rwm", "c 1:5 r" all the rest A "c 1:3 rwm", "c 1:5 r" all the rest
B "c 1:3 rwm", "c 1:5 r" all the rest B "c 1:3 rwm", "c 1:5 r" all the rest
when adding "c *:3 rwm": when adding ``c *:3 rwm``::
# echo "c *:3 rwm" >A/devices.allow # echo "c *:3 rwm" >A/devices.allow
the result: the result::
group whitelist entries denied devices group whitelist entries denied devices
A "c *:3 rwm", "c 1:5 r" all the rest A "c *:3 rwm", "c 1:5 r" all the rest
B "c 1:3 rwm", "c 1:5 r" all the rest B "c 1:3 rwm", "c 1:5 r" all the rest
but now it'll be possible to add new entries to B: but now it'll be possible to add new entries to B::
# echo "c 2:3 rwm" >B/devices.allow # echo "c 2:3 rwm" >B/devices.allow
# echo "c 50:3 r" >B/devices.allow # echo "c 50:3 r" >B/devices.allow
or even
or even::
# echo "c *:3 rwm" >B/devices.allow # echo "c *:3 rwm" >B/devices.allow
Allowing or denying all by writing 'a' to devices.allow or devices.deny will Allowing or denying all by writing 'a' to devices.allow or devices.deny will
not be possible once the device cgroups has children. not be possible once the device cgroups has children.
4.1 Hierarchy (internal implementation) 4.1 Hierarchy (internal implementation)
---------------------------------------
device cgroups is implemented internally using a behavior (ALLOW, DENY) and a device cgroups is implemented internally using a behavior (ALLOW, DENY) and a
list of exceptions. The internal state is controlled using the same user list of exceptions. The internal state is controlled using the same user

View File

@ -1,3 +1,7 @@
==============
Cgroup Freezer
==============
The cgroup freezer is useful to batch job management system which start The cgroup freezer is useful to batch job management system which start
and stop sets of tasks in order to schedule the resources of a machine and stop sets of tasks in order to schedule the resources of a machine
according to the desires of a system administrator. This sort of program according to the desires of a system administrator. This sort of program
@ -23,7 +27,7 @@ blocked, or ignored it can be seen by waiting or ptracing parent tasks.
SIGCONT is especially unsuitable since it can be caught by the task. Any SIGCONT is especially unsuitable since it can be caught by the task. Any
programs designed to watch for SIGSTOP and SIGCONT could be broken by programs designed to watch for SIGSTOP and SIGCONT could be broken by
attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
demonstrate this problem using nested bash shells: demonstrate this problem using nested bash shells::
$ echo $$ $ echo $$
16644 16644
@ -93,19 +97,19 @@ The following cgroupfs files are created by cgroup freezer.
The root cgroup is non-freezable and the above interface files don't The root cgroup is non-freezable and the above interface files don't
exist. exist.
* Examples of usage : * Examples of usage::
# mkdir /sys/fs/cgroup/freezer # mkdir /sys/fs/cgroup/freezer
# mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer # mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer
# mkdir /sys/fs/cgroup/freezer/0 # mkdir /sys/fs/cgroup/freezer/0
# echo $some_pid > /sys/fs/cgroup/freezer/0/tasks # echo $some_pid > /sys/fs/cgroup/freezer/0/tasks
to get status of the freezer subsystem : to get status of the freezer subsystem::
# cat /sys/fs/cgroup/freezer/0/freezer.state # cat /sys/fs/cgroup/freezer/0/freezer.state
THAWED THAWED
to freeze all tasks in the container : to freeze all tasks in the container::
# echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state # echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state
# cat /sys/fs/cgroup/freezer/0/freezer.state # cat /sys/fs/cgroup/freezer/0/freezer.state
@ -113,7 +117,7 @@ to freeze all tasks in the container :
# cat /sys/fs/cgroup/freezer/0/freezer.state # cat /sys/fs/cgroup/freezer/0/freezer.state
FROZEN FROZEN
to unfreeze all tasks in the container : to unfreeze all tasks in the container::
# echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state # echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state
# cat /sys/fs/cgroup/freezer/0/freezer.state # cat /sys/fs/cgroup/freezer/0/freezer.state

View File

@ -1,5 +1,6 @@
==================
HugeTLB Controller HugeTLB Controller
------------------- ==================
The HugeTLB controller allows to limit the HugeTLB usage per control group and The HugeTLB controller allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault. Since HugeTLB doesn't enforces the controller limit during page fault. Since HugeTLB doesn't
@ -16,16 +17,16 @@ With the above step, the initial or the parent HugeTLB group becomes
visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
New groups can be created under the parent group /sys/fs/cgroup. New groups can be created under the parent group /sys/fs/cgroup::
# cd /sys/fs/cgroup # cd /sys/fs/cgroup
# mkdir g1 # mkdir g1
# echo $$ > g1/tasks # echo $$ > g1/tasks
The above steps create a new group g1 and move the current shell The above steps create a new group g1 and move the current shell
process (bash) into it. process (bash) into it.
Brief summary of control files Brief summary of control files::
hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
@ -33,17 +34,17 @@ Brief summary of control files
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
For a system supporting three hugepage sizes (64k, 32M and 1G), the control For a system supporting three hugepage sizes (64k, 32M and 1G), the control
files include: files include::
hugetlb.1GB.limit_in_bytes hugetlb.1GB.limit_in_bytes
hugetlb.1GB.max_usage_in_bytes hugetlb.1GB.max_usage_in_bytes
hugetlb.1GB.usage_in_bytes hugetlb.1GB.usage_in_bytes
hugetlb.1GB.failcnt hugetlb.1GB.failcnt
hugetlb.64KB.limit_in_bytes hugetlb.64KB.limit_in_bytes
hugetlb.64KB.max_usage_in_bytes hugetlb.64KB.max_usage_in_bytes
hugetlb.64KB.usage_in_bytes hugetlb.64KB.usage_in_bytes
hugetlb.64KB.failcnt hugetlb.64KB.failcnt
hugetlb.32MB.limit_in_bytes hugetlb.32MB.limit_in_bytes
hugetlb.32MB.max_usage_in_bytes hugetlb.32MB.max_usage_in_bytes
hugetlb.32MB.usage_in_bytes hugetlb.32MB.usage_in_bytes
hugetlb.32MB.failcnt hugetlb.32MB.failcnt

View File

@ -0,0 +1,30 @@
:orphan:
========================
Control Groups version 1
========================
.. toctree::
:maxdepth: 1
cgroups
blkio-controller
cpuacct
cpusets
devices
freezer-subsystem
hugetlb
memcg_test
memory
net_cls
net_prio
pids
rdma
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@ -1,32 +1,43 @@
Memory Resource Controller(Memcg) Implementation Memo. =====================================================
Memory Resource Controller(Memcg) Implementation Memo
=====================================================
Last Updated: 2010/2 Last Updated: 2010/2
Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
Because VM is getting complex (one of reasons is memcg...), memcg's behavior Because VM is getting complex (one of reasons is memcg...), memcg's behavior
is complex. This is a document for memcg's internal behavior. is complex. This is a document for memcg's internal behavior.
Please note that implementation details can be changed. Please note that implementation details can be changed.
(*) Topics on API should be in Documentation/cgroup-v1/memory.txt) (*) Topics on API should be in Documentation/cgroup-v1/memory.rst)
0. How to record usage ? 0. How to record usage ?
========================
2 objects are used. 2 objects are used.
page_cgroup ....an object per page. page_cgroup ....an object per page.
Allocated at boot or memory hotplug. Freed at memory hot removal. Allocated at boot or memory hotplug. Freed at memory hot removal.
swap_cgroup ... an entry per swp_entry. swap_cgroup ... an entry per swp_entry.
Allocated at swapon(). Freed at swapoff(). Allocated at swapon(). Freed at swapoff().
The page_cgroup has USED bit and double count against a page_cgroup never The page_cgroup has USED bit and double count against a page_cgroup never
occurs. swap_cgroup is used only when a charged page is swapped-out. occurs. swap_cgroup is used only when a charged page is swapped-out.
1. Charge 1. Charge
=========
a page/swp_entry may be charged (usage += PAGE_SIZE) at a page/swp_entry may be charged (usage += PAGE_SIZE) at
mem_cgroup_try_charge() mem_cgroup_try_charge()
2. Uncharge 2. Uncharge
===========
a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
mem_cgroup_uncharge() mem_cgroup_uncharge()
@ -37,9 +48,12 @@ Please note that implementation details can be changed.
disappears. disappears.
3. charge-commit-cancel 3. charge-commit-cancel
=======================
Memcg pages are charged in two steps: Memcg pages are charged in two steps:
mem_cgroup_try_charge()
mem_cgroup_commit_charge() or mem_cgroup_cancel_charge() - mem_cgroup_try_charge()
- mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
At try_charge(), there are no flags to say "this page is charged". At try_charge(), there are no flags to say "this page is charged".
at this point, usage += PAGE_SIZE. at this point, usage += PAGE_SIZE.
@ -51,6 +65,8 @@ Please note that implementation details can be changed.
Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
4. Anonymous 4. Anonymous
============
Anonymous page is newly allocated at Anonymous page is newly allocated at
- page fault into MAP_ANONYMOUS mapping. - page fault into MAP_ANONYMOUS mapping.
- Copy-On-Write. - Copy-On-Write.
@ -78,34 +94,45 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
5. Page Cache 5. Page Cache
Page Cache is charged at =============
Page Cache is charged at
- add_to_page_cache_locked(). - add_to_page_cache_locked().
The logic is very clear. (About migration, see below) The logic is very clear. (About migration, see below)
Note: __remove_from_page_cache() is called by remove_from_page_cache()
and __remove_mapping(). Note:
__remove_from_page_cache() is called by remove_from_page_cache()
and __remove_mapping().
6. Shmem(tmpfs) Page Cache 6. Shmem(tmpfs) Page Cache
===========================
The best way to understand shmem's page state transition is to read The best way to understand shmem's page state transition is to read
mm/shmem.c. mm/shmem.c.
But brief explanation of the behavior of memcg around shmem will be But brief explanation of the behavior of memcg around shmem will be
helpful to understand the logic. helpful to understand the logic.
Shmem's page (just leaf page, not direct/indirect block) can be on Shmem's page (just leaf page, not direct/indirect block) can be on
- radix-tree of shmem's inode. - radix-tree of shmem's inode.
- SwapCache. - SwapCache.
- Both on radix-tree and SwapCache. This happens at swap-in - Both on radix-tree and SwapCache. This happens at swap-in
and swap-out, and swap-out,
It's charged when... It's charged when...
- A new page is added to shmem's radix-tree. - A new page is added to shmem's radix-tree.
- A swp page is read. (move a charge from swap_cgroup to page_cgroup) - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
7. Page Migration 7. Page Migration
=================
mem_cgroup_migrate() mem_cgroup_migrate()
8. LRU 8. LRU
======
Each memcg has its own private LRU. Now, its handling is under global Each memcg has its own private LRU. Now, its handling is under global
VM's control (means that it's handled under global pgdat->lru_lock). VM's control (means that it's handled under global pgdat->lru_lock).
Almost all routines around memcg's LRU is called by global LRU's Almost all routines around memcg's LRU is called by global LRU's
@ -114,163 +141,211 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
A special function is mem_cgroup_isolate_pages(). This scans A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page memcg's private LRU and call __isolate_lru_page() to extract a page
from LRU. from LRU.
(By __isolate_lru_page(), the page is removed from both of global and (By __isolate_lru_page(), the page is removed from both of global and
private LRU.) private LRU.)
9. Typical Tests. 9. Typical Tests.
=================
Tests for racy cases. Tests for racy cases.
9.1 Small limit to memcg. 9.1 Small limit to memcg.
-------------------------
When you do test to do racy case, it's good test to set memcg's limit When you do test to do racy case, it's good test to set memcg's limit
to be very small rather than GB. Many races found in the test under to be very small rather than GB. Many races found in the test under
xKB or xxMB limits. xKB or xxMB limits.
(Memory behavior under GB and Memory behavior under MB shows very
different situation.)
9.2 Shmem (Memory behavior under GB and Memory behavior under MB shows very
different situation.)
9.2 Shmem
---------
Historically, memcg's shmem handling was poor and we saw some amount Historically, memcg's shmem handling was poor and we saw some amount
of troubles here. This is because shmem is page-cache but can be of troubles here. This is because shmem is page-cache but can be
SwapCache. Test with shmem/tmpfs is always good test. SwapCache. Test with shmem/tmpfs is always good test.
9.3 Migration 9.3 Migration
-------------
For NUMA, migration is an another special case. To do easy test, cpuset For NUMA, migration is an another special case. To do easy test, cpuset
is useful. Following is a sample script to do migration. is useful. Following is a sample script to do migration::
mount -t cgroup -o cpuset none /opt/cpuset mount -t cgroup -o cpuset none /opt/cpuset
mkdir /opt/cpuset/01 mkdir /opt/cpuset/01
echo 1 > /opt/cpuset/01/cpuset.cpus echo 1 > /opt/cpuset/01/cpuset.cpus
echo 0 > /opt/cpuset/01/cpuset.mems echo 0 > /opt/cpuset/01/cpuset.mems
echo 1 > /opt/cpuset/01/cpuset.memory_migrate echo 1 > /opt/cpuset/01/cpuset.memory_migrate
mkdir /opt/cpuset/02 mkdir /opt/cpuset/02
echo 1 > /opt/cpuset/02/cpuset.cpus echo 1 > /opt/cpuset/02/cpuset.cpus
echo 1 > /opt/cpuset/02/cpuset.mems echo 1 > /opt/cpuset/02/cpuset.mems
echo 1 > /opt/cpuset/02/cpuset.memory_migrate echo 1 > /opt/cpuset/02/cpuset.memory_migrate
In above set, when you moves a task from 01 to 02, page migration to In above set, when you moves a task from 01 to 02, page migration to
node 0 to node 1 will occur. Following is a script to migrate all node 0 to node 1 will occur. Following is a script to migrate all
under cpuset. under cpuset.::
--
move_task() --
{ move_task()
for pid in $1 {
do for pid in $1
/bin/echo $pid >$2/tasks 2>/dev/null do
echo -n $pid /bin/echo $pid >$2/tasks 2>/dev/null
echo -n " " echo -n $pid
done echo -n " "
echo END done
} echo END
}
G1_TASK=`cat ${G1}/tasks`
G2_TASK=`cat ${G2}/tasks`
move_task "${G1_TASK}" ${G2} &
--
9.4 Memory hotplug
------------------
G1_TASK=`cat ${G1}/tasks`
G2_TASK=`cat ${G2}/tasks`
move_task "${G1_TASK}" ${G2} &
--
9.4 Memory hotplug.
memory hotplug test is one of good test. memory hotplug test is one of good test.
to offline memory, do following.
# echo offline > /sys/devices/system/memory/memoryXXX/state to offline memory, do following::
# echo offline > /sys/devices/system/memory/memoryXXX/state
(XXX is the place of memory) (XXX is the place of memory)
This is an easy way to test page migration, too. This is an easy way to test page migration, too.
9.5 mkdir/rmdir 9.5 mkdir/rmdir
---------------
When using hierarchy, mkdir/rmdir test should be done. When using hierarchy, mkdir/rmdir test should be done.
Use tests like the following. Use tests like the following::
echo 1 >/opt/cgroup/01/memory/use_hierarchy echo 1 >/opt/cgroup/01/memory/use_hierarchy
mkdir /opt/cgroup/01/child_a mkdir /opt/cgroup/01/child_a
mkdir /opt/cgroup/01/child_b mkdir /opt/cgroup/01/child_b
set limit to 01. set limit to 01.
add limit to 01/child_b add limit to 01/child_b
run jobs under child_a and child_b run jobs under child_a and child_b
create/delete following groups at random while jobs are running. create/delete following groups at random while jobs are running::
/opt/cgroup/01/child_a/child_aa
/opt/cgroup/01/child_b/child_bb /opt/cgroup/01/child_a/child_aa
/opt/cgroup/01/child_c /opt/cgroup/01/child_b/child_bb
/opt/cgroup/01/child_c
running new jobs in new group is also good. running new jobs in new group is also good.
9.6 Mount with other subsystems. 9.6 Mount with other subsystems
-------------------------------
Mounting with other subsystems is a good test because there is a Mounting with other subsystems is a good test because there is a
race and lock dependency with other cgroup subsystems. race and lock dependency with other cgroup subsystems.
example) example::
# mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
# mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
and do task move, mkdir, rmdir etc...under this. and do task move, mkdir, rmdir etc...under this.
9.7 swapoff. 9.7 swapoff
-----------
Besides management of swap is one of complicated parts of memcg, Besides management of swap is one of complicated parts of memcg,
call path of swap-in at swapoff is not same as usual swap-in path.. call path of swap-in at swapoff is not same as usual swap-in path..
It's worth to be tested explicitly. It's worth to be tested explicitly.
For example, test like following is good. For example, test like following is good:
(Shell-A)
# mount -t cgroup none /cgroup -o memory (Shell-A)::
# mkdir /cgroup/test
# echo 40M > /cgroup/test/memory.limit_in_bytes # mount -t cgroup none /cgroup -o memory
# echo 0 > /cgroup/test/tasks # mkdir /cgroup/test
# echo 40M > /cgroup/test/memory.limit_in_bytes
# echo 0 > /cgroup/test/tasks
Run malloc(100M) program under this. You'll see 60M of swaps. Run malloc(100M) program under this. You'll see 60M of swaps.
(Shell-B)
# move all tasks in /cgroup/test to /cgroup (Shell-B)::
# /sbin/swapoff -a
# rmdir /cgroup/test # move all tasks in /cgroup/test to /cgroup
# kill malloc task. # /sbin/swapoff -a
# rmdir /cgroup/test
# kill malloc task.
Of course, tmpfs v.s. swapoff test should be tested, too. Of course, tmpfs v.s. swapoff test should be tested, too.
9.8 OOM-Killer 9.8 OOM-Killer
--------------
Out-of-memory caused by memcg's limit will kill tasks under Out-of-memory caused by memcg's limit will kill tasks under
the memcg. When hierarchy is used, a task under hierarchy the memcg. When hierarchy is used, a task under hierarchy
will be killed by the kernel. will be killed by the kernel.
In this case, panic_on_oom shouldn't be invoked and tasks In this case, panic_on_oom shouldn't be invoked and tasks
in other groups shouldn't be killed. in other groups shouldn't be killed.
It's not difficult to cause OOM under memcg as following. It's not difficult to cause OOM under memcg as following.
Case A) when you can swapoff
#swapoff -a Case A) when you can swapoff::
#echo 50M > /memory.limit_in_bytes
#swapoff -a
#echo 50M > /memory.limit_in_bytes
run 51M of malloc run 51M of malloc
Case B) when you use mem+swap limitation. Case B) when you use mem+swap limitation::
#echo 50M > memory.limit_in_bytes
#echo 50M > memory.memsw.limit_in_bytes #echo 50M > memory.limit_in_bytes
#echo 50M > memory.memsw.limit_in_bytes
run 51M of malloc run 51M of malloc
9.9 Move charges at task migration 9.9 Move charges at task migration
----------------------------------
Charges associated with a task can be moved along with task migration. Charges associated with a task can be moved along with task migration.
(Shell-A) (Shell-A)::
#mkdir /cgroup/A
#echo $$ >/cgroup/A/tasks #mkdir /cgroup/A
#echo $$ >/cgroup/A/tasks
run some programs which uses some amount of memory in /cgroup/A. run some programs which uses some amount of memory in /cgroup/A.
(Shell-B) (Shell-B)::
#mkdir /cgroup/B
#echo 1 >/cgroup/B/memory.move_charge_at_immigrate
#echo "pid of the program running in group A" >/cgroup/B/tasks
You can see charges have been moved by reading *.usage_in_bytes or #mkdir /cgroup/B
#echo 1 >/cgroup/B/memory.move_charge_at_immigrate
#echo "pid of the program running in group A" >/cgroup/B/tasks
You can see charges have been moved by reading ``*.usage_in_bytes`` or
memory.stat of both A and B. memory.stat of both A and B.
See 8.2 of Documentation/cgroup-v1/memory.txt to see what value should be
written to move_charge_at_immigrate.
9.10 Memory thresholds See 8.2 of Documentation/cgroup-v1/memory.rst to see what value should
be written to move_charge_at_immigrate.
9.10 Memory thresholds
----------------------
Memory controller implements memory thresholds using cgroups notification Memory controller implements memory thresholds using cgroups notification
API. You can use tools/cgroup/cgroup_event_listener.c to test it. API. You can use tools/cgroup/cgroup_event_listener.c to test it.
(Shell-A) Create cgroup and run event listener (Shell-A) Create cgroup and run event listener::
# mkdir /cgroup/A
# ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
(Shell-B) Add task to cgroup and try to allocate and free memory # mkdir /cgroup/A
# echo $$ >/cgroup/A/tasks # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
# a="$(dd if=/dev/zero bs=1M count=10)"
# a= (Shell-B) Add task to cgroup and try to allocate and free memory::
# echo $$ >/cgroup/A/tasks
# a="$(dd if=/dev/zero bs=1M count=10)"
# a=
You will see message from cgroup_event_listener every time you cross You will see message from cgroup_event_listener every time you cross
the thresholds. the thresholds.

View File

@ -1,22 +1,26 @@
==========================
Memory Resource Controller Memory Resource Controller
==========================
NOTE: This document is hopelessly outdated and it asks for a complete NOTE:
This document is hopelessly outdated and it asks for a complete
rewrite. It still contains a useful information so we are keeping it rewrite. It still contains a useful information so we are keeping it
here but make sure to check the current code if you need a deeper here but make sure to check the current code if you need a deeper
understanding. understanding.
NOTE: The Memory Resource Controller has generically been referred to as the NOTE:
The Memory Resource Controller has generically been referred to as the
memory controller in this document. Do not confuse memory controller memory controller in this document. Do not confuse memory controller
used here with the memory controller that is used in hardware. used here with the memory controller that is used in hardware.
(For editors) (For editors) In this document:
In this document:
When we mention a cgroup (cgroupfs's directory) with memory controller, When we mention a cgroup (cgroupfs's directory) with memory controller,
we call it "memory cgroup". When you see git-log and source code, you'll we call it "memory cgroup". When you see git-log and source code, you'll
see patch's title and function names tend to use "memcg". see patch's title and function names tend to use "memcg".
In this document, we avoid using it. In this document, we avoid using it.
Benefits and Purpose of the memory controller Benefits and Purpose of the memory controller
=============================================
The memory controller isolates the memory behaviour of a group of tasks The memory controller isolates the memory behaviour of a group of tasks
from the rest of the system. The article on LWN [12] mentions some probable from the rest of the system. The article on LWN [12] mentions some probable
@ -38,6 +42,7 @@ e. There are several other use cases; find one or use the controller just
Current Status: linux-2.6.34-mmotm(development version of 2010/April) Current Status: linux-2.6.34-mmotm(development version of 2010/April)
Features: Features:
- accounting anonymous pages, file caches, swap caches usage and limiting them. - accounting anonymous pages, file caches, swap caches usage and limiting them.
- pages are linked to per-memcg LRU exclusively, and there is no global LRU. - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
- optionally, memory+swap usage can be accounted and limited. - optionally, memory+swap usage can be accounted and limited.
@ -54,41 +59,48 @@ Features:
Brief summary of control files. Brief summary of control files.
tasks # attach a task(thread) and show list of threads ==================================== ==========================================
cgroup.procs # show list of processes tasks attach a task(thread) and show list of
cgroup.event_control # an interface for event_fd() threads
memory.usage_in_bytes # show current usage for memory cgroup.procs show list of processes
(See 5.5 for details) cgroup.event_control an interface for event_fd()
memory.memsw.usage_in_bytes # show current usage for memory+Swap memory.usage_in_bytes show current usage for memory
(See 5.5 for details) (See 5.5 for details)
memory.limit_in_bytes # set/show limit of memory usage memory.memsw.usage_in_bytes show current usage for memory+Swap
memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage (See 5.5 for details)
memory.failcnt # show the number of memory usage hits limits memory.limit_in_bytes set/show limit of memory usage
memory.memsw.failcnt # show the number of memory+Swap hits limits memory.memsw.limit_in_bytes set/show limit of memory+Swap usage
memory.max_usage_in_bytes # show max memory usage recorded memory.failcnt show the number of memory usage hits limits
memory.memsw.max_usage_in_bytes # show max memory+Swap usage recorded memory.memsw.failcnt show the number of memory+Swap hits limits
memory.soft_limit_in_bytes # set/show soft limit of memory usage memory.max_usage_in_bytes show max memory usage recorded
memory.stat # show various statistics memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded
memory.use_hierarchy # set/show hierarchical account enabled memory.soft_limit_in_bytes set/show soft limit of memory usage
memory.force_empty # trigger forced page reclaim memory.stat show various statistics
memory.pressure_level # set memory pressure notifications memory.use_hierarchy set/show hierarchical account enabled
memory.swappiness # set/show swappiness parameter of vmscan memory.force_empty trigger forced page reclaim
(See sysctl's vm.swappiness) memory.pressure_level set memory pressure notifications
memory.move_charge_at_immigrate # set/show controls of moving charges memory.swappiness set/show swappiness parameter of vmscan
memory.oom_control # set/show oom controls. (See sysctl's vm.swappiness)
memory.numa_stat # show the number of memory usage per numa node memory.move_charge_at_immigrate set/show controls of moving charges
memory.oom_control set/show oom controls.
memory.numa_stat show the number of memory usage per numa
node
memory.kmem.limit_in_bytes # set/show hard limit for kernel memory memory.kmem.limit_in_bytes set/show hard limit for kernel memory
memory.kmem.usage_in_bytes # show current kernel memory allocation memory.kmem.usage_in_bytes show current kernel memory allocation
memory.kmem.failcnt # show the number of kernel memory usage hits limits memory.kmem.failcnt show the number of kernel memory usage
memory.kmem.max_usage_in_bytes # show max kernel memory usage recorded hits limits
memory.kmem.max_usage_in_bytes show max kernel memory usage recorded
memory.kmem.tcp.limit_in_bytes # set/show hard limit for tcp buf memory memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory
memory.kmem.tcp.usage_in_bytes # show current tcp buf memory allocation memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation
memory.kmem.tcp.failcnt # show the number of tcp buf memory usage hits limits memory.kmem.tcp.failcnt show the number of tcp buf memory usage
memory.kmem.tcp.max_usage_in_bytes # show max tcp buf memory usage recorded hits limits
memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded
==================================== ==========================================
1. History 1. History
==========
The memory controller has a long history. A request for comments for the memory The memory controller has a long history. A request for comments for the memory
controller was posted by Balbir Singh [1]. At the time the RFC was posted controller was posted by Balbir Singh [1]. At the time the RFC was posted
@ -103,6 +115,7 @@ at version 6; it combines both mapped (RSS) and unmapped Page
Cache Control [11]. Cache Control [11].
2. Memory Control 2. Memory Control
=================
Memory is a unique resource in the sense that it is present in a limited Memory is a unique resource in the sense that it is present in a limited
amount. If a task requires a lot of CPU processing, the task can spread amount. If a task requires a lot of CPU processing, the task can spread
@ -120,6 +133,7 @@ are:
The memory controller is the first controller developed. The memory controller is the first controller developed.
2.1. Design 2.1. Design
-----------
The core of the design is a counter called the page_counter. The The core of the design is a counter called the page_counter. The
page_counter tracks the current memory usage and limit of the group of page_counter tracks the current memory usage and limit of the group of
@ -127,6 +141,9 @@ processes associated with the controller. Each cgroup has a memory controller
specific data structure (mem_cgroup) associated with it. specific data structure (mem_cgroup) associated with it.
2.2. Accounting 2.2. Accounting
---------------
::
+--------------------+ +--------------------+
| mem_cgroup | | mem_cgroup |
@ -165,6 +182,7 @@ updated. page_cgroup has its own LRU on cgroup.
(*) page_cgroup structure is allocated at boot/memory-hotplug time. (*) page_cgroup structure is allocated at boot/memory-hotplug time.
2.2.1 Accounting details 2.2.1 Accounting details
------------------------
All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
Some pages which are never reclaimable and will not be on the LRU Some pages which are never reclaimable and will not be on the LRU
@ -191,6 +209,7 @@ Note: we just account pages-on-LRU because our purpose is to control amount
of used pages; not-on-LRU pages tend to be out-of-control from VM view. of used pages; not-on-LRU pages tend to be out-of-control from VM view.
2.3 Shared Page Accounting 2.3 Shared Page Accounting
--------------------------
Shared pages are accounted on the basis of the first touch approach. The Shared pages are accounted on the basis of the first touch approach. The
cgroup that first touches a page is accounted for the page. The principle cgroup that first touches a page is accounted for the page. The principle
@ -207,11 +226,13 @@ be backed into memory in force, charges for pages are accounted against the
caller of swapoff rather than the users of shmem. caller of swapoff rather than the users of shmem.
2.4 Swap Extension (CONFIG_MEMCG_SWAP) 2.4 Swap Extension (CONFIG_MEMCG_SWAP)
--------------------------------------
Swap Extension allows you to record charge for swap. A swapped-in page is Swap Extension allows you to record charge for swap. A swapped-in page is
charged back to original page allocator if possible. charged back to original page allocator if possible.
When swap is accounted, following files are added. When swap is accounted, following files are added.
- memory.memsw.usage_in_bytes. - memory.memsw.usage_in_bytes.
- memory.memsw.limit_in_bytes. - memory.memsw.limit_in_bytes.
@ -224,14 +245,16 @@ In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
By using the memsw limit, you can avoid system OOM which can be caused by swap By using the memsw limit, you can avoid system OOM which can be caused by swap
shortage. shortage.
* why 'memory+swap' rather than swap. **why 'memory+swap' rather than swap**
The global LRU(kswapd) can swap out arbitrary pages. Swap-out means The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of to move account from memory to swap...there is no change in usage of
memory+swap. In other words, when we want to limit the usage of swap without memory+swap. In other words, when we want to limit the usage of swap without
affecting global LRU, memory+swap limit is better than just limiting swap from affecting global LRU, memory+swap limit is better than just limiting swap from
an OS point of view. an OS point of view.
* What happens when a cgroup hits memory.memsw.limit_in_bytes **What happens when a cgroup hits memory.memsw.limit_in_bytes**
When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
in this cgroup. Then, swap-out will not be done by cgroup routine and file in this cgroup. Then, swap-out will not be done by cgroup routine and file
caches are dropped. But as mentioned above, global LRU can do swapout memory caches are dropped. But as mentioned above, global LRU can do swapout memory
@ -239,6 +262,7 @@ from it for sanity of the system's memory management state. You can't forbid
it by cgroup. it by cgroup.
2.5 Reclaim 2.5 Reclaim
-----------
Each cgroup maintains a per cgroup LRU which has the same structure as Each cgroup maintains a per cgroup LRU which has the same structure as
global VM. When a cgroup goes over its limit, we first try global VM. When a cgroup goes over its limit, we first try
@ -251,29 +275,36 @@ The reclaim algorithm has not been modified for cgroups, except that
pages that are selected for reclaiming come from the per-cgroup LRU pages that are selected for reclaiming come from the per-cgroup LRU
list. list.
NOTE: Reclaim does not work for the root cgroup, since we cannot set any NOTE:
limits on the root cgroup. Reclaim does not work for the root cgroup, since we cannot set any
limits on the root cgroup.
Note2: When panic_on_oom is set to "2", the whole system will panic. Note2:
When panic_on_oom is set to "2", the whole system will panic.
When oom event notifier is registered, event will be delivered. When oom event notifier is registered, event will be delivered.
(See oom_control section) (See oom_control section)
2.6 Locking 2.6 Locking
-----------
lock_page_cgroup()/unlock_page_cgroup() should not be called under lock_page_cgroup()/unlock_page_cgroup() should not be called under
the i_pages lock. the i_pages lock.
Other lock order is following: Other lock order is following:
PG_locked. PG_locked.
mm->page_table_lock mm->page_table_lock
pgdat->lru_lock pgdat->lru_lock
lock_page_cgroup. lock_page_cgroup.
In many cases, just lock_page_cgroup() is called. In many cases, just lock_page_cgroup() is called.
per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
pgdat->lru_lock, it has no lock of its own. pgdat->lru_lock, it has no lock of its own.
2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
-----------------------------------------------
With the Kernel memory extension, the Memory Controller is able to limit With the Kernel memory extension, the Memory Controller is able to limit
the amount of kernel memory used by the system. Kernel memory is fundamentally the amount of kernel memory used by the system. Kernel memory is fundamentally
@ -288,6 +319,7 @@ Kernel memory limits are not imposed for the root cgroup. Usage for the root
cgroup may or may not be accounted. The memory used is accumulated into cgroup may or may not be accounted. The memory used is accumulated into
memory.kmem.usage_in_bytes, or in a separate counter when it makes sense. memory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
(currently only for tcp). (currently only for tcp).
The main "kmem" counter is fed into the main counter, so kmem charges will The main "kmem" counter is fed into the main counter, so kmem charges will
also be visible from the user counter. also be visible from the user counter.
@ -295,36 +327,42 @@ Currently no soft limit is implemented for kernel memory. It is future work
to trigger slab reclaim when those limits are reached. to trigger slab reclaim when those limits are reached.
2.7.1 Current Kernel Memory resources accounted 2.7.1 Current Kernel Memory resources accounted
-----------------------------------------------
* stack pages: every process consumes some stack pages. By accounting into stack pages:
kernel memory, we prevent new processes from being created when the kernel every process consumes some stack pages. By accounting into
memory usage is too high. kernel memory, we prevent new processes from being created when the kernel
memory usage is too high.
* slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy slab pages:
of each kmem_cache is created every time the cache is touched by the first time pages allocated by the SLAB or SLUB allocator are tracked. A copy
from inside the memcg. The creation is done lazily, so some objects can still be of each kmem_cache is created every time the cache is touched by the first time
skipped while the cache is being created. All objects in a slab page should from inside the memcg. The creation is done lazily, so some objects can still be
belong to the same memcg. This only fails to hold when a task is migrated to a skipped while the cache is being created. All objects in a slab page should
different memcg during the page allocation by the cache. belong to the same memcg. This only fails to hold when a task is migrated to a
different memcg during the page allocation by the cache.
* sockets memory pressure: some sockets protocols have memory pressure sockets memory pressure:
thresholds. The Memory Controller allows them to be controlled individually some sockets protocols have memory pressure
per cgroup, instead of globally. thresholds. The Memory Controller allows them to be controlled individually
per cgroup, instead of globally.
* tcp memory pressure: sockets memory pressure for the tcp protocol. tcp memory pressure:
sockets memory pressure for the tcp protocol.
2.7.2 Common use cases 2.7.2 Common use cases
----------------------
Because the "kmem" counter is fed to the main user counter, kernel memory can Because the "kmem" counter is fed to the main user counter, kernel memory can
never be limited completely independently of user memory. Say "U" is the user never be limited completely independently of user memory. Say "U" is the user
limit, and "K" the kernel limit. There are three possible ways limits can be limit, and "K" the kernel limit. There are three possible ways limits can be
set: set:
U != 0, K = unlimited: U != 0, K = unlimited:
This is the standard memcg limitation mechanism already present before kmem This is the standard memcg limitation mechanism already present before kmem
accounting. Kernel memory is completely ignored. accounting. Kernel memory is completely ignored.
U != 0, K < U: U != 0, K < U:
Kernel memory is a subset of the user memory. This setup is useful in Kernel memory is a subset of the user memory. This setup is useful in
deployments where the total amount of memory per-cgroup is overcommited. deployments where the total amount of memory per-cgroup is overcommited.
Overcommiting kernel memory limits is definitely not recommended, since the Overcommiting kernel memory limits is definitely not recommended, since the
@ -332,19 +370,23 @@ set:
In this case, the admin could set up K so that the sum of all groups is In this case, the admin could set up K so that the sum of all groups is
never greater than the total memory, and freely set U at the cost of his never greater than the total memory, and freely set U at the cost of his
QoS. QoS.
WARNING: In the current implementation, memory reclaim will NOT be
WARNING:
In the current implementation, memory reclaim will NOT be
triggered for a cgroup when it hits K while staying below U, which makes triggered for a cgroup when it hits K while staying below U, which makes
this setup impractical. this setup impractical.
U != 0, K >= U: U != 0, K >= U:
Since kmem charges will also be fed to the user counter and reclaim will be Since kmem charges will also be fed to the user counter and reclaim will be
triggered for the cgroup for both kinds of memory. This setup gives the triggered for the cgroup for both kinds of memory. This setup gives the
admin a unified view of memory, and it is also useful for people who just admin a unified view of memory, and it is also useful for people who just
want to track kernel memory usage. want to track kernel memory usage.
3. User Interface 3. User Interface
=================
3.0. Configuration 3.0. Configuration
------------------
a. Enable CONFIG_CGROUPS a. Enable CONFIG_CGROUPS
b. Enable CONFIG_MEMCG b. Enable CONFIG_MEMCG
@ -352,39 +394,53 @@ c. Enable CONFIG_MEMCG_SWAP (to use swap extension)
d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) d. Enable CONFIG_MEMCG_KMEM (to use kmem extension)
3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) 3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
# mount -t tmpfs none /sys/fs/cgroup -------------------------------------------------------------------
# mkdir /sys/fs/cgroup/memory
# mount -t cgroup none /sys/fs/cgroup/memory -o memory
3.2. Make the new group and move bash into it ::
# mkdir /sys/fs/cgroup/memory/0
# echo $$ > /sys/fs/cgroup/memory/0/tasks
Since now we're in the 0 cgroup, we can alter the memory limit: # mount -t tmpfs none /sys/fs/cgroup
# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes # mkdir /sys/fs/cgroup/memory
# mount -t cgroup none /sys/fs/cgroup/memory -o memory
NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, 3.2. Make the new group and move bash into it::
mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, Gibibytes.)
NOTE: We can write "-1" to reset the *.limit_in_bytes(unlimited). # mkdir /sys/fs/cgroup/memory/0
NOTE: We cannot set limits on the root cgroup any more. # echo $$ > /sys/fs/cgroup/memory/0/tasks
# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes Since now we're in the 0 cgroup, we can alter the memory limit::
4194304
We can check the usage: # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
1216512 NOTE:
We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,
Gibibytes.)
NOTE:
We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
NOTE:
We cannot set limits on the root cgroup any more.
::
# cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
4194304
We can check the usage::
# cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
1216512
A successful write to this file does not guarantee a successful setting of A successful write to this file does not guarantee a successful setting of
this limit to the value written into the file. This can be due to a this limit to the value written into the file. This can be due to a
number of factors, such as rounding up to page boundaries or the total number of factors, such as rounding up to page boundaries or the total
availability of memory on the system. The user is required to re-read availability of memory on the system. The user is required to re-read
this file after a write to guarantee the value committed by the kernel. this file after a write to guarantee the value committed by the kernel::
# echo 1 > memory.limit_in_bytes # echo 1 > memory.limit_in_bytes
# cat memory.limit_in_bytes # cat memory.limit_in_bytes
4096 4096
The memory.failcnt field gives the number of times that the cgroup limit was The memory.failcnt field gives the number of times that the cgroup limit was
exceeded. exceeded.
@ -393,6 +449,7 @@ The memory.stat file gives accounting information. Now, the number of
caches, RSS and Active pages/Inactive pages are shown. caches, RSS and Active pages/Inactive pages are shown.
4. Testing 4. Testing
==========
For testing features and implementation, see memcg_test.txt. For testing features and implementation, see memcg_test.txt.
@ -408,6 +465,7 @@ But the above two are testing extreme situations.
Trying usual test under memory controller is always helpful. Trying usual test under memory controller is always helpful.
4.1 Troubleshooting 4.1 Troubleshooting
-------------------
Sometimes a user might find that the application under a cgroup is Sometimes a user might find that the application under a cgroup is
terminated by the OOM killer. There are several causes for this: terminated by the OOM killer. There are several causes for this:
@ -422,6 +480,7 @@ To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
seeing what happens will be helpful. seeing what happens will be helpful.
4.2 Task migration 4.2 Task migration
------------------
When a task migrates from one cgroup to another, its charge is not When a task migrates from one cgroup to another, its charge is not
carried forward by default. The pages allocated from the original cgroup still carried forward by default. The pages allocated from the original cgroup still
@ -432,6 +491,7 @@ You can move charges of a task along with task migration.
See 8. "Move charges at task migration" See 8. "Move charges at task migration"
4.3 Removing a cgroup 4.3 Removing a cgroup
---------------------
A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
cgroup might have some charge associated with it, even though all cgroup might have some charge associated with it, even though all
@ -448,13 +508,15 @@ will be charged as a new owner of it.
About use_hierarchy, see Section 6. About use_hierarchy, see Section 6.
5. Misc. interfaces. 5. Misc. interfaces
===================
5.1 force_empty 5.1 force_empty
---------------
memory.force_empty interface is provided to make cgroup's memory usage empty. memory.force_empty interface is provided to make cgroup's memory usage empty.
When writing anything to this When writing anything to this::
# echo 0 > memory.force_empty # echo 0 > memory.force_empty
the cgroup will be reclaimed and as many pages reclaimed as possible. the cgroup will be reclaimed and as many pages reclaimed as possible.
@ -471,50 +533,61 @@ About use_hierarchy, see Section 6.
About use_hierarchy, see Section 6. About use_hierarchy, see Section 6.
5.2 stat file 5.2 stat file
-------------
memory.stat file includes following statistics memory.stat file includes following statistics
# per-memory cgroup local status per-memory cgroup local status
cache - # of bytes of page cache memory. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rss - # of bytes of anonymous and swap cache memory (includes
=============== ===============================================================
cache # of bytes of page cache memory.
rss # of bytes of anonymous and swap cache memory (includes
transparent hugepages). transparent hugepages).
rss_huge - # of bytes of anonymous transparent hugepages. rss_huge # of bytes of anonymous transparent hugepages.
mapped_file - # of bytes of mapped file (includes tmpfs/shmem) mapped_file # of bytes of mapped file (includes tmpfs/shmem)
pgpgin - # of charging events to the memory cgroup. The charging pgpgin # of charging events to the memory cgroup. The charging
event happens each time a page is accounted as either mapped event happens each time a page is accounted as either mapped
anon page(RSS) or cache page(Page Cache) to the cgroup. anon page(RSS) or cache page(Page Cache) to the cgroup.
pgpgout - # of uncharging events to the memory cgroup. The uncharging pgpgout # of uncharging events to the memory cgroup. The uncharging
event happens each time a page is unaccounted from the cgroup. event happens each time a page is unaccounted from the cgroup.
swap - # of bytes of swap usage swap # of bytes of swap usage
dirty - # of bytes that are waiting to get written back to the disk. dirty # of bytes that are waiting to get written back to the disk.
writeback - # of bytes of file/anon cache that are queued for syncing to writeback # of bytes of file/anon cache that are queued for syncing to
disk. disk.
inactive_anon - # of bytes of anonymous and swap cache memory on inactive inactive_anon # of bytes of anonymous and swap cache memory on inactive
LRU list. LRU list.
active_anon - # of bytes of anonymous and swap cache memory on active active_anon # of bytes of anonymous and swap cache memory on active
LRU list. LRU list.
inactive_file - # of bytes of file-backed memory on inactive LRU list. inactive_file # of bytes of file-backed memory on inactive LRU list.
active_file - # of bytes of file-backed memory on active LRU list. active_file # of bytes of file-backed memory on active LRU list.
unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc). unevictable # of bytes of memory that cannot be reclaimed (mlocked etc).
=============== ===============================================================
# status considering hierarchy (see memory.use_hierarchy settings) status considering hierarchy (see memory.use_hierarchy settings)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
hierarchical_memory_limit - # of bytes of memory limit with regard to hierarchy ========================= ===================================================
under which the memory cgroup is hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy
hierarchical_memsw_limit - # of bytes of memory+swap limit with regard to under which the memory cgroup is
hierarchy under which memory cgroup is. hierarchical_memsw_limit # of bytes of memory+swap limit with regard to
hierarchy under which memory cgroup is.
total_<counter> - # hierarchical version of <counter>, which in total_<counter> # hierarchical version of <counter>, which in
addition to the cgroup's own value includes the addition to the cgroup's own value includes the
sum of all hierarchical children's values of sum of all hierarchical children's values of
<counter>, i.e. total_cache <counter>, i.e. total_cache
========================= ===================================================
# The following additional stats are dependent on CONFIG_DEBUG_VM. The following additional stats are dependent on CONFIG_DEBUG_VM
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) ========================= ========================================
recent_rotated_file - VM internal parameter. (see mm/vmscan.c) recent_rotated_anon VM internal parameter. (see mm/vmscan.c)
recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) recent_rotated_file VM internal parameter. (see mm/vmscan.c)
recent_scanned_file - VM internal parameter. (see mm/vmscan.c) recent_scanned_anon VM internal parameter. (see mm/vmscan.c)
recent_scanned_file VM internal parameter. (see mm/vmscan.c)
========================= ========================================
Memo: Memo:
recent_rotated means recent frequency of LRU rotation. recent_rotated means recent frequency of LRU rotation.
@ -525,12 +598,15 @@ Note:
Only anonymous and swap cache memory is listed as part of 'rss' stat. Only anonymous and swap cache memory is listed as part of 'rss' stat.
This should not be confused with the true 'resident set size' or the This should not be confused with the true 'resident set size' or the
amount of physical memory used by the cgroup. amount of physical memory used by the cgroup.
'rss + mapped_file" will give you resident set size of cgroup. 'rss + mapped_file" will give you resident set size of cgroup.
(Note: file and shmem may be shared among other cgroups. In that case, (Note: file and shmem may be shared among other cgroups. In that case,
mapped_file is accounted only when the memory cgroup is owner of page mapped_file is accounted only when the memory cgroup is owner of page
cache.) cache.)
5.3 swappiness 5.3 swappiness
--------------
Overrides /proc/sys/vm/swappiness for the particular group. The tunable Overrides /proc/sys/vm/swappiness for the particular group. The tunable
in the root cgroup corresponds to the global swappiness setting. in the root cgroup corresponds to the global swappiness setting.
@ -541,16 +617,19 @@ there is a swap storage available. This might lead to memcg OOM killer
if there are no file pages to reclaim. if there are no file pages to reclaim.
5.4 failcnt 5.4 failcnt
-----------
A memory cgroup provides memory.failcnt and memory.memsw.failcnt files. A memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
This failcnt(== failure count) shows the number of times that a usage counter This failcnt(== failure count) shows the number of times that a usage counter
hit its limit. When a memory cgroup hits a limit, failcnt increases and hit its limit. When a memory cgroup hits a limit, failcnt increases and
memory under it will be reclaimed. memory under it will be reclaimed.
You can reset failcnt by writing 0 to failcnt file. You can reset failcnt by writing 0 to failcnt file::
# echo 0 > .../memory.failcnt
# echo 0 > .../memory.failcnt
5.5 usage_in_bytes 5.5 usage_in_bytes
------------------
For efficiency, as other kernel components, memory cgroup uses some optimization For efficiency, as other kernel components, memory cgroup uses some optimization
to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
@ -560,6 +639,7 @@ If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
value in memory.stat(see 5.2). value in memory.stat(see 5.2).
5.6 numa_stat 5.6 numa_stat
-------------
This is similar to numa_maps but operates on a per-memcg basis. This is This is similar to numa_maps but operates on a per-memcg basis. This is
useful for providing visibility into the numa locality information within useful for providing visibility into the numa locality information within
@ -571,22 +651,23 @@ Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable"
per-node page counts including "hierarchical_<counter>" which sums up all per-node page counts including "hierarchical_<counter>" which sums up all
hierarchical children's values in addition to the memcg's own value. hierarchical children's values in addition to the memcg's own value.
The output format of memory.numa_stat is: The output format of memory.numa_stat is::
total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ... total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ... file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ... hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...
The "total" count is sum of file + anon + unevictable. The "total" count is sum of file + anon + unevictable.
6. Hierarchy support 6. Hierarchy support
====================
The memory controller supports a deep hierarchy and hierarchical accounting. The memory controller supports a deep hierarchy and hierarchical accounting.
The hierarchy is created by creating the appropriate cgroups in the The hierarchy is created by creating the appropriate cgroups in the
cgroup filesystem. Consider for example, the following cgroup filesystem cgroup filesystem. Consider for example, the following cgroup filesystem
hierarchy hierarchy::
root root
/ | \ / | \
@ -603,24 +684,28 @@ limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
children of the ancestor. children of the ancestor.
6.1 Enabling hierarchical accounting and reclaim 6.1 Enabling hierarchical accounting and reclaim
------------------------------------------------
A memory cgroup by default disables the hierarchy feature. Support A memory cgroup by default disables the hierarchy feature. Support
can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup::
# echo 1 > memory.use_hierarchy # echo 1 > memory.use_hierarchy
The feature can be disabled by The feature can be disabled by::
# echo 0 > memory.use_hierarchy # echo 0 > memory.use_hierarchy
NOTE1: Enabling/disabling will fail if either the cgroup already has other NOTE1:
Enabling/disabling will fail if either the cgroup already has other
cgroups created below it, or if the parent cgroup has use_hierarchy cgroups created below it, or if the parent cgroup has use_hierarchy
enabled. enabled.
NOTE2: When panic_on_oom is set to "2", the whole system will panic in NOTE2:
When panic_on_oom is set to "2", the whole system will panic in
case of an OOM event in any cgroup. case of an OOM event in any cgroup.
7. Soft limits 7. Soft limits
==============
Soft limits allow for greater sharing of memory. The idea behind soft limits Soft limits allow for greater sharing of memory. The idea behind soft limits
is to allow control groups to use as much of the memory as needed, provided is to allow control groups to use as much of the memory as needed, provided
@ -640,22 +725,26 @@ hints/setup. Currently soft limit based reclaim is set up such that
it gets invoked from balance_pgdat (kswapd). it gets invoked from balance_pgdat (kswapd).
7.1 Interface 7.1 Interface
-------------
Soft limits can be setup by using the following commands (in this example we Soft limits can be setup by using the following commands (in this example we
assume a soft limit of 256 MiB) assume a soft limit of 256 MiB)::
# echo 256M > memory.soft_limit_in_bytes # echo 256M > memory.soft_limit_in_bytes
If we want to change this to 1G, we can at any time use If we want to change this to 1G, we can at any time use::
# echo 1G > memory.soft_limit_in_bytes # echo 1G > memory.soft_limit_in_bytes
NOTE1: Soft limits take effect over a long period of time, since they involve NOTE1:
Soft limits take effect over a long period of time, since they involve
reclaiming memory for balancing between memory cgroups reclaiming memory for balancing between memory cgroups
NOTE2: It is recommended to set the soft limit always below the hard limit, NOTE2:
It is recommended to set the soft limit always below the hard limit,
otherwise the hard limit will take precedence. otherwise the hard limit will take precedence.
8. Move charges at task migration 8. Move charges at task migration
=================================
Users can move charges associated with a task along with task migration, that Users can move charges associated with a task along with task migration, that
is, uncharge task's pages from the old cgroup and charge them to the new cgroup. is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
@ -663,60 +752,71 @@ This feature is not supported in !CONFIG_MMU environments because of lack of
page tables. page tables.
8.1 Interface 8.1 Interface
-------------
This feature is disabled by default. It can be enabled (and disabled again) by This feature is disabled by default. It can be enabled (and disabled again) by
writing to memory.move_charge_at_immigrate of the destination cgroup. writing to memory.move_charge_at_immigrate of the destination cgroup.
If you want to enable it: If you want to enable it::
# echo (some positive value) > memory.move_charge_at_immigrate # echo (some positive value) > memory.move_charge_at_immigrate
Note: Each bits of move_charge_at_immigrate has its own meaning about what type Note:
Each bits of move_charge_at_immigrate has its own meaning about what type
of charges should be moved. See 8.2 for details. of charges should be moved. See 8.2 for details.
Note: Charges are moved only when you move mm->owner, in other words, Note:
Charges are moved only when you move mm->owner, in other words,
a leader of a thread group. a leader of a thread group.
Note: If we cannot find enough space for the task in the destination cgroup, we Note:
If we cannot find enough space for the task in the destination cgroup, we
try to make space by reclaiming memory. Task migration may fail if we try to make space by reclaiming memory. Task migration may fail if we
cannot make enough space. cannot make enough space.
Note: It can take several seconds if you move charges much. Note:
It can take several seconds if you move charges much.
And if you want disable it again: And if you want disable it again::
# echo 0 > memory.move_charge_at_immigrate # echo 0 > memory.move_charge_at_immigrate
8.2 Type of charges which can be moved 8.2 Type of charges which can be moved
--------------------------------------
Each bit in move_charge_at_immigrate has its own meaning about what type of Each bit in move_charge_at_immigrate has its own meaning about what type of
charges should be moved. But in any case, it must be noted that an account of charges should be moved. But in any case, it must be noted that an account of
a page or a swap can be moved only when it is charged to the task's current a page or a swap can be moved only when it is charged to the task's current
(old) memory cgroup. (old) memory cgroup.
bit | what type of charges would be moved ? +---+--------------------------------------------------------------------------+
-----+------------------------------------------------------------------------ |bit| what type of charges would be moved ? |
0 | A charge of an anonymous page (or swap of it) used by the target task. +===+==========================================================================+
| You must enable Swap Extension (see 2.4) to enable move of swap charges. | 0 | A charge of an anonymous page (or swap of it) used by the target task. |
-----+------------------------------------------------------------------------ | | You must enable Swap Extension (see 2.4) to enable move of swap charges. |
1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) +---+--------------------------------------------------------------------------+
| and swaps of tmpfs file) mmapped by the target task. Unlike the case of | 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) |
| anonymous pages, file pages (and swaps) in the range mmapped by the task | | and swaps of tmpfs file) mmapped by the target task. Unlike the case of |
| will be moved even if the task hasn't done page fault, i.e. they might | | anonymous pages, file pages (and swaps) in the range mmapped by the task |
| not be the task's "RSS", but other task's "RSS" that maps the same file. | | will be moved even if the task hasn't done page fault, i.e. they might |
| And mapcount of the page is ignored (the page can be moved even if | | not be the task's "RSS", but other task's "RSS" that maps the same file. |
| page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to | | And mapcount of the page is ignored (the page can be moved even if |
| enable move of swap charges. | | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to |
| | enable move of swap charges. |
+---+--------------------------------------------------------------------------+
8.3 TODO 8.3 TODO
--------
- All of moving charge operations are done under cgroup_mutex. It's not good - All of moving charge operations are done under cgroup_mutex. It's not good
behavior to hold the mutex too long, so we may need some trick. behavior to hold the mutex too long, so we may need some trick.
9. Memory thresholds 9. Memory thresholds
====================
Memory cgroup implements memory thresholds using the cgroups notification Memory cgroup implements memory thresholds using the cgroups notification
API (see cgroups.txt). It allows to register multiple memory and memsw API (see cgroups.txt). It allows to register multiple memory and memsw
thresholds and gets notifications when it crosses. thresholds and gets notifications when it crosses.
To register a threshold, an application must: To register a threshold, an application must:
- create an eventfd using eventfd(2); - create an eventfd using eventfd(2);
- open memory.usage_in_bytes or memory.memsw.usage_in_bytes; - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
@ -728,6 +828,7 @@ threshold in any direction.
It's applicable for root and non-root cgroup. It's applicable for root and non-root cgroup.
10. OOM Control 10. OOM Control
===============
memory.oom_control file is for OOM notification and other controls. memory.oom_control file is for OOM notification and other controls.
@ -736,6 +837,7 @@ API (See cgroups.txt). It allows to register multiple OOM notification
delivery and gets notification when OOM happens. delivery and gets notification when OOM happens.
To register a notifier, an application must: To register a notifier, an application must:
- create an eventfd using eventfd(2) - create an eventfd using eventfd(2)
- open memory.oom_control file - open memory.oom_control file
- write string like "<event_fd> <fd of memory.oom_control>" to - write string like "<event_fd> <fd of memory.oom_control>" to
@ -752,8 +854,11 @@ If OOM-killer is disabled, tasks under cgroup will hang/sleep
in memory cgroup's OOM-waitqueue when they request accountable memory. in memory cgroup's OOM-waitqueue when they request accountable memory.
For running them, you have to relax the memory cgroup's OOM status by For running them, you have to relax the memory cgroup's OOM status by
* enlarge limit or reduce usage. * enlarge limit or reduce usage.
To reduce usage, To reduce usage,
* kill some tasks. * kill some tasks.
* move some tasks to other group with account migration. * move some tasks to other group with account migration.
* remove some files (on tmpfs?) * remove some files (on tmpfs?)
@ -761,11 +866,14 @@ To reduce usage,
Then, stopped tasks will work again. Then, stopped tasks will work again.
At reading, current status of OOM is shown. At reading, current status of OOM is shown.
oom_kill_disable 0 or 1 (if 1, oom-killer is disabled)
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may - oom_kill_disable 0 or 1
be stopped.) (if 1, oom-killer is disabled)
- under_oom 0 or 1
(if 1, the memory cgroup is under OOM, tasks may be stopped.)
11. Memory Pressure 11. Memory Pressure
===================
The pressure level notifications can be used to monitor the memory The pressure level notifications can be used to monitor the memory
allocation cost; based on the pressure, applications can implement allocation cost; based on the pressure, applications can implement
@ -840,21 +948,22 @@ Test:
Here is a small script example that makes a new cgroup, sets up a Here is a small script example that makes a new cgroup, sets up a
memory limit, sets up a notification in the cgroup and then makes child memory limit, sets up a notification in the cgroup and then makes child
cgroup experience a critical pressure: cgroup experience a critical pressure::
# cd /sys/fs/cgroup/memory/ # cd /sys/fs/cgroup/memory/
# mkdir foo # mkdir foo
# cd foo # cd foo
# cgroup_event_listener memory.pressure_level low,hierarchy & # cgroup_event_listener memory.pressure_level low,hierarchy &
# echo 8000000 > memory.limit_in_bytes # echo 8000000 > memory.limit_in_bytes
# echo 8000000 > memory.memsw.limit_in_bytes # echo 8000000 > memory.memsw.limit_in_bytes
# echo $$ > tasks # echo $$ > tasks
# dd if=/dev/zero | read x # dd if=/dev/zero | read x
(Expect a bunch of notifications, and eventually, the oom-killer will (Expect a bunch of notifications, and eventually, the oom-killer will
trigger.) trigger.)
12. TODO 12. TODO
========
1. Make per-cgroup scanner reclaim not-shared pages first 1. Make per-cgroup scanner reclaim not-shared pages first
2. Teach controller to account for shared-pages 2. Teach controller to account for shared-pages
@ -862,11 +971,13 @@ Test:
not yet hit but the usage is getting closer not yet hit but the usage is getting closer
Summary Summary
=======
Overall, the memory controller has been a stable controller and has been Overall, the memory controller has been a stable controller and has been
commented and discussed quite extensively in the community. commented and discussed quite extensively in the community.
References References
==========
1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
2. Singh, Balbir. Memory Controller (RSS Control), 2. Singh, Balbir. Memory Controller (RSS Control),

View File

@ -1,5 +1,6 @@
=========================
Network classifier cgroup Network classifier cgroup
------------------------- =========================
The Network classifier cgroup provides an interface to The Network classifier cgroup provides an interface to
tag network packets with a class identifier (classid). tag network packets with a class identifier (classid).
@ -17,23 +18,27 @@ values is 0xAAAABBBB; AAAA is the major handle number and BBBB
is the minor handle number. is the minor handle number.
Reading net_cls.classid yields a decimal result. Reading net_cls.classid yields a decimal result.
Example: Example::
mkdir /sys/fs/cgroup/net_cls
mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls
mkdir /sys/fs/cgroup/net_cls/0
echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
- setting a 10:1 handle.
cat /sys/fs/cgroup/net_cls/0/net_cls.classid mkdir /sys/fs/cgroup/net_cls
1048577 mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls
mkdir /sys/fs/cgroup/net_cls/0
echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
configuring tc: - setting a 10:1 handle::
tc qdisc add dev eth0 root handle 10: htb
tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit cat /sys/fs/cgroup/net_cls/0/net_cls.classid
- creating traffic class 10:1 1048577
tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup - configuring tc::
configuring iptables, basic example: tc qdisc add dev eth0 root handle 10: htb
iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit
- creating traffic class 10:1::
tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup
configuring iptables, basic example::
iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP

View File

@ -1,5 +1,6 @@
=======================
Network priority cgroup Network priority cgroup
------------------------- =======================
The Network priority cgroup provides an interface to allow an administrator to The Network priority cgroup provides an interface to allow an administrator to
dynamically set the priority of network traffic generated by various dynamically set the priority of network traffic generated by various
@ -14,9 +15,9 @@ SO_PRIORITY socket option. This however, is not always possible because:
This cgroup allows an administrator to assign a process to a group which defines This cgroup allows an administrator to assign a process to a group which defines
the priority of egress traffic on a given interface. Network priority groups can the priority of egress traffic on a given interface. Network priority groups can
be created by first mounting the cgroup filesystem. be created by first mounting the cgroup filesystem::
# mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio # mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
With the above step, the initial group acting as the parent accounting group With the above step, the initial group acting as the parent accounting group
becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in
@ -25,17 +26,18 @@ the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup.
Each net_prio cgroup contains two files that are subsystem specific Each net_prio cgroup contains two files that are subsystem specific
net_prio.prioidx net_prio.prioidx
This file is read-only, and is simply informative. It contains a unique integer This file is read-only, and is simply informative. It contains a unique
value that the kernel uses as an internal representation of this cgroup. integer value that the kernel uses as an internal representation of this
cgroup.
net_prio.ifpriomap net_prio.ifpriomap
This file contains a map of the priorities assigned to traffic originating from This file contains a map of the priorities assigned to traffic originating
processes in this group and egressing the system on various interfaces. It from processes in this group and egressing the system on various interfaces.
contains a list of tuples in the form <ifname priority>. Contents of this file It contains a list of tuples in the form <ifname priority>. Contents of this
can be modified by echoing a string into the file using the same tuple format. file can be modified by echoing a string into the file using the same tuple
for example: format. For example::
echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap
This command would force any traffic originating from processes belonging to the This command would force any traffic originating from processes belonging to the
iscsi net_prio cgroup and egressing on interface eth0 to have the priority of iscsi net_prio cgroup and egressing on interface eth0 to have the priority of

View File

@ -1,5 +1,6 @@
Process Number Controller =========================
========================= Process Number Controller
=========================
Abstract Abstract
-------- --------
@ -34,55 +35,58 @@ pids.current tracks all child cgroup hierarchies, so parent/pids.current is a
superset of parent/child/pids.current. superset of parent/child/pids.current.
The pids.events file contains event counters: The pids.events file contains event counters:
- max: Number of times fork failed because limit was hit. - max: Number of times fork failed because limit was hit.
Example Example
------- -------
First, we mount the pids controller: First, we mount the pids controller::
# mkdir -p /sys/fs/cgroup/pids
# mount -t cgroup -o pids none /sys/fs/cgroup/pids
Then we create a hierarchy, set limits and attach processes to it: # mkdir -p /sys/fs/cgroup/pids
# mkdir -p /sys/fs/cgroup/pids/parent/child # mount -t cgroup -o pids none /sys/fs/cgroup/pids
# echo 2 > /sys/fs/cgroup/pids/parent/pids.max
# echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs Then we create a hierarchy, set limits and attach processes to it::
# cat /sys/fs/cgroup/pids/parent/pids.current
2 # mkdir -p /sys/fs/cgroup/pids/parent/child
# # echo 2 > /sys/fs/cgroup/pids/parent/pids.max
# echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs
# cat /sys/fs/cgroup/pids/parent/pids.current
2
#
It should be noted that attempts to overcome the set limit (2 in this case) will It should be noted that attempts to overcome the set limit (2 in this case) will
fail: fail::
# cat /sys/fs/cgroup/pids/parent/pids.current # cat /sys/fs/cgroup/pids/parent/pids.current
2 2
# ( /bin/echo "Here's some processes for you." | cat ) # ( /bin/echo "Here's some processes for you." | cat )
sh: fork: Resource temporary unavailable sh: fork: Resource temporary unavailable
# #
Even if we migrate to a child cgroup (which doesn't have a set limit), we will Even if we migrate to a child cgroup (which doesn't have a set limit), we will
not be able to overcome the most stringent limit in the hierarchy (in this case, not be able to overcome the most stringent limit in the hierarchy (in this case,
parent's): parent's)::
# echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs # echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs
# cat /sys/fs/cgroup/pids/parent/pids.current # cat /sys/fs/cgroup/pids/parent/pids.current
2 2
# cat /sys/fs/cgroup/pids/parent/child/pids.current # cat /sys/fs/cgroup/pids/parent/child/pids.current
2 2
# cat /sys/fs/cgroup/pids/parent/child/pids.max # cat /sys/fs/cgroup/pids/parent/child/pids.max
max max
# ( /bin/echo "Here's some processes for you." | cat ) # ( /bin/echo "Here's some processes for you." | cat )
sh: fork: Resource temporary unavailable sh: fork: Resource temporary unavailable
# #
We can set a limit that is smaller than pids.current, which will stop any new We can set a limit that is smaller than pids.current, which will stop any new
processes from being forked at all (note that the shell itself counts towards processes from being forked at all (note that the shell itself counts towards
pids.current): pids.current)::
# echo 1 > /sys/fs/cgroup/pids/parent/pids.max # echo 1 > /sys/fs/cgroup/pids/parent/pids.max
# /bin/echo "We can't even spawn a single process now." # /bin/echo "We can't even spawn a single process now."
sh: fork: Resource temporary unavailable sh: fork: Resource temporary unavailable
# echo 0 > /sys/fs/cgroup/pids/parent/pids.max # echo 0 > /sys/fs/cgroup/pids/parent/pids.max
# /bin/echo "We can't even spawn a single process now." # /bin/echo "We can't even spawn a single process now."
sh: fork: Resource temporary unavailable sh: fork: Resource temporary unavailable
# #

View File

@ -1,16 +1,17 @@
RDMA Controller ===============
---------------- RDMA Controller
===============
Contents .. Contents
--------
1. Overview
1. Overview 1-1. What is RDMA controller?
1-1. What is RDMA controller? 1-2. Why RDMA controller needed?
1-2. Why RDMA controller needed? 1-3. How is RDMA controller implemented?
1-3. How is RDMA controller implemented? 2. Usage Examples
2. Usage Examples
1. Overview 1. Overview
===========
1-1. What is RDMA controller? 1-1. What is RDMA controller?
----------------------------- -----------------------------
@ -83,27 +84,34 @@ what is configured by user for a given cgroup and what is supported by
IB device. IB device.
Following resources can be accounted by rdma controller. Following resources can be accounted by rdma controller.
========== =============================
hca_handle Maximum number of HCA Handles hca_handle Maximum number of HCA Handles
hca_object Maximum number of HCA Objects hca_object Maximum number of HCA Objects
========== =============================
2. Usage Examples 2. Usage Examples
----------------- =================
(a) Configure resource limit: (a) Configure resource limit::
echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max
echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max
(b) Query resource limit: echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max
cat /sys/fs/cgroup/rdma/2/rdma.max echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max
#Output:
mlx4_0 hca_handle=2 hca_object=2000
ocrdma1 hca_handle=3 hca_object=max
(c) Query current usage: (b) Query resource limit::
cat /sys/fs/cgroup/rdma/2/rdma.current
#Output:
mlx4_0 hca_handle=1 hca_object=20
ocrdma1 hca_handle=1 hca_object=23
(d) Delete resource limit: cat /sys/fs/cgroup/rdma/2/rdma.max
echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max #Output:
mlx4_0 hca_handle=2 hca_object=2000
ocrdma1 hca_handle=3 hca_object=max
(c) Query current usage::
cat /sys/fs/cgroup/rdma/2/rdma.current
#Output:
mlx4_0 hca_handle=1 hca_object=20
ocrdma1 hca_handle=1 hca_object=23
(d) Delete resource limit::
echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max

View File

@ -98,7 +98,7 @@ A memory policy with a valid NodeList will be saved, as specified, for
use at file creation time. When a task allocates a file in the file use at file creation time. When a task allocates a file in the file
system, the mount option memory policy will be applied with a NodeList, system, the mount option memory policy will be applied with a NodeList,
if any, modified by the calling task's cpuset constraints if any, modified by the calling task's cpuset constraints
[See Documentation/cgroup-v1/cpusets.txt] and any optional flags, listed [See Documentation/cgroup-v1/cpusets.rst] and any optional flags, listed
below. If the resulting NodeLists is the empty set, the effective memory below. If the resulting NodeLists is the empty set, the effective memory
policy for the file will revert to "default" policy. policy for the file will revert to "default" policy.

View File

@ -652,7 +652,7 @@ CONTENTS
-deadline tasks cannot have an affinity mask smaller that the entire -deadline tasks cannot have an affinity mask smaller that the entire
root_domain they are created on. However, affinities can be specified root_domain they are created on. However, affinities can be specified
through the cpuset facility (Documentation/cgroup-v1/cpusets.txt). through the cpuset facility (Documentation/cgroup-v1/cpusets.rst).
5.1 SCHED_DEADLINE and cpusets HOWTO 5.1 SCHED_DEADLINE and cpusets HOWTO
------------------------------------ ------------------------------------

View File

@ -215,7 +215,7 @@ SCHED_BATCH) tasks.
These options need CONFIG_CGROUPS to be defined, and let the administrator These options need CONFIG_CGROUPS to be defined, and let the administrator
create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
Documentation/cgroup-v1/cgroups.txt for more information about this filesystem. Documentation/cgroup-v1/cgroups.rst for more information about this filesystem.
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
group created using the pseudo filesystem. See example steps below to create group created using the pseudo filesystem. See example steps below to create

View File

@ -133,7 +133,7 @@ This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
to control the CPU time reserved for each control group. to control the CPU time reserved for each control group.
For more information on working with control groups, you should read For more information on working with control groups, you should read
Documentation/cgroup-v1/cgroups.txt as well. Documentation/cgroup-v1/cgroups.rst as well.
Group settings are checked against the following limits in order to keep the Group settings are checked against the following limits in order to keep the
configuration schedulable: configuration schedulable:

View File

@ -67,7 +67,7 @@ nodes. Each emulated node will manage a fraction of the underlying cells'
physical memory. NUMA emluation is useful for testing NUMA kernel and physical memory. NUMA emluation is useful for testing NUMA kernel and
application features on non-NUMA platforms, and as a sort of memory resource application features on non-NUMA platforms, and as a sort of memory resource
management mechanism when used together with cpusets. management mechanism when used together with cpusets.
[see Documentation/cgroup-v1/cpusets.txt] [see Documentation/cgroup-v1/cpusets.rst]
For each node with memory, Linux constructs an independent memory management For each node with memory, Linux constructs an independent memory management
subsystem, complete with its own free page lists, in-use page lists, usage subsystem, complete with its own free page lists, in-use page lists, usage
@ -114,7 +114,7 @@ allocation behavior using Linux NUMA memory policy. [see
System administrators can restrict the CPUs and nodes' memories that a non- System administrators can restrict the CPUs and nodes' memories that a non-
privileged user can specify in the scheduling or NUMA commands and functions privileged user can specify in the scheduling or NUMA commands and functions
using control groups and CPUsets. [see Documentation/cgroup-v1/cpusets.txt] using control groups and CPUsets. [see Documentation/cgroup-v1/cpusets.rst]
On architectures that do not hide memoryless nodes, Linux will include only On architectures that do not hide memoryless nodes, Linux will include only
zones [nodes] with memory in the zonelists. This means that for a memoryless zones [nodes] with memory in the zonelists. This means that for a memoryless

View File

@ -41,7 +41,7 @@ locations.
Larger installations usually partition the system using cpusets into Larger installations usually partition the system using cpusets into
sections of nodes. Paul Jackson has equipped cpusets with the ability to sections of nodes. Paul Jackson has equipped cpusets with the ability to
move pages when a task is moved to another cpuset (See move pages when a task is moved to another cpuset (See
Documentation/cgroup-v1/cpusets.txt). Documentation/cgroup-v1/cpusets.rst).
Cpusets allows the automation of process locality. If a task is moved to Cpusets allows the automation of process locality. If a task is moved to
a new cpuset then also all its pages are moved with it so that the a new cpuset then also all its pages are moved with it so that the
performance of the process does not sink dramatically. Also the pages performance of the process does not sink dramatically. Also the pages

View File

@ -98,7 +98,7 @@ Memory Control Group Interaction
-------------------------------- --------------------------------
The unevictable LRU facility interacts with the memory control group [aka The unevictable LRU facility interacts with the memory control group [aka
memory controller; see Documentation/cgroup-v1/memory.txt] by extending the memory controller; see Documentation/cgroup-v1/memory.rst] by extending the
lru_list enum. lru_list enum.
The memory controller data structure automatically gets a per-zone unevictable The memory controller data structure automatically gets a per-zone unevictable

View File

@ -15,7 +15,7 @@ assign them to cpusets and their attached tasks. This is a way of limiting the
amount of system memory that are available to a certain class of tasks. amount of system memory that are available to a certain class of tasks.
For more information on the features of cpusets, see For more information on the features of cpusets, see
Documentation/cgroup-v1/cpusets.txt. Documentation/cgroup-v1/cpusets.rst.
There are a number of different configurations you can use for your needs. For There are a number of different configurations you can use for your needs. For
more information on the numa=fake command line option and its various ways of more information on the numa=fake command line option and its various ways of
configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt. configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt.
@ -40,7 +40,7 @@ A machine may be split as follows with "numa=fake=4*512," as reported by dmesg::
On node 3 totalpages: 131072 On node 3 totalpages: 131072
Now following the instructions for mounting the cpusets filesystem from Now following the instructions for mounting the cpusets filesystem from
Documentation/cgroup-v1/cpusets.txt, you can assign fake nodes (i.e. contiguous memory Documentation/cgroup-v1/cpusets.rst, you can assign fake nodes (i.e. contiguous memory
address spaces) to individual cpusets:: address spaces) to individual cpusets::
[root@xroads /]# mkdir exampleset [root@xroads /]# mkdir exampleset

View File

@ -4122,7 +4122,7 @@ W: http://www.bullopensource.org/cpuset/
W: http://oss.sgi.com/projects/cpusets/ W: http://oss.sgi.com/projects/cpusets/
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
S: Maintained S: Maintained
F: Documentation/cgroup-v1/cpusets.txt F: Documentation/cgroup-v1/cpusets.rst
F: include/linux/cpuset.h F: include/linux/cpuset.h
F: kernel/cgroup/cpuset.c F: kernel/cgroup/cpuset.c

View File

@ -89,7 +89,7 @@ config BLK_DEV_THROTTLING
one needs to mount and use blkio cgroup controller for creating one needs to mount and use blkio cgroup controller for creating
cgroups and specifying per device IO rate policies. cgroups and specifying per device IO rate policies.
See Documentation/cgroup-v1/blkio-controller.txt for more information. See Documentation/cgroup-v1/blkio-controller.rst for more information.
config BLK_DEV_THROTTLING_LOW config BLK_DEV_THROTTLING_LOW
bool "Block throttling .low limit interface support (EXPERIMENTAL)" bool "Block throttling .low limit interface support (EXPERIMENTAL)"

View File

@ -624,7 +624,7 @@ struct cftype {
/* /*
* Control Group subsystem type. * Control Group subsystem type.
* See Documentation/cgroup-v1/cgroups.txt for details * See Documentation/cgroup-v1/cgroups.rst for details
*/ */
struct cgroup_subsys { struct cgroup_subsys {
struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css); struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css);

View File

@ -131,6 +131,8 @@ void cgroup_free(struct task_struct *p);
int cgroup_init_early(void); int cgroup_init_early(void);
int cgroup_init(void); int cgroup_init(void);
int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v);
/* /*
* Iteration helpers and macros. * Iteration helpers and macros.
*/ */

View File

@ -785,7 +785,7 @@ union bpf_attr {
* based on a user-provided identifier for all traffic coming from * based on a user-provided identifier for all traffic coming from
* the tasks belonging to the related cgroup. See also the related * the tasks belonging to the related cgroup. See also the related
* kernel documentation, available from the Linux sources in file * kernel documentation, available from the Linux sources in file
* *Documentation/cgroup-v1/net_cls.txt*. * *Documentation/cgroup-v1/net_cls.rst*.
* *
* The Linux kernel has two versions for cgroups: there are * The Linux kernel has two versions for cgroups: there are
* cgroups v1 and cgroups v2. Both are available to users, who can * cgroups v1 and cgroups v2. Both are available to users, who can

View File

@ -850,7 +850,7 @@ config BLK_CGROUP
CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set
CONFIG_BLK_DEV_THROTTLING=y. CONFIG_BLK_DEV_THROTTLING=y.
See Documentation/cgroup-v1/blkio-controller.txt for more information. See Documentation/cgroup-v1/blkio-controller.rst for more information.
config DEBUG_BLK_CGROUP config DEBUG_BLK_CGROUP
bool "IO controller debugging" bool "IO controller debugging"

View File

@ -6240,6 +6240,48 @@ struct cgroup *cgroup_get_from_fd(int fd)
} }
EXPORT_SYMBOL_GPL(cgroup_get_from_fd); EXPORT_SYMBOL_GPL(cgroup_get_from_fd);
static u64 power_of_ten(int power)
{
u64 v = 1;
while (power--)
v *= 10;
return v;
}
/**
* cgroup_parse_float - parse a floating number
* @input: input string
* @dec_shift: number of decimal digits to shift
* @v: output
*
* Parse a decimal floating point number in @input and store the result in
* @v with decimal point right shifted @dec_shift times. For example, if
* @input is "12.3456" and @dec_shift is 3, *@v will be set to 12345.
* Returns 0 on success, -errno otherwise.
*
* There's nothing cgroup specific about this function except that it's
* currently the only user.
*/
int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v)
{
s64 whole, frac = 0;
int fstart = 0, fend = 0, flen;
if (!sscanf(input, "%lld.%n%lld%n", &whole, &fstart, &frac, &fend))
return -EINVAL;
if (frac < 0)
return -EINVAL;
flen = fend > fstart ? fend - fstart : 0;
if (flen < dec_shift)
frac *= power_of_ten(dec_shift - flen);
else
frac = DIV_ROUND_CLOSEST_ULL(frac, power_of_ten(flen - dec_shift));
*v = whole * power_of_ten(dec_shift) + frac;
return 0;
}
/* /*
* sock->sk_cgrp_data handling. For more info, see sock_cgroup_data * sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
* definition in cgroup-defs.h. * definition in cgroup-defs.h.
@ -6402,4 +6444,5 @@ static int __init cgroup_sysfs_init(void)
return sysfs_create_group(kernel_kobj, &cgroup_sysfs_attr_group); return sysfs_create_group(kernel_kobj, &cgroup_sysfs_attr_group);
} }
subsys_initcall(cgroup_sysfs_init); subsys_initcall(cgroup_sysfs_init);
#endif /* CONFIG_SYSFS */ #endif /* CONFIG_SYSFS */

View File

@ -729,7 +729,7 @@ static inline int nr_cpusets(void)
* load balancing domains (sched domains) as specified by that partial * load balancing domains (sched domains) as specified by that partial
* partition. * partition.
* *
* See "What is sched_load_balance" in Documentation/cgroup-v1/cpusets.txt * See "What is sched_load_balance" in Documentation/cgroup-v1/cpusets.rst
* for a background explanation of this. * for a background explanation of this.
* *
* Does not return errors, on the theory that the callers of this * Does not return errors, on the theory that the callers of this

View File

@ -509,7 +509,7 @@ static inline int may_allow_all(struct dev_cgroup *parent)
* This is one of the three key functions for hierarchy implementation. * This is one of the three key functions for hierarchy implementation.
* This function is responsible for re-evaluating all the cgroup's active * This function is responsible for re-evaluating all the cgroup's active
* exceptions due to a parent's exception change. * exceptions due to a parent's exception change.
* Refer to Documentation/cgroup-v1/devices.txt for more details. * Refer to Documentation/cgroup-v1/devices.rst for more details.
*/ */
static void revalidate_active_exceptions(struct dev_cgroup *devcg) static void revalidate_active_exceptions(struct dev_cgroup *devcg)
{ {

View File

@ -785,7 +785,7 @@ union bpf_attr {
* based on a user-provided identifier for all traffic coming from * based on a user-provided identifier for all traffic coming from
* the tasks belonging to the related cgroup. See also the related * the tasks belonging to the related cgroup. See also the related
* kernel documentation, available from the Linux sources in file * kernel documentation, available from the Linux sources in file
* *Documentation/cgroup-v1/net_cls.txt*. * *Documentation/cgroup-v1/net_cls.rst*.
* *
* The Linux kernel has two versions for cgroups: there are * The Linux kernel has two versions for cgroups: there are
* cgroups v1 and cgroups v2. Both are available to users, who can * cgroups v1 and cgroups v2. Both are available to users, who can