In cgroups v2, all mounted controllers reside in a single unified
hierarchy. While (different) controllers may be simultaneously
mounted under the v1 and v2 hierarchies, it is not possible to
mount the same controller simultaneously under both the v1 and
the v2 hierarchies.
The new behaviors in cgroups v2 are summarized here, and in some
cases elaborated in the following subsections.
1. Cgroups v2 provides a unified hierarchy against which all
controllers are mounted.
2. "Internal" processes are not permitted. With the exception of
the root cgroup, processes may reside only in leaf nodes
(cgroups that do not themselves contain child cgroups). The
details are somewhat more subtle than this, and are described
below.
3. Active cgroups must be specified via the files
cgroup.controllers and cgroup.subtree_control.
4. The tasks file has been removed. In addition, the
cgroup.clone_children file that is employed by the cpuset
controller has been removed.
5. An improved mechanism for notification of empty cgroups is
provided by the cgroup.events file.
For more changes, see the Documentation/admin-guide/cgroup-v2.rst
file in the kernel source (or Documentation/cgroup-v2.txt in
Linux 4.17 and earlier).
Some of the new behaviors listed above saw subsequent
modification with the addition in Linux 4.14 of "thread mode"
(described below).
Cgroups v2 unified hierarchy
In cgroups v1, the ability to mount different controllers against
different hierarchies was intended to allow great flexibility for
application design. In practice, though, the flexibility turned
out to be less useful than expected, and in many cases added
complexity. Therefore, in cgroups v2, all available controllers
are mounted against a single hierarchy. The available
controllers are automatically mounted, meaning that it is not
necessary (or possible) to specify the controllers when mounting
the cgroup v2 filesystem using a command such as the following:
mount -t cgroup2 none /mnt/cgroup2
A cgroup v2 controller is available only if it is not currently
in use via a mount against a cgroup v1 hierarchy. Or, to put
things another way, it is not possible to employ the same
controller against both a v1 hierarchy and the unified v2
hierarchy. This means that it may be necessary first to unmount
a v1 controller (as described above) before that controller is
available in v2. Since systemd(1) makes heavy use of some v1
controllers by default, it can in some cases be simpler to boot
the system with selected v1 controllers disabled. To do this,
specify the cgroup_no_v1=list option on the kernel boot command
line; list is a comma-separated list of the names of the
controllers to disable, or the word all to disable all v1
controllers. (This situation is correctly handled by systemd(1),
which falls back to operating without the specified controllers.)
Note that on many modern systems, systemd(1) automatically mounts
the cgroup2 filesystem at /sys/fs/cgroup/unified during the boot
process.
Cgroups v2 mount options
The following options (mount -o) can be specified when mounting
the group v2 filesystem:
nsdelegate (since Linux 4.15)
Treat cgroup namespaces as delegation boundaries. For
details, see below.
memory_localevents (since Linux 5.2)
The memory.events should show statistics only for the
cgroup itself, and not for any descendant cgroups. This
was the behavior before Linux 5.2. Starting in Linux 5.2,
the default behavior is to include statistics for
descendant cgroups in memory.events, and this mount option
can be used to revert to the legacy behavior. This option
is system wide and can be set on mount or modified through
remount only from the initial mount namespace; it is
silently ignored in noninitial namespaces.
Cgroups v2 controllers
The following controllers, documented in the kernel source file
Documentation/admin-guide/cgroup-v2.rst (or
Documentation/cgroup-v2.txt in Linux 4.17 and earlier), are
supported in cgroups version 2:
cpu (since Linux 4.15)
This is the successor to the version 1 cpu and cpuacct
controllers.
cpuset (since Linux 5.0)
This is the successor of the version 1 cpuset controller.
freezer (since Linux 5.2)
This is the successor of the version 1 freezer controller.
hugetlb (since Linux 5.6)
This is the successor of the version 1 hugetlb controller.
io (since Linux 4.5)
This is the successor of the version 1 blkio controller.
memory (since Linux 4.5)
This is the successor of the version 1 memory controller.
perf_event (since Linux 4.11)
This is the same as the version 1 perf_event controller.
pids (since Linux 4.5)
This is the same as the version 1 pids controller.
rdma (since Linux 4.11)
This is the same as the version 1 rdma controller.
There is no direct equivalent of the net_cls and net_prio
controllers from cgroups version 1. Instead, support has been
added to iptables(8) to allow eBPF filters that hook on cgroup v2
pathnames to make decisions about network traffic on a per-cgroup
basis.
The v2 devices controller provides no interface files; instead,
device control is gated by attaching an eBPF (BPF_CGROUP_DEVICE
)
program to a v2 cgroup.
Cgroups v2 subtree control
Each cgroup in the v2 hierarchy contains the following two files:
cgroup.controllers
This read-only file exposes a list of the controllers that
are available in this cgroup. The contents of this file
match the contents of the cgroup.subtree_control file in
the parent cgroup.
cgroup.subtree_control
This is a list of controllers that are active (enabled) in
the cgroup. The set of controllers in this file is a
subset of the set in the cgroup.controllers of this
cgroup. The set of active controllers is modified by
writing strings to this file containing space-delimited
controller names, each preceded by '+' (to enable a
controller) or '-' (to disable a controller), as in the
following example:
echo '+pids -memory' > x/y/cgroup.subtree_control
An attempt to enable a controller that is not present in
cgroup.controllers leads to an ENOENT
error when writing
to the cgroup.subtree_control file.
Because the list of controllers in cgroup.subtree_control is a
subset of those cgroup.controllers, a controller that has been
disabled in one cgroup in the hierarchy can never be re-enabled
in the subtree below that cgroup.
A cgroup's cgroup.subtree_control file determines the set of
controllers that are exercised in the child cgroups. When a
controller (e.g., pids) is present in the cgroup.subtree_control
file of a parent cgroup, then the corresponding controller-
interface files (e.g., pids.max) are automatically created in the
children of that cgroup and can be used to exert resource control
in the child cgroups.
Cgroups v2 "no internal processes" rule
Cgroups v2 enforces a so-called "no internal processes" rule.
Roughly speaking, this rule means that, with the exception of the
root cgroup, processes may reside only in leaf nodes (cgroups
that do not themselves contain child cgroups). This avoids the
need to decide how to partition resources between processes which
are members of cgroup A and processes in child cgroups of A.
For instance, if cgroup /cg1/cg2 exists, then a process may
reside in /cg1/cg2, but not in /cg1. This is to avoid an
ambiguity in cgroups v1 with respect to the delegation of
resources between processes in /cg1 and its child cgroups. The
recommended approach in cgroups v2 is to create a subdirectory
called leaf for any nonleaf cgroup which should contain
processes, but no child cgroups. Thus, processes which
previously would have gone into /cg1 would now go into /cg1/leaf.
This has the advantage of making explicit the relationship
between processes in /cg1/leaf and /cg1's other children.
The "no internal processes" rule is in fact more subtle than
stated above. More precisely, the rule is that a (nonroot)
cgroup can't both (1) have member processes, and (2) distribute
resources into child cgroups—that is, have a nonempty
cgroup.subtree_control file. Thus, it is possible for a cgroup
to have both member processes and child cgroups, but before
controllers can be enabled for that cgroup, the member processes
must be moved out of the cgroup (e.g., perhaps into the child
cgroups).
With the Linux 4.14 addition of "thread mode" (described below),
the "no internal processes" rule has been relaxed in some cases.
Cgroups v2 cgroup.events file
Each nonroot cgroup in the v2 hierarchy contains a read-only
file, cgroup.events, whose contents are key-value pairs
(delimited by newline characters, with the key and value
separated by spaces) providing state information about the
cgroup:
$ cat mygrp/cgroup.events
populated 1
frozen 0
The following keys may appear in this file:
populated
The value of this key is either 1, if this cgroup or any
of its descendants has member processes, or otherwise 0.
frozen (since Linux 5.2)
The value of this key is 1 if this cgroup is currently
frozen, or 0 if it is not.
The cgroup.events file can be monitored, in order to receive
notification when the value of one of its keys changes. Such
monitoring can be done using inotify(7), which notifies changes
as IN_MODIFY
events, or poll(2), which notifies changes by
returning the POLLPRI
and POLLERR
bits in the revents field.
Cgroup v2 release notification
Cgroups v2 provides a new mechanism for obtaining notification
when a cgroup becomes empty. The cgroups v1 release_agent and
notify_on_release files are removed, and replaced by the
populated key in the cgroup.events file. This key either has the
value 0, meaning that the cgroup (and its descendants) contain no
(nonzombie) member processes, or 1, meaning that the cgroup (or
one of its descendants) contains member processes.
The cgroups v2 release-notification mechanism offers the
following advantages over the cgroups v1 release_agent mechanism:
* It allows for cheaper notification, since a single process can
monitor multiple cgroup.events files (using the techniques
described earlier). By contrast, the cgroups v1 mechanism
requires the expense of creating a process for each
notification.
* Notification for different cgroup subhierarchies can be
delegated to different processes. By contrast, the cgroups v1
mechanism allows only one release agent for an entire
hierarchy.
Cgroups v2 cgroup.stat file
Each cgroup in the v2 hierarchy contains a read-only cgroup.stat
file (first introduced in Linux 4.14) that consists of lines
containing key-value pairs. The following keys currently appear
in this file:
nr_descendants
This is the total number of visible (i.e., living)
descendant cgroups underneath this cgroup.
nr_dying_descendants
This is the total number of dying descendant cgroups
underneath this cgroup. A cgroup enters the dying state
after being deleted. It remains in that state for an
undefined period (which will depend on system load) while
resources are freed before the cgroup is destroyed. Note
that the presence of some cgroups in the dying state is
normal, and is not indicative of any problem.
A process can't be made a member of a dying cgroup, and a
dying cgroup can't be brought back to life.
Limiting the number of descendant cgroups
Each cgroup in the v2 hierarchy contains the following files,
which can be used to view and set limits on the number of
descendant cgroups under that cgroup:
cgroup.max.depth (since Linux 4.14)
This file defines a limit on the depth of nesting of
descendant cgroups. A value of 0 in this file means that
no descendant cgroups can be created. An attempt to
create a descendant whose nesting level exceeds the limit
fails (mkdir(2) fails with the error EAGAIN
).
Writing the string "max" to this file means that no limit
is imposed. The default value in this file is "max".
cgroup.max.descendants (since Linux 4.14)
This file defines a limit on the number of live descendant
cgroups that this cgroup may have. An attempt to create
more descendants than allowed by the limit fails (mkdir(2)
fails with the error EAGAIN
).
Writing the string "max" to this file means that no limit
is imposed. The default value in this file is "max".
CGROUPS DELEGATION: DELEGATING A HIERARCHY TO A LESS PRIVILEGED USER