Путеводитель по Руководству Linux

  User  |  Syst  |  Libr  |  Device  |  Files  |  Other  |  Admin  |  Head  |



   cgroups    ( 7 )

группы управления Linux (Linux control groups)

  Name  |  Description  |  Cgroups version 1  |    Cgroups version 2    |

CGROUPS VERSION 2

In cgroups v2, all mounted controllers reside in a single unified hierarchy. While (different) controllers may be simultaneously mounted under the v1 and v2 hierarchies, it is not possible to mount the same controller simultaneously under both the v1 and the v2 hierarchies.

The new behaviors in cgroups v2 are summarized here, and in some cases elaborated in the following subsections.

1. Cgroups v2 provides a unified hierarchy against which all controllers are mounted.

2. "Internal" processes are not permitted. With the exception of the root cgroup, processes may reside only in leaf nodes (cgroups that do not themselves contain child cgroups). The details are somewhat more subtle than this, and are described below.

3. Active cgroups must be specified via the files cgroup.controllers and cgroup.subtree_control.

4. The tasks file has been removed. In addition, the cgroup.clone_children file that is employed by the cpuset controller has been removed.

5. An improved mechanism for notification of empty cgroups is provided by the cgroup.events file.

For more changes, see the Documentation/admin-guide/cgroup-v2.rst file in the kernel source (or Documentation/cgroup-v2.txt in Linux 4.17 and earlier).

Some of the new behaviors listed above saw subsequent modification with the addition in Linux 4.14 of "thread mode" (described below).

Cgroups v2 unified hierarchy In cgroups v1, the ability to mount different controllers against different hierarchies was intended to allow great flexibility for application design. In practice, though, the flexibility turned out to be less useful than expected, and in many cases added complexity. Therefore, in cgroups v2, all available controllers are mounted against a single hierarchy. The available controllers are automatically mounted, meaning that it is not necessary (or possible) to specify the controllers when mounting the cgroup v2 filesystem using a command such as the following:

mount -t cgroup2 none /mnt/cgroup2

A cgroup v2 controller is available only if it is not currently in use via a mount against a cgroup v1 hierarchy. Or, to put things another way, it is not possible to employ the same controller against both a v1 hierarchy and the unified v2 hierarchy. This means that it may be necessary first to unmount a v1 controller (as described above) before that controller is available in v2. Since systemd(1) makes heavy use of some v1 controllers by default, it can in some cases be simpler to boot the system with selected v1 controllers disabled. To do this, specify the cgroup_no_v1=list option on the kernel boot command line; list is a comma-separated list of the names of the controllers to disable, or the word all to disable all v1 controllers. (This situation is correctly handled by systemd(1), which falls back to operating without the specified controllers.)

Note that on many modern systems, systemd(1) automatically mounts the cgroup2 filesystem at /sys/fs/cgroup/unified during the boot process.

Cgroups v2 mount options The following options (mount -o) can be specified when mounting the group v2 filesystem:

nsdelegate (since Linux 4.15) Treat cgroup namespaces as delegation boundaries. For details, see below.

memory_localevents (since Linux 5.2) The memory.events should show statistics only for the cgroup itself, and not for any descendant cgroups. This was the behavior before Linux 5.2. Starting in Linux 5.2, the default behavior is to include statistics for descendant cgroups in memory.events, and this mount option can be used to revert to the legacy behavior. This option is system wide and can be set on mount or modified through remount only from the initial mount namespace; it is silently ignored in noninitial namespaces.

Cgroups v2 controllers The following controllers, documented in the kernel source file Documentation/admin-guide/cgroup-v2.rst (or Documentation/cgroup-v2.txt in Linux 4.17 and earlier), are supported in cgroups version 2:

cpu (since Linux 4.15) This is the successor to the version 1 cpu and cpuacct controllers.

cpuset (since Linux 5.0) This is the successor of the version 1 cpuset controller.

freezer (since Linux 5.2) This is the successor of the version 1 freezer controller.

hugetlb (since Linux 5.6) This is the successor of the version 1 hugetlb controller.

io (since Linux 4.5) This is the successor of the version 1 blkio controller.

memory (since Linux 4.5) This is the successor of the version 1 memory controller.

perf_event (since Linux 4.11) This is the same as the version 1 perf_event controller.

pids (since Linux 4.5) This is the same as the version 1 pids controller.

rdma (since Linux 4.11) This is the same as the version 1 rdma controller.

There is no direct equivalent of the net_cls and net_prio controllers from cgroups version 1. Instead, support has been added to iptables(8) to allow eBPF filters that hook on cgroup v2 pathnames to make decisions about network traffic on a per-cgroup basis.

The v2 devices controller provides no interface files; instead, device control is gated by attaching an eBPF (BPF_CGROUP_DEVICE) program to a v2 cgroup.

Cgroups v2 subtree control Each cgroup in the v2 hierarchy contains the following two files:

cgroup.controllers This read-only file exposes a list of the controllers that are available in this cgroup. The contents of this file match the contents of the cgroup.subtree_control file in the parent cgroup.

cgroup.subtree_control This is a list of controllers that are active (enabled) in the cgroup. The set of controllers in this file is a subset of the set in the cgroup.controllers of this cgroup. The set of active controllers is modified by writing strings to this file containing space-delimited controller names, each preceded by '+' (to enable a controller) or '-' (to disable a controller), as in the following example:

echo '+pids -memory' > x/y/cgroup.subtree_control

An attempt to enable a controller that is not present in cgroup.controllers leads to an ENOENT error when writing to the cgroup.subtree_control file.

Because the list of controllers in cgroup.subtree_control is a subset of those cgroup.controllers, a controller that has been disabled in one cgroup in the hierarchy can never be re-enabled in the subtree below that cgroup.

A cgroup's cgroup.subtree_control file determines the set of controllers that are exercised in the child cgroups. When a controller (e.g., pids) is present in the cgroup.subtree_control file of a parent cgroup, then the corresponding controller- interface files (e.g., pids.max) are automatically created in the children of that cgroup and can be used to exert resource control in the child cgroups.

Cgroups v2 "no internal processes" rule Cgroups v2 enforces a so-called "no internal processes" rule. Roughly speaking, this rule means that, with the exception of the root cgroup, processes may reside only in leaf nodes (cgroups that do not themselves contain child cgroups). This avoids the need to decide how to partition resources between processes which are members of cgroup A and processes in child cgroups of A.

For instance, if cgroup /cg1/cg2 exists, then a process may reside in /cg1/cg2, but not in /cg1. This is to avoid an ambiguity in cgroups v1 with respect to the delegation of resources between processes in /cg1 and its child cgroups. The recommended approach in cgroups v2 is to create a subdirectory called leaf for any nonleaf cgroup which should contain processes, but no child cgroups. Thus, processes which previously would have gone into /cg1 would now go into /cg1/leaf. This has the advantage of making explicit the relationship between processes in /cg1/leaf and /cg1's other children.

The "no internal processes" rule is in fact more subtle than stated above. More precisely, the rule is that a (nonroot) cgroup can't both (1) have member processes, and (2) distribute resources into child cgroups—that is, have a nonempty cgroup.subtree_control file. Thus, it is possible for a cgroup to have both member processes and child cgroups, but before controllers can be enabled for that cgroup, the member processes must be moved out of the cgroup (e.g., perhaps into the child cgroups).

With the Linux 4.14 addition of "thread mode" (described below), the "no internal processes" rule has been relaxed in some cases.

Cgroups v2 cgroup.events file Each nonroot cgroup in the v2 hierarchy contains a read-only file, cgroup.events, whose contents are key-value pairs (delimited by newline characters, with the key and value separated by spaces) providing state information about the cgroup:

$ cat mygrp/cgroup.events populated 1 frozen 0

The following keys may appear in this file:

populated The value of this key is either 1, if this cgroup or any of its descendants has member processes, or otherwise 0.

frozen (since Linux 5.2) The value of this key is 1 if this cgroup is currently frozen, or 0 if it is not.

The cgroup.events file can be monitored, in order to receive notification when the value of one of its keys changes. Such monitoring can be done using inotify(7), which notifies changes as IN_MODIFY events, or poll(2), which notifies changes by returning the POLLPRI and POLLERR bits in the revents field.

Cgroup v2 release notification Cgroups v2 provides a new mechanism for obtaining notification when a cgroup becomes empty. The cgroups v1 release_agent and notify_on_release files are removed, and replaced by the populated key in the cgroup.events file. This key either has the value 0, meaning that the cgroup (and its descendants) contain no (nonzombie) member processes, or 1, meaning that the cgroup (or one of its descendants) contains member processes.

The cgroups v2 release-notification mechanism offers the following advantages over the cgroups v1 release_agent mechanism:

* It allows for cheaper notification, since a single process can monitor multiple cgroup.events files (using the techniques described earlier). By contrast, the cgroups v1 mechanism requires the expense of creating a process for each notification.

* Notification for different cgroup subhierarchies can be delegated to different processes. By contrast, the cgroups v1 mechanism allows only one release agent for an entire hierarchy.

Cgroups v2 cgroup.stat file Each cgroup in the v2 hierarchy contains a read-only cgroup.stat file (first introduced in Linux 4.14) that consists of lines containing key-value pairs. The following keys currently appear in this file:

nr_descendants This is the total number of visible (i.e., living) descendant cgroups underneath this cgroup.

nr_dying_descendants This is the total number of dying descendant cgroups underneath this cgroup. A cgroup enters the dying state after being deleted. It remains in that state for an undefined period (which will depend on system load) while resources are freed before the cgroup is destroyed. Note that the presence of some cgroups in the dying state is normal, and is not indicative of any problem.

A process can't be made a member of a dying cgroup, and a dying cgroup can't be brought back to life.

Limiting the number of descendant cgroups Each cgroup in the v2 hierarchy contains the following files, which can be used to view and set limits on the number of descendant cgroups under that cgroup:

cgroup.max.depth (since Linux 4.14) This file defines a limit on the depth of nesting of descendant cgroups. A value of 0 in this file means that no descendant cgroups can be created. An attempt to create a descendant whose nesting level exceeds the limit fails (mkdir(2) fails with the error EAGAIN).

Writing the string "max" to this file means that no limit is imposed. The default value in this file is "max".

cgroup.max.descendants (since Linux 4.14) This file defines a limit on the number of live descendant cgroups that this cgroup may have. An attempt to create more descendants than allowed by the limit fails (mkdir(2) fails with the error EAGAIN).

Writing the string "max" to this file means that no limit is imposed. The default value in this file is "max".

CGROUPS DELEGATION: DELEGATING A HIERARCHY TO A LESS PRIVILEGED USER