конфигурация среды выполнения (Execution environment configuration)
SYSTEM CALL FILTERING
SystemCallFilter=
Takes a space-separated list of system call names. If this
setting is used, all system calls executed by the unit
processes except for the listed ones will result in immediate
process termination with the SIGSYS
signal (allow-listing).
(See SystemCallErrorNumber= below for changing the default
action). If the first character of the list is "~", the
effect is inverted: only the listed system calls will result
in immediate process termination (deny-listing). Deny-listed
system calls and system call groups may optionally be
suffixed with a colon (":") and "errno" error number (between
0 and 4095) or errno name such as EPERM
, EACCES
or EUCLEAN
(see errno(3) for a full list). This value will be returned
when a deny-listed system call is triggered, instead of
terminating the processes immediately. Special setting "kill"
can be used to explicitly specify killing. This value takes
precedence over the one given in SystemCallErrorNumber=, see
below. If running in user mode, or in system mode, but
without the CAP_SYS_ADMIN
capability (e.g. setting User=),
NoNewPrivileges=yes is implied. This feature makes use of the
Secure Computing Mode 2 interfaces of the kernel ('seccomp
filtering') and is useful for enforcing a minimal sandboxing
environment. Note that the execve()
, exit()
, exit_group()
,
getrlimit()
, rt_sigreturn()
, sigreturn()
system calls and the
system calls for querying time and sleeping are implicitly
allow-listed and do not need to be listed explicitly. This
option may be specified more than once, in which case the
filter masks are merged. If the empty string is assigned, the
filter is reset, all prior assignments will have no effect.
This does not affect commands prefixed with "+".
Note that on systems supporting multiple ABIs (such as
x86/x86-64) it is recommended to turn off alternative ABIs
for services, so that they cannot be used to circumvent the
restrictions of this option. Specifically, it is recommended
to combine this option with SystemCallArchitectures=native or
similar.
Note that strict system call filters may impact execution and
error handling code paths of the service invocation.
Specifically, access to the execve()
system call is required
for the execution of the service binary — if it is blocked
service invocation will necessarily fail. Also, if execution
of the service binary fails for some reason (for example:
missing service executable), the error handling logic might
require access to an additional set of system calls in order
to process and log this failure correctly. It might be
necessary to temporarily disable system call filters in order
to simplify debugging of such failures.
If you specify both types of this option (i.e. allow-listing
and deny-listing), the first encountered will take precedence
and will dictate the default action (termination or approval
of a system call). Then the next occurrences of this option
will add or delete the listed system calls from the set of
the filtered system calls, depending of its type and the
default action. (For example, if you have started with an
allow list rule for read()
and write()
, and right after it
add a deny list rule for write()
, then write()
will be
removed from the set.)
As the number of possible system calls is large, predefined
sets of system calls are provided. A set starts with "@"
character, followed by name of the set.
Table 3. Currently predefined system call sets
┌────────────────┬──────────────────────────┐
│Set
│ Description
│
├────────────────┼──────────────────────────┤
│@aio │ Asynchronous I/O (‐ │
│ │ io_setup(2), │
│ │ io_submit(2), and │
│ │ related calls) │
├────────────────┼──────────────────────────┤
│@basic-io │ System calls for basic │
│ │ I/O: reading, writing, │
│ │ seeking, file descriptor │
│ │ duplication and closing │
│ │ (read(2), write(2), and │
│ │ related calls) │
├────────────────┼──────────────────────────┤
│@chown │ Changing file ownership │
│ │ (chown(2), fchownat(2), │
│ │ and related calls) │
├────────────────┼──────────────────────────┤
│@clock │ System calls for │
│ │ changing the system │
│ │ clock (adjtimex(2), │
│ │ settimeofday(2), and │
│ │ related calls) │
├────────────────┼──────────────────────────┤
│@cpu-emulation │ System calls for CPU │
│ │ emulation functionality │
│ │ (vm86(2) and related │
│ │ calls) │
├────────────────┼──────────────────────────┤
│@debug │ Debugging, performance │
│ │ monitoring and tracing │
│ │ functionality (‐ │
│ │ ptrace(2), │
│ │ perf_event_open(2) and │
│ │ related calls) │
├────────────────┼──────────────────────────┤
│@file-system │ File system operations: │
│ │ opening, creating files │
│ │ and directories for read │
│ │ and write, renaming and │
│ │ removing them, reading │
│ │ file properties, or │
│ │ creating hard and │
│ │ symbolic links │
├────────────────┼──────────────────────────┤
│@io-event │ Event loop system calls │
│ │ (poll(2), select(2), │
│ │ epoll(7), eventfd(2) and │
│ │ related calls) │
├────────────────┼──────────────────────────┤
│@ipc │ Pipes, SysV IPC, POSIX │
│ │ Message Queues and other │
│ │ IPC (mq_overview(7), │
│ │ svipc(7)) │
├────────────────┼──────────────────────────┤
│@keyring │ Kernel keyring access (‐ │
│ │ keyctl(2) and related │
│ │ calls) │
├────────────────┼──────────────────────────┤
│@memlock │ Locking of memory in RAM │
│ │ (mlock(2), mlockall(2) │
│ │ and related calls) │
├────────────────┼──────────────────────────┤
│@module │ Loading and unloading of │
│ │ kernel modules (‐ │
│ │ init_module(2), │
│ │ delete_module(2) and │
│ │ related calls) │
├────────────────┼──────────────────────────┤
│@mount │ Mounting and unmounting │
│ │ of file systems (‐ │
│ │ mount(2), chroot(2), and │
│ │ related calls) │
├────────────────┼──────────────────────────┤
│@network-io │ Socket I/O (including │
│ │ local AF_UNIX): │
│ │ socket(7), unix(7) │
├────────────────┼──────────────────────────┤
│@obsolete │ Unusual, obsolete or │
│ │ unimplemented (‐ │
│ │ create_module(2), │
│ │ gtty(2), ...) │
├────────────────┼──────────────────────────┤
│@privileged │ All system calls which │
│ │ need super-user │
│ │ capabilities (‐ │
│ │ capabilities(7)) │
├────────────────┼──────────────────────────┤
│@process │ Process control, │
│ │ execution, namespacing │
│ │ operations (clone(2), │
│ │ kill(2), namespaces(7), │
│ │ ...) │
├────────────────┼──────────────────────────┤
│@raw-io │ Raw I/O port access (‐ │
│ │ ioperm(2), iopl(2), │
│ │ pciconfig_read()
, ...) │
├────────────────┼──────────────────────────┤
│@reboot │ System calls for │
│ │ rebooting and reboot │
│ │ preparation (reboot(2), │
│ │ kexec()
, ...) │
├────────────────┼──────────────────────────┤
│@resources │ System calls for │
│ │ changing resource │
│ │ limits, memory and │
│ │ scheduling parameters (‐ │
│ │ setrlimit(2), │
│ │ setpriority(2), ...) │
├────────────────┼──────────────────────────┤
│@setuid │ System calls for │
│ │ changing user ID and │
│ │ group ID credentials, (‐ │
│ │ setuid(2), setgid(2), │
│ │ setresuid(2), ...) │
├────────────────┼──────────────────────────┤
│@signal │ System calls for │
│ │ manipulating and │
│ │ handling process signals │
│ │ (signal(2), │
│ │ sigprocmask(2), ...) │
├────────────────┼──────────────────────────┤
│@swap │ System calls for │
│ │ enabling/disabling swap │
│ │ devices (swapon(2), │
│ │ swapoff(2)) │
├────────────────┼──────────────────────────┤
│@sync │ Synchronizing files and │
│ │ memory to disk (‐ │
│ │ fsync(2), msync(2), and │
│ │ related calls) │
├────────────────┼──────────────────────────┤
│@system-service │ A reasonable set of │
│ │ system calls used by │
│ │ common system services, │
│ │ excluding any special │
│ │ purpose calls. This is │
│ │ the recommended starting │
│ │ point for allow-listing │
│ │ system calls for system │
│ │ services, as it contains │
│ │ what is typically needed │
│ │ by system services, but │
│ │ excludes overly specific │
│ │ interfaces. For example, │
│ │ the following APIs are │
│ │ excluded: "@clock", │
│ │ "@mount", "@swap", │
│ │ "@reboot". │
├────────────────┼──────────────────────────┤
│@timer │ System calls for │
│ │ scheduling operations by │
│ │ time (alarm(2), │
│ │ timer_create(2), ...) │
├────────────────┼──────────────────────────┤
│@known │ All system calls defined │
│ │ by the kernel. This list │
│ │ is defined statically in │
│ │ systemd based on a │
│ │ kernel version that was │
│ │ available when this │
│ │ systemd version was │
│ │ released. It will become │
│ │ progressively more │
│ │ out-of-date as the │
│ │ kernel is updated. │
└────────────────┴──────────────────────────┘
Note, that as new system calls are added to the kernel,
additional system calls might be added to the groups above.
Contents of the sets may also change between systemd
versions. In addition, the list of system calls depends on
the kernel version and architecture for which systemd was
compiled. Use systemd-analyze syscall-filter
to list the
actual list of system calls in each filter.
Generally, allow-listing system calls (rather than
deny-listing) is the safer mode of operation. It is
recommended to enforce system call allow lists for all
long-running system services. Specifically, the following
lines are a relatively safe basic choice for the majority of
system services:
[Service]
SystemCallFilter=@system-service
SystemCallErrorNumber=EPERM
Note that various kernel system calls are defined
redundantly: there are multiple system calls for executing
the same operation. For example, the pidfd_send_signal()
system call may be used to execute operations similar to what
can be done with the older kill()
system call, hence blocking
the latter without the former only provides weak protection.
Since new system calls are added regularly to the kernel as
development progresses, keeping system call deny lists
comprehensive requires constant work. It is thus recommended
to use allow-listing instead, which offers the benefit that
new system calls are by default implicitly blocked until the
allow list is updated.
Also note that a number of system calls are required to be
accessible for the dynamic linker to work. The dynamic linker
is required for running most regular programs (specifically:
all dynamic ELF binaries, which is how most distributions
build packaged programs). This means that blocking these
system calls (which include open()
, openat()
or mmap()
) will
make most programs typically shipped with generic
distributions unusable.
It is recommended to combine the file system namespacing
related options with SystemCallFilter=~@mount, in order to
prohibit the unit's processes to undo the mappings.
Specifically these are the options PrivateTmp=,
PrivateDevices=, ProtectSystem=, ProtectHome=,
ProtectKernelTunables=, ProtectControlGroups=,
ProtectKernelLogs=, ProtectClock=, ReadOnlyPaths=,
InaccessiblePaths= and ReadWritePaths=.
SystemCallErrorNumber=
Takes an "errno" error number (between 1 and 4095) or errno
name such as EPERM
, EACCES
or EUCLEAN
, to return when the
system call filter configured with SystemCallFilter= is
triggered, instead of terminating the process immediately.
See errno(3) for a full list of error codes. When this
setting is not used, or when the empty string or the special
setting "kill" is assigned, the process will be terminated
immediately when the filter is triggered.
SystemCallArchitectures=
Takes a space-separated list of architecture identifiers to
include in the system call filter. The known architecture
identifiers are the same as for ConditionArchitecture=
described in systemd.unit(5), as well as x32
, mips64-n32
,
mips64-le-n32
, and the special identifier native
. The special
identifier native
implicitly maps to the native architecture
of the system (or more precisely: to the architecture the
system manager is compiled for). If running in user mode, or
in system mode, but without the CAP_SYS_ADMIN
capability
(e.g. setting User=), NoNewPrivileges=yes is implied. By
default, this option is set to the empty list, i.e. no
filtering is applied.
If this setting is used, processes of this unit will only be
permitted to call native system calls, and system calls of
the specified architectures. For the purposes of this option,
the x32 architecture is treated as including x86-64 system
calls. However, this setting still fulfills its purpose, as
explained below, on x32.
System call filtering is not equally effective on all
architectures. For example, on x86 filtering of network
socket-related calls is not possible, due to ABI limitations
— a limitation that x86-64 does not have, however. On systems
supporting multiple ABIs at the same time — such as
x86/x86-64 — it is hence recommended to limit the set of
permitted system call architectures so that secondary ABIs
may not be used to circumvent the restrictions applied to the
native ABI of the system. In particular, setting
SystemCallArchitectures=native is a good choice for disabling
non-native ABIs.
System call architectures may also be restricted system-wide
via the SystemCallArchitectures= option in the global
configuration. See systemd-system.conf(5) for details.
SystemCallLog=
Takes a space-separated list of system call names. If this
setting is used, all system calls executed by the unit
processes for the listed ones will be logged. If the first
character of the list is "~", the effect is inverted: all
system calls except the listed system calls will be logged.
If running in user mode, or in system mode, but without the
CAP_SYS_ADMIN
capability (e.g. setting User=),
NoNewPrivileges=yes is implied. This feature makes use of the
Secure Computing Mode 2 interfaces of the kernel ('seccomp
filtering') and is useful for auditing or setting up a
minimal sandboxing environment. This option may be specified
more than once, in which case the filter masks are merged. If
the empty string is assigned, the filter is reset, all prior
assignments will have no effect. This does not affect
commands prefixed with "+".