механизм уведомлений в пользовательском пространстве Seccomp (Seccomp user-space notification mechanism)
Описание (Description)
This page describes the user-space notification mechanism
provided by the Secure Computing (seccomp) facility. As well as
the use of the SECCOMP_FILTER_FLAG_NEW_LISTENER
flag, the
SECCOMP_RET_USER_NOTIF
action value, and the
SECCOMP_GET_NOTIF_SIZES
operation described in seccomp(2), this
mechanism involves the use of a number of related ioctl(2)
operations (described below).
Overview
In conventional usage of a seccomp filter, the decision about how
to treat a system call is made by the filter itself. By
contrast, the user-space notification mechanism allows the
seccomp filter to delegate the handling of the system call to
another user-space process. Note that this mechanism is
explicitly not
intended as a method implementing security policy;
see NOTES.
In the discussion that follows, the thread(s) on which the
seccomp filter is installed is (are) referred to as the target,
and the process that is notified by the user-space notification
mechanism is referred to as the supervisor.
A suitably privileged supervisor can use the user-space
notification mechanism to perform actions on behalf of the
target. The advantage of the user-space notification mechanism
is that the supervisor will usually be able to retrieve
information about the target and the performed system call that
the seccomp filter itself cannot. (A seccomp filter is limited
in the information it can obtain and the actions that it can
perform because it is running on a virtual machine inside the
kernel.)
An overview of the steps performed by the target and the
supervisor is as follows:
1. The target establishes a seccomp filter in the usual manner,
but with two differences:
• The seccomp(2) flags argument includes the flag
SECCOMP_FILTER_FLAG_NEW_LISTENER
. Consequently, the return
value of the (successful) seccomp(2) call is a new
"listening" file descriptor that can be used to receive
notifications. Only one "listening" seccomp filter can be
installed for a thread.
• In cases where it is appropriate, the seccomp filter returns
the action value SECCOMP_RET_USER_NOTIF
. This return value
will trigger a notification event.
2. In order that the supervisor can obtain notifications using
the listening file descriptor, (a duplicate of) that file
descriptor must be passed from the target to the supervisor.
One way in which this could be done is by passing the file
descriptor over a UNIX domain socket connection between the
target and the supervisor (using the SCM_RIGHTS
ancillary
message type described in unix(7)). Another way to do this is
through the use of pidfd_getfd(2).
3. The supervisor will receive notification events on the
listening file descriptor. These events are returned as
structures of type seccomp_notif. Because this structure and
its size may evolve over kernel versions, the supervisor must
first determine the size of this structure using the
seccomp(2) SECCOMP_GET_NOTIF_SIZES
operation, which returns a
structure of type seccomp_notif_sizes. The supervisor
allocates a buffer of size seccomp_notif_sizes.seccomp_notif
bytes to receive notification events. In addition,the
supervisor allocates another buffer of size
seccomp_notif_sizes.seccomp_notif_resp bytes for the response
(a struct seccomp_notif_resp structure) that it will provide
to the kernel (and thus the target).
4. The target then performs its workload, which includes system
calls that will be controlled by the seccomp filter. Whenever
one of these system calls causes the filter to return the
SECCOMP_RET_USER_NOTIF
action value, the kernel does not (yet)
execute the system call; instead, execution of the target is
temporarily blocked inside the kernel (in a sleep state that
is interruptible by signals) and a notification event is
generated on the listening file descriptor.
5. The supervisor can now repeatedly monitor the listening file
descriptor for SECCOMP_RET_USER_NOTIF
-triggered events. To do
this, the supervisor uses the SECCOMP_IOCTL_NOTIF_RECV
ioctl(2) operation to read information about a notification
event; this operation blocks until an event is available. The
operation returns a seccomp_notif structure containing
information about the system call that is being attempted by
the target. (As described in NOTES, the file descriptor can
also be monitored with select(2), poll(2), or epoll(7).)
6. The seccomp_notif structure returned by the
SECCOMP_IOCTL_NOTIF_RECV
operation includes the same
information (a seccomp_data structure) that was passed to the
seccomp filter. This information allows the supervisor to
discover the system call number and the arguments for the
target's system call. In addition, the notification event
contains the ID of the thread that triggered the notification
and a unique cookie value that is used in subsequent
SECCOMP_IOCTL_NOTIF_ID_VALID
and SECCOMP_IOCTL_NOTIF_SEND
operations.
The information in the notification can be used to discover
the values of pointer arguments for the target's system call.
(This is something that can't be done from within a seccomp
filter.) One way in which the supervisor can do this is to
open the corresponding /proc/[tid]/mem file (see proc(5)) and
read bytes from the location that corresponds to one of the
pointer arguments whose value is supplied in the notification
event. (The supervisor must be careful to avoid a race
condition that can occur when doing this; see the description
of the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl
(2) operation below.)
In addition, the supervisor can access other system
information that is visible in user space but which is not
accessible from a seccomp filter.
7. Having obtained information as per the previous step, the
supervisor may then choose to perform an action in response to
the target's system call (which, as noted above, is not
executed when the seccomp filter returns the
SECCOMP_RET_USER_NOTIF
action value).
One example use case here relates to containers. The target
may be located inside a container where it does not have
sufficient capabilities to mount a filesystem in the
container's mount namespace. However, the supervisor may be a
more privileged process that does have sufficient capabilities
to perform the mount operation.
8. The supervisor then sends a response to the notification. The
information in this response is used by the kernel to
construct a return value for the target's system call and
provide a value that will be assigned to the errno variable of
the target.
The response is sent using the SECCOMP_IOCTL_NOTIF_SEND
ioctl(2) operation, which is used to transmit a
seccomp_notif_resp structure to the kernel. This structure
includes a cookie value that the supervisor obtained in the
seccomp_notif structure returned by the
SECCOMP_IOCTL_NOTIF_RECV
operation. This cookie value allows
the kernel to associate the response with the target. This
structure must include the cookie value that the supervisor
obtained in the seccomp_notif structure returned by the
SECCOMP_IOCTL_NOTIF_RECV
operation; the cookie allows the
kernel to associate the response with the target.
9. Once the notification has been sent, the system call in the
target thread unblocks, returning the information that was
provided by the supervisor in the notification response.
As a variation on the last two steps, the supervisor can send a
response that tells the kernel that it should execute the target
thread's system call; see the discussion of
SECCOMP_USER_NOTIF_FLAG_CONTINUE
, below.