The following ioctl(2) operations are supported by the seccomp
user-space notification file descriptor. For each of these
operations, the first (file descriptor) argument of ioctl(2) is
the listening file descriptor returned by a call to seccomp(2)
with the SECCOMP_FILTER_FLAG_NEW_LISTENER
flag.
SECCOMP_IOCTL_NOTIF_RECV
The SECCOMP_IOCTL_NOTIF_RECV
operation (available since Linux
5.0) is used to obtain a user-space notification event. If no
such event is currently pending, the operation blocks until an
event occurs. The third ioctl(2) argument is a pointer to a
structure of the following form which contains information about
the event. This structure must be zeroed out before the call.
struct seccomp_notif {
__u64 id; /* Cookie */
__u32 pid; /* TID of target thread */
__u32 flags; /* Currently unused (0) */
struct seccomp_data data; /* See seccomp(2) */
};
The fields in this structure are as follows:
id This is a cookie for the notification. Each such cookie
is guaranteed to be unique for the corresponding seccomp
filter.
• The cookie can be used with the
SECCOMP_IOCTL_NOTIF_ID_VALID ioctl
(2) operation
described below.
• When returning a notification response to the kernel,
the supervisor must include the cookie value in the
seccomp_notif_resp structure that is specified as the
argument of the SECCOMP_IOCTL_NOTIF_SEND
operation.
pid This is the thread ID of the target thread that triggered
the notification event.
flags This is a bit mask of flags providing further information
on the event. In the current implementation, this field
is always zero.
data This is a seccomp_data structure containing information
about the system call that triggered the notification.
This is the same structure that is passed to the seccomp
filter. See seccomp(2) for details of this structure.
On success, this operation returns 0; on failure, -1 is returned,
and errno is set to indicate the cause of the error. This
operation can fail with the following errors:
EINVAL
(since Linux 5.5)
The seccomp_notif structure that was passed to the call
contained nonzero fields.
ENOENT
The target thread was killed by a signal as the
notification information was being generated, or the
target's (blocked) system call was interrupted by a signal
handler.
SECCOMP_IOCTL_NOTIF_ID_VALID
The SECCOMP_IOCTL_NOTIF_ID_VALID
operation (available since Linux
5.0) is used to check that a notification ID returned by an
earlier SECCOMP_IOCTL_NOTIF_RECV
operation is still valid (i.e.,
that the target still exists and its system call is still blocked
waiting for a response).
The third ioctl(2) argument is a pointer to the cookie (id)
returned by the SECCOMP_IOCTL_NOTIF_RECV
operation.
This operation is necessary to avoid race conditions that can
occur when the pid returned by the SECCOMP_IOCTL_NOTIF_RECV
operation terminates, and that process ID is reused by another
process. An example of this kind of race is the following
1. A notification is generated on the listening file descriptor.
The returned seccomp_notif contains the TID of the target
thread (in the pid field of the structure).
2. The target terminates.
3. Another thread or process is created on the system that by
chance reuses the TID that was freed when the target
terminated.
4. The supervisor open(2)s the /proc/[tid]/mem file for the TID
obtained in step 1, with the intention of (say) inspecting the
memory location(s) that containing the argument(s) of the
system call that triggered the notification in step 1.
In the above scenario, the risk is that the supervisor may try to
access the memory of a process other than the target. This race
can be avoided by following the call to open(2) with a
SECCOMP_IOCTL_NOTIF_ID_VALID
operation to verify that the process
that generated the notification is still alive. (Note that if
the target terminates after the latter step, a subsequent read(2)
from the file descriptor may return 0, indicating end of file.)
See NOTES for a discussion of other cases where
SECCOMP_IOCTL_NOTIF_ID_VALID
checks must be performed.
On success (i.e., the notification ID is still valid), this
operation returns 0. On failure (i.e., the notification ID is no
longer valid), -1 is returned, and errno is set to ENOENT
.
SECCOMP_IOCTL_NOTIF_SEND
The SECCOMP_IOCTL_NOTIF_SEND
operation (available since Linux
5.0) is used to send a notification response back to the kernel.
The third ioctl(2) argument of this structure is a pointer to a
structure of the following form:
struct seccomp_notif_resp {
__u64 id; /* Cookie value */
__s64 val; /* Success return value */
__s32 error; /* 0 (success) or negative error number */
__u32 flags; /* See below */
};
The fields of this structure are as follows:
id This is the cookie value that was obtained using the
SECCOMP_IOCTL_NOTIF_RECV
operation. This cookie value
allows the kernel to correctly associate this response
with the system call that triggered the user-space
notification.
val This is the value that will be used for a spoofed success
return for the target's system call; see below.
error This is the value that will be used as the error number
(errno) for a spoofed error return for the target's system
call; see below.
flags This is a bit mask that includes zero or more of the
following flags:
SECCOMP_USER_NOTIF_FLAG_CONTINUE
(since Linux 5.5)
Tell the kernel to execute the target's system
call.
Two kinds of response are possible:
• A response to the kernel telling it to execute the target's
system call. In this case, the flags field includes
SECCOMP_USER_NOTIF_FLAG_CONTINUE
and the error and val fields
must be zero.
This kind of response can be useful in cases where the
supervisor needs to do deeper analysis of the target's system
call than is possible from a seccomp filter (e.g., examining
the values of pointer arguments), and, having decided that the
system call does not require emulation by the supervisor, the
supervisor wants the system call to be executed normally in the
target.
The SECCOMP_USER_NOTIF_FLAG_CONTINUE
flag should be used with
caution; see NOTES.
• A spoofed return value for the target's system call. In this
case, the kernel does not execute the target's system call,
instead causing the system call to return a spoofed value as
specified by fields of the seccomp_notif_resp structure. The
supervisor should set the fields of this structure as follows:
+ flags does not contain SECCOMP_USER_NOTIF_FLAG_CONTINUE
.
+ error is set either to 0 for a spoofed "success" return or
to a negative error number for a spoofed "failure" return.
In the former case, the kernel causes the target's system
call to return the value specified in the val field. In the
latter case, the kernel causes the target's system call to
return -1, and errno is assigned the negated error value.
+ val is set to a value that will be used as the return value
for a spoofed "success" return for the target's system call.
The value in this field is ignored if the error field
contains a nonzero value.
On success, this operation returns 0; on failure, -1 is returned,
and errno is set to indicate the cause of the error. This
operation can fail with the following errors:
EINPROGRESS
A response to this notification has already been sent.
EINVAL
An invalid value was specified in the flags field.
EINVAL
The flags field contained
SECCOMP_USER_NOTIF_FLAG_CONTINUE
, and the error or val
field was not zero.
ENOENT
The blocked system call in the target has been interrupted
by a signal handler or the target has terminated.
SECCOMP_IOCTL_NOTIF_ADDFD
The SECCOMP_IOCTL_NOTIF_ADDFD
operation (available since Linux
5.9) allows the supervisor to install a file descriptor into the
target's file descriptor table. Much like the use of SCM_RIGHTS
messages described in unix(7), this operation is semantically
equivalent to duplicating a file descriptor from the supervisor's
file descriptor table into the target's file descriptor table.
The SECCOMP_IOCTL_NOTIF_ADDFD
operation permits the supervisor to
emulate a target system call (such as socket(2) or openat(2))
that generates a file descriptor. The supervisor can perform the
system call that generates the file descriptor (and associated
open file description) and then use this operation to allocate a
file descriptor that refers to the same open file description in
the target. (For an explanation of open file descriptions, see
open(2).)
Once this operation has been performed, the supervisor can close
its copy of the file descriptor.
In the target, the received file descriptor is subject to the
same Linux Security Module (LSM) checks as are applied to a file
descriptor that is received in an SCM_RIGHTS
ancillary message.
If the file descriptor refers to a socket, it inherits the cgroup
version 1 network controller settings (classid and netprioidx) of
the target.
The third ioctl(2) argument is a pointer to a structure of the
following form:
struct seccomp_notif_addfd {
__u64 id; /* Cookie value */
__u32 flags; /* Flags */
__u32 srcfd; /* Local file descriptor number */
__u32 newfd; /* 0 or desired file descriptor
number in target */
__u32 newfd_flags; /* Flags to set on target file
descriptor */
};
The fields in this structure are as follows:
id This field should be set to the notification ID (cookie
value) that was obtained via SECCOMP_IOCTL_NOTIF_RECV
.
flags This field is a bit mask of flags that modify the behavior
of the operation. Currently, only one flag is supported:
SECCOMP_ADDFD_FLAG_SETFD
When allocating the file descriptor in the target,
use the file descriptor number specified in the
newfd field.
SECCOMP_ADDFD_FLAG_SEND
(since Linux 5.14)
Perform the equivalent of SECCOMP_IOCTL_NOTIF_ADDFD
plus SECCOMP_IOCTL_NOTIF_SEND
as an atomic
operation. On successful invocation, the target
process's errno will be 0 and the return value will
be the file descriptor number that was allocated in
the target. If allocating the file descriptor in
the target fails, the target's system call
continues to be blocked until a successful response
is sent.
srcfd This field should be set to the number of the file
descriptor in the supervisor that is to be duplicated.
newfd This field determines which file descriptor number is
allocated in the target. If the SECCOMP_ADDFD_FLAG_SETFD
flag is set, then this field specifies which file
descriptor number should be allocated. If this file
descriptor number is already open in the target, it is
atomically closed and reused. If the descriptor
duplication fails due to an LSM check, or if srcfd is not
a valid file descriptor, the file descriptor newfd will
not be closed in the target process.
If the SECCOMP_ADDFD_FLAG_SETFD
flag it not set, then this
field must be 0, and the kernel allocates the lowest
unused file descriptor number in the target.
newfd_flags
This field is a bit mask specifying flags that should be
set on the file descriptor that is received in the target
process. Currently, only the following flag is
implemented:
O_CLOEXEC
Set the close-on-exec flag on the received file
descriptor.
On success, this ioctl(2) call returns the number of the file
descriptor that was allocated in the target. Assuming that the
emulated system call is one that returns a file descriptor as its
function result (e.g., socket(2)), this value can be used as the
return value (resp.val) that is supplied in the response that is
subsequently sent with the SECCOMP_IOCTL_NOTIF_SEND
operation.
On error, -1 is returned and errno is set to indicate the cause
of the error.
This operation can fail with the following errors:
EBADF
Allocating the file descriptor in the target would cause
the target's RLIMIT_NOFILE
limit to be exceeded (see
getrlimit(2)).
EBUSY
If the flag SECCOMP_IOCTL_NOTIF_SEND
is used, this means
the operation can't proceed until other
SECCOMP_IOCTL_NOTIF_ADDFD
requests are processed.
EINPROGRESS
The user-space notification specified in the id field
exists but has not yet been fetched (by a
SECCOMP_IOCTL_NOTIF_RECV
) or has already been responded to
(by a SECCOMP_IOCTL_NOTIF_SEND
).
EINVAL
An invalid flag was specified in the flags or newfd_flags
field, or the newfd field is nonzero and the
SECCOMP_ADDFD_FLAG_SETFD
flag was not specified in the
flags field.
EMFILE
The file descriptor number specified in newfd exceeds the
limit specified in /proc/sys/fs/nr_open.
ENOENT
The blocked system call in the target has been interrupted
by a signal handler or the target has terminated.
Here is some sample code (with error handling omitted) that uses
the SECCOMP_ADDFD_FLAG_SETFD
operation (here, to emulate a call
to openat(2)):
int fd, removeFd;
fd = openat(req->data.args[0], path, req->data.args[2],
req->data.args[3]);
struct seccomp_notif_addfd addfd;
addfd.id = req->id; /* Cookie from SECCOMP_IOCTL_NOTIF_RECV */
addfd.srcfd = fd;
addfd.newfd = 0;
addfd.flags = 0;
addfd.newfd_flags = O_CLOEXEC;
targetFd = ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
close(fd); /* No longer needed in supervisor */
struct seccomp_notif_resp *resp;
/* Code to allocate 'resp' omitted */
resp->id = req->id;
resp->error = 0; /* "Success" */
resp->val = targetFd;
resp->flags = 0;
ioctl(notifyFd, SECCOMP_IOCTL_NOTIF_SEND, resp);