One example use case for the user-space notification mechanism is
to allow a container manager (a process which is typically
running with more privilege than the processes inside the
container) to mount block devices or create device nodes for the
container. The mount use case provides an example of where the
SECCOMP_USER_NOTIF_FLAG_CONTINUE ioctl
(2) operation is useful.
Upon receiving a notification for the mount(2) system call, the
container manager (the "supervisor") can distinguish a request to
mount a block filesystem (which would not be possible for a
"target" process inside the container) and mount that file
system. If, on the other hand, the container manager detects
that the operation could be performed by the process inside the
container (e.g., a mount of a tmpfs(5) filesystem), it can notify
the kernel that the target process's mount(2) system call can
continue.
select()/poll()/epoll semantics
The file descriptor returned when seccomp(2) is employed with the
SECCOMP_FILTER_FLAG_NEW_LISTENER
flag can be monitored using
poll(2), epoll(7), and select(2). These interfaces indicate that
the file descriptor is ready as follows:
• When a notification is pending, these interfaces indicate that
the file descriptor is readable. Following such an indication,
a subsequent SECCOMP_IOCTL_NOTIF_RECV ioctl
(2) will not block,
returning either information about a notification or else
failing with the error EINTR
if the target has been killed by a
signal or its system call has been interrupted by a signal
handler.
• After the notification has been received (i.e., by the
SECCOMP_IOCTL_NOTIF_RECV ioctl
(2) operation), these interfaces
indicate that the file descriptor is writable, meaning that a
notification response can be sent using the
SECCOMP_IOCTL_NOTIF_SEND ioctl
(2) operation.
• After the last thread using the filter has terminated and been
reaped using waitpid(2) (or similar), the file descriptor
indicates an end-of-file condition (readable in select(2);
POLLHUP
/EPOLLHUP
in poll(2)/ epoll_wait(2)).
Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE
The intent of the user-space notification feature is to allow
system calls to be performed on behalf of the target. The
target's system call should either be handled by the supervisor
or allowed to continue normally in the kernel (where standard
security policies will be applied).
Note well
: this mechanism must not be used to make security
policy decisions about the system call, which would be inherently
race-prone for reasons described next.
The SECCOMP_USER_NOTIF_FLAG_CONTINUE
flag must be used with
caution. If set by the supervisor, the target's system call will
continue. However, there is a time-of-check, time-of-use race
here, since an attacker could exploit the interval of time where
the target is blocked waiting on the "continue" response to do
things such as rewriting the system call arguments.
Note furthermore that a user-space notifier can be bypassed if
the existing filters allow the use of seccomp(2) or prctl(2) to
install a filter that returns an action value with a higher
precedence than SECCOMP_RET_USER_NOTIF
(see seccomp(2)).
It should thus be absolutely clear that the seccomp user-space
notification mechanism can not
be used to implement a security
policy! It should only ever be used in scenarios where a more
privileged process supervises the system calls of a lesser
privileged target to get around kernel-enforced security
restrictions when the supervisor deems this safe. In other
words, in order to continue a system call, the supervisor should
be sure that another security mechanism or the kernel itself will
sufficiently block the system call if its arguments are rewritten
to something unsafe.
Caveats regarding the use of /proc/[tid]/mem
The discussion above noted the need to use the
SECCOMP_IOCTL_NOTIF_ID_VALID ioctl
(2) when opening the
/proc/[tid]/mem file of the target to avoid the possibility of
accessing the memory of the wrong process in the event that the
target terminates and its ID is recycled by another (unrelated)
thread. However, the use of this ioctl(2) operation is also
necessary in other situations, as explained in the following
paragraphs.
Consider the following scenario, where the supervisor tries to
read the pathname argument of a target's blocked mount(2) system
call:
• From one of its functions (func()), the target calls mount(2),
which triggers a user-space notification and causes the target
to block.
• The supervisor receives the notification, opens
/proc/[tid]/mem, and (successfully) performs the
SECCOMP_IOCTL_NOTIF_ID_VALID
check.
• The target receives a signal, which causes the mount(2) to
abort.
• The signal handler executes in the target, and returns.
• Upon return from the handler, the execution of func() resumes,
and it returns (and perhaps other functions are called,
overwriting the memory that had been used for the stack frame
of func()).
• Using the address provided in the notification information, the
supervisor reads from the target's memory location that used to
contain the pathname.
• The supervisor now calls mount(2) with some arbitrary bytes
obtained in the previous step.
The conclusion from the above scenario is this: since the
target's blocked system call may be interrupted by a signal
handler, the supervisor must be written to expect that the target
may abandon its system call at any
time; in such an event, any
information that the supervisor obtained from the target's memory
must be considered invalid.
To prevent such scenarios, every read from the target's memory
must be separated from use of the bytes so obtained by a
SECCOMP_IOCTL_NOTIF_ID_VALID
check. In the above example, the
check would be placed between the two final steps. An example of
such a check is shown in EXAMPLES.
Following on from the above, it should be clear that a write by
the supervisor into the target's memory can never
be considered
safe.
Caveats regarding blocking system calls
Suppose that the target performs a blocking system call (e.g.,
accept(2)) that the supervisor should handle. The supervisor
might then in turn execute the same blocking system call.
In this scenario, it is important to note that if the target's
system call is now interrupted by a signal, the supervisor is not
informed of this. If the supervisor does not take suitable steps
to actively discover that the target's system call has been
canceled, various difficulties can occur. Taking the example of
accept(2), the supervisor might remain blocked in its accept(2)
holding a port number that the target (which, after the
interruption by the signal handler, perhaps closed its listening
socket) might expect to be able to reuse in a bind(2) call.
Therefore, when the supervisor wishes to emulate a blocking
system call, it must do so in such a way that it gets informed if
the target's system call is interrupted by a signal handler. For
example, if the supervisor itself executes the same blocking
system call, then it could employ a separate thread that uses the
SECCOMP_IOCTL_NOTIF_ID_VALID
operation to check if the target is
still blocked in its system call. Alternatively, in the
accept(2) example, the supervisor might use poll(2) to monitor
both the notification file descriptor (so as to discover when the
target's accept(2) call has been interrupted) and the listening
file descriptor (so as to know when a connection is available).
If the target's system call is interrupted, the supervisor must
take care to release resources (e.g., file descriptors) that it
acquired on behalf of the target.
Interaction with SA_RESTART signal handlers
Consider the following scenario:
• The target process has used sigaction(2) to install a signal
handler with the SA_RESTART
flag.
• The target has made a system call that triggered a seccomp
user-space notification and the target is currently blocked
until the supervisor sends a notification response.
• A signal is delivered to the target and the signal handler is
executed.
• When (if) the supervisor attempts to send a notification
response, the SECCOMP_IOCTL_NOTIF_SEND ioctl
(2)) operation will
fail with the ENOENT
error.
In this scenario, the kernel will restart the target's system
call. Consequently, the supervisor will receive another user-
space notification. Thus, depending on how many times the
blocked system call is interrupted by a signal handler, the
supervisor may receive multiple notifications for the same
instance of a system call in the target.
One oddity is that system call restarting as described in this
scenario will occur even for the blocking system calls listed in
signal(7) that would never
normally be restarted by the
SA_RESTART
flag.
Furthermore, if the supervisor response is a file descriptor
added with SECCOMP_IOCTL_NOTIF_ADDFD
, then the flag
SECCOMP_ADDFD_FLAG_SEND
can be used to atomically add the file
descriptor and return that value, making sure no file descriptors
are inadvertently leaked into the target.