`seccomp_unotify` ( 2 )

механизм уведомлений в пользовательском пространстве Seccomp (Seccomp user-space notification mechanism)

Примечание (Note)

One example use case for the user-space notification mechanism is to allow a container manager (a process which is typically running with more privilege than the processes inside the container) to mount block devices or create device nodes for the container. The mount use case provides an example of where the SECCOMP_USER_NOTIF_FLAG_CONTINUE ioctl(2) operation is useful. Upon receiving a notification for the mount(2) system call, the container manager (the "supervisor") can distinguish a request to mount a block filesystem (which would not be possible for a "target" process inside the container) and mount that file system. If, on the other hand, the container manager detects that the operation could be performed by the process inside the container (e.g., a mount of a tmpfs(5) filesystem), it can notify the kernel that the target process's mount(2) system call can continue.

select()/poll()/epoll semantics The file descriptor returned when seccomp(2) is employed with the SECCOMP_FILTER_FLAG_NEW_LISTENER flag can be monitored using poll(2), epoll(7), and select(2). These interfaces indicate that the file descriptor is ready as follows:

• When a notification is pending, these interfaces indicate that the file descriptor is readable. Following such an indication, a subsequent SECCOMP_IOCTL_NOTIF_RECV ioctl(2) will not block, returning either information about a notification or else failing with the error EINTR if the target has been killed by a signal or its system call has been interrupted by a signal handler.

• After the notification has been received (i.e., by the SECCOMP_IOCTL_NOTIF_RECV ioctl(2) operation), these interfaces indicate that the file descriptor is writable, meaning that a notification response can be sent using the SECCOMP_IOCTL_NOTIF_SEND ioctl(2) operation.

• After the last thread using the filter has terminated and been reaped using waitpid(2) (or similar), the file descriptor indicates an end-of-file condition (readable in select(2); POLLHUP/EPOLLHUP in poll(2)/ epoll_wait(2)).

Design goals; use of SECCOMP_USER_NOTIF_FLAG_CONTINUE The intent of the user-space notification feature is to allow system calls to be performed on behalf of the target. The target's system call should either be handled by the supervisor or allowed to continue normally in the kernel (where standard security policies will be applied).

Note well: this mechanism must not be used to make security policy decisions about the system call, which would be inherently race-prone for reasons described next.

The SECCOMP_USER_NOTIF_FLAG_CONTINUE flag must be used with caution. If set by the supervisor, the target's system call will continue. However, there is a time-of-check, time-of-use race here, since an attacker could exploit the interval of time where the target is blocked waiting on the "continue" response to do things such as rewriting the system call arguments.

Note furthermore that a user-space notifier can be bypassed if the existing filters allow the use of seccomp(2) or prctl(2) to install a filter that returns an action value with a higher precedence than SECCOMP_RET_USER_NOTIF (see seccomp(2)).

It should thus be absolutely clear that the seccomp user-space notification mechanism can not be used to implement a security policy! It should only ever be used in scenarios where a more privileged process supervises the system calls of a lesser privileged target to get around kernel-enforced security restrictions when the supervisor deems this safe. In other words, in order to continue a system call, the supervisor should be sure that another security mechanism or the kernel itself will sufficiently block the system call if its arguments are rewritten to something unsafe.

Caveats regarding the use of /proc/[tid]/mem The discussion above noted the need to use the SECCOMP_IOCTL_NOTIF_ID_VALID ioctl(2) when opening the /proc/[tid]/mem file of the target to avoid the possibility of accessing the memory of the wrong process in the event that the target terminates and its ID is recycled by another (unrelated) thread. However, the use of this ioctl(2) operation is also necessary in other situations, as explained in the following paragraphs.

Consider the following scenario, where the supervisor tries to read the pathname argument of a target's blocked mount(2) system call:

• From one of its functions (func()), the target calls mount(2), which triggers a user-space notification and causes the target to block.

• The supervisor receives the notification, opens /proc/[tid]/mem, and (successfully) performs the SECCOMP_IOCTL_NOTIF_ID_VALID check.

• The target receives a signal, which causes the mount(2) to abort.

• The signal handler executes in the target, and returns.

• Upon return from the handler, the execution of func() resumes, and it returns (and perhaps other functions are called, overwriting the memory that had been used for the stack frame of func()).

• Using the address provided in the notification information, the supervisor reads from the target's memory location that used to contain the pathname.

• The supervisor now calls mount(2) with some arbitrary bytes obtained in the previous step.

The conclusion from the above scenario is this: since the target's blocked system call may be interrupted by a signal handler, the supervisor must be written to expect that the target may abandon its system call at any time; in such an event, any information that the supervisor obtained from the target's memory must be considered invalid.

To prevent such scenarios, every read from the target's memory must be separated from use of the bytes so obtained by a SECCOMP_IOCTL_NOTIF_ID_VALID check. In the above example, the check would be placed between the two final steps. An example of such a check is shown in EXAMPLES.

Following on from the above, it should be clear that a write by the supervisor into the target's memory can never be considered safe.

Caveats regarding blocking system calls Suppose that the target performs a blocking system call (e.g., accept(2)) that the supervisor should handle. The supervisor might then in turn execute the same blocking system call.

In this scenario, it is important to note that if the target's system call is now interrupted by a signal, the supervisor is not informed of this. If the supervisor does not take suitable steps to actively discover that the target's system call has been canceled, various difficulties can occur. Taking the example of accept(2), the supervisor might remain blocked in its accept(2) holding a port number that the target (which, after the interruption by the signal handler, perhaps closed its listening socket) might expect to be able to reuse in a bind(2) call.

Therefore, when the supervisor wishes to emulate a blocking system call, it must do so in such a way that it gets informed if the target's system call is interrupted by a signal handler. For example, if the supervisor itself executes the same blocking system call, then it could employ a separate thread that uses the SECCOMP_IOCTL_NOTIF_ID_VALID operation to check if the target is still blocked in its system call. Alternatively, in the accept(2) example, the supervisor might use poll(2) to monitor both the notification file descriptor (so as to discover when the target's accept(2) call has been interrupted) and the listening file descriptor (so as to know when a connection is available).

If the target's system call is interrupted, the supervisor must take care to release resources (e.g., file descriptors) that it acquired on behalf of the target.

Interaction with SA_RESTART signal handlers Consider the following scenario:

• The target process has used sigaction(2) to install a signal handler with the SA_RESTART flag.

• The target has made a system call that triggered a seccomp user-space notification and the target is currently blocked until the supervisor sends a notification response.

• A signal is delivered to the target and the signal handler is executed.

• When (if) the supervisor attempts to send a notification response, the SECCOMP_IOCTL_NOTIF_SEND ioctl(2)) operation will fail with the ENOENT error.

In this scenario, the kernel will restart the target's system call. Consequently, the supervisor will receive another user- space notification. Thus, depending on how many times the blocked system call is interrupted by a signal handler, the supervisor may receive multiple notifications for the same instance of a system call in the target.

One oddity is that system call restarting as described in this scenario will occur even for the blocking system calls listed in signal(7) that would never normally be restarted by the SA_RESTART flag.

Furthermore, if the supervisor response is a file descriptor added with SECCOMP_IOCTL_NOTIF_ADDFD, then the flag SECCOMP_ADDFD_FLAG_SEND can be used to atomically add the file descriptor and return that value, making sure no file descriptors are inadvertently leaked into the target.

Исходный текст на man7.org

seccomp_unotify ( 2 )

Примечание (Note)

`seccomp_unotify` ( 2 )