userfaultfd
() creates a new userfaultfd object that can be used
for delegation of page-fault handling to a user-space
application, and returns a file descriptor that refers to the new
object. The new userfaultfd object is configured using ioctl(2).
Once the userfaultfd object is configured, the application can
use read(2) to receive userfaultfd notifications. The reads from
userfaultfd may be blocking or non-blocking, depending on the
value of flags used for the creation of the userfaultfd or
subsequent calls to fcntl(2).
The following values may be bitwise ORed in flags to change the
behavior of userfaultfd
():
O_CLOEXEC
Enable the close-on-exec flag for the new userfaultfd file
descriptor. See the description of the O_CLOEXEC
flag in
open(2).
O_NONBLOCK
Enables non-blocking operation for the userfaultfd object.
See the description of the O_NONBLOCK
flag in open(2).
When the last file descriptor referring to a userfaultfd object
is closed, all memory ranges that were registered with the object
are unregistered and unread events are flushed.
Userfaultfd supports two modes of registration:
UFFDIO_REGISTER_MODE_MISSING
(since 4.10)
When registered with UFFDIO_REGISTER_MODE_MISSING
mode,
user-space will receive a page-fault notification when a
missing page is accessed. The faulted thread will be
stopped from execution until the page fault is resolved
from user-space by either an UFFDIO_COPY
or an
UFFDIO_ZEROPAGE
ioctl.
UFFDIO_REGISTER_MODE_WP
(since 5.7)
When registered with UFFDIO_REGISTER_MODE_WP
mode, user-
space will receive a page-fault notification when a write-
protected page is written. The faulted thread will be
stopped from execution until user-space write-unprotects
the page using an UFFDIO_WRITEPROTECT
ioctl.
Multiple modes can be enabled at the same time for the same
memory range.
Since Linux 4.14, a userfaultfd page-fault notification can
selectively embed faulting thread ID information into the
notification. One needs to enable this feature explicitly using
the UFFD_FEATURE_THREAD_ID
feature bit when initializing the
userfaultfd context. By default, thread ID reporting is
disabled.
Usage
The userfaultfd mechanism is designed to allow a thread in a
multithreaded program to perform user-space paging for the other
threads in the process. When a page fault occurs for one of the
regions registered to the userfaultfd object, the faulting thread
is put to sleep and an event is generated that can be read via
the userfaultfd file descriptor. The fault-handling thread reads
events from this file descriptor and services them using the
operations described in ioctl_userfaultfd(2). When servicing the
page fault events, the fault-handling thread can trigger a wake-
up for the sleeping thread.
It is possible for the faulting threads and the fault-handling
threads to run in the context of different processes. In this
case, these threads may belong to different programs, and the
program that executes the faulting threads will not necessarily
cooperate with the program that handles the page faults. In such
non-cooperative mode, the process that monitors userfaultfd and
handles page faults needs to be aware of the changes in the
virtual memory layout of the faulting process to avoid memory
corruption.
Since Linux 4.11, userfaultfd can also notify the fault-handling
threads about changes in the virtual memory layout of the
faulting process. In addition, if the faulting process invokes
fork(2), the userfaultfd objects associated with the parent may
be duplicated into the child process and the userfaultfd monitor
will be notified (via the UFFD_EVENT_FORK
described below) about
the file descriptor associated with the userfault objects created
for the child process, which allows the userfaultfd monitor to
perform user-space paging for the child process. Unlike page
faults which have to be synchronous and require an explicit or
implicit wakeup, all other events are delivered asynchronously
and the non-cooperative process resumes execution as soon as the
userfaultfd manager executes read(2). The userfaultfd manager
should carefully synchronize calls to UFFDIO_COPY
with the
processing of events.
The current asynchronous model of the event delivery is optimal
for single threaded non-cooperative userfaultfd manager
implementations.
Since Linux 5.7, userfaultfd is able to do synchronous page dirty
tracking using the new write-protect register mode. One should
check against the feature bit UFFD_FEATURE_PAGEFAULT_FLAG_WP
before using this feature. Similar to the original userfaultfd
missing mode, the write-protect mode will generate a userfaultfd
notification when the protected page is written. The user needs
to resolve the page fault by unprotecting the faulted page and
kicking the faulted thread to continue. For more information,
please refer to the "Userfaultfd write-protect mode" section.
Userfaultfd operation
After the userfaultfd object is created with userfaultfd
(), the
application must enable it using the UFFDIO_API ioctl
(2)
operation. This operation allows a handshake between the kernel
and user space to determine the API version and supported
features. This operation must be performed before any of the
other ioctl(2) operations described below (or those operations
fail with the EINVAL
error).
After a successful UFFDIO_API
operation, the application then
registers memory address ranges using the UFFDIO_REGISTER
ioctl(2) operation. After successful completion of a
UFFDIO_REGISTER
operation, a page fault occurring in the
requested memory range, and satisfying the mode defined at the
registration time, will be forwarded by the kernel to the user-
space application. The application can then use the UFFDIO_COPY
or UFFDIO_ZEROPAGE ioctl
(2) operations to resolve the page fault.
Since Linux 4.14, if the application sets the UFFD_FEATURE_SIGBUS
feature bit using the UFFDIO_API ioctl
(2), no page-fault
notification will be forwarded to user space. Instead a SIGBUS
signal is delivered to the faulting process. With this feature,
userfaultfd can be used for robustness purposes to simply catch
any access to areas within the registered address range that do
not have pages allocated, without having to listen to userfaultfd
events. No userfaultfd monitor will be required for dealing with
such memory accesses. For example, this feature can be useful
for applications that want to prevent the kernel from
automatically allocating pages and filling holes in sparse files
when the hole is accessed through a memory mapping.
The UFFD_FEATURE_SIGBUS
feature is implicitly inherited through
fork(2) if used in combination with UFFD_FEATURE_FORK
.
Details of the various ioctl(2) operations can be found in
ioctl_userfaultfd(2).
Since Linux 4.11, events other than page-fault may enabled during
UFFDIO_API
operation.
Up to Linux 4.11, userfaultfd can be used only with anonymous
private memory mappings. Since Linux 4.11, userfaultfd can be
also used with hugetlbfs and shared memory mappings.
Userfaultfd write-protect mode (since 5.7)
Since Linux 5.7, userfaultfd supports write-protect mode. The
user needs to first check availability of this feature using
UFFDIO_API
ioctl against the feature bit
UFFD_FEATURE_PAGEFAULT_FLAG_WP
before using this feature.
To register with userfaultfd write-protect mode, the user needs
to initiate the UFFDIO_REGISTER
ioctl with mode
UFFDIO_REGISTER_MODE_WP
set. Note that it is legal to monitor
the same memory range with multiple modes. For example, the user
can do UFFDIO_REGISTER
with the mode set to
UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_WP
. When
there is only UFFDIO_REGISTER_MODE_WP
registered, user-space will
not receive any notification when a missing page is written.
Instead, user-space will receive a write-protect page-fault
notification only when an existing but write-protected page got
written.
After the UFFDIO_REGISTER
ioctl completed with
UFFDIO_REGISTER_MODE_WP
mode set, the user can write-protect any
existing memory within the range using the ioctl
UFFDIO_WRITEPROTECT
where uffdio_writeprotect.mode should be set
to UFFDIO_WRITEPROTECT_MODE_WP
.
When a write-protect event happens, user-space will receive a
page-fault notification whose uffd_msg.pagefault.flags will be
with UFFD_PAGEFAULT_FLAG_WP
flag set. Note: since only writes
can trigger this kind of fault, write-protect notifications will
always have the UFFD_PAGEFAULT_FLAG_WRITE
bit set along with the
UFFD_PAGEFAULT_FLAG_WP
bit.
To resolve a write-protection page fault, the user should
initiate another UFFDIO_WRITEPROTECT
ioctl, whose
uffd_msg.pagefault.flags should have the flag
UFFDIO_WRITEPROTECT_MODE_WP
cleared upon the faulted page or
range.
Write-protect mode supports only private anonymous memory.
Reading from the userfaultfd structure
Each read(2) from the userfaultfd file descriptor returns one or
more uffd_msg structures, each of which describes a page-fault
event or an event required for the non-cooperative userfaultfd
usage:
struct uffd_msg {
__u8 event; /* Type of event */
...
union {
struct {
__u64 flags; /* Flags describing fault */
__u64 address; /* Faulting address */
union {
__u32 ptid; /* Thread ID of the fault */
} feat;
} pagefault;
struct { /* Since Linux 4.11 */
__u32 ufd; /* Userfault file descriptor
of the child process */
} fork;
struct { /* Since Linux 4.11 */
__u64 from; /* Old address of remapped area */
__u64 to; /* New address of remapped area */
__u64 len; /* Original mapping length */
} remap;
struct { /* Since Linux 4.11 */
__u64 start; /* Start address of removed area */
__u64 end; /* End address of removed area */
} remove;
...
} arg;
/* Padding fields omitted */
} __packed;
If multiple events are available and the supplied buffer is large
enough, read(2) returns as many events as will fit in the
supplied buffer. If the buffer supplied to read(2) is smaller
than the size of the uffd_msg structure, the read(2) fails with
the error EINVAL
.
The fields set in the uffd_msg structure are as follows:
event The type of event. Depending of the event type, different
fields of the arg union represent details required for the
event processing. The non-page-fault events are generated
only when appropriate feature is enabled during API
handshake with UFFDIO_API ioctl
(2).
The following values can appear in the event field:
UFFD_EVENT_PAGEFAULT
(since Linux 4.3)
A page-fault event. The page-fault details are
available in the pagefault field.
UFFD_EVENT_FORK
(since Linux 4.11)
Generated when the faulting process invokes fork(2)
(or clone(2) without the CLONE_VM
flag). The event
details are available in the fork field.
UFFD_EVENT_REMAP
(since Linux 4.11)
Generated when the faulting process invokes
mremap(2). The event details are available in the
remap field.
UFFD_EVENT_REMOVE
(since Linux 4.11)
Generated when the faulting process invokes
madvise(2) with MADV_DONTNEED
or MADV_REMOVE
advice. The event details are available in the
remove field.
UFFD_EVENT_UNMAP
(since Linux 4.11)
Generated when the faulting process unmaps a memory
range, either explicitly using munmap(2) or
implicitly during mmap(2) or mremap(2). The event
details are available in the remove field.
pagefault.address
The address that triggered the page fault.
pagefault.flags
A bit mask of flags that describe the event. For
UFFD_EVENT_PAGEFAULT
, the following flag may appear:
UFFD_PAGEFAULT_FLAG_WRITE
If the address is in a range that was registered
with the UFFDIO_REGISTER_MODE_MISSING
flag (see
ioctl_userfaultfd(2)) and this flag is set, this a
write fault; otherwise it is a read fault.
UFFD_PAGEFAULT_FLAG_WP
If the address is in a range that was registered
with the UFFDIO_REGISTER_MODE_WP
flag, when this
bit is set, it means it is a write-protect fault.
Otherwise it is a page-missing fault.
pagefault.feat.pid
The thread ID that triggered the page fault.
fork.ufd
The file descriptor associated with the userfault object
created for the child created by fork(2).
remap.from
The original address of the memory range that was remapped
using mremap(2).
remap.to
The new address of the memory range that was remapped
using mremap(2).
remap.len
The original length of the memory range that was remapped
using mremap(2).
remove.start
The start address of the memory range that was freed using
madvise(2) or unmapped
remove.end
The end address of the memory range that was freed using
madvise(2) or unmapped
A read(2) on a userfaultfd file descriptor can fail with the
following errors:
EINVAL
The userfaultfd object has not yet been enabled using the
UFFDIO_API ioctl
(2) operation
If the O_NONBLOCK
flag is enabled in the associated open file
description, the userfaultfd file descriptor can be monitored
with poll(2), select(2), and epoll(7). When events are
available, the file descriptor indicates as readable. If the
O_NONBLOCK
flag is not enabled, then poll(2) (always) indicates
the file as having a POLLERR
condition, and select(2) indicates
the file descriptor as both readable and writable.