linux-stable/Documentation
Andrea Arcangeli adef440691 userfaultfd: UFFDIO_MOVE uABI
Implement the uABI of UFFDIO_MOVE ioctl.
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [2].

We see over 40% reduction (on a Google pixel 6 device) in the compacting
thread's completion time by using UFFDIO_MOVE vs.  UFFDIO_COPY.  This was
measured using a benchmark that emulates a heap compaction implementation
using userfaultfd (to allow concurrent accesses by application threads). 
More details of the usecase are explained in [2].  Furthermore,
UFFDIO_MOVE enables moving swapped-out pages without touching them within
the same vma.  Today, it can only be done by mremap, however it forces
splitting the vma.

[1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redhat.com/
[2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/

Update for the ioctl_userfaultfd(2)  manpage:

   UFFDIO_MOVE
       (Since Linux xxx)  Move a continuous memory chunk into the
       userfault registered range and optionally wake up the blocked
       thread. The source and destination addresses and the number of
       bytes to move are specified by the src, dst, and len fields of
       the uffdio_move structure pointed to by argp:

           struct uffdio_move {
               __u64 dst;    /* Destination of move */
               __u64 src;    /* Source of move */
               __u64 len;    /* Number of bytes to move */
               __u64 mode;   /* Flags controlling behavior of move */
               __s64 move;   /* Number of bytes moved, or negated error */
           };

       The following value may be bitwise ORed in mode to change the
       behavior of the UFFDIO_MOVE operation:

       UFFDIO_MOVE_MODE_DONTWAKE
              Do not wake up the thread that waits for page-fault
              resolution

       UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES
              Allow holes in the source virtual range that is being moved.
              When not specified, the holes will result in ENOENT error.
              When specified, the holes will be accounted as successfully
              moved memory. This is mostly useful to move hugepage aligned
              virtual regions without knowing if there are transparent
              hugepages in the regions or not, but preventing the risk of
              having to split the hugepage during the operation.

       The move field is used by the kernel to return the number of
       bytes that was actually moved, or an error (a negated errno-
       style value).  If the value returned in move doesn't match the
       value that was specified in len, the operation fails with the
       error EAGAIN.  The move field is output-only; it is not read by
       the UFFDIO_MOVE operation.

       The operation may fail for various reasons. Usually, remapping of
       pages that are not exclusive to the given process fail; once KSM
       might deduplicate pages or fork() COW-shares pages during fork()
       with child processes, they are no longer exclusive. Further, the
       kernel might only perform lightweight checks for detecting whether
       the pages are exclusive, and return -EBUSY in case that check fails.
       To make the operation more likely to succeed, KSM should be
       disabled, fork() should be avoided or MADV_DONTFORK should be
       configured for the source VMA before fork().

       This ioctl(2) operation returns 0 on success.  In this case, the
       entire area was moved.  On error, -1 is returned and errno is
       set to indicate the error.  Possible errors include:

       EAGAIN The number of bytes moved (i.e., the value returned in
              the move field) does not equal the value that was
              specified in the len field.

       EINVAL Either dst or len was not a multiple of the system page
              size, or the range specified by src and len or dst and len
              was invalid.

       EINVAL An invalid bit was specified in the mode field.

       ENOENT
              The source virtual memory range has unmapped holes and
              UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set.

       EEXIST
              The destination virtual memory range is fully or partially
              mapped.

       EBUSY
              The pages in the source virtual memory range are either
              pinned or not exclusive to the process. The kernel might
              only perform lightweight checks for detecting whether the
              pages are exclusive. To make the operation more likely to
              succeed, KSM should be disabled, fork() should be avoided
              or MADV_DONTFORK should be configured for the source virtual
              memory area before fork().

       ENOMEM Allocating memory needed for the operation failed.

       ESRCH
              The target process has exited at the time of a UFFDIO_MOVE
              operation.

Link: https://lkml.kernel.org/r/20231206103702.3873743-3-surenb@google.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 11:58:24 -08:00
..
ABI Docs/ABI/damon: document DAMOS quota goals 2023-12-12 10:57:04 -08:00
PCI
RCU Merge branches 'rcu/torture', 'rcu/fixes', 'rcu/docs', 'rcu/refscale', 'rcu/tasks' and 'rcu/stall' into rcu/next 2023-10-23 15:24:11 +02:00
accel
accounting
admin-guide userfaultfd: UFFDIO_MOVE uABI 2023-12-29 11:58:24 -08:00
arch Docs/LoongArch: Update links in LoongArch introduction.rst 2023-11-21 15:03:25 +08:00
block The number of commits for documentation is not huge this time around, but 2023-11-01 17:11:41 -10:00
bpf bpf: Add __bpf_kfunc_{start,end}_defs macros 2023-11-01 22:33:53 -07:00
cdrom
core-api maple_tree: update the documentation of maple tree 2023-12-10 16:51:32 -08:00
cpu-freq
crypto crypto: ahash - remove support for nonzero alignmask 2023-10-27 18:04:29 +08:00
dev-tools Many singleton patches against the MM code. The patch series which are 2023-11-02 19:38:47 -10:00
devicetree Some pin control fixes for the v6.7 cycle: 2023-11-29 06:45:22 -08:00
doc-guide docs: doc-guide: mention 'make refcheckdocs' 2023-10-22 20:38:55 -06:00
driver-api media updates for v6.7-rc1 2023-11-06 15:06:06 -08:00
fault-injection
fb
features
filesystems mm: thp: introduce multi-size THP sysfs interface 2023-12-20 14:48:12 -08:00
firmware-guide
firmware_class
fpga
gpu Merge tag 'drm-misc-next-2023-10-27' of git://anongit.freedesktop.org/drm/drm-misc into drm-next 2023-10-31 10:47:50 +10:00
hid
hwmon hwmon: (aquacomputer_d5next) Add support for Aquacomputer High Flow USB and MPS Flow 2023-10-29 22:22:48 -07:00
i2c Documentation: i2c: add fault code for not supporting 10 bit addresses 2023-10-29 21:03:35 +01:00
iio
images
infiniband
input
isdn
kbuild Kbuild updates for v6.7 2023-11-04 08:07:19 -10:00
kernel-hacking
leds
litmus-tests
livepatch
locking
maintainer
mhi
misc-devices eeprom: remove doc and MAINTAINERS section after driver was removed 2023-10-18 10:01:34 +02:00
mm Docs/mm/damon/design: place execution model and data structures at the beginning 2023-12-20 14:48:13 -08:00
netlabel
netlink netlink: specs: devlink: add forgotten port function caps enum values 2023-11-01 22:13:43 -07:00
networking Including fixes from netfilter and bpf. 2023-11-09 17:09:35 -08:00
nvdimm
nvme
pcmcia
peci
power
process docs: netdev: try to guide people on dealing with silence 2023-11-21 14:35:43 -08:00
rust Rust changes for v6.7 2023-10-30 20:30:49 -10:00
scheduler asm-generic updates for v6.7 2023-11-01 15:28:33 -10:00
scsi
security
sound Linux 6.6-rc7 2023-10-23 19:38:22 +01:00
sphinx Documentation/sphinx: Remove the repeated word "the" in comments. 2023-10-22 20:33:38 -06:00
sphinx-static
spi
staging
target
timers
tools
trace Documentation: tracing: Add a note about argument and retval access 2023-11-10 19:59:03 +09:00
translations Docs/zh_CN/LoongArch: Update links in LoongArch introduction.rst 2023-11-21 15:03:26 +08:00
usb USB/Thunderbolt changes for 6.7-rc1 2023-11-03 16:00:42 -10:00
userspace-api drm next and fixes for 6.7-rc1 2023-11-07 17:10:02 -08:00
virt KVM/arm64 updates for 6.7 2023-10-31 16:37:07 -04:00
w1
watchdog
wmi
.gitignore
Changes
CodingStyle
Kconfig
Makefile
SubmittingPatches
atomic_bitops.txt
atomic_t.txt
conf.py
docutils.conf
dontdiff
index.rst
memory-barriers.txt
subsystem-apis.rst