linux-stable/fs
Axel Rasmussen fc71884a5f mm: userfaultfd: add new UFFDIO_POISON ioctl
The basic idea here is to "simulate" memory poisoning for VMs.  A VM
running on some host might encounter a memory error, after which some
page(s) are poisoned (i.e., future accesses SIGBUS).  They expect that
once poisoned, pages can never become "un-poisoned".  So, when we live
migrate the VM, we need to preserve the poisoned status of these pages.

When live migrating, we try to get the guest running on its new host as
quickly as possible.  So, we start it running before all memory has been
copied, and before we're certain which pages should be poisoned or not.

So the basic way to use this new feature is:

- On the new host, the guest's memory is registered with userfaultfd, in
  either MISSING or MINOR mode (doesn't really matter for this purpose).
- On any first access, we get a userfaultfd event. At this point we can
  communicate with the old host to find out if the page was poisoned.
- If so, we can respond with a UFFDIO_POISON - this places a swap marker
  so any future accesses will SIGBUS. Because the pte is now "present",
  future accesses won't generate more userfaultfd events, they'll just
  SIGBUS directly.

UFFDIO_POISON does not handle unmapping previously-present PTEs.  This
isn't needed, because during live migration we want to intercept all
accesses with userfaultfd (not just writes, so WP mode isn't useful for
this).  So whether minor or missing mode is being used (or both), the PTE
won't be present in any case, so handling that case isn't needed.

Similarly, UFFDIO_POISON won't replace existing PTE markers.  This might
be okay to do, but it seems to be safer to just refuse to overwrite any
existing entry (like a UFFD_WP PTE marker).

Link: https://lkml.kernel.org/r/20230707215540.2324998-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gaosheng Cui <cuigaosheng1@huawei.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: T.J. Alumbaugh <talumbau@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:12:16 -07:00
..
9p mm, netfs, fscache: stop read optimisation when folio removed from pagecache 2023-08-18 10:12:13 -07:00
adfs splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
affs splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
afs mm, netfs, fscache: stop read optimisation when folio removed from pagecache 2023-08-18 10:12:13 -07:00
autofs arch/*/configs/*defconfig: Replace AUTOFS4_FS by AUTOFS_FS 2023-07-29 14:08:22 -07:00
befs befs: Replace all non-returning strlcpy with strscpy 2023-05-30 16:42:00 -07:00
bfs splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
btrfs for-6.5-rc3-tag 2023-07-27 11:44:08 -07:00
cachefiles mm, netfs, fscache: stop read optimisation when folio removed from pagecache 2023-08-18 10:12:13 -07:00
ceph mm, netfs, fscache: stop read optimisation when folio removed from pagecache 2023-08-18 10:12:13 -07:00
coda coda: Implement splice-read 2023-05-24 08:42:16 -06:00
configfs fs: consolidate duplicate dt_type helpers 2023-04-03 09:23:54 +02:00
cramfs splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
crypto fscrypt: Replace 1-element array with flexible array 2023-05-23 19:46:09 -07:00
debugfs debugfs: Correct the 'debugfs_create_str' docs 2023-05-31 19:02:14 +01:00
devpts devpts: simplify two-level sysctl registration for pty_kern_table 2023-03-13 12:36:34 +01:00
dlm dlm for 6.5 2023-06-29 13:27:50 -07:00
ecryptfs splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
efivarfs efivarfs: expose used and total size 2023-05-17 18:21:34 +02:00
efs
erofs erofs: fix fsdax unavailability for chunk-based regular files 2023-07-12 00:50:56 +08:00
exfat splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
exportfs exportfs: add explicit flag to request non-decodeable file handles 2023-05-22 18:08:37 +02:00
ext2 \n 2023-06-29 13:39:51 -07:00
ext4 mm: merge folio_has_private()/filemap_release_folio() call pairs 2023-08-18 10:12:12 -07:00
f2fs f2fs update for 6.5-rc1 2023-07-05 14:14:37 -07:00
fat splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
freevxfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
fscache fscache: Use clear_and_wake_up_bit() in fscache_create_volume_work() 2023-01-30 12:51:54 +00:00
fuse fuse update for 6.5 2023-07-19 11:00:27 -07:00
gfs2 gfs2 fixes 2023-07-04 11:45:16 -07:00
hfs splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
hfsplus splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
hostfs Landlock updates for v6.5-rc1 2023-06-27 17:10:27 -07:00
hpfs splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
hugetlbfs hugetlb: revert use of page_cache_next_miss() 2023-06-23 16:59:32 -07:00
iomap iomap: micro optimize the ki_pos assignment in iomap_file_buffered_write 2023-07-17 08:49:57 -07:00
isofs
jbd2 jbd2: remove __journal_try_to_free_buffer() 2023-07-10 23:09:21 -04:00
jffs2 for-6.5/splice-2023-06-23 2023-06-26 11:52:12 -07:00
jfs Minor bug fixes and cleanups 2023-06-29 13:10:32 -07:00
kernfs driver core changes for 6.5-rc1 2023-07-03 12:56:23 -07:00
lockd NFS client updates for Linux 6.5 2023-07-01 14:38:25 -07:00
minix splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
netfs Move netfs_extract_iter_to_sg() to lib/scatterlist.c 2023-06-08 13:42:33 +02:00
nfs mm, netfs, fscache: stop read optimisation when folio removed from pagecache 2023-08-18 10:12:13 -07:00
nfs_common NFSv4.2: remove MODULE_LICENSE in non-modules 2023-04-13 13:13:52 -07:00
nfsd nfsd-6.5 fixes: 2023-07-25 13:54:04 -07:00
nilfs2 for-6.5/block-2023-06-23 2023-06-26 12:47:20 -07:00
nls fs/nls: make load_nls() take a const parameter 2023-07-25 00:30:02 -05:00
notify fanotify: disallow mount/sb marks on kernel internal pseudo fs 2023-07-04 13:29:29 +02:00
ntfs - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
ntfs3 driver ntfs3 for linux 6.5 2023-07-07 14:59:38 -07:00
ocfs2 fs: convert block_commit_write to return void 2023-08-18 10:12:07 -07:00
omfs splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
openpromfs
orangefs orangefs: Provide a splice-read wrapper 2023-05-24 08:42:16 -06:00
overlayfs ovl: Always reevaluate the file signature for IMA 2023-07-25 15:36:22 -07:00
proc ksm: add ksm zero pages for each process 2023-08-18 10:12:10 -07:00
pstore pstore updates for v6.5-rc1 2023-06-27 21:21:32 -07:00
qnx4 qnx4: credit contributors in CREDITS 2023-03-14 12:56:30 -06:00
qnx6 qnx6: credit contributor and mark filesystem orphan 2023-03-14 12:56:30 -06:00
quota quota: fix warning in dqgrab() 2023-06-05 16:50:30 +02:00
ramfs - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
reiserfs - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
romfs splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
smb mm, netfs, fscache: stop read optimisation when folio removed from pagecache 2023-08-18 10:12:13 -07:00
squashfs squashfs: fix cache race with migration 2023-07-08 09:29:30 -07:00
sysfs sysfs: Skip empty folders creation 2023-06-15 13:37:53 +02:00
sysv for-6.5/splice-2023-06-23 2023-06-26 11:52:12 -07:00
tracefs fs: port ->mkdir() to pass mnt_idmap 2023-01-19 09:24:26 +01:00
ubifs splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
udf fs: convert block_commit_write to return void 2023-08-18 10:12:07 -07:00
ufs splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
unicode unicode: remove MODULE_LICENSE in non-modules 2023-04-13 13:13:54 -07:00
vboxsf hardening updates for v6.5-rc1 2023-06-27 21:24:18 -07:00
verity fsverity: improve documentation for builtin signature support 2023-06-20 22:47:55 -07:00
xfs xfs: convert flex-array declarations in xfs attr shortform objects 2023-07-17 08:48:56 -07:00
zonefs - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
Kconfig mm: make MEMFD_CREATE into a selectable config option 2023-08-18 10:12:01 -07:00
Kconfig.binfmt
Makefile for-6.5/block-2023-06-23 2023-06-26 12:47:20 -07:00
aio.c fs/aio: Stop allocating aio rings from HIGHMEM 2023-06-15 09:22:23 +02:00
anon_inodes.c
attr.c nfs: use vfs setgid helper 2023-03-30 08:51:48 +02:00
bad_inode.c fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
binfmt_elf.c Merge branch 'expand-stack' 2023-06-28 20:35:21 -07:00
binfmt_elf_fdpic.c binfmt: Slightly simplify elf_fdpic_map_file() 2023-05-30 15:49:46 -07:00
binfmt_elf_test.c
binfmt_flat.c
binfmt_misc.c binfmt_misc: fix shift-out-of-bounds in check_special_flags 2022-12-02 13:57:04 -08:00
binfmt_script.c
buffer.c fs: convert block_commit_write to return void 2023-08-18 10:12:07 -07:00
char_dev.c vfs: Replace all non-returning strlcpy with strscpy 2023-05-15 09:42:01 +02:00
compat_binfmt_elf.c
coredump.c v6.5/vfs.misc 2023-06-26 09:50:21 -07:00
d_path.c fs: d_path: include internal.h 2023-05-17 09:16:59 +02:00
dax.c dax: enable dax fault handler to report VM_FAULT_HWPOISON 2023-06-26 07:54:23 -06:00
dcache.c
direct-io.c - Yosry Ahmed brought back some cgroup v1 stats in OOM logs. 2023-06-28 10:28:11 -07:00
drop_caches.c fs: drop_caches: draining pages before dropping caches 2023-08-18 10:12:11 -07:00
eventfd.c eventfd: show the EFD_SEMAPHORE flag in fdinfo 2023-06-15 09:22:23 +02:00
eventpoll.c v6.5/vfs.misc 2023-06-26 09:50:21 -07:00
exec.c \n 2023-06-29 13:31:44 -07:00
fcntl.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
fhandle.c fsnotify: move fsnotify_open() hook into do_dentry_open() 2023-06-12 10:43:45 +02:00
file.c file: always lock position for FMODE_ATOMIC_POS 2023-07-24 10:16:36 -07:00
file_table.c fs: move cleanup from init_file() into its callers 2023-07-02 13:15:49 +02:00
filesystems.c
fs-writeback.c writeback: move wb_over_bg_thresh() call outside lock section 2023-06-09 16:25:14 -07:00
fs_context.c fs: avoid empty option when generating legacy mount string 2023-06-07 21:49:55 +02:00
fs_parser.c ext4: journal_path mount options should follow links 2022-12-01 10:46:54 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c
init.c fs: port ->permission() to pass mnt_idmap 2023-01-19 09:24:28 +01:00
inode.c fs: don't assume arguments are non-NULL 2023-07-04 10:21:11 +02:00
internal.h v6.5/vfs.file 2023-06-26 10:14:36 -07:00
ioctl.c fs: port inode_owner_or_capable() to mnt_idmap 2023-01-19 09:24:29 +01:00
kernel_read_file.c
libfs.c fs: factor out a direct_write_fallback helper 2023-06-09 16:25:53 -07:00
locks.c filelocks: use mount idmapping for setlease permission check 2023-03-09 22:36:12 +01:00
mbcache.c ext4: fix deadlock due to mbcache entry corruption 2022-12-08 21:49:25 -05:00
mnt_idmapping.c fs: move mnt_idmap 2023-01-19 09:24:30 +01:00
mount.h
mpage.c mpage: use folios in bio end_io handler 2023-04-18 16:30:02 -07:00
namei.c fs: no need to check source 2023-07-04 10:20:29 +02:00
namespace.c v6.5/vfs.mount 2023-06-26 10:27:04 -07:00
nsfs.c kill the last remaining user of proc_ns_fget() 2023-04-20 22:55:35 -04:00
open.c \n 2023-06-29 13:31:44 -07:00
pipe.c pipe: check for IOCB_NOWAIT alongside O_NONBLOCK 2023-05-12 17:17:27 +02:00
pnode.c fs: allow to mount beneath top mount 2023-05-19 04:30:22 +02:00
pnode.h fs: allow to mount beneath top mount 2023-05-19 04:30:22 +02:00
posix_acl.c acl: don't depend on IOP_XATTR 2023-03-06 09:59:20 +01:00
proc_namespace.c tty, proc, kernfs, random: Use copy_splice_read() 2023-05-24 08:42:16 -06:00
read_write.c splice: Use filemap_splice_read() instead of generic_file_splice_read() 2023-05-24 08:42:17 -06:00
readdir.c readdir: Replace one-element arrays with flexible-array members 2023-06-21 09:06:59 +02:00
remap_range.c fs: use UB-safe check for signed addition overflow in remap_verify_area 2023-05-24 11:03:59 +02:00
select.c
seq_file.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
signalfd.c
splice.c mm: merge folio_has_private()/filemap_release_folio() call pairs 2023-08-18 10:12:12 -07:00
stack.c
stat.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
statfs.c statfs: enforce statfs[64] structure initialization 2023-05-17 15:20:17 +02:00
super.c \n 2023-06-29 13:39:51 -07:00
sync.c
sysctls.c sysctl: Refactor base paths registrations 2023-05-23 21:43:26 -07:00
timerfd.c
userfaultfd.c mm: userfaultfd: add new UFFDIO_POISON ioctl 2023-08-18 10:12:16 -07:00
utimes.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
xattr.c fs: don't call posix_acl_listxattr in generic_listxattr 2023-05-17 15:25:20 +02:00