KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
|
|
|
#include <linux/backing-dev.h>
|
|
|
|
#include <linux/falloc.h>
|
|
|
|
#include <linux/kvm_host.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/anon_inodes.h>
|
|
|
|
|
|
|
|
#include "kvm_mm.h"
|
|
|
|
|
|
|
|
struct kvm_gmem {
|
|
|
|
struct kvm *kvm;
|
|
|
|
struct xarray bindings;
|
|
|
|
struct list_head entry;
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
|
|
|
|
{
|
|
|
|
struct folio *folio;
|
|
|
|
|
|
|
|
/* TODO: Support huge pages. */
|
|
|
|
folio = filemap_grab_folio(inode->i_mapping, index);
|
|
|
|
if (IS_ERR_OR_NULL(folio))
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Use the up-to-date flag to track whether or not the memory has been
|
|
|
|
* zeroed before being handed off to the guest. There is no backing
|
|
|
|
* storage for the memory, so the folio will remain up-to-date until
|
|
|
|
* it's removed.
|
|
|
|
*
|
|
|
|
* TODO: Skip clearing pages when trusted firmware will do it when
|
|
|
|
* assigning memory to the guest.
|
|
|
|
*/
|
|
|
|
if (!folio_test_uptodate(folio)) {
|
|
|
|
unsigned long nr_pages = folio_nr_pages(folio);
|
|
|
|
unsigned long i;
|
|
|
|
|
|
|
|
for (i = 0; i < nr_pages; i++)
|
|
|
|
clear_highpage(folio_page(folio, i));
|
|
|
|
|
|
|
|
folio_mark_uptodate(folio);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ignore accessed, referenced, and dirty flags. The memory is
|
|
|
|
* unevictable and there is no storage to write back to.
|
|
|
|
*/
|
|
|
|
return folio;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
|
|
|
|
pgoff_t end)
|
|
|
|
{
|
|
|
|
bool flush = false, found_memslot = false;
|
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
struct kvm *kvm = gmem->kvm;
|
|
|
|
unsigned long index;
|
|
|
|
|
|
|
|
xa_for_each_range(&gmem->bindings, index, slot, start, end - 1) {
|
|
|
|
pgoff_t pgoff = slot->gmem.pgoff;
|
|
|
|
|
|
|
|
struct kvm_gfn_range gfn_range = {
|
|
|
|
.start = slot->base_gfn + max(pgoff, start) - pgoff,
|
|
|
|
.end = slot->base_gfn + min(pgoff + slot->npages, end) - pgoff,
|
|
|
|
.slot = slot,
|
|
|
|
.may_block = true,
|
|
|
|
};
|
|
|
|
|
|
|
|
if (!found_memslot) {
|
|
|
|
found_memslot = true;
|
|
|
|
|
|
|
|
KVM_MMU_LOCK(kvm);
|
|
|
|
kvm_mmu_invalidate_begin(kvm);
|
|
|
|
}
|
|
|
|
|
|
|
|
flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flush)
|
|
|
|
kvm_flush_remote_tlbs(kvm);
|
|
|
|
|
|
|
|
if (found_memslot)
|
|
|
|
KVM_MMU_UNLOCK(kvm);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
|
|
|
|
pgoff_t end)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = gmem->kvm;
|
|
|
|
|
|
|
|
if (xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
|
|
|
|
KVM_MMU_LOCK(kvm);
|
|
|
|
kvm_mmu_invalidate_end(kvm);
|
|
|
|
KVM_MMU_UNLOCK(kvm);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
|
|
|
|
{
|
|
|
|
struct list_head *gmem_list = &inode->i_mapping->private_list;
|
|
|
|
pgoff_t start = offset >> PAGE_SHIFT;
|
|
|
|
pgoff_t end = (offset + len) >> PAGE_SHIFT;
|
|
|
|
struct kvm_gmem *gmem;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Bindings must be stable across invalidation to ensure the start+end
|
|
|
|
* are balanced.
|
|
|
|
*/
|
|
|
|
filemap_invalidate_lock(inode->i_mapping);
|
|
|
|
|
|
|
|
list_for_each_entry(gmem, gmem_list, entry)
|
|
|
|
kvm_gmem_invalidate_begin(gmem, start, end);
|
|
|
|
|
|
|
|
truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
|
|
|
|
|
|
|
|
list_for_each_entry(gmem, gmem_list, entry)
|
|
|
|
kvm_gmem_invalidate_end(gmem, start, end);
|
|
|
|
|
|
|
|
filemap_invalidate_unlock(inode->i_mapping);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
|
|
|
|
{
|
|
|
|
struct address_space *mapping = inode->i_mapping;
|
|
|
|
pgoff_t start, index, end;
|
|
|
|
int r;
|
|
|
|
|
|
|
|
/* Dedicated guest is immutable by default. */
|
|
|
|
if (offset + len > i_size_read(inode))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
filemap_invalidate_lock_shared(mapping);
|
|
|
|
|
|
|
|
start = offset >> PAGE_SHIFT;
|
|
|
|
end = (offset + len) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
r = 0;
|
|
|
|
for (index = start; index < end; ) {
|
|
|
|
struct folio *folio;
|
|
|
|
|
|
|
|
if (signal_pending(current)) {
|
|
|
|
r = -EINTR;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
folio = kvm_gmem_get_folio(inode, index);
|
|
|
|
if (!folio) {
|
|
|
|
r = -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
index = folio_next_index(folio);
|
|
|
|
|
|
|
|
folio_unlock(folio);
|
|
|
|
folio_put(folio);
|
|
|
|
|
|
|
|
/* 64-bit only, wrapping the index should be impossible. */
|
|
|
|
if (WARN_ON_ONCE(!index))
|
|
|
|
break;
|
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
|
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
|
|
|
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset,
|
|
|
|
loff_t len)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!(mode & FALLOC_FL_KEEP_SIZE))
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (mode & FALLOC_FL_PUNCH_HOLE)
|
|
|
|
ret = kvm_gmem_punch_hole(file_inode(file), offset, len);
|
|
|
|
else
|
|
|
|
ret = kvm_gmem_allocate(file_inode(file), offset, len);
|
|
|
|
|
|
|
|
if (!ret)
|
|
|
|
file_modified(file);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_gmem_release(struct inode *inode, struct file *file)
|
|
|
|
{
|
|
|
|
struct kvm_gmem *gmem = file->private_data;
|
|
|
|
struct kvm_memory_slot *slot;
|
|
|
|
struct kvm *kvm = gmem->kvm;
|
|
|
|
unsigned long index;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Prevent concurrent attempts to *unbind* a memslot. This is the last
|
|
|
|
* reference to the file and thus no new bindings can be created, but
|
|
|
|
* dereferencing the slot for existing bindings needs to be protected
|
|
|
|
* against memslot updates, specifically so that unbind doesn't race
|
|
|
|
* and free the memslot (kvm_gmem_get_file() will return NULL).
|
|
|
|
*/
|
|
|
|
mutex_lock(&kvm->slots_lock);
|
|
|
|
|
|
|
|
filemap_invalidate_lock(inode->i_mapping);
|
|
|
|
|
|
|
|
xa_for_each(&gmem->bindings, index, slot)
|
|
|
|
rcu_assign_pointer(slot->gmem.file, NULL);
|
|
|
|
|
|
|
|
synchronize_rcu();
|
|
|
|
|
|
|
|
/*
|
|
|
|
* All in-flight operations are gone and new bindings can be created.
|
|
|
|
* Zap all SPTEs pointed at by this file. Do not free the backing
|
|
|
|
* memory, as its lifetime is associated with the inode, not the file.
|
|
|
|
*/
|
|
|
|
kvm_gmem_invalidate_begin(gmem, 0, -1ul);
|
|
|
|
kvm_gmem_invalidate_end(gmem, 0, -1ul);
|
|
|
|
|
|
|
|
list_del(&gmem->entry);
|
|
|
|
|
|
|
|
filemap_invalidate_unlock(inode->i_mapping);
|
|
|
|
|
|
|
|
mutex_unlock(&kvm->slots_lock);
|
|
|
|
|
|
|
|
xa_destroy(&gmem->bindings);
|
|
|
|
kfree(gmem);
|
|
|
|
|
|
|
|
kvm_put_kvm(kvm);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
Merge branch 'kvm-guestmemfd' into HEAD
Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM
to provide features, enhancements, and optimizations that are kludgly
or outright impossible to implement in a generic memory subsystem.
The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which
similar to the generic memfd_create(), creates an anonymous file and
returns a file descriptor that refers to it. Again like "regular"
memfd files, guest_memfd files live in RAM, have volatile storage,
and are automatically released when the last reference is dropped.
The key differences between memfd files (and every other memory subystem)
is that guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
convert a guest memory area between the shared and guest-private states.
A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to
specify attributes for a given page of guest memory. In the long term,
it will likely be extended to allow userspace to specify per-gfn RWX
protections, including allowing memory to be writable in the guest
without it also being writable in host userspace.
The immediate and driving use case for guest_memfd are Confidential
(CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.
For such use cases, being able to map memory into KVM guests without
requiring said memory to be mapped into the host is a hard requirement.
While SEV+ and TDX prevent untrusted software from reading guest private
data by encrypting guest memory, pKVM provides confidentiality and
integrity *without* relying on memory encryption. In addition, with
SEV-SNP and especially TDX, accessing guest private memory can be fatal
to the host, i.e. KVM must be prevent host userspace from accessing
guest memory irrespective of hardware behavior.
Long term, guest_memfd may be useful for use cases beyond CoCo VMs,
for example hardening userspace against unintentional accesses to guest
memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to
define the allow guest protection (with an exception granted to mapping
guest memory executable), and similarly KVM currently requires the guest
mapping size to be a strict subset of the host userspace mapping size.
Decoupling the mappings sizes would allow userspace to precisely map
only what is needed and with the required permissions, without impacting
guest performance.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to DMA from or into guest memory).
guest_memfd is the result of 3+ years of development and exploration;
taking on memory management responsibilities in KVM was not the first,
second, or even third choice for supporting CoCo VMs. But after many
failed attempts to avoid KVM-specific backing memory, and looking at
where things ended up, it is quite clear that of all approaches tried,
guest_memfd is the simplest, most robust, and most extensible, and the
right thing to do for KVM and the kernel at-large.
The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.
Pending post-merge work includes:
- hugepage support
- looking into using the restrictedmem framework for guest memory
- introducing a testing mechanism to poison memory, possibly using
the same memory attributes introduced here
- SNP and TDX support
There are two non-KVM patches buried in the middle of this series:
fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
2023-11-13 10:58:30 +00:00
|
|
|
static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
{
|
Merge branch 'kvm-guestmemfd' into HEAD
Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM
to provide features, enhancements, and optimizations that are kludgly
or outright impossible to implement in a generic memory subsystem.
The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which
similar to the generic memfd_create(), creates an anonymous file and
returns a file descriptor that refers to it. Again like "regular"
memfd files, guest_memfd files live in RAM, have volatile storage,
and are automatically released when the last reference is dropped.
The key differences between memfd files (and every other memory subystem)
is that guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
convert a guest memory area between the shared and guest-private states.
A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to
specify attributes for a given page of guest memory. In the long term,
it will likely be extended to allow userspace to specify per-gfn RWX
protections, including allowing memory to be writable in the guest
without it also being writable in host userspace.
The immediate and driving use case for guest_memfd are Confidential
(CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.
For such use cases, being able to map memory into KVM guests without
requiring said memory to be mapped into the host is a hard requirement.
While SEV+ and TDX prevent untrusted software from reading guest private
data by encrypting guest memory, pKVM provides confidentiality and
integrity *without* relying on memory encryption. In addition, with
SEV-SNP and especially TDX, accessing guest private memory can be fatal
to the host, i.e. KVM must be prevent host userspace from accessing
guest memory irrespective of hardware behavior.
Long term, guest_memfd may be useful for use cases beyond CoCo VMs,
for example hardening userspace against unintentional accesses to guest
memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to
define the allow guest protection (with an exception granted to mapping
guest memory executable), and similarly KVM currently requires the guest
mapping size to be a strict subset of the host userspace mapping size.
Decoupling the mappings sizes would allow userspace to precisely map
only what is needed and with the required permissions, without impacting
guest performance.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to DMA from or into guest memory).
guest_memfd is the result of 3+ years of development and exploration;
taking on memory management responsibilities in KVM was not the first,
second, or even third choice for supporting CoCo VMs. But after many
failed attempts to avoid KVM-specific backing memory, and looking at
where things ended up, it is quite clear that of all approaches tried,
guest_memfd is the simplest, most robust, and most extensible, and the
right thing to do for KVM and the kernel at-large.
The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.
Pending post-merge work includes:
- hugepage support
- looking into using the restrictedmem framework for guest memory
- introducing a testing mechanism to poison memory, possibly using
the same memory attributes introduced here
- SNP and TDX support
There are two non-KVM patches buried in the middle of this series:
fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
2023-11-13 10:58:30 +00:00
|
|
|
/*
|
|
|
|
* Do not return slot->gmem.file if it has already been closed;
|
|
|
|
* there might be some time between the last fput() and when
|
|
|
|
* kvm_gmem_release() clears slot->gmem.file, and you do not
|
|
|
|
* want to spin in the meanwhile.
|
|
|
|
*/
|
|
|
|
return get_file_active(&slot->gmem.file);
|
KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.
A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem. With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings. E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection. Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.
Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping. Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.
Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).
More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption. And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.
Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.
Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.
Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory. That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.
Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem. I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.
Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.
Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-17-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 10:42:34 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct file_operations kvm_gmem_fops = {
|
|
|
|
.open = generic_file_open,
|
|
|
|
.release = kvm_gmem_release,
|
|
|
|
.fallocate = kvm_gmem_fallocate,
|
|
|
|
};
|
|
|
|
|
|
|
|
void kvm_gmem_init(struct module *module)
|
|
|
|
{
|
|
|
|
kvm_gmem_fops.owner = module;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_gmem_migrate_folio(struct address_space *mapping,
|
|
|
|
struct folio *dst, struct folio *src,
|
|
|
|
enum migrate_mode mode)
|
|
|
|
{
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_gmem_error_page(struct address_space *mapping, struct page *page)
|
|
|
|
{
|
|
|
|
struct list_head *gmem_list = &mapping->private_list;
|
|
|
|
struct kvm_gmem *gmem;
|
|
|
|
pgoff_t start, end;
|
|
|
|
|
|
|
|
filemap_invalidate_lock_shared(mapping);
|
|
|
|
|
|
|
|
start = page->index;
|
|
|
|
end = start + thp_nr_pages(page);
|
|
|
|
|
|
|
|
list_for_each_entry(gmem, gmem_list, entry)
|
|
|
|
kvm_gmem_invalidate_begin(gmem, start, end);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do not truncate the range, what action is taken in response to the
|
|
|
|
* error is userspace's decision (assuming the architecture supports
|
|
|
|
* gracefully handling memory errors). If/when the guest attempts to
|
|
|
|
* access a poisoned page, kvm_gmem_get_pfn() will return -EHWPOISON,
|
|
|
|
* at which point KVM can either terminate the VM or propagate the
|
|
|
|
* error to userspace.
|
|
|
|
*/
|
|
|
|
|
|
|
|
list_for_each_entry(gmem, gmem_list, entry)
|
|
|
|
kvm_gmem_invalidate_end(gmem, start, end);
|
|
|
|
|
|
|
|
filemap_invalidate_unlock_shared(mapping);
|
|
|
|
|
|
|
|
return MF_DELAYED;
|
|
|
|
}
|
|
|
|
|
|
|
|
static const struct address_space_operations kvm_gmem_aops = {
|
|
|
|
.dirty_folio = noop_dirty_folio,
|
|
|
|
#ifdef CONFIG_MIGRATION
|
|
|
|
.migrate_folio = kvm_gmem_migrate_folio,
|
|
|
|
#endif
|
|
|
|
.error_remove_page = kvm_gmem_error_page,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
|
|
|
|
struct kstat *stat, u32 request_mask,
|
|
|
|
unsigned int query_flags)
|
|
|
|
{
|
|
|
|
struct inode *inode = path->dentry->d_inode;
|
|
|
|
|
|
|
|
generic_fillattr(idmap, request_mask, inode, stat);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_gmem_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
|
|
|
|
struct iattr *attr)
|
|
|
|
{
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
static const struct inode_operations kvm_gmem_iops = {
|
|
|
|
.getattr = kvm_gmem_getattr,
|
|
|
|
.setattr = kvm_gmem_setattr,
|
|
|
|
};
|
|
|
|
|
|
|
|
static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
|
|
|
|
{
|
|
|
|
const char *anon_name = "[kvm-gmem]";
|
|
|
|
struct kvm_gmem *gmem;
|
|
|
|
struct inode *inode;
|
|
|
|
struct file *file;
|
|
|
|
int fd, err;
|
|
|
|
|
|
|
|
fd = get_unused_fd_flags(0);
|
|
|
|
if (fd < 0)
|
|
|
|
return fd;
|
|
|
|
|
|
|
|
gmem = kzalloc(sizeof(*gmem), GFP_KERNEL);
|
|
|
|
if (!gmem) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto err_fd;
|
|
|
|
}
|
|
|
|
|
|
|
|
file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
|
|
|
|
O_RDWR, NULL);
|
|
|
|
if (IS_ERR(file)) {
|
|
|
|
err = PTR_ERR(file);
|
|
|
|
goto err_gmem;
|
|
|
|
}
|
|
|
|
|
|
|
|
file->f_flags |= O_LARGEFILE;
|
|
|
|
|
|
|
|
inode = file->f_inode;
|
|
|
|
WARN_ON(file->f_mapping != inode->i_mapping);
|
|
|
|
|
|
|
|
inode->i_private = (void *)(unsigned long)flags;
|
|
|
|
inode->i_op = &kvm_gmem_iops;
|
|
|
|
inode->i_mapping->a_ops = &kvm_gmem_aops;
|
|
|
|
inode->i_mode |= S_IFREG;
|
|
|
|
inode->i_size = size;
|
|
|
|
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
|
|
|
|
mapping_set_unmovable(inode->i_mapping);
|
|
|
|
/* Unmovable mappings are supposed to be marked unevictable as well. */
|
|
|
|
WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
|
|
|
|
|
|
|
|
kvm_get_kvm(kvm);
|
|
|
|
gmem->kvm = kvm;
|
|
|
|
xa_init(&gmem->bindings);
|
|
|
|
list_add(&gmem->entry, &inode->i_mapping->private_list);
|
|
|
|
|
|
|
|
fd_install(fd, file);
|
|
|
|
return fd;
|
|
|
|
|
|
|
|
err_gmem:
|
|
|
|
kfree(gmem);
|
|
|
|
err_fd:
|
|
|
|
put_unused_fd(fd);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
|
|
|
|
{
|
|
|
|
loff_t size = args->size;
|
|
|
|
u64 flags = args->flags;
|
|
|
|
u64 valid_flags = 0;
|
|
|
|
|
|
|
|
if (flags & ~valid_flags)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (size <= 0 || !PAGE_ALIGNED(size))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return __kvm_gmem_create(kvm, size, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
|
|
|
|
unsigned int fd, loff_t offset)
|
|
|
|
{
|
|
|
|
loff_t size = slot->npages << PAGE_SHIFT;
|
|
|
|
unsigned long start, end;
|
|
|
|
struct kvm_gmem *gmem;
|
|
|
|
struct inode *inode;
|
|
|
|
struct file *file;
|
|
|
|
int r = -EINVAL;
|
|
|
|
|
|
|
|
BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
|
|
|
|
|
|
|
|
file = fget(fd);
|
|
|
|
if (!file)
|
|
|
|
return -EBADF;
|
|
|
|
|
|
|
|
if (file->f_op != &kvm_gmem_fops)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
gmem = file->private_data;
|
|
|
|
if (gmem->kvm != kvm)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
inode = file_inode(file);
|
|
|
|
|
|
|
|
if (offset < 0 || !PAGE_ALIGNED(offset) ||
|
|
|
|
offset + size > i_size_read(inode))
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
filemap_invalidate_lock(inode->i_mapping);
|
|
|
|
|
|
|
|
start = offset >> PAGE_SHIFT;
|
|
|
|
end = start + slot->npages;
|
|
|
|
|
|
|
|
if (!xa_empty(&gmem->bindings) &&
|
|
|
|
xa_find(&gmem->bindings, &start, end - 1, XA_PRESENT)) {
|
|
|
|
filemap_invalidate_unlock(inode->i_mapping);
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* No synchronize_rcu() needed, any in-flight readers are guaranteed to
|
|
|
|
* be see either a NULL file or this new file, no need for them to go
|
|
|
|
* away.
|
|
|
|
*/
|
|
|
|
rcu_assign_pointer(slot->gmem.file, file);
|
|
|
|
slot->gmem.pgoff = start;
|
|
|
|
|
|
|
|
xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
|
|
|
|
filemap_invalidate_unlock(inode->i_mapping);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Drop the reference to the file, even on success. The file pins KVM,
|
|
|
|
* not the other way 'round. Active bindings are invalidated if the
|
|
|
|
* file is closed before memslots are destroyed.
|
|
|
|
*/
|
|
|
|
r = 0;
|
|
|
|
err:
|
|
|
|
fput(file);
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_gmem_unbind(struct kvm_memory_slot *slot)
|
|
|
|
{
|
|
|
|
unsigned long start = slot->gmem.pgoff;
|
|
|
|
unsigned long end = start + slot->npages;
|
|
|
|
struct kvm_gmem *gmem;
|
|
|
|
struct file *file;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Nothing to do if the underlying file was already closed (or is being
|
|
|
|
* closed right now), kvm_gmem_release() invalidates all bindings.
|
|
|
|
*/
|
|
|
|
file = kvm_gmem_get_file(slot);
|
|
|
|
if (!file)
|
|
|
|
return;
|
|
|
|
|
|
|
|
gmem = file->private_data;
|
|
|
|
|
|
|
|
filemap_invalidate_lock(file->f_mapping);
|
|
|
|
xa_store_range(&gmem->bindings, start, end - 1, NULL, GFP_KERNEL);
|
|
|
|
rcu_assign_pointer(slot->gmem.file, NULL);
|
|
|
|
synchronize_rcu();
|
|
|
|
filemap_invalidate_unlock(file->f_mapping);
|
|
|
|
|
|
|
|
fput(file);
|
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
|
|
|
|
gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
|
|
|
|
{
|
|
|
|
pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
|
|
|
|
struct kvm_gmem *gmem;
|
|
|
|
struct folio *folio;
|
|
|
|
struct page *page;
|
|
|
|
struct file *file;
|
|
|
|
int r;
|
|
|
|
|
|
|
|
file = kvm_gmem_get_file(slot);
|
|
|
|
if (!file)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
gmem = file->private_data;
|
|
|
|
|
|
|
|
if (WARN_ON_ONCE(xa_load(&gmem->bindings, index) != slot)) {
|
|
|
|
r = -EIO;
|
|
|
|
goto out_fput;
|
|
|
|
}
|
|
|
|
|
|
|
|
folio = kvm_gmem_get_folio(file_inode(file), index);
|
|
|
|
if (!folio) {
|
|
|
|
r = -ENOMEM;
|
|
|
|
goto out_fput;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (folio_test_hwpoison(folio)) {
|
|
|
|
r = -EHWPOISON;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
page = folio_file_page(folio, index);
|
|
|
|
|
|
|
|
*pfn = page_to_pfn(page);
|
|
|
|
if (max_order)
|
|
|
|
*max_order = 0;
|
|
|
|
|
|
|
|
r = 0;
|
|
|
|
|
|
|
|
out_unlock:
|
|
|
|
folio_unlock(folio);
|
|
|
|
out_fput:
|
|
|
|
fput(file);
|
|
|
|
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
|