linux-stable/arch/x86/kvm/mmu/tdp_mmu.c

1852 lines
56 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include "mmu.h"
#include "mmu_internal.h"
#include "mmutrace.h"
#include "tdp_iter.h"
#include "tdp_mmu.h"
#include "spte.h"
#include <asm/cmpxchg.h>
#include <trace/events/kvm.h>
/* Initializes the TDP MMU for the VM, if enabled. */
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-16 00:39:15 +00:00
void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
{
INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
}
/* Arbitrarily returns true so that this may be used in if statements. */
static __always_inline bool kvm_lockdep_assert_mmu_lock_held(struct kvm *kvm,
bool shared)
{
if (shared)
lockdep_assert_held_read(&kvm->mmu_lock);
else
lockdep_assert_held_write(&kvm->mmu_lock);
return true;
}
void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
{
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated Preserve TDP MMU roots until they are explicitly invalidated by gifting the TDP MMU itself a reference to a root when it is allocated. Keeping a reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible performance, and can potentially even soft-hang a vCPU, if a vCPU frequently unloads its roots, e.g. when KVM is emulating SMI+RSM. When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root (from the vCPU's perspective). In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM hash table. Unloading a shadow MMU root just wipes it from the per-vCPU cache; the root is still tracked in the per-VM hash table. When KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root in the per-VM hash table. Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a per-VM structure, where "active" in this case means a root is either in-use or cached as a previous root by at least one vCPU. When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing may be done by a worker, but for all intents and purposes the root is gone). The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all roots effectively frees all roots. The issue is mitigated to some degree in multi-vCPU setups as a different vCPU usually holds a reference to an unloaded root and thus keeps the root alive, allowing the vCPU to reuse its old root after unloading (with a flush+sync). The TDP MMU flaw has been known for some time, as until very recently, KVM's handling of CR0.WP also triggered unloading of all roots. The CR0.WP toggling scenario was eventually addressed by not unloading roots when _only_ CR0.WP is toggled, but such an approach doesn't Just Work for emulating SMM as KVM must emulate a full TLB flush on entry and exit to/from SMM. Given that the shadow MMU plays nice with unloading roots at will, teaching the TDP MMU to do the same is far less complex than modifying KVM to track which roots need to be flushed before reuse. Note, preserving all possible TDP MMU roots is not a concern with respect to memory consumption. Now that the role for direct MMUs doesn't include information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there are _at most_ six possible roots (where "guest_mode" here means L2): 1. 4-level !SMM !guest_mode 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode And because each vCPU can track 4 valid roots, a VM can already have all 6 root combinations live at any given time. Not to mention that, in practice, no sane VMM will advertise different guest.MAXPHYADDR values across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for a single VM. Furthermore, the vast majority of modern hypervisors will utilize EPT/NPT when available, thus the guest_mode=%true cases are also unlikely to be utilized. Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-26 22:03:23 +00:00
/*
* Invalidate all roots, which besides the obvious, schedules all roots
* for zapping and thus puts the TDP MMU's reference to each root, i.e.
* ultimately frees all roots.
*/
kvm_tdp_mmu_invalidate_all_roots(kvm);
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-16 00:39:15 +00:00
kvm_tdp_mmu_zap_invalidated_roots(kvm);
WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
/*
* Ensure that all the outstanding RCU callbacks to free shadow pages
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-16 00:39:15 +00:00
* can run before the VM is torn down. Putting the last reference to
* zapped roots will create new callbacks.
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
*/
rcu_barrier();
}
static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield Many TDP MMU functions which need to perform some action on all TDP MMU roots hold a reference on that root so that they can safely drop the MMU lock in order to yield to other threads. However, when releasing the reference on the root, there is a bug: the root will not be freed even if its reference count (root_count) is reduced to 0. To simplify acquiring and releasing references on TDP MMU root pages, and to ensure that these roots are properly freed, move the get/put operations into another TDP MMU root iterator macro. Moving the get/put operations into an iterator macro also helps simplify control flow when a root does need to be freed. Note that using the list_for_each_entry_safe macro would not have been appropriate in this situation because it could keep a pointer to the next root across an MMU lock release + reacquire, during which time that root could be freed. Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU") Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210107001935.3732070-1-bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 00:19:34 +00:00
{
free_page((unsigned long)sp->spt);
kmem_cache_free(mmu_page_header_cache, sp);
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield Many TDP MMU functions which need to perform some action on all TDP MMU roots hold a reference on that root so that they can safely drop the MMU lock in order to yield to other threads. However, when releasing the reference on the root, there is a bug: the root will not be freed even if its reference count (root_count) is reduced to 0. To simplify acquiring and releasing references on TDP MMU root pages, and to ensure that these roots are properly freed, move the get/put operations into another TDP MMU root iterator macro. Moving the get/put operations into an iterator macro also helps simplify control flow when a root does need to be freed. Note that using the list_for_each_entry_safe macro would not have been appropriate in this situation because it could keep a pointer to the next root across an MMU lock release + reacquire, during which time that root could be freed. Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU") Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210107001935.3732070-1-bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 00:19:34 +00:00
}
/*
* This is called through call_rcu in order to free TDP page table memory
* safely with respect to other kernel threads that may be operating on
* the memory.
* By only accessing TDP MMU page table memory in an RCU read critical
* section, and freeing it after a grace period, lockless access to that
* memory won't use it after it is freed.
*/
static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield Many TDP MMU functions which need to perform some action on all TDP MMU roots hold a reference on that root so that they can safely drop the MMU lock in order to yield to other threads. However, when releasing the reference on the root, there is a bug: the root will not be freed even if its reference count (root_count) is reduced to 0. To simplify acquiring and releasing references on TDP MMU root pages, and to ensure that these roots are properly freed, move the get/put operations into another TDP MMU root iterator macro. Moving the get/put operations into an iterator macro also helps simplify control flow when a root does need to be freed. Note that using the list_for_each_entry_safe macro would not have been appropriate in this situation because it could keep a pointer to the next root across an MMU lock release + reacquire, during which time that root could be freed. Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU") Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210107001935.3732070-1-bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 00:19:34 +00:00
{
struct kvm_mmu_page *sp = container_of(head, struct kvm_mmu_page,
rcu_head);
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield Many TDP MMU functions which need to perform some action on all TDP MMU roots hold a reference on that root so that they can safely drop the MMU lock in order to yield to other threads. However, when releasing the reference on the root, there is a bug: the root will not be freed even if its reference count (root_count) is reduced to 0. To simplify acquiring and releasing references on TDP MMU root pages, and to ensure that these roots are properly freed, move the get/put operations into another TDP MMU root iterator macro. Moving the get/put operations into an iterator macro also helps simplify control flow when a root does need to be freed. Note that using the list_for_each_entry_safe macro would not have been appropriate in this situation because it could keep a pointer to the next root across an MMU lock release + reacquire, during which time that root could be freed. Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU") Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210107001935.3732070-1-bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 00:19:34 +00:00
tdp_mmu_free_sp(sp);
}
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield Many TDP MMU functions which need to perform some action on all TDP MMU roots hold a reference on that root so that they can safely drop the MMU lock in order to yield to other threads. However, when releasing the reference on the root, there is a bug: the root will not be freed even if its reference count (root_count) is reduced to 0. To simplify acquiring and releasing references on TDP MMU root pages, and to ensure that these roots are properly freed, move the get/put operations into another TDP MMU root iterator macro. Moving the get/put operations into an iterator macro also helps simplify control flow when a root does need to be freed. Note that using the list_for_each_entry_safe macro would not have been appropriate in this situation because it could keep a pointer to the next root across an MMU lock release + reacquire, during which time that root could be freed. Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU") Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210107001935.3732070-1-bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 00:19:34 +00:00
void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
{
if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
return;
/*
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated Preserve TDP MMU roots until they are explicitly invalidated by gifting the TDP MMU itself a reference to a root when it is allocated. Keeping a reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible performance, and can potentially even soft-hang a vCPU, if a vCPU frequently unloads its roots, e.g. when KVM is emulating SMI+RSM. When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root (from the vCPU's perspective). In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM hash table. Unloading a shadow MMU root just wipes it from the per-vCPU cache; the root is still tracked in the per-VM hash table. When KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root in the per-VM hash table. Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a per-VM structure, where "active" in this case means a root is either in-use or cached as a previous root by at least one vCPU. When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing may be done by a worker, but for all intents and purposes the root is gone). The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all roots effectively frees all roots. The issue is mitigated to some degree in multi-vCPU setups as a different vCPU usually holds a reference to an unloaded root and thus keeps the root alive, allowing the vCPU to reuse its old root after unloading (with a flush+sync). The TDP MMU flaw has been known for some time, as until very recently, KVM's handling of CR0.WP also triggered unloading of all roots. The CR0.WP toggling scenario was eventually addressed by not unloading roots when _only_ CR0.WP is toggled, but such an approach doesn't Just Work for emulating SMM as KVM must emulate a full TLB flush on entry and exit to/from SMM. Given that the shadow MMU plays nice with unloading roots at will, teaching the TDP MMU to do the same is far less complex than modifying KVM to track which roots need to be flushed before reuse. Note, preserving all possible TDP MMU roots is not a concern with respect to memory consumption. Now that the role for direct MMUs doesn't include information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there are _at most_ six possible roots (where "guest_mode" here means L2): 1. 4-level !SMM !guest_mode 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode And because each vCPU can track 4 valid roots, a VM can already have all 6 root combinations live at any given time. Not to mention that, in practice, no sane VMM will advertise different guest.MAXPHYADDR values across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for a single VM. Furthermore, the vast majority of modern hypervisors will utilize EPT/NPT when available, thus the guest_mode=%true cases are also unlikely to be utilized. Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-26 22:03:23 +00:00
* The TDP MMU itself holds a reference to each root until the root is
* explicitly invalidated, i.e. the final reference should be never be
* put for a valid root.
*/
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated Preserve TDP MMU roots until they are explicitly invalidated by gifting the TDP MMU itself a reference to a root when it is allocated. Keeping a reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible performance, and can potentially even soft-hang a vCPU, if a vCPU frequently unloads its roots, e.g. when KVM is emulating SMI+RSM. When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root (from the vCPU's perspective). In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM hash table. Unloading a shadow MMU root just wipes it from the per-vCPU cache; the root is still tracked in the per-VM hash table. When KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root in the per-VM hash table. Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a per-VM structure, where "active" in this case means a root is either in-use or cached as a previous root by at least one vCPU. When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing may be done by a worker, but for all intents and purposes the root is gone). The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all roots effectively frees all roots. The issue is mitigated to some degree in multi-vCPU setups as a different vCPU usually holds a reference to an unloaded root and thus keeps the root alive, allowing the vCPU to reuse its old root after unloading (with a flush+sync). The TDP MMU flaw has been known for some time, as until very recently, KVM's handling of CR0.WP also triggered unloading of all roots. The CR0.WP toggling scenario was eventually addressed by not unloading roots when _only_ CR0.WP is toggled, but such an approach doesn't Just Work for emulating SMM as KVM must emulate a full TLB flush on entry and exit to/from SMM. Given that the shadow MMU plays nice with unloading roots at will, teaching the TDP MMU to do the same is far less complex than modifying KVM to track which roots need to be flushed before reuse. Note, preserving all possible TDP MMU roots is not a concern with respect to memory consumption. Now that the role for direct MMUs doesn't include information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there are _at most_ six possible roots (where "guest_mode" here means L2): 1. 4-level !SMM !guest_mode 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode And because each vCPU can track 4 valid roots, a VM can already have all 6 root combinations live at any given time. Not to mention that, in practice, no sane VMM will advertise different guest.MAXPHYADDR values across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for a single VM. Furthermore, the vast majority of modern hypervisors will utilize EPT/NPT when available, thus the guest_mode=%true cases are also unlikely to be utilized. Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-26 22:03:23 +00:00
KVM_BUG_ON(!is_tdp_mmu_page(root) || !root->role.invalid, kvm);
KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root Allow yielding when zapping SPTEs after the last reference to a valid root is put. Because KVM must drop all SPTEs in response to relevant mmu_notifier events, mark defunct roots invalid and reset their refcount prior to zapping the root. Keeping the refcount elevated while the zap is in-progress ensures the root is reachable via mmu_notifier until the zap completes and the last reference to the invalid, defunct root is put. Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the root in being put has a massive paging structure, e.g. zapping a root that is backed entirely by 4kb pages for a guest with 32tb of memory can take hundreds of seconds to complete. watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368] RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm] __handle_changed_spte+0x1b2/0x2f0 [kvm] handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm] __handle_changed_spte+0x1f4/0x2f0 [kvm] handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm] __handle_changed_spte+0x1f4/0x2f0 [kvm] tdp_mmu_zap_root+0x307/0x4d0 [kvm] kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm] kvm_mmu_free_roots+0x22d/0x350 [kvm] kvm_mmu_reset_context+0x20/0x60 [kvm] kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm] kvm_vcpu_ioctl+0x5bd/0x710 [kvm] __se_sys_ioctl+0x77/0xc0 __x64_sys_ioctl+0x1d/0x20 do_syscall_64+0x44/0xa0 entry_SYSCALL_64_after_hwframe+0x44/0xae KVM currently doesn't put a root from a non-preemptible context, so other than the mmu_notifier wrinkle, yielding when putting a root is safe. Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't take a reference to each root (it requires mmu_lock be held for the entire duration of the walk). tdp_mmu_next_root() is used only by the yield-friendly iterator. tdp_mmu_zap_root_work() is explicitly yield friendly. kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out, but is still yield-friendly in all call sites, as all callers can be traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or kvm_create_vm(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-03 06:50:21 +00:00
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
list_del_rcu(&root->link);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
call_rcu(&root->rcu_head, tdp_mmu_free_sp_rcu_callback);
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield Many TDP MMU functions which need to perform some action on all TDP MMU roots hold a reference on that root so that they can safely drop the MMU lock in order to yield to other threads. However, when releasing the reference on the root, there is a bug: the root will not be freed even if its reference count (root_count) is reduced to 0. To simplify acquiring and releasing references on TDP MMU root pages, and to ensure that these roots are properly freed, move the get/put operations into another TDP MMU root iterator macro. Moving the get/put operations into an iterator macro also helps simplify control flow when a root does need to be freed. Note that using the list_for_each_entry_safe macro would not have been appropriate in this situation because it could keep a pointer to the next root across an MMU lock release + reacquire, during which time that root could be freed. Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU") Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210107001935.3732070-1-bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 00:19:34 +00:00
}
/*
KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU Zap both valid and invalid roots when zapping/unmapping a gfn range, as KVM must ensure it holds no references to the freed page after returning from the unmap operation. Most notably, the TDP MMU doesn't zap invalid roots in mmu_notifier callbacks. This leads to use-after-free and other issues if the mmu_notifier runs to completion while an invalid root zapper yields as KVM fails to honor the requirement that there must be _no_ references to the page after the mmu_notifier returns. The bug is most easily reproduced by hacking KVM to cause a collision between set_nx_huge_pages() and kvm_mmu_notifier_release(), but the bug exists between kvm_mmu_notifier_invalidate_range_start() and memslot updates as well. Invalidating a root ensures pages aren't accessible by the guest, and KVM won't read or write page data itself, but KVM will trigger e.g. kvm_set_pfn_dirty() when zapping SPTEs, and thus completing a zap of an invalid root _after_ the mmu_notifier returns is fatal. WARNING: CPU: 24 PID: 1496 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:173 [kvm] RIP: 0010:kvm_is_zone_device_pfn+0x96/0xa0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0xa8/0xe0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] zap_gfn_range+0x1f3/0x310 [kvm] kvm_tdp_mmu_zap_invalidated_roots+0x50/0x90 [kvm] kvm_mmu_zap_all_fast+0x177/0x1a0 [kvm] set_nx_huge_pages+0xb4/0x190 [kvm] param_attr_store+0x70/0x100 module_attr_store+0x19/0x30 kernfs_fop_write_iter+0x119/0x1b0 new_sync_write+0x11c/0x1b0 vfs_write+0x1cc/0x270 ksys_write+0x5f/0xe0 do_syscall_64+0x38/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211215011557.399940-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-15 01:15:56 +00:00
* Returns the next root after @prev_root (or the first root if @prev_root is
* NULL). A reference to the returned root is acquired, and the reference to
* @prev_root is released (the caller obviously must hold a reference to
* @prev_root if it's non-NULL).
*
* If @only_valid is true, invalid roots are skipped.
*
* Returns NULL if the end of tdp_mmu_roots was reached.
*/
static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
struct kvm_mmu_page *prev_root,
bool only_valid)
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield Many TDP MMU functions which need to perform some action on all TDP MMU roots hold a reference on that root so that they can safely drop the MMU lock in order to yield to other threads. However, when releasing the reference on the root, there is a bug: the root will not be freed even if its reference count (root_count) is reduced to 0. To simplify acquiring and releasing references on TDP MMU root pages, and to ensure that these roots are properly freed, move the get/put operations into another TDP MMU root iterator macro. Moving the get/put operations into an iterator macro also helps simplify control flow when a root does need to be freed. Note that using the list_for_each_entry_safe macro would not have been appropriate in this situation because it could keep a pointer to the next root across an MMU lock release + reacquire, during which time that root could be freed. Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU") Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210107001935.3732070-1-bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 00:19:34 +00:00
{
struct kvm_mmu_page *next_root;
/*
* While the roots themselves are RCU-protected, fields such as
* role.invalid are protected by mmu_lock.
*/
lockdep_assert_held(&kvm->mmu_lock);
rcu_read_lock();
if (prev_root)
next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
&prev_root->link,
typeof(*prev_root), link);
else
next_root = list_first_or_null_rcu(&kvm->arch.tdp_mmu_roots,
typeof(*next_root), link);
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield Many TDP MMU functions which need to perform some action on all TDP MMU roots hold a reference on that root so that they can safely drop the MMU lock in order to yield to other threads. However, when releasing the reference on the root, there is a bug: the root will not be freed even if its reference count (root_count) is reduced to 0. To simplify acquiring and releasing references on TDP MMU root pages, and to ensure that these roots are properly freed, move the get/put operations into another TDP MMU root iterator macro. Moving the get/put operations into an iterator macro also helps simplify control flow when a root does need to be freed. Note that using the list_for_each_entry_safe macro would not have been appropriate in this situation because it could keep a pointer to the next root across an MMU lock release + reacquire, during which time that root could be freed. Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU") Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210107001935.3732070-1-bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 00:19:34 +00:00
while (next_root) {
KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU Zap both valid and invalid roots when zapping/unmapping a gfn range, as KVM must ensure it holds no references to the freed page after returning from the unmap operation. Most notably, the TDP MMU doesn't zap invalid roots in mmu_notifier callbacks. This leads to use-after-free and other issues if the mmu_notifier runs to completion while an invalid root zapper yields as KVM fails to honor the requirement that there must be _no_ references to the page after the mmu_notifier returns. The bug is most easily reproduced by hacking KVM to cause a collision between set_nx_huge_pages() and kvm_mmu_notifier_release(), but the bug exists between kvm_mmu_notifier_invalidate_range_start() and memslot updates as well. Invalidating a root ensures pages aren't accessible by the guest, and KVM won't read or write page data itself, but KVM will trigger e.g. kvm_set_pfn_dirty() when zapping SPTEs, and thus completing a zap of an invalid root _after_ the mmu_notifier returns is fatal. WARNING: CPU: 24 PID: 1496 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:173 [kvm] RIP: 0010:kvm_is_zone_device_pfn+0x96/0xa0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0xa8/0xe0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] zap_gfn_range+0x1f3/0x310 [kvm] kvm_tdp_mmu_zap_invalidated_roots+0x50/0x90 [kvm] kvm_mmu_zap_all_fast+0x177/0x1a0 [kvm] set_nx_huge_pages+0xb4/0x190 [kvm] param_attr_store+0x70/0x100 module_attr_store+0x19/0x30 kernfs_fop_write_iter+0x119/0x1b0 new_sync_write+0x11c/0x1b0 vfs_write+0x1cc/0x270 ksys_write+0x5f/0xe0 do_syscall_64+0x38/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211215011557.399940-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-15 01:15:56 +00:00
if ((!only_valid || !next_root->role.invalid) &&
kvm_tdp_mmu_get_root(next_root))
break;
next_root = list_next_or_null_rcu(&kvm->arch.tdp_mmu_roots,
&next_root->link, typeof(*next_root), link);
}
rcu_read_unlock();
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield Many TDP MMU functions which need to perform some action on all TDP MMU roots hold a reference on that root so that they can safely drop the MMU lock in order to yield to other threads. However, when releasing the reference on the root, there is a bug: the root will not be freed even if its reference count (root_count) is reduced to 0. To simplify acquiring and releasing references on TDP MMU root pages, and to ensure that these roots are properly freed, move the get/put operations into another TDP MMU root iterator macro. Moving the get/put operations into an iterator macro also helps simplify control flow when a root does need to be freed. Note that using the list_for_each_entry_safe macro would not have been appropriate in this situation because it could keep a pointer to the next root across an MMU lock release + reacquire, during which time that root could be freed. Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU") Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210107001935.3732070-1-bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 00:19:34 +00:00
if (prev_root)
kvm_tdp_mmu_put_root(kvm, prev_root);
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield Many TDP MMU functions which need to perform some action on all TDP MMU roots hold a reference on that root so that they can safely drop the MMU lock in order to yield to other threads. However, when releasing the reference on the root, there is a bug: the root will not be freed even if its reference count (root_count) is reduced to 0. To simplify acquiring and releasing references on TDP MMU root pages, and to ensure that these roots are properly freed, move the get/put operations into another TDP MMU root iterator macro. Moving the get/put operations into an iterator macro also helps simplify control flow when a root does need to be freed. Note that using the list_for_each_entry_safe macro would not have been appropriate in this situation because it could keep a pointer to the next root across an MMU lock release + reacquire, during which time that root could be freed. Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU") Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210107001935.3732070-1-bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 00:19:34 +00:00
return next_root;
}
/*
* Note: this iterator gets and puts references to the roots it iterates over.
* This makes it safe to release the MMU lock and yield within the loop, but
* if exiting the loop early, the caller must drop the reference to the most
* recent root. (Unless keeping a live reference is desirable.)
*
* If shared is set, this function is operating under the MMU lock in read
* mode.
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield Many TDP MMU functions which need to perform some action on all TDP MMU roots hold a reference on that root so that they can safely drop the MMU lock in order to yield to other threads. However, when releasing the reference on the root, there is a bug: the root will not be freed even if its reference count (root_count) is reduced to 0. To simplify acquiring and releasing references on TDP MMU root pages, and to ensure that these roots are properly freed, move the get/put operations into another TDP MMU root iterator macro. Moving the get/put operations into an iterator macro also helps simplify control flow when a root does need to be freed. Note that using the list_for_each_entry_safe macro would not have been appropriate in this situation because it could keep a pointer to the next root across an MMU lock release + reacquire, during which time that root could be freed. Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU") Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210107001935.3732070-1-bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 00:19:34 +00:00
*/
#define __for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, _only_valid) \
for (_root = tdp_mmu_next_root(_kvm, NULL, _only_valid); \
({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \
_root = tdp_mmu_next_root(_kvm, _root, _only_valid)) \
if (_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) { \
} else
KVM: x86/mmu: Ensure TDP MMU roots are freed after yield Many TDP MMU functions which need to perform some action on all TDP MMU roots hold a reference on that root so that they can safely drop the MMU lock in order to yield to other threads. However, when releasing the reference on the root, there is a bug: the root will not be freed even if its reference count (root_count) is reduced to 0. To simplify acquiring and releasing references on TDP MMU root pages, and to ensure that these roots are properly freed, move the get/put operations into another TDP MMU root iterator macro. Moving the get/put operations into an iterator macro also helps simplify control flow when a root does need to be freed. Note that using the list_for_each_entry_safe macro would not have been appropriate in this situation because it could keep a pointer to the next root across an MMU lock release + reacquire, during which time that root could be freed. Reported-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU") Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210107001935.3732070-1-bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-07 00:19:34 +00:00
#define for_each_valid_tdp_mmu_root_yield_safe(_kvm, _root, _as_id) \
__for_each_tdp_mmu_root_yield_safe(_kvm, _root, _as_id, true)
KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU Zap both valid and invalid roots when zapping/unmapping a gfn range, as KVM must ensure it holds no references to the freed page after returning from the unmap operation. Most notably, the TDP MMU doesn't zap invalid roots in mmu_notifier callbacks. This leads to use-after-free and other issues if the mmu_notifier runs to completion while an invalid root zapper yields as KVM fails to honor the requirement that there must be _no_ references to the page after the mmu_notifier returns. The bug is most easily reproduced by hacking KVM to cause a collision between set_nx_huge_pages() and kvm_mmu_notifier_release(), but the bug exists between kvm_mmu_notifier_invalidate_range_start() and memslot updates as well. Invalidating a root ensures pages aren't accessible by the guest, and KVM won't read or write page data itself, but KVM will trigger e.g. kvm_set_pfn_dirty() when zapping SPTEs, and thus completing a zap of an invalid root _after_ the mmu_notifier returns is fatal. WARNING: CPU: 24 PID: 1496 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:173 [kvm] RIP: 0010:kvm_is_zone_device_pfn+0x96/0xa0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0xa8/0xe0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] zap_gfn_range+0x1f3/0x310 [kvm] kvm_tdp_mmu_zap_invalidated_roots+0x50/0x90 [kvm] kvm_mmu_zap_all_fast+0x177/0x1a0 [kvm] set_nx_huge_pages+0xb4/0x190 [kvm] param_attr_store+0x70/0x100 module_attr_store+0x19/0x30 kernfs_fop_write_iter+0x119/0x1b0 new_sync_write+0x11c/0x1b0 vfs_write+0x1cc/0x270 ksys_write+0x5f/0xe0 do_syscall_64+0x38/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211215011557.399940-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-15 01:15:56 +00:00
#define for_each_tdp_mmu_root_yield_safe(_kvm, _root) \
for (_root = tdp_mmu_next_root(_kvm, NULL, false); \
({ lockdep_assert_held(&(_kvm)->mmu_lock); }), _root; \
_root = tdp_mmu_next_root(_kvm, _root, false))
KVM: x86/mmu: Zap _all_ roots when unmapping gfn range in TDP MMU Zap both valid and invalid roots when zapping/unmapping a gfn range, as KVM must ensure it holds no references to the freed page after returning from the unmap operation. Most notably, the TDP MMU doesn't zap invalid roots in mmu_notifier callbacks. This leads to use-after-free and other issues if the mmu_notifier runs to completion while an invalid root zapper yields as KVM fails to honor the requirement that there must be _no_ references to the page after the mmu_notifier returns. The bug is most easily reproduced by hacking KVM to cause a collision between set_nx_huge_pages() and kvm_mmu_notifier_release(), but the bug exists between kvm_mmu_notifier_invalidate_range_start() and memslot updates as well. Invalidating a root ensures pages aren't accessible by the guest, and KVM won't read or write page data itself, but KVM will trigger e.g. kvm_set_pfn_dirty() when zapping SPTEs, and thus completing a zap of an invalid root _after_ the mmu_notifier returns is fatal. WARNING: CPU: 24 PID: 1496 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:173 [kvm] RIP: 0010:kvm_is_zone_device_pfn+0x96/0xa0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0xa8/0xe0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] __handle_changed_spte+0x2ab/0x5e0 [kvm] zap_gfn_range+0x1f3/0x310 [kvm] kvm_tdp_mmu_zap_invalidated_roots+0x50/0x90 [kvm] kvm_mmu_zap_all_fast+0x177/0x1a0 [kvm] set_nx_huge_pages+0xb4/0x190 [kvm] param_attr_store+0x70/0x100 module_attr_store+0x19/0x30 kernfs_fop_write_iter+0x119/0x1b0 new_sync_write+0x11c/0x1b0 vfs_write+0x1cc/0x270 ksys_write+0x5f/0xe0 do_syscall_64+0x38/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211215011557.399940-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-15 01:15:56 +00:00
/*
* Iterate over all TDP MMU roots. Requires that mmu_lock be held for write,
* the implication being that any flow that holds mmu_lock for read is
* inherently yield-friendly and should use the yield-safe variant above.
* Holding mmu_lock for write obviates the need for RCU protection as the list
* is guaranteed to be stable.
*/
#define __for_each_tdp_mmu_root(_kvm, _root, _as_id, _only_valid) \
list_for_each_entry(_root, &_kvm->arch.tdp_mmu_roots, link) \
if (kvm_lockdep_assert_mmu_lock_held(_kvm, false) && \
((_as_id >= 0 && kvm_mmu_page_as_id(_root) != _as_id) || \
((_only_valid) && (_root)->role.invalid))) { \
} else
#define for_each_tdp_mmu_root(_kvm, _root, _as_id) \
__for_each_tdp_mmu_root(_kvm, _root, _as_id, false)
#define for_each_valid_tdp_mmu_root(_kvm, _root, _as_id) \
__for_each_tdp_mmu_root(_kvm, _root, _as_id, true)
static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
{
struct kvm_mmu_page *sp;
sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
return sp;
}
static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
gfn_t gfn, union kvm_mmu_page_role role)
{
INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked Tag shadow pages that cannot be replaced with an NX huge page regardless of whether or not zapping the page would allow KVM to immediately create a huge page, e.g. because something else prevents creating a huge page. I.e. track pages that are disallowed from being NX huge pages regardless of whether or not the page could have been huge at the time of fault. KVM currently tracks pages that were disallowed from being huge due to the NX workaround if and only if the page could otherwise be huge. But that fails to handled the scenario where whatever restriction prevented KVM from installing a huge page goes away, e.g. if dirty logging is disabled, the host mapping level changes, etc... Failure to tag shadow pages appropriately could theoretically lead to false negatives, e.g. if a fetch fault requests a small page and thus isn't tracked, and a read/write fault later requests a huge page, KVM will not reject the huge page as it should. To avoid yet another flag, initialize the list_head and use list_empty() to determine whether or not a page is on the list of NX huge pages that should be recovered. Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is more involved due to mmu_lock being held for read. This will be addressed in a future commit. Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-19 16:56:11 +00:00
set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
sp->role = role;
sp->gfn = gfn;
sp->ptep = sptep;
sp->tdp_mmu_page = true;
trace_kvm_mmu_get_page(sp, true);
}
static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
struct tdp_iter *iter)
{
struct kvm_mmu_page *parent_sp;
union kvm_mmu_page_role role;
parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
role = parent_sp->role;
role.level--;
tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
}
int kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
union kvm_mmu_page_role role = mmu->root_role;
int as_id = kvm_mmu_role_as_id(role);
struct kvm *kvm = vcpu->kvm;
struct kvm_mmu_page *root;
/*
* Check for an existing root before acquiring the pages lock to avoid
* unnecessary serialization if multiple vCPUs are loading a new root.
* E.g. when bringing up secondary vCPUs, KVM will already have created
* a valid root on behalf of the primary vCPU.
*/
read_lock(&kvm->mmu_lock);
for_each_valid_tdp_mmu_root_yield_safe(kvm, root, as_id) {
if (root->role.word == role.word)
goto out_read_unlock;
}
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
/*
* Recheck for an existing root after acquiring the pages lock, another
* vCPU may have raced ahead and created a new usable root. Manually
* walk the list of roots as the standard macros assume that the pages
* lock is *not* held. WARN if grabbing a reference to a usable root
* fails, as the last reference to a root can only be put *after* the
* root has been invalidated, which requires holding mmu_lock for write.
*/
list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
if (root->role.word == role.word &&
!WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root)))
goto out_spin_unlock;
}
root = tdp_mmu_alloc_sp(vcpu);
tdp_mmu_init_sp(root, NULL, 0, role);
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated Preserve TDP MMU roots until they are explicitly invalidated by gifting the TDP MMU itself a reference to a root when it is allocated. Keeping a reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible performance, and can potentially even soft-hang a vCPU, if a vCPU frequently unloads its roots, e.g. when KVM is emulating SMI+RSM. When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root (from the vCPU's perspective). In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM hash table. Unloading a shadow MMU root just wipes it from the per-vCPU cache; the root is still tracked in the per-VM hash table. When KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root in the per-VM hash table. Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a per-VM structure, where "active" in this case means a root is either in-use or cached as a previous root by at least one vCPU. When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing may be done by a worker, but for all intents and purposes the root is gone). The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all roots effectively frees all roots. The issue is mitigated to some degree in multi-vCPU setups as a different vCPU usually holds a reference to an unloaded root and thus keeps the root alive, allowing the vCPU to reuse its old root after unloading (with a flush+sync). The TDP MMU flaw has been known for some time, as until very recently, KVM's handling of CR0.WP also triggered unloading of all roots. The CR0.WP toggling scenario was eventually addressed by not unloading roots when _only_ CR0.WP is toggled, but such an approach doesn't Just Work for emulating SMM as KVM must emulate a full TLB flush on entry and exit to/from SMM. Given that the shadow MMU plays nice with unloading roots at will, teaching the TDP MMU to do the same is far less complex than modifying KVM to track which roots need to be flushed before reuse. Note, preserving all possible TDP MMU roots is not a concern with respect to memory consumption. Now that the role for direct MMUs doesn't include information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there are _at most_ six possible roots (where "guest_mode" here means L2): 1. 4-level !SMM !guest_mode 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode And because each vCPU can track 4 valid roots, a VM can already have all 6 root combinations live at any given time. Not to mention that, in practice, no sane VMM will advertise different guest.MAXPHYADDR values across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for a single VM. Furthermore, the vast majority of modern hypervisors will utilize EPT/NPT when available, thus the guest_mode=%true cases are also unlikely to be utilized. Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-26 22:03:23 +00:00
/*
* TDP MMU roots are kept until they are explicitly invalidated, either
* by a memslot update or by the destruction of the VM. Initialize the
* refcount to two; one reference for the vCPU, and one reference for
* the TDP MMU itself, which is held until the root is invalidated and
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-16 00:39:15 +00:00
* is ultimately put by kvm_tdp_mmu_zap_invalidated_roots().
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated Preserve TDP MMU roots until they are explicitly invalidated by gifting the TDP MMU itself a reference to a root when it is allocated. Keeping a reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible performance, and can potentially even soft-hang a vCPU, if a vCPU frequently unloads its roots, e.g. when KVM is emulating SMI+RSM. When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root (from the vCPU's perspective). In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM hash table. Unloading a shadow MMU root just wipes it from the per-vCPU cache; the root is still tracked in the per-VM hash table. When KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root in the per-VM hash table. Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a per-VM structure, where "active" in this case means a root is either in-use or cached as a previous root by at least one vCPU. When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing may be done by a worker, but for all intents and purposes the root is gone). The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all roots effectively frees all roots. The issue is mitigated to some degree in multi-vCPU setups as a different vCPU usually holds a reference to an unloaded root and thus keeps the root alive, allowing the vCPU to reuse its old root after unloading (with a flush+sync). The TDP MMU flaw has been known for some time, as until very recently, KVM's handling of CR0.WP also triggered unloading of all roots. The CR0.WP toggling scenario was eventually addressed by not unloading roots when _only_ CR0.WP is toggled, but such an approach doesn't Just Work for emulating SMM as KVM must emulate a full TLB flush on entry and exit to/from SMM. Given that the shadow MMU plays nice with unloading roots at will, teaching the TDP MMU to do the same is far less complex than modifying KVM to track which roots need to be flushed before reuse. Note, preserving all possible TDP MMU roots is not a concern with respect to memory consumption. Now that the role for direct MMUs doesn't include information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there are _at most_ six possible roots (where "guest_mode" here means L2): 1. 4-level !SMM !guest_mode 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode And because each vCPU can track 4 valid roots, a VM can already have all 6 root combinations live at any given time. Not to mention that, in practice, no sane VMM will advertise different guest.MAXPHYADDR values across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for a single VM. Furthermore, the vast majority of modern hypervisors will utilize EPT/NPT when available, thus the guest_mode=%true cases are also unlikely to be utilized. Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-26 22:03:23 +00:00
*/
refcount_set(&root->tdp_mmu_root_count, 2);
list_add_rcu(&root->link, &kvm->arch.tdp_mmu_roots);
out_spin_unlock:
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
out_read_unlock:
read_unlock(&kvm->mmu_lock);
/*
* Note, KVM_REQ_MMU_FREE_OBSOLETE_ROOTS will prevent entering the guest
* and actually consuming the root if it's invalidated after dropping
* mmu_lock, and the root can't be freed as this vCPU holds a reference.
*/
mmu->root.hpa = __pa(root->spt);
mmu->root.pgd = 0;
return 0;
}
static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
u64 old_spte, u64 new_spte, int level,
bool shared);
static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
kvm_account_pgtable_pages((void *)sp->spt, +1);
atomic64_inc(&kvm->arch.tdp_mmu_pages);
}
static void tdp_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
{
kvm_account_pgtable_pages((void *)sp->spt, -1);
atomic64_dec(&kvm->arch.tdp_mmu_pages);
}
/**
* tdp_mmu_unlink_sp() - Remove a shadow page from the list of used pages
*
* @kvm: kvm instance
* @sp: the page to be removed
*/
static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
{
tdp_unaccount_mmu_page(kvm, sp);
if (!sp->nx_huge_page_disallowed)
return;
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
sp->nx_huge_page_disallowed = false;
untrack_possible_nx_huge_page(kvm, sp);
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
}
/**
* handle_removed_pt() - handle a page table removed from the TDP structure
*
* @kvm: kvm instance
* @pt: the page removed from the paging structure
* @shared: This operation may not be running under the exclusive use
* of the MMU lock and the operation must synchronize with other
* threads that might be modifying SPTEs.
*
* Given a page table that has been removed from the TDP paging structure,
* iterates through the page table to clear SPTEs and free child page tables.
*
* Note that pt is passed in as a tdp_ptep_t, but it does not need RCU
* protection. Since this thread removed it from the paging structure,
* this thread will be responsible for ensuring the page is freed. Hence the
* early rcu_dereferences in the function.
*/
static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
{
struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(pt));
int level = sp->role.level;
gfn_t base_gfn = sp->gfn;
int i;
trace_kvm_mmu_prepare_zap_page(sp);
tdp_mmu_unlink_sp(kvm, sp);
for (i = 0; i < SPTE_ENT_PER_PAGE; i++) {
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:43 +00:00
tdp_ptep_t sptep = pt + i;
gfn_t gfn = base_gfn + i * KVM_PAGES_PER_HPAGE(level);
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:43 +00:00
u64 old_spte;
if (shared) {
/*
* Set the SPTE to a nonpresent value that other
* threads will not overwrite. If the SPTE was
* already marked as removed then another thread
* handling a page fault could overwrite it, so
* set the SPTE until it is set from some other
* value to the removed SPTE value.
*/
for (;;) {
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:43 +00:00
old_spte = kvm_tdp_mmu_write_spte_atomic(sptep, REMOVED_SPTE);
if (!is_removed_spte(old_spte))
break;
cpu_relax();
}
} else {
/*
* If the SPTE is not MMU-present, there is no backing
* page associated with the SPTE and so no side effects
* that need to be recorded, and exclusive ownership of
* mmu_lock ensures the SPTE can't be made present.
* Note, zapping MMIO SPTEs is also unnecessary as they
* are guarded by the memslots generation, not by being
* unreachable.
*/
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:43 +00:00
old_spte = kvm_tdp_mmu_read_spte(sptep);
if (!is_shadow_present_pte(old_spte))
continue;
/*
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:43 +00:00
* Use the common helper instead of a raw WRITE_ONCE as
* the SPTE needs to be updated atomically if it can be
* modified by a different vCPU outside of mmu_lock.
* Even though the parent SPTE is !PRESENT, the TLB
* hasn't yet been flushed, and both Intel and AMD
* document that A/D assists can use upper-level PxE
* entries that are cached in the TLB, i.e. the CPU can
* still access the page and mark it dirty.
*
* No retry is needed in the atomic update path as the
* sole concern is dropping a Dirty bit, i.e. no other
* task can zap/remove the SPTE as mmu_lock is held for
* write. Marking the SPTE as a removed SPTE is not
* strictly necessary for the same reason, but using
* the remove SPTE value keeps the shared/exclusive
* paths consistent and allows the handle_changed_spte()
* call below to hardcode the new value to REMOVED_SPTE.
*
* Note, even though dropping a Dirty bit is the only
* scenario where a non-atomic update could result in a
* functional bug, simply checking the Dirty bit isn't
* sufficient as a fast page fault could read the upper
* level SPTE before it is zapped, and then make this
* target SPTE writable, resume the guest, and set the
* Dirty bit between reading the SPTE above and writing
* it here.
*/
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:43 +00:00
old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
REMOVED_SPTE, level);
}
handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:43 +00:00
old_spte, REMOVED_SPTE, level, shared);
}
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
}
/**
* handle_changed_spte - handle bookkeeping associated with an SPTE change
* @kvm: kvm instance
* @as_id: the address space of the paging structure the SPTE was a part of
* @gfn: the base GFN that was mapped by the SPTE
* @old_spte: The value of the SPTE before the change
* @new_spte: The value of the SPTE after the change
* @level: the level of the PT the SPTE is part of in the paging structure
* @shared: This operation may not be running under the exclusive use of
* the MMU lock and the operation must synchronize with other
* threads that might be modifying SPTEs.
*
* Handle bookkeeping that might result from the modification of a SPTE. Note,
* dirty logging updates are handled in common code, not here (see make_spte()
* and fast_pf_fix_direct_spte()).
*/
static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
u64 old_spte, u64 new_spte, int level,
bool shared)
{
bool was_present = is_shadow_present_pte(old_spte);
bool is_present = is_shadow_present_pte(new_spte);
bool was_leaf = was_present && is_last_spte(old_spte, level);
bool is_leaf = is_present && is_last_spte(new_spte, level);
bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE() Convert all "runtime" assertions, i.e. assertions that can be triggered while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the MMU that is tied to running vCPUs, i.e. not contained to loading and initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if KVM ends up with a bug that causes a root to be invalidated before the page fault handler is invoked, pretty much _every_ page fault VM-Exit triggers the WARN. If a WARN is triggered frequently, the resulting spam usually causes a lot of damage of its own, e.g. consumes resources to log the WARN and pollutes the kernel log, often to the point where other useful information can be lost. In many case, the damage caused by the spam is actually worse than the bug itself, e.g. KVM can almost always recover from an unexpectedly invalid root. On the flip side, warning every time is rarely helpful for debug and triage, i.e. a single splat is usually sufficient to point a debugger in the right direction, and automated testing, e.g. syzkaller, typically runs with warn_on_panic=1, i.e. will never get past the first WARN anyways. Lastly, when an assertions fails multiple times, the stack traces in KVM are almost always identical, i.e. the full splat only needs to be captured once. And _if_ there is value in captruing information about the failed assert, a ratelimited printk() is sufficient and less likely to rack up a large amount of collateral damage. Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 00:47:17 +00:00
WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
WARN_ON_ONCE(level < PG_LEVEL_4K);
WARN_ON_ONCE(gfn & (KVM_PAGES_PER_HPAGE(level) - 1));
/*
* If this warning were to trigger it would indicate that there was a
* missing MMU notifier or a race with some notifier handler.
* A present, leaf SPTE should never be directly replaced with another
* present leaf SPTE pointing to a different PFN. A notifier handler
* should be zapping the SPTE before the main MM's page table is
* changed, or the SPTE should be zeroed, and the TLBs flushed by the
* thread before replacement.
*/
if (was_leaf && is_leaf && pfn_changed) {
pr_err("Invalid SPTE change: cannot replace a present leaf\n"
"SPTE with another present leaf SPTE mapping a\n"
"different PFN!\n"
"as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
as_id, gfn, old_spte, new_spte, level);
/*
* Crash the host to prevent error propagation and guest data
* corruption.
*/
BUG();
}
if (old_spte == new_spte)
return;
trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
if (is_leaf)
check_spte_writable_invariants(new_spte);
/*
* The only times a SPTE should be changed from a non-present to
* non-present state is when an MMIO entry is installed/modified/
* removed. In that case, there is nothing to do here.
*/
if (!was_present && !is_present) {
/*
* If this change does not involve a MMIO SPTE or removed SPTE,
* it is unexpected. Log the change, though it should not
* impact the guest since both the former and current SPTEs
* are nonpresent.
*/
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE() Convert all "runtime" assertions, i.e. assertions that can be triggered while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the MMU that is tied to running vCPUs, i.e. not contained to loading and initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if KVM ends up with a bug that causes a root to be invalidated before the page fault handler is invoked, pretty much _every_ page fault VM-Exit triggers the WARN. If a WARN is triggered frequently, the resulting spam usually causes a lot of damage of its own, e.g. consumes resources to log the WARN and pollutes the kernel log, often to the point where other useful information can be lost. In many case, the damage caused by the spam is actually worse than the bug itself, e.g. KVM can almost always recover from an unexpectedly invalid root. On the flip side, warning every time is rarely helpful for debug and triage, i.e. a single splat is usually sufficient to point a debugger in the right direction, and automated testing, e.g. syzkaller, typically runs with warn_on_panic=1, i.e. will never get past the first WARN anyways. Lastly, when an assertions fails multiple times, the stack traces in KVM are almost always identical, i.e. the full splat only needs to be captured once. And _if_ there is value in captruing information about the failed assert, a ratelimited printk() is sufficient and less likely to rack up a large amount of collateral damage. Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 00:47:17 +00:00
if (WARN_ON_ONCE(!is_mmio_spte(old_spte) &&
!is_mmio_spte(new_spte) &&
!is_removed_spte(new_spte)))
pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
"should not be replaced with another,\n"
"different nonpresent SPTE, unless one or both\n"
"are MMIO SPTEs, or the new SPTE is\n"
"a temporary removed SPTE.\n"
"as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
as_id, gfn, old_spte, new_spte, level);
return;
}
if (is_leaf != was_leaf)
kvm_update_page_stats(kvm, level, is_leaf ? 1 : -1);
if (was_leaf && is_dirty_spte(old_spte) &&
(!is_present || !is_dirty_spte(new_spte) || pfn_changed))
kvm_set_pfn_dirty(spte_to_pfn(old_spte));
/*
* Recursively handle child PTs if the change removed a subtree from
* the paging structure. Note the WARN on the PFN changing without the
* SPTE being converted to a hugepage (leaf) or being zapped. Shadow
* pages are kernel allocations and should never be migrated.
*/
if (was_present && !was_leaf &&
(is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
if (was_leaf && is_accessed_spte(old_spte) &&
(!is_present || !is_accessed_spte(new_spte) || pfn_changed))
kvm_set_pfn_accessed(spte_to_pfn(old_spte));
}
/*
* tdp_mmu_set_spte_atomic - Set a TDP MMU SPTE atomically
* and handle the associated bookkeeping. Do not mark the page dirty
* in KVM's dirty bitmaps.
*
* If setting the SPTE fails because it has changed, iter->old_spte will be
* refreshed to the current value of the spte.
*
* @kvm: kvm instance
* @iter: a tdp_iter instance currently on the SPTE that should be set
* @new_spte: The value the SPTE should be set to
* Return:
* * 0 - If the SPTE was set.
* * -EBUSY - If the SPTE cannot be set. In this case this function will have
* no side-effects other than setting iter->old_spte to the last
* known value of the spte.
*/
static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
struct tdp_iter *iter,
u64 new_spte)
{
u64 *sptep = rcu_dereference(iter->sptep);
/*
* The caller is responsible for ensuring the old SPTE is not a REMOVED
* SPTE. KVM should never attempt to zap or manipulate a REMOVED SPTE,
* and pre-checking before inserting a new SPTE is advantageous as it
* avoids unnecessary work.
*/
WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
lockdep_assert_held_read(&kvm->mmu_lock);
/*
* Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
* does not hold the mmu_lock. On failure, i.e. if a different logical
* CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
* the current value, so the caller operates on fresh data, e.g. if it
* retries tdp_mmu_set_spte_atomic()
*/
if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
return -EBUSY;
handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
new_spte, iter->level, true);
return 0;
}
static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
struct tdp_iter *iter)
{
int ret;
/*
* Freeze the SPTE by setting it to a special,
* non-present value. This will stop other threads from
* immediately installing a present entry in its place
* before the TLBs are flushed.
*/
ret = tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE);
if (ret)
return ret;
kvm_flush_remote_tlbs_gfn(kvm, iter->gfn, iter->level);
/*
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:43 +00:00
* No other thread can overwrite the removed SPTE as they must either
* wait on the MMU lock or use tdp_mmu_set_spte_atomic() which will not
* overwrite the special removed SPTE value. No bookkeeping is needed
* here since the SPTE is going from non-present to non-present. Use
* the raw write helper to avoid an unnecessary check on volatile bits.
*/
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:43 +00:00
__kvm_tdp_mmu_write_spte(iter->sptep, 0);
return 0;
}
/*
* tdp_mmu_set_spte - Set a TDP MMU SPTE and handle the associated bookkeeping
* @kvm: KVM instance
* @as_id: Address space ID, i.e. regular vs. SMM
* @sptep: Pointer to the SPTE
* @old_spte: The current value of the SPTE
* @new_spte: The new value that will be set for the SPTE
* @gfn: The base GFN that was (or will be) mapped by the SPTE
* @level: The level _containing_ the SPTE (its parent PT's level)
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:43 +00:00
*
* Returns the old SPTE value, which _may_ be different than @old_spte if the
* SPTE had voldatile bits.
*/
static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
u64 old_spte, u64 new_spte, gfn_t gfn, int level)
{
lockdep_assert_held_write(&kvm->mmu_lock);
/*
* No thread should be using this function to set SPTEs to or from the
* temporary removed SPTE value.
* If operating under the MMU lock in read mode, tdp_mmu_set_spte_atomic
* should be used. If operating under the MMU lock in write mode, the
* use of the removed SPTE should not be necessary.
*/
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE() Convert all "runtime" assertions, i.e. assertions that can be triggered while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the MMU that is tied to running vCPUs, i.e. not contained to loading and initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if KVM ends up with a bug that causes a root to be invalidated before the page fault handler is invoked, pretty much _every_ page fault VM-Exit triggers the WARN. If a WARN is triggered frequently, the resulting spam usually causes a lot of damage of its own, e.g. consumes resources to log the WARN and pollutes the kernel log, often to the point where other useful information can be lost. In many case, the damage caused by the spam is actually worse than the bug itself, e.g. KVM can almost always recover from an unexpectedly invalid root. On the flip side, warning every time is rarely helpful for debug and triage, i.e. a single splat is usually sufficient to point a debugger in the right direction, and automated testing, e.g. syzkaller, typically runs with warn_on_panic=1, i.e. will never get past the first WARN anyways. Lastly, when an assertions fails multiple times, the stack traces in KVM are almost always identical, i.e. the full splat only needs to be captured once. And _if_ there is value in captruing information about the failed assert, a ratelimited printk() is sufficient and less likely to rack up a large amount of collateral damage. Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 00:47:17 +00:00
WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:43 +00:00
old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. If a vCPU uses the to-be-modified SPTE to write a page, the CPU can cache the translation as WRITABLE in the TLB despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the Accessed/Dirty bits and not properly tag the backing page. Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and KVM doesn't consume the Accessed bit of non-leaf SPTEs. Dropping the Dirty and/or Writable bits is most problematic for dirty logging, as doing so can result in a missed TLB flush and eventually a missed dirty page. In the unlikely event that the only dirty page(s) is a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty (based on the Dirty or Writable bit depending on the method) and so not update the SPTE and ultimately not flush. If the SPTE is cached in the TLB as writable before it is clobbered, the guest can continue writing the associated page without ever taking a write-protect fault. For most (all?) file back memory, dropping the Dirty bit is a non-issue. The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty bit is effectively ignored because the primary MMU will mark that page dirty when the write-protection is lifted, e.g. when KVM faults the page back in for write. The Accessed bit is a complete non-issue. Aside from being unused for non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the Accessed bit may be dropped anyways. Lastly, the Writable bit is also problematic as an extension of the Dirty bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is !DIRTY && WRITABLE. If KVM fixes an MMU-writable, but !WRITABLE, SPTE out of mmu_lock, then it can allow the CPU to set the Dirty bit despite the SPTE being !WRITABLE when it is checked by KVM. But that all depends on the Dirty bit being problematic in the first place. Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Venkatesh Srinivas <venkateshs@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:43 +00:00
return old_spte;
}
static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
u64 new_spte)
{
WARN_ON_ONCE(iter->yielded);
iter->old_spte = tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep,
iter->old_spte, new_spte,
iter->gfn, iter->level);
}
#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
for_each_tdp_pte(_iter, _root, _start, _end)
#define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end) \
tdp_root_for_each_pte(_iter, _root, _start, _end) \
if (!is_shadow_present_pte(_iter.old_spte) || \
!is_last_spte(_iter.old_spte, _iter.level)) \
continue; \
else
#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end) \
for_each_tdp_pte(_iter, root_to_sp(_mmu->root.hpa), _start, _end)
/*
* Yield if the MMU lock is contended or this thread needs to return control
* to the scheduler.
*
* If this function should yield and flush is set, it will perform a remote
* TLB flush before yielding.
*
KVM: x86/mmu: Don't advance iterator after restart due to yielding After dropping mmu_lock in the TDP MMU, restart the iterator during tdp_iter_next() and do not advance the iterator. Advancing the iterator results in skipping the top-level SPTE and all its children, which is fatal if any of the skipped SPTEs were not visited before yielding. When zapping all SPTEs, i.e. when min_level == root_level, restarting the iter and then invoking tdp_iter_next() is always fatal if the current gfn has as a valid SPTE, as advancing the iterator results in try_step_side() skipping the current gfn, which wasn't visited before yielding. Sprinkle WARNs on iter->yielded being true in various helpers that are often used in conjunction with yielding, and tag the helper with __must_check to reduce the probabily of improper usage. Failing to zap a top-level SPTE manifests in one of two ways. If a valid SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(), the shadow page will be leaked and KVM will WARN accordingly. WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2a0 [kvm] kvm_vcpu_release+0x34/0x60 [kvm] __fput+0x82/0x240 task_work_run+0x5c/0x90 do_exit+0x364/0xa10 ? futex_unqueue+0x38/0x60 do_group_exit+0x33/0xa0 get_signal+0x155/0x850 arch_do_signal_or_restart+0xed/0x750 exit_to_user_mode_prepare+0xc5/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of marking a struct page as dirty/accessed after it has been put back on the free list. This directly triggers a WARN due to encountering a page with page_count() == 0, but it can also lead to data corruption and additional errors in the kernel. WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171 RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0x120/0x1d0 [kvm] __handle_changed_spte+0x92e/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] zap_gfn_range+0x549/0x620 [kvm] kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm] mmu_free_root_page+0x219/0x2c0 [kvm] kvm_mmu_free_roots+0x1b4/0x4e0 [kvm] kvm_mmu_unload+0x1c/0xa0 [kvm] kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm] kvm_put_kvm+0x3b1/0x8b0 [kvm] kvm_vcpu_release+0x4e/0x70 [kvm] __fput+0x1f7/0x8c0 task_work_run+0xf8/0x1a0 do_exit+0x97b/0x2230 do_group_exit+0xda/0x2a0 get_signal+0x3be/0x1e50 arch_do_signal_or_restart+0x244/0x17f0 exit_to_user_mode_prepare+0xcb/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still incorrectly advance past a top-level entry when yielding on a lower-level entry. But with respect to leaking shadow pages, the bug was introduced by yielding before processing the current gfn. Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or callers could jump to their "retry" label. The downside of that approach is that tdp_mmu_iter_cond_resched() _must_ be called before anything else in the loop, and there's no easy way to enfornce that requirement. Ideally, KVM would handling the cond_resched() fully within the iterator macro (the code is actually quite clean) and avoid this entire class of bugs, but that is extremely difficult do while also supporting yielding after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE may block operations on the SPTE for a significant amount of time. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") Reported-by: Ignat Korchagin <ignat@cloudflare.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211214033528.123268-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 03:35:28 +00:00
* If this function yields, iter->yielded is set and the caller must skip to
* the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
* over the paging structures to allow the iterator to continue its traversal
* from the paging structure root.
*
KVM: x86/mmu: Don't advance iterator after restart due to yielding After dropping mmu_lock in the TDP MMU, restart the iterator during tdp_iter_next() and do not advance the iterator. Advancing the iterator results in skipping the top-level SPTE and all its children, which is fatal if any of the skipped SPTEs were not visited before yielding. When zapping all SPTEs, i.e. when min_level == root_level, restarting the iter and then invoking tdp_iter_next() is always fatal if the current gfn has as a valid SPTE, as advancing the iterator results in try_step_side() skipping the current gfn, which wasn't visited before yielding. Sprinkle WARNs on iter->yielded being true in various helpers that are often used in conjunction with yielding, and tag the helper with __must_check to reduce the probabily of improper usage. Failing to zap a top-level SPTE manifests in one of two ways. If a valid SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(), the shadow page will be leaked and KVM will WARN accordingly. WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2a0 [kvm] kvm_vcpu_release+0x34/0x60 [kvm] __fput+0x82/0x240 task_work_run+0x5c/0x90 do_exit+0x364/0xa10 ? futex_unqueue+0x38/0x60 do_group_exit+0x33/0xa0 get_signal+0x155/0x850 arch_do_signal_or_restart+0xed/0x750 exit_to_user_mode_prepare+0xc5/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of marking a struct page as dirty/accessed after it has been put back on the free list. This directly triggers a WARN due to encountering a page with page_count() == 0, but it can also lead to data corruption and additional errors in the kernel. WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171 RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0x120/0x1d0 [kvm] __handle_changed_spte+0x92e/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] zap_gfn_range+0x549/0x620 [kvm] kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm] mmu_free_root_page+0x219/0x2c0 [kvm] kvm_mmu_free_roots+0x1b4/0x4e0 [kvm] kvm_mmu_unload+0x1c/0xa0 [kvm] kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm] kvm_put_kvm+0x3b1/0x8b0 [kvm] kvm_vcpu_release+0x4e/0x70 [kvm] __fput+0x1f7/0x8c0 task_work_run+0xf8/0x1a0 do_exit+0x97b/0x2230 do_group_exit+0xda/0x2a0 get_signal+0x3be/0x1e50 arch_do_signal_or_restart+0x244/0x17f0 exit_to_user_mode_prepare+0xcb/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still incorrectly advance past a top-level entry when yielding on a lower-level entry. But with respect to leaking shadow pages, the bug was introduced by yielding before processing the current gfn. Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or callers could jump to their "retry" label. The downside of that approach is that tdp_mmu_iter_cond_resched() _must_ be called before anything else in the loop, and there's no easy way to enfornce that requirement. Ideally, KVM would handling the cond_resched() fully within the iterator macro (the code is actually quite clean) and avoid this entire class of bugs, but that is extremely difficult do while also supporting yielding after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE may block operations on the SPTE for a significant amount of time. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") Reported-by: Ignat Korchagin <ignat@cloudflare.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211214033528.123268-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 03:35:28 +00:00
* Returns true if this function yielded.
*/
KVM: x86/mmu: Don't advance iterator after restart due to yielding After dropping mmu_lock in the TDP MMU, restart the iterator during tdp_iter_next() and do not advance the iterator. Advancing the iterator results in skipping the top-level SPTE and all its children, which is fatal if any of the skipped SPTEs were not visited before yielding. When zapping all SPTEs, i.e. when min_level == root_level, restarting the iter and then invoking tdp_iter_next() is always fatal if the current gfn has as a valid SPTE, as advancing the iterator results in try_step_side() skipping the current gfn, which wasn't visited before yielding. Sprinkle WARNs on iter->yielded being true in various helpers that are often used in conjunction with yielding, and tag the helper with __must_check to reduce the probabily of improper usage. Failing to zap a top-level SPTE manifests in one of two ways. If a valid SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(), the shadow page will be leaked and KVM will WARN accordingly. WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2a0 [kvm] kvm_vcpu_release+0x34/0x60 [kvm] __fput+0x82/0x240 task_work_run+0x5c/0x90 do_exit+0x364/0xa10 ? futex_unqueue+0x38/0x60 do_group_exit+0x33/0xa0 get_signal+0x155/0x850 arch_do_signal_or_restart+0xed/0x750 exit_to_user_mode_prepare+0xc5/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of marking a struct page as dirty/accessed after it has been put back on the free list. This directly triggers a WARN due to encountering a page with page_count() == 0, but it can also lead to data corruption and additional errors in the kernel. WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171 RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0x120/0x1d0 [kvm] __handle_changed_spte+0x92e/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] zap_gfn_range+0x549/0x620 [kvm] kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm] mmu_free_root_page+0x219/0x2c0 [kvm] kvm_mmu_free_roots+0x1b4/0x4e0 [kvm] kvm_mmu_unload+0x1c/0xa0 [kvm] kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm] kvm_put_kvm+0x3b1/0x8b0 [kvm] kvm_vcpu_release+0x4e/0x70 [kvm] __fput+0x1f7/0x8c0 task_work_run+0xf8/0x1a0 do_exit+0x97b/0x2230 do_group_exit+0xda/0x2a0 get_signal+0x3be/0x1e50 arch_do_signal_or_restart+0x244/0x17f0 exit_to_user_mode_prepare+0xcb/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still incorrectly advance past a top-level entry when yielding on a lower-level entry. But with respect to leaking shadow pages, the bug was introduced by yielding before processing the current gfn. Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or callers could jump to their "retry" label. The downside of that approach is that tdp_mmu_iter_cond_resched() _must_ be called before anything else in the loop, and there's no easy way to enfornce that requirement. Ideally, KVM would handling the cond_resched() fully within the iterator macro (the code is actually quite clean) and avoid this entire class of bugs, but that is extremely difficult do while also supporting yielding after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE may block operations on the SPTE for a significant amount of time. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") Reported-by: Ignat Korchagin <ignat@cloudflare.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211214033528.123268-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 03:35:28 +00:00
static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
struct tdp_iter *iter,
bool flush, bool shared)
{
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE() Convert all "runtime" assertions, i.e. assertions that can be triggered while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the MMU that is tied to running vCPUs, i.e. not contained to loading and initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if KVM ends up with a bug that causes a root to be invalidated before the page fault handler is invoked, pretty much _every_ page fault VM-Exit triggers the WARN. If a WARN is triggered frequently, the resulting spam usually causes a lot of damage of its own, e.g. consumes resources to log the WARN and pollutes the kernel log, often to the point where other useful information can be lost. In many case, the damage caused by the spam is actually worse than the bug itself, e.g. KVM can almost always recover from an unexpectedly invalid root. On the flip side, warning every time is rarely helpful for debug and triage, i.e. a single splat is usually sufficient to point a debugger in the right direction, and automated testing, e.g. syzkaller, typically runs with warn_on_panic=1, i.e. will never get past the first WARN anyways. Lastly, when an assertions fails multiple times, the stack traces in KVM are almost always identical, i.e. the full splat only needs to be captured once. And _if_ there is value in captruing information about the failed assert, a ratelimited printk() is sufficient and less likely to rack up a large amount of collateral damage. Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 00:47:17 +00:00
WARN_ON_ONCE(iter->yielded);
KVM: x86/mmu: Don't advance iterator after restart due to yielding After dropping mmu_lock in the TDP MMU, restart the iterator during tdp_iter_next() and do not advance the iterator. Advancing the iterator results in skipping the top-level SPTE and all its children, which is fatal if any of the skipped SPTEs were not visited before yielding. When zapping all SPTEs, i.e. when min_level == root_level, restarting the iter and then invoking tdp_iter_next() is always fatal if the current gfn has as a valid SPTE, as advancing the iterator results in try_step_side() skipping the current gfn, which wasn't visited before yielding. Sprinkle WARNs on iter->yielded being true in various helpers that are often used in conjunction with yielding, and tag the helper with __must_check to reduce the probabily of improper usage. Failing to zap a top-level SPTE manifests in one of two ways. If a valid SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(), the shadow page will be leaked and KVM will WARN accordingly. WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2a0 [kvm] kvm_vcpu_release+0x34/0x60 [kvm] __fput+0x82/0x240 task_work_run+0x5c/0x90 do_exit+0x364/0xa10 ? futex_unqueue+0x38/0x60 do_group_exit+0x33/0xa0 get_signal+0x155/0x850 arch_do_signal_or_restart+0xed/0x750 exit_to_user_mode_prepare+0xc5/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of marking a struct page as dirty/accessed after it has been put back on the free list. This directly triggers a WARN due to encountering a page with page_count() == 0, but it can also lead to data corruption and additional errors in the kernel. WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171 RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0x120/0x1d0 [kvm] __handle_changed_spte+0x92e/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] zap_gfn_range+0x549/0x620 [kvm] kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm] mmu_free_root_page+0x219/0x2c0 [kvm] kvm_mmu_free_roots+0x1b4/0x4e0 [kvm] kvm_mmu_unload+0x1c/0xa0 [kvm] kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm] kvm_put_kvm+0x3b1/0x8b0 [kvm] kvm_vcpu_release+0x4e/0x70 [kvm] __fput+0x1f7/0x8c0 task_work_run+0xf8/0x1a0 do_exit+0x97b/0x2230 do_group_exit+0xda/0x2a0 get_signal+0x3be/0x1e50 arch_do_signal_or_restart+0x244/0x17f0 exit_to_user_mode_prepare+0xcb/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still incorrectly advance past a top-level entry when yielding on a lower-level entry. But with respect to leaking shadow pages, the bug was introduced by yielding before processing the current gfn. Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or callers could jump to their "retry" label. The downside of that approach is that tdp_mmu_iter_cond_resched() _must_ be called before anything else in the loop, and there's no easy way to enfornce that requirement. Ideally, KVM would handling the cond_resched() fully within the iterator macro (the code is actually quite clean) and avoid this entire class of bugs, but that is extremely difficult do while also supporting yielding after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE may block operations on the SPTE for a significant amount of time. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") Reported-by: Ignat Korchagin <ignat@cloudflare.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211214033528.123268-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 03:35:28 +00:00
/* Ensure forward progress has been made before yielding. */
if (iter->next_last_level_gfn == iter->yielded_gfn)
return false;
if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
if (flush)
kvm_flush_remote_tlbs(kvm);
rcu_read_unlock();
if (shared)
cond_resched_rwlock_read(&kvm->mmu_lock);
else
cond_resched_rwlock_write(&kvm->mmu_lock);
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_lock();
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE() Convert all "runtime" assertions, i.e. assertions that can be triggered while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the MMU that is tied to running vCPUs, i.e. not contained to loading and initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if KVM ends up with a bug that causes a root to be invalidated before the page fault handler is invoked, pretty much _every_ page fault VM-Exit triggers the WARN. If a WARN is triggered frequently, the resulting spam usually causes a lot of damage of its own, e.g. consumes resources to log the WARN and pollutes the kernel log, often to the point where other useful information can be lost. In many case, the damage caused by the spam is actually worse than the bug itself, e.g. KVM can almost always recover from an unexpectedly invalid root. On the flip side, warning every time is rarely helpful for debug and triage, i.e. a single splat is usually sufficient to point a debugger in the right direction, and automated testing, e.g. syzkaller, typically runs with warn_on_panic=1, i.e. will never get past the first WARN anyways. Lastly, when an assertions fails multiple times, the stack traces in KVM are almost always identical, i.e. the full splat only needs to be captured once. And _if_ there is value in captruing information about the failed assert, a ratelimited printk() is sufficient and less likely to rack up a large amount of collateral damage. Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 00:47:17 +00:00
WARN_ON_ONCE(iter->gfn > iter->next_last_level_gfn);
KVM: x86/mmu: Don't advance iterator after restart due to yielding After dropping mmu_lock in the TDP MMU, restart the iterator during tdp_iter_next() and do not advance the iterator. Advancing the iterator results in skipping the top-level SPTE and all its children, which is fatal if any of the skipped SPTEs were not visited before yielding. When zapping all SPTEs, i.e. when min_level == root_level, restarting the iter and then invoking tdp_iter_next() is always fatal if the current gfn has as a valid SPTE, as advancing the iterator results in try_step_side() skipping the current gfn, which wasn't visited before yielding. Sprinkle WARNs on iter->yielded being true in various helpers that are often used in conjunction with yielding, and tag the helper with __must_check to reduce the probabily of improper usage. Failing to zap a top-level SPTE manifests in one of two ways. If a valid SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(), the shadow page will be leaked and KVM will WARN accordingly. WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2a0 [kvm] kvm_vcpu_release+0x34/0x60 [kvm] __fput+0x82/0x240 task_work_run+0x5c/0x90 do_exit+0x364/0xa10 ? futex_unqueue+0x38/0x60 do_group_exit+0x33/0xa0 get_signal+0x155/0x850 arch_do_signal_or_restart+0xed/0x750 exit_to_user_mode_prepare+0xc5/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of marking a struct page as dirty/accessed after it has been put back on the free list. This directly triggers a WARN due to encountering a page with page_count() == 0, but it can also lead to data corruption and additional errors in the kernel. WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171 RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0x120/0x1d0 [kvm] __handle_changed_spte+0x92e/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] zap_gfn_range+0x549/0x620 [kvm] kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm] mmu_free_root_page+0x219/0x2c0 [kvm] kvm_mmu_free_roots+0x1b4/0x4e0 [kvm] kvm_mmu_unload+0x1c/0xa0 [kvm] kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm] kvm_put_kvm+0x3b1/0x8b0 [kvm] kvm_vcpu_release+0x4e/0x70 [kvm] __fput+0x1f7/0x8c0 task_work_run+0xf8/0x1a0 do_exit+0x97b/0x2230 do_group_exit+0xda/0x2a0 get_signal+0x3be/0x1e50 arch_do_signal_or_restart+0x244/0x17f0 exit_to_user_mode_prepare+0xcb/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still incorrectly advance past a top-level entry when yielding on a lower-level entry. But with respect to leaking shadow pages, the bug was introduced by yielding before processing the current gfn. Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or callers could jump to their "retry" label. The downside of that approach is that tdp_mmu_iter_cond_resched() _must_ be called before anything else in the loop, and there's no easy way to enfornce that requirement. Ideally, KVM would handling the cond_resched() fully within the iterator macro (the code is actually quite clean) and avoid this entire class of bugs, but that is extremely difficult do while also supporting yielding after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE may block operations on the SPTE for a significant amount of time. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") Reported-by: Ignat Korchagin <ignat@cloudflare.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211214033528.123268-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 03:35:28 +00:00
iter->yielded = true;
}
KVM: x86/mmu: Don't advance iterator after restart due to yielding After dropping mmu_lock in the TDP MMU, restart the iterator during tdp_iter_next() and do not advance the iterator. Advancing the iterator results in skipping the top-level SPTE and all its children, which is fatal if any of the skipped SPTEs were not visited before yielding. When zapping all SPTEs, i.e. when min_level == root_level, restarting the iter and then invoking tdp_iter_next() is always fatal if the current gfn has as a valid SPTE, as advancing the iterator results in try_step_side() skipping the current gfn, which wasn't visited before yielding. Sprinkle WARNs on iter->yielded being true in various helpers that are often used in conjunction with yielding, and tag the helper with __must_check to reduce the probabily of improper usage. Failing to zap a top-level SPTE manifests in one of two ways. If a valid SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(), the shadow page will be leaked and KVM will WARN accordingly. WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2a0 [kvm] kvm_vcpu_release+0x34/0x60 [kvm] __fput+0x82/0x240 task_work_run+0x5c/0x90 do_exit+0x364/0xa10 ? futex_unqueue+0x38/0x60 do_group_exit+0x33/0xa0 get_signal+0x155/0x850 arch_do_signal_or_restart+0xed/0x750 exit_to_user_mode_prepare+0xc5/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x48/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of marking a struct page as dirty/accessed after it has been put back on the free list. This directly triggers a WARN due to encountering a page with page_count() == 0, but it can also lead to data corruption and additional errors in the kernel. WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171 RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm] Call Trace: <TASK> kvm_set_pfn_dirty+0x120/0x1d0 [kvm] __handle_changed_spte+0x92e/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] __handle_changed_spte+0x63c/0xca0 [kvm] zap_gfn_range+0x549/0x620 [kvm] kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm] mmu_free_root_page+0x219/0x2c0 [kvm] kvm_mmu_free_roots+0x1b4/0x4e0 [kvm] kvm_mmu_unload+0x1c/0xa0 [kvm] kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm] kvm_put_kvm+0x3b1/0x8b0 [kvm] kvm_vcpu_release+0x4e/0x70 [kvm] __fput+0x1f7/0x8c0 task_work_run+0xf8/0x1a0 do_exit+0x97b/0x2230 do_group_exit+0xda/0x2a0 get_signal+0x3be/0x1e50 arch_do_signal_or_restart+0x244/0x17f0 exit_to_user_mode_prepare+0xcb/0x120 syscall_exit_to_user_mode+0x1d/0x40 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still incorrectly advance past a top-level entry when yielding on a lower-level entry. But with respect to leaking shadow pages, the bug was introduced by yielding before processing the current gfn. Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or callers could jump to their "retry" label. The downside of that approach is that tdp_mmu_iter_cond_resched() _must_ be called before anything else in the loop, and there's no easy way to enfornce that requirement. Ideally, KVM would handling the cond_resched() fully within the iterator macro (the code is actually quite clean) and avoid this entire class of bugs, but that is extremely difficult do while also supporting yielding after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE may block operations on the SPTE for a significant amount of time. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") Reported-by: Ignat Korchagin <ignat@cloudflare.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211214033528.123268-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-14 03:35:28 +00:00
return iter->yielded;
}
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-28 23:34:16 +00:00
static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
{
/*
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-28 23:34:16 +00:00
* Bound TDP MMU walks at host.MAXPHYADDR. KVM disallows memslots with
* a gpa range that would exceed the max gfn, and KVM does not create
* MMIO SPTEs for "impossible" gfns, instead sending such accesses down
* the slow emulation path every time.
*/
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-28 23:34:16 +00:00
return kvm_mmu_max_gfn() + 1;
}
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls When zapping a TDP MMU root, perform the zap in two passes to avoid zapping an entire top-level SPTE while holding RCU, which can induce RCU stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then zap top-level entries in the second pass. With 4-level paging, zapping a PGD that is fully populated with 4kb leaf SPTEs take up to ~7 or so seconds (time varies based on kernel config, number of (v)CPUs, etc...). With 5-level paging, that time can balloon well into hundreds of seconds. Before remote TLB flushes were omitted, the problem was even worse as waiting for all active vCPUs to respond to the IPI introduced significant overhead for VMs with large numbers of vCPUs. By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass, the amount of work that is done without dropping RCU protection is strictly bounded, with the worst case latency for a single operation being less than 100ms. Zapping at 1gb in the first pass is not arbitrary. First and foremost, KVM relies on being able to zap 1gb shadow pages in a single shot when when repacing a shadow page with a hugepage. Zapping a 1gb shadow page that is fully populated with 4kb dirty SPTEs also triggers the worst case latency due writing back the struct page accessed/dirty bits for each 4kb page, i.e. the two-pass approach is guaranteed to work so long as KVM can cleany zap a 1gb shadow page. rcu: INFO: rcu_sched self-detected stall on CPU rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000 softirq=15759/15759 fqs=5058 (t=21016 jiffies g=66453 q=238577) NMI backtrace for cpu 52 Call Trace: ... mark_page_accessed+0x266/0x2f0 kvm_set_pfn_accessed+0x31/0x40 handle_removed_tdp_mmu_page+0x259/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 zap_gfn_range+0x141/0x3b0 kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130 kvm_mmu_zap_all_fast+0x121/0x190 kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10 kvm_page_track_flush_slot+0x5c/0x80 kvm_arch_flush_shadow_memslot+0xe/0x10 kvm_set_memslot+0x172/0x4e0 __kvm_set_memory_region+0x337/0x590 kvm_vm_ioctl+0x49c/0xf80 Reported-by: David Matlack <dmatlack@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 00:15:39 +00:00
static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared, int zap_level)
{
struct tdp_iter iter;
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-28 23:34:16 +00:00
gfn_t end = tdp_mmu_max_gfn_exclusive();
gfn_t start = 0;
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls When zapping a TDP MMU root, perform the zap in two passes to avoid zapping an entire top-level SPTE while holding RCU, which can induce RCU stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then zap top-level entries in the second pass. With 4-level paging, zapping a PGD that is fully populated with 4kb leaf SPTEs take up to ~7 or so seconds (time varies based on kernel config, number of (v)CPUs, etc...). With 5-level paging, that time can balloon well into hundreds of seconds. Before remote TLB flushes were omitted, the problem was even worse as waiting for all active vCPUs to respond to the IPI introduced significant overhead for VMs with large numbers of vCPUs. By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass, the amount of work that is done without dropping RCU protection is strictly bounded, with the worst case latency for a single operation being less than 100ms. Zapping at 1gb in the first pass is not arbitrary. First and foremost, KVM relies on being able to zap 1gb shadow pages in a single shot when when repacing a shadow page with a hugepage. Zapping a 1gb shadow page that is fully populated with 4kb dirty SPTEs also triggers the worst case latency due writing back the struct page accessed/dirty bits for each 4kb page, i.e. the two-pass approach is guaranteed to work so long as KVM can cleany zap a 1gb shadow page. rcu: INFO: rcu_sched self-detected stall on CPU rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000 softirq=15759/15759 fqs=5058 (t=21016 jiffies g=66453 q=238577) NMI backtrace for cpu 52 Call Trace: ... mark_page_accessed+0x266/0x2f0 kvm_set_pfn_accessed+0x31/0x40 handle_removed_tdp_mmu_page+0x259/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 zap_gfn_range+0x141/0x3b0 kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130 kvm_mmu_zap_all_fast+0x121/0x190 kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10 kvm_page_track_flush_slot+0x5c/0x80 kvm_arch_flush_shadow_memslot+0xe/0x10 kvm_set_memslot+0x172/0x4e0 __kvm_set_memory_region+0x337/0x590 kvm_vm_ioctl+0x49c/0xf80 Reported-by: David Matlack <dmatlack@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 00:15:39 +00:00
for_each_tdp_pte_min_level(iter, root, zap_level, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
continue;
if (!is_shadow_present_pte(iter.old_spte))
continue;
if (iter.level > zap_level)
continue;
if (!shared)
tdp_mmu_iter_set_spte(kvm, &iter, 0);
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls When zapping a TDP MMU root, perform the zap in two passes to avoid zapping an entire top-level SPTE while holding RCU, which can induce RCU stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then zap top-level entries in the second pass. With 4-level paging, zapping a PGD that is fully populated with 4kb leaf SPTEs take up to ~7 or so seconds (time varies based on kernel config, number of (v)CPUs, etc...). With 5-level paging, that time can balloon well into hundreds of seconds. Before remote TLB flushes were omitted, the problem was even worse as waiting for all active vCPUs to respond to the IPI introduced significant overhead for VMs with large numbers of vCPUs. By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass, the amount of work that is done without dropping RCU protection is strictly bounded, with the worst case latency for a single operation being less than 100ms. Zapping at 1gb in the first pass is not arbitrary. First and foremost, KVM relies on being able to zap 1gb shadow pages in a single shot when when repacing a shadow page with a hugepage. Zapping a 1gb shadow page that is fully populated with 4kb dirty SPTEs also triggers the worst case latency due writing back the struct page accessed/dirty bits for each 4kb page, i.e. the two-pass approach is guaranteed to work so long as KVM can cleany zap a 1gb shadow page. rcu: INFO: rcu_sched self-detected stall on CPU rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000 softirq=15759/15759 fqs=5058 (t=21016 jiffies g=66453 q=238577) NMI backtrace for cpu 52 Call Trace: ... mark_page_accessed+0x266/0x2f0 kvm_set_pfn_accessed+0x31/0x40 handle_removed_tdp_mmu_page+0x259/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 zap_gfn_range+0x141/0x3b0 kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130 kvm_mmu_zap_all_fast+0x121/0x190 kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10 kvm_page_track_flush_slot+0x5c/0x80 kvm_arch_flush_shadow_memslot+0xe/0x10 kvm_set_memslot+0x172/0x4e0 __kvm_set_memory_region+0x337/0x590 kvm_vm_ioctl+0x49c/0xf80 Reported-by: David Matlack <dmatlack@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 00:15:39 +00:00
else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
goto retry;
}
}
static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
bool shared)
{
KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root Allow yielding when zapping SPTEs after the last reference to a valid root is put. Because KVM must drop all SPTEs in response to relevant mmu_notifier events, mark defunct roots invalid and reset their refcount prior to zapping the root. Keeping the refcount elevated while the zap is in-progress ensures the root is reachable via mmu_notifier until the zap completes and the last reference to the invalid, defunct root is put. Allowing kvm_tdp_mmu_put_root() to yield fixes soft lockup issues if the root in being put has a massive paging structure, e.g. zapping a root that is backed entirely by 4kb pages for a guest with 32tb of memory can take hundreds of seconds to complete. watchdog: BUG: soft lockup - CPU#49 stuck for 485s! [max_guest_memor:52368] RIP: 0010:kvm_set_pfn_dirty+0x30/0x50 [kvm] __handle_changed_spte+0x1b2/0x2f0 [kvm] handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm] __handle_changed_spte+0x1f4/0x2f0 [kvm] handle_removed_tdp_mmu_page+0x1a7/0x2b8 [kvm] __handle_changed_spte+0x1f4/0x2f0 [kvm] tdp_mmu_zap_root+0x307/0x4d0 [kvm] kvm_tdp_mmu_put_root+0x7c/0xc0 [kvm] kvm_mmu_free_roots+0x22d/0x350 [kvm] kvm_mmu_reset_context+0x20/0x60 [kvm] kvm_arch_vcpu_ioctl_set_sregs+0x5a/0xc0 [kvm] kvm_vcpu_ioctl+0x5bd/0x710 [kvm] __se_sys_ioctl+0x77/0xc0 __x64_sys_ioctl+0x1d/0x20 do_syscall_64+0x44/0xa0 entry_SYSCALL_64_after_hwframe+0x44/0xae KVM currently doesn't put a root from a non-preemptible context, so other than the mmu_notifier wrinkle, yielding when putting a root is safe. Yield-unfriendly iteration uses for_each_tdp_mmu_root(), which doesn't take a reference to each root (it requires mmu_lock be held for the entire duration of the walk). tdp_mmu_next_root() is used only by the yield-friendly iterator. tdp_mmu_zap_root_work() is explicitly yield friendly. kvm_mmu_free_roots() => mmu_free_root_page() is a much bigger fan-out, but is still yield-friendly in all call sites, as all callers can be traced back to some combination of vcpu_run(), kvm_destroy_vm(), and/or kvm_create_vm(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-03 06:50:21 +00:00
/*
* The root must have an elevated refcount so that it's reachable via
* mmu_notifier callbacks, which allows this path to yield and drop
* mmu_lock. When handling an unmap/release mmu_notifier command, KVM
* must drop all references to relevant pages prior to completing the
* callback. Dropping mmu_lock with an unreachable root would result
* in zapping SPTEs after a relevant mmu_notifier callback completes
* and lead to use-after-free as zapping a SPTE triggers "writeback" of
* dirty accessed bits to the SPTE's associated struct page.
*/
WARN_ON_ONCE(!refcount_read(&root->tdp_mmu_root_count));
kvm_lockdep_assert_mmu_lock_held(kvm, shared);
rcu_read_lock();
/*
* Zap roots in multiple passes of decreasing granularity, i.e. zap at
* 4KiB=>2MiB=>1GiB=>root, in order to better honor need_resched() (all
* preempt models) or mmu_lock contention (full or real-time models).
* Zapping at finer granularity marginally increases the total time of
* the zap, but in most cases the zap itself isn't latency sensitive.
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls When zapping a TDP MMU root, perform the zap in two passes to avoid zapping an entire top-level SPTE while holding RCU, which can induce RCU stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then zap top-level entries in the second pass. With 4-level paging, zapping a PGD that is fully populated with 4kb leaf SPTEs take up to ~7 or so seconds (time varies based on kernel config, number of (v)CPUs, etc...). With 5-level paging, that time can balloon well into hundreds of seconds. Before remote TLB flushes were omitted, the problem was even worse as waiting for all active vCPUs to respond to the IPI introduced significant overhead for VMs with large numbers of vCPUs. By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass, the amount of work that is done without dropping RCU protection is strictly bounded, with the worst case latency for a single operation being less than 100ms. Zapping at 1gb in the first pass is not arbitrary. First and foremost, KVM relies on being able to zap 1gb shadow pages in a single shot when when repacing a shadow page with a hugepage. Zapping a 1gb shadow page that is fully populated with 4kb dirty SPTEs also triggers the worst case latency due writing back the struct page accessed/dirty bits for each 4kb page, i.e. the two-pass approach is guaranteed to work so long as KVM can cleany zap a 1gb shadow page. rcu: INFO: rcu_sched self-detected stall on CPU rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000 softirq=15759/15759 fqs=5058 (t=21016 jiffies g=66453 q=238577) NMI backtrace for cpu 52 Call Trace: ... mark_page_accessed+0x266/0x2f0 kvm_set_pfn_accessed+0x31/0x40 handle_removed_tdp_mmu_page+0x259/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 zap_gfn_range+0x141/0x3b0 kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130 kvm_mmu_zap_all_fast+0x121/0x190 kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10 kvm_page_track_flush_slot+0x5c/0x80 kvm_arch_flush_shadow_memslot+0xe/0x10 kvm_set_memslot+0x172/0x4e0 __kvm_set_memory_region+0x337/0x590 kvm_vm_ioctl+0x49c/0xf80 Reported-by: David Matlack <dmatlack@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 00:15:39 +00:00
*
* If KVM is configured to prove the MMU, skip the 4KiB and 2MiB zaps
* in order to mimic the page fault path, which can replace a 1GiB page
* table with an equivalent 1GiB hugepage, i.e. can get saddled with
* zapping a 1GiB region that's fully populated with 4KiB SPTEs. This
* allows verifying that KVM can safely zap 1GiB regions, e.g. without
* inducing RCU stalls, without relying on a relatively rare event
* (zapping roots is orders of magnitude more common). Note, because
* zapping a SP recurses on its children, stepping down to PG_LEVEL_4K
* in the iterator itself is unnecessary.
*/
if (!IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
__tdp_mmu_zap_root(kvm, root, shared, PG_LEVEL_4K);
__tdp_mmu_zap_root(kvm, root, shared, PG_LEVEL_2M);
}
KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls When zapping a TDP MMU root, perform the zap in two passes to avoid zapping an entire top-level SPTE while holding RCU, which can induce RCU stalls. In the first pass, zap SPTEs at PG_LEVEL_1G, and then zap top-level entries in the second pass. With 4-level paging, zapping a PGD that is fully populated with 4kb leaf SPTEs take up to ~7 or so seconds (time varies based on kernel config, number of (v)CPUs, etc...). With 5-level paging, that time can balloon well into hundreds of seconds. Before remote TLB flushes were omitted, the problem was even worse as waiting for all active vCPUs to respond to the IPI introduced significant overhead for VMs with large numbers of vCPUs. By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass, the amount of work that is done without dropping RCU protection is strictly bounded, with the worst case latency for a single operation being less than 100ms. Zapping at 1gb in the first pass is not arbitrary. First and foremost, KVM relies on being able to zap 1gb shadow pages in a single shot when when repacing a shadow page with a hugepage. Zapping a 1gb shadow page that is fully populated with 4kb dirty SPTEs also triggers the worst case latency due writing back the struct page accessed/dirty bits for each 4kb page, i.e. the two-pass approach is guaranteed to work so long as KVM can cleany zap a 1gb shadow page. rcu: INFO: rcu_sched self-detected stall on CPU rcu: 52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000 softirq=15759/15759 fqs=5058 (t=21016 jiffies g=66453 q=238577) NMI backtrace for cpu 52 Call Trace: ... mark_page_accessed+0x266/0x2f0 kvm_set_pfn_accessed+0x31/0x40 handle_removed_tdp_mmu_page+0x259/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 handle_removed_tdp_mmu_page+0x1c1/0x2e0 __handle_changed_spte+0x223/0x2c0 zap_gfn_range+0x141/0x3b0 kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130 kvm_mmu_zap_all_fast+0x121/0x190 kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10 kvm_page_track_flush_slot+0x5c/0x80 kvm_arch_flush_shadow_memslot+0xe/0x10 kvm_set_memslot+0x172/0x4e0 __kvm_set_memory_region+0x337/0x590 kvm_vm_ioctl+0x49c/0xf80 Reported-by: David Matlack <dmatlack@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 00:15:39 +00:00
__tdp_mmu_zap_root(kvm, root, shared, PG_LEVEL_1G);
__tdp_mmu_zap_root(kvm, root, shared, root->role.level);
rcu_read_unlock();
}
bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
{
u64 old_spte;
/*
* This helper intentionally doesn't allow zapping a root shadow page,
* which doesn't have a parent page table and thus no associated entry.
*/
if (WARN_ON_ONCE(!sp->ptep))
return false;
old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead of immediately flushing. Because the shadow pages are freed in an RCU callback, so long as at least one CPU holds RCU, all CPUs are protected. For vCPUs running in the guest, i.e. consuming TLB entries, KVM only needs to ensure the caller services the pending TLB flush before dropping its RCU protections. I.e. use the caller's RCU as a proxy for all vCPUs running in the guest. Deferring the flushes allows batching flushes, e.g. when installing a 1gb hugepage and zapping a pile of SPs. And when zapping an entire root, deferring flushes allows skipping the flush entirely (because flushes are not needed in that case). Avoiding flushes when zapping an entire root is especially important as synchronizing with other CPUs via IPI after zapping every shadow page can cause significant performance issues for large VMs. The issue is exacerbated by KVM zapping entire top-level entries without dropping RCU protection, which can lead to RCU stalls even when zapping roots backing relatively "small" amounts of guest memory, e.g. 2tb. Removing the IPI bottleneck largely mitigates the RCU issues, though it's likely still a problem for 5-level paging. A future patch will further address the problem by zapping roots in multiple passes to avoid holding RCU for an extended duration. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 00:15:37 +00:00
if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
return false;
tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
sp->gfn, sp->role.level + 1);
return true;
}
/*
* If can_yield is true, will release the MMU lock and reschedule if the
* scheduler needs the CPU or there is contention on the MMU lock. If this
* function cannot yield, it will not release the MMU lock or reschedule and
* the caller must ensure it does not supply too large a GFN range, or the
* operation can cause a soft lockup.
*/
static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t start, gfn_t end, bool can_yield, bool flush)
{
struct tdp_iter iter;
KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-28 23:34:16 +00:00
end = min(end, tdp_mmu_max_gfn_exclusive());
KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and really zap all SPTEs in this case. As is, zap_gfn_range() skips non-leaf SPTEs whose range exceeds the range to be zapped. If shadow_phys_bits is not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level paging, the "zap all" flows will skip top-level SPTEs whose range extends beyond shadow_phys_bits and leak their SPs when the VM is destroyed. Use the current upper bound (based on host.MAXPHYADDR) to detect that the caller wants to zap all SPTEs, e.g. instead of using the max theoretical gfn, 1 << (52 - 12). The more precise upper bound allows the TDP iterator to terminate its walk earlier when running on hosts with MAXPHYADDR < 52. Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to help future debuggers should KVM decide to leak SPTEs again. The bug is most easily reproduced by running (and unloading!) KVM in a VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped. ============================================================================= BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown() ----------------------------------------------------------------------------- Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1) CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack_lvl+0x45/0x59 slab_err+0x95/0xc9 __kmem_cache_shutdown.cold+0x3c/0x158 kmem_cache_destroy+0x3d/0xf0 kvm_mmu_module_exit+0xa/0x30 [kvm] kvm_arch_exit+0x5d/0x90 [kvm] kvm_exit+0x78/0x90 [kvm] vmx_exit+0x1a/0x50 [kvm_intel] __x64_sys_delete_module+0x13f/0x220 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210812181414.3376143-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-12 18:14:13 +00:00
lockdep_assert_held_write(&kvm->mmu_lock);
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_lock();
for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
if (can_yield &&
tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
flush = false;
continue;
}
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
continue;
tdp_mmu_iter_set_spte(kvm, &iter, 0);
/*
* Zappings SPTEs in invalid roots doesn't require a TLB flush,
* see kvm_tdp_mmu_zap_invalidated_roots() for details.
*/
if (!root->role.invalid)
flush = true;
}
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_unlock();
/*
* Because this flow zaps _only_ leaf SPTEs, the caller doesn't need
* to provide RCU protection as no 'struct kvm_mmu_page' will be freed.
*/
return flush;
}
/*
* Zap leaf SPTEs for the range of gfns, [start, end), for all *VALID** roots.
* Returns true if a TLB flush is needed before releasing the MMU lock, i.e. if
* one or more SPTEs were zapped since the MMU lock was last acquired.
*/
bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush)
{
struct kvm_mmu_page *root;
lockdep_assert_held_write(&kvm->mmu_lock);
for_each_valid_tdp_mmu_root_yield_safe(kvm, root, -1)
flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush);
return flush;
}
void kvm_tdp_mmu_zap_all(struct kvm *kvm)
{
struct kvm_mmu_page *root;
/*
* Zap all roots, including invalid roots, as all SPTEs must be dropped
* before returning to the caller. Zap directly even if the root is
* also being zapped by a worker. Walking zapped top-level SPTEs isn't
* all that expensive and mmu_lock is already held, which means the
* worker has yielded, i.e. flushing the work instead of zapping here
* isn't guaranteed to be any faster.
*
* A TLB flush is unnecessary, KVM zaps everything if and only the VM
* is being destroyed or the userspace VMM has exited. In both cases,
* KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
*/
lockdep_assert_held_write(&kvm->mmu_lock);
for_each_tdp_mmu_root_yield_safe(kvm, root)
tdp_mmu_zap_root(kvm, root, false);
}
/*
KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap Fix misleading and arguably wrong comments in the TDP MMU's fast zap flow. The comments, and the fact that actually zapping invalid roots was added separately, strongly suggests that zapping invalid roots is an optimization and not required for correctness. That is a lie. KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(), because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(), KVM is relying on it to fully remove all references to the memslot. Once the memslot is gone, KVM's mmu_notifier hooks will be unable to find the stale references as the hva=>gfn translation is done via the memslots. If KVM doesn't immediately zap SPTEs and userspace unmaps a range after deleting a memslot, KVM will fail to zap in response to the mmu_notifier due to not finding a memslot corresponding to the notifier's range, which leads to a variation of use-after-free. The other misleading comment (and code) explicitly states that roots without a reference should be skipped. While that's technically true, it's also extremely misleading as it should be impossible for KVM to encounter a defunct root on the list while holding mmu_lock for write. Opportunistically add a WARN to enforce that invariant. Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Fixes: 4c6654bd160d ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 00:15:21 +00:00
* Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
* zap" completes.
*/
void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
{
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-16 00:39:15 +00:00
struct kvm_mmu_page *root;
read_lock(&kvm->mmu_lock);
for_each_tdp_mmu_root_yield_safe(kvm, root) {
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-16 00:39:15 +00:00
if (!root->tdp_mmu_scheduled_root_to_zap)
continue;
root->tdp_mmu_scheduled_root_to_zap = false;
KVM_BUG_ON(!root->role.invalid, kvm);
/*
* A TLB flush is not necessary as KVM performs a local TLB
* flush when allocating a new root (see kvm_mmu_load()), and
* when migrating a vCPU to a different pCPU. Note, the local
* TLB flush on reuse also invalidates paging-structure-cache
* entries, i.e. TLB entries for intermediate paging structures,
* that may be zapped, as such entries are associated with the
* ASID on both VMX and SVM.
*/
tdp_mmu_zap_root(kvm, root, true);
/*
* The referenced needs to be put *after* zapping the root, as
* the root must be reachable by mmu_notifiers while it's being
* zapped
*/
kvm_tdp_mmu_put_root(kvm, root);
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-16 00:39:15 +00:00
}
read_unlock(&kvm->mmu_lock);
}
/*
KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap Fix misleading and arguably wrong comments in the TDP MMU's fast zap flow. The comments, and the fact that actually zapping invalid roots was added separately, strongly suggests that zapping invalid roots is an optimization and not required for correctness. That is a lie. KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(), because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(), KVM is relying on it to fully remove all references to the memslot. Once the memslot is gone, KVM's mmu_notifier hooks will be unable to find the stale references as the hva=>gfn translation is done via the memslots. If KVM doesn't immediately zap SPTEs and userspace unmaps a range after deleting a memslot, KVM will fail to zap in response to the mmu_notifier due to not finding a memslot corresponding to the notifier's range, which leads to a variation of use-after-free. The other misleading comment (and code) explicitly states that roots without a reference should be skipped. While that's technically true, it's also extremely misleading as it should be impossible for KVM to encounter a defunct root on the list while holding mmu_lock for write. Opportunistically add a WARN to enforce that invariant. Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Fixes: 4c6654bd160d ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 00:15:21 +00:00
* Mark each TDP MMU root as invalid to prevent vCPUs from reusing a root that
* is about to be zapped, e.g. in response to a memslots update. The actual
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-16 00:39:15 +00:00
* zapping is done separately so that it happens with mmu_lock with read,
* whereas invalidating roots must be done with mmu_lock held for write (unless
* the VM is being destroyed).
*
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-16 00:39:15 +00:00
* Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
* See kvm_tdp_mmu_alloc_root().
*/
void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
{
struct kvm_mmu_page *root;
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated Preserve TDP MMU roots until they are explicitly invalidated by gifting the TDP MMU itself a reference to a root when it is allocated. Keeping a reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible performance, and can potentially even soft-hang a vCPU, if a vCPU frequently unloads its roots, e.g. when KVM is emulating SMI+RSM. When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root (from the vCPU's perspective). In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM hash table. Unloading a shadow MMU root just wipes it from the per-vCPU cache; the root is still tracked in the per-VM hash table. When KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root in the per-VM hash table. Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a per-VM structure, where "active" in this case means a root is either in-use or cached as a previous root by at least one vCPU. When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing may be done by a worker, but for all intents and purposes the root is gone). The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all roots effectively frees all roots. The issue is mitigated to some degree in multi-vCPU setups as a different vCPU usually holds a reference to an unloaded root and thus keeps the root alive, allowing the vCPU to reuse its old root after unloading (with a flush+sync). The TDP MMU flaw has been known for some time, as until very recently, KVM's handling of CR0.WP also triggered unloading of all roots. The CR0.WP toggling scenario was eventually addressed by not unloading roots when _only_ CR0.WP is toggled, but such an approach doesn't Just Work for emulating SMM as KVM must emulate a full TLB flush on entry and exit to/from SMM. Given that the shadow MMU plays nice with unloading roots at will, teaching the TDP MMU to do the same is far less complex than modifying KVM to track which roots need to be flushed before reuse. Note, preserving all possible TDP MMU roots is not a concern with respect to memory consumption. Now that the role for direct MMUs doesn't include information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there are _at most_ six possible roots (where "guest_mode" here means L2): 1. 4-level !SMM !guest_mode 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode And because each vCPU can track 4 valid roots, a VM can already have all 6 root combinations live at any given time. Not to mention that, in practice, no sane VMM will advertise different guest.MAXPHYADDR values across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for a single VM. Furthermore, the vast majority of modern hypervisors will utilize EPT/NPT when available, thus the guest_mode=%true cases are also unlikely to be utilized. Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-26 22:03:23 +00:00
/*
* mmu_lock must be held for write to ensure that a root doesn't become
* invalid while there are active readers (invalidating a root while
* there are active readers may or may not be problematic in practice,
* but it's uncharted territory and not supported).
*
* Waive the assertion if there are no users of @kvm, i.e. the VM is
* being destroyed after all references have been put, or if no vCPUs
* have been created (which means there are no roots), i.e. the VM is
* being destroyed in an error path of KVM_CREATE_VM.
*/
if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
refcount_read(&kvm->users_count) && kvm->created_vcpus)
lockdep_assert_held_write(&kvm->mmu_lock);
/*
* As above, mmu_lock isn't held when destroying the VM! There can't
* be other references to @kvm, i.e. nothing else can invalidate roots
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-16 00:39:15 +00:00
* or get/put references to roots.
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated Preserve TDP MMU roots until they are explicitly invalidated by gifting the TDP MMU itself a reference to a root when it is allocated. Keeping a reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible performance, and can potentially even soft-hang a vCPU, if a vCPU frequently unloads its roots, e.g. when KVM is emulating SMI+RSM. When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root (from the vCPU's perspective). In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM hash table. Unloading a shadow MMU root just wipes it from the per-vCPU cache; the root is still tracked in the per-VM hash table. When KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root in the per-VM hash table. Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a per-VM structure, where "active" in this case means a root is either in-use or cached as a previous root by at least one vCPU. When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing may be done by a worker, but for all intents and purposes the root is gone). The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all roots effectively frees all roots. The issue is mitigated to some degree in multi-vCPU setups as a different vCPU usually holds a reference to an unloaded root and thus keeps the root alive, allowing the vCPU to reuse its old root after unloading (with a flush+sync). The TDP MMU flaw has been known for some time, as until very recently, KVM's handling of CR0.WP also triggered unloading of all roots. The CR0.WP toggling scenario was eventually addressed by not unloading roots when _only_ CR0.WP is toggled, but such an approach doesn't Just Work for emulating SMM as KVM must emulate a full TLB flush on entry and exit to/from SMM. Given that the shadow MMU plays nice with unloading roots at will, teaching the TDP MMU to do the same is far less complex than modifying KVM to track which roots need to be flushed before reuse. Note, preserving all possible TDP MMU roots is not a concern with respect to memory consumption. Now that the role for direct MMUs doesn't include information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there are _at most_ six possible roots (where "guest_mode" here means L2): 1. 4-level !SMM !guest_mode 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode And because each vCPU can track 4 valid roots, a VM can already have all 6 root combinations live at any given time. Not to mention that, in practice, no sane VMM will advertise different guest.MAXPHYADDR values across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for a single VM. Furthermore, the vast majority of modern hypervisors will utilize EPT/NPT when available, thus the guest_mode=%true cases are also unlikely to be utilized. Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-26 22:03:23 +00:00
*/
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-16 00:39:15 +00:00
list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
/*
* Note, invalid roots can outlive a memslot update! Invalid
* roots must be *zapped* before the memslot update completes,
* but a different task can acquire a reference and keep the
* root alive after its been zapped.
*/
KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated Preserve TDP MMU roots until they are explicitly invalidated by gifting the TDP MMU itself a reference to a root when it is allocated. Keeping a reference in the TDP MMU fixes a flaw where the TDP MMU exhibits terrible performance, and can potentially even soft-hang a vCPU, if a vCPU frequently unloads its roots, e.g. when KVM is emulating SMI+RSM. When KVM emulates something that invalidates _all_ TLB entries, e.g. SMI and RSM, KVM unloads all of the vCPUs roots (KVM keeps a small per-vCPU cache of previous roots). Unloading roots is a simple way to ensure KVM flushes and synchronizes all roots for the vCPU, as KVM flushes and syncs when allocating a "new" root (from the vCPU's perspective). In the shadow MMU, KVM keeps track of all shadow pages, roots included, in a per-VM hash table. Unloading a shadow MMU root just wipes it from the per-vCPU cache; the root is still tracked in the per-VM hash table. When KVM loads a "new" root for the vCPU, KVM will find the old, unloaded root in the per-VM hash table. Unlike the shadow MMU, the TDP MMU doesn't track "inactive" roots in a per-VM structure, where "active" in this case means a root is either in-use or cached as a previous root by at least one vCPU. When a TDP MMU root becomes inactive, i.e. the last vCPU reference to the root is put, KVM immediately frees the root (asterisk on "immediately" as the actual freeing may be done by a worker, but for all intents and purposes the root is gone). The TDP MMU behavior is especially problematic for 1-vCPU setups, as unloading all roots effectively frees all roots. The issue is mitigated to some degree in multi-vCPU setups as a different vCPU usually holds a reference to an unloaded root and thus keeps the root alive, allowing the vCPU to reuse its old root after unloading (with a flush+sync). The TDP MMU flaw has been known for some time, as until very recently, KVM's handling of CR0.WP also triggered unloading of all roots. The CR0.WP toggling scenario was eventually addressed by not unloading roots when _only_ CR0.WP is toggled, but such an approach doesn't Just Work for emulating SMM as KVM must emulate a full TLB flush on entry and exit to/from SMM. Given that the shadow MMU plays nice with unloading roots at will, teaching the TDP MMU to do the same is far less complex than modifying KVM to track which roots need to be flushed before reuse. Note, preserving all possible TDP MMU roots is not a concern with respect to memory consumption. Now that the role for direct MMUs doesn't include information about the guest, e.g. CR0.PG, CR0.WP, CR4.SMEP, etc., there are _at most_ six possible roots (where "guest_mode" here means L2): 1. 4-level !SMM !guest_mode 2. 4-level SMM !guest_mode 3. 5-level !SMM !guest_mode 4. 5-level SMM !guest_mode 5. 4-level !SMM guest_mode 6. 5-level !SMM guest_mode And because each vCPU can track 4 valid roots, a VM can already have all 6 root combinations live at any given time. Not to mention that, in practice, no sane VMM will advertise different guest.MAXPHYADDR values across vCPUs, i.e. KVM won't ever use both 4-level and 5-level roots for a single VM. Furthermore, the vast majority of modern hypervisors will utilize EPT/NPT when available, thus the guest_mode=%true cases are also unlikely to be utilized. Reported-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com Link: https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com Link: https://lore.kernel.org/all/20230322013731.102955-1-minipli@grsecurity.net Link: https://lore.kernel.org/all/000000000000a0bc2b05f9dd7fab@google.com Link: https://lore.kernel.org/all/000000000000eca0b905fa0f7756@google.com Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Cc: stable@vger.kernel.org Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Link: https://lore.kernel.org/r/20230426220323.3079789-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-26 22:03:23 +00:00
if (!root->role.invalid) {
KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-16 00:39:15 +00:00
root->tdp_mmu_scheduled_root_to_zap = true;
root->role.invalid = true;
}
KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap Fix misleading and arguably wrong comments in the TDP MMU's fast zap flow. The comments, and the fact that actually zapping invalid roots was added separately, strongly suggests that zapping invalid roots is an optimization and not required for correctness. That is a lie. KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(), because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(), KVM is relying on it to fully remove all references to the memslot. Once the memslot is gone, KVM's mmu_notifier hooks will be unable to find the stale references as the hva=>gfn translation is done via the memslots. If KVM doesn't immediately zap SPTEs and userspace unmaps a range after deleting a memslot, KVM will fail to zap in response to the mmu_notifier due to not finding a memslot corresponding to the notifier's range, which leads to a variation of use-after-free. The other misleading comment (and code) explicitly states that roots without a reference should be skipped. While that's technically true, it's also extremely misleading as it should be impossible for KVM to encounter a defunct root on the list while holding mmu_lock for write. Opportunistically add a WARN to enforce that invariant. Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Fixes: 4c6654bd160d ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 00:15:21 +00:00
}
}
/*
* Installs a last-level SPTE to handle a TDP page fault.
* (NPT/EPT violation/misconfiguration)
*/
static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault,
struct tdp_iter *iter)
{
struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
u64 new_spte;
int ret = RET_PF_FIXED;
bool wrprot = false;
KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected level Don't install a leaf TDP MMU SPTE if the parent page's level doesn't match the target level of the fault, and instead have the vCPU retry the faulting instruction after warning. Continuing on is completely unnecessary as the absolute worst case scenario of retrying is DoSing the vCPU, whereas continuing on all but guarantees bigger explosions, e.g. ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__handle_changed_spte.cold+0x95/0x9c RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246 RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0 R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3 R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_mmu_map+0x3b0/0x510 kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> Modules linked in: kvm_intel ---[ end trace 0000000000000000 ]--- Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221213033030.83345-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-13 03:30:29 +00:00
if (WARN_ON_ONCE(sp->role.level != fault->goal_level))
return RET_PF_RETRY;
if (unlikely(!fault->slot))
new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
else
wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
fault->pfn, iter->old_spte, fault->prefetch, true,
fault->map_writable, &new_spte);
if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;
else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
return RET_PF_RETRY;
KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead of immediately flushing. Because the shadow pages are freed in an RCU callback, so long as at least one CPU holds RCU, all CPUs are protected. For vCPUs running in the guest, i.e. consuming TLB entries, KVM only needs to ensure the caller services the pending TLB flush before dropping its RCU protections. I.e. use the caller's RCU as a proxy for all vCPUs running in the guest. Deferring the flushes allows batching flushes, e.g. when installing a 1gb hugepage and zapping a pile of SPs. And when zapping an entire root, deferring flushes allows skipping the flush entirely (because flushes are not needed in that case). Avoiding flushes when zapping an entire root is especially important as synchronizing with other CPUs via IPI after zapping every shadow page can cause significant performance issues for large VMs. The issue is exacerbated by KVM zapping entire top-level entries without dropping RCU protection, which can lead to RCU stalls even when zapping roots backing relatively "small" amounts of guest memory, e.g. 2tb. Removing the IPI bottleneck largely mitigates the RCU issues, though it's likely still a problem for 5-level paging. A future patch will further address the problem by zapping roots in multiple passes to avoid holding RCU for an extended duration. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-26 00:15:37 +00:00
else if (is_shadow_present_pte(iter->old_spte) &&
!is_last_spte(iter->old_spte, iter->level))
kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
/*
* If the page fault was caused by a write but the page is write
* protected, emulation is needed. If the emulation was skipped,
* the vCPU would have the same fault again.
*/
if (wrprot) {
if (fault->write)
ret = RET_PF_EMULATE;
}
/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
if (unlikely(is_mmio_spte(new_spte))) {
vcpu->stat.pf_mmio_spte_created++;
trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
new_spte);
ret = RET_PF_EMULATE;
} else {
trace_kvm_mmu_set_spte(iter->level, iter->gfn,
rcu_dereference(iter->sptep));
}
return ret;
}
/*
* tdp_mmu_link_sp - Replace the given spte with an spte pointing to the
* provided page table.
*
* @kvm: kvm instance
* @iter: a tdp_iter instance currently on the SPTE that should be set
* @sp: The new TDP page table to install.
* @shared: This operation is running under the MMU lock in read mode.
*
* Returns: 0 if the new page table was installed. Non-0 if the page table
* could not be installed (e.g. the atomic compare-exchange failed).
*/
static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_mmu_page *sp, bool shared)
{
KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use Check for A/D bits being disabled instead of the access tracking mask being non-zero when deciding whether or not to attempt to fix a page fault vian the fast path. Originally, the access tracking mask was non-zero if and only if A/D bits were disabled by _KVM_ (including not being supported by hardware), but that hasn't been true since nVMX was fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause KVM to not use A/D bits while running L2 despite KVM using them while running L1. In other words, don't attempt the fast path just because EPT is enabled. Note, attempting the fast path for all !PRESENT faults can "fix" a very, _VERY_ tiny percentage of faults out of mmu_lock by detecting that the fault is spurious, i.e. has been fixed by a different vCPU, but again the odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM gets less than 10 successes out of 30k+ faults, and that's likely one of the more favorable scenarios. Disabling dirty logging can likely lead to a rash of collisions between vCPUs for some workloads that operate on a common set of pages, but penalizing _all_ !PRESENT faults for that one case is unlikely to be a net positive, not to mention that that problem is best solved by not zapping in the first place. The number of spurious faults does scale with the number of vCPUs, e.g. a 255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast path (again out of 30k), but that's all of 0.2% of faults. Using legacy shadow paging does get more spurious faults, and a few more detected out of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring faults that are reflected into the guest), i.e. the extra detections are purely due to the sheer number of faults observed. On the other hand, getting a "negative" in the fast path takes in the neighborhood of 150-250 cycles. So while it is tempting to keep/extend the current behavior, such a change needs to come with hard numbers showing that it's actually a win in the grand scheme, or any scheme for that matter. Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-23 03:47:44 +00:00
u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled());
int ret = 0;
if (shared) {
ret = tdp_mmu_set_spte_atomic(kvm, iter, spte);
if (ret)
return ret;
} else {
tdp_mmu_iter_set_spte(kvm, iter, spte);
}
tdp_account_mmu_page(kvm, sp);
return 0;
}
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left *unmapped* and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. | Config: ept=Y, tdp_mmu=Y, 5% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.332305091s | 0.019615027s | 0.006108211s | 4 | 0.353096020s | 0.019452131s | 0.006214670s | 8 | 0.453938562s | 0.019748246s | 0.006610997s | 16 | 0.719095024s | 0.019972171s | 0.007757889s | 32 | 1.698727124s | 0.021361615s | 0.012274432s | 64 | 2.630673582s | 0.031122014s | 0.016994683s | 96 | 3.016535213s | 0.062608739s | 0.044760838s | Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. | Config: ept=Y, tdp_mmu=Y, 100% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.317710329s | 0.296204596s | 0.058689782s | 4 | 0.337102375s | 0.299841017s | 0.060343076s | 8 | 0.386025681s | 0.297274460s | 0.060399702s | 16 | 0.791462524s | 0.298942578s | 0.062508699s | 32 | 1.719646014s | 0.313101996s | 0.075984855s | 64 | 2.527973150s | 0.455779206s | 0.079789363s | 96 | 2.681123208s | 0.673778787s | 0.165386739s | Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 18:59:05 +00:00
static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_mmu_page *sp, bool shared);
/*
* Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
* page tables and SPTEs to translate the faulting guest physical address.
*/
int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
struct kvm *kvm = vcpu->kvm;
struct tdp_iter iter;
struct kvm_mmu_page *sp;
int ret = RET_PF_RETRY;
kvm_mmu_hugepage_adjust(vcpu, fault);
trace_kvm_mmu_spte_requested(fault);
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_lock();
tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
int r;
if (fault->nx_huge_page_workaround_enabled)
disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left *unmapped* and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. | Config: ept=Y, tdp_mmu=Y, 5% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.332305091s | 0.019615027s | 0.006108211s | 4 | 0.353096020s | 0.019452131s | 0.006214670s | 8 | 0.453938562s | 0.019748246s | 0.006610997s | 16 | 0.719095024s | 0.019972171s | 0.007757889s | 32 | 1.698727124s | 0.021361615s | 0.012274432s | 64 | 2.630673582s | 0.031122014s | 0.016994683s | 96 | 3.016535213s | 0.062608739s | 0.044760838s | Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. | Config: ept=Y, tdp_mmu=Y, 100% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.317710329s | 0.296204596s | 0.058689782s | 4 | 0.337102375s | 0.299841017s | 0.060343076s | 8 | 0.386025681s | 0.297274460s | 0.060399702s | 16 | 0.791462524s | 0.298942578s | 0.062508699s | 32 | 1.719646014s | 0.313101996s | 0.075984855s | 64 | 2.527973150s | 0.455779206s | 0.079789363s | 96 | 2.681123208s | 0.673778787s | 0.165386739s | Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 18:59:05 +00:00
/*
* If SPTE has been frozen by another thread, just give up and
* retry, avoiding unnecessary page table allocation and free.
*/
if (is_removed_spte(iter.old_spte))
goto retry;
KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen Hoist the is_removed_spte() check above the "level == goal_level" check when walking SPTEs during a TDP MMU page fault to avoid attempting to map a leaf entry if said entry is frozen by a different task/vCPU. ------------[ cut here ]------------ WARNING: CPU: 3 PID: 939 at arch/x86/kvm/mmu/tdp_mmu.c:653 kvm_tdp_mmu_map+0x269/0x4b0 Modules linked in: kvm_intel CPU: 3 PID: 939 Comm: nx_huge_pages_t Not tainted 6.1.0-rc4+ #67 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_tdp_mmu_map+0x269/0x4b0 RSP: 0018:ffffc9000068fba8 EFLAGS: 00010246 RAX: 00000000000005a0 RBX: ffffc9000068fcc0 RCX: 0000000000000005 RDX: ffff88810741f000 RSI: ffff888107f04600 RDI: ffffc900006a3000 RBP: 060000010b000bf3 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 000ffffffffff000 R12: 0000000000000005 R13: ffff888113670000 R14: ffff888107464958 R15: 0000000000000000 FS: 00007f01c942c740(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000117013006 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry") Cc: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Robert Hoo <robert.hu@linux.intel.com> Message-Id: <20221213033030.83345-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-13 03:30:26 +00:00
if (iter.level == fault->goal_level)
KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached Map the leaf SPTE when handling a TDP MMU page fault if and only if the target level is reached. A recent commit reworked the retry logic and incorrectly assumed that walking SPTEs would never "fail", as the loop either bails (retries) or installs parent SPs. However, the iterator itself will bail early if it detects a frozen (REMOVED) SPTE when stepping down. The TDP iterator also rereads the current SPTE before stepping down specifically to avoid walking into a part of the tree that is being removed, which means it's possible to terminate the loop without the guts of the loop observing the frozen SPTE, e.g. if a different task zaps a parent SPTE between the initial read and try_step_down()'s refresh. Mapping a leaf SPTE at the wrong level results in all kinds of badness as page table walkers interpret the SPTE as a page table, not a leaf, and walk into the weeds. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510 Modules linked in: kvm_intel CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_tdp_mmu_map+0x481/0x510 RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027 RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48 R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0 R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Invalid SPTE change: cannot replace a present leaf SPTE with another present leaf SPTE mapping a different PFN! as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2 ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__handle_changed_spte.cold+0x95/0x9c RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246 RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0 R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3 R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_mmu_map+0x3b0/0x510 kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> Modules linked in: kvm_intel ---[ end trace 0000000000000000 ]--- Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry") Cc: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221213033030.83345-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-13 03:30:27 +00:00
goto map_target_level;
KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozen Hoist the is_removed_spte() check above the "level == goal_level" check when walking SPTEs during a TDP MMU page fault to avoid attempting to map a leaf entry if said entry is frozen by a different task/vCPU. ------------[ cut here ]------------ WARNING: CPU: 3 PID: 939 at arch/x86/kvm/mmu/tdp_mmu.c:653 kvm_tdp_mmu_map+0x269/0x4b0 Modules linked in: kvm_intel CPU: 3 PID: 939 Comm: nx_huge_pages_t Not tainted 6.1.0-rc4+ #67 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_tdp_mmu_map+0x269/0x4b0 RSP: 0018:ffffc9000068fba8 EFLAGS: 00010246 RAX: 00000000000005a0 RBX: ffffc9000068fcc0 RCX: 0000000000000005 RDX: ffff88810741f000 RSI: ffff888107f04600 RDI: ffffc900006a3000 RBP: 060000010b000bf3 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 000ffffffffff000 R12: 0000000000000005 R13: ffff888113670000 R14: ffff888107464958 R15: 0000000000000000 FS: 00007f01c942c740(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000117013006 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry") Cc: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Robert Hoo <robert.hu@linux.intel.com> Message-Id: <20221213033030.83345-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-13 03:30:26 +00:00
/* Step down into the lower level page table if it exists. */
if (is_shadow_present_pte(iter.old_spte) &&
!is_large_pte(iter.old_spte))
continue;
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left *unmapped* and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. | Config: ept=Y, tdp_mmu=Y, 5% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.332305091s | 0.019615027s | 0.006108211s | 4 | 0.353096020s | 0.019452131s | 0.006214670s | 8 | 0.453938562s | 0.019748246s | 0.006610997s | 16 | 0.719095024s | 0.019972171s | 0.007757889s | 32 | 1.698727124s | 0.021361615s | 0.012274432s | 64 | 2.630673582s | 0.031122014s | 0.016994683s | 96 | 3.016535213s | 0.062608739s | 0.044760838s | Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. | Config: ept=Y, tdp_mmu=Y, 100% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.317710329s | 0.296204596s | 0.058689782s | 4 | 0.337102375s | 0.299841017s | 0.060343076s | 8 | 0.386025681s | 0.297274460s | 0.060399702s | 16 | 0.791462524s | 0.298942578s | 0.062508699s | 32 | 1.719646014s | 0.313101996s | 0.075984855s | 64 | 2.527973150s | 0.455779206s | 0.079789363s | 96 | 2.681123208s | 0.673778787s | 0.165386739s | Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 18:59:05 +00:00
/*
* The SPTE is either non-present or points to a huge page that
* needs to be split.
*/
sp = tdp_mmu_alloc_sp(vcpu);
tdp_mmu_init_child_sp(sp, &iter);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left *unmapped* and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. | Config: ept=Y, tdp_mmu=Y, 5% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.332305091s | 0.019615027s | 0.006108211s | 4 | 0.353096020s | 0.019452131s | 0.006214670s | 8 | 0.453938562s | 0.019748246s | 0.006610997s | 16 | 0.719095024s | 0.019972171s | 0.007757889s | 32 | 1.698727124s | 0.021361615s | 0.012274432s | 64 | 2.630673582s | 0.031122014s | 0.016994683s | 96 | 3.016535213s | 0.062608739s | 0.044760838s | Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. | Config: ept=Y, tdp_mmu=Y, 100% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.317710329s | 0.296204596s | 0.058689782s | 4 | 0.337102375s | 0.299841017s | 0.060343076s | 8 | 0.386025681s | 0.297274460s | 0.060399702s | 16 | 0.791462524s | 0.298942578s | 0.062508699s | 32 | 1.719646014s | 0.313101996s | 0.075984855s | 64 | 2.527973150s | 0.455779206s | 0.079789363s | 96 | 2.681123208s | 0.673778787s | 0.165386739s | Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 18:59:05 +00:00
sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left *unmapped* and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. | Config: ept=Y, tdp_mmu=Y, 5% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.332305091s | 0.019615027s | 0.006108211s | 4 | 0.353096020s | 0.019452131s | 0.006214670s | 8 | 0.453938562s | 0.019748246s | 0.006610997s | 16 | 0.719095024s | 0.019972171s | 0.007757889s | 32 | 1.698727124s | 0.021361615s | 0.012274432s | 64 | 2.630673582s | 0.031122014s | 0.016994683s | 96 | 3.016535213s | 0.062608739s | 0.044760838s | Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. | Config: ept=Y, tdp_mmu=Y, 100% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.317710329s | 0.296204596s | 0.058689782s | 4 | 0.337102375s | 0.299841017s | 0.060343076s | 8 | 0.386025681s | 0.297274460s | 0.060399702s | 16 | 0.791462524s | 0.298942578s | 0.062508699s | 32 | 1.719646014s | 0.313101996s | 0.075984855s | 64 | 2.527973150s | 0.455779206s | 0.079789363s | 96 | 2.681123208s | 0.673778787s | 0.165386739s | Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 18:59:05 +00:00
if (is_shadow_present_pte(iter.old_spte))
r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left *unmapped* and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. | Config: ept=Y, tdp_mmu=Y, 5% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.332305091s | 0.019615027s | 0.006108211s | 4 | 0.353096020s | 0.019452131s | 0.006214670s | 8 | 0.453938562s | 0.019748246s | 0.006610997s | 16 | 0.719095024s | 0.019972171s | 0.007757889s | 32 | 1.698727124s | 0.021361615s | 0.012274432s | 64 | 2.630673582s | 0.031122014s | 0.016994683s | 96 | 3.016535213s | 0.062608739s | 0.044760838s | Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. | Config: ept=Y, tdp_mmu=Y, 100% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.317710329s | 0.296204596s | 0.058689782s | 4 | 0.337102375s | 0.299841017s | 0.060343076s | 8 | 0.386025681s | 0.297274460s | 0.060399702s | 16 | 0.791462524s | 0.298942578s | 0.062508699s | 32 | 1.719646014s | 0.313101996s | 0.075984855s | 64 | 2.527973150s | 0.455779206s | 0.079789363s | 96 | 2.681123208s | 0.673778787s | 0.165386739s | Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 18:59:05 +00:00
else
r = tdp_mmu_link_sp(kvm, &iter, sp, true);
/*
KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached Map the leaf SPTE when handling a TDP MMU page fault if and only if the target level is reached. A recent commit reworked the retry logic and incorrectly assumed that walking SPTEs would never "fail", as the loop either bails (retries) or installs parent SPs. However, the iterator itself will bail early if it detects a frozen (REMOVED) SPTE when stepping down. The TDP iterator also rereads the current SPTE before stepping down specifically to avoid walking into a part of the tree that is being removed, which means it's possible to terminate the loop without the guts of the loop observing the frozen SPTE, e.g. if a different task zaps a parent SPTE between the initial read and try_step_down()'s refresh. Mapping a leaf SPTE at the wrong level results in all kinds of badness as page table walkers interpret the SPTE as a page table, not a leaf, and walk into the weeds. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510 Modules linked in: kvm_intel CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_tdp_mmu_map+0x481/0x510 RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027 RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48 R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0 R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Invalid SPTE change: cannot replace a present leaf SPTE with another present leaf SPTE mapping a different PFN! as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2 ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__handle_changed_spte.cold+0x95/0x9c RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246 RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0 R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3 R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_mmu_map+0x3b0/0x510 kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> Modules linked in: kvm_intel ---[ end trace 0000000000000000 ]--- Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry") Cc: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221213033030.83345-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-13 03:30:27 +00:00
* Force the guest to retry if installing an upper level SPTE
* failed, e.g. because a different task modified the SPTE.
*/
if (r) {
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left *unmapped* and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. | Config: ept=Y, tdp_mmu=Y, 5% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.332305091s | 0.019615027s | 0.006108211s | 4 | 0.353096020s | 0.019452131s | 0.006214670s | 8 | 0.453938562s | 0.019748246s | 0.006610997s | 16 | 0.719095024s | 0.019972171s | 0.007757889s | 32 | 1.698727124s | 0.021361615s | 0.012274432s | 64 | 2.630673582s | 0.031122014s | 0.016994683s | 96 | 3.016535213s | 0.062608739s | 0.044760838s | Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. | Config: ept=Y, tdp_mmu=Y, 100% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.317710329s | 0.296204596s | 0.058689782s | 4 | 0.337102375s | 0.299841017s | 0.060343076s | 8 | 0.386025681s | 0.297274460s | 0.060399702s | 16 | 0.791462524s | 0.298942578s | 0.062508699s | 32 | 1.719646014s | 0.313101996s | 0.075984855s | 64 | 2.527973150s | 0.455779206s | 0.079789363s | 96 | 2.681123208s | 0.673778787s | 0.165386739s | Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 18:59:05 +00:00
tdp_mmu_free_sp(sp);
goto retry;
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left *unmapped* and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. | Config: ept=Y, tdp_mmu=Y, 5% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.332305091s | 0.019615027s | 0.006108211s | 4 | 0.353096020s | 0.019452131s | 0.006214670s | 8 | 0.453938562s | 0.019748246s | 0.006610997s | 16 | 0.719095024s | 0.019972171s | 0.007757889s | 32 | 1.698727124s | 0.021361615s | 0.012274432s | 64 | 2.630673582s | 0.031122014s | 0.016994683s | 96 | 3.016535213s | 0.062608739s | 0.044760838s | Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. | Config: ept=Y, tdp_mmu=Y, 100% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.317710329s | 0.296204596s | 0.058689782s | 4 | 0.337102375s | 0.299841017s | 0.060343076s | 8 | 0.386025681s | 0.297274460s | 0.060399702s | 16 | 0.791462524s | 0.298942578s | 0.062508699s | 32 | 1.719646014s | 0.313101996s | 0.075984855s | 64 | 2.527973150s | 0.455779206s | 0.079789363s | 96 | 2.681123208s | 0.673778787s | 0.165386739s | Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 18:59:05 +00:00
}
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left *unmapped* and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. | Config: ept=Y, tdp_mmu=Y, 5% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.332305091s | 0.019615027s | 0.006108211s | 4 | 0.353096020s | 0.019452131s | 0.006214670s | 8 | 0.453938562s | 0.019748246s | 0.006610997s | 16 | 0.719095024s | 0.019972171s | 0.007757889s | 32 | 1.698727124s | 0.021361615s | 0.012274432s | 64 | 2.630673582s | 0.031122014s | 0.016994683s | 96 | 3.016535213s | 0.062608739s | 0.044760838s | Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. | Config: ept=Y, tdp_mmu=Y, 100% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.317710329s | 0.296204596s | 0.058689782s | 4 | 0.337102375s | 0.299841017s | 0.060343076s | 8 | 0.386025681s | 0.297274460s | 0.060399702s | 16 | 0.791462524s | 0.298942578s | 0.062508699s | 32 | 1.719646014s | 0.313101996s | 0.075984855s | 64 | 2.527973150s | 0.455779206s | 0.079789363s | 96 | 2.681123208s | 0.673778787s | 0.165386739s | Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 18:59:05 +00:00
if (fault->huge_page_disallowed &&
fault->req_level >= iter.level) {
spin_lock(&kvm->arch.tdp_mmu_pages_lock);
KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock spinlock when adding a new shadow page in the TDP MMU. To ensure the NX reclaim kthread can't see a not-yet-linked shadow page, the page fault path links the new page table prior to adding the page to possible_nx_huge_pages. If the page is zapped by different task, e.g. because dirty logging is disabled, between linking the page and adding it to the list, KVM can end up triggering use-after-free by adding the zapped SP to the aforementioned list, as the zapped SP's memory is scheduled for removal via RCU callback. The bug is detected by the sanity checks guarded by CONFIG_DEBUG_LIST=y, i.e. the below splat is just one possible signature. ------------[ cut here ]------------ list_add corruption. prev->next should be next (ffffc9000071fa70), but was ffff88811125ee38. (prev=ffff88811125ee38). WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30 __list_add_valid+0x79/0xa0 Modules linked in: kvm_intel CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #71 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__list_add_valid+0x79/0xa0 RSP: 0018:ffffc900006efb68 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027 RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8 RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08 R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930 R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90 FS: 00007fc0415d2740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0 Call Trace: <TASK> track_possible_nx_huge_page+0x53/0x80 kvm_tdp_mmu_map+0x242/0x2c0 kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 61f94478547b ("KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE") Reported-by: Greg Thelen <gthelen@google.com> Analyzed-by: David Matlack <dmatlack@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221213033030.83345-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-13 03:30:28 +00:00
if (sp->nx_huge_page_disallowed)
track_possible_nx_huge_page(kvm, sp);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left *unmapped* and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. | Config: ept=Y, tdp_mmu=Y, 5% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.332305091s | 0.019615027s | 0.006108211s | 4 | 0.353096020s | 0.019452131s | 0.006214670s | 8 | 0.453938562s | 0.019748246s | 0.006610997s | 16 | 0.719095024s | 0.019972171s | 0.007757889s | 32 | 1.698727124s | 0.021361615s | 0.012274432s | 64 | 2.630673582s | 0.031122014s | 0.016994683s | 96 | 3.016535213s | 0.062608739s | 0.044760838s | Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. | Config: ept=Y, tdp_mmu=Y, 100% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.317710329s | 0.296204596s | 0.058689782s | 4 | 0.337102375s | 0.299841017s | 0.060343076s | 8 | 0.386025681s | 0.297274460s | 0.060399702s | 16 | 0.791462524s | 0.298942578s | 0.062508699s | 32 | 1.719646014s | 0.313101996s | 0.075984855s | 64 | 2.527973150s | 0.455779206s | 0.079789363s | 96 | 2.681123208s | 0.673778787s | 0.165386739s | Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 18:59:05 +00:00
spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
}
}
KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reached Map the leaf SPTE when handling a TDP MMU page fault if and only if the target level is reached. A recent commit reworked the retry logic and incorrectly assumed that walking SPTEs would never "fail", as the loop either bails (retries) or installs parent SPs. However, the iterator itself will bail early if it detects a frozen (REMOVED) SPTE when stepping down. The TDP iterator also rereads the current SPTE before stepping down specifically to avoid walking into a part of the tree that is being removed, which means it's possible to terminate the loop without the guts of the loop observing the frozen SPTE, e.g. if a different task zaps a parent SPTE between the initial read and try_step_down()'s refresh. Mapping a leaf SPTE at the wrong level results in all kinds of badness as page table walkers interpret the SPTE as a page table, not a leaf, and walk into the weeds. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510 Modules linked in: kvm_intel CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_tdp_mmu_map+0x481/0x510 RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027 RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48 R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0 R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Invalid SPTE change: cannot replace a present leaf SPTE with another present leaf SPTE mapping a different PFN! as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2 ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__handle_changed_spte.cold+0x95/0x9c RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246 RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0 R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3 R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_mmu_map+0x3b0/0x510 kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> Modules linked in: kvm_intel ---[ end trace 0000000000000000 ]--- Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry") Cc: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221213033030.83345-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-13 03:30:27 +00:00
/*
* The walk aborted before reaching the target level, e.g. because the
* iterator detected an upper level SPTE was frozen during traversal.
*/
WARN_ON_ONCE(iter.level == fault->goal_level);
goto retry;
map_target_level:
ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter);
retry:
rcu_read_unlock();
return ret;
}
bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
bool flush)
{
struct kvm_mmu_page *root;
__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
range->may_block, flush);
return flush;
}
typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_gfn_range *range);
static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
struct kvm_gfn_range *range,
tdp_handler_t handler)
{
struct kvm_mmu_page *root;
struct tdp_iter iter;
bool ret = false;
/*
* Don't support rescheduling, none of the MMU notifiers that funnel
* into this helper allow blocking; it'd be dead, wasteful code.
*/
for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
rcu_read_lock();
tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
ret |= handler(kvm, &iter, range);
rcu_read_unlock();
}
return ret;
}
/*
* Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
* if any of the GFNs in the range have been accessed.
*
* No need to mark the corresponding PFN as accessed as this call is coming
* from the clear_young() or clear_flush_young() notifier, which uses the
* return value to determine if the page has been accessed.
*/
static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_gfn_range *range)
{
u64 new_spte;
/* If we have a non-accessed entry we don't need to change the pte. */
if (!is_accessed_spte(iter->old_spte))
return false;
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
if (spte_ad_enabled(iter->old_spte)) {
iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
iter->old_spte,
shadow_accessed_mask,
iter->level);
new_spte = iter->old_spte & ~shadow_accessed_mask;
} else {
/*
* Capture the dirty status of the page, so that it doesn't get
* lost when the SPTE is marked for access tracking.
*/
if (is_writable_pte(iter->old_spte))
kvm_set_pfn_dirty(spte_to_pfn(iter->old_spte));
new_spte = mark_spte_for_access_track(iter->old_spte);
iter->old_spte = kvm_tdp_mmu_write_spte(iter->sptep,
iter->old_spte, new_spte,
iter->level);
}
trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
iter->old_spte, new_spte);
return true;
}
bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
}
static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_gfn_range *range)
{
return is_accessed_spte(iter->old_spte);
}
bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
}
static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_gfn_range *range)
{
u64 new_spte;
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
/* Huge pages aren't expected to be modified without first being zapped. */
KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE() Convert all "runtime" assertions, i.e. assertions that can be triggered while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the MMU that is tied to running vCPUs, i.e. not contained to loading and initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if KVM ends up with a bug that causes a root to be invalidated before the page fault handler is invoked, pretty much _every_ page fault VM-Exit triggers the WARN. If a WARN is triggered frequently, the resulting spam usually causes a lot of damage of its own, e.g. consumes resources to log the WARN and pollutes the kernel log, often to the point where other useful information can be lost. In many case, the damage caused by the spam is actually worse than the bug itself, e.g. KVM can almost always recover from an unexpectedly invalid root. On the flip side, warning every time is rarely helpful for debug and triage, i.e. a single splat is usually sufficient to point a debugger in the right direction, and automated testing, e.g. syzkaller, typically runs with warn_on_panic=1, i.e. will never get past the first WARN anyways. Lastly, when an assertions fails multiple times, the stack traces in KVM are almost always identical, i.e. the full splat only needs to be captured once. And _if_ there is value in captruing information about the failed assert, a ratelimited printk() is sufficient and less likely to rack up a large amount of collateral damage. Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-29 00:47:17 +00:00
WARN_ON_ONCE(pte_huge(range->arg.pte) || range->start + 1 != range->end);
if (iter->level != PG_LEVEL_4K ||
!is_shadow_present_pte(iter->old_spte))
return false;
/*
* Note, when changing a read-only SPTE, it's not strictly necessary to
* zero the SPTE before setting the new PFN, but doing so preserves the
* invariant that the PFN of a present * leaf SPTE can never change.
* See handle_changed_spte().
*/
tdp_mmu_iter_set_spte(kvm, iter, 0);
if (!pte_write(range->arg.pte)) {
new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
pte_pfn(range->arg.pte));
tdp_mmu_iter_set_spte(kvm, iter, new_spte);
}
return true;
}
/*
* Handle the changed_pte MMU notifier for the TDP MMU.
* data is a pointer to the new pte_t mapping the HVA specified by the MMU
* notifier.
* Returns non-zero if a flush is needed before releasing the MMU lock.
*/
bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
/*
* No need to handle the remote TLB flush under RCU protection, the
* target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
* shadow page. See the WARN on pfn_changed in handle_changed_spte().
*/
return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
}
/*
* Remove write access from all SPTEs at or above min_level that map GFNs
* [start, end). Returns true if an SPTE has been changed and the TLBs need to
* be flushed.
*/
static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t start, gfn_t end, int min_level)
{
struct tdp_iter iter;
u64 new_spte;
bool spte_set = false;
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_lock();
BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level) ||
!(iter.old_spte & PT_WRITABLE_MASK))
continue;
new_spte = iter.old_spte & ~PT_WRITABLE_MASK;
if (tdp_mmu_set_spte_atomic(kvm, &iter, new_spte))
goto retry;
spte_set = true;
}
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_unlock();
return spte_set;
}
/*
* Remove write access from all the SPTEs mapping GFNs in the memslot. Will
* only affect leaf SPTEs down to min_level.
* Returns true if an SPTE has been changed and the TLBs need to be flushed.
*/
bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
const struct kvm_memory_slot *slot, int min_level)
{
struct kvm_mmu_page *root;
bool spte_set = false;
lockdep_assert_held_read(&kvm->mmu_lock);
for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id)
spte_set |= wrprot_gfn_range(kvm, root, slot->base_gfn,
slot->base_gfn + slot->npages, min_level);
return spte_set;
}
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
{
struct kvm_mmu_page *sp;
gfp |= __GFP_ZERO;
sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
if (!sp)
return NULL;
sp->spt = (void *)__get_free_page(gfp);
if (!sp->spt) {
kmem_cache_free(mmu_page_header_cache, sp);
return NULL;
}
return sp;
}
static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
struct tdp_iter *iter,
bool shared)
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
{
struct kvm_mmu_page *sp;
kvm_lockdep_assert_mmu_lock_held(kvm, shared);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
/*
* Since we are allocating while under the MMU lock we have to be
* careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
* reclaim and to avoid making any filesystem callbacks (which can end
* up invoking KVM MMU notifiers, resulting in a deadlock).
*
* If this allocation fails we drop the lock and retry with reclaim
* allowed.
*/
sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
if (sp)
return sp;
rcu_read_unlock();
if (shared)
read_unlock(&kvm->mmu_lock);
else
write_unlock(&kvm->mmu_lock);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
iter->yielded = true;
sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
if (shared)
read_lock(&kvm->mmu_lock);
else
write_lock(&kvm->mmu_lock);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
rcu_read_lock();
return sp;
}
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left *unmapped* and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. | Config: ept=Y, tdp_mmu=Y, 5% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.332305091s | 0.019615027s | 0.006108211s | 4 | 0.353096020s | 0.019452131s | 0.006214670s | 8 | 0.453938562s | 0.019748246s | 0.006610997s | 16 | 0.719095024s | 0.019972171s | 0.007757889s | 32 | 1.698727124s | 0.021361615s | 0.012274432s | 64 | 2.630673582s | 0.031122014s | 0.016994683s | 96 | 3.016535213s | 0.062608739s | 0.044760838s | Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. | Config: ept=Y, tdp_mmu=Y, 100% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.317710329s | 0.296204596s | 0.058689782s | 4 | 0.337102375s | 0.299841017s | 0.060343076s | 8 | 0.386025681s | 0.297274460s | 0.060399702s | 16 | 0.791462524s | 0.298942578s | 0.062508699s | 32 | 1.719646014s | 0.313101996s | 0.075984855s | 64 | 2.527973150s | 0.455779206s | 0.079789363s | 96 | 2.681123208s | 0.673778787s | 0.165386739s | Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 18:59:05 +00:00
/* Note, the caller is responsible for initializing @sp. */
static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
struct kvm_mmu_page *sp, bool shared)
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
{
const u64 huge_spte = iter->old_spte;
const int level = iter->level;
int ret, i;
/*
* No need for atomics when writing to sp->spt since the page table has
* not been linked in yet and thus is not reachable from any other CPU.
*/
for (i = 0; i < SPTE_ENT_PER_PAGE; i++)
sp->spt[i] = make_huge_page_split_spte(kvm, huge_spte, sp->role, i);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
/*
* Replace the huge spte with a pointer to the populated lower level
* page table. Since we are making this change without a TLB flush vCPUs
* will see a mix of the split mappings and the original huge mapping,
* depending on what's currently in their TLB. This is fine from a
* correctness standpoint since the translation will be the same either
* way.
*/
ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
if (ret)
goto out;
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
/*
* tdp_mmu_link_sp_atomic() will handle subtracting the huge page we
* are overwriting from the page stats. But we have to manually update
* the page stats with the new present child pages.
*/
kvm_update_page_stats(kvm, level - 1, SPTE_ENT_PER_PAGE);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
out:
trace_kvm_mmu_split_huge_page(iter->gfn, huge_spte, level, ret);
return ret;
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
}
static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
struct kvm_mmu_page *root,
gfn_t start, gfn_t end,
int target_level, bool shared)
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
{
struct kvm_mmu_page *sp = NULL;
struct tdp_iter iter;
int ret = 0;
rcu_read_lock();
/*
* Traverse the page table splitting all huge pages above the target
* level into one lower level. For example, if we encounter a 1GB page
* we split it into 512 2MB pages.
*
* Since the TDP iterator uses a pre-order traversal, we are guaranteed
* to visit an SPTE before ever visiting its children, which means we
* will correctly recursively split huge pages that are more than one
* level above the target level (e.g. splitting a 1GB to 512 2MB pages,
* and then splitting each of those to 512 4KB pages).
*/
for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
continue;
if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
continue;
if (!sp) {
sp = tdp_mmu_alloc_sp_for_split(kvm, &iter, shared);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
if (!sp) {
ret = -ENOMEM;
trace_kvm_mmu_split_huge_page(iter.gfn,
iter.old_spte,
iter.level, ret);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
break;
}
if (iter.yielded)
continue;
}
KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault Now that the TDP MMU has a mechanism to split huge pages, use it in the fault path when a huge page needs to be replaced with a mapping at a lower level. This change reduces the negative performance impact of NX HugePages. Prior to this change if a vCPU executed from a huge page and NX HugePages was enabled, the vCPU would take a fault, zap the huge page, and mapping the faulting address at 4KiB with execute permissions enabled. The rest of the memory would be left *unmapped* and have to be faulted back in by the guest upon access (read, write, or execute). If guest is backed by 1GiB, a single execute instruction can zap an entire GiB of its physical address space. For example, it can take a VM longer to execute from its memory than to populate that memory in the first place: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.748378795s Executing from memory : 2.899670885s With this change, such faults split the huge page instead of zapping it, which avoids the non-present faults on the rest of the huge page: $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96 Populating memory : 2.729544474s Executing from memory : 0.111965688s <--- This change also reduces the performance impact of dirty logging when eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can be desirable for read-heavy workloads, as it avoids allocating memory to split huge pages that are never written and avoids increasing the TLB miss cost on reads of those pages. | Config: ept=Y, tdp_mmu=Y, 5% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.332305091s | 0.019615027s | 0.006108211s | 4 | 0.353096020s | 0.019452131s | 0.006214670s | 8 | 0.453938562s | 0.019748246s | 0.006610997s | 16 | 0.719095024s | 0.019972171s | 0.007757889s | 32 | 1.698727124s | 0.021361615s | 0.012274432s | 64 | 2.630673582s | 0.031122014s | 0.016994683s | 96 | 3.016535213s | 0.062608739s | 0.044760838s | Eager page splitting remains beneficial for write-heavy workloads, but the gap is now reduced. | Config: ept=Y, tdp_mmu=Y, 100% writes | | Iteration 1 dirty memory time | | --------------------------------------------- | vCPU Count | eps=N (Before) | eps=N (After) | eps=Y | ------------ | -------------- | ------------- | ------------ | 2 | 0.317710329s | 0.296204596s | 0.058689782s | 4 | 0.337102375s | 0.299841017s | 0.060343076s | 8 | 0.386025681s | 0.297274460s | 0.060399702s | 16 | 0.791462524s | 0.298942578s | 0.062508699s | 32 | 1.719646014s | 0.313101996s | 0.075984855s | 64 | 2.527973150s | 0.455779206s | 0.079789363s | 96 | 2.681123208s | 0.673778787s | 0.165386739s | Further study is needed to determine if the remaining gap is acceptable for customer workloads or if eager_page_split=N still requires a-priori knowledge of the VM workload, especially when considering these costs extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20221109185905.486172-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09 18:59:05 +00:00
tdp_mmu_init_child_sp(sp, &iter);
if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
goto retry;
sp = NULL;
}
rcu_read_unlock();
/*
* It's possible to exit the loop having never used the last sp if, for
* example, a vCPU doing HugePage NX splitting wins the race and
* installs its own sp in place of the last sp we tried to split.
*/
if (sp)
tdp_mmu_free_sp(sp);
return ret;
}
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
/*
* Try to split all huge pages mapped by the TDP MMU down to the target level.
*/
void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
const struct kvm_memory_slot *slot,
gfn_t start, gfn_t end,
int target_level, bool shared)
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
{
struct kvm_mmu_page *root;
int r = 0;
kvm_lockdep_assert_mmu_lock_held(kvm, shared);
for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id) {
r = tdp_mmu_split_huge_pages_root(kvm, root, start, end, target_level, shared);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
if (r) {
kvm_tdp_mmu_put_root(kvm, root);
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
break;
}
}
}
static bool tdp_mmu_need_write_protect(struct kvm_mmu_page *sp)
{
/*
* All TDP MMU shadow pages share the same role as their root, aside
* from level, so it is valid to key off any shadow page to determine if
* write protection is needed for an entire tree.
*/
return kvm_mmu_page_ad_need_write_protect(sp) || !kvm_ad_enabled();
}
static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t start, gfn_t end)
{
const u64 dbit = tdp_mmu_need_write_protect(root) ? PT_WRITABLE_MASK :
shadow_dirty_mask;
struct tdp_iter iter;
bool spte_set = false;
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_lock();
tdp_root_for_each_pte(iter, root, start, end) {
retry:
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
continue;
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;
KVM_MMU_WARN_ON(dbit == shadow_dirty_mask &&
spte_ad_need_write_protect(iter.old_spte));
if (!(iter.old_spte & dbit))
continue;
if (tdp_mmu_set_spte_atomic(kvm, &iter, iter.old_spte & ~dbit))
goto retry;
spte_set = true;
}
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_unlock();
return spte_set;
}
/*
* Clear the dirty status (D-bit or W-bit) of all the SPTEs mapping GFNs in the
* memslot. Returns true if an SPTE has been changed and the TLBs need to be
* flushed.
*/
bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm,
const struct kvm_memory_slot *slot)
{
struct kvm_mmu_page *root;
bool spte_set = false;
lockdep_assert_held_read(&kvm->mmu_lock);
for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id)
spte_set |= clear_dirty_gfn_range(kvm, root, slot->base_gfn,
slot->base_gfn + slot->npages);
return spte_set;
}
static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t gfn, unsigned long mask, bool wrprot)
{
const u64 dbit = (wrprot || tdp_mmu_need_write_protect(root)) ? PT_WRITABLE_MASK :
shadow_dirty_mask;
struct tdp_iter iter;
lockdep_assert_held_write(&kvm->mmu_lock);
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_lock();
tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
gfn + BITS_PER_LONG) {
if (!mask)
break;
KVM_MMU_WARN_ON(dbit == shadow_dirty_mask &&
spte_ad_need_write_protect(iter.old_spte));
if (iter.level > PG_LEVEL_4K ||
!(mask & (1UL << (iter.gfn - gfn))))
continue;
mask &= ~(1UL << (iter.gfn - gfn));
if (!(iter.old_spte & dbit))
continue;
iter.old_spte = tdp_mmu_clear_spte_bits(iter.sptep,
iter.old_spte, dbit,
iter.level);
KVM: x86/mmu: Bypass __handle_changed_spte() when clearing TDP MMU dirty bits Drop everything except marking the PFN dirty and the relevant tracepoint parts of __handle_changed_spte() when clearing the dirty status of gfns in the TDP MMU. Clearing only the Dirty (or Writable) bit doesn't affect the SPTEs shadow-present status, whether or not the SPTE is a leaf, or change the SPTE's PFN. I.e. other than marking the PFN dirty, none of the functional updates handled by __handle_changed_spte() are relevant. Losing __handle_changed_spte()'s sanity checks does mean that a bug could theoretical go unnoticed, but that scenario is extremely unlikely, e.g. would effectively require a misconfigured or a locking bug elsewhere. Opportunistically remove a comment blurb from __handle_changed_spte() about all modifications to TDP MMU SPTEs needing to invoke said function, that "rule" hasn't been true since fast page fault support was added for the TDP MMU (and perhaps even before). Tested on a VM (160 vCPUs, 160 GB memory) and found that performance of clear dirty log stage improved by ~40% in dirty_log_perf_test (with the full optimization applied). Before optimization: -------------------- Iteration 1 clear dirty log time: 3.638543593s Iteration 2 clear dirty log time: 3.145032742s Iteration 3 clear dirty log time: 3.142340358s Clear dirty log over 3 iterations took 9.925916693s. (Avg 3.308638897s/iteration) After optimization: ------------------- Iteration 1 clear dirty log time: 2.318988110s Iteration 2 clear dirty log time: 1.794470164s Iteration 3 clear dirty log time: 1.791668628s Clear dirty log over 3 iterations took 5.905126902s. (Avg 1.968375634s/iteration) Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: split the switch to atomic-AND to a separate patch] Link: https://lore.kernel.org/r/20230321220021.2119033-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-21 22:00:14 +00:00
trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level,
iter.old_spte,
iter.old_spte & ~dbit);
kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte));
}
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_unlock();
}
/*
* Clear the dirty status (D-bit or W-bit) of all the 4k SPTEs mapping GFNs for
* which a bit is set in mask, starting at gfn. The given memslot is expected to
* contain all the GFNs represented by set bits in the mask.
*/
void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
gfn_t gfn, unsigned long mask,
bool wrprot)
{
struct kvm_mmu_page *root;
for_each_valid_tdp_mmu_root(kvm, root, slot->as_id)
clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
}
static void zap_collapsible_spte_range(struct kvm *kvm,
struct kvm_mmu_page *root,
const struct kvm_memory_slot *slot)
{
gfn_t start = slot->base_gfn;
gfn_t end = start + slot->npages;
struct tdp_iter iter;
int max_mapping_level;
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_lock();
for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
retry:
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;
if (iter.level > KVM_MAX_HUGEPAGE_LEVEL ||
!is_shadow_present_pte(iter.old_spte))
continue;
/*
* Don't zap leaf SPTEs, if a leaf SPTE could be replaced with
* a large page size, then its parent would have been zapped
* instead of stepping down.
*/
if (is_last_spte(iter.old_spte, iter.level))
continue;
/*
* If iter.gfn resides outside of the slot, i.e. the page for
* the current level overlaps but is not contained by the slot,
* then the SPTE can't be made huge. More importantly, trying
* to query that info from slot->arch.lpage_info will cause an
* out-of-bounds access.
*/
if (iter.gfn < start || iter.gfn >= end)
continue;
max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
iter.gfn, PG_LEVEL_NUM);
if (max_mapping_level < iter.level)
continue;
/* Note, a successful atomic zap also does a remote TLB flush. */
if (tdp_mmu_zap_spte_atomic(kvm, &iter))
goto retry;
}
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_unlock();
}
/*
* Zap non-leaf SPTEs (and free their associated page tables) which could
* be replaced by huge pages, for GFNs within the slot.
*/
void kvm_tdp_mmu_zap_collapsible_sptes(struct kvm *kvm,
const struct kvm_memory_slot *slot)
{
struct kvm_mmu_page *root;
lockdep_assert_held_read(&kvm->mmu_lock);
for_each_valid_tdp_mmu_root_yield_safe(kvm, root, slot->as_id)
zap_collapsible_spte_range(kvm, root, slot);
}
/*
* Removes write access on the last level SPTE mapping this GFN and unsets the
* MMU-writable bit to ensure future writes continue to be intercepted.
* Returns true if an SPTE was set and a TLB flush is needed.
*/
static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t gfn, int min_level)
{
struct tdp_iter iter;
u64 new_spte;
bool spte_set = false;
BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_lock();
for_each_tdp_pte_min_level(iter, root, min_level, gfn, gfn + 1) {
if (!is_shadow_present_pte(iter.old_spte) ||
!is_last_spte(iter.old_spte, iter.level))
continue;
new_spte = iter.old_spte &
~(PT_WRITABLE_MASK | shadow_mmu_writable_mask);
if (new_spte == iter.old_spte)
break;
tdp_mmu_iter_set_spte(kvm, &iter, new_spte);
spte_set = true;
}
KVM: x86/mmu: Protect TDP MMU page table memory with RCU In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02 18:57:23 +00:00
rcu_read_unlock();
return spte_set;
}
/*
* Removes write access on the last level SPTE mapping this GFN and unsets the
* MMU-writable bit to ensure future writes continue to be intercepted.
* Returns true if an SPTE was set and a TLB flush is needed.
*/
bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
struct kvm_memory_slot *slot, gfn_t gfn,
int min_level)
{
struct kvm_mmu_page *root;
bool spte_set = false;
lockdep_assert_held_write(&kvm->mmu_lock);
for_each_valid_tdp_mmu_root(kvm, root, slot->as_id)
spte_set |= write_protect_gfn(kvm, root, gfn, min_level);
return spte_set;
}
/*
* Return the level of the lowest level SPTE added to sptes.
* That SPTE may be non-present.
*
* Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
*/
int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
int *root_level)
{
struct tdp_iter iter;
struct kvm_mmu *mmu = vcpu->arch.mmu;
gfn_t gfn = addr >> PAGE_SHIFT;
int leaf = -1;
*root_level = vcpu->arch.mmu->root_role.level;
tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
leaf = iter.level;
sptes[leaf] = iter.old_spte;
}
return leaf;
}
/*
* Returns the last level spte pointer of the shadow page walk for the given
* gpa, and sets *spte to the spte value. This spte may be non-preset. If no
* walk could be performed, returns NULL and *spte does not contain valid data.
*
* Contract:
* - Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
* - The returned sptep must not be used after kvm_tdp_mmu_walk_lockless_end.
*
* WARNING: This function is only intended to be called during fast_page_fault.
*/
u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
u64 *spte)
{
struct tdp_iter iter;
struct kvm_mmu *mmu = vcpu->arch.mmu;
gfn_t gfn = addr >> PAGE_SHIFT;
tdp_ptep_t sptep = NULL;
tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
*spte = iter.old_spte;
sptep = iter.sptep;
}
/*
* Perform the rcu_dereference to get the raw spte pointer value since
* we are passing it up to fast_page_fault, which is shared with the
* legacy MMU and thus does not retain the TDP MMU-specific __rcu
* annotation.
*
* This is safe since fast_page_fault obeys the contracts of this
* function as well as all TDP MMU contracts around modifying SPTEs
* outside of mmu_lock.
*/
return rcu_dereference(sptep);
}