linux-stable/arch/x86/kvm/mmu/tdp_iter.c
Sean Christopherson 3a0f64de47 KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator.  Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.

When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.

Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.

Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.

  WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
  RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
  Call Trace:
   <TASK>
   kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
   kvm_destroy_vm+0x162/0x2a0 [kvm]
   kvm_vcpu_release+0x34/0x60 [kvm]
   __fput+0x82/0x240
   task_work_run+0x5c/0x90
   do_exit+0x364/0xa10
   ? futex_unqueue+0x38/0x60
   do_group_exit+0x33/0xa0
   get_signal+0x155/0x850
   arch_do_signal_or_restart+0xed/0x750
   exit_to_user_mode_prepare+0xc5/0x120
   syscall_exit_to_user_mode+0x1d/0x40
   do_syscall_64+0x48/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list.  This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.

  WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
  RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
  Call Trace:
   <TASK>
   kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
   __handle_changed_spte+0x92e/0xca0 [kvm]
   __handle_changed_spte+0x63c/0xca0 [kvm]
   __handle_changed_spte+0x63c/0xca0 [kvm]
   __handle_changed_spte+0x63c/0xca0 [kvm]
   zap_gfn_range+0x549/0x620 [kvm]
   kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
   mmu_free_root_page+0x219/0x2c0 [kvm]
   kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
   kvm_mmu_unload+0x1c/0xa0 [kvm]
   kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
   kvm_put_kvm+0x3b1/0x8b0 [kvm]
   kvm_vcpu_release+0x4e/0x70 [kvm]
   __fput+0x1f7/0x8c0
   task_work_run+0xf8/0x1a0
   do_exit+0x97b/0x2230
   do_group_exit+0xda/0x2a0
   get_signal+0x3be/0x1e50
   arch_do_signal_or_restart+0x244/0x17f0
   exit_to_user_mode_prepare+0xcb/0x120
   syscall_exit_to_user_mode+0x1d/0x40
   do_syscall_64+0x4d/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Note, the underlying bug existed even before commit 1af4a96025 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry.  But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.

Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label.  The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.

Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.

Fixes: faaf05b00a ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-20 08:06:53 -05:00

178 lines
5 KiB
C

// SPDX-License-Identifier: GPL-2.0
#include "mmu_internal.h"
#include "tdp_iter.h"
#include "spte.h"
/*
* Recalculates the pointer to the SPTE for the current GFN and level and
* reread the SPTE.
*/
static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
{
iter->sptep = iter->pt_path[iter->level - 1] +
SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
}
static gfn_t round_gfn_for_level(gfn_t gfn, int level)
{
return gfn & -KVM_PAGES_PER_HPAGE(level);
}
/*
* Return the TDP iterator to the root PT and allow it to continue its
* traversal over the paging structure from there.
*/
void tdp_iter_restart(struct tdp_iter *iter)
{
iter->yielded = false;
iter->yielded_gfn = iter->next_last_level_gfn;
iter->level = iter->root_level;
iter->gfn = round_gfn_for_level(iter->next_last_level_gfn, iter->level);
tdp_iter_refresh_sptep(iter);
iter->valid = true;
}
/*
* Sets a TDP iterator to walk a pre-order traversal of the paging structure
* rooted at root_pt, starting with the walk to translate next_last_level_gfn.
*/
void tdp_iter_start(struct tdp_iter *iter, u64 *root_pt, int root_level,
int min_level, gfn_t next_last_level_gfn)
{
WARN_ON(root_level < 1);
WARN_ON(root_level > PT64_ROOT_MAX_LEVEL);
iter->next_last_level_gfn = next_last_level_gfn;
iter->root_level = root_level;
iter->min_level = min_level;
iter->pt_path[iter->root_level - 1] = (tdp_ptep_t)root_pt;
iter->as_id = kvm_mmu_page_as_id(sptep_to_sp(root_pt));
tdp_iter_restart(iter);
}
/*
* Given an SPTE and its level, returns a pointer containing the host virtual
* address of the child page table referenced by the SPTE. Returns null if
* there is no such entry.
*/
tdp_ptep_t spte_to_child_pt(u64 spte, int level)
{
/*
* There's no child entry if this entry isn't present or is a
* last-level entry.
*/
if (!is_shadow_present_pte(spte) || is_last_spte(spte, level))
return NULL;
return (tdp_ptep_t)__va(spte_to_pfn(spte) << PAGE_SHIFT);
}
/*
* Steps down one level in the paging structure towards the goal GFN. Returns
* true if the iterator was able to step down a level, false otherwise.
*/
static bool try_step_down(struct tdp_iter *iter)
{
tdp_ptep_t child_pt;
if (iter->level == iter->min_level)
return false;
/*
* Reread the SPTE before stepping down to avoid traversing into page
* tables that are no longer linked from this entry.
*/
iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
child_pt = spte_to_child_pt(iter->old_spte, iter->level);
if (!child_pt)
return false;
iter->level--;
iter->pt_path[iter->level - 1] = child_pt;
iter->gfn = round_gfn_for_level(iter->next_last_level_gfn, iter->level);
tdp_iter_refresh_sptep(iter);
return true;
}
/*
* Steps to the next entry in the current page table, at the current page table
* level. The next entry could point to a page backing guest memory or another
* page table, or it could be non-present. Returns true if the iterator was
* able to step to the next entry in the page table, false if the iterator was
* already at the end of the current page table.
*/
static bool try_step_side(struct tdp_iter *iter)
{
/*
* Check if the iterator is already at the end of the current page
* table.
*/
if (SHADOW_PT_INDEX(iter->gfn << PAGE_SHIFT, iter->level) ==
(PT64_ENT_PER_PAGE - 1))
return false;
iter->gfn += KVM_PAGES_PER_HPAGE(iter->level);
iter->next_last_level_gfn = iter->gfn;
iter->sptep++;
iter->old_spte = READ_ONCE(*rcu_dereference(iter->sptep));
return true;
}
/*
* Tries to traverse back up a level in the paging structure so that the walk
* can continue from the next entry in the parent page table. Returns true on a
* successful step up, false if already in the root page.
*/
static bool try_step_up(struct tdp_iter *iter)
{
if (iter->level == iter->root_level)
return false;
iter->level++;
iter->gfn = round_gfn_for_level(iter->gfn, iter->level);
tdp_iter_refresh_sptep(iter);
return true;
}
/*
* Step to the next SPTE in a pre-order traversal of the paging structure.
* To get to the next SPTE, the iterator either steps down towards the goal
* GFN, if at a present, non-last-level SPTE, or over to a SPTE mapping a
* highter GFN.
*
* The basic algorithm is as follows:
* 1. If the current SPTE is a non-last-level SPTE, step down into the page
* table it points to.
* 2. If the iterator cannot step down, it will try to step to the next SPTE
* in the current page of the paging structure.
* 3. If the iterator cannot step to the next entry in the current page, it will
* try to step up to the parent paging structure page. In this case, that
* SPTE will have already been visited, and so the iterator must also step
* to the side again.
*/
void tdp_iter_next(struct tdp_iter *iter)
{
if (iter->yielded) {
tdp_iter_restart(iter);
return;
}
if (try_step_down(iter))
return;
do {
if (try_step_side(iter))
return;
} while (try_step_up(iter));
iter->valid = false;
}