2020-02-10 06:02:59 +00:00
|
|
|
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
|
|
|
|
=================
|
|
|
|
KVM Lock Overview
|
|
|
|
=================
|
|
|
|
|
|
|
|
1. Acquisition Orders
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
The acquisition orders for mutexes are as follows:
|
|
|
|
|
KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
Previously, KVM hooked the STARTING phase, which is not allowed to sleep
and thus could not take kvm_lock (a mutex). This effectively allows the
task that's initiating hardware enabling/disabling to preempted and/or
migrated.
Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
is "raw" because hardware enabling/disabling needs to be atomic with
respect to migration is wrong on multiple fronts. First, while regular
spinlocks can be preempted, the task holding the lock cannot be migrated.
Second, preventing migration is not required. on_each_cpu() disables
preemption, which ensures that cpus_hardware_enabled correctly reflects
hardware state. The task may be preempted/migrated between bumping
kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
kvm_usage_count is still protected, e.g. other tasks that call
hardware_enable_all() will be blocked until the preempted/migrated owner
exits its critical section.
KVM does have lockless accesses to kvm_usage_count in the suspend/resume
flows, but those are safe because all tasks must be frozen prior to
suspending CPUs, and a task cannot be frozen while it holds one or more
locks (userspace tasks are frozen via a fake signal).
Preemption doesn't need to be explicitly disabled in the hotplug path.
The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
only cares about having a stable CPU, i.e. to ensure hardware is enabled
on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice
with this state too, as is_percpu_thread() is true for the hotplug thread.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:28 +00:00
|
|
|
- cpus_read_lock() is taken outside kvm_lock
|
|
|
|
|
2020-02-10 06:02:59 +00:00
|
|
|
- kvm->lock is taken outside vcpu->mutex
|
|
|
|
|
|
|
|
- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
|
|
|
|
|
|
|
|
- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
|
|
|
|
them together is quite rare.
|
|
|
|
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
- kvm->mn_active_invalidate_count ensures that pairs of
|
|
|
|
invalidate_range_start() and invalidate_range_end() callbacks
|
|
|
|
use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock
|
2023-02-23 05:28:51 +00:00
|
|
|
are taken on the waiting side when modifying memslots, so MMU notifiers
|
KVM: Block memslot updates across range_start() and range_end()
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().
Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.
A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".
======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
which lock already depends on the new lock.
This corresponds to the following MMU notifier logic:
invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock
At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):
- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots
- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes
- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.
Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.
Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-27 12:09:15 +00:00
|
|
|
must not take either kvm->slots_lock or kvm->slots_arch_lock.
|
|
|
|
|
2022-12-28 11:00:22 +00:00
|
|
|
For SRCU:
|
|
|
|
|
2023-01-09 11:02:16 +00:00
|
|
|
- ``synchronize_srcu(&kvm->srcu)`` is called inside critical sections
|
|
|
|
for kvm->lock, vcpu->mutex and kvm->slots_lock. These locks _cannot_
|
|
|
|
be taken inside a kvm->srcu read-side critical section; that is, the
|
|
|
|
following is broken::
|
|
|
|
|
|
|
|
srcu_read_lock(&kvm->srcu);
|
|
|
|
mutex_lock(&kvm->slots_lock);
|
|
|
|
|
|
|
|
- kvm->slots_arch_lock instead is released before the call to
|
|
|
|
``synchronize_srcu()``. It _can_ therefore be taken inside a
|
|
|
|
kvm->srcu read-side critical section, for example while processing
|
|
|
|
a vmexit.
|
2022-12-28 11:00:22 +00:00
|
|
|
|
2021-02-02 18:57:26 +00:00
|
|
|
On x86:
|
|
|
|
|
2023-01-11 18:06:51 +00:00
|
|
|
- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock and kvm->arch.xen.xen_lock
|
2021-02-02 18:57:26 +00:00
|
|
|
|
2023-11-25 08:33:59 +00:00
|
|
|
- kvm->arch.mmu_lock is an rwlock; critical sections for
|
|
|
|
kvm->arch.tdp_mmu_pages_lock and kvm->arch.mmu_unsync_pages_lock must
|
|
|
|
also take kvm->arch.mmu_lock
|
2020-02-10 06:02:59 +00:00
|
|
|
|
|
|
|
Everything else is a leaf: no other lock is taken inside the critical
|
|
|
|
sections.
|
|
|
|
|
|
|
|
2. Exception
|
|
|
|
------------
|
|
|
|
|
|
|
|
Fast page fault:
|
|
|
|
|
|
|
|
Fast page fault is the fast path which fixes the guest page fault out of
|
|
|
|
the mmu-lock on x86. Currently, the page fault can be fast in one of the
|
|
|
|
following two cases:
|
|
|
|
|
|
|
|
1. Access Tracking: The SPTE is not present, but it is marked for access
|
2021-02-25 20:47:37 +00:00
|
|
|
tracking. That means we need to restore the saved R/X bits. This is
|
|
|
|
described in more detail later below.
|
2020-02-10 06:02:59 +00:00
|
|
|
|
2021-02-25 20:47:37 +00:00
|
|
|
2. Write-Protection: The SPTE is present and the fault is caused by
|
|
|
|
write-protect. That means we just need to change the W bit of the spte.
|
2020-02-10 06:02:59 +00:00
|
|
|
|
2023-06-12 03:08:08 +00:00
|
|
|
What we use to avoid all the races is the Host-writable bit and MMU-writable bit
|
2021-02-25 20:47:43 +00:00
|
|
|
on the spte:
|
2020-02-10 06:02:59 +00:00
|
|
|
|
2021-02-25 20:47:43 +00:00
|
|
|
- Host-writable means the gfn is writable in the host kernel page tables and in
|
|
|
|
its KVM memslot.
|
|
|
|
- MMU-writable means the gfn is writable in the guest's mmu and it is not
|
|
|
|
write-protected by shadow page write-protection.
|
2020-02-10 06:02:59 +00:00
|
|
|
|
|
|
|
On fast page fault path, we will use cmpxchg to atomically set the spte W
|
2021-02-25 20:47:43 +00:00
|
|
|
bit if spte.HOST_WRITEABLE = 1 and spte.WRITE_PROTECT = 1, to restore the saved
|
|
|
|
R/X bits if for an access-traced spte, or both. This is safe because whenever
|
|
|
|
changing these bits can be detected by cmpxchg.
|
2020-02-10 06:02:59 +00:00
|
|
|
|
|
|
|
But we need carefully check these cases:
|
|
|
|
|
|
|
|
1) The mapping from gfn to pfn
|
|
|
|
|
|
|
|
The mapping from gfn to pfn may be changed since we can only ensure the pfn
|
|
|
|
is not changed during cmpxchg. This is a ABA problem, for example, below case
|
|
|
|
will happen:
|
|
|
|
|
|
|
|
+------------------------------------------------------------------------+
|
|
|
|
| At the beginning:: |
|
|
|
|
| |
|
|
|
|
| gpte = gfn1 |
|
|
|
|
| gfn1 is mapped to pfn1 on host |
|
|
|
|
| spte is the shadow page table entry corresponding with gpte and |
|
|
|
|
| spte = pfn1 |
|
|
|
|
+------------------------------------------------------------------------+
|
|
|
|
| On fast page fault path: |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| CPU 0: | CPU 1: |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| :: | |
|
|
|
|
| | |
|
|
|
|
| old_spte = *spte; | |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| | pfn1 is swapped out:: |
|
|
|
|
| | |
|
|
|
|
| | spte = 0; |
|
|
|
|
| | |
|
|
|
|
| | pfn1 is re-alloced for gfn2. |
|
|
|
|
| | |
|
|
|
|
| | gpte is changed to point to |
|
|
|
|
| | gfn2 by the guest:: |
|
|
|
|
| | |
|
|
|
|
| | spte = pfn1; |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| :: |
|
|
|
|
| |
|
|
|
|
| if (cmpxchg(spte, old_spte, old_spte+W) |
|
|
|
|
| mark_page_dirty(vcpu->kvm, gfn1) |
|
|
|
|
| OOPS!!! |
|
|
|
|
+------------------------------------------------------------------------+
|
|
|
|
|
|
|
|
We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
|
|
|
|
|
|
|
|
For direct sp, we can easily avoid it since the spte of direct sp is fixed
|
2020-03-05 15:57:08 +00:00
|
|
|
to gfn. For indirect sp, we disabled fast page fault for simplicity.
|
|
|
|
|
|
|
|
A solution for indirect sp could be to pin the gfn, for example via
|
|
|
|
kvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning:
|
2020-02-10 06:02:59 +00:00
|
|
|
|
2023-06-12 03:08:08 +00:00
|
|
|
- We have held the refcount of pfn; that means the pfn can not be freed and
|
2020-02-10 06:02:59 +00:00
|
|
|
be reused for another gfn.
|
2020-03-05 15:57:08 +00:00
|
|
|
- The pfn is writable and therefore it cannot be shared between different gfns
|
2020-02-10 06:02:59 +00:00
|
|
|
by KSM.
|
|
|
|
|
|
|
|
Then, we can ensure the dirty bitmaps is correctly set for a gfn.
|
|
|
|
|
|
|
|
2) Dirty bit tracking
|
|
|
|
|
|
|
|
In the origin code, the spte can be fast updated (non-atomically) if the
|
|
|
|
spte is read-only and the Accessed bit has already been set since the
|
|
|
|
Accessed bit and Dirty bit can not be lost.
|
|
|
|
|
|
|
|
But it is not true after fast page fault since the spte can be marked
|
|
|
|
writable between reading spte and updating spte. Like below case:
|
|
|
|
|
|
|
|
+------------------------------------------------------------------------+
|
|
|
|
| At the beginning:: |
|
|
|
|
| |
|
|
|
|
| spte.W = 0 |
|
|
|
|
| spte.Accessed = 1 |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| CPU 0: | CPU 1: |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| In mmu_spte_clear_track_bits():: | |
|
|
|
|
| | |
|
|
|
|
| old_spte = *spte; | |
|
|
|
|
| | |
|
|
|
|
| | |
|
|
|
|
| /* 'if' condition is satisfied. */| |
|
|
|
|
| if (old_spte.Accessed == 1 && | |
|
|
|
|
| old_spte.W == 0) | |
|
|
|
|
| spte = 0ull; | |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| | on fast page fault path:: |
|
|
|
|
| | |
|
|
|
|
| | spte.W = 1 |
|
|
|
|
| | |
|
|
|
|
| | memory write on the spte:: |
|
|
|
|
| | |
|
|
|
|
| | spte.Dirty = 1 |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
| :: | |
|
|
|
|
| | |
|
|
|
|
| else | |
|
|
|
|
| old_spte = xchg(spte, 0ull) | |
|
|
|
|
| if (old_spte.Accessed == 1) | |
|
|
|
|
| kvm_set_pfn_accessed(spte.pfn);| |
|
|
|
|
| if (old_spte.Dirty == 1) | |
|
|
|
|
| kvm_set_pfn_dirty(spte.pfn); | |
|
|
|
|
| OOPS!!! | |
|
|
|
|
+------------------------------------+-----------------------------------+
|
|
|
|
|
|
|
|
The Dirty bit is lost in this case.
|
|
|
|
|
|
|
|
In order to avoid this kind of issue, we always treat the spte as "volatile"
|
2023-06-12 03:08:08 +00:00
|
|
|
if it can be updated out of mmu-lock [see spte_has_volatile_bits()]; it means
|
2020-02-10 06:02:59 +00:00
|
|
|
the spte is always atomically updated in this case.
|
|
|
|
|
|
|
|
3) flush tlbs due to spte updated
|
|
|
|
|
2023-06-12 03:08:08 +00:00
|
|
|
If the spte is updated from writable to read-only, we should flush all TLBs,
|
2020-02-10 06:02:59 +00:00
|
|
|
otherwise rmap_write_protect will find a read-only spte, even though the
|
|
|
|
writable spte might be cached on a CPU's TLB.
|
|
|
|
|
|
|
|
As mentioned before, the spte can be updated to writable out of mmu-lock on
|
2023-06-12 03:08:08 +00:00
|
|
|
fast page fault path. In order to easily audit the path, we see if TLBs needing
|
|
|
|
to be flushed caused this reason in mmu_spte_update() since this is a common
|
2020-02-10 06:02:59 +00:00
|
|
|
function to update spte (present -> present).
|
|
|
|
|
|
|
|
Since the spte is "volatile" if it can be updated out of mmu-lock, we always
|
2023-06-12 03:08:08 +00:00
|
|
|
atomically update the spte and the race caused by fast page fault can be avoided.
|
2020-02-10 06:02:59 +00:00
|
|
|
See the comments in spte_has_volatile_bits() and mmu_spte_update().
|
|
|
|
|
|
|
|
Lockless Access Tracking:
|
|
|
|
|
|
|
|
This is used for Intel CPUs that are using EPT but do not support the EPT A/D
|
2021-02-25 20:47:37 +00:00
|
|
|
bits. In this case, PTEs are tagged as A/D disabled (using ignored bits), and
|
|
|
|
when the KVM MMU notifier is called to track accesses to a page (via
|
|
|
|
kvm_mmu_notifier_clear_flush_young), it marks the PTE not-present in hardware
|
|
|
|
by clearing the RWX bits in the PTE and storing the original R & X bits in more
|
|
|
|
unused/ignored bits. When the VM tries to access the page later on, a fault is
|
|
|
|
generated and the fast page fault mechanism described above is used to
|
|
|
|
atomically restore the PTE to a Present state. The W bit is not saved when the
|
|
|
|
PTE is marked for access tracking and during restoration to the Present state,
|
|
|
|
the W bit is set depending on whether or not it was a write access. If it
|
|
|
|
wasn't, then the W bit will remain clear until a write access happens, at which
|
|
|
|
time it will be set using the Dirty tracking mechanism described above.
|
2020-02-10 06:02:59 +00:00
|
|
|
|
|
|
|
3. Reference
|
|
|
|
------------
|
|
|
|
|
2022-03-22 11:07:19 +00:00
|
|
|
``kvm_lock``
|
|
|
|
^^^^^^^^^^^^
|
|
|
|
|
2020-02-10 06:02:59 +00:00
|
|
|
:Type: mutex
|
|
|
|
:Arch: any
|
|
|
|
:Protects: - vm_list
|
KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
Previously, KVM hooked the STARTING phase, which is not allowed to sleep
and thus could not take kvm_lock (a mutex). This effectively allows the
task that's initiating hardware enabling/disabling to preempted and/or
migrated.
Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
is "raw" because hardware enabling/disabling needs to be atomic with
respect to migration is wrong on multiple fronts. First, while regular
spinlocks can be preempted, the task holding the lock cannot be migrated.
Second, preventing migration is not required. on_each_cpu() disables
preemption, which ensures that cpus_hardware_enabled correctly reflects
hardware state. The task may be preempted/migrated between bumping
kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
kvm_usage_count is still protected, e.g. other tasks that call
hardware_enable_all() will be blocked until the preempted/migrated owner
exits its critical section.
KVM does have lockless accesses to kvm_usage_count in the suspend/resume
flows, but those are safe because all tasks must be frozen prior to
suspending CPUs, and a task cannot be frozen while it holds one or more
locks (userspace tasks are frozen via a fake signal).
Preemption doesn't need to be explicitly disabled in the hotplug path.
The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
only cares about having a stable CPU, i.e. to ensure hardware is enabled
on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice
with this state too, as is_percpu_thread() is true for the hotplug thread.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:28 +00:00
|
|
|
- kvm_usage_count
|
|
|
|
- hardware virtualization enable/disable
|
|
|
|
:Comment: KVM also disables CPU hotplug via cpus_read_lock() during
|
|
|
|
enable/disable.
|
2020-02-10 06:02:59 +00:00
|
|
|
|
2022-03-22 11:07:20 +00:00
|
|
|
``kvm->mn_invalidate_lock``
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
:Type: spinlock_t
|
|
|
|
:Arch: any
|
|
|
|
:Protects: mn_active_invalidate_count, mn_memslots_update_rcuwait
|
2022-03-22 11:07:19 +00:00
|
|
|
|
|
|
|
``kvm_arch::tsc_write_lock``
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
:Type: raw_spinlock_t
|
2020-02-10 06:02:59 +00:00
|
|
|
:Arch: x86
|
|
|
|
:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
|
|
|
|
- tsc offset in vmcb
|
|
|
|
:Comment: 'raw' because updating the tsc offsets must not be preempted.
|
|
|
|
|
2022-03-22 11:07:19 +00:00
|
|
|
``kvm->mmu_lock``
|
|
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
:Type: spinlock_t or rwlock_t
|
2020-02-10 06:02:59 +00:00
|
|
|
:Arch: any
|
|
|
|
:Protects: -shadow page/shadow tlb entry
|
|
|
|
:Comment: it is a spinlock since it is used in mmu notifier.
|
|
|
|
|
2022-03-22 11:07:19 +00:00
|
|
|
``kvm->srcu``
|
|
|
|
^^^^^^^^^^^^^
|
2020-02-10 06:02:59 +00:00
|
|
|
:Type: srcu lock
|
|
|
|
:Arch: any
|
|
|
|
:Protects: - kvm->memslots
|
|
|
|
- kvm->buses
|
|
|
|
:Comment: The srcu read lock must be held while accessing memslots (e.g.
|
|
|
|
when using gfn_to_* functions) and while accessing in-kernel
|
|
|
|
MMIO/PIO address->device structure mapping (kvm->buses).
|
|
|
|
The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
|
|
|
|
if it is needed by multiple functions.
|
|
|
|
|
2022-03-22 11:07:20 +00:00
|
|
|
``kvm->slots_arch_lock``
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
:Type: mutex
|
|
|
|
:Arch: any (only needed on x86 though)
|
|
|
|
:Protects: any arch-specific fields of memslots that have to be modified
|
|
|
|
in a ``kvm->srcu`` read-side critical section.
|
|
|
|
:Comment: must be held before reading the pointer to the current memslots,
|
|
|
|
until after all changes to the memslots are complete
|
|
|
|
|
2022-03-22 11:07:19 +00:00
|
|
|
``wakeup_vcpus_on_cpu_lock``
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
2020-02-10 06:02:59 +00:00
|
|
|
:Type: spinlock_t
|
|
|
|
:Arch: x86
|
2022-03-22 11:07:19 +00:00
|
|
|
:Protects: wakeup_vcpus_on_cpu
|
2020-02-10 06:02:59 +00:00
|
|
|
:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts.
|
2023-06-12 03:08:08 +00:00
|
|
|
When VT-d posted-interrupts are supported and the VM has assigned
|
2020-02-10 06:02:59 +00:00
|
|
|
devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
|
2023-06-12 03:08:08 +00:00
|
|
|
protected by blocked_vcpu_on_cpu_lock. When VT-d hardware issues
|
2020-02-10 06:02:59 +00:00
|
|
|
wakeup notification event since external interrupts from the
|
|
|
|
assigned devices happens, we will find the vCPU on the list to
|
|
|
|
wakeup.
|
KVM: x86: Serialize vendor module initialization (hardware setup)
Acquire a new mutex, vendor_module_lock, in kvm_x86_vendor_init() while
doing hardware setup to ensure that concurrent calls are fully serialized.
KVM rejects attempts to load vendor modules if a different module has
already been loaded, but doesn't handle the case where multiple vendor
modules are loaded at the same time, and module_init() doesn't run under
the global module_mutex.
Note, in practice, this is likely a benign bug as no platform exists that
supports both SVM and VMX, i.e. barring a weird VM setup, one of the
vendor modules is guaranteed to fail a support check before modifying
common KVM state.
Alternatively, KVM could perform an atomic CMPXCHG on .hardware_enable,
but that comes with its own ugliness as it would require setting
.hardware_enable before success is guaranteed, e.g. attempting to load
the "wrong" could result in spurious failure to load the "right" module.
Introduce a new mutex as using kvm_lock is extremely deadlock prone due
to kvm_lock being taken under cpus_write_lock(), and in the future, under
under cpus_read_lock(). Any operation that takes cpus_read_lock() while
holding kvm_lock would potentially deadlock, e.g. kvm_timer_init() takes
cpus_read_lock() to register a callback. In theory, KVM could avoid
such problematic paths, i.e. do less setup under kvm_lock, but avoiding
all calls to cpus_read_lock() is subtly difficult and thus fragile. E.g.
updating static calls also acquires cpus_read_lock().
Inverting the lock ordering, i.e. always taking kvm_lock outside
cpus_read_lock(), is not a viable option as kvm_lock is taken in various
callbacks that may be invoked under cpus_read_lock(), e.g. x86's
kvmclock_cpufreq_notifier().
The lockdep splat below is dependent on future patches to take
cpus_read_lock() in hardware_enable_all(), but as above, deadlock is
already is already possible.
======================================================
WARNING: possible circular locking dependency detected
6.0.0-smp--7ec93244f194-init2 #27 Tainted: G O
------------------------------------------------------
stable/251833 is trying to acquire lock:
ffffffffc097ea28 (kvm_lock){+.+.}-{3:3}, at: hardware_enable_all+0x1f/0xc0 [kvm]
but task is already holding lock:
ffffffffa2456828 (cpu_hotplug_lock){++++}-{0:0}, at: hardware_enable_all+0xf/0xc0 [kvm]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (cpu_hotplug_lock){++++}-{0:0}:
cpus_read_lock+0x2a/0xa0
__cpuhp_setup_state+0x2b/0x60
__kvm_x86_vendor_init+0x16a/0x1870 [kvm]
kvm_x86_vendor_init+0x23/0x40 [kvm]
0xffffffffc0a4d02b
do_one_initcall+0x110/0x200
do_init_module+0x4f/0x250
load_module+0x1730/0x18f0
__se_sys_finit_module+0xca/0x100
__x64_sys_finit_module+0x1d/0x20
do_syscall_64+0x3d/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (kvm_lock){+.+.}-{3:3}:
__lock_acquire+0x16f4/0x30d0
lock_acquire+0xb2/0x190
__mutex_lock+0x98/0x6f0
mutex_lock_nested+0x1b/0x20
hardware_enable_all+0x1f/0xc0 [kvm]
kvm_dev_ioctl+0x45e/0x930 [kvm]
__se_sys_ioctl+0x77/0xc0
__x64_sys_ioctl+0x1d/0x20
do_syscall_64+0x3d/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(cpu_hotplug_lock);
lock(kvm_lock);
lock(cpu_hotplug_lock);
lock(kvm_lock);
*** DEADLOCK ***
1 lock held by stable/251833:
#0: ffffffffa2456828 (cpu_hotplug_lock){++++}-{0:0}, at: hardware_enable_all+0xf/0xc0 [kvm]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-16-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:08:59 +00:00
|
|
|
|
|
|
|
``vendor_module_lock``
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
:Type: mutex
|
|
|
|
:Arch: x86
|
|
|
|
:Protects: loading a vendor module (kvm_amd or kvm_intel)
|
KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now
that KVM hooks CPU hotplug during the ONLINE phase, which can sleep.
Previously, KVM hooked the STARTING phase, which is not allowed to sleep
and thus could not take kvm_lock (a mutex). This effectively allows the
task that's initiating hardware enabling/disabling to preempted and/or
migrated.
Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock
is "raw" because hardware enabling/disabling needs to be atomic with
respect to migration is wrong on multiple fronts. First, while regular
spinlocks can be preempted, the task holding the lock cannot be migrated.
Second, preventing migration is not required. on_each_cpu() disables
preemption, which ensures that cpus_hardware_enabled correctly reflects
hardware state. The task may be preempted/migrated between bumping
kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as
kvm_usage_count is still protected, e.g. other tasks that call
hardware_enable_all() will be blocked until the preempted/migrated owner
exits its critical section.
KVM does have lockless accesses to kvm_usage_count in the suspend/resume
flows, but those are safe because all tasks must be frozen prior to
suspending CPUs, and a task cannot be frozen while it holds one or more
locks (userspace tasks are frozen via a fake signal).
Preemption doesn't need to be explicitly disabled in the hotplug path.
The hotplug thread is pinned to the CPU that's being hotplugged, and KVM
only cares about having a stable CPU, i.e. to ensure hardware is enabled
on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice
with this state too, as is_percpu_thread() is true for the hotplug thread.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221130230934.1014142-45-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30 23:09:28 +00:00
|
|
|
:Comment: Exists because using kvm_lock leads to deadlock. cpu_hotplug_lock is
|
|
|
|
taken outside of kvm_lock, e.g. in KVM's CPU online/offline callbacks, and
|
|
|
|
many operations need to take cpu_hotplug_lock when loading a vendor module,
|
|
|
|
e.g. updating static calls.
|