Commit graph

1043021 commits

Author SHA1 Message Date
Thomas Huth
22d7108ce4 KVM: selftests: Fix kvm_vm_free() in cr4_cpuid_sync and vmx_tsc_adjust tests
The kvm_vm_free() statement here is currently dead code, since the loop
in front of it can only be left with the "goto done" that jumps right
after the kvm_vm_free(). Fix it by swapping the locations of the "done"
label and the kvm_vm_free().

Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20210826074928.240942-1-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:08 -04:00
Colin Ian King
d22869aff4 kvm: selftests: Fix spelling mistake "missmatch" -> "mismatch"
There is a spelling mistake in an error message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Message-Id: <20210826120752.12633-1-colin.king@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:08 -04:00
Sean Christopherson
25b9784586 KVM: x86: Manually retrieve CPUID.0x1 when getting FMS for RESET/INIT
Manually look for a CPUID.0x1 entry instead of bouncing through
kvm_cpuid() when retrieving the Family-Model-Stepping information for
vCPU RESET/INIT.  This fixes a potential undefined behavior bug due to
kvm_cpuid() using the uninitialized "dummy" param as the ECX _input_,
a.k.a. the index.

A more minimal fix would be to simply zero "dummy", but the extra work in
kvm_cpuid() is wasteful, and KVM should be treating the FMS retrieval as
an out-of-band access, e.g. same as how KVM computes guest.MAXPHYADDR.
Both Intel's SDM and AMD's APM describe the RDX value at RESET/INIT as
holding the CPU's FMS information, not as holding CPUID.0x1.EAX.  KVM's
usage of CPUID entries to get FMS is simply a pragmatic approach to avoid
having yet another way for userspace to provide inconsistent data.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210929222426.1855730-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:08 -04:00
Sean Christopherson
62dd57dd67 KVM: x86: WARN on non-zero CRs at RESET to detect improper initalization
WARN if CR0, CR3, or CR4 are non-zero at RESET, which given the current
KVM implementation, really means WARN if they're not zeroed at vCPU
creation.  VMX in particular has several ->set_*() flows that read other
registers to handle side effects, and because those flows are common to
RESET and INIT, KVM subtly relies on emulated/virtualized registers to be
zeroed at vCPU creation in order to do the right thing at RESET.

Use CRs as a sentinel because they are most likely to be written as side
effects, and because KVM specifically needs CR0.PG and CR0.PE to be '0'
to correctly reflect the state of the vCPU's MMU.  CRs are also loaded
and stored from/to the VMCS, and so adds some level of coverage to verify
that KVM doesn't conflate zero-allocating the VMCS with properly
initializing the VMCS with VMWRITEs.

Note, '0' is somewhat arbitrary, vCPU creation can technically stuff any
value for a register so long as it's coherent with respect to the current
vCPU state.  In practice, '0' works for all registers and is convenient.

Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:07 -04:00
Sean Christopherson
9ebe530b9f KVM: SVM: Move RESET emulation to svm_vcpu_reset()
Move RESET emulation for SVM vCPUs to svm_vcpu_reset(), and drop an extra
init_vmcb() from svm_create_vcpu() in the process.  Hopefully KVM will
someday expose a dedicated RESET ioctl(), and in the meantime separating
"create" from "RESET" is a nice cleanup.

Keep the call to svm_switch_vmcb() so that misuse of svm->vmcb at worst
breaks the guest, e.g. premature accesses doesn't cause a NULL pointer
dereference.

Cc: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:07 -04:00
Sean Christopherson
06692e4b80 KVM: VMX: Move RESET emulation to vmx_vcpu_reset()
Move vCPU RESET emulation, including initializating of select VMCS state,
to vmx_vcpu_reset().  Drop the open coded "vCPU load" sequence, as
->vcpu_reset() is invoked while the vCPU is properly loaded (which is
kind of the point of ->vcpu_reset()...).  Hopefully KVM will someday
expose a dedicated RESET ioctl(), and in the meantime separating "create"
from "RESET" is a nice cleanup.

Deferring VMCS initialization is effectively a nop as it's impossible to
safely access the VMCS between the current call site and its new home, as
both the vCPU and the pCPU are put immediately after init_vmcs(), i.e.
the VMCS isn't guaranteed to be loaded.

Note, task preemption is not a problem as vmx_sched_in() _can't_ touch
the VMCS as ->sched_in() is invoked before the vCPU, and thus VMCS, is
reloaded.  I.e. the preemption path also can't consume VMCS state.

Cc: Reiji Watanabe <reijiw@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:07 -04:00
Sean Christopherson
d06567353e KVM: VMX: Drop explicit zeroing of MSR guest values at vCPU creation
Don't zero out user return and nested MSRs during vCPU creation, and
instead rely on vcpu_vmx being zero-allocated.  Explicitly zeroing MSRs
is not wrong, and is in fact necessary if KVM ever emulates vCPU RESET
outside of vCPU creation, but zeroing only a subset of MSRs is confusing.

Poking directly into KVM's backing is also undesirable in that it doesn't
scale and is error prone.  Ideally KVM would have a common RESET path for
all MSRs, e.g. by expanding kvm_set_msr(), which would obviate the need
for this out-of-bad code (to support standalone RESET).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:07 -04:00
Sean Christopherson
583d369b36 KVM: x86: Fold fx_init() into kvm_arch_vcpu_create()
Move the few bits of relevant fx_init() code into kvm_arch_vcpu_create(),
dropping the superfluous check on vcpu->arch.guest_fpu that was blindly
and wrongly added by commit ed02b21309 ("KVM: SVM: Guest FPU state
save/restore not needed for SEV-ES guest").

Note, KVM currently allocates and then frees FPU state for SEV-ES guests,
rather than avoid the allocation in the first place.  While that approach
is inarguably inefficient and unnecessary, it's a cleanup for the future.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:06 -04:00
Sean Christopherson
e8f65b9bb4 KVM: x86: Remove defunct setting of XCR0 for guest during vCPU create
Drop code to initialize XCR0 during fx_init(), a.k.a. vCPU creation, as
XCR0 has been initialized during kvm_vcpu_reset() (for RESET) since
commit a554d207dc ("KVM: X86: Processor States following Reset or INIT").

Back when XCR0 support was added by commit 2acf923e38 ("KVM: VMX:
Enable XSAVE/XRSTOR for guest"), KVM didn't differentiate between RESET
and INIT.  Ignoring the fact that calling fx_init() for INIT is obviously
wrong, e.g. FPU state after INIT is not the same as after RESET, setting
XCR0 in fx_init() was correct.

Eventually fx_init() got moved to kvm_arch_vcpu_init(), a.k.a. vCPU
creation (ignore the terrible name) by commit 0ee6a51725 ("x86/fpu,
kvm: Simplify fx_init()").  Finally, commit 95a0d01eef ("KVM: x86: Move
all vcpu init code into kvm_arch_vcpu_create()") killed off
kvm_arch_vcpu_init(), leaving behind the oddity of redundant setting of
guest state during vCPU creation.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:06 -04:00
Sean Christopherson
5ebbc470d7 KVM: x86: Remove defunct setting of CR0.ET for guests during vCPU create
Drop code to set CR0.ET for the guest during initialization of the guest
FPU.  The code was added as a misguided bug fix by commit 380102c8e4
("KVM Set the ET flag in CR0 after initializing FX") to resolve an issue
where vcpu->cr0 (now vcpu->arch.cr0) was not correctly initialized on SVM
systems.  While init_vmcb() did set CR0.ET, it only did so in the VMCB,
and subtly did not update vcpu->cr0.  Stuffing CR0.ET worked around the
immediate problem, but did not fix the real bug of vcpu->cr0 and the VMCB
being out of sync.  That underlying bug was eventually remedied by commit
18fa000ae4 ("KVM: SVM: Reset cr0 properly on vcpu reset").

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:06 -04:00
Sean Christopherson
ff8828c84f KVM: x86: Do not mark all registers as avail/dirty during RESET/INIT
Do not blindly mark all registers as available+dirty at RESET/INIT, and
instead rely on writes to registers to go through the proper mutators or
to explicitly mark registers as dirty.  INIT in particular does not blindly
overwrite all registers, e.g. select bits in CR0 are preserved across INIT,
thus marking registers available+dirty without first reading the register
from hardware is incorrect.

In practice this is a benign bug as KVM doesn't let the guest control CR0
bits that are preserved across INIT, and all other true registers are
explicitly written during the RESET/INIT flows.  The PDPTRs and EX_INFO
"registers" are not explicitly written, but accessing those values during
RESET/INIT is nonsensical and would be a KVM bug regardless of register
caching.

Fixes: 66f7b72e11 ("KVM: x86: Make register state after reset conform to specification")
[sean: !!! NOT FOR STABLE !!!]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210921000303.400537-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:05 -04:00
Sean Christopherson
94c641ba7a KVM: x86: Simplify retrieving the page offset when loading PDTPRs
Replace impressively complex "logic" for computing the page offset from
CR3 when loading PDPTRs.  Unlike other paging modes, the address held in
CR3 for PAE paging is 32-byte aligned, i.e. occupies bits 31:5, thus bits
11:5 need to be used as the offset from the gfn when reading PDPTRs.

The existing calculation originated in commit 1342d3536d ("[PATCH] KVM:
MMU: Load the pae pdptrs on cr3 change like the processor does"), which
read the PDPTRs from guest memory as individual 8-byte loads.  At the
time, the so called "offset" was the base index of PDPTR0 as a _u64_, not
a byte offset.  Naming aside, the computation was useful and arguably
simplified the overall flow.

Unfortunately, when commit 195aefde9c ("KVM: Add general accessors to
read and write guest memory") added accessors with offsets at byte
granularity, the cleverness of the original code was lost and KVM was
left with convoluted code for a simple operation.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210831164224.1119728-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:05 -04:00
Sean Christopherson
15cabbc259 KVM: x86: Subsume nested GPA read helper into load_pdptrs()
Open code the call to mmu->translate_gpa() when loading nested PDPTRs and
kill off the existing helper, kvm_read_guest_page_mmu(), to discourage
incorrect use.  Reading guest memory straight from an L2 GPA is extremely
rare (as evidenced by the lack of users), as very few constructs in x86
specify physical addresses, even fewer are virtualized by KVM, and even
fewer yet require emulation of L2 by L0 KVM.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210831164224.1119728-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:05 -04:00
Juergen Gross
a1c42ddedf kvm: rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS
KVM_MAX_VCPU_ID is not specifying the highest allowed vcpu-id, but the
number of allowed vcpu-ids. This has already led to confusion, so
rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS to make its semantics more
clear

Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210913135745.13944-3-jgross@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:05 -04:00
Juergen Gross
1e254d0d86 Revert "x86/kvm: fix vcpu-id indexed array sizes"
This reverts commit 76b4f357d0.

The commit has the wrong reasoning, as KVM_MAX_VCPU_ID is not defining the
maximum allowed vcpu-id as its name suggests, but the number of vcpu-ids.
So revert this patch again.

Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210913135745.13944-2-jgross@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:04 -04:00
Vitaly Kuznetsov
620b2438ab KVM: Make kvm_make_vcpus_request_mask() use pre-allocated cpu_kick_mask
kvm_make_vcpus_request_mask() already disables preemption so just like
kvm_make_all_cpus_request_except() it can be switched to using
pre-allocated per-cpu cpumasks. This allows for improvements for both
users of the function: in Hyper-V emulation code 'tlb_flush' can now be
dropped from 'struct kvm_vcpu_hv' and kvm_make_scan_ioapic_request_mask()
gets rid of dynamic allocation.

cpumask_available() checks in kvm_make_vcpu_request() and
kvm_kick_many_cpus() can now be dropped as they checks for an impossible
condition: kvm_init() makes sure per-cpu masks are allocated.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210903075141.403071-9-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:04 -04:00
Vitaly Kuznetsov
baff59ccdc KVM: Pre-allocate cpumasks for kvm_make_all_cpus_request_except()
Allocating cpumask dynamically in zalloc_cpumask_var() is not ideal.
Allocation is somewhat slow and can (in theory and when CPUMASK_OFFSTACK)
fail. kvm_make_all_cpus_request_except() already disables preemption so
we can use pre-allocated per-cpu cpumasks instead.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210903075141.403071-8-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:04 -04:00
Vitaly Kuznetsov
381cecc5d7 KVM: Drop 'except' parameter from kvm_make_vcpus_request_mask()
Both remaining callers of kvm_make_vcpus_request_mask() pass 'NULL' for
'except' parameter so it can just be dropped.

No functional change intended ©.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210903075141.403071-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:04 -04:00
Vitaly Kuznetsov
ae0946cd36 KVM: Optimize kvm_make_vcpus_request_mask() a bit
Iterating over set bits in 'vcpu_bitmap' should be faster than going
through all vCPUs, especially when just a few bits are set.

Drop kvm_make_vcpus_request_mask() call from kvm_make_all_cpus_request_except()
to avoid handling the special case when 'vcpu_bitmap' is NULL, move the
code to kvm_make_all_cpus_request_except() itself.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210903075141.403071-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:03 -04:00
Vitaly Kuznetsov
6470accc7b KVM: x86: hyper-v: Avoid calling kvm_make_vcpus_request_mask() with vcpu_mask==NULL
In preparation to making kvm_make_vcpus_request_mask() use for_each_set_bit()
switch kvm_hv_flush_tlb() to calling kvm_make_all_cpus_request() for 'all cpus'
case.

Note: kvm_make_all_cpus_request() (unlike kvm_make_vcpus_request_mask())
currently dynamically allocates cpumask on each call and this is suboptimal.
Both kvm_make_all_cpus_request() and kvm_make_vcpus_request_mask() are
going to be switched to using pre-allocated per-cpu masks.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210903075141.403071-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:03 -04:00
Yang Li
11476d277e KVM: use vma_pages() helper
Use vma_pages function on vma object instead of explicit computation.

Fix the following coccicheck warning:
./virt/kvm/kvm_main.c:3526:29-35: WARNING: Consider using vma_pages
helper on vma

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Message-Id: <1632900526-119643-1-git-send-email-yang.lee@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:03 -04:00
Vitaly Kuznetsov
feb3162f9d KVM: nVMX: Reset vmxon_ptr upon VMXOFF emulation.
Currently, 'vmx->nested.vmxon_ptr' is not reset upon VMXOFF
emulation. This is not a problem per se as we never access
it when !vmx->nested.vmxon. But this should be done to avoid
any issue in the future.

Also, initialize the vmxon_ptr when vcpu is created.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Message-Id: <20210929175154.11396-3-yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:02 -04:00
Yu Zhang
64c785082c KVM: nVMX: Use INVALID_GPA for pointers used in nVMX.
Clean up nested.c and vmx.c by using INVALID_GPA instead of "-1ull",
to denote an invalid address in nested VMX. Affected addresses are
the ones of VMXON region, current VMCS, VMCS link pointer, virtual-
APIC page, ENCLS-exiting bitmap, and IO bitmap etc.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Message-Id: <20210929175154.11396-2-yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:27:02 -04:00
Sean Christopherson
7b0035eaa7 KVM: selftests: Ensure all migrations are performed when test is affined
Rework the CPU selection in the migration worker to ensure the specified
number of migrations are performed when the test iteslf is affined to a
subset of CPUs.  The existing logic skips iterations if the target CPU is
not in the original set of possible CPUs, which causes the test to fail
if too many iterations are skipped.

  ==== Test Assertion Failure ====
  rseq_test.c:228: i > (NR_TASK_MIGRATIONS / 2)
  pid=10127 tid=10127 errno=4 - Interrupted system call
     1  0x00000000004018e5: main at rseq_test.c:227
     2  0x00007fcc8fc66bf6: ?? ??:0
     3  0x0000000000401959: _start at ??:?
  Only performed 4 KVM_RUNs, task stalled too much?

Calculate the min/max possible CPUs as a cheap "best effort" to avoid
high runtimes when the test is affined to a small percentage of CPUs.
Alternatively, a list or xarray of the possible CPUs could be used, but
even in a horrendously inefficient setup, such optimizations are not
needed because the runtime is completely dominated by the cost of
migrating the task, and the absolute runtime is well under a minute in
even truly absurd setups, e.g. running on a subset of vCPUs in a VM that
is heavily overcommited (16 vCPUs per pCPU).

Fixes: 61e52f1630 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs")
Reported-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210929234112.1862848-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:25:57 -04:00
Sean Christopherson
e8a747d088 KVM: x86: Swap order of CPUID entry "index" vs. "significant flag" checks
Check whether a CPUID entry's index is significant before checking for a
matching index to hack-a-fix an undefined behavior bug due to consuming
uninitialized data.  RESET/INIT emulation uses kvm_cpuid() to retrieve
CPUID.0x1, which does _not_ have a significant index, and fails to
initialize the dummy variable that doubles as EBX/ECX/EDX output _and_
ECX, a.k.a. index, input.

Practically speaking, it's _extremely_  unlikely any compiler will yield
code that causes problems, as the compiler would need to inline the
kvm_cpuid() call to detect the uninitialized data, and intentionally hose
the kernel, e.g. insert ud2, instead of simply ignoring the result of
the index comparison.

Although the sketchy "dummy" pattern was introduced in SVM by commit
66f7b72e11 ("KVM: x86: Make register state after reset conform to
specification"), it wasn't actually broken until commit 7ff6c03503
("KVM: x86: Remove stateful CPUID handling") arbitrarily swapped the
order of operations such that "index" was checked before the significant
flag.

Avoid consuming uninitialized data by reverting to checking the flag
before the index purely so that the fix can be easily backported; the
offending RESET/INIT code has been refactored, moved, and consolidated
from vendor code to common x86 since the bug was introduced.  A future
patch will directly address the bad RESET/INIT behavior.

The undefined behavior was detected by syzbot + KernelMemorySanitizer.

  BUG: KMSAN: uninit-value in cpuid_entry2_find arch/x86/kvm/cpuid.c:68
  BUG: KMSAN: uninit-value in kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103
  BUG: KMSAN: uninit-value in kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183
   cpuid_entry2_find arch/x86/kvm/cpuid.c:68 [inline]
   kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103 [inline]
   kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183
   kvm_vcpu_reset+0x13fb/0x1c20 arch/x86/kvm/x86.c:10885
   kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923
   vcpu_enter_guest+0xfd2/0x6d80 arch/x86/kvm/x86.c:9534
   vcpu_run+0x7f5/0x18d0 arch/x86/kvm/x86.c:9788
   kvm_arch_vcpu_ioctl_run+0x245b/0x2d10 arch/x86/kvm/x86.c:10020

  Local variable ----dummy@kvm_vcpu_reset created at:
   kvm_vcpu_reset+0x1fb/0x1c20 arch/x86/kvm/x86.c:10812
   kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923

Reported-by: syzbot+f3985126b746b3d59c9d@syzkaller.appspotmail.com
Reported-by: Alexander Potapenko <glider@google.com>
Fixes: 2a24be79b6 ("KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping")
Fixes: 7ff6c03503 ("KVM: x86: Remove stateful CPUID handling")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20210929222426.1855730-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:20:33 -04:00
Zelin Deng
773e89ab00 ptp: Fix ptp_kvm_getcrosststamp issue for x86 ptp_kvm
hv_clock is preallocated to have only HVC_BOOT_ARRAY_SIZE (64) elements;
if the PTP_SYS_OFFSET_PRECISE ioctl is executed on vCPUs whose index is
64 of higher, retrieving the struct pvclock_vcpu_time_info pointer with
"src = &hv_clock[cpu].pvti" will result in an out-of-bounds access and
a wild pointer.  Change it to "this_cpu_pvti()" which is guaranteed to
be valid.

Fixes: 95a3d4454b ("Switch kvmclock data to a PER_CPU variable")
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Message-Id: <1632892429-101194-3-git-send-email-zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:08:15 -04:00
Zelin Deng
ad9af93068 x86/kvmclock: Move this_cpu_pvti into kvmclock.h
There're other modules might use hv_clock_per_cpu variable like ptp_kvm,
so move it into kvmclock.h and export the symbol to make it visiable to
other modules.

Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Message-Id: <1632892429-101194-2-git-send-email-zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30 04:08:01 -04:00
Oliver Upton
e02c16b9cd selftests: KVM: Don't clobber XMM register when read
There is no need to clobber a register that is only being read from.
Oops. Drop the XMM register from the clobbers list.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210927223621.50178-1-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-28 11:31:29 -04:00
Zhenzhong Duan
5c49d1850d KVM: VMX: Fix a TSX_CTRL_CPUID_CLEAR field mask issue
When updating the host's mask for its MSR_IA32_TSX_CTRL user return entry,
clear the mask in the found uret MSR instead of vmx->guest_uret_msrs[i].
Modifying guest_uret_msrs directly is completely broken as 'i' does not
point at the MSR_IA32_TSX_CTRL entry.  In fact, it's guaranteed to be an
out-of-bounds accesses as is always set to kvm_nr_uret_msrs in a prior
loop. By sheer dumb luck, the fallout is limited to "only" failing to
preserve the host's TSX_CTRL_CPUID_CLEAR.  The out-of-bounds access is
benign as it's guaranteed to clear a bit in a guest MSR value, which are
always zero at vCPU creation on both x86-64 and i386.

Cc: stable@vger.kernel.org
Fixes: 8ea8b8d6f8 ("KVM: VMX: Use common x86's uret MSR list as the one true list")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210926015545.281083-1-zhenzhong.duan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-27 11:25:40 -04:00
Paolo Bonzini
50b0781846 KVM/arm64 fixes for 5.15, take #1
- Add missing FORCE target when building the EL2 object
 - Fix a PMU probe regression on some platforms
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmFNjR4PHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDImkQAL2Y+jxz3fd8oXHPFIOW1TkHTaafyk5DmqmP
 WqwjQLj+WpI/q7bQawbhaoFSonFhkf2OOu4BvsrEdEzQIiONkbuIcdqMHjPKfeHK
 x4dx/JG7K/W/b86WOh479L9M7VQYw0F53tgOwUxf9tHZgXl1gmlUy1K9qxqyLty6
 1aYaoU8/3xuHI4RDEmQTfS/gj7X/1ng7OK3B3ZeUV1U4xOuBxcmS+leevilSE7c2
 QC5mgU9Dsa3UkATdz+TZtmbx9BeTRhyNar0KnrwsaSkTti4miT8DoApoU9JdRVr3
 VoSkgb2hdyT8J/jBNg+6Rs3uQr2Lp0scyXRDgZDdba7FPbUuDzCqKxfpMoXNU8Um
 G5kOJGBg07cI4l8S/giIaqO6r8cZooBNuWKuJDKqED9ikbma4hG8kaLsHP4BtjNZ
 INUVeL39je/5/gp6XaRPYBKqEajp5bRnxbPOzWKqqELV3s4ArtZPs8cVS6c8b0YL
 9O7pv7EXueQHOgoF0Zh1H14wv4iBK5LKHGv3r/uks0ryBc/x93qeAVKcQAZ+l/s2
 dPWfQzDHvwkQybkkYw4XlVE2kLTKxbcvolN+++TIfCzvXyt/pcL5MPlkZIvMfZhd
 3YuN44NH61pnn5h0lHlWk9mNPnBoAswjw5qopa1kr5YVcYCcvJ0MdB/Wb6Xm473p
 /DFZv20H
 =klvK
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-fixes-5.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master

KVM/arm64 fixes for 5.15, take #1

- Add missing FORCE target when building the EL2 object
- Fix a PMU probe regression on some platforms
2021-09-24 06:04:42 -04:00
Oliver Upton
386ca9d7fd selftests: KVM: Explicitly use movq to read xmm registers
Compiling the KVM selftests with clang emits the following warning:

>> include/x86_64/processor.h:297:25: error: variable 'xmm0' is uninitialized when used here [-Werror,-Wuninitialized]
>>                return (unsigned long)xmm0;

where xmm0 is accessed via an uninitialized register variable.

Indeed, this is a misuse of register variables, which really should only
be used for specifying register constraints on variables passed to
inline assembly. Rather than attempting to read xmm registers via
register variables, just explicitly perform the movq from the desired
xmm register.

Fixes: 783e9e5126 ("kvm: selftests: add API testing infrastructure")
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210924005147.1122357-1-oupton@google.com>
Reviewed-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-24 02:32:58 -04:00
Oliver Upton
fbf094ce52 selftests: KVM: Call ucall_init when setting up in rseq_test
While x86 does not require any additional setup to use the ucall
infrastructure, arm64 needs to set up the MMIO address used to signal a
ucall to userspace. rseq_test does not initialize the MMIO address,
resulting in the test spinning indefinitely.

Fix the issue by calling ucall_init() during setup.

Fixes: 61e52f1630 ("KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs")
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210923220033.4172362-1-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-24 02:32:12 -04:00
Lai Jiangshan
6bc6db0002 KVM: Remove tlbs_dirty
There is no user of tlbs_dirty.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-4-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 11:01:12 -04:00
Lai Jiangshan
65855ed8b0 KVM: X86: Synchronize the shadow pagetable before link it
If gpte is changed from non-present to present, the guest doesn't need
to flush tlb per SDM.  So the host must synchronze sp before
link it.  Otherwise the guest might use a wrong mapping.

For example: the guest first changes a level-1 pagetable, and then
links its parent to a new place where the original gpte is non-present.
Finally the guest can access the remapped area without flushing
the tlb.  The guest's behavior should be allowed per SDM, but the host
kvm mmu makes it wrong.

Fixes: 4731d4c7a0 ("KVM: MMU: out of sync shadow core")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 11:01:00 -04:00
Lai Jiangshan
f81602958c KVM: X86: Fix missed remote tlb flush in rmap_write_protect()
When kvm->tlbs_dirty > 0, some rmaps might have been deleted
without flushing tlb remotely after kvm_sync_page().  If @gfn
was writable before and it's rmaps was deleted in kvm_sync_page(),
and if the tlb entry is still in a remote running VCPU,  the @gfn
is not safely protected.

To fix the problem, kvm_sync_page() does the remote flush when
needed to avoid the problem.

Fixes: a4ee1ca4a3 ("KVM: MMU: delay flush all tlbs on sync_page path")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210918005636.3675-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 10:28:44 -04:00
Maxim Levitsky
faf6b75562 KVM: x86: nSVM: don't copy virt_ext from vmcb12
These field correspond to features that we don't expose yet to L2

While currently there are no CVE worthy features in this field,
if AMD adds more features to this field, that could allow guest
escapes similar to CVE-2021-3653 and CVE-2021-3656.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-6-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 10:06:46 -04:00
Maxim Levitsky
d1cba6c922 KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround
GP SVM errata workaround made the #GP handler always emulate
the SVM instructions.

However these instructions #GP in case the operand is not 4K aligned,
but the workaround code didn't check this and we ended up
emulating these instructions anyway.

This is only an emulation accuracy check bug as there is no harm for
KVM to read/write unaligned vmcb images.

Fixes: 82a11e9c6f ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions")

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 10:05:29 -04:00
Maxim Levitsky
1ad32105d7 KVM: x86: selftests: test simultaneous uses of V_IRQ from L1 and L0
Test that if:

* L1 disables virtual interrupt masking, and INTR intercept.

* L1 setups a virtual interrupt to be injected to L2 and enters L2 with
  interrupts disabled, thus the virtual interrupt is pending.

* Now an external interrupt arrives in L1 and since
  L1 doesn't intercept it, it should be delivered to L2 when
  it enables interrupts.

  to do this L0 (abuses) V_IRQ to setup an
  interrupt window, and returns to L2.

* L2 enables interrupts.
  This should trigger the interrupt window,
  injection of the external interrupt and delivery
  of the virtual interrupt that can now be done.

* Test that now L2 gets those interrupts.

This is the test that demonstrates the issue that was
fixed in the previous patch.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 10:05:07 -04:00
Maxim Levitsky
aee77e1169 KVM: x86: nSVM: restore int_vector in svm_clear_vintr
In svm_clear_vintr we try to restore the virtual interrupt
injection that might be pending, but we fail to restore
the interrupt vector.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210914154825.104886-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23 10:04:40 -04:00
Fares Mehanna
e1fc1553cd kvm: x86: Add AMD PMU MSRs to msrs_to_save_all[]
Intel PMU MSRs is in msrs_to_save_all[], so add AMD PMU MSRs to have a
consistent behavior between Intel and AMD when using KVM_GET_MSRS,
KVM_SET_MSRS or KVM_GET_MSR_INDEX_LIST.

We have to add legacy and new MSRs to handle guests running without
X86_FEATURE_PERFCTR_CORE.

Signed-off-by: Fares Mehanna <faresx@amazon.de>
Message-Id: <20210915133951.22389-1-faresx@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 11:30:14 -04:00
Maxim Levitsky
dbab610a5b KVM: x86: nVMX: re-evaluate emulation_required on nested VM exit
If L1 had invalid state on VM entry (can happen on SMM transactions
when we enter from real mode, straight to nested guest),

then after we load 'host' state from VMCS12, the state has to become
valid again, but since we load the segment registers with
__vmx_set_segment we weren't always updating emulation_required.

Update emulation_required explicitly at end of load_vmcs12_host_state.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:47:50 -04:00
Maxim Levitsky
c8607e4a08 KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry
It is possible that when non root mode is entered via special entry
(!from_vmentry), that is from SMM or from loading the nested state,
the L2 state could be invalid in regard to non unrestricted guest mode,
but later it can become valid.

(for example when RSM emulation restores segment registers from SMRAM)

Thus delay the check to VM entry, where we will check this and fail.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-7-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:47:50 -04:00
Maxim Levitsky
c42dec148b KVM: x86: VMX: synthesize invalid VM exit when emulating invalid guest state
Since no actual VM entry happened, the VM exit information is stale.
To avoid this, synthesize an invalid VM guest state VM exit.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:47:49 -04:00
Maxim Levitsky
136a55c054 KVM: x86: nSVM: refactor svm_leave_smm and smm_enter_smm
Use return statements instead of nested if, and fix error
path to free all the maps that were allocated.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:47:43 -04:00
Maxim Levitsky
e85d3e7b49 KVM: x86: SVM: call KVM_REQ_GET_NESTED_STATE_PAGES on exit from SMM mode
Currently the KVM_REQ_GET_NESTED_STATE_PAGES on SVM only reloads PDPTRs,
and MSR bitmap, with former not really needed for SMM as SMM exit code
reloads them again from SMRAM'S CR3, and later happens to work
since MSR bitmap isn't modified while in SMM.

Still it is better to be consistient with VMX.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:17 -04:00
Maxim Levitsky
37687c403a KVM: x86: reset pdptrs_from_userspace when exiting smm
When exiting SMM, pdpts are loaded again from the guest memory.

This fixes a theoretical bug, when exit from SMM triggers entry to the
nested guest which re-uses some of the migration
code which uses this flag as a workaround for a legacy userspace.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:16 -04:00
Maxim Levitsky
e2e6e449d6 KVM: x86: nSVM: restore the L1 host state prior to resuming nested guest on SMM exit
Otherwise guest entry code might see incorrect L1 state (e.g paging state).

Fixes: 37be407b2c ("KVM: nSVM: Fix L1 state corruption upon return from SMM")

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210913140954.165665-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:16 -04:00
Vitaly Kuznetsov
8d68bad6d8 KVM: nVMX: Filter out all unsupported controls when eVMCS was activated
Windows Server 2022 with Hyper-V role enabled failed to boot on KVM when
enlightened VMCS is advertised. Debugging revealed there are two exposed
secondary controls it is not happy with: SECONDARY_EXEC_ENABLE_VMFUNC and
SECONDARY_EXEC_SHADOW_VMCS. These controls are known to be unsupported,
as there are no corresponding fields in eVMCSv1 (see the comment above
EVMCS1_UNSUPPORTED_2NDEXEC definition).

Previously, commit 31de3d2500 ("x86/kvm/hyper-v: move VMX controls
sanitization out of nested_enable_evmcs()") introduced the required
filtering mechanism for VMX MSRs but for some reason put only known
to be problematic (and not full EVMCS1_UNSUPPORTED_* lists) controls
there.

Note, Windows Server 2022 seems to have gained some sanity check for VMX
MSRs: it doesn't even try to launch a guest when there's something it
doesn't like, nested_evmcs_check_controls() mechanism can't catch the
problem.

Let's be bold this time and instead of playing whack-a-mole just filter out
all unsupported controls from VMX MSRs.

Fixes: 31de3d2500 ("x86/kvm/hyper-v: move VMX controls sanitization out of nested_enable_evmcs()")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210907163530.110066-1-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:15 -04:00
Sean Christopherson
0bbc2ca851 KVM: KVM: Use cpumask_available() to check for NULL cpumask when kicking vCPUs
Check for a NULL cpumask_var_t when kicking multiple vCPUs via
cpumask_available(), which performs a !NULL check if and only if cpumasks
are configured to be allocated off-stack.  This is a meaningless
optimization, e.g. avoids a TEST+Jcc and TEST+CMOV on x86, but more
importantly helps document that the NULL check is necessary even though
all callers pass in a local variable.

No functional change intended.

Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210827092516.1027264-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:15 -04:00
Sean Christopherson
85b640450d KVM: Clean up benign vcpu->cpu data races when kicking vCPUs
Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu
is read exactly once, and by ensuring the vCPU is booted from guest mode
if kvm_arch_vcpu_should_kick() returns true.  Fix a similar race in
kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if
kvm_request_needs_ipi() returns true.

Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or
kvm_request_needs_ipi()) means the target vCPU could get migrated (change
vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and
reading vcpu->mode.  If that happens, the kick/IPI will be sent to the
old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs.

Although failing to kick the vCPU is not exactly ideal, practically
speaking it cannot cause a functional issue unless there is also a bug in
the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s
behavior.

The purpose of sending an IPI is purely to get a vCPU into the host (or
out of reading SPTEs) so that the vCPU can recognize a change in state,
e.g. a KVM_REQ_* request.  If vCPU's handling of the state change is
required for correctness, KVM must ensure either the vCPU sees the change
before entering the guest, or that the sender sees the vCPU as running in
guest mode.  All architectures handle this by (a) sending the request
before calling kvm_vcpu_kick() and (b) checking for requests _after_
setting vcpu->mode.

x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to
ensure it kicks and waits for vCPUs that started reading SPTEs _before_
MMU changes were finalized, but any vCPU that starts reading after MMU
changes were finalized will see the new state and can continue on
uninterrupted.

For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g.
x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied
upon for functional correctness, e.g. in the dirty log case, userspace
cannot assume it has a 100% complete log if vCPUs are still running.

All that said, eliminate the benign race since the cost of doing so is an
"extra" atomic cmpxchg() in the case where the target vCPU is loaded by
the current pCPU or is not loaded at all.  I.e. the kick will be skipped
due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as
opposed to the kick being skipped because of the cpu checks.

Keep the "cpu != me" checks even though they appear useless/impossible at
first glance.  x86 processes guest IPI writes in a fast path that runs in
IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE.  And
calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or
READING_SHADOW_PAGE_TABLES is perfectly reasonable.

Note, a race with the cpu_online() check in kvm_vcpu_kick() likely
persists, e.g. the vCPU could exit guest mode and get offlined between
the cpu_online() check and the sending of smp_send_reschedule().  But,
the online check appears to exist only to avoid a WARN in x86's
native_smp_send_reschedule() that fires if the target CPU is not online.
The reschedule WARN exists because CPU offlining takes the CPU out of the
scheduling pool, i.e. the WARN is intended to detect the case where the
kernel attempts to schedule a task on an offline CPU.  The actual sending
of the IPI is a non-issue as at worst it will simpy be dropped on the
floor.  In other words, KVM's usurping of the reschedule IPI could
theoretically trigger a WARN if the stars align, but there will be no
loss of functionality.

[*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a

Cc: Venkatesh Srinivas <venkateshs@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Fixes: 97222cc831 ("KVM: Emulate local APIC in kernel")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210827092516.1027264-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22 10:33:15 -04:00