Commit graph

1469 commits

Author SHA1 Message Date
Liran Alon
72aeb60c52 KVM: nVMX: Verify eVMCS revision id match supported eVMCS version on eVMCS VMPTRLD
According to TLFS section 16.11.2 Enlightened VMCS, the first u32
field of eVMCS should specify eVMCS VersionNumber.

This version should be in the range of supported eVMCS versions exposed
to guest via CPUID.0x4000000A.EAX[0:15].
The range which KVM expose to guest in this CPUID field should be the
same as the value returned in vmcs_version by nested_enable_evmcs().

According to the above, eVMCS VMPTRLD should verify that version specified
in given eVMCS is in the supported range. However, current code
mistakenly verfies this field against VMCS12_REVISION.

One can also see that when KVM use eVMCS, it makes sure that
alloc_vmcs_cpu() sets allocated eVMCS revision_id to KVM_EVMCS_VERSION.

Obvious fix should just change eVMCS VMPTRLD to verify first u32 field
of eVMCS is equal to KVM_EVMCS_VERSION.
However, it turns out that Microsoft Hyper-V fails to comply to their
own invented interface: When Hyper-V use eVMCS, it just sets first u32
field of eVMCS to revision_id specified in MSR_IA32_VMX_BASIC (In our
case: VMCS12_REVISION). Instead of used eVMCS version number which is
one of the supported versions specified in CPUID.0x4000000A.EAX[0:15].
To overcome Hyper-V bug, we accept either a supported eVMCS version
or VMCS12_REVISION as valid values for first u32 field of eVMCS.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-27 12:50:29 +01:00
Leonid Shatz
326e742533 KVM: nVMX/nSVM: Fix bug which sets vcpu->arch.tsc_offset to L1 tsc_offset
Since commit e79f245dde ("X86/KVM: Properly update 'tsc_offset' to
represent the running guest"), vcpu->arch.tsc_offset meaning was
changed to always reflect the tsc_offset value set on active VMCS.
Regardless if vCPU is currently running L1 or L2.

However, above mentioned commit failed to also change
kvm_vcpu_write_tsc_offset() to set vcpu->arch.tsc_offset correctly.
This is because vmx_write_tsc_offset() could set the tsc_offset value
in active VMCS to given offset parameter *plus vmcs12->tsc_offset*.
However, kvm_vcpu_write_tsc_offset() just sets vcpu->arch.tsc_offset
to given offset parameter. Without taking into account the possible
addition of vmcs12->tsc_offset. (Same is true for SVM case).

Fix this issue by changing kvm_x86_ops->write_tsc_offset() to return
actually set tsc_offset in active VMCS and modify
kvm_vcpu_write_tsc_offset() to set returned value in
vcpu->arch.tsc_offset.
In addition, rename write_tsc_offset() callback to write_l1_tsc_offset()
to make it clear that it is meant to set L1 TSC offset.

Fixes: e79f245dde ("X86/KVM: Properly update 'tsc_offset' to represent the running guest")
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Leonid Shatz <leonid.shatz@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-27 12:50:10 +01:00
Yi Wang
1e4329ee2c x86/kvm/vmx: fix old-style function declaration
The inline keyword which is not at the beginning of the function
declaration may trigger the following build warnings, so let's fix it:

arch/x86/kvm/vmx.c:1309:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
arch/x86/kvm/vmx.c:5947:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
arch/x86/kvm/vmx.c:5985:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]
arch/x86/kvm/vmx.c:6023:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration]

Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-27 12:50:09 +01:00
Liran Alon
f48b4711dd KVM: VMX: Update shared MSRs to be saved/restored on MSR_EFER.LMA changes
When guest transitions from/to long-mode by modifying MSR_EFER.LMA,
the list of shared MSRs to be saved/restored on guest<->host
transitions is updated (See vmx_set_efer() call to setup_msrs()).

On every entry to guest, vcpu_enter_guest() calls
vmx_prepare_switch_to_guest(). This function should also take care
of setting the shared MSRs to be saved/restored. However, the
function does nothing in case we are already running with loaded
guest state (vmx->loaded_cpu_state != NULL).

This means that even when guest modifies MSR_EFER.LMA which results
in updating the list of shared MSRs, it isn't being taken into account
by vmx_prepare_switch_to_guest() because it happens while we are
running with loaded guest state.

To fix above mentioned issue, add a flag to mark that the list of
shared MSRs has been updated and modify vmx_prepare_switch_to_guest()
to set shared MSRs when running with host state *OR* list of shared
MSRs has been updated.

Note that this issue was mistakenly introduced by commit
678e315e78 ("KVM: vmx: add dedicated utility to access guest's
kernel_gs_base") because previously vmx_set_efer() always called
vmx_load_host_state() which resulted in vmx_prepare_switch_to_guest() to
set shared MSRs.

Fixes: 678e315e78 ("KVM: vmx: add dedicated utility to access guest's kernel_gs_base")
Reported-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-27 12:50:08 +01:00
Liran Alon
7f9ad1dfa3 KVM: nVMX: Fix kernel info-leak when enabling KVM_CAP_HYPERV_ENLIGHTENED_VMCS more than once
Consider the case that userspace enables KVM_CAP_HYPERV_ENLIGHTENED_VMCS twice:
1) kvm_vcpu_ioctl_enable_cap() is called to enable
KVM_CAP_HYPERV_ENLIGHTENED_VMCS which calls nested_enable_evmcs().
2) nested_enable_evmcs() sets enlightened_vmcs_enabled to true and fills
vmcs_version which is then copied to userspace.
3) kvm_vcpu_ioctl_enable_cap() is called again to enable
KVM_CAP_HYPERV_ENLIGHTENED_VMCS which calls nested_enable_evmcs().
4) This time nested_enable_evmcs() just returns 0 as
enlightened_vmcs_enabled is already true. *Without filling
vmcs_version*.
5) kvm_vcpu_ioctl_enable_cap() continues as usual and copies
*uninitialized* vmcs_version to userspace which leads to kernel info-leak.

Fix this issue by simply changing nested_enable_evmcs() to always fill
vmcs_version output argument. Even when enlightened_vmcs_enabled is
already set to true.

Note that SVM's nested_enable_evmcs() should not be modified because it
always returns a non-zero value (-ENODEV) which results in
kvm_vcpu_ioctl_enable_cap() skipping the copy of vmcs_version to
userspace (as it should).

Fixes: 57b119da35 ("KVM: nVMX: add KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability")
Reported-by: syzbot+cfbc368e283d381f8cef@syzkaller.appspotmail.com
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-27 12:49:46 +01:00
Luiz Capitulino
a87c99e612 KVM: VMX: re-add ple_gap module parameter
Apparently, the ple_gap parameter was accidentally removed
by commit c8e88717cf. Add it
back.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Cc: stable@vger.kernel.org
Fixes: c8e88717cf
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-11-27 12:48:43 +01:00
Linus Torvalds
0d1e8b8d2b KVM updates for v4.20
ARM:
  - Improved guest IPA space support (32 to 52 bits)
 
  - RAS event delivery for 32bit
 
  - PMU fixes
 
  - Guest entry hardening
 
  - Various cleanups
 
  - Port of dirty_log_test selftest
 
 PPC:
  - Nested HV KVM support for radix guests on POWER9.  The performance is
    much better than with PR KVM.  Migration and arbitrary level of
    nesting is supported.
 
  - Disable nested HV-KVM on early POWER9 chips that need a particular hardware
    bug workaround
 
  - One VM per core mode to prevent potential data leaks
 
  - PCI pass-through optimization
 
  - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base
 
 s390:
  - Initial version of AP crypto virtualization via vfio-mdev
 
  - Improvement for vfio-ap
 
  - Set the host program identifier
 
  - Optimize page table locking
 
 x86:
  - Enable nested virtualization by default
 
  - Implement Hyper-V IPI hypercalls
 
  - Improve #PF and #DB handling
 
  - Allow guests to use Enlightened VMCS
 
  - Add migration selftests for VMCS and Enlightened VMCS
 
  - Allow coalesced PIO accesses
 
  - Add an option to perform nested VMCS host state consistency check
    through hardware
 
  - Automatic tuning of lapic_timer_advance_ns
 
  - Many fixes, minor improvements, and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJb0FINAAoJEED/6hsPKofoI60IAJRS3vOAQ9Fav8cJsO1oBHcX
 3+NexfnBke1bzrjIR3SUcHKGZbdnVPNZc+Q4JjIbPpPmmOMU5jc9BC1dmd5f4Vzh
 BMnQ0yCvgFv3A3fy/Icx1Z8NJppxosdmqdQLrQrNo8aD3cjnqY2yQixdXrAfzLzw
 XEgKdIFCCz8oVN/C9TT4wwJn6l9OE7BM5bMKGFy5VNXzMu7t64UDOLbbjZxNgi1g
 teYvfVGdt5mH0N7b2GPPWRbJmgnz5ygVVpVNQUEFrdKZoCm6r5u9d19N+RRXAwan
 ZYFj10W2T8pJOUf3tryev4V33X7MRQitfJBo4tP5hZfi9uRX89np5zP1CFE7AtY=
 =yEPW
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "ARM:
   - Improved guest IPA space support (32 to 52 bits)

   - RAS event delivery for 32bit

   - PMU fixes

   - Guest entry hardening

   - Various cleanups

   - Port of dirty_log_test selftest

  PPC:
   - Nested HV KVM support for radix guests on POWER9. The performance
     is much better than with PR KVM. Migration and arbitrary level of
     nesting is supported.

   - Disable nested HV-KVM on early POWER9 chips that need a particular
     hardware bug workaround

   - One VM per core mode to prevent potential data leaks

   - PCI pass-through optimization

   - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base

  s390:
   - Initial version of AP crypto virtualization via vfio-mdev

   - Improvement for vfio-ap

   - Set the host program identifier

   - Optimize page table locking

  x86:
   - Enable nested virtualization by default

   - Implement Hyper-V IPI hypercalls

   - Improve #PF and #DB handling

   - Allow guests to use Enlightened VMCS

   - Add migration selftests for VMCS and Enlightened VMCS

   - Allow coalesced PIO accesses

   - Add an option to perform nested VMCS host state consistency check
     through hardware

   - Automatic tuning of lapic_timer_advance_ns

   - Many fixes, minor improvements, and cleanups"

* tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
  KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
  Revert "kvm: x86: optimize dr6 restore"
  KVM: PPC: Optimize clearing TCEs for sparse tables
  x86/kvm/nVMX: tweak shadow fields
  selftests/kvm: add missing executables to .gitignore
  KVM: arm64: Safety check PSTATE when entering guest and handle IL
  KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips
  arm/arm64: KVM: Enable 32 bits kvm vcpu events support
  arm/arm64: KVM: Rename function kvm_arch_dev_ioctl_check_extension()
  KVM: arm64: Fix caching of host MDCR_EL2 value
  KVM: VMX: enable nested virtualization by default
  KVM/x86: Use 32bit xor to clear registers in svm.c
  kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD
  kvm: vmx: Defer setting of DR6 until #DB delivery
  kvm: x86: Defer setting of CR2 until #PF delivery
  kvm: x86: Add payload operands to kvm_multiple_exception
  kvm: x86: Add exception payload fields to kvm_vcpu_events
  kvm: x86: Add has_payload and payload to kvm_queued_exception
  KVM: Documentation: Fix omission in struct kvm_vcpu_events
  KVM: selftests: add Enlightened VMCS test
  ...
2018-10-25 17:57:35 -07:00
KarimAllah Ahmed
22a7cdcae6 KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
The spec only requires the posted interrupt descriptor address to be
64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also
forces the address to be page aligned.

Only validate that the address does not cross the maximum physical address
without enforcing a page alignment.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Fixes: 6de84e581c ("nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2")
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Krish Sadhuhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-10-24 12:47:16 +02:00
Vitaly Kuznetsov
cbe3f898d1 x86/kvm/nVMX: tweak shadow fields
It seems we have some leftovers from times when 'unrestricted guest'
wasn't exposed to L1. Stop shadowing GUEST_CS_{BASE,LIMIT,AR_SELECTOR}
and GUEST_ES_BASE, shadow GUEST_SS_AR_BYTES as it was found that some
hypervisors (e.g. Hyper-V without Enlightened VMCS) access it pretty
often.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-19 18:45:14 +02:00
Paolo Bonzini
1e58e5e591 KVM: VMX: enable nested virtualization by default
With live migration support and finally a good solution for exception
event injection, nested VMX should be ready for having a stable userspace
ABI.  The results of syzkaller fuzzing are not perfect but not horrible
either (and might be partially due to running on GCE, so that effectively
we're testing three-level nesting on a fork of upstream KVM!).  Enabling
it by default seems like a nice way to conclude the 4.20 pull request. :)

Unfortunately, enabling nested SVM in 2009 (commit 4b6e4dca70) was a
bit premature.  However, until live migration support is in place we can
reasonably expect that it does not offer much in terms of ABI guarantees.
Therefore we are still in time to break things and conform as much as
possible to the interface used for VMX.

Suggested-by: Jim Mattson <jmattson@google.com>
Suggested-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Celebrated-by: Liran Alon <liran.alon@oracle.com>
Celebrated-by: Wanpeng Li <kernellwp@gmail.com>
Celebrated-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 19:10:09 +02:00
Uros Bizjak
43ce76ce73 KVM/x86: Use 32bit xor to clear registers in svm.c
x86_64 zero-extends 32bit xor operation to a full 64bit register.

Also add a comment and remove unnecessary instruction suffix in vmx.c

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 19:08:21 +02:00
Jim Mattson
f10c729ff9 kvm: vmx: Defer setting of DR6 until #DB delivery
When exception payloads are enabled by userspace (which is not yet
possible) and a #DB is raised in L2, defer the setting of DR6 until
later. Under VMX, this allows the L1 hypervisor to intercept the fault
before DR6 is modified. Under SVM, DR6 is modified before L1 can
intercept the fault (as has always been the case with DR7).

Note that the payload associated with a #DB exception includes only
the "new DR6 bits." When the payload is delievered, DR6.B0-B3 will be
cleared and DR6.RTM will be set prior to merging in the new DR6 bits.

Also note that bit 16 in the "new DR6 bits" is set to indicate that a
debug exception (#DB) or a breakpoint exception (#BP) occurred inside
an RTM region while advanced debugging of RTM transactional regions
was enabled. Though the reverse of DR6.RTM, this makes the #DB payload
field compatible with both the pending debug exceptions field under
VMX and the exit qualification for #DB exceptions under VMX.

Reported-by: Jim Mattson <jmattson@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 19:07:43 +02:00
Jim Mattson
da998b46d2 kvm: x86: Defer setting of CR2 until #PF delivery
When exception payloads are enabled by userspace (which is not yet
possible) and a #PF is raised in L2, defer the setting of CR2 until
the #PF is delivered. This allows the L1 hypervisor to intercept the
fault before CR2 is modified.

For backwards compatibility, when exception payloads are not enabled
by userspace, kvm_multiple_exception modifies CR2 when the #PF
exception is raised.

Reported-by: Jim Mattson <jmattson@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 19:07:43 +02:00
Vitaly Kuznetsov
8cab6507f6 x86/kvm/nVMX: nested state migration for Enlightened VMCS
Add support for get/set of nested state when Enlightened VMCS is in use.
A new KVM_STATE_NESTED_EVMCS flag to indicate eVMCS on the vCPU was enabled
is added.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:19 +02:00
Vitaly Kuznetsov
a1b0c1c64d x86/kvm/nVMX: allow bare VMXON state migration
It is perfectly valid for a guest to do VMXON and not do VMPTRLD. This
state needs to be preserved on migration.

Cc: stable@vger.kernel.org
Fixes: 8fcc4b5923
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:18 +02:00
Vitaly Kuznetsov
c4ebd6295a KVM: nVMX: optimize prepare_vmcs02{,_full} for Enlightened VMCS case
When Enlightened VMCS is in use by L1 hypervisor we can avoid vmwriting
VMCS fields which did not change.

Our first goal is to achieve minimal impact on traditional VMCS case so
we're not wrapping each vmwrite() with an if-changed checker. We also can't
utilize static keys as Enlightened VMCS usage is per-guest.

This patch implements the simpliest solution: checking fields in groups.
We skip single vmwrite() statements as doing the check will cost us
something even in non-evmcs case and the win is tiny. Unfortunately, this
makes prepare_vmcs02_full{,_full}() code Enlightened VMCS-dependent (and
a bit ugly).

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:16 +02:00
Vitaly Kuznetsov
b8bbab928f KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR
Per Hyper-V TLFS 5.0b:

"The L1 hypervisor may choose to use enlightened VMCSs by writing 1 to
the corresponding field in the VP assist page (see section 7.8.7).
Another field in the VP assist page controls the currently active
enlightened VMCS. Each enlightened VMCS is exactly one page (4 KB) in
size and must be initially zeroed. No VMPTRLD instruction must be
executed to make an enlightened VMCS active or current.

After the L1 hypervisor performs a VM entry with an enlightened VMCS,
the VMCS is considered active on the processor. An enlightened VMCS
can only be active on a single processor at the same time. The L1
hypervisor can execute a VMCLEAR instruction to transition an
enlightened VMCS from the active to the non-active state. Any VMREAD
or VMWRITE instructions while an enlightened VMCS is active is
unsupported and can result in unexpected behavior."

Keep Enlightened VMCS structure for the current L2 guest permanently mapped
from struct nested_vmx instead of mapping it every time.

Suggested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:16 +02:00
Vitaly Kuznetsov
945679e301 KVM: nVMX: add enlightened VMCS state
Adds hv_evmcs pointer and implement copy_enlightened_to_vmcs12() and
copy_enlightened_to_vmcs12().

prepare_vmcs02()/prepare_vmcs02_full() separation is not valid for
Enlightened VMCS, do full sync for now.

Suggested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:15 +02:00
Vitaly Kuznetsov
57b119da35 KVM: nVMX: add KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability
Enlightened VMCS is opt-in. The current version does not contain all
fields supported by nested VMX so we must not advertise the
corresponding VMX features if enlightened VMCS is enabled.

Userspace is given the enlightened VMCS version supported by KVM as
part of enabling KVM_CAP_HYPERV_ENLIGHTENED_VMCS. The version is to
be advertised to the nested hypervisor, currently done via a cpuid
leaf for Hyper-V.

Suggested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:14 +02:00
Vitaly Kuznetsov
5d7a644336 KVM: VMX: refactor evmcs_sanitize_exec_ctrls()
Split off EVMCS1_UNSUPPORTED_* macros so we can re-use them when
enabling Enlightened VMCS for Hyper-V on KVM.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:14 +02:00
Lan Tianyu
a5c214dad1 KVM/VMX: Change hv flush logic when ept tables are mismatched.
If ept table pointers are mismatched, flushing tlb for each vcpus via
hv flush interface still helps to reduce vmexits which are triggered
by IPI and INEPT emulation.

Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:09 +02:00
Uros Bizjak
44c2d667ce KVM/x86: Use 32bit xor to clear register
x86_64 zero-extends 32bit xor to a full 64bit register. Use %k asm
operand modifier to force 32bit register and save 268 bytes in kvm.o

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:09 +02:00
Uros Bizjak
4b1e54786e KVM/x86: Use assembly instruction mnemonics instead of .byte streams
Recently the minimum required version of binutils was changed to 2.20,
which supports all VMX instruction mnemonics. The patch removes
all .byte #defines and uses real instruction mnemonics instead.

The compiler is now able to pass memory operand to the instruction,
so there is no need for memory clobber anymore. Also, the compiler
adds CC register clobber automatically to all extended asm clauses,
so the patch also removes explicit CC clobber.

The immediate benefit of the patch is removal of many unnecesary
register moves, resulting in 1434 saved bytes in vmx.o:

   text    data     bss     dec     hex filename
 151257   18246    8500  178003   2b753 vmx.o
 152691   18246    8500  179437   2bced vmx-old.o

Some examples of improvement include removal of unneeded moves
of %rsp to %rax in front of invept and invvpid instructions:

    a57e:	b9 01 00 00 00       	mov    $0x1,%ecx
    a583:	48 89 04 24          	mov    %rax,(%rsp)
    a587:	48 89 e0             	mov    %rsp,%rax
    a58a:	48 c7 44 24 08 00 00 	movq   $0x0,0x8(%rsp)
    a591:	00 00
    a593:	66 0f 38 80 08       	invept (%rax),%rcx

to:

    a45c:	48 89 04 24          	mov    %rax,(%rsp)
    a460:	b8 01 00 00 00       	mov    $0x1,%eax
    a465:	48 c7 44 24 08 00 00 	movq   $0x0,0x8(%rsp)
    a46c:	00 00
    a46e:	66 0f 38 80 04 24    	invept (%rsp),%rax

and the ability to use more optimal registers and memory operands
in the instruction:

    8faa:	48 8b 44 24 28       	mov    0x28(%rsp),%rax
    8faf:	4c 89 c2             	mov    %r8,%rdx
    8fb2:	0f 79 d0             	vmwrite %rax,%rdx

to:

    8e7c:	44 0f 79 44 24 28    	vmwrite 0x28(%rsp),%r8

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:08 +02:00
Uros Bizjak
5ebb272b2e KVM/x86: Fix invvpid and invept register operand size in 64-bit mode
Register operand size of invvpid and invept instruction in 64-bit mode
has always 64 bits. Adjust inline function argument type to reflect
correct size.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:07 +02:00
Vitaly Kuznetsov
36d9594dfb x86/kvm/mmu: make space for source data caching in struct kvm_mmu
In preparation to MMU reconfiguration avoidance we need a space to
cache source data. As this partially intersects with kvm_mmu_page_role,
create 64bit sized union kvm_mmu_role holding both base and extended data.
No functional change.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:05 +02:00
Vitaly Kuznetsov
14c07ad89f x86/kvm/mmu: introduce guest_mmu
When EPT is used for nested guest we need to re-init MMU as shadow
EPT MMU (nested_ept_init_mmu_context() does that). When we return back
from L2 to L1 kvm_mmu_reset_context() in nested_vmx_load_cr3() resets
MMU back to normal TDP mode. Add a special 'guest_mmu' so we can use
separate root caches; the improved hit rate is not very important for
single vCPU performance, but it avoids contention on the mmu_lock for
many vCPUs.

On the nested CPUID benchmark, with 16 vCPUs, an L2->L1->L2 vmexit
goes from 42k to 26k cycles.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:04 +02:00
Vitaly Kuznetsov
6a82cd1c7b x86/kvm/mmu.c: add kvm_mmu parameter to kvm_mmu_free_roots()
Add an option to specify which MMU root we want to free. This will
be used when nested and non-nested MMUs for L1 are split.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
2018-10-17 00:30:03 +02:00
Vitaly Kuznetsov
3dc773e745 x86/kvm/mmu.c: set get_pdptr hook in kvm_init_shadow_ept_mmu()
kvm_init_shadow_ept_mmu() doesn't set get_pdptr() hook and is this
not a problem just because MMU context is already initialized and this
hook points to kvm_pdptr_read(). As we're intended to use a dedicated
MMU for shadow EPT MMU set this hook explicitly.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
2018-10-17 00:30:03 +02:00
Vitaly Kuznetsov
44dd3ffa7b x86/kvm/mmu: make vcpu->mmu a pointer to the current MMU
As a preparation to full MMU split between L1 and L2 make vcpu->arch.mmu
a pointer to the currently used mmu. For now, this is always
vcpu->arch.root_mmu. No functional change.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
2018-10-17 00:30:02 +02:00
Sean Christopherson
2768c0cc4a KVM: nVMX: WARN if nested run hits VMFail with early consistency checks enabled
When early consistency checks are enabled, all VMFail conditions
should be caught by nested_vmx_check_vmentry_hw().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:30:00 +02:00
Sean Christopherson
52017608da KVM: nVMX: add option to perform early consistency checks via H/W
KVM defers many VMX consistency checks to the CPU, ostensibly for
performance reasons[1], including checks that result in VMFail (as
opposed to VMExit).  This behavior may be undesirable for some users
since this means KVM detects certain classes of VMFail only after it
has processed guest state, e.g. emulated MSR load-on-entry.  Because
there is a strict ordering between checks that cause VMFail and those
that cause VMExit, i.e. all VMFail checks are performed before any
checks that cause VMExit, we can detect (almost) all VMFail conditions
via a dry run of sorts.  The almost qualifier exists because some
state in vmcs02 comes from L0, e.g. VPID, which means that hardware
will never detect an invalid VPID in vmcs12 because it never sees
said value.  Software must (continue to) explicitly check such fields.

After preparing vmcs02 with all state needed to pass the VMFail
consistency checks, optionally do a "test" VMEnter with an invalid
GUEST_RFLAGS.  If the VMEnter results in a VMExit (due to bad guest
state), then we can safely say that the nested VMEnter should not
VMFail, i.e. any VMFail encountered in nested_vmx_vmexit() must
be due to an L0 bug.  GUEST_RFLAGS is used to induce VMExit as it
is unconditionally loaded on all implementations of VMX, has an
invalid value that is writable on a 32-bit system and its consistency
check is performed relatively early in all implementations (the exact
order of consistency checks is micro-architectural).

Unfortunately, since the "passing" case causes a VMExit, KVM must
be extra diligent to ensure that host state is restored, e.g. DR7
and RFLAGS are reset on VMExit.  Failure to restore RFLAGS.IF is
particularly fatal.

And of course the extra VMEnter and VMExit impacts performance.
The raw overhead of the early consistency checks is ~6% on modern
hardware (though this could easily vary based on configuration),
while the added latency observed from the L1 VMM is ~10%.  The
early consistency checks do not occur in a vacuum, e.g. spending
more time in L0 can lead to more interrupts being serviced while
emulating VMEnter, thereby increasing the latency observed by L1.

Add a module param, early_consistency_checks, to provide control
over whether or not VMX performs the early consistency checks.
In addition to standard on/off behavior, the param accepts a value
of -1, which is essentialy an "auto" setting whereby KVM does
the early checks only when it thinks it's running on bare metal.
When running nested, doing early checks is of dubious value since
the resulting behavior is heavily dependent on L0.  In the future,
the "auto" setting could also be used to default to skipping the
early hardware checks for certain configurations/platforms if KVM
reaches a state where it has 100% coverage of VMFail conditions.

[1] To my knowledge no one has implemented and tested full software
    emulation of the VMFail consistency checks.  Until that happens,
    one can only speculate about the actual performance overhead of
    doing all VMFail consistency checks in software.  Obviously any
    code is slower than no code, but in the grand scheme of nested
    virtualization it's entirely possible the overhead is negligible.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:59 +02:00
Sean Christopherson
5a5e8a15d7 KVM: vmx: write HOST_IA32_EFER in vmx_set_constant_host_state()
EFER is constant in the host and writing it once during setup means
we can skip writing the host value in add_atomic_switch_msr_special().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:59 +02:00
Sean Christopherson
09abb5e3e5 KVM: nVMX: call kvm_skip_emulated_instruction in nested_vmx_{fail,succeed}
... as every invocation of nested_vmx_{fail,succeed} is immediately
followed by a call to kvm_skip_emulated_instruction().  This saves
a bit of code and eliminates some silly paths, e.g. nested_vmx_run()
ended up with a goto label purely used to call and return
kvm_skip_emulated_instruction().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:58 +02:00
Sean Christopherson
c37a6116d8 KVM: nVMX: do not call nested_vmx_succeed() for consistency check VMExit
EFLAGS is set to a fixed value on VMExit, calling nested_vmx_succeed()
is unnecessary and wrong.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:57 +02:00
Sean Christopherson
cb61de2f48 KVM: nVMX: do not skip VMEnter instruction that succeeds
A successful VMEnter is essentially a fancy indirect branch that
pulls the target RIP from the VMCS.  Skipping the instruction is
unnecessary (RIP will get overwritten by the VMExit handler) and
is problematic because it can incorrectly suppress a #DB due to
EFLAGS.TF when a VMFail is detected by hardware (happens after we
skip the instruction).

Now that vmx_nested_run() is not prematurely skipping the instr,
use the full kvm_skip_emulated_instruction() in the VMFail path
of nested_vmx_vmexit().  We also need to explicitly update the
GUEST_INTERRUPTIBILITY_INFO when loading vmcs12 host state.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:57 +02:00
Sean Christopherson
16fb9a46c5 KVM: nVMX: do early preparation of vmcs02 before check_vmentry_postreqs()
In anticipation of using vmcs02 to do early consistency checks, move
the early preparation of vmcs02 prior to checking the postreqs.  The
downside of this approach is that we'll unnecessary load vmcs02 in
the case that check_vmentry_postreqs() fails, but that is essentially
our slow path anyways (not actually slow, but it's the path we don't
really care about optimizing).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:56 +02:00
Sean Christopherson
9d6105b2b5 KVM: nVMX: initialize vmcs02 constant exactly once (per VMCS)
Add a dedicated flag to track if vmcs02 has been initialized, i.e.
the constant state for vmcs02 has been written to the backing VMCS.
The launched flag (in struct loaded_vmcs) gets cleared on logical
CPU migration to mirror hardware behavior[1], i.e. using the launched
flag to determine whether or not vmcs02 constant state needs to be
initialized results in unnecessarily re-initializing the VMCS when
migrating between logical CPUS.

[1] The active VMCS needs to be VMCLEARed before it can be migrated
    to a different logical CPU.  Hardware's VMCS cache is per-CPU
    and is not coherent between CPUs.  VMCLEAR flushes the cache so
    that any dirty data is written back to memory.  A side effect
    of VMCLEAR is that it also clears the VMCS's internal launch
    flag, which KVM must mirror because VMRESUME must be used to
    run a previously launched VMCS.

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:56 +02:00
Sean Christopherson
09abe32002 KVM: nVMX: split pieces of prepare_vmcs02() to prepare_vmcs02_early()
Add prepare_vmcs02_early() and move pieces of prepare_vmcs02() to the
new function.  prepare_vmcs02_early() writes the bits of vmcs02 that
a) must be in place to pass the VMFail consistency checks (assuming
vmcs12 is valid) and b) are needed recover from a VMExit, e.g. host
state that is loaded on VMExit.  Splitting the functionality will
enable KVM to leverage hardware to do VMFail consistency checks via
a dry run of VMEnter and recover from a potential VMExit without
having to fully initialize vmcs02.

Add prepare_vmcs02_constant_state() to handle writing vmcs02 state that
comes from vmcs01 and never changes, i.e. we don't need to rewrite any
of the vmcs02 that is effectively constant once defined.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:55 +02:00
Sean Christopherson
860ff2aa84 KVM: VMX: remove ASSERT() on vmx->pml_pg validity
vmx->pml_pg is allocated by vmx_create_vcpu() and is only nullified
when the vCPU is destroyed by vmx_free_vcpu().  Remove the ASSERTs
on vmx->pml_pg, there is no need to carry debug code that provides
no value to the current code base.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:55 +02:00
Sean Christopherson
39f9c3885c KVM: vVMX: rename label for post-enter_guest_mode consistency check
Rename 'fail' to 'vmentry_fail_vmexit_guest_mode' to make it more
obvious that it's simply a different entry point to the VMExit path,
whose purpose is unwind the updates done prior to calling
prepare_vmcs02().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:54 +02:00
Sean Christopherson
a633e41e73 KVM: nVMX: assimilate nested_vmx_entry_failure() into nested_vmx_enter_non_root_mode()
Handling all VMExits due to failed consistency checks on VMEnter in
nested_vmx_enter_non_root_mode() consolidates all relevant code into
a single location, and removing nested_vmx_entry_failure() eliminates
a confusing function name and label.  For a VMEntry, "fail" and its
derivatives has a very specific meaning due to the different behavior
of a VMEnter VMFail versus VMExit, i.e. it wasn't obvious that
nested_vmx_entry_failure() handled VMExit scenarios.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:53 +02:00
Sean Christopherson
7671ce21b1 KVM: nVMX: move check_vmentry_postreqs() call to nested_vmx_enter_non_root_mode()
In preparation of supporting checkpoint/restore for nested state,
commit ca0bde28f2 ("kvm: nVMX: Split VMCS checks from nested_vmx_run()")
modified check_vmentry_postreqs() to only perform the guest EFER
consistency checks when nested_run_pending is true.  But, in the
normal nested VMEntry flow, nested_run_pending is only set after
check_vmentry_postreqs(), i.e. the consistency check is being skipped.

Alternatively, nested_run_pending could be set prior to calling
check_vmentry_postreqs() in nested_vmx_run(), but placing the
consistency checks in nested_vmx_enter_non_root_mode() allows us
to split prepare_vmcs02() and interleave the preparation with
the consistency checks without having to change the call sites
of nested_vmx_enter_non_root_mode().  In other words, the rest
of the consistency check code in nested_vmx_run() will be joining
the postreqs checks in future patches.

Fixes: ca0bde28f2 ("kvm: nVMX: Split VMCS checks from nested_vmx_run()")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:53 +02:00
Sean Christopherson
d63907dc7d KVM: nVMX: rename enter_vmx_non_root_mode to nested_vmx_enter_non_root_mode
...to be more consistent with the nested VMX nomenclature.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:52 +02:00
Sean Christopherson
3df5c37e55 KVM: nVMX: try to set EFER bits correctly when initializing controls
VM_ENTRY_IA32E_MODE and VM_{ENTRY,EXIT}_LOAD_IA32_EFER will be
explicitly set/cleared as needed by vmx_set_efer(), but attempt
to get the bits set correctly when intializing the control fields.
Setting the value correctly can avoid multiple VMWrites.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:52 +02:00
Sean Christopherson
02343cf207 KVM: vmx: do not unconditionally clear EFER switching
Do not unconditionally call clear_atomic_switch_msr() when updating
EFER.  This adds up to four unnecessary VMWrites in the case where
guest_efer != host_efer, e.g. if the load_on_{entry,exit} bits were
already set.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:51 +02:00
Sean Christopherson
b7031fd40f KVM: nVMX: reset cache/shadows when switching loaded VMCS
Reset the vm_{entry,exit}_controls_shadow variables as well as the
segment cache after loading a new VMCS in vmx_switch_vmcs().  The
shadows/cache track VMCS data, i.e. they're stale every time we
switch to a new VMCS regardless of reason.

This fixes a bug where stale control shadows would be consumed after
a nested VMExit due to a failed consistency check.

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:50 +02:00
Sean Christopherson
1abf23fb42 KVM: nVMX: use vm_exit_controls_init() to write exit controls for vmcs02
Write VM_EXIT_CONTROLS using vm_exit_controls_init() when configuring
vmcs02, otherwise vm_exit_controls_shadow will be stale.  EFER in
particular can be corrupted if VM_EXIT_LOAD_IA32_EFER is not updated
due to an incorrect shadow optimization, which can crash L0 due to
EFER not being loaded on exit.  This does not occur with the current
code base simply because update_transition_efer() unconditionally
clears VM_EXIT_LOAD_IA32_EFER before conditionally setting it, and
because a nested guest always starts with VM_EXIT_LOAD_IA32_EFER
clear, i.e. we'll only ever unnecessarily clear the bit.  That is,
until someone optimizes update_transition_efer()...

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:50 +02:00
Sean Christopherson
5b8ba41daf KVM: nVMX: move vmcs12 EPTP consistency check to check_vmentry_prereqs()
An invalid EPTP causes a VMFail(VMXERR_ENTRY_INVALID_CONTROL_FIELD),
not a VMExit.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:49 +02:00
Sean Christopherson
64a919f7b5 KVM: nVMX: move host EFER consistency checks to VMFail path
Invalid host state related to loading EFER on VMExit causes a
VMFail(VMXERR_ENTRY_INVALID_HOST_STATE_FIELD), not a VMExit.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:49 +02:00
Jim Mattson
3c6e099fa1 KVM: nVMX: Always reflect #NM VM-exits to L1
When bit 3 (corresponding to CR0.TS) of the VMCS12 cr0_guest_host_mask
field is clear, the VMCS12 guest_cr0 field does not necessarily hold
the current value of the L2 CR0.TS bit, so the code that checked for
L2's CR0.TS bit being set was incorrect. Moreover, I'm not sure that
the CR0.TS check was adequate. (What if L2's CR0.EM was set, for
instance?)

Fortunately, lazy FPU has gone away, so L0 has lost all interest in
intercepting #NM exceptions. See commit bd7e5b0899 ("KVM: x86:
remove code for lazy FPU handling"). Therefore, there is no longer any
question of which hypervisor gets first dibs. The #NM VM-exit should
always be reflected to L1. (Note that the corresponding bit must be
set in the VMCS12 exception_bitmap field for there to be an #NM
VM-exit at all.)

Fixes: ccf9844e5d ("kvm, vmx: Really fix lazy FPU on nested guest")
Reported-by: Abhiroop Dabral <adabral@paloaltonetworks.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Tested-by: Abhiroop Dabral <adabral@paloaltonetworks.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:47 +02:00