Commit graph

5131 commits

Author SHA1 Message Date
Ben Gardon
4183683918 kvm: vmx: Add memcg accounting to KVM allocations
There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.

Tested:
	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
	introduced no new failures.
	Ran a kernel memory accounting test which creates a VM to touch
	memory and then checks that the kernel memory allocated for the
	process is within certain bounds.
	With this patch we account for much more of the vmalloc and slab memory
	allocated for the VM.

Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:31 +01:00
Ben Gardon
1ec696470c kvm: svm: Add memcg accounting to KVM allocations
There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.

Tested:
	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
	introduced no new failures.
	Ran a kernel memory accounting test which creates a VM to touch
	memory and then checks that the kernel memory allocated for the
	process is within certain bounds.
	With this patch we account for much more of the vmalloc and slab memory
	allocated for the VM.

Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:31 +01:00
Ben Gardon
254272ce65 kvm: x86: Add memcg accounting to KVM allocations
There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.

Tested:
	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
	introduced no new failures.
	Ran a kernel memory accounting test which creates a VM to touch
	memory and then checks that the kernel memory allocated for the
	process is within certain bounds.
	With this patch we account for much more of the vmalloc and slab memory
	allocated for the VM.

There remain a few allocations which should be charged to the VM's
cgroup but are not. In x86, they include:
	vcpu->arch.pio_data
There allocations are unaccounted in this patch because they are mapped
to userspace, and accounting them to a cgroup causes problems. This
should be addressed in a future patch.

Signed-off-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:30 +01:00
Paolo Bonzini
359a6c3ddc KVM: nVMX: do not start the preemption timer hrtimer unnecessarily
The preemption timer can be started even if there is a vmentry
failure during or after loading guest state.  That is pointless,
move the call after all conditions have been checked.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:29 +01:00
Yu Zhang
d92935979a kvm: vmx: Fix typos in vmentry/vmexit control setting
Previously, 'commit f99e3daf94 ("KVM: x86: Add Intel PT
virtualization work mode")' work mode' offered framework
to support Intel PT virtualization. However, the patch has
some typos in vmx_vmentry_ctrl() and vmx_vmexit_ctrl(), e.g.
used wrong flags and wrong variable, which will cause the
VM entry failure later.

Fixes: 'commit f99e3daf94 ("KVM: x86: Add Intel PT virtualization work mode")'
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:28 +01:00
Paolo Bonzini
b4b65b5642 KVM: x86: cleanup freeing of nested state
Ensure that the VCPU free path goes through vmx_leave_nested and
thus nested_vmx_vmexit, so that the cancellation of the timer does
not have to be in free_nested.  In addition, because some paths through
nested_vmx_vmexit do not go through sync_vmcs12, the cancellation of
the timer is moved there.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:27 +01:00
Luwei Kang
81b016676e KVM: x86: Sync the pending Posted-Interrupts
Some Posted-Interrupts from passthrough devices may be lost or
overwritten when the vCPU is in runnable state.

The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will
be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state
but not running on physical CPU). If a posted interrupt coming at this
time, the irq remmaping facility will set the bit of PIR (Posted
Interrupt Requests) without ON (Outstanding Notification).
So this interrupt can't be sync to APIC virtualization register and
will not be handled by Guest because ON is zero.

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
[Eliminate the pi_clear_sn fast path. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:27 +01:00
Liu Jingqi
c029b5deb0 KVM: x86: expose MOVDIR64B CPU feature into VM.
MOVDIR64B moves 64-bytes as direct-store with 64-bytes write atomicity.
Direct store is implemented by using write combining (WC) for writing
data directly into memory without caching the data.

Availability of the MOVDIR64B instruction is indicated by the presence
of the CPUID feature flag MOVDIR64B (CPUID.0x07.0x0:ECX[bit 28]).

This patch exposes the movdir64b feature to the guest.

The release document ref below link:
https://software.intel.com/sites/default/files/managed/c5/15/\
architecture-instruction-set-extensions-programming-reference.pdf

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Cc: Xu Tao <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:26 +01:00
Liu Jingqi
74f2370bb6 KVM: x86: expose MOVDIRI CPU feature into VM.
MOVDIRI moves doubleword or quadword from register to memory through
direct store which is implemented by using write combining (WC) for
writing data directly into memory without caching the data.

Availability of the MOVDIRI instruction is indicated by the presence of
the CPUID feature flag MOVDIRI(CPUID.0x07.0x0:ECX[bit 27]).

This patch exposes the movdiri feature to the guest.

The release document ref below link:
https://software.intel.com/sites/default/files/managed/c5/15/\
architecture-instruction-set-extensions-programming-reference.pdf

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Cc: Xu Tao <tao3.xu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:26 +01:00
Kai Huang
8acc0993e3 kvm, x86, mmu: Use kernel generic dynamic physical address mask
AMD's SME/SEV is no longer the only case which reduces supported
physical address bits, since Intel introduced Multi-key Total Memory
Encryption (MKTME), which repurposes high bits of physical address as
keyID, thus effectively shrinks supported physical address bits. To
cover both cases (and potential similar future features), kernel MM
introduced generic dynamaic physical address mask instead of hard-coded
__PHYSICAL_MASK in 'commit 94d49eb30e ("x86/mm: Decouple dynamic
__PHYSICAL_MASK from AMD SME")'. KVM should use that too.

Change PT64_BASE_ADDR_MASK to use kernel dynamic physical address mask
when it is enabled, instead of sme_clr. PT64_DIR_BASE_ADDR_MASK is also
deleted since it is not used at all.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:25 +01:00
Paolo Bonzini
e0dfacbfe9 KVM: nVMX: remove useless is_protmode check
VMX is only accessible in protected mode, remove a confusing check
that causes the conditional to lack a final "else" branch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:24 +01:00
Sean Christopherson
34333cc6c2 KVM: nVMX: Ignore limit checks on VMX instructions using flat segments
Regarding segments with a limit==0xffffffff, the SDM officially states:

    When the effective limit is FFFFFFFFH (4 GBytes), these accesses may
    or may not cause the indicated exceptions.  Behavior is
    implementation-specific and may vary from one execution to another.

In practice, all CPUs that support VMX ignore limit checks for "flat
segments", i.e. an expand-up data or code segment with base=0 and
limit=0xffffffff.  This is subtly different than wrapping the effective
address calculation based on the address size, as the flat segment
behavior also applies to accesses that would wrap the 4g boundary, e.g.
a 4-byte access starting at 0xffffffff will access linear addresses
0xffffffff, 0x0, 0x1 and 0x2.

Fixes: f9eb4af67c ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:24 +01:00
Sean Christopherson
8570f9e881 KVM: nVMX: Apply addr size mask to effective address for VMX instructions
The address size of an instruction affects the effective address, not
the virtual/linear address.  The final address may still be truncated,
e.g. to 32-bits outside of long mode, but that happens irrespective of
the address size, e.g. a 32-bit address size can yield a 64-bit virtual
address when using FS/GS with a non-zero base.

Fixes: 064aea7747 ("KVM: nVMX: Decoding memory operands of VMX instructions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:23 +01:00
Sean Christopherson
946c522b60 KVM: nVMX: Sign extend displacements of VMX instr's mem operands
The VMCS.EXIT_QUALIFCATION field reports the displacements of memory
operands for various instructions, including VMX instructions, as a
naturally sized unsigned value, but masks the value by the addr size,
e.g. given a ModRM encoded as -0x28(%ebp), the -0x28 displacement is
reported as 0xffffffd8 for a 32-bit address size.  Despite some weird
wording regarding sign extension, the SDM explicitly states that bits
beyond the instructions address size are undefined:

    In all cases, bits of this field beyond the instruction’s address
    size are undefined.

Failure to sign extend the displacement results in KVM incorrectly
treating a negative displacement as a large positive displacement when
the address size of the VMX instruction is smaller than KVM's native
size, e.g. a 32-bit address size on a 64-bit KVM.

The very original decoding, added by commit 064aea7747 ("KVM: nVMX:
Decoding memory operands of VMX instructions"), sort of modeled sign
extension by truncating the final virtual/linear address for a 32-bit
address size.  I.e. it messed up the effective address but made it work
by adjusting the final address.

When segmentation checks were added, the truncation logic was kept
as-is and no sign extension logic was introduced.  In other words, it
kept calculating the wrong effective address while mostly generating
the correct virtual/linear address.  As the effective address is what's
used in the segment limit checks, this results in KVM incorreclty
injecting #GP/#SS faults due to non-existent segment violations when
a nested VMM uses negative displacements with an address size smaller
than KVM's native address size.

Using the -0x28(%ebp) example, an EBP value of 0x1000 will result in
KVM using 0x100000fd8 as the effective address when checking for a
segment limit violation.  This causes a 100% failure rate when running
a 32-bit KVM build as L1 on top of a 64-bit KVM L0.

Fixes: f9eb4af67c ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:22 +01:00
Suthikulpanit, Suravee
c57cd3c89e svm: Fix improper check when deactivate AVIC
The function svm_refresh_apicv_exec_ctrl() always returning prematurely
as kvm_vcpu_apicv_active() always return false when calling from
the function arch/x86/kvm/x86.c:kvm_vcpu_deactivate_apicv().
This is because the apicv_active is set to false just before calling
refresh_apicv_exec_ctrl().

Also, we need to mark VMCB_AVIC bit as dirty instead of VMCB_INTR.

So, fix svm_refresh_apicv_exec_ctrl() to properly deactivate AVIC.

Fixes: 67034bb9dd ('KVM: SVM: Add irqchip_split() checks before enabling AVIC')
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:22 +01:00
Paolo Bonzini
f7589cca50 KVM: x86: cull apicv code when userspace irqchip is requested
Currently apicv_active can be true even if in-kernel LAPIC
emulation is disabled.  Avoid this by properly initializing
it in kvm_arch_vcpu_init, and then do not do anything to
deactivate APICv when it is actually not used

(Currently APICv is only deactivated by SynIC code that in turn
is only reachable when in-kernel LAPIC is in use.  However, it is
cleaner if kvm_vcpu_deactivate_apicv avoids relying on this.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:21 +01:00
Suthikulpanit, Suravee
98d90582be svm: Fix AVIC DFR and LDR handling
Current SVM AVIC driver makes two incorrect assumptions:
  1. APIC LDR register cannot be zero
  2. APIC DFR for all vCPUs must be the same

LDR=0 means the local APIC does not support logical destination mode.
Therefore, the driver should mark any previously assigned logical APIC ID
table entry as invalid, and return success.  Also, DFR is specific to
a particular local APIC, and can be different among all vCPUs
(as observed on Windows 10).

These incorrect assumptions cause Windows 10 and FreeBSD VMs to fail
to boot with AVIC enabled. So, instead of flush the whole logical APIC ID
table, handle DFR and LDR for each vCPU independently.

Fixes: 18f40c53e1 ('svm: Add VMEXIT handlers for AVIC')
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Julian Stecklina <jsteckli@amazon.de>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:20 +01:00
Sean Christopherson
4f44c4eec5 KVM: VMX: Reorder clearing of registers in the vCPU-run assembly flow
Move the clearing of the common registers (not 64-bit-only) to the start
of the flow that clears registers holding guest state.  This is
purely a cosmetic change so that the label doesn't point at a blank line
and a #define.

No functional change intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:18 +01:00
Sean Christopherson
fc2ba5a27a KVM: VMX: Call vCPU-run asm sub-routine from C and remove clobbering
...now that the sub-routine follows standard calling conventions.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:18 +01:00
Sean Christopherson
3b895ef486 KVM: VMX: Preserve callee-save registers in vCPU-run asm sub-routine
...to make it callable from C code.

Note that because KVM chooses to be ultra paranoid about guest register
values, all callee-save registers are still cleared after VM-Exit even
though the host's values are now reloaded from the stack.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:17 +01:00
Sean Christopherson
e75c3c3a04 KVM: VMX: Return VM-Fail from vCPU-run assembly via standard ABI reg
...to prepare for making the assembly sub-routine callable from C code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:17 +01:00
Sean Christopherson
77df549559 KVM: VMX: Pass @launched to the vCPU-run asm via standard ABI regs
...to prepare for making the sub-routine callable from C code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:16 +01:00
Sean Christopherson
a62fd5a76c KVM: VMX: Use RAX as the scratch register during vCPU-run
...to prepare for making the sub-routine callable from C code.  That
means returning the result in RAX.  Since RAX will be used to return the
result, use it as the scratch register as well to make the code readable
and to document that the scratch register is more or less arbitrary.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:15 +01:00
Sean Christopherson
ee2fc635ef KVM: VMX: Rename ____vmx_vcpu_run() to __vmx_vcpu_run()
...now that the name is no longer usurped by a defunct helper function.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:15 +01:00
Sean Christopherson
c823dd5c0f KVM: VMX: Fold __vmx_vcpu_run() back into vmx_vcpu_run()
...now that the code is no longer tagged with STACK_FRAME_NON_STANDARD.
Arguably, providing __vmx_vcpu_run() to break up vmx_vcpu_run() is
valuable on its own, but the previous split was purposely made as small
as possible to limit the effects STACK_FRAME_NON_STANDARD.  In other
words, the current split is now completely arbitrary and likely not the
most logical.

This also allows renaming ____vmx_vcpu_run() to __vmx_vcpu_run() in a
future patch.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:14 +01:00
Sean Christopherson
5e0781df18 KVM: VMX: Move vCPU-run code to a proper assembly routine
As evidenced by the myriad patches leading up to this moment, using
an inline asm blob for vCPU-run is nothing short of horrific.  It's also
been called "unholy", "an abomination" and likely a whole host of other
names that would violate the Code of Conduct if recorded here and now.

The code is relocated nearly verbatim, e.g. quotes, newlines, tabs and
__stringify need to be dropped, but other than those cosmetic changes
the only functional changees are to add the "call" and replace the final
"jmp" with a "ret".

Note that STACK_FRAME_NON_STANDARD is also dropped from __vmx_vcpu_run().

Suggested-by: Andi Kleen <ak@linux.intel.com>
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:13 +01:00
Sean Christopherson
63c73aa07f KVM: VMX: Create a stack frame in vCPU-run
...in preparation for moving to a proper assembly sub-routnine.
vCPU-run isn't a leaf function since it calls vmx_update_host_rsp()
and vmx_vmenter().  And since we need to save/restore RBP anyways,
unconditionally creating the frame costs a single MOV, i.e. don't
bother keying off CONFIG_FRAME_POINTER or using FRAME_BEGIN, etc...

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:13 +01:00
Sean Christopherson
c14f9dd50b KVM: VMX: Use #defines in place of immediates in VM-Enter inline asm
...to prepare for moving the inline asm to a proper asm sub-routine.
Eliminating the immediates allows a nearly verbatim move, e.g. quotes,
newlines, tabs and __stringify need to be dropped, but other than those
cosmetic changes the only function change will be to replace the final
"jmp" with a "ret".

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-20 22:48:12 +01:00
Sean Christopherson
d558920491 KVM: VMX: Use vcpu->arch.regs directly when saving/loading guest state
...now that all other references to struct vcpu_vmx have been removed.

Note that 'vmx' still needs to be passed into the asm blob in _ASM_ARG1
as it is consumed by vmx_update_host_rsp().  And similar to that code,
use _ASM_ARG2 in the assembly code to prepare for moving to proper asm,
while explicitly referencing the exact registers in the clobber list for
clarity in the short term and to avoid additional precompiler games.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:26 +01:00
Sean Christopherson
f78d0971b7 KVM: VMX: Don't save guest registers after VM-Fail
A failed VM-Enter (obviously) didn't succeed, meaning the CPU never
executed an instrunction in guest mode and so can't have changed the
general purpose registers.

In addition to saving some instructions in the VM-Fail case, this also
provides a separate path entirely and thus an opportunity to propagate
the fail condition to vmx->fail via register without introducing undue
pain.  Using a register, as opposed to directly referencing vmx->fail,
eliminates the need to pass the offset of 'fail', which will simplify
moving the code to proper assembly in future patches.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:26 +01:00
Sean Christopherson
217aaff53c KVM: VMX: Invert the ordering of saving guest/host scratch reg at VM-Enter
Switching the ordering allows for an out-of-line path for VM-Fail
that elides saving guest state but still shares the register clearing
with the VM-Exit path.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:25 +01:00
Sean Christopherson
c9afc58cc3 KVM: VMX: Pass "launched" directly to the vCPU-run asm blob
...and remove struct vcpu_vmx's temporary __launched variable.

Eliminating __launched is a bonus, the real motivation is to get to the
point where the only reference to struct vcpu_vmx in the asm code is
to vcpu.arch.regs, which will simplify moving the blob to a proper asm
file.  Note that also means this approach is deliberately different than
what is used in nested_vmx_check_vmentry_hw().

Use BL as it is a callee-save register in both 32-bit and 64-bit ABIs,
i.e. it can't be modified by vmx_update_host_rsp(), to avoid having to
temporarily save/restore the launched flag.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:24 +01:00
Sean Christopherson
c09b03eb7f KVM: VMX: Update VMCS.HOST_RSP via helper C function
Providing a helper function to update HOST_RSP is visibly easier to
read, and more importantly (for the future) eliminates two arguments to
the VM-Enter assembly blob.  Reducing the number of arguments to the asm
blob is for all intents and purposes a prerequisite to moving the code
to a proper assembly routine.  It's not truly mandatory, but it greatly
simplifies the future code, and the cost of the extra CALL+RET is
negligible in the grand scheme.

Note that although _ASM_ARG[1-3] can be used in the inline asm itself,
the intput/output constraints need to be manually defined.  gcc will
actually compile with _ASM_ARG[1-3] specified as constraints, but what
it actually ends up doing with the bogus constraint is unknown.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:24 +01:00
Sean Christopherson
47e97c099b KVM: VMX: Load/save guest CR2 via C code in __vmx_vcpu_run()
...to eliminate its parameter and struct vcpu_vmx offset definition
from the assembly blob.  Accessing CR2 from C versus assembly doesn't
change the likelihood of taking a page fault (and modifying CR2) while
it's loaded with the guest's value, so long as we don't do anything
silly between accessing CR2 and VM-Enter/VM-Exit.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:23 +01:00
Sean Christopherson
5a8781607e KVM: nVMX: Cache host_rsp on a per-VMCS basis
Currently, host_rsp is cached on a per-vCPU basis, i.e. it's stored in
struct vcpu_vmx.  In non-nested usage the caching is for all intents
and purposes 100% effective, e.g. only the first VMLAUNCH needs to
synchronize VMCS.HOST_RSP since the call stack to vmx_vcpu_run() is
identical each and every time.  But when running a nested guest, KVM
must invalidate the cache when switching the current VMCS as it can't
guarantee the new VMCS has the same HOST_RSP as the previous VMCS.  In
other words, the cache loses almost all of its efficacy when running a
nested VM.

Move host_rsp to struct vmcs_host_state, which is per-VMCS, so that it
is cached on a per-VMCS basis and restores its 100% hit rate when
nested VMs are in play.

Note that the host_rsp cache for vmcs02 essentially "breaks" when
nested early checks are enabled as nested_vmx_check_vmentry_hw() will
see a different RSP at the time of its VM-Enter.  While it's possible
to avoid even that VMCS.HOST_RSP synchronization, e.g. by employing a
dedicated VM-Exit stack, there is little motivation for doing so as
the overhead of two VMWRITEs (~55 cycles) is dwarfed by the overhead
of the extra VMX transition (600+ cycles) and is a proverbial drop in
the ocean relative to the total cost of a nested transtion (10s of
thousands of cycles).

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:22 +01:00
Sean Christopherson
fbda0fd31a KVM: nVMX: Let the compiler select the reg for holding HOST_RSP
...and provide an explicit name for the constraint.  Naming the input
constraint makes the code self-documenting and also avoids the fragility
of numerically referring to constraints, e.g. %4 breaks badly whenever
the constraints are modified.

Explicitly using RDX was inherited from vCPU-run, i.e. completely
arbitrary.  Even vCPU-run doesn't truly need to explicitly use RDX, but
doing so is more robust as vCPU-run needs tight control over its
register usage.

Note that while the naming "conflict" between host_rsp and HOST_RSP
is slightly confusing, the former will be renamed slightly in a
future patch, at which point HOST_RSP is absolutely what is desired.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:21 +01:00
Sean Christopherson
74dfa2784e KVM: nVMX: Reference vmx->loaded_vmcs->launched directly
Temporarily propagating vmx->loaded_vmcs->launched to vmx->__launched
is not functionally necessary, but rather was done historically to
avoid passing both 'vmx' and 'loaded_vmcs' to the vCPU-run asm blob.
Nested early checks inherited this behavior by virtue of copy+paste.

A future patch will move HOST_RSP caching to be per-VMCS, i.e. store
'host_rsp' in loaded VMCS.  Now that the reference to 'vmx->fail' is
also gone from nested early checks, referencing 'loaded_vmcs' directly
means we can drop the 'vmx' reference when introducing per-VMCS RSP
caching.  And it means __launched can be dropped from struct vcpu_vmx
if/when vCPU-run receives similar treatment.

Note the use of a named register constraint for 'loaded_vmcs'.  Using
RCX to hold 'vmx' was inherited from vCPU-run.  In the vCPU-run case,
the scratch register needs to be explicitly defined as it is crushed
when loading guest state, i.e. deferring to the compiler would corrupt
the pointer.  Since nested early checks never loads guests state, it's
a-ok to let the compiler pick any register.  Naming the constraint
avoids the fragility of referencing constraints via %1, %2, etc.., which
breaks horribly when modifying constraints, and generally makes the asm
blob more readable.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:21 +01:00
Sean Christopherson
bbc0b82392 KVM: nVMX: Capture VM-Fail via CC_{SET,OUT} in nested early checks
...to take advantage of __GCC_ASM_FLAG_OUTPUTS__ when possible.

Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:20 +01:00
Sean Christopherson
f1727b4954 KVM: nVMX: Capture VM-Fail to a local var in nested_vmx_check_vmentry_hw()
Unlike the primary vCPU-run flow, the nested early checks code doesn't
actually want to propagate VM-Fail back to 'vmx'.  Yay copy+paste.

In additional to eliminating the need to clear vmx->fail before
returning, using a local boolean also drops a reference to 'vmx' in the
asm blob.  Dropping the reference to 'vmx' will save a register in the
long run as future patches will shift all pointer references from 'vmx'
to 'vmx->loaded_vmcs'.

Fixes: 52017608da ("KVM: nVMX: add option to perform early consistency checks via H/W")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:19 +01:00
Sean Christopherson
6c1e7e5b40 KVM: nVMX: Explicitly reference the scratch reg in nested early checks
Using %1 to reference RCX, i.e. the 'vmx' pointer', is obtuse and
fragile, e.g. it results in cryptic and infurating compile errors if the
output constraints are touched by anything more than a gentle breeze.

Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:19 +01:00
Sean Christopherson
98ff2acc91 KVM: nVMX: Drop STACK_FRAME_NON_STANDARD from nested_vmx_check_vmentry_hw()
...as it doesn't technically actually do anything non-standard with the
stack even though it modifies RSP in a weird way.  E.g. RSP is loaded
with VMCS.HOST_RSP if the VM-Enter gets far enough to trigger VM-Exit,
but it's simply reloaded with the current value.

Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:18 +01:00
Sean Christopherson
9ce0a07a6f KVM: nVMX: Remove a rogue "rax" clobber from nested_vmx_check_vmentry_hw()
RAX is not touched by nested_vmx_check_vmentry_hw(), directly or
indirectly (e.g. vmx_vmenter()).  Remove it from the clobber list.

Fixes: 52017608da ("KVM: nVMX: add option to perform early consistency checks via H/W")
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:17 +01:00
Sean Christopherson
6f7c6d23b7 KVM: VMX: Let the compiler save/load RDX during vCPU-run
Per commit c20363006a ("KVM: VMX: Let gcc to choose which registers
to save (x86_64)"), the only reason RDX is saved/loaded to/from the
stack is because it was specified as an input, i.e. couldn't be marked
as clobbered (ignoring the fact that "saving" it to a dummy output
would indirectly mark it as clobbered).

Now that RDX is no longer an input, clobber it.

Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:17 +01:00
Sean Christopherson
ccf447434e KVM: VMX: Manually load RDX in vCPU-run asm blob
Load RDX with the VMCS.HOST_RSP field encoding on-demand instead of
delegating to the compiler via an input constraint.  In addition to
saving one whole MOV instruction, this allows RDX to be properly
clobbered (in a future patch) instead of being saved/loaded to/from
the stack.

Despite nested_vmx_check_vmentry_hw() having similar code, leave it
alone, for now.  In that case, RDX is unconditionally used and isn't
clobbered, i.e. sending in HOST_RSP as an input is simpler.

Note that because HOST_RSP is an enum and not a define, it must be
redefined as an immediate instead of using __stringify(HOST_RSP).  The
naming "conflict" between host_rsp and HOST_RSP is slightly confusing,
but the former will be removed in a future patch, at which point
HOST_RSP is absolutely what is desired.

Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:16 +01:00
Sean Christopherson
f3689e3f17 KVM: VMX: Save RSI to an unused output in the vCPU-run asm blob
RSI is clobbered by the vCPU-run asm blob, but it's not marked as such,
probably because GCC doesn't let you mark inputs as clobbered.  "Save"
RSI to a dummy output so that GCC recognizes it as being clobbered.

Fixes: 773e8a0425 ("x86/kvm: use Enlightened VMCS when running on Hyper-V")
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:15 +01:00
Sean Christopherson
831a301129 KVM: VMX: Modify only RSP when creating a placeholder for guest's RCX
In the vCPU-run asm blob, the guest's RCX is temporarily saved onto the
stack after VM-Exit as the exit flow must first load a register with a
pointer to the vCPU's save area in order to save the guest's registers.
RCX is arbitrarily designated as the scratch register.

Since the stack usage is to (1)save host, (2)save guest, (3)load host
and (4)load guest, the code can't conform to the stack's natural FIFO
semantics, i.e. it can't simply do PUSH/POP.  Regardless of whether it
is done for the host's value or guest's value, at some point the code
needs to access the stack using a non-traditional method, e.g. MOV
instead of POP.  vCPU-run opts to create a placeholder on the stack for
guest's RCX (by adjusting RSP) and saves RCX to its place immediately
after VM-Exit (via MOV).

In other words, the purpose of the first 'PUSH RCX' at the start of
the vCPU-run asm blob  is to adjust RSP down, i.e. there's no need to
actually access memory.  Use 'SUB $wordsize, RSP' instead of 'PUSH RCX'
to make it more obvious that the intent is simply to create a gap on
the stack for the guest's RCX.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:14 +01:00
Sean Christopherson
0e0ab73c9a KVM: VMX: Zero out *all* general purpose registers after VM-Exit
...except RSP, which is restored by hardware as part of VM-Exit.

Paolo theorized that restoring registers from the stack after a VM-Exit
in lieu of zeroing them could lead to speculative execution with the
guest's values, e.g. if the stack accesses miss the L1 cache[1].
Zeroing XORs are dirt cheap, so just be ultra-paranoid.

Note that the scratch register (currently RCX) used to save/restore the
guest state is also zeroed as its host-defined value is loaded via the
stack, just with a MOV instead of a POP.

[1] https://patchwork.kernel.org/patch/10771539/#22441255

Fixes: 0cb5b30698 ("kvm: vmx: Scrub hardware GPRs at VM-exit")
Cc: <stable@vger.kernel.org>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:14 +01:00
Sean Christopherson
1ce072cbfd KVM: nVMX: Check a single byte for VMCS "launched" in nested early checks
Nested early checks does a manual comparison of a VMCS' launched status
in its asm blob to execute the correct VM-Enter instruction, i.e.
VMLAUNCH vs. VMRESUME.  The launched flag is a bool, which is a typedef
of _Bool.  C99 does not define an exact size for _Bool, stating only
that is must be large enough to hold '0' and '1'.  Most, if not all,
compilers use a single byte for _Bool, including gcc[1].

The use of 'cmpl' instead of 'cmpb' was not deliberate, but rather the
result of a copy-paste as the asm blob was directly derived from the asm
blob for vCPU-run.

This has not caused any known problems, likely due to compilers aligning
variables to 4-byte or 8-byte boundaries and KVM zeroing out struct
vcpu_vmx during allocation.  I.e. vCPU-run accesses "junk" data, it just
happens to always be zero and so doesn't affect the result.

[1] https://gcc.gnu.org/ml/gcc-patches/2000-10/msg01127.html

Fixes: 52017608da ("KVM: nVMX: add option to perform early consistency checks via H/W")
Cc: <stable@vger.kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:13 +01:00
Sean Christopherson
61c08aa960 KVM: VMX: Compare only a single byte for VMCS' "launched" in vCPU-run
The vCPU-run asm blob does a manual comparison of a VMCS' launched
status to execute the correct VM-Enter instruction, i.e. VMLAUNCH vs.
VMRESUME.  The launched flag is a bool, which is a typedef of _Bool.
C99 does not define an exact size for _Bool, stating only that is must
be large enough to hold '0' and '1'.  Most, if not all, compilers use
a single byte for _Bool, including gcc[1].

Originally, 'launched' was of type 'int' and so the asm blob used 'cmpl'
to check the launch status.  When 'launched' was moved to be stored on a
per-VMCS basis, struct vcpu_vmx's "temporary" __launched flag was added
in order to avoid having to pass the current VMCS into the asm blob.
The new  '__launched' was defined as a 'bool' and not an 'int', but the
'cmp' instruction was not updated.

This has not caused any known problems, likely due to compilers aligning
variables to 4-byte or 8-byte boundaries and KVM zeroing out struct
vcpu_vmx during allocation.  I.e. vCPU-run accesses "junk" data, it just
happens to always be zero and so doesn't affect the result.

[1] https://gcc.gnu.org/ml/gcc-patches/2000-10/msg01127.html

Fixes: d462b81923 ("KVM: VMX: Keep list of loaded VMCSs, instead of vcpus")
Cc: <stable@vger.kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-12 13:12:12 +01:00
Peter Shier
ecec76885b KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
Bugzilla: 1671904

There are multiple code paths where an hrtimer may have been started to
emulate an L1 VMX preemption timer that can result in a call to free_nested
without an intervening L2 exit where the hrtimer is normally
cancelled. Unconditionally cancel in free_nested to cover all cases.

Embargoed until Feb 7th 2019.

Signed-off-by: Peter Shier <pshier@google.com>
Reported-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Cc: stable@kernel.org
Message-Id: <20181011184646.154065-1-pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-02-07 19:03:01 +01:00