linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-10-04 08:08:54 +00:00

Author	SHA1	Message	Date
Paolo Bonzini	0f02bd0ade	KVM: nVMX: check for required but missing VMCS12 in KVM_SET_NESTED_STATE A missing VMCS12 was not causing -EINVAL (it was just read with copy_from_user, so it is not a security issue, but it is still wrong). Test for VMCS12 validity and reject the nested state if a VMCS12 is required but not present. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-27 09:04:49 -04:00
Thomas Gleixner	72c3c0fe54	x86/kvm: Use generic xfer to guest work function Use the generic infrastructure to check for and handle pending work before transitioning into guest mode. This now handles TIF_NOTIFY_RESUME as well which was ignored so far. Handling it is important as this covers task work and task work will be used to offload the heavy lifting of POSIX CPU timers to thread context. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200722220520.979724969@linutronix.de	2020-07-24 15:05:01 +02:00
Mohammed Gamal	3edd68399d	KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support This patch adds a new capability KVM_CAP_SMALLER_MAXPHYADDR which allows userspace to query if the underlying architecture would support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly (e.g. qemu can decide if it should warn for -cpu ..,phys-bits=X) The complications in this patch are due to unexpected (but documented) behaviour we see with NPF vmexit handling in AMD processor. If SVM is modified to add guest physical address checks in the NPF and guest #PF paths, we see the followning error multiple times in the 'access' test in kvm-unit-tests: test pte.p pte.36 pde.p: FAIL: pte 2000021 expected 2000001 Dump mapping: address: 0x123400000000 ------L4: 24c3027 ------L3: 24c4027 ------L2: 24c5021 ------L1: 1002000021 This is because the PTE's accessed bit is set by the CPU hardware before the NPF vmexit. This is handled completely by hardware and cannot be fixed in software. Therefore, availability of the new capability depends on a boolean variable allow_smaller_maxphyaddr which is set individually by VMX and SVM init routines. On VMX it's always set to true, on SVM it's only set to true when NPT is not enabled. CC: Tom Lendacky <thomas.lendacky@amd.com> CC: Babu Moger <babu.moger@amd.com> Signed-off-by: Mohammed Gamal <mgamal@redhat.com> Message-Id: <20200710154811.418214-10-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 17:01:53 -04:00
Paolo Bonzini	8c4182bd27	KVM: VMX: optimize #PF injection when MAXPHYADDR does not match Ignore non-present page faults, since those cannot have reserved bits set. When running access.flat with "-cpu Haswell,phys-bits=36", the number of trapped page faults goes down from `8872644` to 3978948. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-9-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 17:01:52 -04:00
Mohammed Gamal	1dbf5d68af	KVM: VMX: Add guest physical address check in EPT violation and misconfig Check guest physical address against its maximum, which depends on the guest MAXPHYADDR. If the guest's physical address exceeds the maximum (i.e. has reserved bits set), inject a guest page fault with PFERR_RSVD_MASK set. This has to be done both in the EPT violation and page fault paths, as there are complications in both cases with respect to the computation of the correct error code. For EPT violations, unfortunately the only possibility is to emulate, because the access type in the exit qualification might refer to an access to a paging structure, rather than to the access performed by the program. Trapping page faults instead is needed in order to correct the error code, but the access type can be obtained from the original error code and passed to gva_to_gpa. The corrections required in the error code are subtle. For example, imagine that a PTE for a supervisor page has a reserved bit set. On a supervisor-mode access, the EPT violation path would trigger. However, on a user-mode access, the processor will not notice the reserved bit and not include PFERR_RSVD_MASK in the error code. Co-developed-by: Mohammed Gamal <mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-8-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 17:01:52 -04:00
Paolo Bonzini	a0c134347b	KVM: VMX: introduce vmx_need_pf_intercept Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-7-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 17:01:52 -04:00
Paolo Bonzini	6986982fef	KVM: x86: rename update_bp_intercept to update_exception_bitmap We would like to introduce a callback to update the #PF intercept when CPUID changes. Just reuse update_bp_intercept since VMX is already using update_exception_bitmap instead of a bespoke function. While at it, remove an unnecessary assignment in the SVM version, which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug) and has nothing to do with the exception bitmap. Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 13:20:18 -04:00
Vitaly Kuznetsov	d574c539c3	KVM: x86: move MSR_IA32_PERF_CAPABILITIES emulation to common x86 code state_test/smm_test selftests are failing on AMD with: "Unexpected result from KVM_GET_MSRS, r: 51 (failed MSR was 0x345)" MSR_IA32_PERF_CAPABILITIES is an emulated MSR on Intel but it is not known to AMD code, we can move the emulation to common x86 code. For AMD, we basically just allow the host to read and write zero to the MSR. Fixes: `27461da310` ("KVM: x86/pmu: Support full width counting") Suggested-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200710152559.1645827-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 12:52:56 -04:00
Paolo Bonzini	83d31e5271	KVM: nVMX: fixes for preemption timer migration Commit `850448f35a` ("KVM: nVMX: Fix VMX preemption timer migration", 2020-06-01) accidentally broke nVMX live migration from older version by changing the userspace ABI. Restore it and, while at it, ensure that vmx->nested.has_preemption_timer_deadline is always initialized according to the KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE flag. Cc: Makarand Sonare <makarandsonare@google.com> Fixes: `850448f35a` ("KVM: nVMX: Fix VMX preemption timer migration") Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-10 06:15:36 -04:00
Thomas Gleixner	2245d39886	x86/kvm/vmx: Use native read/write_cr2() read/write_cr2() go throuh the paravirt XXL indirection, but nested VMX in a XEN_PV guest is not supported. Use the native variants. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Message-Id: <20200708195322.344731916@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 07:08:42 -04:00
Thomas Gleixner	3ebccdf373	x86/kvm/vmx: Move guest enter/exit into .noinstr.text Move the functions which are inside the RCU off region into the non-instrumentable text section. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200708195322.037311579@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 07:08:40 -04:00
Thomas Gleixner	0642391e21	x86/kvm/vmx: Add hardirq tracing to guest enter/exit Entering guest mode is more or less the same as returning to user space. From an instrumentation point of view both leave kernel mode and the transition to guest or user mode reenables interrupts on the host. In user mode an interrupt is served directly and in guest mode it causes a VM exit which then handles or reinjects the interrupt. The transition from guest mode or user mode to kernel mode disables interrupts, which needs to be recorded in instrumentation to set the correct state again. This is important for e.g. latency analysis because otherwise the execution time in guest or user mode would be wrongly accounted as interrupt disabled and could trigger false positives. Add hardirq tracing to guest enter/exit functions in the same way as it is done in the user mode enter/exit code, respecting the RCU requirements. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200708195321.822002354@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 07:08:39 -04:00
Thomas Gleixner	87fa7f3e98	x86/kvm: Move context tracking where it belongs Context tracking for KVM happens way too early in the vcpu_run() code. Anything after guest_enter_irqoff() and before guest_exit_irqoff() cannot use RCU and should also be not instrumented. The current way of doing this covers way too much code. Move it closer to the actual vmenter/exit code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200708195321.724574345@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 07:08:38 -04:00
Maxim Levitsky	841c2be09f	kvm: x86: replace kvm_spec_ctrl_test_value with runtime test on the host To avoid complex and in some cases incorrect logic in kvm_spec_ctrl_test_value, just try the guest's given value on the host processor instead, and if it doesn't #GP, allow the guest to set it. One such case is when host CPU supports STIBP mitigation but doesn't support IBRS (as is the case with some Zen2 AMD cpus), and in this case we were giving guest #GP when it tried to use STIBP The reason why can can do the host test is that IA32_SPEC_CTRL msr is passed to the guest, after the guest sets it to a non zero value for the first time (due to performance reasons), and as as result of this, it is pointless to emulate #GP condition on this first access, in a different way than what the host CPU does. This is based on a patch from Sean Christopherson, who suggested this idea. Fixes: `6441fa6178` ("KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200708115731.180097-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 07:08:38 -04:00
Xiaoyao Li	7c1b761be0	KVM: x86: Rename cpuid_update() callback to vcpu_after_set_cpuid() The name of callback cpuid_update() is misleading that it's not about updating CPUID settings of vcpu but updating the configurations of vcpu based on the CPUIDs. So rename it to vcpu_after_set_cpuid(). Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200709043426.92712-5-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-09 07:08:18 -04:00
Sean Christopherson	b2656e4d8b	KVM: nVMX: Wrap VM-Fail valid path in generic VM-Fail helper Add nested_vmx_fail() to wrap VM-Fail paths that _may_ result in VM-Fail Valid to make it clear at the call sites that the Valid flavor isn't guaranteed. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200609015607.6994-1-sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:47 -04:00
Jim Mattson	c967118ddb	kvm: x86: Set last_vmentry_cpu in vcpu_enter_guest Since this field is now in kvm_vcpu_arch, clean things up a little by setting it in vendor-agnostic code: vcpu_enter_guest. Note that it must be set after the call to kvm_x86_ops.run(), since it can't be updated before pre_sev_run(). Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Reviewed-by: Peter Shier <pshier@google.com> Message-Id: <20200603235623.245638-7-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:46 -04:00
Jim Mattson	8a14fe4f0c	kvm: x86: Move last_cpu into kvm_vcpu_arch as last_vmentry_cpu Both the vcpu_vmx structure and the vcpu_svm structure have a 'last_cpu' field. Move the common field into the kvm_vcpu_arch structure. For clarity, rename it to 'last_vmentry_cpu.' Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Reviewed-by: Peter Shier <pshier@google.com> Message-Id: <20200603235623.245638-6-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:45 -04:00
Jim Mattson	1aa561b1a4	kvm: x86: Add "last CPU" to some KVM_EXIT information More often than not, a failed VM-entry in an x86 production environment is induced by a defective CPU. To help identify the bad hardware, include the id of the last logical CPU to run a vCPU in the information provided to userspace on a KVM exit for failed VM-entry or for KVM internal errors not associated with emulation. The presence of this additional information is indicated by a new capability, KVM_CAP_LAST_CPU. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Reviewed-by: Peter Shier <pshier@google.com> Message-Id: <20200603235623.245638-5-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:45 -04:00
Jim Mattson	80a1684c01	kvm: vmx: Add last_cpu to struct vcpu_vmx As we already do in svm, record the last logical processor on which a vCPU has run, so that it can be communicated to userspace for potential hardware errors. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Reviewed-by: Peter Shier <pshier@google.com> Message-Id: <20200603235623.245638-4-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:43 -04:00
Peter Xu	12bc2132b1	KVM: X86: Do the same ignore_msrs check for feature msrs Logically the ignore_msrs and report_ignored_msrs should also apply to feature MSRs. Add them in. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200622220442.21998-3-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:40 -04:00
Sean Christopherson	02f5fb2e69	KVM: x86/mmu: Make .write_log_dirty a nested operation Move .write_log_dirty() into kvm_x86_nested_ops to help differentiate it from the non-nested dirty log hooks. And because it's a nested-only operation. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:38 -04:00
Sean Christopherson	2f1d48aae2	KVM: nVMX: WARN if PML emulation helper is invoked outside of nested guest WARN if vmx_write_pml_buffer() is called outside of guest mode instead of silently ignoring the condition. The only caller is nested EPT's ept_update_accessed_dirty_bits(), which should only be reachable when L2 is active. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-08 16:21:37 -04:00
Sean Christopherson	fa71e9527f	KVM: VMX: Use KVM_POSSIBLE_CR_GUEST_BITS to initialize guest/host masks Use the "common" KVM_POSSIBLE_CR_GUEST_BITS defines to initialize the CR0/CR4 guest host masks instead of duplicating most of the CR4 mask and open coding the CR0 mask. SVM doesn't utilize the masks, i.e. the masks are effectively VMX specific even if they're not named as such. This avoids duplicate code, better documents the guest owned CR0 bit, and eliminates the need for a build-time assertion to keep VMX and x86 synchronized. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703040422.31536-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-03 12:16:33 -04:00
Sean Christopherson	7c83d096ae	KVM: x86: Mark CR4.TSD as being possibly owned by the guest Mark CR4.TSD as being possibly owned by the guest as that is indeed the case on VMX. Without TSD being tagged as possibly owned by the guest, a targeted read of CR4 to get TSD could observe a stale value. This bug is benign in the current code base as the sole consumer of TSD is the emulator (for RDTSC) and the emulator always "reads" the entirety of CR4 when grabbing bits. Add a build-time assertion in to ensure VMX doesn't hand over more CR4 bits without also updating x86. Fixes: `52ce3c21ae` ("x86,kvm,vmx: Don't trap writes to CR4.TSD") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703040422.31536-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-07-03 12:16:28 -04:00
Sean Christopherson	e4553b4976	KVM: VMX: Remove vcpu_vmx's defunct copy of host_pkru Remove vcpu_vmx.host_pkru, which got left behind when PKRU support was moved to common x86 code. No functional change intended. Fixes: `37486135d3` ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200617034123.25647-1-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-23 06:01:29 -04:00
Sean Christopherson	bf09fb6cba	KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL Remove support for context switching between the guest's and host's desired UMWAIT_CONTROL. Propagating the guest's value to hardware isn't required for correct functionality, e.g. KVM intercepts reads and writes to the MSR, and the latency effects of the settings controlled by the MSR are not architecturally visible. As a general rule, KVM should not allow the guest to control power management settings unless explicitly enabled by userspace, e.g. see KVM_CAP_X86_DISABLE_EXITS. E.g. Intel's SDM explicitly states that C0.2 can improve the performance of SMT siblings. A devious guest could disable C0.2 so as to improve the performance of their workloads at the detriment to workloads running in the host or on other VMs. Wholesale removal of UMWAIT_CONTROL context switching also fixes a race condition where updates from the host may cause KVM to enter the guest with the incorrect value. Because updates are are propagated to all CPUs via IPI (SMP function callback), the value in hardware may be stale with respect to the cached value and KVM could enter the guest with the wrong value in hardware. As above, the guest can't observe the bad value, but it's a weird and confusing wart in the implementation. Removal also fixes the unnecessary usage of VMX's atomic load/store MSR lists. Using the lists is only necessary for MSRs that are required for correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on old hardware, or for MSRs that need to-the-uop precision, e.g. perf related MSRs. For UMWAIT_CONTROL, the effects are only visible in the kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in vcpu_vmx_run(). Using the atomic lists is undesirable as they are more expensive than direct RDMSR/WRMSR. Furthermore, even if giving the guest control of the MSR is legitimate, e.g. in pass-through scenarios, it's not clear that the benefits would outweigh the overhead. E.g. saving and restoring an MSR across a VMX roundtrip costs ~250 cycles, and if the guest diverged from the host that cost would be paid on every run of the guest. In other words, if there is a legitimate use case then it should be enabled by a new per-VM capability. Note, KVM still needs to emulate MSR_IA32_UMWAIT_CONTROL so that it can correctly expose other WAITPKG features to the guest, e.g. TPAUSE, UMWAIT and UMONITOR. Fixes: `6e3ba4abce` ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL") Cc: stable@vger.kernel.org Cc: Jingqi Liu <jingqi.liu@intel.com> Cc: Tao Xu <tao3.xu@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200623005135.10414-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-22 20:54:57 -04:00
Sean Christopherson	2dbebf7ae1	KVM: nVMX: Plumb L2 GPA through to PML emulation Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all intents and purposes is vmx_write_pml_buffer(), instead of having the latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS. If the dirty bit update is the result of KVM emulation (rare for L2), then the GPA in the VMCS may be stale and/or hold a completely unrelated GPA. Fixes: `c5f983f6e8` ("nVMX: Implement emulated Page Modification Logging") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-22 18:23:03 -04:00
Vitaly Kuznetsov	49097762fa	Revert "KVM: VMX: Micro-optimize vmexit time when not exposing PMU" Guest crashes are observed on a Cascade Lake system when 'perf top' is launched on the host, e.g. BUG: unable to handle kernel paging request at fffffe0000073038 PGD 7ffa7067 P4D 7ffa7067 PUD 7ffa6067 PMD 7ffa5067 PTE ffffffffff120 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 1 Comm: systemd Not tainted 4.18.0+ #380 ... Call Trace: serial8250_console_write+0xfe/0x1f0 call_console_drivers.constprop.0+0x9d/0x120 console_unlock+0x1ea/0x460 Call traces are different but the crash is imminent. The problem was blindly bisected to the commit `041bc42ce2` ("KVM: VMX: Micro-optimize vmexit time when not exposing PMU"). It was also confirmed that the issue goes away if PMU is exposed to the guest. With some instrumentation of the guest we can see what is being switched (when we do atomic_switch_perf_msrs()): vmx_vcpu_run: switching 2 msrs vmx_vcpu_run: switching MSR38f guest: 70000000d host: 70000000f vmx_vcpu_run: switching MSR3f1 guest: 0 host: 2 The current guess is that PEBS (MSR_IA32_PEBS_ENABLE, 0x3f1) is to blame. Regardless of whether PMU is exposed to the guest or not, PEBS needs to be disabled upon switch. This reverts commit `041bc42ce2`. Reported-by: Maxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200619094046.654019-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-19 08:13:40 -04:00
Thomas Gleixner	6758034e4d	x86/process/64: Make save_fsgs_for_kvm() ready for FSGSBASE save_fsgs_for_kvm() is invoked via vcpu_enter_guest() kvm_x86_ops.prepare_guest_switch(vcpu) vmx_prepare_switch_to_guest() save_fsgs_for_kvm() with preemption disabled, but interrupts enabled. The upcoming FSGSBASE based GS safe needs interrupts to be disabled. This could be done in the helper function, but that function is also called from switch_to() which has interrupts disabled already. Disable interrupts inside save_fsgs_for_kvm() and rename the function to current_save_fsgs() so it can be invoked from other places. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200528201402.1708239-7-sashal@kernel.org	2020-06-18 15:47:01 +02:00
Sean Christopherson	88c200d929	KVM: VMX: Add helpers to identify interrupt type from intr_info Add is_intr_type() and is_intr_type_n() to consolidate the boilerplate code for querying a specific type of interrupt given an encoded value from VMCS.VM_{ENTER,EXIT}_INTR_INFO, with and without an associated vector respectively. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200609014518.26756-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-15 12:12:20 -04:00
Linus Torvalds	076f14be7f	The X86 entry, exception and interrupt code rework This all started about 6 month ago with the attempt to move the Posix CPU timer heavy lifting out of the timer interrupt code and just have lockless quick checks in that code path. Trivial 5 patches. This unearthed an inconsistency in the KVM handling of task work and the review requested to move all of this into generic code so other architectures can share. Valid request and solved with another 25 patches but those unearthed inconsistencies vs. RCU and instrumentation. Digging into this made it obvious that there are quite some inconsistencies vs. instrumentation in general. The int3 text poke handling in particular was completely unprotected and with the batched update of trace events even more likely to expose to endless int3 recursion. In parallel the RCU implications of instrumenting fragile entry code came up in several discussions. The conclusion of the X86 maintainer team was to go all the way and make the protection against any form of instrumentation of fragile and dangerous code pathes enforcable and verifiable by tooling. A first batch of preparatory work hit mainline with commit `d5f744f9a2`. The (almost) full solution introduced a new code section '.noinstr.text' into which all code which needs to be protected from instrumentation of all sorts goes into. Any call into instrumentable code out of this section has to be annotated. objtool has support to validate this. Kprobes now excludes this section fully which also prevents BPF from fiddling with it and all 'noinstr' annotated functions also keep ftrace off. The section, kprobes and objtool changes are already merged. The major changes coming with this are: - Preparatory cleanups - Annotating of relevant functions to move them into the noinstr.text section or enforcing inlining by marking them __always_inline so the compiler cannot misplace or instrument them. - Splitting and simplifying the idtentry macro maze so that it is now clearly separated into simple exception entries and the more interesting ones which use interrupt stacks and have the paranoid handling vs. CR3 and GS. - Move quite some of the low level ASM functionality into C code: - enter_from and exit to user space handling. The ASM code now calls into C after doing the really necessary ASM handling and the return path goes back out without bells and whistels in ASM. - exception entry/exit got the equivivalent treatment - move all IRQ tracepoints from ASM to C so they can be placed as appropriate which is especially important for the int3 recursion issue. - Consolidate the declaration and definition of entry points between 32 and 64 bit. They share a common header and macros now. - Remove the extra device interrupt entry maze and just use the regular exception entry code. - All ASM entry points except NMI are now generated from the shared header file and the corresponding macros in the 32 and 64 bit entry ASM. - The C code entry points are consolidated as well with the help of DEFINE_IDTENTRY() macros. This allows to ensure at one central point that all corresponding entry points share the same semantics. The actual function body for most entry points is in an instrumentable and sane state. There are special macros for the more sensitive entry points, e.g. INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF. They allow to put the whole entry instrumentation and RCU handling into safe places instead of the previous pray that it is correct approach. - The INT3 text poke handling is now completely isolated and the recursion issue banned. Aside of the entry rework this required other isolation work, e.g. the ability to force inline bsearch. - Prevent #DB on fragile entry code, entry relevant memory and disable it on NMI, #MC entry, which allowed to get rid of the nested #DB IST stack shifting hackery. - A few other cleanups and enhancements which have been made possible through this and already merged changes, e.g. consolidating and further restricting the IDT code so the IDT table becomes RO after init which removes yet another popular attack vector - About 680 lines of ASM maze are gone. There are a few open issues: - An escape out of the noinstr section in the MCE handler which needs some more thought but under the aspect that MCE is a complete trainwreck by design and the propability to survive it is low, this was not high on the priority list. - Paravirtualization When PV is enabled then objtool complains about a bunch of indirect calls out of the noinstr section. There are a few straight forward ways to fix this, but the other issues vs. general correctness were more pressing than parawitz. - KVM KVM is inconsistent as well. Patches have been posted, but they have not yet been commented on or picked up by the KVM folks. - IDLE Pretty much the same problems can be found in the low level idle code especially the parts where RCU stopped watching. This was beyond the scope of the more obvious and exposable problems and is on the todo list. The lesson learned from this brain melting exercise to morph the evolved code base into something which can be validated and understood is that once again the violation of the most important engineering principle "correctness first" has caused quite a few people to spend valuable time on problems which could have been avoided in the first place. The "features first" tinkering mindset really has to stop. With that I want to say thanks to everyone involved in contributing to this effort. Special thanks go to the following people (alphabetical order): Alexandre Chartre Andy Lutomirski Borislav Petkov Brian Gerst Frederic Weisbecker Josh Poimboeuf Juergen Gross Lai Jiangshan Macro Elver Paolo Bonzini Paul McKenney Peter Zijlstra Vitaly Kuznetsov Will Deacon -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl7j510THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoU2WD/4refvaNm08fG7aiVYem3JJzr0+Pq5O /opwnI/1D973ApApj5W/Nd53sN5tVqOiXncSKgywRBWZxRCAGjVYypl9rjpvXu4l HlMjhEKBmWkDryxxrM98Vr7hl3hnId5laR56oFfH+G4LUsItaV6Uak/HfXZ4Mq1k iYVbEtl2CN+KJjvSgZ6Y1l853Ab5mmGvmeGNHHWCj8ZyjF3cOLoelDTQNnsb0wXM crKXBcXJSsCWKYyJ5PTvB82crQCET7Su+LgwK06w/ZbW1//2hVIjSCiN5o/V+aRJ 06BZNMj8v9tfglkN8LEQvRIjTlnEQ2sq3GxbrVtA53zxkzbBCBJQ96w8yYzQX0ux yhqQ/aIZJ1wTYEjJzSkftwLNMRHpaOUnKvJndXRKAYi+eGI7syF61qcZSYGKuAQ/ bK3b/CzU6QWr1235oTADxh4isEwxA0Pg5wtJCfDDOG0MJ9ALMSOGUkhoiz5EqpkU mzFAwfG/Uj7hRjlkms7Yj2OjZfnU7iypj63GgpXghLjr5ksRFKEOMw8e1GXltVHs zzwghUjqp2EPq0VOOQn3lp9lol5Prc3xfFHczKpO+CJW6Rpa4YVdqJmejBqJy/on Hh/T/ST3wa2qBeAw89vZIeWiUJZZCsQ0f//+2hAbzJY45Y6DuR9vbTAPb9agRgOM xg+YaCfpQqFc1A== =llba -----END PGP SIGNATURE----- Merge tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry updates from Thomas Gleixner: "The x86 entry, exception and interrupt code rework This all started about 6 month ago with the attempt to move the Posix CPU timer heavy lifting out of the timer interrupt code and just have lockless quick checks in that code path. Trivial 5 patches. This unearthed an inconsistency in the KVM handling of task work and the review requested to move all of this into generic code so other architectures can share. Valid request and solved with another 25 patches but those unearthed inconsistencies vs. RCU and instrumentation. Digging into this made it obvious that there are quite some inconsistencies vs. instrumentation in general. The int3 text poke handling in particular was completely unprotected and with the batched update of trace events even more likely to expose to endless int3 recursion. In parallel the RCU implications of instrumenting fragile entry code came up in several discussions. The conclusion of the x86 maintainer team was to go all the way and make the protection against any form of instrumentation of fragile and dangerous code pathes enforcable and verifiable by tooling. A first batch of preparatory work hit mainline with commit `d5f744f9a2` ("Pull x86 entry code updates from Thomas Gleixner") That (almost) full solution introduced a new code section '.noinstr.text' into which all code which needs to be protected from instrumentation of all sorts goes into. Any call into instrumentable code out of this section has to be annotated. objtool has support to validate this. Kprobes now excludes this section fully which also prevents BPF from fiddling with it and all 'noinstr' annotated functions also keep ftrace off. The section, kprobes and objtool changes are already merged. The major changes coming with this are: - Preparatory cleanups - Annotating of relevant functions to move them into the noinstr.text section or enforcing inlining by marking them __always_inline so the compiler cannot misplace or instrument them. - Splitting and simplifying the idtentry macro maze so that it is now clearly separated into simple exception entries and the more interesting ones which use interrupt stacks and have the paranoid handling vs. CR3 and GS. - Move quite some of the low level ASM functionality into C code: - enter_from and exit to user space handling. The ASM code now calls into C after doing the really necessary ASM handling and the return path goes back out without bells and whistels in ASM. - exception entry/exit got the equivivalent treatment - move all IRQ tracepoints from ASM to C so they can be placed as appropriate which is especially important for the int3 recursion issue. - Consolidate the declaration and definition of entry points between 32 and 64 bit. They share a common header and macros now. - Remove the extra device interrupt entry maze and just use the regular exception entry code. - All ASM entry points except NMI are now generated from the shared header file and the corresponding macros in the 32 and 64 bit entry ASM. - The C code entry points are consolidated as well with the help of DEFINE_IDTENTRY() macros. This allows to ensure at one central point that all corresponding entry points share the same semantics. The actual function body for most entry points is in an instrumentable and sane state. There are special macros for the more sensitive entry points, e.g. INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF. They allow to put the whole entry instrumentation and RCU handling into safe places instead of the previous pray that it is correct approach. - The INT3 text poke handling is now completely isolated and the recursion issue banned. Aside of the entry rework this required other isolation work, e.g. the ability to force inline bsearch. - Prevent #DB on fragile entry code, entry relevant memory and disable it on NMI, #MC entry, which allowed to get rid of the nested #DB IST stack shifting hackery. - A few other cleanups and enhancements which have been made possible through this and already merged changes, e.g. consolidating and further restricting the IDT code so the IDT table becomes RO after init which removes yet another popular attack vector - About 680 lines of ASM maze are gone. There are a few open issues: - An escape out of the noinstr section in the MCE handler which needs some more thought but under the aspect that MCE is a complete trainwreck by design and the propability to survive it is low, this was not high on the priority list. - Paravirtualization When PV is enabled then objtool complains about a bunch of indirect calls out of the noinstr section. There are a few straight forward ways to fix this, but the other issues vs. general correctness were more pressing than parawitz. - KVM KVM is inconsistent as well. Patches have been posted, but they have not yet been commented on or picked up by the KVM folks. - IDLE Pretty much the same problems can be found in the low level idle code especially the parts where RCU stopped watching. This was beyond the scope of the more obvious and exposable problems and is on the todo list. The lesson learned from this brain melting exercise to morph the evolved code base into something which can be validated and understood is that once again the violation of the most important engineering principle "correctness first" has caused quite a few people to spend valuable time on problems which could have been avoided in the first place. The "features first" tinkering mindset really has to stop. With that I want to say thanks to everyone involved in contributing to this effort. Special thanks go to the following people (alphabetical order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra, Vitaly Kuznetsov, and Will Deacon" * tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits) x86/entry: Force rcu_irq_enter() when in idle task x86/entry: Make NMI use IDTENTRY_RAW x86/entry: Treat BUG/WARN as NMI-like entries x86/entry: Unbreak __irqentry_text_start/end magic x86/entry: __always_inline CR2 for noinstr lockdep: __always_inline more for noinstr x86/entry: Re-order #DB handler to avoid SAN instrumentation x86/entry: __always_inline arch_atomic_ for noinstr x86/entry: __always_inline irqflags for noinstr x86/entry: __always_inline debugreg for noinstr x86/idt: Consolidate idt functionality x86/idt: Cleanup trap_init() x86/idt: Use proper constants for table size x86/idt: Add comments about early #PF handling x86/idt: Mark init only functions __init x86/entry: Rename trace_hardirqs_off_prepare() x86/entry: Clarify irq_{enter,exit}_rcu() x86/entry: Remove DBn stacks x86/entry: Remove debug IDT frobbing x86/entry: Optimize local_db_save() for virt ...	2020-06-13 10:05:47 -07:00
Paolo Bonzini	77f81f37fb	Merge branch 'kvm-basic-exit-reason' into HEAD Using a topic branch so that stable branches can simply cherry-pick the patch. Reviewed-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-11 12:35:14 -04:00
Sean Christopherson	2ebac8bb3c	KVM: nVMX: Consult only the "basic" exit reason when routing nested exit Consult only the basic exit reason, i.e. bits 15:0 of vmcs.EXIT_REASON, when determining whether a nested VM-Exit should be reflected into L1 or handled by KVM in L0. For better or worse, the switch statement in nested_vmx_exit_reflected() currently defaults to "true", i.e. reflects any nested VM-Exit without dedicated logic. Because the case statements only contain the basic exit reason, any VM-Exit with modifier bits set will be reflected to L1, even if KVM intended to handle it in L0. Practically speaking, this only affects EXIT_REASON_MCE_DURING_VMENTRY, i.e. a #MC that occurs on nested VM-Enter would be incorrectly routed to L1, as "failed VM-Entry" is the only modifier that KVM can currently encounter. The SMM modifiers will never be generated as KVM doesn't support/employ a SMI Transfer Monitor. Ditto for "exit from enclave", as KVM doesn't yet support virtualizing SGX, i.e. it's impossible to enter an enclave in a KVM guest (L1 or L2). Fixes: `644d711aa0` ("KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit") Cc: Jim Mattson <jmattson@google.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200227174430.26371-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-11 11:28:11 -04:00
Peter Zijlstra	84b6a34915	x86/entry: Optimize local_db_save() for virt Because DRn access is 'difficult' with virt; but the DR7 read is cheaper than a cacheline miss on native, add a virt specific fast path to local_db_save(), such that when breakpoints are not in use to avoid touching DRn entirely. Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200529213321.187833200@infradead.org	2020-06-11 15:15:22 +02:00
Thomas Gleixner	8cd501c1fa	x86/entry: Convert Machine Check to IDTENTRY_IST Convert #MC to IDTENTRY_MCE: - Implement the C entry points with DEFINE_IDTENTRY_MCE - Emit the ASM stub with DECLARE_IDTENTRY_MCE - Remove the ASM idtentry in 64bit - Remove the open coded ASM entry code in 32bit - Fixup the XEN/PV code - Remove the old prototypes - Remove the error code from *machine_check_vector() as it is always 0 and not used by any of the functions it can point to. Fixup all the functions as well. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200505135314.334980426@linutronix.de	2020-06-11 15:14:57 +02:00
Vitaly Kuznetsov	7a35e515a7	KVM: VMX: Properly handle kvm_read/write_guest_virt() result Syzbot reports the following issue: WARNING: CPU: 0 PID: 6819 at arch/x86/kvm/x86.c:618 kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618 ... Call Trace: ... RIP: 0010:kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618 ... nested_vmx_get_vmptr+0x1f9/0x2a0 arch/x86/kvm/vmx/nested.c:4638 handle_vmon arch/x86/kvm/vmx/nested.c:4767 [inline] handle_vmon+0x168/0x3a0 arch/x86/kvm/vmx/nested.c:4728 vmx_handle_exit+0x29c/0x1260 arch/x86/kvm/vmx/vmx.c:6067 'exception' we're trying to inject with kvm_inject_emulated_page_fault() comes from: nested_vmx_get_vmptr() kvm_read_guest_virt() kvm_read_guest_virt_helper() vcpu->arch.walk_mmu->gva_to_gpa() but it is only set when GVA to GPA conversion fails. In case it doesn't but we still fail kvm_vcpu_read_guest_page(), X86EMUL_IO_NEEDED is returned and nested_vmx_get_vmptr() calls kvm_inject_emulated_page_fault() with zeroed 'exception'. This happen when the argument is MMIO. Paolo also noticed that nested_vmx_get_vmptr() is not the only place in KVM code where kvm_read/write_guest_virt() return result is mishandled. VMX instructions along with INVPCID have the same issue. This was already noticed before, e.g. see commit `541ab2aeb2` ("KVM: x86: work around leak of uninitialized stack contents") but was never fully fixed. KVM could've handled the request correctly by going to userspace and performing I/O but there doesn't seem to be a good need for such requests in the first place. Introduce vmx_handle_memory_failure() as an interim solution. Note, nested_vmx_get_vmptr() now has three possible outcomes: OK, PF, KVM_EXIT_INTERNAL_ERROR and callers need to know if userspace exit is needed (for KVM_EXIT_INTERNAL_ERROR) in case of failure. We don't seem to have a good enum describing this tristate, just add "int *ret" to nested_vmx_get_vmptr() interface to pass the information. Reported-by: syzbot+2a7156e11dc199bdbd8a@syzkaller.appspotmail.com Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200605115906.532682-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-08 07:59:08 -04:00
Babu Moger	fa44b82eb8	KVM: x86: Move MPK feature detection to common code Both Intel and AMD support (MPK) Memory Protection Key feature. Move the feature detection from VMX to the common code. It should work for both the platforms now. Signed-off-by: Babu Moger <babu.moger@amd.com> Message-Id: <158932795627.44260.15144185478040178638.stgit@naples-babu.amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-04 12:35:06 -04:00
Sean Christopherson	f4f6bd93fd	KVM: VMX: Always treat MSR_IA32_PERF_CAPABILITIES as a valid PMU MSR Unconditionally return true when querying the validity of MSR_IA32_PERF_CAPABILITIES so as to defer the validity check to intel_pmu_{get,set}_msr(), which can properly give the MSR a pass when the access is initiated from host userspace. The MSR is emulated so there is no underlying hardware dependency to worry about. Fixes: `27461da310` ("KVM: x86/pmu: Support full width counting") Cc: Like Xu <like.xu@linux.intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200603203303.28545-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-04 12:20:44 -04:00
Makarand Sonare	8d7fbf01f9	KVM: selftests: VMX preemption timer migration test When a nested VM with a VMX-preemption timer is migrated, verify that the nested VM and its parent VM observe the VMX-preemption timer exit close to the original expiration deadline. Signed-off-by: Makarand Sonare <makarandsonare@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200526215107.205814-3-makarandsonare@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-01 04:26:10 -04:00
Peter Shier	850448f35a	KVM: nVMX: Fix VMX preemption timer migration Add new field to hold preemption timer expiration deadline appended to struct kvm_vmx_nested_state_hdr. This is to prevent the first VM-Enter after migration from incorrectly restarting the timer with the full timer value instead of partially decayed timer value. KVM_SET_NESTED_STATE restarts timer using migrated state regardless of whether L1 sets VM_EXIT_SAVE_VMX_PREEMPTION_TIMER. Fixes: `cf8b84f48a` ("kvm: nVMX: Prepare for checkpointing L2 state") Signed-off-by: Peter Shier <pshier@google.com> Signed-off-by: Makarand Sonare <makarandsonare@google.com> Message-Id: <20200526215107.205814-2-makarandsonare@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-01 04:26:10 -04:00
Like Xu	27461da310	KVM: x86/pmu: Support full width counting Intel CPUs have a new alternative MSR range (starting from MSR_IA32_PMC0) for GP counters that allows writing the full counter width. Enable this range from a new capability bit (IA32_PERF_CAPABILITIES.FW_WRITE[bit 13]). The guest would query CPUID to get the counter width, and sign extends the counter values as needed. The traditional MSRs always limit to 32bit, even though the counter internally is larger (48 or 57 bits). When the new capability is set, use the alternative range which do not have these restrictions. This lowers the overhead of perf stat slightly because it has to do less interrupts to accumulate the counter value. Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20200529074347.124619-3-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-01 04:26:09 -04:00
Wei Wang	cbd717585b	KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in Change kvm_pmu_get_msr() to get the msr_data struct, as the host_initiated field from the struct could be used by get_msr. This also makes this API consistent with kvm_pmu_set_msr. No functional changes. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Message-Id: <20200529074347.124619-2-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-01 04:26:08 -04:00
Vitaly Kuznetsov	68fd66f100	KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info Currently, APF mechanism relies on the #PF abuse where the token is being passed through CR2. If we switch to using interrupts to deliver page-ready notifications we need a different way to pass the data. Extent the existing 'struct kvm_vcpu_pv_apf_data' with token information for page-ready notifications. While on it, rename 'reason' to 'flags'. This doesn't change the semantics as we only have reasons '1' and '2' and these can be treated as bit flags but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery making 'reason' name misleading. The newly introduced apf_put_user_ready() temporary puts both flags and token information, this will be changed to put token only when we switch to interrupt based notifications. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-01 04:26:06 -04:00
Gustavo A. R. Silva	f4a9fdd5f1	KVM: VMX: Replace zero-length array with flexible-array The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit `7649773293` ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Message-Id: <20200507185618.GA14831@embeddedor> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-01 04:26:05 -04:00
Paolo Bonzini	cc440cdad5	KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE Similar to VMX, the state that is captured through the currently available IOCTLs is a mix of L1 and L2 state, dependent on whether the L2 guest was running at the moment when the process was interrupted to save its state. In particular, the SVM-specific state for nested virtualization includes the L1 saved state (including the interrupt flag), the cached L2 controls, and the GIF. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-06-01 04:26:05 -04:00
Paolo Bonzini	df7e0681dd	KVM: nVMX: always update CR3 in VMCS vmx_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as an optimization, but this is only correct before the nested vmentry. If userspace is modifying CR3 with KVM_SET_SREGS after the VM has already been put in guest mode, the value of CR3 will not be updated. Remove the optimization, which almost never triggers anyway. Fixes: `04f11ef458` ("KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-28 11:46:18 -04:00
Paolo Bonzini	c9d40913ac	KVM: x86: enable event window in inject_pending_event In case an interrupt arrives after nested.check_events but before the call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt window even if the interrupt is actually going to be a vmexit. This is useless rather than harmful, but it really complicates reasoning about SVM's handling of the VINTR intercept. We'd like to never bother with the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in that case there is no interrupt window and we can just exit the nested guest whenever we want. This patch moves the opening of the interrupt window inside inject_pending_event. This consolidates the check for pending interrupt/NMI/SMI in one place, and makes KVM's usage of immediate exits more consistent, extending it beyond just nested virtualization. There are two functional changes here. They only affect corner cases, but overall they simplify the inject_pending_event. - re-injection of still-pending events will also use req_immediate_exit instead of using interrupt-window intercepts. This should have no impact on performance on Intel since it simply replaces an interrupt-window or NMI-window exit for a preemption-timer exit. On AMD, which has no equivalent of the preemption time, it may incur some overhead but an actual effect on performance should only be visible in pathological cases. - kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true if an interrupt, NMI or SMI is blocked by nested_run_pending. This makes sense because entering the VM will allow it to make progress and deliver the event. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-28 11:41:46 -04:00
Miaohe Lin	a8cfbae592	KVM: VMX: replace "fall through" with "return" to indicate different case The second "/* fall through */" in rmode_exception() makes code harder to read. Replace it with "return" to indicate they are different cases, only the #DB and #BP check vcpu->guest_debug, while others don't care. And this also improves the readability. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Message-Id: <1582080348-20827-1-git-send-email-linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-27 13:11:09 -04:00
Sean Christopherson	cb97c2d680	KVM: x86: Take an unsigned 32-bit int for has_emulated_msr()'s index Take a u32 for the index in has_emulated_msr() to match hardware, which treats MSR indices as unsigned 32-bit values. Functionally, taking a signed int doesn't cause problems with the current code base, but could theoretically cause problems with 32-bit KVM, e.g. if the index were checked via a less-than statement, which would evaluate incorrectly for MSR indices with bit 31 set. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200218234012.7110-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-27 13:11:08 -04:00
Paolo Bonzini	7529e767c2	Merge branch 'kvm-master' into HEAD Merge AMD fixes before doing more development work.	2020-05-27 13:10:29 -04:00
Paolo Bonzini	e7581caca4	KVM: x86: simplify is_mmio_spte We can simply look at bits 52-53 to identify MMIO entries in KVM's page tables. Therefore, there is no need to pass a mask to kvm_mmu_set_mmio_spte_mask. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-27 13:08:29 -04:00
Maxim Levitsky	0abcc8f65c	KVM: VMX: enable X86_FEATURE_WAITPKG in KVM capabilities Even though we might not allow the guest to use WAITPKG's new instructions, we should tell KVM that the feature is supported by the host CPU. Note that vmx_waitpkg_supported checks that WAITPKG _can_ be set in secondary execution controls as specified by VMX capability MSR, rather that we actually enable it for a guest. Cc: stable@vger.kernel.org Fixes: `e69e72faa3` ("KVM: x86: Add support for user wait instructions") Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200523161455.3940-2-mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-27 13:08:03 -04:00
Jim Mattson	93dff2fed2	KVM: nVMX: Migrate the VMX-preemption timer The hrtimer used to emulate the VMX-preemption timer must be pinned to the same logical processor as the vCPU thread to be interrupted if we want to have any hope of adhering to the architectural specification of the VMX-preemption timer. Even with this change, the emulated VMX-preemption timer VM-exit occasionally arrives too late. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200508203643.85477-4-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:26 -04:00
Jim Mattson	ada0098df6	KVM: nVMX: Change emulated VMX-preemption timer hrtimer to absolute Prepare for migration of this hrtimer, by changing it from relative to absolute. (I couldn't get migration to work with a relative timer.) Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200508203643.85477-3-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:25 -04:00
Jim Mattson	1739f3d56d	KVM: nVMX: Really make emulated nested preemption timer pinned The PINNED bit is ignored by hrtimer_init. It is only considered when starting the timer. When the hrtimer isn't pinned to the same logical processor as the vCPU thread to be interrupted, the emulated VMX-preemption timer often fails to adhere to the architectural specification. Fixes: `f15a75eedc` ("KVM: nVMX: make emulated nested preemption timer pinned") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200508203643.85477-2-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:24 -04:00
Sean Christopherson	6c1c6e5835	KVM: nVMX: Remove unused 'ops' param from nested_vmx_hardware_setup() Remove a 'struct kvm_x86_ops' param that got left behind when the nested ops were moved to their own struct. Fixes: `33b2217245` ("KVM: x86: move nested-related kvm_x86_ops to a separate struct") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200506204653.14683-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:24 -04:00
Wanpeng Li	26efe2fd92	KVM: VMX: Handle preemption timer fastpath This patch implements a fastpath for the preemption timer vmexit. The vmexit can be handled quickly so it can be performed with interrupts off and going back directly to the guest. Testing on SKX Server. cyclictest in guest(w/o mwait exposed, adaptive advance lapic timer is default -1): 5540.5ns -> 4602ns 17% kvm-unit-test/vmexit.flat: w/o avanced timer: tscdeadline_immed: 3028.5 -> 2494.75 17.6% tscdeadline: 5765.7 -> 5285 8.3% w/ adaptive advance timer default -1: tscdeadline_immed: 3123.75 -> 2583 17.3% tscdeadline: 4663.75 -> 4537 2.7% Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-8-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:22 -04:00
Paolo Bonzini	199a8b84c4	KVM: x86: introduce kvm_can_use_hv_timer Replace the ad hoc test in vmx_set_hv_timer with a test in the caller, start_hv_timer. This test is not Intel-specific and would be duplicated when introducing the fast path for the TSC deadline MSR. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:21 -04:00
Wanpeng Li	379a3c8ee4	KVM: VMX: Optimize posted-interrupt delivery for timer fastpath While optimizing posted-interrupt delivery especially for the timer fastpath scenario, I measured kvm_x86_ops.deliver_posted_interrupt() to introduce substantial latency because the processor has to perform all vmentry tasks, ack the posted interrupt notification vector, read the posted-interrupt descriptor etc. This is not only slow, it is also unnecessary when delivering an interrupt to the current CPU (as is the case for the LAPIC timer) because PIR->IRR and IRR->RVI synchronization is already performed on vmentry Therefore skip kvm_vcpu_trigger_posted_interrupt in this case, and instead do vmx_sync_pir_to_irr() on the EXIT_FASTPATH_REENTER_GUEST fastpath as well. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-6-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:20 -04:00
Wanpeng Li	404d5d7bff	KVM: X86: Introduce more exit_fastpath_completion enum values Adds a fastpath_t typedef since enum lines are a bit long, and replace EXIT_FASTPATH_SKIP_EMUL_INS with two new exit_fastpath_completion enum values. - EXIT_FASTPATH_EXIT_HANDLED kvm will still go through it's full run loop, but it would skip invoking the exit handler. - EXIT_FASTPATH_REENTER_GUEST complete fastpath, guest can be re-entered without invoking the exit handler or going back to vcpu_run Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1588055009-12677-4-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:19 -04:00
Wanpeng Li	dcf068da7e	KVM: VMX: Introduce generic fastpath handler Introduce generic fastpath handler to handle MSR fastpath, VMX-preemption timer fastpath etc; move it after vmx_complete_interrupts() in order to catch events delivered to the guest, and abort the fast path in later patches. While at it, move the kvm_exit tracepoint so that it is printed for fastpath vmexits as well. There is no observed performance effect for the IPI fastpath after this patch. Tested-by: Haiwei Li <lihaiwei@tencent.com> Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <1588055009-12677-2-git-send-email-wanpengli@tencent.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:17 -04:00
Sean Christopherson	9e826feb8f	KVM: nVMX: Drop superfluous VMREAD of vmcs02.GUEST_SYSENTER_* Don't propagate GUEST_SYSENTER_* from vmcs02 to vmcs12 on nested VM-Exit as the vmcs12 fields are updated in vmx_set_msr(), and writes to the corresponding MSRs are always intercepted by KVM when running L2. Dropping the propagation was intended to be done in the same commit that added vmcs12 writes in vmx_set_msr()[1], but for reasons unknown was only shuffled around[2][3]. [1] https://patchwork.kernel.org/patch/10933215 [2] https://patchwork.kernel.org/patch/10933215/#22682289 [3] https://lore.kernel.org/patchwork/patch/1088643 Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428231025.12766-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:17 -04:00
Sean Christopherson	2408500dfc	KVM: nVMX: Truncate writes to vmcs.SYSENTER_EIP/ESP for 32-bit vCPU Explicitly truncate the data written to vmcs.SYSENTER_EIP/ESP on WRMSR if the virtual CPU doesn't support 64-bit mode. The SYSENTER address fields in the VMCS are natural width, i.e. bits 63:32 are dropped if the CPU doesn't support Intel 64 architectures. This behavior is visible to the guest after a VM-Exit/VM-Exit roundtrip, e.g. if the guest sets bits 63:32 in the actual MSR. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428231025.12766-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:16 -04:00
Uros Bizjak	551896e0e0	KVM: VMX: Improve handle_external_interrupt_irqoff inline assembly Improve handle_external_interrupt_irqoff inline assembly in several ways: - remove unneeded %c operand modifiers and "$" prefixes - use %rsp instead of _ASM_SP, since we are in CONFIG_X86_64 part - use $-16 immediate to align %rsp - remove unneeded use of __ASM_SIZE macro - define "ss" named operand only for X86_64 The patch introduces no functional changes. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Message-Id: <20200504155706.2516956-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:16 -04:00
Uros Bizjak	c16312f4fa	KVM: VMX: Remove unneeded __ASM_SIZE usage with POP instruction POP [mem] defaults to the word size, and the only legal non-default size is 16 bits, e.g. a 32-bit POP will #UD in 64-bit mode and vice versa, no need to use __ASM_SIZE macro to force operating mode. Changes since v1: - Fix commit message. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Message-Id: <20200427205035.1594232-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:13 -04:00
Sean Christopherson	3bae0459bc	KVM: x86/mmu: Drop KVM's hugepage enums in favor of the kernel's enums Replace KVM's PT_PAGE_TABLE_LEVEL, PT_DIRECTORY_LEVEL and PT_PDPE_LEVEL with the kernel's PG_LEVEL_4K, PG_LEVEL_2M and PG_LEVEL_1G. KVM's enums are borderline impossible to remember and result in code that is visually difficult to audit, e.g. if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PT_PDPE_LEVEL; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PT_DIRECTORY_LEVEL; else ept_lpage_level = PT_PAGE_TABLE_LEVEL; versus if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PG_LEVEL_1G; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PG_LEVEL_2M; else ept_lpage_level = PG_LEVEL_4K; No functional change intended. Suggested-by: Barret Rhoden <brho@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:11 -04:00
Sean Christopherson	68cda40d9f	KVM: nVMX: Tweak handling of failure code for nested VM-Enter failure Use an enum for passing around the failure code for a failed VM-Enter that results in VM-Exit to provide a level of indirection from the final resting place of the failure code, vmcs.EXIT_QUALIFICATION. The exit qualification field is an unsigned long, e.g. passing around 'u32 exit_qual' throws up red flags as it suggests KVM may be dropping bits when reporting errors to L1. This is a red herring because the only defined failure codes are 0, 2, 3, and 4, i.e. don't come remotely close to overflowing a u32. Setting vmcs.EXIT_QUALIFICATION on entry failure is further complicated by the MSR load list, which returns the (1-based) entry that failed, and the number of MSRs to load is a 32-bit VMCS field. At first blush, it would appear that overflowing a u32 is possible, but the number of MSRs that can be loaded is hardcapped at 4096 (limited by MSR_IA32_VMX_MISC). In other words, there are two completely disparate types of data that eventually get stuffed into vmcs.EXIT_QUALIFICATION, neither of which is an 'unsigned long' in nature. This was presumably the reasoning for switching to 'u32' when the related code was refactored in commit `ca0bde28f2` ("kvm: nVMX: Split VMCS checks from nested_vmx_run()"). Using an enum for the failure code addresses the technically-possible- but-will-never-happen scenario where Intel defines a failure code that doesn't fit in a 32-bit integer. The enum variables and values will either be automatically sized (gcc 5.4 behavior) or be subjected to some combination of truncation. The former case will simply work, while the latter will trigger a compile-time warning unless the compiler is being particularly unhelpful. Separating the failure code from the failed MSR entry allows for disassociating both from vmcs.EXIT_QUALIFICATION, which avoids the conundrum where KVM has to choose between 'u32 exit_qual' and tracking values as 'unsigned long' that have no business being tracked as such. To cement the split, set vmcs12->exit_qualification directly from the entry error code or failed MSR index instead of bouncing through a local variable. Opportunistically rename the variables in load_vmcs12_host_state() and vmx_set_nested_state() to call out that they're ignored, set exit_reason on demand on nested VM-Enter failure, and add a comment in nested_vmx_load_msr() to call out that returning 'i + 1' can't wrap. No functional change intended. Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200511220529.11402-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:07:31 -04:00
Sean Christopherson	e93fd3b3e8	KVM: x86/mmu: Capture TDP level when updating CPUID Snapshot the TDP level now that it's invariant (SVM) or dependent only on host capabilities and guest CPUID (VMX). This avoids having to call kvm_x86_ops.get_tdp_level() when initializing a TDP MMU and/or calculating the page role, and thus avoids the associated retpoline. Drop the WARN in vmx_get_tdp_level() as updating CPUID while L2 is active is legal, if dodgy. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-11-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:14 -04:00
Sean Christopherson	0047fcade4	KVM: VMX: Move nested EPT out of kvm_x86_ops.get_tdp_level() hook Separate the "core" TDP level handling from the nested EPT path to make it clear that kvm_x86_ops.get_tdp_level() is used if and only if nested EPT is not in use (kvm_init_shadow_ept_mmu() calculates the level from the passed in vmcs12->eptp). Add a WARN_ON() to enforce that the kvm_x86_ops hook is not called for nested EPT. This sets the stage for snapshotting the non-"nested EPT" TDP page level during kvm_cpuid_update() to avoid the retpoline associated with kvm_x86_ops.get_tdp_level() when resetting the MMU, a relatively frequent operation when running a nested guest. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:13 -04:00
Sean Christopherson	bd31fe495d	KVM: VMX: Add proper cache tracking for CR0 Move CR0 caching into the standard register caching mechanism in order to take advantage of the availability checks provided by regs_avail. This avoids multiple VMREADs in the (uncommon) case where kvm_read_cr0() is called multiple times in a single VM-Exit, and more importantly eliminates a kvm_x86_ops hook, saves a retpoline on SVM when reading CR0, and squashes the confusing naming discrepancy of "cache_reg" vs. "decache_cr0_guest_bits". No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:12 -04:00
Sean Christopherson	f98c1e7712	KVM: VMX: Add proper cache tracking for CR4 Move CR4 caching into the standard register caching mechanism in order to take advantage of the availability checks provided by regs_avail. This avoids multiple VMREADs and retpolines (when configured) during nested VMX transitions as kvm_read_cr4_bits() is invoked multiple times on each transition, e.g. when stuffing CR0 and CR3. As an added bonus, this eliminates a kvm_x86_ops hook, saves a retpoline on SVM when reading CR4, and squashes the confusing naming discrepancy of "cache_reg" vs. "decache_cr4_guest_bits". No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:10 -04:00
Sean Christopherson	0cc69204e7	KVM: nVMX: Unconditionally validate CR3 during nested transitions Unconditionally check the validity of the incoming CR3 during nested VM-Enter/VM-Exit to avoid invoking kvm_read_cr3() in the common case where the guest isn't using PAE paging. If vmcs.GUEST_CR3 hasn't yet been cached (common case), kvm_read_cr3() will trigger a VMREAD. The VMREAD (~30 cycles) alone is likely slower than nested_cr3_valid() (~5 cycles if vcpu->arch.maxphyaddr gets a cache hit), and the poor exchange only gets worse when retpolines are enabled as the call to kvm_x86_ops.cache_reg() will incur a retpoline (60+ cycles). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:09 -04:00
Sean Christopherson	56ba77a459	KVM: x86: Save L1 TSC offset in 'struct kvm_vcpu_arch' Save L1's TSC offset in 'struct kvm_vcpu_arch' and drop the kvm_x86_ops hook read_l1_tsc_offset(). This avoids a retpoline (when configured) when reading L1's effective TSC, which is done at least once on every VM-Exit. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200502043234.12481-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:04 -04:00
Sean Christopherson	1af1bb0562	KVM: nVMX: Skip IBPB when temporarily switching between vmcs01 and vmcs02 Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS switch when temporarily loading vmcs02 to synchronize it to vmcs12, i.e. give copy_vmcs02_to_vmcs12_rare() the same treatment as vmx_switch_vmcs(). Make vmx_vcpu_load() static now that it's only referenced within vmx.c. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200506235850.22600-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:03 -04:00
Sean Christopherson	5c911beff2	KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02 Skip the Indirect Branch Prediction Barrier that is triggered on a VMCS switch when running with spectre_v2_user=on/auto if the switch is between two VMCSes in the same guest, i.e. between vmcs01 and vmcs02. The IBPB is intended to prevent one guest from attacking another, which is unnecessary in the nested case as it's the same guest from KVM's perspective. This all but eliminates the overhead observed for nested VMX transitions when running with CONFIG_RETPOLINE=y and spectre_v2_user=on/auto, which can be significant, e.g. roughly 3x on current systems. Reported-by: Alexander Graf <graf@amazon.com> Cc: KarimAllah Raslan <karahmed@amazon.de> Cc: stable@vger.kernel.org Fixes: `15d4507152` ("KVM/x86: Add IBPB support") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200501163117.4655-1-sean.j.christopherson@intel.com> [Invert direction of bool argument. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:02 -04:00
Sean Christopherson	f27ad73a6e	KVM: VMX: Use accessor to read vmcs.INTR_INFO when handling exception Use vmx_get_intr_info() when grabbing the cached vmcs.INTR_INFO in handle_exception_nmi() to ensure the cache isn't stale. Bypassing the caching accessor doesn't cause any known issues as the cache is always refreshed by handle_exception_nmi_irqoff(), but the whole point of adding the proper caching mechanism was to avoid such dependencies. Fixes: `8791585837` ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200427171837.22613-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:15:01 -04:00
Paolo Bonzini	fede8076aa	KVM: x86: handle wrap around 32-bit address space KVM is not handling the case where EIP wraps around the 32-bit address space (that is, outside long mode). This is needed both in vmx.c and in emulate.c. SVM with NRIPS is okay, but it can still print an error to dmesg due to integer overflow. Reported-by: Nick Peterson <everdox@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:59 -04:00
Paolo Bonzini	c300ab9f08	KVM: x86: Replace late check_nested_events() hack with more precise fix Add an argument to interrupt_allowed and nmi_allowed, to checking if interrupt injection is blocked. Use the hook to handle the case where an interrupt arrives between check_nested_events() and the injection logic. Drop the retry of check_nested_events() that hack-a-fixed the same condition. Blocking injection is also a bit of a hack, e.g. KVM should do exiting and non-exiting interrupt processing in a single pass, but it's a more precise hack. The old comment is also misleading, e.g. KVM_REQ_EVENT is purely an optimization, setting it on every run loop (which KVM doesn't do) should not affect functionality, only performance. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-13-sean.j.christopherson@intel.com> [Extend to SVM, add SMI and NMI. Even though NMI and SMI cannot come asynchronously right now, making the fix generic is easy and removes a special case. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:49 -04:00
Sean Christopherson	7ab0abdb55	KVM: VMX: Use vmx_get_rflags() to query RFLAGS in vmx_interrupt_blocked() Use vmx_get_rflags() instead of manually reading vmcs.GUEST_RFLAGS when querying RFLAGS.IF so that multiple checks against interrupt blocking in a single run loop only require a single VMREAD. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-14-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:48 -04:00
Sean Christopherson	db43859280	KVM: VMX: Use vmx_interrupt_blocked() directly from vmx_handle_exit() Use vmx_interrupt_blocked() instead of bouncing through vmx_interrupt_allowed() when handling edge cases in vmx_handle_exit(). The nested_run_pending check in vmx_interrupt_allowed() should never evaluate true in the VM-Exit path. Hoist the WARN in handle_invalid_guest_state() up to vmx_handle_exit() to enforce the above assumption for the !enable_vnmi case, and to detect any other potential bugs with nested VM-Enter. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-12-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:47 -04:00
Sean Christopherson	1cd2f0b0dd	KVM: nVMX: Prioritize SMI over nested IRQ/NMI Check for an unblocked SMI in vmx_check_nested_events() so that pending SMIs are correctly prioritized over IRQs and NMIs when the latter events will trigger VM-Exit. This also fixes an issue where an SMI that was marked pending while processing a nested VM-Enter wouldn't trigger an immediate exit, i.e. would be incorrectly delayed until L2 happened to take a VM-Exit. Fixes: `64d6067057` ("KVM: x86: stubs for SMM support") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:43 -04:00
Sean Christopherson	15ff0b450b	KVM: nVMX: Preserve IRQ/NMI priority irrespective of exiting behavior Short circuit vmx_check_nested_events() if an unblocked IRQ/NMI is pending and needs to be injected into L2, priority between coincident events is not dependent on exiting behavior. Fixes: `b6b8a1451f` ("KVM: nVMX: Rework interception of IRQs and NMIs") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-9-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:42 -04:00
Sean Christopherson	1b660b6baa	KVM: VMX: Split out architectural interrupt/NMI blocking checks Move the architectural (non-KVM specific) interrupt/NMI blocking checks to a separate helper so that they can be used in a future patch by vmx_check_nested_events(). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:39 -04:00
Sean Christopherson	429ab576f3	KVM: nVMX: Report NMIs as allowed when in L2 and Exit-on-NMI is set Report NMIs as allowed when the vCPU is in L2 and L2 is being run with Exit-on-NMI enabled, as NMIs are always unblocked from L1's perspective in this case. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:32 -04:00
Paolo Bonzini	a9fa7cb6aa	KVM: x86: replace is_smm checks with kvm_x86_ops.smi_allowed Do not hardcode is_smm so that all the architectural conditions for blocking SMIs are listed in a single place. Well, in two places because this introduces some code duplication between Intel and AMD. This ensures that nested SVM obeys GIF in kvm_vcpu_has_events. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:31 -04:00
Sean Christopherson	88c604b66e	KVM: x86: Make return for {interrupt_nmi,smi}_allowed() a bool instead of int Return an actual bool for kvm_x86_ops' {interrupt_nmi}_allowed() hook to better reflect the return semantics, and to avoid creating an even bigger mess when the related VMX code is refactored in upcoming patches. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:29 -04:00
Sean Christopherson	d2060bd42e	KVM: nVMX: Open a window for pending nested VMX preemption timer Add a kvm_x86_ops hook to detect a nested pending "hypervisor timer" and use it to effectively open a window for servicing the expired timer. Like pending SMIs on VMX, opening a window simply means requesting an immediate exit. This fixes a bug where an expired VMX preemption timer (for L2) will be delayed and/or lost if a pending exception is injected into L2. The pending exception is rightly prioritized by vmx_check_nested_events() and injected into L2, with the preemption timer left pending. Because no window opened, L2 is free to run uninterrupted. Fixes: `f4124500c2` ("KVM: nVMX: Fully emulate preemption timer") Reported-by: Jim Mattson <jmattson@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-3-sean.j.christopherson@intel.com> [Check it in kvm_vcpu_has_events too, to ensure that the preemption timer is serviced promptly even if the vCPU is halted and L1 is not intercepting HLT. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:27 -04:00
Sean Christopherson	6ce347af14	KVM: nVMX: Preserve exception priority irrespective of exiting behavior Short circuit vmx_check_nested_events() if an exception is pending and needs to be injected into L2, priority between coincident events is not dependent on exiting behavior. This fixes a bug where a single-step #DB that is not intercepted by L1 is incorrectly dropped due to servicing a VMX Preemption Timer VM-Exit. Injected exceptions also need to be blocked if nested VM-Enter is pending or an exception was already injected, otherwise injecting the exception could overwrite an existing event injection from L1. Technically, this scenario should be impossible, i.e. KVM shouldn't inject its own exception during nested VM-Enter. This will be addressed in a future patch. Note, event priority between SMI, NMI and INTR is incorrect for L2, e.g. SMI should take priority over VM-Exit on NMI/INTR, and NMI that is injected into L2 should take priority over VM-Exit INTR. This will also be addressed in a future patch. Fixes: `b6b8a1451f` ("KVM: nVMX: Rework interception of IRQs and NMIs") Reported-by: Jim Mattson <jmattson@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423022550.15113-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 12:14:25 -04:00
Paolo Bonzini	4aef2ec902	Merge branch 'kvm-amd-fixes' into HEAD	2020-05-13 12:14:05 -04:00
Babu Moger	37486135d3	KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c Though rdpkru and wrpkru are contingent upon CR4.PKE, the PKRU resource isn't. It can be read with XSAVE and written with XRSTOR. So, if we don't set the guest PKRU value here(kvm_load_guest_xsave_state), the guest can read the host value. In case of kvm_load_host_xsave_state, guest with CR4.PKE clear could potentially use XRSTOR to change the host PKRU value. While at it, move pkru state save/restore to common code and the host_pkru field to kvm_vcpu_arch. This will let SVM support protection keys. Cc: stable@vger.kernel.org Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Message-Id: <158932794619.44260.14508381096663848853.stgit@naples-babu.amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-13 11:27:41 -04:00
Paolo Bonzini	45981dedf5	KVM: VMX: pass correct DR6 for GD userspace exit When KVM_EXIT_DEBUG is raised for the disabled-breakpoints case (DR7.GD), DR6 was incorrectly copied from the value in the VM. Instead, DR6.BD should be set in order to catch this case. On AMD this does not need any special code because the processor triggers a #DB exception that is intercepted. However, the testcase would fail without the previous patch because both DR6.BS and DR6.BD would be set. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-08 07:44:31 -04:00
Paolo Bonzini	d67668e9dd	KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6 There are two issues with KVM_EXIT_DEBUG on AMD, whose root cause is the different handling of DR6 on intercepted #DB exceptions on Intel and AMD. On Intel, #DB exceptions transmit the DR6 value via the exit qualification field of the VMCS, and the exit qualification only contains the description of the precise event that caused a vmexit. On AMD, instead the DR6 field of the VMCB is filled in as if the #DB exception was to be injected into the guest. This has two effects when guest debugging is in use: * the guest DR6 is clobbered * the kvm_run->debug.arch.dr6 field can accumulate more debug events, rather than just the last one that happened (the testcase in the next patch covers this issue). This patch fixes both issues by emulating, so to speak, the Intel behavior on AMD processors. The important observation is that (after the previous patches) the VMCB value of DR6 is only ever observable from the guest is KVM_DEBUGREG_WONT_EXIT is set. Therefore we can actually set vmcb->save.dr6 to any value we want as long as KVM_DEBUGREG_WONT_EXIT is clear, which it will be if guest debugging is enabled. Therefore it is possible to enter the guest with an all-zero DR6, reconstruct the #DB payload from the DR6 we get at exit time, and let kvm_deliver_exception_payload move the newly set bits into vcpu->arch.dr6. Some extra bits may be included in the payload if KVM_DEBUGREG_WONT_EXIT is set, but this is harmless. This may not be the most optimized way to deal with this, but it is simple and, being confined within SVM code, it gets rid of the set_dr6 callback and kvm_update_dr6. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-08 07:44:31 -04:00
Paolo Bonzini	5679b803e4	KVM: SVM: keep DR6 synchronized with vcpu->arch.dr6 kvm_x86_ops.set_dr6 is only ever called with vcpu->arch.dr6 as the second argument. Ensure that the VMCB value is synchronized to vcpu->arch.dr6 on #DB (both "normal" and nested) and nested vmentry, so that the current value of DR6 is always available in vcpu->arch.dr6. The get_dr6 callback can just access vcpu->arch.dr6 and becomes redundant. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-08 07:43:47 -04:00
Peter Xu	13196638d5	KVM: X86: Set RTM for DB_VECTOR too for KVM_EXIT_DEBUG RTM should always been set even with KVM_EXIT_DEBUG on #DB. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20200505205000.188252-2-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-07 06:13:41 -04:00
Paolo Bonzini	4d5523cfd5	KVM: x86: fix DR6 delivery for various cases of #DB injection Go through kvm_queue_exception_p so that the payload is correctly delivered through the exit qualification, and add a kvm_update_dr6 call to kvm_deliver_exception_payload that is needed on AMD. Reported-by: Peter Xu <peterx@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-07 06:13:41 -04:00
Sean Christopherson	c7cb2d650c	KVM: VMX: Explicitly clear RFLAGS.CF and RFLAGS.ZF in VM-Exit RSB path Clear CF and ZF in the VM-Exit path after doing __FILL_RETURN_BUFFER so that KVM doesn't interpret clobbered RFLAGS as a VM-Fail. Filling the RSB has always clobbered RFLAGS, its current incarnation just happens clear CF and ZF in the processs. Relying on the macro to clear CF and ZF is extremely fragile, e.g. commit `089dd8e531` ("x86/speculation: Change FILL_RETURN_BUFFER to work with objtool") tweaks the loop such that the ZF flag is always set. Reported-by: Qian Cai <cai@lca.pw> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: stable@vger.kernel.org Fixes: `f2fde6a5bc` ("KVM: VMX: Move RSB stuffing to before the first RET after VM-Exit") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200506035355.2242-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-06 06:51:35 -04:00
Sean Christopherson	f9336e3281	KVM: nVMX: Replace a BUG_ON(1) with BUG() to squash clang warning Use BUG() in the impossible-to-hit default case when switching on the scope of INVEPT to squash a warning with clang 11 due to clang treating the BUG_ON() as conditional. >> arch/x86/kvm/vmx/nested.c:5246:3: warning: variable 'roots_to_free' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] BUG_ON(1); Reported-by: kbuild test robot <lkp@intel.com> Fixes: `ce8fe7b77b` ("KVM: nVMX: Free only the affected contexts when emulating INVEPT") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200504153506.28898-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-04 11:58:55 -04:00
Sean Christopherson	87796555d4	KVM: nVMX: Store vmcs.EXIT_QUALIFICATION as an unsigned long, not u32 Use an unsigned long for 'exit_qual' in nested_vmx_reflect_vmexit(), the EXIT_QUALIFICATION field is naturally sized, not a 32-bit field. The bug is most easily observed by doing VMXON (or any VMX instruction) in L2 with a negative displacement, in which case dropping the upper bits on nested VM-Exit results in L1 calculating the wrong virtual address for the memory operand, e.g. "vmxon -0x8(%rbp)" yields: Unhandled cpu exception 14 #PF at ip 0000000000400553 rbp=0000000000537000 cr2=0000000100536ff8 Fixes: `fbdd502503` ("KVM: nVMX: Move VM-Fail check out of nested_vmx_exit_reflected()") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200423001127.13490-1-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-24 12:51:21 -04:00
Sean Christopherson	9bd4af240f	KVM: nVMX: Drop a redundant call to vmx_get_intr_info() Drop nested_vmx_l1_wants_exit()'s initialization of intr_info from vmx_get_intr_info() that was inadvertantly introduced along with the caching mechanism. EXIT_REASON_EXCEPTION_NMI, the only consumer of intr_info, populates the variable before using it. Fixes: bb53120d67cd ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200421075328.14458-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-23 18:24:28 -04:00
Paolo Bonzini	33b2217245	KVM: x86: move nested-related kvm_x86_ops to a separate struct Clean up some of the patching of kvm_x86_ops, by moving kvm_x86_ops related to nested virtualization into a separate struct. As a result, these ops will always be non-NULL on VMX. This is not a problem: * check_nested_events is only called if is_guest_mode(vcpu) returns true * get_nested_state treats VMXOFF state the same as nested being disabled * set_nested_state fails if you attempt to set nested state while nesting is disabled * nested_enable_evmcs could already be called on a CPU without VMX enabled in CPUID. * nested_get_evmcs_version was fixed in the previous patch Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-23 09:04:57 -04:00
Paolo Bonzini	25091990ef	KVM: eVMCS: check if nesting is enabled In the next patch nested_get_evmcs_version will be always set in kvm_x86_ops for VMX, even if nesting is disabled. Therefore, check whether VMX (aka nesting) is available in the function, the caller will not do the check anymore. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-23 09:04:56 -04:00
Paolo Bonzini	3bda03865f	KVM: s390: Fix for 5.7 and maintainer update - Silence false positive lockdep warning - add Claudio as reviewer -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABAgAGBQJenY6AAAoJEBF7vIC1phx8bykQAK+QZyD+H/zGNuqeUVn0sh8e yKUVMR+kuE+l57q77nt2AYVxqpCD9xSKRR+SOSLzhVH/HJf625nm+Ny/WOWMebwJ EA/KK+v15T5rga8gFza+4cPg4v/pHwjHhSbjTb1JWg+8cJR1BTj6OxRuTtWr5+25 GF4RhkJOit/VhNbCo1aIgs7/7F1pPALstdPAUsHYe1PeULdRMVqSVluXT2KTPhpi /kzDw8sKKcYgv/eaVdcNoHv+VX1AWIRDAKEttCywyocfbu0ESwadmR7C0qlm1446 HqowP6F0xCF0Whi/65aN4ZOv7wjO/qrV08DZ7JLA3/oKlXtZ1ieyiE2q/P1frSo1 gvmuHiH5/UI6t6a/BSCpJwqcilxKYArqAAYBKoGiJhTbsJStqw0wl41klWTKXlTq VrCvjoUxQ9JMjFCQ1GXOU+ODNyX2IwZYptJ5vF24HYzBJwUBe3HPG9/BA8YcodzG qGQ5IKv0Q1IFTwOqnt557H0MjcBtNIEx54aLJrPy3wldsiNSj39Ft0cuvnbR+Q4F QhKk88dHtd7NW1IirfgYmLGe0rB1ANKM7wUGEdM5w2y5Eg8wCs8/P4KeGh0YyFI9 xPqZDfwof6KkDjOGFXr/CeD/thi+km0/FpePb7cL5Ow4a+JmrCvqQiXrf0TbnFpv t5ZlHnGzoSHsEaRgmJ+X =d46L -----END PGP SIGNATURE----- Merge tag 'kvm-s390-master-5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master KVM: s390: Fix for 5.7 and maintainer update - Silence false positive lockdep warning - add Claudio as reviewer	2020-04-21 09:37:13 -04:00
Wanpeng Li	a9ab13ff6e	KVM: X86: Improve latency for single target IPI fastpath IPI and Timer cause the main MSRs write vmexits in cloud environment observation, let's optimize virtual IPI latency more aggressively to inject target IPI as soon as possible. Running kvm-unit-tests/vmexit.flat IPI testing on SKX server, disable adaptive advance lapic timer and adaptive halt-polling to avoid the interference, this patch can give another 7% improvement. w/o fastpath -> x86.c fastpath 4238 -> 3543 16.4% x86.c fastpath -> vmx.c fastpath 3543 -> 3293 7% w/o fastpath -> vmx.c fastpath 4238 -> 3293 22.3% Cc: Haiwei Li <lihaiwei@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200410174703.1138-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:10 -04:00
Sean Christopherson	873e1da169	KVM: VMX: Optimize handling of VM-Entry failures in vmx_vcpu_run() Mark the VM-Fail, VM-Exit on VM-Enter, and #MC on VM-Enter paths as 'unlikely' so as to improve code generation so that it favors successful VM-Enter. The performance of successful VM-Enter is for more important, irrespective of whether or not success is actually likely. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200410174703.1138-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:09 -04:00
Sean Christopherson	b8d295f96b	KVM: nVMX: Remove non-functional "support" for CR3 target values Remove all references to cr3_target_value[0-3] and replace the fields in vmcs12 with "dead_space" to preserve the vmcs12 layout. KVM doesn't support emulating CR3-target values, despite a variety of code that implies otherwise, as KVM unconditionally reports '0' for the number of supported CR3-target values. This technically fixes a bug where KVM would incorrectly allow VMREAD and VMWRITE to nonexistent fields, i.e. cr3_target_value[0-3]. Per Intel's SDM, the number of supported CR3-target values reported in VMX_MISC also enumerates the existence of the associated VMCS fields: If a future implementation supports more than 4 CR3-target values, they will be encoded consecutively following the 4 encodings given here. Alternatively, the "bug" could be fixed by actually advertisting support for 4 CR3-target values, but that'd likely just enable kvm-unit-tests given that no one has complained about lack of support for going on ten years, e.g. KVM, Xen and HyperV don't use CR3-target values. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200416000739.9012-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:09 -04:00
Sean Christopherson	8791585837	KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags Introduce a new "extended register" type, EXIT_INFO_2 (to pair with the nomenclature in .get_exit_info()), and use it to cache VMX's vmcs.EXIT_INTR_INFO. Drop a comment in vmx_recover_nmi_blocking() that is obsoleted by the generic caching mechanism. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:07 -04:00
Sean Christopherson	5addc23519	KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags Introduce a new "extended register" type, EXIT_INFO_1 (to pair with the nomenclature in .get_exit_info()), and use it to cache VMX's vmcs.EXIT_QUALIFICATION. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:07 -04:00
Sean Christopherson	ec0241f3bb	KVM: nVMX: Drop manual clearing of segment cache on nested VMCS switch Drop the call to vmx_segment_cache_clear() in vmx_switch_vmcs() now that the entire register cache is reset when switching the active VMCS, e.g. vmx_segment_cache_test_set() will reset the segment cache due to VCPU_EXREG_SEGMENTS being unavailable. Move vmx_segment_cache_clear() to vmx.c now that it's no longer invoked by the nested code. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:06 -04:00
Sean Christopherson	e5d03de593	KVM: nVMX: Reset register cache (available and dirty masks) on VMCS switch Reset the per-vCPU available and dirty register masks when switching between vmcs01 and vmcs02, as the masks track state relative to the current VMCS. The stale masks don't cause problems in the current code base because the registers are either unconditionally written on nested transitions or, in the case of segment registers, have an additional tracker that is manually reset. Note, by dropping (previously implicitly, now explicitly) the dirty mask when switching the active VMCS, KVM is technically losing writes to the associated fields. But, the only regs that can be dirtied (RIP, RSP and PDPTRs) are unconditionally written on nested transitions, e.g. explicit writeback is a waste of cycles, and a WARN_ON would be rather pointless. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:06 -04:00
Sean Christopherson	9932b49e5a	KVM: nVMX: Invoke ept_save_pdptrs() if and only if PAE paging is enabled Invoke ept_save_pdptrs() when restoring L1's host state on a "late" VM-Fail if and only if PAE paging is enabled. This saves a CALL in the common case where L1 is a 64-bit host, and avoids incorrectly marking the PDPTRs as dirty. WARN if ept_save_pdptrs() is called with PAE disabled now that the nested usage pre-checks is_pae_paging(). Barring a bug in KVM's MMU, attempting to read the PDPTRs with PAE disabled is now impossible. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:06 -04:00
Sean Christopherson	4dcefa312a	KVM: nVMX: Rename exit_reason to vm_exit_reason for nested VM-Exit Use "vm_exit_reason" for code related to injecting a nested VM-Exit to VM-Exits to make it clear that nested_vmx_vmexit() expects the full exit eason, not just the basic exit reason. The basic exit reason (bits 15:0 of vmcs.VM_EXIT_REASON) is colloquially referred to as simply "exit reason". Note, other flows, e.g. vmx_handle_exit(), are intentionally left as is. A future patch will convert vmx->exit_reason to a union + bit-field, and the exempted flows will interact with the unionized of "exit_reason". Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:05 -04:00
Sean Christopherson	2a7833899f	KVM: nVMX: Cast exit_reason to u16 to check for nested EXTERNAL_INTERRUPT Explicitly check only the basic exit reason when emulating an external interrupt VM-Exit in nested_vmx_vmexit(). Checking the full exit reason doesn't currently cause problems, but only because the only exit reason modifier support by KVM is FAILED_VMENTRY, which is mutually exclusive with EXTERNAL_INTERRUPT. Future modifiers, e.g. ENCLAVE_MODE, will coexist with EXTERNAL_INTERRUPT. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-9-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:05 -04:00
Sean Christopherson	f47baaed4f	KVM: nVMX: Pull exit_reason from vcpu_vmx in nested_vmx_reflect_vmexit() Grab the exit reason from the vcpu struct in nested_vmx_reflect_vmexit() instead of having the exit reason explicitly passed from the caller. This fixes a discrepancy between VM-Fail and VM-Exit handling, as the VM-Fail case is already handled by checking vcpu_vmx, e.g. the exit reason previously passed on the stack is bogus if vmx->fail is set. Not taking the exit reason on the stack also avoids having to document that nested_vmx_reflect_vmexit() requires the full exit reason, as opposed to just the basic exit reason, which is not at all obvious since the only usages of the full exit reason are for tracing and way down in prepare_vmcs12() where it's propagated to vmcs12. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:04 -04:00
Sean Christopherson	1d283062c9	KVM: nVMX: Drop a superfluous WARN on reflecting EXTERNAL_INTERRUPT Drop the WARN in nested_vmx_reflect_vmexit() that fires if KVM attempts to reflect an external interrupt. The WARN is blatantly impossible to hit now that nested_vmx_l0_wants_exit() is called from nested_vmx_reflect_vmexit() unconditionally returns true for EXTERNAL_INTERRUPT. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:04 -04:00
Sean Christopherson	2c1f332380	KVM: nVMX: Split VM-Exit reflection logic into L0 vs. L1 wants Split the logic that determines whether a nested VM-Exit is reflected into L1 into "L0 wants" and "L1 wants" to document the core control flow at a high level. If L0 wants the VM-Exit, e.g. because the exit is due to a hardware event that isn't passed through to L1, then KVM should handle the exit in L0 without considering L1's configuration. Then, if L0 doesn't want the exit, KVM needs to query L1's wants to determine whether or not L1 "caused" the exit, e.g. by setting an exiting control, versus the exit occurring due to an L0 setting, e.g. when L0 intercepts an action that L1 chose to pass-through. Note, this adds an extra read on vmcs.VM_EXIT_INTR_INFO for exception. This will be addressed in a future patch via a VMX-wide enhancement, rather than pile on another case where vmx->exit_intr_info is conditionally available. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:03 -04:00
Sean Christopherson	236871b674	KVM: nVMX: Move nested VM-Exit tracepoint into nested_vmx_reflect_vmexit() Move the tracepoint for nested VM-Exits in preparation of splitting the reflection logic into L1 wants the exit vs. L0 always handles the exit. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:03 -04:00
Sean Christopherson	fbdd502503	KVM: nVMX: Move VM-Fail check out of nested_vmx_exit_reflected() Check for VM-Fail on nested VM-Enter in nested_vmx_reflect_vmexit() in preparation for separating nested_vmx_exit_reflected() into separate "L0 wants exit exit" and "L1 wants the exit" helpers. Explicitly set exit_intr_info and exit_qual to zero instead of reading them from vmcs02, as they are invalid on VM-Fail (and thankfully ignored by nested_vmx_vmexit() for nested VM-Fail). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:02 -04:00
Sean Christopherson	7b7bd87dbd	KVM: nVMX: Uninline nested_vmx_reflect_vmexit(), i.e. move it to nested.c Uninline nested_vmx_reflect_vmexit() in preparation of refactoring nested_vmx_exit_reflected() to split up the reflection logic into more consumable chunks, e.g. VM-Fail vs. L1 wants the exit vs. L0 always handles the exit. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:02 -04:00
Sean Christopherson	789afc5ccd	KVM: nVMX: Move reflection check into nested_vmx_reflect_vmexit() Move the call to nested_vmx_exit_reflected() from vmx_handle_exit() into nested_vmx_reflect_vmexit() and change the semantics of the return value for nested_vmx_reflect_vmexit() to indicate whether or not the exit was reflected into L1. nested_vmx_exit_reflected() and nested_vmx_reflect_vmexit() are intrinsically tied together, calling one without simultaneously calling the other makes little sense. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415175519.14230-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:13:01 -04:00
Sean Christopherson	be100ef136	KVM: VMX: Clean cr3/pgd handling in vmx_load_mmu_pgd() Rename @cr3 to @pgd in vmx_load_mmu_pgd() to reflect that it will be loaded into vmcs.EPT_POINTER and not vmcs.GUEST_CR3 when EPT is enabled. Similarly, load guest_cr3 with @pgd if and only if EPT is disabled. This fixes one of the last, if not _the_ last, cases in KVM where a variable that is not strictly a cr3 value uses "cr3" instead of "pgd". Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-38-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:59 -04:00
Sean Christopherson	be01e8e2c6	KVM: x86: Replace "cr3" with "pgd" in "new cr3/pgd" related code Rename functions and variables in kvm_mmu_new_cr3() and related code to replace "cr3" with "pgd", i.e. continue the work started by commit `727a7e27cf` ("KVM: x86: rename set_cr3 callback and related flags to load_mmu_pgd"). kvm_mmu_new_cr3() and company are not always loading a new CR3, e.g. when nested EPT is enabled "cr3" is actually an EPTP. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-37-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:59 -04:00
Sean Christopherson	ce8fe7b77b	KVM: nVMX: Free only the affected contexts when emulating INVEPT Add logic to handle_invept() to free only those roots that match the target EPT context when emulating a single-context INVEPT. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-36-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:58 -04:00
Sean Christopherson	9805c5f74b	KVM: nVMX: Don't flush TLB on nested VMX transition Unconditionally skip the TLB flush triggered when reusing a root for a nested transition as nested_vmx_transition_tlb_flush() ensures the TLB is flushed when needed, regardless of whether the MMU can reuse a cached root (or the last root). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-35-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:58 -04:00
Sean Christopherson	41fab65e7c	KVM: nVMX: Skip MMU sync on nested VMX transition when possible Skip the MMU sync when reusing a cached root if EPT is enabled or L1 enabled VPID for L2. If EPT is enabled, guest-physical mappings aren't flushed even if VPID is disabled, i.e. L1 can't expect stale TLB entries to be flushed if it has enabled EPT and L0 isn't shadowing PTEs (for L1 or L2) if L1 has EPT disabled. If VPID is enabled (and EPT is disabled), then L1 can't expect stale TLB entries to be flushed (for itself or L2). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-34-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:57 -04:00
Sean Christopherson	4a632ac6ca	KVM: x86/mmu: Add separate override for MMU sync during fast CR3 switch Add a separate "skip" override for MMU sync, a future change to avoid TLB flushes on nested VMX transitions may need to sync the MMU even if the TLB flush is unnecessary. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-32-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:56 -04:00
Sean Christopherson	4de1f9d469	KVM: VMX: Don't reload APIC access page if its control is disabled Don't reload the APIC access page if its control is disabled, e.g. if the guest is running with x2APIC (likely) or with the local APIC disabled (unlikely), to avoid unnecessary TLB flushes and VMWRITEs. Unconditionally reload the APIC access page and flush the TLB when the guest's virtual APIC transitions to "xAPIC enabled", as any changes to the APIC access page's mapping will not be recorded while the guest's virtual APIC is disabled. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-30-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:55 -04:00
Sean Christopherson	a4148b7ca2	KVM: VMX: Retrieve APIC access page HPA only when necessary Move the retrieval of the HPA associated with L1's APIC access page into VMX code to avoid unnecessarily calling gfn_to_page(), e.g. when the vCPU is in guest mode (L2). Alternatively, the optimization logic in VMX could be mirrored into the common x86 code, but that will get ugly fast when further optimizations are introduced. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-29-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:55 -04:00
Sean Christopherson	1196cb970b	KVM: nVMX: Reload APIC access page on nested VM-Exit only if necessary Defer reloading L1's APIC page by logging the need for a reload and processing it during nested VM-Exit instead of unconditionally reloading the APIC page on nested VM-Exit. This eliminates a TLB flush on the majority of VM-Exits as the APIC page rarely needs to be reloaded. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-28-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:54 -04:00
Sean Christopherson	c51e1ffee5	KVM: nVMX: Selectively use TLB_FLUSH_CURRENT for nested VM-Enter/VM-Exit Flush only the current context, as opposed to all contexts, when requesting a TLB flush to handle the scenario where a L1 does not expect a TLB flush, but one is required because L1 and L2 shared an ASID. This occurs if EPT is disabled (no per-EPTP tag), VPID is enabled (hardware doesn't flush unconditionally) and vmcs02 does not have its own VPID due to exhaustion of available VPIDs. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-27-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:54 -04:00
Sean Christopherson	eeeb4f67a6	KVM: x86: Introduce KVM_REQ_TLB_FLUSH_CURRENT to flush current ASID Add KVM_REQ_TLB_FLUSH_CURRENT to allow optimized TLB flushing of VMX's EPTP/VPID contexts[] from the KVM MMU and/or in a deferred manner, e.g. to flush L2's context during nested VM-Enter. Convert KVM_REQ_TLB_FLUSH to KVM_REQ_TLB_FLUSH_CURRENT in flows where the flush is directly associated with vCPU-scoped instruction emulation, i.e. MOV CR3 and INVPCID. Add a comment in vmx_vcpu_load_vmcs() above its KVM_REQ_TLB_FLUSH to make it clear that it deliberately requests a flush of all contexts. Service any pending flush request on nested VM-Exit as it's possible a nested VM-Exit could occur after requesting a flush for L2. Add the same logic for nested VM-Enter even though it's _extremely_ unlikely for flush to be pending on nested VM-Enter, but theoretically possible (in the future) due to RSM (SMM) emulation. [] Intel also has an Address Space Identifier (ASID) concept, e.g. EPTP+VPID+PCID == ASID, it's just not documented in the SDM because the rules of invalidation are different based on which piece of the ASID is being changed, i.e. whether the EPTP, VPID, or PCID context must be invalidated. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-25-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:53 -04:00
Sean Christopherson	50b265a4ee	KVM: nVMX: Add helper to handle TLB flushes on nested VM-Enter/VM-Exit Add a helper to determine whether or not a full TLB flush needs to be performed on nested VM-Enter/VM-Exit, as the logic is identical for both flows and needs a fairly beefy comment to boot. This also provides a common point to make future adjustments to the logic. Handle vpid12 changes the new helper as well even though it is specific to VM-Enter. The vpid12 logic is an extension of the flushing logic, and it's worth the extra bool parameter to provide a single location for the flushing logic. Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-24-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:52 -04:00
Sean Christopherson	7780938cc7	KVM: x86: Rename ->tlb_flush() to ->tlb_flush_all() Rename ->tlb_flush() to ->tlb_flush_all() in preparation for adding a new hook to flush only the current ASID/context. Opportunstically replace the comment in vmx_flush_tlb() that explains why it flushes all EPTP/VPID contexts with a comment explaining why it unconditionally uses INVEPT when EPT is enabled. I.e. rely on the "all" part of the name to clarify why it does global INVEPT/INVVPID. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-23-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:52 -04:00
Sean Christopherson	33d19ec9b1	KVM: VMX: Introduce vmx_flush_tlb_current() Add a helper to flush TLB entries only for the current EPTP/VPID context and use it for the existing direct invocations of vmx_flush_tlb(). TLB flushes that are specific to the current vCPU state do not need to flush other contexts. Note, both converted call sites happen to be related to the APIC access page, this is purely coincidental. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-21-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:51 -04:00
Sean Christopherson	25d8b84376	KVM: nVMX: Move nested_get_vpid02() to vmx/nested.h Move nested_get_vpid02() to vmx/nested.h so that a future patch can reference it from vmx.c to implement context-specific TLB flushing. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-20-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:51 -04:00
Sean Christopherson	5058b692c6	KVM: VMX: Move vmx_flush_tlb() to vmx.c Move vmx_flush_tlb() to vmx.c and make it non-inline static now that all its callers live in vmx.c. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-19-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:50 -04:00
Sean Christopherson	f55ac304ca	KVM: x86: Drop @invalidate_gpa param from kvm_x86_ops' tlb_flush() Drop @invalidate_gpa from ->tlb_flush() and kvm_vcpu_flush_tlb() now that all callers pass %true for said param, or ignore the param (SVM has an internal call to svm_flush_tlb() in svm_flush_tlb_guest that somewhat arbitrarily passes %false). Remove __vmx_flush_tlb() as it is no longer used. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-17-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:49 -04:00
Sean Christopherson	ad104b5e43	KVM: VMX: Clean up vmx_flush_tlb_gva() Refactor vmx_flush_tlb_gva() to remove a superfluous local variable and clean up its comment, which is oddly located below the code it is commenting. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-16-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-21 09:12:49 -04:00
Sean Christopherson	e64419d991	KVM: x86: Move "flush guest's TLB" logic to separate kvm_x86_ops hook Add a dedicated hook to handle flushing TLB entries on behalf of the guest, i.e. for a paravirtualized TLB flush, and use it directly instead of bouncing through kvm_vcpu_flush_tlb(). For VMX, change the effective implementation implementation to never do INVEPT and flush only the current context, i.e. to always flush via INVVPID(SINGLE_CONTEXT). The INVEPT performed by __vmx_flush_tlb() when @invalidate_gpa=false and enable_vpid=0 is unnecessary, as it will only flush guest-physical mappings; linear and combined mappings are flushed by VM-Enter when VPID is disabled, and changes in the guest pages tables do not affect guest-physical mappings. When EPT and VPID are enabled, doing INVVPID is not required (by Intel's architecture) to invalidate guest-physical mappings, i.e. TLB entries that cache guest-physical mappings can live across INVVPID as the mappings are associated with an EPTP, not a VPID. The intent of @invalidate_gpa is to inform vmx_flush_tlb() that it must "invalidate gpa mappings", i.e. do INVEPT and not simply INVVPID. Other than nested VPID handling, which now calls vpid_sync_context() directly, the only scenario where KVM can safely do INVVPID instead of INVEPT (when EPT is enabled) is if KVM is flushing TLB entries from the guest's perspective, i.e. is only required to invalidate linear mappings. For SVM, flushing TLB entries from the guest's perspective can be done by flushing the current ASID, as changes to the guest's page tables are associated only with the current ASID. Adding a dedicated ->tlb_flush_guest() paves the way toward removing @invalidate_gpa, which is a potentially dangerous control flag as its meaning is not exactly crystal clear, even for those who are familiar with the subtleties of what mappings Intel CPUs are/aren't allowed to keep across various invalidation scenarios. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-15-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-20 17:26:10 -04:00
Sean Christopherson	bc41d0c40e	KVM: nVMX: Use vpid_sync_vcpu_addr() to emulate INVVPID with address Use vpid_sync_vcpu_addr() to emulate the "individual address" variant of INVVPID now that said function handles the fallback case of the (host) CPU not supporting "individual address". Note, the "vpid == 0" checks in the vpid_sync_*() helpers aren't actually redundant with the "!operand.vpid" check in handle_invvpid(), as the vpid passed to vpid_sync_vcpu_addr() is a KVM (host) controlled value, i.e. vpid02 can be zero even if operand.vpid is non-zero. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-14-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-20 17:26:09 -04:00
Sean Christopherson	ca431c0cc3	KVM: VMX: Drop redundant capability checks in low level INVVPID helpers Remove the INVVPID capabilities checks from vpid_sync_vcpu_single() and vpid_sync_vcpu_global() now that all callers ensure the INVVPID variant is supported. Note, in some cases the guarantee is provided in concert with hardware_setup(), which enables VPID if and only if at least of invvpid_single() or invvpid_global() is supported. Drop the WARN_ON_ONCE() from vmx_flush_tlb() as vpid_sync_vcpu_single() will trigger a WARN() on INVVPID failure, i.e. if SINGLE_CONTEXT isn't supported. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-13-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-20 17:26:08 -04:00
Sean Christopherson	ab4b3597ff	KVM: VMX: Handle INVVPID fallback logic in vpid_sync_vcpu_addr() Directly invoke vpid_sync_context() to do a global INVVPID when the individual address variant is not supported instead of deferring such behavior to the caller. This allows for additional consolidation of code as the logic is basically identical to the emulation of the individual address variant in handle_invvpid(). No functional change intended. Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-12-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-20 17:26:08 -04:00
Sean Christopherson	8a8b097c6c	KVM: VMX: Move vpid_sync_vcpu_addr() down a few lines Move vpid_sync_vcpu_addr() below vpid_sync_context() so that it can be refactored in a future patch to call vpid_sync_context() directly when the "individual address" INVVPID variant isn't supported. No functional change intended. Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-11-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-20 17:26:07 -04:00
Sean Christopherson	446ace4bca	KVM: VMX: Use vpid_sync_context() directly when possible Use vpid_sync_context() directly for flows that run if and only if enable_vpid=1, or more specifically, nested VMX flows that are gated by vmx->nested.msrs.secondary_ctls_high.SECONDARY_EXEC_ENABLE_VPID being set, which is allowed if and only if enable_vpid=1. Because these flows call __vmx_flush_tlb() with @invalidate_gpa=false, the if-statement that decides between INVEPT and INVVPID will always go down the INVVPID path, i.e. call vpid_sync_context() because "enable_ept && (invalidate_gpa \|\| !enable_vpid)" always evaluates false. This helps pave the way toward removing @invalidate_gpa and @vpid from __vmx_flush_tlb() and its callers. Opportunstically drop unnecessary brackets in handle_invvpid() around an affected __vmx_flush_tlb()->vpid_sync_context() conversion. No functional change intended. Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-20 17:26:06 -04:00
Sean Christopherson	c746b3a4b8	KVM: VMX: Skip global INVVPID fallback if vpid==0 in vpid_sync_context() Skip the global INVVPID in the unlikely scenario that vpid==0 and the SINGLE_CONTEXT variant of INVVPID is unsupported. If vpid==0, there's no need to INVVPID as it's impossible to do VM-Enter with VPID enabled and vmcs.VPID==0, i.e. there can't be any TLB entries for the vCPU with vpid==0. The fact that the SINGLE_CONTEXT variant isn't supported is irrelevant. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-9-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-20 17:26:06 -04:00
Junaid Shahid	ee1fa209f5	KVM: x86: Sync SPTEs when injecting page/EPT fault into L1 When injecting a page fault or EPT violation/misconfiguration, KVM is not syncing any shadow PTEs associated with the faulting address, including those in previous MMUs that are associated with L1's current EPTP (in a nested EPT scenario), nor is it flushing any hardware TLB entries. All this is done by kvm_mmu_invalidate_gva. Page faults that are either !PRESENT or RSVD are exempt from the flushing, as the CPU is not allowed to cache such translations. Signed-off-by: Junaid Shahid <junaids@google.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-20 17:26:05 -04:00
Junaid Shahid	d6e3f8385d	KVM: nVMX: Invalidate all roots when emulating INVVPID without EPT Free all roots when emulating INVVPID for L1 and EPT is disabled, as outstanding changes to the page tables managed by L1 need to be recognized. Because L1 and L2 share an MMU when EPT is disabled, and because VPID is not tracked by the MMU role, all roots in the current MMU (root_mmu) need to be freed, otherwise a future nested VM-Enter or VM-Exit could do a fast CR3 switch (without a flush/sync) and consume stale SPTEs. Fixes: `5c614b3583` ("KVM: nVMX: nested VPID emulation") Signed-off-by: Junaid Shahid <junaids@google.com> [sean: ported to upstream KVM, reworded the comment and changelog] Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-5-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-15 12:08:49 -04:00
Sean Christopherson	f8aa7e3958	KVM: nVMX: Invalidate all EPTP contexts when emulating INVEPT for L1 Free all L2 (guest_mmu) roots when emulating INVEPT for L1. Outstanding changes to the EPT tables managed by L1 need to be recognized, and relying on KVM to always flush L2's EPTP context on nested VM-Enter is dangerous. Similar to handle_invpcid(), rely on kvm_mmu_free_roots() to do a remote TLB flush if necessary, e.g. if L1 has never entered L2 then there is nothing to be done. Nuking all L2 roots is overkill for the single-context variant, but it's the safe and easy bet. A more precise zap mechanism will be added in the future. Add a TODO to call out that KVM only needs to invalidate affected contexts. Fixes: `14c07ad89f` ("x86/kvm/mmu: introduce guest_mmu") Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-4-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-15 12:08:49 -04:00
Sean Christopherson	eed0030e4c	KVM: nVMX: Validate the EPTP when emulating INVEPT(EXTENT_CONTEXT) Signal VM-Fail for the single-context variant of INVEPT if the specified EPTP is invalid. Per the INEVPT pseudocode in Intel's SDM, it's subject to the standard EPT checks: If VM entry with the "enable EPT" VM execution control set to 1 would fail due to the EPTP value then VMfail(Invalid operand to INVEPT/INVVPID); Fixes: `bfd0a56b90` ("nEPT: Nested INVEPT") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-3-sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-15 12:08:48 -04:00
Sean Christopherson	e8eff28215	KVM: VMX: Flush all EPTP/VPID contexts on remote TLB flush Flush all EPTP/VPID contexts if a TLB flush _may_ have been triggered by a remote or deferred TLB flush, i.e. by KVM_REQ_TLB_FLUSH. Remote TLB flushes require all contexts to be invalidated, not just the active contexts, e.g. all mappings in all contexts for a given HVA need to be invalidated on a mmu_notifier invalidation. Similarly, the instigator of the deferred TLB flush may be expecting all contexts to be flushed, e.g. vmx_vcpu_load_vmcs(). Without nested VMX, flushing only the current EPTP/VPID context isn't problematic because KVM uses a constant VPID for each vCPU, and mmu_alloc_direct_roots() all but guarantees KVM will use a single EPTP for L1. In the rare case where a different EPTP is created or reused, KVM (currently) unconditionally flushes the new EPTP context prior to entering the guest. With nested VMX, KVM conditionally uses a different VPID for L2, and unconditionally uses a different EPTP for L2. Because KVM doesn't _intentionally_ guarantee L2's EPTP/VPID context is flushed on nested VM-Enter, it'd be possible for a malicious L1 to attack the host and/or different VMs by exploiting the lack of flushing for L2. 1) Launch nested guest from malicious L1. 2) Nested VM-Enter to L2. 3) Access target GPA 'g'. CPU inserts TLB entry tagged with L2's ASID mapping 'g' to host PFN 'x'. 2) Nested VM-Exit to L1. 3) L1 triggers kernel same-page merging (ksm) by duplicating/zeroing the page for PFN 'x'. 4) Host kernel merges PFN 'x' with PFN 'y', i.e. unmaps PFN 'x' and remaps the page to PFN 'y'. mmu_notifier sends invalidate command, KVM flushes TLB only for L1's ASID. 4) Host kernel reallocates PFN 'x' to some other task/guest. 5) Nested VM-Enter to L2. KVM does not invalidate L2's EPTP or VPID. 6) L2 accesses GPA 'g' and gains read/write access to PFN 'x' via its stale TLB entry. However, current KVM unconditionally flushes L1's EPTP/VPID context on nested VM-Exit. But, that behavior is mostly unintentional, KVM doesn't go out of its way to flush EPTP/VPID on nested VM-Enter/VM-Exit, rather a TLB flush is guaranteed to occur prior to re-entering L1 due to __kvm_mmu_new_cr3() always being called with skip_tlb_flush=false. On nested VM-Enter, this happens via kvm_init_shadow_ept_mmu() (nested EPT enabled) or in nested_vmx_load_cr3() (nested EPT disabled). On nested VM-Exit it occurs via nested_vmx_load_cr3(). This also fixes a bug where a deferred TLB flush in the context of L2, with EPT disabled, would flush L1's VPID instead of L2's VPID, as vmx_flush_tlb() flushes L1's VPID regardless of is_guest_mode(). Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Junaid Shahid <junaids@google.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: John Haxby <john.haxby@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Fixes: `efebf0aaec` ("KVM: nVMX: Do not flush TLB on L1<->L2 transitions if L1 uses VPID and EPT") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-15 12:08:48 -04:00

1 2 3 4 5 ...

743 commits