linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-11-01 17:08:10 +00:00

Author	SHA1	Message	Date
Alexander Graf	1ae099540e	KVM: x86: Allow deflecting unknown MSR accesses to user space MSRs are weird. Some of them are normal control registers, such as EFER. Some however are registers that really are model specific, not very interesting to virtualization workloads, and not performance critical. Others again are really just windows into package configuration. Out of these MSRs, only the first category is necessary to implement in kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against certain CPU models and MSRs that contain information on the package level are much better suited for user space to process. However, over time we have accumulated a lot of MSRs that are not the first category, but still handled by in-kernel KVM code. This patch adds a generic interface to handle WRMSR and RDMSR from user space. With this, any future MSR that is part of the latter categories can be handled in user space. Furthermore, it allows us to replace the existing "ignore_msrs" logic with something that applies per-VM rather than on the full system. That way you can run productive VMs in parallel to experimental ones where you don't care about proper MSR handling. Signed-off-by: Alexander Graf <graf@amazon.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200925143422.21718-3-graf@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:58:04 -04:00
Alexander Graf	90218e434c	KVM: x86: Return -ENOENT on unimplemented MSRs When we find an MSR that we can not handle, bubble up that error code as MSR error return code. Follow up patches will use that to expose the fact that an MSR is not handled by KVM to user space. Suggested-by: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20200925143422.21718-2-graf@amazon.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:58:02 -04:00
Sean Christopherson	802145c56a	KVM: VMX: Rename vmx_uret_msr's "index" to "slot" Rename "index" to "slot" in struct vmx_uret_msr to align with the terminology used by common x86's kvm_user_return_msrs, and to avoid conflating "MSR's ECX index" with "MSR's index into an array". No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-16-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:58:02 -04:00
Sean Christopherson	14a61b642d	KVM: VMX: Rename "vmx_msr_index" to "vmx_uret_msrs_list" Rename "vmx_msr_index" to "vmx_uret_msrs_list" to associate it with the uret MSRs array, and to avoid conflating "MSR's ECX index" with "MSR's index into an array". Similarly, don't use "slot" in the name as that terminology is claimed by the common x86 "user_return_msrs" mechanism. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-15-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:58:01 -04:00
Sean Christopherson	7bf662bb5e	KVM: VMX: Rename "vmx_set_guest_msr" to "vmx_set_guest_uret_msr" Add "uret" to vmx_set_guest_msr() to explicitly associate it with the guest_uret_msrs array, and to differentiate it from vmx_set_msr() as well as VMX's load/store MSRs. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-14-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:58:01 -04:00
Sean Christopherson	d85a8034c0	KVM: VMX: Rename "find_msr_entry" to "vmx_find_uret_msr" Rename "find_msr_entry" to scope it to VMX and to associate it with guest_uret_msrs. Drop the "entry" so that the function name pairs with the existing __vmx_find_uret_msr(), which intentionally uses a double underscore prefix instead of appending "index" or "slot" as those names are already claimed by other pieces of the user return MSR stack. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-13-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:58:00 -04:00
Sean Christopherson	bd65ba82b3	KVM: VMX: Add vmx_setup_uret_msr() to handle lookup and swap Add vmx_setup_uret_msr() to wrap the lookup and manipulation of the uret MSRs array during setup_msrs(). In addition to consolidating code, this eliminates move_msr_up(), which while being a very literally description of the function, isn't exacly helpful in understanding the net effect of the code. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-12-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:58:00 -04:00
Sean Christopherson	86e3e494fe	KVM: VMX: Move uret MSR lookup into update_transition_efer() Move checking for the existence of MSR_EFER in the uret MSR array into update_transition_efer() so that the lookup and manipulation of the array in setup_msrs() occur back-to-back. This paves the way toward adding a helper to wrap the lookup and manipulation. To avoid unnecessary overhead, defer the lookup until the uret array would actually be modified in update_transition_efer(). EFER obviously exists on CPUs that support the dedicated VMCS fields for switching EFER, and EFER must exist for the guest and host EFER.NX value to diverge, i.e. there is no danger of attempting to read/write EFER when it doesn't exist. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-11-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:59 -04:00
Sean Christopherson	ef1d2ee12e	KVM: VMX: Check guest support for RDTSCP before processing MSR_TSC_AUX Check for RDTSCP support prior to checking if MSR_TSC_AUX is in the uret MSRs array so that the array lookup and manipulation are back-to-back. This paves the way toward adding a helper to wrap the lookup and manipulation. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:59 -04:00
Sean Christopherson	1e7a483037	KVM: VMX: Rename "__find_msr_index" to "__vmx_find_uret_msr" Rename "__find_msr_index" to scope it to VMX, associate it with guest_uret_msrs, and to avoid conflating "MSR's ECX index" with "MSR's array index". Similarly, don't use "slot" in the name so as to avoid colliding the common x86's half of "user_return_msrs" (the slot in kvm_user_return_msrs is not the same slot in guest_uret_msrs). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-9-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:58 -04:00
Sean Christopherson	658ece84f5	KVM: VMX: Rename vcpu_vmx's "guest_msrs_ready" to "guest_uret_msrs_loaded" Add "uret" to "guest_msrs_ready" to explicitly associate it with the "guest_uret_msrs" array, and replace "ready" with "loaded" to more precisely reflect what it tracks, e.g. "ready" could be interpreted as meaning ready for processing (setup_msrs() has run), which is wrong. "loaded" also aligns with the similar "guest_state_loaded" field. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:58 -04:00
Sean Christopherson	e9bb1ae92d	KVM: VMX: Rename vcpu_vmx's "save_nmsrs" to "nr_active_uret_msrs" Add "uret" into the name of "save_nmsrs" to explicitly associate it with the guest_uret_msrs array, and replace "save" with "active" (for lack of a better word) to better describe what is being tracked. While "save" is more or less accurate when viewed as a literal description of the field, e.g. it holds the number of MSRs that were saved into the array the last time setup_msrs() was invoked, it can easily be misinterpreted by the reader, e.g. as meaning the number of MSRs that were saved from hardware at some point in the past, or as the number of MSRs that need to be saved at some point in the future, both of which are wrong. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:57 -04:00
Sean Christopherson	fbc1800738	KVM: VMX: Rename vcpu_vmx's "nmsrs" to "nr_uret_msrs" Rename vcpu_vmx.nsmrs to vcpu_vmx.nr_uret_msrs to explicitly associate it with the guest_uret_msrs array. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:57 -04:00
Sean Christopherson	eb3db1b137	KVM: VMX: Rename the "shared_msr_entry" struct to "vmx_uret_msr" Rename struct "shared_msr_entry" to "vmx_uret_msr" to align with x86's rename of "shared_msrs" to "user_return_msrs", and to call out that the struct is specific to VMX, i.e. not part of the generic "shared_msrs" framework. Abbreviate "user_return" as "uret" to keep line lengths marginally sane and code more or less readable. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:56 -04:00
Sean Christopherson	a128a934f2	KVM: VMX: Rename "vmx_find_msr_index" to "vmx_find_loadstore_msr_slot" Add "loadstore" to vmx_find_msr_index() to differentiate it from the so called shared MSRs helpers (which will soon be renamed), and replace "index" with "slot" to better convey that the helper returns slot in the array, not the MSR index (the value that gets stuffed into ECX). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:55 -04:00
Sean Christopherson	ce833b2324	KVM: VMX: Prepend "MAX_" to MSR array size defines Add "MAX" to the LOADSTORE and so called SHARED MSR defines to make it more clear that the define controls the array size, as opposed to the actual number of valid entries that are in the array. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:55 -04:00
Sean Christopherson	7e34fbd05c	KVM: x86: Rename "shared_msrs" to "user_return_msrs" Rename the "shared_msrs" mechanism, which is used to defer restoring MSRs that are only consumed when running in userspace, to a more banal but less likely to be confusing "user_return_msrs". The "shared" nomenclature is confusing as it's not obvious who is sharing what, e.g. reasonable interpretations are that the guest value is shared by vCPUs in a VM, or that the MSR value is shared/common to guest and host, both of which are wrong. "shared" is also misleading as the MSR value (in hardware) is not guaranteed to be shared/reused between VMs (if that's indeed the correct interpretation of the name), as the ability to share values between VMs is simply a side effect (albiet a very nice side effect) of deferring restoration of the host value until returning from userspace. "user_return" avoids the above confusion by describing the mechanism itself instead of its effects. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:54 -04:00
Sean Christopherson	04d28e3752	KVM: x86/mmu: Move individual kvm_mmu initialization into common helper Move initialization of 'struct kvm_mmu' fields into alloc_mmu_pages() to consolidate code, and rename the helper to __kvm_mmu_create(). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923163314.8181-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:54 -04:00
Sean Christopherson	02f1965ff8	KVM: nVMX: Read EXIT_QUAL and INTR_INFO only when needed for nested exit Read vmcs.EXIT_QUALIFICATION and vmcs.VM_EXIT_INTR_INFO only if the VM-Exit is being reflected to L1 now that they are no longer passed directly to the kvm_nested_vmexit tracepoint. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923201349.16097-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:53 -04:00
Sean Christopherson	cc167bd7ee	KVM: x86: Use common definition for kvm_nested_vmexit tracepoint Use the newly introduced TRACE_EVENT_KVM_EXIT to define the guts of kvm_nested_vmexit so that it captures and prints the same information as kvm_exit. This has the bonus side effect of fixing the interrupt info and error code printing for the case where they're invalid, e.g. if the exit was a failed VM-Entry. This also sets the stage for retrieving EXIT_QUALIFICATION and VM_EXIT_INTR_INFO in nested_vmx_reflect_vmexit() if and only if the VM-Exit is being routed to L1. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923201349.16097-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:52 -04:00
Sean Christopherson	029e8c8ad6	KVM: x86: Add macro wrapper for defining kvm_exit tracepoint Macrofy the definition of kvm_exit so that the definition can be reused verbatim by kvm_nested_vmexit. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923201349.16097-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:52 -04:00
Sean Christopherson	235ba74f00	KVM: x86: Add intr/vectoring info and error code to kvm_exit tracepoint Extend the kvm_exit tracepoint to align it with kvm_nested_vmexit in terms of what information is captured. On SVM, add interrupt info and error code, while on VMX it add IDT vectoring and error code. This sets the stage for macrofying the kvm_exit tracepoint definition so that it can be reused for kvm_nested_vmexit without loss of information. Opportunistically stuff a zero for VM_EXIT_INTR_INFO if the VM-Enter failed, as the field is guaranteed to be invalid. Note, it'd be possible to further filter the interrupt/exception fields based on the VM-Exit reason, but the helper is intended only for tracepoints, i.e. an extra VMREAD or two is a non-issue, the failed VM-Enter case is just low hanging fruit. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923201349.16097-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:51 -04:00
Sean Christopherson	f315f2b140	KVM: VMX: Add a helper to test for a valid error code given an intr info Add a helper, is_exception_with_error_code(), to provide the simple but difficult to read code of checking for a valid exception with an error code given a vmcs.VM_EXIT_INTR_INFO value. The helper will gain another user, vmx_get_exit_info(), in a future patch. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923201349.16097-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:50 -04:00
Sean Christopherson	a9d7d76c66	KVM: x86: Read guest RIP from within the kvm_nested_vmexit tracepoint Use kvm_rip_read() to read the guest's RIP for the nested VM-Exit tracepoint instead of having the caller pass in an argument. Params that are passed into a tracepoint are evaluated even if the tracepoint is disabled, i.e. passing in RIP for VMX incurs a VMREAD and retpoline to retrieve a value that may never be used, e.g. if the exit is due to a hardware interrupt. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923201349.16097-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:50 -04:00
Sean Christopherson	b2d522552c	KVM: x86: Add RIP to the kvm_entry, i.e. VM-Enter, tracepoint Add RIP to the kvm_entry tracepoint to help debug if the kvm_exit tracepoint is disabled or if VM-Enter fails, in which case the kvm_exit tracepoint won't be hit. Read RIP from within the tracepoint itself to avoid a potential VMREAD and retpoline if the guest's RIP isn't available. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923201349.16097-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:49 -04:00
Sean Christopherson	138534a810	KVM: nVMX: WARN on attempt to switch the currently loaded VMCS WARN if KVM attempts to switch to the currently loaded VMCS. Now that nested_vmx_free_vcpu() doesn't blindly call vmx_switch_vmcs(), all paths that lead to vmx_switch_vmcs() are implicitly guarded by guest vs. host mode, e.g. KVM should never emulate VMX instructions when guest mode is active, and nested_vmx_vmexit() should never be called when host mode is active. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923184452.980-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:49 -04:00
Sean Christopherson	ebec153a05	KVM: nVMX: Drop redundant VMCS switch and free_nested() call Remove the explicit switch to vmcs01 and the call to free_nested() in nested_vmx_free_vcpu(). free_nested(), which is called unconditionally by vmx_leave_nested(), ensures vmcs01 is loaded prior to freeing vmcs02 and friends. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923184452.980-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:46 -04:00
Sean Christopherson	df82a24b29	KVM: nVMX: Ensure vmcs01 is the loaded VMCS when freeing nested state Add a WARN in free_nested() to ensure vmcs01 is loaded prior to freeing vmcs02 and friends, and explicitly switch to vmcs01 if it's not. KVM is supposed to keep is_guest_mode() and loaded_vmcs==vmcs02 synchronized, but bugs happen and freeing vmcs02 while it's in use will escalate a KVM error to a use-after-free and potentially crash the kernel. Do the WARN and switch even in the !vmxon case to help detect latent bugs. free_nested() is not a hot path, and the check is cheap. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923184452.980-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:46 -04:00
Sean Christopherson	c61ca2fcbc	KVM: nVMX: Move free_nested() below vmx_switch_vmcs() Move free_nested() down below vmx_switch_vmcs() so that a future patch can do an "emergency" invocation of vmx_switch_vmcs() if vmcs01 is not the loaded VMCS when freeing nested resources. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923184452.980-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:45 -04:00
Sean Christopherson	2ba4493a8b	KVM: nVMX: Explicitly check for valid guest state for !unrestricted guest Call guest_state_valid() directly instead of querying emulation_required when checking if L1 is attempting VM-Enter with invalid guest state. If emulate_invalid_guest_state is false, KVM will fixup segment regs to avoid emulation and will never set emulation_required, i.e. KVM will incorrectly miss the associated consistency checks because the nested path stuffs segments directly into vmcs02. Opportunsitically add Consistency Check tracing to make future debug suck a little less. Fixes: `2bb8cafea8` ("KVM: vVMX: signal failure for nested VMEntry if emulation_required") Fixes: `3184a995f7` ("KVM: nVMX: fix vmentry failure code when L2 state would require emulation") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923184452.980-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:45 -04:00
Sean Christopherson	b89d5ad00e	KVM: nVMX: Reload vmcs01 if getting vmcs12's pages fails Reload vmcs01 when bailing from nested_vmx_enter_non_root_mode() as KVM expects vmcs01 to be loaded when is_guest_mode() is false. Fixes: `671ddc700f` ("KVM: nVMX: Don't leak L1 MMIO regions to L2") Cc: stable@vger.kernel.org Cc: Dan Cross <dcross@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923184452.980-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:44 -04:00
Sean Christopherson	fc387d8daf	KVM: nVMX: Reset the segment cache when stuffing guest segs Explicitly reset the segment cache after stuffing guest segment regs in prepare_vmcs02_rare(). Although the cache is reset when switching to vmcs02, there is nothing that prevents KVM from re-populating the cache prior to writing vmcs02 with vmcs12's values. E.g. if the vCPU is preempted after switching to vmcs02 but before prepare_vmcs02_rare(), kvm_arch_vcpu_put() will dereference GUEST_SS_AR_BYTES via .get_cpl() and cache the stale vmcs02 value. While the current code base only caches stale data in the preemption case, it's theoretically possible future code could read a segment register during the nested flow itself, i.e. this isn't technically illegal behavior in kvm_arch_vcpu_put(), although it did introduce the bug. This manifests as an unexpected nested VM-Enter failure when running with unrestricted guest disabled if the above preemption case coincides with L1 switching L2's CPL, e.g. when switching from a L2 vCPU at CPL3 to to a L2 vCPU at CPL0. stack_segment_valid() will see the new SS_SEL but the old SS_AR_BYTES and incorrectly mark the guest state as invalid due to SS.dpl != SS.rpl. Don't bother updating the cache even though prepare_vmcs02_rare() writes every segment. With unrestricted guest, guest segments are almost never read, let alone L2 guest segments. On the other hand, populating the cache requires a large number of memory writes, i.e. it's unlikely to be a net win. Updating the cache would be a win when unrestricted guest is not supported, as guest_state_valid() will immediately cache all segment registers. But, nested virtualization without unrestricted guest is dirt slow, saving some VMREADs won't change that, and every CPU manufactured in the last decade supports unrestricted guest. In other words, the extra (minor) complexity isn't worth the trouble. Note, kvm_arch_vcpu_put() may see stale data when querying guest CPL depending on when preemption occurs. This is "ok" in that the usage is imperfect by nature, i.e. it's used heuristically to improve performance but doesn't affect functionality. kvm_arch_vcpu_put() could be "fixed" by also disabling preemption while loading segments, but that's pointless and misleading as reading state from kvm_sched_{in,out}() is guaranteed to see stale data in one form or another. E.g. even if all the usage of regs_avail is fixed to call kvm_register_mark_available() after the associated state is set, the individual state might still be stale with respect to the overall vCPU state. I.e. making functional decisions in an asynchronous hook is doomed from the get go. Thankfully KVM doesn't do that. Fixes: `de63ad4cf4` ("KVM: X86: implement the logic for spinlock optimization") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923184452.980-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:44 -04:00
Sean Christopherson	e88b809369	KVM: x86/mmu: Track write/user faults using bools Use bools to track write and user faults throughout the page fault paths and down into mmu_set_spte(). The actual usage is purely boolean, but that's not obvious without digging into all paths as the current code uses a mix of bools (TDP and try_async_pf) and ints (shadow paging and mmu_set_spte()). No true functional change intended (although the pgprintk() will now print 0/1 instead of 0/PFERR_WRITE_MASK). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923183735.584-9-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:43 -04:00
Sean Christopherson	dcc7065170	KVM: x86/mmu: Hoist ITLB multi-hit workaround check up a level Move the "ITLB multi-hit workaround enabled" check into the callers of disallowed_hugepage_adjust() to make it more obvious that the helper is specific to the workaround, and to be consistent with the accounting, i.e. account_huge_nx_page() is called if and only if the workaround is enabled. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923183735.584-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:43 -04:00
Sean Christopherson	1d4a7372e1	KVM: x86/mmu: Rename 'hlevel' to 'level' in FNAME(fetch) Rename 'hlevel', which presumably stands for 'host level', to simply 'level' in FNAME(fetch). The variable hasn't tracked the host level for quite some time. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923183735.584-7-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:42 -04:00
Sean Christopherson	5bcaf3e171	KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested Condition the accounting of a disallowed huge NX page on the original requested level of the page being greater than the current iterator level. This does two things: accounts the page if and only if a huge page was actually disallowed, and accounts the shadow page if and only if it was the level at which the huge page was disallowed. For the latter case, the previous logic would account all shadow pages used to create the translation for the forced small page, e.g. even PML4, which can't be a huge page on current hardware, would be accounted as having been a disallowed huge page when using 5-level EPT. The overzealous accounting is purely a performance issue, i.e. the recovery thread will spuriously zap shadow pages, but otherwise the bad behavior is harmless. Cc: Junaid Shahid <junaids@google.com> Fixes: `b8e8c8303f` ("kvm: mmu: ITLB_MULTIHIT mitigation") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923183735.584-6-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:41 -04:00
Sean Christopherson	3cf066127e	KVM: x86/mmu: Capture requested page level before NX huge page workaround Apply the "huge page disallowed" adjustment of the max level only after capturing the original requested level. The requested level will be used in a future patch to skip adding pages to the list of disallowed huge pages if a huge page wasn't possible anyways, e.g. if the page isn't mapped as a huge page in the host. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923183735.584-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:41 -04:00
Sean Christopherson	6c2fd34f5c	KVM: x86/mmu: Move "huge page disallowed" calculation into mapping helpers Calculate huge_page_disallowed in __direct_map() and FNAME(fetch) in preparation for reworking the calculation so that it preserves the requested map level and eventually to avoid flagging a shadow page as being disallowed for being used as a large/huge page when it couldn't have been huge in the first place, e.g. because the backing page in the host is not large. Pass the error code into the helpers and use it to recalcuate exec and write_fault instead adding yet more booleans to the parameters. Opportunistically use huge_page_disallowed instead of lpage_disallowed to match the nomenclature used within the mapping helpers (though even they have existing inconsistencies). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923183735.584-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:40 -04:00
Sean Christopherson	7d919c7a38	KVM: x86/mmu: Refactor the zap loop for recovering NX lpages Refactor the zap loop in kvm_recover_nx_lpages() to be a for loop that iterates on to_zap and drop the !to_zap check that leads to the in-loop calling of kvm_mmu_commit_zap_page(). The in-loop commit when to_zap hits zero is superfluous now that there's an unconditional commit after the loop to handle the case where lpage_disallowed_mmu_pages is emptied. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923183735.584-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:39 -04:00
Sean Christopherson	e89505698c	KVM: x86/mmu: Commit zap of remaining invalid pages when recovering lpages Call kvm_mmu_commit_zap_page() after exiting the "prepare zap" loop in kvm_recover_nx_lpages() to finish zapping pages in the unlikely event that the loop exited due to lpage_disallowed_mmu_pages being empty. Because the recovery thread drops mmu_lock() when rescheduling, it's possible that lpage_disallowed_mmu_pages could be emptied by a different thread without to_zap reaching zero despite to_zap being derived from the number of disallowed lpages. Fixes: `1aa9b9572b` ("kvm: x86: mmu: Recovery of shattered NX large pages") Cc: Junaid Shahid <junaids@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923183735.584-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:39 -04:00
Sean Christopherson	5a085326d5	KVM: VMX: Rename ops.h to vmx_ops.h Rename ops.h to vmx_ops.h to allow adding a tdx_ops.h in the future without causing massive confusion. Trust Domain Extensions (TDX) is built on VMX, but KVM cannot directly access the VMCS(es) for a TDX guest, thus TDX will need its own "ops" implementation for wrapping the low level operations. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923183112.3030-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:38 -04:00
Xiaoyao Li	8888cdd099	KVM: VMX: Extract posted interrupt support to separate files Extract the posted interrupt code so that it can be reused for Trust Domain Extensions (TDX), which requires posted interrupts and can use KVM VMX's implementation almost verbatim. TDX is different enough from raw VMX that it is highly desirable to implement the guts of TDX in a separate file, i.e. reusing posted interrupt code by shoving TDX support into vmx.c would be a mess. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923183112.3030-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:38 -04:00
Sean Christopherson	127037591c	KVM: x86/mmu: Bail early from final #PF handling on spurious faults Detect spurious page faults, e.g. page faults that occur when multiple vCPUs simultaneously access a not-present page, and skip the SPTE write, prefetch, and stats update for spurious faults. Note, the performance benefits of skipping the write and prefetch are likely negligible, and the false positive stats adjustment is probably lost in the noise. The primary motivation is to play nice with TDX's SEPT in the long term. SEAMCALLs (to program SEPT entries) are quite costly, e.g. thousands of cycles, and a spurious SEPT update will result in a SEAMCALL error (which KVM will ideally treat as fatal). Reported-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923220425.18402-5-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:37 -04:00
Sean Christopherson	c4371c2a68	KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed Introduce RET_PF_FIXED and RET_PF_SPURIOUS to provide unique return values instead of overloading RET_PF_RETRY. In the short term, the unique values add clarity to the code and RET_PF_SPURIOUS will be used by set_spte() to avoid unnecessary work for spurious faults. In the long term, TDX will use RET_PF_FIXED to deterministically map memory during pre-boot. The page fault flow may bail early for benign reasons, e.g. if the mmu_notifier fires for an unrelated address. With only RET_PF_RETRY, it's impossible for the caller to distinguish between "cool, page is mapped" and "darn, need to try again", and thus cannot handle benign cases like the mmu_notifier retry. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923220425.18402-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:36 -04:00
Sean Christopherson	83a2ba4cb2	KVM: x86/mmu: Invert RET_PF_* check when falling through to emulation Explicitly check for RET_PF_EMULATE instead of implicitly doing the same by checking for !RET_PF_RETRY (RET_PF_INVALID is handled earlier). This will adding new RET_PF_ types in future patches without breaking the emulation path. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923220425.18402-3-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:36 -04:00
Sean Christopherson	7b367bc9a6	KVM: x86/mmu: Return -EIO if page fault returns RET_PF_INVALID Exit to userspace with an error if the MMU is buggy and returns RET_PF_INVALID when servicing a page fault. This will allow a future patch to invert the emulation path, i.e. emulate only on RET_PF_EMULATE instead of emulating on anything but RET_PF_RETRY. This technically means that KVM will exit to userspace instead of emulating on RET_PF_INVALID, but practically speaking it's a nop as the MMU never returns RET_PF_INVALID. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923220425.18402-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:35 -04:00
Ben Gardon	2de4085ccc	KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent Recursively zap all to-be-orphaned children, unsynced or otherwise, when zapping a shadow page for a nested TDP MMU. KVM currently only zaps the unsynced child pages, but not the synced ones. This can create problems over time when running many nested guests because it leaves unlinked pages which will not be freed until the page quota is hit. With the default page quota of 20 shadow pages per 1000 guest pages, this looks like a memory leak and can degrade MMU performance. In a recent benchmark, substantial performance degradation was observed: An L1 guest was booted with 64G memory. 2G nested Windows guests were booted, 10 at a time for 20 iterations. (200 total boots) Windows was used in this benchmark because they touch all of their memory on startup. By the end of the benchmark, the nested guests were taking ~10% longer to boot. With this patch there is no degradation in boot time. Without this patch the benchmark ends with hundreds of thousands of stale EPT02 pages cluttering up rmaps and the page hash map. As a result, VM shutdown is also much slower: deleting memslot 0 was observed to take over a minute. With this patch it takes just a few miliseconds. Cc: Peter Shier <pshier@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923221406.16297-3-sean.j.christopherson@intel.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:35 -04:00
Sean Christopherson	ace569e015	KVM: x86/mmu: Move flush logic from mmu_page_zap_pte() to FNAME(invlpg) Move the logic that controls whether or not FNAME(invlpg) needs to flush fully into FNAME(invlpg) so that mmu_page_zap_pte() doesn't return a value. This allows a future patch to redefine the return semantics for mmu_page_zap_pte() so that it can recursively zap orphaned child shadow pages for nested TDP MMUs. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923221406.16297-2-sean.j.christopherson@intel.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:34 -04:00
Vitaly Kuznetsov	dbcf3f96fa	KVM: x86: hyper-v: disallow configuring SynIC timers with no SynIC Hyper-V Synthetic timers require SynIC but we don't seem to check that upon HV_X64_MSR_STIMER[X]_CONFIG/HV_X64_MSR_STIMER0_COUNT writes. Make the behavior match synic_set_msr(). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200924145757.1035782-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:33 -04:00
Sean Christopherson	4d710de964	KVM: x86/mmu: Stash 'kvm' in a local variable in kvm_mmu_free_roots() To make kvm_mmu_free_roots() a bit more readable, capture 'kvm' in a local variable instead of doing vcpu->kvm over and over (and over). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923191204.8410-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-09-28 07:57:32 -04:00

1 2 3 4 5 ...

36383 commits