linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-20 09:31:09 +00:00

Author	SHA1	Message	Date
Paolo Bonzini	bcc4f2bc50	KVM: MMU: mark page dirty in make_spte This simplifies set_spte, which we want to remove, and unifies code between the shadow MMU and the TDP MMU. The warning will be added back later to make_spte as well. There is a small disadvantage in the TDP MMU; it may unnecessarily mark a page as dirty twice if two vCPUs end up mapping the same page twice. However, this is a very small cost for a case that is already rare. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:53 -04:00
David Matlack	68be1306ca	KVM: x86/mmu: Fold rmap_recycle into rmap_add Consolidate rmap_recycle and rmap_add into a single function since they are only ever called together (and only from one place). This has a nice side effect of eliminating an extra kvm_vcpu_gfn_to_memslot(). In addition it makes mmu_set_spte(), which is a very long function, a little shorter. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:52 -04:00
Sean Christopherson	b1a429fb18	KVM: x86/mmu: Verify shadow walk doesn't terminate early in page faults WARN and bail if the shadow walk for faulting in a SPTE terminates early, i.e. doesn't reach the expected level because the walk encountered a terminal SPTE. The shadow walks for page faults are subtle in that they install non-leaf SPTEs (zapping leaf SPTEs if necessary!) in the loop body, and consume the newly created non-leaf SPTE in the loop control, e.g. __shadow_walk_next(). In other words, the walks guarantee that the walk will stop if and only if the target level is reached by installing non-leaf SPTEs to guarantee the walk remains valid. Opportunistically use fault->goal-level instead of it.level in FNAME(fetch) to further clarify that KVM always installs the leaf SPTE at the target level. Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210906122547.263316-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:52 -04:00
Paolo Bonzini	f0066d94c9	KVM: MMU: change tracepoints arguments to kvm_page_fault Pass struct kvm_page_fault to tracepoints instead of extracting the arguments from the struct. This also lets the kvm_mmu_spte_requested tracepoint pick the gfn directly from fault->gfn, instead of using the address. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:52 -04:00
Paolo Bonzini	536f0e6ace	KVM: MMU: change disallowed_hugepage_adjust() arguments to kvm_page_fault Pass struct kvm_page_fault to disallowed_hugepage_adjust() instead of extracting the arguments from the struct. Tweak a bit the conditions to avoid long lines. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:51 -04:00
Paolo Bonzini	73a3c65947	KVM: MMU: change kvm_mmu_hugepage_adjust() arguments to kvm_page_fault Pass struct kvm_page_fault to kvm_mmu_hugepage_adjust() instead of extracting the arguments from the struct; the results are also stored in the struct, so the callers are adjusted consequently. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:51 -04:00
Paolo Bonzini	3c8ad5a675	KVM: MMU: change fast_page_fault() arguments to kvm_page_fault Pass struct kvm_page_fault to fast_page_fault() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:51 -04:00
Paolo Bonzini	cdc47767a0	KVM: MMU: change tdp_mmu_map_handle_target_level() arguments to kvm_page_fault Pass struct kvm_page_fault to tdp_mmu_map_handle_target_level() instead of extracting the arguments from the struct. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:50 -04:00
Paolo Bonzini	2f6305dd56	KVM: MMU: change kvm_tdp_mmu_map() arguments to kvm_page_fault Pass struct kvm_page_fault to kvm_tdp_mmu_map() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:50 -04:00
Paolo Bonzini	9c03b1821a	KVM: MMU: change FNAME(fetch)() arguments to kvm_page_fault Pass struct kvm_page_fault to FNAME(fetch)() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:50 -04:00
Paolo Bonzini	43b74355ef	KVM: MMU: change __direct_map() arguments to kvm_page_fault Pass struct kvm_page_fault to __direct_map() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:50 -04:00
Paolo Bonzini	3a13f4fea3	KVM: MMU: change handle_abnormal_pfn() arguments to kvm_page_fault Pass struct kvm_page_fault to handle_abnormal_pfn() instead of extracting the arguments from the struct. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:49 -04:00
Paolo Bonzini	3647cd04b7	KVM: MMU: change kvm_faultin_pfn() arguments to kvm_page_fault Add fields to struct kvm_page_fault corresponding to outputs of kvm_faultin_pfn(). For now they have to be extracted again from struct kvm_page_fault in the subsequent steps, but this is temporary until other functions in the chain are switched over as well. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:49 -04:00
Paolo Bonzini	b8a5d55115	KVM: MMU: change page_fault_handle_page_track() arguments to kvm_page_fault Add fields to struct kvm_page_fault corresponding to the arguments of page_fault_handle_page_track(). The fields are initialized in the callers, and page_fault_handle_page_track() receives a struct kvm_page_fault instead of having to extract the arguments out of it. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:49 -04:00
Paolo Bonzini	4326e57ef4	KVM: MMU: change direct_page_fault() arguments to kvm_page_fault Add fields to struct kvm_page_fault corresponding to the arguments of direct_page_fault(). The fields are initialized in the callers, and direct_page_fault() receives a struct kvm_page_fault instead of having to extract the arguments out of it. Also adjust FNAME(page_fault) to store the max_level in struct kvm_page_fault, to keep it similar to the direct map path. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:49 -04:00
Paolo Bonzini	c501040abc	KVM: MMU: change mmu->page_fault() arguments to kvm_page_fault Pass struct kvm_page_fault to mmu->page_fault() instead of extracting the arguments from the struct. FNAME(page_fault) can use the precomputed bools from the error code. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:48 -04:00
Paolo Bonzini	6defd9bb17	KVM: MMU: Introduce struct kvm_page_fault Create a single structure for arguments that are passed from kvm_mmu_do_page_fault to the page fault handlers. Later the structure will grow to include various output parameters that are passed back to the next steps in the page fault handling. Suggested-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:48 -04:00
Paolo Bonzini	d055f028a5	KVM: MMU: pass unadulterated gpa to direct_page_fault Do not bother removing the low bits of the gpa. This masking dates back to the very first commit of KVM but it is unnecessary, as exemplified by the other call in kvm_tdp_page_fault. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:48 -04:00
Oliver Upton	55c0cefbdb	KVM: x86: Fix potential race in KVM_GET_CLOCK Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock outside of the pvclock sync lock. This is problematic, as the clock value written to the user may or may not actually correspond to a stable TSC. Fix the race by populating the entire kvm_clock_data structure behind the pvclock_gtod_sync_lock. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-4-oupton@google.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:47 -04:00
Paolo Bonzini	45e6c2fac0	KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions No functional change intended. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:47 -04:00
Paolo Bonzini	6b6fcd2804	kvm: x86: abstract locking around pvclock_update_vm_gtod_copy Updates to the kvmclock parameters needs to do a complicated dance of KVM_REQ_MCLOCK_INPROGRESS and KVM_REQ_CLOCK_UPDATE in addition to taking pvclock_gtod_sync_lock. Place that in two functions that can be called on all of master clock update, KVM_SET_CLOCK, and Hyper-V reenlightenment. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:47 -04:00
Lai Jiangshan	3e44dce4d0	KVM: X86: Move PTE present check from loop body to __shadow_walk_next() So far, the loop bodies already ensure the PTE is present before calling __shadow_walk_next(): Some loop bodies simply exit with a !PRESENT directly and some other loop bodies, i.e. FNAME(fetch) and __direct_map() do not currently guard their walks with is_shadow_present_pte, but only because they install present non-leaf SPTEs in the loop itself. But checking pte present in __shadow_walk_next() (which is called from shadow_walk_okay()) is more prudent; walking past a !PRESENT SPTE would lead to attempting to read a the next level SPTE from a garbage iter->shadow_addr. It also allows to remove the is_shadow_present_pte() checks from the loop bodies. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210906122547.263316-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:46 -04:00
Maxim Levitsky	5228eb96a4	KVM: x86: nSVM: implement nested TSC scaling This was tested by booting a nested guest with TSC=1Ghz, observing the clocks, and doing about 100 cycles of migration. Note that qemu patch is needed to support migration because of a new MSR that needs to be placed in the migration state. The patch will be sent to the qemu mailing list soon. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210914154825.104886-14-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:46 -04:00
Maxim Levitsky	f800650a4e	KVM: x86: SVM: add module param to control TSC scaling This allows to easily simulate a CPU without this feature. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210914154825.104886-13-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:44:46 -04:00
Paolo Bonzini	36e8194dcd	KVM: x86: SVM: don't set VMLOAD/VMSAVE intercepts on vCPU reset Commit `adc2a23734` ("KVM: nSVM: improve SYSENTER emulation on AMD"), made init_vmcb set vmload/vmsave intercepts unconditionally, and relied on svm_vcpu_after_set_cpuid to clear them when possible. However init_vmcb is also called when the vCPU is reset, and it is not followed by another call to svm_vcpu_after_set_cpuid because the CPUID is already set. This mistake makes the VMSAVE/VMLOAD intercept to be set when it is not needed, and harms performance of the nested guest. Extract the relevant parts of svm_vcpu_after_set_cpuid so that they can be called again on reset. Fixes: `adc2a23734` ("KVM: nSVM: improve SYSENTER emulation on AMD") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-01 03:37:34 -04:00
Maxim Levitsky	4c84926e22	KVM: x86: SVM: add module param to control LBR virtualization This is useful for debug and also makes it consistent with the rest of the SVM optional features. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210914154825.104886-9-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:11 -04:00
Maxim Levitsky	0226a45c46	KVM: x86: nSVM: don't copy pause related settings According to the SDM, the CPU never modifies these settings. It loads them on VM entry and updates an internal copy instead. Also don't load them from the vmcb12 as we don't expose these features to the nested guest yet. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210914154825.104886-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:11 -04:00
Longpeng(Mike)	515a0c79e7	kvm: irqfd: avoid update unmodified entries of the routing All of the irqfds would to be updated when update the irq routing, it's too expensive if there're too many irqfds. However we can reduce the cost by avoid some unnecessary updates. For irqs of MSI type on X86, the update can be saved if the msi values are not change. The vfio migration could receives benefit from this optimi- zaiton. The test VM has 128 vcpus and 8 VF (with 65 vectors enabled), so the VM has more than 520 irqfds. We mesure the cost of the vfio_msix_enable (in QEMU, it would set routing for each irqfd) for each VF, and we can see the total cost can be significantly reduced. Origin Apply this Patch 1st 8 4 2nd 15 5 3rd 22 6 4th 24 6 5th 36 7 6th 44 7 7th 51 8 8th 58 8 Total 258ms 51ms We're also tring to optimize the QEMU part [1], but it's still worth to optimize the KVM to gain more benefits. [1] https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04215.html Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Message-Id: <20210827080003.2689-1-longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:10 -04:00
Lai Jiangshan	8b8f9d753b	KVM: X86: Don't check unsync if the original spte is writible If the original spte is writable, the target gfn should not be the gfn of synchronized shadowpage and can continue to be writable. When !can_unsync, speculative must be false. So when the check of "!can_unsync" is removed, we need to move the label of "out" up. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-11-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:10 -04:00
Lai Jiangshan	f1c4a88c41	KVM: X86: Don't unsync pagetables when speculative We'd better only unsync the pagetable when there just was a really write fault on a level-1 pagetable. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-10-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:10 -04:00
Lai Jiangshan	cc2a8e66bb	KVM: X86: Remove FNAME(update_pte) Its solo caller is changed to use FNAME(prefetch_gpte) directly. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-9-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:10 -04:00
Lai Jiangshan	5591c0694d	KVM: X86: Zap the invalid list after remote tlb flushing In mmu_sync_children(), it can zap the invalid list after remote tlb flushing. Emptifying the invalid list ASAP might help reduce a remote tlb flushing in some cases. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-8-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:09 -04:00
Lai Jiangshan	c3e5e415bc	KVM: X86: Change kvm_sync_page() to return true when remote flush is needed Currently kvm_sync_page() returns true when there is any present spte. But the return value is ignored in the callers. Changing kvm_sync_page() to return true when remote flush is needed and changing mmu->sync_page() not to directly flush can combine and reduce remote flush requests. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-7-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:09 -04:00
Lai Jiangshan	06152b2dec	KVM: X86: Remove kvm_mmu_flush_or_zap() Because local_flush is useless, kvm_mmu_flush_or_zap() can be removed and kvm_mmu_remote_flush_or_zap is used instead. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-6-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:09 -04:00
Lai Jiangshan	bd047e5440	KVM: X86: Don't flush current tlb on shadow page modification After any shadow page modification, flushing tlb only on current VCPU is weird due to other VCPU's tlb might still be stale. In other words, if there is any mandatory tlb-flushing after shadow page modification, SET_SPTE_NEED_REMOTE_TLB_FLUSH or remote_flush should be set and the tlbs of all VCPUs should be flushed. There is not point to only flush current tlb except when the request is from vCPU's or pCPU's activities. If there was any bug that mandatory tlb-flushing is required and SET_SPTE_NEED_REMOTE_TLB_FLUSH/remote_flush is failed to set, this patch would expose the bug in a more destructive way. The related code paths are checked and no missing SET_SPTE_NEED_REMOTE_TLB_FLUSH is found yet. Currently, there is no optional tlb-flushing after sync page related code is changed to flush tlb timely. So we can just remove these local flushing code. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:09 -04:00
Sean Christopherson	c6cecc4b93	KVM: x86/mmu: Complete prefetch for trailing SPTEs for direct, legacy MMU Make a final call to direct_pte_prefetch_many() if there are "trailing" SPTEs to prefetch, i.e. SPTEs for GFNs following the faulting GFN. The call to direct_pte_prefetch_many() in the loop only handles the case where there are !PRESENT SPTEs preceding a PRESENT SPTE. E.g. if the faulting GFN is a multiple of 8 (the prefetch size) and all SPTEs for the following GFNs are !PRESENT, the loop will terminate with "start = sptep+1" and not prefetch any SPTEs. Prefetching trailing SPTEs as intended can drastically reduce the number of guest page faults, e.g. accessing the first byte of every 4kb page in a 6gb chunk of virtual memory, in a VM with 8gb of preallocated memory, the number of pf_fixed events observed in L0 drops from ~1.75M to <0.27M. Note, this only affects memory that is backed by 4kb pages as KVM doesn't prefetch when installing hugepages. Shadow paging prefetching is not affected as it does not batch the prefetches due to the need to process the corresponding guest PTE. The TDP MMU is not affected because it doesn't have prefetching, yet... Fixes: `957ed9effd` ("KVM: MMU: prefetch ptes when intercepted guest #PF") Cc: Sergey Senozhatsky <senozhatsky@google.com> Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210818235615.2047588-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:08 -04:00
Thomas Huth	22d7108ce4	KVM: selftests: Fix kvm_vm_free() in cr4_cpuid_sync and vmx_tsc_adjust tests The kvm_vm_free() statement here is currently dead code, since the loop in front of it can only be left with the "goto done" that jumps right after the kvm_vm_free(). Fix it by swapping the locations of the "done" label and the kvm_vm_free(). Signed-off-by: Thomas Huth <thuth@redhat.com> Message-Id: <20210826074928.240942-1-thuth@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:08 -04:00
Colin Ian King	d22869aff4	kvm: selftests: Fix spelling mistake "missmatch" -> "mismatch" There is a spelling mistake in an error message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Message-Id: <20210826120752.12633-1-colin.king@canonical.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:08 -04:00
Sean Christopherson	25b9784586	KVM: x86: Manually retrieve CPUID.0x1 when getting FMS for RESET/INIT Manually look for a CPUID.0x1 entry instead of bouncing through kvm_cpuid() when retrieving the Family-Model-Stepping information for vCPU RESET/INIT. This fixes a potential undefined behavior bug due to kvm_cpuid() using the uninitialized "dummy" param as the ECX _input_, a.k.a. the index. A more minimal fix would be to simply zero "dummy", but the extra work in kvm_cpuid() is wasteful, and KVM should be treating the FMS retrieval as an out-of-band access, e.g. same as how KVM computes guest.MAXPHYADDR. Both Intel's SDM and AMD's APM describe the RDX value at RESET/INIT as holding the CPU's FMS information, not as holding CPUID.0x1.EAX. KVM's usage of CPUID entries to get FMS is simply a pragmatic approach to avoid having yet another way for userspace to provide inconsistent data. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20210929222426.1855730-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:08 -04:00
Sean Christopherson	62dd57dd67	KVM: x86: WARN on non-zero CRs at RESET to detect improper initalization WARN if CR0, CR3, or CR4 are non-zero at RESET, which given the current KVM implementation, really means WARN if they're not zeroed at vCPU creation. VMX in particular has several ->set_*() flows that read other registers to handle side effects, and because those flows are common to RESET and INIT, KVM subtly relies on emulated/virtualized registers to be zeroed at vCPU creation in order to do the right thing at RESET. Use CRs as a sentinel because they are most likely to be written as side effects, and because KVM specifically needs CR0.PG and CR0.PE to be '0' to correctly reflect the state of the vCPU's MMU. CRs are also loaded and stored from/to the VMCS, and so adds some level of coverage to verify that KVM doesn't conflate zero-allocating the VMCS with properly initializing the VMCS with VMWRITEs. Note, '0' is somewhat arbitrary, vCPU creation can technically stuff any value for a register so long as it's coherent with respect to the current vCPU state. In practice, '0' works for all registers and is convenient. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:07 -04:00
Sean Christopherson	9ebe530b9f	KVM: SVM: Move RESET emulation to svm_vcpu_reset() Move RESET emulation for SVM vCPUs to svm_vcpu_reset(), and drop an extra init_vmcb() from svm_create_vcpu() in the process. Hopefully KVM will someday expose a dedicated RESET ioctl(), and in the meantime separating "create" from "RESET" is a nice cleanup. Keep the call to svm_switch_vmcb() so that misuse of svm->vmcb at worst breaks the guest, e.g. premature accesses doesn't cause a NULL pointer dereference. Cc: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:07 -04:00
Sean Christopherson	06692e4b80	KVM: VMX: Move RESET emulation to vmx_vcpu_reset() Move vCPU RESET emulation, including initializating of select VMCS state, to vmx_vcpu_reset(). Drop the open coded "vCPU load" sequence, as ->vcpu_reset() is invoked while the vCPU is properly loaded (which is kind of the point of ->vcpu_reset()...). Hopefully KVM will someday expose a dedicated RESET ioctl(), and in the meantime separating "create" from "RESET" is a nice cleanup. Deferring VMCS initialization is effectively a nop as it's impossible to safely access the VMCS between the current call site and its new home, as both the vCPU and the pCPU are put immediately after init_vmcs(), i.e. the VMCS isn't guaranteed to be loaded. Note, task preemption is not a problem as vmx_sched_in() _can't_ touch the VMCS as ->sched_in() is invoked before the vCPU, and thus VMCS, is reloaded. I.e. the preemption path also can't consume VMCS state. Cc: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:07 -04:00
Sean Christopherson	d06567353e	KVM: VMX: Drop explicit zeroing of MSR guest values at vCPU creation Don't zero out user return and nested MSRs during vCPU creation, and instead rely on vcpu_vmx being zero-allocated. Explicitly zeroing MSRs is not wrong, and is in fact necessary if KVM ever emulates vCPU RESET outside of vCPU creation, but zeroing only a subset of MSRs is confusing. Poking directly into KVM's backing is also undesirable in that it doesn't scale and is error prone. Ideally KVM would have a common RESET path for all MSRs, e.g. by expanding kvm_set_msr(), which would obviate the need for this out-of-bad code (to support standalone RESET). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:07 -04:00
Sean Christopherson	583d369b36	KVM: x86: Fold fx_init() into kvm_arch_vcpu_create() Move the few bits of relevant fx_init() code into kvm_arch_vcpu_create(), dropping the superfluous check on vcpu->arch.guest_fpu that was blindly and wrongly added by commit `ed02b21309` ("KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest"). Note, KVM currently allocates and then frees FPU state for SEV-ES guests, rather than avoid the allocation in the first place. While that approach is inarguably inefficient and unnecessary, it's a cleanup for the future. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:06 -04:00
Sean Christopherson	e8f65b9bb4	KVM: x86: Remove defunct setting of XCR0 for guest during vCPU create Drop code to initialize XCR0 during fx_init(), a.k.a. vCPU creation, as XCR0 has been initialized during kvm_vcpu_reset() (for RESET) since commit `a554d207dc` ("KVM: X86: Processor States following Reset or INIT"). Back when XCR0 support was added by commit `2acf923e38` ("KVM: VMX: Enable XSAVE/XRSTOR for guest"), KVM didn't differentiate between RESET and INIT. Ignoring the fact that calling fx_init() for INIT is obviously wrong, e.g. FPU state after INIT is not the same as after RESET, setting XCR0 in fx_init() was correct. Eventually fx_init() got moved to kvm_arch_vcpu_init(), a.k.a. vCPU creation (ignore the terrible name) by commit `0ee6a51725` ("x86/fpu, kvm: Simplify fx_init()"). Finally, commit `95a0d01eef` ("KVM: x86: Move all vcpu init code into kvm_arch_vcpu_create()") killed off kvm_arch_vcpu_init(), leaving behind the oddity of redundant setting of guest state during vCPU creation. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:06 -04:00
Sean Christopherson	5ebbc470d7	KVM: x86: Remove defunct setting of CR0.ET for guests during vCPU create Drop code to set CR0.ET for the guest during initialization of the guest FPU. The code was added as a misguided bug fix by commit `380102c8e4` ("KVM Set the ET flag in CR0 after initializing FX") to resolve an issue where vcpu->cr0 (now vcpu->arch.cr0) was not correctly initialized on SVM systems. While init_vmcb() did set CR0.ET, it only did so in the VMCB, and subtly did not update vcpu->cr0. Stuffing CR0.ET worked around the immediate problem, but did not fix the real bug of vcpu->cr0 and the VMCB being out of sync. That underlying bug was eventually remedied by commit `18fa000ae4` ("KVM: SVM: Reset cr0 properly on vcpu reset"). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:06 -04:00
Sean Christopherson	ff8828c84f	KVM: x86: Do not mark all registers as avail/dirty during RESET/INIT Do not blindly mark all registers as available+dirty at RESET/INIT, and instead rely on writes to registers to go through the proper mutators or to explicitly mark registers as dirty. INIT in particular does not blindly overwrite all registers, e.g. select bits in CR0 are preserved across INIT, thus marking registers available+dirty without first reading the register from hardware is incorrect. In practice this is a benign bug as KVM doesn't let the guest control CR0 bits that are preserved across INIT, and all other true registers are explicitly written during the RESET/INIT flows. The PDPTRs and EX_INFO "registers" are not explicitly written, but accessing those values during RESET/INIT is nonsensical and would be a KVM bug regardless of register caching. Fixes: `66f7b72e11` ("KVM: x86: Make register state after reset conform to specification") [sean: !!! NOT FOR STABLE !!!] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:05 -04:00
Sean Christopherson	94c641ba7a	KVM: x86: Simplify retrieving the page offset when loading PDTPRs Replace impressively complex "logic" for computing the page offset from CR3 when loading PDPTRs. Unlike other paging modes, the address held in CR3 for PAE paging is 32-byte aligned, i.e. occupies bits 31:5, thus bits 11:5 need to be used as the offset from the gfn when reading PDPTRs. The existing calculation originated in commit `1342d3536d` ("[PATCH] KVM: MMU: Load the pae pdptrs on cr3 change like the processor does"), which read the PDPTRs from guest memory as individual 8-byte loads. At the time, the so called "offset" was the base index of PDPTR0 as a _u64_, not a byte offset. Naming aside, the computation was useful and arguably simplified the overall flow. Unfortunately, when commit `195aefde9c` ("KVM: Add general accessors to read and write guest memory") added accessors with offsets at byte granularity, the cleverness of the original code was lost and KVM was left with convoluted code for a simple operation. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210831164224.1119728-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:05 -04:00
Sean Christopherson	15cabbc259	KVM: x86: Subsume nested GPA read helper into load_pdptrs() Open code the call to mmu->translate_gpa() when loading nested PDPTRs and kill off the existing helper, kvm_read_guest_page_mmu(), to discourage incorrect use. Reading guest memory straight from an L2 GPA is extremely rare (as evidenced by the lack of users), as very few constructs in x86 specify physical addresses, even fewer are virtualized by KVM, and even fewer yet require emulation of L2 by L0 KVM. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210831164224.1119728-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:05 -04:00
Juergen Gross	a1c42ddedf	kvm: rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS KVM_MAX_VCPU_ID is not specifying the highest allowed vcpu-id, but the number of allowed vcpu-ids. This has already led to confusion, so rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS to make its semantics more clear Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210913135745.13944-3-jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-09-30 04:27:05 -04:00

1 2 3 4 5 ...

1043108 commits