linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-20 09:31:09 +00:00

Author	SHA1	Message	Date
David Edmondson	a9d496d8e0	KVM: x86: Clarify the kvm_run.emulation_failure structure layout Until more flags for kvm_run.emulation_failure flags are defined, it is undetermined whether new payload elements corresponding to those flags will be additive or alternative. As a hint to userspace that an alternative is possible, wrap the current payload elements in a union. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Edmondson <david.edmondson@oracle.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210920103737.2696756-2-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-25 06:48:24 -04:00
Jim Mattson	ed290e1c20	KVM: selftests: Fix nested SVM tests when built with clang Though gcc conveniently compiles a simple memset to "rep stos," clang prefers to call the libc version of memset. If a test is dynamically linked, the libc memset isn't available in L1 (nor is the PLT or the GOT, for that matter). Even if the test is statically linked, the libc memset may choose to use some CPU features, like AVX, which may not be enabled in L1. Note that __builtin_memset doesn't solve the problem, because (a) the compiler is free to call memset anyway, and (b) __builtin_memset may also choose to use features like AVX, which may not be available in L1. To avoid a myriad of problems, use an explicit "rep stos" to clear the VMCB in generic_svm_setup(), which is called both from L0 and L1. Reported-by: Ricardo Koller <ricarkol@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Fixes: `20ba262f86` ("selftests: KVM: AMD Nested test infrastructure") Message-Id: <20210930003649.4026553-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 12:46:37 -04:00
Jim Mattson	dfd3c713a9	kvm: x86: Remove stale declaration of kvm_no_apic_vcpu This variable was renamed to kvm_has_noapic_vcpu in commit `6e4e3b4df4` ("KVM: Stop using deprecated jump label APIs"). Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20211021185449.3471763-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 12:46:37 -04:00
Sean Christopherson	ec5a4919fa	KVM: VMX: Unregister posted interrupt wakeup handler on hardware unsetup Unregister KVM's posted interrupt wakeup handler during unsetup so that a spurious interrupt that arrives after kvm_intel.ko is unloaded doesn't call into freed memory. Fixes: `bf9f6ac8d7` ("KVM: Update Posted-Interrupts Descriptor when vCPU is blocked") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009001107.3936588-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 12:46:37 -04:00
Sean Christopherson	6ff53f6a43	x86/irq: Ensure PI wakeup handler is unregistered before module unload Add a synchronize_rcu() after clearing the posted interrupt wakeup handler to ensure all readers, i.e. in-flight IRQ handlers, see the new handler before returning to the caller. If the caller is an exiting module and is unregistering its handler, failure to wait could result in the IRQ handler jumping into an unloaded module. The registration path doesn't require synchronization, as it's the caller's responsibility to not generate interrupts it cares about until after its handler is registered. Fixes: `f6b3c72c23` ("x86/irq: Define a global vector for VT-d Posted-Interrupts") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211009001107.3936588-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 12:45:35 -04:00
Sean Christopherson	187c8833de	KVM: x86: Use rw_semaphore for APICv lock to allow vCPU parallelism Use a rw_semaphore instead of a mutex to coordinate APICv updates so that vCPUs responding to requests can take the lock for read and run in parallel. Using a mutex forces serialization of vCPUs even though kvm_vcpu_update_apicv() only touches data local to that vCPU or is protected by a different lock, e.g. SVM's ir_list_lock. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022004927.1448382-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 11:20:16 -04:00
Sean Christopherson	ee49a89329	KVM: x86: Move SVM's APICv sanity check to common x86 Move SVM's assertion that vCPU's APICv state is consistent with its VM's state out of svm_vcpu_run() and into x86's common inner run loop. The assertion and underlying logic is not unique to SVM, it's just that SVM has more inhibiting conditions and thus is more likely to run headfirst into any KVM bugs. Add relevant comments to document exactly why the update path has unusual ordering between the update the kick, why said ordering is safe, and also the basic rules behind the assertion in the run loop. Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022004927.1448382-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 11:20:16 -04:00
Lukas Bulwahn	9b4eb77099	riscv: do not select non-existing config ANON_INODES Commit `99cdc6c18c` ("RISC-V: Add initial skeletal KVM support") selects the config ANON_INODES in config KVM, but the config ANON_INODES is removed since commit `5dd50aaeb1` ("Make anon_inodes unconditional") in 2018. Hence, ./scripts/checkkconfigsymbols.py warns on non-existing symbols: ANON_INODES Referencing files: arch/riscv/kvm/Kconfig Remove selecting the non-existing config ANON_INODES. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Message-Id: <20211022061514.25946-1-lukas.bulwahn@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:53:37 -04:00
Sean Christopherson	21fa324654	KVM: x86/mmu: Extract zapping of rmaps for gfn range to separate helper Extract the zapping of rmaps, a.k.a. legacy MMU, for a gfn range to a separate helper to clean up the unholy mess that kvm_zap_gfn_range() has become. In addition to deep nesting, the rmaps zapping spreads out the declaration of several variables and is generally a mess. Clean up the mess now so that future work to improve the memslots implementation doesn't need to deal with it. Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022010005.1454978-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:51:52 -04:00
Sean Christopherson	e8be2a5ba8	KVM: x86/mmu: Drop a redundant remote TLB flush in kvm_zap_gfn_range() Remove an unnecessary remote TLB flush in kvm_zap_gfn_range() now that said function holds mmu_lock for write for its entire duration. The flush was added by the now-reverted commit to allow TDP MMU to flush while holding mmu_lock for read, as the transition from write=>read required dropping the lock and thus a pending flush needed to be serviced. Fixes: `5a324c24b6` ("Revert "KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock"") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022010005.1454978-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:51:46 -04:00
Sean Christopherson	bc3b3c1002	KVM: x86/mmu: Drop a redundant, broken remote TLB flush A recent commit to fix the calls to kvm_flush_remote_tlbs_with_address() in kvm_zap_gfn_range() inadvertantly added yet another flush instead of fixing the existing flush. Drop the redundant flush, and fix the params for the existing flush. Cc: stable@vger.kernel.org Fixes: `2822da4466` ("KVM: x86/mmu: fix parameters to kvm_flush_remote_tlbs_with_address") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211022010005.1454978-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:51:30 -04:00
Lai Jiangshan	61b05a9fd4	KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest() kvm_mmu_unload() destroys all the PGD caches. Use the lighter kvm_mmu_sync_roots() and kvm_mmu_sync_prev_roots() instead. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:44:43 -04:00
Lai Jiangshan	264d3dc1d3	KVM: X86: pair smp_wmb() of mmu_try_to_unsync_pages() with smp_rmb() The commit `578e1c4db2` ("kvm: x86: Avoid taking MMU lock in kvm_mmu_sync_roots if no sync is needed") added smp_wmb() in mmu_try_to_unsync_pages(), but the corresponding smp_load_acquire() isn't used on the load of SPTE.W. smp_load_acquire() orders _subsequent_ loads after sp->is_unsync; it does not order _earlier_ loads before the load of sp->is_unsync. This has no functional change; smp_rmb() is a NOP on x86, and no compiler barrier is required because there is a VMEXIT between the load of SPTE.W and kvm_mmu_snc_roots. Cc: Junaid Shahid <junaids@google.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:43:21 -04:00
Lai Jiangshan	509bfe3d97	KVM: X86: Cache CR3 in prev_roots when PCID is disabled The commit `21823fbda5` ("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush") invalidates all PGDs for the specific PCID and in the case of PCID is disabled, it includes all PGDs in the prev_roots and the commit made prev_roots totally unused in this case. Not using prev_roots fixes a problem when CR4.PCIDE is changed 0 -> 1 before the said commit: (CR4.PCIDE=0, CR4.PGE=1; CR3=cr3_a; the page for the guest RIP is global; cr3_b is cached in prev_roots) modify page tables under cr3_b the shadow root of cr3_b is unsync in kvm INVPCID single context the guest expects the TLB is clean for PCID=0 change CR4.PCIDE 0 -> 1 switch to cr3_b with PCID=0,NOFLUSH=1 No sync in kvm, cr3_b is still unsync in kvm jump to the page that was modified in step 1 shadow page tables point to the wrong page It is a very unlikely case, but it shows that stale prev_roots can be a problem after CR4.PCIDE changes from 0 to 1. However, to fix this case, the commit disabled caching CR3 in prev_roots altogether when PCID is disabled. Not all CPUs have PCID; especially the PCID support for AMD CPUs is kind of recent. To restore the prev_roots optimization for CR4.PCIDE=0, flush the whole MMU (including all prev_roots) when CR4.PCIDE changes. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:30 -04:00
Lai Jiangshan	e45e9e3998	KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid() The KVM doesn't know whether any TLB for a specific pcid is cached in the CPU when tdp is enabled. So it is better to flush all the guest TLB when invalidating any single PCID context. The case is very rare or even impossible since KVM generally doesn't intercept CR3 write or INVPCID instructions when tdp is enabled, so the fix is mostly for the sake of overall robustness. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211019110154.4091-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:30 -04:00
Lai Jiangshan	a91a7c7096	KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE X86_CR4_PGE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. It is also inconsistent that X86_CR4_PGE is in KVM_MMU_CR4_ROLE_BITS while kvm_mmu_role doesn't use X86_CR4_PGE. So X86_CR4_PGE is also removed from KVM_MMU_CR4_ROLE_BITS. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:29 -04:00
Lai Jiangshan	552617382c	KVM: X86: Don't reset mmu context when X86_CR4_PCIDE 1->0 X86_CR4_PCIDE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:29 -04:00
Michael Roth	413eaa4ecd	KVM: selftests: set CPUID before setting sregs in vcpu creation Recent kernels have checks to ensure the GPA values in special-purpose registers like CR3 are within the maximum physical address range and don't overlap with anything in the upper/reserved range. In the case of SEV kselftest guests booting directly into 64-bit mode, CR3 needs to be initialized to the GPA of the page table root, with the encryption bit set. The kernel accounts for this encryption bit by removing it from reserved bit range when the guest advertises the bit position via KVM_SET_CPUID, but kselftests currently call KVM_SET_SREGS as part of vm_vcpu_add_default(), before KVM_SET_CPUID. As a result, KVM_SET_SREGS will return an error in these cases. Address this by moving vcpu_set_cpuid() (which calls KVM_SET_CPUID*) ahead of vcpu_setup() (which calls KVM_SET_SREGS). While there, address a typo in the assertion that triggers when KVM_SET_SREGS fails. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20211006203617.13045-1-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Nathan Tempelman <natet@google.com>	2021-10-22 05:19:29 -04:00
Wanpeng Li	9ae7f6c9b5	KVM: emulate: Comment on difference between RDPMC implementation and manual SDM mentioned that, RDPMC: IF (((CR4.PCE = 1) or (CPL = 0) or (CR0.PE = 0)) and (ECX indicates a supported counter)) THEN EAX := counter[31:0]; EDX := ZeroExtend(counter[MSCB:32]); ELSE (* ECX is not valid or CR4.PCE is 0 and CPL is 1, 2, or 3 and CR0.PE is 1 *) #GP(0); FI; Let's add a comment why CR0.PE isn't tested since it's impossible for CPL to be >0 if CR0.PE=0. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1634724836-73721-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:29 -04:00
Sean Christopherson	9dadfc4a61	KVM: x86: Add vendor name to kvm_x86_ops, use it for error messages Paul pointed out the error messages when KVM fails to load are unhelpful in understanding exactly what went wrong if userspace probes the "wrong" module. Add a mandatory kvm_x86_ops field to track vendor module names, kvm_intel and kvm_amd, and use the name for relevant error message when KVM fails to load so that the user knows which module failed to load. Opportunistically tweak the "disabled by bios" error message to clarify that _support_ was disabled, not that the module itself was magically disabled by BIOS. Suggested-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211018183929.897461-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:28 -04:00
Junaid Shahid	4dfe4f40d8	kvm: x86: mmu: Make NX huge page recovery period configurable Currently, the NX huge page recovery thread wakes up every minute and zaps 1/nx_huge_pages_recovery_ratio of the total number of split NX huge pages at a time. This is intended to ensure that only a relatively small number of pages get zapped at a time. But for very large VMs (or more specifically, VMs with a large number of executable pages), a period of 1 minute could still result in this number being too high (unless the ratio is changed significantly, but that can result in split pages lingering on for too long). This change makes the period configurable instead of fixing it at 1 minute. Users of large VMs can then adjust the period and/or the ratio to reduce the number of pages zapped at one time while still maintaining the same overall duration for cycling through the entire list. By default, KVM derives a period from the ratio such that a page will remain on the list for 1 hour on average. Signed-off-by: Junaid Shahid <junaids@google.com> Message-Id: <20211020010627.305925-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:28 -04:00
Wanpeng Li	540c7abe61	KVM: vPMU: Fill get_msr MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 SDM section 18.2.3 mentioned that: "IA32_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow indicator(s) of any general-purpose or fixed-function counters via a single WRMSR." It is R/W mentioned by SDM, we read this msr on bare-metal during perf testing, the value is always 0 for ICX/SKX boxes on hands. Let's fill get_msr MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 as hardware behavior and drop global_ovf_ctrl variable. Tested-by: Like Xu <likexu@tencent.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1634631160-67276-2-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:28 -04:00
David Matlack	610265ea3d	KVM: x86/mmu: Rename slot_handle_leaf to slot_handle_level_4k slot_handle_leaf is a misnomer because it only operates on 4K SPTEs whereas "leaf" is used to describe any valid terminal SPTE (4K or large page). Rename slot_handle_leaf to slot_handle_level_4k to avoid confusion. Making this change makes it more obvious there is a benign discrepency between the legacy MMU and the TDP MMU when it comes to dirty logging. The legacy MMU only iterates through 4K SPTEs when zapping for collapsing and when clearing D-bits. The TDP MMU, on the other hand, iterates through SPTEs on all levels. The TDP MMU behavior of zapping SPTEs at all levels is technically overkill for its current dirty logging implementation, which always demotes to 4k SPTES, but both the TDP MMU and legacy MMU zap if and only if the SPTE can be replaced by a larger page, i.e. will not spuriously zap 2m (or larger) SPTEs. Opportunistically add comments to explain this discrepency in the code. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20211019162223.3935109-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:27 -04:00
Xiaoyao Li	e099f3eb0e	KVM: VMX: RTIT_CTL_BRANCH_EN has no dependency on other CPUID bit Per Intel SDM, RTIT_CTL_BRANCH_EN bit has no dependency on any CPUID leaf 0x14. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-5-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:27 -04:00
Xiaoyao Li	f4d3a902a5	KVM: VMX: Rename pt_desc.addr_range to pt_desc.num_address_ranges To better self explain the meaning of this field and match the PT_CAP_num_address_ranges constatn. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-4-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:27 -04:00
Xiaoyao Li	ba51d62723	KVM: VMX: Use precomputed vmx->pt_desc.addr_range The number of valid PT ADDR MSRs for the guest is precomputed in vmx->pt_desc.addr_range. Use it instead of calculating again. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-3-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:26 -04:00
Xiaoyao Li	2e6e0d683b	KVM: VMX: Restore host's MSR_IA32_RTIT_CTL when it's not zero A minor optimization to WRMSR MSR_IA32_RTIT_CTL when necessary. Opportunistically refine the comment to call out that KVM requires VM_EXIT_CLEAR_IA32_RTIT_CTL to expose PT to the guest. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-2-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:26 -04:00
Paolo Bonzini	2839180ce5	KVM: x86/mmu: clean up prefetch/prefault/speculative naming "prefetch", "prefault" and "speculative" are used throughout KVM to mean the same thing. Use a single name, standardizing on "prefetch" which is already used by various functions such as direct_pte_prefetch, FNAME(prefetch_gpte), FNAME(pte_prefetch), etc. Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:26 -04:00
David Stevens	1e76a3ce0d	KVM: cleanup allocation of rmaps and page tracking data Unify the flags for rmaps and page tracking data, using a single flag in struct kvm_arch and a single loop to go over all the address spaces and memslots. This avoids code duplication between alloc_all_memslots_rmaps and kvm_page_track_enable_mmu_write_tracking. Signed-off-by: David Stevens <stevensd@chromium.org> [This patch is the delta between David's v2 and v3, with conflicts fixed and my own commit message. - Paolo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-22 05:19:25 -04:00
Oliver Upton	3f9808cac0	selftests: KVM: Introduce system counter offset test Introduce a KVM selftest to verify that userspace manipulation of the TSC (via the new vCPU attribute) results in the correct behavior within the guest. Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181555.973085-6-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:46 -04:00
Oliver Upton	c895513453	selftests: KVM: Add helpers for vCPU device attributes vCPU file descriptors are abstracted away from test code in KVM selftests, meaning that tests cannot directly access a vCPU's device attributes. Add helpers that tests can use to get at vCPU device attributes. Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181555.973085-5-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:46 -04:00
Oliver Upton	c1901feef5	selftests: KVM: Fix kvm device helper ioctl assertions The KVM_CREATE_DEVICE and KVM_{GET,SET}_DEVICE_ATTR ioctls are defined to return a value of zero on success. As such, tighten the assertions in the helper functions to only pass if the return code is zero. Suggested-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181555.973085-4-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:46 -04:00
Oliver Upton	61fb1c5485	selftests: KVM: Add test for KVM_{GET,SET}_CLOCK Add a selftest for the new KVM clock UAPI that was introduced. Ensure that the KVM clock is consistent between userspace and the guest, and that the difference in realtime will only ever cause the KVM clock to advance forward. Cc: Andrew Jones <drjones@redhat.com> Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181555.973085-3-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:45 -04:00
Oliver Upton	5000653934	tools: arch: x86: pull in pvclock headers Copy over approximately clean versions of the pvclock headers into tools. Reconcile headers/symbols missing in tools that are unneeded. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181555.973085-2-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:45 -04:00
Oliver Upton	828ca89628	KVM: x86: Expose TSC offset controls to userspace To date, VMM-directed TSC synchronization and migration has been a bit messy. KVM has some baked-in heuristics around TSC writes to infer if the VMM is attempting to synchronize. This is problematic, as it depends on host userspace writing to the guest's TSC within 1 second of the last write. A much cleaner approach to configuring the guest's views of the TSC is to simply migrate the TSC offset for every vCPU. Offsets are idempotent, and thus not subject to change depending on when the VMM actually reads/writes values from/to KVM. The VMM can then read the TSC once with KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when the guest is paused. Cc: David Matlack <dmatlack@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210916181538.968978-8-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:45 -04:00
Oliver Upton	58d4277be9	KVM: x86: Refactor tsc synchronization code Refactor kvm_synchronize_tsc to make a new function that allows callers to specify TSC parameters (offset, value, nanoseconds, etc.) explicitly for the sake of participating in TSC synchronization. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-7-oupton@google.com> [Make sure kvm->arch.cur_tsc_generation and vcpu->arch.this_tsc_generation are equal at the end of __kvm_synchronize_tsc, if matched is false. Reported by Maxim Levitsky. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:45 -04:00
Paolo Bonzini	869b44211a	kvm: x86: protect masterclock with a seqcount Protect the reference point for kvmclock with a seqcount, so that kvmclock updates for all vCPUs can proceed in parallel. Xen runstate updates will also run in parallel and not bounce the kvmclock cacheline. Of the variables that were protected by pvclock_gtod_sync_lock, nr_vcpus_matched_tsc is different because it is updated outside pvclock_update_vm_gtod_copy and read inside it. Therefore, we need to keep it protected by a spinlock. In fact it must now be a raw spinlock, because pvclock_update_vm_gtod_copy, being the write-side of a seqcount, is non-preemptible. Since we already have tsc_write_lock which is a raw spinlock, we can just use tsc_write_lock as the lock that protects the write-side of the seqcount. Co-developed-by: Oliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-6-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:44 -04:00
Oliver Upton	c68dc1b577	KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK Handling the migration of TSCs correctly is difficult, in part because Linux does not provide userspace with the ability to retrieve a (TSC, realtime) clock pair for a single instant in time. In lieu of a more convenient facility, KVM can report similar information in the kvm_clock structure. Provide userspace with a host TSC & realtime pair iff the realtime clock is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid realtime value, advance the KVM clock by the amount of elapsed time. Do not step the KVM clock backwards, though, as it is a monotonic oscillator. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210916181538.968978-5-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:44 -04:00
Paolo Bonzini	3d5e7a28b1	KVM: x86: avoid warning with -Wbitwise-instead-of-logical This is a new warning in clang top-of-tree (will be clang 14): In file included from arch/x86/kvm/mmu/mmu.c:27: arch/x86/kvm/mmu/spte.h:318:9: error: use of bitwise '\|' with boolean operands [-Werror,-Wbitwise-instead-of-logical] return __is_bad_mt_xwr(rsvd_check, spte) \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \|\| arch/x86/kvm/mmu/spte.h:318:9: note: cast one or both operands to int to silence this warning The code is fine, but change it anyway to shut up this clever clogs of a compiler. Reported-by: torvic9@mailbox.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:43:42 -04:00
Paolo Bonzini	a25c78d04c	Merge commit 'kvm-pagedata-alloc-fixes' into HEAD	2021-10-18 14:13:37 -04:00
Paolo Bonzini	fa13843d15	KVM: X86: fix lazy allocation of rmaps If allocation of rmaps fails, but some of the pointers have already been written, those pointers can be cleaned up when the memslot is freed, or even reused later for another attempt at allocating the rmaps. Therefore there is no need to WARN, as done for example in memslot_rmap_alloc, but the allocation must be skipped lest KVM will overwrite the previous pointer and will indeed leak memory. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 14:07:17 -04:00
Andrei Vagin	a7cc099f2e	KVM: x86/mmu: kvm_faultin_pfn has to return false if pfh is returned This looks like a typo in `8f32d5e563`. This change didn't intend to do any functional changes. The problem was caught by gVisor tests. Fixes: `8f32d5e563` ("KVM: x86/mmu: allow kvm_faultin_pfn to return page fault handling code") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrei Vagin <avagin@gmail.com> Message-Id: <20211015163221.472508-1-avagin@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-18 08:19:18 -04:00
Paolo Bonzini	e2b6d941ec	KVM/arm64 fixes for 5.15, take #2 - Properly refcount pages used as a concatenated stage-2 PGD - Fix missing unlock when detecting the use of MTE+VM_SHARED -----BEGIN PGP SIGNATURE----- iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmFpPQwPHG1hekBrZXJu ZWwub3JnAAoJECPQ0LrRPXpDasoP/iNiTIEw7zrHs37Vx4dtTjz15RNF9gmFd2iR EXCg76V+2VN8/87bIcYdKeHkjXtERJ2tTCOJq9X/dn8MixvyShhCxJnk5chc1eZE 2W4GY0gkuKO6E5rDAe10kl+zeFKVAd77zAeUezZYGGfRlm1Ly8sXrwzfR7ZXtvDH 1DL89adycUJlh28lwcX73ScXWpiAsFXdWFjsVpHJngDZ1z3aYUqtPJmESH+mwtuK J85tTq4PZCsnubmUpuCYopgBJnogjh4KTulVOyenykvQyXZ+EVBjTA4Xh8iCyi77 LMxqazvVkFnL3tBw2H7OCWPhQ/xfmWclpKcngDvnfPMIW6b3Tl5dCIY6uNWoLMnV 8COh4ggahhqgsswHVCiFPBkJ/J3G+L8vLz2PHXtHJw6yOzPTEPD5yXH8oXoJhQ+Y U2aT83bWdc5sK9ZvhRHSpVXFZ6bWSUnMztP9szz1DM4QzyT6RHgXJE7TxWXFHhPR Hs0VGX13hz1f/AYN/+BDxicCtH9r/SbG6hlTPrt9+IFUAiz9EZ5Xy28fT1+c93Cf ArylAulF+HQyZMMe4RV2quAEXx5q5ag8pMsm/KQrh9fn0srrO4AnaoBGtuIjKJu3 ccBd/1vbIAU9AxLw82LHW558gAzdq3LskPApgjkTA19oj7ffdMEqqs/cIDkN7nDx QEDaVxFv =eQRY -----END PGP SIGNATURE----- Merge tag 'kvmarm-fixes-5.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 5.15, take #2 - Properly refcount pages used as a concatenated stage-2 PGD - Fix missing unlock when detecting the use of MTE+VM_SHARED	2021-10-15 04:47:55 -04:00
Paolo Bonzini	019057bd73	KVM: SEV-ES: fix length of string I/O The size of the data in the scratch buffer is not divided by the size of each port I/O operation, so vcpu->arch.pio.count ends up being larger than it should be by a factor of size. Cc: stable@vger.kernel.org Fixes: `7ed9abfe8e` ("KVM: SVM: Support string IO operations for an SEV-ES guest") Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-10-15 04:47:36 -04:00
Quentin Perret	6e6a8ef088	KVM: arm64: Release mmap_lock when using VM_SHARED with MTE VM_SHARED mappings are currently forbidden in a memslot with MTE to prevent two VMs racing to sanitise the same page. However, this check is performed while holding current->mm's mmap_lock, but fails to release it. Fix this by releasing the lock when needed. Fixes: `ea7fc1bb1c` ("KVM: arm64: Introduce MTE VM feature") Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211005122031.809857-1-qperret@google.com	2021-10-05 13:22:45 +01:00
Quentin Perret	7615c2a514	KVM: arm64: Report corrupted refcount at EL2 Some of the refcount manipulation helpers used at EL2 are instrumented to catch a corrupted state, but not all of them are treated equally. Let's make things more consistent by instrumenting hyp_page_ref_dec_and_test() as well. Acked-by: Will Deacon <will@kernel.org> Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211005090155.734578-6-qperret@google.com	2021-10-05 13:02:54 +01:00
Quentin Perret	1d58a17ef5	KVM: arm64: Fix host stage-2 PGD refcount The KVM page-table library refcounts the pages of concatenated stage-2 PGDs individually. However, when running KVM in protected mode, the host's stage-2 PGD is currently managed by EL2 as a single high-order compound page, which can cause the refcount of the tail pages to reach 0 when they shouldn't, hence corrupting the page-table. Fix this by introducing a new hyp_split_page() helper in the EL2 page allocator (matching the kernel's split_page() function), and make use of it from host_s2_zalloc_pages_exact(). Fixes: `1025c8c0c6` ("KVM: arm64: Wrap the host with a stage 2") Acked-by: Will Deacon <will@kernel.org> Suggested-by: Will Deacon <will@kernel.org> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211005090155.734578-5-qperret@google.com	2021-10-05 13:02:54 +01:00
Paolo Bonzini	542a2640a2	Initial KVM RISC-V support Following features are supported by the initial KVM RISC-V support: 1. No RISC-V specific KVM IOCTL 2. Loadable KVM RISC-V module 3. Minimal possible KVM world-switch which touches only GPRs and few CSRs 4. Works on both RV64 and RV32 host 5. Full Guest/VM switch via vcpu_get/vcpu_put infrastructure 6. KVM ONE_REG interface for VCPU register access from KVM user-space 7. Interrupt controller emulation in KVM user-space 8. Timer and IPI emuation in kernel 9. Both Sv39x4 and Sv48x4 supported for RV64 host 10. MMU notifiers supported 11. Generic dirty log supported 12. FP lazy save/restore supported 13. SBI v0.1 emulation for Guest/VM 14. Forward unhandled SBI calls to KVM user-space 15. Hugepage support for Guest/VM 16. IOEVENTFD support for Vhost -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmFb7+kACgkQrUjsVaLH LAfvBRAAiD/sUCbrGNk4IAOpI3drxBUTJAC3cCClOAlffHjgUgSf3Qlz+9NMNfuV 6h3Jk9cheIVwrsQLPG4Zawnk2bujFpaJ4rn+kO3aA1ArT+ECGR2XVie2CpBuJ5g7 ysRlrhcr9YVXXlt8ZczUBwYSjDkDaM1gaDQCt3ep65RCm8NU++Y/TlJOK4xFqC2Z Km2jlX2FAxKQLjNJstCPFGiXHhvRK9tu9TK18Fl9Ibk+oO7/H+tZ6VwCivzjzdiW kYAa6xdX4NT80H6UGeR/R72b2CekD2RJDnyaN4LE2ticIzuysZpxcI3806gOazDm 8zVXB9627LRzHzJt4JjdKw76BghuKE3DW8V8hg4WuTpVa6Nx6EHxgpOpGA2sFZDr QZPv4gWQeEGBsTxQJpVJ8D6sIEsQdNQQERUemQToAzX9Hl/kkKaSQSfocSOYqgVP iMNY4LsJfNg/vDEDzMub/kF9qH0jPH8pkQuNX/X0A8nAtJkznjxVN9NaTxDD25Tp udw5/easWD1qDGE6i4Kk/JyDb9NbVAK0ZK6CM5JjtqDtOAnVXCXy+rXgEPsNs3Td kZVa34hKAlTpVhMfI0yj0s5jX0d/2w5KuQX147C/IWi3pBdIYZLvcqAb/1cxSoff uyW4dryfU7cQQdOfbIKHgaLU3b221/q5wYmEi3zqtVbixhIUtNE= =MMFv -----END PGP SIGNATURE----- Merge tag 'kvm-riscv-5.16-1' of git://github.com/kvm-riscv/linux into HEAD Initial KVM RISC-V support Following features are supported by the initial KVM RISC-V support: 1. No RISC-V specific KVM IOCTL 2. Loadable KVM RISC-V module 3. Minimal possible KVM world-switch which touches only GPRs and few CSRs 4. Works on both RV64 and RV32 host 5. Full Guest/VM switch via vcpu_get/vcpu_put infrastructure 6. KVM ONE_REG interface for VCPU register access from KVM user-space 7. Interrupt controller emulation in KVM user-space 8. Timer and IPI emuation in kernel 9. Both Sv39x4 and Sv48x4 supported for RV64 host 10. MMU notifiers supported 11. Generic dirty log supported 12. FP lazy save/restore supported 13. SBI v0.1 emulation for Guest/VM 14. Forward unhandled SBI calls to KVM user-space 15. Hugepage support for Guest/VM 16. IOEVENTFD support for Vhost	2021-10-05 04:19:24 -04:00
Anup Patel	24b699d12c	RISC-V: KVM: Add MAINTAINERS entry Add myself as maintainer for KVM RISC-V and Atish as designated reviewer. Signed-off-by: Atish Patra <atish.patra@wdc.com> Signed-off-by: Anup Patel <anup.patel@wdc.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Alexander Graf <graf@amazon.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>	2021-10-04 16:14:10 +05:30
Anup Patel	da40d85805	RISC-V: KVM: Document RISC-V specific parts of KVM API Document RISC-V specific parts of the KVM API, such as: - The interrupt numbers passed to the KVM_INTERRUPT ioctl. - The states supported by the KVM_{GET,SET}_MP_STATE ioctls. - The registers supported by the KVM_{GET,SET}_ONE_REG interface and the encoding of those register ids. - The exit reason KVM_EXIT_RISCV_SBI for SBI calls forwarded to userspace tool. CC: Jonathan Corbet <corbet@lwn.net> CC: linux-doc@vger.kernel.org Signed-off-by: Anup Patel <anup.patel@wdc.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>	2021-10-04 16:12:34 +05:30

1 2 3 4 5 ...

1043141 commits