Commit graph

40528 commits

Author SHA1 Message Date
Linus Torvalds
f0e18b03fc - Free shmem backing storage for SGX enclave pages when those are
swapped back into EPC memory
 
 - Prevent do_int3() from being kprobed, to avoid recursion
 
 - Remap setup_data and setup_indirect structures properly when accessing
 their members
 
 - Correct the alternatives patching order for modules too
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmItzJgACgkQEsHwGGHe
 VUqaow/8C115xuZEBn+iT+adcQxbqrg3S2en/Hq0aJOEhkNkbOhgAW0OWHvj7Gs3
 +2taD35MqzneEOfa0Gv46600V4+SV5K5NAFndr4PA2FVgIw01rEQios2oc4QSQBP
 PVJgvGyIMpN71ODKTiZ8w4ihp3J7MWDkCP1z4hbO/lfM4tOXcYzh2Lv1fE8hHr5b
 qFtPDyYgfEKUVFa+sv2sE1cJw670UFDcFqGAIjxUUm0r78GKmPz08gZm9YiBTJgV
 jrxySdpAh/eaPeHNfFH9RzAD2ZGZppgIkPCp33ZdrMEhnZmwLz7vc76BMbkD2P6w
 1fBmBZ5F8yOMaaLHSGh4Ek5Gs3p9DjmaZdEWwz+yiIe1RFLKyOQu6gsmGbAyuQx4
 KSfPFnfkOfw/7cz6BSp3Sh6zgrGPqloIVcHkWRth/LJZSV/fVgM8bPg3VLJP6WFi
 o4WTcNAq/fNMAmGwtIVpTUW/QJafXvOauKkDGQkMQ87U68QSh6uDrvrvMHPF8W+Y
 SPcYrdsAPagLxq0GCCQ6doSvBjWNTolXfTnfAoATZpae0URmrvu9ddgUbIlgeQWY
 n/rK+cKk+iuLTEZC55+v5OALwEMOM3Tuz4Ghko8re0pkD/kE61m3Az6w5sKN3Inc
 c21tvO/dxHhAnHV+34d2LM27PU4qoFdVO2mPup702x68XT+X0/g=
 =YLph
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.17_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Free shmem backing storage for SGX enclave pages when those are
   swapped back into EPC memory

 - Prevent do_int3() from being kprobed, to avoid recursion

 - Remap setup_data and setup_indirect structures properly when
   accessing their members

 - Correct the alternatives patching order for modules too

* tag 'x86_urgent_for_v5.17_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Free backing memory after faulting the enclave page
  x86/traps: Mark do_int3() NOKPROBE_SYMBOL
  x86/boot: Add setup_indirect support in early_memremap_is_setup_data()
  x86/boot: Fix memremap of setup_indirect structures
  x86/module: Fix the paravirt vs alternative order
2022-03-13 10:36:38 -07:00
Jarkko Sakkinen
08999b2489 x86/sgx: Free backing memory after faulting the enclave page
There is a limited amount of SGX memory (EPC) on each system.  When that
memory is used up, SGX has its own swapping mechanism which is similar
in concept but totally separate from the core mm/* code.  Instead of
swapping to disk, SGX swaps from EPC to normal RAM.  That normal RAM
comes from a shared memory pseudo-file and can itself be swapped by the
core mm code.  There is a hierarchy like this:

	EPC <-> shmem <-> disk

After data is swapped back in from shmem to EPC, the shmem backing
storage needs to be freed.  Currently, the backing shmem is not freed.
This effectively wastes the shmem while the enclave is running.  The
memory is recovered when the enclave is destroyed and the backing
storage freed.

Sort this out by freeing memory with shmem_truncate_range(), as soon as
a page is faulted back to the EPC.  In addition, free the memory for
PCMD pages as soon as all PCMD's in a page have been marked as unused
by zeroing its contents.

Cc: stable@vger.kernel.org
Fixes: 1728ab54b4 ("x86/sgx: Add a page reclaimer")
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220303223859.273187-1-jarkko@kernel.org
2022-03-11 10:31:06 -08:00
Li Huafei
a365a65f9c x86/traps: Mark do_int3() NOKPROBE_SYMBOL
Since kprobe_int3_handler() is called in do_int3(), probing do_int3()
can cause a breakpoint recursion and crash the kernel. Therefore,
do_int3() should be marked as NOKPROBE_SYMBOL.

Fixes: 21e28290b3 ("x86/traps: Split int3 handler up")
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220310120915.63349-1-lihuafei1@huawei.com
2022-03-11 19:19:30 +01:00
Ross Philipson
445c1470b6 x86/boot: Add setup_indirect support in early_memremap_is_setup_data()
The x86 boot documentation describes the setup_indirect structures and
how they are used. Only one of the two functions in ioremap.c that needed
to be modified to be aware of the introduction of setup_indirect
functionality was updated. Adds comparable support to the other function
where it was missing.

Fixes: b3c72fc9a7 ("x86/boot: Introduce setup_indirect")
Signed-off-by: Ross Philipson <ross.philipson@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/1645668456-22036-3-git-send-email-ross.philipson@oracle.com
2022-03-09 12:49:46 +01:00
Ross Philipson
7228918b34 x86/boot: Fix memremap of setup_indirect structures
As documented, the setup_indirect structure is nested inside
the setup_data structures in the setup_data list. The code currently
accesses the fields inside the setup_indirect structure but only
the sizeof(struct setup_data) is being memremapped. No crash
occurred but this is just due to how the area is remapped under the
covers.

Properly memremap both the setup_data and setup_indirect structures
in these cases before accessing them.

Fixes: b3c72fc9a7 ("x86/boot: Introduce setup_indirect")
Signed-off-by: Ross Philipson <ross.philipson@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/1645668456-22036-2-git-send-email-ross.philipson@oracle.com
2022-03-09 12:49:44 +01:00
Peter Zijlstra
5adf349439 x86/module: Fix the paravirt vs alternative order
Ever since commit

  4e6292114c ("x86/paravirt: Add new features for paravirt patching")

there is an ordering dependency between patching paravirt ops and
patching alternatives, the module loader still violates this.

Fixes: 4e6292114c ("x86/paravirt: Add new features for paravirt patching")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220303112825.068773913@infradead.org
2022-03-08 14:15:25 +01:00
Linus Torvalds
4a01e748a5 - Mitigate Spectre v2-type Branch History Buffer attacks on machines
which support eIBRS, i.e., the hardware-assisted speculation restriction
 after it has been shown that such machines are vulnerable even with the
 hardware mitigation.
 
 - Do not use the default LFENCE-based Spectre v2 mitigation on AMD as it
 is insufficient to mitigate such attacks. Instead, switch to retpolines
 on all AMD by default.
 
 - Update the docs and add some warnings for the obviously vulnerable
 cmdline configurations.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmIkktUACgkQEsHwGGHe
 VUo7ZQ/+O4hzL/tHY0V/ekkDxCrJ3q3Hp+DcxUl2ee5PC3Qgxv1Z1waH6ppK8jQs
 marAGr7FYbvzY039ON7irxhpSIckBCpx9tM2F43zsPxxY8EdxGojkHbmaqso5HtW
 l3/O28AcZYoKN/fF8rRAIJy4hrTVascKrNJ2fOiYWYBT62ZIoPm0FusgXbKTZPD+
 gT7iUMoyPjBnKdWDT9L6kKOxDF9TivX1Y6JdDHbnnBsgRkeFatkeq9BJ93M73q63
 Ziq9c8ZcEXyKez+cGFCfXM7+pNYmfsiL48lilTyf+v+GXahDJQOkFw39j5zXEALm
 Nk6yB3PRQ74pEwm5WbK7KO8iwPpblmnDB978mfUcpk+9xWJD8pyoUcItAmCBsXh1
 LjIImYPqL6YihUb9udh+PEDISsfzWNzr4T+kgW9/yXXG4ZmGy3TLInhTK+rNAxJa
 EshWZExEZj6yJvt83Vu08W9fppYJq976tJvl8LWOYthaxqY7IQz0q7mYd799yxk0
 MLPqvZP1+4pHzqn2c9yeHgrwHwMmoqcyMx6B3EA5maYQPdlT7Fk9RCBeCdIA/ieF
 OgGxy1WwMH+cvUa5MaBy3Y32LeYU3bUJh0yPFq/7BxEYGG9PJtLhg2xTo1Ui8F1d
 fKrcSFcjZKVJ9UE5HaqOcp4ka+Q220I9IDGURXkAFQlnOU7X7CE=
 =Athd
 -----END PGP SIGNATURE-----

Merge tag 'x86_bugs_for_v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 spectre fixes from Borislav Petkov:

 - Mitigate Spectre v2-type Branch History Buffer attacks on machines
   which support eIBRS, i.e., the hardware-assisted speculation
   restriction after it has been shown that such machines are vulnerable
   even with the hardware mitigation.

 - Do not use the default LFENCE-based Spectre v2 mitigation on AMD as
   it is insufficient to mitigate such attacks. Instead, switch to
   retpolines on all AMD by default.

 - Update the docs and add some warnings for the obviously vulnerable
   cmdline configurations.

* tag 'x86_bugs_for_v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/speculation: Warn about eIBRS + LFENCE + Unprivileged eBPF + SMT
  x86/speculation: Warn about Spectre v2 LFENCE mitigation
  x86/speculation: Update link to AMD speculation whitepaper
  x86/speculation: Use generic retpoline by default on AMD
  x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting
  Documentation/hw-vuln: Update spectre doc
  x86/speculation: Add eIBRS + Retpoline options
  x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCE
2022-03-07 17:29:47 -08:00
Linus Torvalds
f81664f760 x86 guest:
* Tweaks to the paravirtualization code, to avoid using them
 when they're pointless or harmful
 
 x86 host:
 
 * Fix for SRCU lockdep splat
 
 * Brown paper bag fix for the propagation of errno
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmIkkdsUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroP15Qf7B8BXNMlNkret5WN/4pGf06gNdIY6
 ZqC8t/Lx1+fCkzGk+VtAw0bxRscOF4z1XzvfywO5ZI5bxQB/b2xTyBkVY90SqhsB
 shug5QpikejpmvVZJXxwD3+loCUah2T6FUT6QJa0sKVhW+XiqOva8fAmYLG5agaa
 VGvqFXTXiVmbiw/O9ZI/CfUC0WNrn+I1iDO+oGWyhv/22tePxGCizVczRFJn6DAD
 Vh5P6AfOqXjmzdpUeOiU544FQZPHAZehb7/xYc0T9GSW4fPnTmHwRzwhUqgJnx7d
 3E+eWGwny+Q/OrpKf7SbxtB65yn7lHRmdN/YtCHygl4sjs6CdjSPY8/9jQ==
 =PPz1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "x86 guest:

   - Tweaks to the paravirtualization code, to avoid using them when
     they're pointless or harmful

  x86 host:

   - Fix for SRCU lockdep splat

   - Brown paper bag fix for the propagation of errno"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run
  KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()
  KVM: x86: Yield to IPI target vCPU only if it is busy
  x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64
  x86/kvm: Don't waste memory if kvmclock is disabled
  x86/kvm: Don't use PV TLB/yield when mwait is advertised
2022-03-06 12:08:42 -08:00
Josh Poimboeuf
0de05d056a x86/speculation: Warn about eIBRS + LFENCE + Unprivileged eBPF + SMT
The commit

   44a3918c82 ("x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting")

added a warning for the "eIBRS + unprivileged eBPF" combination, which
has been shown to be vulnerable against Spectre v2 BHB-based attacks.

However, there's no warning about the "eIBRS + LFENCE retpoline +
unprivileged eBPF" combo. The LFENCE adds more protection by shortening
the speculation window after a mispredicted branch. That makes an attack
significantly more difficult, even with unprivileged eBPF. So at least
for now the logic doesn't warn about that combination.

But if you then add SMT into the mix, the SMT attack angle weakens the
effectiveness of the LFENCE considerably.

So extend the "eIBRS + unprivileged eBPF" warning to also include the
"eIBRS + LFENCE + unprivileged eBPF + SMT" case.

  [ bp: Massage commit message. ]

Suggested-by: Alyssa Milburn <alyssa.milburn@linux.intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-03-05 09:30:47 +01:00
Josh Poimboeuf
eafd987d4a x86/speculation: Warn about Spectre v2 LFENCE mitigation
With:

  f8a66d608a ("x86,bugs: Unconditionally allow spectre_v2=retpoline,amd")

it became possible to enable the LFENCE "retpoline" on Intel. However,
Intel doesn't recommend it, as it has some weaknesses compared to
retpoline.

Now AMD doesn't recommend it either.

It can still be left available as a cmdline option. It's faster than
retpoline but is weaker in certain scenarios -- particularly SMT, but
even non-SMT may be vulnerable in some cases.

So just unconditionally warn if the user requests it on the cmdline.

  [ bp: Massage commit message. ]

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-03-05 09:16:24 +01:00
Paolo Bonzini
8d25b7beca KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run
kvm_arch_vcpu_ioctl_run is already doing srcu_read_lock/unlock in two
places, namely vcpu_run and post_kvm_run_save, and a third is actually
needed around the call to vcpu->arch.complete_userspace_io to avoid
the following splat:

  WARNING: suspicious RCU usage
  arch/x86/kvm/pmu.c:190 suspicious rcu_dereference_check() usage!
  other info that might help us debug this:
  rcu_scheduler_active = 2, debug_locks = 1
  1 lock held by CPU 28/KVM/370841:
  #0: ff11004089f280b8 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x87/0x730 [kvm]
  Call Trace:
   <TASK>
   dump_stack_lvl+0x59/0x73
   reprogram_fixed_counter+0x15d/0x1a0 [kvm]
   kvm_pmu_trigger_event+0x1a3/0x260 [kvm]
   ? free_moved_vector+0x1b4/0x1e0
   complete_fast_pio_in+0x8a/0xd0 [kvm]

This splat is not at all unexpected, since complete_userspace_io callbacks
can execute similar code to vmexits.  For example, SVM with nrips=false
will call into the emulator from svm_skip_emulated_instruction().

While it's tempting to never acquire kvm->srcu for an uninitialized vCPU,
practically speaking there's no penalty to acquiring kvm->srcu "early"
as the KVM_MP_STATE_UNINITIALIZED path is a one-time thing per vCPU.  On
the other hand, seemingly innocuous helpers like kvm_apic_accept_events()
and sync_regs() can theoretically reach code that might access
SRCU-protected data structures, e.g. sync_regs() can trigger forced
existing of nested mode via kvm_vcpu_ioctl_x86_set_vcpu_events().

Reported-by: Like Xu <likexu@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-02 10:55:58 -05:00
Like Xu
c6c937d673 KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()
Just like on the optional mmu_alloc_direct_roots() path, once shadow
path reaches "r = -EIO" somewhere, the caller needs to know the actual
state in order to enter error handling and avoid something worse.

Fixes: 4a38162ee9 ("KVM: MMU: load PDPTRs outside mmu_lock")
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220301124941.48412-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-03-02 10:55:58 -05:00
Linus Torvalds
fb184c4af9 The bigger part of the change is a revert for x86 hosts. Here the
second patch was supposed to fix the first, but in reality it was
 just as broken, so both have to go.
 
 x86 host:
 
 * Revert incorrect assumption that cr3 changes come with preempt notifier
   callbacks (they don't when static branches are changed, for example)
 
 ARM host:
 
 * Correctly synchronise PMR and co on PSCI CPU_SUSPEND
 
 * Skip tests that depend on GICv3 when the HW isn't available
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmIeGnUUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMCYgf9GPVUOQUbHVxVqvKB1ABnY3ZIZuS3
 +/XLgVifSmXb2sPQmcKIPk7eQkxlzpnVdbznJO5qFMtVKRv/ppj+/ly2wwF8l+rR
 m1XvyYo2sukK5vTpBrQiRm3aWY7vpx0ds4DStLnrnBuPF/U7x6WlHSL/BqXaNcSJ
 e+SXd/UFhkg7dEQaU3eqXyf2/mMfR2ZLdUb4v+/UiV7kfzzvRqNERd8HUoVk2FcM
 VYBr07ChaV4XB/dZsCDVSz2Z7f7rH3sMMW82ZHKjuFUEW4Dij9NiX2ycaeRvkSLG
 tnliTuROCY2bOQeIVCTHf5XqCAAm7sA1AoClFaUy30+UW9s9j45NuhUQbA==
 =nuHK
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "The bigger part of the change is a revert for x86 hosts. Here the
  second patch was supposed to fix the first, but in reality it was just
  as broken, so both have to go.

  x86 host:

   - Revert incorrect assumption that cr3 changes come with preempt
     notifier callbacks (they don't when static branches are changed,
     for example)

  ARM host:

   - Correctly synchronise PMR and co on PSCI CPU_SUSPEND

   - Skip tests that depend on GICv3 when the HW isn't available"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: selftests: aarch64: Skip tests if we can't create a vgic-v3
  Revert "KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"
  Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()"
  KVM: arm64: Don't miss pending interrupts for suspended vCPU
2022-03-01 12:01:18 -08:00
Kim Phillips
244d00b5dd x86/speculation: Use generic retpoline by default on AMD
AMD retpoline may be susceptible to speculation. The speculation
execution window for an incorrect indirect branch prediction using
LFENCE/JMP sequence may potentially be large enough to allow
exploitation using Spectre V2.

By default, don't use retpoline,lfence on AMD.  Instead, use the
generic retpoline.

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
2022-02-28 18:37:08 +01:00
Li RongQing
9ee83635d8 KVM: x86: Yield to IPI target vCPU only if it is busy
When sending a call-function IPI-many to vCPUs, yield to the
IPI target vCPU which is marked as preempted.

but when emulating HLT, an idling vCPU will be voluntarily
scheduled out and mark as preempted from the guest kernel
perspective. yielding to idle vCPU is pointless and increase
unnecessary vmexit, maybe miss the true preempted vCPU

so yield to IPI target vCPU only if vCPU is busy and preempted

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Message-Id: <1644380201-29423-1-git-send-email-lirongqing@baidu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25 10:09:35 -05:00
Dexuan Cui
92e68cc558 x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64
When Linux runs as an Isolated VM on Hyper-V, it supports AMD SEV-SNP
but it's partially enlightened, i.e. cc_platform_has(
CC_ATTR_GUEST_MEM_ENCRYPT) is true but sev_active() is false.

Commit 4d96f91091 per se is good, but with it now
kvm_setup_vsyscall_timeinfo() -> kvmclock_init_mem() calls
set_memory_decrypted(), and later gets stuck when trying to zere out
the pages pointed by 'hvclock_mem', if Linux runs as an Isolated VM on
Hyper-V. The cause is that here now the Linux VM should no longer access
the original guest physical addrss (GPA); instead the VM should do
memremap() and access the original GPA + ms_hyperv.shared_gpa_boundary:
see the example code in drivers/hv/connection.c: vmbus_connect() or
drivers/hv/ring_buffer.c: hv_ringbuffer_init(). If the VM tries to
access the original GPA, it keepts getting injected a fault by Hyper-V
and gets stuck there.

Here the issue happens only when the VM has >=65 vCPUs, because the
global static array hv_clock_boot[] can hold 64 "struct
pvclock_vsyscall_time_info" (the sizeof of the struct is 64 bytes), so
kvmclock_init_mem() only allocates memory in the case of vCPUs > 64.

Since the 'hvclock_mem' pages are only useful when the kvm clock is
supported by the underlying hypervisor, fix the issue by returning
early when Linux VM runs on Hyper-V, which doesn't support kvm clock.

Fixes: 4d96f91091 ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()")
Tested-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Message-Id: <20220225084600.17817-1-decui@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25 10:09:34 -05:00
Wanpeng Li
3c51d0a6c7 x86/kvm: Don't waste memory if kvmclock is disabled
Even if "no-kvmclock" is passed in cmdline parameter, the guest kernel
still allocates hvclock_mem which is scaled by the number of vCPUs,
let's check kvmclock enable in advance to avoid this memory waste.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1645520523-30814-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25 10:09:34 -05:00
Wanpeng Li
40cd58dbf1 x86/kvm: Don't use PV TLB/yield when mwait is advertised
MWAIT is advertised in host is not overcommitted scenario, however, PV
TLB/sched yield should be enabled in host overcommitted scenario. Let's
add the MWAIT checking when enabling PV TLB/sched yield.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1645777780-2581-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25 10:09:34 -05:00
Sean Christopherson
1a71581012 Revert "KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"
Revert back to refreshing vmcs.HOST_CR3 immediately prior to VM-Enter.
The PCID (ASID) part of CR3 can be bumped without KVM being scheduled
out, as the kernel will switch CR3 during __text_poke(), e.g. in response
to a static key toggling.  If switch_mm_irqs_off() chooses a new ASID for
the mm associate with KVM, KVM will do VM-Enter => VM-Exit with a stale
vmcs.HOST_CR3.

Add a comment to explain why KVM must wait until VM-Enter is imminent to
refresh vmcs.HOST_CR3.

The following splat was captured by stashing vmcs.HOST_CR3 in kvm_vcpu
and adding a WARN in load_new_mm_cr3() to fire if a new ASID is being
loaded for the KVM-associated mm while KVM has a "running" vCPU:

  static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
  {
	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();

	...

	WARN(vcpu && (vcpu->cr3 & GENMASK(11, 0)) != (new_mm_cr3 & GENMASK(11, 0)) &&
	     (vcpu->cr3 & PHYSICAL_PAGE_MASK) == (new_mm_cr3 & PHYSICAL_PAGE_MASK),
	     "KVM is hosed, loading CR3 = %lx, vmcs.HOST_CR3 = %lx", new_mm_cr3, vcpu->cr3);
  }

  ------------[ cut here ]------------
  KVM is hosed, loading CR3 = 8000000105393004, vmcs.HOST_CR3 = 105393003
  WARNING: CPU: 4 PID: 20717 at arch/x86/mm/tlb.c:291 load_new_mm_cr3+0x82/0xe0
  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel
  CPU: 4 PID: 20717 Comm: stable Tainted: G        W         5.17.0-rc3+ #747
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:load_new_mm_cr3+0x82/0xe0
  RSP: 0018:ffffc9000489fa98 EFLAGS: 00010082
  RAX: 0000000000000000 RBX: 8000000105393004 RCX: 0000000000000027
  RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277d1b788
  RBP: 0000000000000004 R08: ffff888277d1b780 R09: ffffc9000489f8b8
  R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
  R13: ffff88810678a800 R14: 0000000000000004 R15: 0000000000000c33
  FS:  00007fa9f0e72700(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 00000001001b5003 CR4: 0000000000172ea0
  Call Trace:
   <TASK>
   switch_mm_irqs_off+0x1cb/0x460
   __text_poke+0x308/0x3e0
   text_poke_bp_batch+0x168/0x220
   text_poke_finish+0x1b/0x30
   arch_jump_label_transform_apply+0x18/0x30
   static_key_slow_inc_cpuslocked+0x7c/0x90
   static_key_slow_inc+0x16/0x20
   kvm_lapic_set_base+0x116/0x190
   kvm_set_apic_base+0xa5/0xe0
   kvm_set_msr_common+0x2f4/0xf60
   vmx_set_msr+0x355/0xe70 [kvm_intel]
   kvm_set_msr_ignored_check+0x91/0x230
   kvm_emulate_wrmsr+0x36/0x120
   vmx_handle_exit+0x609/0x6c0 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0x146f/0x1b80
   kvm_vcpu_ioctl+0x279/0x690
   __x64_sys_ioctl+0x83/0xb0
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  ---[ end trace 0000000000000000 ]---

This reverts commit 15ad9762d6.

Fixes: 15ad9762d6 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
Reported-by: Wanpeng Li <kernellwp@gmail.com>
Cc: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Lai Jiangshan <jiangshanlai@gmail.com>
Message-Id: <20220224191917.3508476-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25 04:02:18 -05:00
Sean Christopherson
bca06b85fc Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()"
Undo a nested VMX fix as a step toward reverting the commit it fixed,
15ad9762d6 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"),
as the underlying premise that "host CR3 in the vcpu thread can only be
changed when scheduling" is wrong.

This reverts commit a9f2705ec8.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220224191917.3508476-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25 04:02:18 -05:00
Linus Torvalds
1f840c0ef4 x86 host:
* Expose KVM_CAP_ENABLE_CAP since it is supported
 
 * Disable KVM_HC_CLOCK_PAIRING in TSC catchup mode
 
 * Ensure async page fault token is nonzero
 
 * Fix lockdep false negative
 
 * Fix FPU migration regression from the AMX changes
 
 x86 guest:
 
 * Don't use PV TLB/IPI/yield on uniprocessor guests
 
 PPC:
 * reserve capability id (topic branch for ppc/kvm)
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmIXyQAUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPKJQf/T9NeXOFIPIIlH4ZKM7155qlwX8dx
 NR2YV+RNYd27MDkaEm9w4ucXacGpPuBPPx9v7UiLlAqAN+NP7nF3rQKC0SpQMC6H
 EKFtm+8al8EzyDYP36fqnwDne/xWHlOeGXRRJMKPGhXBSoXoY5cK35IXmNZjfteQ
 hK7siBs2saJ2VFqMCbJ9Pqdu1NDO6OEt8HWz2Dnx6EUd90O0pHWZy5JvWOYfyLjL
 Y2pP0dZQxuB/PmqkpVj2gV9jK2Zhj33eerzDV4tVXPV7le8fgGeTaJ8ft+SUIizS
 YCcPR89+u5c9yzlwY2i7mvloayKnuqkECiGtRG6VHNlrPZTPijems8tH1w==
 =lWjy
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "x86 host:

   - Expose KVM_CAP_ENABLE_CAP since it is supported

   - Disable KVM_HC_CLOCK_PAIRING in TSC catchup mode

   - Ensure async page fault token is nonzero

   - Fix lockdep false negative

   - Fix FPU migration regression from the AMX changes

  x86 guest:

   - Don't use PV TLB/IPI/yield on uniprocessor guests

  PPC:

   - reserve capability id (topic branch for ppc/kvm)"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: nSVM: disallow userspace setting of MSR_AMD64_TSC_RATIO to non default value when tsc scaling disabled
  KVM: x86/mmu: make apf token non-zero to fix bug
  KVM: PPC: reserve capability 210 for KVM_CAP_PPC_AIL_MODE_3
  x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPU
  x86/kvm: Fix compilation warning in non-x86_64 builds
  x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0
  x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
  kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode
  KVM: Fix lockdep false negative during host resume
  KVM: x86: Add KVM_CAP_ENABLE_CAP to x86
2022-02-24 14:05:49 -08:00
Maxim Levitsky
e910a53fb4 KVM: x86: nSVM: disallow userspace setting of MSR_AMD64_TSC_RATIO to non default value when tsc scaling disabled
If nested tsc scaling is disabled, MSR_AMD64_TSC_RATIO should
never have non default value.

Due to way nested tsc scaling support was implmented in qemu,
it would set this msr to 0 when nested tsc scaling was disabled.
Ignore that value for now, as it causes no harm.

Fixes: 5228eb96a4 ("KVM: x86: nSVM: implement nested TSC scaling")
Cc: stable@vger.kernel.org

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220223115649.319134-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-24 13:04:47 -05:00
Liang Zhang
6f3c1fc53d KVM: x86/mmu: make apf token non-zero to fix bug
In current async pagefault logic, when a page is ready, KVM relies on
kvm_arch_can_dequeue_async_page_present() to determine whether to deliver
a READY event to the Guest. This function test token value of struct
kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a
READY event is finished by Guest. If value is zero meaning that a READY
event is done, so the KVM can deliver another.
But the kvm_arch_setup_async_pf() may produce a valid token with zero
value, which is confused with previous mention and may lead the loss of
this READY event.

This bug may cause task blocked forever in Guest:
 INFO: task stress:7532 blocked for more than 1254 seconds.
       Not tainted 5.10.0 #16
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:stress          state:D stack:    0 pid: 7532 ppid:  1409
 flags:0x00000080
 Call Trace:
  __schedule+0x1e7/0x650
  schedule+0x46/0xb0
  kvm_async_pf_task_wait_schedule+0xad/0xe0
  ? exit_to_user_mode_prepare+0x60/0x70
  __kvm_handle_async_pf+0x4f/0xb0
  ? asm_exc_page_fault+0x8/0x30
  exc_page_fault+0x6f/0x110
  ? asm_exc_page_fault+0x8/0x30
  asm_exc_page_fault+0x1e/0x30
 RIP: 0033:0x402d00
 RSP: 002b:00007ffd31912500 EFLAGS: 00010206
 RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0
 RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0
 RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086
 R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000
 R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000

Signed-off-by: Liang Zhang <zhangliang5@huawei.com>
Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-24 13:04:46 -05:00
Josh Poimboeuf
44a3918c82 x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting
With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable
to Spectre v2 BHB-based attacks.

When both are enabled, print a warning message and report it in the
'spectre_v2' sysfs vulnerabilities file.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2022-02-21 10:21:47 +01:00
Peter Zijlstra
1e19da8522 x86/speculation: Add eIBRS + Retpoline options
Thanks to the chaps at VUsec it is now clear that eIBRS is not
sufficient, therefore allow enabling of retpolines along with eIBRS.

Add spectre_v2=eibrs, spectre_v2=eibrs,lfence and
spectre_v2=eibrs,retpoline options to explicitly pick your preferred
means of mitigation.

Since there's new mitigations there's also user visible changes in
/sys/devices/system/cpu/vulnerabilities/spectre_v2 to reflect these
new mitigations.

  [ bp: Massage commit message, trim error messages,
    do more precise eIBRS mode checking. ]

Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Patrick Colp <patrick.colp@oracle.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2022-02-21 10:21:35 +01:00
Peter Zijlstra (Intel)
d45476d983 x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCE
The RETPOLINE_AMD name is unfortunate since it isn't necessarily
AMD only, in fact Hygon also uses it. Furthermore it will likely be
sufficient for some Intel processors. Therefore rename the thing to
RETPOLINE_LFENCE to better describe what it is.

Add the spectre_v2=retpoline,lfence option as an alias to
spectre_v2=retpoline,amd to preserve existing setups. However, the output
of /sys/devices/system/cpu/vulnerabilities/spectre_v2 will be changed.

  [ bp: Fix typos, massage. ]

Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2022-02-21 10:21:28 +01:00
Linus Torvalds
222177397a - Fix the ptrace regset xfpregs_set() callback to behave according to the ABI
- Handle poisoned pages properly in the SGX reclaimer code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmISMboACgkQEsHwGGHe
 VUoo0hAAvohA86G6CCfDvQUjSoimYqsKockWCS594/Fz0eEEi7wf+1e48OOtEVnW
 4TCqd/dCNXnzOSj0KE2/nY+X3SaOv+2gLr1uyv+A8iX8nkTt+3Hkrzjm6mkSE5Bq
 C6aKANwdqfzmyy7efYHYHOb4hbcMe8t1df7rPsAwaT+MVJCOgjnICLAx4+YuhNqP
 hvBs6o5wFgPx6/LJzmgaL1QA/qxstmhagLncU1MXZPte6xo+Dp1/6qMm2QFiJgIr
 clxZU8Q87NlzIAOGFohwzs0fm3DID529PzTfArqsRJPKB8TX0EUoZ8d2OPzD03I1
 aOdw6dhdNxfgORI0+5glffJ6aGbUi076x+1K4r4wYzGHW+hjd/gyNxIfnGOOgKOG
 ZZdf01wiOpcvzBHk/yx+jSfmwI5Zt079DCTD9MnutV8k952dtqZNmJ+Vrkmj9aCS
 Lsv/AZJDGa4HEBAI+lvYCZb1xFzBIDCinVgl/PrFrcMFuRi737J2C50LzYeZpg/x
 gllkk3U5kTlZ1iGiQlRzBlDu28vXVIV9koF5xoOwNLYjpFr6Jog62tiioP5IS9ym
 VVxAaYlzYNNMjwwtypS55sMmYK+dnaK/jrmCp9xI3br5e9OXhZCr3dq/od03ijPH
 jhFCr0Ec7IPIk5uHj9uvcUDKyvrqjDWCG8aK1Yc65Ed5pLYX7cQ=
 =5Bl4
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Fix the ptrace regset xfpregs_set() callback to behave according to
   the ABI

 - Handle poisoned pages properly in the SGX reclaimer code

* tag 'x86_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ptrace: Fix xfpregs_set()'s incorrect xmm clearing
  x86/sgx: Fix missing poison handling in reclaimer
2022-02-20 12:46:21 -08:00
Andy Lutomirski
44cad52cc1 x86/ptrace: Fix xfpregs_set()'s incorrect xmm clearing
xfpregs_set() handles 32-bit REGSET_XFP and 64-bit REGSET_FP. The actual
code treats these regsets as modern FX state (i.e. the beginning part of
XSTATE). The declarations of the regsets thought they were the legacy
i387 format. The code thought they were the 32-bit (no xmm8..15) variant
of XSTATE and, for good measure, made the high bits disappear by zeroing
the wrong part of the buffer. The latter broke ptrace, and everything
else confused anyone trying to understand the code. In particular, the
nonsense definitions of the regsets confused me when I wrote this code.

Clean this all up. Change the declarations to match reality (which
shouldn't change the generated code, let alone the ABI) and fix
xfpregs_set() to clear the correct bits and to only do so for 32-bit
callers.

Fixes: 6164331d15 ("x86/fpu: Rewrite xfpregs_set()")
Reported-by: Luís Ferreira <contact@lsferreira.net>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215524
Link: https://lore.kernel.org/r/YgpFnZpF01WwR8wU@zn.tnic
2022-02-18 11:23:21 +01:00
Wanpeng Li
ec756e40e2 x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPU
Inspired by commit 3553ae5690 (x86/kvm: Don't use pvqspinlock code if
only 1 vCPU), on a VM with only 1 vCPU, there is no need to enable
pv tlb/ipi/sched_yield and we can save the memory for __pv_cpu_mask.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1645171838-2855-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18 03:36:24 -05:00
Leonardo Bras
ba1f77c546 x86/kvm: Fix compilation warning in non-x86_64 builds
On non-x86_64 builds, helpers gtod_is_based_on_tsc() and
kvm_guest_supported_xfd() are defined but never used.  Because these are
static inline but are in a .c file, some compilers do warn for them with
-Wunused-function, which becomes an error if -Werror is present.

Add #ifdef so they are only defined in x86_64 builds.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Message-Id: <20220218034100.115702-1-leobras@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18 03:33:45 -05:00
Reinette Chatre
e5733d8c89 x86/sgx: Fix missing poison handling in reclaimer
The SGX reclaimer code lacks page poison handling in its main
free path. This can lead to avoidable machine checks if a
poisoned page is freed and reallocated instead of being
isolated.

A troublesome scenario is:
 1. Machine check (#MC) occurs (asynchronous, !MF_ACTION_REQUIRED)
 2. arch_memory_failure() is eventually called
 3. (SGX) page->poison set to 1
 4. Page is reclaimed
 5. Page added to normal free lists by sgx_reclaim_pages()
    ^ This is the bug (poison pages should be isolated on the
    sgx_poison_page_list instead)
 6. Page is reallocated by some innocent enclave, a second (synchronous)
    in-kernel #MC is induced, probably during EADD instruction.
    ^ This is the fallout from the bug

(6) is unfortunate and can be avoided by replacing the open coded
enclave page freeing code in the reclaimer with sgx_free_epc_page()
to obtain support for poison page handling that includes placing the
poisoned page on the correct list.

Fixes: d6d261bded ("x86/sgx: Add new sgx_epc_page flag bit to mark free pages")
Fixes: 992801ae92 ("x86/sgx: Initial poison handling for dirty and free pages")
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lkml.kernel.org/r/dcc95eb2aaefb042527ac50d0a50738c7c160dac.1643830353.git.reinette.chatre@intel.com
2022-02-17 10:24:50 -08:00
Leonardo Bras
988896bb61 x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0
kvm_vcpu_arch currently contains the guest supported features in both
guest_supported_xcr0 and guest_fpu.fpstate->user_xfeatures field.

Currently both fields are set to the same value in
kvm_vcpu_after_set_cpuid() and are not changed anywhere else after that.

Since it's not good to keep duplicated data, remove guest_supported_xcr0.

To keep the code more readable, introduce kvm_guest_supported_xcr()
and kvm_guest_supported_xfd() to replace the previous usages of
guest_supported_xcr0.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Message-Id: <20220217053028.96432-3-leobras@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17 10:06:49 -05:00
Leonardo Bras
ad856280dd x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0
During host/guest switch (like in kvm_arch_vcpu_ioctl_run()), the kernel
swaps the fpu between host/guest contexts, by using fpu_swap_kvm_fpstate().

When xsave feature is available, the fpu swap is done by:
- xsave(s) instruction, with guest's fpstate->xfeatures as mask, is used
  to store the current state of the fpu registers to a buffer.
- xrstor(s) instruction, with (fpu_kernel_cfg.max_features &
  XFEATURE_MASK_FPSTATE) as mask, is used to put the buffer into fpu regs.

For xsave(s) the mask is used to limit what parts of the fpu regs will
be copied to the buffer. Likewise on xrstor(s), the mask is used to
limit what parts of the fpu regs will be changed.

The mask for xsave(s), the guest's fpstate->xfeatures, is defined on
kvm_arch_vcpu_create(), which (in summary) sets it to all features
supported by the cpu which are enabled on kernel config.

This means that xsave(s) will save to guest buffer all the fpu regs
contents the cpu has enabled when the guest is paused, even if they
are not used.

This would not be an issue, if xrstor(s) would also do that.

xrstor(s)'s mask for host/guest swap is basically every valid feature
contained in kernel config, except XFEATURE_MASK_PKRU.
Accordingto kernel src, it is instead switched in switch_to() and
flush_thread().

Then, the following happens with a host supporting PKRU starts a
guest that does not support it:
1 - Host has XFEATURE_MASK_PKRU set. 1st switch to guest,
2 - xsave(s) fpu regs to host fpustate (buffer has XFEATURE_MASK_PKRU)
3 - xrstor(s) guest fpustate to fpu regs (fpu regs have XFEATURE_MASK_PKRU)
4 - guest runs, then switch back to host,
5 - xsave(s) fpu regs to guest fpstate (buffer now have XFEATURE_MASK_PKRU)
6 - xrstor(s) host fpstate to fpu regs.
7 - kvm_vcpu_ioctl_x86_get_xsave() copy guest fpstate to userspace (with
    XFEATURE_MASK_PKRU, which should not be supported by guest vcpu)

On 5, even though the guest does not support PKRU, it does have the flag
set on guest fpstate, which is transferred to userspace via vcpu ioctl
KVM_GET_XSAVE.

This becomes a problem when the user decides on migrating the above guest
to another machine that does not support PKRU: the new host restores
guest's fpu regs to as they were before (xrstor(s)), but since the new
host don't support PKRU, a general-protection exception ocurs in xrstor(s)
and that crashes the guest.

This can be solved by making the guest's fpstate->user_xfeatures hold
a copy of guest_supported_xcr0. This way, on 7 the only flags copied to
userspace will be the ones compatible to guest requirements, and thus
there will be no issue during migration.

As a bonus, it will also fail if userspace tries to set fpu features
(with the KVM_SET_XSAVE ioctl) that are not compatible to the guest
configuration.  Such features will never be returned by KVM_GET_XSAVE
or KVM_GET_XSAVE2.

Also, since kvm_vcpu_after_set_cpuid() now sets fpstate->user_xfeatures,
there is not need to set it in kvm_check_cpuid(). So, change
fpstate_realloc() so it does not touch fpstate->user_xfeatures if a
non-NULL guest_fpu is passed, which is the case when kvm_check_cpuid()
calls it.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Message-Id: <20220217053028.96432-2-leobras@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17 10:05:57 -05:00
Anton Romanov
3a55f72924 kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode
If vcpu has tsc_always_catchup set each request updates pvclock data.
KVM_HC_CLOCK_PAIRING consumers such as ptp_kvm_x86 rely on tsc read on
host's side and do hypercall inside pvclock_read_retry loop leading to
infinite loop in such situation.

v3:
    Removed warn
    Changed return code to KVM_EFAULT
v2:
    Added warn

Signed-off-by: Anton Romanov <romanton@google.com>
Message-Id: <20220216182653.506850-1-romanton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17 09:52:50 -05:00
Aaron Lewis
127770ac0d KVM: x86: Add KVM_CAP_ENABLE_CAP to x86
Follow the precedent set by other architectures that support the VCPU
ioctl, KVM_ENABLE_CAP, and advertise the VM extension, KVM_CAP_ENABLE_CAP.
This way, userspace can ensure that KVM_ENABLE_CAP is available on a
vcpu before using it.

Fixes: 5c919412fe ("kvm/x86: Hyper-V synthetic interrupt controller")
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Message-Id: <20220214212950.1776943-1-aaronlewis@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17 09:52:50 -05:00
Linus Torvalds
c5d9ae265b ARM:
* Read HW interrupt pending state from the HW
 
 x86:
 
 * Don't truncate the performance event mask on AMD
 
 * Fix Xen runstate updates to be atomic when preempting vCPU
 
 * Fix for AMD AVIC interrupt injection race
 
 * Several other AMD fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmIL4G4UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNkQQf/Z75dnmdRl8sHHnGjwH2IhWHwAg+h
 5O+mJphYt4cvVMexP5dj69b7mHtKMeg/0TxPvPfwCLlhzKkW1gQFwwBAq/YuBCKw
 cnMuVPeCSWo6znpS+jYUF4FAJgPKkzfFR9UwYAR5UexSWyOwU8rLcvSxj8vJjO/l
 sIke+f767Ks2KgcTMIudObg+vDcgnQXI8n8ztI7hF1WJKYHdTKFkYN7BYRxQ9BW6
 4fq51218DhRMv6S7so5dhYC473f+D0t8b5S/Mygur/x6mzsdQJKeOmi8aWGoDa/B
 Bmse+X0lHoOkdXaxqpBgQCYeyrXohNcXx7cpGRVFnS45Jf7MLG4OfVHWNQ==
 =kD2l
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Read HW interrupt pending state from the HW

  x86:

   - Don't truncate the performance event mask on AMD

   - Fix Xen runstate updates to be atomic when preempting vCPU

   - Fix for AMD AVIC interrupt injection race

   - Several other AMD fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW
  KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event
  KVM: SVM: fix race between interrupt delivery and AVIC inhibition
  KVM: SVM: set IRR in svm_deliver_interrupt
  KVM: SVM: extract avic_ring_doorbell
  selftests: kvm: Remove absent target file
  KVM: arm64: vgic: Read HW interrupt pending state from the HW
  KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU
  KVM: x86: SVM: move avic definitions from AMD's spec to svm.h
  KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it
  KVM: x86: nSVM: deal with L1 hypervisor that intercepts interrupts but lets L2 control them
  KVM: x86: nSVM: expose clean bit support to the guest
  KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM
  KVM: x86: nSVM: mark vmcb01 as dirty when restoring SMM saved state
  KVM: x86: nSVM: fix potential NULL derefernce on nested migration
  KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT && !gCR0.PG case
  Revert "svm: Add warning message for AVIC IPI invalid target"
2022-02-15 11:07:59 -08:00
Jim Mattson
710c476514 KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW
AMD's event select is 3 nybbles, with the high nybble in bits 35:32 of
a PerfEvtSeln MSR. Don't mask off the high nybble when configuring a
RAW perf event.

Fixes: ca724305a2 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220203014813.2130559-2-jmattson@google.com>
Reviewed-by: David Dunn <daviddunn@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-14 07:44:51 -05:00
Jim Mattson
b8bfee85f1 KVM: x86/pmu: Don't truncate the PerfEvtSeln MSR when creating a perf event
AMD's event select is 3 nybbles, with the high nybble in bits 35:32 of
a PerfEvtSeln MSR. Don't drop the high nybble when setting up the
config field of a perf_event_attr structure for a call to
perf_event_create_kernel_counter().

Fixes: ca724305a2 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20220203014813.2130559-1-jmattson@google.com>
Reviewed-by: David Dunn <daviddunn@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-14 07:43:46 -05:00
Linus Torvalds
42964a18f8 - Fix a case where objtool would mistakenly warn about instructions being unreachable
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmII/BUACgkQEsHwGGHe
 VUr5VBAAhK4/Hqe5wgbbKE2CwUfvCd7KBbKQb0FbBguO9Sg9NkQhD0ROE5qQCsRE
 AMBJKuC4AahoVIsPuJ2nr73iaM5UY3xx6Dth6qPnA7VEwpWM/h/A2t0D9bgm5PdR
 1FK5QnmIP+o/astYTvP8YpIlhRqHK97fZM0KYZX8SfLG2sbnYiBANj7YiwafNLQs
 7FR/J+LH2PRAU1sBrdrhvd/u4jvSTjcbutPEETGuTTMoPiJj4TGa0XyZNQR4/br+
 WTnbzLFJpRabBMVE2ELzbzSXnEeUZpSKe79G9LW5AFaGa9UpyooXVUn4PcPNXTia
 Q8zkMKusNFUlQc0pbcUxaS/g+UnC7koFn/XvPxp5k9tL7x4exq9hhOn/F+K9Ctuw
 jUVrg63+/VFYtMwbZpYd81k5I/rx+o7t8rkmUrk0Wz/gpE9CDgbjfhGyAFqmXAFU
 mGAGcFHbBaG6J5/XGqOGZR42yajlg9lwxpaj+taCtfbf46/48E6mTtLG2qQRDAqW
 QlGVS8H9t+vmwfO8oAt2tWwLTyZqt+6VmNTbwerKSEblEC+yB/lLO5AcWsnNwHVl
 9ZwSRTPw8ejkj/AdyoTSedMJHzcsAXT8PtIr2rSjuX1b8SubN8GmbsApyQmzV3G0
 n5DLTcKvpCPjtWjtpF44yjP1rfdDLFBIh+RYHF7iNPFo0uhQFOg=
 =+cTD
 -----END PGP SIGNATURE-----

Merge tag 'objtool_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool fix from Borislav Petkov:
 "Fix a case where objtool would mistakenly warn about instructions
  being unreachable"

* tag 'objtool_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bug: Merge annotate_reachable() into _BUG_FLAGS() asm
2022-02-13 09:43:34 -08:00
Linus Torvalds
808f0ab221 - Prevent softlockups when tearing down large SGX enclaves
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmII99AACgkQEsHwGGHe
 VUp03RAAo2KE+kb5kbMRqF0eJjGCl34EGbcfkl2Kr+M1tbi7aG41VbnGuZIyHoOQ
 U6ZnN8v/ft5tQy7PfXrZW7iNKQgLrdEOfI6lUoyr5LGgJprXj/SSJvFMngfypuQi
 mPhkvRs+hTIe4ylQvbqQKgQxFIMgF90+XsdmBz0vaJDun8dkOr8ghS2bBPHf1y4o
 1sQ14SgDCU0hVtPhH5HQQIcanqmHXNbYreXuTToHRwgqy4CcoX5vaQQUGgjQK5SK
 ektmDxGiBI90jscL+ZoXg730dchXku04WY5tfoYJazUnIMKNSlpmTGQLBsQnlDf5
 Cxi94h91GYaLM0u44OICyfHNPUi+xx19o31dzBnNviSpKLKoK/lquFCNUXNqVn9E
 lwOhMysSYOuxgtnYqLXMUSQWMwY3rbISrUqPOR7vMYzg/b+LKPTc76I9B7C9/UYW
 6lKCchDicAAv/rLh1+0JOOKWTaz1F8dVasxRCRGYreL8ZxT14jsB41sn4xxSRXRZ
 d/iEobh/LFL1c37ju0sWdHj5fSK9c4pPIM560o2ftBwGypvryVdBFDpbe3B1W/AD
 IJXRsVW3LU7BbnGEXYcobcX5vXeBKULtcTliS9VTQjIQVZkcqPp7t8GSJOMf9qos
 k889Fi9NfQktaUQDQztTujmwuQrP7JPaejrmvQU0xwam88ZkvwM=
 =yZ3t
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Borislav Petkov:
 "Prevent softlockups when tearing down large SGX enclaves"

* tag 'x86_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Silence softlockup detection when releasing large enclaves
2022-02-13 09:22:52 -08:00
Linus Torvalds
4a387c98b3 xen: branch for v5.17-rc4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYgd+vgAKCRCAXGG7T9hj
 vrMKAQCGIOlp3iLisC9ZbzZF2SkeEPW602QF0LC3hexPKgtD/wD/dPeU33MtzkIC
 d53GcdcDUBv4ByYKz6/tGPiZhzQSEwI=
 =nm20
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.17a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:

 - Two small cleanups

 - Another fix for addressing the EFI framebuffer above 4GB when running
   as Xen dom0

 - A patch to let Xen guests use reserved bits in MSI- and IO-APIC-
   registers for extended APIC-IDs the same way KVM guests are doing it
   already

* tag 'for-linus-5.17a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen/pci: Make use of the helper macro LIST_HEAD()
  xen/x2apic: Fix inconsistent indenting
  xen/x86: detect support for extended destination ID
  xen/x86: obtain full video frame buffer address for Dom0 also under EFI
2022-02-12 09:08:57 -08:00
Maxim Levitsky
66fa226c13 KVM: SVM: fix race between interrupt delivery and AVIC inhibition
If svm_deliver_avic_intr is called just after the target vcpu's AVIC got
inhibited, it might read a stale value of vcpu->arch.apicv_active
which can lead to the target vCPU not noticing the interrupt.

To fix this use load-acquire/store-release so that, if the target vCPU
is IN_GUEST_MODE, we're guaranteed to see a previous disabling of the
AVIC.  If AVIC has been disabled in the meanwhile, proceed with the
KVM_REQ_EVENT-based delivery.

Incomplete IPI vmexit has the same races as svm_deliver_avic_intr, and
in fact it can be handled in exactly the same way; the only difference
lies in who has set IRR, whether svm_deliver_interrupt or the processor.
Therefore, svm_complete_interrupt_delivery can be used to fix incomplete
IPI vmexits as well.

Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-11 12:53:02 -05:00
Paolo Bonzini
30811174f0 KVM: SVM: set IRR in svm_deliver_interrupt
SVM has to set IRR for both the AVIC and the software-LAPIC case,
so pull it up to the common function that handles both configurations.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-11 12:53:02 -05:00
Maxim Levitsky
0a5f784273 KVM: SVM: extract avic_ring_doorbell
The check on the current CPU adds an extra level of indentation to
svm_deliver_avic_intr and conflates documentation on what happens
if the vCPU exits (of interest to svm_deliver_avic_intr) and migrates
(only of interest to avic_ring_doorbell, which calls get/put_cpu()).
Extract the wrmsr to a separate function and rewrite the
comment in svm_deliver_avic_intr().

Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-11 12:53:02 -05:00
Rafael J. Wysocki
27a98fe60b Merge branch 'acpi-x86'
Merge a revert of a problematic commit for 5.17-rc4.

* acpi-x86:
  x86/PCI: revert "Ignore E820 reservations for bridge windows on newer systems"
2022-02-11 17:32:20 +01:00
Reinette Chatre
8795359e35 x86/sgx: Silence softlockup detection when releasing large enclaves
Vijay reported that the "unclobbered_vdso_oversubscribed" selftest
triggers the softlockup detector.

Actual SGX systems have 128GB of enclave memory or more.  The
"unclobbered_vdso_oversubscribed" selftest creates one enclave which
consumes all of the enclave memory on the system. Tearing down such a
large enclave takes around a minute, most of it in the loop where
the EREMOVE instruction is applied to each individual 4k enclave page.

Spending one minute in a loop triggers the softlockup detector.

Add a cond_resched() to give other tasks a chance to run and placate
the softlockup detector.

Cc: stable@vger.kernel.org
Fixes: 1728ab54b4 ("x86/sgx: Add a page reclaimer")
Reported-by: Vijay Dhanraj <vijay.dhanraj@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Jarkko Sakkinen <jarkko@kernel.org>  (kselftest as sanity check)
Link: https://lkml.kernel.org/r/ced01cac1e75f900251b0a4ae1150aa8ebd295ec.1644345232.git.reinette.chatre@intel.com
2022-02-10 15:58:14 -08:00
David Woodhouse
fcb732d8f8 KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU
There are circumstances whem kvm_xen_update_runstate_guest() should not
sleep because it ends up being called from __schedule() when the vCPU
is preempted:

[  222.830825]  kvm_xen_update_runstate_guest+0x24/0x100
[  222.830878]  kvm_arch_vcpu_put+0x14c/0x200
[  222.830920]  kvm_sched_out+0x30/0x40
[  222.830960]  __schedule+0x55c/0x9f0

To handle this, make it use the same trick as __kvm_xen_has_interrupt(),
of using the hva from the gfn_to_hva_cache directly. Then it can use
pagefault_disable() around the accesses and just bail out if the page
is absent (which is unlikely).

I almost switched to using a gfn_to_pfn_cache here and bailing out if
kvm_map_gfn() fails, like kvm_steal_time_set_preempted() does — but on
closer inspection it looks like kvm_map_gfn() will *always* fail in
atomic context for a page in IOMEM, which means it will silently fail
to make the update every single time for such guests, AFAICT. So I
didn't do it that way after all. And will probably fix that one too.

Cc: stable@vger.kernel.org
Fixes: 30b5c851af ("KVM: x86/xen: Add support for vCPU runstate information")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <b17a93e5ff4561e57b1238e3e7ccd0b613eb827e.camel@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10 13:39:06 -05:00
Jiapeng Chong
afea27dc31 xen/x2apic: Fix inconsistent indenting
Eliminate the follow smatch warning:

arch/x86/xen/enlighten_hvm.c:189 xen_cpu_dead_hvm() warn: inconsistent
indenting.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20220207103506.102008-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2022-02-10 11:10:20 +01:00
Roger Pau Monne
e07e98da92 xen/x86: detect support for extended destination ID
Xen allows the usage of some previously reserved bits in the IO-APIC
RTE and the MSI address fields in order to store high bits for the
target APIC ID. Such feature is already implemented by QEMU/KVM and
HyperV, so in order to enable it just add the handler that checks for
it's presence.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20220120152527.7524-3-roger.pau@citrix.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2022-02-10 11:10:17 +01:00
Jan Beulich
f34c4f2dd2 xen/x86: obtain full video frame buffer address for Dom0 also under EFI
The initial change would not work when Xen was booted from EFI: There is
an early exit from the case block in that case. Move the necessary code
ahead of that.

Fixes: 335e4dd67b ("xen/x86: obtain upper 32 bits of video frame buffer address for Dom0")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>

Link: https://lore.kernel.org/r/2501ce9d-40e5-b49d-b0e5-435544d17d4a@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2022-02-10 11:07:23 +01:00