Commit graph

43601 commits

Author SHA1 Message Date
Michal Luczaj
e73ba25fdc KVM: x86: Simplify msr_io()
As of commit bccf2150fe ("KVM: Per-vcpu inodes"), __msr_io() doesn't
return a negative value. Remove unnecessary checks.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230107001256.2365304-7-mhal@rbox.co
[sean: call out commit which left behind the unnecessary check]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03 15:55:17 -08:00
Michal Luczaj
4559e6cf45 KVM: x86: Remove unnecessary initialization in kvm_vm_ioctl_set_msr_filter()
Do not initialize the value of `r`, as it will be overwritten.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230107001256.2365304-6-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03 15:30:45 -08:00
Michal Luczaj
1fdefb8bd8 KVM: x86: Explicitly state lockdep condition of msr_filter update
Replace `1` with the actual mutex_is_locked() check.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230107001256.2365304-5-mhal@rbox.co
[sean: delete the comment that explained the hardocded '1']
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03 15:30:39 -08:00
Michal Luczaj
4d85cfcaa8 KVM: x86: Simplify msr_filter update
Replace srcu_dereference()+rcu_assign_pointer() sequence with
a single rcu_replace_pointer().

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230107001256.2365304-4-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03 15:30:32 -08:00
Michal Luczaj
708f799d22 KVM: x86: Optimize kvm->lock and SRCU interaction (KVM_X86_SET_MSR_FILTER)
Reduce time spent holding kvm->lock: unlock mutex before calling
synchronize_srcu().  There is no need to hold kvm->lock until all vCPUs
have been kicked, KVM only needs to guarantee that all vCPUs will switch
to the new filter before exiting to userspace.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230107001256.2365304-3-mhal@rbox.co
[sean: expand changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03 15:30:17 -08:00
Michal Luczaj
95744a90db KVM: x86: Optimize kvm->lock and SRCU interaction (KVM_SET_PMU_EVENT_FILTER)
Reduce time spent holding kvm->lock: unlock mutex before calling
synchronize_srcu_expedited().  There is no need to hold kvm->lock until
all vCPUs have been kicked, KVM only needs to guarantee that all vCPUs
will switch to the new filter before exiting to userspace.  Protecting
the write to __reprogram_pmi is also unnecessary as a vCPU may process
a set bit before receiving the final KVM_REQ_PMU, but the per-vCPU writes
are guaranteed to occur after all vCPUs have switched to the new filter.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230107001256.2365304-2-mhal@rbox.co
[sean: expand changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03 15:19:22 -08:00
Michal Luczaj
096691e0d2 KVM: x86/emulator: Fix comment in __load_segment_descriptor()
The comment refers to the same condition twice. Make it reflect what the
code actually does.

No functional change intended.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230126013405.2967156-3-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03 13:34:02 -08:00
Michal Luczaj
0735d1c34e KVM: x86/emulator: Fix segment load privilege level validation
Intel SDM describes what steps are taken by the CPU to verify if a
memory segment can actually be used at a given privilege level. Loading
DS/ES/FS/GS involves checking segment's type as well as making sure that
neither selector's RPL nor caller's CPL are greater than segment's DPL.

Emulator implements Intel's pseudocode in __load_segment_descriptor(),
even quoting the pseudocode in the comments. Although the pseudocode is
correctly translated, the implementation is incorrect. This is most
likely due to SDM, at the time, being wrong.

Patch fixes emulator's logic and updates the pseudocode in the comment.
Below are historical notes.

Emulator code for handling segment descriptors appears to have been
introduced in March 2010 in commit 38ba30ba51 ("KVM: x86 emulator:
Emulate task switch in emulator.c"). Intel SDM Vol 2A: Instruction Set
Reference, A-M (Order Number: 253666-034US, _March 2010_) lists the
steps for loading segment registers in section related to MOV
instruction:

  IF DS, ES, FS, or GS is loaded with non-NULL selector
  THEN
    IF segment selector index is outside descriptor table limits
    or segment is not a data or readable code segment
    or ((segment is a data or nonconforming code segment)
    and (both RPL and CPL > DPL))   <---
      THEN #GP(selector); FI;

This is precisely what __load_segment_descriptor() quotes and
implements. But there's a twist; a few SDM revisions later
(253667-044US), in August 2012, the snippet above becomes:

  IF DS, ES, FS, or GS is loaded with non-NULL selector
  THEN
    IF segment selector index is outside descriptor table limits
    or segment is not a data or readable code segment
    or ((segment is a data or nonconforming code segment)
      [note: missing or superfluous parenthesis?]
    or ((RPL > DPL) and (CPL > DPL))   <---
      THEN #GP(selector); FI;

Many SDMs later (253667-065US), in December 2017, pseudocode reaches
what seems to be its final form:

  IF DS, ES, FS, or GS is loaded with non-NULL selector
  THEN
    IF segment selector index is outside descriptor table limits
    OR segment is not a data or readable code segment
    OR ((segment is a data or nonconforming code segment)
        AND ((RPL > DPL) or (CPL > DPL)))   <---
      THEN #GP(selector); FI;

which also matches the behavior described in AMD's APM, which states that
a #GP occurs if:

  The DS, ES, FS, or GS register was loaded and the segment pointed to
  was a data or non-conforming code segment, but the RPL or CPL was
  greater than the DPL.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230126013405.2967156-2-mhal@rbox.co
[sean: add blurb to changelog calling out AMD agrees]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-03 13:31:59 -08:00
Ard Biesheuvel
234fa51db9 efi: Drop minimum EFI version check at boot
We currently pass a minimum major version to the generic EFI helper that
checks the system table magic and version, and refuse to boot if the
value is lower.

The motivation for this check is unknown, and even the code that uses
major version 2 as the minimum (ARM, arm64 and RISC-V) should make it
past this check without problems, and boot to a point where we have
access to a console or some other means to inform the user that the
firmware's major revision number made us unhappy. (Revision 2.0 of the
UEFI specification was released in January 2006, whereas ARM, arm64 and
RISC-V support where added in 2009, 2013 and 2017, respectively, so
checking for major version 2 or higher is completely arbitrary)

So just drop the check.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-02-03 18:01:07 +01:00
Song Liu
0c05e7bd2d livepatch,x86: Clear relocation targets on a module removal
Josh reported a bug:

  When the object to be patched is a module, and that module is
  rmmod'ed and reloaded, it fails to load with:

  module: x86/modules: Skipping invalid relocation target, existing value is nonzero for type 2, loc 00000000ba0302e9, val ffffffffa03e293c
  livepatch: failed to initialize patch 'livepatch_nfsd' for module 'nfsd' (-8)
  livepatch: patch 'livepatch_nfsd' failed for module 'nfsd', refusing to load module 'nfsd'

  The livepatch module has a relocation which references a symbol
  in the _previous_ loading of nfsd. When apply_relocate_add()
  tries to replace the old relocation with a new one, it sees that
  the previous one is nonzero and it errors out.

He also proposed three different solutions. We could remove the error
check in apply_relocate_add() introduced by commit eda9cec4c9
("x86/module: Detect and skip invalid relocations"). However the check
is useful for detecting corrupted modules.

We could also deny the patched modules to be removed. If it proved to be
a major drawback for users, we could still implement a different
approach. The solution would also complicate the existing code a lot.

We thus decided to reverse the relocation patching (clear all relocation
targets on x86_64). The solution is not
universal and is too much arch-specific, but it may prove to be simpler
in the end.

Reported-by: Josh Poimboeuf <jpoimboe@redhat.com>
Originally-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Joe Lawrence <joe.lawrence@redhat.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230125185401.279042-2-song@kernel.org
2023-02-03 11:28:22 +01:00
Song Liu
bbb93362a4 x86/module: remove unused code in __apply_relocate_add
This "#if 0" block has been untouched for many years. Remove it to clean
up the code.

Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Reviewed-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20230125185401.279042-1-song@kernel.org
2023-02-03 11:27:23 +01:00
Ricardo Ribalda
828dfc0f7b scripts/spelling.txt: add `permitted'
Patch series "spelling: Fix some trivial typos".

Seems like permitted has two t's :), Lets add that to spellings to help
others.


This patch (of 3):

Add another common typo. Noticed when I sent a patch with the typo and
in kvm and of.

[ribalda@chromium.org: fix trivial typo]
  Link: https://lkml.kernel.org/r/20221220-permited-v1-2-52ea9857fa61@chromium.org
Link: https://lkml.kernel.org/r/20221220-permited-v1-1-52ea9857fa61@chromium.org
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:50:01 -08:00
Matthew Wilcox (Oracle)
6bc56a4d85 mm: add vma_alloc_zeroed_movable_folio()
Replace alloc_zeroed_user_highpage_movable().  The main difference is
returning a folio containing a single page instead of returning the page,
but take the opportunity to rename the function to match other allocation
functions a little better and rewrite the documentation to place more
emphasis on the zeroing rather than the highmem aspect.

Link: https://lkml.kernel.org/r/20230116191813.2145215-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:18 -08:00
David Hildenbrand
950fe885a8 mm: remove __HAVE_ARCH_PTE_SWP_EXCLUSIVE
__HAVE_ARCH_PTE_SWP_EXCLUSIVE is now supported by all architectures that
support swp PTEs, so let's drop it.

Link: https://lkml.kernel.org/r/20230113171026.582290-27-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:11 -08:00
David Hildenbrand
93c0eac40d x86/mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE also on 32bit
Let's support __HAVE_ARCH_PTE_SWP_EXCLUSIVE just like we already do on
x86-64.  After deciphering the PTE layout it becomes clear that there are
still unused bits for 2-level and 3-level page tables that we should be
able to use.  Reusing a bit avoids stealing one bit from the swap offset.

While at it, mask the type in __swp_entry(); use some helper definitions
to make the macros easier to grasp.

Link: https://lkml.kernel.org/r/20230113171026.582290-25-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:10 -08:00
Alexander Potapenko
1e15d374bb Revert "x86: kmsan: sync metadata pages on page fault"
This reverts commit 3f1e2c7a90.

As noticed by Qun-Wei Lin, arch_sync_kernel_mappings() in
arch/x86/mm/fault.c is only used with CONFIG_X86_32, whereas KMSAN is only
supported on x86_64, where this code is not compiled.

The patch in question dates back to downstream KMSAN branch based on
v5.8-rc5, it sneaked into upstream unnoticed in v6.1.

Link: https://lkml.kernel.org/r/20230111101806.3236991-1-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Qun-Wei Lin <qun-wei.lin@mediatek.com>
  Link: https://github.com/google/kmsan/issues/91
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marco Elver <elver@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-02 22:33:01 -08:00
Peter Lafreniere
8a1955f958 crypto: x86 - exit fpu context earlier in ECB/CBC macros
Currently the ecb/cbc macros hold fpu context unnecessarily when using
scalar cipher routines (e.g. when handling odd sizes of blocks per walk).

Change the macros to drop fpu context as soon as the fpu is out of use.

No performance impact found (on Intel Haswell).

Signed-off-by: Peter Lafreniere <peter@n8pjl.ca>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-02-03 12:54:54 +08:00
Kirill A. Shutemov
1e70c68037 x86/tdx: Do not corrupt frame-pointer in __tdx_hypercall()
If compiled with CONFIG_FRAME_POINTER=y, objtool is not happy that
__tdx_hypercall() messes up RBP:

  objtool: __tdx_hypercall+0x7f: return with modified stack frame

Rework the function to store TDX_HCALL_ flags on stack instead of RBP.

[ dhansen: minor changelog tweaks ]

Fixes: c30c4b2555 ("x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/202301290255.buUBs99R-lkp@intel.com
Link: https://lore.kernel.org/all/20230130135354.27674-1-kirill.shutemov%40linux.intel.com
2023-02-02 16:31:25 -08:00
Jakub Kicinski
82b4a9412b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/core/gro.c
  7d2c89b325 ("skb: Do mix page pool and page referenced frags in GRO")
  b1a78b9b98 ("net: add support for ipv4 big tcp")
https://lore.kernel.org/all/20230203094454.5766f160@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-02 14:49:55 -08:00
Paul E. McKenney
efc8b329c7 clocksource: Verify HPET and PMTMR when TSC unverified
On systems with two or fewer sockets, when the boot CPU has CONSTANT_TSC,
NONSTOP_TSC, and TSC_ADJUST, clocksource watchdog verification of the
TSC is disabled.  This works well much of the time, but there is the
occasional production-level system that meets all of these criteria, but
which still has a TSC that skews significantly from atomic-clock time.
This is usually attributed to a firmware or hardware fault.  Yes, the
various NTP daemons do express their opinions of userspace-to-atomic-clock
time skew, but they put them in various places, depending on the daemon
and distro in question.  It would therefore be good for the kernel to
have some clue that there is a problem.

The old behavior of marking the TSC unstable is a non-starter because a
great many workloads simply cannot tolerate the overheads and latencies
of the various non-TSC clocksources.  In addition, NTP-corrected systems
sometimes can tolerate significant kernel-space time skew as long as
the userspace time sources are within epsilon of atomic-clock time.

Therefore, when watchdog verification of TSC is disabled, enable it for
HPET and PMTMR (AKA ACPI PM timer).  This provides the needed in-kernel
time-skew diagnostic without degrading the system's performance.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Waiman Long <longman@redhat.com>
Cc: <x86@kernel.org>
Tested-by: Feng Tang <feng.tang@intel.com>
2023-02-02 14:23:02 -08:00
Feng Tang
a7ec817d55 x86/tsc: Add option to force frequency recalibration with HW timer
The kernel assumes that the TSC frequency which is provided by the
hardware / firmware via MSRs or CPUID(0x15) is correct after applying
a few basic consistency checks. This disables the TSC recalibration
against HPET or PM timer.

As a result there is no mechanism to validate that frequency in cases
where a firmware or hardware defect is suspected. And there was case
that some user used atomic clock to measure the TSC frequency and
reported an inaccuracy issue, which was later fixed in firmware.

Add an option 'recalibrate' for 'tsc' kernel parameter to force the
tsc freq recalibration with HPET or PM timer, and warn if the
deviation from previous value is more than about 500 PPM, which
provides a way to verify the data from hardware / firmware.

There is no functional change to existing work flow.

Recently there was a real-world case: "The 40ms/s divergence between
TSC and HPET was observed on hardware that is quite recent" [1], on
that platform the TSC frequence 1896 MHz was got from CPUID(0x15),
and the force-reclibration with HPET/PMTIMER both calibrated out
value of 1975 MHz, which also matched with check from software
'chronyd', indicating it's a problem of BIOS or firmware.

[Thanks tglx for helping improving the commit log]
[ paulmck: Wordsmith Kconfig help text. ]

[1]. https://lore.kernel.org/lkml/20221117230910.GI4001@paulmck-ThinkPad-P17-Gen-1/
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: <x86@kernel.org>
Cc: <linux-doc@vger.kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-02-02 14:22:52 -08:00
Like Xu
13738a3647 perf/x86/intel: Expose EPT-friendly PEBS for SPR and future models
According to Intel SDM, the EPT-friendly PEBS is supported by all the
platforms after ICX, ADL and the future platforms with PEBS format 5.

Currently the only in-kernel user of this capability is KVM, which has
very limited support for hybrid core pmu, so ADL and its successors do
not currently expose this capability. When both hybrid core and PEBS
format 5 are present, KVM will decide on its own merits.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-perf-users@vger.kernel.org
Suggested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221109082802.27543-4-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01 16:42:36 -08:00
Like Xu
974850be01 KVM: x86/pmu: Add PRIR++ and PDist support for SPR and later models
The pebs capability on the SPR is basically the same as Ice Lake Server
with the exception of two special facilities that have been enhanced and
require special handling.

Upon triggering a PEBS assist, there will be a finite delay between the
time the counter overflows and when the microcode starts to carry out
its data collection obligations. Even if the delay is constant in core
clock space, it invariably manifest as variable "skids" in instruction
address space.

On the Ice Lake Server, the Precise Distribution of Instructions Retire
(PDIR) facility mitigates the "skid" problem by providing an early
indication of when the counter is about to overflow. On SPR, the PDIR
counter available (Fixed 0) is unchanged, but the capability is enhanced
to Instruction-Accurate PDIR (PDIR++), where PEBS is taken on the
next instruction after the one that caused the overflow.

SPR also introduces a new Precise Distribution (PDist) facility only on
general programmable counter 0. Per Intel SDM, PDist eliminates any
skid or shadowing effects from PEBS. With PDist, the PEBS record will
be generated precisely upon completion of the instruction or operation
that causes the counter to overflow (there is no "wait for next occurrence"
by default).

In terms of KVM handling, when guest accesses those special counters,
the KVM needs to request the same index counters via the perf_event
kernel subsystem to ensure that the guest uses the correct pebs hardware
counter (PRIR++ or PDist). This is mainly achieved by adjusting the
event precise level to the maximum, where the semantics of this magic
number is mainly defined by the internal software context of perf_event
and it's also backwards compatible as part of the user space interface.

Opportunistically, refine confusing comments on TNT+, as the only
ones that currently support pebs_ept are Ice Lake server and SPR (GLC+).

Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20221109082802.27543-3-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01 16:42:36 -08:00
Emanuele Giuseppe Esposito
052c3b99cb KVM: x86: Reinitialize xAPIC ID when userspace forces x2APIC => xAPIC
Reinitialize the xAPIC ID to the vCPU ID when userspace forces the APIC
to transition directly from x2APIC to xAPIC mode, e.g. to emulate RESET.
KVM already stuffs the xAPIC ID when the APIC is transitioned from
DISABLED to xAPIC (commit 49bd29ba1d ("KVM: x86: reset APIC ID when
enabling LAPIC")), i.e. userspace is conditioned to expect KVM to update
the xAPIC ID, but KVM doesn't handle the architecturally-impossible case
where userspace forces x2APIC=>xAPIC via KVM_SET_MSRS.

On its own, the "bug" is benign, as userspace emulation of RESET will also
stuff APIC registers via KVM_SET_LAPIC, i.e. will manually set the xAPIC
ID.  However, commit 3743c2f025 ("KVM: x86: inhibit APICv/AVIC on
changes to APIC ID or APIC base") introduced a bug, fixed by commit
commit ef40757743 ("KVM: x86: fix APICv/x2AVIC disabled when vm reboot
by itself"), that caused KVM to fail to properly update the xAPIC ID when
handling KVM_SET_LAPIC.  Refresh the xAPIC ID even though it's not
strictly necessary so that KVM provides consistent behavior.

Note, KVM follows Intel architecture with regard to handling the xAPIC ID
and x2APIC IDs across mode transitions.  For the APIC DISABLED case
(commit 49bd29ba1d), Intel's SDM says the xAPIC ID _may_ be
reinitialized

    10.4.3 Enabling or Disabling the Local APIC

    When IA32_APIC_BASE[11] is set to 0, prior initialization to the APIC
    may be lost and the APIC may return to the state described in Section
    10.4.7.1, “Local APIC State After Power-Up or Reset.”

    10.4.7.1 Local APIC State After Power-Up or Reset

    ... The local APIC ID register is set to a unique APIC ID. ...

i.e. KVM's behavior is legal as per Intel's architecture.   In practice,
Intel's behavior is N/A as modern Intel CPUs (since at least Haswell) make
the xAPIC ID fully read-only.

And for xAPIC => x2APIC transitions (commit 257b9a5faa ("KVM: x86: use
correct APIC ID on x2APIC transition")), Intel's SDM says:

  Any APIC ID value written to the memory-mapped local APIC ID register
  is not preserved.

AMD's APM says nothing (that I could find) about the xAPIC ID when the
APIC is DISABLED, but testing on bare metal (Rome) shows that the xAPIC ID
is preserved when the APIC is DISABLED and re-enabled in xAPIC mode.  AMD
also preserves the xAPIC ID when the APIC is transitioned from xAPIC to
x2APIC, i.e. allows a backdoor write of the x2APIC ID, which is again not
emulated by KVM.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Link: https://lore.kernel.org/all/20230109130605.2013555-2-eesposit@redhat.com
[sean: rewrite changelog, set xAPIC ID iff APIC is enabled]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01 15:37:13 -08:00
Vipin Sharma
db9cf24cea KVM: x86: hyper-v: Add extended hypercall support in Hyper-v
Add support for extended hypercall in Hyper-v. Hyper-v TLFS 6.0b
describes hypercalls above call code 0x8000 as extended hypercalls.

A Hyper-v hypervisor's guest VM finds availability of extended
hypercalls via CPUID.0x40000003.EBX BIT(20). If the bit is set then the
guest can call extended hypercalls.

All extended hypercalls will exit to userspace by default. This allows
for easy support of future hypercalls without being dependent on KVM
releases.

If there will be need to process the hypercall in KVM instead of
userspace then KVM can create a capability which userspace can query to
know which hypercalls can be handled by the KVM and enable handling
of those hypercalls.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20221212183720.4062037-10-vipinsh@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01 13:46:00 -08:00
Vipin Sharma
1a9df3262a KVM: x86: hyper-v: Use common code for hypercall userspace exit
Remove duplicate code to exit to userspace for hyper-v hypercalls and
use a common place to exit.

No functional change intended.

Signed-off-by: Vipin Sharma <vipinsh@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20221212183720.4062037-9-vipinsh@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-02-01 13:41:44 -08:00
Maxim Levitsky
32e69f232d KVM: x86: Use emulator callbacks instead of duplicating "host flags"
Instead of re-defining the "host flags" bits, just expose dedicated
helpers for each of the two remaining flags that are consumed by the
emulator.  The emulator never consumes both "is guest" and "is SMM" in
close proximity, so there is no motivation to avoid additional indirect
branches.

Also while at it, garbage collect the recently removed host flags.

No functional change is intended.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20221129193717.513824-6-mlevitsk@redhat.com
[sean: fix CONFIG_KVM_SMM=n builds, tweak names of wrappers]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31 17:29:09 -08:00
Maxim Levitsky
916b54a768 KVM: x86: Move HF_NMI_MASK and HF_IRET_MASK into "struct vcpu_svm"
Move HF_NMI_MASK and HF_IRET_MASK (a.k.a. "waiting for IRET") out of the
common "hflags" and into dedicated flags in "struct vcpu_svm".  The flags
are used only for the SVM and thus should not be in hflags.

Tracking NMI masking in software isn't SVM specific, e.g. VMX has a
similar flag (soft_vnmi_blocked), but that's much more of a hack as VMX
can't intercept IRET, is useful only for ancient CPUs, i.e. will
hopefully be removed at some point, and again the exact behavior is
vendor specific and shouldn't ever be referenced in common code.
converting VMX

No functional change is intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20221129193717.513824-5-mlevitsk@redhat.com
[sean: split from HF_GIF_MASK patch]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31 12:56:42 -08:00
Maxim Levitsky
c760e86f27 KVM: x86: Move HF_GIF_MASK into "struct vcpu_svm" as "guest_gif"
Move HF_GIF_MASK out of the common "hflags" and into vcpu_svm.guest_gif.
GIF is an SVM-only concept and has should never be consulted outside of
SVM-specific code.

No functional change is intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20221129193717.513824-5-mlevitsk@redhat.com
[sean: split to separate patch]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31 12:56:42 -08:00
Maxim Levitsky
8957cbcfed KVM: nSVM: Don't sync tlb_ctl back to vmcb12 on nested VM-Exit
Don't sync the TLB control field from vmcb02 to vmcs12 on nested VM-Exit.
Per AMD's APM, the field is not modified by hardware:

  The VMRUN instruction reads, but does not change, the value of the
  TLB_CONTROL field

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Santosh Shukla <Santosh.Shukla@amd.com>
Link: https://lore.kernel.org/r/20221129193717.513824-2-mlevitsk@redhat.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-31 12:56:26 -08:00
Alexey Kardashevskiy
7914695743 x86/amd: Cache debug register values in percpu variables
Reading DR[0-3]_ADDR_MASK MSRs takes about 250 cycles which is going to
be noticeable with the AMD KVM SEV-ES DebugSwap feature enabled.  KVM is
going to store host's DR[0-3] and DR[0-3]_ADDR_MASK before switching to
a guest; the hardware is going to swap these on VMRUN and VMEXIT.

Store MSR values passed to set_dr_addr_mask() in percpu variables
(when changed) and return them via new amd_get_dr_addr_mask().
The gain here is about 10x.

As set_dr_addr_mask() uses the array too, change the @dr type to
unsigned to avoid checking for <0. And give it the amd_ prefix to match
the new helper as the whole DR_ADDR_MASK feature is AMD-specific anyway.

While at it, replace deprecated boot_cpu_has() with cpu_feature_enabled()
in set_dr_addr_mask().

Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230120031047.628097-2-aik@amd.com
2023-01-31 20:09:26 +01:00
Ashok Raj
25d0dc4b95 x86/microcode: Allow only "1" as a late reload trigger value
Microcode gets reloaded late only if "1" is written to the reload file.
However, the code silently treats any other unsigned integer as a
successful write even though no actions are performed to load microcode.

Make the loader more strict to accept only "1" as a trigger value and
return an error otherwise.

  [ bp: Massage commit message. ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230130213955.6046-3-ashok.raj@intel.com
2023-01-31 16:47:03 +01:00
Peter Zijlstra
923510c88d x86/static_call: Add support for Jcc tail-calls
Clang likes to create conditional tail calls like:

  0000000000000350 <amd_pmu_add_event>:
  350:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1) 351: R_X86_64_NONE      __fentry__-0x4
  355:       48 83 bf 20 01 00 00 00         cmpq   $0x0,0x120(%rdi)
  35d:       0f 85 00 00 00 00       jne    363 <amd_pmu_add_event+0x13>     35f: R_X86_64_PLT32     __SCT__amd_pmu_branch_add-0x4
  363:       e9 00 00 00 00          jmp    368 <amd_pmu_add_event+0x18>     364: R_X86_64_PLT32     __x86_return_thunk-0x4

Where 0x35d is a static call site that's turned into a conditional
tail-call using the Jcc class of instructions.

Teach the in-line static call text patching about this.

Notably, since there is no conditional-ret, in that case patch the Jcc
to point at an empty stub function that does the ret -- or the return
thunk when needed.

Reported-by: "Erhard F." <erhard_f@mailbox.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/Y9Kdg9QjHkr9G5b5@hirez.programming.kicks-ass.net
2023-01-31 15:05:31 +01:00
Peter Zijlstra
ac0ee0a956 x86/alternatives: Teach text_poke_bp() to patch Jcc.d32 instructions
In order to re-write Jcc.d32 instructions text_poke_bp() needs to be
taught about them.

The biggest hurdle is that the whole machinery is currently made for 5
byte instructions and extending this would grow struct text_poke_loc
which is currently a nice 16 bytes and used in an array.

However, since text_poke_loc contains a full copy of the (s32)
displacement, it is possible to map the Jcc.d32 2 byte opcodes to
Jcc.d8 1 byte opcode for the int3 emulation.

This then leaves the replacement bytes; fudge that by only storing the
last 5 bytes and adding the rule that 'length == 6' instruction will
be prefixed with a 0x0f byte.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20230123210607.115718513@infradead.org
2023-01-31 15:05:31 +01:00
Peter Zijlstra
db7adcfd1c x86/alternatives: Introduce int3_emulate_jcc()
Move the kprobe Jcc emulation into int3_emulate_jcc() so it can be
used by more code -- specifically static_call() will need this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20230123210607.057678245@infradead.org
2023-01-31 15:05:30 +01:00
Peter Zijlstra
8739c68115 sched/clock/x86: Mark sched_clock() noinstr
In order to use sched_clock() from noinstr code, mark it and all it's
implenentations noinstr.

The whole pvclock thing (used by KVM/Xen) is a bit of a pain,
since it calls out to watchdogs, create a
pvclock_clocksource_read_nowd() variant doesn't do that and can be
noinstr.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230126151323.702003578@infradead.org
2023-01-31 15:01:47 +01:00
Uros Bizjak
5c9da9fe82 x86/pvclock: Improve atomic update of last_value in pvclock_clocksource_read()
Improve atomic update of last_value in pvclock_clocksource_read:

- Atomic update can be skipped if the "last_value" is already
  equal to "ret".

- The detection of atomic update failure is not correct. The value,
  returned by atomic64_cmpxchg should be compared to the old value
  from the location to be updated. If these two are the same, then
  atomic update succeeded and "last_value" location is updated to
  "ret" in an atomic way. Otherwise, the atomic update failed and
  it should be retried with the value from "last_value" - exactly
  what atomic64_try_cmpxchg does in a correct and more optimal way.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20230118202330.3740-1-ubizjak@gmail.com
Link: https://lore.kernel.org/r/20230126151323.643408110@infradead.org
2023-01-31 15:01:46 +01:00
Peter Zijlstra
7aab7aa4b4 x86/atomics: Always inline arch_atomic64*()
As already done for regular arch_atomic*(), always inline arch_atomic64*().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230126151323.585115019@infradead.org
2023-01-31 15:01:46 +01:00
Ingo Molnar
57a30218fa Linux 6.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPW7E8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGf7MIAI0JnHN9WvtEukSZ
 E6j6+cEGWxsvD6q0g3GPolaKOCw7hlv0pWcFJFcUAt0jebspMdxV2oUGJ8RYW7Lg
 nCcHvEVswGKLAQtQSWw52qotW6fUfMPsNYYB5l31sm1sKH4Cgss0W7l2HxO/1LvG
 TSeNHX53vNAZ8pVnFYEWCSXC9bzrmU/VALF2EV00cdICmfvjlgkELGXoLKJJWzUp
 s63fBHYGGURSgwIWOKStoO6HNo0j/F/wcSMx8leY8qDUtVKHj4v24EvSgxUSDBER
 ch3LiSQ6qf4sw/z7pqruKFthKOrlNmcc0phjiES0xwwGiNhLv0z3rAhc4OM2cgYh
 SDc/Y/c=
 =zpaD
 -----END PGP SIGNATURE-----

Merge tag 'v6.2-rc6' into sched/core, to pick up fixes

Pick up fixes before merging another batch of cpuidle updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-01-31 15:01:20 +01:00
Joerg Roedel
9d2c7203ff x86/debug: Fix stack recursion caused by wrongly ordered DR7 accesses
In kernels compiled with CONFIG_PARAVIRT=n, the compiler re-orders the
DR7 read in exc_nmi() to happen before the call to sev_es_ist_enter().

This is problematic when running as an SEV-ES guest because in this
environment the DR7 read might cause a #VC exception, and taking #VC
exceptions is not safe in exc_nmi() before sev_es_ist_enter() has run.

The result is stack recursion if the NMI was caused on the #VC IST
stack, because a subsequent #VC exception in the NMI handler will
overwrite the stack frame of the interrupted #VC handler.

As there are no compiler barriers affecting the ordering of DR7
reads/writes, make the accesses to this register volatile, forbidding
the compiler to re-order them.

  [ bp: Massage text, make them volatile too, to make sure some
  aggressive compiler optimization pass doesn't discard them. ]

Fixes: 315562c9af ("x86/sev-es: Adjust #VC IST Stack on entering NMI handler")
Reported-by: Alexey Kardashevskiy <aik@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230127035616.508966-1-aik@amd.com
2023-01-31 12:51:19 +01:00
Evgeniy Baskov
79729f26b0 efi/libstub: Add memory attribute protocol definitions
EFI_MEMORY_ATTRIBUTE_PROTOCOL servers as a better alternative to
DXE services for setting memory attributes in EFI Boot Services
environment. This protocol is better since it is a part of UEFI
specification itself and not UEFI PI specification like DXE
services.

Add EFI_MEMORY_ATTRIBUTE_PROTOCOL definitions.
Support mixed mode properly for its calls.

Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Evgeniy Baskov <baskov@ispras.ru>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-01-30 13:11:34 +01:00
Linus Torvalds
bc6bc34b10 - Start checking for -mindirect-branch-cs-prefix clang support too now that LLVM
16 will support it
 
 - Fix a NULL ptr deref when suspending with Xen PV
 
 - Have a SEV-SNP guest check explicitly for features enabled by the hypervisor
   and fail gracefully if some are unsupported by the guest instead of failing in
   a non-obvious and hard-to-debug way
 
 - Fix a MSI descriptor leakage under Xen
 
 - Mark Xen's MSI domain as supporting MSI-X
 
 - Prevent legacy PIC interrupts from being resent in software by marking them
   level triggered, as they should be, which lead to a NULL ptr deref
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPWdaAACgkQEsHwGGHe
 VUrHUBAAvRh7cDVKvr2wPNJ+9RFjPFubcLq/+4yTe4Bu14wnYn8+gb2S7P2bPJWl
 TxbZhJGV2IeYyB+aJybjGwX96AzE5lWD6epQLRLI/BOVmqLA8Eyw9te0NEQDd9Wq
 hgY1/hhUmLdpXubq09YjIht5nlxVZ+JFfppouTR7LVtZl7MFvQbAOU8rWzpcbm7l
 UlA4GLakFeOYpbI5g+zzsZ1a1M0cSz8wQ43plD5aAtNy2wKbWiA4QDXJo9J7+ZQg
 1FmOHZ3ChPjEhf9k/N2uAZ6RUHVEDIJ7hDfsM3j/qsKGzCV4cho2Ify+0PFWo4pt
 FO2bRh1yDFqdz4m6ulhnmuMnoRnEuswwrTzrG+HYu1ntpUt72dULmmRBpJ0s6C7W
 BHpCThNWIlcfHbLNY+VAOJ2hiOqfU/8ld+8R0sn7xmomjgCRYHh9kDGOeuvh1hJl
 jfITDuL2gjcj3Ph6+xh6KvOga41ff3EcxkfvqolZ/emRllPiDWsKYXVgUIl22FHt
 23xH7gutPCp27MdoXM1EaWs5s/PQfctvm4LrW6XS8IjWURLo4RrkU6y7YD5VKVy8
 KCKcpF+JMdE1Ao1WpMFfjDG4ObyMYyqnD790nQVM1e5kIhOpeWaa7WzkGFRyIo6Q
 5BU3GGDyMT8SHy7NFhQL62YJe0P2ZctNIDjiTQlYDnWWhjmvk8I=
 =4QMN
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Start checking for -mindirect-branch-cs-prefix clang support too now
   that LLVM 16 will support it

 - Fix a NULL ptr deref when suspending with Xen PV

 - Have a SEV-SNP guest check explicitly for features enabled by the
   hypervisor and fail gracefully if some are unsupported by the guest
   instead of failing in a non-obvious and hard-to-debug way

 - Fix a MSI descriptor leakage under Xen

 - Mark Xen's MSI domain as supporting MSI-X

 - Prevent legacy PIC interrupts from being resent in software by
   marking them level triggered, as they should be, which lead to a NULL
   ptr deref

* tag 'x86_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/build: Move '-mindirect-branch-cs-prefix' out of GCC-only block
  acpi: Fix suspend with Xen PV
  x86/sev: Add SEV-SNP guest feature negotiation support
  x86/pci/xen: Fixup fallout from the PCI/MSI overhaul
  x86/pci/xen: Set MSI_FLAG_PCI_MSIX support in Xen MSI domain
  x86/i8259: Mark legacy PIC interrupts with IRQ_LEVEL
2023-01-29 11:17:34 -08:00
Jakub Kicinski
2d104c390f bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY9RqJgAKCRDbK58LschI
 gw2IAP9G5uhFO5abBzYLupp6SY3T5j97MUvPwLfFqUEt7EXmuwEA2lCUEWeW0KtR
 QX+QmzCa6iHxrW7WzP4DUYLue//FJQY=
 =yYqA
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2023-01-28

We've added 124 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 6386 insertions(+), 1827 deletions(-).

The main changes are:

1) Implement XDP hints via kfuncs with initial support for RX hash and
   timestamp metadata kfuncs, from Stanislav Fomichev and
   Toke Høiland-Jørgensen.
   Measurements on overhead: https://lore.kernel.org/bpf/875yellcx6.fsf@toke.dk

2) Extend libbpf's bpf_tracing.h support for tracing arguments of
   kprobes/uprobes and syscall as a special case, from Andrii Nakryiko.

3) Significantly reduce the search time for module symbols by livepatch
   and BPF, from Jiri Olsa and Zhen Lei.

4) Enable cpumasks to be used as kptrs, which is useful for tracing
   programs tracking which tasks end up running on which CPUs
   in different time intervals, from David Vernet.

5) Fix several issues in the dynptr processing such as stack slot liveness
   propagation, missing checks for PTR_TO_STACK variable offset, etc,
   from Kumar Kartikeya Dwivedi.

6) Various performance improvements, fixes, and introduction of more
   than just one XDP program to XSK selftests, from Magnus Karlsson.

7) Big batch to BPF samples to reduce deprecated functionality,
   from Daniel T. Lee.

8) Enable struct_ops programs to be sleepable in verifier,
   from David Vernet.

9) Reduce pr_warn() noise on BTF mismatches when they are expected under
   the CONFIG_MODULE_ALLOW_BTF_MISMATCH config anyway, from Connor O'Brien.

10) Describe modulo and division by zero behavior of the BPF runtime
    in BPF's instruction specification document, from Dave Thaler.

11) Several improvements to libbpf API documentation in libbpf.h,
    from Grant Seltzer.

12) Improve resolve_btfids header dependencies related to subcmd and add
    proper support for HOSTCC, from Ian Rogers.

13) Add ipip6 and ip6ip decapsulation support for bpf_skb_adjust_room()
    helper along with BPF selftests, from Ziyang Xuan.

14) Simplify the parsing logic of structure parameters for BPF trampoline
    in the x86-64 JIT compiler, from Pu Lehui.

15) Get BTF working for kernels with CONFIG_RUST enabled by excluding
    Rust compilation units with pahole, from Martin Rodriguez Reboredo.

16) Get bpf_setsockopt() working for kTLS on top of TCP sockets,
    from Kui-Feng Lee.

17) Disable stack protection for BPF objects in bpftool given BPF backends
    don't support it, from Holger Hoffstätte.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (124 commits)
  selftest/bpf: Make crashes more debuggable in test_progs
  libbpf: Add documentation to map pinning API functions
  libbpf: Fix malformed documentation formatting
  selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata
  selftests/bpf: Calls bpf_setsockopt() on a ktls enabled socket.
  bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt().
  bpf/selftests: Verify struct_ops prog sleepable behavior
  bpf: Pass const struct bpf_prog * to .check_member
  libbpf: Support sleepable struct_ops.s section
  bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable
  selftests/bpf: Fix vmtest static compilation error
  tools/resolve_btfids: Alter how HOSTCC is forced
  tools/resolve_btfids: Install subcmd headers
  bpf/docs: Document the nocast aliasing behavior of ___init
  bpf/docs: Document how nested trusted fields may be defined
  bpf/docs: Document cpumask kfuncs in a new file
  selftests/bpf: Add selftest suite for cpumask kfuncs
  selftests/bpf: Add nested trust selftests suite
  bpf: Enable cpumasks to be queried and used as kptrs
  bpf: Disallow NULLable pointers for trusted kfuncs
  ...
====================

Link: https://lore.kernel.org/r/20230128004827.21371-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-28 00:00:14 -08:00
Jakub Kicinski
b568d3072a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/ethernet/intel/ice/ice_main.c
  418e53401e ("ice: move devlink port creation/deletion")
  643ef23bd9 ("ice: Introduce local var for readability")
https://lore.kernel.org/all/20230127124025.0dacef40@canb.auug.org.au/
https://lore.kernel.org/all/20230124005714.3996270-1-anthony.l.nguyen@intel.com/

drivers/net/ethernet/engleder/tsnep_main.c
  3d53aaef43 ("tsnep: Fix TX queue stop/wake for multiple queues")
  25faa6a4c5 ("tsnep: Replace TX spin_lock with __netif_tx_lock")
https://lore.kernel.org/all/20230127123604.36bb3e99@canb.auug.org.au/

net/netfilter/nf_conntrack_proto_sctp.c
  13bd9b31a9 ("Revert "netfilter: conntrack: add sctp DATA_SENT state"")
  a44b765148 ("netfilter: conntrack: unify established states for SCTP paths")
  f71cb8f45d ("netfilter: conntrack: sctp: use nf log infrastructure for invalid packets")
https://lore.kernel.org/all/20230127125052.674281f9@canb.auug.org.au/
https://lore.kernel.org/all/d36076f3-6add-a442-6d4b-ead9f7ffff86@tessares.net/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-27 22:56:18 -08:00
Kirill A. Shutemov
8de62af018 x86/tdx: Disable NOTIFY_ENABLES
== Background ==

There is a class of side-channel attacks against SGX enclaves called
"SGX Step"[1]. These attacks create lots of exceptions inside of
enclaves. Basically, run an in-enclave instruction, cause an exception.
Over and over.

There is a concern that a VMM could attack a TDX guest in the same way
by causing lots of #VE's. The TDX architecture includes new
countermeasures for these attacks. It basically counts the number of
exceptions and can send another *special* exception once the number of
VMM-induced #VE's hits a critical threshold[2].

== Problem ==

But, these special exceptions are independent of any action that the
guest takes. They can occur anywhere that the guest executes. This
includes sensitive areas like the entry code. The (non-paranoid) #VE
handler is incapable of handling exceptions in these areas.

== Solution ==

Fortunately, the special exceptions can be disabled by the guest via
write to NOTIFY_ENABLES TDCS field. NOTIFY_ENABLES is disabled by
default, but might be enabled by a bootloader, firmware or an earlier
kernel before the current kernel runs.

Disable NOTIFY_ENABLES feature explicitly and unconditionally. Any
NOTIFY_ENABLES-based #VE's that occur before this point will end up
in the early #VE exception handler and die due to unexpected exit
reason.

[1] https://github.com/jovanbulck/sgx-step
[2] https://intel.github.io/ccc-linux-guest-hardening-docs/security-spec.html#safety-against-ve-in-kernel-code

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-8-kirill.shutemov%40linux.intel.com
2023-01-27 09:46:05 -08:00
Kirill A. Shutemov
47e67cf317 x86/tdx: Relax SEPT_VE_DISABLE check for debug TD
A "SEPT #VE" occurs when a TDX guest touches memory that is not properly
mapped into the "secure EPT".  This can be the result of hypervisor
attacks or bugs, *OR* guest bugs.  Most notably, buggy guests might
touch unaccepted memory for lots of different memory safety bugs like
buffer overflows.

TDX guests do not want to continue in the face of hypervisor attacks or
hypervisor bugs.  They want to terminate as fast and safely as possible.
SEPT_VE_DISABLE ensures that TDX guests *can't* continue in the face of
these kinds of issues.

But, that causes a problem.  TDX guests that can't continue can't spit
out oopses or other debugging info.  In essence SEPT_VE_DISABLE=1 guests
are not debuggable.

Relax the SEPT_VE_DISABLE check to warning on debug TD and panic() in
the #VE handler on EPT-violation on private memory. It will produce
useful backtrace.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-7-kirill.shutemov%40linux.intel.com
2023-01-27 09:46:05 -08:00
Kirill A. Shutemov
71acdcd7cd x86/tdx: Use ReportFatalError to report missing SEPT_VE_DISABLE
Linux TDX guests require that the SEPT_VE_DISABLE "attribute" be set.
If it is not set, the kernel is theoretically required to handle
exceptions anywhere that kernel memory is accessed, including places
like NMI handlers and in the syscall entry gap.

Rather than even try to handle these exceptions, the kernel refuses to
run if SEPT_VE_DISABLE is unset.

However, the SEPT_VE_DISABLE detection and refusal code happens very
early in boot, even before earlyprintk runs.  Calling panic() will
effectively just hang the system.

Instead, call a TDX-specific panic() function.  This makes a very simple
TDVMCALL which gets a short error string out to the hypervisor without
any console infrastructure.

Use TDG.VP.VMCALL<ReportFatalError> to report the error. The hypercall
can encode message up to 64 bytes in eight registers.

[ dhansen: tweak comment and remove while loop brackets. ]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-6-kirill.shutemov%40linux.intel.com
2023-01-27 09:45:55 -08:00
Kirill A. Shutemov
752d13305c x86/tdx: Expand __tdx_hypercall() to handle more arguments
So far __tdx_hypercall() only handles six arguments for VMCALL.
Expanding it to six more register would allow to cover more use-cases
like ReportFatalError() and Hyper-V hypercalls.

With all preparations in place, the expansion is pretty straight
forward.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-5-kirill.shutemov%40linux.intel.com
2023-01-27 09:42:09 -08:00
Kirill A. Shutemov
c30c4b2555 x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments
RDI is the first argument to __tdx_hypercall() that used to pass pointer
to struct tdx_hypercall_args. RSI is the second argument that contains
flags, such as TDX_HCALL_HAS_OUTPUT and TDX_HCALL_ISSUE_STI.

RDI and RSI can also be used as arguments to TDVMCALL leafs. Move RDI to
RAX and RSI to RBP to free up them for the hypercall arguments.

RAX saved on stack during TDCALL as it returns status code in the
register.

RBP value has to be restored before returning from __tdx_hypercall() as
it is callee-saved register.

This is preparatory patch. No functional change.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-4-kirill.shutemov%40linux.intel.com
2023-01-27 09:42:09 -08:00
Kirill A. Shutemov
0da908c291 x86/tdx: Add more registers to struct tdx_hypercall_args
struct tdx_hypercall_args is used to pass down hypercall arguments to
__tdx_hypercall() assembly routine.

Currently __tdx_hypercall() handles up to 6 arguments. In preparation to
changes in __tdx_hypercall(), expand the structure to 6 more registers
and generate asm offsets for them.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-3-kirill.shutemov%40linux.intel.com
2023-01-27 09:42:09 -08:00
Kirill A. Shutemov
3543f8830b x86/tdx: Fix typo in comment in __tdx_hypercall()
Comment in __tdx_hypercall() points that RAX==0 indicates TDVMCALL
failure which is opposite of the truth: RAX==0 is success.

Fix the comment. No functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-2-kirill.shutemov%40linux.intel.com
2023-01-27 09:42:09 -08:00
Alexander Potapenko
ce3ba2af96 x86: Suppress KMSAN reports in arch_within_stack_frames()
arch_within_stack_frames() performs stack walking and may confuse
KMSAN by stepping on stale shadow values. To prevent false positive
reports, disable KMSAN checks in this function.

This fixes KMSAN's interoperability with CONFIG_HARDENED_USERCOPY.

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Eric Biggers <ebiggers@google.com>
Link: https://github.com/google/kmsan/issues/89
Link: https://lore.kernel.org/lkml/Y3b9AAEKp2Vr3e6O@sol.localdomain/
Link: https://lore.kernel.org/all/20221118172305.3321253-1-glider%40google.com
2023-01-27 09:00:56 -08:00
Jakub Kicinski
68f4eae781 net: checksum: drop the linux/uaccess.h include
net/checksum.h pulls in linux/uaccess.h which is large.

In the x86 header the include seems to not be needed at all.
ARM on the other hand does not include uaccess.h, even tho
it calls access_ok().

In the generic implementation guard the include of linux/uaccess.h
with the same condition as the code that needs it.

With this change pre-processed net/checksum.h shrinks on x86
from 30616 lines to just 1193.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-27 11:19:46 +00:00
Sean Christopherson
2de154f541 KVM: x86/pmu: Provide "error" semantics for unsupported-but-known PMU MSRs
Provide "error" semantics (read zeros, drop writes) for userspace accesses
to MSRs that are ultimately unsupported for whatever reason, but for which
KVM told userspace to save and restore the MSR, i.e. for MSRs that KVM
included in KVM_GET_MSR_INDEX_LIST.

Previously, KVM special cased a few PMU MSRs that were problematic at one
point or another.  Extend the treatment to all PMU MSRs, e.g. to avoid
spurious unsupported accesses.

Note, the logic can also be used for non-PMU MSRs, but as of today only
PMU MSRs can end up being unsupported after KVM told userspace to save and
restore them.

Link: https://lore.kernel.org/r/20230124234905.3774678-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26 18:03:42 -08:00
Like Xu
e33b6d79ac KVM: x86/pmu: Don't tell userspace to save MSRs for non-existent fixed PMCs
Limit the set of MSRs for fixed PMU counters based on the number of fixed
counters actually supported by the host so that userspace doesn't waste
time saving and restoring dummy values.

Signed-off-by: Like Xu <likexu@tencent.com>
[sean: split for !enable_pmu logic, drop min(), write changelog]
Link: https://lore.kernel.org/r/20230124234905.3774678-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26 18:03:42 -08:00
Sean Christopherson
c3531edc79 KVM: x86/pmu: Don't tell userspace to save PMU MSRs if PMU is disabled
Omit all PMU MSRs from the "MSRs to save" list if the PMU is disabled so
that userspace doesn't waste time saving and restoring dummy values.  KVM
provides "error" semantics (read zeros, drop writes) for such known-but-
unsupported MSRs, i.e. has fudged around this issue for quite some time.
Keep the "error" semantics as-is for now, the logic will be cleaned up in
a separate patch.

Cc: Aaron Lewis <aaronlewis@google.com>
Cc: Weijiang Yang <weijiang.yang@intel.com>
Link: https://lore.kernel.org/r/20230124234905.3774678-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26 18:03:42 -08:00
Sean Christopherson
2374b7310b KVM: x86/pmu: Use separate array for defining "PMU MSRs to save"
Move all potential to-be-saved PMU MSRs into a separate array so that a
future patch can easily omit all PMU MSRs from the list when the PMU is
disabled.

No functional change intended.

Link: https://lore.kernel.org/r/20230124234905.3774678-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26 18:03:42 -08:00
Sean Christopherson
e76ae52747 KVM: x86/pmu: Gate all "unimplemented MSR" prints on report_ignored_msrs
Add helpers to print unimplemented MSR accesses and condition all such
prints on report_ignored_msrs, i.e. honor userspace's request to not
print unimplemented MSRs.  Even though vcpu_unimpl() is ratelimited,
printing can still be problematic, e.g. if a print gets stalled when host
userspace is writing MSRs during live migration, an effective stall can
result in very noticeable disruption in the guest.

E.g. the profile below was taken while calling KVM_SET_MSRS on the PMU
counters while the PMU was disabled in KVM.

  -   99.75%     0.00%  [.] __ioctl
   - __ioctl
      - 99.74% entry_SYSCALL_64_after_hwframe
           do_syscall_64
           sys_ioctl
         - do_vfs_ioctl
            - 92.48% kvm_vcpu_ioctl
               - kvm_arch_vcpu_ioctl
                  - 85.12% kvm_set_msr_ignored_check
                       svm_set_msr
                       kvm_set_msr_common
                       printk
                       vprintk_func
                       vprintk_default
                       vprintk_emit
                       console_unlock
                       call_console_drivers
                       univ8250_console_write
                       serial8250_console_write
                       uart_console_write

Reported-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20230124234905.3774678-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26 18:03:42 -08:00
Sean Christopherson
8911ce6669 KVM: x86/pmu: Cap kvm_pmu_cap.num_counters_gp at KVM's internal max
Limit kvm_pmu_cap.num_counters_gp during kvm_init_pmu_capability() based
on the vendor PMU capabilities so that consuming num_counters_gp naturally
does the right thing.  This fixes a mostly theoretical bug where KVM could
over-report its PMU support in KVM_GET_SUPPORTED_CPUID for leaf 0xA, e.g.
if the number of counters reported by perf is greater than KVM's
hardcoded internal limit.  Incorporating input from the AMD PMU also
avoids over-reporting MSRs to save when running on AMD.

Link: https://lore.kernel.org/r/20230124234905.3774678-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26 18:03:42 -08:00
Like Xu
2a3003e950 KVM: x86/pmu: Drop event_type and rename "struct kvm_event_hw_type_mapping"
After commit ("02791a5c362b KVM: x86/pmu: Use PERF_TYPE_RAW
to merge reprogram_{gp,fixed}counter()"), vPMU starts to directly
use the hardware event eventsel and unit_mask to reprogram perf_event,
and the event_type field in the "struct kvm_event_hw_type_mapping"
is simply no longer being used.

Convert the struct into an anonymous struct as the current name is
obsolete as the structure no longer has any mapping semantics, and
placing the struct definition directly above its sole user makes its
easier to understand what the array is filling in.

Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20221205122048.16023-1-likexu@tencent.com
[sean: drop new comment, use anonymous struct]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-26 14:13:44 -08:00
Uros Bizjak
890a0794b3 x86/ACPI/boot: Use try_cmpxchg() in __acpi_{acquire,release}_global_lock()
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
__acpi_{acquire,release}_global_lock().  x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after CMPXCHG
(and related MOV instruction in front of CMPXCHG).

Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when CMPXCHG
fails. There is no need to re-read the value in the loop.

Note that the value from *ptr should be read using READ_ONCE() to prevent
the compiler from merging, refetching or reordering the read.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20230116162522.4072-1-ubizjak@gmail.com
2023-01-26 11:49:40 +01:00
Uros Bizjak
50fd4d5e69 x86/PAT: Use try_cmpxchg() in set_page_memtype()
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
set_page_memtype.  x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after cmpxchg (and related move
instruction in front of cmpxchg).

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails. There is no need to re-read the value in the loop.

Note that the value from *ptr should be read using READ_ONCE to prevent
the compiler from merging, refetching or reordering the read.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230116163446.4734-1-ubizjak@gmail.com
2023-01-26 11:49:39 +01:00
Borislav Petkov (AMD)
793207bad7 x86/resctrl: Fix a silly -Wunused-but-set-variable warning
clang correctly complains

  arch/x86/kernel/cpu/resctrl/rdtgroup.c:1456:6: warning: variable \
     'h' set but not used [-Wunused-but-set-variable]
          u32 h;
              ^

but it can't know whether this use is innocuous or really a problem.
There's a reason why those warning switches are behind a W=1 and not
enabled by default - yes, one needs to do:

  make W=1 CC=clang HOSTCC=clang arch/x86/kernel/cpu/resctrl/

with clang 14 in order to trigger it.

I would normally not take a silly fix like that but this one is simple
and doesn't make the code uglier so...

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/202301242015.kbzkVteJ-lkp@intel.com
2023-01-26 11:15:20 +01:00
Nick Desaulniers
994f5f7816 x86/boot/compressed: prefer cc-option for CFLAGS additions
as-option tests new options using KBUILD_CFLAGS, which causes problems
when using as-option to update KBUILD_AFLAGS because many compiler
options are not valid assembler options.

This will be fixed in a follow up patch. Before doing so, move the
assembler test for -Wa,-mrelax-relocations=no from using as-option to
cc-option.

Link: https://lore.kernel.org/llvm/CAK7LNATcHt7GcXZ=jMszyH=+M_LC9Qr6yeAGRCBbE6xriLxtUQ@mail.gmail.com/
Suggested-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Linux Kernel Functional Testing <lkft@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2023-01-26 12:40:47 +09:00
Kim Phillips
8c19b6f257 KVM: x86: Propagate the AMD Automatic IBRS feature to the guest
Add the AMD Automatic IBRS feature bit to those being propagated to the guest,
and enable the guest EFER bit.

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-9-kim.phillips@amd.com
2023-01-25 17:21:40 +01:00
Kim Phillips
e7862eda30 x86/cpu: Support AMD Automatic IBRS
The AMD Zen4 core supports a new feature called Automatic IBRS.

It is a "set-and-forget" feature that means that, like Intel's Enhanced IBRS,
h/w manages its IBRS mitigation resources automatically across CPL transitions.

The feature is advertised by CPUID_Fn80000021_EAX bit 8 and is enabled by
setting MSR C000_0080 (EFER) bit 21.

Enable Automatic IBRS by default if the CPU feature is present.  It typically
provides greater performance over the incumbent generic retpolines mitigation.

Reuse the SPECTRE_V2_EIBRS spectre_v2_mitigation enum.  AMD Automatic IBRS and
Intel Enhanced IBRS have similar enablement.  Add NO_EIBRS_PBRSB to
cpu_vuln_whitelist, since AMD Automatic IBRS isn't affected by PBRSB-eIBRS.

The kernel command line option spectre_v2=eibrs is used to select AMD Automatic
IBRS, if available.

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-8-kim.phillips@amd.com
2023-01-25 17:16:01 +01:00
Kim Phillips
faabfcb194 x86/cpu, kvm: Add the SMM_CTL MSR not present feature
The SMM_CTL MSR not present feature was being open-coded for KVM.
Add it to its newly added CPUID leaf 0x80000021 EAX proper.

Also drop the bit description comments now the code is more
self-describing.

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-7-kim.phillips@amd.com
2023-01-25 16:37:20 +01:00
Kim Phillips
5b909d4ae5 x86/cpu, kvm: Add the Null Selector Clears Base feature
The Null Selector Clears Base feature was being open-coded for KVM.
Add it to its newly added native CPUID leaf 0x80000021 EAX proper.

Also drop the bit description comments now it's more self-describing.

  [ bp: Convert test in check_null_seg_clears_base() too. ]

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-6-kim.phillips@amd.com
2023-01-25 16:25:46 +01:00
Kim Phillips
84168ae786 x86/cpu, kvm: Move X86_FEATURE_LFENCE_RDTSC to its native leaf
The LFENCE always serializing feature bit was defined as scattered
LFENCE_RDTSC and its native leaf bit position open-coded for KVM.  Add
it to its newly added CPUID leaf 0x80000021 EAX proper.  With
LFENCE_RDTSC in its proper place, the kernel's set_cpu_cap() will
effectively synthesize the feature for KVM going forward.

Also, DE_CFG[1] doesn't need to be set on such CPUs anymore.

  [ bp: Massage and merge diff from Sean. ]

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-5-kim.phillips@amd.com
2023-01-25 13:06:13 +01:00
Kim Phillips
a9dc9ec5a1 x86/cpu, kvm: Add the NO_NESTED_DATA_BP feature
The "Processor ignores nested data breakpoints" feature was being
open-coded for KVM.  Add the feature to its newly introduced CPUID leaf
0x80000021 EAX proper.

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-4-kim.phillips@amd.com
2023-01-25 12:36:34 +01:00
Jens Axboe
cb3ea4b767 x86/fpu: Don't set TIF_NEED_FPU_LOAD for PF_IO_WORKER threads
We don't set it on PF_KTHREAD threads as they never return to userspace,
and PF_IO_WORKER threads are identical in that regard. As they keep
running in the kernel until they die, skip setting the FPU flag on them.

More of a cosmetic thing that was found while debugging and
issue and pondering why the FPU flag is set on these threads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/560c844c-f128-555b-40c6-31baff27537f@kernel.dk
2023-01-25 12:35:15 +01:00
Brian Gerst
4c382d723e x86/vdso: Move VDSO image init to vdso2c generated code
Generate an init function for each VDSO image, replacing init_vdso() and
sysenter_setup().

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230124184019.26850-1-brgerst@gmail.com
2023-01-25 12:33:40 +01:00
Kim Phillips
c35ac8c4bf KVM: x86: Move open-coded CPUID leaf 0x80000021 EAX bit propagation code
Move code from __do_cpuid_func() to kvm_set_cpu_caps() in preparation for adding
the features in their native leaf.

Also drop the bit description comments as it will be more self-describing once
the individual features are added.

Whilst there, switch to using the more efficient cpu_feature_enabled() instead
of static_cpu_has().

Note, LFENCE_RDTSC and "NULL selector clears base" are currently synthetic,
Linux-defined feature flags as Linux tracking of the features predates AMD's
definition.  Keep the manual propagation of the flags from their synthetic
counterparts until the kernel fully converts to AMD's definition, otherwise KVM
would stop synthesizing the flags as intended.

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-3-kim.phillips@amd.com
2023-01-25 12:33:13 +01:00
Kim Phillips
8415a74852 x86/cpu, kvm: Add support for CPUID_80000021_EAX
Add support for CPUID leaf 80000021, EAX. The majority of the features will be
used in the kernel and thus a separate leaf is appropriate.

Include KVM's reverse_cpuid entry because features are used by VM guests, too.

  [ bp: Massage commit message. ]

Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-2-kim.phillips@amd.com
2023-01-25 12:33:06 +01:00
Randy Dunlap
54628de679 x86/Kconfig: Fix spellos & punctuation
Fix spelling (reported by codespell) & punctuation in arch/x86/ Kconfig.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230124181753.19309-1-rdunlap@infradead.org
2023-01-25 12:21:04 +01:00
Borislav Petkov (AMD)
ebd3ad60a6 x86/cpu: Use cpu_feature_enabled() when checking global pages support
X86_FEATURE_PGE determines whether the CPU has enabled global page
translations support. Use the faster cpu_feature_enabled() check to
shave off some more cycles when flushing all TLB entries, including the
global ones.

What this practically saves is:

   mov    0x82eb308(%rip),%rax        # 0xffffffff8935bec8 <boot_cpu_data+40>
   test   $0x20,%ah

... which test the bit. Not a lot, but TLB flushing is a timing-sensitive
path, so anything to make it even faster.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230125075013.9292-1-bp@alien8.de
2023-01-25 10:32:06 +01:00
Linus Torvalds
b2f317173e ARM64:
- Pass the correct address to mte_clear_page_tags() on initialising
   a tagged page
 
 - Plug a race against a GICv4.1 doorbell interrupt while saving
   the vgic-v3 pending state.
 
 x86:
 
 - A command line parsing fix and a clang compilation fix for selftests
 
 - A fix for a longstanding VMX issue, that surprisingly was only found
   now to affect real world guests
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmPM/foUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroM18Af/ZygTp0zd0+ZEqI8lu6hi9MmL7pKu
 CbzjuJUD7iw8fUGZDyYpL7CrcAdQX7JC6cRjBQMq+9Zzh+QBc1SkkBoEwpHy/EoR
 xPOSlNmZGM3kQssqHhwC5ciLNYQQ9yEMAw0kTIoOw3/Aznjk70PUzjwIFC5fRTAB
 +ScOQj+9hkr9bzNTnIxY50Ewt6kwiZ7BEbL3a6CHCvkFkLnUAjwp/Ci6dIsqXsae
 Stlq/ZJi9QYw5Od4C0e63pfSG3MniaVT3aqisB3dEi8I4Tcpbsh7MaJf43ImFm56
 jEymmu/FYWXyMpV2Dlt3703SstXO8V9lVztsnbOVgU7/TEjFD5ADUOi7Dg==
 =WKnF
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM64:

   - Pass the correct address to mte_clear_page_tags() on initialising a
     tagged page

   - Plug a race against a GICv4.1 doorbell interrupt while saving the
     vgic-v3 pending state.

  x86:

   - A command line parsing fix and a clang compilation fix for
     selftests

   - A fix for a longstanding VMX issue, that surprisingly was only
     found now to affect real world guests"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: selftests: Make reclaim_period_ms input always be positive
  KVM: x86/vmx: Do not skip segment attributes if unusable bit is set
  selftests: kvm: move declaration at the beginning of main()
  KVM: arm64: GICv4.1: Fix race with doorbell on VPE activation/deactivation
  KVM: arm64: Pass the actual page address to mte_clear_page_tags()
2023-01-24 17:48:09 -08:00
ye xingchen
2eb398df77 KVM: x86: Replace IS_ERR() with IS_ERR_VALUE()
Avoid type casts that are needed for IS_ERR() and use
IS_ERR_VALUE() instead.

Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Link: https://lore.kernel.org/r/202211161718436948912@zte.com.cn
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:42:09 -08:00
Sean Christopherson
11df586d77 KVM: VMX: Handle NMI VM-Exits in noinstr region
Move VMX's handling of NMI VM-Exits into vmx_vcpu_enter_exit() so that
the NMI is handled prior to leaving the safety of noinstr.  Handling the
NMI after leaving noinstr exposes the kernel to potential ordering
problems as an instrumentation-induced fault, e.g. #DB, #BP, #PF, etc.
will unblock NMIs when IRETing back to the faulting instruction.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:36:41 -08:00
Sean Christopherson
4f76e86f7e KVM: VMX: Provide separate subroutines for invoking NMI vs. IRQ handlers
Split the asm subroutines for handling NMIs versus IRQs that occur in the
guest so that the NMI handler can be called from a noinstr section.  As a
bonus, the NMI path doesn't need an indirect branch.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:36:41 -08:00
Sean Christopherson
54a3b70a75 x86/entry: KVM: Use dedicated VMX NMI entry for 32-bit kernels too
Use a dedicated entry for invoking the NMI handler from KVM VMX's VM-Exit
path for 32-bit even though using a dedicated entry for 32-bit isn't
strictly necessary.  Exposing a single symbol will allow KVM to reference
the entry point in assembly code without having to resort to more #ifdefs
(or #defines).  identry.h is intended to be included from asm files only
once, and so simply including idtentry.h in KVM assembly isn't an option.

Bypassing the ESP fixup and CR3 switching in the standard NMI entry code
is safe as KVM always handles NMIs that occur in the guest on a kernel
stack, with a kernel CR3.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:36:40 -08:00
Sean Christopherson
432727f1cb KVM: VMX: Always inline to_vmx() and to_kvm_vmx()
Tag to_vmx() and to_kvm_vmx() __always_inline as they both just reflect
the passed in pointer (the embedded struct is the first field in the
container), and drop the @vmx param from vmx_vcpu_enter_exit(), which
likely existed purely to make noinstr validation happy.

Amusingly, when the compiler decides to not inline the helpers, e.g. for
KASAN builds, to_vmx() and to_kvm_vmx() may end up pointing at the same
symbol, which generates very confusing objtool warnings.  E.g. the use of
to_vmx() in a future patch led to objtool complaining about to_kvm_vmx(),
and only once all use of to_kvm_vmx() was commented out did to_vmx() pop
up in the obj tool report.

  vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x160: call to to_kvm_vmx()
                               leaves .noinstr.text section

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:36:40 -08:00
Sean Christopherson
11633f6950 KVM: VMX: Always inline eVMCS read/write helpers
Tag all evmcs_{read,write}() helpers __always_inline so that they can be
freely used in noinstr sections, e.g. to get the VM-Exit reason in
vcpu_vmx_enter_exit() (in a future patch).  For consistency and to avoid
more spot fixes in the future, e.g. see commit 010050a863 ("x86/kvm:
Always inline evmcs_write64()"), tag all accessors even though
evmcs_read32() is the only anticipated use case in the near future.  In
practice, non-KASAN builds are all but guaranteed to inline the helpers
anyways.

  vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x107: call to evmcs_read32()
                               leaves .noinstr.text section

Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:36:27 -08:00
Sean Christopherson
8578f59657 KVM: VMX: Allow VM-Fail path of VMREAD helper to be instrumented
Allow instrumentation in the VM-Fail path of __vmcs_readl() so that the
helper can be used in noinstr functions, e.g. to get the exit reason in
vmx_vcpu_enter_exit() in order to handle NMI VM-Exits in the noinstr
section.  While allowing instrumentation isn't technically safe, KVM has
much bigger problems if VMREAD fails in a noinstr section.

Note, all other VMX instructions also allow instrumentation in their
VM-Fail paths for similar reasons, VMREAD was simply omitted by commit
3ebccdf373 ("x86/kvm/vmx: Move guest enter/exit into .noinstr.text")
because VMREAD wasn't used in a noinstr section at the time.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:36:26 -08:00
Sean Christopherson
fc9465be8a KVM: x86: Make vmx_get_exit_qual() and vmx_get_intr_info() noinstr-friendly
Add an extra special noinstr-friendly helper to test+mark a "register"
available and use it when caching vmcs.EXIT_QUALIFICATION and
vmcs.VM_EXIT_INTR_INFO.  Make the caching helpers __always_inline too so
that they can be used in noinstr functions.

A future fix will move VMX's handling of NMI exits into the noinstr
vmx_vcpu_enter_exit() so that the NMI is processed before any kind of
instrumentation can trigger a fault and thus IRET, i.e. so that KVM
doesn't invoke the NMI handler with NMIs enabled.

Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221213060912.654668-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:35:38 -08:00
Alexey Dobriyan
e8733482f5 KVM: VMX: don't use "unsigned long" in vmx_vcpu_enter_exit()
__vmx_vcpu_run_flags() returns "unsigned int" and uses only 2 bits of it
so using "unsigned long" is very much pointless.  Furthermore,
__vmx_vcpu_run() and vmx_spec_ctrl_restore_host() take an "unsigned int",
i.e. actually relying on an "unsigned long" value won't work.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/Y3e7UW0WNV2AZmsZ@p183
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:07:01 -08:00
Sean Christopherson
5a7a64779e KVM: VMX: Access @flags as a 32-bit value in __vmx_vcpu_run()
Access @flags using 32-bit operands when saving and testing @flags for
VMX_RUN_VMRESUME, as using 8-bit operands is unnecessarily fragile due
to relying on VMX_RUN_VMRESUME being in bits 0-7.  The behavior of
treating @flags a single byte is a holdover from when the param was
"bool launched", i.e. is not deliberate.

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20221119003747.2615229-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:07:00 -08:00
Anish Ghulati
a31b531cd2 KVM: SVM: Account scratch allocations used to decrypt SEV guest memory
Account the temp/scratch allocation used to decrypt unaligned debug
accesses to SEV guest memory, the allocation is very much tied to the
target VM.

Reported-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Anish Ghulati <aghulati@google.com>
Link: https://lore.kernel.org/r/20230113220923.2834699-1-aghulati@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:06:48 -08:00
Like Xu
36b0256789 KVM: svm/avic: Drop "struct kvm_x86_ops" for avic_hardware_setup()
Even in commit 4bdec12aa8 ("KVM: SVM: Detect X2APIC virtualization
(x2AVIC) support"), where avic_hardware_setup() was first introduced,
its only pass-in parameter "struct kvm_x86_ops *ops" is not used at all.
Clean it up a bit to avoid compiler ranting from LLVM toolchain.

Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20221109115952.92816-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:06:48 -08:00
zhang songyi
5f6015b26b KVM: SVM: remove redundant ret variable
Return value from svm_nmi_blocked() directly instead of taking
this in another redundant variable.

Signed-off-by: zhang songyi <zhang.songyi@zte.com.cn>
Link: https://lore.kernel.org/r/202211282003389362484@zte.com.cn
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:06:47 -08:00
Aaron Lewis
14329b825f KVM: x86/pmu: Introduce masked events to the pmu event filter
When building a list of filter events, it can sometimes be a challenge
to fit all the events needed to adequately restrict the guest into the
limited space available in the pmu event filter.  This stems from the
fact that the pmu event filter requires each event (i.e. event select +
unit mask) be listed, when the intention might be to restrict the
event select all together, regardless of it's unit mask.  Instead of
increasing the number of filter events in the pmu event filter, add a
new encoding that is able to do a more generalized match on the unit mask.

Introduce masked events as another encoding the pmu event filter
understands.  Masked events has the fields: mask, match, and exclude.
When filtering based on these events, the mask is applied to the guest's
unit mask to see if it matches the match value (i.e. umask & mask ==
match).  The exclude bit can then be used to exclude events from that
match.  E.g. for a given event select, if it's easier to say which unit
mask values shouldn't be filtered, a masked event can be set up to match
all possible unit mask values, then another masked event can be set up to
match the unit mask values that shouldn't be filtered.

Userspace can query to see if this feature exists by looking for the
capability, KVM_CAP_PMU_EVENT_MASKED_EVENTS.

This feature is enabled by setting the flags field in the pmu event
filter to KVM_PMU_EVENT_FLAG_MASKED_EVENTS.

Events can be encoded by using KVM_PMU_ENCODE_MASKED_ENTRY().

It is an error to have a bit set outside the valid bits for a masked
event, and calls to KVM_SET_PMU_EVENT_FILTER will return -EINVAL in
such cases, including the high bits of the event select (35:32) if
called on Intel.

With these updates the filter matching code has been updated to match on
a common event.  Masked events were flexible enough to handle both event
types, so they were used as the common event.  This changes how guest
events get filtered because regardless of the type of event used in the
uAPI, they will be converted to masked events.  Because of this there
could be a slight performance hit because instead of matching the filter
event with a lookup on event select + unit mask, it does a lookup on event
select then walks the unit masks to find the match.  This shouldn't be a
big problem because I would expect the set of common event selects to be
small, and if they aren't the set can likely be reduced by using masked
events to generalize the unit mask.  Using one type of event when
filtering guest events allows for a common code path to be used.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Link: https://lore.kernel.org/r/20221220161236.555143-5-aaronlewis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:06:12 -08:00
Aaron Lewis
c5a287fa0d KVM: x86/pmu: prepare the pmu event filter for masked events
Refactor check_pmu_event_filter() in preparation for masked events.

No functional changes intended

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Link: https://lore.kernel.org/r/20221220161236.555143-4-aaronlewis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:06:11 -08:00
Aaron Lewis
8589827fd5 KVM: x86/pmu: Remove impossible events from the pmu event filter
If it's not possible for an event in the pmu event filter to match a
pmu event being programmed by the guest, it's pointless to have it in
the list.  Opt for a shorter list by removing those events.

Because this is established uAPI the pmu event filter can't outright
rejected these events as garbage and return an error.  Instead, play
nice and remove them from the list.

Also, opportunistically rewrite the comment when the filter is set to
clarify that it guards against *all* TOCTOU attacks on the verified
data.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Link: https://lore.kernel.org/r/20221220161236.555143-3-aaronlewis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:06:11 -08:00
Aaron Lewis
6a5cba7bed KVM: x86/pmu: Correct the mask used in a pmu event filter lookup
When checking if a pmu event the guest is attempting to program should
be filtered, only consider the event select + unit mask in that
decision. Use an architecture specific mask to mask out all other bits,
including bits 35:32 on Intel.  Those bits are not part of the event
select and should not be considered in that decision.

Fixes: 66bb8a065f ("KVM: x86: PMU Event Filter")
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Link: https://lore.kernel.org/r/20221220161236.555143-2-aaronlewis@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:06:10 -08:00
Christophe JAILLET
11b36fe7d4 KVM: x86/mmu: Use kstrtobool() instead of strtobool()
strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.

In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.

While at it, include the corresponding header file (<linux/kstrtox.h>)

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/670882aa04dbdd171b46d3b20ffab87158454616.1673689135.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:49 -08:00
Hou Wenlong
4ad980aea7 KVM: x86/mmu: Cleanup range-based flushing for given page
Use the new kvm_flush_remote_tlbs_gfn() helper to cleanup the call sites
of range-based flushing for given page, which makes the code clear.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/r/593ee1a876ece0e819191c0b23f56b940d6686db.1665214747.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:48 -08:00
Hou Wenlong
3cdf93746f KVM: x86/mmu: Fix wrong gfn range of tlb flushing in validate_direct_spte()
The spte pointing to the children SP is dropped, so the whole gfn range
covered by the children SP should be flushed. Although, Hyper-V may
treat a 1-page flush the same if the address points to a huge page, it
still would be better to use the correct size of huge page.

Fixes: c3134ce240 ("KVM: Replace old tlb flush function with new one to flush a specified range.")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/r/5f297c566f7d7ff2ea6da3c66d050f69ce1b8ede.1665214747.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:48 -08:00
Hou Wenlong
1b2dc73604 KVM: x86/mmu: Fix wrong start gfn of tlb flushing with range
When a spte is dropped, the start gfn of tlb flushing should be the gfn
of spte not the base gfn of SP which contains the spte. Also introduce a
helper function to do range-based flushing when a spte is dropped, which
would help prevent future buggy use of
kvm_flush_remote_tlbs_with_address() in such case.

Fixes: c3134ce240 ("KVM: Replace old tlb flush function with new one to flush a specified range.")
Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/r/72ac2169a261976f00c1703e88cda676dfb960f5.1665214747.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:47 -08:00
Hou Wenlong
1e203847aa KVM: x86/mmu: Reduce gfn range of tlb flushing in tdp_mmu_map_handle_target_level()
Since the children SP is zapped, the gfn range of tlb flushing should be
the range covered by children SP not parent SP. Replace sp->gfn which is
the base gfn of parent SP with iter->gfn and use the correct size of gfn
range for children SP to reduce tlb flushing range.

Fixes: bb95dfb9e2 ("KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages")
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/528ab9c784a486e9ce05f61462ad9260796a8732.1665214747.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:46 -08:00
Hou Wenlong
9ffe926537 KVM: x86/mmu: Fix wrong gfn range of tlb flushing in kvm_set_pte_rmapp()
When the spte of hupe page is dropped in kvm_set_pte_rmapp(), the whole
gfn range covered by the spte should be flushed. However,
rmap_walk_init_level() doesn't align down the gfn for new level like tdp
iterator does, then the gfn used in kvm_set_pte_rmapp() is not the base
gfn of huge page. And the size of gfn range is wrong too for huge page.
Use the base gfn of huge page and the size of huge page for flushing
tlbs for huge page. Also introduce a helper function to flush the given
page (huge or not) of guest memory, which would help prevent future
buggy use of kvm_flush_remote_tlbs_with_address() in such case.

Fixes: c3134ce240 ("KVM: Replace old tlb flush function with new one to flush a specified range.")
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/r/0ce24d7078fa5f1f8d64b0c59826c50f32f8065e.1665214747.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:46 -08:00
Hou Wenlong
c667a3baed KVM: x86/mmu: Move round_gfn_for_level() helper into mmu_internal.h
Rounding down the GFN to a huge page size is a common pattern throughout
KVM, so move round_gfn_for_level() helper in tdp_iter.c to
mmu_internal.h for common usage. Also rename it as gfn_round_for_level()
to use gfn_* prefix and clean up the other call sites.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/r/415c64782f27444898db650e21cf28eeb6441dfa.1665214747.git.houwenlong.hwl@antgroup.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:45 -08:00
Wei Liu
a7e48ef77f KVM: x86/mmu: fix an incorrect comment in kvm_mmu_new_pgd()
There is no function named kvm_mmu_ensure_valid_pgd().

Fix the comment and remove the pair of braces to conform to Linux kernel
coding style.

Signed-off-by: Wei Liu <wei.liu@kernel.org>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20221128214709.224710-1-wei.liu@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:45 -08:00
Lai Jiangshan
9e3fbdfd9b kvm: x86/mmu: Don't clear write flooding for direct SP
Although there is no harm, but there is no point to clear write
flooding for direct SP.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230105100310.6700-1-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:44 -08:00
Lai Jiangshan
dc1ae59fc4 kvm: x86/mmu: Rename SPTE_TDP_AD_ENABLED_MASK to SPTE_TDP_AD_ENABLED
SPTE_TDP_AD_ENABLED_MASK, SPTE_TDP_AD_DISABLED_MASK and
SPTE_TDP_AD_WRPROT_ONLY_MASK are actual value, not mask.

Remove "MASK" from their names.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230105100204.6521-1-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:44 -08:00
Sean Christopherson
a2b07fa7b9 x86/reboot: Disable SVM, not just VMX, when stopping CPUs
Disable SVM and more importantly force GIF=1 when halting a CPU or
rebooting the machine.  Similar to VMX, SVM allows software to block
INITs via CLGI, and thus can be problematic for a crash/reboot.  The
window for failure is smaller with SVM as INIT is only blocked while
GIF=0, i.e. between CLGI and STGI, but the window does exist.

Fixes: fba4f472b3 ("x86/reboot: Turn off KVM when halting a CPU")
Cc: stable@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130233650.1404148-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:22 -08:00
Sean Christopherson
d81f952aa6 x86/reboot: Disable virtualization in an emergency if SVM is supported
Disable SVM on all CPUs via NMI shootdown during an emergency reboot.
Like VMX, SVM can block INIT, e.g. if the emergency reboot is triggered
between CLGI and STGI, and thus can prevent bringing up other CPUs via
INIT-SIPI-SIPI.

Cc: stable@vger.kernel.org
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130233650.1404148-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:22 -08:00
Sean Christopherson
6a3236580b x86/virt: Force GIF=1 prior to disabling SVM (for reboot flows)
Set GIF=1 prior to disabling SVM to ensure that INIT is recognized if the
kernel is disabling SVM in an emergency, e.g. if the kernel is about to
jump into a crash kernel or may reboot without doing a full CPU RESET.
If GIF is left cleared, the new kernel (or firmware) will be unabled to
awaken APs.  Eat faults on STGI (due to EFER.SVME=0) as it's possible
that SVM could be disabled via NMI shootdown between reading EFER.SVME
and executing STGI.

Link: https://lore.kernel.org/all/cbcb6f35-e5d7-c1c9-4db9-fe5cc4de579a@amd.com
Cc: stable@vger.kernel.org
Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130233650.1404148-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:21 -08:00
Sean Christopherson
26044aff37 x86/crash: Disable virt in core NMI crash handler to avoid double shootdown
Disable virtualization in crash_nmi_callback() and rework the
emergency_vmx_disable_all() path to do an NMI shootdown if and only if a
shootdown has not already occurred.   NMI crash shootdown fundamentally
can't support multiple invocations as responding CPUs are deliberately
put into halt state without unblocking NMIs.  But, the emergency reboot
path doesn't have any work of its own, it simply cares about disabling
virtualization, i.e. so long as a shootdown occurred, emergency reboot
doesn't care who initiated the shootdown, or when.

If "crash_kexec_post_notifiers" is specified on the kernel command line,
panic() will invoke crash_smp_send_stop() and result in a second call to
nmi_shootdown_cpus() during native_machine_emergency_restart().

Invoke the callback _before_ disabling virtualization, as the current
VMCS needs to be cleared before doing VMXOFF.  Note, this results in a
subtle change in ordering between disabling virtualization and stopping
Intel PT on the responding CPUs.  While VMX and Intel PT do interact,
VMXOFF and writes to MSR_IA32_RTIT_CTL do not induce faults between one
another, which is all that matters when panicking.

Harden nmi_shootdown_cpus() against multiple invocations to try and
capture any such kernel bugs via a WARN instead of hanging the system
during a crash/dump, e.g. prior to the recent hardening of
register_nmi_handler(), re-registering the NMI handler would trigger a
double list_add() and hang the system if CONFIG_BUG_ON_DATA_CORRUPTION=y.

 list_add double add: new=ffffffff82220800, prev=ffffffff8221cfe8, next=ffffffff82220800.
 WARNING: CPU: 2 PID: 1319 at lib/list_debug.c:29 __list_add_valid+0x67/0x70
 Call Trace:
  __register_nmi_handler+0xcf/0x130
  nmi_shootdown_cpus+0x39/0x90
  native_machine_emergency_restart+0x1c9/0x1d0
  panic+0x237/0x29b

Extract the disabling logic to a common helper to deduplicate code, and
to prepare for doing the shootdown in the emergency reboot path if SVM
is supported.

Note, prior to commit ed72736183 ("x86/reboot: Force all cpus to exit
VMX root if VMX is supported"), nmi_shootdown_cpus() was subtly protected
against a second invocation by a cpu_vmx_enabled() check as the kdump
handler would disable VMX if it ran first.

Fixes: ed72736183 ("x86/reboot: Force all cpus to exit VMX root if VMX is supported")
Cc: stable@vger.kernel.org
Reported-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/all/20220427224924.592546-2-gpiccoli@igalia.com
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20221130233650.1404148-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:21 -08:00
Paul Durrant
f422f853af KVM: x86/xen: update Xen CPUID Leaf 4 (tsc info) sub-leaves, if present
The scaling information in subleaf 1 should match the values set by KVM in
the 'vcpu_info' sub-structure 'time_info' (a.k.a. pvclock_vcpu_time_info)
which is shared with the guest, but is not directly available to the VMM.
The offset values are not set since a TSC offset is already applied.
The TSC frequency should also be set in sub-leaf 2.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20230106103600.528-3-pdurrant@amazon.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:20 -08:00
Paul Durrant
48639df8a9 KVM: x86/cpuid: generalize kvm_update_kvm_cpuid_base() and also capture limit
A subsequent patch will need to acquire the CPUID leaf range for emulated
Xen so explicitly pass the signature of the hypervisor we're interested in
to the new function. Also introduce a new kvm_hypervisor_cpuid structure
so we can neatly store both the base and limit leaf indices.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20230106103600.528-2-pdurrant@amazon.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:19 -08:00
David Matlack
ee661d8ea9 KVM: x86: Replace cpu_dirty_logging_count with nr_memslots_dirty_logging
Drop cpu_dirty_logging_count in favor of nr_memslots_dirty_logging.
Both fields count the number of memslots that have dirty-logging enabled,
with the only difference being that cpu_dirty_logging_count is only
incremented when using PML. So while nr_memslots_dirty_logging is not a
direct replacement for cpu_dirty_logging_count, it can be combined with
enable_pml to get the same information.

Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20230105214303.2919415-1-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:19 -08:00
Kees Cook
6213b701a9 KVM: x86: Replace 0-length arrays with flexible arrays
Zero-length arrays are deprecated[1]. Replace struct kvm_nested_state's
"data" union 0-length arrays with flexible arrays. (How are the
sizes of these arrays verified?) Detected with GCC 13, using
-fstrict-flex-arrays=3:

arch/x86/kvm/svm/nested.c: In function 'svm_get_nested_state':
arch/x86/kvm/svm/nested.c:1536:17: error: array subscript 0 is outside array bounds of 'struct kvm_svm_nested_state_data[0]' [-Werror=array-bounds=]
 1536 |                 &user_kvm_nested_state->data.svm[0];
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from include/uapi/linux/kvm.h:15,
                 from include/linux/kvm_host.h:40,
                 from arch/x86/kvm/svm/nested.c:18:
arch/x86/include/uapi/asm/kvm.h:511:50: note: while referencing 'svm'
  511 |                 struct kvm_svm_nested_state_data svm[0];
      |                                                  ^~~

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays

Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230105190548.never.323-kees@kernel.org
Link: https://lore.kernel.org/r/20230118195905.gonna.693-kees@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:18 -08:00
Jim Mattson
2a4209d6a9 KVM: x86: Advertise fast REP string features inherent to the CPU
Fast zero-length REP MOVSB, fast short REP STOSB, and fast short REP
{CMPSB,SCASB} are inherent features of the processor that cannot be
hidden by the hypervisor. When these features are present on the host,
enumerate them in KVM_GET_SUPPORTED_CPUID.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220901211811.2883855-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:18 -08:00
Jim Mattson
f8df91e73a x86/cpufeatures: Add macros for Intel's new fast rep string features
KVM_GET_SUPPORTED_CPUID should reflect these host CPUID bits. The bits
are already cached in word 12. Give the bits X86_FEATURE names, so
that they can be easily referenced. Hide these bits from
/proc/cpuinfo, since the host kernel makes no use of them at present.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20220901211811.2883855-1-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:05:17 -08:00
Li RongQing
8e6ed96cdd KVM: x86: fire timer when it is migrated and expired, and in oneshot mode
when the vCPU was migrated, if its timer is expired, KVM _should_ fire
the timer ASAP, zeroing the deadline here will cause the timer to
immediately fire on the destination

Cc: Sean Christopherson <seanjc@google.com>
Cc: Peter Shier <pshier@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://lore.kernel.org/r/20230106040625.8404-1-lirongqing@baidu.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:04:37 -08:00
Sean Christopherson
02efd818a6 KVM: VMX: Intercept reads to invalid and write-only x2APIC registers
Intercept reads to invalid (non-existent) and write-only x2APIC registers
when configuring VMX's MSR bitmaps for x2APIC+APICv.  When APICv is fully
enabled, Intel hardware doesn't validate the registers on RDMSR and
instead blindly retrieves data from the vAPIC page, i.e. it's software's
responsibility to intercept reads to non-existent and write-only MSRs.

Fixes: 8d14695f95 ("x86, apicv: add virtual x2apic support")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20230107011025.565472-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:04:36 -08:00
Sean Christopherson
c39857ce8d KVM: VMX: Always intercept accesses to unsupported "extended" x2APIC regs
Don't clear the "read" bits for x2APIC registers above SELF_IPI (APIC regs
0x400 - 0xff0, MSRs 0x840 - 0x8ff).  KVM doesn't emulate registers in that
space (there are a smattering of AMD-only extensions) and so should
intercept reads in order to inject #GP.  When APICv is fully enabled,
Intel hardware doesn't validate the registers on RDMSR and instead blindly
retrieves data from the vAPIC page, i.e. it's software's responsibility to
intercept reads to non-existent MSRs.

Fixes: 8d14695f95 ("x86, apicv: add virtual x2apic support")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20230107011025.565472-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:04:36 -08:00
Sean Christopherson
b5fcc59be7 KVM: x86: Split out logic to generate "readable" APIC regs mask to helper
Move the generation of the readable APIC regs bitmask to a standalone
helper so that VMX can use the mask for its MSR interception bitmaps.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20230107011025.565472-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:04:35 -08:00
Sean Christopherson
b223649576 KVM: x86: Mark x2APIC DFR reg as non-existent for x2APIC
Mark APIC_DFR as being invalid/non-existent in x2APIC mode instead of
handling it as a one-off check in kvm_x2apic_msr_read().  This will allow
reusing "valid_reg_mask" to generate VMX's interception bitmaps for
x2APIC.  Handling DFR in the common read path may also fix the Hyper-V
PV MSR interface, if that can coexist with x2APIC.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20230107011025.565472-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:04:35 -08:00
Sean Christopherson
ab52be1b31 KVM: x86: Inject #GP on x2APIC WRMSR that sets reserved bits 63:32
Reject attempts to set bits 63:32 for 32-bit x2APIC registers, i.e. all
x2APIC registers except ICR.  Per Intel's SDM:

  Non-zero writes (by WRMSR instruction) to reserved bits to these
  registers will raise a general protection fault exception

Opportunistically fix a typo in a nearby comment.

Reported-by: Marc Orr <marcorr@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20230107011025.565472-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:04:34 -08:00
Sean Christopherson
ba5838abb0 KVM: x86: Inject #GP if WRMSR sets reserved bits in APIC Self-IPI
Inject a #GP if the guest attempts to set reserved bits in the x2APIC-only
Self-IPI register.  Bits 7:0 hold the vector, all other bits are reserved.

Reported-by: Marc Orr <marcorr@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Venkatesh Srinivas <venkateshs@chromium.org>
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20230107011025.565472-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:04:34 -08:00
zhang songyi
85e64d09f7 KVM: x86: remove redundant ret variable
Return value from apic_get_tmcct() directly instead of taking
this in another redundant variable.

Signed-off-by: zhang songyi <zhang.songyi@zte.com.cn>
Link: https://lore.kernel.org/r/202211231704457807160@zte.com.cn
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24 10:04:33 -08:00
Paolo Bonzini
f15a87c006 Merge branch 'kvm-lapic-fix-and-cleanup' into HEAD
The first half or so patches fix semi-urgent, real-world relevant APICv
and AVIC bugs.

The second half fixes a variety of AVIC and optimized APIC map bugs
where KVM doesn't play nice with various edge cases that are
architecturally legal(ish), but are unlikely to occur in most real world
scenarios

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-24 06:08:01 -05:00
Paolo Bonzini
dc7c31e922 Merge branch 'kvm-v6.2-rc4-fixes' into HEAD
ARM:

* Fix the PMCR_EL0 reset value after the PMU rework

* Correctly handle S2 fault triggered by a S1 page table walk
  by not always classifying it as a write, as this breaks on
  R/O memslots

* Document why we cannot exit with KVM_EXIT_MMIO when taking
  a write fault from a S1 PTW on a R/O memslot

* Put the Apple M2 on the naughty list for not being able to
  correctly implement the vgic SEIS feature, just like the M1
  before it

* Reviewer updates: Alex is stepping down, replaced by Zenghui

x86:

* Fix various rare locking issues in Xen emulation and teach lockdep
  to detect them

* Documentation improvements

* Do not return host topology information from KVM_GET_SUPPORTED_CPUID
2023-01-24 06:05:23 -05:00
Babu Moger
4fe61bff5a x86/resctrl: Add interface to write mbm_local_bytes_config
The event configuration for mbm_local_bytes can be changed by the
user by writing to the configuration file
/sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config.

The event configuration settings are domain specific and will affect all
the CPUs in the domain.

Following are the types of events supported:

  ====  ===========================================================
  Bits   Description
  ====  ===========================================================
  6      Dirty Victims from the QOS domain to all types of memory
  5      Reads to slow memory in the non-local NUMA domain
  4      Reads to slow memory in the local NUMA domain
  3      Non-temporal writes to non-local NUMA domain
  2      Non-temporal writes to local NUMA domain
  1      Reads to memory in the non-local NUMA domain
  0      Reads to memory in the local NUMA domain
  ====  ===========================================================

For example, to change the mbm_local_bytes_config to count all the non-temporal
writes on domain 0, the bits 2 and 3 needs to be set which is 1100b (in hex
0xc).
Run the command:

  $echo  0=0xc > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config

To change the mbm_local_bytes to count only reads to local NUMA domain 1,
the bit 0 needs to be set which 1b (in hex 0x1). Run the command:

  $echo  1=0x1 > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-13-babu.moger@amd.com
2023-01-23 17:40:32 +01:00
Babu Moger
92bd5a1390 x86/resctrl: Add interface to write mbm_total_bytes_config
The event configuration for mbm_total_bytes can be changed by the user by
writing to the file /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config.

The event configuration settings are domain specific and affect all the
CPUs in the domain.

Following are the types of events supported:

  ====  ===========================================================
  Bits   Description
  ====  ===========================================================
  6      Dirty Victims from the QOS domain to all types of memory
  5      Reads to slow memory in the non-local NUMA domain
  4      Reads to slow memory in the local NUMA domain
  3      Non-temporal writes to non-local NUMA domain
  2      Non-temporal writes to local NUMA domain
  1      Reads to memory in the non-local NUMA domain
  0      Reads to memory in the local NUMA domain
  ====  ===========================================================

For example:

To change the mbm_total_bytes to count only reads on domain 0, the bits
0, 1, 4 and 5 needs to be set, which is 110011b (in hex 0x33).
Run the command:

  $echo  0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config

To change the mbm_total_bytes to count all the slow memory reads on domain 1,
the bits 4 and 5 needs to be set which is 110000b (in hex 0x30).
Run the command:

  $echo  1=0x30 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-12-babu.moger@amd.com
2023-01-23 17:40:30 +01:00
Babu Moger
73afb2d3ce x86/resctrl: Add interface to read mbm_local_bytes_config
The event configuration can be viewed by the user by reading the configuration
file /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config.  The event
configuration settings are domain specific and will affect all the CPUs in the
domain.

Following are the types of events supported:

  ====  ===========================================================
  Bits   Description
  ====  ===========================================================
  6      Dirty Victims from the QOS domain to all types of memory
  5      Reads to slow memory in the non-local NUMA domain
  4      Reads to slow memory in the local NUMA domain
  3      Non-temporal writes to non-local NUMA domain
  2      Non-temporal writes to local NUMA domain
  1      Reads to memory in the non-local NUMA domain
  0      Reads to memory in the local NUMA domain
  ====  ===========================================================

By default, the mbm_local_bytes_config is set to 0x15 to count all the local
event types.

For example:

  $cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
  0=0x15;1=0x15;2=0x15;3=0x15

In this case, the event mbm_local_bytes is configured with 0x15 on
domains 0 to 3.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-11-babu.moger@amd.com
2023-01-23 17:40:27 +01:00
Babu Moger
dc2a3e8579 x86/resctrl: Add interface to read mbm_total_bytes_config
The event configuration can be viewed by the user by reading the
configuration file /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config.  The
event configuration settings are domain specific and will affect all the CPUs in
the domain.

Following are the types of events supported:

  ====  ===========================================================
  Bits   Description
  ====  ===========================================================
  6      Dirty Victims from the QOS domain to all types of memory
  5      Reads to slow memory in the non-local NUMA domain
  4      Reads to slow memory in the local NUMA domain
  3      Non-temporal writes to non-local NUMA domain
  2      Non-temporal writes to local NUMA domain
  1      Reads to memory in the non-local NUMA domain
  0      Reads to memory in the local NUMA domain
  ====  ===========================================================

By default, the mbm_total_bytes_config is set to 0x7f to count all the
event types.

For example:

  $cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
  0=0x7f;1=0x7f;2=0x7f;3=0x7f

In this case, the event mbm_total_bytes is configured with 0x7f on
domains 0 to 3.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-10-babu.moger@amd.com
2023-01-23 17:40:24 +01:00
Babu Moger
d507f83ced x86/resctrl: Support monitor configuration
Add a new field in struct mon_evt to support Bandwidth Monitoring Event
Configuration (BMEC) and also update the "mon_features" display.

The resctrl file "mon_features" will display the supported events
and files that can be used to configure those events if monitor
configuration is supported.

Before the change:

  $ cat /sys/fs/resctrl/info/L3_MON/mon_features
  llc_occupancy
  mbm_total_bytes
  mbm_local_bytes

After the change when BMEC is supported:

  $ cat /sys/fs/resctrl/info/L3_MON/mon_features
  llc_occupancy
  mbm_total_bytes
  mbm_total_bytes_config
  mbm_local_bytes
  mbm_local_bytes_config

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-9-babu.moger@amd.com
2023-01-23 17:40:21 +01:00
Babu Moger
bd334c86b5 x86/resctrl: Add __init attribute to rdt_get_mon_l3_config()
In an upcoming change, rdt_get_mon_l3_config() needs to call rdt_cpu_has() to
query the monitor related features. It cannot be called right now because
rdt_cpu_has() has the __init attribute but rdt_get_mon_l3_config() doesn't.

Add the __init attribute to rdt_get_mon_l3_config() that is only called by
get_rdt_mon_resources() that already has the __init attribute. Also make
rdt_cpu_has() available to by rdt_get_mon_l3_config() via the internal header
file.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-8-babu.moger@amd.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-01-23 17:40:11 +01:00
Babu Moger
5b6fac3fa4 x86/resctrl: Detect and configure Slow Memory Bandwidth Allocation
The QoS slow memory configuration details are available via
CPUID_Fn80000020_EDX_x02. Detect the available details and
initialize the rest to defaults.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-7-babu.moger@amd.com
2023-01-23 17:38:44 +01:00
Babu Moger
a76f65c89f x86/resctrl: Include new features in command line options
Add the command line options to enable or disable the new resctrl features:

smba: Slow Memory Bandwidth Allocation
bmec: Bandwidth Monitor Event Configuration.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-6-babu.moger@amd.com
2023-01-23 17:38:38 +01:00
Babu Moger
78335aac61 x86/cpufeatures: Add Bandwidth Monitoring Event Configuration feature flag
Newer AMD processors support the new feature Bandwidth Monitoring Event
Configuration (BMEC).

The feature support is identified via CPUID Fn8000_0020_EBX_x0[3]: EVT_CFG -
Bandwidth Monitoring Event Configuration (BMEC)

The bandwidth monitoring events mbm_total_bytes and mbm_local_bytes are set to
count all the total and local reads/writes, respectively. With the introduction
of slow memory, the two counters are not enough to count all the different types
of memory events. Therefore, BMEC provides the option to configure
mbm_total_bytes and mbm_local_bytes to count the specific type of events.

Each BMEC event has a configuration MSR which contains one field for each
bandwidth type that can be used to configure the bandwidth event to track any
combination of supported bandwidth types. The event will count requests from
every bandwidth type bit that is set in the corresponding configuration
register.

Following are the types of events supported:

  ====    ========================================================
  Bits    Description
  ====    ========================================================
  6       Dirty Victims from the QOS domain to all types of memory
  5       Reads to slow memory in the non-local NUMA domain
  4       Reads to slow memory in the local NUMA domain
  3       Non-temporal writes to non-local NUMA domain
  2       Non-temporal writes to local NUMA domain
  1       Reads to memory in the non-local NUMA domain
  0       Reads to memory in the local NUMA domain
  ====    ========================================================

By default, the mbm_total_bytes configuration is set to 0x7F to count
all the event types and the mbm_local_bytes configuration is set to 0x15 to
count all the local memory events.

Feature description is available in the specification, "AMD64 Technology
Platform Quality of Service Extensions, Revision: 1.03 Publication" at
https://bugzilla.kernel.org/attachment.cgi?id=301365

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-5-babu.moger@amd.com
2023-01-23 17:38:31 +01:00
Babu Moger
a5b6996655 x86/resctrl: Add a new resource type RDT_RESOURCE_SMBA
Add a new resource type RDT_RESOURCE_SMBA to handle the QoS enforcement
policies on the external slow memory.

Mostly initialization of the essentials. Setting fflags to RFTYPE_RES_MB
configures the SMBA resource to have the same resctrl files as the
existing MBA resource. The SMBA resource has identical properties to
the existing MBA resource. These properties will be enumerated in an
upcoming change and exposed via resctrl because of this flag.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-4-babu.moger@amd.com
2023-01-23 17:38:22 +01:00
Babu Moger
f334f723a6 x86/cpufeatures: Add Slow Memory Bandwidth Allocation feature flag
Add the new AMD feature X86_FEATURE_SMBA. With it, the QOS enforcement policies
can be applied to external slow memory connected to the host. QOS enforcement is
accomplished by assigning a Class Of Service (COS) to a processor and specifying
allocations or limits for that COS for each resource to be allocated.

This feature is identified by the CPUID function 0x8000_0020_EBX_x0[2]:
L3SBE - L3 external slow memory bandwidth enforcement.

CXL.memory is the only supported "slow" memory device. With SMBA, the hardware
enables bandwidth allocation on the slow memory devices.  If there are multiple
slow memory devices in the system, then the throttling logic groups all the slow
sources together and applies the limit on them as a whole.

The presence of the SMBA feature (with CXL.memory) is independent of whether
slow memory device is actually present in the system. If there is no slow memory
in the system, then setting a SMBA limit will have no impact on the performance
of the system.

Presence of CXL memory can be identified by the numactl command:

  $numactl -H
  available: 2 nodes (0-1)
  node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
  node 0 size: 63678 MB node 0 free: 59542 MB
  node 1 cpus:
  node 1 size: 16122 MB
  node 1 free: 15627 MB
  node distances:
  node   0   1
     0:  10  50
     1:  50  10

CPU list for CXL memory will be empty. The cpu-cxl node distance is greater than
cpu-to-cpu distances. Node 1 has the CXL memory in this case. CXL memory can
also be identified using ACPI SRAT table and memory maps.

Feature description is available in the specification, "AMD64 Technology
Platform Quality of Service Extensions, Revision: 1.03 Publication # 56375
Revision: 1.03 Issue Date: February 2022" at
https://bugzilla.kernel.org/attachment.cgi?id=301365

See also https://www.amd.com/en/support/tech-docs/amd64-technology-platform-quality-service-extensions

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-3-babu.moger@amd.com
2023-01-23 17:38:17 +01:00
Babu Moger
fc3b618c87 x86/resctrl: Replace smp_call_function_many() with on_each_cpu_mask()
on_each_cpu_mask() runs the function on each CPU specified by cpumask,
which may include the local processor.

Replace smp_call_function_many() with on_each_cpu_mask() to simplify
the code.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20230113152039.770054-2-babu.moger@amd.com
2023-01-23 17:38:04 +01:00
Linus Torvalds
2475bf0250 - Make sure the scheduler doesn't use stale frequency scaling values when latter
get disabled due to a value error
 
 - Fix a NULL pointer access on UP configs
 
 - Use the proper locking when updating CPU capacity
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPNKRsACgkQEsHwGGHe
 VUr8NA/9GTqGUpOq0iRQkEOE1oAdT4ZxJ9dQpaWjwWSNv40HT7XJBzjtvzAyiOBF
 LTUVClBct5i1/j0tVcNq1zXEj3im2e24Ki6A1TWukejgGAMT/7siSkuChEDAMg2M
 b79nKCMpuIZMzkJND3qTkW/aMPpAyU82G8BeLCjw7vgPBsbVkjgbxGFKYdHgpLZa
 kdX/GhOufu40jcGeUxA4zWTpXfuXT3OG7JYLrlHeJ/HEdzy9kLCkWH4jHHllzPQw
 c4JwG7UnNkDKD6zkG0Guzi1zQy39egh/kaj7FQmVap9sneq+x69T4ta0BfXdBXnQ
 Vsqc/nnOybUzr9Gjg5W6KRZWk0k6hK9n3+cye88BfRvzMY0KAgxeCiZEn7cuKqZp
 15Agzz77vcwt32QsxjSp+hKQxJtePcBCurDhkfuAZyPELNIBeCW6inrXArprfTIg
 IfEF068GKsvDGu3I/z49VXFZ9YBoerQREHbN/xL1VNeB8VoAxAE227OMUvhMcIu1
 jxVvkESwc9BIybTcJbUjgg2i9t+Fv3gtZBQ6GFfDboHneia4U5a9aTyN6QGjX+5P
 SIXPhePbFCNrhW7JbSGqKBd96MczbFNQinRhgX1lBU0241cAnchY4nH9fUvv7Zrt
 b/QDg58tb2bJkm1Z08L256ZETELOU9nLJVxUbtBNSSO9dfkvoBs=
 =aNSK
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Make sure the scheduler doesn't use stale frequency scaling values
   when latter get disabled due to a value error

 - Fix a NULL pointer access on UP configs

 - Use the proper locking when updating CPU capacity

* tag 'sched_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/aperfmperf: Erase stale arch_freq_scale values when disabling frequency invariance readings
  sched/core: Fix NULL pointer access fault in sched_setaffinity() with non-SMP configs
  sched/fair: Fixes for capacity inversion detection
  sched/uclamp: Fix a uninitialized variable warnings
2023-01-22 12:14:58 -08:00
Linus Torvalds
2b299a1cd4 - Add Emerald Rapids model support to more perf machinery
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPNJQMACgkQEsHwGGHe
 VUpplw//XSPjrP7Ikajhwv6qQVX0ZEfSnte5qA5B+t5qu59T1DQGhBllkV7pay+V
 PuEESYkOt/MfZ6lCyoG75w80dBMlVIGi3b0HsZ9bw0ejIxm33DMM/E1ZPsKBh8i5
 H/9u26Yo77LE2HMr8Nf8HyOlLrdvBqgQBV3VHgk/eLFZnJR3Yk/T1+8ZZzLpx+TA
 5BfKvXFHWqkpe8KF20A6OxC2FQfLx5hCiY0ozOk0ZHIKzTy9g+BVaDw3XV049nj/
 BttTvhV9bdimCxVgBWxUBt5n7QoR7zE5vfief4o9UkNHeut0LmjnwcfymFwL0j21
 8yVo+layNzhTYX/7wXQchz3LV+8ZvFvVZ1Iw3hhP0yQknbDBLtDHGLTy+p9t2cgS
 1nuMNX1bZHDAQ4vV3ehYqC+MmGIh55dTRDggLvxMOzOv+v1vXMR+mlVELL+Q8KJy
 7ynNfnP88EjCQfBpxe33D8VMFJ7LqsWAIjYu2upXrlVVR8mKbp2FnSjn9cK+Np75
 Y4132ESWSvn7mHSmmcRHOjdSUxmi1UgBsL689hF9Zkx8zpHJSLhUWwW4xKKwE/dd
 /jvwZ50KzjuJYWrn/ueJpubgyeoMSZY6Y+iaCl4aYZna0X7h+PF7qY4A8pB6YXpH
 A0uVCAlNL+lkGFKAq+TkOx1lCw/eR+zg2s2ClVUdB6e8qsYYX2o=
 =9JcB
 -----END PGP SIGNATURE-----

Merge tag 'perf_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

 - Add Emerald Rapids model support to more perf machinery

* tag 'perf_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/cstate: Add Emerald Rapids
  perf/x86/intel: Add Emerald Rapids
2023-01-22 12:06:18 -08:00
Masahiro Yamada
6ae4b9868a kbuild: allow to combine multiple V= levels
Commit a6de553da0 ("kbuild: Allow to combine multiple W= levels")
supported W=123 to enable all the extra warning groups.

I think a similar idea is applicable to the V= option.

  V=1 echos the whole command
  V=2 prints the reason for rebuilding

These are orthogonal, and can be enabled at the same time.

This commit supports V=12 to enable both of them.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: Nicolas Schier <nicolas@fjasle.eu>
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
2023-01-22 23:43:32 +09:00
Nathan Chancellor
27b5de622e x86/build: Move '-mindirect-branch-cs-prefix' out of GCC-only block
LLVM 16 will have support for this flag so move it out of the GCC-only
block to allow LLVM builds to take advantage of it.

Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/1665
Link: 6f867f9102
Link: https://lore.kernel.org/r/20230120165826.2469302-1-nathan@kernel.org
2023-01-22 11:36:45 +01:00
Hendrik Borghorst
a44b331614 KVM: x86/vmx: Do not skip segment attributes if unusable bit is set
When serializing and deserializing kvm_sregs, attributes of the segment
descriptors are stored by user space. For unusable segments,
vmx_segment_access_rights skips all attributes and sets them to 0.

This means we zero out the DPL (Descriptor Privilege Level) for unusable
entries.

Unusable segments are - contrary to their name - usable in 64bit mode and
are used by guests to for example create a linear map through the
NULL selector.

VMENTER checks if SS.DPL is correct depending on the CS segment type.
For types 9 (Execute Only) and 11 (Execute Read), CS.DPL must be equal to
SS.DPL [1].

We have seen real world guests setting CS to a usable segment with DPL=3
and SS to an unusable segment with DPL=3. Once we go through an sregs
get/set cycle, SS.DPL turns to 0. This causes the virtual machine to crash
reproducibly.

This commit changes the attribute logic to always preserve attributes for
unusable segments. According to [2] SS.DPL is always saved on VM exits,
regardless of the unusable bit so user space applications should have saved
the information on serialization correctly.

[3] specifies that besides SS.DPL the rest of the attributes of the
descriptors are undefined after VM entry if unusable bit is set. So, there
should be no harm in setting them all to the previous state.

[1] Intel SDM Vol 3C 26.3.1.2 Checks on Guest Segment Registers
[2] Intel SDM Vol 3C 27.3.2 Saving Segment Registers and Descriptor-Table
Registers
[3] Intel SDM Vol 3C 26.3.2.2 Loading Guest Segment Registers and
Descriptor-Table Registers

Cc: Alexander Graf <graf@amazon.de>
Cc: stable@vger.kernel.org
Signed-off-by: Hendrik Borghorst <hborghor@amazon.de>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Alexander Graf <graf@amazon.com>
Message-Id: <20221114164823.69555-1-hborghor@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-22 04:09:28 -05:00
Ashok Raj
a9a5cac225 x86/microcode/intel: Print old and new revision during early boot
Make early loading message match late loading message and print both old
and new revisions.

This is helpful to know what the BIOS loaded revision is before an early
update.

Cache the early BIOS revision before the microcode update and have
print_ucode_info() print both the old and new revision in the same
format as microcode_reload_late().

  [ bp: Massage, remove useless comment. ]

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230120161923.118882-6-ashok.raj@intel.com
2023-01-21 14:55:21 +01:00
Ashok Raj
174f1b909a x86/microcode/intel: Pass the microcode revision to print_ucode_info() directly
print_ucode_info() takes a struct ucode_cpu_info pointer as parameter.
Its sole purpose is to print the microcode revision.

The only available ucode_cpu_info always describes the currently loaded
microcode revision. After a microcode update is successful, this is the
new revision, or on failure it is the original revision.

In preparation for future changes, replace the struct ucode_cpu_info
pointer parameter with a plain integer which contains the revision
number and adjust the call sites accordingly.

No functional change.

  [ bp:
    - Fix + cleanup commit message.
    - Revert arbitrary, unrelated change.
  ]

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230120161923.118882-5-ashok.raj@intel.com
2023-01-21 14:55:20 +01:00
Ashok Raj
6eab3abac7 x86/microcode: Adjust late loading result reporting message
During late microcode loading, the "Reload completed" message is issued
unconditionally, regardless of success or failure.

Adjust the message to report the result of the update.

  [ bp: Massage. ]

Fixes: 9bd681251b ("x86/microcode: Announce reload operation's completion")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/lkml/874judpqqd.ffs@tglx/
2023-01-21 14:55:20 +01:00
Ashok Raj
c0dd9245aa x86/microcode: Check CPU capabilities after late microcode update correctly
The kernel caches each CPU's feature bits at boot in an x86_capability[]
structure. However, the capabilities in the BSP's copy can be turned off
as a result of certain command line parameters or configuration
restrictions, for example the SGX bit. This can cause a mismatch when
comparing the values before and after the microcode update.

Another example is X86_FEATURE_SRBDS_CTRL which gets added only after
microcode update:

  --- cpuid.before	2023-01-21 14:54:15.652000747 +0100
  +++ cpuid.after	2023-01-21 14:54:26.632001024 +0100
  @@ -10,7 +10,7 @@ CPU:
      0x00000004 0x04: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
      0x00000005 0x00: eax=0x00000040 ebx=0x00000040 ecx=0x00000003 edx=0x11142120
      0x00000006 0x00: eax=0x000027f7 ebx=0x00000002 ecx=0x00000001 edx=0x00000000
  -   0x00000007 0x00: eax=0x00000000 ebx=0x029c6fbf ecx=0x40000000 edx=0xbc002400
  +   0x00000007 0x00: eax=0x00000000 ebx=0x029c6fbf ecx=0x40000000 edx=0xbc002e00
  									     ^^^

and which proves for a gazillionth time that late loading is a bad bad
idea.

microcode_check() is called after an update to report any previously
cached CPUID bits which might have changed due to the update.

Therefore, store the cached CPU caps before the update and compare them
with the CPU caps after the microcode update has succeeded.

Thus, the comparison is done between the CPUID *hardware* bits before
and after the upgrade instead of using the cached, possibly runtime
modified values in BSP's boot_cpu_data copy.

As a result, false warnings about CPUID bits changes are avoided.

  [ bp:
  	- Massage.
	- Add SRBDS_CTRL example.
	- Add kernel-doc.
	- Incorporate forgotten review feedback from dhansen.
	]

Fixes: 1008c52c09 ("x86/CPU: Add a microcode loader callback")
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230109153555.4986-3-ashok.raj@intel.com
2023-01-21 14:53:20 +01:00
Kan Liang
5d515ee40c perf/x86/uncore: Don't WARN_ON_ONCE() for a broken discovery table
The kernel warning message is triggered, when SPR MCC is used.

[   17.945331] ------------[ cut here ]------------
[   17.946305] WARNING: CPU: 65 PID: 1 at
arch/x86/events/intel/uncore_discovery.c:184
intel_uncore_has_discovery_tables+0x4c0/0x65c
[   17.946305] Modules linked in:
[   17.946305] CPU: 65 PID: 1 Comm: swapper/0 Not tainted
5.4.17-2136.313.1-X10-2c+ #4

It's caused by the broken discovery table of UPI.

The discovery tables are from hardware. Except for dropping the broken
information, there is nothing Linux can do. Using WARN_ON_ONCE() is
overkilled.

Use the pr_info() to replace WARN_ON_ONCE(), and specify what uncore unit
is dropped and the reason.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Petlan <mpetlan@redhat.com>
Link: https://lore.kernel.org/r/20230112200105.733466-6-kan.liang@linux.intel.com
2023-01-21 00:06:13 +01:00
Kan Liang
65248a9a9e perf/x86/uncore: Add a quirk for UPI on SPR
The discovery table of UPI on some SPR variants, e.g., MCC, is broken.
The third UPI table may includes a wrong address which points to a
non-exists device. The bug impacts both UPI and M3UPI uncore PMON.

Use a pre-defined UPI and M3UPI table to replace the broken table.

Different BIOS may populate a device into a different domain or a
different BUS. The accurate location can only be retrieved at load time.
Add spr_update_device_location() to update the location of the UPI and
M3UPI in the pre-defined table.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Petlan <mpetlan@redhat.com>
Link: https://lore.kernel.org/r/20230112200105.733466-5-kan.liang@linux.intel.com
2023-01-21 00:06:13 +01:00
Kan Liang
bd9514a4d5 perf/x86/uncore: Ignore broken units in discovery table
Some units in a discovery table may be broken, e.g., UPI of SPR MCC.
A generic method is required to ignore the broken units.

Add uncore_units_ignore in the struct intel_uncore_init_fun, which
indicates the type ID of broken units. It will be assigned by the
platform-specific code later when the platform has a broken discovery
table.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Petlan <mpetlan@redhat.com>
Link: https://lore.kernel.org/r/20230112200105.733466-4-kan.liang@linux.intel.com
2023-01-21 00:06:12 +01:00
Kan Liang
3af548f236 perf/x86/uncore: Fix potential NULL pointer in uncore_get_alias_name
The current code assumes that the discovery table provides valid
box_ids for the normal units. It's not the case anymore since some units
in the discovery table are broken on some SPR variants.

Factor out uncore_get_box_id(). Check the existence of the type->box_ids
before using it. If it's not available, use pmu_idx.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Petlan <mpetlan@redhat.com>
Link: https://lore.kernel.org/r/20230112200105.733466-3-kan.liang@linux.intel.com
2023-01-21 00:06:12 +01:00
Kan Liang
dbf061b262 perf/x86/uncore: Factor out uncore_device_to_die()
The same code is used to retrieve the logical die ID with a given PCI
device in both the discovery code and the code that supports a system
with > 8 nodes.

Factor out uncore_device_to_die() to replace the duplicate code.

No functional change.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Petlan <mpetlan@redhat.com>
Link: https://lore.kernel.org/r/20230112200105.733466-2-kan.liang@linux.intel.com
2023-01-21 00:06:11 +01:00
Ashok Raj
ab31c74455 x86/microcode: Add a parameter to microcode_check() to store CPU capabilities
Add a parameter to store CPU capabilities before performing a microcode
update so that CPU capabilities can be compared before and after update.

  [ bp: Massage. ]

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230109153555.4986-2-ashok.raj@intel.com
2023-01-20 21:45:13 +01:00
Jakub Kicinski
b3c588cd55 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ipa/ipa_interrupt.c
drivers/net/ipa/ipa_interrupt.h
  9ec9b2a308 ("net: ipa: disable ipa interrupt during suspend")
  8e461e1f09 ("net: ipa: introduce ipa_interrupt_enable()")
  d50ed35587 ("net: ipa: enable IPA interrupt handlers separate from registration")
https://lore.kernel.org/all/20230119114125.5182c7ab@canb.auug.org.au/
https://lore.kernel.org/all/79e46152-8043-a512-79d9-c3b905462774@tessares.net/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-20 12:28:23 -08:00
Li RongQing
716ff71ae2 cpuidle-haltpoll: Replace default_idle() with arch_cpu_idle()
When a KVM guest has MWAIT, mwait_idle() is used as the default idle
function.

However, the cpuidle-haltpoll driver calls default_idle() from
default_enter_idle() directly and that one uses HLT instead of MWAIT,
which may affect performance adversely, because MWAIT is preferred to
HLT as explained by the changelog of commit aebef63cf7 ("x86: Remove
vendor checks from prefer_mwait_c1_over_halt").

Make default_enter_idle() call arch_cpu_idle(), which can use MWAIT,
instead of default_idle() to address this issue.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
[ rjw: Changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-01-20 17:33:52 +01:00
Taehee Yoo
d6b7ec1106 crypto: x86/aria-avx512 - fix build failure with old binutils
The minimum version of binutils for kernel build is currently 2.23 and
it doesn't support GFNI.
So, it fails to build the aria-avx512 if the old binutils is used.
aria-avx512 requires GFNI, so it should not be allowed to build if the
old binutils is used.
The AS_AVX512 and AS_GFNI are added to the Kconfig to disable build
aria-avx512 if the old binutils is used.

Fixes: c970d42001 ("crypto: x86/aria - implement aria-avx512")
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-01-20 18:29:32 +08:00
Taehee Yoo
1eb468b3c7 crypto: x86/aria-avx2 - fix build failure with old binutils
The minimum version of binutils for kernel build is currently 2.23 and
it doesn't support GFNI.
So, it fails to build the aria-avx2 if the old binutils is used.
The code using GFNI is an optional part of aria-avx2.
So, it disables GFNI part in it when the old binutils is used.

Fixes: 37d8d3ae7a ("crypto: x86/aria - implement aria-avx2")
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-01-20 18:29:32 +08:00
Taehee Yoo
e3cf2f8794 crypto: x86/aria-avx - fix build failure with old binutils
The minimum version of binutils for kernel build is currently 2.23 and
it doesn't support GFNI.
So, it fails to build the aria-avx if the old binutils is used.
The code using GFNI is an optional part of aria-avx.
So, it disables GFNI part in it when the old binutils is used.
In order to check whether the using binutils is supporting GFNI or not,
AS_GFNI is added.

Fixes: ba3579e6e4 ("crypto: aria-avx - add AES-NI/AVX/x86_64/GFNI assembler implementation of aria cipher")
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-01-20 18:29:31 +08:00
Paul E. McKenney
344da544f1 x86/nmi: Print reasons why backtrace NMIs are ignored
Instrument nmi_trigger_cpumask_backtrace() to dump out diagnostics based
on evidence accumulated by exc_nmi().  These diagnostics are dumped for
CPUs that ignored an NMI backtrace request for more than 10 seconds.

[ paulmck: Apply Ingo Molnar feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
2023-01-19 15:55:12 -08:00
Paul E. McKenney
1a3ea611fc x86/nmi: Accumulate NMI-progress evidence in exc_nmi()
CPUs ignoring NMIs is often a sign of those CPUs going bad, but there
are quite a few other reasons why a CPU might ignore NMIs.  Therefore,
accumulate evidence within exc_nmi() as to what might be preventing a
given CPU from responding to an NMI.

[ paulmck: Apply Peter Zijlstra feedback. ]

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
2023-01-19 15:54:41 -08:00
Juergen Gross
fe0ba8c23f acpi: Fix suspend with Xen PV
Commit f1e5250094 ("x86/boot: Skip realmode init code when running as
Xen PV guest") missed one code path accessing real_mode_header, leading
to dereferencing NULL when suspending the system under Xen:

    [  348.284004] PM: suspend entry (deep)
    [  348.289532] Filesystems sync: 0.005 seconds
    [  348.291545] Freezing user space processes ... (elapsed 0.000 seconds) done.
    [  348.292457] OOM killer disabled.
    [  348.292462] Freezing remaining freezable tasks ... (elapsed 0.104 seconds) done.
    [  348.396612] printk: Suspending console(s) (use no_console_suspend to debug)
    [  348.749228] PM: suspend devices took 0.352 seconds
    [  348.769713] ACPI: EC: interrupt blocked
    [  348.816077] BUG: kernel NULL pointer dereference, address: 000000000000001c
    [  348.816080] #PF: supervisor read access in kernel mode
    [  348.816081] #PF: error_code(0x0000) - not-present page
    [  348.816083] PGD 0 P4D 0
    [  348.816086] Oops: 0000 [#1] PREEMPT SMP NOPTI
    [  348.816089] CPU: 0 PID: 6764 Comm: systemd-sleep Not tainted 6.1.3-1.fc32.qubes.x86_64 #1
    [  348.816092] Hardware name: Star Labs StarBook/StarBook, BIOS 8.01 07/03/2022
    [  348.816093] RIP: e030:acpi_get_wakeup_address+0xc/0x20

Fix that by adding an optional acpi callback allowing to skip setting
the wakeup address, as in the Xen PV case this will be handled by the
hypervisor anyway.

Fixes: f1e5250094 ("x86/boot: Skip realmode init code when running as Xen PV guest")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/all/20230117155724.22940-1-jgross%40suse.com
2023-01-19 13:52:05 -08:00
Yu Zhang
cecafc0a83 KVM: MMU: Make the definition of 'INVALID_GPA' common
KVM already has a 'GPA_INVALID' defined as (~(gpa_t)0) in kvm_types.h,
and it is used by ARM code. We do not need another definition of
'INVALID_GPA' for X86 specifically.

Instead of using the common 'GPA_INVALID' for X86, replace it with
'INVALID_GPA', and change the users of 'GPA_INVALID' so that the diff
can be smaller. Also because the name 'INVALID_GPA' tells the user we
are using an invalid GPA, while the name 'GPA_INVALID' is emphasizing
the GPA is an invalid one.

No functional change intended.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230105130127.866171-1-yu.c.zhang@linux.intel.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-01-19 21:48:38 +00:00
Gustavo A. R. Silva
aa81cb9d97 x86/fpu: Replace zero-length array in struct xregs_state with flexible-array member
Zero-length arrays are deprecated [1] and have to be replaced by C99
flexible-array members.

This helps with the ongoing efforts to tighten the FORTIFY_SOURCE routines
on memcpy() and help to make progress towards globally enabling
-fstrict-flex-arrays=3 [2]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays [1]
Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602902.html [2]
Link: https://github.com/KSPP/linux/issues/78
Link: https://lore.kernel.org/r/Y7zCFpa2XNs/o9YQ@work
2023-01-19 21:59:35 +01:00
Nikunj A Dadhania
8c29f01654 x86/sev: Add SEV-SNP guest feature negotiation support
The hypervisor can enable various new features (SEV_FEATURES[1:63]) and start a
SNP guest. Some of these features need guest side implementation. If any of
these features are enabled without it, the behavior of the SNP guest will be
undefined.  It may fail booting in a non-obvious way making it difficult to
debug.

Instead of allowing the guest to continue and have it fail randomly later,
detect this early and fail gracefully.

The SEV_STATUS MSR indicates features which the hypervisor has enabled.  While
booting, SNP guests should ascertain that all the enabled features have guest
side implementation. In case a feature is not implemented in the guest, the
guest terminates booting with GHCB protocol Non-Automatic Exit(NAE) termination
request event, see "SEV-ES Guest-Hypervisor Communication Block Standardization"
document (currently at https://developer.amd.com/wp-content/resources/56421.pdf),
section "Termination Request".

Populate SW_EXITINFO2 with mask of unsupported features that the hypervisor can
easily report to the user.

More details in the AMD64 APM Vol 2, Section "SEV_STATUS MSR".

  [ bp:
    - Massage.
    - Move snp_check_features() call to C code.
    Note: the CC:stable@ aspect here is to be able to protect older, stable
    kernels when running on newer hypervisors. Or not "running" but fail
    reliably and in a well-defined manner instead of randomly. ]

Fixes: cbd3d4f7c4 ("x86/sev: Check SEV-SNP features support")
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20230118061943.534309-1-nikunj@amd.com
2023-01-19 17:29:58 +01:00
Mike Kravetz
e9adcfecf5 mm: remove zap_page_range and create zap_vma_pages
zap_page_range was originally designed to unmap pages within an address
range that could span multiple vmas.  While working on [1], it was
discovered that all callers of zap_page_range pass a range entirely within
a single vma.  In addition, the mmu notification call within zap_page
range does not correctly handle ranges that span multiple vmas.  When
crossing a vma boundary, a new mmu_notifier_range_init/end call pair with
the new vma should be made.

Instead of fixing zap_page_range, do the following:
- Create a new routine zap_vma_pages() that will remove all pages within
  the passed vma.  Most users of zap_page_range pass the entire vma and
  can use this new routine.
- For callers of zap_page_range not passing the entire vma, instead call
  zap_page_range_single().
- Remove zap_page_range.

[1] https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.kravetz@oracle.com/
Link: https://lkml.kernel.org/r/20230104002732.232573-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Peter Xu <peterx@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Xu <peterx@redhat.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>	[s390]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:55 -08:00
Peter Xu
f1eb1bacfb mm/uffd: always wr-protect pte in pte|pmd_mkuffd_wp()
This patch is a cleanup to always wr-protect pte/pmd in mkuffd_wp paths.

The reasons I still think this patch is worthwhile, are:

  (1) It is a cleanup already; diffstat tells.

  (2) It just feels natural after I thought about this, if the pte is uffd
      protected, let's remove the write bit no matter what it was.

  (2) Since x86 is the only arch that supports uffd-wp, it also redefines
      pte|pmd_mkuffd_wp() in that it should always contain removals of
      write bits.  It means any future arch that want to implement uffd-wp
      should naturally follow this rule too.  It's good to make it a
      default, even if with vm_page_prot changes on VM_UFFD_WP.

  (3) It covers more than vm_page_prot.  So no chance of any potential
      future "accident" (like pte_mkdirty() sparc64 or loongarch, even
      though it just got its pte_mkdirty fixed <1 month ago).  It'll be
      fairly clear when reading the code too that we don't worry anything
      before a pte_mkuffd_wp() on uncertainty of the write bit.

We may call pte_wrprotect() one more time in some paths (e.g.  thp split),
but that should be fully local bitop instruction so the overhead should be
negligible.

Although this patch should logically also fix all the known issues on
uffd-wp too recently on page migration (not for numa hint recovery - that
may need another explcit pte_wrprotect), but this is not the plan for that
fix.  So no fixes, and stable doesn't need this.

Link: https://lkml.kernel.org/r/20221214201533.1774616-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ives van Hoorne <ives@codesandbox.io>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-01-18 17:12:37 -08:00
Kan Liang
5a8a05f165 perf/x86/intel/cstate: Add Emerald Rapids
From the perspective of Intel cstate residency counters,
Emerald Rapids is the same as the Sapphire Rapids and Ice Lake.
Add Emerald Rapids model.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230106160449.3566477-2-kan.liang@linux.intel.com
2023-01-18 12:42:49 +01:00
Kan Liang
6795e558e9 perf/x86/intel: Add Emerald Rapids
From core PMU's perspective, Emerald Rapids is the same as the Sapphire
Rapids. The only difference is the event list, which will be
supported in the perf tool later.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230106160449.3566477-1-kan.liang@linux.intel.com
2023-01-18 12:42:49 +01:00
Guangju Wang[baidu]
59047d942b x86/microcode: Use the DEVICE_ATTR_RO() macro
Use DEVICE_ATTR_RO() helper instead of open-coded DEVICE_ATTR(),
which makes the code a bit shorter and easier to read.

No change in functionality.

Signed-off-by: Guangju Wang[baidu] <wgj900@163.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230118023554.1898-1-wgj900@163.com
2023-01-18 12:02:20 +01:00
Namhyung Kim
f6e707156e perf/core: Introduce perf_prepare_header()
Factor out perf_prepare_header() so that it can call
perf_prepare_sample() without a header if not needed.

Also it checks the filtered_sample_type to avoid duplicate
work when perf_prepare_sample() is called twice (or more).

Suggested-by: Peter Zijlstr <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-8-namhyung@kernel.org
2023-01-18 11:57:20 +01:00
Namhyung Kim
eb55b455ef perf/core: Add perf_sample_save_brstack() helper
When we saves the branch stack to the perf sample data, we needs to
update the sample flags and the dynamic size.  To make sure this is
done consistently, add the perf_sample_save_brstack() helper and
convert all call sites.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-5-namhyung@kernel.org
2023-01-18 11:57:20 +01:00
Namhyung Kim
0a9081cf0a perf/core: Add perf_sample_save_raw_data() helper
When we save the raw_data to the perf sample data, we need to update
the sample flags and the dynamic size.  To make sure this is done
consistently, add the perf_sample_save_raw_data() helper and convert
all call sites.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-4-namhyung@kernel.org
2023-01-18 11:57:19 +01:00
Namhyung Kim
31046500c1 perf/core: Add perf_sample_save_callchain() helper
When we save the callchain to the perf sample data, we need to update
the sample flags and the dynamic size.  To ensure this is done consistently,
add the perf_sample_save_callchain() helper and convert all call sites.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230118060559.615653-3-namhyung@kernel.org
2023-01-18 11:57:19 +01:00
Ingo Molnar
65adf3a57c Linux 6.2-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPEGkMeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGE4YH/1sxdve3nlWT7eyf
 VDanaCZFc2u6Sk1td23HyhvOVbFjj1E39UokHK9YfzD4hhn+u9SFvDgoXHGhHzYq
 IhDvNM3mLsKpEtjwNbbi/lt6tryJchhZ96O5leRiUxZaw/GBVnHrKGR56vCjC/DB
 zq8td4I+TmTPMpsL3e3WZqJoX3bjTqgZFNShA1mZFYtuQRpAVkJkafMRwuAySIMQ
 +FyK/zZ+MpGJ7y/UmWfaBmS5GtjBz0fva7fTbEkUqk8bDtNUWMBhsTKtgU1LBpBe
 Yrkhs3sdQdpKSm9yNaNrlLhSjFwkusV5kR09p6/5w+wthdPmjnoUZuEnAZpYyydE
 ZY7+Vqk=
 =q9Ac
 -----END PGP SIGNATURE-----

Merge tag 'v6.2-rc4' into perf/core, to pick up fixes

Move from the -rc1 base to the fresher -rc4 kernel that
has various fixes included, before applying a larger
patchset.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-01-18 11:56:57 +01:00
Jinank Jain
f0d2f5c2c0 x86/hyperv: Add an interface to do nested hypercalls
According to TLFS, in order to communicate to L0 hypervisor there needs
to be an additional bit set in the control register. This communication
is required to perform privileged instructions which can only be
performed by L0 hypervisor. An example of that could be setting up the
VMBus infrastructure.

Signed-off-by: Jinank Jain <jinankjain@linux.microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/24f9d46d5259a688113e6e5e69e21002647f4949.1672639707.git.jinankjain@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-01-17 13:37:19 +00:00
Jinank Jain
7fec185a56 Drivers: hv: Setup synic registers in case of nested root partition
Child partitions are free to allocate SynIC message and event page but in
case of root partition it must use the pages allocated by Microsoft
Hypervisor (MSHV). Base address for these pages can be found using
synthetic MSRs exposed by MSHV. There is a slight difference in those MSRs
for nested vs non-nested root partition.

Signed-off-by: Jinank Jain <jinankjain@linux.microsoft.com>
Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/cb951fb1ad6814996fc54f4a255c5841a20a151f.1672639707.git.jinankjain@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-01-17 13:36:43 +00:00
Thomas Gleixner
6c796996ee x86/pci/xen: Fixup fallout from the PCI/MSI overhaul
David reported that the recent PCI/MSI rework results in MSI descriptor
leakage under XEN.

This is caused by:

  1) The missing MSI_FLAG_FREE_MSI_DESCS flag in the XEN MSI domain info,
     which is required now that PCI/MSI delegates descriptor freeing to
     the core MSI code.

  2) Not disassociating the interrupts on teardown, by setting the
     msi_desc::irq to 0. This was not required before because the teardown
     was unconditional and did not check whether a MSI descriptor was still
     connected to a Linux interrupt.

On further inspection it came to light that the MSI_FLAG_DEV_SYSFS is
missing in the XEN MSI domain info as well to restore the pre 6.2 status
quo.

Add the missing MSI flags and disassociate the MSI descriptor from the
Linux interrupt in the XEN specific teardown function.

Fixes: b2bdda205c ("PCI/MSI: Let the MSI core free descriptors")
Fixes: 2f2940d168 ("genirq/msi: Remove filter from msi_free_descs_free_range()")
Fixes: ffd84485e6 ("PCI/MSI: Let the irq code handle sysfs groups")
Reported-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/871qnunycr.ffs@tglx
2023-01-16 20:40:44 +01:00
David Woodhouse
0a3a58de31 x86/pci/xen: Set MSI_FLAG_PCI_MSIX support in Xen MSI domain
The Xen MSI → PIRQ magic does support MSI-X, so advertise it.

(In fact it's better off with MSI-X than MSI, because it's actually
broken by design for 32-bit MSI, since it puts the high bits of the
PIRQ# into the high 32 bits of the MSI message address, instead of the
Extended Destination ID field which is in bits 4-11.

Strictly speaking, this really fixes a much older commit 2e4386eba0
("x86/xen: Wrap XEN MSI management into irqdomain") which failed to set
the flag. But that never really mattered until __pci_enable_msix_range()
started to check and bail out early. So in 6.2-rc we see failures e.g.
to bring up networking on an Amazon EC2 m4.16xlarge instance:

[   41.498694] ena 0000:00:03.0 (unnamed net_device) (uninitialized): Failed to enable MSI-X. irq_cnt -524
[   41.498705] ena 0000:00:03.0: Can not reserve msix vectors
[   41.498712] ena 0000:00:03.0: Failed to enable and set the admin interrupts

Side note: This is the first bug found, and first patch tested, by running
Xen guests under QEMU/KVM instead of running under actual Xen.

Fixes: 99f3d27976 ("PCI/MSI: Reject MSI-X early")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/4bffa69a949bfdc92c4a18e5a1c3cbb3b94a0d32.camel@infradead.org
2023-01-16 20:40:44 +01:00
Thomas Gleixner
5fa5595072 x86/i8259: Mark legacy PIC interrupts with IRQ_LEVEL
Baoquan reported that after triggering a crash the subsequent crash-kernel
fails to boot about half of the time. It triggers a NULL pointer
dereference in the periodic tick code.

This happens because the legacy timer interrupt (IRQ0) is resent in
software which happens in soft interrupt (tasklet) context. In this context
get_irq_regs() returns NULL which leads to the NULL pointer dereference.

The reason for the resend is a spurious APIC interrupt on the IRQ0 vector
which is captured and leads to a resend when the legacy timer interrupt is
enabled. This is wrong because the legacy PIC interrupts are level
triggered and therefore should never be resent in software, but nothing
ever sets the IRQ_LEVEL flag on those interrupts, so the core code does not
know about their trigger type.

Ensure that IRQ_LEVEL is set when the legacy PCI interrupts are set up.

Fixes: a4633adcdb ("[PATCH] genirq: add genirq sw IRQ-retrigger")
Reported-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Baoquan He <bhe@redhat.com>
Link: https://lore.kernel.org/r/87mt6rjrra.ffs@tglx
2023-01-16 17:24:56 +01:00
Yair Podemsky
5f5cc9ed99 x86/aperfmperf: Erase stale arch_freq_scale values when disabling frequency invariance readings
Once disable_freq_invariance_work is called the scale_freq_tick function
will not compute or update the arch_freq_scale values.
However the scheduler will still read these values and use them.
The result is that the scheduler might perform unfair decisions based on stale
values.

This patch adds the step of setting the arch_freq_scale values for all
cpus to the default (max) value SCHED_CAPACITY_SCALE, Once all cpus
have the same arch_freq_scale value the scaling is meaningless.

Signed-off-by: Yair Podemsky <ypodemsk@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230110160206.75912-1-ypodemsk@redhat.com
2023-01-16 10:19:15 +01:00
Linus Torvalds
f0f70ddb8f - Make sure the poking PGD is pinned for Xen PV as it requires it this way
- Fixes for two resctrl races when moving a task or creating a new monitoring
   group
 
 - Fix SEV-SNP guests running under HyperV where MTRRs are disabled to not return
   a UC- type mapping type on memremap() and thus cause a serious slowdown
 
 - Fix insn mnemonics in bioscall.S now that binutils is starting to fix
   confusing insn suffixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmPD5xsACgkQEsHwGGHe
 VUr65g/8CkfKQKIQ/kPn1B+M/PI4S8DBmz7CdufQTbB66GSfDwRpGnxIKJKZM1UG
 pyOmP1kHVXGGCsvFQimalxnBtFx6t3wFS+R+c/l5pRtgU63bwQpNjJ+vBcBpb4xs
 J7f06VgF6jB8qk0NqXBKNJt4kauZA50wiErYm8PU/yf0tmHratjrvDfKQmus9pI4
 AMcQEudRhDDo8xqn8fvIC/S7xm9/TgNzKxYP+HciPa74HZm5vrjS0hIyZkIsh5d7
 Q4MsP4VaBH12W6MbpF9TBw5VTDomwY8oS4xTNfkhyOTcHi8uLrMjA66VmaJwWXpe
 EZjOk4+KhtxYigaI5oQO9M5e9IINK6ZfbzU65P74wTvP3orszsBPG7QvodIzLc76
 YI1GEea/bXgxYgPuMR6spZoGMS58K7BXgViVMVRwZ9bZk17+5pfbLkzKqZsfw/s/
 nj3n8z4ayoc+ffDPllh0471Dn16ugKMf+EvW2Su1Q1QoaA7icNNkZrKaM6eSuoFr
 bClsrHeglQFadwy4kmY0fi7BUfiKTp+c5Ur7lM7VojrjH0XcoZreThVg0tNrk3ZR
 IAyBjtCdtg1rzrRfb7qBf6B+WNGdxYWGuDgOPiH1Xh+/plvyCqQ2/3AGWQAbu/qP
 to+IYmZg09mRpVBDYpCY4z6IzRbMJ9rRho4rNXCQUeAzSVMCrZE=
 =W2zR
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Make sure the poking PGD is pinned for Xen PV as it requires it this
   way

 - Fixes for two resctrl races when moving a task or creating a new
   monitoring group

 - Fix SEV-SNP guests running under HyperV where MTRRs are disabled to
   not return a UC- type mapping type on memremap() and thus cause a
   serious slowdown

 - Fix insn mnemonics in bioscall.S now that binutils is starting to fix
   confusing insn suffixes

* tag 'x86_urgent_for_v6.2_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: fix poking_init() for Xen PV guests
  x86/resctrl: Fix event counts regression in reused RMIDs
  x86/resctrl: Fix task CLOSID/RMID update race
  x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case
  x86/boot: Avoid using Intel mnemonics in AT&T syntax asm
2023-01-15 07:17:44 -06:00
Christophe JAILLET
ef6dfc4b23 x86/signal: Fix the value returned by strict_sas_size()
Functions used with __setup() return 1 when the argument has been
successfully parsed.

Reverse the returned value so that 1 is returned when kstrtobool() is
successful (i.e. returns 0).

My understanding of these __setup() functions is that returning 1 or 0
does not change much anyway - so this is more of a cleanup than a
functional fix.

I spot it and found it spurious while looking at something else.
Even if the output is not perfect, you'll get the idea with:

   $ git grep -B2 -A10 retu.*kstrtobool | grep __setup -B10

Fixes: 3aac3ebea0 ("x86/signal: Implement sigaltstack size validation")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/73882d43ebe420c9d8fb82d0560021722b243000.1673717552.git.christophe.jaillet@wanadoo.fr
2023-01-15 09:54:27 +01:00
Linus Torvalds
9e058c2952 pci-v6.2-fixes-1
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmPBniAUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vwOjRAAhjyRAgyiZV2rWS4pyvpQpqcpZWD9
 796ZSqnzLJjVYCymGvUTX23FEA48n59+bCM/WpfEGUPrBf8LZQxC9YOCm6ltuM8+
 FoSBykW/tHPq5IWaLzgrWpHeDOgEnZu/WFGGvrV3tl1mLpM1SJT8bGDsjHXlo+FM
 qkTEiA3nUEKQs5x9r2TTLCeUWGPNTIHNd2VfuxOqM3qC/nVCOfTTxU8nm6Lk7Eix
 nboAugAIADJIjs/+ZGekLBuzZYPkLYuDTyMYJ5hdo1p7wWCLc9gArEqvXKwVgmD3
 ptenZeOlQi9Ay45HmkfIgfgKeeQ7REJj3dx04vf67neAianyUrB0EZDqDjR7LmgM
 ozlNt0XjyoeEhu6AQS0s1LZtbDiED1R/00P6Gb+YEjUCVipW2lEYYwP0v9dsnNoh
 6wblgnkQoxLFM+5CAXRmCmpaoQn0Uam7okfVeohtsz8/kNQF2St0hjzr4Dmws+O3
 k9PUqnnUl4ByElzpEDesVGZMJ3pxFVH15ufu8VnRqN60pLTvNrsPyU4cVnG176Rc
 3RSDN3zMtPxnHJVy4r3bTNEZsX/7RUrOb4xScOXMmRDBMUc8QdscF8Oj1ucKlj5j
 mp7vB/7+VjU96uRarRyqUxGeQc77DCTcvOa1IGh/cuYom8ZJ6vpSCpKy6f6SFGuf
 i8iTTUcQKCdqVW4=
 =Fv2v
 -----END PGP SIGNATURE-----

Merge tag 'pci-v6.2-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull pci fixes from Bjorn Helgaas:

 - Work around apparent firmware issue that made Linux reject MMCONFIG
   space, which broke PCI extended config space (Bjorn Helgaas)

 - Fix CONFIG_PCIE_BT1 dependency due to mid-air collision between a
   PCI_MSI_IRQ_DOMAIN -> PCI_MSI change and addition of PCIE_BT1 (Lukas
   Bulwahn)

* tag 'pci-v6.2-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space
  x86/pci: Simplify is_mmconf_reserved() messages
  PCI: dwc: Adjust to recent removal of PCI_MSI_IRQ_DOMAIN
2023-01-13 17:32:22 -06:00
Linus Torvalds
92783a90bc ARM:
* Fix the PMCR_EL0 reset value after the PMU rework
 
 * Correctly handle S2 fault triggered by a S1 page table walk
   by not always classifying it as a write, as this breaks on
   R/O memslots
 
 * Document why we cannot exit with KVM_EXIT_MMIO when taking
   a write fault from a S1 PTW on a R/O memslot
 
 * Put the Apple M2 on the naughty list for not being able to
   correctly implement the vgic SEIS feature, just like the M1
   before it
 
 * Reviewer updates: Alex is stepping down, replaced by Zenghui
 
 x86:
 
 * Fix various rare locking issues in Xen emulation and teach lockdep
   to detect them
 
 * Documentation improvements
 
 * Do not return host topology information from KVM_GET_SUPPORTED_CPUID
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmPAT3EUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroPmDAf+ICCVMwgm+PjAc6NuXzaUk6BFGWKF
 1lzMvnKb6ARnhMKwyjl/Sf5EgnTuucnSTBHuE1kjaLkPUDNJvi4oRXVdDwKjtXnZ
 Zxk4dpsNLWVfALHTk1KweIkR5KNif0kugUh9RNp6zOBnoTVRh8XdCHpeDv73tJaG
 R1gCAreVTDbp+wNrVpiImUfYAZ4GrGpwwWRH/xLAGDWoTL9Z9J5tQygf+0C429n/
 eJoTrToLjESbYadDgCNDD+TUkHbeDVg8aeio2JZga9SvH3RBhwriLqz26v9yvikL
 UoY96AySMaiox4pgCUYUl8nng8MR8AG4C4vpNnLalj7tfHxRfhtAwD0EYw==
 =gDOV
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Fix the PMCR_EL0 reset value after the PMU rework

   - Correctly handle S2 fault triggered by a S1 page table walk by not
     always classifying it as a write, as this breaks on R/O memslots

   - Document why we cannot exit with KVM_EXIT_MMIO when taking a write
     fault from a S1 PTW on a R/O memslot

   - Put the Apple M2 on the naughty list for not being able to
     correctly implement the vgic SEIS feature, just like the M1 before
     it

   - Reviewer updates: Alex is stepping down, replaced by Zenghui

  x86:

   - Fix various rare locking issues in Xen emulation and teach lockdep
     to detect them

   - Documentation improvements

   - Do not return host topology information from KVM_GET_SUPPORTED_CPUID"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86/xen: Avoid deadlock by adding kvm->arch.xen.xen_lock leaf node lock
  KVM: Ensure lockdep knows about kvm->lock vs. vcpu->mutex ordering rule
  KVM: x86/xen: Fix potential deadlock in kvm_xen_update_runstate_guest()
  KVM: x86/xen: Fix lockdep warning on "recursive" gpc locking
  Documentation: kvm: fix SRCU locking order docs
  KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID
  KVM: nSVM: clarify recalc_intercepts() wrt CR8
  MAINTAINERS: Remove myself as a KVM/arm64 reviewer
  MAINTAINERS: Add Zenghui Yu as a KVM/arm64 reviewer
  KVM: arm64: vgic: Add Apple M2 cpus to the list of broken SEIS implementations
  KVM: arm64: Convert FSC_* over to ESR_ELx_FSC_*
  KVM: arm64: Document the behaviour of S1PTW faults on RO memslots
  KVM: arm64: Fix S1PTW handling on RO memslots
  KVM: arm64: PMU: Fix PMCR_EL0 reset value
2023-01-13 14:41:50 -06:00
Bjorn Helgaas
fd3a8cff4d x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space
Normally we reject ECAM space unless it is reported as reserved in the E820
table or via a PNP0C02 _CRS method (PCI Firmware, r3.3, sec 4.1.2).

07eab0901e ("efi/x86: Remove EfiMemoryMappedIO from E820 map"), removes
E820 entries that correspond to EfiMemoryMappedIO regions because some
other firmware uses EfiMemoryMappedIO for PCI host bridge windows, and the
E820 entries prevent Linux from allocating BAR space for hot-added devices.

Some firmware doesn't report ECAM space via PNP0C02 _CRS methods, but does
mention it as an EfiMemoryMappedIO region via EFI GetMemoryMap(), which is
normally converted to an E820 entry by a bootloader or EFI stub.  After
07eab0901e, that E820 entry is removed, so we reject this ECAM space,
which makes PCI extended config space (offsets 0x100-0xfff) inaccessible.

The lack of extended config space breaks anything that relies on it,
including perf, VSEC telemetry, EDAC, QAT, SR-IOV, etc.

Allow use of ECAM for extended config space when the region is covered by
an EfiMemoryMappedIO region, even if it's not included in E820 or PNP0C02
_CRS.

Link: https://lore.kernel.org/r/ac2693d8-8ba3-72e0-5b66-b3ae008d539d@linux.intel.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216891
Fixes: 07eab0901e ("efi/x86: Remove EfiMemoryMappedIO from E820 map")
Link: https://lore.kernel.org/r/20230110180243.1590045-3-helgaas@kernel.org
Reported-by: Kan Liang <kan.liang@linux.intel.com>
Reported-by: Tony Luck <tony.luck@intel.com>
Reported-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Reported-by: Yunying Sun <yunying.sun@intel.com>
Reported-by: Baowen Zheng <baowen.zheng@corigine.com>
Reported-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reported-by: Yang Lixiao <lixiao.yang@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Yunying Sun <yunying.sun@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Rafael J. Wysocki <rafael@kernel.org>
2023-01-13 11:53:54 -06:00
Sean Christopherson
72c70ceeaf KVM: x86: Add helpers to recalc physical vs. logical optimized APIC maps
Move the guts of kvm_recalculate_apic_map()'s main loop to two separate
helpers to handle recalculating the physical and logical pieces of the
optimized map.  Having 100+ lines of code in the for-loop makes it hard
to understand what is being calculated where.

No functional change intended.

Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-34-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:35 -05:00
Greg Edwards
d471bd853d KVM: x86: Allow APICv APIC ID inhibit to be cleared
Legacy kernels prior to commit 4399c03c67 ("x86/apic: Remove
verify_local_APIC()") write the APIC ID of the boot CPU twice to verify
a functioning local APIC.  This results in APIC acceleration inhibited
on these kernels for reason APICV_INHIBIT_REASON_APIC_ID_MODIFIED.

Allow the APICV_INHIBIT_REASON_APIC_ID_MODIFIED inhibit reason to be
cleared if/when all APICs in xAPIC mode set their APIC ID back to the
expected vcpu_id value.

Fold the functionality previously in kvm_lapic_xapic_id_updated() into
kvm_recalculate_apic_map(), as this allows examining all APICs in one
pass.

Fixes: 3743c2f025 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base")
Signed-off-by: Greg Edwards <gedwards@ddn.com>
Link: https://lore.kernel.org/r/20221117183247.94314-1-gedwards@ddn.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-33-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:35 -05:00
Sean Christopherson
b3f257a846 KVM: x86: Track required APICv inhibits with variable, not callback
Track the per-vendor required APICv inhibits with a variable instead of
calling into vendor code every time KVM wants to query the set of
required inhibits.  The required inhibits are a property of the vendor's
virtualization architecture, i.e. are 100% static.

Using a variable allows the compiler to inline the check, e.g. generate
a single-uop TEST+Jcc, and thus eliminates any desire to avoid checking
inhibits for performance reasons.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-32-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:34 -05:00
Sean Christopherson
e2ed3e64a2 Revert "KVM: SVM: Do not throw warning when calling avic_vcpu_load on a running vcpu"
Turns out that some warnings exist for good reasons.  Restore the warning
in avic_vcpu_load() that guards against calling avic_vcpu_load() on a
running vCPU now that KVM avoids doing so when switching between x2APIC
and xAPIC.  The entire point of the WARN is to highlight that KVM should
not be reloading an AVIC.

Opportunistically convert the WARN_ON() to WARN_ON_ONCE() to avoid
spamming the kernel if it does fire.

This reverts commit c0caeee65a.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-31-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:34 -05:00
Sean Christopherson
a790e338c7 KVM: SVM: Ignore writes to Remote Read Data on AVIC write traps
Drop writes to APIC_RRR, a.k.a. Remote Read Data Register, on AVIC
unaccelerated write traps.  The register is read-only and isn't emulated
by KVM.  Sending the register through kvm_apic_write_nodecode() will
result in screaming when x2APIC is enabled due to the unexpected failure
to retrieve the MSR (KVM expects that only "legal" accesses will trap).

Fixes: 4d1d7942e3 ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20230106011306.85230-30-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:34 -05:00
Sean Christopherson
bbfc7aa62a KVM: SVM: Handle multiple logical targets in AVIC kick fastpath
Iterate over all target logical IDs in the AVIC kick fastpath instead of
bailing if there is more than one target.  Now that KVM inhibits AVIC if
vCPUs aren't mapped 1:1 with logical IDs, each bit in the destination is
guaranteed to match to at most one vCPU, i.e. iterating over the bitmap
is guaranteed to kick each valid target exactly once.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-29-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:33 -05:00
Sean Christopherson
1808c95095 KVM: SVM: Require logical ID to be power-of-2 for AVIC entry
Do not modify AVIC's logical ID table if the logical ID portion of the
LDR is not a power-of-2, i.e. if the LDR has multiple bits set.  Taking
only the first bit means that KVM will fail to match MDAs that intersect
with "higher" bits in the "ID"

The "ID" acts as a bitmap, but is referred to as an ID because there's an
implicit, unenforced "requirement" that software only set one bit.  This
edge case is arguably out-of-spec behavior, but KVM cleanly handles it
in all other cases, e.g. the optimized logical map (and AVIC!) is also
disabled in this scenario.

Refactor the code to consolidate the checks, and so that the code looks
more like avic_kick_target_vcpus_fast().

Fixes: 18f40c53e1 ("svm: Add VMEXIT handlers for AVIC")
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-28-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:33 -05:00
Sean Christopherson
4f160b7bd4 KVM: SVM: Update svm->ldr_reg cache even if LDR is "bad"
Update SVM's cache of the LDR even if the new value is "bad".  Leaving
stale information in the cache can result in KVM missing updates and/or
invalidating the wrong entry, e.g. if avic_invalidate_logical_id_entry()
is triggered after a different vCPU has "claimed" the old LDR.

Fixes: 18f40c53e1 ("svm: Add VMEXIT handlers for AVIC")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-27-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:32 -05:00
Sean Christopherson
1ba59a4454 KVM: SVM: Always update local APIC on writes to logical dest register
Update the vCPU's local (virtual) APIC on LDR writes even if the write
"fails".  The APIC needs to recalc the optimized logical map even if the
LDR is invalid or zero, e.g. if the guest clears its LDR, the optimized
map will be left as is and the vCPU will receive interrupts using its
old LDR.

Fixes: 18f40c53e1 ("svm: Add VMEXIT handlers for AVIC")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-26-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:32 -05:00
Sean Christopherson
9a364857ab KVM: SVM: Inhibit AVIC if vCPUs are aliased in logical mode
Inhibit SVM's AVIC if multiple vCPUs are aliased to the same logical ID.
Architecturally, all CPUs whose logical ID matches the MDA are supposed
to receive the interrupt; overwriting existing entries in AVIC's
logical=>physical map can result in missed IPIs.

Fixes: 18f40c53e1 ("svm: Add VMEXIT handlers for AVIC")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-25-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:31 -05:00
Sean Christopherson
5063c41beb KVM: x86: Inhibit APICv/AVIC if the optimized physical map is disabled
Inhibit APICv/AVIC if the optimized physical map is disabled so that KVM
KVM provides consistent APIC behavior if xAPIC IDs are aliased due to
vcpu_id being truncated and the x2APIC hotplug hack isn't enabled.  If
the hotplug hack is disabled, events that are emulated by KVM will follow
architectural behavior (all matching vCPUs receive events, even if the
"match" is due to truncation), whereas APICv and AVIC will deliver events
only to the first matching vCPU, i.e. the vCPU that matches without
truncation.

Note, the "extra" inhibit is needed because  KVM deliberately ignores
mismatches due to truncation when applying the APIC_ID_MODIFIED inhibit
so that large VMs (>255 vCPUs) can run with APICv/AVIC.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-24-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:31 -05:00
Sean Christopherson
5b84b02917 KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDs
Apply KVM's hotplug hack if and only if userspace has enabled 32-bit IDs
for x2APIC.  If 32-bit IDs are not enabled, disable the optimized map to
honor x86 architectural behavior if multiple vCPUs shared a physical APIC
ID.  As called out in the changelog that added the hack, all CPUs whose
(possibly truncated) APIC ID matches the target are supposed to receive
the IPI.

  KVM intentionally differs from real hardware, because real hardware
  (Knights Landing) does just "x2apic_id & 0xff" to decide whether to
  accept the interrupt in xAPIC mode and it can deliver one interrupt to
  more than one physical destination, e.g. 0x123 to 0x123 and 0x23.

Applying the hack even when x2APIC is not fully enabled means KVM doesn't
correctly handle scenarios where the guest has aliased xAPIC IDs across
multiple vCPUs, as only the vCPU with the lowest vCPU ID will receive any
interrupts.  It's extremely unlikely any real world guest aliases APIC
IDs, or even modifies APIC IDs, but KVM's behavior is arbitrary, e.g. the
lowest vCPU ID "wins" regardless of which vCPU is "aliasing" and which
vCPU is "normal".

Furthermore, the hack is _not_ guaranteed to work!  The hack works if and
only if the optimized APIC map is successfully allocated.  If the map
allocation fails (unlikely), KVM will fall back to its unoptimized
behavior, which _does_ honor the architectural behavior.

Pivot on 32-bit x2APIC IDs being enabled as that is required to take
advantage of the hotplug hack (see kvm_apic_state_fixup()), i.e. won't
break existing setups unless they are way, way off in the weeds.

And an entry in KVM's errata to document the hack.  Alternatively, KVM
could provide an actual x2APIC quirk and document the hack that way, but
there's unlikely to ever be a use case for disabling the quirk.  Go the
errata route to avoid having to validate a quirk no one cares about.

Fixes: 5bd5db385b ("KVM: x86: allow hotplug of VCPU with APIC ID over 0xff")
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-23-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:30 -05:00
Sean Christopherson
2970052481 KVM: x86: Disable APIC logical map if vCPUs are aliased in logical mode
Disable the optimized APIC logical map if multiple vCPUs are aliased to
the same logical ID.  Architecturally, all CPUs whose logical ID matches
the MDA are supposed to receive the interrupt; overwriting existing map
entries can result in missed IPIs.

Fixes: 1e08ec4a13 ("KVM: optimize apic interrupt delivery")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20230106011306.85230-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:30 -05:00
Sean Christopherson
2bf934aadc KVM: x86: Disable APIC logical map if logical ID covers multiple MDAs
Disable the optimized APIC logical map if a logical ID covers multiple
MDAs, i.e. if a vCPU has multiple bits set in its ID.  In logical mode,
events match if "ID & MDA != 0", i.e. creating an entry for only the
first bit can cause interrupts to be missed.

Note, creating an entry for every bit is also wrong as KVM would generate
IPIs for every matching bit.  It would be possible to teach KVM to play
nice with this edge case, but it is very much an edge case and probably
not used in any real world OS, i.e. it's not worth optimizing.

Fixes: 1e08ec4a13 ("KVM: optimize apic interrupt delivery")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20230106011306.85230-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:30 -05:00
Sean Christopherson
76e527509d KVM: x86: Skip redundant x2APIC logical mode optimized cluster setup
Skip the optimized cluster[] setup for x2APIC logical mode, as KVM reuses
the optimized map's phys_map[] and doesn't actually need to insert the
target apic into the cluster[].  The LDR is derived from the x2APIC ID,
and both are read-only in KVM, thus the vCPU's cluster[ldr] is guaranteed
to be the same entry as the vCPU's phys_map[x2apic_id] entry.

Skipping the unnecessary setup will allow a future fix for aliased xAPIC
logical IDs to simply require that cluster[ldr] is non-NULL, i.e. won't
have to special case x2APIC.

Alternatively, the future check could allow "cluster[ldr] == apic", but
that ends up being terribly confusing because cluster[ldr] is only set
at the very end, i.e. it's only possible due to x2APIC's shenanigans.

Another alternative would be to send x2APIC down a separate path _after_
the calculation and then assert that all of the above, but the resulting
code is rather messy, and it's arguably unnecessary since asserting that
the actual LDR matches the expected LDR means that simply testing that
interrupts are delivered correctly provides the same guarantees.

Reported-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:29 -05:00
Sean Christopherson
3536690101 KVM: x86: Explicitly track all possibilities for APIC map's logical modes
Track all possibilities for the optimized APIC map's logical modes
instead of overloading the pseudo-bitmap and treating any "unknown" value
as "invalid".

As documented by the now-stale comment above the mode values, the values
did have meaning when the optimized map was originally added.  That
dependent logical was removed by commit e45115b62f ("KVM: x86: use
physical LAPIC array for logical x2APIC"), but the obfuscated behavior
and its comment were left behind.

Opportunistically rename "mode" to "logical_mode", partly to make it
clear that the "disabled" case applies only to the logical map, but also
to prove that there is no lurking code that expects "mode" to be a bitmap.

Functionally, this is a glorified nop.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20230106011306.85230-19-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:29 -05:00
Sean Christopherson
6ea567ca00 KVM: x86: Explicitly skip optimized logical map setup if vCPU's LDR==0
Explicitly skip the optimized map setup if the vCPU's LDR is '0', i.e. if
the vCPU will never respond to logical mode interrupts.  KVM already
skips setup in this case, but relies on kvm_apic_map_get_logical_dest()
to generate mask==0.  KVM still needs the mask=0 check as a non-zero LDR
can yield mask==0 depending on the mode, but explicitly handling the LDR
will make it simpler to clean up the logical mode tracking in the future.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230106011306.85230-18-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-13 10:45:28 -05:00