Commit graph

12823 commits

Author SHA1 Message Date
Vitaly Kuznetsov
b4f69df0f6 KVM: x86: Make Hyper-V emulation optional
Hyper-V emulation in KVM is a fairly big chunk and in some cases it may be
desirable to not compile it in to reduce module sizes as well as the attack
surface. Introduce CONFIG_KVM_HYPERV option to make it possible.

Note, there's room for further nVMX/nSVM code optimizations when
!CONFIG_KVM_HYPERV, this will be done in follow-up patches.

Reorganize Makefile a bit so all CONFIG_HYPERV and CONFIG_KVM_HYPERV files
are grouped together.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-13-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:57 -08:00
Vitaly Kuznetsov
cfef5af3cb KVM: x86: Move Hyper-V partition assist page out of Hyper-V emulation context
Hyper-V partition assist page is used when KVM runs on top of Hyper-V and
is not used for Windows/Hyper-V guests on KVM, this means that 'hv_pa_pg'
placement in 'struct kvm_hv' is unfortunate. As a preparation to making
Hyper-V emulation optional, move 'hv_pa_pg' to 'struct kvm_arch' and put it
under CONFIG_HYPERV.

While on it, introduce hv_get_partition_assist_page() helper to allocate
partition assist page. Move the comment explaining why we use a single page
for all vCPUs from VMX and expand it a bit.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-3-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:34:01 -08:00
Vitaly Kuznetsov
87562052c9 KVM: x86/xen: Remove unneeded xen context from kvm_arch when !CONFIG_KVM_XEN
Saving a few bytes of memory per KVM VM is certainly great but what's more
important is the ability to see where the code accesses Xen emulation
context while CONFIG_KVM_XEN is not enabled. Currently, kvm_cpu_get_extint()
is the only such place and it is harmless: kvm_xen_has_interrupt() always
returns '0' when !CONFIG_KVM_XEN.

No functional change intended.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-2-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-07 09:33:42 -08:00
James Morse
5b95f94c3b x86/topology: Switch over to GENERIC_CPU_DEVICES
Now that GENERIC_CPU_DEVICES calls arch_register_cpu(), which can be
overridden by the arch code, switch over to this to allow common code
to choose when the register_cpu() call is made.

x86's struct cpus come from struct x86_cpu, which has no other members
or users. Remove this and use the version defined by common code.

This is an intermediate step to the logic being moved to drivers/acpi,
where GENERIC_CPU_DEVICES will do the work when booting with acpi=off.

This patch also has the effect of moving the registration of CPUs from
subsys to driver core initialisation, prior to any initcalls running.

----
Changes since RFC:
 * Fixed the second copy of arch_register_cpu() used for non-hotplug
Changes since RFC v2:
 * Remove duplicate of the weak generic arch_register_cpu(), spotted
   by Jonathan Cameron. Add note about initialisation order change.
Changes since RFC v3:
 * Adapt to removal of EXPORT_SYMBOL()s

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/E1r5R3l-00Cszm-UA@rmk-PC.armlinux.org.uk
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-06 12:41:49 +09:00
Paolo Bonzini
250ce1b4d2 KVM: x86/mmu: always take tdp_mmu_pages_lock
It is cheap to take tdp_mmu_pages_lock in all write-side critical sections.
We already do it all the time when zapping with read_lock(), so it is not
a problem to do it from the kvm_tdp_mmu_zap_all() path (aka
kvm_arch_flush_shadow_all(), aka VM destruction and MMU notifier release).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-4-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01 07:52:08 -08:00
Sean Christopherson
fd89499a51 KVM: x86/pmu: Track emulated counter events instead of previous counter
Explicitly track emulated counter events instead of using the common
counter value that's shared with the hardware counter owned by perf.
Bumping the common counter requires snapshotting the pre-increment value
in order to detect overflow from emulation, and the snapshot approach is
inherently flawed.

Snapshotting the previous counter at every increment assumes that there is
at most one emulated counter event per emulated instruction (or rather,
between checks for KVM_REQ_PMU).  That's mostly holds true today because
KVM only emulates (branch) instructions retired, but the approach will
fall apart if KVM ever supports event types that don't have a 1:1
relationship with instructions.

And KVM already has a relevant bug, as handle_invalid_guest_state()
emulates multiple instructions without checking KVM_REQ_PMU, i.e. could
miss an overflow event due to clobbering pmc->prev_counter.  Not checking
KVM_REQ_PMU is problematic in both cases, but at least with the emulated
counter approach, the resulting behavior is delayed overflow detection,
as opposed to completely lost detection.

Tracking the emulated count fixes another bug where the snapshot approach
can signal spurious overflow due to incorporating both the emulated count
and perf's count in the check, i.e. if overflow is detected by perf, then
KVM's emulation will also incorrectly signal overflow.  Add a comment in
the related code to call out the need to process emulated events *after*
pausing the perf event (big kudos to Mingwei for figuring out that
particular wrinkle).

Cc: Mingwei Zhang <mizhang@google.com>
Cc: Roman Kagan <rkagan@amazon.de>
Cc: Jim Mattson <jmattson@google.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20231103230541.352265-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:55 -08:00
Sean Christopherson
cbb359d81a KVM: x86/pmu: Move PMU reset logic to common x86 code
Move the common (or at least "ignored") aspects of resetting the vPMU to
common x86 code, along with the stop/release helpers that are no used only
by the common pmu.c.

There is no need to manually handle fixed counters as all_valid_pmc_idx
tracks both fixed and general purpose counters, and resetting the vPMU is
far from a hot path, i.e. the extra bit of overhead to the PMC from the
index is a non-issue.

Zero fixed_ctr_ctrl in common code even though it's Intel specific.
Ensuring it's zero doesn't harm AMD/SVM in any way, and stopping the fixed
counters via all_valid_pmc_idx, but not clearing the associated control
bits, would be odd/confusing.

Make the .reset() hook optional as SVM no longer needs vendor specific
handling.

Cc: stable@vger.kernel.org
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 12:52:54 -08:00
Sean Christopherson
0277022a77 KVM: x86/mmu: Declare flush_remote_tlbs{_range}() hooks iff HYPERV!=n
Declare the kvm_x86_ops hooks used to wire up paravirt TLB flushes when
running under Hyper-V if and only if CONFIG_HYPERV!=n.  Wrapping yet more
code with IS_ENABLED(CONFIG_HYPERV) eliminates a handful of conditional
branches, and makes it super obvious why the hooks *might* be valid.

Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231018192325.1893896-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-29 14:44:47 -08:00
Peter Zijlstra
edc8fc01f6 x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogram
intel_idle_irq() re-enables IRQs very early. As a result, an interrupt
may fire before mwait() is eventually called. If such an interrupt queues
a timer, it may go unnoticed until mwait returns and the idle loop
handles the tick re-evaluation. And monitoring TIF_NEED_RESCHED doesn't
help because a local timer enqueue doesn't set that flag.

The issue is mitigated by the fact that this idle handler is only invoked
for shallow C-states when, presumably, the next tick is supposed to be
close enough. There may still be rare cases though when the next tick
is far away and the selected C-state is shallow, resulting in a timer
getting ignored for a while.

Fix this with using sti_mwait() whose IRQ-reenablement only triggers
upon calling mwait(), dealing with the race while keeping the interrupt
latency within acceptable bounds.

Fixes: c227233ad6 (intel_idle: enable interrupts before C1 on Xeons)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lkml.kernel.org/r/20231115151325.6262-3-frederic@kernel.org
2023-11-29 15:44:01 +01:00
Frederic Weisbecker
7d09a052a3 x86: Add a comment about the "magic" behind shadow sti before mwait
Add a note to make sure we never miss and break the requirements behind
it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lkml.kernel.org/r/20231115151325.6262-2-frederic@kernel.org
2023-11-29 15:44:01 +01:00
Borislav Petkov (AMD)
30fa92832f x86/CPU/AMD: Add ZenX generations flags
Add X86_FEATURE flags for each Zen generation. They should be used from
now on instead of checking f/m/s.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lore.kernel.org/r/20231120104152.13740-2-bp@alien8.de
2023-11-29 12:11:01 +01:00
Robert Hoo
93d1c9f498 KVM: x86: Virtualize LAM for supervisor pointer
Add support to allow guests to set the new CR4 control bit for LAM and add
implementation to get untagged address for supervisor pointers.

LAM modifies the canonicality check applied to 64-bit linear addresses for
data accesses, allowing software to use of the untranslated address bits for
metadata and masks the metadata bits before using them as linear addresses
to access memory. LAM uses CR4.LAM_SUP (bit 28) to configure and enable LAM
for supervisor pointers. It also changes VMENTER to allow the bit to be set
in VMCS's HOST_CR4 and GUEST_CR4 to support virtualization. Note CR4.LAM_SUP
is allowed to be set even not in 64-bit mode, but it will not take effect
since LAM only applies to 64-bit linear addresses.

Move CR4.LAM_SUP out of CR4_RESERVED_BITS, its reservation depends on vcpu
supporting LAM or not. Leave it intercepted to prevent guest from setting
the bit if LAM is not exposed to guest as well as to avoid vmread every time
when KVM fetches its value, with the expectation that guest won't toggle the
bit frequently.

Set CR4.LAM_SUP bit in the emulated IA32_VMX_CR4_FIXED1 MSR for guests to
allow guests to enable LAM for supervisor pointers in nested VMX operation.

Hardware is not required to do TLB flush when CR4.LAM_SUP toggled, KVM
doesn't need to emulate TLB flush based on it.  There's no other features
or vmx_exec_controls connection, and no other code needed in
{kvm,vmx}_set_cr4().

Skip address untag for instruction fetches (which includes branch targets),
operand of INVLPG instructions, and implicit system accesses, all of which
are not subject to untagging.  Note, get_untagged_addr() isn't invoked for
implicit system accesses as there is no reason to do so, but check the
flag anyways for documentation purposes.

Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-11-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28 17:54:07 -08:00
Binbin Wu
37a41847b7 KVM: x86: Introduce get_untagged_addr() in kvm_x86_ops and call it in emulator
Introduce a new interface get_untagged_addr() to kvm_x86_ops to untag
the metadata from linear address.  Call the interface in linearization
of instruction emulator for 64-bit mode.

When enabled feature like Intel Linear Address Masking (LAM) or AMD Upper
Address Ignore (UAI), linear addresses may be tagged with metadata that
needs to be dropped prior to canonicality checks, i.e. the metadata is
ignored.

Introduce get_untagged_addr() to kvm_x86_ops to hide the vendor specific
code, as sadly LAM and UAI have different semantics.  Pass the emulator
flags to allow vendor specific implementation to precisely identify the
access type (LAM doesn't untag certain accesses).

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-9-binbin.wu@linux.intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28 17:54:06 -08:00
Muralidhara M K
47b744ea5e x86/MCE/AMD: Add new MA_LLC, USR_DP, and USR_CP bank types
Add HWID and McaType values for new SMCA bank types.

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231102114225.2006878-3-muralimk@amd.com
2023-11-28 16:26:55 +01:00
Yazen Ghannam
ff03ff328f x86/mce/amd, EDAC/mce_amd: Move long names to decoder module
The long names of the SMCA banks are only used by the MCE decoder
module.

Move them out of the arch code and into the decoder module.

  [ bp: Name the long names array "smca_long_names", drop local ptr in
    decode_smca_error(), constify arrays. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231118193248.1296798-5-yazen.ghannam@amd.com
2023-11-27 12:16:51 +01:00
Arnd Bergmann
42874e4eb3 arch: vdso: consolidate gettime prototypes
The VDSO functions are defined as globals in the kernel sources but intended
to be called from userspace, so there is no need to declare them in a kernel
side header.

Without a prototype, this now causes warnings such as

arch/mips/vdso/vgettimeofday.c:14:5: error: no previous prototype for '__vdso_clock_gettime' [-Werror=missing-prototypes]
arch/mips/vdso/vgettimeofday.c:28:5: error: no previous prototype for '__vdso_gettimeofday' [-Werror=missing-prototypes]
arch/mips/vdso/vgettimeofday.c:36:5: error: no previous prototype for '__vdso_clock_getres' [-Werror=missing-prototypes]
arch/mips/vdso/vgettimeofday.c:42:5: error: no previous prototype for '__vdso_clock_gettime64' [-Werror=missing-prototypes]
arch/sparc/vdso/vclock_gettime.c:254:1: error: no previous prototype for '__vdso_clock_gettime' [-Werror=missing-prototypes]
arch/sparc/vdso/vclock_gettime.c:282:1: error: no previous prototype for '__vdso_clock_gettime_stick' [-Werror=missing-prototypes]
arch/sparc/vdso/vclock_gettime.c:307:1: error: no previous prototype for '__vdso_gettimeofday' [-Werror=missing-prototypes]
arch/sparc/vdso/vclock_gettime.c:343:1: error: no previous prototype for '__vdso_gettimeofday_stick' [-Werror=missing-prototypes]

Most architectures have already added workarounds for these by adding
declarations somewhere, but since these are all compatible, we should
really just have one copy, with an #ifdef check for the 32-bit vs
64-bit variant and use that everywhere.

Unfortunately, the sparc an um versions are currently incompatible
since they never added support for __vdso_clock_gettime64() in 32-bit
userland. For the moment, I'm leaving this one out, as I can't
easily test it and it requires a larger rework.

Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-11-23 11:32:32 +01:00
Arnd Bergmann
f717a8d164 arch: include linux/cpu.h for trap_init() prototype
some architectures run into a -Wmissing-prototypes warning
for trap_init()

arch/microblaze/kernel/traps.c:21:6: warning: no previous prototype for 'trap_init' [-Wmissing-prototypes]

Include the right header to avoid this consistently, removing
the extra declarations on m68k and x86 that were added as local
workarounds already.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-11-23 11:32:31 +01:00
Arnd Bergmann
64bac5ea17 arch: consolidate arch_irq_work_raise prototypes
The prototype was hidden in an #ifdef on x86, which causes a warning:

kernel/irq_work.c:72:13: error: no previous prototype for 'arch_irq_work_raise' [-Werror=missing-prototypes]

Some architectures have a working prototype, while others don't.
Fix this by providing it in only one place that is always visible.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Acked-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-11-23 11:32:29 +01:00
Yuntao Wang
03f111710a x86/io: Remove the unused 'bw' parameter from the BUILDIO() macro
Commit 1e8f93e183 ("x86: Consolidate port I/O helpers") moved some
port I/O helpers to <asm/shared/io.h>, which caused the 'bw' parameter
in the BUILDIO() macro to become unused. Remove it.

Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231123034911.217791-1-ytcoode@gmail.com
2023-11-23 10:57:36 +01:00
Andrew Cooper
5a7d6d26af x86/apic: Drop struct local_apic
This type predates recorded history in tglx/history.git, making it older
than Feb 5th 2002.

This structure is literally old enough to drink in most juristictions in
the world, and has not been used once in that time.

Lay it to rest in /dev/null.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20231102-x86-apic-v1-3-bf049a2a0ed6@citrix.com
2023-11-21 17:23:35 +01:00
Andrew Cooper
855da7cdf9 x86/apic: Drop enum apic_delivery_modes
The type is not used any more.

Replace the constants with plain defines so they can live outside of an
__ASSEMBLY__ block, allowing for more cleanup in subsequent changes.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20231102-x86-apic-v1-2-bf049a2a0ed6@citrix.com
2023-11-21 17:05:06 +01:00
Andrew Cooper
07e8f88568 x86/apic: Drop apic::delivery_mode
This field is set to APIC_DELIVERY_MODE_FIXED in all cases, and is read
exactly once.  Fold the constant in uv_program_mmr() and drop the field.

Searching for the origin of the stale HyperV comment reveals commit
a31e58e129 ("x86/apic: Switch all APICs to Fixed delivery mode") which
notes:

  As a consequence of this change, the apic::irq_delivery_mode field is
  now pointless, but this needs to be cleaned up in a separate patch.

6 years is long enough for this technical debt to have survived.

  [ bp: Fold in
    https://lore.kernel.org/r/20231121123034.1442059-1-andrew.cooper3@citrix.com
  ]

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20231102-x86-apic-v1-1-bf049a2a0ed6@citrix.com
2023-11-21 16:58:54 +01:00
Peter Zijlstra
5d2d4a9f60 Merge branch 'tip/perf/urgent'
Avoid conflicts, base on fixes.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2023-11-15 10:15:40 +01:00
Paolo Bonzini
6c370dc653 Merge branch 'kvm-guestmemfd' into HEAD
Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd.  Guest-first memory allows KVM
to provide features, enhancements, and optimizations that are kludgly
or outright impossible to implement in a generic memory subsystem.

The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which
similar to the generic memfd_create(), creates an anonymous file and
returns a file descriptor that refers to it.  Again like "regular"
memfd files, guest_memfd files live in RAM, have volatile storage,
and are automatically released when the last reference is dropped.
The key differences between memfd files (and every other memory subystem)
is that guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
convert a guest memory area between the shared and guest-private states.

A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to
specify attributes for a given page of guest memory.  In the long term,
it will likely be extended to allow userspace to specify per-gfn RWX
protections, including allowing memory to be writable in the guest
without it also being writable in host userspace.

The immediate and driving use case for guest_memfd are Confidential
(CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.
For such use cases, being able to map memory into KVM guests without
requiring said memory to be mapped into the host is a hard requirement.
While SEV+ and TDX prevent untrusted software from reading guest private
data by encrypting guest memory, pKVM provides confidentiality and
integrity *without* relying on memory encryption.  In addition, with
SEV-SNP and especially TDX, accessing guest private memory can be fatal
to the host, i.e. KVM must be prevent host userspace from accessing
guest memory irrespective of hardware behavior.

Long term, guest_memfd may be useful for use cases beyond CoCo VMs,
for example hardening userspace against unintentional accesses to guest
memory.  As mentioned earlier, KVM's ABI uses userspace VMA protections to
define the allow guest protection (with an exception granted to mapping
guest memory executable), and similarly KVM currently requires the guest
mapping size to be a strict subset of the host userspace mapping size.
Decoupling the mappings sizes would allow userspace to precisely map
only what is needed and with the required permissions, without impacting
guest performance.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to DMA from or into guest memory).

guest_memfd is the result of 3+ years of development and exploration;
taking on memory management responsibilities in KVM was not the first,
second, or even third choice for supporting CoCo VMs.  But after many
failed attempts to avoid KVM-specific backing memory, and looking at
where things ended up, it is quite clear that of all approaches tried,
guest_memfd is the simplest, most robust, and most extensible, and the
right thing to do for KVM and the kernel at-large.

The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.

Pending post-merge work includes:
- hugepage support
- looking into using the restrictedmem framework for guest memory
- introducing a testing mechanism to poison memory, possibly using
  the same memory attributes introduced here
- SNP and TDX support

There are two non-KVM patches buried in the middle of this series:

  fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
2023-11-14 08:31:31 -05:00
Sean Christopherson
89ea60c2c7 KVM: x86: Add support for "protected VMs" that can utilize private memory
Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-24-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-14 08:01:05 -05:00
Sean Christopherson
eed52e434b KVM: Allow arch code to track number of memslot address spaces per VM
Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs.  Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.

Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-23-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-14 08:01:05 -05:00
Sean Christopherson
2333afa17a KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
KVM_ADDRESS_SPACE_NUM.

No functional change intended.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-14 08:01:04 -05:00
Chao Peng
90b4fe1798 KVM: x86: Disallow hugepages when memory attributes are mixed
Disallow creating hugepages with mixed memory attributes, e.g. shared
versus private, as mapping a hugepage in this case would allow the guest
to access memory with the wrong attributes, e.g. overlaying private memory
with a shared hugepage.

Tracking whether or not attributes are mixed via the existing
disallow_lpage field, but use the most significant bit in 'disallow_lpage'
to indicate a hugepage has mixed attributes instead using the normal
refcounting.  Whether or not attributes are mixed is binary; either they
are or they aren't.  Attempting to squeeze that info into the refcount is
unnecessarily complex as it would require knowing the previous state of
the mixed count when updating attributes.  Using a flag means KVM just
needs to ensure the current status is reflected in the memslots.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-14 08:01:04 -05:00
Hou Wenlong
fe22bc430c x86/paravirt: Make the struct paravirt_patch_site packed
Similar to struct alt_instr, make the struct paravirt_patch_site packed
and get rid of all the .align directives and save 2 bytes for one
PARA_SITE entry on X86_64.

  [ bp: Massage commit message. ]

Suggested-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/6dcb20159ded36586c5f7f2ae159e4e030256627.1686301237.git.houwenlong.hwl@antgroup.com
2023-11-13 12:43:50 +01:00
Hou Wenlong
5c22c4726e x86/paravirt: Use relative reference for the original instruction offset
Similar to the alternative patching, use a relative reference for original
instruction offset rather than absolute one, which saves 8 bytes for one
PARA_SITE entry on x86_64.  As a result, a R_X86_64_PC32 relocation is
generated instead of an R_X86_64_64 one, which also reduces relocation
metadata on relocatable builds. Hardcode the alignment to 4 now.

  [ bp: Massage commit message. ]

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/9e6053107fbaabc0d33e5d2865c5af2c67ec9925.1686301237.git.houwenlong.hwl@antgroup.com
2023-11-13 12:23:27 +01:00
Sean Christopherson
f128cf8cfb KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER
Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior.  Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.

Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared.  PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.

  bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);

Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.

Acked-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13 05:29:09 -05:00
Borislav Petkov (AMD)
04c3024560 x86/barrier: Do not serialize MSR accesses on AMD
AMD does not have the requirement for a synchronization barrier when
acccessing a certain group of MSRs. Do not incur that unnecessary
penalty there.

There will be a CPUID bit which explicitly states that a MFENCE is not
needed. Once that bit is added to the APM, this will be extended with
it.

While at it, move to processor.h to avoid include hell. Untangling that
file properly is a matter for another day.

Some notes on the performance aspect of why this is relevant, courtesy
of Kishon VijayAbraham <Kishon.VijayAbraham@amd.com>:

On a AMD Zen4 system with 96 cores, a modified ipi-bench[1] on a VM
shows x2AVIC IPI rate is 3% to 4% lower than AVIC IPI rate. The
ipi-bench is modified so that the IPIs are sent between two vCPUs in the
same CCX. This also requires to pin the vCPU to a physical core to
prevent any latencies. This simulates the use case of pinning vCPUs to
the thread of a single CCX to avoid interrupt IPI latency.

In order to avoid run-to-run variance (for both x2AVIC and AVIC), the
below configurations are done:

  1) Disable Power States in BIOS (to prevent the system from going to
     lower power state)

  2) Run the system at fixed frequency 2500MHz (to prevent the system
     from increasing the frequency when the load is more)

With the above configuration:

*) Performance measured using ipi-bench for AVIC:
  Average Latency:  1124.98ns [Time to send IPI from one vCPU to another vCPU]

  Cumulative throughput: 42.6759M/s [Total number of IPIs sent in a second from
  				     48 vCPUs simultaneously]

*) Performance measured using ipi-bench for x2AVIC:
  Average Latency:  1172.42ns [Time to send IPI from one vCPU to another vCPU]

  Cumulative throughput: 40.9432M/s [Total number of IPIs sent in a second from
  				     48 vCPUs simultaneously]

From above, x2AVIC latency is ~4% more than AVIC. However, the expectation is
x2AVIC performance to be better or equivalent to AVIC. Upon analyzing
the perf captures, it is observed significant time is spent in
weak_wrmsr_fence() invoked by x2apic_send_IPI().

With the fix to skip weak_wrmsr_fence()

*) Performance measured using ipi-bench for x2AVIC:
  Average Latency:  1117.44ns [Time to send IPI from one vCPU to another vCPU]

  Cumulative throughput: 42.9608M/s [Total number of IPIs sent in a second from
  				     48 vCPUs simultaneously]

Comparing the performance of x2AVIC with and without the fix, it can be seen
the performance improves by ~4%.

Performance captured using an unmodified ipi-bench using the 'mesh-ipi' option
with and without weak_wrmsr_fence() on a Zen4 system also showed significant
performance improvement without weak_wrmsr_fence(). The 'mesh-ipi' option ignores
CCX or CCD and just picks random vCPU.

  Average throughput (10 iterations) with weak_wrmsr_fence(),
        Cumulative throughput: 4933374 IPI/s

  Average throughput (10 iterations) without weak_wrmsr_fence(),
        Cumulative throughput: 6355156 IPI/s

[1] https://github.com/bytedance/kvm-utils/tree/master/microbenchmark/ipi-bench

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230622095212.20940-1-bp@alien8.de
2023-11-13 10:09:45 +01:00
Yuntao Wang
f7a25cf1d4 x86/setup: Make relocated_ramdisk a local variable of relocate_initrd()
After

  0b62f6cb07 ("x86/microcode/32: Move early loading after paging enable"),

the global variable relocated_ramdisk is no longer used anywhere except
for the relocate_initrd() function. Make it a local variable of that
function.

Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Link: https://lore.kernel.org/r/20231113034026.130679-1-ytcoode@gmail.com
2023-11-13 09:09:37 +01:00
Roger Pau Monne
bfa993b355 acpi/processor: sanitize _OSC/_PDC capabilities for Xen dom0
The Processor capability bits notify ACPI of the OS capabilities, and
so ACPI can adjust the return of other Processor methods taking the OS
capabilities into account.

When Linux is running as a Xen dom0, the hypervisor is the entity
in charge of processor power management, and hence Xen needs to make
sure the capabilities reported by _OSC/_PDC match the capabilities of
the driver in Xen.

Introduce a small helper to sanitize the buffer when running as Xen
dom0.

When Xen supports HWP, this serves as the equivalent of commit
a21211672c ("ACPI / processor: Request native thermal interrupt
handling via _OSC") to avoid SMM crashes.  Xen will set bit
ACPI_PROC_CAP_COLLAB_PROC_PERF (bit 12) in the capability bits and the
_OSC/_PDC call will apply it.

[ jandryuk: Mention Xen HWP's need.  Support _OSC & _PDC ]
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20231108212517.72279-1-jandryuk@gmail.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2023-11-13 07:22:00 +01:00
Arnd Bergmann
abc28463c8 kprobes: unify kprobes_exceptions_nofify() prototypes
Most architectures that support kprobes declare this function in their
own asm/kprobes.h header and provide an override, but some are missing
the prototype, which causes a warning for the __weak stub implementation:

kernel/kprobes.c:1865:12: error: no previous prototype for 'kprobe_exceptions_notify' [-Werror=missing-prototypes]
 1865 | int __weak kprobe_exceptions_notify(struct notifier_block *self,

Move the prototype into linux/kprobes.h so it is visible to all
the definitions.

Link: https://lore.kernel.org/all/20231108125843.3806765-4-arnd@kernel.org/

Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-11-10 19:59:05 +09:00
Linus Torvalds
5e2cb28dd7 configfs-tsm for v6.7
- Introduce configfs-tsm as a shared ABI for confidential computing
   attestation reports
 
 - Convert sev-guest to additionally support configfs-tsm alongside its
   vendor specific ioctl()
 
 - Added signed attestation report retrieval to the tdx-guest driver
   forgoing a new vendor specific ioctl()
 
 - Misc. cleanups and a new __free() annotation for kvfree()
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCZUQhiQAKCRDfioYZHlFs
 Z2gMAQCJdtP0f2kH+pvf3oxAkA1OubKBqJqWOppeyrhTsNMpDQEA9ljXH9h7eRB/
 2NQ6USrU6jqcdu3gB5Tzq8J/ZZabMQU=
 =1Eiv
 -----END PGP SIGNATURE-----

Merge tag 'tsm-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux

Pull unified attestation reporting from Dan Williams:
 "In an ideal world there would be a cross-vendor standard attestation
  report format for confidential guests along with a common device
  definition to act as the transport.

  In the real world the situation ended up with multiple platform
  vendors inventing their own attestation report formats with the
  SEV-SNP implementation being a first mover to define a custom
  sev-guest character device and corresponding ioctl(). Later, this
  configfs-tsm proposal intercepted an attempt to add a tdx-guest
  character device and a corresponding new ioctl(). It also anticipated
  ARM and RISC-V showing up with more chardevs and more ioctls().

  The proposal takes for granted that Linux tolerates the vendor report
  format differentiation until a standard arrives. From talking with
  folks involved, it sounds like that standardization work is unlikely
  to resolve anytime soon. It also takes the position that kernfs ABIs
  are easier to maintain than ioctl(). The result is a shared configfs
  mechanism to return per-vendor report-blobs with the option to later
  support a standard when that arrives.

  Part of the goal here also is to get the community into the
  "uncomfortable, but beneficial to the long term maintainability of the
  kernel" state of talking to each other about their differentiation and
  opportunities to collaborate. Think of this like the device-driver
  equivalent of the common memory-management infrastructure for
  confidential-computing being built up in KVM.

  As for establishing an "upstream path for cross-vendor
  confidential-computing device driver infrastructure" this is something
  I want to discuss at Plumbers. At present, the multiple vendor
  proposals for assigning devices to confidential computing VMs likely
  needs a new dedicated repository and maintainer team, but that is a
  discussion for v6.8.

  For now, Greg and Thomas have acked this approach and this is passing
  is AMD, Intel, and Google tests.

  Summary:

   - Introduce configfs-tsm as a shared ABI for confidential computing
     attestation reports

   - Convert sev-guest to additionally support configfs-tsm alongside
     its vendor specific ioctl()

   - Added signed attestation report retrieval to the tdx-guest driver
     forgoing a new vendor specific ioctl()

   - Misc cleanups and a new __free() annotation for kvfree()"

* tag 'tsm-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux:
  virt: tdx-guest: Add Quote generation support using TSM_REPORTS
  virt: sevguest: Add TSM_REPORTS support for SNP_GET_EXT_REPORT
  mm/slab: Add __free() support for kvfree
  virt: sevguest: Prep for kernel internal get_ext_report()
  configfs-tsm: Introduce a shared ABI for attestation reports
  virt: coco: Add a coco/Makefile and coco/Kconfig
  virt: sevguest: Fix passing a stack buffer as a scatterlist target
2023-11-04 15:58:13 -10:00
Linus Torvalds
0a23fb262d Major microcode loader restructuring, cleanup and improvements by Thomas
Gleixner:
 
 - Restructure the code needed for it and add a temporary initrd mapping
   on 32-bit so that the loader can access the microcode blobs. This in
   itself is a preparation for the next major improvement:
 
 - Do not load microcode on 32-bit before paging has been enabled.
   Handling this has caused an endless stream of headaches, issues, ugly
   code and unnecessary hacks in the past. And there really wasn't any
   sensible reason to do that in the first place. So switch the 32-bit
   loading to happen after paging has been enabled and turn the loader
   code "real purrty" again
 
 - Drop mixed microcode steppings loading on Intel - there, a single patch
   loaded on the whole system is sufficient
 
 - Rework late loading to track which CPUs have updated microcode
   successfully and which haven't, act accordingly
 
 - Move late microcode loading on Intel in NMI context in order to
   guarantee concurrent loading on all threads
 
 - Make the late loading CPU-hotplug-safe and have the offlined threads
   be woken up for the purpose of the update
 
 - Add support for a minimum revision which determines whether late
   microcode loading is safe on a machine and the microcode does not
   change software visible features which the machine cannot use anyway
   since feature detection has happened already. Roughly, the minimum
   revision is the smallest revision number which must be loaded
   currently on the system so that late updates can be allowed
 
 - Other nice leanups, fixess, etc all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVE0xkACgkQEsHwGGHe
 VUrCuBAAhOqqwkYPiGXPWd2hvdn1zGtD5fvEdXn3Orzd+Lwc6YaQTsCxCjIO/0ws
 8inpPFuOeGz4TZcplzipi3G5oatPVc7ORDuW+/BvQQQljZOsSKfhiaC29t6dvS6z
 UG3sbCXKVwlJ5Kwv3Qe4eWur4Ex6GeFDZkIvBCmbaAdGPFlfu1i2uO1yBooNP1Rs
 GiUkp+dP1/KREWwR/dOIsHYL2QjWIWfHQEWit/9Bj46rxE9ERx/TWt3AeKPfKriO
 Wp0JKp6QY78jg6a0a2/JVmbT1BKz69Z9aPp6hl4P2MfbBYOnqijRhdezFW0NyqV2
 pn6nsuiLIiXbnSOEw0+Wdnw5Q0qhICs5B5eaBfQrwgfZ8pxPHv2Ir777GvUTV01E
 Dv0ZpYsHa+mHe17nlK8V3+4eajt0PetExcXAYNiIE+pCb7pLjjKkl8e+lcEvEsO0
 QSL3zG5i5RWUMPYUvaFRgepWy3k/GPIoDQjRcUD3P+1T0GmnogNN10MMNhmOzfWU
 pyafe4tJUOVsq0HJ7V/bxIHk2p+Q+5JLKh5xBm9janE4BpabmSQnvFWNblVfK4ig
 M9ohjI/yMtgXROC4xkNXgi8wE5jfDKBghT6FjTqKWSV45vknF1mNEjvuaY+aRZ3H
 MB4P3HCj+PKWJimWHRYnDshcytkgcgVcYDiim8va/4UDrw8O2ks=
 =JOZu
 -----END PGP SIGNATURE-----

Merge tag 'x86_microcode_for_v6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 microcode loading updates from Borislac Petkov:
 "Major microcode loader restructuring, cleanup and improvements by
  Thomas Gleixner:

   - Restructure the code needed for it and add a temporary initrd
     mapping on 32-bit so that the loader can access the microcode
     blobs. This in itself is a preparation for the next major
     improvement:

   - Do not load microcode on 32-bit before paging has been enabled.

     Handling this has caused an endless stream of headaches, issues,
     ugly code and unnecessary hacks in the past. And there really
     wasn't any sensible reason to do that in the first place. So switch
     the 32-bit loading to happen after paging has been enabled and turn
     the loader code "real purrty" again

   - Drop mixed microcode steppings loading on Intel - there, a single
     patch loaded on the whole system is sufficient

   - Rework late loading to track which CPUs have updated microcode
     successfully and which haven't, act accordingly

   - Move late microcode loading on Intel in NMI context in order to
     guarantee concurrent loading on all threads

   - Make the late loading CPU-hotplug-safe and have the offlined
     threads be woken up for the purpose of the update

   - Add support for a minimum revision which determines whether late
     microcode loading is safe on a machine and the microcode does not
     change software visible features which the machine cannot use
     anyway since feature detection has happened already. Roughly, the
     minimum revision is the smallest revision number which must be
     loaded currently on the system so that late updates can be allowed

   - Other nice leanups, fixess, etc all over the place"

* tag 'x86_microcode_for_v6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  x86/microcode/intel: Add a minimum required revision for late loading
  x86/microcode: Prepare for minimal revision check
  x86/microcode: Handle "offline" CPUs correctly
  x86/apic: Provide apic_force_nmi_on_cpu()
  x86/microcode: Protect against instrumentation
  x86/microcode: Rendezvous and load in NMI
  x86/microcode: Replace the all-in-one rendevous handler
  x86/microcode: Provide new control functions
  x86/microcode: Add per CPU control field
  x86/microcode: Add per CPU result state
  x86/microcode: Sanitize __wait_for_cpus()
  x86/microcode: Clarify the late load logic
  x86/microcode: Handle "nosmt" correctly
  x86/microcode: Clean up mc_cpu_down_prep()
  x86/microcode: Get rid of the schedule work indirection
  x86/microcode: Mop up early loading leftovers
  x86/microcode/amd: Use cached microcode for AP load
  x86/microcode/amd: Cache builtin/initrd microcode early
  x86/microcode/amd: Cache builtin microcode too
  x86/microcode/amd: Use correct per CPU ucode_cpu_info
  ...
2023-11-04 08:46:37 -10:00
Linus Torvalds
8f6f76a6a2 As usual, lots of singleton and doubleton patches all over the tree and
there's little I can say which isn't in the individual changelogs.
 
 The lengthier patch series are
 
 - "kdump: use generic functions to simplify crashkernel reservation in
   arch", from Baoquan He.  This is mainly cleanups and consolidation of
   the "crashkernel=" kernel parameter handling.
 
 - After much discussion, David Laight's "minmax: Relax type checks in
   min() and max()" is here.  Hopefully reduces some typecasting and the
   use of min_t() and max_t().
 
 - A group of patches from Oleg Nesterov which clean up and slightly fix
   our handling of reads from /proc/PID/task/...  and which remove
   task_struct.therad_group.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZUQP9wAKCRDdBJ7gKXxA
 jmOAAQDh8sxagQYocoVsSm28ICqXFeaY9Co1jzBIDdNesAvYVwD/c2DHRqJHEiS4
 63BNcG3+hM9nwGJHb5lyh5m79nBMRg0=
 =On4u
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "As usual, lots of singleton and doubleton patches all over the tree
  and there's little I can say which isn't in the individual changelogs.

  The lengthier patch series are

   - 'kdump: use generic functions to simplify crashkernel reservation
     in arch', from Baoquan He. This is mainly cleanups and
     consolidation of the 'crashkernel=' kernel parameter handling

   - After much discussion, David Laight's 'minmax: Relax type checks in
     min() and max()' is here. Hopefully reduces some typecasting and
     the use of min_t() and max_t()

   - A group of patches from Oleg Nesterov which clean up and slightly
     fix our handling of reads from /proc/PID/task/... and which remove
     task_struct.thread_group"

* tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits)
  scripts/gdb/vmalloc: disable on no-MMU
  scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n
  .mailmap: add address mapping for Tomeu Vizoso
  mailmap: update email address for Claudiu Beznea
  tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions
  .mailmap: map Benjamin Poirier's address
  scripts/gdb: add lx_current support for riscv
  ocfs2: fix a spelling typo in comment
  proc: test ProtectionKey in proc-empty-vm test
  proc: fix proc-empty-vm test with vsyscall
  fs/proc/base.c: remove unneeded semicolon
  do_io_accounting: use sig->stats_lock
  do_io_accounting: use __for_each_thread()
  ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error()
  ocfs2: fix a typo in a comment
  scripts/show_delta: add __main__ judgement before main code
  treewide: mark stuff as __ro_after_init
  fs: ocfs2: check status values
  proc: test /proc/${pid}/statm
  compiler.h: move __is_constexpr() to compiler.h
  ...
2023-11-02 20:53:31 -10:00
Linus Torvalds
ecae0bd517 Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
 
 - Kemeng Shi has contributed some compation maintenance work in the
   series "Fixes and cleanups to compaction".
 
 - Joel Fernandes has a patchset ("Optimize mremap during mutual
   alignment within PMD") which fixes an obscure issue with mremap()'s
   pagetable handling during a subsequent exec(), based upon an
   implementation which Linus suggested.
 
 - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
   following patch series:
 
 	mm/damon: misc fixups for documents, comments and its tracepoint
 	mm/damon: add a tracepoint for damos apply target regions
 	mm/damon: provide pseudo-moving sum based access rate
 	mm/damon: implement DAMOS apply intervals
 	mm/damon/core-test: Fix memory leaks in core-test
 	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
 
 - In the series "Do not try to access unaccepted memory" Adrian Hunter
   provides some fixups for the recently-added "unaccepted memory' feature.
   To increase the feature's checking coverage.  "Plug a few gaps where
   RAM is exposed without checking if it is unaccepted memory".
 
 - In the series "cleanups for lockless slab shrink" Qi Zheng has done
   some maintenance work which is preparation for the lockless slab
   shrinking code.
 
 - Qi Zheng has redone the earlier (and reverted) attempt to make slab
   shrinking lockless in the series "use refcount+RCU method to implement
   lockless slab shrink".
 
 - David Hildenbrand contributes some maintenance work for the rmap code
   in the series "Anon rmap cleanups".
 
 - Kefeng Wang does more folio conversions and some maintenance work in
   the migration code.  Series "mm: migrate: more folio conversion and
   unification".
 
 - Matthew Wilcox has fixed an issue in the buffer_head code which was
   causing long stalls under some heavy memory/IO loads.  Some cleanups
   were added on the way.  Series "Add and use bdev_getblk()".
 
 - In the series "Use nth_page() in place of direct struct page
   manipulation" Zi Yan has fixed a potential issue with the direct
   manipulation of hugetlb page frames.
 
 - In the series "mm: hugetlb: Skip initialization of gigantic tail
   struct pages if freed by HVO" has improved our handling of gigantic
   pages in the hugetlb vmmemmep optimizaton code.  This provides
   significant boot time improvements when significant amounts of gigantic
   pages are in use.
 
 - Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
   rationalization and folio conversions in the hugetlb code.
 
 - Yin Fengwei has improved mlock()'s handling of large folios in the
   series "support large folio for mlock"
 
 - In the series "Expose swapcache stat for memcg v1" Liu Shixin has
   added statistics for memcg v1 users which are available (and useful)
   under memcg v2.
 
 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
   prctl so that userspace may direct the kernel to not automatically
   propagate the denial to child processes.  The series is named "MDWE
   without inheritance".
 
 - Kefeng Wang has provided the series "mm: convert numa balancing
   functions to use a folio" which does what it says.
 
 - In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
   makes is possible for a process to propagate KSM treatment across
   exec().
 
 - Huang Ying has enhanced memory tiering's calculation of memory
   distances.  This is used to permit the dax/kmem driver to use "high
   bandwidth memory" in addition to Optane Data Center Persistent Memory
   Modules (DCPMM).  The series is named "memory tiering: calculate
   abstract distance based on ACPI HMAT"
 
 - In the series "Smart scanning mode for KSM" Stefan Roesch has
   optimized KSM by teaching it to retain and use some historical
   information from previous scans.
 
 - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
   series "mm: memcg: fix tracking of pending stats updates values".
 
 - In the series "Implement IOCTL to get and optionally clear info about
   PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
   us to atomically read-then-clear page softdirty state.  This is mainly
   used by CRIU.
 
 - Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
   - a bunch of relatively minor maintenance tweaks to this code.
 
 - Matthew Wilcox has increased the use of the VMA lock over file-backed
   page faults in the series "Handle more faults under the VMA lock".  Some
   rationalizations of the fault path became possible as a result.
 
 - In the series "mm/rmap: convert page_move_anon_rmap() to
   folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
   and folio conversions.
 
 - In the series "various improvements to the GUP interface" Lorenzo
   Stoakes has simplified and improved the GUP interface with an eye to
   providing groundwork for future improvements.
 
 - Andrey Konovalov has sent along the series "kasan: assorted fixes and
   improvements" which does those things.
 
 - Some page allocator maintenance work from Kemeng Shi in the series
   "Two minor cleanups to break_down_buddy_pages".
 
 - In thes series "New selftest for mm" Breno Leitao has developed
   another MM self test which tickles a race we had between madvise() and
   page faults.
 
 - In the series "Add folio_end_read" Matthew Wilcox provides cleanups
   and an optimization to the core pagecache code.
 
 - Nhat Pham has added memcg accounting for hugetlb memory in the series
   "hugetlb memcg accounting".
 
 - Cleanups and rationalizations to the pagemap code from Lorenzo
   Stoakes, in the series "Abstract vma_merge() and split_vma()".
 
 - Audra Mitchell has fixed issues in the procfs page_owner code's new
   timestamping feature which was causing some misbehaviours.  In the
   series "Fix page_owner's use of free timestamps".
 
 - Lorenzo Stoakes has fixed the handling of new mappings of sealed files
   in the series "permit write-sealed memfd read-only shared mappings".
 
 - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
   series "Batch hugetlb vmemmap modification operations".
 
 - Some buffer_head folio conversions and cleanups from Matthew Wilcox in
   the series "Finish the create_empty_buffers() transition".
 
 - As a page allocator performance optimization Huang Ying has added
   automatic tuning to the allocator's per-cpu-pages feature, in the series
   "mm: PCP high auto-tuning".
 
 - Roman Gushchin has contributed the patchset "mm: improve performance
   of accounted kernel memory allocations" which improves their performance
   by ~30% as measured by a micro-benchmark.
 
 - folio conversions from Kefeng Wang in the series "mm: convert page
   cpupid functions to folios".
 
 - Some kmemleak fixups in Liu Shixin's series "Some bugfix about
   kmemleak".
 
 - Qi Zheng has improved our handling of memoryless nodes by keeping them
   off the allocation fallback list.  This is done in the series "handle
   memoryless nodes more appropriately".
 
 - khugepaged conversions from Vishal Moola in the series "Some
   khugepaged folio conversions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
 jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
 FgeUPAD1oasg6CP+INZvCj34waNxwAc=
 =E+Y4
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Kemeng Shi has contributed some compation maintenance work in the
     series 'Fixes and cleanups to compaction'

   - Joel Fernandes has a patchset ('Optimize mremap during mutual
     alignment within PMD') which fixes an obscure issue with mremap()'s
     pagetable handling during a subsequent exec(), based upon an
     implementation which Linus suggested

   - More DAMON/DAMOS maintenance and feature work from SeongJae Park i
     the following patch series:

	mm/damon: misc fixups for documents, comments and its tracepoint
	mm/damon: add a tracepoint for damos apply target regions
	mm/damon: provide pseudo-moving sum based access rate
	mm/damon: implement DAMOS apply intervals
	mm/damon/core-test: Fix memory leaks in core-test
	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval

   - In the series 'Do not try to access unaccepted memory' Adrian
     Hunter provides some fixups for the recently-added 'unaccepted
     memory' feature. To increase the feature's checking coverage. 'Plug
     a few gaps where RAM is exposed without checking if it is
     unaccepted memory'

   - In the series 'cleanups for lockless slab shrink' Qi Zheng has done
     some maintenance work which is preparation for the lockless slab
     shrinking code

   - Qi Zheng has redone the earlier (and reverted) attempt to make slab
     shrinking lockless in the series 'use refcount+RCU method to
     implement lockless slab shrink'

   - David Hildenbrand contributes some maintenance work for the rmap
     code in the series 'Anon rmap cleanups'

   - Kefeng Wang does more folio conversions and some maintenance work
     in the migration code. Series 'mm: migrate: more folio conversion
     and unification'

   - Matthew Wilcox has fixed an issue in the buffer_head code which was
     causing long stalls under some heavy memory/IO loads. Some cleanups
     were added on the way. Series 'Add and use bdev_getblk()'

   - In the series 'Use nth_page() in place of direct struct page
     manipulation' Zi Yan has fixed a potential issue with the direct
     manipulation of hugetlb page frames

   - In the series 'mm: hugetlb: Skip initialization of gigantic tail
     struct pages if freed by HVO' has improved our handling of gigantic
     pages in the hugetlb vmmemmep optimizaton code. This provides
     significant boot time improvements when significant amounts of
     gigantic pages are in use

   - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
     rationalization and folio conversions in the hugetlb code

   - Yin Fengwei has improved mlock()'s handling of large folios in the
     series 'support large folio for mlock'

   - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
     added statistics for memcg v1 users which are available (and
     useful) under memcg v2

   - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
     prctl so that userspace may direct the kernel to not automatically
     propagate the denial to child processes. The series is named 'MDWE
     without inheritance'

   - Kefeng Wang has provided the series 'mm: convert numa balancing
     functions to use a folio' which does what it says

   - In the series 'mm/ksm: add fork-exec support for prctl' Stefan
     Roesch makes is possible for a process to propagate KSM treatment
     across exec()

   - Huang Ying has enhanced memory tiering's calculation of memory
     distances. This is used to permit the dax/kmem driver to use 'high
     bandwidth memory' in addition to Optane Data Center Persistent
     Memory Modules (DCPMM). The series is named 'memory tiering:
     calculate abstract distance based on ACPI HMAT'

   - In the series 'Smart scanning mode for KSM' Stefan Roesch has
     optimized KSM by teaching it to retain and use some historical
     information from previous scans

   - Yosry Ahmed has fixed some inconsistencies in memcg statistics in
     the series 'mm: memcg: fix tracking of pending stats updates
     values'

   - In the series 'Implement IOCTL to get and optionally clear info
     about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
     which permits us to atomically read-then-clear page softdirty
     state. This is mainly used by CRIU

   - Hugh Dickins contributed the series 'shmem,tmpfs: general
     maintenance', a bunch of relatively minor maintenance tweaks to
     this code

   - Matthew Wilcox has increased the use of the VMA lock over
     file-backed page faults in the series 'Handle more faults under the
     VMA lock'. Some rationalizations of the fault path became possible
     as a result

   - In the series 'mm/rmap: convert page_move_anon_rmap() to
     folio_move_anon_rmap()' David Hildenbrand has implemented some
     cleanups and folio conversions

   - In the series 'various improvements to the GUP interface' Lorenzo
     Stoakes has simplified and improved the GUP interface with an eye
     to providing groundwork for future improvements

   - Andrey Konovalov has sent along the series 'kasan: assorted fixes
     and improvements' which does those things

   - Some page allocator maintenance work from Kemeng Shi in the series
     'Two minor cleanups to break_down_buddy_pages'

   - In thes series 'New selftest for mm' Breno Leitao has developed
     another MM self test which tickles a race we had between madvise()
     and page faults

   - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
     and an optimization to the core pagecache code

   - Nhat Pham has added memcg accounting for hugetlb memory in the
     series 'hugetlb memcg accounting'

   - Cleanups and rationalizations to the pagemap code from Lorenzo
     Stoakes, in the series 'Abstract vma_merge() and split_vma()'

   - Audra Mitchell has fixed issues in the procfs page_owner code's new
     timestamping feature which was causing some misbehaviours. In the
     series 'Fix page_owner's use of free timestamps'

   - Lorenzo Stoakes has fixed the handling of new mappings of sealed
     files in the series 'permit write-sealed memfd read-only shared
     mappings'

   - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
     series 'Batch hugetlb vmemmap modification operations'

   - Some buffer_head folio conversions and cleanups from Matthew Wilcox
     in the series 'Finish the create_empty_buffers() transition'

   - As a page allocator performance optimization Huang Ying has added
     automatic tuning to the allocator's per-cpu-pages feature, in the
     series 'mm: PCP high auto-tuning'

   - Roman Gushchin has contributed the patchset 'mm: improve
     performance of accounted kernel memory allocations' which improves
     their performance by ~30% as measured by a micro-benchmark

   - folio conversions from Kefeng Wang in the series 'mm: convert page
     cpupid functions to folios'

   - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
     kmemleak'

   - Qi Zheng has improved our handling of memoryless nodes by keeping
     them off the allocation fallback list. This is done in the series
     'handle memoryless nodes more appropriately'

   - khugepaged conversions from Vishal Moola in the series 'Some
     khugepaged folio conversions'"

[ bcachefs conflicts with the dynamically allocated shrinkers have been
  resolved as per Stephen Rothwell in

     https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/

  with help from Qi Zheng.

  The clone3 test filtering conflict was half-arsed by yours truly ]

* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
  mm/damon/sysfs: update monitoring target regions for online input commit
  mm/damon/sysfs: remove requested targets when online-commit inputs
  selftests: add a sanity check for zswap
  Documentation: maple_tree: fix word spelling error
  mm/vmalloc: fix the unchecked dereference warning in vread_iter()
  zswap: export compression failure stats
  Documentation: ubsan: drop "the" from article title
  mempolicy: migration attempt to match interleave nodes
  mempolicy: mmap_lock is not needed while migrating folios
  mempolicy: alloc_pages_mpol() for NUMA policy without vma
  mm: add page_rmappable_folio() wrapper
  mempolicy: remove confusing MPOL_MF_LAZY dead code
  mempolicy: mpol_shared_policy_init() without pseudo-vma
  mempolicy trivia: use pgoff_t in shared mempolicy tree
  mempolicy trivia: slightly more consistent naming
  mempolicy trivia: delete those ancient pr_debug()s
  mempolicy: fix migrate_pages(2) syscall return nr_failed
  kernfs: drop shared NUMA mempolicy hooks
  hugetlbfs: drop shared NUMA mempolicy pretence
  mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
  ...
2023-11-02 19:38:47 -10:00
Linus Torvalds
6803bd7956 ARM:
* Generalized infrastructure for 'writable' ID registers, effectively
   allowing userspace to opt-out of certain vCPU features for its guest
 
 * Optimization for vSGI injection, opportunistically compressing MPIDR
   to vCPU mapping into a table
 
 * Improvements to KVM's PMU emulation, allowing userspace to select
   the number of PMCs available to a VM
 
 * Guest support for memory operation instructions (FEAT_MOPS)
 
 * Cleanups to handling feature flags in KVM_ARM_VCPU_INIT, squashing
   bugs and getting rid of useless code
 
 * Changes to the way the SMCCC filter is constructed, avoiding wasted
   memory allocations when not in use
 
 * Load the stage-2 MMU context at vcpu_load() for VHE systems, reducing
   the overhead of errata mitigations
 
 * Miscellaneous kernel and selftest fixes
 
 LoongArch:
 
 * New architecture.  The hardware uses the same model as x86, s390
   and RISC-V, where guest/host mode is orthogonal to supervisor/user
   mode.  The virtualization extensions are very similar to MIPS,
   therefore the code also has some similarities but it's been cleaned
   up to avoid some of the historical bogosities that are found in
   arch/mips.  The kernel emulates MMU, timer and CSR accesses, while
   interrupt controllers are only emulated in userspace, at least for
   now.
 
 RISC-V:
 
 * Support for the Smstateen and Zicond extensions
 
 * Support for virtualizing senvcfg
 
 * Support for virtualized SBI debug console (DBCN)
 
 S390:
 
 * Nested page table management can be monitored through tracepoints
   and statistics
 
 x86:
 
 * Fix incorrect handling of VMX posted interrupt descriptor in KVM_SET_LAPIC,
   which could result in a dropped timer IRQ
 
 * Avoid WARN on systems with Intel IPI virtualization
 
 * Add CONFIG_KVM_MAX_NR_VCPUS, to allow supporting up to 4096 vCPUs without
   forcing more common use cases to eat the extra memory overhead.
 
 * Add virtualization support for AMD SRSO mitigation (IBPB_BRTYPE and
   SBPB, aka Selective Branch Predictor Barrier).
 
 * Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
   creating the original vCPU would cause KVM to try to synchronize the vCPU's
   TSC and thus clobber the correct TSC being set by userspace.
 
 * Compute guest wall clock using a single TSC read to avoid generating an
   inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.
 
 * "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
   about a "Firmware Bug" if the bit isn't set for select F/M/S combos.
   Likewise "virtualize" (ignore) MSR_AMD64_TW_CFG to appease Windows Server
   2022.
 
 * Don't apply side effects to Hyper-V's synthetic timer on writes from
   userspace to fix an issue where the auto-enable behavior can trigger
   spurious interrupts, i.e. do auto-enabling only for guest writes.
 
 * Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
   without PML enabled.
 
 * Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.
 
 * Harden the fast page fault path to guard against encountering an invalid
   root when walking SPTEs.
 
 * Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.
 
 * Use the fast path directly from the timer callback when delivering Xen
   timer events, instead of waiting for the next iteration of the run loop.
   This was not done so far because previously proposed code had races,
   but now care is taken to stop the hrtimer at critical points such as
   restarting the timer or saving the timer information for userspace.
 
 * Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag.
 
 * Optimize injection of PMU interrupts that are simultaneous with NMIs.
 
 * Usual handful of fixes for typos and other warts.
 
 x86 - MTRR/PAT fixes and optimizations:
 
 * Clean up code that deals with honoring guest MTRRs when the VM has
   non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.
 
 * Zap EPT entries when non-coherent DMA assignment stops/start to prevent
   using stale entries with the wrong memtype.
 
 * Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y.
   This was done as a workaround for virtual machine BIOSes that did not
   bother to clear CR0.CD (because ancient KVM/QEMU did not bother to
   set it, in turn), and there's zero reason to extend the quirk to
   also ignore guest PAT.
 
 x86 - SEV fixes:
 
 * Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
   running an SEV-ES guest.
 
 * Clean up the recognition of emulation failures on SEV guests, when KVM would
   like to "skip" the instruction but it had already been partially emulated.
   This makes it possible to drop a hack that second guessed the (insufficient)
   information provided by the emulator, and just do the right thing.
 
 Documentation:
 
 * Various updates and fixes, mostly for x86
 
 * MTRR and PAT fixes and optimizations:
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmVBZc0UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroP1LQf+NgsmZ1lkGQlKdSdijoQ856w+k0or
 l2SV1wUwiEdFPSGK+RTUlHV5Y1ni1dn/CqCVIJZKEI3ZtZ1m9/4HKIRXvbMwFHIH
 hx+E4Lnf8YUjsGjKTLd531UKcpphztZavQ6pXLEwazkSkDEra+JIKtooI8uU+9/p
 bd/eF1V+13a8CHQf1iNztFJVxqBJbVlnPx4cZDRQQvewskIDGnVDtwbrwCUKGtzD
 eNSzhY7si6O2kdQNkuA8xPhg29dYX9XLaCK2K1l8xOUm8WipLdtF86GAKJ5BVuOL
 6ek/2QCYjZ7a+coAZNfgSEUi8JmFHEqCo7cnKmWzPJp+2zyXsdudqAhT1g==
 =UIxm
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Generalized infrastructure for 'writable' ID registers, effectively
     allowing userspace to opt-out of certain vCPU features for its
     guest

   - Optimization for vSGI injection, opportunistically compressing
     MPIDR to vCPU mapping into a table

   - Improvements to KVM's PMU emulation, allowing userspace to select
     the number of PMCs available to a VM

   - Guest support for memory operation instructions (FEAT_MOPS)

   - Cleanups to handling feature flags in KVM_ARM_VCPU_INIT, squashing
     bugs and getting rid of useless code

   - Changes to the way the SMCCC filter is constructed, avoiding wasted
     memory allocations when not in use

   - Load the stage-2 MMU context at vcpu_load() for VHE systems,
     reducing the overhead of errata mitigations

   - Miscellaneous kernel and selftest fixes

  LoongArch:

   - New architecture for kvm.

     The hardware uses the same model as x86, s390 and RISC-V, where
     guest/host mode is orthogonal to supervisor/user mode. The
     virtualization extensions are very similar to MIPS, therefore the
     code also has some similarities but it's been cleaned up to avoid
     some of the historical bogosities that are found in arch/mips. The
     kernel emulates MMU, timer and CSR accesses, while interrupt
     controllers are only emulated in userspace, at least for now.

  RISC-V:

   - Support for the Smstateen and Zicond extensions

   - Support for virtualizing senvcfg

   - Support for virtualized SBI debug console (DBCN)

  S390:

   - Nested page table management can be monitored through tracepoints
     and statistics

  x86:

   - Fix incorrect handling of VMX posted interrupt descriptor in
     KVM_SET_LAPIC, which could result in a dropped timer IRQ

   - Avoid WARN on systems with Intel IPI virtualization

   - Add CONFIG_KVM_MAX_NR_VCPUS, to allow supporting up to 4096 vCPUs
     without forcing more common use cases to eat the extra memory
     overhead.

   - Add virtualization support for AMD SRSO mitigation (IBPB_BRTYPE and
     SBPB, aka Selective Branch Predictor Barrier).

   - Fix a bug where restoring a vCPU snapshot that was taken within 1
     second of creating the original vCPU would cause KVM to try to
     synchronize the vCPU's TSC and thus clobber the correct TSC being
     set by userspace.

   - Compute guest wall clock using a single TSC read to avoid
     generating an inaccurate time, e.g. if the vCPU is preempted
     between multiple TSC reads.

   - "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which
     complain about a "Firmware Bug" if the bit isn't set for select
     F/M/S combos. Likewise "virtualize" (ignore) MSR_AMD64_TW_CFG to
     appease Windows Server 2022.

   - Don't apply side effects to Hyper-V's synthetic timer on writes
     from userspace to fix an issue where the auto-enable behavior can
     trigger spurious interrupts, i.e. do auto-enabling only for guest
     writes.

   - Remove an unnecessary kick of all vCPUs when synchronizing the
     dirty log without PML enabled.

   - Advertise "support" for non-serializing FS/GS base MSR writes as
     appropriate.

   - Harden the fast page fault path to guard against encountering an
     invalid root when walking SPTEs.

   - Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.

   - Use the fast path directly from the timer callback when delivering
     Xen timer events, instead of waiting for the next iteration of the
     run loop. This was not done so far because previously proposed code
     had races, but now care is taken to stop the hrtimer at critical
     points such as restarting the timer or saving the timer information
     for userspace.

   - Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future
     flag.

   - Optimize injection of PMU interrupts that are simultaneous with
     NMIs.

   - Usual handful of fixes for typos and other warts.

  x86 - MTRR/PAT fixes and optimizations:

   - Clean up code that deals with honoring guest MTRRs when the VM has
     non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.

   - Zap EPT entries when non-coherent DMA assignment stops/start to
     prevent using stale entries with the wrong memtype.

   - Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y

     This was done as a workaround for virtual machine BIOSes that did
     not bother to clear CR0.CD (because ancient KVM/QEMU did not bother
     to set it, in turn), and there's zero reason to extend the quirk to
     also ignore guest PAT.

  x86 - SEV fixes:

   - Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts
     SHUTDOWN while running an SEV-ES guest.

   - Clean up the recognition of emulation failures on SEV guests, when
     KVM would like to "skip" the instruction but it had already been
     partially emulated. This makes it possible to drop a hack that
     second guessed the (insufficient) information provided by the
     emulator, and just do the right thing.

  Documentation:

   - Various updates and fixes, mostly for x86

   - MTRR and PAT fixes and optimizations"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (164 commits)
  KVM: selftests: Avoid using forced target for generating arm64 headers
  tools headers arm64: Fix references to top srcdir in Makefile
  KVM: arm64: Add tracepoint for MMIO accesses where ISV==0
  KVM: arm64: selftest: Perform ISB before reading PAR_EL1
  KVM: arm64: selftest: Add the missing .guest_prepare()
  KVM: arm64: Always invalidate TLB for stage-2 permission faults
  KVM: x86: Service NMI requests after PMI requests in VM-Enter path
  KVM: arm64: Handle AArch32 SPSR_{irq,abt,und,fiq} as RAZ/WI
  KVM: arm64: Do not let a L1 hypervisor access the *32_EL2 sysregs
  KVM: arm64: Refine _EL2 system register list that require trap reinjection
  arm64: Add missing _EL2 encodings
  arm64: Add missing _EL12 encodings
  KVM: selftests: aarch64: vPMU test for validating user accesses
  KVM: selftests: aarch64: vPMU register test for unimplemented counters
  KVM: selftests: aarch64: vPMU register test for implemented counters
  KVM: selftests: aarch64: Introduce vpmu_counter_access test
  tools: Import arm_pmuv3.h
  KVM: arm64: PMU: Allow userspace to limit PMCR_EL0.N for the guest
  KVM: arm64: Sanitize PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} before first run
  KVM: arm64: Add {get,set}_user for PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
  ...
2023-11-02 15:45:15 -10:00
Linus Torvalds
8999ad99f4 * Refactor and clean up TDX hypercall/module call infrastructure
* Handle retrying/resuming page conversion hypercalls
  * Make sure to use the (shockingly) reliable TSC in TDX guests
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmVBlqMACgkQaDWVMHDJ
 krBrhBAArKay0MvzmdzS4IQs8JqkmuMEHI6WabYv2POPjJNXrn5MelLH972pLuX9
 NJ3+yeOLmNMYwqu5qwLCxyeO5CtqEyT2lNumUrxAtHQG4+oS2RYJYUalxMuoGxt8
 fAHxbItFg0TobBSUtwcnN2R2WdXwPuUW0Co+pJfLlZV4umVM7QANO1nf1g8YmlDD
 sVtpDaeKJRdylmwgWgAyGow0tDKd6oZB9j/vOHvZRrEQ+DMjEtG75fjwbjbu43Cl
 tI/fbxKjzAkOFcZ7PEPsQ8jE1h9DXU+JzTML9Nu/cPMalxMeBg3Dol/JOEbqgreI
 4W8Lg7g071EkUcQDxpwfe4aS6rsfsbwUIV4gJVkg9ZhlT7RayWsFik2CfBpJ4IMQ
 TM8BxtCEGCz3cxVvg3mstX9rRA7eNlXOzcKE/8Y7cpSsp94bA9jtf2GgUSUoi9St
 y+fIEei8mgeHutdiFh8psrmR7hp6iX/ldMFqHtjNo6xatf2KjdVHhVSU13Jz544z
 43ATNi1gZeHOgfwlAlIxLPDVDJidHuux3f6g2vfMkAqItyEqFauC1HA1pIDgckoY
 9FpBPp9vNUToSPp6reB6z/PkEBIrG2XtQh82JLt2CnCb6aTUtnPds+psjtT4sSE/
 a9SQvZLWWmpj+BlI2yrtfJzhy7SwhltgdjItQHidmCNEn0PYfTc=
 =FJ1Y
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 TDX updates from Dave Hansen:
 "The majority of this is a rework of the assembly and C wrappers that
  are used to talk to the TDX module and VMM. This is a nice cleanup in
  general but is also clearing the way for using this code when Linux is
  the TDX VMM.

  There are also some tidbits to make TDX guests play nicer with Hyper-V
  and to take advantage the hardware TSC.

  Summary:

   - Refactor and clean up TDX hypercall/module call infrastructure

   - Handle retrying/resuming page conversion hypercalls

   - Make sure to use the (shockingly) reliable TSC in TDX guests"

[ TLA reminder: TDX is "Trust Domain Extensions", Intel's guest VM
  confidentiality technology ]

* tag 'x86_tdx_for_6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tdx: Mark TSC reliable
  x86/tdx: Fix __noreturn build warning around __tdx_hypercall_failed()
  x86/virt/tdx: Make TDX_MODULE_CALL handle SEAMCALL #UD and #GP
  x86/virt/tdx: Wire up basic SEAMCALL functions
  x86/tdx: Remove 'struct tdx_hypercall_args'
  x86/tdx: Reimplement __tdx_hypercall() using TDX_MODULE_CALL asm
  x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL
  x86/tdx: Extend TDX_MODULE_CALL to support more TDCALL/SEAMCALL leafs
  x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
  x86/tdx: Rename __tdx_module_call() to __tdcall()
  x86/tdx: Make macros of TDCALLs consistent with the spec
  x86/tdx: Skip saving output regs when SEAMCALL fails with VMFailInvalid
  x86/tdx: Zero out the missing RSI in TDX_HYPERCALL macro
  x86/tdx: Retry partially-completed page conversion hypercalls
2023-11-01 10:28:32 -10:00
Linus Torvalds
7d461b291e drm for 6.7-rc1
kernel:
 - add initial vmemdup-user-array
 
 core:
 - fix platform remove() to return void
 - drm_file owner updated to reflect owner
 - move size calcs to drm buddy allocator
 - let GPUVM build as a module
 - allow variable number of run-queues in scheduler
 
 edid:
 - handle bad h/v sync_end in EDIDs
 
 panfrost:
 - add Boris as maintainer
 
 fbdev:
 - use fb_ops helpers more
 - only allow logo use from fbcon
 - rename fb_pgproto to pgprot_framebuffer
 - add HPD state to drm_connector_oob_hotplug_event
 - convert to fbdev i/o mem helpers
 
 i915:
 - Enable meteorlake by default
 - Early Xe2 LPD/Lunarlake display enablement
 - Rework subplatforms into IP version checks
 - GuC based TLB invalidation for Meteorlake
 - Display rework for future Xe driver integration
 - LNL FBC features
 - LNL display feature capability reads
 - update recommended fw versions for DG2+
 - drop fastboot module parameter
 - added deviceid for Arrowlake-S
 - drop preproduction workarounds
 - don't disable preemption for resets
 - cleanup inlines in headers
 - PXP firmware loading fix
 - Fix sg list lengths
 - DSC PPS state readout/verification
 - Add more RPL P/U PCI IDs
 - Add new DG2-G12 stepping
 - DP enhanced framing support to state checker
 - Improve shared link bandwidth management
 - stop using GEM macros in display code
 - refactor related code into display code
 - locally enable W=1 warnings
 - remove PSR watchdog timers on LNL
 
 amdgpu:
 - RAS/FRU EEPROM updatse
 - IP discovery updatses
 - GC 11.5 support
 - DCN 3.5 support
 - VPE 6.1 support
 - NBIO 7.11 support
 - DML2 support
 - lots of IP updates
 - use flexible arrays for bo list handling
 - W=1 fixes
 - Enable seamless boot in more cases
 - Enable context type property for HDMI
 - Rework GPUVM TLB flushing
 - VCN IB start/size alignment fixes
 
 amdkfd:
 - GC 10/11 fixes
 - GC 11.5 support
 - use partial migration in GPU faults
 
 radeon:
 - W=1 Fixes
 - fix some possible buffer overflow/NULL derefs
 nouveau:
 - update uapi for NO_PREFETCH
 - scheduler/fence fixes
 - rework suspend/resume for GSP-RM
 - rework display in preparation for GSP-RM
 
 habanalabs:
 - uapi: expose tsc clock
 - uapi: block access to eventfd through control device
 - uapi: force dma-buf export to PAGE_SIZE alignments
 - complete move to accel subsystem
 - move firmware interface include files
 - perform hard reset on PCIe AXI drain event
 - optimise user interrupt handling
 
 msm:
 - DP: use existing helpers for DPCD
 - DPU: interrupts reworked
 - gpu: a7xx (a730/a740) support
 - decouple msm_drv from kms for headless devices
 
 mediatek:
 - MT8188 dsi/dp/edp support
 - DDP GAMMA - 12 bit LUT support
 - connector dynamic selection capability
 
 rockchip:
 - rv1126 mipi-dsi/vop support
 - add planar formats
 
 ast:
 - rename constants
 
 panels:
 - Mitsubishi AA084XE01
 - JDI LPM102A188A
 - LTK050H3148W-CTA6
 
 ivpu:
 - power management fixes
 
 qaic:
 - add detach slice bo api
 
 komeda:
 - add NV12 writeback
 
 tegra:
 - support NVSYNC/NHSYNC
 - host1x suspend fixes
 
 ili9882t:
 - separate into own driver
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmVAgzYACgkQDHTzWXnE
 hr7ZEQ//UXne3tyGOsU3X8r+lstLFDMa90a3hvTg6hX+Q0MjHd/clwkKFkLpkipL
 n7gIZlaHl11dRs0FzrIZA5EVAAgjMLKmIl10NBDFec6ZFA3VERcggx8y61uifI15
 VviMR1VbLHYZaCdyrQOK0A4wcktWnKXyoXp7cwy9crdc2GOBMUZkdIqtvD7jHxQx
 UMIFnzi1CyKUX/Fjt/JceYcNk9y2ZGkzakYO3sHcUdv4DPu9qX4kNzpjF691AZBP
 UeKWvCswTRVg2M0kuo/RYIBzqaTmOlk6dHLWBognIeZPyuyhCcaGC2d64c6tShwQ
 dtHdi+IgyQ8s2qb350ymKTQUP7xA/DfZBwH7LvrZALBxeQGYQN1CnsgDMOS2wcUc
 XrRFiS7PxEOtMMBctcPBnnoV5ttnsLLlPpzM9puh9sUFMn6CgLzcAMqXdqxzMajH
 +dz2aD1N0vMqq4varozOg9SC2QamgUiPN/TQfrulhCTCfQaXczy5x1OYiIz65+Sl
 mKoe2WASuP9Ve8do4N/wEwH5SZY2ItipBdUTRxttY9NTanmV0X5DjZBXH5b9XGci
 Zl5Ar613f9zwm5T5BVA5k6s3ZbGY6QcP5pDNTCPaSgitfFXIdReBZ2CaYzK3MPg/
 Wit/TXrud9yT6VPpI1igboMyasf5QubV1MY1K83kOCWr9u8R2CM=
 =l79u
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm

Pull drm updates from Dave Airlie:
 "Highlights:
   - AMD adds some more upcoming HW platforms
   - Intel made Meteorlake stable and started adding Lunarlake
   - nouveau has a bunch of display rework in prepartion for the NVIDIA
     GSP firmware support
   - msm adds a7xx support
   - habanalabs has finished migration to accel subsystem

  Detail summary:

  kernel:
   - add initial vmemdup-user-array

  core:
   - fix platform remove() to return void
   - drm_file owner updated to reflect owner
   - move size calcs to drm buddy allocator
   - let GPUVM build as a module
   - allow variable number of run-queues in scheduler

  edid:
   - handle bad h/v sync_end in EDIDs

  panfrost:
   - add Boris as maintainer

  fbdev:
   - use fb_ops helpers more
   - only allow logo use from fbcon
   - rename fb_pgproto to pgprot_framebuffer
   - add HPD state to drm_connector_oob_hotplug_event
   - convert to fbdev i/o mem helpers

  i915:
   - Enable meteorlake by default
   - Early Xe2 LPD/Lunarlake display enablement
   - Rework subplatforms into IP version checks
   - GuC based TLB invalidation for Meteorlake
   - Display rework for future Xe driver integration
   - LNL FBC features
   - LNL display feature capability reads
   - update recommended fw versions for DG2+
   - drop fastboot module parameter
   - added deviceid for Arrowlake-S
   - drop preproduction workarounds
   - don't disable preemption for resets
   - cleanup inlines in headers
   - PXP firmware loading fix
   - Fix sg list lengths
   - DSC PPS state readout/verification
   - Add more RPL P/U PCI IDs
   - Add new DG2-G12 stepping
   - DP enhanced framing support to state checker
   - Improve shared link bandwidth management
   - stop using GEM macros in display code
   - refactor related code into display code
   - locally enable W=1 warnings
   - remove PSR watchdog timers on LNL

  amdgpu:
   - RAS/FRU EEPROM updatse
   - IP discovery updatses
   - GC 11.5 support
   - DCN 3.5 support
   - VPE 6.1 support
   - NBIO 7.11 support
   - DML2 support
   - lots of IP updates
   - use flexible arrays for bo list handling
   - W=1 fixes
   - Enable seamless boot in more cases
   - Enable context type property for HDMI
   - Rework GPUVM TLB flushing
   - VCN IB start/size alignment fixes

  amdkfd:
   - GC 10/11 fixes
   - GC 11.5 support
   - use partial migration in GPU faults

  radeon:
   - W=1 Fixes
   - fix some possible buffer overflow/NULL derefs

  nouveau:
   - update uapi for NO_PREFETCH
   - scheduler/fence fixes
   - rework suspend/resume for GSP-RM
   - rework display in preparation for GSP-RM

  habanalabs:
   - uapi: expose tsc clock
   - uapi: block access to eventfd through control device
   - uapi: force dma-buf export to PAGE_SIZE alignments
   - complete move to accel subsystem
   - move firmware interface include files
   - perform hard reset on PCIe AXI drain event
   - optimise user interrupt handling

  msm:
   - DP: use existing helpers for DPCD
   - DPU: interrupts reworked
   - gpu: a7xx (a730/a740) support
   - decouple msm_drv from kms for headless devices

  mediatek:
   - MT8188 dsi/dp/edp support
   - DDP GAMMA - 12 bit LUT support
   - connector dynamic selection capability

  rockchip:
   - rv1126 mipi-dsi/vop support
   - add planar formats

  ast:
   - rename constants

  panels:
   - Mitsubishi AA084XE01
   - JDI LPM102A188A
   - LTK050H3148W-CTA6

  ivpu:
   - power management fixes

  qaic:
   - add detach slice bo api

  komeda:
   - add NV12 writeback

  tegra:
   - support NVSYNC/NHSYNC
   - host1x suspend fixes

  ili9882t:
   - separate into own driver"

* tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm: (1803 commits)
  drm/amdgpu: Remove unused variables from amdgpu_show_fdinfo
  drm/amdgpu: Remove duplicate fdinfo fields
  drm/amd/amdgpu: avoid to disable gfxhub interrupt when driver is unloaded
  drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systems
  drm/amdgpu: Retrieve CE count from ce_count_lo_chip in EccInfo table
  drm/amdgpu: Identify data parity error corrected in replay mode
  drm/amdgpu: Fix typo in IP discovery parsing
  drm/amd/display: fix S/G display enablement
  drm/amdxcp: fix amdxcp unloads incompletely
  drm/amd/amdgpu: fix the GPU power print error in pm info
  drm/amdgpu: Use pcie domain of xcc acpi objects
  drm/amd: check num of link levels when update pcie param
  drm/amdgpu: Add a read to GFX v9.4.3 ring test
  drm/amd/pm: call smu_cmn_get_smc_version in is_mode1_reset_supported.
  drm/amdgpu: get RAS poison status from DF v4_6_2
  drm/amdgpu: Use discovery table's subrevision
  drm/amd/display: 3.2.256
  drm/amd/display: add interface to query SubVP status
  drm/amd/display: Read before writing Backlight Mode Set Register
  drm/amd/display: Disable SYMCLK32_SE RCO on DCN314
  ...
2023-11-01 06:28:35 -10:00
Linus Torvalds
59fff63cc2 platform-drivers-x86 for v6.7-1
Highlights:
  - asus-wmi:		Support for screenpad and solve brightness key
 			press duplication
  - int3472:		Eliminate the last use of deprecated GPIO functions
  - mlxbf-pmc:		New HW support
  - msi-ec:		Support new EC configurations
  - thinkpad_acpi:	Support reading aux MAC address during passthrough
  - wmi: 		Fixes & improvements
  - x86-android-tablets:	Detection fix and avoid use of GPIO private APIs
  - Debug & metrics interface improvements
  - Miscellaneous cleanups / fixes / improvements
 
 The following is an automated shortlog grouped by driver:
 
 acer-wmi:
  -  Remove void function return
 
 amd/hsmp:
  -  add support for metrics tbl
  -  create plat specific struct
  -  Fix iomem handling
  -  improve the error log
 
 amd/pmc:
  -  Add dump_custom_stb module parameter
  -  Add PMFW command id to support S2D force flush
  -  Handle overflow cases where the num_samples range is higher
  -  Use flex array when calling amd_pmc_stb_debugfs_open_v2()
 
 asus-wireless:
  -  Replace open coded acpi_match_acpi_device()
 
 asus-wmi:
  -  add support for ASUS screenpad
  -  Do not report brightness up/down keys when also reported by acpi_video
 
 gpiolib: acpi:
  -  Add a ignore interrupt quirk for Peaq C1010
  -  Check if a GPIO is listed in ignore_interrupt earlier
 
 hp-bioscfg:
  -  Annotate struct bios_args with __counted_by
 
 inspur-platform-profile:
  -  Add platform profile support
 
 int3472:
  -  Add new skl_int3472_fill_gpiod_lookup() helper
  -  Add new skl_int3472_gpiod_get_from_temp_lookup() helper
  -  Stop using gpiod_toggle_active_low()
  -  Switch to devm_get_gpiod()
 
 intel: bytcrc_pwrsrc:
  -  Convert to platform remove callback returning void
 
 intel/ifs:
  -  Add new CPU support
  -  Add new error code
  -  ARRAY BIST for Sierra Forest
  -  Gen2 scan image loading
  -  Gen2 Scan test support
  -  Metadata validation for start_chunk
  -  Refactor image loading code
  -  Store IFS generation number
  -  Validate image size
 
 intel_speed_select_if:
  -  Remove hardcoded map size
  -  Use devm_ioremap_resource
 
 intel/tpmi:
  -  Add debugfs support for read/write blocked
  -  Add defines to get version information
 
 intel-uncore-freq:
  -  Ignore minor version change
 
 ISST:
  -  Allow level 0 to be not present
  -  Ignore minor version change
  -  Use fuse enabled mask instead of allowed levels
 
 mellanox:
  -  Fix misspelling error in routine name
  -  Rename some init()/exit() functions for consistent naming
 
 mlxbf-bootctl:
  -  Convert to platform remove callback returning void
 
 mlxbf-pmc:
  -  Add support for BlueField-3
 
 mlxbf-tmfifo:
  -  Convert to platform remove callback returning void
 
 mlx-Convert to platform remove callback returning void:
  - mlx-Convert to platform remove callback returning void
 
 mlxreg-hotplug:
  -  Convert to platform remove callback returning void
 
 mlxreg-io:
  -  Convert to platform remove callback returning void
 
 mlxreg-lc:
  -  Convert to platform remove callback returning void
 
 msi-ec:
  -  Add more EC configs
  -  rename fn_super_swap
 
 nvsw-sn2201:
  -  Convert to platform remove callback returning void
 
 sel3350-Convert to platform remove callback returning void:
  - sel3350-Convert to platform remove callback returning void
 
 siemens: simatic-ipc-batt-apollolake:
  -  Convert to platform remove callback returning void
 
 siemens: simatic-ipc-batt:
  -  Convert to platform remove callback returning void
 
 siemens: simatic-ipc-batt-elkhartlake:
  -  Convert to platform remove callback returning void
 
 siemens: simatic-ipc-batt-f7188x:
  -  Convert to platform remove callback returning void
 
 siemens: simatic-ipc-batt:
  -  Simplify simatic_ipc_batt_remove()
 
 surface: acpi-notify:
  -  Convert to platform remove callback returning void
 
 surface: aggregator:
  -  Annotate struct ssam_event with __counted_by
 
 surface: aggregator-cdev:
  -  Convert to platform remove callback returning void
 
 surface: aggregator-registry:
  -  Convert to platform remove callback returning void
 
 surface: dtx:
  -  Convert to platform remove callback returning void
 
 surface: gpe:
  -  Convert to platform remove callback returning void
 
 surface: hotplug:
  -  Convert to platform remove callback returning void
 
 surface: surface3-wmi:
  -  Convert to platform remove callback returning void
 
 think-lmi:
  -  Add bulk save feature
  -  Replace kstrdup() + strreplace() with kstrdup_and_replace()
  -  Use strreplace() to replace a character by nul
 
 thinkpad_acpi:
  -  Add battery quirk for Thinkpad X120e
  -  replace deprecated strncpy with memcpy
  -  sysfs interface to auxmac
 
 tools/power/x86/intel-speed-select:
  -  Display error for core-power support
  -  Increase max CPUs in one request
  -  No TRL for non compute domains
  -  Sanitize integer arguments
  -  turbo-mode enable disable swapped
  -  Update help for TRL
  -  Use cgroup isolate for CPU 0
  -  v1.18 release
 
 wmi:
  -  Decouple probe deferring from wmi_block_list
  -  Decouple WMI device removal from wmi_block_list
  -  Fix opening of char device
  -  Fix probe failure when failing to register WMI devices
  -  Fix refcounting of WMI devices in legacy functions
 
 x86-android-tablets:
  -  Add a comment about x86_android_tablet_get_gpiod()
  -  Create a platform_device from module_init()
  -  Drop "linux,power-supply-name" from lenovo_yt3_bq25892_0_props[]
  -  Fix Lenovo Yoga Tablet 2 830F/L vs 1050F/L detection
  -  Remove invalid_aei_gpiochip from Peaq C1010
  -  Remove invalid_aei_gpiochip support
  -  Stop using gpiolib private APIs
  -  Use platform-device as gpio-keys parent
 
 xo15-ebook:
  -  Replace open coded acpi_match_acpi_device()
 
 Merges:
  -  Merge branch 'pdx86/platform-drivers-x86-int3472' into review-ilpo
  -  Merge branch 'pdx86/platform-drivers-x86-mellanox-init' into review-ilpo
  -  Merge remote-tracking branch 'intel-speed-select/intel-sst' into review-ilpo
  -  Merge remote-tracking branch 'pdx86/platform-drivers-x86-android-tablets' into review-hans
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSCSUwRdwTNL2MhaBlZrE9hU+XOMQUCZT+lBwAKCRBZrE9hU+XO
 Mck0AQCFU7dYLCF4d1CXtHf1eZhSXLpYdhcO+C08JGGoM+MqSgD+Jyb9KJHk4pxE
 FvKG51I9neyAne9lvNrLodHRzxCYgAo=
 =duM8
 -----END PGP SIGNATURE-----

Merge tag 'platform-drivers-x86-v6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver updates from Ilpo Järvinen:

 - asus-wmi: Support for screenpad and solve brightness key press
   duplication

 - int3472: Eliminate the last use of deprecated GPIO functions

 - mlxbf-pmc: New HW support

 - msi-ec: Support new EC configurations

 - thinkpad_acpi: Support reading aux MAC address during passthrough

 - wmi: Fixes & improvements

 - x86-android-tablets: Detection fix and avoid use of GPIO private APIs

 - Debug & metrics interface improvements

 - Miscellaneous cleanups / fixes / improvements

* tag 'platform-drivers-x86-v6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (80 commits)
  platform/x86: inspur-platform-profile: Add platform profile support
  platform/x86: thinkpad_acpi: Add battery quirk for Thinkpad X120e
  platform/x86: wmi: Decouple WMI device removal from wmi_block_list
  platform/x86: wmi: Fix opening of char device
  platform/x86: wmi: Fix probe failure when failing to register WMI devices
  platform/x86: wmi: Fix refcounting of WMI devices in legacy functions
  platform/x86: wmi: Decouple probe deferring from wmi_block_list
  platform/x86/amd/hsmp: Fix iomem handling
  platform/x86: asus-wmi: Do not report brightness up/down keys when also reported by acpi_video
  platform/x86: thinkpad_acpi: replace deprecated strncpy with memcpy
  tools/power/x86/intel-speed-select: v1.18 release
  tools/power/x86/intel-speed-select: Use cgroup isolate for CPU 0
  tools/power/x86/intel-speed-select: Increase max CPUs in one request
  tools/power/x86/intel-speed-select: Display error for core-power support
  tools/power/x86/intel-speed-select: No TRL for non compute domains
  tools/power/x86/intel-speed-select: turbo-mode enable disable swapped
  tools/power/x86/intel-speed-select: Update help for TRL
  tools/power/x86/intel-speed-select: Sanitize integer arguments
  platform/x86: acer-wmi: Remove void function return
  platform/x86/amd/pmc: Add dump_custom_stb module parameter
  ...
2023-10-31 17:53:00 -10:00
Paolo Bonzini
be47941980 KVM SVM changes for 6.7:
- Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
    running an SEV-ES guest.
 
  - Clean up handling "failures" when KVM detects it can't emulate the "skip"
    action for an instruction that has already been partially emulated.  Drop a
    hack in the SVM code that was fudging around the emulator code not giving
    SVM enough information to do the right thing.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8GHYSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5hwkQAIR8l1gWz/caz29biBzmRnDS+aZOXcYM
 8V8WBJqJgMKE9egibF4sADAlhInXzg19Xr7bQs6VfuvmdXrCn0UJ/nLorX+H85A2
 pph6iNlWO6tyQAjvk/AieaeUyZOqpCFmKOgxfN2Fr/Lrn7u3AdjXC20qPeFJSLXr
 YOTCQ704yvjjJp4yVA8JlclAQu38hanKiO5SZdlLzbuhUgWwQk4DVP2ZsYnhX+RO
 F6exxORvMnYF/LJe/kR2/DMLf2JWvyUmjRrGWoeRoksOw5BlXMc5HyTPHSJ2jDac
 lJaNtmZkTY1bDVWZk7N03ze5aFJa4DaqJdIFLtgujrFW8thog0P48aH6vmKi4UAA
 bXme9GFYbmJTkemaGRnrzidFV12uPNvvanS+1PDOw4sn4HpscoMSpZw5PeH2kBwV
 6uKNCJCwLtk8oe50yroKD7rJ/ASB7CeoqzbIL9s2TA0HSAskIf65T4eZp01uniyd
 Q98yCdrG2mudsg5aU5yMfe0LwZby5BB5kUCqIe4hyRC68GJR8wkAzhaFRgCn4aJE
 yaTyjnT2V3PGMEEJOPFdSF3VQGztljzQiXlEvBVj3zvMGQNTo2NhmS3ka4W+wW5G
 avRYv8dITlGRs6J2gV1vp8Eb5LzDrwRpRURSmzeP5rR58saKdljTZgNfOzfLeFr1
 WhLzonLz52IS
 =U0fq
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-svm-6.7' of https://github.com/kvm-x86/linux into HEAD

KVM SVM changes for 6.7:

 - Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
   running an SEV-ES guest.

 - Clean up handling "failures" when KVM detects it can't emulate the "skip"
   action for an instruction that has already been partially emulated.  Drop a
   hack in the SVM code that was fudging around the emulator code not giving
   SVM enough information to do the right thing.
2023-10-31 10:22:43 -04:00
Paolo Bonzini
e122d7a100 KVM x86 Xen changes for 6.7:
- Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.
 
  - Use the fast path directly from the timer callback when delivering Xen timer
    events.  Avoid the problematic races with using the fast path by ensuring
    the hrtimer isn't running when (re)starting the timer or saving the timer
    information (for userspace).
 
  - Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8He8SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5KyQP+wUH3n6hhJGScsSCpWXK6r8q+Y2ZBftY
 ecXuoTfeBJmsoTbnExF7K600DtbxHY5jjxt3ROmoUCertCFRCoq6pi5v4rbRDDQ1
 fmGkht43A6zAuHQ0Ntvkq4rNEmISAbzLP4EXOxZJ/Hxld91T8IutMFo7NN/YfOSx
 nb+qgb7B25T7ODGvzahRjxnoevCHBN/TdKeDrvsoWeMpVw+CDYqquQOcLfHMaBAN
 DqGwZzpdVqRQqg3TOuBGCiv5IcvskjkFUh0y6cEYkCR/MruLoT6CygoLImEV2naW
 RU0ZU9Y4cjf+BV/faQEdP6mDQwwCUHWLxDpXUVn03KQYQHlA7q6UgRKxy35ixZ5w
 Euxvg4m2ZGgJjsVLqTTMUlbLSNxD6wWZAVxGH7w8XghKrNmoj1IoajPZS+1rwyO2
 5rUynMKf3HMT6oeqqZH95aChlUMiAvaPYPc+ogku8Bt1zJQVv/xnk/6T95Vw6C/t
 KfYsV80rmJd/EL/fUXYX3mCMcZGHyv80QlOEc0uR4f25HGszCG8qHiSaUtnvQUjQ
 xaguSuO1Cf7sdhHPWj4p/US+Jerrgd8nzoQGvKUOkdLsQzU71xwjvTZNlmmBYKKO
 zgGIXZfaXa4JibAqnRrC+V8UdDPOwKvOEzmH0joLEzkTISnIG2LycvZ6tG7sTcMU
 0sIg2dvhJx/G
 =Z2eM
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-xen-6.7' of https://github.com/kvm-x86/linux into HEAD

KVM x86 Xen changes for 6.7:

 - Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.

 - Use the fast path directly from the timer callback when delivering Xen timer
   events.  Avoid the problematic races with using the fast path by ensuring
   the hrtimer isn't running when (re)starting the timer or saving the timer
   information (for userspace).

 - Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag.
2023-10-31 10:21:42 -04:00
Paolo Bonzini
f292dc8aad KVM x86 misc changes for 6.7:
- Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without
    forcing more common use cases to eat the extra memory overhead.
 
  - Add IBPB and SBPB virtualization support.
 
  - Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
    creating the original vCPU would cause KVM to try to synchronize the vCPU's
    TSC and thus clobber the correct TSC being set by userspace.
 
  - Compute guest wall clock using a single TSC read to avoid generating an
    inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.
 
  - "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
     about a "Firmware Bug" if the bit isn't set for select F/M/S combos.
 
  - Don't apply side effects to Hyper-V's synthetic timer on writes from
    userspace to fix an issue where the auto-enable behavior can trigger
    spurious interrupts, i.e. do auto-enabling only for guest writes.
 
  - Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
    without PML enabled.
 
  - Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.
 
  - Use octal notation for file permissions through KVM x86.
 
  - Fix a handful of typo fixes and warts.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8EugSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5xS0P+gPTDO81CUZO70LrO2W4E7toRBf/F9x1
 /v5D/76p9hG32Z6+BJs/xxDxJFagw75MtoR5oKivtXiip3TxbfOyDOlaQkIRo85E
 /d95il/LRidL3Mv3TXRj1lykXnxSSz9tigAGEZti1Y9Fn9fXEIwurJH7dU5cBI1E
 fin5bsDaTNRjG4jjTiEUbnKPRTlD/S7CQJn4CaYvZhMv/eJkYDLyBBVy4VLoLzvD
 ctL6VJQLGPVxbxr9mEmulaqMrSuDIQQLkRVQJAViKyerBInTEc5d/GPCHuE8O3zi
 0r/QSJbMS9titWLz07NhJ1UH4VJNyaEhRlyJPSFhBW4h6dzUb3EXdUe0Hwa+JH/S
 H2cVqsANItTCIhvDtuEGIRDahu0eD+63h90InJ0gEVL1kSJS+UWZHB71PkUEQgAV
 2OsuT1D26fuxrv+0b9ioBZURycqKw++zGsrwyVhe77eBgqBJ12tbL4TAD+QNjaQ5
 HZTCe6YV83gZoOMeVkoTGSf96s9lGORgxsaAIXmFuLB9RVCVXhVh0ph2HZsnV8Hw
 ZXEXpBEFo7GUhb0NIvsk2W73QL87A3fLv15yITWc8KuC7/dXP9z6KpSKjFySS69X
 uWD1MVx6shhvbg97UzoJlXc3/z0aVzmdZJudE5d0gcFvAjIItqp6ICPOoKxfj8pT
 tqRZu3kVHd61
 =sfp8
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-misc-6.7' of https://github.com/kvm-x86/linux into HEAD

KVM x86 misc changes for 6.7:

 - Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without
   forcing more common use cases to eat the extra memory overhead.

 - Add IBPB and SBPB virtualization support.

 - Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
   creating the original vCPU would cause KVM to try to synchronize the vCPU's
   TSC and thus clobber the correct TSC being set by userspace.

 - Compute guest wall clock using a single TSC read to avoid generating an
   inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.

 - "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
    about a "Firmware Bug" if the bit isn't set for select F/M/S combos.

 - Don't apply side effects to Hyper-V's synthetic timer on writes from
   userspace to fix an issue where the auto-enable behavior can trigger
   spurious interrupts, i.e. do auto-enabling only for guest writes.

 - Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
   without PML enabled.

 - Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.

 - Use octal notation for file permissions through KVM x86.

 - Fix a handful of typo fixes and warts.
2023-10-31 10:15:15 -04:00
Paolo Bonzini
f233646760 KVM x86 APIC changes for 6.7:
- Purge VMX's posted interrupt descriptor *before* loading APIC state when
    handling KVM_SET_LAPIC.  Purging the PID after loading APIC state results in
    lost APIC timer IRQs as the APIC timer can be armed as part of loading APIC
    state, i.e. can immediately pend an IRQ if the expiry is in the past.
 
  - Clear the ICR.BUSY bit when handling trap-like x2APIC writes to suppress a
    WARN due to KVM expecting the BUSY bit to be cleared when sending IPIs.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8DrkSHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5Z54P/jQbjmwbvAVDa1g2HGr16ouFb3v+rCgV
 eXiN2/8tDlQl9JkvZsjOY1AtJauS3RZu/SncT3IxCCvHVrKKt0xgnhWT//rAMX3N
 zrrFxXJmilFdd2ijsuK8flHVFO5i06Kx1NRjSmAMb76et1et1EI0I62F1rom8ocn
 l0tryTkV0qhWKlgxzLBNzzhzxxA2Mvb6vFhut+nfa0IisnQ+roh5dV4JS7Cdhy+5
 qIQ4J86JdDR6kmraZLCv8r7vIQn5EQPVCAC2WWalds7iWUW3ixy01NLweVLaVtA3
 xWRHYK7azvu3SJ9HvmSMRZzzTElQBYnE2d37ycyYewHbDUJ6FCXxh7H+BehYaVBW
 CK6GLaMHY8vyXLcB2cRAzZ7GgBSSDevL11/CMc2ueGc51AOsDCxFh6v69TU+/PiQ
 dMOYiNTuwV9QC3raIXPCc2Q9rqGp20xBqXIv5XLEyIxqcFQESvtRggg6zJdBZ3zB
 T5/xAXDf++rKFiu1wskFAbwsk3M4T36u1WlK9iayC9dPvCDGRzsle3tt9npmoj2X
 3e0Uz7fkRLki1YZW1qLmvqoYJ9bX9B9BjuKkORIqSqXuwcgPYEzHt1GcRByI2FSk
 KSTnNwgs/Bs498LsDdU1Z740ZkYFGmu2nK+CUvwxP+uejPsnlhik5kGqLSIeA7zA
 9kDpcdV+tZKJ
 =atZ2
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-apic-6.7' of https://github.com/kvm-x86/linux into HEAD

KVM x86 APIC changes for 6.7:

 - Purge VMX's posted interrupt descriptor *before* loading APIC state when
   handling KVM_SET_LAPIC.  Purging the PID after loading APIC state results in
   lost APIC timer IRQs as the APIC timer can be armed as part of loading APIC
   state, i.e. can immediately pend an IRQ if the expiry is in the past.

 - Clear the ICR.BUSY bit when handling trap-like x2APIC writes.  This avoids a
   WARN, due to KVM expecting the BUSY bit to be cleared when sending IPIs.
2023-10-31 10:11:19 -04:00
Linus Torvalds
eb55307e67 X86 core code updates:
- Limit the hardcoded topology quirk for Hygon CPUs to those which have a
     model ID less than 4. The newer models have the topology CPUID leaf 0xB
     correctly implemented and are not affected.
 
   - Make SMT control more robust against enumeration failures
 
     SMT control was added to allow controlling SMT at boottime or
     runtime. The primary purpose was to provide a simple mechanism to
     disable SMT in the light of speculation attack vectors.
 
     It turned out that the code is sensible to enumeration failures and
     worked only by chance for XEN/PV. XEN/PV has no real APIC enumeration
     which means the primary thread mask is not set up correctly. By chance
     a XEN/PV boot ends up with smp_num_siblings == 2, which makes the
     hotplug control stay at its default value "enabled". So the mask is
     never evaluated.
 
     The ongoing rework of the topology evaluation caused XEN/PV to end up
     with smp_num_siblings == 1, which sets the SMT control to "not
     supported" and the empty primary thread mask causes the hotplug core to
     deny the bringup of the APS.
 
     Make the decision logic more robust and take 'not supported' and 'not
     implemented' into account for the decision whether a CPU should be
     booted or not.
 
   - Fake primary thread mask for XEN/PV
 
     Pretend that all XEN/PV vCPUs are primary threads, which makes the
     usage of the primary thread mask valid on XEN/PV. That is consistent
     with because all of the topology information on XEN/PV is fake or even
     non-existent.
 
   - Encapsulate topology information in cpuinfo_x86
 
     Move the randomly scattered topology data into a separate data
     structure for readability and as a preparatory step for the topology
     evaluation overhaul.
 
   - Consolidate APIC ID data type to u32
 
     It's fixed width hardware data and not randomly u16, int, unsigned long
     or whatever developers decided to use.
 
   - Cure the abuse of cpuinfo for persisting logical IDs.
 
     Per CPU cpuinfo is used to persist the logical package and die
     IDs. That's really not the right place simply because cpuinfo is
     subject to be reinitialized when a CPU goes through an offline/online
     cycle.
 
     Use separate per CPU data for the persisting to enable the further
     topology management rework. It will be removed once the new topology
     management is in place.
 
   - Provide a debug interface for inspecting topology information
 
     Useful in general and extremly helpful for validating the topology
     management rework in terms of correctness or "bug" compatibility.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmU+yX0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoROUD/4vlvKEcpm9rbI5DzLcaq4DFHKbyEZF
 cQtzuOSM/9vTc9DHnuoNNLl9TWSYxiVYnejf3E21evfsqspYlzbTH8bId9XBCUid
 6B68AJW842M2erNuwj0b0HwF1z++zpDmBDyhGOty/KQhoM8pYOHMvntAmbzJbuso
 Dgx6BLVFcboTy6RwlfRa0EE8f9W5V+JbmG/VBDpdyCInal7VrudoVFZmWQnPIft7
 zwOJpAoehkp8OKq7geKDf79yWxu9a1sNPd62HtaVEvfHwehHqE6OaMLss1us+0vT
 SJ/D6gmRQBOwcXaZL0wL1dG7Km9Et4AisOvzhXGvTa5b2D5oljVoqJ7V7FTf5g3u
 y3aqWbeUJzERUbeJt1HoGVAKyA4GtZOvg+TNIysf6F1Z4khl9alfa9jiqjj4g1au
 zgItq/ZMBEBmJ7X4FxQUEUVBG2CDsEidyNBDRcimWQUDfBakV/iCs0suD8uu8ZOD
 K5jMx8Hi2+xFx7r1YqsfsyMBYOf/zUZw65RbNe+kI992JbJ9nhcODbnbo5MlAsyv
 vcqlK5FwXgZ4YAC8dZHU/tyTiqAW7oaOSkqKwTP5gcyNEqsjQHV//q6v+uqtjfYn
 1C4oUsRHT2vJiV9ktNJTA4GQHIYF4geGgpG8Ih2SjXsSzdGtUd3DtX1iq0YiLEOk
 eHhYsnniqsYB5g==
 =xrz8
 -----END PGP SIGNATURE-----

Merge tag 'x86-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 core updates from Thomas Gleixner:

 - Limit the hardcoded topology quirk for Hygon CPUs to those which have
   a model ID less than 4.

   The newer models have the topology CPUID leaf 0xB correctly
   implemented and are not affected.

 - Make SMT control more robust against enumeration failures

   SMT control was added to allow controlling SMT at boottime or
   runtime. The primary purpose was to provide a simple mechanism to
   disable SMT in the light of speculation attack vectors.

   It turned out that the code is sensible to enumeration failures and
   worked only by chance for XEN/PV. XEN/PV has no real APIC enumeration
   which means the primary thread mask is not set up correctly. By
   chance a XEN/PV boot ends up with smp_num_siblings == 2, which makes
   the hotplug control stay at its default value "enabled". So the mask
   is never evaluated.

   The ongoing rework of the topology evaluation caused XEN/PV to end up
   with smp_num_siblings == 1, which sets the SMT control to "not
   supported" and the empty primary thread mask causes the hotplug core
   to deny the bringup of the APS.

   Make the decision logic more robust and take 'not supported' and 'not
   implemented' into account for the decision whether a CPU should be
   booted or not.

 - Fake primary thread mask for XEN/PV

   Pretend that all XEN/PV vCPUs are primary threads, which makes the
   usage of the primary thread mask valid on XEN/PV. That is consistent
   with because all of the topology information on XEN/PV is fake or
   even non-existent.

 - Encapsulate topology information in cpuinfo_x86

   Move the randomly scattered topology data into a separate data
   structure for readability and as a preparatory step for the topology
   evaluation overhaul.

 - Consolidate APIC ID data type to u32

   It's fixed width hardware data and not randomly u16, int, unsigned
   long or whatever developers decided to use.

 - Cure the abuse of cpuinfo for persisting logical IDs.

   Per CPU cpuinfo is used to persist the logical package and die IDs.
   That's really not the right place simply because cpuinfo is subject
   to be reinitialized when a CPU goes through an offline/online cycle.

   Use separate per CPU data for the persisting to enable the further
   topology management rework. It will be removed once the new topology
   management is in place.

 - Provide a debug interface for inspecting topology information

   Useful in general and extremly helpful for validating the topology
   management rework in terms of correctness or "bug" compatibility.

* tag 'x86-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86/apic, x86/hyperv: Use u32 in hv_snp_boot_ap() too
  x86/cpu: Provide debug interface
  x86/cpu/topology: Cure the abuse of cpuinfo for persisting logical ids
  x86/apic: Use u32 for wakeup_secondary_cpu[_64]()
  x86/apic: Use u32 for [gs]et_apic_id()
  x86/apic: Use u32 for phys_pkg_id()
  x86/apic: Use u32 for cpu_present_to_apicid()
  x86/apic: Use u32 for check_apicid_used()
  x86/apic: Use u32 for APIC IDs in global data
  x86/apic: Use BAD_APICID consistently
  x86/cpu: Move cpu_l[l2]c_id into topology info
  x86/cpu: Move logical package and die IDs into topology info
  x86/cpu: Remove pointless evaluation of x86_coreid_bits
  x86/cpu: Move cu_id into topology info
  x86/cpu: Move cpu_core_id into topology info
  hwmon: (fam15h_power) Use topology_core_id()
  scsi: lpfc: Use topology_core_id()
  x86/cpu: Move cpu_die_id into topology info
  x86/cpu: Move phys_proc_id into topology info
  x86/cpu: Encapsulate topology information in cpuinfo_x86
  ...
2023-10-30 17:37:47 -10:00
Linus Torvalds
943af0e73a Updates for the X86 APIC:
- Make the quirk for non-maskable MSI interrupts in the affinity setter
     functional again.
 
     It was broken by a MSI core code update, which restructured the code in
     a way that the quirk flag was not longer set correctly.
 
     Trying to restore the core logic caused a deeper inspection and it
     turned out that the extra quirk flag is not required at all because
     it's the inverse of the reservation mode bit, which only can be set
     when the MSI interrupt is maskable.
 
     So the trivial fix is to use the reservation mode check in the affinity
     setter function and remove almost 40 lines of code related to the
     no-mask quirk flag.
 
   - Cure a Kconfig dependency issue which causes compile fails by correcting
     the conditionals in the affected heaer files.
 
   - Clean up coding style in the UV APIC driver.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmU+w2ETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYodRIEACugQpAiND53Cz9MTJE1XWGheBn8n+y
 WZtbVckK9LXHfnf99dtigtq3ycPBg2Mx/pItT1c71iN/NyVPNCC/3mM+ntP53etX
 06v7VoIgRpF+GRQsIbjvVfBkKWS3G5zqHq6hazE0lcPhPEiOIMSBcusSQaA3C3sp
 BT23rrx/59JKCb/187pum0Obx+n1oH+MG91wJ1v0zNOGgfx1u5gBPyKDZ0JAsMPx
 r05991Lp4K80ooEsk/PKgpcZuPZD3QRwFDLHJwh2ITjsC22ItOnhN/c/1J8MZtk9
 5YBaIzy+vyQd+nksqmDtB48FD0ioW/WLUR5zV9MMpD3RQAGgrxQmXeTqScu/j8tn
 I4QZGi80HZSWLwFjWL2DoAw5LHZDh8ksxIYZBHE+8LHx0ymyX+7//naZnNFn9cXM
 K5orouZFQi+dvfD7MAla73fiibPs6cHjGzRKfsfNlLBNNj+/ffcEolLlKGhEX9fx
 R1w7gbtNs/RDnfoa45cvmez8UxJB5zei7l2HWbSecR2DDTQ9aGfEU/0NEr9muBzK
 cuZIpmgD09VvS84jCxdYbaXA5Cau3NLOE5Os6UN9Wa3dUjWP5PArWkc8RzxP3Bgx
 PANtytesqJ9YeDDg9SyylPw6d3EYNldtFZDXlD32K2O2SDmUZxxJa+0og8K9YNJN
 aIUcqiSaYhg5rQ==
 =2lhV
 -----END PGP SIGNATURE-----

Merge tag 'x86-apic-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 APIC updates from Thomas Gleixner:

 - Make the quirk for non-maskable MSI interrupts in the affinity setter
   functional again.

   It was broken by a MSI core code update, which restructured the code
   in a way that the quirk flag was not longer set correctly.

   Trying to restore the core logic caused a deeper inspection and it
   turned out that the extra quirk flag is not required at all because
   it's the inverse of the reservation mode bit, which only can be set
   when the MSI interrupt is maskable.

   So the trivial fix is to use the reservation mode check in the
   affinity setter function and remove almost 40 lines of code related
   to the no-mask quirk flag.

 - Cure a Kconfig dependency issue which causes compile failures by
   correcting the conditionals in the affected header files.

 - Clean up coding style in the UV APIC driver.

* tag 'x86-apic-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic/msi: Fix misconfigured non-maskable MSI quirk
  x86/msi: Fix compile error caused by CONFIG_GENERIC_MSI_IRQ=y && !CONFIG_X86_LOCAL_APIC
  x86/platform/uv/apic: Clean up inconsistent indenting
2023-10-30 17:27:56 -10:00
Linus Torvalds
f0d25b5d0f x86 MM handling code changes for v6.7:
- Add new NX-stack self-test
  - Improve NUMA partial-CFMWS handling
  - Fix #VC handler bugs resulting in SEV-SNP boot failures
  - Drop the 4MB memory size restriction on minimal NUMA nodes
  - Reorganize headers a bit, in preparation to header dependency reduction efforts
  - Misc cleanups & fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU9Ek4RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gIJQ/+Mg6mzMaThyNXqhJszeZJBmDaBv2sqjAB
 5tcferg1nJBdNBzX8bJ95UFt9fIqeYAcgH00qlQCYSmyzbC1TQTk9U2Pre1zbOw4
 042ONK8sygKSje1zdYleHoBeqwnxD2VNM0NwBElhGjumwHRng/tbLiI9wx6qiz+C
 VsFXavkBszHGA1pjy9wZLGixYIH5jCygMpH134Wp+CIhpS+C4nftcGdIL1D5Oil1
 6Tm2XeI6uyfiQhm9IOwDjfoYeC7gUjx1rp8rHseGUMJxyO/BX9q5j1ixbsVriqfW
 97ucYuRL9mza7ic516C9v7OlAA3AGH2xWV+SYOGK88i9Co4kYzP4WnamxXqOsD8+
 popxG55oa6QelhaouTBZvgERpZ4fWupSDs/UccsDaE9leMCerNEbGHEzt/Mm/2sw
 xopjMQ0y5Kn6/fS0dLv8U+XHu4ANkvXJkFd6Ny0h/WfgGefuQOOTG9ruYgfeqqB8
 dViQ4R7CO8ySjD45KawAZl/EqL86x1M/CI1nlt0YY4vNwUuOJbebL7Jn8w3Fjxm5
 FVfUlDmcPdhZfL9Vnrsi6MIou1cU1yJPw4D6sXJ4sg4s7A4ebBcRRrjayVQ4msjv
 Q7cvBOMnWEHhOV11pvP50FmQuj74XW3bUqiuWrnK1SypvnhHavF6kc1XYpBLs1xZ
 y8nueJW2qPw=
 =tT5F
 -----END PGP SIGNATURE-----

Merge tag 'x86-mm-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm handling updates from Ingo Molnar:

 - Add new NX-stack self-test

 - Improve NUMA partial-CFMWS handling

 - Fix #VC handler bugs resulting in SEV-SNP boot failures

 - Drop the 4MB memory size restriction on minimal NUMA nodes

 - Reorganize headers a bit, in preparation to header dependency
   reduction efforts

 - Misc cleanups & fixes

* tag 'x86-mm-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size
  selftests/x86/lam: Zero out buffer for readlink()
  x86/sev: Drop unneeded #include
  x86/sev: Move sev_setup_arch() to mem_encrypt.c
  x86/tdx: Replace deprecated strncpy() with strtomem_pad()
  selftests/x86/mm: Add new test that userspace stack is in fact NX
  x86/sev: Make boot_ghcb_page[] static
  x86/boot: Move x86_cache_alignment initialization to correct spot
  x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach
  x86/sev-es: Allow copy_from_kernel_nofault() in earlier boot
  x86_64: Show CR4.PSE on auxiliaries like on BSP
  x86/iommu/docs: Update AMD IOMMU specification document URL
  x86/sev/docs: Update document URL in amd-memory-encryption.rst
  x86/mm: Move arch_memory_failure() and arch_is_platform_page() definitions from <asm/processor.h> to <asm/pgtable.h>
  ACPI/NUMA: Apply SRAT proximity domain to entire CFMWS window
  x86/numa: Introduce numa_fill_memblks()
2023-10-30 15:40:57 -10:00
Linus Torvalds
ed766c2611 Changes to the x86 entry code in v6.7:
- Make IA32_EMULATION boot time configurable with
    the new ia32_emulation=<bool> boot option.
 
  - Clean up fast syscall return validation code: convert
    it to C and refactor the code.
 
  - As part of this, optimize the canonical RIP test code.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU9DiARHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iNAw//cLn9gBXMVPDiCDVUOTqjkZ+OwIF11Y9v
 WatksSe5hrw0Bzl5CiSvtrWpTkKPnhyM8Lc1WD8l0YSMKprdkQfNAvQOPv0IMLjk
 XP1pgQhAiXwB87XL/G2sA6RunuK56zlnl7KJiDrQThrS/WOfrq3UkB2vyYEP4GtP
 69WZ/WM++u74uEml0+HZ0Z9HVvzwYl1VQPdTYfl52S4H3U8MXL89YEsPr13Ttq88
 FMKdXJ/VvItuVM/ZHHqFkGvRJjUtDWePLu29b684Ap6onDJ7uMMw86Gj5UxXtdpB
 Axsjuwlca8sCPotcqohay6IdyxIth6lMdvjPv0KhA+/QMrHbDaluv88YQs4k7Add
 1GPULH6oeDTHxMPOcJmFuSTpMY8HP6O9ZIXB6ogQRkLaDJKaWr5UQU7L2VBQ/WUy
 NRa6mba0XHYrz6U7DmtsdL0idWBJeJokHmaIcGJ/pp6gMznvufm2+SoJ6w6wcYva
 VTSTyrAAj/N9/TzJ5i8S2+yDPI9GanFpZJfYbW/rT9XGutvXWVKe3AmUNgR8O+hE
 JiEMfpR0TtXXlrik74jur/RPZhaFIE8MeCvJrkJ3oxQlPThYSTMBAlUOtD7kOfNT
 onjPrumREX4hOIBU+nnC9VrJMqxX9lz4xDzqw3jvX99Ma0o8Wx/UndWELX8tAYwd
 j8M8NWAbv90=
 =YkaP
 -----END PGP SIGNATURE-----

Merge tag 'x86-entry-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 entry updates from Ingo Molnar:

 - Make IA32_EMULATION boot time configurable with
   the new ia32_emulation=<bool> boot option

 - Clean up fast syscall return validation code: convert
   it to C and refactor the code

 - As part of this, optimize the canonical RIP test code

* tag 'x86-entry-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/entry/32: Clean up syscall fast exit tests
  x86/entry/64: Use TASK_SIZE_MAX for canonical RIP test
  x86/entry/64: Convert SYSRET validation tests to C
  x86/entry/32: Remove SEP test for SYSEXIT
  x86/entry/32: Convert do_fast_syscall_32() to bool return type
  x86/entry/compat: Combine return value test from syscall handler
  x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
  x86: Make IA32_EMULATION boot time configurable
  x86/entry: Make IA32 syscalls' availability depend on ia32_enabled()
  x86/elf: Make loading of 32bit processes depend on ia32_enabled()
  x86/entry: Compile entry_SYSCALL32_ignore() unconditionally
  x86/entry: Rename ignore_sysret()
  x86: Introduce ia32_enabled()
2023-10-30 15:27:27 -10:00
Linus Torvalds
5780e39edb x86 assembly code improvements for v6.7 are:
- Micro-optimize the x86 bitops code
 - Define target-specific {raw,this}_cpu_try_cmpxchg{64,128}() to improve code generation
 - Define and use raw_cpu_try_cmpxchg() preempt_count_set()
 - Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_op
 - Remove the unused __sw_hweight64() implementation on x86-32
 - Misc fixes and cleanups
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU9BJkRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iBKBAAl++uUEmjJM2nfS1AtKS5Rn0YJGoj7ZLy
 MMjFJwYzw7nWqbkCZJqQATicmOVvdEUacYibYYfX4QkH3ylC9Av3ta1HeUEzqLpX
 qG8W4jJPu/qlAOtGI4mQkq/yVminea6xy6l0vcCF+pezUKnwADP/YSL2Zsg03UsX
 Nelty29NrpN/qCcLUJk40CHRjBhfYBVEk+HtCMahnftzLSNZWHYTgoYA9x4zm4Hg
 LGiION+dwJKwaafaNw8/k1ikRJYfc5c1ZImi7YoOeCXXtBq7VHOHf76axok42m4s
 2FJ7QefioQ/Gs1Gojxd99080F4H/Elt6uj05hR2493ncN4WVNKqYBOMezrlG686D
 CuoOZvMaJ1LyaEntu1YWF3dtHIDTVjHhe5dyVclgu2a+v1gP/16xTLIf2z/cWtuX
 whOcalFc7AdGXnLtGACHnhbc5Yh/Ex9Y49+6WYgBrDcgNDCGdTOa/m8kFmKFgOjh
 x8Ot1xUjFI3utGgMlfKpYqw281ws+DjPiVvGi+fmUf8eBsq2IJuO9/f9oyfFfFSj
 1bzwrL5Qop/ccZAOxtuLnVVVZ+Cz/5wHmMTI0rRE5BsoiVUrtWZfD+E0/vqdavjG
 0eabDTdwUgXNBp6ex0TajXKtQY7FKg3tzIuPqHRvhRCvlt77MoSdcWRCc37ac5GJ
 M6mAeLcitu8=
 =7WUt
 -----END PGP SIGNATURE-----

Merge tag 'x86-asm-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 assembly code updates from Ingo Molnar:

 - Micro-optimize the x86 bitops code

 - Define target-specific {raw,this}_cpu_try_cmpxchg{64,128}() to
   improve code generation

 - Define and use raw_cpu_try_cmpxchg() preempt_count_set()

 - Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_op

 - Remove the unused __sw_hweight64() implementation on x86-32

 - Misc fixes and cleanups

* tag 'x86-asm-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/lib: Address kernel-doc warnings
  x86/entry: Fix typos in comments
  x86/entry: Remove unused argument %rsi passed to exc_nmi()
  x86/bitops: Remove unused __sw_hweight64() assembly implementation on x86-32
  x86/percpu: Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_op
  x86/percpu: Use raw_cpu_try_cmpxchg() in preempt_count_set()
  x86/percpu: Define raw_cpu_try_cmpxchg and this_cpu_try_cmpxchg()
  x86/percpu: Define {raw,this}_cpu_try_cmpxchg{64,128}
  x86/asm/bitops: Use __builtin_clz{l|ll} to evaluate constant expressions
2023-10-30 14:18:00 -10:00
Linus Torvalds
2b95bb0526 Changes to the x86 boot code in v6.7:
- Rework PE header generation, primarily to generate a modern, 4k aligned
    kernel image view with narrower W^X permissions.
 
  - Further refine init-lifetime annotations
 
  - Misc cleanups & fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU9B6ERHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jXOg/+NAOQKhIYK0uFqAM+CEhZX4cqsJ9Ck0ze
 bqQ8pf5iCkbVZ+6ByiMSOszScTgVTSalRfKMYR+Fa9PVkLK4SNAeYPnGYugmLRoj
 U3lZYFpNDEwsZOmFwvqn7p+bGBQcBYKZuVI6bQh5U7Go4v6ujPjK4zTAK8SWDdTp
 DtEzhj9tELcYlm1NSV2OYu/k0IWAFV3Fc++G3WAm85xOK7oXVOYeMIlaVkpOyAXu
 th3yCw+Q0u1tuBS++77FwsEPt1KTzKGcTL7HpPrb4e4e4snOhmri+KAM/Noef7Vm
 lWqo8fTAeYwpYQ80oFsXVDhuI5LsfsuQgQid20sy1cWwswe1o1A73/AeP4pRogWl
 zLJuRcuNg2/VhPvMLdBWn5QdgJjH7CngeH+r/YkZPssPo6tfwa5UW7HOTCQvLsO9
 a+xy098qkk9d+8Za0sYMuv8/4+Ev5II2haP8edLgNWQ8S5qKIUQaY+r6268pIN/F
 0fGP9B3wblBjiNWCnd8UBh6T271g1O4vaMUt2URdcW3QObEq2EGnNiTc5tx9OPnP
 ZxQdAIl6pB0H0HIe9/7PABF40biKn84zmSl+KuXrhvh1f5FjYjJWVNyKlAKdSpSR
 wjvzg1KbhLiAHV05oQSHR7txMHJxfjpxAKmus0Hpqo6qVQ9FgrKiru9VHKocIpKU
 z66g+wEKUuY=
 =sxZJ
 -----END PGP SIGNATURE-----

Merge tag 'x86-boot-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 boot updates from Ingo Molnar:

 - Rework PE header generation, primarily to generate a modern, 4k
   aligned kernel image view with narrower W^X permissions.

 - Further refine init-lifetime annotations

 - Misc cleanups & fixes

* tag 'x86-boot-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86/boot: efistub: Assign global boot_params variable
  x86/boot: Rename conflicting 'boot_params' pointer to 'boot_params_ptr'
  x86/head/64: Move the __head definition to <asm/init.h>
  x86/head/64: Add missing __head annotation to startup_64_load_idt()
  x86/head/64: Mark 'startup_gdt[]' and 'startup_gdt_descr' as __initdata
  x86/boot: Harmonize the style of array-type parameter for fixup_pointer() calls
  x86/boot: Fix incorrect startup_gdt_descr.size
  x86/boot: Compile boot code with -std=gnu11 too
  x86/boot: Increase section and file alignment to 4k/512
  x86/boot: Split off PE/COFF .data section
  x86/boot: Drop PE/COFF .reloc section
  x86/boot: Construct PE/COFF .text section from assembler
  x86/boot: Derive file size from _edata symbol
  x86/boot: Define setup size in linker script
  x86/boot: Set EFI handover offset directly in header asm
  x86/boot: Grab kernel_info offset from zoffset header directly
  x86/boot: Drop references to startup_64
  x86/boot: Drop redundant code setting the root device
  x86/boot: Omit compression buffer from PE/COFF image memory footprint
  x86/boot: Remove the 'bugger off' message
  ...
2023-10-30 14:11:57 -10:00
Linus Torvalds
3b8b4b4fc4 Replace <asm/export.h> uses with <linux/export.h> and then remove <asm/export.h>.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU9DyYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iZqQ/+MZif5+F024nXciPOeogsMxtT5zStTd9a
 D+rBOCg6iodBIKKcfoqWUkjwXiREwnMxLq9JbArWIqOIBEEfMUsxz7S7PsnFA5dS
 /wXtJ6LHxiudeW6Vs41zLpJVANeAwxFXXyGGuRFk3Hxb7Zwh5jwhU76PKfBiUxrh
 7PzAqfOcVcbvlMXD/oMn9PlhRA4U+nrz+4ao00NCrbWBH/+ECHLb24IoB39jV8Gb
 XMDSsS5vxmCAR+ObRZm209lEf+tIv0T4I57GF85i5fX2aOrwPVvJDtnNMZY8xBDh
 TOyrq+JUTrl6JDKsRfiqvwxIG52bvytK0RHGfzUfObVtVa1AqlXwZaOUYItxlVjx
 f99DkIKhdUV6B3fVvYjsVP+g551kFKbh4EzYnLyp9XlMDIW9WVJv6u4GhR84piju
 3gxkYYBqnD58d/sf3xlAm5NDiqm2vGqw6qWKPkw7HrhX9QGaVkqIKMZTGVk2zM95
 LPVtXfqM+D3nscOmzoD9IsWmi7Z5ep8O0OfM0plOTRLYUlON2F5ewi4l0dNURH1g
 xHO3kr/TpzKeyv2NsG3ugqcO1evItJuCDx/PdzpR88HML/aXSeBh6IlEpMEnXQeq
 wQqB3YpgpTpV0/onIKfRs9qI4posWg0z0Rnfpp+nvVAjpsNrvndddZr65p5lkynZ
 9ccJQ7ipaw0=
 =O8Ce
 -----END PGP SIGNATURE-----

Merge tag 'x86-headers-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 header file cleanup from Ingo Molnar:
 "Replace <asm/export.h> uses with <linux/export.h> and then remove
  <asm/export.h>"

* tag 'x86-headers-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/headers: Remove <asm/export.h>
  x86/headers: Replace #include <asm/export.h> with #include <linux/export.h>
  x86/headers: Remove unnecessary #include <asm/export.h>
2023-10-30 14:04:23 -10:00
Linus Torvalds
bceb7accb7 Performance events changes for v6.7 are:
- Add AMD Unified Memory Controller (UMC) events introduced with Zen 4
  - Simplify & clean up the uncore management code
  - Fall back from RDPMC to RDMSR on certain uncore PMUs
  - Improve per-package and cstate event reading
  - Extend the Intel ref-cycles event to GP counters
  - Fix Intel MTL event constraints
  - Improve the Intel hybrid CPU handling code
  - Micro-optimize the RAPL code
  - Optimize perf_cgroup_switch()
  - Improve large AUX area error handling
 
  - Misc fixes and cleanups
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU89YsRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iQqQ/9EF9mG4te5By4qN+B7jADCmE71xG5ViKz
 sp4Thl86SHxhwFuiHn8dMUixrp+qbcemi5yTbQ9TF8cKl4s3Ju2CihU8jaauUp0a
 iS5W0IliMqLD1pxQoXAPLuPVInVYgrNOCbR4l6l7D6ervh5Z6PVEf7SVeAP3L5wo
 QV/V3NKkrYeNQL+FoKhCH8Vhxw0HxUmKJO7UhW6yuCt7BAok9Es18h3OVnn+7es4
 BB7VI/JvdmXf2ioKhTPnDXJjC+vh5vnwiBoTcdQ2W9ADhWUvfL4ozxOXT6z7oC3A
 nwBOdXf8w8Rqnqqd8hduop1QUrusMxlEVgOMCk27qHx97uWgPceZWdoxDXGHBiRK
 fqJAwXERf9wp5/M57NDlPwyf/43Hocdx2CdLkQBpfD78/k/sB5hW0KxnzY0FUI9x
 jBRQyWD05IDJATBaMHz+VbrexS+Itvjp2QvSiSm9zislYD4zA9fQ3lAgFhEpcUbA
 ZA/nN4t+CbiGEAsJEuBPlvSC1ahUwVP/0nz3PFlVWFDqAx0mXgVNKBe083A9yh7I
 dVisVY6KPAVDzyOc1LqzU8WFXNFnIkIIaLrb6fRHJVEM8MDfpLPS/a+7AHdRcDP4
 yq6fjVVjyP7e9lSQLYBUP3/3uiVnWQj92l6V6CrcgDMX5rDOb0VN+BQrmhPR6fWY
 WEim6WZrZj4=
 =OLsv
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull performance event updates from Ingo Molnar:
 - Add AMD Unified Memory Controller (UMC) events introduced with Zen 4
 - Simplify & clean up the uncore management code
 - Fall back from RDPMC to RDMSR on certain uncore PMUs
 - Improve per-package and cstate event reading
 - Extend the Intel ref-cycles event to GP counters
 - Fix Intel MTL event constraints
 - Improve the Intel hybrid CPU handling code
 - Micro-optimize the RAPL code
 - Optimize perf_cgroup_switch()
 - Improve large AUX area error handling
 - Misc fixes and cleanups

* tag 'perf-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  perf/x86/amd/uncore: Pass through error code for initialization failures, instead of -ENODEV
  perf/x86/amd/uncore: Fix uninitialized return value in amd_uncore_init()
  x86/cpu: Fix the AMD Fam 17h, Fam 19h, Zen2 and Zen4 MSR enumerations
  perf: Optimize perf_cgroup_switch()
  perf/x86/amd/uncore: Add memory controller support
  perf/x86/amd/uncore: Add group exclusivity
  perf/x86/amd/uncore: Use rdmsr if rdpmc is unavailable
  perf/x86/amd/uncore: Move discovery and registration
  perf/x86/amd/uncore: Refactor uncore management
  perf/core: Allow reading package events from perf_event_read_local
  perf/x86/cstate: Allow reading the package statistics from local CPU
  perf/x86/intel/pt: Fix kernel-doc comments
  perf/x86/rapl: Annotate 'struct rapl_pmus' with __counted_by
  perf/core: Rename perf_proc_update_handler() -> perf_event_max_sample_rate_handler(), for readability
  perf/x86/rapl: Fix "Using plain integer as NULL pointer" Sparse warning
  perf/x86/rapl: Use local64_try_cmpxchg in rapl_event_update()
  perf/x86/rapl: Stop doing cpu_relax() in the local64_cmpxchg() loop in rapl_event_update()
  perf/core: Bail out early if the request AUX area is out of bound
  perf/x86/intel: Extend the ref-cycles event to GP counters
  perf/x86/intel: Fix broken fixed event constraints extension
  ...
2023-10-30 13:44:35 -10:00
Linus Torvalds
cd063c8b9e Misc fixes and cleanups:
- Fix potential MAX_NAME_LEN limit related build failures
  - Fix scripts/faddr2line symbol filtering bug
  - Fix scripts/faddr2line on LLVM=1
  - Fix scripts/faddr2line to accept readelf output with mapping symbols
  - Minor cleanups
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU88VYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g2rQ//dvzezrAs+ZEhKLbRLSabbAlCeJ+J9zuP
 c0xBmaLwUh47sSDKfBLLEFN3IMDfgMdKjfb3E32vT/WQ+ASdfEMs6FfwRtaErypG
 XfZFpfC2WE1+Gq0MAgrXYuQgDv1Lygdimoy0aCwMlrgb7ZgWL1xorG0VSEemyKhd
 CoRFURKjeJIKJN1oOvTXKhp/SZyk39KHXeF4qSAjIGkrzsfDtEUSNR6NjBmeGUS4
 zNVWus/CucHK/6MMpHtdWw1/Ygemc1CBzYC3ZSMGimqy4Rqe2RsiGa0Y3XhlMCyn
 ekNFuUm9bxStaTknM3ZXga0xHPdKnTPkihxykLDzo0Nh9eysuFlmFrFJ2xL/B87k
 IxlpXvwxjxTSmGDhGQFVnXma6M2le3YFWGClS8UyhSPG08qg09ClwZ8OtVDi8ITI
 rj0VoFbFLuc8aeHF/tyF2t323JmcMHq0aHi+kMUElszm6+B+fPnD54gHU+REXVxO
 YIRkK9RY52mfU4KFf8xlO/UhFF6nP8pgE8pVnNF4lC034M0t4z+i/TLjOsspjVt3
 yMoZakD7sfUkAaCBq4mVfdWwo5UzTVse0BarbEcKxoME6wLEfN+efE850zGdy7n1
 iRC9AddddEyo4BnSHbWdWu/PDYJKPiH7dAtHBcfnEMJjLQewnRHlsHHbCA55jtrX
 363jNE3x6K4=
 =9U5x
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Ingo Molnar:
 "Misc fixes and cleanups:

   - Fix potential MAX_NAME_LEN limit related build failures

   - Fix scripts/faddr2line symbol filtering bug

   - Fix scripts/faddr2line on LLVM=1

   - Fix scripts/faddr2line to accept readelf output with mapping
     symbols

   - Minor cleanups"

* tag 'objtool-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  scripts/faddr2line: Skip over mapping symbols in output from readelf
  scripts/faddr2line: Use LLVM addr2line and readelf if LLVM=1
  scripts/faddr2line: Don't filter out non-function symbols from readelf
  objtool: Remove max symbol name length limitation
  objtool: Propagate early errors
  objtool: Use 'the fallthrough' pseudo-keyword
  x86/speculation, objtool: Use absolute relocations for annotations
  x86/unwind/orc: Remove redundant initialization of 'mid' pointer in __orc_find()
2023-10-30 13:20:02 -10:00
Linus Torvalds
63ce50fff9 Scheduler changes for v6.7 are:
- Fair scheduler (SCHED_OTHER) improvements:
 
     - Remove the old and now unused SIS_PROP code & option
     - Scan cluster before LLC in the wake-up path
     - Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
 
  - NUMA scheduling improvements:
 
     - Improve the VMA access-PID code to better skip/scan VMAs
     - Extend tracing to cover VMA-skipping decisions
     - Improve/fix the recently introduced sched_numa_find_nth_cpu() code
     - Generalize numa_map_to_online_node()
 
  - Energy scheduling improvements:
 
     - Remove the EM_MAX_COMPLEXITY limit
     - Add tracepoints to track energy computation
     - Make the behavior of the 'sched_energy_aware' sysctl more consistent
     - Consolidate and clean up access to a CPU's max compute capacity
     - Fix uclamp code corner cases
 
  - RT scheduling improvements:
 
     - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
     - Drive the ->rto_mask with rt_rq->pushable_tasks updates
 
  - Scheduler scalability improvements:
 
     - Rate-limit updates to tg->load_avg
     - On x86 disable IBRS when CPU is offline to improve single-threaded performance
     - Micro-optimize in_task() and in_interrupt()
     - Micro-optimize the PSI code
     - Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
 
  - Core scheduler infrastructure improvements:
 
     - Use saved_state to reduce some spurious freezer wakeups
     - Bring in a handful of fast-headers improvements to scheduler headers
     - Make the scheduler UAPI headers more widely usable by user-space
     - Simplify the control flow of scheduler syscalls by using lock guards
     - Fix sched_setaffinity() vs. CPU hotplug race
 
  - Scheduler debuggability improvements:
     - Disallow writing invalid values to sched_rt_period_us
     - Fix a race in the rq-clock debugging code triggering warnings
     - Fix a warning in the bandwidth distribution code
     - Micro-optimize in_atomic_preempt_off() checks
     - Enforce that the tasklist_lock is held in for_each_thread()
     - Print the TGID in sched_show_task()
     - Remove the /proc/sys/kernel/sched_child_runs_first sysctl
 
  - Misc cleanups & fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU8/NoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gN+xAAvKGYNZBCBG4jowxccgqAbCx81KOhhsy/
 KUaOmdLPg9WaXuqjZ5sggXQCMT0wUqBYAmqV7ts53VhWcma2I1ap4dCM6Jj+RLrc
 vNwkeNetsikiZtarMoCJs5NahL8ULh3liBaoAkkToPjQ5r43aZ/eKwDovEdIKc+g
 +Vgn7jUY8ssIrAOKT1midSwY1y8kAU2AzWOSFDTgedkJP4PgOu9/lBl9jSJ2sYaX
 N4XqONYPXTwOHUtvmzkYILxLz0k0GgJ7hmt78E8Xy2rC4taGCRwCfCMBYxREuwiP
 huo3O1P/iIe5svm4/EBUvcpvf44eAWTV+CD0dnJPwOc9IvFhpSzqSZZAsyy/JQKt
 Lnzmc/xmyc1PnXCYJfHuXrw2/m+MyUHaegPzh5iLJFrlqa79GavOElj0jNTAMzbZ
 39fybzPtuFP+64faRfu0BBlQZfORPBNc/oWMpPKqgP58YGuveKTWaUF5rl5lM7Ne
 nm07uOmq02JVR8YzPl/FcfhU2dPMawWuMwUjEr2eU+lAunY3PF88vu0FALj7iOBd
 66F8qrtpDHJanOxrdEUwSJ7hgw79qY1iw66Db7cQYjMazFKZONxArQPqFUZ0ngLI
 n9hVa7brg1bAQKrQflqjcIAIbpVu3SjPEl15cKpAJTB/gn5H66TQgw8uQ6HfG+h2
 GtOsn1nlvuk=
 =GDqb
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Fair scheduler (SCHED_OTHER) improvements:
   - Remove the old and now unused SIS_PROP code & option
   - Scan cluster before LLC in the wake-up path
   - Use candidate prev/recent_used CPU if scanning failed for cluster
     wakeup

  NUMA scheduling improvements:
   - Improve the VMA access-PID code to better skip/scan VMAs
   - Extend tracing to cover VMA-skipping decisions
   - Improve/fix the recently introduced sched_numa_find_nth_cpu() code
   - Generalize numa_map_to_online_node()

  Energy scheduling improvements:
   - Remove the EM_MAX_COMPLEXITY limit
   - Add tracepoints to track energy computation
   - Make the behavior of the 'sched_energy_aware' sysctl more
     consistent
   - Consolidate and clean up access to a CPU's max compute capacity
   - Fix uclamp code corner cases

  RT scheduling improvements:
   - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
   - Drive the ->rto_mask with rt_rq->pushable_tasks updates

  Scheduler scalability improvements:
   - Rate-limit updates to tg->load_avg
   - On x86 disable IBRS when CPU is offline to improve single-threaded
     performance
   - Micro-optimize in_task() and in_interrupt()
   - Micro-optimize the PSI code
   - Avoid updating PSI triggers and ->rtpoll_total when there are no
     state changes

  Core scheduler infrastructure improvements:
   - Use saved_state to reduce some spurious freezer wakeups
   - Bring in a handful of fast-headers improvements to scheduler
     headers
   - Make the scheduler UAPI headers more widely usable by user-space
   - Simplify the control flow of scheduler syscalls by using lock
     guards
   - Fix sched_setaffinity() vs. CPU hotplug race

  Scheduler debuggability improvements:
   - Disallow writing invalid values to sched_rt_period_us
   - Fix a race in the rq-clock debugging code triggering warnings
   - Fix a warning in the bandwidth distribution code
   - Micro-optimize in_atomic_preempt_off() checks
   - Enforce that the tasklist_lock is held in for_each_thread()
   - Print the TGID in sched_show_task()
   - Remove the /proc/sys/kernel/sched_child_runs_first sysctl

  ... and misc cleanups & fixes"

* tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
  sched/fair: Remove SIS_PROP
  sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
  sched/fair: Scan cluster before scanning LLC in wake-up path
  sched: Add cpus_share_resources API
  sched/core: Fix RQCF_ACT_SKIP leak
  sched/fair: Remove unused 'curr' argument from pick_next_entity()
  sched/nohz: Update comments about NEWILB_KICK
  sched/fair: Remove duplicate #include
  sched/psi: Update poll => rtpoll in relevant comments
  sched: Make PELT acronym definition searchable
  sched: Fix stop_one_cpu_nowait() vs hotplug
  sched/psi: Bail out early from irq time accounting
  sched/topology: Rename 'DIE' domain to 'PKG'
  sched/psi: Delete the 'update_total' function parameter from update_triggers()
  sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
  sched/headers: Remove comment referring to rq::cpu_load, since this has been removed
  sched/numa: Complete scanning of inactive VMAs when there is no alternative
  sched/numa: Complete scanning of partial VMAs regardless of PID activity
  sched/numa: Move up the access pid reset logic
  sched/numa: Trace decisions related to skipping VMAs
  ...
2023-10-30 13:12:15 -10:00
Linus Torvalds
3cf3fabccb Locking changes in this cycle are:
- Futex improvements:
 
     - Add the 'futex2' syscall ABI, which is an attempt to get away from the
       multiplex syscall and adds a little room for extentions, while lifting
       some limitations.
 
     - Fix futex PI recursive rt_mutex waiter state bug
 
     - Fix inter-process shared futexes on no-MMU systems
 
     - Use folios instead of pages
 
  - Micro-optimizations of locking primitives:
 
     - Improve arch_spin_value_unlocked() on asm-generic ticket spinlock
       architectures, to improve lockref code generation.
 
     - Improve the x86-32 lockref_get_not_zero() main loop by adding
       build-time CMPXCHG8B support detection for the relevant lockref code,
       and by better interfacing the CMPXCHG8B assembly code with the compiler.
 
     - Introduce arch_sync_try_cmpxchg() on x86 to improve sync_try_cmpxchg()
       code generation. Convert some sync_cmpxchg() users to sync_try_cmpxchg().
 
     - Micro-optimize rcuref_put_slowpath()
 
  - Locking debuggability improvements:
 
     - Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well
 
     - Enforce atomicity of sched_submit_work(), which is de-facto atomic but
       was un-enforced previously.
 
     - Extend <linux/cleanup.h>'s no_free_ptr() with __must_check semantics
 
     - Fix ww_mutex self-tests
 
     - Clean up const-propagation in <linux/seqlock.h> and simplify
       the API-instantiation macros a bit.
 
  - RT locking improvements:
 
     - Provide the rt_mutex_*_schedule() primitives/helpers and use them
       in the rtmutex code to avoid recursion vs. rtlock on the PI state.
 
     - Add nested blocking lockdep asserts to rt_mutex_lock(), rtlock_lock()
       and rwbase_read_lock().
 
  - Plus misc fixes & cleanups
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU877IRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g9jw/+N7rxQ78dmFCYh4UWnLCYvuKP0/ivHErG
 493JcB8MupuA2tfJHIkDdr4aM2mNq2E61w69/WlZAQWWD6pdOhwgF5Xf5eoEcJm0
 vsAhWBGLxihXdtevPuMAx0dEpg3AMp2wc6i5PkN831KdPUgCNsrKq9Bfnfef7/G8
 MQTSHjmtba6jxleyxfEa4tE2xe5PJX825nRfkX2e1cf+stkYua+uJFxVxUfxFWGE
 4pBy70D9OC7MsJ44WWOA1gwkVtMMiBTmRPNjlP8Gz2GQ0f3ERHRwYk3jDHOPHZI6
 0GNt7pE3IMXQn2UuDtfkvv9IFTd+U5qD+APnWIn2ntWXqzGLFqOlmovMrobVn7El
 olYDCyweWPG71m1Qblsb1VK2QjRPQVJ9NAEg8RlDHIu2ThxHbMysDVGPVOYnPFq4
 S8QFpmldzbNoPU4rDJyT1fAmoUIrusBHkl+Us3yGfC74iM+fHnDEvaSoMZbzEdY1
 x/Nocj9XgKEgfXdYzrCWFmZ9xXqHkO25/wDL6yKqBdQtvaEalXuHTT6mQcYxrUPm
 Xx1BPan2Jg7p4u2oOFcVtKewUtRH9KBx8qytr5S+JK4PJbrBsixMnr84HLd/3X2V
 ykYkO+367T5MTYv4TnJDE5vdurzUqekKSCFPY3skPujPJfdLj1vsPzYf9iMkCLdo
 hU2f/R+Wpdk=
 =36Ff
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Info Molnar:
 "Futex improvements:

   - Add the 'futex2' syscall ABI, which is an attempt to get away from
     the multiplex syscall and adds a little room for extentions, while
     lifting some limitations.

   - Fix futex PI recursive rt_mutex waiter state bug

   - Fix inter-process shared futexes on no-MMU systems

   - Use folios instead of pages

  Micro-optimizations of locking primitives:

   - Improve arch_spin_value_unlocked() on asm-generic ticket spinlock
     architectures, to improve lockref code generation

   - Improve the x86-32 lockref_get_not_zero() main loop by adding
     build-time CMPXCHG8B support detection for the relevant lockref
     code, and by better interfacing the CMPXCHG8B assembly code with
     the compiler

   - Introduce arch_sync_try_cmpxchg() on x86 to improve
     sync_try_cmpxchg() code generation. Convert some sync_cmpxchg()
     users to sync_try_cmpxchg().

   - Micro-optimize rcuref_put_slowpath()

  Locking debuggability improvements:

   - Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well

   - Enforce atomicity of sched_submit_work(), which is de-facto atomic
     but was un-enforced previously.

   - Extend <linux/cleanup.h>'s no_free_ptr() with __must_check
     semantics

   - Fix ww_mutex self-tests

   - Clean up const-propagation in <linux/seqlock.h> and simplify the
     API-instantiation macros a bit

  RT locking improvements:

   - Provide the rt_mutex_*_schedule() primitives/helpers and use them
     in the rtmutex code to avoid recursion vs. rtlock on the PI state.

   - Add nested blocking lockdep asserts to rt_mutex_lock(),
     rtlock_lock() and rwbase_read_lock()

  .. plus misc fixes & cleanups"

* tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  futex: Don't include process MM in futex key on no-MMU
  locking/seqlock: Fix grammar in comment
  alpha: Fix up new futex syscall numbers
  locking/seqlock: Propagate 'const' pointers within read-only methods, remove forced type casts
  locking/lockdep: Fix string sizing bug that triggers a format-truncation compiler-warning
  locking/seqlock: Change __seqprop() to return the function pointer
  locking/seqlock: Simplify SEQCOUNT_LOCKNAME()
  locking/atomics: Use atomic_try_cmpxchg_release() to micro-optimize rcuref_put_slowpath()
  locking/atomic, xen: Use sync_try_cmpxchg() instead of sync_cmpxchg()
  locking/atomic/x86: Introduce arch_sync_try_cmpxchg()
  locking/atomic: Add generic support for sync_try_cmpxchg() and its fallback
  locking/seqlock: Fix typo in comment
  futex/requeue: Remove unnecessary ‘NULL’ initialization from futex_proxy_trylock_atomic()
  locking/local, arch: Rewrite local_add_unless() as a static inline function
  locking/debug: Fix debugfs API return value checks to use IS_ERR()
  locking/ww_mutex/test: Make sure we bail out instead of livelock
  locking/ww_mutex/test: Fix potential workqueue corruption
  locking/ww_mutex/test: Use prng instead of rng to avoid hangs at bootup
  futex: Add sys_futex_requeue()
  futex: Add flags2 argument to futex_requeue()
  ...
2023-10-30 12:38:48 -10:00
Linus Torvalds
f155f3b3ed - Make sure PCI function 4 IDs of AMD family 0x19, models 0x60-0x7f are
actually used in the amd_nb.c enumeration
 
 - Add support for extracting NUMA information from devicetree for
   Hyper-V usages
 
 - Add PCI device IDs for the new AMD MI300 AI accelerators
 
 - Annotate an array in struct uv_rtc_timer_head with the new
   __counted_by attribute
 
 - Rework UV's NMI action parameter handling
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmU78I4ACgkQEsHwGGHe
 VUoDFQ//a8KYzcj2wC+GA+tAScMCP4lT3hsSYoHvFE6Fd4HBngMUBcaEHY0InxWc
 jfXRHFZBS54sLMH2XPVOnozTBDLUEF+4cnApj9iVf//RwjSz/mnJ1P/fqwGQHmfY
 eyOKn38nzkdcf+MW69HAyOeOLxOhOcxEdrzJ4AuPijjTH0GfHWXtuZrqCgaRB6XZ
 yYw7PwO9azSBIYA3c87IPo5uGDa0Ht/BWWHv+XD21DjH0E2n2jaGPgExojBwpI1F
 zJZkx8xJWlQANSD9qiYWoYFdEluGAJ6Fyn989hMRVUqFLbGB9ppEu0BD5TWor08A
 UMX3LYIXM+706EnpOYKc5m5dajbUNOEUl73M6S6J67ugvRUWjCZpoejEx1Aejup7
 aCcYrkbSiuVPFn8Wj9/JdkItu+zyzzsp3fjEdYlYVfefLh3EY4IrFAkQGUk7A0cA
 p6Idau8j4Cx+4lvNEFGMk/u7SAxcInJx7NDy8rJL9CG1BdbjzkwJH+GJFELuI/zM
 dSd+N2r5YUMClHR8Tor17PDh61i+ovg1kPhOyDyf7bNv0HQG0Hbcded4lyiVrsXa
 m5GNYRXEQFZXMeBxo6dJKDAj2rdSPsACC2FBTa9B+bciklcWid4FVTvf/9sSFeQv
 cvGFfGbFdW/uddAHsWqHMiuTfBsQneZ9UbZCKq/kAkHAtfdhRa8=
 =7JVZ
 -----END PGP SIGNATURE-----

Merge tag 'x86_platform_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 platform updates from Borislav Petkov:

 - Make sure PCI function 4 IDs of AMD family 0x19, models 0x60-0x7f are
   actually used in the amd_nb.c enumeration

 - Add support for extracting NUMA information from devicetree for
   Hyper-V usages

 - Add PCI device IDs for the new AMD MI300 AI accelerators

 - Annotate an array in struct uv_rtc_timer_head with the new
   __counted_by attribute

 - Rework UV's NMI action parameter handling

* tag 'x86_platform_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/amd_nb: Use Family 19h Models 60h-7Fh Function 4 IDs
  x86/numa: Add Devicetree support
  x86/of: Move the x86_flattree_get_config() call out of x86_dtb_init()
  x86/amd_nb: Add AMD Family MI300 PCI IDs
  x86/platform/uv: Annotate struct uv_rtc_timer_head with __counted_by
  x86/platform/uv: Rework NMI "action" modparam handling
2023-10-30 12:32:48 -10:00
Linus Torvalds
ca2e9c3bee - Make sure the "svm" feature flag is cleared from /proc/cpuinfo when
virtualization support is disabled in the BIOS on AMD and Hygon
   platforms
 
 - A minor cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmU77KoACgkQEsHwGGHe
 VUrophAAtfsB+WhRydin0V6kjQeH+RbiWyx/jOw6eNqvzOzaOPxVXn0cAHRSgAO4
 +S8tKIqaWpXNNNKpOIKBVaDkh9qr50/p36/jfVkXi8GOLYrK633F0BMjcG4+/vYQ
 A9b5iNiJhZ7xWE6+qRrqdg+o+a6UyPUGz34HNp3KwJVTdaHU2OnXXwuWeiUkgRrJ
 uQSfLc4+UIeefIzNy8Tqg083iaENBYMya7U90rzewD64NF0bsA15AEPut/6tnUVq
 ej3UU3cqO7nKXyhuZX+zpt856MZFa1rNYVXUAfoAO4xhqdN0Q5LFWO506sqajNx/
 hqbT+hKDoC03zuLmbZO21s/uWQdtVFo63FU0h9QBRp1m6Ug5P3rQQCK8ydJc5xwr
 Yd7je6UPK9jIKBo9VP1qmsyzGwADNevNf1qGExHI2T6Wml7HgDmPysAHnGiKqRGI
 1o9+Yqa+VBt8Wml9M8Ny+dLyr5F/2uq8sMrQedQlXdFMSzVm2JYecukJ5BvUWE/r
 Qyll8mTpIdgGXjBt56lMrgH7ibMC5ct/4MvTHOHuA997g/PwuwtWj7QyKXpUq2Rf
 o/c3zKKWIFxevjzwU86haCBaz+5xAQlB6dJw61ExxsmUuT/kZzkN15w6aqGZtpns
 PsARwnvuwZJ7vfqFLIa0ZkPN4OgnkRX7HlNqrVyKpONDTocZd9E=
 =i9On
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpuid updates from Borislav Petkov:

 - Make sure the "svm" feature flag is cleared from /proc/cpuinfo when
   virtualization support is disabled in the BIOS on AMD and Hygon
   platforms

 - A minor cleanup

* tag 'x86_cpu_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu/amd: Remove redundant 'break' statement
  x86/cpu: Clear SVM feature if disabled by BIOS
2023-10-30 12:10:24 -10:00
Linus Torvalds
f84a52eef5 - A bunch of improvements, cleanups and fixlets to the SRSO mitigation
machinery and other, general cleanups to the hw mitigations code,
   by Josh Poimboeuf
 
 - Improve the return thunk detection by objtool as it is absolutely
   important that the default return thunk is not used after returns
   have been patched. Future work to detect and report this better is
   pending
 
 - Other misc cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmU7mFEACgkQEsHwGGHe
 VUpbBxAAtS4X5LCntPWUsDEBU80SBYAunEp0Wd0ttYEj+UrEk4tvnWVGFiIEr47A
 PrRKK9JCJtC6ko0+dwPtMi66L/T7mCpoNPI1kzfRG1IHJBfvCTGJhzZsesogvkA2
 1X9Je+RCVW4xVybIryxhjMGdB6jUoGEU1a4DmQXq481qiLB3ilvA1bIAaNo9BBYP
 rxKPrPcdOxn2NjxuOWg+FXjSc8LuAVSu3HqsgCW2AHJ6XIKEYWEq9FkXhwj9OJOr
 ax1F4qD1IY++jYZO9DJiltjeJyj0wC+yp8kDDURoLbcTk85WHlpD5vK0g64mELOA
 y0375thHep+vsrtQ/qZAmi/eVTaTekgbi7McahjoZebK7FbKOYRk6GZ+5+m29AVr
 DfQSJ7xQQqbCbpimeFmZ+gQf7mFexyDWvjUPyBl+OelOY1umdPM9IZVTnqib5LPr
 D2M+uqWfJhSwACi2o05LRv0gyhkAz0bGHrwZPmCVuxE5kBbhOpj4aT87fetUp/MW
 8lEFa3PHx/gkh2VOJ7ZgKzpeD75Vjo8TRAXOe4O2jn/L54gNEJ+1mukvrjW3+lp1
 ShmcZokl3ldPq6F5ioE+u45hVAfHkaruWM+5Rj3hsA/fdFN3isTVLhIRIsypPTKc
 p1ITT8Yhek8vkm9PcRBE5xWRmEZ2XE5ooDld930nJxra8QNVVQw=
 =E7c4
 -----END PGP SIGNATURE-----

Merge tag 'x86_bugs_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 hw mitigation updates from Borislav Petkov:

 - A bunch of improvements, cleanups and fixlets to the SRSO mitigation
   machinery and other, general cleanups to the hw mitigations code, by
   Josh Poimboeuf

 - Improve the return thunk detection by objtool as it is absolutely
   important that the default return thunk is not used after returns
   have been patched. Future work to detect and report this better is
   pending

 - Other misc cleanups and fixes

* tag 'x86_bugs_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  x86/retpoline: Document some thunk handling aspects
  x86/retpoline: Make sure there are no unconverted return thunks due to KCSAN
  x86/callthunks: Delete unused "struct thunk_desc"
  x86/vdso: Run objtool on vdso32-setup.o
  objtool: Fix return thunk patching in retpolines
  x86/srso: Remove unnecessary semicolon
  x86/pti: Fix kernel warnings for pti= and nopti cmdline options
  x86/calldepth: Rename __x86_return_skl() to call_depth_return_thunk()
  x86/nospec: Refactor UNTRAIN_RET[_*]
  x86/rethunk: Use SYM_CODE_START[_LOCAL]_NOALIGN macros
  x86/srso: Disentangle rethunk-dependent options
  x86/srso: Move retbleed IBPB check into existing 'has_microcode' code block
  x86/bugs: Remove default case for fully switched enums
  x86/srso: Remove 'pred_cmd' label
  x86/srso: Unexport untraining functions
  x86/srso: Improve i-cache locality for alias mitigation
  x86/srso: Fix unret validation dependencies
  x86/srso: Fix vulnerability reporting for missing microcode
  x86/srso: Print mitigation for retbleed IBPB case
  x86/srso: Print actual mitigation if requested mitigation isn't possible
  ...
2023-10-30 11:48:49 -10:00
Linus Torvalds
01ae815c50 - Specify what error addresses reported on AMD are actually usable
memory error addresses for further decoding
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmU7kb8ACgkQEsHwGGHe
 VUq3/A//VTrsON+RRS+M7PVewXMiTbwjVytum/9gWXtuUBEFdWCQjCe4TSaI6+mX
 v8inAomBE7s3SoQYkosF1VO2l0r68aJLOm6hczzbjz+ZjGvramDiv5qCs0iMM8m4
 Nvwyjeo1+2G6JeaX2rR7fqnZkA4NcYE1/s05pksNEaXMsAhpSOWenRgUK1EyQXLE
 y1u63G5GLMT4cpjEmEcbp9Lb02WwQzB9inZ1f4MFoujkI5VJ/9b68D+DpGwHd4Ag
 HNrg6LR/YpVwioVnsa+xEiQSxxwuRCHvS8kbc27d3qhfT4cRhmtAIsHYkyeO75TJ
 jkXU1Gme/k2RDEYHOz7heSVWgmG3y9/swc3UZJFE0QAnjdajY7mUsM9+o5uCm4Y6
 rALf8z7t0+oMpG1YML5Y+0wvgcPk9pih6Mm9tbBlFCXPi2OQ5bieNkHe7RQXHcQx
 xwFoQI0ByWvW7omu+jqA8iN4YSLaQhST2wzghPF1Wu7KAewu5lpU+9kmgmL7utme
 aHIQFdhRFusYEABqlr8XQSew5FMtIfOeWWCdgWHghUQp6LCsA0QeUxcQR9VdY9Th
 IgY1j4G2lQeLpDlnWE9VPMCWk4cuTABRyVWEu1B4wScU3xRWD8jOh+LcF76RdoYz
 k7GW0d68DGDCRgU7q86LNM/bNH0zyIIteTj64uHBzQm0ygJcdZU=
 =NcVX
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 RAS updates from Borislav Petkov:

 - Specify what error addresses reported on AMD are actually usable
   memory error addresses for further decoding

* tag 'ras_core_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Cleanup mce_usable_address()
  x86/mce: Define amd_mce_usable_address()
  x86/MCE/AMD: Split amd_mce_is_memory_error()
2023-10-30 11:47:03 -10:00
Linus Torvalds
df9c65b5fc vfs-6.7.iov_iter
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppQwAKCRCRxhvAZXjc
 om2kAP4u+eLsrhJHfUPUttGUEkSkZE+5/s/f1A/1GcV51usLSgEAu8urxAnP49GW
 INaDABXaFfKx8/KI/H2YFZPKGwlNEwY=
 =PDDE
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull iov_iter updates from Christian Brauner:
 "This contain's David's iov_iter cleanup work to convert the iov_iter
  iteration macros to inline functions:

   - Remove last_offset from iov_iter as it was only used by ITER_PIPE

   - Add a __user tag on copy_mc_to_user()'s dst argument on x86 to
     match that on powerpc and get rid of a sparse warning

   - Convert iter->user_backed to user_backed_iter() in the sound PCM
     driver

   - Convert iter->user_backed to user_backed_iter() in a couple of
     infiniband drivers

   - Renumber the type enum so that the ITER_* constants match the order
     in iterate_and_advance*()

   - Since the preceding patch puts UBUF and IOVEC at 0 and 1, change
     user_backed_iter() to just use the type value and get rid of the
     extra flag

   - Convert the iov_iter iteration macros to always-inline functions to
     make the code easier to follow. It uses function pointers, but they
     get optimised away

   - Move the check for ->copy_mc to _copy_from_iter() and
     copy_page_from_iter_atomic() rather than in memcpy_from_iter_mc()
     where it gets repeated for every segment. Instead, we check once
     and invoke a side function that can use iterate_bvec() rather than
     iterate_and_advance() and supply a different step function

   - Move the copy-and-csum code to net/ where it can be in proximity
     with the code that uses it

   - Fold memcpy_and_csum() in to its two users

   - Move csum_and_copy_from_iter_full() out of line and merge in
     csum_and_copy_from_iter() since the former is the only caller of
     the latter

   - Move hash_and_copy_to_iter() to net/ where it can be with its only
     caller"

* tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
  iov_iter, net: Move hash_and_copy_to_iter() to net/
  iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together
  iov_iter, net: Fold in csum_and_memcpy()
  iov_iter, net: Move csum_and_copy_to/from_iter() to net/
  iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()
  iov_iter: Convert iterate*() to inline funcs
  iov_iter: Derive user-backedness from the iterator type
  iov_iter: Renumber ITER_* constants
  infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC
  sound: Fix snd_pcm_readv()/writev() to use iov access functions
  iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user()
  iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE
2023-10-30 09:24:21 -10:00
Thomas Gleixner
128b0c9781 x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility
David and a few others reported that on certain newer systems some legacy
interrupts fail to work correctly.

Debugging revealed that the BIOS of these systems leaves the legacy PIC in
uninitialized state which makes the PIC detection fail and the kernel
switches to a dummy implementation.

Unfortunately this fallback causes quite some code to fail as it depends on
checks for the number of legacy PIC interrupts or the availability of the
real PIC.

In theory there is no reason to use the PIC on any modern system when
IO/APIC is available, but the dependencies on the related checks cannot be
resolved trivially and on short notice. This needs lots of analysis and
rework.

The PIC detection has been added to avoid quirky checks and force selection
of the dummy implementation all over the place, especially in VM guest
scenarios. So it's not an option to revert the relevant commit as that
would break a lot of other scenarios.

One solution would be to try to initialize the PIC on detection fail and
retry the detection, but that puts the burden on everything which does not
have a PIC.

Fortunately the ACPI/MADT table header has a flag field, which advertises
in bit 0 that the system is PCAT compatible, which means it has a legacy
8259 PIC.

Evaluate that bit and if set avoid the detection routine and keep the real
PIC installed, which then gets initialized (for nothing) and makes the rest
of the code with all the dependencies work again.

Fixes: e179f69141 ("x86, irq, pic: Probe for legacy PIC and set legacy_pic appropriately")
Reported-by: David Lazar <dlazar@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: David Lazar <dlazar@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Cc: stable@vger.kernel.org
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218003
Link: https://lore.kernel.org/r/875y2u5s8g.ffs@tglx
2023-10-27 20:36:49 +02:00
Tony Luck
b99d70c0d1 x86/cpu: Add model number for Intel Arrow Lake mobile processor
For "reasons" Intel has code-named this CPU with a "_H" suffix.

[ dhansen: As usual, apply this and send it upstream quickly to
	   make it easier for anyone who is doing work that
	   consumes this. ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231025202513.12358-1-tony.luck%40intel.com
2023-10-27 10:19:26 -07:00
Kan Liang
3374491619 perf/x86/intel: Support branch counters logging
The branch counters logging (A.K.A LBR event logging) introduces a
per-counter indication of precise event occurrences in LBRs. It can
provide a means to attribute exposed retirement latency to combinations
of events across a block of instructions. It also provides a means of
attributing Timed LBR latencies to events.

The feature is first introduced on SRF/GRR. It is an enhancement of the
ARCH LBR. It adds new fields in the LBR_INFO MSRs to log the occurrences
of events on the GP counters. The information is displayed by the order
of counters.

The design proposed in this patch requires that the events which are
logged must be in a group with the event that has LBR. If there are
more than one LBR group, the counters logging information only from the
current group (overflowed) are stored for the perf tool, otherwise the
perf tool cannot know which and when other groups are scheduled
especially when multiplexing is triggered. The user can ensure it uses
the maximum number of counters that support LBR info (4 by now) by
making the group large enough.

The HW only logs events by the order of counters. The order may be
different from the order of enabling which the perf tool can understand.
When parsing the information of each branch entry, convert the counter
order to the enabled order, and store the enabled order in the extension
space.

Unconditionally reset LBRs for an LBR event group when it's deleted. The
logged counter information is only valid for the current LBR group. If
another LBR group is scheduled later, the information from the stale
LBRs would be otherwise wrongly interpreted.

Add a sanity check in intel_pmu_hw_config(). Disable the feature if other
counter filters (inv, cmask, edge, in_tx) are set or LBR call stack mode
is enabled. (For the LBR call stack mode, we cannot simply flush the
LBR, since it will break the call stack. Also, there is no obvious usage
with the call stack mode for now.)

Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't require any branch
stack setup.

Expose the maximum number of supported counters and the width of the
counters into the sysfs. The perf tool can use the information to parse
the logged counters in each branch.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-5-kan.liang@linux.intel.com
2023-10-27 15:05:11 +02:00
Ashok Raj
cf5ab01c87 x86/microcode/intel: Add a minimum required revision for late loading
In general users, don't have the necessary information to determine
whether late loading of a new microcode version is safe and does not
modify anything which the currently running kernel uses already, e.g.
removal of CPUID bits or behavioural changes of MSRs.

To address this issue, Intel has added a "minimum required version"
field to a previously reserved field in the microcode header.  Microcode
updates should only be applied if the current microcode version is equal
to, or greater than this minimum required version.

Thomas made some suggestions on how meta-data in the microcode file could
provide Linux with information to decide if the new microcode is suitable
candidate for late loading. But even the "simpler" option requires a lot of
metadata and corresponding kernel code to parse it, so the final suggestion
was to add the 'minimum required version' field in the header.

When microcode changes visible features, microcode will set the minimum
required version to its own revision which prevents late loading.

Old microcode blobs have the minimum revision field always set to 0, which
indicates that there is no information and the kernel considers it
unsafe.

This is a pure OS software mechanism. The hardware/firmware ignores this
header field.

For early loading there is no restriction because OS visible features
are enumerated after the early load and therefore a change has no
effect.

The check is always enabled, but by default not enforced. It can be
enforced via Kconfig or kernel command line.

If enforced, the kernel refuses to late load microcode with a minimum
required version field which is zero or when the currently loaded
microcode revision is smaller than the minimum required revision.

If not enforced the load happens independent of the revision check to
stay compatible with the existing behaviour, but it influences the
decision whether the kernel is tainted or not. If the check signals that
the late load is safe, then the kernel is not tainted.

Early loading is not affected by this.

[ tglx: Massaged changelog and fixed up the implementation ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.776467264@linutronix.de
2023-10-24 15:05:55 +02:00
Thomas Gleixner
8f849ff63b x86/microcode: Handle "offline" CPUs correctly
Offline CPUs need to be parked in a safe loop when microcode update is
in progress on the primary CPU. Currently, offline CPUs are parked in
mwait_play_dead(), and for Intel CPUs, its not a safe instruction,
because the MWAIT instruction can be patched in the new microcode update
that can cause instability.

  - Add a new microcode state 'UCODE_OFFLINE' to report status on per-CPU
  basis.
  - Force NMI on the offline CPUs.

Wake up offline CPUs while the update is in progress and then return
them back to mwait_play_dead() after microcode update is complete.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.660850472@linutronix.de
2023-10-24 15:05:55 +02:00
Thomas Gleixner
9cab5fb776 x86/apic: Provide apic_force_nmi_on_cpu()
When SMT siblings are soft-offlined and parked in one of the play_dead()
variants they still react on NMI, which is problematic on affected Intel
CPUs. The default play_dead() variant uses MWAIT on modern CPUs, which is
not guaranteed to be safe when updated concurrently.

Right now late loading is prevented when not all SMT siblings are online,
but as they still react on NMI, it is possible to bring them out of their
park position into a trivial rendezvous handler.

Provide a function which allows to do that. I does sanity checks whether
the target is in the cpus_booted_once_mask and whether the APIC driver
supports it.

Mark X2APIC and XAPIC as capable, but exclude 32bit and the UV and NUMACHIP
variants as that needs feedback from the relevant experts.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.603100036@linutronix.de
2023-10-24 15:05:55 +02:00
Thomas Gleixner
7eb314a228 x86/microcode: Rendezvous and load in NMI
stop_machine() does not prevent the spin-waiting sibling from handling
an NMI, which is obviously violating the whole concept of rendezvous.

Implement a static branch right in the beginning of the NMI handler
which is nopped out except when enabled by the late loading mechanism.

The late loader enables the static branch before stop_machine() is
invoked. Each CPU has an nmi_enable in its control structure which
indicates whether the CPU should go into the update routine.

This is required to bridge the gap between enabling the branch and
actually being at the point where it is required to enter the loader
wait loop.

Each CPU which arrives in the stopper thread function sets that flag and
issues a self NMI right after that. If the NMI function sees the flag
clear, it returns. If it's set it clears the flag and enters the
rendezvous.

This is safe against a real NMI which hits in between setting the flag
and sending the NMI to itself. The real NMI will be swallowed by the
microcode update and the self NMI will then let stuff continue.
Otherwise this would end up with a spurious NMI.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.489900814@linutronix.de
2023-10-24 15:05:55 +02:00
Thomas Gleixner
b7fcd995b2 x86/microcode/intel: Rework intel_find_matching_signature()
Take a cpu_signature argument and work from there. Move the match()
helper next to the callsite as there is no point for having it in
a header.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115902.797820205@linutronix.de
2023-10-24 15:05:54 +02:00
Thomas Gleixner
164aa1ca53 x86/microcode/intel: Rework intel_cpu_collect_info()
Nothing needs struct ucode_cpu_info. Make it take struct cpu_signature,
let it return a boolean and simplify the implementation. Rename it now
that the silly name clash with collect_cpu_info() is gone.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.851573238@linutronix.de
2023-10-24 15:05:53 +02:00
Ingo Molnar
4e5b65a22b Linux 6.6-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmU1ngkeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGrsIH/0k/+gdBBYFFdEym
 foRhKir9WV3ZX4oIozJjA1f7T+qVYclKs6kaYm3gNepRBb6AoG8pdgv4MMAqhYsf
 QMe2XHi0MrO/qKBgfNfivxEa9jq+0QK5uvTbqCRqCAB8LfwVyDqapCmg3EuiZcPW
 UbMITmnwLIfXgPxvp9rabmCsTqO6FLbf0GDOVIkNSAIDBXMpcO1iffjrWUbhRa7n
 oIoiJmWJLcXLxPWDsRKbpJwzw2cIG08YhfQYAiQnC3YaeRm1FKLDIICRBsmfYzja
 rWv9r4dn4TDfV4/AnjggQnsZvz2yPCxNaFSQIT88nIeiLvyuUTJ9j8aidsSfMZQf
 xZAbzbA=
 =NoQv
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-rc7' into sched/core, to pick up fixes

Pick up recent sched/urgent fixes merged upstream.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-23 11:32:25 +02:00
Dave Airlie
7cd62eab9b Linux 6.6-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmU1ngkeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGrsIH/0k/+gdBBYFFdEym
 foRhKir9WV3ZX4oIozJjA1f7T+qVYclKs6kaYm3gNepRBb6AoG8pdgv4MMAqhYsf
 QMe2XHi0MrO/qKBgfNfivxEa9jq+0QK5uvTbqCRqCAB8LfwVyDqapCmg3EuiZcPW
 UbMITmnwLIfXgPxvp9rabmCsTqO6FLbf0GDOVIkNSAIDBXMpcO1iffjrWUbhRa7n
 oIoiJmWJLcXLxPWDsRKbpJwzw2cIG08YhfQYAiQnC3YaeRm1FKLDIICRBsmfYzja
 rWv9r4dn4TDfV4/AnjggQnsZvz2yPCxNaFSQIT88nIeiLvyuUTJ9j8aidsSfMZQf
 xZAbzbA=
 =NoQv
 -----END PGP SIGNATURE-----

BackMerge tag 'v6.6-rc7' into drm-next

This is needed to add the msm pr which is based on a higher base.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2023-10-23 18:20:06 +10:00
Josh Poimboeuf
99ee56c765 x86/calldepth: Rename __x86_return_skl() to call_depth_return_thunk()
For consistency with the other return thunks, rename __x86_return_skl()
to call_depth_return_thunk().

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/ae44e9f9976934e3b5b47a458d523ccb15867561.1693889988.git.jpoimboe@kernel.org
2023-10-20 12:45:48 +02:00
Josh Poimboeuf
e8efc0800b x86/nospec: Refactor UNTRAIN_RET[_*]
Factor out the UNTRAIN_RET[_*] common bits into a helper macro.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/f06d45489778bd49623297af2a983eea09067a74.1693889988.git.jpoimboe@kernel.org
2023-10-20 12:41:57 +02:00
Josh Poimboeuf
34a3cae747 x86/srso: Disentangle rethunk-dependent options
CONFIG_RETHUNK, CONFIG_CPU_UNRET_ENTRY and CONFIG_CPU_SRSO are all
tangled up.  De-spaghettify the code a bit.

Some of the rethunk-related code has been shuffled around within the
'.text..__x86.return_thunk' section, but otherwise there are no
functional changes.  srso_alias_untrain_ret() and srso_alias_safe_ret()
((which are very address-sensitive) haven't moved.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/2845084ed303d8384905db3b87b77693945302b4.1693889988.git.jpoimboe@kernel.org
2023-10-20 12:30:50 +02:00
Josh Poimboeuf
eb54be26b0 x86/srso: Unexport untraining functions
These functions aren't called outside of retpoline.S.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/1ae080f95ce7266c82cba6d2adde82349b832654.1693889988.git.jpoimboe@kernel.org
2023-10-20 12:21:59 +02:00
Josh Poimboeuf
eeb9f34df0 x86/srso: Fix unret validation dependencies
CONFIG_CPU_SRSO isn't dependent on CONFIG_CPU_UNRET_ENTRY (AMD
Retbleed), so the two features are independently configurable.  Fix
several issues for the (presumably rare) case where CONFIG_CPU_SRSO is
enabled but CONFIG_CPU_UNRET_ENTRY isn't.

Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/299fb7740174d0f2335e91c58af0e9c242b4bac1.1693889988.git.jpoimboe@kernel.org
2023-10-20 11:46:59 +02:00
Mike Rapoport (IBM)
a1e2b8b368 x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size
Qi Zheng reported crashes in a production environment and provided a
simplified example as a reproducer:

 |  For example, if we use Qemu to start a two NUMA node kernel,
 |  one of the nodes has 2M memory (less than NODE_MIN_SIZE),
 |  and the other node has 2G, then we will encounter the
 |  following panic:
 |
 |    BUG: kernel NULL pointer dereference, address: 0000000000000000
 |    <...>
 |    RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
 |    <...>
 |    Call Trace:
 |      <TASK>
 |      deactivate_slab()
 |      bootstrap()
 |      kmem_cache_init()
 |      start_kernel()
 |      secondary_startup_64_no_verify()

The crashes happen because of inconsistency between the nodemask that
has nodes with less than 4MB as memoryless, and the actual memory fed
into the core mm.

The commit:

  9391a3f9c7 ("[PATCH] x86_64: Clear more state when ignoring empty node in SRAT parsing")

... that introduced minimal size of a NUMA node does not explain why
a node size cannot be less than 4MB and what boot failures this
restriction might fix.

Fixes have been submitted to the core MM code to tighten up the
memory topologies it accepts and to not crash on weird input:

  mm: page_alloc: skip memoryless nodes entirely
  mm: memory_hotplug: drop memoryless node from fallback lists

Andrew has accepted them into the -mm tree, but there are no
stable SHA1's yet.

This patch drops the limitation for minimal node size on x86:

  - which works around the crash without the fixes to the core MM.
  - makes x86 topologies less weird,
  - removes an arbitrary and undocumented limitation on NUMA topologies.

[ mingo: Improved changelog clarity. ]

Reported-by: Qi Zheng <zhengqi.arch@bytedance.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/ZS+2qqjEO5/867br@gmail.com
2023-10-20 10:40:22 +02:00
Kuppuswamy Sathyanarayanan
f4738f56d1 virt: tdx-guest: Add Quote generation support using TSM_REPORTS
In TDX guest, the attestation process is used to verify the TDX guest
trustworthiness to other entities before provisioning secrets to the
guest. The first step in the attestation process is TDREPORT
generation, which involves getting the guest measurement data in the
format of TDREPORT, which is further used to validate the authenticity
of the TDX guest. TDREPORT by design is integrity-protected and can
only be verified on the local machine.

To support remote verification of the TDREPORT in a SGX-based
attestation, the TDREPORT needs to be sent to the SGX Quoting Enclave
(QE) to convert it to a remotely verifiable Quote. SGX QE by design can
only run outside of the TDX guest (i.e. in a host process or in a
normal VM) and guest can use communication channels like vsock or
TCP/IP to send the TDREPORT to the QE. But for security concerns, the
TDX guest may not support these communication channels. To handle such
cases, TDX defines a GetQuote hypercall which can be used by the guest
to request the host VMM to communicate with the SGX QE. More details
about GetQuote hypercall can be found in TDX Guest-Host Communication
Interface (GHCI) for Intel TDX 1.0, section titled
"TDG.VP.VMCALL<GetQuote>".

Trusted Security Module (TSM) [1] exposes a common ABI for Confidential
Computing Guest platforms to get the measurement data via ConfigFS.
Extend the TSM framework and add support to allow an attestation agent
to get the TDX Quote data (included usage example below).

  report=/sys/kernel/config/tsm/report/report0
  mkdir $report
  dd if=/dev/urandom bs=64 count=1 > $report/inblob
  hexdump -C $report/outblob
  rmdir $report

GetQuote TDVMCALL requires TD guest pass a 4K aligned shared buffer
with TDREPORT data as input, which is further used by the VMM to copy
the TD Quote result after successful Quote generation. To create the
shared buffer, allocate a large enough memory and mark it shared using
set_memory_decrypted() in tdx_guest_init(). This buffer will be re-used
for GetQuote requests in the TDX TSM handler.

Although this method reserves a fixed chunk of memory for GetQuote
requests, such one time allocation can help avoid memory fragmentation
related allocation failures later in the uptime of the guest.

Since the Quote generation process is not time-critical or frequently
used, the current version uses a polling model for Quote requests and
it also does not support parallel GetQuote requests.

Link: https://lore.kernel.org/lkml/169342399185.3934343.3035845348326944519.stgit@dwillia2-xfh.jf.intel.com/ [1]
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Erdem Aktas <erdemaktas@google.com>
Tested-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Tested-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2023-10-19 18:12:00 -07:00
Maciej S. Szmigiero
2770d47220 KVM: x86: Ignore MSR_AMD64_TW_CFG access
Hyper-V enabled Windows Server 2022 KVM VM cannot be started on Zen1 Ryzen
since it crashes at boot with SYSTEM_THREAD_EXCEPTION_NOT_HANDLED +
STATUS_PRIVILEGED_INSTRUCTION (in other words, because of an unexpected #GP
in the guest kernel).

This is because Windows tries to set bit 8 in MSR_AMD64_TW_CFG and can't
handle receiving a #GP when doing so.

Give this MSR the same treatment that commit 2e32b71906
("x86, kvm: Add MSR_AMD64_BU_CFG2 to the list of ignored MSRs") gave
MSR_AMD64_BU_CFG2 under justification that this MSR is baremetal-relevant
only.
Although apparently it was then needed for Linux guests, not Windows as in
this case.

With this change, the aforementioned guest setup is able to finish booting
successfully.

This issue can be reproduced either on a Summit Ridge Ryzen (with
just "-cpu host") or on a Naples EPYC (with "-cpu host,stepping=1" since
EPYC is ordinarily stepping 2).

Alternatively, userspace could solve the problem by using MSR filters, but
forcing every userspace to define a filter isn't very friendly and doesn't
add much, if any, value.  The only potential hiccup is if one of these
"baremetal-only" MSRs ever requires actual emulation and/or has F/M/S
specific behavior.  But if that happens, then KVM can still punt *that*
handling to userspace since userspace MSR filters "win" over KVM's default
handling.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/1ce85d9c7c9e9632393816cf19c902e0a3f411f1.1697731406.git.maciej.szmigiero@oracle.com
[sean: call out MSR filtering alternative]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-19 10:55:14 -07:00
Liang Chen
122ae01c51 KVM: x86: remove the unused assigned_dev_head from kvm_arch
Legacy device assignment was dropped years ago. This field is not used
anymore.

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Link: https://lore.kernel.org/r/20231019043336.8998-1-liangchen.linux@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-19 08:42:22 -07:00
Matthew Wilcox (Oracle)
247dbcdbf7 bitops: add xor_unlock_is_negative_byte()
Replace clear_bit_and_unlock_is_negative_byte() with
xor_unlock_is_negative_byte().  We have a few places that like to lock a
folio, set a flag and unlock it again.  Allow for the possibility of
combining the latter two operations for efficiency.  We are guaranteed
that the caller holds the lock, so it is safe to unlock it with the xor. 
The caller must guarantee that nobody else will set the flag without
holding the lock; it is not safe to do this with the PG_dirty flag, for
example.

Link: https://lkml.kernel.org/r/20231004165317.1061855-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18 14:34:16 -07:00
Jim Mattson
329369caec x86: KVM: Add feature flag for CPUID.80000021H:EAX[bit 1]
Define an X86_FEATURE_* flag for CPUID.80000021H:EAX.[bit 1], and
advertise the feature to userspace via KVM_GET_SUPPORTED_CPUID.

Per AMD's "Processor Programming Reference (PPR) for AMD Family 19h
Model 61h, Revision B1 Processors (56713-B1-PUB)," this CPUID bit
indicates that a WRMSR to MSR_FS_BASE, MSR_GS_BASE, or
MSR_KERNEL_GS_BASE is non-serializing. This is a change in previously
architected behavior.

Effectively, this CPUID bit is a "defeature" bit, or a reverse
polarity feature bit. When this CPUID bit is clear, the feature
(serialization on WRMSR to any of these three MSRs) is available. When
this CPUID bit is set, the feature is not available.

KVM_GET_SUPPORTED_CPUID must pass this bit through from the underlying
hardware, if it is set. Leaving the bit clear claims that WRMSR to
these three MSRs will be serializing in a guest running under
KVM. That isn't true. Though KVM could emulate the feature by
intercepting writes to the specified MSRs, it does not do so
today. The guest is allowed direct read/write access to these MSRs
without interception, so the innate hardware behavior is preserved
under KVM.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231005031237.1652871-1-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-18 13:50:28 -07:00
Thomas Gleixner
0b62f6cb07 x86/microcode/32: Move early loading after paging enable
32-bit loads microcode before paging is enabled. The commit which
introduced that has zero justification in the changelog. The cover
letter has slightly more content, but it does not give any technical
justification either:

  "The problem in current microcode loading method is that we load a
   microcode way, way too late; ideally we should load it before turning
   paging on.  This may only be practical on 32 bits since we can't get
   to 64-bit mode without paging on, but we should still do it as early
   as at all possible."

Handwaving word salad with zero technical content.

Someone claimed in an offlist conversation that this is required for
curing the ATOM erratum AAE44/AAF40/AAG38/AAH41. That erratum requires
an microcode update in order to make the usage of PSE safe. But during
early boot, PSE is completely irrelevant and it is evaluated way later.

Neither is it relevant for the AP on single core HT enabled CPUs as the
microcode loading on the AP is not doing anything.

On dual core CPUs there is a theoretical problem if a split of an
executable large page between enabling paging including PSE and loading
the microcode happens. But that's only theoretical, it's practically
irrelevant because the affected dual core CPUs are 64bit enabled and
therefore have paging and PSE enabled before loading the microcode on
the second core. So why would it work on 64-bit but not on 32-bit?

The erratum:

  "AAG38 Code Fetch May Occur to Incorrect Address After a Large Page is
   Split Into 4-Kbyte Pages

   Problem: If software clears the PS (page size) bit in a present PDE
   (page directory entry), that will cause linear addresses mapped through
   this PDE to use 4-KByte pages instead of using a large page after old
   TLB entries are invalidated. Due to this erratum, if a code fetch uses
   this PDE before the TLB entry for the large page is invalidated then it
   may fetch from a different physical address than specified by either the
   old large page translation or the new 4-KByte page translation. This
   erratum may also cause speculative code fetches from incorrect addresses."

The practical relevance for this is exactly zero because there is no
splitting of large text pages during early boot-time, i.e. between paging
enable and microcode loading, and neither during CPU hotplug.

IOW, this load microcode before paging enable is yet another voodoo
programming solution in search of a problem. What's worse is that it causes
at least two serious problems:

 1) When stackprotector is enabled, the microcode loader code has the
    stackprotector mechanics enabled. The read from the per CPU variable
    __stack_chk_guard is always accessing the virtual address either
    directly on UP or via %fs on SMP. In physical address mode this
    results in an access to memory above 3GB. So this works by chance as
    the hardware returns the same value when there is no RAM at this
    physical address. When there is RAM populated above 3G then the read
    is by chance the same as nothing changes that memory during the very
    early boot stage. That's not necessarily true during runtime CPU
    hotplug.

 2) When function tracing is enabled, the relevant microcode loader
    functions and the functions invoked from there will call into the
    tracing code and evaluate global and per CPU variables in physical
    address mode. What could potentially go wrong?

Cure this and move the microcode loading after the early paging enable, use
the new temporary initrd mapping and remove the gunk in the microcode
loader which is required to handle physical address mode.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.348298216@linutronix.de
2023-10-18 22:15:01 +02:00
Thomas Gleixner
4c585af718 x86/boot/32: Temporarily map initrd for microcode loading
Early microcode loading on 32-bit runs in physical address mode because
the initrd is not covered by the initial page tables. That results in
a horrible mess all over the microcode loader code.

Provide a temporary mapping for the initrd in the initial page tables by
appending it to the actual initial mapping starting with a new PGD or
PMD depending on the configured page table levels ([non-]PAE).

The page table entries are located after _brk_end so they are not
permanently using memory space. The mapping is invalidated right away in
i386_start_kernel() after the early microcode loader has run.

This prepares for removing the physical address mode oddities from all
over the microcode loader code, which in turn allows further cleanups.

Provide the map and unmap code and document the place where the
microcode loader needs to be invoked with a comment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.292291436@linutronix.de
2023-10-18 18:28:41 +02:00
Thomas Gleixner
a62f4ca106 x86/boot/32: De-uglify the 2/3 level paging difference in mk_early_pgtbl_32()
Move the ifdeffery out of the function and use proper typedefs to make it
work for both 2 and 3 level paging.

No functional change.

  [ bp: Move mk_early_pgtbl_32() declaration into a header. ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.111059491@linutronix.de
2023-10-18 18:27:30 +02:00
Ard Biesheuvel
d55d5bc5d9 x86/boot: Rename conflicting 'boot_params' pointer to 'boot_params_ptr'
The x86 decompressor is built and linked as a separate executable, but
it shares components with the kernel proper, which are either #include'd
as C files, or linked into the decompresor as a static library (e.g, the
EFI stub)

Both the kernel itself and the decompressor define a global symbol
'boot_params' to refer to the boot_params struct, but in the former
case, it refers to the struct directly, whereas in the decompressor, it
refers to a global pointer variable referring to the struct boot_params
passed by the bootloader or constructed from scratch.

This ambiguity is unfortunate, and makes it impossible to assign this
decompressor variable from the x86 EFI stub, given that declaring it as
extern results in a clash. So rename the decompressor version (whose
scope is limited) to boot_params_ptr.

[ mingo: Renamed 'boot_params_p' to 'boot_params_ptr' for clarity ]

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
2023-10-18 12:03:03 +02:00
Mingwei Zhang
5a989bbead KVM: x86: Update the variable naming in kvm_x86_ops.sched_in()
Update the variable with name 'kvm' in kvm_x86_ops.sched_in() to 'vcpu' to
avoid confusions. Variable naming in KVM has a clear convention that 'kvm'
refers to pointer of type 'struct kvm *', while 'vcpu' refers to pointer of
type 'struct kvm_vcpu *'.

Fix this 9-year old naming issue for fun.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20231017232610.4008690-1-mizhang@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-17 16:40:01 -07:00
Hou Wenlong
d2a285d65b x86/head/64: Move the __head definition to <asm/init.h>
Move the __head section definition to a header to widen its use.

An upcoming patch will mark the code as __head in mem_encrypt_identity.c too.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/0583f57977be184689c373fe540cbd7d85ca2047.1697525407.git.houwenlong.hwl@antgroup.com
2023-10-17 14:51:14 +02:00
Linus Torvalds
86d6a628a2 ARM:
- Fix the handling of the phycal timer offset when FEAT_ECV
   and CNTPOFF_EL2 are implemented.
 
 - Restore the functionnality of Permission Indirection that
   was broken by the Fine Grained Trapping rework
 
 - Cleanup some PMU event sharing code
 
 MIPS:
 
 - Fix W=1 build.
 
 s390:
 
 - One small fix for gisa to avoid stalls.
 
 x86:
 
 - Truncate writes to PMU counters to the counter's width to avoid spurious
   overflows when emulating counter events in software.
 
 - Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined
   architectural behavior).
 
 - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to
   kick the guest out of emulated halt.
 
 - Fix for loading XSAVE state from an old kernel into a new one.
 
 - Fixes for AMD AVIC
 
 selftests:
 
 - Play nice with %llx when formatting guest printf and assert statements.
 
 - Clean up stale test metadata.
 
 - Zero-initialize structures in memslot perf test to workaround a suspected
   "may be used uninitialized" false positives from GCC.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmUtvXgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOE3gf/Q0Xvi/oU/+iDMuvfCbMZg/nhbrsa
 WmE/zXLrCF0DknppAsWulkhLGL2ceL6X+f37f2vWpBdG9SVDG/vSAg+QQDwsXiKN
 hTJoaybtMMPZM64emPZKeLMrq3A/U32qIMmWMJkoQRyz6dftUhGqZEuy1jw8oomJ
 n9idRDCMkbo+bick4URt0FEuI3Tf6dPIlG7P5hObFTw+nenzzxTjoxWZ432Mgyod
 yqveEke4hcQ+6K1zdAcDNZqT9ZhxeTxAO4yrBAYfnFoPLhUXKDUumkqAQPNOhKTo
 YN+b29kHBm+HvYkHN785FQla/13wjE1aq5TUj5J7NEDv4uRXDefDq2OAeg==
 =b9AY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Fix the handling of the phycal timer offset when FEAT_ECV and
     CNTPOFF_EL2 are implemented

   - Restore the functionnality of Permission Indirection that was
     broken by the Fine Grained Trapping rework

   - Cleanup some PMU event sharing code

  MIPS:

   - Fix W=1 build

  s390:

   - One small fix for gisa to avoid stalls

  x86:

   - Truncate writes to PMU counters to the counter's width to avoid
     spurious overflows when emulating counter events in software

   - Set the LVTPC entry mask bit when handling a PMI (to match
     Intel-defined architectural behavior)

   - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work
     to kick the guest out of emulated halt

   - Fix for loading XSAVE state from an old kernel into a new one

   - Fixes for AMD AVIC

  selftests:

   - Play nice with %llx when formatting guest printf and assert
     statements

   - Clean up stale test metadata

   - Zero-initialize structures in memslot perf test to workaround a
     suspected 'may be used uninitialized' false positives from GCC"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
  KVM: arm64: timers: Correctly handle TGE flip with CNTPOFF_EL2
  KVM: arm64: POR{E0}_EL1 do not need trap handlers
  KVM: arm64: Add nPIR{E0}_EL1 to HFG traps
  KVM: MIPS: fix -Wunused-but-set-variable warning
  KVM: arm64: pmu: Drop redundant check for non-NULL kvm_pmu_events
  KVM: SVM: Fix build error when using -Werror=unused-but-set-variable
  x86: KVM: SVM: refresh AVIC inhibition in svm_leave_nested()
  x86: KVM: SVM: add support for Invalid IPI Vector interception
  x86: KVM: SVM: always update the x2avic msr interception
  KVM: selftests: Force load all supported XSAVE state in state test
  KVM: selftests: Load XSAVE state into untouched vCPU during state test
  KVM: selftests: Touch relevant XSAVE state in guest for state test
  KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2}
  x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer
  KVM: selftests: Zero-initialize entire test_result in memslot perf test
  KVM: selftests: Remove obsolete and incorrect test case metadata
  KVM: selftests: Treat %llx like %lx when formatting guest printf
  KVM: x86/pmu: Synthesize at most one PMI per VM-exit
  KVM: x86: Mask LVTPC when handling a PMI
  KVM: x86/pmu: Truncate counter value to allowed width on write
  ...
2023-10-16 18:34:17 -07:00
Yazen Ghannam
1bae0cfe4a x86/mce: Cleanup mce_usable_address()
Move Intel-specific checks into a helper function.

Explicitly use "bool" for return type.

No functional change intended.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613141142.36801-4-yazen.ghannam@amd.com
2023-10-16 15:37:01 +02:00
Linus Torvalds
fbe1bf1e5f Revert "x86/smp: Put CPUs into INIT on shutdown if possible"
This reverts commit 45e34c8af5, and the
two subsequent fixes to it:

  3f874c9b2a ("x86/smp: Don't send INIT to non-present and non-booted CPUs")
  b1472a60a5 ("x86/smp: Don't send INIT to boot CPU")

because it seems to result in hung machines at shutdown.  Particularly
some Dell machines, but Thomas says

 "The rest seems to be Lenovo and Sony with Alderlake/Raptorlake CPUs -
  at least that's what I could figure out from the various bug reports.

  I don't know which CPUs the DELL machines have, so I can't say it's a
  pattern.

  I agree with the revert for now"

Ashok Raj chimes in:

 "There was a report (probably this same one), and it turns out it was a
  bug in the BIOS SMI handler.

  The client BIOS's were waiting for the lowest APICID to be the SMI
  rendevous master. If this is MeteorLake, the BSP wasn't the one with
  the lowest APIC and it triped here.

  The BIOS change is also being pushed to others for assimilation :)

  Server BIOS's had this correctly for a while now"

and it does look likely to be some bad interaction between SMI and the
non-BSP cores having put into INIT (and thus unresponsive until reset).

Link: https://bbs.archlinux.org/viewtopic.php?pid=2124429
Link: https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
Link: https://forum.artixlinux.org/index.php/topic,5997.0.html
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2241279
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-10-15 12:02:02 -07:00
Linus Torvalds
ddf2085598 Fix a Longsoon build warning by harmonizing the arch_[un]register_cpu()
prototypes between architectures.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUrobMRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ggUw/7BGHe360tsrsMAcOHcwvvGhnQ0UKuoqLs
 IJl3dfsdX+JnL7cpbNcRBVDqgH2seIwdQFa73gALColcxntEBbnC/gVS4QLLSxSp
 HIq5C1OELT/jPMOjc6aimJx/qPvW/CLgo2WJx78rv0ykkf1RJIzqCTVKf8VQX6Vu
 t0/9jEhBNuL8DZthJ5ZV448WtlJcdnWXVGxq/UHEheV219Rjtp3NGf8s/K+WMzF3
 x9Zhmb+/UPgjhaZtrQDP2mf7ZYgmVhLvJTRSQdQNrcDe/ZaNrCrEGOwHuOpQ0vXw
 v42rd1AVGV/xgIhfBOABLdb5snBbQMDvYLcma04bkBd6H6WPFJZ1PvnGovTagxUO
 FP4117VBA5ARwZemxwGEPJkNF9lVEPSBVDv7bx2OO0zVCViuAbKJXxGzCW/GiSGA
 BRk5FogxJ7TcjWsYWWaZfYlq8RFI5UI3K/IEQIUpQKtC9OMhScdp532xEP48dc8u
 pHjPjVoYCXtoD/ZD73ZJJduN6Hn30HEE/IJ6+YKJRo4EZquUrGtbgG46QbFBqxqW
 xIPPTx7OPAaAfq020+c6BuUnta1iEY6I4De/+XbRdQf+AIWqfLsWNI8en8aO4t5p
 rtlrYD2Si1F0KBWtDWR7JhCm8CD2klWVWrD9d4DLpz9ljHXKa9d7BYp3BxkvgcQm
 x8f1D9yC9X4=
 =20Na
 -----END PGP SIGNATURE-----

Merge tag 'smp-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull CPU hotplug fix from Ingo Molnar:
 "Fix a Longsoon build warning by harmonizing the
  arch_[un]register_cpu() prototypes between architectures"

* tag 'smp-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu-hotplug: Provide prototypes for arch CPU registration
2023-10-15 08:44:56 -07:00
Paolo Bonzini
88e4cd893f KVM x86/pmu fixes for 6.6:
- Truncate writes to PMU counters to the counter's width to avoid spurious
    overflows when emulating counter events in software.
 
  - Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined
    architectural behavior).
 
  - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to
    kick the guest out of emulated halt.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmUp1FESHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5IRsQAIsk+UwTP+q+ZzkpkSOJ+ocmKU97/GbW
 snB+F5FwNXnWEPzHIV+Ldv+WUpmHilTrylk2t5jLyew783TPxTnLmNAa+D3iSSBP
 jSGzCIqR2uRHOxhuJgkKvdOkfuS7vob1KcKrfOwKCSss78VhKGkMGIi66/81RTxo
 zxpzva+F2YtbCwKWXewOvR4CsWhjVqOGRTCmjF6t8PpFDGqwZdu0ornBHC2gvkUI
 iDHWVBg5Rz/akqxjEVL94SP5qdFSaVG+F3Z8xpnn+tfPncEK/xPFdGHGKwOy5Jvt
 4dQLc6TGmS2+NGPU3eAJOr+GZKryQth1CI+5RDlnoKQXjQ3laJwjmgyCRbUYLoZh
 /R7f5YJrhGheUvCCmagY1g2x41qp/CTG1RnX1SVTIGH9h+5LSVcCukCL9Tx2/B4v
 eU8nrzhUuijSqG6TiyAV5hvFqMQf3LWWcjSSW58kIWmXLpqdb/Xp6wiFHjOM7wZM
 c1br+6AwKZwKNdqn3/cnlBnLc+1jq/PWFnuF9svjKn5JTOyg8kddmyWUkDqiLOeZ
 /jqqwRJQUZppy4DxFHdkuQxnTsrztNzs/vhQtF6MIgFRULrs4FaiTUxuAs72skqm
 Fv/IIuyHWjST9HY8dgTx8PLqUevEc7zekmhN1Cj5KwhlHxKYWSZfew80CO7h2qhJ
 IvAC70QC+BsW
 =g8g3
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.6-fixes' of https://github.com/kvm-x86/linux into HEAD

KVM x86/pmu fixes for 6.6:

 - Truncate writes to PMU counters to the counter's width to avoid spurious
   overflows when emulating counter events in software.

 - Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined
   architectural behavior).

 - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to
   kick the guest out of emulated halt.
2023-10-15 08:24:18 -04:00
Linus Torvalds
dc9b2e683b Fix a false-positive KASAN warning, fix an AMD erratum on Zen4 CPUs,
and fix kernel-doc build warnings.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUrEOQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hG+g/+OUdKjz6u3rod9ZR0/u0koyXFsat2V+CD
 9OSUsDvff/7alRhztUoyj3g1t+brp1GuvyPTJ+KN3jJO08PHfQgm7eH873dguPSs
 k4o45eKglYuCb+9CQvOJggqOp7W2tvA1BW8pU/YdNgwxOARZFBeY6nkxW7oii9Ff
 TLRtzVfBAShYm7QK8HIae7DpPwqt1RcINcWRa4HX93/Gh364Q3Cywj+8DfARweWq
 J+47bgrMP5o+AHWiZs0AGsflXy0PYmrmFzh/d6iMqLSZuAhyG6heyzuThAimze5L
 5O89Dh/cPHTwQdY1pOAYezMVRNECecsrv6nlu9uTjCkvutopd1EB+WfZh2z1qaPB
 3t6JKliUSRUrckk/RtM8qIcb6VV7G0+6ez9n6EGo+9ci2On8afN2va16qo/rrwU7
 dfCyRLxaoLA4AZWjvtVSMHRlA4Iox+pH7AlBCeALIStHXZfaPJIjNkfyO5iT3Mug
 6KyI8QDpVvGfWz5GwE+4GaB6IKekcGOqROubK9X8mtNg+FhDU7xcIEM2f28etmYt
 kmKWD9yJdKiaW+bwJ3C2krUhtxQgW4wNjLrrJTU3m8tsc+7SwEKbUls+OHHEUyMD
 rE7X64fuL9E+VXyOg58bN4XbUpLm94z6E4IuBPsLthdKPsz9Lq6R2BZlpZQ4cVyG
 efglSGvEHnU=
 =5l6l
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
 "Fix a false-positive KASAN warning, fix an AMD erratum on Zen4 CPUs,
  and fix kernel-doc build warnings"

* tag 'x86-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/alternatives: Disable KASAN in apply_alternatives()
  x86/cpu: Fix AMD erratum #1485 on Zen4-based CPUs
  x86/resctrl: Fix kernel-doc warnings
2023-10-14 15:32:20 -07:00
Brian Gerst
ca282b486a x86/entry/64: Convert SYSRET validation tests to C
No change in functionality expected.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20231011224351.130935-2-brgerst@gmail.com
2023-10-13 13:05:28 +02:00
Ingo Molnar
92fe9bb77b x86/apic, x86/hyperv: Use u32 in hv_snp_boot_ap() too
The data type for APIC IDs was standardized to 'u32' in the
following recent commit:

   db4a4086a2 ("x86/apic: Use u32 for wakeup_secondary_cpu[_64]()")

Which changed the function arguments type signature of the
apic->wakeup_secondary_cpu() APIC driver function.

Propagate this to hv_snp_boot_ap() as well, which also addresses a
'assignment from incompatible pointer type' build warning that triggers
under the -Werror=incompatible-pointer-types GCC warning.

Fixes: db4a4086a2 ("x86/apic: Use u32 for wakeup_secondary_cpu[_64]()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230814085113.233274223@linutronix.de
2023-10-13 12:26:58 +02:00
Borislav Petkov
deedec0a15 x86/cpu: Fix the AMD Fam 17h, Fam 19h, Zen2 and Zen4 MSR enumerations
The comments introduced in <asm/msr-index.h> in the merge conflict fixup in:

  8f4156d587 ("Merge branch 'x86/urgent' into perf/core, to resolve conflict")

... aren't right: AMD naming schemes are more complex than implied,
family 0x17 is Zen1 and 2, family 0x19 is spread around Zen 3 and 4.

So there's indeed four separate MSR namespaces for:

  MSR_F17H_
  MSR_F19H_
  MSR_ZEN2_
  MSR_ZEN4_

... and the namespaces cannot be merged.

Fix it up. No change in functionality.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/D99589F4-BC5D-430B-87B2-72C20370CF57@exactcode.com
2023-10-12 20:10:39 +02:00
Maxim Levitsky
2dcf37abf9 x86: KVM: SVM: add support for Invalid IPI Vector interception
In later revisions of AMD's APM, there is a new 'incomplete IPI' exit code:

"Invalid IPI Vector - The vector for the specified IPI was set to an
illegal value (VEC < 16)"

Note that tests on Zen2 machine show that this VM exit doesn't happen and
instead AVIC just does nothing.

Add support for this exit code by doing nothing, instead of filling
the kernel log with errors.

Also replace an unthrottled 'pr_err()' if another unknown incomplete
IPI exit happens with vcpu_unimpl()

(e.g in case AMD adds yet another 'Invalid IPI' exit reason)

Cc: <stable@vger.kernel.org>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928173354.217464-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:08:59 -04:00
Sean Christopherson
18164f66e6 x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer
Plumb an xfeatures mask into __copy_xstate_to_uabi_buf() so that KVM can
constrain which xfeatures are saved into the userspace buffer without
having to modify the user_xfeatures field in KVM's guest_fpu state.

KVM's ABI for KVM_GET_XSAVE{2} is that features that are not exposed to
guest must not show up in the effective xstate_bv field of the buffer.
Saving only the guest-supported xfeatures allows userspace to load the
saved state on a different host with a fewer xfeatures, so long as the
target host supports the xfeatures that are exposed to the guest.

KVM currently sets user_xfeatures directly to restrict KVM_GET_XSAVE{2} to
the set of guest-supported xfeatures, but doing so broke KVM's historical
ABI for KVM_SET_XSAVE, which allows userspace to load any xfeatures that
are supported by the *host*.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928001956.924301-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:08:58 -04:00
Suma Hegde
5150542b8e
platform/x86/amd/hsmp: add support for metrics tbl
AMD MI300 MCM provides GET_METRICS_TABLE message to retrieve
all the system management information from SMU.

The metrics table is made available as hexadecimal sysfs binary file
under per socket sysfs directory created at
/sys/devices/platform/amd_hsmp/socket%d/metrics_bin

Metrics table definitions will be documented as part of Public PPR.
The same is defined in the amd_hsmp.h header.

Signed-off-by: Suma Hegde <suma.hegde@amd.com>
Reviewed-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Link: https://lore.kernel.org/r/20231010120310.3464066-2-suma.hegde@amd.com
[ij: lseek -> lseek(), dram -> DRAM in dev_err()]
[ij: added period to terminate a documentation sentence]
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2023-10-12 16:29:58 +03:00
Thomas Zimmermann
052ddf7b86 fbdev: Replace fb_pgprotect() with pgprot_framebuffer()
Rename the fbdev mmap helper fb_pgprotect() to pgprot_framebuffer().
The helper sets VMA page-access flags for framebuffers in device I/O
memory.

Also clean up the helper's parameters and return value. Instead of
the VMA instance, pass the individial parameters separately: existing
page-access flags, the VMAs start and end addresses and the offset
in the underlying device memory rsp file. Return the new page-access
flags. These changes align pgprot_framebuffer() with other pgprot_()
functions.

v4:
	* fix commit message (Christophe)
v3:
	* rename fb_pgprotect() to pgprot_framebuffer() (Arnd)

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230922080636.26762-3-tzimmermann@suse.de
2023-10-12 09:20:46 +02:00
Lu Yao
441ccc3512 x86/msi: Fix compile error caused by CONFIG_GENERIC_MSI_IRQ=y && !CONFIG_X86_LOCAL_APIC
When compiling the x86 kernel, if X86_LOCAL_APIC is not enabled but
GENERIC_MSI_IRQ is selected in '.config', the following compilation
error will occur:

  include/linux/gpio/driver.h:38:19: error:
    field 'msiinfo' has incomplete type

  kernel/irq/msi.c:752:5: error: invalid use of incomplete typedef
    'msi_alloc_info_t' {aka 'struct irq_alloc_info'}

  kernel/irq/msi.c:740:1: error: control reaches end of non-void function

This is because file such as 'kernel/irq/msi.c' only depends on
'GENERIC_MSI_IRQ', and uses 'struct msi_alloc_info_t'. However,
this struct depends on 'X86_LOCAL_APIC'.

When enable 'GENERIC_MSI_IRQ' or 'X86_LOCAL_APIC' will select
'IRQ_DOMAIN_HIERARCHY', so exposing this struct using
'IRQ_DOMAIN_HIERARCHY' rather than 'X86_LOCAL_APIC'.

Under the above conditions, if 'HPET_TIMER' is selected, the following
compilation error will occur:

  arch/x86/kernel/hpet.c:550:13: error: ‘x86_vector_domain’ undeclared

  arch/x86/kernel/hpet.c:600:9: error: implicit declaration of
    function ‘init_irq_alloc_info’

This is because 'x86_vector_domain' is defined in 'kernel/apic/vector.c'
which is compiled only when 'X86_LOCAL_APIC' is enabled. Besides,
function 'msi_domain_set_affinity' is defined in 'include/linux/msi.h'
which depends on 'GENERIC_MSI_IRQ'. So use 'X86_LOCAL_APIC' and
'GENERIC_MSI_IRQ' to expose these code.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Lu Yao <yaolu@kylinos.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231012032659.323251-1-yaolu@kylinos.cn
2023-10-12 08:13:27 +02:00
Ingo Molnar
8f4156d587 Merge branch 'x86/urgent' into perf/core, to resolve conflict
Resolve an MSR enumeration conflict.

Conflicts:
	arch/x86/include/asm/msr-index.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-11 22:54:03 +02:00
Russell King (Oracle)
c4dd854f74 cpu-hotplug: Provide prototypes for arch CPU registration
Provide common prototypes for arch_register_cpu() and
arch_unregister_cpu(). These are called by acpi_processor.c, with weak
versions, so the prototype for this is already set. It is generally not
necessary for function prototypes to be conditional on preprocessor macros.

Some architectures (e.g. Loongarch) are missing the prototype for this, and
rather than add it to Loongarch's asm/cpu.h, do the job once for everyone.

Since this covers everyone, remove the now unnecessary prototypes in
asm/cpu.h, and therefore remove the 'static' from one of ia64's
arch_register_cpu() definitions.

[ tglx: Bring back the ia64 part and remove the ACPI prototypes ]

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/E1qkoRr-0088Q8-Da@rmk-PC.armlinux.org.uk
2023-10-11 14:27:37 +02:00
Borislav Petkov (AMD)
f454b18e07 x86/cpu: Fix AMD erratum #1485 on Zen4-based CPUs
Fix erratum #1485 on Zen4 parts where running with STIBP disabled can
cause an #UD exception. The performance impact of the fix is negligible.

Reported-by: René Rebe <rene@exactcode.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: René Rebe <rene@exactcode.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/D99589F4-BC5D-430B-87B2-72C20370CF57@exactcode.com
2023-10-11 11:00:11 +02:00
Alexander Shishkin
6e74b12515 x86/sev: Move sev_setup_arch() to mem_encrypt.c
Since commit:

  4d96f91091 ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()")

... the SWIOTLB bounce buffer size adjustment and restricted virtio memory
setting also inadvertently apply to TDX: the code is using
cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) as a gatekeeping condition,
which is also true for TDX, and this is also what we want.

To reflect this, move the corresponding code to generic mem_encrypt.c.

No functional changes intended.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20231010145220.3960055-2-alexander.shishkin@linux.intel.com
2023-10-11 10:15:47 +02:00
Linus Torvalds
b711538a40 hyperv-fixes for v6.6-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmUk4fcTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXhhqCACWsBYTB0EJ3oMJnzfnHeuN418ZDx/O
 AL0k0O5MT6roEFmvGUhzJ/jsoxL+W+Wj3aFwzReyOSQpgjTTF/Ja26LPvxRzDxKi
 sZPojnR2ykW31l7y+eh1p9qSM/aYvTMDP5zO7L1fBnWMAGMv8w8RezpCJ7bh4BgA
 FTMZZrvKYVT9hCGkYqKUZGBtDTPZ56WE+MCiRxTWQvF+4QKaIff0tpno8V7203bE
 D/b4+Ouh19RXFTC5dUq/0JtAdV2AadrPHnScUupc8Hk/MMFiU5CzvH4bAqiwXBcU
 YqqlD3kZbIqqbKE93+03jvyrRDvDGlq+rpA3KMk5MBAfrkM4DytpWvMs
 =SVq1
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-fixes-signed-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

 - fixes for Hyper-V VTL code (Saurabh Sengar and Olaf Hering)

 - fix hv_kvp_daemon to support keyfile based connection profile
   (Shradha Gupta)

* tag 'hyperv-fixes-signed-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  hv/hv_kvp_daemon:Support for keyfile based connection profile
  hyperv: reduce size of ms_hyperv_info
  x86/hyperv: Add common print prefix "Hyper-V" in hv_init
  x86/hyperv: Remove hv_vtl_early_init initcall
  x86/hyperv: Restrict get_vtl to only VTL platforms
2023-10-10 11:01:21 -07:00
Thomas Gleixner
db4a4086a2 x86/apic: Use u32 for wakeup_secondary_cpu[_64]()
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.

Make it all consistently use u32 because that reflects the hardware
register width.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.233274223@linutronix.de
2023-10-10 14:38:19 +02:00
Thomas Gleixner
59f7928cd4 x86/apic: Use u32 for [gs]et_apic_id()
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.

Make it all consistently use u32 because that reflects the hardware
register width.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.172569282@linutronix.de
2023-10-10 14:38:19 +02:00
Thomas Gleixner
01ccf9bbd2 x86/apic: Use u32 for phys_pkg_id()
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.

Make it all consistently use u32 because that reflects the hardware
register width even if that callback going to be removed soonish.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.113097126@linutronix.de
2023-10-10 14:38:19 +02:00
Thomas Gleixner
8aa2a4178d x86/apic: Use u32 for cpu_present_to_apicid()
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.

Make it all consistently use u32 because that reflects the hardware
register width and fixup a few related usage sites for consistency sake.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.054064391@linutronix.de
2023-10-10 14:38:19 +02:00
Thomas Gleixner
5d376b8fb1 x86/apic: Use u32 for check_apicid_used()
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.

Make it all consistently use u32 because that reflects the hardware
register width and move the default implementation to local.h as there are
no users outside the apic directory.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.981956102@linutronix.de
2023-10-10 14:38:18 +02:00
Thomas Gleixner
4705243d23 x86/apic: Use u32 for APIC IDs in global data
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.

Make it all consistently use u32 because that reflects the hardware
register width and fixup the most obvious usage sites of that.

The APIC callbacks will be addressed separately.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.922905727@linutronix.de
2023-10-10 14:38:18 +02:00
Thomas Gleixner
6e29032340 x86/cpu: Move cpu_l[l2]c_id into topology info
The topology IDs which identify the LLC and L2 domains clearly belong to
the per CPU topology information.

Move them into cpuinfo_x86::cpuinfo_topo and get rid of the extra per CPU
data and the related exports.

This also paves the way to do proper topology evaluation during early boot
because it removes the only per CPU dependency for that.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.803864641@linutronix.de
2023-10-10 14:38:18 +02:00
Thomas Gleixner
22dc963162 x86/cpu: Move logical package and die IDs into topology info
Yet another topology related data pair. Rename logical_proc_id to
logical_pkg_id so it fits the common naming conventions.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.745139505@linutronix.de
2023-10-10 14:38:18 +02:00
Thomas Gleixner
e3c0c5d52a x86/cpu: Move cu_id into topology info
No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.628405546@linutronix.de
2023-10-10 14:38:18 +02:00
Thomas Gleixner
e95256335d x86/cpu: Move cpu_core_id into topology info
Rename it to core_id and stick it to the other ID fields.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.566519388@linutronix.de
2023-10-10 14:38:17 +02:00
Thomas Gleixner
8a169ed40f x86/cpu: Move cpu_die_id into topology info
Move the next member.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.388185134@linutronix.de
2023-10-10 14:38:17 +02:00
Thomas Gleixner
02fb601d27 x86/cpu: Move phys_proc_id into topology info
Rename it to pkg_id which is the terminology used in the kernel.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.329006989@linutronix.de
2023-10-10 14:38:17 +02:00
Thomas Gleixner
b9655e702d x86/cpu: Encapsulate topology information in cpuinfo_x86
The topology related information is randomly scattered across cpuinfo_x86.

Create a new structure cpuinfo_topo and move in a first step initial_apicid
and apicid into it.

Aside of being better readable this is in preparation for replacing the
horribly fragile CPU topology evaluation code further down the road.

Consolidate APIC ID fields to u32 as that represents the hardware type.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.269787744@linutronix.de
2023-10-10 14:38:17 +02:00
Like Xu
bf328e22e4 KVM: x86: Don't sync user-written TSC against startup values
The legacy API for setting the TSC is fundamentally broken, and only
allows userspace to set a TSC "now", without any way to account for
time lost between the calculation of the value, and the kernel eventually
handling the ioctl.

To work around this, KVM has a hack which, if a TSC is set with a value
which is within a second's worth of the last TSC "written" to any vCPU in
the VM, assumes that userspace actually intended the two TSC values to be
in sync and adjusts the newly-written TSC value accordingly.

Thus, when a VMM restores a guest after suspend or migration using the
legacy API, the TSCs aren't necessarily *right*, but at least they're
in sync.

This trick falls down when restoring a guest which genuinely has been
running for less time than the 1 second of imprecision KVM allows for in
in the legacy API.  On *creation*, the first vCPU starts its TSC counting
from zero, and the subsequent vCPUs synchronize to that.  But then when
the VMM tries to restore a vCPU's intended TSC, because the VM has been
alive for less than 1 second and KVM's default TSC value for new vCPU's is
'0', the intended TSC is within a second of the last "written" TSC and KVM
incorrectly adjusts the intended TSC in an attempt to synchronize.

But further hacks can be piled onto KVM's existing hackish ABI, and
declare that the *first* value written by *userspace* (on any vCPU)
should not be subject to this "correction", i.e. KVM can assume that the
first write from userspace is not an attempt to sync up with TSC values
that only come from the kernel's default vCPU creation.

To that end: Add a flag, kvm->arch.user_set_tsc, protected by
kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in
the VM *has* been set by userspace, and make the 1-second slop hack only
trigger if user_set_tsc is already set.

Note that userspace can explicitly request a *synchronization* of the
TSC by writing zero. For the purpose of user_set_tsc, an explicit
synchronization counts as "setting" the TSC, i.e. if userspace then
subsequently writes an explicit non-zero value which happens to be within
1 second of the previous value, the new value will be "corrected".  This
behavior is deliberate, as treating explicit synchronization as "setting"
the TSC preserves KVM's existing behaviour inasmuch as possible (KVM
always applied the 1-second "correction" regardless of whether the write
came from userspace vs. the kernel).

Reported-by: Yong He <alexyonghe@tencent.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Original-by: Oliver Upton <oliver.upton@linux.dev>
Original-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Tested-by: Yong He <alexyonghe@tencent.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231008025335.7419-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-09 17:29:52 -07:00
Uros Bizjak
636d6a8b85 locking/atomic/x86: Introduce arch_sync_try_cmpxchg()
Introduce the arch_sync_try_cmpxchg() macro to improve code using
sync_try_cmpxchg() locking primitive. The new definitions use existing
__raw_try_cmpxchg() macros, but use its own "lock; " prefix.

The new macros improve assembly of the cmpxchg loop in
evtchn_fifo_unmask() from drivers/xen/events/events_fifo.c from:

 57a:	85 c0                	test   %eax,%eax
 57c:	78 52                	js     5d0 <...>
 57e:	89 c1                	mov    %eax,%ecx
 580:	25 ff ff ff af       	and    $0xafffffff,%eax
 585:	c7 04 24 00 00 00 00 	movl   $0x0,(%rsp)
 58c:	81 e1 ff ff ff ef    	and    $0xefffffff,%ecx
 592:	89 4c 24 04          	mov    %ecx,0x4(%rsp)
 596:	89 44 24 08          	mov    %eax,0x8(%rsp)
 59a:	8b 74 24 08          	mov    0x8(%rsp),%esi
 59e:	8b 44 24 04          	mov    0x4(%rsp),%eax
 5a2:	f0 0f b1 32          	lock cmpxchg %esi,(%rdx)
 5a6:	89 04 24             	mov    %eax,(%rsp)
 5a9:	8b 04 24             	mov    (%rsp),%eax
 5ac:	39 c1                	cmp    %eax,%ecx
 5ae:	74 07                	je     5b7 <...>
 5b0:	a9 00 00 00 40       	test   $0x40000000,%eax
 5b5:	75 c3                	jne    57a <...>
 <...>

to:

 578:	a9 00 00 00 40       	test   $0x40000000,%eax
 57d:	74 2b                	je     5aa <...>
 57f:	85 c0                	test   %eax,%eax
 581:	78 40                	js     5c3 <...>
 583:	89 c1                	mov    %eax,%ecx
 585:	25 ff ff ff af       	and    $0xafffffff,%eax
 58a:	81 e1 ff ff ff ef    	and    $0xefffffff,%ecx
 590:	89 4c 24 04          	mov    %ecx,0x4(%rsp)
 594:	89 44 24 08          	mov    %eax,0x8(%rsp)
 598:	8b 4c 24 08          	mov    0x8(%rsp),%ecx
 59c:	8b 44 24 04          	mov    0x4(%rsp),%eax
 5a0:	f0 0f b1 0a          	lock cmpxchg %ecx,(%rdx)
 5a4:	89 44 24 04          	mov    %eax,0x4(%rsp)
 5a8:	75 30                	jne    5da <...>
 <...>
 5da:	8b 44 24 04          	mov    0x4(%rsp),%eax
 5de:	eb 98                	jmp    578 <...>

The new code removes move instructions from 585: 5a6: and 5a9:
and the compare from 5ac:. Additionally, the compiler assumes that
cmpxchg success is more probable and optimizes code flow accordingly.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
2023-10-09 18:14:25 +02:00
Ingo Molnar
fdb8b7a1af Linux 6.6-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmUjFeceHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNCAH/RDI8G44DCV9Ps5U
 rl/FMf6iLUxU6fCS3Wwe8vtppLjPP7Y16AH5HKMumoDIqTfh9ZAUVKhZfT+PTgz3
 /oFXcGzZQLTcdbtH7XK2/zk7N/RI25/rDiCDd1uIJVCNii+hsBKS6Ihc4wXadxaR
 0z3lwoEKp2egeaeqmJWMzJLdjRrYhLs33+SEciVYqTiIvlWsM5QBm/sMvES7V57s
 TXrs5/y7yXtDBZ2PgYNCBRLyBazjqB28x07aQoePOAs6nFXl5N/wWPW/4wirWFHT
 s9LYZlmVo+O+RHWj10ASm/2l+ihgn959ZfRj1VekK2AWU1x/VzSPcuCXKvsrUoa+
 xEjL+vM=
 =efE3
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-09 18:09:23 +02:00
Sandipan Das
25e5684782 perf/x86/amd/uncore: Add memory controller support
Unified Memory Controller (UMC) events were introduced with Zen 4 as a
part of the Performance Monitoring Version 2 (PerfMonV2) enhancements.
An event is specified using the EventSelect bits and the RdWrMask bits
can be used for additional filtering of read and write requests.

As of now, a maximum of 12 channels of DDR5 are available on each socket
and each channel is controlled by a dedicated UMC. Each UMC, in turn,
has its own set of performance monitoring counters.

Since the MSR address space for the UMC PERF_CTL and PERF_CTR registers
are reused across sockets, uncore groups are created on the basis of
socket IDs. Hence, group exclusivity is mandatory while opening events
so that events for an UMC can only be opened on CPUs which are on the
same socket as the corresponding memory channel.

For each socket, the total number of available UMC counters and active
memory channels are determined from CPUID leaf 0x80000022 EBX and ECX
respectively. Usually, on Zen 4, each UMC has four counters.

MSR assignments are determined on the basis of active UMCs. E.g. if
UMCs 1, 4 and 9 are active for a given socket, then

  * UMC 1 gets MSRs 0xc0010800 to 0xc0010807 as PERF_CTLs and PERF_CTRs
  * UMC 4 gets MSRs 0xc0010808 to 0xc001080f as PERF_CTLs and PERF_CTRs
  * UMC 9 gets MSRs 0xc0010810 to 0xc0010817 as PERF_CTLs and PERF_CTRs

If there are sockets without any online CPUs when the amd_uncore driver
is loaded, UMCs for such sockets will not be discoverable since the
mechanism relies on executing the CPUID instruction on an online CPU
from the socket.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/b25f391205c22733493abec1ed850b71784edc5f.1696425185.git.sandipan.das@amd.com
2023-10-09 16:12:25 +02:00
Waiman Long
e3e3bab184 x86/speculation: Add __update_spec_ctrl() helper
Add a new __update_spec_ctrl() helper which is a variant of
update_spec_ctrl() that can be used in a noinstr function.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20230727184600.26768-2-longman@redhat.com
2023-10-07 11:33:28 +02:00
Jithu Joseph
97a5e801b3
platform/x86/intel/ifs: Store IFS generation number
IFS generation number is reported via MSR_INTEGRITY_CAPS.  As IFS
support gets added to newer CPUs, some differences are expected during
IFS image loading and test flows.

Define MSR bitmasks to extract and store the generation in driver data,
so that driver can modify its MSR interaction appropriately.

Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Link: https://lore.kernel.org/r/20231005195137.3117166-2-jithu.joseph@intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2023-10-06 13:05:13 +03:00
Brian Gerst
0d3109ad2e x86/entry/32: Convert do_fast_syscall_32() to bool return type
Doesn't have to be 'long' - this simplifies the code a bit.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230721161018.50214-5-brgerst@gmail.com
2023-10-05 10:06:42 +02:00
Ingo Molnar
3fc18b06b8 Linux 6.6-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmUZ4WEeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGnIYH/07zef2U1nlqI+ro
 HRL2GlWGIs9yE70Oax+A3eYUYsjJIPu0yiDhFHUgOV3VyAALo44ZX/WNwKCGsI3e
 zhuOeItyyVcLGZXVC/jxSu0uveyfEiEYIWRYGyQ6Sna8Ksdk/qwhNgQNotdWdQG5
 7xt8z32couglu0uOkxcGqjTxmbjO6WSM5qi7Ts+xLsgrcS5cRuNhAg/vezp9bfeL
 1IUieCih4RJFgar/6LPOiB8uoVXEBonVbtlTRRqYdnqcsSIC+ACR9ZFk/+X88b5z
 S+Ta5VTcOAPu+2M/lSGe+PlUECvoBNK0SIYnaVCP2paPmDxfDXOFvSy/qJE87/7L
 9BeonFw=
 =8FTr
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-rc4' into x86/entry, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-05 10:05:51 +02:00
Sean Christopherson
aeb904f6b9 KVM: x86: Refactor can_emulate_instruction() return to be more expressive
Refactor and rename can_emulate_instruction() to allow vendor code to
return more than true/false, e.g. to explicitly differentiate between
"retry", "fault", and "unhandleable".  For now, just do the plumbing, a
future patch will expand SVM's implementation to signal outright failure
if KVM attempts EMULTYPE_SKIP on an SEV guest.

No functional change intended (or rather, none that are visible to the
guest or userspace).

Link: https://lore.kernel.org/r/20230825013621.2845700-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-04 15:08:53 -07:00
Peng Hao
ee11ab6bb0 KVM: X86: Reduce size of kvm_vcpu_arch structure when CONFIG_KVM_XEN=n
When CONFIG_KVM_XEN=n, the size of kvm_vcpu_arch can be reduced
from 5100+ to 4400+ by adding macro control.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Link: https://lore.kernel.org/all/CAPm50aKwbZGeXPK5uig18Br8CF1hOS71CE2j_dLX+ub7oJdpGg@mail.gmail.com
[sean: fix whitespace damage]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-04 12:26:02 -07:00
Baoquan He
9c08a2a139 x86: kdump: use generic interface to simplify crashkernel reservation code
With the help of newly changed function parse_crashkernel() and generic
reserve_crashkernel_generic(), crashkernel reservation can be simplified
by steps:

1) Add a new header file <asm/crash_core.h>, and define CRASH_ALIGN,
   CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX and
   DEFAULT_CRASH_KERNEL_LOW_SIZE in <asm/crash_core.h>;

2) Add arch_reserve_crashkernel() to call parse_crashkernel() and
   reserve_crashkernel_generic(), and do the ARCH specific work if
   needed.

3) Add ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION Kconfig in
   arch/x86/Kconfig.

When adding DEFAULT_CRASH_KERNEL_LOW_SIZE, add crash_low_size_default() to
calculate crashkernel low memory because x86_64 has special requirement.

The old reserve_crashkernel_low() and reserve_crashkernel() can be
removed.

[bhe@redhat.com: move crash_low_size_default() code into <asm/crash_core.h>]
  Link: https://lkml.kernel.org/r/ZQpeAjOmuMJBFw1/@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20230914033142.676708-7-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Jiahao <chenjiahao16@huawei.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:41:58 -07:00
Uros Bizjak
5e0eb67974 locking/local, arch: Rewrite local_add_unless() as a static inline function
Rewrite local_add_unless() as a static inline function with boolean
return value, similar to the arch_atomic_add_unless() arch fallbacks.

The function is currently unused.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230731084458.28096-1-ubizjak@gmail.com
2023-10-04 11:38:11 +02:00
Masahiro Yamada
8b01de8030 x86/headers: Remove <asm/export.h>
All *.S files under arch/x86/ have been converted to include
<linux/export.h> instead of <asm/export.h>.

Remove <asm/export.h>.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230806145958.380314-3-masahiroy@kernel.org
2023-10-03 10:38:08 +02:00
Saurabh Sengar
0d294c8c4e x86/of: Move the x86_flattree_get_config() call out of x86_dtb_init()
Fetching the device tree configuration before initmem_init() is necessary
to allow the parsing of NUMA node information. However moving the entire
x86_dtb_init() call before initmem_init() is not correct as APIC/IO-APIC enumeration
has to be after initmem_init().

Thus, move the x86_flattree_get_config() call out of x86_dtb_init(),
into setup_arch(), to call it before initmem_init(), and
leave the ACPI/IOAPIC registration sequence as-is.

[ mingo: Updated the changelog for clarity. ]

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/1692949657-16446-1-git-send-email-ssengar@linux.microsoft.com
2023-10-02 21:30:09 +02:00
Linus Torvalds
d2c5231581 Fourteen hotfixes, eleven of which are cc:stable. The remainder pertain
to issues which were introduced after 6.5.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZRmSDAAKCRDdBJ7gKXxA
 jlSaAQCe3SnBdjRmuzbp5iIfNJOY7GXLN4NwMsArRUxRGY27IwD+KWhXZP/ydVnt
 ZgS4x9rmarHuh5Pxds+6SRGhihRz/Ak=
 =sf/5
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Fourteen hotfixes, eleven of which are cc:stable. The remainder
  pertain to issues which were introduced after 6.5"

* tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Crash: add lock to serialize crash hotplug handling
  selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
  mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
  mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions()
  mm, memcg: reconsider kmem.limit_in_bytes deprecation
  mm: zswap: fix potential memory corruption on duplicate store
  arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries
  mm: hugetlb: add huge page size param to set_huge_pte_at()
  maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states
  maple_tree: add mas_is_active() to detect in-tree walks
  nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()
  mm: abstract moving to the next PFN
  mm: report success more often from filemap_map_folio_range()
  fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
2023-10-01 13:33:25 -07:00
Matthew Wilcox (Oracle)
ce60f27bb6 mm: abstract moving to the next PFN
In order to fix the L1TF vulnerability, x86 can invert the PTE bits for
PROT_NONE VMAs, which means we cannot move from one PTE to the next by
adding 1 to the PFN field of the PTE.  This results in the BUG reported at
[1].

Abstract advancing the PTE to the next PFN through a pte_next_pfn()
function/macro.

Link: https://lkml.kernel.org/r/20230920040958.866520-1-willy@infradead.org
Fixes: bcc6cc8325 ("mm: add default definition of set_ptes()")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: syzbot+55cc72f8cc3a549119df@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000d099fa0604f03351@google.com [1]
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29 17:20:46 -07:00
Haitao Shan
9cfec6d097 KVM: x86: Fix lapic timer interrupt lost after loading a snapshot.
When running android emulator (which is based on QEMU 2.12) on
certain Intel hosts with kernel version 6.3-rc1 or above, guest
will freeze after loading a snapshot. This is almost 100%
reproducible. By default, the android emulator will use snapshot
to speed up the next launching of the same android guest. So
this breaks the android emulator badly.

I tested QEMU 8.0.4 from Debian 12 with an Ubuntu 22.04 guest by
running command "loadvm" after "savevm". The same issue is
observed. At the same time, none of our AMD platforms is impacted.
More experiments show that loading the KVM module with
"enable_apicv=false" can workaround it.

The issue started to show up after commit 8e6ed96cdd ("KVM: x86:
fire timer when it is migrated and expired, and in oneshot mode").
However, as is pointed out by Sean Christopherson, it is introduced
by commit 967235d320 ("KVM: vmx: clear pending interrupts on
KVM_SET_LAPIC"). commit 8e6ed96cdd ("KVM: x86: fire timer when
it is migrated and expired, and in oneshot mode") just makes it
easier to hit the issue.

Having both commits, the oneshot lapic timer gets fired immediately
inside the KVM_SET_LAPIC call when loading the snapshot. On Intel
platforms with APIC virtualization and posted interrupt processing,
this eventually leads to setting the corresponding PIR bit. However,
the whole PIR bits get cleared later in the same KVM_SET_LAPIC call
by apicv_post_state_restore. This leads to timer interrupt lost.

The fix is to move vmx_apicv_post_state_restore to the beginning of
the KVM_SET_LAPIC call and rename to vmx_apicv_pre_state_restore.
What vmx_apicv_post_state_restore does is actually clearing any
former apicv state and this behavior is more suitable to carry out
in the beginning.

Fixes: 967235d320 ("KVM: vmx: clear pending interrupts on KVM_SET_LAPIC")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Haitao Shan <hshan@google.com>
Link: https://lore.kernel.org/r/20230913000215.478387-1-hshan@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-28 10:19:29 -07:00
Kyle Meyer
f10a570b09 KVM: x86: Add CONFIG_KVM_MAX_NR_VCPUS to allow up to 4096 vCPUs
Add a Kconfig entry to set the maximum number of vCPUs per KVM guest and
set the default value to 4096 when MAXSMP is enabled, as there are use
cases that want to create more than the currently allowed 1024 vCPUs and
are more than happy to eat the memory overhead.

The Hyper-V TLFS doesn't allow more than 64 sparse banks, i.e. allows a
maximum of 4096 virtual CPUs. Cap KVM's maximum number of virtual CPUs
to 4096 to avoid exceeding Hyper-V's limit as KVM support for Hyper-V is
unconditional, and alternatives like dynamically disabling Hyper-V
enlightenments that rely on sparse banks would require non-trivial code
changes.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://lore.kernel.org/r/20230824215244.3897419-1-kyle.meyer@hpe.com
[sean: massage changelog with --verbose, document #ifdef mess]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-28 09:25:19 -07:00
Jim Mattson
73554b29bd KVM: x86/pmu: Synthesize at most one PMI per VM-exit
When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
VM-exit that also invokes __kvm_perf_overflow() as a result of
instruction emulation, kvm_pmu_deliver_pmi() will be called twice
before the next VM-entry.

Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that
KVM sets the LVTPC mask bit when delivering a PMI.  But using IRQ work to
trigger the PMI is still broken, albeit very theoretically.

E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the
vCPU to be migrated to a different pCPU, then it's possible for
kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from
KVM_REQ_PMI and still generate two PMIs.

KVM could set the mask bit using an atomic operation, but that'd just be
piling on unnecessary code to workaround what is effectively a hack.  The
*only* reason KVM uses IRQ work is to ensure the PMI is treated as a wake
event, e.g. if the vCPU just executed HLT.

Remove the irq_work callback for synthesizing a PMI, and all of the
logic for invoking it. Instead, to prevent a vcpu from leaving C0 with
a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events().

Fixes: 9cd803d496 ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Jim Mattson <jmattson@google.com>
Tested-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20230925173448.3518223-2-mizhang@google.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-25 14:42:52 -07:00
David Howells
066baf92be
iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user()
copy_mc_to_user() has the destination marked __user on powerpc, but not on
x86; the latter results in a sparse warning in lib/iov_iter.c.

Fix this by applying the tag on x86 too.

Fixes: ec6347bb43 ("x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230925120309.1731676-3-dhowells@redhat.com
cc: Dan Williams <dan.j.williams@intel.com>
cc: Thomas Gleixner <tglx@linutronix.de>
cc: Ingo Molnar <mingo@redhat.com>
cc: Borislav Petkov <bp@alien8.de>
cc: Dave Hansen <dave.hansen@linux.intel.com>
cc: "H. Peter Anvin" <hpa@zytor.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: x86@kernel.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-25 14:30:27 +02:00
Linus Torvalds
8a511e7efc ARM:
* Fix EL2 Stage-1 MMIO mappings where a random address was used
 
 * Fix SMCCC function number comparison when the SVE hint is set
 
 RISC-V:
 
 * Fix KVM_GET_REG_LIST API for ISA_EXT registers
 
 * Fix reading ISA_EXT register of a missing extension
 
 * Fix ISA_EXT register handling in get-reg-list test
 
 * Fix filtering of AIA registers in get-reg-list test
 
 x86:
 
 * Fixes for TSC_AUX virtualization
 
 * Stop zapping page tables asynchronously, since we don't
   zap them as often as before
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmUQU5YUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNcdwf/X8eHQ5yfAE0J70xs4VZ1z7B8i77q
 P54401z/q0FyQ4yyTHwbUv/FgVYscZ0efYogrkd3uuoPNtLmN2xKj1tM95A2ncP/
 v318ljevZ0FWZ6J471Xu9MM3u15QmjC3Wai9z6IP4tz0S2rUhOYTJdFqlNf6gQSu
 P8n9l2j3ZLAiUNizXa8M7350gCUFCBi37dvLLVTYOxbu17hZtmNjhNpz5G7YNc9y
 zmJIJh30ZnMGUgMylLfcW0ZoqQFNIkNg3yyr9YjY68bTW5aspXdhp9u0zI+01xYF
 sT+tOXBPPLi9MBuckX+oLMsvNXEZWxos2oMow3qziMo83neG+jU+WhjLHg==
 =+sqe
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"ARM:

   - Fix EL2 Stage-1 MMIO mappings where a random address was used

   - Fix SMCCC function number comparison when the SVE hint is set

  RISC-V:

   - Fix KVM_GET_REG_LIST API for ISA_EXT registers

   - Fix reading ISA_EXT register of a missing extension

   - Fix ISA_EXT register handling in get-reg-list test

   - Fix filtering of AIA registers in get-reg-list test

  x86:

   - Fixes for TSC_AUX virtualization

   - Stop zapping page tables asynchronously, since we don't zap them as
     often as before"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX
  KVM: SVM: Fix TSC_AUX virtualization setup
  KVM: SVM: INTERCEPT_RDTSCP is never intercepted anyway
  KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously
  KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe()
  KVM: x86/mmu: Open code leaf invalidation from mmu_notifier
  KVM: riscv: selftests: Selectively filter-out AIA registers
  KVM: riscv: selftests: Fix ISA_EXT register handling in get-reg-list
  RISC-V: KVM: Fix riscv_vcpu_get_isa_ext_single() for missing extensions
  RISC-V: KVM: Fix KVM_GET_REG_LIST API for ISA_EXT registers
  KVM: selftests: Assert that vasprintf() is successful
  KVM: arm64: nvhe: Ignore SVE hint in SMCCC function ID
  KVM: arm64: Properly return allocated EL2 VA from hyp_alloc_private_va_range()
2023-09-24 14:14:35 -07:00
Paolo Bonzini
5804c19b80 KVM/riscv fixes for 6.6, take #1
- Fix KVM_GET_REG_LIST API for ISA_EXT registers
 - Fix reading ISA_EXT register of a missing extension
 - Fix ISA_EXT register handling in get-reg-list test
 - Fix filtering of AIA registers in get-reg-list test
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmUMDssACgkQrUjsVaLH
 LAckSg/+IZ5DPvPs81rUpL3i1Z5SrK4jXWL2zyvMIksBEYmFD2NPNvinVZ4Sxv6u
 IzWNKJcAp4nA/+qPGPLXCURDe1W6PCDvO4SShjYm2UkPtNIfiskmFr3MunXZysgm
 I7USJgj9ev+46yfOnwlYrwpZ8sQk7Z6nLTI/6Osk4Q7Sm0Vjoobh6krub7LNjeKQ
 y6p+vxrXj+Owc5H9bgl0wAi6GOmOJKAM+cZU5DygQxjOgiUgNbOzsVgbLDTvExNq
 gISUU4PoAO7/U1NKEaaopbe7C0KNQHTnegedtXsDzg7WTBah77/MNBt4snbfiR27
 6rODklZlG/kAGIHdVtYC+zf8AfPqvGTIT8SLGmzQlyVlHujFBGn0L41NmMzW+EeA
 7UpfUk8vPiiGhefBE+Ml3yqiReogo+aRhL1mZoI39rPusd7DMnwx97KpBlAcYuTI
 PTgqycIMRmq2dSCHya+nrOVpwwV3Qx4G8Alpq1jOa7XDMeGMj4h521NQHjWckIK2
 IJ2a0RtzB10+Z91nLV+amdAno326AnxJC7dR26O6uqVSPJy/nHE2GAc49gFKeWq6
 QmzgzY1sU2Y02/TM8miyKSl3i+bpZSIPzXCKlOm1TowBKO+sfJzn/yMon9sVaVhb
 4Wjgg3vgE74y9FVsL4JXR/PfrZH5Aq77J1R+/pMtsNTtVYrt1Sk=
 =ytFs
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-fixes-6.6-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv fixes for 6.6, take #1

- Fix KVM_GET_REG_LIST API for ISA_EXT registers
- Fix reading ISA_EXT register of a missing extension
- Fix ISA_EXT register handling in get-reg-list test
- Fix filtering of AIA registers in get-reg-list test
2023-09-23 05:35:55 -04:00
Sean Christopherson
0df9dab891 KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously
Stop zapping invalidate TDP MMU roots via work queue now that KVM
preserves TDP MMU roots until they are explicitly invalidated.  Zapping
roots asynchronously was effectively a workaround to avoid stalling a vCPU
for an extended during if a vCPU unloaded a root, which at the time
happened whenever the guest toggled CR0.WP (a frequent operation for some
guest kernels).

While a clever hack, zapping roots via an unbound worker had subtle,
unintended consequences on host scheduling, especially when zapping
multiple roots, e.g. as part of a memslot.  Because the work of zapping a
root is no longer bound to the task that initiated the zap, things like
the CPU affinity and priority of the original task get lost.  Losing the
affinity and priority can be especially problematic if unbound workqueues
aren't affined to a small number of CPUs, as zapping multiple roots can
cause KVM to heavily utilize the majority of CPUs in the system, *beyond*
the CPUs KVM is already using to run vCPUs.

When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root
zap can result in KVM occupying all logical CPUs for ~8ms, and result in
high priority tasks not being scheduled in in a timely manner.  In v5.15,
which doesn't preserve unloaded roots, the issues were even more noticeable
as KVM would zap roots more frequently and could occupy all CPUs for 50ms+.

Consuming all CPUs for an extended duration can lead to significant jitter
throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots
is a semi-frequent operation as memslots are deleted and recreated with
different host virtual addresses to react to host GPU drivers allocating
and freeing GPU blobs.  On ChromeOS, the jitter manifests as audio blips
during games due to the audio server's tasks not getting scheduled in
promptly, despite the tasks having a high realtime priority.

Deleting memslots isn't exactly a fast path and should be avoided when
possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the
memslot shenanigans, but KVM is squarely in the wrong.  Not to mention
that removing the async zapping eliminates a non-trivial amount of
complexity.

Note, one of the subtle behaviors hidden behind the async zapping is that
KVM would zap invalidated roots only once (ignoring partial zaps from
things like mmu_notifier events).  Preserve this behavior by adding a flag
to identify roots that are scheduled to be zapped versus roots that have
already been zapped but not yet freed.

Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can
encounter invalid roots, as it's not at all obvious why zapping
invalidated roots shouldn't simply zap all invalid roots.

Reported-by: Pattara Teerapong <pteerapong@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Yiwei Zhang<zzyiwei@google.com>
Cc: Paul Hsia <paulhsia@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230916003916.2545000-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23 05:35:48 -04:00
Linus Torvalds
e583bffeb8 Misc x86 fixes:
- Fix a kexec bug,
  - Fix an UML build bug,
  - Fix a handful of SRSO related bugs,
  - Fix a shadow stacks handling bug & robustify related code.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUNbQIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jVIg/9EChW7qFTda8joR41Uayg07VIOpGirDLu
 7hjzOnt4Ni93cfFNUBkKDKXoWxGAiOD+cRDnT6+zsJAvAZR26Y3UNVLYlAy+lFKK
 9kRxeDM7nOEKqCC+zneinMFcIKmZRttMLpj8O901jB2S08x4UarnNx5SaWEcqYbn
 rf1XIEuEvlxqMfafNueS/TRadV52qVU8Y+2+inkDnM7dDNwt+rCs5tQ9ebJos8QO
 tsAoQes1G+0mjPrpqAgsIic5e3QCHliwVr8ASQrKZ9KR+fokEJRbSBNjgHUCNeVN
 0bHBhuDJBSniC7QmAQGEizrZWmHl2HxwYYnCE0gd0g24b7sDTwWBFSXWratCrPdX
 e4qYq36xonWdQcbpVF8ijMXF/R810vDyis/yc/BTeo5EBWf6aM/cx1dmS9DUxRpA
 QsIW2amSCPVYwYE5MAC+K/DM9nq1tk8YnKi4Mob3L28+W3CmVwSwT9S86z2QLlZu
 +KgVV4yBtJY1FJqVur5w3awhFtqLfBdfIvs6uQCd9VZXVPbBfS8+rOQmmhFixEDu
 FSPlAChmXYTAM2R+UxcEvw1Zckrtd2BCOa8UrY2lq57mduBK1EymdpfjlrUAChLG
 x7fQBOGNgOTLwYcsurIdS5jAqiudpnJ/KDG8ZAmKsVoW96JCPp9B3tVZMp9tT30C
 8HRwSPX4384=
 =58St
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 fixes from Ingo Molnar:

 - Fix a kexec bug

 - Fix an UML build bug

 - Fix a handful of SRSO related bugs

 - Fix a shadow stacks handling bug & robustify related code

* tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/shstk: Add warning for shadow stack double unmap
  x86/shstk: Remove useless clone error handling
  x86/shstk: Handle vfork clone failure correctly
  x86/srso: Fix SBPB enablement for spec_rstack_overflow=off
  x86/srso: Don't probe microcode in a guest
  x86/srso: Set CPUID feature bits independently of bug or mitigation status
  x86/srso: Fix srso_show_state() side effect
  x86/asm: Fix build of UML with KASAN
  x86/mm, kexec, ima: Use memblock_free_late() from ima_free_kexec_buffer()
2023-09-22 12:26:42 -07:00
Saurabh Sengar
14058f72cf x86/hyperv: Remove hv_vtl_early_init initcall
There has been cases reported where HYPERV_VTL_MODE is enabled by mistake,
on a non Hyper-V platforms. This causes the hv_vtl_early_init function to
be called in an non Hyper-V/VTL platforms which results the memory
corruption.

Remove the early_initcall for hv_vtl_early_init and call it at the end of
hyperv_init to make sure it is never called in a non Hyper-V platform by
mistake.

Reported-by: Mathias Krause <minipli@grsecurity.net>
Closes: https://lore.kernel.org/lkml/40467722-f4ab-19a5-4989-308225b1f9f0@grsecurity.net/
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Acked-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/1695358720-27681-1-git-send-email-ssengar@linux.microsoft.com
2023-09-22 18:41:29 +00:00
Fangrui Song
b8ec60e118 x86/speculation, objtool: Use absolute relocations for annotations
.discard.retpoline_safe sections do not have the SHF_ALLOC flag.  These
sections referencing text sections' STT_SECTION symbols with PC-relative
relocations like R_386_PC32 [0] is conceptually not suitable.  Newer
LLD will report warnings for REL relocations even for relocatable links [1]:

    ld.lld: warning: vmlinux.a(drivers/i2c/busses/i2c-i801.o):(.discard.retpoline_safe+0x120): has non-ABS relocation R_386_PC32 against symbol ''

Switch to absolute relocations instead, which indicate link-time
addresses.  In a relocatable link, these addresses are also output
section offsets, used by checks in tools/objtool/check.c.  When linking
vmlinux, these .discard.* sections will be discarded, therefore it is
not a problem that R_X86_64_32 cannot represent a kernel address.

Alternatively, we could set the SHF_ALLOC flag for .discard.* sections,
but I think non-SHF_ALLOC for sections to be discarded makes more sense.

Note: if we decide to never support REL architectures (e.g. arm, i386),
we can utilize R_*_NONE relocations (.reloc ., BFD_RELOC_NONE, sym),
making .discard.* sections zero-sized.  That said, the section content
waste is 4 bytes per entry, much smaller than sizeof(Elf{32,64}_Rel).

  [0] commit 1c0c1faf56 ("objtool: Use relative pointers for annotations")
  [1] https://github.com/ClangBuiltLinux/linux/issues/1937

Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20230920001728.1439947-1-maskray@google.com
2023-09-22 11:41:24 +02:00
Paolo Bonzini
7deda2ce5b x86/cpu: Clear SVM feature if disabled by BIOS
When SVM is disabled by BIOS, one cannot use KVM but the
SVM feature is still shown in the output of /proc/cpuinfo.
On Intel machines, VMX is cleared by init_ia32_feat_ctl(),
so do the same on AMD and Hygon processors.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230921114940.957141-1-pbonzini@redhat.com
2023-09-22 10:55:26 +02:00