linux-stable

Commit Graph

Author	SHA1	Message	Date
Ingo Molnar	3fc18b06b8	Linux 6.6-rc4 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmUZ4WEeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGnIYH/07zef2U1nlqI+ro HRL2GlWGIs9yE70Oax+A3eYUYsjJIPu0yiDhFHUgOV3VyAALo44ZX/WNwKCGsI3e zhuOeItyyVcLGZXVC/jxSu0uveyfEiEYIWRYGyQ6Sna8Ksdk/qwhNgQNotdWdQG5 7xt8z32couglu0uOkxcGqjTxmbjO6WSM5qi7Ts+xLsgrcS5cRuNhAg/vezp9bfeL 1IUieCih4RJFgar/6LPOiB8uoVXEBonVbtlTRRqYdnqcsSIC+ACR9ZFk/+X88b5z S+Ta5VTcOAPu+2M/lSGe+PlUECvoBNK0SIYnaVCP2paPmDxfDXOFvSy/qJE87/7L 9BeonFw= =8FTr -----END PGP SIGNATURE----- Merge tag 'v6.6-rc4' into x86/entry, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2023-10-05 10:05:51 +02:00
Sean Christopherson	aeb904f6b9	KVM: x86: Refactor can_emulate_instruction() return to be more expressive Refactor and rename can_emulate_instruction() to allow vendor code to return more than true/false, e.g. to explicitly differentiate between "retry", "fault", and "unhandleable". For now, just do the plumbing, a future patch will expand SVM's implementation to signal outright failure if KVM attempts EMULTYPE_SKIP on an SEV guest. No functional change intended (or rather, none that are visible to the guest or userspace). Link: https://lore.kernel.org/r/20230825013621.2845700-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-10-04 15:08:53 -07:00
Peng Hao	ee11ab6bb0	KVM: X86: Reduce size of kvm_vcpu_arch structure when CONFIG_KVM_XEN=n When CONFIG_KVM_XEN=n, the size of kvm_vcpu_arch can be reduced from 5100+ to 4400+ by adding macro control. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Link: https://lore.kernel.org/all/CAPm50aKwbZGeXPK5uig18Br8CF1hOS71CE2j_dLX+ub7oJdpGg@mail.gmail.com [sean: fix whitespace damage] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-10-04 12:26:02 -07:00
Baoquan He	9c08a2a139	x86: kdump: use generic interface to simplify crashkernel reservation code With the help of newly changed function parse_crashkernel() and generic reserve_crashkernel_generic(), crashkernel reservation can be simplified by steps: 1) Add a new header file <asm/crash_core.h>, and define CRASH_ALIGN, CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX and DEFAULT_CRASH_KERNEL_LOW_SIZE in <asm/crash_core.h>; 2) Add arch_reserve_crashkernel() to call parse_crashkernel() and reserve_crashkernel_generic(), and do the ARCH specific work if needed. 3) Add ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION Kconfig in arch/x86/Kconfig. When adding DEFAULT_CRASH_KERNEL_LOW_SIZE, add crash_low_size_default() to calculate crashkernel low memory because x86_64 has special requirement. The old reserve_crashkernel_low() and reserve_crashkernel() can be removed. [bhe@redhat.com: move crash_low_size_default() code into <asm/crash_core.h>] Link: https://lkml.kernel.org/r/ZQpeAjOmuMJBFw1/@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20230914033142.676708-7-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Jiahao <chenjiahao16@huawei.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-10-04 10:41:58 -07:00
Uros Bizjak	5e0eb67974	locking/local, arch: Rewrite local_add_unless() as a static inline function Rewrite local_add_unless() as a static inline function with boolean return value, similar to the arch_atomic_add_unless() arch fallbacks. The function is currently unused. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230731084458.28096-1-ubizjak@gmail.com	2023-10-04 11:38:11 +02:00
Masahiro Yamada	8b01de8030	x86/headers: Remove <asm/export.h> All *.S files under arch/x86/ have been converted to include <linux/export.h> instead of <asm/export.h>. Remove <asm/export.h>. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230806145958.380314-3-masahiroy@kernel.org	2023-10-03 10:38:08 +02:00
Saurabh Sengar	0d294c8c4e	x86/of: Move the x86_flattree_get_config() call out of x86_dtb_init() Fetching the device tree configuration before initmem_init() is necessary to allow the parsing of NUMA node information. However moving the entire x86_dtb_init() call before initmem_init() is not correct as APIC/IO-APIC enumeration has to be after initmem_init(). Thus, move the x86_flattree_get_config() call out of x86_dtb_init(), into setup_arch(), to call it before initmem_init(), and leave the ACPI/IOAPIC registration sequence as-is. [ mingo: Updated the changelog for clarity. ] Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/1692949657-16446-1-git-send-email-ssengar@linux.microsoft.com	2023-10-02 21:30:09 +02:00
Linus Torvalds	d2c5231581	Fourteen hotfixes, eleven of which are cc:stable. The remainder pertain to issues which were introduced after 6.5. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZRmSDAAKCRDdBJ7gKXxA jlSaAQCe3SnBdjRmuzbp5iIfNJOY7GXLN4NwMsArRUxRGY27IwD+KWhXZP/ydVnt ZgS4x9rmarHuh5Pxds+6SRGhihRz/Ak= =sf/5 -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Fourteen hotfixes, eleven of which are cc:stable. The remainder pertain to issues which were introduced after 6.5" * tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: Crash: add lock to serialize crash hotplug handling selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions() mm, memcg: reconsider kmem.limit_in_bytes deprecation mm: zswap: fix potential memory corruption on duplicate store arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries mm: hugetlb: add huge page size param to set_huge_pte_at() maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states maple_tree: add mas_is_active() to detect in-tree walks nilfs2: fix potential use after free in nilfs_gccache_submit_read_data() mm: abstract moving to the next PFN mm: report success more often from filemap_map_folio_range() fs: binfmt_elf_efpic: fix personality for ELF-FDPIC	2023-10-01 13:33:25 -07:00
Matthew Wilcox (Oracle)	ce60f27bb6	mm: abstract moving to the next PFN In order to fix the L1TF vulnerability, x86 can invert the PTE bits for PROT_NONE VMAs, which means we cannot move from one PTE to the next by adding 1 to the PFN field of the PTE. This results in the BUG reported at [1]. Abstract advancing the PTE to the next PFN through a pte_next_pfn() function/macro. Link: https://lkml.kernel.org/r/20230920040958.866520-1-willy@infradead.org Fixes: `bcc6cc8325` ("mm: add default definition of set_ptes()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: syzbot+55cc72f8cc3a549119df@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/000000000000d099fa0604f03351@google.com [1] Reviewed-by: Yin Fengwei <fengwei.yin@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-09-29 17:20:46 -07:00
Haitao Shan	9cfec6d097	KVM: x86: Fix lapic timer interrupt lost after loading a snapshot. When running android emulator (which is based on QEMU 2.12) on certain Intel hosts with kernel version 6.3-rc1 or above, guest will freeze after loading a snapshot. This is almost 100% reproducible. By default, the android emulator will use snapshot to speed up the next launching of the same android guest. So this breaks the android emulator badly. I tested QEMU 8.0.4 from Debian 12 with an Ubuntu 22.04 guest by running command "loadvm" after "savevm". The same issue is observed. At the same time, none of our AMD platforms is impacted. More experiments show that loading the KVM module with "enable_apicv=false" can workaround it. The issue started to show up after commit `8e6ed96cdd` ("KVM: x86: fire timer when it is migrated and expired, and in oneshot mode"). However, as is pointed out by Sean Christopherson, it is introduced by commit `967235d320` ("KVM: vmx: clear pending interrupts on KVM_SET_LAPIC"). commit `8e6ed96cdd` ("KVM: x86: fire timer when it is migrated and expired, and in oneshot mode") just makes it easier to hit the issue. Having both commits, the oneshot lapic timer gets fired immediately inside the KVM_SET_LAPIC call when loading the snapshot. On Intel platforms with APIC virtualization and posted interrupt processing, this eventually leads to setting the corresponding PIR bit. However, the whole PIR bits get cleared later in the same KVM_SET_LAPIC call by apicv_post_state_restore. This leads to timer interrupt lost. The fix is to move vmx_apicv_post_state_restore to the beginning of the KVM_SET_LAPIC call and rename to vmx_apicv_pre_state_restore. What vmx_apicv_post_state_restore does is actually clearing any former apicv state and this behavior is more suitable to carry out in the beginning. Fixes: `967235d320` ("KVM: vmx: clear pending interrupts on KVM_SET_LAPIC") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Haitao Shan <hshan@google.com> Link: https://lore.kernel.org/r/20230913000215.478387-1-hshan@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-09-28 10:19:29 -07:00
Kyle Meyer	f10a570b09	KVM: x86: Add CONFIG_KVM_MAX_NR_VCPUS to allow up to 4096 vCPUs Add a Kconfig entry to set the maximum number of vCPUs per KVM guest and set the default value to 4096 when MAXSMP is enabled, as there are use cases that want to create more than the currently allowed 1024 vCPUs and are more than happy to eat the memory overhead. The Hyper-V TLFS doesn't allow more than 64 sparse banks, i.e. allows a maximum of 4096 virtual CPUs. Cap KVM's maximum number of virtual CPUs to 4096 to avoid exceeding Hyper-V's limit as KVM support for Hyper-V is unconditional, and alternatives like dynamically disabling Hyper-V enlightenments that rely on sparse banks would require non-trivial code changes. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Link: https://lore.kernel.org/r/20230824215244.3897419-1-kyle.meyer@hpe.com [sean: massage changelog with --verbose, document #ifdef mess] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-09-28 09:25:19 -07:00
Jim Mattson	73554b29bd	KVM: x86/pmu: Synthesize at most one PMI per VM-exit When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a VM-exit that also invokes __kvm_perf_overflow() as a result of instruction emulation, kvm_pmu_deliver_pmi() will be called twice before the next VM-entry. Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that KVM sets the LVTPC mask bit when delivering a PMI. But using IRQ work to trigger the PMI is still broken, albeit very theoretically. E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the vCPU to be migrated to a different pCPU, then it's possible for kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from KVM_REQ_PMI and still generate two PMIs. KVM could set the mask bit using an atomic operation, but that'd just be piling on unnecessary code to workaround what is effectively a hack. The only reason KVM uses IRQ work is to ensure the PMI is treated as a wake event, e.g. if the vCPU just executed HLT. Remove the irq_work callback for synthesizing a PMI, and all of the logic for invoking it. Instead, to prevent a vcpu from leaving C0 with a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events(). Fixes: `9cd803d496` ("KVM: x86: Update vPMCs when retiring instructions") Signed-off-by: Jim Mattson <jmattson@google.com> Tested-by: Mingwei Zhang <mizhang@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20230925173448.3518223-2-mizhang@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-09-25 14:42:52 -07:00
David Howells	066baf92be	iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user() copy_mc_to_user() has the destination marked __user on powerpc, but not on x86; the latter results in a sparse warning in lib/iov_iter.c. Fix this by applying the tag on x86 too. Fixes: `ec6347bb43` ("x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230925120309.1731676-3-dhowells@redhat.com cc: Dan Williams <dan.j.williams@intel.com> cc: Thomas Gleixner <tglx@linutronix.de> cc: Ingo Molnar <mingo@redhat.com> cc: Borislav Petkov <bp@alien8.de> cc: Dave Hansen <dave.hansen@linux.intel.com> cc: "H. Peter Anvin" <hpa@zytor.com> cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Christoph Hellwig <hch@lst.de> cc: Christian Brauner <christian@brauner.io> cc: Matthew Wilcox <willy@infradead.org> cc: Linus Torvalds <torvalds@linux-foundation.org> cc: David Laight <David.Laight@ACULAB.COM> cc: x86@kernel.org cc: linux-block@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-09-25 14:30:27 +02:00
Linus Torvalds	8a511e7efc	ARM: * Fix EL2 Stage-1 MMIO mappings where a random address was used * Fix SMCCC function number comparison when the SVE hint is set RISC-V: * Fix KVM_GET_REG_LIST API for ISA_EXT registers * Fix reading ISA_EXT register of a missing extension * Fix ISA_EXT register handling in get-reg-list test * Fix filtering of AIA registers in get-reg-list test x86: * Fixes for TSC_AUX virtualization * Stop zapping page tables asynchronously, since we don't zap them as often as before -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmUQU5YUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNcdwf/X8eHQ5yfAE0J70xs4VZ1z7B8i77q P54401z/q0FyQ4yyTHwbUv/FgVYscZ0efYogrkd3uuoPNtLmN2xKj1tM95A2ncP/ v318ljevZ0FWZ6J471Xu9MM3u15QmjC3Wai9z6IP4tz0S2rUhOYTJdFqlNf6gQSu P8n9l2j3ZLAiUNizXa8M7350gCUFCBi37dvLLVTYOxbu17hZtmNjhNpz5G7YNc9y zmJIJh30ZnMGUgMylLfcW0ZoqQFNIkNg3yyr9YjY68bTW5aspXdhp9u0zI+01xYF sT+tOXBPPLi9MBuckX+oLMsvNXEZWxos2oMow3qziMo83neG+jU+WhjLHg== =+sqe -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "ARM: - Fix EL2 Stage-1 MMIO mappings where a random address was used - Fix SMCCC function number comparison when the SVE hint is set RISC-V: - Fix KVM_GET_REG_LIST API for ISA_EXT registers - Fix reading ISA_EXT register of a missing extension - Fix ISA_EXT register handling in get-reg-list test - Fix filtering of AIA registers in get-reg-list test x86: - Fixes for TSC_AUX virtualization - Stop zapping page tables asynchronously, since we don't zap them as often as before" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX KVM: SVM: Fix TSC_AUX virtualization setup KVM: SVM: INTERCEPT_RDTSCP is never intercepted anyway KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe() KVM: x86/mmu: Open code leaf invalidation from mmu_notifier KVM: riscv: selftests: Selectively filter-out AIA registers KVM: riscv: selftests: Fix ISA_EXT register handling in get-reg-list RISC-V: KVM: Fix riscv_vcpu_get_isa_ext_single() for missing extensions RISC-V: KVM: Fix KVM_GET_REG_LIST API for ISA_EXT registers KVM: selftests: Assert that vasprintf() is successful KVM: arm64: nvhe: Ignore SVE hint in SMCCC function ID KVM: arm64: Properly return allocated EL2 VA from hyp_alloc_private_va_range()	2023-09-24 14:14:35 -07:00
Paolo Bonzini	5804c19b80	KVM/riscv fixes for 6.6, take #1 - Fix KVM_GET_REG_LIST API for ISA_EXT registers - Fix reading ISA_EXT register of a missing extension - Fix ISA_EXT register handling in get-reg-list test - Fix filtering of AIA registers in get-reg-list test -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmUMDssACgkQrUjsVaLH LAckSg/+IZ5DPvPs81rUpL3i1Z5SrK4jXWL2zyvMIksBEYmFD2NPNvinVZ4Sxv6u IzWNKJcAp4nA/+qPGPLXCURDe1W6PCDvO4SShjYm2UkPtNIfiskmFr3MunXZysgm I7USJgj9ev+46yfOnwlYrwpZ8sQk7Z6nLTI/6Osk4Q7Sm0Vjoobh6krub7LNjeKQ y6p+vxrXj+Owc5H9bgl0wAi6GOmOJKAM+cZU5DygQxjOgiUgNbOzsVgbLDTvExNq gISUU4PoAO7/U1NKEaaopbe7C0KNQHTnegedtXsDzg7WTBah77/MNBt4snbfiR27 6rODklZlG/kAGIHdVtYC+zf8AfPqvGTIT8SLGmzQlyVlHujFBGn0L41NmMzW+EeA 7UpfUk8vPiiGhefBE+Ml3yqiReogo+aRhL1mZoI39rPusd7DMnwx97KpBlAcYuTI PTgqycIMRmq2dSCHya+nrOVpwwV3Qx4G8Alpq1jOa7XDMeGMj4h521NQHjWckIK2 IJ2a0RtzB10+Z91nLV+amdAno326AnxJC7dR26O6uqVSPJy/nHE2GAc49gFKeWq6 QmzgzY1sU2Y02/TM8miyKSl3i+bpZSIPzXCKlOm1TowBKO+sfJzn/yMon9sVaVhb 4Wjgg3vgE74y9FVsL4JXR/PfrZH5Aq77J1R+/pMtsNTtVYrt1Sk= =ytFs -----END PGP SIGNATURE----- Merge tag 'kvm-riscv-fixes-6.6-1' of https://github.com/kvm-riscv/linux into HEAD KVM/riscv fixes for 6.6, take #1 - Fix KVM_GET_REG_LIST API for ISA_EXT registers - Fix reading ISA_EXT register of a missing extension - Fix ISA_EXT register handling in get-reg-list test - Fix filtering of AIA registers in get-reg-list test	2023-09-23 05:35:55 -04:00
Sean Christopherson	0df9dab891	KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, beyond the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-09-23 05:35:48 -04:00
Linus Torvalds	e583bffeb8	Misc x86 fixes: - Fix a kexec bug, - Fix an UML build bug, - Fix a handful of SRSO related bugs, - Fix a shadow stacks handling bug & robustify related code. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUNbQIRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jVIg/9EChW7qFTda8joR41Uayg07VIOpGirDLu 7hjzOnt4Ni93cfFNUBkKDKXoWxGAiOD+cRDnT6+zsJAvAZR26Y3UNVLYlAy+lFKK 9kRxeDM7nOEKqCC+zneinMFcIKmZRttMLpj8O901jB2S08x4UarnNx5SaWEcqYbn rf1XIEuEvlxqMfafNueS/TRadV52qVU8Y+2+inkDnM7dDNwt+rCs5tQ9ebJos8QO tsAoQes1G+0mjPrpqAgsIic5e3QCHliwVr8ASQrKZ9KR+fokEJRbSBNjgHUCNeVN 0bHBhuDJBSniC7QmAQGEizrZWmHl2HxwYYnCE0gd0g24b7sDTwWBFSXWratCrPdX e4qYq36xonWdQcbpVF8ijMXF/R810vDyis/yc/BTeo5EBWf6aM/cx1dmS9DUxRpA QsIW2amSCPVYwYE5MAC+K/DM9nq1tk8YnKi4Mob3L28+W3CmVwSwT9S86z2QLlZu +KgVV4yBtJY1FJqVur5w3awhFtqLfBdfIvs6uQCd9VZXVPbBfS8+rOQmmhFixEDu FSPlAChmXYTAM2R+UxcEvw1Zckrtd2BCOa8UrY2lq57mduBK1EymdpfjlrUAChLG x7fQBOGNgOTLwYcsurIdS5jAqiudpnJ/KDG8ZAmKsVoW96JCPp9B3tVZMp9tT30C 8HRwSPX4384= =58St -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Fix a kexec bug - Fix an UML build bug - Fix a handful of SRSO related bugs - Fix a shadow stacks handling bug & robustify related code * tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/shstk: Add warning for shadow stack double unmap x86/shstk: Remove useless clone error handling x86/shstk: Handle vfork clone failure correctly x86/srso: Fix SBPB enablement for spec_rstack_overflow=off x86/srso: Don't probe microcode in a guest x86/srso: Set CPUID feature bits independently of bug or mitigation status x86/srso: Fix srso_show_state() side effect x86/asm: Fix build of UML with KASAN x86/mm, kexec, ima: Use memblock_free_late() from ima_free_kexec_buffer()	2023-09-22 12:26:42 -07:00
Saurabh Sengar	14058f72cf	x86/hyperv: Remove hv_vtl_early_init initcall There has been cases reported where HYPERV_VTL_MODE is enabled by mistake, on a non Hyper-V platforms. This causes the hv_vtl_early_init function to be called in an non Hyper-V/VTL platforms which results the memory corruption. Remove the early_initcall for hv_vtl_early_init and call it at the end of hyperv_init to make sure it is never called in a non Hyper-V platform by mistake. Reported-by: Mathias Krause <minipli@grsecurity.net> Closes: https://lore.kernel.org/lkml/40467722-f4ab-19a5-4989-308225b1f9f0@grsecurity.net/ Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Acked-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/1695358720-27681-1-git-send-email-ssengar@linux.microsoft.com	2023-09-22 18:41:29 +00:00
Fangrui Song	b8ec60e118	x86/speculation, objtool: Use absolute relocations for annotations .discard.retpoline_safe sections do not have the SHF_ALLOC flag. These sections referencing text sections' STT_SECTION symbols with PC-relative relocations like R_386_PC32 [0] is conceptually not suitable. Newer LLD will report warnings for REL relocations even for relocatable links [1]: ld.lld: warning: vmlinux.a(drivers/i2c/busses/i2c-i801.o):(.discard.retpoline_safe+0x120): has non-ABS relocation R_386_PC32 against symbol '' Switch to absolute relocations instead, which indicate link-time addresses. In a relocatable link, these addresses are also output section offsets, used by checks in tools/objtool/check.c. When linking vmlinux, these .discard.* sections will be discarded, therefore it is not a problem that R_X86_64_32 cannot represent a kernel address. Alternatively, we could set the SHF_ALLOC flag for .discard.* sections, but I think non-SHF_ALLOC for sections to be discarded makes more sense. Note: if we decide to never support REL architectures (e.g. arm, i386), we can utilize R__NONE relocations (.reloc ., BFD_RELOC_NONE, sym), making .discard. sections zero-sized. That said, the section content waste is 4 bytes per entry, much smaller than sizeof(Elf{32,64}_Rel). [0] commit `1c0c1faf56` ("objtool: Use relative pointers for annotations") [1] https://github.com/ClangBuiltLinux/linux/issues/1937 Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/r/20230920001728.1439947-1-maskray@google.com	2023-09-22 11:41:24 +02:00
Paolo Bonzini	7deda2ce5b	x86/cpu: Clear SVM feature if disabled by BIOS When SVM is disabled by BIOS, one cannot use KVM but the SVM feature is still shown in the output of /proc/cpuinfo. On Intel machines, VMX is cleared by init_ia32_feat_ctl(), so do the same on AMD and Hygon processors. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230921114940.957141-1-pbonzini@redhat.com	2023-09-22 10:55:26 +02:00
Ingo Molnar	d73a105586	x86/mm: Move arch_memory_failure() and arch_is_platform_page() definitions from <asm/processor.h> to <asm/pgtable.h> <linux/mm.h> relies on these definitions being included first, which is true currently due to historic header spaghetti, but in the future <asm/processor.h> will not guaranteed to be included by the MM code. Move these definitions over into a suitable MM header. This is a preparatory patch for x86 header dependency simplifications and reductions. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org	2023-09-22 09:32:03 +02:00
Uros Bizjak	7c097ca50d	x86/percpu: Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_op The fallback alternative uses %rsi register to manually load pointer to the percpu variable before the call to the emulation function. This is unoptimal, because the load is hidden from the compiler. Move the load of %rsi outside inline asm, so the compiler can reuse the value. The code in slub.o improves from: 55ac: 49 8b 3c 24 mov (%r12),%rdi 55b0: 48 8d 4a 40 lea 0x40(%rdx),%rcx 55b4: 49 8b 1c 07 mov (%r15,%rax,1),%rbx 55b8: 4c 89 f8 mov %r15,%rax 55bb: 48 8d 37 lea (%rdi),%rsi 55be: e8 00 00 00 00 callq 55c3 <...> 55bf: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 55c3: 75 a3 jne 5568 <...> 55c5: ... 0000000000000000 <.altinstr_replacement>: 5: 65 48 0f c7 0f cmpxchg16b %gs:(%rdi) to: 55ac: 49 8b 34 24 mov (%r12),%rsi 55b0: 48 8d 4a 40 lea 0x40(%rdx),%rcx 55b4: 49 8b 1c 07 mov (%r15,%rax,1),%rbx 55b8: 4c 89 f8 mov %r15,%rax 55bb: e8 00 00 00 00 callq 55c0 <...> 55bc: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 55c0: 75 a6 jne 5568 <...> 55c2: ... Where the alternative replacement instruction now uses %rsi: 0000000000000000 <.altinstr_replacement>: 5: 65 48 0f c7 0e cmpxchg16b %gs:(%rsi) The instruction (effectively a reg-reg move) at 55bb: in the original assembly is removed. Also, both the CALL and replacement CMPXCHG16B are 5 bytes long, removing the need for NOPs in the asm code. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230918151452.62344-1-ubizjak@gmail.com	2023-09-21 09:35:50 +02:00
Rick Edgecombe	331955600d	x86/shstk: Handle vfork clone failure correctly Shadow stacks are allocated automatically and freed on exit, depending on the clone flags. The two cases where new shadow stacks are not allocated are !CLONE_VM (fork()) and CLONE_VFORK (vfork()). For !CLONE_VM, although a new stack is not allocated, it can be freed normally because it will happen in the child's copy of the VM. However, for CLONE_VFORK the parent and the child are actually using the same shadow stack. So the kernel doesn't need to allocate or free a shadow stack for a CLONE_VFORK child. CLONE_VFORK children already need special tracking to avoid returning to userspace until the child exits or execs. Shadow stack uses this same tracking to avoid freeing CLONE_VFORK shadow stacks. However, the tracking is not setup until the clone has succeeded (internally). Which means, if a CLONE_VFORK fails, the existing logic will not know it is a CLONE_VFORK and proceed to unmap the parents shadow stack. This error handling cleanup logic runs via exit_thread() in the bad_fork_cleanup_thread label in copy_process(). The issue was seen in the glibc test "posix/tst-spawn3-pidfd" while running with shadow stack using currently out-of-tree glibc patches. Fix it by not unmapping the vfork shadow stack in the error case as well. Since clone is implemented in core code, it is not ideal to pass the clone flags along the error path in order to have shadow stack code have symmetric logic in the freeing half of the thread shadow stack handling. Instead use the existing state for thread shadow stacks to track whether the thread is managing its own shadow stack. For CLONE_VFORK, simply set shstk->base and shstk->size to 0, and have it mean the thread is not managing a shadow stack and so should skip cleanup work. Implement this by breaking up the CLONE_VFORK and !CLONE_VM cases in shstk_alloc_thread_stack() to separate conditionals since, the logic is now different between them. In the case of CLONE_VFORK && !CLONE_VM, the existing behavior is to not clean up the shadow stack in the child (which should go away quickly with either be exit or exec), so maintain that behavior by handling the CLONE_VFORK case first in the allocation path. This new logioc cleanly handles the case of normal, successful CLONE_VFORK's skipping cleaning up their shadow stack's on exit as well. So remove the existing, vfork shadow stack freeing logic. This is in deactivate_mm() where vfork_done is used to tell if it is a vfork child that can skip cleaning up the thread shadow stack. Fixes: `b2926a36b9` ("x86/shstk: Handle thread shadow stack") Reported-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: H.J. Lu <hjl.tools@gmail.com> Link: https://lore.kernel.org/all/20230908203655.543765-2-rick.p.edgecombe%40intel.com	2023-09-19 09:18:34 -07:00
Josh Poimboeuf	91857ae203	x86/srso: Set CPUID feature bits independently of bug or mitigation status Booting with mitigations=off incorrectly prevents the X86_FEATURE_{IBPB_BRTYPE,SBPB} CPUID bits from getting set. Also, future CPUs without X86_BUG_SRSO might still have IBPB with branch type prediction flushing, in which case SBPB should be used instead of IBPB. The current code doesn't allow for that. Also, cpu_has_ibpb_brtype_microcode() has some surprising side effects and the setting of these feature bits really doesn't belong in the mitigation code anyway. Move it to earlier. Fixes: `fb3bd914b3` ("x86/srso: Add a Speculative RAS Overflow mitigation") Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/869a1709abfe13b673bdd10c2f4332ca253a40bc.1693889988.git.jpoimboe@kernel.org	2023-09-19 10:54:07 +02:00
Juergen Gross	49147beb0c	x86/xen: allow nesting of same lazy mode When running as a paravirtualized guest under Xen, Linux is using "lazy mode" for issuing hypercalls which don't need to take immediate effect in order to improve performance (examples are e.g. multiple PTE changes). There are two different lazy modes defined: MMU and CPU lazy mode. Today it is not possible to nest multiple lazy mode sections, even if they are of the same kind. A recent change in memory management added nesting of MMU lazy mode sections, resulting in a regression when running as Xen PV guest. Technically there is no reason why nesting of multiple sections of the same kind of lazy mode shouldn't be allowed. So add support for that for fixing the regression. Fixes: `bcc6cc8325` ("mm: add default definition of set_ptes()") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20230913113828.18421-4-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>	2023-09-19 07:04:49 +02:00
Juergen Gross	a4a7644c15	x86/xen: move paravirt lazy code Only Xen is using the paravirt lazy mode code, so it can be moved to Xen specific sources. This allows to make some of the functions static or to merge them into their only call sites. While at it do a rename from "paravirt" to "xen" for all moved specifiers. No functional change. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20230913113828.18421-3-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com>	2023-09-19 07:04:49 +02:00
Vincent Whitchurch	10f4c9b9a3	x86/asm: Fix build of UML with KASAN Building UML with KASAN fails since commit `69d4c0d321` ("entry, kasan, x86: Disallow overriding mem() functions") with the following errors: $ tools/testing/kunit/kunit.py run --kconfig_add CONFIG_KASAN=y ... ld: mm/kasan/shadow.o: in function `memset': shadow.c:(.text+0x40): multiple definition of `memset'; arch/x86/lib/memset_64.o:(.noinstr.text+0x0): first defined here ld: mm/kasan/shadow.o: in function `memmove': shadow.c:(.text+0x90): multiple definition of `memmove'; arch/x86/lib/memmove_64.o:(.noinstr.text+0x0): first defined here ld: mm/kasan/shadow.o: in function `memcpy': shadow.c:(.text+0x110): multiple definition of `memcpy'; arch/x86/lib/memcpy_64.o:(.noinstr.text+0x0): first defined here UML does not use GENERIC_ENTRY and is still supposed to be allowed to override the mem() functions, so use weak aliases in that case. Fixes: `69d4c0d321` ("entry, kasan, x86: Disallow overriding mem*() functions") Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20230918-uml-kasan-v3-1-7ad6db477df6@axis.com	2023-09-18 19:30:08 +02:00
Kai Huang	518755a7ee	x86/tdx: Fix __noreturn build warning around __tdx_hypercall_failed() LKP reported below build warning: vmlinux.o: warning: objtool: __tdx_hypercall+0x128: __tdx_hypercall_failed() is missing a __noreturn annotation The __tdx_hypercall_failed() function definition already has __noreturn annotation, but it turns out the __noreturn must be annotated to the function declaration. PeterZ explains: "FWIW, the reason being that... The point of noreturn is that the caller should know to stop generating code. For that the declaration needs the attribute, because call sites typically do not have access to the function definition in C." Add __noreturn annotation to the declaration of __tdx_hypercall_failed() to fix. It's not a bad idea to document the __noreturn nature at the definition site either, so keep the annotation at the definition. Note <asm/shared/tdx.h> is also included by TDX related assembly files. Include <linux/compiler_attributes.h> only in case of !__ASSEMBLY__ otherwise compiling assembly file would trigger build error. Also, following the objtool documentation, add __tdx_hypercall_failed() to "tools/objtool/noreturns.h". Fixes: `c641cfb5c1` ("x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230918041858.331234-1-kai.huang@intel.com Closes: https://lore.kernel.org/oe-kbuild-all/202309140828.9RdmlH2Z-lkp@intel.com/	2023-09-18 09:11:39 +02:00
Linus Torvalds	e789286468	Misc fixes: - Fix an UV boot crash, - Skip spurious ENDBR generation on _THIS_IP_, - Fix ENDBR use in putuser() asm methods, - Fix corner case boot crashes on 5-level paging, - and fix a false positive WARNING on LTO kernels. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUHOrwRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1j6Jw/+PjUfh/4+KM/Z8VzcBy2UhY3NMF2ptGCN 47FPLy+8ADqOvIfgsPsBEO9VXwdvkHfH64YWRUlULjvPNOSs+37drBYMe9AI9xKE u6NhoBHmsnOtoLkBFIQYlJys9GyAeoMlwSSHxzRwQS+3VztRjoH636jiBcg/h7DR zhakfnJD1SSOZuEyyDPnO0uIUarrcqC2jdZwucSqZnvZFdA/pexUHQEe2RtMXLln EIA5kuEo7UdADcSr/Lbty7MKO+6xpRTjxF0yNItPtwPWsnxrSAC7P+dQ37uB975U Z0CJ/kw54XG5Sdoech7XCWYmJzDxSPhiziA1USKpZJMo5phAG/apTJK6NG4TG94r U+3QhLopUoSd8N74/VtZn0FjOrMsk7YKD5o8twlTbpCd2VaiJk4X5oBIns6Tvq05 0Vmsx15XO3mEzg1wWbbdjhjHW0czRgBpikS9QyexZKVkukYcW8QT6bk1gK1ypg94 do4PHKB53QIt31dedy/ddpQj4u1sJ4+a9/+IG29kjUB33M4+uFedtw11vfe+CDN0 XLRc6HfPblogIIEO/figJgwSq/TPCOsNHTwKkejq+D1Ey6DsrnX9Gg4BWVz/3dDW 84SW7TaO2FGEDh4NkR2ijkZlbpofFnCvhCh/uohodPlqDrTVTuRKCZW9I5plmGVa qeXUpNDFs1E= =BMjT -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes: - Fix an UV boot crash - Skip spurious ENDBR generation on _THIS_IP_ - Fix ENDBR use in putuser() asm methods - Fix corner case boot crashes on 5-level paging - and fix a false positive WARNING on LTO kernels" * tag 'x86-urgent-2023-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/purgatory: Remove LTO flags x86/boot/compressed: Reserve more memory for page tables x86/ibt: Avoid duplicate ENDBR in __put_user_nocheck*() x86/ibt: Suppress spurious ENDBR x86/platform/uv: Use alternate source for socket to node data	2023-09-17 11:13:37 -07:00
Kirill A. Shutemov	f530ee95b7	x86/boot/compressed: Reserve more memory for page tables The decompressor has a hard limit on the number of page tables it can allocate. This limit is defined at compile-time and will cause boot failure if it is reached. The kernel is very strict and calculates the limit precisely for the worst-case scenario based on the current configuration. However, it is easy to forget to adjust the limit when a new use-case arises. The worst-case scenario is rarely encountered during sanity checks. In the case of enabling 5-level paging, a use-case was overlooked. The limit needs to be increased by one to accommodate the additional level. This oversight went unnoticed until Aaron attempted to run the kernel via kexec with 5-level paging and unaccepted memory enabled. Update wost-case calculations to include 5-level paging. To address this issue, let's allocate some extra space for page tables. 128K should be sufficient for any use-case. The logic can be simplified by using a single value for all kernel configurations. [ Also add a warning, should this memory run low - by Dave Hansen. ] Fixes: `34bbb0009f` ("x86/boot/compressed: Enable 5-level paging during decompression stage") Reported-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230915070221.10266-1-kirill.shutemov@linux.intel.com	2023-09-17 09:48:57 +02:00
Uros Bizjak	b8e3dfa16e	x86/percpu: Use raw_cpu_try_cmpxchg() in preempt_count_set() Use raw_cpu_try_cmpxchg() instead of raw_cpu_cmpxchg(ptr, old, new) == old. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after CMPXCHG (and related MOV instruction in front of CMPXCHG). Also, raw_cpu_try_cmpxchg() implicitly assigns old ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230830151623.3900-2-ubizjak@gmail.com	2023-09-15 13:19:22 +02:00
Uros Bizjak	5f863897d9	x86/percpu: Define raw_cpu_try_cmpxchg and this_cpu_try_cmpxchg() Define target-specific raw_cpu_try_cmpxchg_N() and this_cpu_try_cmpxchg_N() macros. These definitions override the generic fallback definitions and enable target-specific optimized implementations. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230830151623.3900-1-ubizjak@gmail.com	2023-09-15 13:18:23 +02:00
Uros Bizjak	54cd971c6f	x86/percpu: Define {raw,this}_cpu_try_cmpxchg{64,128} Define target-specific {raw,this}_cpu_try_cmpxchg64() and {raw,this}_cpu_try_cmpxchg128() macros. These definitions override the generic fallback definitions and enable target-specific optimized implementations. Several places in mm/slub.o improve from e.g.: 53bc: 48 8d 4f 40 lea 0x40(%rdi),%rcx 53c0: 48 89 fa mov %rdi,%rdx 53c3: 49 8b 5c 05 00 mov 0x0(%r13,%rax,1),%rbx 53c8: 4c 89 e8 mov %r13,%rax 53cb: 49 8d 30 lea (%r8),%rsi 53ce: e8 00 00 00 00 call 53d3 <...> 53cf: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 53d3: 48 31 d7 xor %rdx,%rdi 53d6: 4c 31 e8 xor %r13,%rax 53d9: 48 09 c7 or %rax,%rdi 53dc: 75 ae jne 538c <...> to: 53bc: 48 8d 4a 40 lea 0x40(%rdx),%rcx 53c0: 49 8b 1c 07 mov (%r15,%rax,1),%rbx 53c4: 4c 89 f8 mov %r15,%rax 53c7: 48 8d 37 lea (%rdi),%rsi 53ca: e8 00 00 00 00 call 53cf <...> 53cb: R_X86_64_PLT32 this_cpu_cmpxchg16b_emu-0x4 53cf: 75 bb jne 538c <...> reducing the size of mm/slub.o by 80 bytes: text data bss dec hex filename 39758 5337 4208 49303 c097 slub-new.o 39838 5337 4208 49383 c0e7 slub-old.o Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230906185941.53527-1-ubizjak@gmail.com	2023-09-15 13:16:35 +02:00
Nikolay Borisov	61382281e9	x86/entry: Make IA32 syscalls' availability depend on ia32_enabled() Another major aspect of supporting running of 32bit processes is the ability to access 32bit syscalls. Such syscalls can be invoked by using the legacy int 0x80 handler and sysenter/syscall instructions. If IA32 emulation is disabled ensure that each of those 3 distinct mechanisms are also disabled. For int 0x80 a #GP exception would be generated since the respective descriptor is not going to be loaded at all. Invoking sysenter will also result in a #GP since IA32_SYSENTER_CS contains an invalid segment. Finally, syscall instruction cannot really be disabled so it's configured to execute a minimal handler. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230623111409.3047467-6-nik.borisov@suse.com	2023-09-14 13:19:53 +02:00
Nikolay Borisov	5ae2702d7c	x86/elf: Make loading of 32bit processes depend on ia32_enabled() Major aspect of ia32 emulation is the ability to load 32bit processes. That's currently decided (among others) by compat_elf_check_arch(). Make the macro use ia32_enabled() to decide if IA32 compat is enabled before loading a 32bit process. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230623111409.3047467-5-nik.borisov@suse.com	2023-09-14 13:19:53 +02:00
Nikolay Borisov	f71e1d2ff8	x86/entry: Rename ignore_sysret() The SYSCALL instruction cannot really be disabled in compatibility mode. The best that can be done is to configure the CSTAR msr to point to a minimal handler. Currently this handler has a rather misleading name - ignore_sysret() as it's not really doing anything with sysret. Give it a more descriptive name. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230623111409.3047467-3-nik.borisov@suse.com	2023-09-14 13:19:53 +02:00
Nikolay Borisov	1da5c9bc11	x86: Introduce ia32_enabled() IA32 support on 64bit kernels depends on whether CONFIG_IA32_EMULATION is selected or not. As it is a compile time option it doesn't provide the flexibility to have distributions set their own policy for IA32 support and give the user the flexibility to override it. As a first step introduce ia32_enabled() which abstracts whether IA32 compat is turned on or off. Upcoming patches will implement the ability to set IA32 compat state at boot time. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230623111409.3047467-2-nik.borisov@suse.com	2023-09-14 13:19:53 +02:00
Kai Huang	7b804135d4	x86/virt/tdx: Make TDX_MODULE_CALL handle SEAMCALL #UD and #GP SEAMCALL instruction causes #UD if the CPU isn't in VMX operation. Currently the TDX_MODULE_CALL assembly doesn't handle #UD, thus making SEAMCALL when VMX is disabled would cause Oops. Unfortunately, there are legal cases that SEAMCALL can be made when VMX is disabled. For instance, VMX can be disabled due to emergency reboot while there are still TDX guests running. Extend the TDX_MODULE_CALL assembly to return an error code for #UD to handle this case gracefully, e.g., KVM can then quietly eat all SEAMCALL errors caused by emergency reboot. SEAMCALL instruction also causes #GP when TDX isn't enabled by the BIOS. Use _ASM_EXTABLE_FAULT() to catch both exceptions with the trap number recorded, and define two new error codes by XORing the trap number to the TDX_SW_ERROR. This opportunistically handles #GP too while using the same simple assembly code. A bonus is when kernel mistakenly calls SEAMCALL when CPU isn't in VMX operation, or when TDX isn't enabled by the BIOS, or when the BIOS is buggy, the kernel can get a nicer error code rather than a less understandable Oops. This is basically based on Peter's code. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/de975832a367f476aab2d0eb0d9de66019a16b54.1692096753.git.kai.huang%40intel.com	2023-09-12 16:30:27 -07:00
Kai Huang	c33621b4c5	x86/virt/tdx: Wire up basic SEAMCALL functions Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. A CPU-attested software module called 'the TDX module' runs inside a new isolated memory range as a trusted hypervisor to manage and run protected VMs. TDX introduces a new CPU mode: Secure Arbitration Mode (SEAM). This mode runs only the TDX module itself or other code to load the TDX module. The host kernel communicates with SEAM software via a new SEAMCALL instruction. This is conceptually similar to a guest->host hypercall, except it is made from the host to SEAM software instead. The TDX module establishes a new SEAMCALL ABI which allows the host to initialize the module and to manage VMs. The SEAMCALL ABI is very similar to the TDCALL ABI and leverages much TDCALL infrastructure. Wire up basic functions to make SEAMCALLs for the basic support of running TDX guests: __seamcall(), __seamcall_ret(), and __seamcall_saved_ret() for TDH.VP.ENTER. All SEAMCALLs involved in the basic TDX support don't use "callee-saved" registers as input and output, except the TDH.VP.ENTER. To start to support TDX, create a new arch/x86/virt/vmx/tdx/tdx.c for TDX host kernel support. Add a new Kconfig option CONFIG_INTEL_TDX_HOST to opt-in TDX host kernel support (to distinguish with TDX guest kernel support). So far only KVM uses TDX. Make the new config option depend on KVM_INTEL. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Isaku Yamahata <isaku.yamahata@intel.com> Link: https://lore.kernel.org/all/4db7c3fc085e6af12acc2932294254ddb3d320b3.1692096753.git.kai.huang%40intel.com	2023-09-12 16:30:27 -07:00
Kai Huang	8a8544bde8	x86/tdx: Remove 'struct tdx_hypercall_args' Now 'struct tdx_hypercall_args' is basically 'struct tdx_module_args' minus RCX. Although from __tdx_hypercall()'s perspective RCX isn't used as shared register thus not part of input/output registers, it's not worth to have a separate structure just due to one register. Remove the 'struct tdx_hypercall_args' and use 'struct tdx_module_args' instead in __tdx_hypercall() related code. This also saves the memory copy between the two structures within __tdx_hypercall(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/798dad5ce24e9d745cf0e16825b75ccc433ad065.1692096753.git.kai.huang%40intel.com	2023-09-12 16:30:14 -07:00
Kai Huang	90f5ecd37f	x86/tdx: Reimplement __tdx_hypercall() using TDX_MODULE_CALL asm Now the TDX_HYPERCALL asm is basically identical to the TDX_MODULE_CALL with both '\saved' and '\ret' enabled, with two minor things though: 1) The way to restore the structure pointer is different The TDX_HYPERCALL uses RCX as spare to restore the structure pointer, but the TDX_MODULE_CALL assumes no spare register can be used. In other words, TDX_MODULE_CALL already covers what TDX_HYPERCALL does. 2) TDX_MODULE_CALL only clears shared registers for TDH.VP.ENTER For this just need to make that code available for the non-host case. Thus, remove the TDX_HYPERCALL and reimplement the __tdx_hypercall() using the TDX_MODULE_CALL. Extend the TDX_MODULE_CALL to cover "clear shared registers" for TDG.VP.VMCALL. Introduce a new __tdcall_saved_ret() to replace the temporary __tdcall_hypercall(). The __tdcall_saved_ret() can also be used for those new TDCALLs which require more input/output registers than the basic TDCALLs do. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/e68a2473fb6f5bcd78b078cae7510e9d0753b3df.1692096753.git.kai.huang%40intel.com	2023-09-12 16:30:14 -07:00
Kai Huang	c641cfb5c1	x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro share similar code pattern too. The __tdx_hypercall() and __tdcall() should be unified to use the same assembly code. As a preparation to unify them, simplify the TDX_HYPERCALL to make it more like the TDX_MODULE_CALL. The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as function call argument, and does below extra things comparing to the TDX_MODULE_CALL: 1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally; 2) It sets RCX to the (fixed) bitmap of shared registers internally; 3) It calls __tdx_hypercall_failed() internally (and panics) when the TDCALL instruction itself fails; 4) After TDCALL, it moves R10 to RAX to return the return code of the VMCALL leaf, regardless the '\ret' asm macro argument; Firstly, change the TDX_HYPERCALL to take the same function call arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer to 'struct tdx_module_args'. Then 1) and 2) can be moved to the caller: - TDG.VP.VMCALL leaf ID can be passed via the function call argument; - 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus the bitmap of shared registers can be passed via RCX in the structure. Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL always save output registers to the structure. The caller then can: - Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error; - Return R10 in the structure as the return code of the VMCALL leaf; With above changes, change the asm function from __tdx_hypercall() to __tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper of it. This avoids having to add another wrapper of __tdx_hypercall() (_tdx_hypercall() is already taken). The __tdcall_hypercall() will be replaced with a __tdcall() variant using TDX_MODULE_CALL in a later commit as the final goal is to have one assembly to handle both TDCALL and TDVMCALL. Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep this unchanged, annotate __tdx_hypercall(), which is a C function now, as 'noinstr'. Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so. Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the compressed code. Opportunistically fix a checkpatch error complaining using space around parenthesis '(' and ')' while moving the bitmap of shared registers to <asm/shared/tdx.h>. [ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com	2023-09-12 16:28:13 -07:00
Alison Schofield	8f012db27c	x86/numa: Introduce numa_fill_memblks() numa_fill_memblks() fills in the gaps in numa_meminfo memblks over an physical address range. The ACPI driver will use numa_fill_memblks() to implement a new Linux policy that prescribes extending proximity domains in a portion of a CFMWS window to the entire window. Dan Williams offered this explanation of the policy: A CFWMS is an ACPI data structure that indicates potential locations where CXL memory can be placed. It is the playground where the CXL driver has free reign to establish regions. That space can be populated by BIOS created regions, or driver created regions, after hotplug or other reconfiguration. When BIOS creates a region in a CXL Window it additionally describes that subset of the Window range in the other typical ACPI tables SRAT, SLIT, and HMAT. The rationale for BIOS not pre-describing the entire CXL Window in SRAT, SLIT, and HMAT is that it can not predict the future. I.e. there is nothing stopping higher or lower performance devices being placed in the same Window. Compare that to ACPI memory hotplug that just onlines additional capacity in the proximity domain with little freedom for dynamic performance differentiation. That leaves the OS with a choice, should unpopulated window capacity match the proximity domain of an existing region, or should it allocate a new one? This patch takes the simple position of minimizing proximity domain proliferation by reusing any proximity domain intersection for the entire Window. If the Window has no intersections then allocate a new proximity domain. Note that SRAT, SLIT and HMAT information can be enumerated dynamically in a standard way from device provided data. Think of CXL as the end of ACPI needing to describe memory attributes, CXL offers a standard discovery model for performance attributes, but Linux still needs to interoperate with the old regime. Reported-by: Derick Marks <derick.w.marks@intel.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Alison Schofield <alison.schofield@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Derick Marks <derick.w.marks@intel.com> Link: https://lore.kernel.org/all/ef078a6f056ca974e5af85997013c0fda9e3326d.1689018477.git.alison.schofield%40intel.com	2023-09-12 16:13:05 -07:00
Peter Zijlstra	25e73b7e3f	x86/ibt: Suppress spurious ENDBR It was reported that under certain circumstances GCC emits ENDBR instructions for _THIS_IP_ usage. Specifically, when it appears at the start of a basic block -- but not elsewhere. Since _THIS_IP_ is never used for control flow, these ENDBR instructions are completely superfluous. Override the _THIS_IP_ definition for x86_64 to avoid this. Less ENDBR instructions is better. Fixes: `156ff4a544` ("x86/ibt: Base IBT bits") Reported-by: David Kaplan <David.Kaplan@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230802110323.016197440@infradead.org	2023-09-12 17:50:53 +02:00
Kai Huang	12f34ed862	x86/tdx: Extend TDX_MODULE_CALL to support more TDCALL/SEAMCALL leafs The TDX guest live migration support (TDX 1.5) adds new TDCALL/SEAMCALL leaf functions. Those new TDCALLs/SEAMCALLs take additional registers for input (R10-R13) and output (R12-R13). TDG.SERVTD.RD is an example. Also, the current TDX_MODULE_CALL doesn't aim to handle TDH.VP.ENTER SEAMCALL, which monitors the TDG.VP.VMCALL in input/output registers when it returns in case of VMCALL from TDX guest. With those new TDCALLs/SEAMCALLs and the TDH.VP.ENTER covered, the TDX_MODULE_CALL macro basically needs to handle the same input/output registers as the TDX_HYPERCALL does. And as a result, they also share similar logic in the assembly, thus should be unified to use one common assembly. Extend the TDX_MODULE_CALL asm to support the new TDCALLs/SEAMCALLs and also the TDH.VP.ENTER SEAMCALL. Eventually it will be unified with the TDX_HYPERCALL. The new input/output registers fit with the "callee-saved" registers in the x86 calling convention. Add a new "saved" parameter to support those new TDCALLs/SEAMCALLs and TDH.VP.ENTER and keep the existing TDCALLs/SEAMCALLs minimally impacted. For TDH.VP.ENTER, after it returns the registers shared by the guest contain guest's values. Explicitly clear them to prevent speculative use of guest's values. Note most TDX live migration related SEAMCALLs may also clobber AVX* state ("AVX, AVX2 and AVX512 state: may be reset to the architectural INIT state" -- see TDH.EXPORT.MEM for example). And TDH.VP.ENTER also clobbers XMM0-XMM15 when the corresponding bit is set in RCX. Don't handle them in the TDX_MODULE_CALL macro but let the caller save and restore when needed. This is basically based on Peter's code. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/d4785de7c392f7c5684407f6c24a73b92148ec49.1692096753.git.kai.huang%40intel.com	2023-09-11 16:33:51 -07:00
Kai Huang	57a420bb81	x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and SEAMCALL, takes one parameter for each input register and an optional 'struct tdx_module_output' (a collection of output registers) as output. This is different from the TDX_HYPERCALL macro which uses a single 'struct tdx_hypercall_args' to carry all input/output registers. The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more input/output registers. Also, the TDH.VP.ENTER (which isn't covered by the current TDX_MODULE_CALL macro) basically can use all registers that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't extendible to cover those cases. Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro to use a single structure 'struct tdx_module_args' to carry all the input/output registers. Currently, R10/R11 are only used as output register but not as input by any TDCALL/SEAMCALL. Change to also use R10/R11 as input register to make input/output registers symmetric. Currently, the TDX_MODULE_CALL macro depends on the caller to pass a non-NULL 'struct tdx_module_output' to get additional output registers. Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to take a new 'ret' macro argument to indicate whether to save the output registers to the 'struct tdx_module_args'. Also introduce a new __tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret(). Note the tdcall(), which is a wrapper of __tdcall(), is called by three callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init(). The former two need the additional output but the last one doesn't. For simplicity, make tdcall() always call __tdcall_ret() to avoid another "_ret()" wrapper. The last caller tdx_early_init() isn't performance critical anyway. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com	2023-09-11 16:33:38 -07:00
Kai Huang	5efb96289e	x86/tdx: Rename __tdx_module_call() to __tdcall() __tdx_module_call() is only used by the TDX guest to issue TDCALL to the TDX module. Rename it to __tdcall() to match its behaviour, e.g., it cannot be used to make host-side SEAMCALL. Also rename tdx_module_call() which is a wrapper of __tdx_module_call() to tdcall(). No functional change intended. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/785d20d99fbcd0db8262c94da6423375422d8c75.1692096753.git.kai.huang%40intel.com	2023-09-11 16:33:32 -07:00
Kai Huang	f0024dbfc4	x86/tdx: Make macros of TDCALLs consistent with the spec The TDX spec names all TDCALLs with prefix "TDG". Currently, the kernel doesn't follow such convention for the macros of those TDCALLs but uses prefix "TDX_" for all of them. Although it's arguable whether the TDX spec names those TDCALLs properly, it's better for the kernel to follow the spec when naming those macros. Change all macros of TDCALLs to make them consistent with the spec. As a bonus, they get distinguished easily from the host-side SEAMCALLs, which all have prefix "TDH". No functional change intended. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/516dccd0bd8fb9a0b6af30d25bb2d971aa03d598.1692096753.git.kai.huang%40intel.com	2023-09-11 16:33:16 -07:00
Dexuan Cui	019b383d11	x86/tdx: Retry partially-completed page conversion hypercalls TDX guest memory is private by default and the VMM may not access it. However, in cases where the guest needs to share data with the VMM, the guest and the VMM can coordinate to make memory shared between them. The guest side of this protocol includes the "MapGPA" hypercall. This call takes a guest physical address range. The hypercall spec (aka. the GHCI) says that the MapGPA call is allowed to return partial progress in mapping this range and indicate that fact with a special error code. A guest that sees such partial progress is expected to retry the operation for the portion of the address range that was not completed. Hyper-V does this partial completion dance when set_memory_decrypted() is called to "decrypt" swiotlb bounce buffers that can be up to 1GB in size. It is evidently the only VMM that does this, which is why nobody noticed this until now. [ dhansen: rewrite changelog ] Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20230811021246.821-2-decui%40microsoft.com	2023-09-11 16:19:33 -07:00
Ard Biesheuvel	762f169f5d	efi/x86: Move EFI runtime call setup/teardown helpers out of line Only the arch_efi_call_virt() macro that some architectures override needs to be a macro, given that it is variadic and encapsulates calls via function pointers that have different prototypes. The associated setup and teardown code are not special in this regard, and don't need to be instantiated at each call site. So turn them into ordinary C functions and move them out of line. Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2023-09-11 06:37:50 +00:00
Linus Torvalds	e56b2b6057	Fix preemption delays in the SGX code, remove unnecessarily UAPI-exported code, fix a ld.lld linker (in)compatibility quirk and make the x86 SMP init code a bit more conservative to fix kexec() lockups. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmT97boRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jObA//X7nug+d+IMLIs+c4579z4ZhkltMRxJVI Btf8sdHpwgTUtKaOLmJnGiJ7f0GK5NtoaNtGUJF28aQETVOhco0Fvg/R8k1FE2Tc CJqw6oy2FjVqD9qzZPCXh6QCvTtGjN5GF+xmoUbf7eZ9U8IVvOxBG+7yDMorQw3P zzjIccLLg/aDvNLN/yZD2oqw6UGHZuh/Qr/0Q4PkZ7zL+yWV8EC+HOx3rlQklq0x hh6YMwa4LGr3przUObHsfNS11EDzLDhg2WtTQMr12vlnpUB2eXnXWLklr6rpWjcz qJiMxkrEkygB7seXnuQ0b4KHN/17zdkJ+t6vZoznUTXs1ohIDLWdiNTSl03qCs9B V98a1x3MPT6aro9O/5ywyAJwPb0hvsg2S0ODFWab0Z3oRUbIG/k6dTEYlP7qZw8v EFMtLy6M2EILXetj8q2ZGcA0rKz7pj/z9SosWDzqNj76w7xGwDKrSWoKJckkCwG+ j+ycBuKfrpxVYOF4ywvONSf35QTIW8BR0sM9Lg1GZuwaeincFwLf0cmS4ljGRyZ1 Vsi0SfpIgVQkeY/17onTa1C5X6c2wIE9nq253M58Xnc9B2EWpYImr+4PVZk6s4GI GExvdPC/rIIwYa0LmvYTTlpHEd7f5qIAhfcEtMAuGSjVDLvmdDGFkaU7TgJ6Jcw2 D12wKSAAgPU= =S38E -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Fix preemption delays in the SGX code, remove unnecessarily UAPI-exported code, fix a ld.lld linker (in)compatibility quirk and make the x86 SMP init code a bit more conservative to fix kexec() lockups" * tag 'x86-urgent-2023-09-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sgx: Break up long non-preemptible delays in sgx_vepc_release() x86: Remove the arch_calc_vm_prot_bits() macro from the UAPI x86/build: Fix linker fill bytes quirk/incompatibility for ld.lld x86/smp: Don't send INIT to non-present and non-booted CPUs	2023-09-10 10:39:31 -07:00
Linus Torvalds	0c02183427	ARM: * Clean up vCPU targets, always returning generic v8 as the preferred target * Trap forwarding infrastructure for nested virtualization (used for traps that are taken from an L2 guest and are needed by the L1 hypervisor) * FEAT_TLBIRANGE support to only invalidate specific ranges of addresses when collapsing a table PTE to a block PTE. This avoids that the guest refills the TLBs again for addresses that aren't covered by the table PTE. * Fix vPMU issues related to handling of PMUver. * Don't unnecessary align non-stack allocations in the EL2 VA space * Drop HCR_VIRT_EXCP_MASK, which was never used... * Don't use smp_processor_id() in kvm_arch_vcpu_load(), but the cpu parameter instead * Drop redundant call to kvm_set_pfn_accessed() in user_mem_abort() * Remove prototypes without implementations RISC-V: * Zba, Zbs, Zicntr, Zicsr, Zifencei, and Zihpm support for guest * Added ONE_REG interface for SATP mode * Added ONE_REG interface to enable/disable multiple ISA extensions * Improved error codes returned by ONE_REG interfaces * Added KVM_GET_REG_LIST ioctl() implementation for KVM RISC-V * Added get-reg-list selftest for KVM RISC-V s390: * PV crypto passthrough enablement (Tony, Steffen, Viktor, Janosch) Allows a PV guest to use crypto cards. Card access is governed by the firmware and once a crypto queue is "bound" to a PV VM every other entity (PV or not) looses access until it is not bound anymore. Enablement is done via flags when creating the PV VM. * Guest debug fixes (Ilya) x86: * Clean up KVM's handling of Intel architectural events * Intel bugfixes * Add support for SEV-ES DebugSwap, allowing SEV-ES guests to use debug registers and generate/handle #DBs * Clean up LBR virtualization code * Fix a bug where KVM fails to set the target pCPU during an IRTE update * Fix fatal bugs in SEV-ES intrahost migration * Fix a bug where the recent (architecturally correct) change to reinject #BP and skip INT3 broke SEV guests (can't decode INT3 to skip it) * Retry APIC map recalculation if a vCPU is added/enabled * Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM * Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR cannot diverge from the default when TSC scaling is disabled up related code * Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID * Rip out the ancient MMU_DEBUG crud and replace the useful bits with CONFIG_KVM_PROVE_MMU * Fix KVM's handling of !visible guest roots to avoid premature triple fault injection * Overhaul KVM's page-track APIs, and KVMGT's usage, to reduce the API surface that is needed by external users (currently only KVMGT), and fix a variety of issues in the process This last item had a silly one-character bug in the topic branch that was sent to me. Because it caused pretty bad selftest failures in some configurations, I decided to squash in the fix. So, while the exact commit ids haven't been in linux-next, the code has (from the kvm-x86 tree). Generic: * Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass action specific data without needing to constantly update the main handlers. * Drop unused function declarations Selftests: * Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs * Add support for printf() in guest code and covert all guest asserts to use printf-based reporting * Clean up the PMU event filter test and add new testcases * Include x86 selftests in the KVM x86 MAINTAINERS entry -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmT1m0kUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroMNgggAiN7nz6UC423FznuI+yO3TLm8tkx1 CpKh5onqQogVtchH+vrngi97cfOzZb1/AtifY90OWQi31KEWhehkeofcx7G6ERhj 5a9NFADY1xGBsX4exca/VHDxhnzsbDWaWYPXw5vWFWI6erft9Mvy3tp1LwTvOzqM v8X4aWz+g5bmo/DWJf4Wu32tEU6mnxzkrjKU14JmyqQTBawVmJ3RYvHVJ/Agpw+n hRtPAy7FU6XTdkmq/uCT+KUHuJEIK0E/l1js47HFAqSzwdW70UDg14GGo1o4ETxu RjZQmVNvL57yVgi6QU38/A0FWIsWQm5IlaX1Ug6x8pjZPnUKNbo9BY4T1g== =W+4p -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM: - Clean up vCPU targets, always returning generic v8 as the preferred target - Trap forwarding infrastructure for nested virtualization (used for traps that are taken from an L2 guest and are needed by the L1 hypervisor) - FEAT_TLBIRANGE support to only invalidate specific ranges of addresses when collapsing a table PTE to a block PTE. This avoids that the guest refills the TLBs again for addresses that aren't covered by the table PTE. - Fix vPMU issues related to handling of PMUver. - Don't unnecessary align non-stack allocations in the EL2 VA space - Drop HCR_VIRT_EXCP_MASK, which was never used... - Don't use smp_processor_id() in kvm_arch_vcpu_load(), but the cpu parameter instead - Drop redundant call to kvm_set_pfn_accessed() in user_mem_abort() - Remove prototypes without implementations RISC-V: - Zba, Zbs, Zicntr, Zicsr, Zifencei, and Zihpm support for guest - Added ONE_REG interface for SATP mode - Added ONE_REG interface to enable/disable multiple ISA extensions - Improved error codes returned by ONE_REG interfaces - Added KVM_GET_REG_LIST ioctl() implementation for KVM RISC-V - Added get-reg-list selftest for KVM RISC-V s390: - PV crypto passthrough enablement (Tony, Steffen, Viktor, Janosch) Allows a PV guest to use crypto cards. Card access is governed by the firmware and once a crypto queue is "bound" to a PV VM every other entity (PV or not) looses access until it is not bound anymore. Enablement is done via flags when creating the PV VM. - Guest debug fixes (Ilya) x86: - Clean up KVM's handling of Intel architectural events - Intel bugfixes - Add support for SEV-ES DebugSwap, allowing SEV-ES guests to use debug registers and generate/handle #DBs - Clean up LBR virtualization code - Fix a bug where KVM fails to set the target pCPU during an IRTE update - Fix fatal bugs in SEV-ES intrahost migration - Fix a bug where the recent (architecturally correct) change to reinject #BP and skip INT3 broke SEV guests (can't decode INT3 to skip it) - Retry APIC map recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR cannot diverge from the default when TSC scaling is disabled up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID - Rip out the ancient MMU_DEBUG crud and replace the useful bits with CONFIG_KVM_PROVE_MMU - Fix KVM's handling of !visible guest roots to avoid premature triple fault injection - Overhaul KVM's page-track APIs, and KVMGT's usage, to reduce the API surface that is needed by external users (currently only KVMGT), and fix a variety of issues in the process Generic: - Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass action specific data without needing to constantly update the main handlers. - Drop unused function declarations Selftests: - Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs - Add support for printf() in guest code and covert all guest asserts to use printf-based reporting - Clean up the PMU event filter test and add new testcases - Include x86 selftests in the KVM x86 MAINTAINERS entry" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (279 commits) KVM: x86/mmu: Include mmu.h in spte.h KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots KVM: x86/mmu: Disallow guest from using !visible slots for page tables KVM: x86/mmu: Harden TDP MMU iteration against root w/o shadow page KVM: x86/mmu: Harden new PGD against roots without shadow pages KVM: x86/mmu: Add helper to convert root hpa to shadow page drm/i915/gvt: Drop final dependencies on KVM internal details KVM: x86/mmu: Handle KVM bookkeeping in page-track APIs, not callers KVM: x86/mmu: Drop @slot param from exported/external page-track APIs KVM: x86/mmu: Bug the VM if write-tracking is used but not enabled KVM: x86/mmu: Assert that correct locks are held for page write-tracking KVM: x86/mmu: Rename page-track APIs to reflect the new reality KVM: x86/mmu: Drop infrastructure for multiple page-track modes KVM: x86/mmu: Use page-track notifiers iff there are external users KVM: x86/mmu: Move KVM-only page-track declarations to internal header KVM: x86: Remove the unused page-track hook track_flush_slot() drm/i915/gvt: switch from ->track_flush_slot() to ->track_remove_region() KVM: x86: Add a new page-track hook to handle memslot deletion drm/i915/gvt: Don't bother removing write-protection on to-be-deleted slot KVM: x86: Reject memslot MOVE operations if KVMGT is attached ...	2023-09-07 13:52:20 -07:00
Nick Desaulniers	3dae5c43ba	x86/asm/bitops: Use __builtin_clz{l\|ll} to evaluate constant expressions Micro-optimize the bitops code some more, similar to commits: `fdb6649ab7` ("x86/asm/bitops: Use __builtin_ctzl() to evaluate constant expressions") `2fcff790dc` ("powerpc: Use builtin functions for fls()/__fls()/fls64()") From a recent discussion, I noticed that x86 is lacking an optimization that appears in arch/powerpc/include/asm/bitops.h related to constant folding. If you add a BUILD_BUG_ON(__builtin_constant_p(param)) to these functions, you'll find that there were cases where the use of inline asm pessimized the compiler's ability to perform constant folding resulting in runtime calculation of a value that could have been computed at compile time. Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230828-x86_fls-v1-1-e6a31b9f79c3@google.com	2023-09-07 00:05:50 +02:00
Thomas Huth	659df86a7b	x86: Remove the arch_calc_vm_prot_bits() macro from the UAPI The arch_calc_vm_prot_bits() macro uses VM_PKEY_BIT0 etc. which are not part of the UAPI, so the macro is completely useless for userspace. It is also hidden behind the CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS config switch which we shouldn't expose to userspace. Thus let's move this macro into a new internal header instead. Fixes: `8f62c88322` ("x86/mm/pkeys: Add arch-specific VMA protection bits") Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu> Acked-by: Dave Hansen <dave.hansen@intel.com> Link: https://lore.kernel.org/r/20230906162658.142511-1-thuth@redhat.com	2023-09-06 23:50:46 +02:00
Linus Torvalds	0b90c5637d	hyperv-next for v6.6 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmT0EE8THHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXg5FCACGJ6n2ikhtRHAENHIVY/mTh+HbhO07 ERzjADfqKF43u1Nt9cslgT4MioqwLjQsAu/A0YcJgVxVSOtg7dnbDmurRAjrGT/3 iKqcVvnaiwSV44TkF8evpeMttZSOg29ImmpyQjoZJJvDMfpxleEK53nuKB9EsjKL Mz/0gSPoNc79bWF+85cVhgPnGIh9nBarxHqVsuWjMhc+UFhzjf9mOtk34qqPfJ1Q 4RsKGEjkVkeXoG6nGd6Gl/+8WoTpenOZQLchhInocY+k9FlAzW1Kr+ICLDx+Topw 8OJ6fv2rMDOejT9aOaA3/imf7LMer0xSUKb6N0sqQAQX8KzwcOYyKtQJ =rC/v -----END PGP SIGNATURE----- Merge tag 'hyperv-next-signed-20230902' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Support for SEV-SNP guests on Hyper-V (Tianyu Lan) - Support for TDX guests on Hyper-V (Dexuan Cui) - Use SBRM API in Hyper-V balloon driver (Mitchell Levy) - Avoid dereferencing ACPI root object handle in VMBus driver (Maciej Szmigiero) - A few misecllaneous fixes (Jiapeng Chong, Nathan Chancellor, Saurabh Sengar) * tag 'hyperv-next-signed-20230902' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits) x86/hyperv: Remove duplicate include x86/hyperv: Move the code in ivm.c around to avoid unnecessary ifdef's x86/hyperv: Remove hv_isolation_type_en_snp x86/hyperv: Use TDX GHCI to access some MSRs in a TDX VM with the paravisor Drivers: hv: vmbus: Bring the post_msg_page back for TDX VMs with the paravisor x86/hyperv: Introduce a global variable hyperv_paravisor_present Drivers: hv: vmbus: Support >64 VPs for a fully enlightened TDX/SNP VM x86/hyperv: Fix serial console interrupts for fully enlightened TDX guests Drivers: hv: vmbus: Support fully enlightened TDX guests x86/hyperv: Support hypercalls for fully enlightened TDX guests x86/hyperv: Add hv_isolation_type_tdx() to detect TDX guests x86/hyperv: Fix undefined reference to isolation_type_en_snp without CONFIG_HYPERV x86/hyperv: Add missing 'inline' to hv_snp_boot_ap() stub hv: hyperv.h: Replace one-element array with flexible-array member Drivers: hv: vmbus: Don't dereference ACPI root object handle x86/hyperv: Add hyperv-specific handling for VMMCALL under SEV-ES x86/hyperv: Add smp support for SEV-SNP guest clocksource: hyper-v: Mark hyperv tsc page unencrypted in sev-snp enlightened guest x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest drivers: hv: Mark percpu hvcall input arg page unencrypted in SEV-SNP enlightened guest ...	2023-09-04 11:26:29 -07:00
Linus Torvalds	2fcbb03847	* Mark all Skylake CPUs as vulnerable to GDS * Fix PKRU covert channel * Fix -Wmissing-variable-declarations warning for ia32_xyz_class * Fix kernel-doc annotation warning -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTyK2sACgkQaDWVMHDJ krCPyA//WbT65hhGZt5UHayoW3Cv8NmS4kRf6FTVMmt1yIvbOtVmWVZeOGMypprC jYBPOdTf1SdoUxeD1cTTd/rtbYknczaBLC5VGilGb2/663yQl9ThT3ePc6OgUpPW 28ItLslfHGchHbrOgCgUK8Aiv8xRdfNUJoByMTOoce6eGUQsgYaNOq74z1YaLsZ4 afbitcd/vWO9eJfW5DNl26rIP+ksOgA/Iue8uRiEczoPltnbUxGGLCtuuR9B2Xxi FqfttvcnUy9yFJBnELx4721VtTcmK4Dq5O/ZoJnmVQqRPY9aXpkbwZnH9co8r/uh GueQj30l6VhEO6XnwPvjsOghXI2vOMrxlL3zGMw5y6pjfUWa4W6f33EV6mByYfml dfPTXIz9K1QxznHtk6xVhn1X8CFw7L1L7CC/2CPQPETiuqyjWrHqvFBXcsWpKq4N 1GjnpoveNZmWcE7UY9qEHAXGh7ndI+EIux/0WfJJHOE/fzCr06GphQdxNwPGdi5O T3JkZf6R4GBvfi6e0F6Qn2B02sDl18remmAMi9bAn/M31g+mvAoHqKLlI9EzAGGz 91PzJdHecdvalU6zsCKqi0ETzq11WG7zCfSXs+jeHBe1jsUITpcZCkpVi2pGA2aq X7qLjJyW+Sx0hdE2IWFhCOfiwGFpXxfegCl1uax3HrZg6rrORbw= =prt2 -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2023-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Dave Hansen: "The most important fix here adds a missing CPU model to the recent Gather Data Sampling (GDS) mitigation list to ensure that mitigations are available on that CPU. There are also a pair of warning fixes, and closure of a covert channel that pops up when protection keys are disabled. Summary: - Mark all Skylake CPUs as vulnerable to GDS - Fix PKRU covert channel - Fix -Wmissing-variable-declarations warning for ia32_xyz_class - Fix kernel-doc annotation warning" * tag 'x86-urgent-2023-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu/xstate: Fix PKRU covert channel x86/irq/i8259: Fix kernel-doc annotation warning x86/speculation: Mark all Skylake CPUs as vulnerable to GDS x86/audit: Fix -Wmissing-variable-declarations warning for ia32_xyz_class	2023-09-01 16:40:19 -07:00
Linus Torvalds	df57721f9a	Add x86 shadow stack support Convert IBT selftest to asm to fix objtool warning -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTv1QQACgkQaDWVMHDJ krAUwhAAn6TOwHJK8BSkHeiQhON1nrlP3c5cv0AyZ2NP8RYDrZrSZvhpYBJ6wgKC Cx5CGq5nn9twYsYS3KsktLKDfR3lRdsQ7K9qtyFtYiaeaVKo+7gEKl/K+klwai8/ gninQWHk0zmSCja8Vi77q52WOMkQKapT8+vaON9EVDO8dVEi+CvhAIfPwMafuiwO Rk4X86SzoZu9FP79LcCg9XyGC/XbM2OG9eNUTSCKT40qTTKm5y4gix687NvAlaHR ko5MTsdl0Wfp6Qk0ohT74LnoA2c1g/FluvZIM33ci/2rFpkf9Hw7ip3lUXqn6CPx rKiZ+pVRc0xikVWkraMfIGMJfUd2rhelp8OyoozD7DB7UZw40Q4RW4N5tgq9Fhe9 MQs3p1v9N8xHdRKl365UcOczUxNAmv4u0nV5gY/4FMC6VjldCl2V9fmqYXyzFS4/ Ogg4FSd7c2JyGFKPs+5uXyi+RY2qOX4+nzHOoKD7SY616IYqtgKoz5usxETLwZ6s VtJOmJL0h//z0A7tBliB0zd+SQ5UQQBDC2XouQH2fNX2isJMn0UDmWJGjaHgK6Hh 8jVp6LNqf+CEQS387UxckOyj7fu438hDky1Ggaw4YqowEOhQeqLVO4++x+HITrbp AupXfbJw9h9cMN63Yc0gVxXQ9IMZ+M7UxLtZ3Cd8/PVztNy/clA= =3UUm -----END PGP SIGNATURE----- Merge tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 shadow stack support from Dave Hansen: "This is the long awaited x86 shadow stack support, part of Intel's Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace. The main use case for shadow stack is providing protection against return oriented programming attacks. It works by maintaining a secondary (shadow) stack using a special memory type that has protections against modification. When executing a CALL instruction, the processor pushes the return address to both the normal stack and to the special permission shadow stack. Upon RET, the processor pops the shadow stack copy and compares it to the normal stack copy. For more information, refer to the links below for the earlier versions of this patch set" Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/ Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/ * tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits) x86/shstk: Change order of __user in type x86/ibt: Convert IBT selftest to asm x86/shstk: Don't retry vm_munmap() on -EINTR x86/kbuild: Fix Documentation/ reference x86/shstk: Move arch detail comment out of core mm x86/shstk: Add ARCH_SHSTK_STATUS x86/shstk: Add ARCH_SHSTK_UNLOCK x86: Add PTRACE interface for shadow stack selftests/x86: Add shadow stack test x86/cpufeatures: Enable CET CR4 bit for shadow stack x86/shstk: Wire in shadow stack interface x86: Expose thread features in /proc/$PID/status x86/shstk: Support WRSS for userspace x86/shstk: Introduce map_shadow_stack syscall x86/shstk: Check that signal frame is shadow stack mem x86/shstk: Check that SSP is aligned on sigreturn x86/shstk: Handle signals for shadow stack x86/shstk: Introduce routines modifying shstk x86/shstk: Handle thread shadow stack x86/shstk: Add user-mode shadow stack support ...	2023-08-31 12:20:12 -07:00
Sean Christopherson	f22b1e8500	KVM: x86/mmu: Handle KVM bookkeeping in page-track APIs, not callers Get/put references to KVM when a page-track notifier is (un)registered instead of relying on the caller to do so. Forcing the caller to do the bookkeeping is unnecessary and adds one more thing for users to get wrong, e.g. see commit `9ed1fdee9e` ("drm/i915/gvt: Get reference to KVM iff attachment to VM is successful"). Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:19 -04:00
Sean Christopherson	96316a0670	KVM: x86/mmu: Drop @slot param from exported/external page-track APIs Refactor KVM's exported/external page-track, a.k.a. write-track, APIs to take only the gfn and do the required memslot lookup in KVM proper. Forcing users of the APIs to get the memslot unnecessarily bleeds KVM internals into KVMGT and complicates usage of the APIs. No functional change intended. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:18 -04:00
Sean Christopherson	7b574863e7	KVM: x86/mmu: Rename page-track APIs to reflect the new reality Rename the page-track APIs to capture that they're all about tracking writes, now that the facade of supporting multiple modes is gone. Opportunstically replace "slot" with "gfn" in anticipation of removing the @slot param from the external APIs. No functional change intended. Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:15 -04:00
Sean Christopherson	338068b5be	KVM: x86/mmu: Drop infrastructure for multiple page-track modes Drop "support" for multiple page-track modes, as there is no evidence that array-based and refcounted metadata is the optimal solution for other modes, nor is there any evidence that other use cases, e.g. for access-tracking, will be a good fit for the page-track machinery in general. E.g. one potential use case of access-tracking would be to prevent guest access to poisoned memory (from the guest's perspective). In that case, the number of poisoned pages is likely to be a very small percentage of the guest memory, and there is no need to reference count the number of access-tracking users, i.e. expanding gfn_track[] for a new mode would be grossly inefficient. And for poisoned memory, host userspace would also likely want to trap accesses, e.g. to inject #MC into the guest, and that isn't currently supported by the page-track framework. A better alternative for that poisoned page use case is likely a variation of the proposed per-gfn attributes overlay (linked), which would allow efficiently tracking the sparse set of poisoned pages, and by default would exit to userspace on access. Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com Cc: Ben Gardon <bgardon@google.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:14 -04:00
Sean Christopherson	e998fb1a30	KVM: x86/mmu: Use page-track notifiers iff there are external users Disable the page-track notifier code at compile time if there are no external users, i.e. if CONFIG_KVM_EXTERNAL_WRITE_TRACKING=n. KVM itself now hooks emulated writes directly instead of relying on the page-track mechanism. Provide a stub for "struct kvm_page_track_notifier_node" so that including headers directly from the command line, e.g. for testing include guards, doesn't fail due to a struct having an incomplete type. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:14 -04:00
Sean Christopherson	58ea7cf700	KVM: x86/mmu: Move KVM-only page-track declarations to internal header Bury the declaration of the page-track helpers that are intended only for internal KVM use in a "private" header. In addition to guarding against unwanted usage of the internal-only helpers, dropping their definitions avoids exposing other structures that should be KVM-internal, e.g. for memslots. This is a baby step toward making kvm_host.h a KVM-internal header in the very distant future. Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:08:13 -04:00
Yan Zhao	d104d5bbbc	KVM: x86: Remove the unused page-track hook track_flush_slot() Remove ->track_remove_slot(), there are no longer any users and it's unlikely a "flush" hook will ever be the correct API to provide to an external page-track user. Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:07:26 -04:00
Yan Zhao	b83ab124de	KVM: x86: Add a new page-track hook to handle memslot deletion Add a new page-track hook, track_remove_region(), that is called when a memslot DELETE operation is about to be committed. The "remove" hook will be used by KVMGT and will effectively replace the existing track_flush_slot() altogether now that KVM itself doesn't rely on the "flush" hook either. The "flush" hook is flawed as it's invoked before the memslot operation is guaranteed to succeed, i.e. KVM might ultimately keep the existing memslot without notifying external page track users, a.k.a. KVMGT. In practice, this can't currently happen on x86, but there are no guarantees that won't change in the future, not to mention that "flush" does a very poor job of describing what is happening. Pass in the gfn+nr_pages instead of the slot itself so external users, i.e. KVMGT, don't need to exposed to KVM internals (memslots). This will help set the stage for additional cleanups to the page-track APIs. Opportunistically align the existing srcu_read_lock_held() usage so that the new case doesn't stand out like a sore thumb (and not aligning the new code makes bots unhappy). Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230729013535.1070024-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:07:25 -04:00
Sean Christopherson	c70934e0ab	KVM: x86: Reject memslot MOVE operations if KVMGT is attached Disallow moving memslots if the VM has external page-track users, i.e. if KVMGT is being used to expose a virtual GPU to the guest, as KVMGT doesn't correctly handle moving memory regions. Note, this is potential ABI breakage! E.g. userspace could move regions that aren't shadowed by KVMGT without harming the guest. However, the only known user of KVMGT is QEMU, and QEMU doesn't move generic memory regions. KVM's own support for moving memory regions was also broken for multiple years (albeit for an edge case, but arguably moving RAM is itself an edge case), e.g. see commit `edd4fa37ba` ("KVM: x86: Allocate new rmap and large page tracking when moving memslot"). Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 14:07:23 -04:00
Sean Christopherson	b271e17def	KVM: drm/i915/gvt: Drop @vcpu from KVM's ->track_write() hook Drop @vcpu from KVM's ->track_write() hook provided for external users of the page-track APIs now that KVM itself doesn't use the page-track mechanism. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:49:01 -04:00
Sean Christopherson	932844462a	KVM: x86/mmu: Don't bounce through page-track mechanism for guest PTEs Don't use the generic page-track mechanism to handle writes to guest PTEs in KVM's MMU. KVM's MMU needs access to information that should not be exposed to external page-track users, e.g. KVM needs (for some definitions of "need") the vCPU to query the current paging mode, whereas external users, i.e. KVMGT, have no ties to the current vCPU and so should never need the vCPU. Moving away from the page-track mechanism will allow dropping use of the page-track mechanism for KVM's own MMU, and will also allow simplifying and cleaning up the page-track APIs. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:49:00 -04:00
Sean Christopherson	db0d70e610	KVM: x86/mmu: Move kvm_arch_flush_shadow_{all,memslot}() to mmu.c Move x86's implementation of kvm_arch_flush_shadow_{all,memslot}() into mmu.c, and make kvm_mmu_zap_all() static as it was globally visible only for kvm_arch_flush_shadow_all(). This will allow refactoring kvm_arch_flush_shadow_memslot() to call kvm_mmu_zap_all() directly without having to expose kvm_mmu_zap_all_fast() outside of mmu.c. Keeping everything in mmu.c will also likely simplify supporting TDX, which intends to do zap only relevant SPTEs on memslot updates. No functional change intended. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-08-31 13:48:59 -04:00
Paolo Bonzini	6d5e3c318a	KVM x86 changes for 6.6: - Misc cleanups - Retry APIC optimized recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR can diverge from the default iff TSC scaling is enabled, and clean up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTueMwSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5hp4P/i/UmIJEJupryUrD/ZXcSjqmupCtv4JS Z2o1KIAPbM5GUX4iyF1cnZrI4Ac5zMtULN8Tp3ATOp3AqKy72AqB1Z82e+v6SKis KfSXlDFCPFisrwv3Ys7JEu9vIS8oqITHmSBk8OAmElwujdQ5jYLZjwGbCXbM9qas yCFGLqD4fjX8XqkZLmXggjT99MPSgiTPoKL592Wq4JR8mY4hyQqJzBepDjb94sT7 wrsAv1B+BchGDguk0+nOdmHM4emGrZU7fVqi3OFPofSlwAAdkqZObleb422KB058 5bcpNow+9VH5pzgq8XSAU7DLNgH9aXH0PcVU8ASU6P0D9fceKoOFuL47nnFbwz0t vKafcXNWFs8xHE4iyzvAAsZK/X8GR0ngNByPnamATMsjt2tTmsa5BOyAPkIN+GpT DzZCIk27SbdGC3lGYlSV+5ob/+sOr6m384DkvSZnU6JiiFLlZiTxURj1/9Zvfka8 2co2wnf8cJxnKFUThFfuxs9XpKgvhkOE8LauwCSo4MAQM95Pen+NAK960RBWj0xl wof5kIGmKbwmMXyg2Sr+EKqe5KRPba22Yi3x24tURAXafKK/AW7T8dgEEXOll7dp pKmTPAevwUk9wYIGultjhEBXKYgMOeD2BVoTa5je5h1Da28onrSJ7aLQUixHHs0J gLdtzs8M9K9t =yGM1 -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.6' of https://github.com/kvm-x86/linux into HEAD KVM x86 changes for 6.6: - Misc cleanups - Retry APIC optimized recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR can diverge from the default iff TSC scaling is enabled, and clean up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID	2023-08-31 13:36:33 -04:00
Paolo Bonzini	bd7fe98b35	KVM: x86: SVM changes for 6.6: - Add support for SEV-ES DebugSwap, i.e. allow SEV-ES guests to use debug registers and generate/handle #DBs - Clean up LBR virtualization code - Fix a bug where KVM fails to set the target pCPU during an IRTE update - Fix fatal bugs in SEV-ES intrahost migration - Fix a bug where the recent (architecturally correct) change to reinject #BP and skip INT3 broke SEV guests (can't decode INT3 to skip it) -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmTue8YSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5aqUP/jF7DyMXyQGYMKoQhFxWyGRhfqV8Ov8i 7sUpEKSx5WTxOsFHBgdGeNU+m9eBJHWVmrJM9imI4OCUvJmxasRRsnyhvEUvBIUE amQT45aVm2xqjRNRUkOCUUHiDKtUdwpSRlOSyhnDTKmlMbNoH+fO3SLJ1oB/fsae wnmyiF98j2vT/5mD6F/F87hlNMq4CqG/OZWJ9Kk8GfvfJpUcC8r/H0NsMgSMF2/L Q+Hn+r/XDfMSrBiyEzevWyPbJi7nL+WF9EQDJASf+aAkmFMHK6AU4XNITwVw3XcZ FGtSP/NzvnePhd5gqtbiW9hRQkWcKjqnydtyI3ZDVVBpEbJ6OJn3+UFoLZ8NoSE+ D3EDs1PA7Qjty6kYx9/NZpXz5BAMd9mikkTL7PTrlrAZAEimToqoHx7mBjmLp4E+ diKrpG2w1OTtO/Pafi0z0zZN6Yc9MJOyZVK78DpIiLey3rNip9SawWGh+wV14WNC nbn7Wpf8EGE1E8n00mwrGMRCuRm7LQhLbcVXITiGKrbpxUzam6sqDIgt73Q7xma2 NWcPizeFNy47uurNOA2V9xHkbEAYjWaM12uyzmGzILvvmvNnpU0NuZ78cgV5ZWMk 4US53CAQbG4+qUCJWhIDoriluaLXjL9tLiZgJW0T6cus3nL5NWYqvlq6TWYyK00J zjiK7vky77Pq =WC5V -----END PGP SIGNATURE----- Merge tag 'kvm-x86-svm-6.6' of https://github.com/kvm-x86/linux into HEAD KVM: x86: SVM changes for 6.6: - Add support for SEV-ES DebugSwap, i.e. allow SEV-ES guests to use debug registers and generate/handle #DBs - Clean up LBR virtualization code - Fix a bug where KVM fails to set the target pCPU during an IRTE update - Fix fatal bugs in SEV-ES intrahost migration - Fix a bug where the recent (architecturally correct) change to reinject #BP and skip INT3 broke SEV guests (can't decode INT3 to skip it)	2023-08-31 13:32:40 -04:00
Paolo Bonzini	e0fb12c673	KVM/arm64 updates for Linux 6.6 - Add support for TLB range invalidation of Stage-2 page tables, avoiding unnecessary invalidations. Systems that do not implement range invalidation still rely on a full invalidation when dealing with large ranges. - Add infrastructure for forwarding traps taken from a L2 guest to the L1 guest, with L0 acting as the dispatcher, another baby step towards the full nested support. - Simplify the way we deal with the (long deprecated) 'CPU target', resulting in a much needed cleanup. - Fix another set of PMU bugs, both on the guest and host sides, as we seem to never have any shortage of those... - Relax the alignment requirements of EL2 VA allocations for non-stack allocations, as we were otherwise wasting a lot of that precious VA space. - The usual set of non-functional cleanups, although I note the lack of spelling fixes... -----BEGIN PGP SIGNATURE----- iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmTsXrUPHG1hekBrZXJu ZWwub3JnAAoJECPQ0LrRPXpDZpIQAJUM1rNEOJ8ExYRfoG1LaTfcOm5TD6D1IWlO uCUx4xLMBudw/55HusmUSdiomQ3Xg5UdRaU7vX5OYwPbdoWebjEUfgdP3jCA/TiW mZTMv3x9hOvp+EOS/UnS469cERvg1/KfwcdOQsWL0HsCFZnu2XmQHWPD++vovLNp F1892ij875mC6C6mOR60H2nyjIiCuqWh/8eKBkp65CARCbFDYxWhqBnmcmTvoquh E87pQDPdtgXc0KlOWCABh5bYOu1WGVEXE5f3ixtdY9cQakkSI3NkFKw27/mIWS4q TCsagByNnPFDXTglb1dJopNdluLMFi1iXhRJX78R/PYaHxf4uFafWcQk1U7eDdLg 1kPANggwYe4KNAQZUvRhH7lIPWHCH0r4c1qHV+FsiOZVoDOSKHo4RW1ZFtirJSNW LNJMdk+8xyae0S7z164EpZB/tpFttX4gl3YvUT/T+4gH8+CRFAaoAlK39CoGDPpk f+P2GE1Z5YupF16YjpZtBnan55KkU1b6eORl5zpnAtoaz5WGXqj1t4qo0Q6e9WB9 X4rdDVhH7vRUmhjmSP6PuEygb84hnITLdGpkH2BmWj/4uYuCN+p+U2B2o/QdMJoo cPxdflLOU/+1gfAFYPtHVjVKCqzhwbw3iLXQpO12gzRYqE13rUnAr7RuGDf5fBVC LW7Pv81o =DKhx -----END PGP SIGNATURE----- Merge tag 'kvmarm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 6.6 - Add support for TLB range invalidation of Stage-2 page tables, avoiding unnecessary invalidations. Systems that do not implement range invalidation still rely on a full invalidation when dealing with large ranges. - Add infrastructure for forwarding traps taken from a L2 guest to the L1 guest, with L0 acting as the dispatcher, another baby step towards the full nested support. - Simplify the way we deal with the (long deprecated) 'CPU target', resulting in a much needed cleanup. - Fix another set of PMU bugs, both on the guest and host sides, as we seem to never have any shortage of those... - Relax the alignment requirements of EL2 VA allocations for non-stack allocations, as we were otherwise wasting a lot of that precious VA space. - The usual set of non-functional cleanups, although I note the lack of spelling fixes...	2023-08-31 13:18:53 -04:00
Linus Torvalds	1687d8aca5	* Rework apic callbacks, getting rid of unnecessary ones and coalescing lots of silly duplicates. * Use static_calls() instead of indirect calls for apic->foo() * Tons of cleanups an crap removal along the way -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTvfO8ACgkQaDWVMHDJ krAP2A//ccii/LuvtTnNEIMMR5w2rwTdHv91ancgFkC8pOeNk37Z8sSLq8tKuLFA vgjBIysVIqunuRcNCJ+eqwIIxYfU+UGCWHppzLwO+DY3Q7o9EoTL0BgytdAqxpQQ ntEVarqWq25QYXKFoAqbUTJ1UXa42/8HfiXAX/jvP+ACXfilkGPZre6ASxlXeOhm XbgPuNQPmXi2WYQH9GCQEsz2Nh80hKap8upK2WbQzzJ3lXsm+xA//4klab0HCYwl Uc302uVZozyXRMKbAlwmgasTFOLiV8KKriJ0oHoktBpWgkpdR9uv/RDeSaFR3DAl aFmecD4k/Hqezg4yVl+4YpEn2KjxiwARCm4PMW5AV7lpWBPBHAOOai65yJlAi9U6 bP8pM0+aIx9xg7oWfsTnQ7RkIJ+GZ0w+KZ9LXFM59iu3eV1pAJE3UVyUehe/J1q9 n8OcH0UeHRlAb8HckqVm1AC7IPvfHw4OAPtUq7z3NFDwbq6i651Tu7f+i2bj31cX 77Ames+fx6WjxUjyFbJwaK44E7Qez3waztdBfn91qw+m0b+gnKE3ieDNpJTqmm5b mKulV7KJwwS6cdqY3+Kr+pIlN+uuGAv7wGzVLcaEAXucDsVn/YAMJHY2+v97xv+n J9N+yeaYtmSXVlDsJ6dndMrTQMmcasK1CVXKxs+VYq5Lgf+A68w= =eoKm -----END PGP SIGNATURE----- Merge tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 apic updates from Dave Hansen: "This includes a very thorough rework of the 'struct apic' handlers. Quite a variety of them popped up over the years, especially in the 32-bit days when odd apics were much more in vogue. The end result speaks for itself, which is a removal of a ton of code and static calls to replace indirect calls. If there's any breakage here, it's likely to be around the 32-bit museum pieces that get light to no testing these days. Summary: - Rework apic callbacks, getting rid of unnecessary ones and coalescing lots of silly duplicates. - Use static_calls() instead of indirect calls for apic->foo() - Tons of cleanups an crap removal along the way" * tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) x86/apic: Turn on static calls x86/apic: Provide static call infrastructure for APIC callbacks x86/apic: Wrap IPI calls into helper functions x86/apic: Mark all hotpath APIC callback wrappers __always_inline x86/xen/apic: Mark apic __ro_after_init x86/apic: Convert other overrides to apic_update_callback() x86/apic: Replace acpi_wake_cpu_handler_update() and apic_set_eoi_cb() x86/apic: Provide apic_update_callback() x86/xen/apic: Use standard apic driver mechanism for Xen PV x86/apic: Provide common init infrastructure x86/apic: Wrap apic->native_eoi() into a helper x86/apic: Nuke ack_APIC_irq() x86/apic: Remove pointless arguments from [native_]eoi_write() x86/apic/noop: Tidy up the code x86/apic: Remove pointless NULL initializations x86/apic: Sanitize APIC ID range validation x86/apic: Prepare x2APIC for using apic::max_apic_id x86/apic: Simplify X2APIC ID validation x86/apic: Add max_apic_id member x86/apic: Wrap APIC ID validation into an inline ...	2023-08-30 10:44:46 -07:00
Linus Torvalds	87fa732dc5	X86 core updates: - Prevent kprobes on compiler generated CFI checking code. The compiler generates a instruction sequence for indirect call checks. If this sequence is modified with a kprobe, then the check fails. So the instructions must be protected against probing. - A few minor cleanups for the SMP code Thanks, tglx -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTvO50THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoSS/D/4xIz99FKNzowrhR9LYQpvjBLmrFY2L Hz/l+iHarNU8VE9eeTpYt0F3Ho7cUaMbC4juqxNthfjPOtkGotO6FLLOxWfkVkzz k54N2o4cUX7Hwi/HEpKwjzQ1ggLwxEUafeczvFg5aU3PlKcpzscT5todxxRrw+vz WA9JIuMlwVRwl+Jodzp+LsUGWaF1gT1aVBvFZGXFJ12GB58NMn+Nm9g77BmsYN2H jUQjsKLq32S1aU8Bq+YNIr9XWagTvDicJ617cwyzy1HivKadeBF2bj5+uJioxLFZ 8YN9PyXtICkYJ/TrOsVJ5XxuNWNpyZ2x8OwCVdYboqeooylOJTHx1pUBufNQIpAr qx2npX6NOK29QFVXiTJoP3K+4sv/G4EL6FO3Sf2J92/4HPiTlhINGCjm/UAqbJcJ 5Y3QfnIqB6ohCm7/e648ht5U+vdiNMzMAFj/jgNyqV8KBUQoXG58ygNP42V+Qq8+ Is5oDv4M1BgynUG8FaIQV3LXFHFLA6dielgdJwTy7TZry0O9rU5qieUDskD6GBVv 6ayiEqYKyXyJZw6YEzY/kDUxnpIRnqq0Lzbdn7TjH5cetnlfImc477jv8m3Um8ZP FMVe20MN79yuKPVlvkt2tCPEYttDXufQDHtDJp2WCp1/TizmGEW51sdsacw+YXrS M1uUljSmEeSM6Q== =nODz -----END PGP SIGNATURE----- Merge tag 'x86-core-2023-08-30-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 core updates from Thomas Gleixner: - Prevent kprobes on compiler generated CFI checking code. The compiler generates an instruction sequence for indirect call checks. If this sequence is modified with a kprobe, then the check fails. So the instructions must be protected against probing. - A few minor cleanups for the SMP code * tag 'x86-core-2023-08-30-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kprobes: Prohibit probing on compiler generated CFI checking code x86/smpboot: Change smp_store_boot_cpu_info() to static x86/smp: Remove a non-existent function declaration x86/smpboot: Remove a stray comment about CPU hotplug	2023-08-30 10:10:31 -07:00
Linus Torvalds	9855922705	- Remove unnecessary "INVPCID single" feature tracking - Include PAT in page protection modify mask -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTuUrwACgkQaDWVMHDJ krD7jRAAt37pNfAJLd+pJtBAtsZYlmPq1aYHSuLPQFaFebYgN8j4ekMwRNRBbgQF 6dWQXYSRMFnmJzbxBcHTkzYzR1Noh1o8U0SKUp3CNFfA3gGAq1mpoOKc7l1KjVGN x6x0+5aroT0DCtxqid0iBY38IJb/qmJ63NLGT0oJm8NZ9CTwd1UaN6KXWz1mawgk BvIY1zgMLibB2aYi2Eib2JlhQ6DWHSJMAZpMMEdPay/lr6ONlQZ3Sckjvz5huskQ ikGIvzF3L6BFDsxIYjE4uoFoImDcs4Q3gIGoqsn/Ig79mCnttgoRQ7HVFmUrVKq1 nxa+o+uqWNJjRSwbHUKX1ReyiFF5Re+7csODEnIzHr761YXWTcm94sR8jb4bCMqV QiWkzt5wcdzpAZC72gcRLqL2K3uMwm2rpxhw7az/LgDzNcdkWqFbFurGGN/3Ro6e RM9FvTIi+a40cSjc+zCNDSSwb90Oe8ZINFb9g0ta++5mFQXG7bsydwnWVq5pRY0V 5qNtWNvusW01c5GmOf0iJY7M84jegf4dzPNZcQd6XblWf5XyR+YnjCLU8g1Y3s8y H3BC8xHvgIb2Ln/XX4V6er7Ey+SS3XGumeqRn6gi3foa4DNODzbsVuIVpZAZoqyn hY4eGmwVS+OS7B+wOE44Z3hqMq4K0eXo+PsXCov4HFAbuMCtsrA= =TKMq -----END PGP SIGNATURE----- Merge tag 'x86_mm_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Dave Hansen: "A pair of small x86/mm updates. The INVPCID one is purely a cleanup. The PAT one fixes a real issue, albeit a relatively obscure one (graphics device passthrough under Xen). The fix also makes the code much more readable. Summary: - Remove unnecessary "INVPCID single" feature tracking - Include PAT in page protection modify mask" * tag 'x86_mm_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Remove "INVPCID single" feature tracking x86/mm: Fix PAT bit missing from page protection modify mask	2023-08-30 09:54:00 -07:00
Mateusz Guzik	ca96b162bf	x86: bring back rep movsq for user access on CPUs without ERMS Intel CPUs ship with ERMS for over a decade, but this is not true for AMD. In particular one reasonably recent uarch (EPYC 7R13) does not have it (or at least the bit is inactive when running on the Amazon EC2 cloud -- I found rather conflicting information about AMD CPUs vs the extension). Hand-rolled mov loops executing in this case are quite pessimal compared to rep movsq for bigger sizes. While the upper limit depends on uarch, everyone is well south of 1KB AFAICS and sizes bigger than that are common. While technically ancient CPUs may be suffering from rep usage, gcc has been emitting it for years all over kernel code, so I don't think this is a legitimate concern. Sample result from read1_processes from will-it-scale (4KB reads/s): before: 1507021 after: 1721828 (+14%) Note that the cutoff point for rep usage is set to 64 bytes, which is way too conservative but I'm sticking to what was done in `47ee3f1dd9` ("x86: re-introduce support for ERMS copies for user space accesses"). That is to say some copies will now go slower, which is fixable but beyond the scope of this patch. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-08-30 09:45:12 -07:00
Justin Stitt	e8f13e061d	x86/audit: Fix -Wmissing-variable-declarations warning for ia32_xyz_class When building x86 defconfig with Clang-18 I get the following warnings: \| arch/x86/ia32/audit.c:6:10: warning: no previous extern declaration for non-static variable 'ia32_dir_class' [-Wmissing-variable-declarations] \| 6 \| unsigned ia32_dir_class[] = { \| arch/x86/ia32/audit.c:11:10: warning: no previous extern declaration for non-static variable 'ia32_chattr_class' [-Wmissing-variable-declarations] \| 11 \| unsigned ia32_chattr_class[] = { \| arch/x86/ia32/audit.c:16:10: warning: no previous extern declaration for non-static variable 'ia32_write_class' [-Wmissing-variable-declarations] \| 16 \| unsigned ia32_write_class[] = { \| arch/x86/ia32/audit.c:21:10: warning: no previous extern declaration for non-static variable 'ia32_read_class' [-Wmissing-variable-declarations] \| 21 \| unsigned ia32_read_class[] = { \| arch/x86/ia32/audit.c:26:10: warning: no previous extern declaration for non-static variable 'ia32_signal_class' [-Wmissing-variable-declarations] \| 26 \| unsigned ia32_signal_class[] = { These warnings occur due to their respective extern declarations being scoped inside of audit_classes_init as well as only being enabled with `CONFIG_IA32_EMULATION=y`: \| static int __init audit_classes_init(void) \| { \| #ifdef CONFIG_IA32_EMULATION \| extern __u32 ia32_dir_class[]; \| extern __u32 ia32_write_class[]; \| extern __u32 ia32_read_class[]; \| extern __u32 ia32_chattr_class[]; \| audit_register_class(AUDIT_CLASS_WRITE_32, ia32_write_class); \| audit_register_class(AUDIT_CLASS_READ_32, ia32_read_class); \| audit_register_class(AUDIT_CLASS_DIR_WRITE_32, ia32_dir_class); \| audit_register_class(AUDIT_CLASS_CHATTR_32, ia32_chattr_class); \| #endif \| audit_register_class(AUDIT_CLASS_WRITE, write_class); \| audit_register_class(AUDIT_CLASS_READ, read_class); \| audit_register_class(AUDIT_CLASS_DIR_WRITE, dir_class); \| audit_register_class(AUDIT_CLASS_CHATTR, chattr_class); \| return 0; \| } Lift the extern declarations to their own header and resolve scoping issues (and thus fix the warnings). Moreover, change __u32 to unsigned so that we match the definitions: \| unsigned ia32_dir_class[] = { \| #include <asm-generic/audit_dir_write.h> \| ~0U \| }; \| \| unsigned ia32_chattr_class[] = { \| #include <asm-generic/audit_change_attr.h> \| ~0U \| }; \| ... This patch is similar to commit: `0e5e3d4461` ("x86/audit: Fix a -Wmissing-prototypes warning for ia32_classify_syscall()") [1] Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/all/20200516123816.2680-1-b.thiel@posteo.de/ [1] Link: https://github.com/ClangBuiltLinux/linux/issues/1920 Link: https://lore.kernel.org/r/20230829-missingvardecl-audit-v1-1-34efeb7f3539@google.com	2023-08-30 10:11:16 +02:00
Linus Torvalds	6c1b980a7e	dma-maping updates for Linux 6.6 - allow dynamic sizing of the swiotlb buffer, to cater for secure virtualization workloads that require all I/O to be bounce buffered (Petr Tesarik) - move a declaration to a header (Arnd Bergmann) - check for memory region overlap in dma-contiguous (Binglei Wang) - remove the somewhat dangerous runtime swiotlb-xen enablement and unexport is_swiotlb_active (Christoph Hellwig, Juergen Gross) - per-node CMA improvements (Yajun Deng) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmTuDHkLHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYOqvhAApMk2/ceTgVH17sXaKE822+xKvgv377O6TlggMeGG W4zA0KD69DNz0AfaaCc5U5f7n8Ld/YY1RsvkHW4b3jgw+KRTeQr0jjitBgP5kP2M A1+qxdyJpCTwiPt9s2+JFVPeyZ0s52V6OJODKRG3s0ore55R+U09VySKtASON+q3 GMKfWqQteKC+thg7NkrQ7JUixuo84oICws+rZn4K9ifsX2O0HYW6aMW0feRfZjJH r0TgqZc4RdPTSaF22oapR9Ls39+7hp/pBvoLm5sBNA3cl5C3X4VWo9ERMU1jW9h+ VYQv39NycUspgskWJmpbU06/+ooYqQlwHSR/vdNusmFIvxo4tf6/UX72YO5F8Dar ap0wYGauiEwTjSnhVxPTXk3obWyWEsgFAeRnPdTlH2CNmv38QZU2HLb8eU1pcXxX j+WI2Ewy9z22uBVYiPOKpdW1jkSfmlmfPp/8SbAdua7I3YQ90rQN6AvU06zAi/cL NQTgO81E4jPkygqAVgS/LeYziWAQ73yM7m9ExThtTgqFtHortwhJ4Fd8XKtvtvEb viXAZ/WZtQBv/CIKAW98NhgIDP/SPOT8ym6V35WK+kkNFMS6LMSQUfl9GgbHGyFa n9icMm7BmbDtT1+AKNafG9En4DtAf9M9QNidAVOyfrsIk6S0gZoZwvIStkA7on8a cNY= =kVVr -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.6-2023-08-29' of git://git.infradead.org/users/hch/dma-mapping Pull dma-maping updates from Christoph Hellwig: - allow dynamic sizing of the swiotlb buffer, to cater for secure virtualization workloads that require all I/O to be bounce buffered (Petr Tesarik) - move a declaration to a header (Arnd Bergmann) - check for memory region overlap in dma-contiguous (Binglei Wang) - remove the somewhat dangerous runtime swiotlb-xen enablement and unexport is_swiotlb_active (Christoph Hellwig, Juergen Gross) - per-node CMA improvements (Yajun Deng) * tag 'dma-mapping-6.6-2023-08-29' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: optimize get_max_slots() swiotlb: move slot allocation explanation comment where it belongs swiotlb: search the software IO TLB only if the device makes use of it swiotlb: allocate a new memory pool when existing pools are full swiotlb: determine potential physical address limit swiotlb: if swiotlb is full, fall back to a transient memory pool swiotlb: add a flag whether SWIOTLB is allowed to grow swiotlb: separate memory pool data from other allocator data swiotlb: add documentation and rename swiotlb_do_find_slots() swiotlb: make io_tlb_default_mem local to swiotlb.c swiotlb: bail out of swiotlb_init_late() if swiotlb is already allocated dma-contiguous: check for memory region overlap dma-contiguous: support numa CMA for specified node dma-contiguous: support per-numa CMA for all architectures dma-mapping: move arch_dma_set_mask() declaration to header swiotlb: unexport is_swiotlb_active x86: always initialize xen-swiotlb when xen-pcifront is enabling xen/pci: add flag for PCI passthrough being possible	2023-08-29 20:32:10 -07:00
Linus Torvalds	d68b4b6f30	- An extensive rework of kexec and crash Kconfig from Eric DeVolder ("refactor Kconfig to consolidate KEXEC and CRASH options"). - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a couple of macros to args.h"). - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper commands"). - vsprintf inclusion rationalization from Andy Shevchenko ("lib/vsprintf: Rework header inclusions"). - Switch the handling of kdump from a udev scheme to in-kernel handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory hot un/plug"). - Many singleton patches to various parts of the tree -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO2GpAAKCRDdBJ7gKXxA juW3AQD1moHzlSN6x9I3tjm5TWWNYFoFL8af7wXDJspp/DWH/AD/TO0XlWWhhbYy QHy7lL0Syha38kKLMXTM+bN6YQHi9AU= =WJQa -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - An extensive rework of kexec and crash Kconfig from Eric DeVolder ("refactor Kconfig to consolidate KEXEC and CRASH options") - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a couple of macros to args.h") - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper commands") - vsprintf inclusion rationalization from Andy Shevchenko ("lib/vsprintf: Rework header inclusions") - Switch the handling of kdump from a udev scheme to in-kernel handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory hot un/plug") - Many singleton patches to various parts of the tree * tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (81 commits) document while_each_thread(), change first_tid() to use for_each_thread() drivers/char/mem.c: shrink character device's devlist[] array x86/crash: optimize CPU changes crash: change crash_prepare_elf64_headers() to for_each_possible_cpu() crash: hotplug support for kexec_load() x86/crash: add x86 crash hotplug support crash: memory and CPU hotplug sysfs attributes kexec: exclude elfcorehdr from the segment digest crash: add generic infrastructure for crash hotplug support crash: move a few code bits to setup support of crash hotplug kstrtox: consistently use _tolower() kill do_each_thread() nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse scripts/bloat-o-meter: count weak symbol sizes treewide: drop CONFIG_EMBEDDED lockdep: fix static memory detection even more lib/vsprintf: declare no_hash_pointers in sprintf.h lib/vsprintf: split out sprintf() and friends kernel/fork: stop playing lockless games for exe_file replacement adfs: delete unused "union adfs_dirtail" definition ...	2023-08-29 14:53:51 -07:00
Linus Torvalds	b96a3e9142	- Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6 Nfio7afS1ncD+hPYT8947UnLxTgn+ww= =Afws -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...	2023-08-29 14:25:26 -07:00
Linus Torvalds	330235e874	ACPI updates for 6.6-rc1 - Update the ACPICA code in the kernel to upstream revision 20230628 including the following changes: * Suppress a GCC 12 dangling-pointer warning (Philip Prindeville). * Reformat the ACPI_STATE_COMMON macro and its users (George Guo). * Replace the ternary operator with ACPI_MIN() (Jiangshan Yi). * Add support for _DSC as per ACPI 6.5 (Saket Dumbre). * Remove a duplicate macro from zephyr header (Najumon B.A). * Add data structures for GED and _EVT tracking (Jose Marinho). * Fix misspelled CDAT DSMAS define (Dave Jiang). * Simplify an error message in acpi_ds_result_push() (Christophe Jaillet). * Add a struct size macro related to SRAT (Dave Jiang). * Add AML_NO_OPERAND_RESOLVE flag to Timer (Abhishek Mainkar). * Add support for RISC-V external interrupt controllers in MADT (Sunil V L). * Add RHCT flags, CMO and MMU nodes (Sunil V L). * Change ACPICA version to 20230628 (Bob Moore). - Introduce new wrappers for ACPICA notify handler install/remove and convert multiple drivers to using their own Notify() handlers instead of the ACPI bus type .notify() slated for removal (Michal Wilczynski). - Add backlight=native DMI quirk for Apple iMac12,1 and iMac12,2 (Hans de Goede). - Put ACPI video and its child devices explicitly into D0 on boot to avoid platform firmware confusion (Kai-Heng Feng). - Add backlight=native DMI quirk for Lenovo Ideapad Z470 (Jiri Slaby). - Support obtaining physical CPU ID from MADT on LoongArch (Bibo Mao). - Convert ACPI CPU initialization to using _OSC instead of _PDC that has been depreceted since 2018 and dropped from the specification in ACPI 6.5 (Michal Wilczynski, Rafael Wysocki). - Drop non-functional nocrt parameter from ACPI thermal (Mario Limonciello). - Clean up the ACPI thermal driver, rework the handling of firmware notifications in it and make it provide a table of generic trip point structures to the core during initialization (Rafael Wysocki). - Defer enumeration of devices with _DEP pointing to IVSC (Wentong Wu). - Install SystemCMOS address space handler for ACPI000E (TAD) to meet platform firmware expectations on some platforms (Zhang Rui). - Fix finding the generic error data in the ACPi extlog driver for compatibility with old and new firmware interface versions (Xiaochun Lee). - Remove assorted unused declarations of functions (Yue Haibing). - Move AMBA bus scan handling into arm64 specific directory (Sudeep Holla). - Fix and clean up suspend-to-idle interface for AMD systems (Mario Limonciello, Andy Shevchenko). - Fix string truncation warning in pnpacpi_add_device() (Sunil V L). -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmTslAkSHHJqd0Byand5 c29ja2kubmV0AAoJEILEb/54YlRxtvgQAIsYAHXgalMFCLTqxeRZ8IMMLh/oxVfp RDbZ7lsBNzhYKXNv0jczjDApvfOctplh+qVuRIyFPrVrD+xzPgI9eFaZx41B3UPy 3UG/KphjtF+sAu8rSFM2qcL2P+NGWe1Okr2G8Mbvu6bkGjfHShX6HGR7YgIyrZVO VMVLWqLvjvvg/jkraozArTtlvMwTrpBFxgWQVPK9TLRwx+JfB7WV6Z1XkgbllpR5 8Q+O9wyVWfAIrLk/HZJdYNkdwREi46d226ePs6tJkwzE8KCUna0U4WbVVPRjbSfI Mkpy7Bnn91EZdN43/3ZzPBNO6NJi7UsD46EwRMzf1v7KyfsttnKw/v/SMke6gKQW 43gi0trdbVxEdnN33CmBb2k82YQ5tjVuNOwKNy4xvFZ4vTE+WTz2imXMdW28biLT sU1eYJXPzBUQ4Ja2x56a1DOp2C1uOcSHbTKNzmtFsmT2FIWOKN+9PrOjaFEn7cyU FwWSzcONLE/QC0cSr7cWYym06aY8INtAlCm0hrlpwBDyiVDACz/QZ3NQAmbXKr5z 5JIDQ+8v6YvpIx48yDlTYaQPMEskkZ2udkmwZCh6Vs0fMGyo3DDWvAIltPmHuIoj KekfDK5pkJjftK67IlHioAkS3KGHy4VTybi4Sx8KjYy9Q4GPQ4TL2h18kInXGu4f tv7U9J6zx9mJ =aXc6 -----END PGP SIGNATURE----- Merge tag 'acpi-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI updates from Rafael Wysocki: "These include new ACPICA material, a rework of the ACPI thermal driver, a switch-over of the ACPI processor driver to using _OSC instead of (long deprecated) _PDC for CPU initialization, a rework of firmware notifications handling in several drivers, fixes and cleanups for suspend-to-idle handling on AMD systems, ACPI backlight driver updates and more. Specifics: - Update the ACPICA code in the kernel to upstream revision 20230628 including the following changes: - Suppress a GCC 12 dangling-pointer warning (Philip Prindeville) - Reformat the ACPI_STATE_COMMON macro and its users (George Guo) - Replace the ternary operator with ACPI_MIN() (Jiangshan Yi) - Add support for _DSC as per ACPI 6.5 (Saket Dumbre) - Remove a duplicate macro from zephyr header (Najumon B.A) - Add data structures for GED and _EVT tracking (Jose Marinho) - Fix misspelled CDAT DSMAS define (Dave Jiang) - Simplify an error message in acpi_ds_result_push() (Christophe Jaillet) - Add a struct size macro related to SRAT (Dave Jiang) - Add AML_NO_OPERAND_RESOLVE flag to Timer (Abhishek Mainkar) - Add support for RISC-V external interrupt controllers in MADT (Sunil V L) - Add RHCT flags, CMO and MMU nodes (Sunil V L) - Change ACPICA version to 20230628 (Bob Moore) - Introduce new wrappers for ACPICA notify handler install/remove and convert multiple drivers to using their own Notify() handlers instead of the ACPI bus type .notify() slated for removal (Michal Wilczynski) - Add backlight=native DMI quirk for Apple iMac12,1 and iMac12,2 (Hans de Goede) - Put ACPI video and its child devices explicitly into D0 on boot to avoid platform firmware confusion (Kai-Heng Feng) - Add backlight=native DMI quirk for Lenovo Ideapad Z470 (Jiri Slaby) - Support obtaining physical CPU ID from MADT on LoongArch (Bibo Mao) - Convert ACPI CPU initialization to using _OSC instead of _PDC that has been depreceted since 2018 and dropped from the specification in ACPI 6.5 (Michal Wilczynski, Rafael Wysocki) - Drop non-functional nocrt parameter from ACPI thermal (Mario Limonciello) - Clean up the ACPI thermal driver, rework the handling of firmware notifications in it and make it provide a table of generic trip point structures to the core during initialization (Rafael Wysocki) - Defer enumeration of devices with _DEP pointing to IVSC (Wentong Wu) - Install SystemCMOS address space handler for ACPI000E (TAD) to meet platform firmware expectations on some platforms (Zhang Rui) - Fix finding the generic error data in the ACPi extlog driver for compatibility with old and new firmware interface versions (Xiaochun Lee) - Remove assorted unused declarations of functions (Yue Haibing) - Move AMBA bus scan handling into arm64 specific directory (Sudeep Holla) - Fix and clean up suspend-to-idle interface for AMD systems (Mario Limonciello, Andy Shevchenko) - Fix string truncation warning in pnpacpi_add_device() (Sunil V L)" * tag 'acpi-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (66 commits) ACPI: x86: s2idle: Add a function to get LPS0 constraint for a device ACPI: x86: s2idle: Add for_each_lpi_constraint() helper ACPI: x86: s2idle: Add more debugging for AMD constraints parsing ACPI: x86: s2idle: Fix a logic error parsing AMD constraints table ACPI: x86: s2idle: Catch multiple ACPI_TYPE_PACKAGE objects ACPI: x86: s2idle: Post-increment variables when getting constraints ACPI: Adjust #ifdef for *_lps0_dev use ACPI: TAD: Install SystemCMOS address space handler for ACPI000E ACPI: Remove assorted unused declarations of functions ACPI: extlog: Fix finding the generic error data for v3 structure PNP: ACPI: Fix string truncation warning ACPI: Remove unused extern declaration acpi_paddr_to_node() ACPI: video: Add backlight=native DMI quirk for Apple iMac12,1 and iMac12,2 ACPI: video: Put ACPI video and its child devices into D0 on boot ACPI: processor: LoongArch: Get physical ID from MADT ACPI: scan: Defer enumeration of devices with a _DEP pointing to IVSC device ACPI: thermal: Eliminate code duplication from acpi_thermal_notify() ACPI: thermal: Drop unnecessary thermal zone callbacks ACPI: thermal: Rework thermal_get_trend() ACPI: thermal: Use trip point table to register thermal zones ...	2023-08-28 17:58:39 -07:00
Linus Torvalds	6383cb42ac	xen: branch for v6.6-rc1 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZOdp9QAKCRCAXGG7T9hj vii+AQDgY+/PhND0pS40yKHPZpfbe1dGdNlbuD0ktumHQ3a5nAEAu6WNFEEG8NPx Unopygv7irUdytK4RY+DPgB7FKRCuA4= =XrQ0 -----END PGP SIGNATURE----- Merge tag 'for-linus-6.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - a bunch of minor cleanups - a patch adding irqfd support for virtio backends running as user daemon supporting Xen guests * tag 'for-linus-6.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: privcmd: Add support for irqfd xen/xenbus: Avoid a lockdep warning when adding a watch xen: Fix one kernel-doc comment xen: xenbus: Use helper function IS_ERR_OR_NULL() xen: Switch to use kmemdup() helper xen-pciback: Remove unused function declarations x86/xen: Make virt_to_pfn() a static inline xen: remove a confusing comment on auto-translated guest I/O xen/evtchn: Remove unused function declaration xen_set_affinity_evtchn()	2023-08-28 17:41:30 -07:00
Linus Torvalds	97efd28334	Misc x86 cleanups. The following commit deserves special mention: `22dc02f81c` Revert "sched/fair: Move unused stub functions to header" This is in x86/cleanups, because the revert is a re-application of a number of cleanups that got removed inadvertedly. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtDkoRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jCMw//UvQGM8yxsTa57r0/ZpJHS2++P5pJxOsz 45kBb3aBiDV6idArce4EHpthp3MvF3Pycibp9w0qg//NOtIHTKeagXv52abxsu1W hmS6gXJZDXZvjO1BFaUlmv97iYtzGfKnQppj32C4tMr9SaP49h3KvOHH1Z8CR3mP 1nZaJJwYIi2qBh7msnmLGG+F0drb85O/dfHdoLX6iVJw9UP4n5nu9u8u1E0iC7J7 2GC6AwP60A0EBRTK9EHQQEYwy9uvdS/TG5f2Qk1VP87KA9TTocs8MyapMG4DQu79 hZKVEGuVQAlV3rYe9cJBNpDx1mTu3rmuMH0G71KEe3T6UcG5QRUiAPm8UfA9prPD uWjY4zm5o0W3tUio4V1MqqiLFIaBU63WmTY9RyM0QH8Ms8r8GugWKmnrTIuHfEC3 9D+Uhyb5d8ID6qFGLTOvPm0g+v64lnH71qq83PcVJgsmZvUb2XvFA3d/A0h9JzLT 2In/yfU10UsLUFTiNRyAgcLccjaGhliDB2oke9Kp0OyOTSQRcWmiq8kByVxCPImP auOWWcNXjcuOgjlnziEkMTDuRY12MgUB2If4zhELvdEFibIaaNW5sNCbY2msWaN1 CUD7fcj0L3HZvzujUm72l5hxL2brJMuPwVNJfuOe4T8wzy569d6VJULrd1URBM1B vfaPs1Dz46Q= =kiAA -----END PGP SIGNATURE----- Merge tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 cleanups from Ingo Molnar: "The following commit deserves special mention: `22dc02f81c` Revert "sched/fair: Move unused stub functions to header" This is in x86/cleanups, because the revert is a re-application of a number of cleanups that got removed inadvertedly" [ This also effectively undoes the amd_check_microcode() microcode declaration change I had done in my microcode loader merge in commit `42a7f6e3ff` ("Merge tag 'x86_microcode_for_v6.6_rc1' [...]"). I picked the declaration change by Arnd from this branch instead, which put it in <asm/processor.h> instead of <asm/microcode.h> like I had done in my merge resolution - Linus ] * tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/platform/uv: Refactor code using deprecated strncpy() interface to use strscpy() x86/hpet: Refactor code using deprecated strncpy() interface to use strscpy() x86/platform/uv: Refactor code using deprecated strcpy()/strncpy() interfaces to use strscpy() x86/qspinlock-paravirt: Fix missing-prototype warning x86/paravirt: Silence unused native_pv_lock_init() function warning x86/alternative: Add a __alt_reloc_selftest() prototype x86/purgatory: Include header for warn() declaration x86/asm: Avoid unneeded __div64_32 function definition Revert "sched/fair: Move unused stub functions to header" x86/apic: Hide unused safe_smp_processor_id() on 32-bit UP x86/cpu: Fix amd_check_microcode() declaration	2023-08-28 17:05:58 -07:00
Linus Torvalds	1a7c611546	Perf events changes for v6.6: - AMD IBS improvements - Intel PMU driver updates - Extend core perf facilities & the ARM PMU driver to better handle ARM big.LITTLE events - Micro-optimize software events and the ring-buffer code - Misc cleanups & fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtBscRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hHoQ/+IBQ8Xi/rcdd40n8OqEB/VBWVuSjNT3uN 3pHHcTl2Pio9CxBeat42NekNijlRILCKJrZ3Lt3JWBmWyWv5l3KFabelj+lDF2xa TVCjTnQNe1+HvrODYnF4ECIs5vaoMVjcJ9jg8+VDgAcOQr1nZs4m5TVAd6TLqPpV urBEQVULkkzk7ZRhfrugKhw+wrpWFefgGCx0RV8ijZB7TLMHc2wE+Q/sTxKdKceL wNaJaDgV33pZh0aImwR9pKUE532hF1FiBdLuehkh61PZa1L82jzAX1xjw2s1hSa4 eIWemPHJIYfivRlENbJsDWc4N8gk6ijVHwrxGcr4Axu+NN+zPtQ3ddhaGMAyKdTo qUKXH3MZSMIl++jI5Fkc6xM+XLvY1rML62epSzMwu/cc7Z5MeyWdQcri0N9YFuO7 wUUNnFpU00lwQBLbyyUQ3Zi8E0QV7NuPW4axTkmntiIjMpLagaEvVSf6nf8qLpbE WTT16s707t19hUZNazNZ7ONmhly4ALbHFQEH65J2KoYn99fYqy9z68Hwk+xnmykw bc3qvfhpw0MImQQ+DqHiBwb4n4UuvY2WlkkZI3FfNeSG63DaM2mZikfpElpXYjn6 9iOIXvx21Wiq/n0cbLhidI2q/ZzFCzYLCk6ikZ320wb+rhvd7EoSlZil6QSzn3pH Qdk+NEZgWQY= =ZT6+ -----END PGP SIGNATURE----- Merge tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event updates from Ingo Molnar: - AMD IBS improvements - Intel PMU driver updates - Extend core perf facilities & the ARM PMU driver to better handle ARM big.LITTLE events - Micro-optimize software events and the ring-buffer code - Misc cleanups & fixes * tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/uncore: Remove unnecessary ?: operator around pcibios_err_to_errno() call perf/x86/intel: Add Crestmont PMU x86/cpu: Update Hybrids x86/cpu: Fix Crestmont uarch x86/cpu: Fix Gracemont uarch perf: Remove unused extern declaration arch_perf_get_page_size() perf: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability arm_pmu: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability perf/x86: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability perf/x86/ibs: Set mem_lvl_num, mem_remote and mem_hops for data_src perf/mem: Add PERF_MEM_LVLNUM_NA to PERF_MEM_NA perf/mem: Introduce PERF_MEM_LVLNUM_UNC perf/ring_buffer: Use local_try_cmpxchg in __perf_output_begin locking/arch: Avoid variable shadowing in local_try_cmpxchg() perf/core: Use local64_try_cmpxchg in perf_swevent_set_period perf/x86: Use local64_try_cmpxchg perf/amd: Prevent grouping of IBS events	2023-08-28 16:35:01 -07:00
Linus Torvalds	d7dd9b449f	EFI updates for v6.6 - one bugfix for x86 mixed mode that did not make it into v6.5 - first pass of cleanup for the EFI runtime wrappers - some cosmetic touchups -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZOx+LAAKCRAwbglWLn0t XKzLAPwK8wIZ14NlC55NCtdvKEzK3N/muxKRqg2MAHfrbnREFgD9GgUxSbIEU2gz BXQM9GLaP86qCXkZlPYktjP6RVfDYAk= =e2mx -----END PGP SIGNATURE----- Merge tag 'efi-next-for-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI updates from Ard Biesheuvel: "This primarily covers some cleanup work on the EFI runtime wrappers, which are shared between all EFI architectures except Itanium, and which provide some level of isolation to prevent faults occurring in the firmware code (which runs at the same privilege level as the kernel) from bringing down the system. Beyond that, there is a fix that did not make it into v6.5, and some doc fixes and dead code cleanup. - one bugfix for x86 mixed mode that did not make it into v6.5 - first pass of cleanup for the EFI runtime wrappers - some cosmetic touchups" * tag 'efi-next-for-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: x86/efistub: Fix PCI ROM preservation in mixed mode efi/runtime-wrappers: Clean up white space and add __init annotation acpi/prmt: Use EFI runtime sandbox to invoke PRM handlers efi/runtime-wrappers: Don't duplicate setup/teardown code efi/runtime-wrappers: Remove duplicated macro for service returning void efi/runtime-wrapper: Move workqueue manipulation out of line efi/runtime-wrappers: Use type safe encapsulation of call arguments efi/riscv: Move EFI runtime call setup/teardown helpers out of line efi/arm64: Move EFI runtime call setup/teardown helpers out of line efi/riscv: libstub: Fix comment about absolute relocation efi: memmap: Remove kernel-doc warnings efi: Remove unused extern declaration efi_lookup_mapped_addr()	2023-08-28 16:25:45 -07:00
Linus Torvalds	42a7f6e3ff	- The first, cleanup part of the microcode loader reorg tglx has been working on. This part makes the loader core code as it is practically enabled on pretty much every baremetal machine so there's no need to have the Kconfig items. In addition, there are cleanups which prepare for future feature enablement. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTskfcACgkQEsHwGGHe VUo/hBAAiqVqdc0WASHgVYoO9mD1M2T3oHFz8ceX2pGKf/raZ5UJyit7ybWEZEWG rWXFORRlqOKoKImQm6JhyJsu29Xmi9sTb1WNwEyT8YMdhx8v57hOch3alX7sm2BF 9eOl/77hxt7Pt8keN6gY5w5cydEgBvi8bVe8sfU3hJMwieAMH0q5syRx7fDizcVd qoTicHRjfj5Q8BL5NXtdPEMENYPyV89DVjnUM1HVPpCkoHxmOujewgjs4gY7PsGp qAGB1+IG3aqOWHM9SDIJp5U9tNX2huhqRTcZsNEe8qHTXCv8F8zRzK0J8giM1wed 5aAGC4AfMh/gjryXeMj1nnwoAf5TQw4dK+y+BYXIykdQDV1up+HdDtjrBmJ5Kslf n/8uGLdLLqMQVEE2hT3r1Ft1RqVf3UwWOxzc+KASjKCj0F5djt+F2JDNGoN0sMD9 JDj3Dtvo2e1+aZlcvXWSmMCVMT0By1mbFqirEXT3i1sYwHDx23s+KwY71CdT8gx8 nbEWsaADPRynNbTAI5Z5TFq0Cohs9AUNuotRHYCc0Et5NBlzoN8yAKNW39twUDEt a/Knq1Vnybrp18pE/rDphm+p/K261OuEXfFFVTASSlvgMnVM0UAZZZka7A0DmN+g mvZ2A9hByFk6sELm3QeNrOdex8djeichyY7+EQ13K25wMd/YsX0= =QXDh -----END PGP SIGNATURE----- Merge tag 'x86_microcode_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 microcode loading updates from Borislav Petkov: "The first, cleanup part of the microcode loader reorg tglx has been working on. The other part wasn't fully ready in time so it will follow on later. This part makes the loader core code as it is practically enabled on pretty much every baremetal machine so there's no need to have the Kconfig items. In addition, there are cleanups which prepare for future feature enablement" * tag 'x86_microcode_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode: Remove remaining references to CONFIG_MICROCODE_AMD x86/microcode/intel: Remove pointless mutex x86/microcode/intel: Remove debug code x86/microcode: Move core specific defines to local header x86/microcode/intel: Rename get_datasize() since its used externally x86/microcode: Make reload_early_microcode() static x86/microcode: Include vendor headers into microcode.h x86/microcode/intel: Move microcode functions out of cpu/intel.c x86/microcode: Hide the config knob x86/mm: Remove unused microcode.h include x86/microcode: Remove microcode_mutex x86/microcode/AMD: Rip out static buffers	2023-08-28 15:55:20 -07:00
Linus Torvalds	f31f663fa9	- Handle the case where the beginning virtual address of the address range whose SEV encryption status needs to change, is not page aligned so that callers which round up the number of pages to be decrypted, would mark a trailing page as decrypted and thus cause corruption during live migration. - Return an error from the #VC handler on AMD SEV-* guests when the debug registers swapping is enabled as a DR7 access should not happen then - that register is guest/host switched. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTsje4ACgkQEsHwGGHe VUrKWxAAqiTpQSjJCB32ReioSsLv3kl7vtLO3xtE42VpF0F7pAAPzRsh+bgDjGSM uqcEgbX1YtPlb8wK6yh5dyNLLvtzxhaAQkUfbfuEN2oqbvIEcJmhWAm/xw1yCsh2 GDphFPtvqgT4KUCkEHj8tC9eQzG+L0bwymPzqXooVDnm4rL0ulEl6ONffhHfJFVg bmL8UjmJNodFcO6YBfosQIDDfc4ayuwm9f/rGltNFl+jwCi62kMJaVdU1112agsV LE73DRoRpfHKLslj9o9ubRcvaHKS24y2Amflnj1tas0h8I2uXBRwIgxjQXl5vtXV pu5/5VHM9X13x8XKpKVkEohXkBzFRigs8yfHq+JlpyWXXB/ymW8Acbqqnvll12r4 JSy+XfBNa6V5Y/NDS/1faJiX6XSi5ZyZHZG70sf52XVoBYhzoms5kxqTJnHHisnY X50677/tQF3V9WsmKD0aj0Um2ztiq0/TNMI7FT3lzYRDNJb1ln3ZK9f04i8L5jA4 bsrSV5oCVpLkW4eQaAJwxttTB+dRb5MwwkeS7D/eTuJ1pgUmJMIbZp2YbJH7NP2F 6FShQdwHi8KYN7mxUM+WwOk7goaBm5L61w5UtRlt6aDE7LdEbMAeSSdmD3HlEZHR ntBqcEx4SkAT+Ru0izVXjsoWmtkn8+DY44oUC2X6eZxUSAT4Cm4= =td9F -----END PGP SIGNATURE----- Merge tag 'x86_sev_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Handle the case where the beginning virtual address of the address range whose SEV encryption status needs to change, is not page aligned so that callers which round up the number of pages to be decrypted, would mark a trailing page as decrypted and thus cause corruption during live migration. - Return an error from the #VC handler on AMD SEV-* guests when the debug registers swapping is enabled as a DR7 access should not happen then - that register is guest/host switched. * tag 'x86_sev_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev: Make enc_dec_hypercall() accept a size instead of npages x86/sev: Do not handle #VC for DR7 read/write	2023-08-28 15:28:54 -07:00
Linus Torvalds	bd9e99f790	- Avoid the baremetal decompressor code when booting on an EFI machine. This is mandated by the current tightening of EFI executables requirements when used in a secure boot scenario. More specifically, an EFI executable cannot have a single section with RWX permissions, which conflicts with the in-place kernel decompression that is done today. Instead, the things required by the booting kernel image are done in the EFI stub now. Work by Ard Biesheuvel. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTsMGAACgkQEsHwGGHe VUqqGg/+LJq7dapnyMiqafO9Xdpm+h7wTylChdfev7UB3wri9IJGXXFkwZZUNF8Z p2c/ECk75CG+z7ZSv+WirZLWlmlBT1D51YjI1lHmg3Vmh3YXZWJHkwaG1DI4euL7 C4gvd/gmpXGMOxjZLV/D5d7wwoO2REyVkDeCy0lqHCisvd4GRXqUJHe6hJ3GxVv/ VqbT7ZHMP4BlbWK4/yTq9+wKSC9T0OfAOW2b+K/SAh19TpRqFBpkMOxaZx0nj2Eh UxQUd2hntGtEGpPf71HPQMYCkWyvJD1LpTYfwld7tNvzXSt2xg2YZAf7WzJ0Od3b kxg2mItHhVRCcwEc9OWlphvgAYCzYECzeMniesLFeJYLMvHNtORzxNPkDLrOYjT6 0K05DEWJlEqZhIWtaKJ/SSCj7JnfmgubfK3tG3cerCelQZIvUix6ysvDBwPnqdUr 3O1d7iMlM1UtqMo0GLeIp0q1aRKMsNfRtUvqjaiQ2NjSuyJqZl8t0jfKd6DHcnuy WidBMhREULa5qYGE9dxosdTJK1wOhqomTmBSC8EeDUPevLV+DNQUn+oBOVhFvJVk gpmsr2mi9+hwC9iNh1K4wslLl9Jc/BlIY0wwC0ysh6eTWQfrvpfXrtHRpmTvwcVW g2lt+shYiPDa7q3bvqQL7iqFU6bkMGCs89EFGJEpNbBO076uNZY= =2Oej -----END PGP SIGNATURE----- Merge tag 'x86_boot_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Borislav Petkov: "Avoid the baremetal decompressor code when booting on an EFI machine. This is mandated by the current tightening of EFI executables requirements when used in a secure boot scenario. More specifically, an EFI executable cannot have a single section with RWX permissions, which conflicts with the in-place kernel decompression that is done today. Instead, the things required by the booting kernel image are done in the EFI stub now. Work by Ard Biesheuvel" * tag 'x86_boot_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/efistub: Avoid legacy decompressor when doing EFI boot x86/efistub: Perform SNP feature test while running in the firmware efi/libstub: Add limit argument to efi_random_alloc() x86/decompressor: Factor out kernel decompression and relocation x86/decompressor: Move global symbol references to C code decompress: Use 8 byte alignment x86/efistub: Prefer EFI memory attributes protocol over DXE services x86/efistub: Perform 4/5 level paging switch from the stub x86/decompressor: Merge trampoline cleanup with switching code x86/decompressor: Pass pgtable address to trampoline directly x86/decompressor: Only call the trampoline when changing paging levels x86/decompressor: Call trampoline directly from C code x86/decompressor: Avoid the need for a stack in the 32-bit trampoline x86/decompressor: Use standard calling convention for trampoline x86/decompressor: Call trampoline as a normal function x86/decompressor: Assign paging related global variables earlier x86/decompressor: Store boot_params pointer in callee save register x86/efistub: Clear BSS in EFI handover protocol entrypoint x86/decompressor: Avoid magic offsets for EFI handover entrypoint x86/efistub: Simplify and clean up handover entry code ...	2023-08-28 15:15:37 -07:00
Linus Torvalds	6f49693a6c	Updates for the CPU hotplug core: - Support partial SMT enablement. So far the sysfs SMT control only allows to toggle between SMT on and off. That's sufficient for x86 which usually has at max two threads except for the Xeon PHI platform which has four threads per core. Though PowerPC has up to 16 threads per core and so far it's only possible to control the number of enabled threads per core via a command line option. There is some way to control this at runtime, but that lacks enforcement and the usability is awkward. This update expands the sysfs interface and the core infrastructure to accept numerical values so PowerPC can build SMT runtime control for partial SMT enablement on top. The core support has also been provided to the PowerPC maintainers who added the PowerPC related changes on top. - Minor cleanups and documentation updates. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmTsj4wTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoaszEADKMd/6m7/Bq7RU2OJ+IXw8yfMEF9nS 6HPrFu71a4cDufb/G8UckQOvkwdTFWD7bZ0snJe2sBDFTOtzK/inYkgPZTxlm7si JcJmFnHKUM7OTwNZb7Tv1bd9Csz4JhggAYUw6P8CqsCmhQ+p6ECemx3bHDlYiywm 5eW2yzI9EM4dbsHPwUOvjI0WazGvAf0esSDAS8JTnhBXbd8FAckbMV+xuRPcCUK+ dBqbqr+3Nf4/wcXTro/gZIc7sEATAHH6m7zHlLVBSyVPnBxre8NLz6KciW4SezyJ GWFnDV03mmG2KxQ2ugwI8n6M3zDJQtfEJFwW/x4t2M5RK+ka2a6G6GtCLHYOXLWR akIuBXtTAC57BgpqzBihGej9eiC1BJ1QMa9ZK+6WDXSZtMTFOLlbwdY2/qyfxpfw LfepWb+UMtFy5YyW84S1O5/AqpOtKD2kPTqfDjvDxWIAigispU+qwAKxcMzMjtwz aAlf2Z/iX0R9DkRzGD2gaFG5AUsRich8RtVO7u+WDwYSsi8ywrvryiPlZrDDBkSQ sRzdoHeXNGVY/FgkbZmEyBj4udrypymkR6ivqn6C2OrysgznSiv5NC983uS6TfJX cVqdUv6CNYYNiNu0x0Qf0MluYT2s5c1Fa4bjCBJL+KwORwjM3+TCN9RA1KtFrW2T G3Ta1KqI6wRonA== =JQRJ -----END PGP SIGNATURE----- Merge tag 'smp-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug updates from Thomas Gleixner: "Updates for the CPU hotplug core: - Support partial SMT enablement. So far the sysfs SMT control only allows to toggle between SMT on and off. That's sufficient for x86 which usually has at max two threads except for the Xeon PHI platform which has four threads per core Though PowerPC has up to 16 threads per core and so far it's only possible to control the number of enabled threads per core via a command line option. There is some way to control this at runtime, but that lacks enforcement and the usability is awkward This update expands the sysfs interface and the core infrastructure to accept numerical values so PowerPC can build SMT runtime control for partial SMT enablement on top The core support has also been provided to the PowerPC maintainers who added the PowerPC related changes on top - Minor cleanups and documentation updates" * tag 'smp-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation: core-api/cpuhotplug: Fix state names cpu/hotplug: Remove unused function declaration cpu_set_state_online() cpu/SMT: Fix cpu_smt_possible() comment cpu/SMT: Allow enabling partial SMT states via sysfs cpu/SMT: Create topology_smt_thread_allowed() cpu/SMT: Remove topology_smt_supported() cpu/SMT: Store the current/max number of threads cpu/SMT: Move smt/control simple exit cases earlier cpu/SMT: Move SMT prototypes into cpu_smt.h cpu/hotplug: Remove dependancy against cpu_primary_thread_mask	2023-08-28 15:04:43 -07:00
Rafael J. Wysocki	9bd0c413b9	Merge branch 'acpi-processor' Merge ACPI processor driver changes for 6.6-rc1: - Support obtaining physical CPU ID from MADT on LoongArch (Bibo Mao). - Convert ACPI CPU initialization to using _OSC instead of _PDC that has been depreceted since 2018 and dropped from the specification in ACPI 6.5 (Michal Wilczynski, Rafael Wysocki). * acpi-processor: ACPI: processor: LoongArch: Get physical ID from MADT ACPI: processor: Refine messages in acpi_early_processor_control_setup() ACPI: processor: Remove acpi_hwp_native_thermal_lvt_osc() ACPI: processor: Use _OSC to convey OSPM processor support information ACPI: processor: Introduce acpi_processor_osc() ACPI: processor: Set CAP_SMP_T_SWCOORD in arch_acpi_set_proc_cap_bits() ACPI: processor: Clear C_C2C3_FFH and C_C1_FFH in arch_acpi_set_proc_cap_bits() ACPI: processor: Rename ACPI_PDC symbols ACPI: processor: Refactor arch_acpi_set_pdc_bits() ACPI: processor: Move processor_physically_present() to acpi_processor.c ACPI: processor: Move MWAIT quirk out of acpi_processor.c	2023-08-25 20:40:50 +02:00
Steve Rutherford	ac3f9c9f1b	x86/sev: Make enc_dec_hypercall() accept a size instead of npages enc_dec_hypercall() accepted a page count instead of a size, which forced its callers to round up. As a result, non-page aligned vaddrs caused pages to be spuriously marked as decrypted via the encryption status hypercall, which in turn caused consistent corruption of pages during live migration. Live migration requires accurate encryption status information to avoid migrating pages from the wrong perspective. Fixes: `064ce6c550` ("mm: x86: Invoke hypercall when page encryption status is changed") Signed-off-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Tested-by: Ben Hillier <bhillier@google.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230824223731.2055016-1-srutherford@google.com	2023-08-25 13:33:48 +02:00
Dexuan Cui	e3131f1c81	x86/hyperv: Remove hv_isolation_type_en_snp In ms_hyperv_init_platform(), do not distinguish between a SNP VM with the paravisor and a SNP VM without the paravisor. Replace hv_isolation_type_en_snp() with !ms_hyperv.paravisor_present && hv_isolation_type_snp(). The hv_isolation_type_en_snp() in drivers/hv/hv.c and drivers/hv/hv_common.c can be changed to hv_isolation_type_snp() since we know !ms_hyperv.paravisor_present is true there. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20230824080712.30327-10-decui@microsoft.com	2023-08-25 00:04:57 +00:00
Dexuan Cui	b9b4fe3a72	x86/hyperv: Use TDX GHCI to access some MSRs in a TDX VM with the paravisor When the paravisor is present, a SNP VM must use GHCB to access some special MSRs, including HV_X64_MSR_GUEST_OS_ID and some SynIC MSRs. Similarly, when the paravisor is present, a TDX VM must use TDX GHCI to access the same MSRs. Implement hv_tdx_msr_write() and hv_tdx_msr_read(), and use the helper functions hv_ivm_msr_read() and hv_ivm_msr_write() to access the MSRs in a unified way for SNP/TDX VMs with the paravisor. Do not export hv_tdx_msr_write() and hv_tdx_msr_read(), because we never really used hv_ghcb_msr_write() and hv_ghcb_msr_read() in any module. Update arch/x86/include/asm/mshyperv.h so that the kernel can still build if CONFIG_AMD_MEM_ENCRYPT or CONFIG_INTEL_TDX_GUEST is not set, or neither is set. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20230824080712.30327-9-decui@microsoft.com	2023-08-25 00:04:57 +00:00
Dexuan Cui	d3a9d7e49d	x86/hyperv: Introduce a global variable hyperv_paravisor_present The new variable hyperv_paravisor_present is set only when the VM is a SNP/TDX VM with the paravisor running: see ms_hyperv_init_platform(). We introduce hyperv_paravisor_present because we can not use ms_hyperv.paravisor_present in arch/x86/include/asm/mshyperv.h: struct ms_hyperv_info is defined in include/asm-generic/mshyperv.h, which is included at the end of arch/x86/include/asm/mshyperv.h, but at the beginning of arch/x86/include/asm/mshyperv.h, we would already need to use struct ms_hyperv_info in hv_do_hypercall(). We use hyperv_paravisor_present only in include/asm-generic/mshyperv.h, and use ms_hyperv.paravisor_present elsewhere. In the future, we'll introduce a hypercall function structure for different VM types, and at boot time, the right function pointers would be written into the structure so that runtime testing of TDX vs. SNP vs. normal will be avoided and hyperv_paravisor_present will no longer be needed. Call hv_vtom_init() when it's a VBS VM or when ms_hyperv.paravisor_present is true, i.e. the VM is a SNP VM or TDX VM with the paravisor. Enhance hv_vtom_init() for a TDX VM with the paravisor. In hv_common_cpu_init(), don't decrypt the hyperv_pcpu_input_arg for a TDX VM with the paravisor, just like we don't decrypt the page for a SNP VM with the paravisor. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20230824080712.30327-7-decui@microsoft.com	2023-08-25 00:04:57 +00:00
Dexuan Cui	d6e0228d26	x86/hyperv: Support hypercalls for fully enlightened TDX guests A fully enlightened TDX guest on Hyper-V (i.e. without the paravisor) only uses the GHCI call rather than hv_hypercall_pg. Do not initialize hypercall_pg for such a guest. In hv_common_cpu_init(), the hyperv_pcpu_input_arg page needs to be decrypted in such a guest. Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20230824080712.30327-3-decui@microsoft.com	2023-08-25 00:04:56 +00:00
Dexuan Cui	08e9d12077	x86/hyperv: Add hv_isolation_type_tdx() to detect TDX guests No logic change to SNP/VBS guests. hv_isolation_type_tdx() will be used to instruct a TDX guest on Hyper-V to do some TDX-specific operations, e.g. for a fully enlightened TDX guest (i.e. without the paravisor), hv_do_hypercall() should use __tdx_hypercall() and such a guest on Hyper-V should handle the Hyper-V Event/Message/Monitor pages specially. Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20230824080712.30327-2-decui@microsoft.com	2023-08-25 00:04:56 +00:00
Eric DeVolder	a72bbec70d	crash: hotplug support for kexec_load() The hotplug support for kexec_load() requires changes to the userspace kexec-tools and a little extra help from the kernel. Given a kdump capture kernel loaded via kexec_load(), and a subsequent hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites it to reflect the hotplug change. That is the desired outcome, however, at kernel panic time, the purgatory integrity check fails (because the elfcorehdr changed), and the capture kernel does not boot and no vmcore is generated. Therefore, the userspace kexec-tools/kexec must indicate to the kernel that the elfcorehdr can be modified (because the kexec excluded the elfcorehdr from the digest, and sized the elfcorehdr memory buffer appropriately). To facilitate hotplug support with kexec_load(): - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is safe for the kernel to modify the kexec_load()'d elfcorehdr - the /sys/kernel/crash_elfcorehdr_size node communicates the preferred size of the elfcorehdr memory buffer - The sysfs crash_hotplug nodes (ie. /sys/devices/system/[cpu\|memory]/crash_hotplug) dynamically take into account kexec_file_load() vs kexec_load() and KEXEC_UPDATE_ELFCOREHDR. This is critical so that the udev rule processing of crash_hotplug is all that is needed to determine if the userspace unload-then-load of the kdump image is to be skipped, or not. The proposed udev rule change looks like: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" The table below indicates the behavior of kexec_load()'d kdump image updates (with the new udev crash_hotplug rule in place): Kernel \|Kexec -------+-----+---- Old \|Old \|New \| a \| a -------+-----+---- New \| a \| b -------+-----+---- where kexec 'old' and 'new' delineate kexec-tools has the needed modifications for the crash hotplug feature, and kernel 'old' and 'new' delineate the kernel supports this crash hotplug feature. Behavior 'a' indicates the unload-then-reload of the entire kdump image. For the kexec 'old' column, the unload-then-reload occurs due to the missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec) does not present the crash_hotplug sysfs node, which leads to the unload-then-reload of the kdump image. Behavior 'b' indicates the desired optimized behavior of the kernel directly modifying the elfcorehdr and avoiding the unload-then-reload of the kdump image. If the udev rule is not updated with crash_hotplug node check, then no matter any combination of kernel or kexec is new or old, the kdump image continues to be unload-then-reload on hotplug changes. To fully support crash hotplug feature, there needs to be a rollout of kernel, kexec-tools and udev rule changes. However, the order of the rollout of these pieces does not matter; kexec_load()'d kdump images still function for hotplug as-is. Link: https://lkml.kernel.org/r/20230814214446.6659-7-eric.devolder@oracle.com Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Suggested-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Akhil Raj <lf32.dev@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-24 16:25:14 -07:00
Eric DeVolder	ea53ad9cf7	x86/crash: add x86 crash hotplug support When CPU or memory is hot un/plugged, or off/onlined, the crash elfcorehdr, which describes the CPUs and memory in the system, must also be updated. A new elfcorehdr is generated from the available CPUs and memory and replaces the existing elfcorehdr. The segment containing the elfcorehdr is identified at run-time in crash_core:crash_handle_hotplug_event(). No modifications to purgatory (see 'kexec: exclude elfcorehdr from the segment digest') or boot_params (as the elfcorehdr= capture kernel command line parameter pointer remains unchanged and correct) are needed, just elfcorehdr. For kexec_file_load(), the elfcorehdr segment size is based on NR_CPUS and CRASH_MAX_MEMORY_RANGES in order to accommodate a growing number of CPU and memory resources. For kexec_load(), the userspace kexec utility needs to size the elfcorehdr segment in the same/similar manner. To accommodate kexec_load() syscall in the absence of kexec_file_load() syscall support, prepare_elf_headers() and dependents are moved outside of CONFIG_KEXEC_FILE. [eric.devolder@oracle.com: correct unused function build error] Link: https://lkml.kernel.org/r/20230821182644.2143-1-eric.devolder@oracle.com Link: https://lkml.kernel.org/r/20230814214446.6659-6-eric.devolder@oracle.com Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Akhil Raj <lf32.dev@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-24 16:25:14 -07:00
Matthew Wilcox (Oracle)	a3e1c9372c	x86: implement the new page table range API Add PFN_PTE_SHIFT and a noop update_mmu_cache_range(). Link: https://lkml.kernel.org/r/20230802151406.3735276-29-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-24 16:20:24 -07:00
Matthew Wilcox (Oracle)	a379322022	mm: convert page_table_check_pte_set() to page_table_check_ptes_set() Tell the page table check how many PTEs & PFNs we want it to check. Link: https://lkml.kernel.org/r/20230802151406.3735276-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-24 16:20:18 -07:00
Nathan Chancellor	f0a3d1de89	x86/hyperv: Add missing 'inline' to hv_snp_boot_ap() stub When building without CONFIG_AMD_MEM_ENCRYPT, there are several repeated instances of -Wunused-function due to missing 'inline' on the stub of hy_snp_boot_ap(): In file included from drivers/hv/hv_common.c:29: ./arch/x86/include/asm/mshyperv.h:272:12: error: 'hv_snp_boot_ap' defined but not used [-Werror=unused-function] 272 \| static int hv_snp_boot_ap(int cpu, unsigned long start_ip) { return 0; } \| ^~~~~~~~~~~~~~ cc1: all warnings being treated as errors Add 'inline' to fix the warnings. Fixes: `44676bb9d5` ("x86/hyperv: Add smp support for SEV-SNP guest") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20230822-hv_snp_boot_ap-missing-inline-v1-1-e712dcb2da0f@kernel.org	2023-08-23 05:41:26 +00:00
Ard Biesheuvel	e38abdab44	efi/runtime-wrappers: Remove duplicated macro for service returning void __efi_call_virt() exists as an alternative for efi_call_virt() for the sole reason that ResetSystem() returns void, and so we cannot use a call to it in the RHS of an assignment. Given that there is only a single user, let's drop the macro, and expand it into the caller. That way, the remaining macro can be tightened somewhat in terms of type safety too. Note that the use of typeof() on the runtime service invocation does not result in an actual call being made, but it does require a few pointer types to be fixed up and converted into the proper function pointer prototypes. Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2023-08-22 10:39:26 +02:00
Tianyu Lan	44676bb9d5	x86/hyperv: Add smp support for SEV-SNP guest In the AMD SEV-SNP guest, AP needs to be started up via sev es save area and Hyper-V requires to call HVCALL_START_VP hypercall to pass the gpa of sev es save area with AP's vp index and VTL(Virtual trust level) parameters. Override wakeup_secondary_cpu_64 callback with hv_snp_boot_ap. Reviewed-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20230818102919.1318039-8-ltykernel@gmail.com	2023-08-22 00:38:20 +00:00
Tianyu Lan	48b1f68372	x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest In sev-snp enlightened guest, Hyper-V hypercall needs to use vmmcall to trigger vmexit and notify hypervisor to handle hypercall request. Signed-off-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20230818102919.1318039-6-ltykernel@gmail.com	2023-08-22 00:38:20 +00:00
Tianyu Lan	8387ce06d7	x86/hyperv: Set Virtual Trust Level in VMBus init message SEV-SNP guests on Hyper-V can run at multiple Virtual Trust Levels (VTL). During boot, get the VTL at which we're running using the GET_VP_REGISTERs hypercall, and save the value for future use. Then during VMBus initialization, set the VTL with the saved value as required in the VMBus init message. Reviewed-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20230818102919.1318039-3-ltykernel@gmail.com	2023-08-22 00:38:20 +00:00
Tianyu Lan	d6e2d65244	x86/hyperv: Add sev-snp enlightened guest static key Introduce static key isolation_type_en_snp for enlightened sev-snp guest check. Reviewed-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/20230818102919.1318039-2-ltykernel@gmail.com	2023-08-22 00:38:20 +00:00
Helge Deller	0a6b58c5cd	lockdep: fix static memory detection even more On the parisc architecture, lockdep reports for all static objects which are in the __initdata section (e.g. "setup_done" in devtmpfs, "kthreadd_done" in init/main.c) this warning: INFO: trying to register non-static key. The warning itself is wrong, because those objects are in the __initdata section, but the section itself is on parisc outside of range from _stext to _end, which is why the static_obj() functions returns a wrong answer. While fixing this issue, I noticed that the whole existing check can be simplified a lot. Instead of checking against the _stext and _end symbols (which include code areas too) just check for the .data and .bss segments (since we check a data object). This can be done with the existing is_kernel_core_data() macro. In addition objects in the __initdata section can be checked with init_section_contains(), and is_kernel_rodata() allows keys to be in the _ro_after_init section. This partly reverts and simplifies commit `bac59d18c7` ("x86/setup: Fix static memory detection"). Link: https://lkml.kernel.org/r/ZNqrLRaOi/3wPAdp@p100 Fixes: `bac59d18c7` ("x86/setup: Fix static memory detection") Signed-off-by: Helge Deller <deller@gmx.de> Cc: Borislav Petkov <bp@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-21 13:46:24 -07:00
Linus Walleij	067e4f174f	x86/xen: Make virt_to_pfn() a static inline Making virt_to_pfn() a static inline taking a strongly typed (const void ) makes the contract of a passing a pointer of that type to the function explicit and exposes any misuse of the macro virt_to_pfn() acting polymorphic and accepting many types such as (void ), (unitptr_t) or (unsigned long) as arguments without warnings. Also fix all offending call sites to pass a (void *) rather than an unsigned long. Since virt_to_mfn() is wrapping virt_to_pfn() this function has become polymorphic as well so the usage need to be fixed up. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230810-virt-to-phys-x86-xen-v1-1-9e966d333e7a@linaro.org Signed-off-by: Juergen Gross <jgross@suse.com>	2023-08-21 09:52:29 +02:00
Douglas Anderson	8d539b84f1	nmi_backtrace: allow excluding an arbitrary CPU The APIs that allow backtracing across CPUs have always had a way to exclude the current CPU. This convenience means callers didn't need to find a place to allocate a CPU mask just to handle the common case. Let's extend the API to take a CPU ID to exclude instead of just a boolean. This isn't any more complex for the API to handle and allows the hardlockup detector to exclude a different CPU (the one it already did a trace for) without needing to find space for a CPU mask. Arguably, this new API also encourages safer behavior. Specifically if the caller wants to avoid tracing the current CPU (maybe because they already traced the current CPU) this makes it more obvious to the caller that they need to make sure that the current CPU ID can't change. [akpm@linux-foundation.org: fix trigger_allbutcpu_cpu_backtrace() stub] Link: https://lkml.kernel.org/r/20230804065935.v4.1.Ia35521b91fc781368945161d7b28538f9996c182@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: kernel test robot <lkp@intel.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:19:00 -07:00
Andy Shevchenko	15beb1b746	x86/asm: replace custom COUNT_ARGS() & CONCATENATE() implementations Replace custom implementation of the macros from args.h. Link: https://lkml.kernel.org/r/20230718211147.18647-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Daniel Latypov <dlatypov@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Gow <davidgow@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:18:56 -07:00
Alistair Popple	1af5a81099	mmu_notifiers: rename invalidate_range notifier There are two main use cases for mmu notifiers. One is by KVM which uses mmu_notifier_invalidate_range_start()/end() to manage a software TLB. The other is to manage hardware TLBs which need to use the invalidate_range() callback because HW can establish new TLB entries at any time. Hence using start/end() can lead to memory corruption as these callbacks happen too soon/late during page unmap. mmu notifier users should therefore either use the start()/end() callbacks or the invalidate_range() callbacks. To make this usage clearer rename the invalidate_range() callback to arch_invalidate_secondary_tlbs() and update documention. Link: https://lkml.kernel.org/r/6f77248cd25545c8020a54b4e567e8b72be4dca1.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:41 -07:00
Alistair Popple	6bbd42e2df	mmu_notifiers: call invalidate_range() when invalidating TLBs The invalidate_range() is going to become an architecture specific mmu notifier used to keep the TLB of secondary MMUs such as an IOMMU in sync with the CPU page tables. Currently it is called from separate code paths to the main CPU TLB invalidations. This can lead to a secondary TLB not getting invalidated when required and makes it hard to reason about when exactly the secondary TLB is invalidated. To fix this move the notifier call to the architecture specific TLB maintenance functions for architectures that have secondary MMUs requiring explicit software invalidations. This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades require a TLB invalidation. This invalidation is done by the architecture specific ptep_set_access_flags() which calls flush_tlb_page() if required. However this doesn't call the notifier resulting in infinite faults being generated by devices using the SMMU if it has previously cached a read-only PTE in it's TLB. Moving the invalidations into the TLB invalidation functions ensures all invalidations happen at the same time as the CPU invalidation. The architecture specific flush_tlb_all() routines do not call the notifier as none of the IOMMUs require this. Link: https://lkml.kernel.org/r/0287ae32d91393a582897d6c4db6f7456b1001f2.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Tested-by: SeongJae Park <sj@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Andrew Donnellan <ajd@linux.ibm.com> Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Cc: Frederic Barrat <fbarrat@linux.ibm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nicolin Chen <nicolinc@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Zhi Wang <zhi.wang.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:41 -07:00
Yicong Yang	db6c1f6f23	mm/tlbbatch: introduce arch_flush_tlb_batched_pending() Currently we'll flush the mm in flush_tlb_batched_pending() to avoid race between reclaim unmaps pages by batched TLB flush and mprotect/munmap/etc. Other architectures like arm64 may only need a synchronization barrier(dsb) here rather than a full mm flush. So add arch_flush_tlb_batched_pending() to allow an arch-specific implementation here. This intends no functional changes on x86 since still a full mm flush for x86. Link: https://lkml.kernel.org/r/20230717131004.12662-4-yangyicong@huawei.com Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Barry Song <baohua@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Darren Hart <darren@os.amperecomputing.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: lipeifeng <lipeifeng@oppo.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Miao <realmz6@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zeng Tao <prime.zeng@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:37 -07:00
Barry Song	f73419bb89	mm/tlbbatch: rename and extend some functions This patch does some preparation works to extend batched TLB flush to arm64. Including: - Extend set_tlb_ubc_flush_pending() and arch_tlbbatch_add_mm() to accept an additional argument for address, architectures like arm64 may need this for tlbi. - Rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending() to match its current function since we don't need to handle mm on architectures like arm64 and add_mm is not proper, add_pending will make sense to both as on x86 we're pending the TLB flush operations while on arm64 we're pending the synchronize operations. This intends no functional changes on x86. Link: https://lkml.kernel.org/r/20230717131004.12662-3-yangyicong@huawei.com Tested-by: Yicong Yang <yangyicong@hisilicon.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Nadav Amit <namit@vmware.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Barry Song <baohua@kernel.org> Cc: Darren Hart <darren@os.amperecomputing.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: lipeifeng <lipeifeng@oppo.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Miao <realmz6@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Zeng Tao <prime.zeng@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:36 -07:00
Anshuman Khandual	65c8d30e67	mm/tlbbatch: introduce arch_tlbbatch_should_defer() Patch series "arm64: support batched/deferred tlb shootdown during page reclamation/migration", v11. Though ARM64 has the hardware to do tlb shootdown, the hardware broadcasting is not free. A simplest micro benchmark shows even on snapdragon 888 with only 8 cores, the overhead for ptep_clear_flush is huge even for paging out one page mapped by only one process: 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush While pages are mapped by multiple processes or HW has more CPUs, the cost should become even higher due to the bad scalability of tlb shootdown. The same benchmark can result in 16.99% CPU consumption on ARM64 server with around 100 cores according to the test on patch 4/4. This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by 1. only send tlbi instructions in the first stage - arch_tlbbatch_add_mm() 2. wait for the completion of tlbi by dsb while doing tlbbatch sync in arch_tlbbatch_flush() Testing on snapdragon shows the overhead of ptep_clear_flush is removed by the patchset. The micro benchmark becomes 5% faster even for one page mapped by single process on snapdragon 888. Since BATCHED_UNMAP_TLB_FLUSH is implemented only on x86, the patchset does some renaming/extension for the current implementation first (Patch 1-3), then add the support on arm64 (Patch 4). This patch (of 4): The entire scheme of deferred TLB flush in reclaim path rests on the fact that the cost to refill TLB entries is less than flushing out individual entries by sending IPI to remote CPUs. But architecture can have different ways to evaluate that. Hence apart from checking TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be architecture specific. [yangyicong@hisilicon.com: rebase and fix incorrect return value type] Link: https://lkml.kernel.org/r/20230717131004.12662-1-yangyicong@huawei.com Link: https://lkml.kernel.org/r/20230717131004.12662-2-yangyicong@huawei.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khandual@linux.vnet.ibm.com/] Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Punit Agrawal <punit.agrawal@bytedance.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Darren Hart <darren@os.amperecomputing.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: lipeifeng <lipeifeng@oppo.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Miao <realmz6@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Zeng Tao <prime.zeng@hisilicon.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <namit@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:36 -07:00
Baoquan He	0b1f77e74b	asm-generic/iomap.h: remove ARCH_HAS_IOREMAP_xx macros Patch series "mm: ioremap: Convert architectures to take GENERIC_IOREMAP way", v8. Motivation and implementation: ============================== Currently, many architecutres have't taken the standard GENERIC_IOREMAP way to implement ioremap_prot(), iounmap(), and ioremap_xx(), but make these functions specifically under each arch's folder. Those cause many duplicated code of ioremap() and iounmap(). In this patchset, firstly introduce generic_ioremap_prot() and generic_iounmap() to extract the generic code for GENERIC_IOREMAP. By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(), generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap() and iounmap() are all visible and available to arch. Arch needs to provide wrapper functions to override the generic version if there's arch specific handling in its corresponding ioremap_prot(), ioremap() or iounmap(). With these changes, duplicated ioremap/iounmap() code uder ARCH-es are removed, and the equivalent functioality is kept as before. Background info: ================ 1) The converting more architectures to take GENERIC_IOREMAP way is suggested by Christoph in below discussion: https://lore.kernel.org/all/Yp7h0Jv6vpgt6xdZ@infradead.org/T/#u 2) In the previous v1 to v3, it's basically further action after arm64 has converted to GENERIC_IOREMAP way in below patchset. It's done by adding hook ioremap_allowed() and iounmap_allowed() in ARCH to add ARCH specific handling the middle of ioremap_prot() and iounmap(). [PATCH v5 0/6] arm64: Cleanup ioremap() and support ioremap_prot() https://lore.kernel.org/all/20220607125027.44946-1-wangkefeng.wang@huawei.com/T/#u Later, during v3 reviewing, Christophe Leroy suggested to introduce generic_ioremap_prot() and generic_iounmap() to generic codes, and ARCH can provide wrapper function ioremap_prot(), ioremap() or iounmap() if needed. Christophe made a RFC patchset as below to specially demonstrate his idea. This is what v4 and now v5 is doing. [RFC PATCH 0/8] mm: ioremap: Convert architectures to take GENERIC_IOREMAP way https://lore.kernel.org/all/cover.1665568707.git.christophe.leroy@csgroup.eu/T/#u Testing: ======== In v8, I only applied this patchset onto the latest linus's tree to build and run on arm64 and s390. This patch (of 19): Let's use '#define ioremap_xx' and "#ifdef ioremap_xx" instead. To remove defined ARCH_HAS_IOREMAP_xx macros in <asm/io.h> of each ARCH, the ARCH's own ioremap_wc\|wt\|np definition need be above "#include <asm-generic/iomap.h>. Otherwise the redefinition error would be seen during compiling. So the relevant adjustments are made to avoid compiling error: loongarch: - doesn't include <asm-generic/iomap.h>, defining ARCH_HAS_IOREMAP_WC is redundant, so simply remove it. m68k: - selected GENERIC_IOMAP, <asm-generic/iomap.h> has been added in <asm-generic/io.h>, and <asm/kmap.h> is included above <asm-generic/iomap.h>, so simply remove ARCH_HAS_IOREMAP_WT defining. mips: - move "#include <asm-generic/iomap.h>" below ioremap_wc definition in <asm/io.h> powerpc: - remove "#include <asm-generic/iomap.h>" in <asm/io.h> because it's duplicated with the one in <asm-generic/io.h>, let's rely on the latter. x86: - selected GENERIC_IOMAP, remove #include <asm-generic/iomap.h> in the middle of <asm/io.h>. Let's rely on <asm-generic/io.h>. Link: https://lkml.kernel.org/r/20230706154520.11257-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Helge Deller <deller@gmx.de> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Stafford Horne <shorne@gmail.com> Cc: Brian Cain <bcain@quicinc.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Rich Felker <dalias@libc.org> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:32 -07:00
Kemeng Shi	6d144436d9	mm/page_table_check: remove unused parameter in [__]page_table_check_pud_set Remove unused addr in __page_table_check_pud_set and page_table_check_pud_set. Link: https://lkml.kernel.org/r/20230713172636.1705415-9-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:29 -07:00
Kemeng Shi	a3b837130b	mm/page_table_check: remove unused parameter in [__]page_table_check_pmd_set Remove unused addr in __page_table_check_pmd_set and page_table_check_pmd_set. Link: https://lkml.kernel.org/r/20230713172636.1705415-8-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:29 -07:00
Kemeng Shi	1066293d42	mm/page_table_check: remove unused parameter in [__]page_table_check_pte_set Remove unused addr in __page_table_check_pte_set and page_table_check_pte_set. Link: https://lkml.kernel.org/r/20230713172636.1705415-7-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:29 -07:00
Kemeng Shi	931c38e164	mm/page_table_check: remove unused parameter in [__]page_table_check_pud_clear Remove unused addr in __page_table_check_pud_clear and page_table_check_pud_clear. Link: https://lkml.kernel.org/r/20230713172636.1705415-6-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:28 -07:00
Kemeng Shi	1831414cd7	mm/page_table_check: remove unused parameter in [__]page_table_check_pmd_clear Remove unused addr in page_table_check_pmd_clear and __page_table_check_pmd_clear. Link: https://lkml.kernel.org/r/20230713172636.1705415-5-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:28 -07:00
Kemeng Shi	aa232204c4	mm/page_table_check: remove unused parameter in [__]page_table_check_pte_clear Remove unused addr in page_table_check_pte_clear and __page_table_check_pte_clear. Link: https://lkml.kernel.org/r/20230713172636.1705415-4-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-18 10:12:28 -07:00
Sean Christopherson	fe60e8f65f	KVM: x86: Use KVM-governed feature framework to track "XSAVES enabled" Use the governed feature framework to track if XSAVES is "enabled", i.e. if XSAVES can be used by the guest. Add a comment in the SVM code to explain the very unintuitive logic of deliberately NOT checking if XSAVES is enumerated in the guest CPUID model. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20230815203653.519297-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-17 11:38:28 -07:00
Sean Christopherson	662f681578	KVM: VMX: Rename XSAVES control to follow KVM's preferred "ENABLE_XYZ" Rename the XSAVES secondary execution control to follow KVM's preferred style so that XSAVES related logic can use common macros that depend on KVM's preferred style. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20230815203653.519297-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-17 11:38:28 -07:00
Sean Christopherson	42764413d1	KVM: x86: Add a framework for enabling KVM-governed x86 features Introduce yet another X86_FEATURE flag framework to manage and cache KVM governed features (for lack of a better name). "Governed" in this case means that KVM has some level of involvement and/or vested interest in whether or not an X86_FEATURE can be used by the guest. The intent of the framework is twofold: to simplify caching of guest CPUID flags that KVM needs to frequently query, and to add clarity to such caching, e.g. it isn't immediately obvious that SVM's bundle of flags for "optional nested SVM features" track whether or not a flag is exposed to L1. Begrudgingly define KVM_MAX_NR_GOVERNED_FEATURES for the size of the bitmap to avoid exposing governed_features.h in arch/x86/include/asm/, but add a FIXME to call out that it can and should be cleaned up once "struct kvm_vcpu_arch" is no longer expose to the kernel at large. Cc: Zeng Guang <guang.zeng@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20230815203653.519297-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-17 11:38:27 -07:00
Manali Shukla	f67063414c	KVM: SVM: correct the size of spec_ctrl field in VMCB save area Correct the spec_ctrl field in the VMCB save area based on the AMD Programmer's manual. Originally, the spec_ctrl was listed as u32 with 4 bytes of reserved area. The AMD Programmer's Manual now lists the spec_ctrl as 8 bytes in VMCB save area. The Public Processor Programming reference for Genoa, shows SPEC_CTRL as 64b register, but the AMD Programmer's Manual lists SPEC_CTRL as 32b register. This discrepancy will be cleaned up in next revision of the AMD Programmer's Manual. Since remaining bits above bit 7 are reserved bits in SPEC_CTRL MSR and thus, not being used, the spec_ctrl added as u32 in the VMCB save area is currently not an issue. Fixes: `3dd2775b74` ("KVM: SVM: Create a separate mapping for the SEV-ES save area") Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Manali Shukla <manali.shukla@amd.com> Link: https://lore.kernel.org/r/20230717041903.85480-1-manali.shukla@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-17 11:36:24 -07:00
Josh Poimboeuf	c6cfcbd8ca	x86/ibt: Convert IBT selftest to asm The following warning is reported when frame pointers and kernel IBT are enabled: vmlinux.o: warning: objtool: ibt_selftest+0x11: sibling call from callable instruction with modified stack frame The problem is that objtool interprets the indirect branch in ibt_selftest() as a sibling call, and GCC inserts a (partial) frame pointer prologue before it: 0000 000000000003f550 <ibt_selftest>: 0000 3f550: f3 0f 1e fa endbr64 0004 3f554: e8 00 00 00 00 call 3f559 <ibt_selftest+0x9> 3f555: R_X86_64_PLT32 __fentry__-0x4 0009 3f559: 55 push %rbp 000a 3f55a: 48 8d 05 02 00 00 00 lea 0x2(%rip),%rax # 3f563 <ibt_selftest_ip> 0011 3f561: ff e0 jmp *%rax Note the inline asm is missing ASM_CALL_CONSTRAINT, so the 'push %rbp' happens before the indirect branch and the 'mov %rsp, %rbp' happens afterwards. Simplify the generated code and make it easier to understand for both tools and humans by moving the selftest to proper asm. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/99a7e16b97bda97bf0a04aa141d6241cd8a839a2.1680912949.git.jpoimboe@kernel.org	2023-08-17 17:07:09 +02:00
David Matlack	d478899605	KVM: Allow range-based TLB invalidation from common code Make kvm_flush_remote_tlbs_range() visible in common code and create a default implementation that just invalidates the whole TLB. This paves the way for several future features/cleanups: - Introduction of range-based TLBI on ARM. - Eliminating kvm_arch_flush_remote_tlbs_memslot() - Moving the KVM/x86 TDP MMU to common code. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Anup Patel <anup@brainfault.org> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-6-rananta@google.com	2023-08-17 09:40:35 +01:00
David Matlack	a1342c8027	KVM: Rename kvm_arch_flush_remote_tlb() to kvm_arch_flush_remote_tlbs() Rename kvm_arch_flush_remote_tlb() and the associated macro __KVM_HAVE_ARCH_FLUSH_REMOTE_TLB to kvm_arch_flush_remote_tlbs() and __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS respectively. Making the name plural matches kvm_flush_remote_tlbs() and makes it more clear that this function can affect more than one remote TLB. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-2-rananta@google.com	2023-08-17 09:35:14 +01:00
Peter Zijlstra	864bcaa38e	x86/cpu/kvm: Provide UNTRAIN_RET_VM Similar to how it doesn't make sense to have UNTRAIN_RET have two untrain calls, it also doesn't make sense for VMEXIT to have an extra IBPB call. This cures VMEXIT doing potentially unret+IBPB or double IBPB. Also, the (SEV) VMEXIT case seems to have been overlooked. Redefine the meaning of the synthetic IBPB flags to: - ENTRY_IBPB -- issue IBPB on entry (was: entry + VMEXIT) - IBPB_ON_VMEXIT -- issue IBPB on VMEXIT And have 'retbleed=ibpb' set BOTH feature flags to ensure it retains the previous behaviour and issues IBPB on entry+VMEXIT. The new 'srso=ibpb_vmexit' option only sets IBPB_ON_VMEXIT. Create UNTRAIN_RET_VM specifically for the VMEXIT case, and have that check IBPB_ON_VMEXIT. All this avoids having the VMEXIT case having to check both ENTRY_IBPB and IBPB_ON_VMEXIT and simplifies the alternatives. Fixes: `fb3bd914b3` ("x86/srso: Add a Speculative RAS Overflow mitigation") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230814121149.109557833@infradead.org	2023-08-16 21:58:59 +02:00
Peter Zijlstra	e7c25c441e	x86/cpu: Cleanup the untrain mess Since there can only be one active return_thunk, there only needs be one (matching) untrain_ret. It fundamentally doesn't make sense to allow multiple untrain_ret at the same time. Fold all the 3 different untrain methods into a single (temporary) helper stub. Fixes: `fb3bd914b3` ("x86/srso: Add a Speculative RAS Overflow mitigation") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230814121149.042774962@infradead.org	2023-08-16 21:58:59 +02:00
Peter Zijlstra	42be649dd1	x86/cpu: Rename srso_(.*)_alias to srso_alias_\1 For a more consistent namespace. [ bp: Fixup names in the doc too. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230814121148.976236447@infradead.org	2023-08-16 21:58:53 +02:00
Peter Zijlstra	d025b7bac0	x86/cpu: Rename original retbleed methods Rename the original retbleed return thunk and untrain_ret to retbleed_return_thunk() and retbleed_untrain_ret(). No functional changes. Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230814121148.909378169@infradead.org	2023-08-16 21:47:53 +02:00
Peter Zijlstra	d43490d0ab	x86/cpu: Clean up SRSO return thunk mess Use the existing configurable return thunk. There is absolute no justification for having created this __x86_return_thunk alternative. To clarify, the whole thing looks like: Zen3/4 does: srso_alias_untrain_ret: nop2 lfence jmp srso_alias_return_thunk int3 srso_alias_safe_ret: // aliasses srso_alias_untrain_ret just so add $8, %rsp ret int3 srso_alias_return_thunk: call srso_alias_safe_ret ud2 While Zen1/2 does: srso_untrain_ret: movabs $foo, %rax lfence call srso_safe_ret (jmp srso_return_thunk ?) int3 srso_safe_ret: // embedded in movabs instruction add $8,%rsp ret int3 srso_return_thunk: call srso_safe_ret ud2 While retbleed does: zen_untrain_ret: test $0xcc, %bl lfence jmp zen_return_thunk int3 zen_return_thunk: // embedded in the test instruction ret int3 Where Zen1/2 flush the BTB entry using the instruction decoder trick (test,movabs) Zen3/4 use BTB aliasing. SRSO adds a return sequence (srso_safe_ret()) which forces the function return instruction to speculate into a trap (UD2). This RET will then mispredict and execution will continue at the return site read from the top of the stack. Pick one of three options at boot (evey function can only ever return once). [ bp: Fixup commit message uarch details and add them in a comment in the code too. Add a comment about the srso_select_mitigation() dependency on retbleed_select_mitigation(). Add moar ifdeffery for 32-bit builds. Add a dummy srso_untrain_ret_alias() definition for 32-bit alternatives needing the symbol. ] Fixes: `fb3bd914b3` ("x86/srso: Add a Speculative RAS Overflow mitigation") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230814121148.842775684@infradead.org	2023-08-16 21:47:24 +02:00
Peter Zijlstra	095b8303f3	x86/alternative: Make custom return thunk unconditional There is infrastructure to rewrite return thunks to point to any random thunk one desires, unwrap that from CALL_THUNKS, which up to now was the sole user of that. [ bp: Make the thunks visible on 32-bit and add ifdeffery for the 32-bit builds. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230814121148.775293785@infradead.org	2023-08-16 09:39:16 +02:00
Petr Pavlu	833fd800bf	x86/retpoline,kprobes: Skip optprobe check for indirect jumps with retpolines and IBT The kprobes optimization check can_optimize() calls insn_is_indirect_jump() to detect indirect jump instructions in a target function. If any is found, creating an optprobe is disallowed in the function because the jump could be from a jump table and could potentially land in the middle of the target optprobe. With retpolines, insn_is_indirect_jump() additionally looks for calls to indirect thunks which the compiler potentially used to replace original jumps. This extra check is however unnecessary because jump tables are disabled when the kernel is built with retpolines. The same is currently the case with IBT. Based on this observation, remove the logic to look for calls to indirect thunks and skip the check for indirect jumps altogether if the kernel is built with retpolines or IBT. Remove subsequently the symbols __indirect_thunk_start and __indirect_thunk_end which are no longer needed. Dropping this logic indirectly fixes a problem where the range [__indirect_thunk_start, __indirect_thunk_end] wrongly included also the return thunk. It caused that machines which used the return thunk as a mitigation and didn't have it patched by any alternative ended up not being able to use optprobes in any regular function. Fixes: `0b53c374b9` ("x86/retpoline: Use -mfunction-return") Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20230711091952.27944-3-petr.pavlu@suse.com	2023-08-14 11:46:51 +02:00
Borislav Petkov (AMD)	f58d6fbcb7	x86/CPU/AMD: Fix the DIV(0) initial fix attempt Initially, it was thought that doing an innocuous division in the #DE handler would take care to prevent any leaking of old data from the divider but by the time the fault is raised, the speculation has already advanced too far and such data could already have been used by younger operations. Therefore, do the innocuous division on every exit to userspace so that userspace doesn't see any potentially old data from integer divisions in kernel space. Do the same before VMRUN too, to protect host data from leaking into the guest too. Fixes: `77245f1c3c` ("x86/CPU/AMD: Do not leak quotient data after a division by 0") Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230811213824.10025-1-bp@alien8.de	2023-08-14 11:02:50 +02:00
Thomas Gleixner	d02a0efd0f	x86/microcode: Move core specific defines to local header There is no reason to expose all of this globally. Move everything which is not required outside of the microcode specific code to local header files and into the respective source files. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230812195727.952876381@linutronix.de	2023-08-13 18:42:55 +02:00
Ashok Raj	b0e67db12d	x86/microcode/intel: Rename get_datasize() since its used externally Rename get_datasize() to intel_microcode_get_datasize() and make it an inline. [ tglx: Make the argument typed and fix up the IFS code ] Suggested-by: Boris Petkov <bp@alien8.de> Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230812195727.894165745@linutronix.de	2023-08-13 18:42:55 +02:00
Thomas Gleixner	18648dbd33	x86/microcode: Make reload_early_microcode() static `fe055896c0` ("x86/microcode: Merge the early microcode loader") left this needlessly public. Git archaeology provided by Borislav. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230812195727.834943153@linutronix.de	2023-08-13 18:42:55 +02:00
Ashok Raj	82ad097b02	x86/microcode: Include vendor headers into microcode.h Currently vendor specific headers are included explicitly when used in common code. Instead, include the vendor specific headers in microcode.h, and include that in all usages. No functional change. Suggested-by: Boris Petkov <bp@alien8.de> Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230812195727.776541545@linutronix.de	2023-08-13 18:42:55 +02:00
Thomas Gleixner	4da2131fac	x86/microcode/intel: Move microcode functions out of cpu/intel.c There is really no point to have that in the CPUID evaluation code. Move it into the Intel-specific microcode handling along with the data structures, defines and helpers required by it. The exports need to stay for IFS. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230812195727.719202319@linutronix.de	2023-08-13 18:42:48 +02:00
Thomas Gleixner	e6bcfdd75d	x86/microcode: Hide the config knob In reality CONFIG_MICROCODE is enabled in any reasonable configuration when Intel or AMD support is enabled. Accommodate to reality. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230812195727.660453052@linutronix.de	2023-08-13 10:26:39 +02:00
Mateusz Guzik	c8afaa1b0f	locking: remove spin_lock_prefetch The only remaining consumer is new_inode, where it showed up in 2001 as commit c37fa164f793 ("v2.4.9.9 -> v2.4.9.10") in a historical repo [1] with a changelog which does not mention it. Since then the line got only touched up to keep compiling. While it may have been of benefit back in the day, it is guaranteed to at best not get in the way in the multicore setting -- as the code performs a lot of work between the prefetch and actual lock acquire, any contention means the cacheline is already invalid by the time the routine calls spin_lock(). It adds spurious traffic, for short. On top of it prefetch is notoriously tricky to use for single-threaded purposes, making it questionable from the get go. As such, remove it. I admit upfront I did not see value in benchmarking this change, but I can do it if that is deemed appropriate. Removal from new_inode and of the entire thing are in the same patch as requested by Linus, so whatever weird looks can be directed at that guy. Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/fs/inode.c?id=c37fa164f793735b32aa3f53154ff1a7659e6442 [1] Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-08-12 09:18:47 -07:00
Linus Torvalds	43972cf2de	- Do not parse the confidential computing blob on non-AMD hardware as it leads to an EFI config table ending up unmapped - Use the correct segment selector in the 32-bit version of getcpu() in the vDSO - Make sure vDSO and VVAR regions are placed in the 47-bit VA range even on 5-level paging systems - Add models 0x90-0x91 to the range of AMD Zenbleed-affected CPUs -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTXVWMACgkQEsHwGGHe VUrrpg//f0BFZV4dbANuYAX47A/hhrm8n5KwV9Mjq8JDeG/TumUN/vAK9+n8outO pLpPOoXRSCSPybXBSmI1ryVwLaDq6BJy0fYyq4kHOvBFXYKodfTNdO8Oec+le3V/ DBGnY6TQ5x1PgWuAUE9WoeGQuTN1d3Dxm0V/pG2LU/qW3mr+GlBXsSKUVjwp9/OW JPNw1XQDFVuxT+heRLxQONPkdTxkwwKxZBDRwnSSj7chbJ/jSbnX9a5xinwBvMZY Q6nelt/AMwAgfO2oz38y1tnR0bfd8fM08SUUgoajWWghKemZK5uNAgZZJd4tPNq6 lBonNc6jF9UGohzgQbNOAjfmDomtyxe4JYGl2SnyWcfwFzGANpcSKSbM9H91LMaI Sh7hKykZQNGmDctbLsxvlPFgIoxWFLK1DfNeM2yQRzlayqDF8CRlDIcWXHEtYOXf AOZFMJtOJBZjLbSeWIBW278atNG3ONWEb75kqiKRYxsX9QzvwAZYCYZH+NGe6sLO kkCm7g3NcAItf1qrPNGZd/k9fA/+3RXEWNdjYsmegMvU8vQBPY0w4NVwGtU9LCkq jspQxnNlVy1ayqr/TQXRhzn5+d7CQ1PLNwVsGh4S+diCEFu2aEdOhrBhS1uaujLv 5iLpmyyh0yO9aHebK/u4cciAvwVB7WuzQqWYIGzdsolrc+lbY54= =zPFr -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Do not parse the confidential computing blob on non-AMD hardware as it leads to an EFI config table ending up unmapped - Use the correct segment selector in the 32-bit version of getcpu() in the vDSO - Make sure vDSO and VVAR regions are placed in the 47-bit VA range even on 5-level paging systems - Add models 0x90-0x91 to the range of AMD Zenbleed-affected CPUs * tag 'x86_urgent_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu/amd: Enable Zenbleed fix for AMD Custom APU 0405 x86/mm: Fix VDSO and VVAR placement on 5-level paging machines x86/linkage: Fix typo of BUILD_VDSO in asm/linkage.h x86/vdso: Choose the right GDT_ENTRY_CPUNODE for 32-bit getcpu() on 64-bit kernel x86/sev: Do not try to parse for the CC blob on non-AMD hardware	2023-08-12 08:47:01 -07:00
Linus Torvalds	272b86ba9d	- A first series of cleanups/unifications and documentation improvements to the SRSO and GDS mitigations code which got postponed to after the embargo date - Fix the SRSO aliasing addresses assertion so that the LLVM linker can parse it too -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTXULMACgkQEsHwGGHe VUqAKxAAlGm4YodsbEX+SQTVDLsECBw9lpJ3wU8lD0fBmIAmGiWFy9jqn4FYyAi8 YlKNI8Ru/mME5cM5BBV5ZZ5PNIZPD8OmqbbaHkSjcjLJfdNg03D/RDUPELSyY7T5 HviLaknJFjn6HwvLkSLdBIsAAkVqF5lYitP8x5OgBp6Lc9PO7/xWjSZoVhOUkrFe Pjc8sT0DgtC2PWkxbB66/uxhdUnFqpioWL+06akeSWuHweIQDQ+P7sxfCB8NZO0u YV4hmd/JfjoVc0DtvS3HOm14Ruhmru/oiKg/XcJO7uGPBKxuVK8xsHqeUyGMdTeS +sNXA0XjbvaUV9IihuvVHrX8nMirkW7u0NWMNlJCO9QF5eJPfc0I07VLpKJGEsph wKSNCN7F64GfjkRGl4jPo26tX+fXGMm32+gGgpqsCYnTBu+nrqprXck4DJQZBNl4 6Le7sfUky2PSllbFh5MnKaylfeWKcqlOzfko7tjWtFm7raOHEGy31m92igKms0hM IlCyEe6mJUcMJ60QzYwaB9FJ+50jIZXeckRnud/mExgaAGQqe7RcVbwurQCCDtYq vd4sb9TV9vU07Uqz1NBxmzl6GbYM1ORV9hnlpj/eDnh/ArBzj44UwiGB1bVQ31Iy OMBJZ+RQtspa12xq7Zu++mjc+9XTeX9JK81PYg6UU+5ogQapdx4= =P0vQ -----END PGP SIGNATURE----- Merge tag 'x86_bugs_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mitigation fixes from Borislav Petkov: "The first set of fallout fixes after the embargo madness. There will be another set next week too. - A first series of cleanups/unifications and documentation improvements to the SRSO and GDS mitigations code which got postponed to after the embargo date - Fix the SRSO aliasing addresses assertion so that the LLVM linker can parse it too" * tag 'x86_bugs_for_v6.5_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: driver core: cpu: Fix the fallback cpu_show_gds() name x86: Move gds_ucode_mitigated() declaration to header x86/speculation: Add cpu_show_gds() prototype driver core: cpu: Make cpu_show_not_affected() static x86/srso: Fix build breakage with the LLVM linker Documentation/srso: Document IBPB aspect and fix formatting driver core: cpu: Unify redundant silly stubs Documentation/hw-vuln: Unify filename specification in index	2023-08-12 08:34:20 -07:00
Linus Torvalds	29d99aae13	ACPI fixes for 6.5-rc6 Rework the handling of interrupt overrides on AMD Zen-based machines to avoid recently introduced regressions (Hans de Goede). -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmTWgQMSHHJqd0Byand5 c29ja2kubmV0AAoJEILEb/54YlRxaggP/iYft7/0saIW/gQEWUngTdwwYpDdL5c6 cnym9fO7dNvZ7svGgv1oQj3Nge79hWWWKTOVB5F3ZKPEmYHlhAekSYb+SI+/YUcp fpIyVKO/y8nO+Cnw5t9EoTteJ65WJaadBNTuRUw28j4ETs/4AewWFFXSogzO2IIU x4UoWPHEXhhjYTnYL+ivRAKJG5iTIbfQS16muO7p2z630FneUYGDvUxx0XuIvbf0 dXVjKi/rxgfozNyL8HmyeDE3LSjG1b7h7h6MS/ZrdM2LoE4V2pp/Z7qtxs53sIeG Y+dsBZEp8G/YReuMPush31Sgepf6l4OQpjKkq54vRMD6su36IMojMYXayLWHmWK7 RP7n896takZ80tC7GH/pDJUUCoiyvE08/RUxnwg8QPCj5PCMmp2w0/DhhYi1FeFD 4QxYx7UBYGGFy94Bx97LvBVZXM484F8CJnQreXm7csniy0BME5MeV6S6YNjXsXJH M+O0uwdlk24g4SH7E21jUQl6Ch+YtbcH0LRtivGnC6YkcItWkcNRccph3H27wNHH jMR43tJsWp6yjXeyWw+vrML9VZ0aK3yyUKPAqcXhC4aIp8DSgVaVuLdRwXHmBqK5 Phx5+7zuxJ4UpOAQb1RtFOCgRAtfNN57QaXJ0hPuOuzopVOWPHjmcuw4fUdW5YCG esjxuO9VAyQZ =4kD6 -----END PGP SIGNATURE----- Merge tag 'acpi-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fixes from Rafael Wysocki: "Rework the handling of interrupt overrides on AMD Zen-based machines to avoid recently introduced regressions (Hans de Goede). Note that this is intended as a short-term mitigation for 6.5 and the long-term approach will be to attempt to use the configuration left by the BIOS, but it requires more investigation" * tag 'acpi-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: resource: Add IRQ override quirk for PCSpecialist Elimina Pro 16 M ACPI: resource: Honor MADT INT_SRC_OVR settings for IRQ1 on AMD Zen ACPI: resource: Always use MADT override IRQ settings for all legacy non i8042 IRQs ACPI: resource: revert "Remove "Zen" specific match and quirks"	2023-08-11 12:30:00 -07:00
Arnd Bergmann	eb3515dc99	x86: Move gds_ucode_mitigated() declaration to header The declaration got placed in the .c file of the caller, but that causes a warning for the definition: arch/x86/kernel/cpu/bugs.c:682:6: error: no previous prototype for 'gds_ucode_mitigated' [-Werror=missing-prototypes] Move it to a header where both sides can observe it instead. Fixes: `81ac7e5d74` ("KVM: Add GDS_NO support to KVM") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Cc: stable@kernel.org Link: https://lore.kernel.org/all/20230809130530.1913368-2-arnd%40kernel.org	2023-08-10 09:13:21 -07:00
Peter Zijlstra	535445621a	x86/cpu: Update Hybrids Give the hybrid thingies their own section, appropriately between Core and Atom. Add the Raptor Lake uarch names. Put Lunar Lake after Arrow Lake per interweb guidance. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230807150405.828551866@infradead.org	2023-08-09 21:51:07 +02:00
Peter Zijlstra	0cfd8fbadd	x86/cpu: Fix Crestmont uarch Sierra Forest and Grand Ridge are both E-core only using Crestmont micro-architecture, They fit the pre-existing naming scheme prefectly fine, adhere to it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230807150405.757666627@infradead.org	2023-08-09 21:51:06 +02:00
Peter Zijlstra	882cdb06b6	x86/cpu: Fix Gracemont uarch Alderlake N is an E-core only product using Gracemont micro-architecture. It fits the pre-existing naming scheme perfectly fine, adhere to it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230807150405.686834933@infradead.org	2023-08-09 21:51:06 +02:00
Hans de Goede	c6a1fd910d	ACPI: resource: Honor MADT INT_SRC_OVR settings for IRQ1 on AMD Zen On AMD Zen acpi_dev_irq_override() by default prefers the DSDT IRQ 1 settings over the MADT settings. This causes the keyboard to malfunction on some laptop models (see Links), all models from the Links have an INT_SRC_OVR MADT entry for IRQ 1. Fixes: `a9c4a912b7` ("ACPI: resource: Remove "Zen" specific match and quirks") Link: https://bugzilla.kernel.org/show_bug.cgi?id=217336 Link: https://bugzilla.kernel.org/show_bug.cgi?id=217394 Link: https://bugzilla.kernel.org/show_bug.cgi?id=217406 Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2023-08-09 21:18:46 +02:00
Thomas Gleixner	f8542a5549	x86/apic: Turn on static calls Convert all the APIC callback inline wrappers from apic->foo() to static_call(apic_call_foo)(), except for the safe_wait_icr_idle() one which is only used during SMP bringup when sending INIT/SIPI. That really can do the conditional callback. The regular wait_icr_idle() matters as it is used in irq_work_raise(), so X2APIC machines spare the conditional. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest) Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com>	2023-08-09 12:00:55 -07:00
Thomas Gleixner	3b7c27e678	x86/apic: Provide static call infrastructure for APIC callbacks Declare and define the static calls for the hotpath APIC callbacks. Note this deliberately uses STATIC_CALL_NULL() because otherwise it would be required to have the definitions in the 32bit and the 64bit default APIC implementations and it's hard to keep the calls in sync. The other option would be to have stub functions for each callback type. Not pretty either So the NULL capable calls are used and filled in during early boot after the static key infrastructure has been initialized. The calls will be static_call() except for the wait_irc_idle() callback which is valid to be NULL for X2APIC systems. Update the calls when a new APIC driver is installed and when a callback override is invoked. Export the trampolines for the two calls which are used in KVM and MCE error inject modules. Test the setup and let the next step convert the inline wrappers to make it effective. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 12:00:55 -07:00
Dave Hansen	28b8235238	x86/apic: Wrap IPI calls into helper functions Move them to one place so the static call conversion gets simpler. No functional change. [ dhansen: merge against recent x86/apic changes ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 12:00:55 -07:00
Thomas Gleixner	54271fb0b7	x86/apic: Mark all hotpath APIC callback wrappers __always_inline There is no value for instrumentation to look at those wrappers and with the upcoming conversion to static calls even less so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 12:00:55 -07:00
Thomas Gleixner	2744a7ce34	x86/apic: Replace acpi_wake_cpu_handler_update() and apic_set_eoi_cb() Switch them over to apic_update_callback() and remove the code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 12:00:46 -07:00
Thomas Gleixner	bef4f379e9	x86/apic: Provide apic_update_callback() There are already two variants of update mechanism for particular callbacks and virtualization just writes into the data structure. Provide an interface and use a shadow data structure to preserve callbacks so they can be reapplied when the APIC driver is replaced. The extra data structure is intentional as any new callback needs to be also updated in the core code. This also prepares for static calls. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 12:00:46 -07:00
Thomas Gleixner	3af1e415e4	x86/apic: Provide common init infrastructure In preparation for converting the hotpath APIC callbacks to static keys, provide common initialization infrastructure. Lift apic_install_drivers() from probe_64.c and convert all places which switch the apic instance by storing the pointer to use apic_install_driver() as a first step. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:34 -07:00
Thomas Gleixner	0fa075769c	x86/apic: Wrap apic->native_eoi() into a helper Prepare for converting the hotpath APIC callbacks to static calls. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:34 -07:00
Dave Hansen	670c04add6	x86/apic: Nuke ack_APIC_irq() Yet another wrapper of a wrapper gone along with the outdated comment that this compiles to a single instruction. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:34 -07:00
Thomas Gleixner	185c8f33a0	x86/apic: Remove pointless arguments from [native_]eoi_write() Every callsite hands in the same constants which is a pointless exercise and cannot be optimized by the compiler due to the indirect calls. Use the constants in the eoi() callbacks and remove the arguments. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Wei Liu <wei.liu@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:33 -07:00
Thomas Gleixner	d8666cf780	x86/apic: Sanitize APIC ID range validation Now that everything has apic::max_apic_id set and the eventual update for the x2APIC case is in place, switch the apic_id_valid() helper to use apic::max_apic_id and remove the apic::apic_id_valid() callback. [ dhansen: Fix subject typo ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:32 -07:00
Thomas Gleixner	b5a5ce58d3	x86/apic: Prepare x2APIC for using apic::max_apic_id In order to remove the apic::apic_id_valid() callback and switch to checking apic::max_apic_id, it is required to update apic::max_apic_id when the APIC initialization code overrides it via x2apic_set_max_apicid(). Make the existing booleans a bitfield and add a flag which lets the update function and the core code which switches the driver detect whether the apic instance wants to have that update or not and apply it if required. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:31 -07:00
Thomas Gleixner	d92e5e7cf5	x86/apic: Add max_apic_id member There is really no point to have a callback which compares numbers. Add a field which allows each APIC to store the maximum APIC ID supported and fill it in for all APIC incarnations. The next step will remove the callback. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:31 -07:00
Thomas Gleixner	9132d720eb	x86/apic: Wrap APIC ID validation into an inline Prepare for removing the callback and making this as simple comparison to an upper limit, which is the obvious solution to do for limit checks... Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:30 -07:00
Thomas Gleixner	e7b6a023d2	x86/apic: Move safe wait_icr_idle() next to apic_mem_wait_icr_idle() Move it next to apic_mem_wait_icr_idle(), rename it so that it's clear what it does and rewrite it in readable form. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:29 -07:00
Thomas Gleixner	13d779fd26	x86/apic: Allow apic::safe_wait_icr_idle() to be NULL Remove tons of NOOP callbacks by making the invocation of safe_wait_icr_idle() conditional in the inline wrapper. Will be replaced by a static_call_cond() later. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:28 -07:00
Thomas Gleixner	ee513d9da3	x86/apic: Allow apic::wait_icr_idle() to be NULL Nuke more NOOP callbacks and make the invocation conditional. Will be replaced with a static call later. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:28 -07:00
Thomas Gleixner	cfebd0077f	x86/apic: Consolidate wait_icr_idle() implementations Two copies and also needlessly public. Move it into ipi.c so it can be inlined. Rename it to apic_mem_wait_icr_idle(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:28 -07:00
Thomas Gleixner	5a3a46bd16	x86/apic: Mop up apic::apic_id_registered() Really not a hotpath and again no reason for having a gazillion of empty callbacks returning 1. Make it return bool and provide one shared implementation for the remaining users. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:27 -07:00
Thomas Gleixner	9d87f5b67e	x86/apic: Mop up *setup_apic_routing() default_setup_apic_routing() is a complete misnomer. On 64bit it does the actual APIC probing and on 32bit it is used to force select the bigsmp APIC and to emit a redundant message in the apic::setup_apic_routing() callback. Rename the 64bit and 32bit function so they reflect what they are doing and remove the useless APIC callback. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:26 -07:00
Thomas Gleixner	9a2a637af0	x86/apic: Nuke apic::apicid_to_cpu_present() This is only used on 32bit and is a wrapper around physid_set_mask_of_physid() in all 32bit APIC drivers. Remove the callback and use physid_set_mask_of_physid() in the code directly, Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:26 -07:00
Thomas Gleixner	2f6df03f80	x86/apic: Nuke empty init_apic_ldr() callbacks apic::init_apic_ldr() is only invoked when the APIC is initialized. So there is really no point in having: - Default empty callbacks all over the place - Two implementations of the actual LDR init function where one is just unreadable gunk but does exactly the same as the other. Make the apic::init_apic_ldr() invocation conditional, remove the empty callbacks and consolidate the two implementation into one. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:25 -07:00
Thomas Gleixner	79c9a17c16	x86/apic/32: Decrapify the def_bigsmp mechanism If the system has more than 8 CPUs then XAPIC and the bigsmp APIC driver is required. This is ensured via: 1) Enumerating all possible CPUs up to NR_CPUS 2) Checking at boot CPU APIC setup time whether the system has more than 8 CPUs and has an XAPIC. If that's the case then it's attempted to install the bigsmp APIC driver and a magic variable 'def_to_bigsmp' is set to one. 3) If that magic variable is set and CONFIG_X86_BIGSMP=n and the system has more than 8 CPUs smp_sanity_check() removes all CPUs >= #8 from the present and possible mask in the most convoluted way. This logic is completely broken for the case where the bigsmp driver is enabled, but not selected due to a command line option specifying the default APIC. In that case the system boots with default APIC in logical destination mode and fails to reduce the number of CPUs. That aside the above which is sprinkled over 3 different places is yet another piece of art. It would have been too obvious to check the requirements upfront and limit nr_cpu_ids _before_ enumerating tons of CPUs and then removing them again. Implement exactly this. Check the bigsmp requirement when the boot APIC is registered which happens _before_ ACPI/MPTABLE parsing and limit the number of CPUs to 8 if it can't be used. Switch it over when the boot CPU apic is set up if necessary. [ dhansen: fix nr_cpu_ids off-by-one in default_setup_apic_routing() ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:25 -07:00
Thomas Gleixner	d75baa260c	x86/apic/32: Remove pointless default_acpi_madt_oem_check() On 32bit there is no APIC implementing the acpi_madt_oem_check() except XEN PV, but that does not matter at all. generic_apic_probe() runs before ACPI tables are parsed. This selects the XEN APIC if there is no command line override because the XEN APIC driver is the first to be probed. If there is a command line override then the XEN PV driver won't be selected in the MADT OEM check either. As there is no other MADT check implemented for 32bit APICs, this whole excercise is a NOOP and can be removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:24 -07:00
Thomas Gleixner	f2bb0b4f15	x86/apic/32: Remove x86_cpu_to_logical_apicid This per CPU variable is just yet another form of voodoo programming. The boot ordering is: per_cpu(x86_cpu_to_logical_apicid, cpu) = 1U << cpu; ..... setup_apic() apic->init_apic_ldr() default_init_apic_ldr() apic_write(SET_APIC_LOGICAL_ID(1UL << smp_processor_id(), APIC_LDR); id = GET_APIC_LOGICAL_ID(apic_read(APIC_LDR); WARN_ON(id != per_cpu(x86_cpu_to_logical_apicid, cpu)); per_cpu(x86_cpu_to_logical_apicid, cpu) = id; So first write the default into LDR and then validate it against the same default which was set up during early boot APIC enumeration. Brilliant, isn't it? The comment above the per CPU variable declaration describes it well: 'Let's keep it ugly for now.' Remove the useless gunk and use '1U << cpu' consistently all over the place. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:23 -07:00
Thomas Gleixner	e120e58ec2	x86/apic/32: Sanitize logical APIC ID handling apic::x86_32_early_logical_apicid() is yet another historical joke. It is used to preset the x86_cpu_to_logical_apicid per CPU variable during APIC enumeration with: - 1 shifted left by the CPU number - the physical APIC ID in case of bigsmp The latter is hillarious because bigsmp uses physical destination mode which never can use the logical APIC ID. It gets even worse. As bigsmp can be enforced late in the boot process the probe function overwrites the per CPU variable which is never used for this APIC type once again. Remove that gunk and store 1 << cpunr unconditionally if and only if the CPU number is less than 8, because the default logical destination mode only allows up to 8 CPUs. This is just an intermediate step before removing the per CPU insanity completely. Stay tuned. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:23 -07:00
Thomas Gleixner	f52e2c3e96	x86/apic: Remove check_phys_apicid_present() The only silly usage site is gone. Remove the gunk which was even outright wrong in the bigsmp_32 case which returned true unconditionally. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:22 -07:00
Thomas Gleixner	81287ad65d	x86/apic: Sanitize APIC address setup Convert places which just write mp_lapic_addr and let them register the local APIC address directly instead of relying on magic other code to do so. Add a WARN_ON() into register_lapic_address() which is raised when register_lapic_address() is invoked more than once during boot. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:20 -07:00
Thomas Gleixner	1751adedbd	x86/apic: Make some APIC init functions bool Quite some APIC init functions are pure boolean, but use the success = 0, fail < 0 model. That's confusing as hell when reading through the code. Convert them to boolean. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:20 -07:00
Thomas Gleixner	249ada2c82	x86/apic: Remove the pointless APIC version check This historical leftover is really uninteresting today. Whatever MPTABLE or MADT delivers we only trust the hardware anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:19 -07:00
Thomas Gleixner	d10a904435	x86/apic: Consolidate boot_cpu_physical_apicid initialization sites boot_cpu_physical_apicid is written in random places and in the last consequence filled with the APIC ID read from the local APIC. That causes it to have inconsistent state when the MPTABLE is broken. As a consequence tons of moronic checks are sprinkled all over the place. Consolidate the code and read it exactly once when either X2APIC mode is detected early or when the APIC mapping is established. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:18 -07:00
Thomas Gleixner	1d90c9f731	x86/apic: Nuke unused apic::inquire_remote_apic() Put it to the other historical leftovers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:18 -07:00
Thomas Gleixner	a6625b473b	x86/apic: Get rid of hard_smp_processor_id() No point in having a wrapper around read_apic_id(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:17 -07:00
Thomas Gleixner	d23c977fb0	x86/apic: Remove pointless x86_bios_cpu_apicid It's a useless copy of x86_cpu_to_apicid. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:17 -07:00
Thomas Gleixner	ecf600f894	x86/apic/ioapic: Rename skip_ioapic_setup Another variable name which is confusing at best. Convert to bool. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:16 -07:00
Thomas Gleixner	49062454a3	x86/apic: Rename disable_apic It reflects a state and not a command. Make it bool while at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:16 -07:00
Thomas Gleixner	13d88dcb1a	x86/cpu: Remove unused physid_*() nonsense Tons of silly unused bitmap wrappers... Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:15 -07:00
Thomas Gleixner	3ba3fdfe2c	x86/cpu: Make identify_boot_cpu() static It's not longer used outside the source file. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)	2023-08-09 11:58:15 -07:00
Borislav Petkov (AMD)	77245f1c3c	x86/CPU/AMD: Do not leak quotient data after a division by 0 Under certain circumstances, an integer division by 0 which faults, can leave stale quotient data from a previous division operation on Zen1 microarchitectures. Do a dummy division 0/1 before returning from the #DE exception handler in order to avoid any leaks of potentially sensitive data. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-08-09 07:55:00 -07:00
Jinghao Jia	7324f74d39	x86/linkage: Fix typo of BUILD_VDSO in asm/linkage.h The BUILD_VDSO macro was incorrectly spelled as BULID_VDSO in asm/linkage.h. This causes the !defined(BULID_VDSO) directive to always evaluate to true. Correct the spelling to BUILD_VDSO. Fixes: `bea75b3389` ("x86/Kconfig: Introduce function padding") Signed-off-by: Jinghao Jia <jinghao@linux.ibm.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230808182353.76218-1-jinghao@linux.ibm.com	2023-08-08 23:13:17 +02:00
Xin Li	39163d5479	x86/vdso: Choose the right GDT_ENTRY_CPUNODE for 32-bit getcpu() on 64-bit kernel The vDSO getcpu() reads CPU ID from the GDT_ENTRY_CPUNODE entry when the RDPID instruction is not available. And GDT_ENTRY_CPUNODE is defined as 28 on 32-bit Linux kernel and 15 on 64-bit. But the 32-bit getcpu() on 64-bit Linux kernel is compiled with 32-bit Linux kernel GDT_ENTRY_CPUNODE, i.e., 28, beyond the 64-bit Linux kernel GDT limit. Thus, it just fails _silently_. When BUILD_VDSO32_64 is defined, choose the 64-bit Linux kernel GDT definitions to compile the 32-bit getcpu(). Fixes: `877cff5296` ("x86/vdso: Fake 32bit VDSO build on 64bit compile for vgetcpu") Reported-by: kernel test robot <yujie.liu@intel.com> Reported-by: Shan Kang <shan.kang@intel.com> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230322061758.10639-1-xin3.li@intel.com Link: https://lore.kernel.org/oe-lkp/202303020903.b01fd1de-yujie.liu@intel.com	2023-08-08 09:31:43 +02:00
Linus Torvalds	64094e7e31	Mitigate Gather Data Sampling issue * Add Base GDS mitigation * Support GDS_NO under KVM * Fix a documentation typo -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTJh5YACgkQaDWVMHDJ krAzAw/8DzjhAYEa7a1AodCBMNg8uNOPnLNoRPPNhaN5Iw6W3zXYDBDKT9PyjAIx RoIM0aHx/oY9nCpK441o25oCWAAyzk6E5/+q9hMa7B4aHUGKqiDUC6L9dC8UiiSN yvoBv4g7F81QnmyazwYI64S6vnbr4Cqe7K/mvVqQ/vbJiugD25zY8mflRV9YAuMk Oe7Ff/mCA+I/kqyKhJE3cf3qNhZ61FsFI886fOSvIE7g4THKqo5eGPpIQxR4mXiU Ri2JWffTaeHr2m0sAfFeLH4VTZxfAgBkNQUEWeG6f2kDGTEKibXFRsU4+zxjn3gl xug+9jfnKN1ceKyNlVeJJZKAfr2TiyUtrlSE5d+subIRKKBaAGgnCQDasaFAluzd aZkOYz30PCebhN+KTrR84FySHCaxnev04jqdtVGAQEDbTvyNagFUdZFGhWijJShV l2l4A0gFSYJmPfPVuuAwOJnnZtA1sRH9oz/Sny3+z9BKloZh+Nc/+Cu9zC8SLjaU BF3Qv2gU9HKTJ+MSy2JrGS52cONfpO5ngFHoOMilZ1KBHrfSb1eiy32PDT+vK60Y PFEmI8SWl7bmrO1snVUCfGaHBsHJSu5KMqwBGmM4xSRzJpyvRe493xC7+nFvqNLY vFOFc4jGeusOXgiLPpfGduppkTGcM7sy75UMLwTSLcQbDK99mus= =ZAPY -----END PGP SIGNATURE----- Merge tag 'gds-for-linus-2023-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/gds fixes from Dave Hansen: "Mitigate Gather Data Sampling issue: - Add Base GDS mitigation - Support GDS_NO under KVM - Fix a documentation typo" * tag 'gds-for-linus-2023-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation/x86: Fix backwards on/off logic about YMM support KVM: Add GDS_NO support to KVM x86/speculation: Add Kconfig option for GDS x86/speculation: Add force option to GDS mitigation x86/speculation: Add Gather Data Sampling mitigation	2023-08-07 17:03:54 -07:00
Linus Torvalds	138bcddb86	Add a mitigation for the speculative RAS (Return Address Stack) overflow vulnerability on AMD processors. In short, this is yet another issue where userspace poisons a microarchitectural structure which can then be used to leak privileged information through a side channel. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTQs1gACgkQEsHwGGHe VUo1UA/8C34PwJveZDcerdkaxSF+WKx7AjOI/L2ws1qn9YVFA3ItFMgVuFTrlY6c 1eYKYB3FS9fVN3KzGOXGyhho6seHqfY0+8cyYupR+PVLn9rSy7GqHaIMr37FdQ2z yb9xu26v+gsvuPEApazS6MxijYS98u71rHhmg97qsHCnUiMJ01+TaGucntukNJv8 FfwjZJvgeUiBPQ/6IeA/O0413tPPJ9weawPyW+sV1w7NlXjaUVkNXwiq/Xxbt9uI sWwMBjFHpSnhBRaDK8W5Blee/ZfsS6qhJ4jyEKUlGtsElMnZLPHbnrbpxxqA9gyE K+3ZhoHf/W1hhvcZcALNoUHLx0CvVekn0o41urAhPfUutLIiwLQWVbApmuW80fgC DhPedEFu7Wp6Okj5+Bqi/XOsOOWN2WRDSzdAq10o1C+e+fzmkr6y4E6gskfz1zXU ssD9S4+uAJ5bccS5lck4zLffsaA03nAYTlvl1KRP4pOz5G9ln6eyO20ar1WwfGAV o5ZsTJVGQMyVA49QFkksj+kOI3chkmDswPYyGn2y8OfqYXU4Ip4eN+VkjorIAo10 zIec3Z0bCGZ9UUMylUmdtH3KAm8q0wVNoFrUkMEmO8j6nn7ew2BhwLMn4uu+nOnw lX2AG6PNhRLVDVaNgDsWMwejaDsitQPoWRuCIAZ0kQhbeYuwfpM= =73JY -----END PGP SIGNATURE----- Merge tag 'x86_bugs_srso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/srso fixes from Borislav Petkov: "Add a mitigation for the speculative RAS (Return Address Stack) overflow vulnerability on AMD processors. In short, this is yet another issue where userspace poisons a microarchitectural structure which can then be used to leak privileged information through a side channel" * tag 'x86_bugs_srso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/srso: Tie SBPB bit setting to microcode patch detection x86/srso: Add a forgotten NOENDBR annotation x86/srso: Fix return thunks in generated code x86/srso: Add IBPB on VMEXIT x86/srso: Add IBPB x86/srso: Add SRSO_NO support x86/srso: Add IBPB_BRTYPE support x86/srso: Add a Speculative RAS Overflow mitigation x86/bugs: Increase the x86 bugs vector size to two u32s	2023-08-07 16:35:44 -07:00
Ard Biesheuvel	a1b87d54f4	x86/efistub: Avoid legacy decompressor when doing EFI boot The bare metal decompressor code was never really intended to run in a hosted environment such as the EFI boot services, and does a few things that are becoming problematic in the context of EFI boot now that the logo requirements are getting tighter: EFI executables will no longer be allowed to consist of a single executable section that is mapped with read, write and execute permissions if they are intended for use in a context where Secure Boot is enabled (and where Microsoft's set of certificates is used, i.e., every x86 PC built to run Windows). To avoid stepping on reserved memory before having inspected the E820 tables, and to ensure the correct placement when running a kernel build that is non-relocatable, the bare metal decompressor moves its own executable image to the end of the allocation that was reserved for it, in order to perform the decompression in place. This means the region in question requires both write and execute permissions, which either need to be given upfront (which EFI will no longer permit), or need to be applied on demand using the existing page fault handling framework. However, the physical placement of the kernel is usually randomized anyway, and even if it isn't, a dedicated decompression output buffer can be allocated anywhere in memory using EFI APIs when still running in the boot services, given that EFI support already implies a relocatable kernel. This means that decompression in place is never necessary, nor is moving the compressed image from one end to the other. Since EFI already maps all of memory 1:1, it is also unnecessary to create new page tables or handle page faults when decompressing the kernel. That means there is also no need to replace the special exception handlers for SEV. Generally, there is little need to do any of the things that the decompressor does beyond - initialize SEV encryption, if needed, - perform the 4/5 level paging switch, if needed, - decompress the kernel - relocate the kernel So do all of this from the EFI stub code, and avoid the bare metal decompressor altogether. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230807162720.545787-24-ardb@kernel.org	2023-08-07 21:07:43 +02:00
Ard Biesheuvel	31c77a5099	x86/efistub: Perform SNP feature test while running in the firmware Before refactoring the EFI stub boot flow to avoid the legacy bare metal decompressor, duplicate the SNP feature check in the EFI stub before handing over to the kernel proper. The SNP feature check can be performed while running under the EFI boot services, which means it can force the boot to fail gracefully and return an error to the bootloader if the loaded kernel does not implement support for all the features that the hypervisor enabled. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230807162720.545787-23-ardb@kernel.org	2023-08-07 21:03:53 +02:00
Ard Biesheuvel	8338151935	x86/decompressor: Factor out kernel decompression and relocation Factor out the decompressor sequence that invokes the decompressor, parses the ELF and applies the relocations so that it can be called directly from the EFI stub. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230807162720.545787-21-ardb@kernel.org	2023-08-07 20:59:13 +02:00
Thomas Gleixner	bdc1dad299	x86/vector: Replace IRQ_MOVE_CLEANUP_VECTOR with a timer callback The left overs of a moved interrupt are cleaned up once the interrupt is raised on the new target CPU. Keeping the vector valid on the original target CPU guarantees that there can't be an interrupt lost if the affinity change races with an concurrent interrupt from the device. This cleanup utilizes the lowest priority interrupt vector for this cleanup, which makes sure that in the unlikely case when the to be cleaned up interrupt is pending in the local APICs IRR the cleanup vector does not live lock. But there is no real reason to use an interrupt vector for cleaning up the leftovers of a moved interrupt. It's not a high performance operation. The only requirement is that it happens on the original target CPU. Convert it to use a timer instead and adjust the code accordingly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230621171248.6805-3-xin3.li@intel.com	2023-08-06 14:15:10 +02:00
Thomas Gleixner	a539cc86a1	x86/vector: Rename send_cleanup_vector() to vector_schedule_cleanup() Rename send_cleanup_vector() to vector_schedule_cleanup() to prepare for replacing the vector cleanup IPI with a timer callback. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Link: https://lore.kernel.org/r/20230621171248.6805-2-xin3.li@intel.com	2023-08-06 14:15:09 +02:00
Linus Torvalds	024ff300db	hyperv-fixes for 6.5-rc5 -----BEGIN PGP SIGNATURE----- iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmTNh9UTHHdlaS5saXVA a2VybmVsLm9yZwAKCRB2FHBfkEGgXkhFCACvhcb/scp31lcBjaGb8AWogejPBFfW sb38M9X0E50omXoWahYXCMe2Dot7sgxEPuCr7PgwMBGj7Mbdy+kJg9OgdQc7mvks XVENJmfe5H4pyNw69eVjGz/ceNP+GzpTEO0ut+suCxa8+JiBDNnTfLd1W/UYNCJ8 JOGauWnvkzA5B6/lgAB6JSldDTijKQr1NQGYGZOO6LMxVSUYb/9jFRceyEaGRL/S HQdQU3FVcBYK2R2bqp4KCmQWFF352szmYuKMioMhpXQZTo868/llLPVarIwL/fUA 2u7NLLmza0N+DR3esZnKmMASa8wuNJgxV1Ros3nRHBa50xsTwfbvICjW =MaTo -----END PGP SIGNATURE----- Merge tag 'hyperv-fixes-signed-20230804' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Fix a bug in a python script for Hyper-V (Ani Sinha) - Workaround a bug in Hyper-V when IBT is enabled (Michael Kelley) - Fix an issue parsing MP table when Linux runs in VTL2 (Saurabh Sengar) - Several cleanup patches (Nischala Yelchuri, Kameron Carr, YueHaibing, ZhiHu) * tag 'hyperv-fixes-signed-20230804' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: Drivers: hv: vmbus: Remove unused extern declaration vmbus_ontimer() x86/hyperv: add noop functions to x86_init mpparse functions vmbus_testing: fix wrong python syntax for integer value comparison x86/hyperv: fix a warning in mshyperv.h x86/hyperv: Disable IBT when hypercall page lacks ENDBR instruction x86/hyperv: Improve code for referencing hyperv_pcpu_input_arg Drivers: hv: Change hv_free_hyperv_page() to take void * argument	2023-08-04 17:16:14 -07:00
Sean Christopherson	2d63699099	KVM: x86: Always write vCPU's current TSC offset/ratio in vendor hooks Drop the @offset and @multiplier params from the kvm_x86_ops hooks for propagating TSC offsets/multipliers into hardware, and instead have the vendor implementations pull the information directly from the vCPU structure. The respective vCPU fields _must_ be written at the same time in order to maintain consistent state, i.e. it's not random luck that the value passed in by all callers is grabbed from the vCPU. Explicitly grabbing the value from the vCPU field in SVM's implementation in particular will allow for additional cleanup without introducing even more subtle dependencies. Specifically, SVM can skip the WRMSR if guest state isn't loaded, i.e. svm_prepare_switch_to_guest() will load the correct value for the vCPU prior to entering the guest. This also reconciles KVM's handling of related values that are stored in the vCPU, as svm_write_tsc_offset() already assumes/requires the caller to have updated l1_tsc_offset. Link: https://lore.kernel.org/r/20230729011608.1065019-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-03 17:16:29 -07:00
Sean Christopherson	76ab816108	x86/virt: KVM: Move "disable SVM" helper into KVM SVM Move cpu_svm_disable() into KVM proper now that all hardware virtualization management is routed through KVM. Remove the now-empty virtext.h. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-03 15:37:15 -07:00
Sean Christopherson	85fd29dd5f	x86/virt: KVM: Open code cpu_has_svm() into kvm_is_svm_supported() Fold the guts of cpu_has_svm() into kvm_is_svm_supported(), its sole remaining user. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-03 15:37:15 -07:00
Sean Christopherson	5df8ecfe36	x86/virt: Drop unnecessary check on extended CPUID level in cpu_has_svm() Drop the explicit check on the extended CPUID level in cpu_has_svm(), the kernel's cached CPUID info will leave the entire SVM leaf unset if said leaf is not supported by hardware. Prior to using cached information, the check was needed to avoid false positives due to Intel's rather crazy CPUID behavior of returning the values of the maximum supported leaf if the specified leaf is unsupported. Fixes: `682a810887` ("x86/kvm/svm: Simplify cpu_has_svm()") Link: https://lore.kernel.org/r/20230721201859.2307736-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-03 15:37:15 -07:00
Sean Christopherson	22e420e127	x86/virt: KVM: Move VMXOFF helpers into KVM VMX Now that VMX is disabled in emergencies via the virt callbacks, move the VMXOFF helpers into KVM, the only remaining user. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-03 15:37:15 -07:00
Sean Christopherson	b6a6af0d19	x86/virt: KVM: Open code cpu_has_vmx() in KVM VMX Fold the raw CPUID check for VMX into kvm_is_vmx_supported(), its sole user. Keep the check even though KVM also checks X86_FEATURE_VMX, as the intent is to provide a unique error message if VMX is unsupported by hardware, whereas X86_FEATURE_VMX may be clear due to firmware and/or kernel actions. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-03 15:37:14 -07:00
Sean Christopherson	261cd5ed93	x86/reboot: Expose VMCS crash hooks if and only if KVM_{INTEL,AMD} is enabled Expose the crash/reboot hooks used by KVM to disable virtualization in hardware and unblock INIT only if there's a potential in-tree user, i.e. either KVM_INTEL or KVM_AMD is enabled. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-03 15:37:14 -07:00
Sean Christopherson	baeb4de7ad	x86/reboot: KVM: Disable SVM during reboot via virt/KVM reboot callback Use the virt callback to disable SVM (and set GIF=1) during an emergency instead of blindly attempting to disable SVM. Like the VMX case, if a hypervisor, i.e. KVM, isn't loaded/active, SVM can't be in use. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-03 15:37:14 -07:00
Sean Christopherson	119b5cb4ff	x86/reboot: KVM: Handle VMXOFF in KVM's reboot callback Use KVM VMX's reboot/crash callback to do VMXOFF in an emergency instead of manually and blindly doing VMXOFF. There's no need to attempt VMXOFF if a hypervisor, i.e. KVM, isn't loaded/active, i.e. if the CPU can't possibly be post-VMXON. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-03 15:37:14 -07:00
Sean Christopherson	5e408396c6	x86/reboot: Harden virtualization hooks for emergency reboot Provide dedicated helpers to (un)register virt hooks used during an emergency crash/reboot, and WARN if there is an attempt to overwrite the registered callback, or an attempt to do an unpaired unregister. Opportunsitically use rcu_assign_pointer() instead of RCU_INIT_POINTER(), mainly so that the set/unset paths are more symmetrical, but also because any performance gains from using RCU_INIT_POINTER() are meaningless for this code. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230721201859.2307736-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-03 15:37:14 -07:00
Sean Christopherson	b23c83ad2c	x86/reboot: VMCLEAR active VMCSes before emergency reboot VMCLEAR active VMCSes before any emergency reboot, not just if the kernel may kexec into a new kernel after a crash. Per Intel's SDM, the VMX architecture doesn't require the CPU to flush the VMCS cache on INIT. If an emergency reboot doesn't RESET CPUs, cached VMCSes could theoretically be kept and only be written back to memory after the new kernel is booted, i.e. could effectively corrupt memory after reboot. Opportunistically remove the setting of the global pointer to NULL to make checkpatch happy. Cc: Andrew Cooper <Andrew.Cooper3@citrix.com> Link: https://lore.kernel.org/r/20230721201859.2307736-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-08-03 15:37:14 -07:00
Dave Hansen	54e3d9434e	x86/mm: Remove "INVPCID single" feature tracking From: Dave Hansen <dave.hansen@linux.intel.com> tl;dr: Replace a synthetic X86_FEATURE with a hardware X86_FEATURE and check of existing per-cpu state. == Background == There are three features in play here: 1. Good old Page Table Isolation (PTI) 2. Process Context IDentifiers (PCIDs) which allow entries from multiple address spaces to be in the TLB at once. 3. Support for the "Invalidate PCID" (INVPCID) instruction, specifically the "individual address" mode (aka. mode 0). When all three of these are in place, INVPCID can and should be used to flush out individual addresses in the PTI user address space. But there's a wrinkle or two: First, this INVPCID mode is dependent on CR4.PCIDE. Even if X86_FEATURE_INVPCID==1, the instruction may #GP without setting up CR4. Second, TLB flushing is done very early, even before CR4 is fully set up. That means even if PTI, PCID and INVPCID are supported, there is still a window where INVPCID can #GP. == Problem == The current code seems to work, but mostly by chance and there are a bunch of ways it can go wrong. It's also somewhat hard to follow since X86_FEATURE_INVPCID_SINGLE is set far away from its lone user. == Solution == Make "INVPCID single" more robust and easier to follow by placing all the logic in one place. Remove X86_FEATURE_INVPCID_SINGLE. Make two explicit checks before using INVPCID: 1. Check that the system supports INVPCID itself (boot_cpu_has()) 2. Then check the CR4.PCIDE shadow to ensures that the CPU can safely use INVPCID for individual address invalidation. The CR4 check always works and is not affected by any X86_FEATURE_* twiddling or inconsistencies between the boot and secondary CPUs. This has been tested on non-Meltdown hardware by using pti=on and then flipping PCID and INVPCID support with qemu. == Aside == How does this code even work today? By chance, I think. First, PTI is initialized around the same time that the boot CPU sets CR4.PCIDE=1. There are currently no TLB invalidations when PTI=1 but CR4.PCIDE=0. That means that the X86_FEATURE_INVPCID_SINGLE check is never even reached. this_cpu_has() is also very nasty to use in this context because the boot CPU reaches here before cpu_data(0) has been initialized. It happens to work for X86_FEATURE_INVPCID_SINGLE since it's a software-defined feature but it would fall over for a hardware- derived X86_FEATURE. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230718170630.7922E235%40davehans-spike.ostc.intel.com	2023-08-03 10:34:05 -07:00
Arnd Bergmann	8874a414f8	x86/qspinlock-paravirt: Fix missing-prototype warning __pv_queued_spin_unlock_slowpath() is defined in a header file as a global function, and designed to be called from inline asm, but there is no prototype visible in the definition: kernel/locking/qspinlock_paravirt.h:493:1: error: no previous \ prototype for '__pv_queued_spin_unlock_slowpath' [-Werror=missing-prototypes] Add this to the x86 header that contains the inline asm calling it, and ensure this gets included before the definition, rather than after it. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230803082619.1369127-8-arnd@kernel.org	2023-08-03 17:15:05 +02:00
Arnd Bergmann	ce0a1b608b	x86/paravirt: Silence unused native_pv_lock_init() function warning The native_pv_lock_init() function is only used in SMP configurations and declared in asm/qspinlock.h which is not used in UP kernels, but the function is still defined for both, which causes a warning: arch/x86/kernel/paravirt.c:76:13: error: no previous prototype for 'native_pv_lock_init' [-Werror=missing-prototypes] Move the declaration to asm/paravirt.h so it is visible even with CONFIG_SMP but short-circuit the definition to turn it into an empty function. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230803082619.1369127-7-arnd@kernel.org	2023-08-03 16:50:19 +02:00
Arnd Bergmann	65412c8d72	x86/asm: Avoid unneeded __div64_32 function definition The __div64_32() function is provided for 32-bit architectures that don't have a custom do_div() implementation. x86_32 has one, and does not use the header file that declares the function prototype, so the definition causes a W=1 warning: lib/math/div64.c:31:32: error: no previous prototype for '__div64_32' [-Werror=missing-prototypes] Define an empty macro to prevent the function definition from getting built, which avoids the warning and saves a little .text space. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230803082619.1369127-4-arnd@kernel.org	2023-08-03 12:08:35 +02:00
Rick Edgecombe	67840ad0fa	x86/shstk: Add ARCH_SHSTK_STATUS CRIU and GDB need to get the current shadow stack and WRSS enablement status. This information is already available via /proc/pid/status, but this is inconvenient for CRIU because it involves parsing the text output in an area of the code where this is difficult. Provide a status arch_prctl(), ARCH_SHSTK_STATUS for retrieving the status. Have arg2 be a userspace address, and make the new arch_prctl simply copy the features out to userspace. Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-43-rick.p.edgecombe%40intel.com	2023-08-02 15:01:51 -07:00
Rick Edgecombe	2fab02b25a	x86: Add PTRACE interface for shadow stack Some applications (like GDB) would like to tweak shadow stack state via ptrace. This allows for existing functionality to continue to work for seized shadow stack applications. Provide a regset interface for manipulating the shadow stack pointer (SSP). There is already ptrace functionality for accessing xstate, but this does not include supervisor xfeatures. So there is not a completely clear place for where to put the shadow stack state. Adding it to the user xfeatures regset would complicate that code, as it currently shares logic with signals which should not have supervisor features. Don't add a general supervisor xfeature regset like the user one, because it is better to maintain flexibility for other supervisor xfeatures to define their own interface. For example, an xfeature may decide not to expose all of it's state to userspace, as is actually the case for shadow stack ptrace functionality. A lot of enum values remain to be used, so just put it in dedicated shadow stack regset. The only downside to not having a generic supervisor xfeature regset, is that apps need to be enlightened of any new supervisor xfeature exposed this way (i.e. they can't try to have generic save/restore logic). But maybe that is a good thing, because they have to think through each new xfeature instead of encountering issues when a new supervisor xfeature was added. By adding a shadow stack regset, it also has the effect of including the shadow stack state in a core dump, which could be useful for debugging. The shadow stack specific xstate includes the SSP, and the shadow stack and WRSS enablement status. Enabling shadow stack or WRSS in the kernel involves more than just flipping the bit. The kernel is made aware that it has to do extra things when cloning or handling signals. That logic is triggered off of separate feature enablement state kept in the task struct. So the flipping on HW shadow stack enforcement without notifying the kernel to change its behavior would severely limit what an application could do without crashing, and the results would depend on kernel internal implementation details. There is also no known use for controlling this state via ptrace today. So only expose the SSP, which is something that userspace already has indirect control over. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-41-rick.p.edgecombe%40intel.com	2023-08-02 15:01:51 -07:00
Rick Edgecombe	05e36022c0	x86/shstk: Handle signals for shadow stack When a signal is handled, the context is pushed to the stack before handling it. For shadow stacks, since the shadow stack only tracks return addresses, there isn't any state that needs to be pushed. However, there are still a few things that need to be done. These things are visible to userspace and which will be kernel ABI for shadow stacks. One is to make sure the restorer address is written to shadow stack, since the signal handler (if not changing ucontext) returns to the restorer, and the restorer calls sigreturn. So add the restorer on the shadow stack before handling the signal, so there is not a conflict when the signal handler returns to the restorer. The other thing to do is to place some type of checkable token on the thread's shadow stack before handling the signal and check it during sigreturn. This is an extra layer of protection to hamper attackers calling sigreturn manually as in SROP-like attacks. For this token the shadow stack data format defined earlier can be used. Have the data pushed be the previous SSP. In the future the sigreturn might want to return back to a different stack. Storing the SSP (instead of a restore offset or something) allows for future functionality that may want to restore to a different stack. So, when handling a signal push - the SSP pointing in the shadow stack data format - the restorer address below the restore token. In sigreturn, verify SSP is stored in the data format and pop the shadow stack. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-32-rick.p.edgecombe%40intel.com	2023-08-02 15:01:50 -07:00
Rick Edgecombe	928054769d	x86/shstk: Introduce routines modifying shstk Shadow stacks are normally written to via CALL/RET or specific CET instructions like RSTORSSP/SAVEPREVSSP. However, sometimes the kernel will need to write to the shadow stack directly using the ring-0 only WRUSS instruction. A shadow stack restore token marks a restore point of the shadow stack, and the address in a token must point directly above the token, which is within the same shadow stack. This is distinctively different from other pointers on the shadow stack, since those pointers point to executable code area. Introduce token setup and verify routines. Also introduce WRUSS, which is a kernel-mode instruction but writes directly to user shadow stack. In future patches that enable shadow stack to work with signals, the kernel will need something to denote the point in the stack where sigreturn may be called. This will prevent attackers calling sigreturn at arbitrary places in the stack, in order to help prevent SROP attacks. To do this, something that can only be written by the kernel needs to be placed on the shadow stack. This can be accomplished by setting bit 63 in the frame written to the shadow stack. Userspace return addresses can't have this bit set as it is in the kernel range. It also can't be a valid restore token. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-31-rick.p.edgecombe%40intel.com	2023-08-02 15:01:50 -07:00
Rick Edgecombe	b2926a36b9	x86/shstk: Handle thread shadow stack When a process is duplicated, but the child shares the address space with the parent, there is potential for the threads sharing a single stack to cause conflicts for each other. In the normal non-CET case this is handled in two ways. With regular CLONE_VM a new stack is provided by userspace such that the parent and child have different stacks. For vfork, the parent is suspended until the child exits. So as long as the child doesn't return from the vfork()/CLONE_VFORK calling function and sticks to a limited set of operations, the parent and child can share the same stack. For shadow stack, these scenarios present similar sharing problems. For the CLONE_VM case, the child and the parent must have separate shadow stacks. Instead of changing clone to take a shadow stack, have the kernel just allocate one and switch to it. Use stack_size passed from clone3() syscall for thread shadow stack size. A compat-mode thread shadow stack size is further reduced to 1/4. This allows more threads to run in a 32-bit address space. The clone() does not pass stack_size, which was added to clone3(). In that case, use RLIMIT_STACK size and cap to 4 GB. For shadow stack enabled vfork(), the parent and child can share the same shadow stack, like they can share a normal stack. Since the parent is suspended until the child terminates, the child will not interfere with the parent while executing as long as it doesn't return from the vfork() and overwrite up the shadow stack. The child can safely overwrite down the shadow stack, as the parent can just overwrite this later. So CET does not add any additional limitations for vfork(). Free the shadow stack on thread exit by doing it in mm_release(). Skip this when exiting a vfork() child since the stack is shared in the parent. During this operation, the shadow stack pointer of the new thread needs to be updated to point to the newly allocated shadow stack. Since the ability to do this is confined to the FPU subsystem, change fpu_clone() to take the new shadow stack pointer, and update it internally inside the FPU subsystem. This part was suggested by Thomas Gleixner. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-30-rick.p.edgecombe%40intel.com	2023-08-02 15:01:50 -07:00
Rick Edgecombe	2d39a6add4	x86/shstk: Add user-mode shadow stack support Introduce basic shadow stack enabling/disabling/allocation routines. A task's shadow stack is allocated from memory with VM_SHADOW_STACK flag and has a fixed size of min(RLIMIT_STACK, 4GB). Keep the task's shadow stack address and size in thread_struct. This will be copied when cloning new threads, but needs to be cleared during exec, so add a function to do this. 32 bit shadow stack is not expected to have many users and it will complicate the signal implementation. So do not support IA32 emulation or x32. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-29-rick.p.edgecombe%40intel.com	2023-08-02 15:01:50 -07:00
Rick Edgecombe	a5f6c2ace9	x86/shstk: Add user control-protection fault handler A control-protection fault is triggered when a control-flow transfer attempt violates Shadow Stack or Indirect Branch Tracking constraints. For example, the return address for a RET instruction differs from the copy on the shadow stack. There already exists a control-protection fault handler for handling kernel IBT faults. Refactor this fault handler into separate user and kernel handlers, like the page fault handler. Add a control-protection handler for usermode. To avoid ifdeffery, put them both in a new file cet.c, which is compiled in the case of either of the two CET features supported in the kernel: kernel IBT or user mode shadow stack. Move some static inline functions from traps.c into a header so they can be used in cet.c. Opportunistically fix a comment in the kernel IBT part of the fault handler that is on the end of the line instead of preceding it. Keep the same behavior for the kernel side of the fault handler, except for converting a BUG to a WARN in the case of a #CP happening when the feature is missing. This unifies the behavior with the new shadow stack code, and also prevents the kernel from crashing under this situation which is potentially recoverable. The control-protection fault handler works in a similar way as the general protection fault handler. It provides the si_code SEGV_CPERR to the signal handler. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-28-rick.p.edgecombe%40intel.com	2023-08-02 15:01:50 -07:00
Rick Edgecombe	98cfa46309	x86: Introduce userspace API for shadow stack Add three new arch_prctl() handles: - ARCH_SHSTK_ENABLE/DISABLE enables or disables the specified feature. Returns 0 on success or a negative value on error. - ARCH_SHSTK_LOCK prevents future disabling or enabling of the specified feature. Returns 0 on success or a negative value on error. The features are handled per-thread and inherited over fork(2)/clone(2), but reset on exec(). Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-27-rick.p.edgecombe%40intel.com	2023-08-02 15:01:50 -07:00
Rick Edgecombe	6ee836687a	x86/fpu: Add helper for modifying xstate Just like user xfeatures, supervisor xfeatures can be active in the registers or present in the task FPU buffer. If the registers are active, the registers can be modified directly. If the registers are not active, the modification must be performed on the task FPU buffer. When the state is not active, the kernel could perform modifications directly to the buffer. But in order for it to do that, it needs to know where in the buffer the specific state it wants to modify is located. Doing this is not robust against optimizations that compact the FPU buffer, as each access would require computing where in the buffer it is. The easiest way to modify supervisor xfeature data is to force restore the registers and write directly to the MSRs. Often times this is just fine anyway as the registers need to be restored before returning to userspace. Do this for now, leaving buffer writing optimizations for the future. Add a new function fpregs_lock_and_load() that can simultaneously call fpregs_lock() and do this restore. Also perform some extra sanity checks in this function since this will be used in non-fpu focused code. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-26-rick.p.edgecombe%40intel.com	2023-08-02 15:01:50 -07:00
Rick Edgecombe	8970ef027b	x86/fpu/xstate: Introduce CET MSR and XSAVES supervisor states Shadow stack register state can be managed with XSAVE. The registers can logically be separated into two groups: * Registers controlling user-mode operation * Registers controlling kernel-mode operation The architecture has two new XSAVE state components: one for each group of those groups of registers. This lets an OS manage them separately if it chooses. Future patches for host userspace and KVM guests will only utilize the user-mode registers, so only configure XSAVE to save user-mode registers. This state will add 16 bytes to the xsave buffer size. Future patches will use the user-mode XSAVE area to save guest user-mode CET state. However, VMCS includes new fields for guest CET supervisor states. KVM can use these to save and restore guest supervisor state, so host supervisor XSAVE support is not required. Adding this exacerbates the already unwieldy if statement in check_xstate_against_struct() that handles warning about unimplemented xfeatures. So refactor these check's by having XCHECK_SZ() set a bool when it actually check's the xfeature. This ends up exceeding 80 chars, but was better on balance than other options explored. Pass the bool as pointer to make it clear that XCHECK_SZ() can change the variable. While configuring user-mode XSAVE, clarify kernel-mode registers are not managed by XSAVE by defining the xfeature in XFEATURE_MASK_SUPERVISOR_UNSUPPORTED, like is done for XFEATURE_MASK_PT. This serves more of a documentation as code purpose, and functionally, only enables a few safety checks. Both XSAVE state components are supervisor states, even the state controlling user-mode operation. This is a departure from earlier features like protection keys where the PKRU state is a normal user (non-supervisor) state. Having the user state be supervisor-managed ensures there is no direct, unprivileged access to it, making it harder for an attacker to subvert CET. To facilitate this privileged access, define the two user-mode CET MSRs, and the bits defined in those MSRs relevant to future shadow stack enablement patches. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-25-rick.p.edgecombe%40intel.com	2023-08-02 15:01:50 -07:00
Rick Edgecombe	6beb99580b	mm: Don't allow write GUPs to shadow stack memory The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. In userspace, shadow stack memory is writable only in very specific, controlled ways. However, since userspace can, even in the limited ways, modify shadow stack contents, the kernel treats it as writable memory. As a result, without additional work there would remain many ways for userspace to trigger the kernel to write arbitrary data to shadow stacks via get_user_pages(, FOLL_WRITE) based operations. To help userspace protect their shadow stacks, make this a little less exposed by blocking writable get_user_pages() operations for shadow stack VMAs. Still allow FOLL_FORCE to write through shadow stack protections, as it does for read-only protections. This is required for debugging use cases. [ dhansen: fix rebase goof, readd writable_file_mapping_allowed() hunk ] Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-23-rick.p.edgecombe%40intel.com	2023-08-02 15:01:20 -07:00
Christoph Hellwig	f9a38ea517	x86: always initialize xen-swiotlb when xen-pcifront is enabling Remove the dangerous late initialization of xen-swiotlb in pci_xen_swiotlb_init_late and instead just always initialize xen-swiotlb in the boot code if CONFIG_XEN_PCIDEV_FRONTEND is enabled and Xen PV PCI is possible. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Juergen Gross <jgross@suse.com>	2023-07-31 17:54:27 +02:00
Arnd Bergmann	566ffa3ae9	x86/cpu: Fix amd_check_microcode() declaration The newly added amd_check_microcode() function has two conflicting definitions if CONFIG_CPU_SUP_AMD is enabled and CONFIG_MICROCODE_AMD is disabled. Since the header with the stub definition is not included in cpu/amd.c, this only causes a -Wmissing-prototype warning with W=1: arch/x86/kernel/cpu/amd.c:1289:6: error: no previous prototype for 'amd_check_microcode' [-Werror=missing-prototypes] Adding the missing #include shows the other problem: arch/x86/kernel/cpu/amd.c:1290:6: error: redefinition of 'amd_check_microcode' arch/x86/include/asm/microcode_amd.h:58:20: note: previous definition of 'amd_check_microcode' with type 'void(void)' Move the declaration into a more appropriate header that is already included, with the #ifdef check changed to match the definition's. Fixes: `522b1d6921` ("x86/cpu/amd: Add a Zenbleed fix") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230725121751.2007665-1-arnd@kernel.org	2023-07-31 11:03:16 +02:00
Linus Torvalds	98a05fe8cd	x86: * Do not register IRQ bypass consumer if posted interrupts not supported * Fix missed device interrupt due to non-atomic update of IRR * Use GFP_KERNEL_ACCOUNT for pid_table in ipiv * Make VMREAD error path play nice with noinstr * x86: Acquire SRCU read lock when handling fastpath MSR writes * Support linking rseq tests statically against glibc 2.35+ * Fix reference count for stats file descriptors * Detect userspace setting invalid CR0 Non-KVM: * Remove coccinelle script that has caused multiple confusion ("debugfs, coccinelle: check for obsolete DEFINE_SIMPLE_ATTRIBUTE() usage", acked by Greg) -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmTGZycUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOoxQf+OFUHJwtYWJplE/KYHW1Fyo4NE1xx IGyakObkA7sYrij43lH0VV4hL0IYv6Z5R6bU4uXyhFjJHsriEmr8Hq+Zug9XE09+ dsP8vZcai9t1ZZLKdI7uCrm4erDAVbeBrFLjUDb6GmPraWOVQOvJe+C3sZQfDWgp 26OO2EsjTM8liq46URrEUF8qzeWkl7eR9uYPpCKJJ5u3DYuXeq6znHRkEu1U2HYr kuFCayhVZHDMAPGm20/pxK4PX+MU/5une/WLJlqEfOEMuAnbcLxNTJkHF7ntlH+V FNIM3bWdIaNUH+tgaix3c4RdqWzUq9ubTiN+DyG1kPnDt7K2rmUFBvj1jg== =9fND -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "x86: - Do not register IRQ bypass consumer if posted interrupts not supported - Fix missed device interrupt due to non-atomic update of IRR - Use GFP_KERNEL_ACCOUNT for pid_table in ipiv - Make VMREAD error path play nice with noinstr - x86: Acquire SRCU read lock when handling fastpath MSR writes - Support linking rseq tests statically against glibc 2.35+ - Fix reference count for stats file descriptors - Detect userspace setting invalid CR0 Non-KVM: - Remove coccinelle script that has caused multiple confusion ("debugfs, coccinelle: check for obsolete DEFINE_SIMPLE_ATTRIBUTE() usage", acked by Greg)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits) KVM: selftests: Expand x86's sregs test to cover illegal CR0 values KVM: VMX: Don't fudge CR0 and CR4 for restricted L2 guest KVM: x86: Disallow KVM_SET_SREGS{2} if incoming CR0 is invalid Revert "debugfs, coccinelle: check for obsolete DEFINE_SIMPLE_ATTRIBUTE() usage" KVM: selftests: Verify stats fd is usable after VM fd has been closed KVM: selftests: Verify stats fd can be dup()'d and read KVM: selftests: Verify userspace can create "redundant" binary stats files KVM: selftests: Explicitly free vcpus array in binary stats test KVM: selftests: Clean up stats fd in common stats_test() helper KVM: selftests: Use pread() to read binary stats header KVM: Grab a reference to KVM for VM and vCPU stats file descriptors selftests/rseq: Play nice with binaries statically linked against glibc 2.35+ Revert "KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid" KVM: x86: Acquire SRCU read lock when handling fastpath MSR writes KVM: VMX: Use vmread_error() to report VM-Fail in "goto" path KVM: VMX: Make VMREAD error path play nice with noinstr KVM: x86/irq: Conditionally register IRQ bypass consumer again KVM: X86: Use GFP_KERNEL_ACCOUNT for pid_table in ipiv KVM: x86: check the kvm_cpu_get_interrupt result before using it KVM: x86: VMX: set irr_pending in kvm_apic_update_irr ...	2023-07-30 11:19:08 -07:00
Sean Christopherson	26a0652cb4	KVM: x86: Disallow KVM_SET_SREGS{2} if incoming CR0 is invalid Reject KVM_SET_SREGS{2} with -EINVAL if the incoming CR0 is invalid, e.g. due to setting bits 63:32, illegal combinations, or to a value that isn't allowed in VMX (non-)root mode. The VMX checks in particular are "fun" as failure to disallow Real Mode for an L2 that is configured with unrestricted guest disabled, when KVM itself has unrestricted guest enabled, will result in KVM forcing VM86 mode to virtual Real Mode for L2, but then fail to unwind the related metadata when synthesizing a nested VM-Exit back to L1 (which has unrestricted guest enabled). Opportunistically fix a benign typo in the prototype for is_valid_cr4(). Cc: stable@vger.kernel.org Reported-by: syzbot+5feef0b9ee9c8e9e5689@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000f316b705fdf6e2b4@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230613203037.1968489-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-07-29 11:05:31 -04:00
Alexey Kardashevskiy	d1f85fbe83	KVM: SEV: Enable data breakpoints in SEV-ES Add support for "DebugSwap for SEV-ES guests", which provides support for swapping DR[0-3] and DR[0-3]_ADDR_MASK on VMRUN and VMEXIT, i.e. allows KVM to expose debug capabilities to SEV-ES guests. Without DebugSwap support, the CPU doesn't save/load most _guest_ debug registers (except DR6/7), and KVM cannot manually context switch guest DRs due the VMSA being encrypted. Enable DebugSwap if and only if the CPU also supports NoNestedDataBp, which causes the CPU to ignore nested #DBs, i.e. #DBs that occur when vectoring a #DB. Without NoNestedDataBp, a malicious guest can DoS the host by putting the CPU into an infinite loop of vectoring #DBs (see https://bugzilla.redhat.com/show_bug.cgi?id=1278496) Set the features bit in sev_es_sync_vmsa() which is the last point when VMSA is not encrypted yet as sev_(es_)init_vmcb() (where the most init happens) is called not only when VCPU is initialised but also on intrahost migration when VMSA is encrypted. Eliminate DR7 intercepts as KVM can't modify guest DR7, and intercepting DR7 would completely defeat the purpose of enabling DebugSwap. Make X86_FEATURE_DEBUG_SWAP appear in /proc/cpuinfo (by not adding "") to let the operator know if the VM can debug. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Link: https://lore.kernel.org/r/20230615063757.3039121-7-aik@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-07-28 16:12:56 -07:00
Sohil Mehta	d7114f83ee	x86/smpboot: Change smp_store_boot_cpu_info() to static The function is only used locally. Convert it to a static one. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230727180533.3119660-4-sohil.mehta@intel.com	2023-07-28 10:17:53 +02:00
Sohil Mehta	54bfd02bbf	x86/smp: Remove a non-existent function declaration x86_idle_thread_init() does not exist anywhere. Remove its declaration from the header. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230727180533.3119660-3-sohil.mehta@intel.com	2023-07-28 10:17:53 +02:00
Laurent Dufour	91b4a7dbfe	cpu/SMT: Remove topology_smt_supported() Since the maximum number of threads is now passed to cpu_smt_set_num_threads(), checking that value is enough to know whether SMT is supported. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230705145143.40545-6-ldufour@linux.ibm.com	2023-07-28 09:53:37 +02:00
Michael Ellerman	3f9169196b	cpu/SMT: Move SMT prototypes into cpu_smt.h In order to export the cpuhp_smt_control enum as part of the interface between generic and architecture code, the architecture code needs to include asm/topology.h. But that leads to circular header dependencies. So split the enum and related declarations into a separate header. [ ldufour: Reworded the commit's description ] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20230705145143.40545-3-ldufour@linux.ibm.com	2023-07-28 09:53:36 +02:00
Borislav Petkov (AMD)	d893832d0e	x86/srso: Add IBPB on VMEXIT Add the option to flush IBPB only on VMEXIT in order to protect from malicious guests but one otherwise trusts the software that runs on the hypervisor. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-07-27 11:07:19 +02:00
Borislav Petkov (AMD)	233d6f68b9	x86/srso: Add IBPB Add the option to mitigate using IBPB on a kernel entry. Pull in the Retbleed alternative so that the IBPB call from there can be used. Also, if Retbleed mitigation is done using IBPB, the same mitigation can and must be used here. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-07-27 11:07:19 +02:00
Borislav Petkov (AMD)	1b5277c0ea	x86/srso: Add SRSO_NO support Add support for the CPUID flag which denotes that the CPU is not affected by SRSO. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-07-27 11:07:19 +02:00
Borislav Petkov (AMD)	79113e4060	x86/srso: Add IBPB_BRTYPE support Add support for the synthetic CPUID flag which "if this bit is 1, it indicates that MSR 49h (PRED_CMD) bit 0 (IBPB) flushes all branch type predictions from the CPU branch predictor." This flag is there so that this capability in guests can be detected easily (otherwise one would have to track microcode revisions which is impossible for guests). It is also needed only for Zen3 and -4. The other two (Zen1 and -2) always flush branch type predictions by default. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-07-27 11:07:19 +02:00
Borislav Petkov (AMD)	fb3bd914b3	x86/srso: Add a Speculative RAS Overflow mitigation Add a mitigation for the speculative return address stack overflow vulnerability found on AMD processors. The mitigation works by ensuring all RET instructions speculate to a controlled location, similar to how speculation is controlled in the retpoline sequence. To accomplish this, the __x86_return_thunk forces the CPU to mispredict every function return using a 'safe return' sequence. To ensure the safety of this mitigation, the kernel must ensure that the safe return sequence is itself free from attacker interference. In Zen3 and Zen4, this is accomplished by creating a BTB alias between the untraining function srso_untrain_ret_alias() and the safe return function srso_safe_ret_alias() which results in evicting a potentially poisoned BTB entry and using that safe one for all function returns. In older Zen1 and Zen2, this is accomplished using a reinterpretation technique similar to Retbleed one: srso_untrain_ret() and srso_safe_ret(). Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-07-27 11:07:14 +02:00
Borislav Petkov (AMD)	05e91e7211	x86/microcode/AMD: Rip out static buffers Load straight from the containers (initrd or builtin, for example). There's no need to cache the patch per node. This even simplifies the code a bit with the opportunity for more cleanups later. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/r/20230720202813.3269888-1-john.allen@amd.com	2023-07-27 10:04:54 +02:00
ZhiHu	060f2b979c	x86/hyperv: fix a warning in mshyperv.h The following checkpatch warning is removed: WARNING: Use #include <linux/io.h> instead of <asm/io.h> Signed-off-by: ZhiHu <huzhi001@208suo.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>	2023-07-23 23:12:47 +00:00
Daniel Sneddon	8974eb5882	x86/speculation: Add Gather Data Sampling mitigation Gather Data Sampling (GDS) is a hardware vulnerability which allows unprivileged speculative access to data which was previously stored in vector registers. Intel processors that support AVX2 and AVX512 have gather instructions that fetch non-contiguous data elements from memory. On vulnerable hardware, when a gather instruction is transiently executed and encounters a fault, stale data from architectural or internal vector registers may get transiently stored to the destination vector register allowing an attacker to infer the stale data using typical side channel techniques like cache timing attacks. This mitigation is different from many earlier ones for two reasons. First, it is enabled by default and a bit must be set to DISABLE it. This is the opposite of normal mitigation polarity. This means GDS can be mitigated simply by updating microcode and leaving the new control bit alone. Second, GDS has a "lock" bit. This lock bit is there because the mitigation affects the hardware security features KeyLocker and SGX. It needs to be enabled and STAY enabled for these features to be mitigated against GDS. The mitigation is enabled in the microcode by default. Disable it by setting gather_data_sampling=off or by disabling all mitigations with mitigations=off. The mitigation status can be checked by reading: /sys/devices/system/cpu/vulnerabilities/gather_data_sampling Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>	2023-07-19 16:45:37 -07:00
Borislav Petkov (AMD)	0e52740ffd	x86/bugs: Increase the x86 bugs vector size to two u32s There was never a doubt in my mind that they would not fit into a single u32 eventually. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-07-18 09:35:38 +02:00
Janusz Krzysztofik	548cb93205	x86/mm: Fix PAT bit missing from page protection modify mask Visible glitches have been observed when running graphics applications on Linux under Xen hypervisor. Those observations have been confirmed with failures from kms_pwrite_crc Intel GPU test that verifies data coherency of DRM frame buffer objects using hardware CRC checksums calculated by display controllers, exposed to userspace via debugfs. Affected processing paths have then been identified with new IGT test variants that mmap the objects using different methods and caching modes [1]. When running as a Xen PV guest, Linux uses Xen provided PAT configuration which is different from its native one. In particular, Xen specific PTE encoding of write-combining caching, likely used by graphics applications, differs from the Linux default one found among statically defined minimal set of supported modes. Since Xen defines PTE encoding of the WC mode as _PAGE_PAT, it no longer belongs to the minimal set, depends on correct handling of _PAGE_PAT bit, and can be mismatched with write-back caching. When a user calls mmap() for a DRM buffer object, DRM device specific .mmap file operation, called from mmap_region(), takes care of setting PTE encoding bits in a vm_page_prot field of an associated virtual memory area structure. Unfortunately, _PAGE_PAT bit is not preserved when the vma's .vm_flags are then applied to .vm_page_prot via vm_set_page_prot(). Bits to be preserved are determined with _PAGE_CHG_MASK symbol that doesn't cover _PAGE_PAT. As a consequence, WB caching is requested instead of WC when running under Xen (also, WP is silently changed to WT, and UC downgraded to UC_MINUS). When running on bare metal, WC is not affected, but WP and WT extra modes are unintentionally replaced with WC and UC, respectively. WP and WT modes, encoded with _PAGE_PAT bit set, were introduced by commit `281d4078be` ("x86: Make page cache mode a real type"). Care was taken to extend _PAGE_CACHE_MASK symbol with that additional bit, but that symbol has never been used for identification of bits preserved when applying page protection flags. Support for all cache modes under Xen, including the problematic WC mode, was then introduced by commit `47591df505` ("xen: Support Xen pv-domains using PAT"). The issue needs to be fixed by including _PAGE_PAT bit into a bitmask used by pgprot_modify() for selecting bits to be preserved. We can do that either internally to pgprot_modify() (as initially proposed), or by making _PAGE_PAT a part of _PAGE_CHG_MASK. If we go for the latter then, since _PAGE_PAT is the same as _PAGE_PSE, we need to note that _HPAGE_CHG_MASK -- a huge pmds' counterpart of _PAGE_CHG_MASK, introduced by commit `c489f1257b` ("thp: add pmd_modify"), defined as (_PAGE_CHG_MASK \| _PAGE_PSE) -- will no longer differ from _PAGE_CHG_MASK. If such modification of _PAGE_CHG_MASK was irrelevant to its users then one might wonder why that new _HPAGE_CHG_MASK symbol was introduced instead of reusing the existing one with that otherwise irrelevant bit (_PAGE_PSE in that case) added. Add _PAGE_PAT to _PAGE_CHG_MASK and _PAGE_PAT_LARGE to _HPAGE_CHG_MASK for symmetry. Split out common bits from both symbols to a common symbol for clarity. [ dhansen: tweak the solution changelog description ] [1] https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/commit/0f0754413f14 Fixes: `281d4078be` ("x86: Make page cache mode a real type") Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Link: https://gitlab.freedesktop.org/drm/intel/-/issues/7648 Link: https://lore.kernel.org/all/20230710073613.8006-2-janusz.krzysztofik%40linux.intel.com	2023-07-17 07:41:45 -07:00
Borislav Petkov (AMD)	522b1d6921	x86/cpu/amd: Add a Zenbleed fix Add a fix for the Zen2 VZEROUPPER data corruption bug where under certain circumstances executing VZEROUPPER can cause register corruption or leak data. The optimal fix is through microcode but in the case the proper microcode revision has not been applied, enable a fallback fix using a chicken bit. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-07-17 15:48:10 +02:00
Michal Wilczynski	5ba30be7fd	ACPI: processor: Introduce acpi_processor_osc() The processor _OSC method is already used for a workaround introduced in commit `a21211672c` ("ACPI / processor: Request native thermal interrupt handling via _OSC"), but in accordance with ACPI 6.5 (and earlier), it should be used for negotiating all of the processor capabilities instead of _PDC (which has been deprecated since ACPI 3.0 and got removed from ACPI 6.5 entirely). Create a new callback function called acpi_processor_osc() to be invoked for every processor object and processor device in the ACPI namespace, in analogy with the already existing acpi_hwp_native_thermal_lvt_osc(). Make this function implement the workaround mentioned above and convey all of the OSPM processor support information to the platform firmware by setting all of the appropriate processor capabilities bits before evaluating _OSC for the given processor. For this purpose, make it call arch_acpi_set_proc_cap_bits() and modify the latter to set ACPI_PROC_CAP_COLLAB_PROC_PERF along with the other processor capabilities bits. Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> [ rjw: Subject and changelog edits, whitespace fixup ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2023-07-14 17:59:40 +02:00
Michal Wilczynski	b9e8d0168a	ACPI: processor: Set CAP_SMP_T_SWCOORD in arch_acpi_set_proc_cap_bits() Currently, ACPI_PROC_CAP_SMP_T_SWCOORD is set in acpi_set_pdc_bits(), but it is not _PDC-specific. It should be set along with the other processor capability bits. Move the setting of ACPI_PROC_CAP_SMP_T_SWCOORD to arch_acpi_set_proc_cap_bits(). Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2023-07-14 17:59:40 +02:00
Michal Wilczynski	4f37ab5e05	ACPI: processor: Clear C_C2C3_FFH and C_C1_FFH in arch_acpi_set_proc_cap_bits() Currently arch_acpi_set_proc_cap_bits() clears ACPI_PDC_C_C2C3_FFH bit in case MWAIT instruction is not supported. It should also clear ACPI_PDC_C_C1_FFH, as when MWAIT is not supported, C1 is entered by executing the HLT instruction. Quote from the C_C1_FFH description: "If set, OSPM is capable of performing native C State instructions (beyond halt) for the C1 handler in multi-processor configurations". As without MWAIT there is no native C-state instructions beyond HALT, this bit should be toggled off." Clear ACPI_PDC_C_C1_FFH and ACPI_PDC_C_C2C3_FFH in arch_acpi_set_proc_cap_bits() in case MWAIT is not supported or overridden. Remove setting those bits from the processor_pdc.c code. Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2023-07-14 17:59:40 +02:00

... 3 4 5 6 7 ...

12521 Commits