linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-15 23:25:07 +00:00

Author	SHA1	Message	Date
Yonghong Song	ad96f1c913	bpf: Fix a bpf_jit_dump issue for x86_64 with sysctl bpf_jit_enable. The sysctl net/core/bpf_jit_enable does not work now due to commit `1022a5498f` ("bpf, x86_64: Use bpf_jit_binary_pack_alloc"). The commit saved the jitted insns into 'rw_image' instead of 'image' which caused bpf_jit_dump not dumping proper content. With 'echo 2 > /proc/sys/net/core/bpf_jit_enable', run './test_progs -t fentry_test'. Without this patch, one of jitted image for one particular prog is: flen=17 proglen=92 pass=4 image=0000000014c64883 from=test_progs pid=1807 00000000: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000010: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000020: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000030: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000040: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 00000050: cc cc cc cc cc cc cc cc cc cc cc cc With this patch, the jitte image for the same prog is: flen=17 proglen=92 pass=4 image=00000000b90254b7 from=test_progs pid=1809 00000000: f3 0f 1e fa 0f 1f 44 00 00 66 90 55 48 89 e5 f3 00000010: 0f 1e fa 31 f6 48 8b 57 00 48 83 fa 07 75 2b 48 00000020: 8b 57 10 83 fa 09 75 22 48 8b 57 08 48 81 e2 ff 00000030: 00 00 00 48 83 fa 08 75 11 48 8b 7f 18 be 01 00 00000040: 00 00 48 83 ff 0a 74 02 31 f6 48 bf 18 d0 14 00 00000050: 00 c9 ff ff 48 89 77 00 31 c0 c9 c3 Fixes: `1022a5498f` ("bpf, x86_64: Use bpf_jit_binary_pack_alloc") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20230609005439.3173569-1-yhs@fb.com	2023-06-12 16:47:18 +02:00
Borislav Petkov (AMD)	a32b0f0db3	x86/microcode/AMD: Load late on both threads too Do the same as early loading - load on both threads. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230605141332.25948-1-bp@alien8.de	2023-06-12 11:02:17 +02:00
Linus Torvalds	4c605260bc	- Set up the kernel CS earlier in the boot process in case EFI boots the kernel after bypassing the decompressor and the CS descriptor used ends up being the EFI one which is not mapped in the identity page table, leading to early SEV/SNP guest communication exceptions resulting in the guest crashing -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSFqi8ACgkQEsHwGGHe VUoJWhAAqSYKHVMuOebeCiS4mF7gKIdwE5UwxF5vVWa7nwOLxRUygdFTweyjqV2n bKoGGNwEquYxhKUoRjrBr+dxZXx6qapDS8oGUL9Ndus93Zs/zIe6KF23EJbkZKoy 4uh37D0G2lgA+E3Ke/hX5ac94AYtcTd8cfSw63GIs10vt/bsupdhreSY3e2Z8zcQ e2OngvL/PUX06g4/wXYsAlQszRwyPwJO++y82OdqisJXJLV65fxui7YRS8g4Koh2 DjKHmQyuFTN3D40C7F7vlH0iq9+kAhDpMaG6lL2/QaWGQSAA4sw9qArxy9JTDATw 0Dk+L4iHimPFopke1z6rPG13TRhtLt0cWGCp1/cKQA6w6/eBvpOdpkxUeL7kF8UP yZMh3XeBRWIOewfXeN+sjhsetVHSK3dMRPppN/pry0bgTUWbGBo7nnl3QgrdYyrk l+BigY0JTOBRzkk3ECqCcR88jYcI1jNm/iaqCuPwqhkpanElKryD268cu4FINz60 UFDlrKiEVmQhMrf6MJji3eJec6CezjDtTfuboPPLyUxz/At/5khcxEtAalmieYpy WZmj2hlG9Mzdfv5TA3JX5BvOt7ODicf7wtxJ3W3qxz2iUMJ3uCO0fSn1b9sJz0L7 LwRZ7uskHz5J122AQhEEDq0T6n0rY4GBdZhzONN64wXttSGJTqY= =yaY/ -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.4_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: - Set up the kernel CS earlier in the boot process in case EFI boots the kernel after bypassing the decompressor and the CS descriptor used ends up being the EFI one which is not mapped in the identity page table, leading to early SEV/SNP guest communication exceptions resulting in the guest crashing * tag 'x86_urgent_for_v6.4_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/head/64: Switch to KERNEL_CS as soon as new GDT is installed	2023-06-11 10:14:02 -07:00
Arnd Bergmann	af0a76e126	thread_info: move function declarations to linux/thread_info.h There are a few __weak functions in kernel/fork.c, which architectures can override. If there is no prototype, the compiler warns about them: kernel/fork.c:164:13: error: no previous prototype for 'arch_release_task_struct' [-Werror=missing-prototypes] kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' [-Werror=missing-prototypes] kernel/fork.c:1086:12: error: no previous prototype for 'arch_dup_task_struct' [-Werror=missing-prototypes] There are already prototypes in a number of architecture specific headers that have addressed those warnings before, but it's much better to have these in a single place so the warning no longer shows up anywhere. Link: https://lkml.kernel.org/r/20230517131102.934196-14-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Paris <eparis@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 17:44:16 -07:00
Arnd Bergmann	ad1a48301f	init: consolidate prototypes in linux/init.h The init/main.c file contains some extern declarations for functions defined in architecture code, and it defines some other functions that are called from architecture code with a custom prototype. Both of those result in warnings with 'make W=1': init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes] init/main.c:790:20: error: no previous prototype for 'mem_encrypt_init' [-Werror=missing-prototypes] init/main.c:792:20: error: no previous prototype for 'poking_init' [-Werror=missing-prototypes] arch/arm64/kernel/irq.c:122:13: error: no previous prototype for 'init_IRQ' [-Werror=missing-prototypes] arch/arm64/kernel/time.c:55:13: error: no previous prototype for 'time_init' [-Werror=missing-prototypes] arch/x86/kernel/process.c:935:13: error: no previous prototype for 'arch_post_acpi_subsys_init' [-Werror=missing-prototypes] init/calibrate.c:261:37: error: no previous prototype for 'calibrate_delay_is_known' [-Werror=missing-prototypes] kernel/fork.c:991:20: error: no previous prototype for 'arch_task_cache_init' [-Werror=missing-prototypes] Add prototypes for all of these in include/linux/init.h or another appropriate header, and remove the duplicate declarations from architecture specific code. [sfr@canb.auug.org.au: declare time_init_early()] Link: https://lkml.kernel.org/r/20230519124311.5167221c@canb.auug.org.au Link: https://lkml.kernel.org/r/20230517131102.934196-12-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Paris <eparis@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 17:44:16 -07:00
Lorenzo Stoakes	54d020692b	mm/gup: remove unused vmas parameter from get_user_pages() Patch series "remove the vmas parameter from GUP APIs", v6. (pin_/get)_user_pages[_remote]() each provide an optional output parameter for an array of VMA objects associated with each page in the input range. These provide the means for VMAs to be returned, as long as mm->mmap_lock is never released during the GUP operation (i.e. the internal flag FOLL_UNLOCKABLE is not specified). In addition, these VMAs can only be accessed with the mmap_lock held and become invalidated the moment it is released. The vast majority of invocations do not use this functionality and of those that do, all but one case retrieve a single VMA to perform checks upon. It is not egregious in the single VMA cases to simply replace the operation with a vma_lookup(). In these cases we duplicate the (fast) lookup on a slow path already under the mmap_lock, abstracted to a new get_user_page_vma_remote() inline helper function which also performs error checking and reference count maintenance. The special case is io_uring, where io_pin_pages() specifically needs to assert that the VMAs underlying the range do not result in broken long-term GUP file-backed mappings. As GUP now internally asserts that FOLL_LONGTERM mappings are not file-backed in a broken fashion (i.e. requiring dirty tracking) - as implemented in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings" - this logic is no longer required and so we can simply remove it altogether from io_uring. Eliminating the vmas parameter eliminates an entire class of danging pointer errors that might have occured should the lock have been incorrectly released. In addition, the API is simplified and now clearly expresses what it is intended for - applying the specified GUP flags and (if pinning) returning pinned pages. This change additionally opens the door to further potential improvements in GUP and the possible marrying of disparate code paths. I have run this series against gup_test with no issues. Thanks to Matthew Wilcox for suggesting this refactoring! This patch (of 6): No invocation of get_user_pages() use the vmas parameter, so remove it. The GUP API is confusing and caveated. Recent changes have done much to improve that, however there is more we can do. Exporting vmas is a prime target as the caller has to be extremely careful to preclude their use after the mmap_lock has expired or otherwise be left with dangling pointers. Removing the vmas parameter focuses the GUP functions upon their primary purpose - pinning (and outputting) pages as well as performing the actions implied by the input flags. This is part of a patch series aiming to remove the vmas parameter altogether. Link: https://lkml.kernel.org/r/cover.1684350871.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/589e0c64794668ffc799651e8d85e703262b1e9d.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Christian König <christian.koenig@amd.com> (for radeon parts) Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Sean Christopherson <seanjc@google.com> (KVM) Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:25 -07:00
Nhat Pham	cf264e1329	cachestat: implement cachestat syscall There is currently no good way to query the page cache state of large file sets and directory trees. There is mincore(), but it scales poorly: the kernel writes out a lot of bitmap data that userspace has to aggregate, when the user really doesn not care about per-page information in that case. The user also needs to mmap and unmap each file as it goes along, which can be quite slow as well. Some use cases where this information could come in handy: * Allowing database to decide whether to perform an index scan or direct table queries based on the in-memory cache state of the index. * Visibility into the writeback algorithm, for performance issues diagnostic. * Workload-aware writeback pacing: estimating IO fulfilled by page cache (and IO to be done) within a range of a file, allowing for more frequent syncing when and where there is IO capacity, and batching when there is not. * Computing memory usage of large files/directory trees, analogous to the du tool for disk usage. More information about these use cases could be found in the following thread: https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/ This patch implements a new syscall that queries cache state of a file and summarizes the number of cached pages, number of dirty pages, number of pages marked for writeback, number of (recently) evicted pages, etc. in a given range. Currently, the syscall is only wired in for x86 architecture. NAME cachestat - query the page cache statistics of a file. SYNOPSIS #include <sys/mman.h> struct cachestat_range { __u64 off; __u64 len; }; struct cachestat { __u64 nr_cache; __u64 nr_dirty; __u64 nr_writeback; __u64 nr_evicted; __u64 nr_recently_evicted; }; int cachestat(unsigned int fd, struct cachestat_range cstat_range, struct cachestat cstat, unsigned int flags); DESCRIPTION cachestat() queries the number of cached pages, number of dirty pages, number of pages marked for writeback, number of evicted pages, number of recently evicted pages, in the bytes range given by `off` and `len`. An evicted page is a page that is previously in the page cache but has been evicted since. A page is recently evicted if its last eviction was recent enough that its reentry to the cache would indicate that it is actively being used by the system, and that there is memory pressure on the system. These values are returned in a cachestat struct, whose address is given by the `cstat` argument. The `off` and `len` arguments must be non-negative integers. If `len` > 0, the queried range is [`off`, `off` + `len`]. If `len` == 0, we will query in the range from `off` to the end of the file. The `flags` argument is unused for now, but is included for future extensibility. User should pass 0 (i.e no flag specified). Currently, hugetlbfs is not supported. Because the status of a page can change after cachestat() checks it but before it returns to the application, the returned values may contain stale information. RETURN VALUE On success, cachestat returns 0. On error, -1 is returned, and errno is set to indicate the error. ERRORS EFAULT cstat or cstat_args points to an invalid address. EINVAL invalid flags. EBADF invalid file descriptor. EOPNOTSUPP file descriptor is of a hugetlbfs file [nphamcs@gmail.com: replace rounddown logic with the existing helper] Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Brian Foster <bfoster@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-06-09 16:25:16 -07:00
Ingo Molnar	301cf77e21	x86/orc: Make the is_callthunk() definition depend on CONFIG_BPF_JIT=y Recent commit: `020126239b` Revert "x86/orc: Make it callthunk aware" Made the only user of is_callthunk() depend on CONFIG_BPF_JIT=y, while the definition of the helper function is unconditional. Move is_callthunk() inside the #ifdef block. Addresses this build failure: arch/x86/kernel/callthunks.c:296:13: error: ‘is_callthunk’ defined but not used [-Werror=unused-function] Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: linux-kernel@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org>	2023-06-09 11:09:04 +02:00
Juergen Gross	78841cd185	x86/mm: Remove Xen-PV leftovers from init_32.c There are still some unneeded paravirt calls in arch/x86/mm/init_32.c. Remove them. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230609055100.12633-1-jgross@suse.com	2023-06-09 11:00:21 +02:00
Michael Kelley	504dba50b0	x86/irq: Add hardcoded hypervisor interrupts to /proc/stat Some hypervisor interrupts (such as for Hyper-V VMbus and Hyper-V timers) have hardcoded interrupt vectors on x86 and don't have Linux IRQs assigned. These interrupts are shown in /proc/interrupts, but are not reported in the first field of the "intr" line in /proc/stat because the x86 version of arch_irq_stat_cpu() doesn't include them. Fix this by adding code to arch_irq_stat_cpu() to include these interrupts, similar to existing interrupts that don't have Linux IRQs. Use #if IS_ENABLED() because unlike all the other nearby #ifdefs, CONFIG_HYPERV can be built as a module. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/1677523568-50263-1-git-send-email-mikelley%40microsoft.com	2023-06-08 08:28:08 -07:00
Josh Poimboeuf	a9da824762	drm/vmwgfx: Add unwind hints around RBP clobber VMware high-bandwidth hypercalls take the RBP register as input. This breaks basic frame pointer convention, as RBP should never be clobbered. So frame pointer unwinding is broken for the instructions surrounding the hypercalls. Fortunately this doesn't break live patching with CONFIG_FRAME_POINTER, as it only unwinds from blocking tasks, and stack traces from preempted tasks are already marked unreliable anyway. However, for live patching with ORC, this could actually be a theoretical problem if vmw_port_hb_{in,out}() were still compiled with a frame pointer due to having an aligned stack. In practice that hasn't seemed to be an issue since the objtool warnings have only been seen with CONFIG_FRAME_POINTER. Add unwind hint annotations to tell the ORC unwinder to mark stack traces as unreliable. Fixes the following warnings: vmlinux.o: warning: objtool: vmw_port_hb_in+0x1df: return with modified stack frame vmlinux.o: warning: objtool: vmw_port_hb_out+0x1dd: return with modified stack frame Fixes: `89da76fde6` ("drm/vmwgfx: Add VMWare host messaging capability") Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305160135.97q0Elax-lkp@intel.com/ Link: https://lore.kernel.org/r/4c795f2d87bc0391cf6543bcb224fa540b55ce4b.1685981486.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>	2023-06-07 10:03:12 -07:00
Josh Poimboeuf	ac27ecf68a	x86/entry: Move thunk restore code into thunk functions There's no need for both thunk functions to jump to the same shared thunk restore code which lives outside the thunk function boundaries. It disrupts i-cache locality and confuses objtool. Keep it simple by keeping each thunk's restore code self-contained within the function. Fixes a bunch of false positive "missing __noreturn" warnings like: vmlinux.o: warning: objtool: do_arch_prctl_common+0xf4: preempt_schedule_thunk() is missing a __noreturn annotation Fixes: `fedb724c3d` ("objtool: Detect missing __noreturn annotations") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202305281037.3PaI3tW4-lkp@intel.com/ Link: https://lore.kernel.org/r/46aa8aeb716f302e22e1673ae15ee6fe050b41f4.1685488050.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>	2023-06-07 09:54:45 -07:00
Josh Poimboeuf	020126239b	Revert "x86/orc: Make it callthunk aware" Commit `396e0b8e09` ("x86/orc: Make it callthunk aware") attempted to deal with the fact that function prefix code didn't have ORC coverage. However, it didn't work as advertised. Use of the "null" ORC entry just caused affected unwinds to end early. The root cause has now been fixed with commit `5743654f5e` ("objtool: Generate ORC data for __pfx code"). Revert most of commit `396e0b8e09` ("x86/orc: Make it callthunk aware"). The is_callthunk() function remains as it's now used by other code. Link: https://lore.kernel.org/r/a05b916ef941da872cbece1ab3593eceabd05a79.1684245404.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>	2023-06-07 09:48:57 -07:00
Bo Liu	7e980867ce	x86/mm: Remove repeated word in comments Remove the repeated word "the" in comments. Signed-off-by: Bo Liu <liubo03@inspur.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230504085446.2574-1-liubo03@inspur.com	2023-06-07 12:47:34 +02:00
Peter Newman	8da2b938eb	x86/resctrl: Implement rename op for mon groups To change the resources allocated to a large group of tasks, such as an application container, a container manager must write all of the tasks' IDs into the tasks file interface of the new control group. This is challenging when the container's task list is always changing. In addition, if the container manager is using monitoring groups to separately track the bandwidth of containers assigned to the same control group, when moving a container, it must first move the container's tasks to the default monitoring group of the new control group before it can move these tasks into the container's replacement monitoring group under the destination control group. This is undesirable because it makes bandwidth usage during the move unattributable to the correct tasks and resets monitoring event counters and cache usage information for the group. Implement the rename operation only for resctrlfs monitor groups to enable users to move a monitoring group from one control group to another. This effects a change in resources allocated to all the tasks in the monitoring group while otherwise leaving the monitoring data intact. Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20230419125015.693566-3-peternewman@google.com	2023-06-07 12:40:36 +02:00
Peter Newman	c45c06d4ae	x86/resctrl: Factor rdtgroup lock for multi-file ops rdtgroup_kn_lock_live() can only release a kernfs reference for a single file before waiting on the rdtgroup_mutex, limiting its usefulness for operations on multiple files, such as rename. Factor the work needed to respectively break and unbreak active protection on an individual file into rdtgroup_kn_{get,put}(). No functional change. Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20230419125015.693566-2-peternewman@google.com	2023-06-07 12:15:18 +02:00
Like Xu	94cdeebd82	KVM: x86/cpuid: Add AMD CPUID ExtPerfMonAndDbg leaf 0x80000022 CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new performance monitoring features for AMD processors. Bit 0 of EAX indicates support for Performance Monitoring Version 2 (PerfMonV2) features. If found to be set during PMU initialization, the EBX bits of the same CPUID function can be used to determine the number of available PMCs for different PMU types. Expose the relevant bits via KVM_GET_SUPPORTED_CPUID so that guests can make use of the PerfMonV2 features. Co-developed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230603011058.1038821-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	4a2771895c	KVM: x86/svm/pmu: Add AMD PerfMonV2 support If AMD Performance Monitoring Version 2 (PerfMonV2) is detected by the guest, it can use a new scheme to manage the Core PMCs using the new global control and status registers. In addition to benefiting from the PerfMonV2 functionality in the same way as the host (higher precision), the guest also can reduce the number of vm-exits by lowering the total number of MSRs accesses. In terms of implementation details, amd_is_valid_msr() is resurrected since three newly added MSRs could not be mapped to one vPMC. The possibility of emulating PerfMonV2 on the mainframe has also been eliminated for reasons of precision. Co-developed-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: drop "Based on the observed HW." comments] Link: https://lore.kernel.org/r/20230603011058.1038821-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	fe8d76c1a6	KVM: x86/cpuid: Add a KVM-only leaf to redirect AMD PerfMonV2 flag Add a KVM-only leaf for AMD's PerfMonV2 to redirect the kernel's scattered version to its architectural location, e.g. so that KVM can query guest support via guest_cpuid_has(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230603011058.1038821-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	1c2bf8a6b0	KVM: x86/pmu: Constrain the num of guest counters with kvm_pmu_cap Cap the number of general purpose counters enumerated on AMD to what KVM actually supports, i.e. don't allow userspace to coerce KVM into thinking there are more counters than actually exist, e.g. by enumerating X86_FEATURE_PERFCTR_CORE in guest CPUID when its not supported. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230603011058.1038821-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	d338d8789e	KVM: x86/pmu: Advertise PERFCTR_CORE iff the min nr of counters is met Enable and advertise PERFCTR_CORE if and only if the minimum number of required counters are available, i.e. if perf says there are less than six general purpose counters. Opportunistically, use kvm_cpu_cap_check_and_set() instead of open coding the check for host support. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: massage shortlog and changelog] Link: https://lore.kernel.org/r/20230603011058.1038821-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	6a08083f29	KVM: x86/pmu: Disable vPMU if the minimum num of counters isn't met Disable PMU support when running on AMD and perf reports fewer than four general purpose counters. All AMD PMUs must define at least four counters due to AMD's legacy architecture hardcoding the number of counters without providing a way to enumerate the number of counters to software, e.g. from AMD's APM: The legacy architecture defines four performance counters (PerfCtrn) and corresponding event-select registers (PerfEvtSeln). Virtualizing fewer than four counters can lead to guest instability as software expects four counters to be available. Rather than bleed AMD details into the common code, just define a const unsigned int and provide a convenient location to document why Intel and AMD have different mins (in particular, AMD's lack of any way to enumerate less than four counters to the guest). Keep the minimum number of counters at Intel at one, even though old P6 and Core Solo/Duo processor effectively require a minimum of two counters. KVM can, and more importantly has up until this point, supported a vPMU so long as the CPU has at least one counter. Perf's support for P6/Core CPUs does require two counters, but perf will happily chug along with a single counter when running on a modern CPU. Cc: Jim Mattson <jmattson@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: set Intel min to '1', not '2'] Link: https://lore.kernel.org/r/20230603011058.1038821-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	6593039d33	KVM: x86: Explicitly zero cpuid "0xa" leaf when PMU is disabled Add an explicit !enable_pmu check as relying on kvm_pmu_cap to be zeroed isn't obvious. Although when !enable_pmu, KVM will have zero-padded kvm_pmu_cap to do subsequent CPUID leaf assignments. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230603011058.1038821-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	13afa29ae4	KVM: x86/pmu: Provide Intel PMU's pmc_is_enabled() as generic x86 code Move the Intel PMU implementation of pmc_is_enabled() to common x86 code as pmc_is_globally_enabled(), and drop AMD's implementation. AMD PMU currently supports only v1, and thus not PERF_GLOBAL_CONTROL, thus the semantics for AMD are unchanged. And when support for AMD PMU v2 comes along, the common behavior will also Just Work. Signed-off-by: Like Xu <likexu@tencent.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230603011058.1038821-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	c85cdc1cc1	KVM: x86/pmu: Move handling PERF_GLOBAL_CTRL and friends to common x86 Move the handling of GLOBAL_CTRL, GLOBAL_STATUS, and GLOBAL_OVF_CTRL, a.k.a. GLOBAL_STATUS_RESET, from Intel PMU code to generic x86 PMU code. AMD PerfMonV2 defines three registers that have the same semantics as Intel's variants, just with different names and indices. Conveniently, since KVM virtualizes GLOBAL_CTRL on Intel only for PMU v2 and above, and AMD's version shows up in v2, KVM can use common code for the existence check as well. Signed-off-by: Like Xu <likexu@tencent.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230603011058.1038821-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	30dab5c0b6	KVM: x86/pmu: Reject userspace attempts to set reserved GLOBAL_STATUS bits Reject userspace writes to MSR_CORE_PERF_GLOBAL_STATUS that attempt to set reserved bits. Allowing userspace to stuff reserved bits doesn't harm KVM itself, but it's architecturally wrong and the guest can't clear the unsupported bits, e.g. makes the guest's PMI handler very confused. Signed-off-by: Like Xu <likexu@tencent.com> [sean: rewrite changelog to avoid use of #GP, rebase on name change] Link: https://lore.kernel.org/r/20230603011058.1038821-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Like Xu	8de18543df	KVM: x86/pmu: Move reprogram_counters() to pmu.h Move reprogram_counters() out of Intel specific PMU code and into pmu.h so that it can be used to implement AMD PMU v2 support. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: rewrite changelog] Link: https://lore.kernel.org/r/20230603011058.1038821-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Sean Christopherson	53550b8922	KVM: x86/pmu: Rename global_ovf_ctrl_mask to global_status_mask Rename global_ovf_ctrl_mask to global_status_mask to avoid confusion now that Intel has renamed GLOBAL_OVF_CTRL to GLOBAL_STATUS_RESET in PMU v4. GLOBAL_OVF_CTRL and GLOBAL_STATUS_RESET are the same MSR index, i.e. are just different names for the same thing, but the SDM provides different entries in the IA-32 Architectural MSRs table, which gets really confusing when looking at PMU v4 definitions since it looks like GLOBAL_STATUS has bits that don't exist in GLOBAL_OVF_CTRL, but in reality the bits are simply defined in the GLOBAL_STATUS_RESET entry. No functional change intended. Cc: Like Xu <like.xu.linux@gmail.com> Link: https://lore.kernel.org/r/20230603011058.1038821-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:31:44 -07:00
Michal Luczaj	e12fa4b92a	KVM: x86: Clean up: remove redundant bool conversions As test_bit() returns bool, explicitly converting result to bool is unnecessary. Get rid of '!!'. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20230605200158.118109-1-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 17:13:55 -07:00
Kirill A. Shutemov	94142c9d1b	x86/mm: Fix enc_status_change_finish_noop() enc_status_change_finish_noop() is now defined as always-fail, which doesn't make sense for noop. The change has no user-visible effect because it is only called if the platform has CC_ATTR_MEM_ENCRYPT. All platforms with the attribute override the callback with their own implementation. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20230606095622.1939-4-kirill.shutemov%40linux.intel.com	2023-06-06 16:24:27 -07:00
Kirill A. Shutemov	195edce08b	x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad() tl;dr: There is a race in the TDX private<=>shared conversion code which could kill the TDX guest. Fix it by changing conversion ordering to eliminate the window. TDX hardware maintains metadata to track which pages are private and shared. Additionally, TDX guests use the guest x86 page tables to specify whether a given mapping is intended to be private or shared. Bad things happen when the intent and metadata do not match. So there are two thing in play: 1. "the page" -- the physical TDX page metadata 2. "the mapping" -- the guest-controlled x86 page table intent For instance, an unrecoverable exit to VMM occurs if a guest touches a private mapping that points to a shared physical page. In summary: * Private mapping => Private Page == OK (obviously) * Shared mapping => Shared Page == OK (obviously) * Private mapping => Shared Page == BIG BOOM! * Shared mapping => Private Page == OK-ish (It will read generate a recoverable #VE via handle_mmio()) Enter load_unaligned_zeropad(). It can touch memory that is adjacent but otherwise unrelated to the memory it needs to touch. It will cause one of those unrecoverable exits (aka. BIG BOOM) if it blunders into a shared mapping pointing to a private page. This is a problem when __set_memory_enc_pgtable() converts pages from shared to private. It first changes the mapping and second modifies the TDX page metadata. It's moving from: * Shared mapping => Shared Page == OK to: * Private mapping => Shared Page == BIG BOOM! This means that there is a window with a shared mapping pointing to a private page where load_unaligned_zeropad() can strike. Add a TDX handler for guest.enc_status_change_prepare(). This converts the page from shared to private before the page becomes private. This ensures that there is never a private mapping to a shared page. Leave a guest.enc_status_change_finish() in place but only use it for private=>shared conversions. This will delay updating the TDX metadata marking the page private until after the mapping matches the metadata. This also ensures that there is never a private mapping to a shared page. [ dhansen: rewrite changelog ] Fixes: `7dbde76316` ("x86/mm/cpa: Add support for TDX shared memory") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20230606095622.1939-3-kirill.shutemov%40linux.intel.com	2023-06-06 16:24:02 -07:00
Sean Christopherson	056b9919a1	KVM: x86: Use cpu_feature_enabled() for PKU instead of #ifdef Replace an #ifdef on CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS with a cpu_feature_enabled() check on X86_FEATURE_PKU. The macro magic of DISABLED_MASK_BIT_SET() means that cpu_feature_enabled() provides the same end result (no code generated) when PKU is disabled by Kconfig. No functional change intended. Cc: Jon Kohler <jon@nutanix.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Link: https://lore.kernel.org/r/20230602010550.785722-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 15:15:06 -07:00
Sean Christopherson	0a3869e14d	KVM: x86/mmu: Trigger APIC-access page reload iff vendor code cares Request an APIC-access page reload when the backing page is migrated (or unmapped) if and only if vendor code actually plugs the backing pfn into structures that reside outside of KVM's MMU. This avoids kicking all vCPUs in the (hopefully infrequent) scenario where the backing page is migrated/invalidated. Unlike VMX's APICv, SVM's AVIC doesn't plug the backing pfn directly into the VMCB and so doesn't need a hook to invalidate an out-of-MMU "mapping". Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 15:07:05 -07:00
Sean Christopherson	0a8a5f2c8c	KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access page Now that KVM honors past and in-progress mmu_notifier invalidations when reloading the APIC-access page, use KVM's "standard" invalidation hooks to trigger a reload and delete the one-off usage of invalidate_range(). Aside from eliminating one-off code in KVM, dropping KVM's use of invalidate_range() will allow common mmu_notifier to redefine the API to be more strictly focused on invalidating secondary TLBs that share the primary MMU's page tables. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 15:07:05 -07:00
Sean Christopherson	878940b33d	KVM: VMX: Retry APIC-access page reload if invalidation is in-progress Re-request an APIC-access page reload if there is a relevant mmu_notifier invalidation in-progress when KVM retrieves the backing pfn, i.e. stall vCPUs until the backing pfn for the APIC-access page is "officially" stable. Relying on the primary MMU to not make changes after invoking ->invalidate_range() works, e.g. any additional changes to a PRESENT PTE would also trigger an ->invalidate_range(), but using ->invalidate_range() to fudge around KVM not honoring past and in-progress invalidations is a bit hacky. Honoring invalidations will allow using KVM's standard mmu_notifier hooks to detect APIC-access page reloads, which will in turn allow removing KVM's implementation of ->invalidate_range() (the APIC-access page case is a true one-off). Opportunistically add a comment to explain why doing nothing if a memslot isn't found is functionally correct. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 15:07:04 -07:00
Kirill A. Shutemov	3f6819dd19	x86/mm: Allow guest.enc_status_change_prepare() to fail TDX code is going to provide guest.enc_status_change_prepare() that is able to fail. TDX will use the call to convert the GPA range from shared to private. This operation can fail. Add a way to return an error from the callback. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20230606095622.1939-2-kirill.shutemov%40linux.intel.com	2023-06-06 11:07:01 -07:00
Alexander Mikhalitsyn	6d1bc9754b	KVM: SVM: enhance info printk's in SEV init Let's print available ASID ranges for SEV/SEV-ES guests. This information can be useful for system administrator to debug if SEV/SEV-ES fails to enable. There are a few reasons. SEV: - NPT is disabled (module parameter) - CPU lacks some features (sev, decodeassists) - Maximum SEV ASID is 0 SEV-ES: - mmio_caching is disabled (module parameter) - CPU lacks sev_es feature - Minimum SEV ASID value is 1 (can be adjusted in BIOS/UEFI) Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Stéphane Graber <stgraber@ubuntu.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/r/20230522161249.800829-3-aleksandr.mikhalitsyn@canonical.com [sean: print '0' for min SEV-ES ASID if there are no available ASIDs] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 10:41:29 -07:00
Tom Lendacky	6c32117963	x86/sev: Add SNP-specific unaccepted memory support Add SNP-specific hooks to the unaccepted memory support in the boot path (__accept_memory()) and the core kernel (accept_memory()) in order to support booting SNP guests when unaccepted memory is present. Without this support, SNP guests will fail to boot and/or panic() when unaccepted memory is present in the EFI memory map. The process of accepting memory under SNP involves invoking the hypervisor to perform a page state change for the page to private memory and then issuing a PVALIDATE instruction to accept the page. Since the boot path and the core kernel paths perform similar operations, move the pvalidate_pages() and vmgexit_psc() functions into sev-shared.c to avoid code duplication. Create the new header file arch/x86/boot/compressed/sev.h because adding the function declaration to any of the existing SEV related header files pulls in too many other header files, causing the build to fail. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/a52fa69f460fd1876d70074b20ad68210dfc31dd.1686063086.git.thomas.lendacky@amd.com	2023-06-06 18:31:37 +02:00
Tom Lendacky	15d9088779	x86/sev: Use large PSC requests if applicable In advance of providing support for unaccepted memory, request 2M Page State Change (PSC) requests when the address range allows for it. By using a 2M page size, more PSC operations can be handled in a single request to the hypervisor. The hypervisor will determine if it can accommodate the larger request by checking the mapping in the nested page table. If mapped as a large page, then the 2M page request can be performed, otherwise the 2M page request will be broken down into 512 4K page requests. This is still more efficient than having the guest perform multiple PSC requests in order to process the 512 4K pages. In conjunction with the 2M PSC requests, attempt to perform the associated PVALIDATE instruction of the page using the 2M page size. If PVALIDATE fails with a size mismatch, then fallback to validating 512 4K pages. To do this, page validation is modified to work with the PSC structure and not just a virtual address range. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/050d17b460dfc237b51d72082e5df4498d3513cb.1686063086.git.thomas.lendacky@amd.com	2023-06-06 18:29:35 +02:00
Tom Lendacky	7006b75592	x86/sev: Allow for use of the early boot GHCB for PSC requests Using a GHCB for a page stage change (as opposed to the MSR protocol) allows for multiple pages to be processed in a single request. In prep for early PSC requests in support of unaccepted memory, update the invocation of vmgexit_psc() to be able to use the early boot GHCB and not just the per-CPU GHCB structure. In order to use the proper GHCB (early boot vs per-CPU), set a flag that indicates when the per-CPU GHCBs are available and registered. For APs, the per-CPU GHCBs are created before they are started and registered upon startup, so this flag can be used globally for the BSP and APs instead of creating a per-CPU flag. This will allow for a significant reduction in the number of MSR protocol page state change requests when accepting memory. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/d6cbb21f87f81eb8282dd3bf6c34d9698c8a4bbc.1686063086.git.thomas.lendacky@amd.com	2023-06-06 18:29:00 +02:00
Tom Lendacky	69dcb1e3bb	x86/sev: Put PSC struct on the stack in prep for unaccepted memory support In advance of providing support for unaccepted memory, switch from using kmalloc() for allocating the Page State Change (PSC) structure to using a local variable that lives on the stack. This is needed to avoid a possible recursive call into set_pages_state() if the kmalloc() call requires (more) memory to be accepted, which would result in a hang. The current size of the PSC struct is 2,032 bytes. To make the struct more stack friendly, reduce the number of PSC entries from 253 down to 64, resulting in a size of 520 bytes. This is a nice compromise on struct size and total PSC requests while still allowing parallel PSC operations across vCPUs. If the reduction in PSC entries results in any kind of performance issue (that is not seen at the moment), use of a larger static PSC struct, with fallback to the smaller stack version, can be investigated. For more background info on this decision, see the subthread in the Link: tag below. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/lkml/658c455c40e8950cb046dd885dd19dc1c52d060a.1659103274.git.thomas.lendacky@amd.com	2023-06-06 18:28:25 +02:00
Tom Lendacky	5dee19b6b2	x86/sev: Fix calculation of end address based on number of pages When calculating an end address based on an unsigned int number of pages, any value greater than or equal to 0x100000 that is shift PAGE_SHIFT bits results in a 0 value, resulting in an invalid end address. Change the number of pages variable in various routines from an unsigned int to an unsigned long to calculate the end address correctly. Fixes: `5e5ccff60a` ("x86/sev: Add helper for validating pages in early enc attribute changes") Fixes: `dc3f3d2474` ("x86/mm: Validate memory when changing the C-bit") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/6a6e4eea0e1414402bac747744984fa4e9c01bb6.1686063086.git.thomas.lendacky@amd.com	2023-06-06 18:27:20 +02:00
Kirill A. Shutemov	75d090fd16	x86/tdx: Add unaccepted memory support Hookup TDX-specific code to accept memory. Accepting the memory is done with ACCEPT_PAGE module call on every page in the range. MAP_GPA hypercall is not required as the unaccepted memory is considered private already. Extract the part of tdx_enc_status_changed() that does memory acceptance in a new helper. Move the helper tdx-shared.c. It is going to be used by both main kernel and decompressor. [ bp: Fix the INTEL_TDX_GUEST=y, KVM_GUEST=n build. ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230606142637.5171-10-kirill.shutemov@linux.intel.com	2023-06-06 18:25:57 +02:00
Chao Gao	02f1b0b736	KVM: x86: Correct the name for skipping VMENTER l1d flush There is no VMENTER_L1D_FLUSH_NESTED_VM. It should be ARCH_CAP_SKIP_VMENTRY_L1DFLUSH. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230524061634.54141-3-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-06 09:14:52 -07:00
Kirill A. Shutemov	c2b353ae24	x86/tdx: Refactor try_accept_one() Rework try_accept_one() to return accepted size instead of modifying 'start' inside the helper. It makes 'start' in-only argument and streamlines code on the caller side. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20230606142637.5171-9-kirill.shutemov@linux.intel.com	2023-06-06 17:38:50 +02:00
Kirill A. Shutemov	ff40b5769a	x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub Memory acceptance requires a hypercall and one or multiple module calls. Make helpers for the calls available in boot stub. It has to accept memory where kernel image and initrd are placed. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20230606142637.5171-8-kirill.shutemov@linux.intel.com	2023-06-06 17:31:23 +02:00
Kirill A. Shutemov	2053bc57f3	efi: Add unaccepted memory support efi_config_parse_tables() reserves memory that holds unaccepted memory configuration table so it won't be reused by page allocator. Core-mm requires few helpers to support unaccepted memory: - accept_memory() checks the range of addresses against the bitmap and accept memory if needed. - range_contains_unaccepted_memory() checks if anything within the range requires acceptance. Architectural code has to provide efi_get_unaccepted_table() that returns pointer to the unaccepted memory configuration table. arch_accept_memory() handles arch-specific part of memory acceptance. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20230606142637.5171-6-kirill.shutemov@linux.intel.com	2023-06-06 17:22:20 +02:00
Kirill A. Shutemov	3fd1239a78	x86/boot/compressed: Handle unaccepted memory The firmware will pre-accept the memory used to run the stub. But, the stub is responsible for accepting the memory into which it decompresses the main kernel. Accept memory just before decompression starts. The stub is also responsible for choosing a physical address in which to place the decompressed kernel image. The KASLR mechanism will randomize this physical address. Since the accepted memory region is relatively small, KASLR would be quite ineffective if it only used the pre-accepted area (EFI_CONVENTIONAL_MEMORY). Ensure that KASLR randomizes among the entire physical address space by also including EFI_UNACCEPTED_MEMORY. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20230606142637.5171-5-kirill.shutemov@linux.intel.com	2023-06-06 17:17:24 +02:00
Kirill A. Shutemov	745e3ed85f	efi/libstub: Implement support for unaccepted memory UEFI Specification version 2.9 introduces the concept of memory acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, requiring memory to be accepted before it can be used by the guest. Accepting happens via a protocol specific for the Virtual Machine platform. Accepting memory is costly and it makes VMM allocate memory for the accepted guest physical address range. It's better to postpone memory acceptance until memory is needed. It lowers boot time and reduces memory overhead. The kernel needs to know what memory has been accepted. Firmware communicates this information via memory map: a new memory type -- EFI_UNACCEPTED_MEMORY -- indicates such memory. Range-based tracking works fine for firmware, but it gets bulky for the kernel: e820 (or whatever the arch uses) has to be modified on every page acceptance. It leads to table fragmentation and there's a limited number of entries in the e820 table. Another option is to mark such memory as usable in e820 and track if the range has been accepted in a bitmap. One bit in the bitmap represents a naturally aligned power-2-sized region of address space -- unit. For x86, unit size is 2MiB: 4k of the bitmap is enough to track 64GiB or physical address space. In the worst-case scenario -- a huge hole in the middle of the address space -- It needs 256MiB to handle 4PiB of the address space. Any unaccepted memory that is not aligned to unit_size gets accepted upfront. The bitmap is allocated and constructed in the EFI stub and passed down to the kernel via EFI configuration table. allocate_e820() allocates the bitmap if unaccepted memory is present, according to the size of unaccepted region. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20230606142637.5171-4-kirill.shutemov@linux.intel.com	2023-06-06 16:58:23 +02:00
Randy Dunlap	b26d3d054d	x86/lib/msr: Clean up kernel-doc notation Convert x86/lib/msr.c comments to kernel-doc notation to eliminate kernel-doc warnings: msr.c:30: warning: This comment starts with '/**', but isn't \ a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst ... Fixes: `22085a66c2` ("x86: Add another set of MSR accessor functions") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/oe-kbuild-all/202304120048.v4uqUq9Q-lkp@intel.com/	2023-06-06 15:19:50 +02:00
Peter Zijlstra	5c5e9a2b25	x86/tsc: Provide sched_clock_noinstr() With the intent to provide local_clock_noinstr(), a variant of local_clock() that's safe to be called from noinstr code (with the assumption that any such code will already be non-preemptible), prepare for things by providing a noinstr sched_clock_noinstr() function. Specifically, preempt_enable_*() calls out to schedule(), which upsets noinstr validation efforts. vmlinux.o: warning: objtool: native_sched_clock+0x96: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section vmlinux.o: warning: objtool: kvm_clock_read+0x22: call to preempt_schedule_notrace_thunk() leaves .noinstr.text section Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.910937674@infradead.org	2023-06-05 21:11:08 +02:00
Peter Zijlstra	e39acc37db	clocksource: hyper-v: Provide noinstr sched_clock() With the intent to provide local_clock_noinstr(), a variant of local_clock() that's safe to be called from noinstr code (with the assumption that any such code will already be non-preemptible), prepare for things by making the Hyper-V TSC and MSR sched_clock implementations noinstr. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.843039089@infradead.org	2023-06-05 21:11:08 +02:00
Peter Zijlstra	9397fa2ea3	clocksource: hyper-v: Adjust hv_read_tsc_page_tsc() to avoid special casing U64_MAX Currently hv_read_tsc_page_tsc() (ab)uses the (valid) time value of U64_MAX as an error return. This breaks the clean wrap-around of the clock. Modify the function signature to return a boolean state and provide another u64 pointer to store the actual time on success. This obviates the need to steal one time value and restores the full counter width. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.775630881@infradead.org	2023-06-05 21:11:07 +02:00
Peter Zijlstra	77750f78b0	x86/vdso: Fix gettimeofday masking Because of how the virtual clocks use U64_MAX as an exception value instead of a valid time, the clocks can no longer be assumed to wrap cleanly. This is then compounded by arch_vdso_cycles_ok() rejecting everything with the MSB/Sign-bit set. Therefore, the effective mask becomes S64_MAX, and the comment with vdso_calc_delta() that states the mask is U64_MAX and isn't optimized out is just plain silly. Now, the code has a negative filter -- to deal with TSC wobbles: if (cycles > last) which is just plain wrong, because it should've been written as: if ((s64)(cycles - last) > 0) to take wrapping into account, but per all the above, we don't actually wrap on u64 anymore. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Kelley <mikelley@microsoft.com> # Hyper-V Link: https://lore.kernel.org/r/20230519102715.704767397@infradead.org	2023-06-05 21:11:07 +02:00
Peter Zijlstra	8f2d6c41e5	x86/sched: Rewrite topology setup Instead of having a number of fixed topologies to pick from; build one on the fly. This is both simpler now and simpler to extend in the future. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230601153522.GB559993%40hirez.programming.kicks-ass.net	2023-06-05 21:11:03 +02:00
Yazen Ghannam	c35977b00f	x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errors The MI200 (Aldebaran) series of devices introduced a new SMCA bank type for Unified Memory Controllers. The MCE subsystem already has support for this new type. The MCE decoder module will decode the common MCA error information for the new bank type, but it will not pass the information to the AMD64 EDAC module for detailed memory error decoding. Have the MCE decoder module recognize the new bank type as an SMCA UMC memory error and pass the MCA information to AMD64 EDAC. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Co-developed-by: Muralidhara M K <muralidhara.mk@amd.com> Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230515113537.1052146-3-muralimk@amd.com	2023-06-05 12:27:11 +02:00
Borislav Petkov (AMD)	f5e87cd511	x86/amd_nb: Re-sort and re-indent PCI defines Sort them by family, model and type and align them vertically for better readability. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230531094212.GHZHcWdMDkCpAp4daj@fat_crate.local	2023-06-05 12:26:54 +02:00
Yazen Ghannam	e15885689c	x86/amd_nb: Add MI200 PCI IDs The AMD MI200 series accelerators are data center GPUs. They include unified memory controllers and a data fabric similar to those used in AMD x86 CPU products. The memory controllers report errors using MCA, though these errors are generally handled through GPU drivers that directly manage the accelerator device. In some configurations, memory errors from these devices will be reported through MCA and managed by x86 CPUs. The OS is expected to handle these errors in similar fashion to MCA errors originating from memory controllers on the CPUs. In Linux, this flow includes passing MCA errors to a notifier chain with handlers in the EDAC subsystem. The AMD64 EDAC module requires information from the memory controllers and data fabric in order to provide detailed decoding of memory errors. The information is read from hardware registers accessed through interfaces in the data fabric. The accelerator data fabrics are visible to the host x86 CPUs as PCI devices just like x86 CPU data fabrics are already. However, the accelerator fabrics have new and unique PCI IDs. Add PCI IDs for the MI200 series of accelerator devices in order to enable EDAC support. The data fabrics of the accelerator devices will be enumerated as any other fabric already supported. System-specific implementation details will be handled within the AMD64 EDAC module. [ bp: Scrub off marketing speak. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Co-developed-by: Muralidhara M K <muralidhara.mk@amd.com> Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230515113537.1052146-2-muralimk@amd.com	2023-06-05 12:26:37 +02:00
Mark Rutland	ef558b4b7b	locking/atomic: treewide: delete arch_atomic_() kerneldoc Currently several architectures have kerneldoc comments for arch_atomic_(), which is unhelpful as these live in a shared namespace where they clash, and the arch_atomic_() ops are now an implementation detail of the raw_atomic_() ops, which no-one should use those directly. Delete the kerneldoc comments for arch_atomic_(), along with pseudo-kerneldoc comments which are in the correct style but are missing the leading '/*' necessary to be true kerneldoc comments. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230605070124.3741859-28-mark.rutland@arm.com	2023-06-05 09:57:24 +02:00
Mark Rutland	0f613bfa82	locking/atomic: treewide: use raw_atomic_<op>() Now that we have raw_atomic_<op>() definitions, there's no need to use arch_atomic_<op>() definitions outside of the low-level atomic definitions. Move treewide users of arch_atomic_<op>() over to the equivalent raw_atomic*_<op>(). There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230605070124.3741859-19-mark.rutland@arm.com	2023-06-05 09:57:20 +02:00
Mark Rutland	5bef003538	locking/atomic: x86: add preprocessor symbols Some atomics can be implemented in several different ways, e.g. FULL/ACQUIRE/RELEASE ordered atomics can be implemented in terms of RELAXED atomics, and ACQUIRE/RELEASE/RELAXED can be implemented in terms of FULL ordered atomics. Other atomics are optional, and don't exist in some configurations (e.g. not all architectures implement the 128-bit cmpxchg ops). Subsequent patches will require that architectures define a preprocessor symbol for any atomic (or ordering variant) which is optional. This will make the fallback ifdeffery more robust, and simplify future changes. Add the required definitions to arch/x86. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230605070124.3741859-13-mark.rutland@arm.com	2023-06-05 09:57:17 +02:00
Peter Zijlstra	febe950dbf	arch: Remove cmpxchg_double No moar users, remove the monster. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132323.991907085@infradead.org	2023-06-05 09:36:39 +02:00
Peter Zijlstra	6d12c8d308	percpu: Wire up cmpxchg128 In order to replace cmpxchg_double() with the newly minted cmpxchg128() family of functions, wire it up in this_cpu_cmpxchg(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132323.654945124@infradead.org	2023-06-05 09:36:37 +02:00
Peter Zijlstra	b23e139d0b	arch: Introduce arch_{,try_}_cmpxchg128{,_local}() For all architectures that currently support cmpxchg_double() implement the cmpxchg128() family of functions that is basically the same but with a saner interface. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132323.452120708@infradead.org	2023-06-05 09:36:35 +02:00
Linus Torvalds	b066935bf8	ARM: * Address some fallout of the locking rework, this time affecting the way the vgic is configured * Fix an issue where the page table walker frees a subtree and then proceeds with walking what it has just freed... * Check that a given PA donated to the guest is actually memory (only affecting pKVM) * Correctly handle MTE CMOs by Set/Way * Fix the reported address of a watchpoint forwarded to userspace * Fix the freeing of the root of stage-2 page tables * Stop creating spurious PMU events to perform detection of the default PMU and use the existing PMU list instead. x86: * Fix a memslot lookup bug in the NX recovery thread that could theoretically let userspace bypass the NX hugepage mitigation * Fix a s/BLOCKING/PENDING bug in SVM's vNMI support * Account exit stats for fastpath VM-Exits that never leave the super tight run-loop * Fix an out-of-bounds bug in the optimized APIC map code, and add a regression test for the race. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmR7k1QUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNblwf/faUVOBMv7mQBGsGa7FNcmaNhYeIT U1k4pFNlo7dNNuNJrGdpo+sOGP5A8CRLNSVvlyjgCHF1Qc9gVtXNvZ9PnA6nAYmB qqvUz/TDw9/NLTlJEkbSs05B4am4yfd5pV6R/32jrPIbXOW++6ae2LpILS/NPBrB y0tGiVUJrO3zVXdBKa4PFmlO8jsXPmMEiicEJa5v2Boeo5SFyFfErw9zDNwSMsQc 27bzbs3O2daXTNMFnwVCCpWUxt1EqWYUXGvBjsChAUI0K10F2/GW9f6YeFsGXqKI d8g1QuCukSt/CvN0pT+g/540mR6i0Azpek1myQfuCu2IhQ1jCJaSWOjoEw== =8VrO -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "ARM: - Address some fallout of the locking rework, this time affecting the way the vgic is configured - Fix an issue where the page table walker frees a subtree and then proceeds with walking what it has just freed... - Check that a given PA donated to the guest is actually memory (only affecting pKVM) - Correctly handle MTE CMOs by Set/Way - Fix the reported address of a watchpoint forwarded to userspace - Fix the freeing of the root of stage-2 page tables - Stop creating spurious PMU events to perform detection of the default PMU and use the existing PMU list instead x86: - Fix a memslot lookup bug in the NX recovery thread that could theoretically let userspace bypass the NX hugepage mitigation - Fix a s/BLOCKING/PENDING bug in SVM's vNMI support - Account exit stats for fastpath VM-Exits that never leave the super tight run-loop - Fix an out-of-bounds bug in the optimized APIC map code, and add a regression test for the race" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: selftests: Add test for race in kvm_recalculate_apic_map() KVM: x86: Bail from kvm_recalculate_phys_map() if x2APIC ID is out-of-bounds KVM: x86: Account fastpath-only VM-Exits in vCPU stats KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASK KVM: x86/mmu: Grab memslot for correct address space in NX recovery worker KVM: arm64: Document default vPMU behavior on heterogeneous systems KVM: arm64: Iterate arm_pmus list to probe for default PMU KVM: arm64: Drop last page ref in kvm_pgtable_stage2_free_removed() KVM: arm64: Populate fault info for watchpoint KVM: arm64: Reload PTE after invoking walker callback on preorder traversal KVM: arm64: Handle trap of tagged Set/Way CMOs arm64: Add missing Set/Way CMO encodings KVM: arm64: Prevent unconditional donation of unmapped regions from the host KVM: arm64: vgic: Fix a comment KVM: arm64: vgic: Fix locking comment KVM: arm64: vgic: Wrap vgic_its_create() with config_lock KVM: arm64: vgic: Fix a circular locking issue	2023-06-04 07:16:53 -04:00
Sean Christopherson	4364b28798	KVM: x86: Bail from kvm_recalculate_phys_map() if x2APIC ID is out-of-bounds Bail from kvm_recalculate_phys_map() and disable the optimized map if the target vCPU's x2APIC ID is out-of-bounds, i.e. if the vCPU was added and/or enabled its local APIC after the map was allocated. This fixes an out-of-bounds access bug in the !x2apic_format path where KVM would write beyond the end of phys_map. Check the x2APIC ID regardless of whether or not x2APIC is enabled, as KVM's hardcodes x2APIC ID to be the vCPU ID, i.e. it can't change, and the map allocation in kvm_recalculate_apic_map() doesn't check for x2APIC being enabled, i.e. the check won't get false postivies. Note, this also affects the x2apic_format path, which previously just ignored the "x2apic_id > new->max_apic_id" case. That too is arguably a bug fix, as ignoring the vCPU meant that KVM would not send interrupts to the vCPU until the next map recalculation. In practice, that "bug" is likely benign as a newly present vCPU/APIC would immediately trigger a recalc. But, there's no functional downside to disabling the map, and a future patch will gracefully handle the -E2BIG case by retrying instead of simply disabling the optimized map. Opportunistically add a sanity check on the xAPIC ID size, along with a comment explaining why the xAPIC ID is guaranteed to be "good". Reported-by: Michal Luczaj <mhal@rbox.co> Fixes: `5b84b02917` ("KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDs") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602233250.1014316-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-02 17:20:50 -07:00
Tom Lendacky	a37f2699c3	x86/head/64: Switch to KERNEL_CS as soon as new GDT is installed The call to startup_64_setup_env() will install a new GDT but does not actually switch to using the KERNEL_CS entry until returning from the function call. Commit `bcce829083` ("x86/sev: Detect/setup SEV/SME features earlier in boot") moved the call to sme_enable() earlier in the boot process and in between the call to startup_64_setup_env() and the switch to KERNEL_CS. An SEV-ES or an SEV-SNP guest will trigger #VC exceptions during the call to sme_enable() and if the CS pushed on the stack as part of the exception and used by IRETQ is not mapped by the new GDT, then problems occur. Today, the current CS when entering startup_64 is the kernel CS value because it was set up by the decompressor code, so no issue is seen. However, a recent patchset that looked to avoid using the legacy decompressor during an EFI boot exposed this bug. At entry to startup_64, the CS value is that of EFI and is not mapped in the new kernel GDT. So when a #VC exception occurs, the CS value used by IRETQ is not valid and the guest boot crashes. Fix this issue by moving the block that switches to the KERNEL_CS value to be done immediately after returning from startup_64_setup_env(). Fixes: `bcce829083` ("x86/sev: Detect/setup SEV/SME features earlier in boot") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Link: https://lore.kernel.org/all/6ff1f28af2829cc9aea357ebee285825f90a431f.1684340801.git.thomas.lendacky%40amd.com	2023-06-02 16:59:57 -07:00
Sean Christopherson	791a089861	KVM: SVM: Invoke trace_kvm_exit() for fastpath VM-Exits Move SVM's call to trace_kvm_exit() from the "slow" VM-Exit handler to svm_vcpu_run() so that KVM traces fastpath VM-Exits that re-enter the guest without bouncing through the slow path. This bug is benign in the current code base as KVM doesn't currently support any such exits on SVM. Fixes: `a9ab13ff6e` ("KVM: X86: Improve latency for single target IPI fastpath") Link: https://lore.kernel.org/r/20230602011920.787844-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-02 16:40:27 -07:00
Sean Christopherson	8b703a49c9	KVM: x86: Account fastpath-only VM-Exits in vCPU stats Increment vcpu->stat.exits when handling a fastpath VM-Exit without going through any part of the "slow" path. Not bumping the exits stat can result in wildly misleading exit counts, e.g. if the primary reason the guest is exiting is to program the TSC deadline timer. Fixes: `404d5d7bff` ("KVM: X86: Introduce more exit_fastpath_completion enum values") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602011920.787844-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-02 16:37:49 -07:00
Maciej S. Szmigiero	b2ce899788	KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASK While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware I noticed that with vCPU count large enough (> 16) they sometimes froze at boot. With vCPU count of 64 they never booted successfully - suggesting some kind of a race condition. Since adding "vnmi=0" module parameter made these guests boot successfully it was clear that the problem is most likely (v)NMI-related. Running kvm-unit-tests quickly showed failing NMI-related tests cases, like "multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests and the NMI parts of eventinj test. The issue was that once one NMI was being serviced no other NMI was allowed to be set pending (NMI limit = 0), which was traced to svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather than for the "NMI pending" flag. Fix this by testing for the right flag in svm_is_vnmi_pending(). Once this is done, the NMI-related kvm-unit-tests pass successfully and the Windows guest no longer freezes at boot. Fixes: `fa4c027a79` ("KVM: x86: Add support for SVM's Virtual NMI") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/be4ca192eb0c1e69a210db3009ca984e6a54ae69.1684495380.git.maciej.szmigiero@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-02 16:34:20 -07:00
Sean Christopherson	817fa99836	KVM: x86/mmu: Grab memslot for correct address space in NX recovery worker Factor in the address space (non-SMM vs. SMM) of the target shadow page when recovering potential NX huge pages, otherwise KVM will retrieve the wrong memslot when zapping shadow pages that were created for SMM. The bug most visibly manifests as a WARN on the memslot being non-NULL, but the worst case scenario is that KVM could unaccount the shadow page without ensuring KVM won't install a huge page, i.e. if the non-SMM slot is being dirty logged, but the SMM slot is not. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 3911 at arch/x86/kvm/mmu/mmu.c:7015 kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm] CPU: 1 PID: 3911 Comm: kvm-nx-lpage-re RIP: 0010:kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm] RSP: 0018:ffff99b284f0be68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff99b284edd000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff9271397024e0 R08: 0000000000000000 R09: ffff927139702450 R10: 0000000000000000 R11: 0000000000000001 R12: ffff99b284f0be98 R13: 0000000000000000 R14: ffff9270991fcd80 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff927f9f640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0aacad3ae0 CR3: 000000088fc2c005 CR4: 00000000003726e0 Call Trace: <TASK> __pfx_kvm_nx_huge_page_recovery_worker+0x10/0x10 [kvm] kvm_vm_worker_thread+0x106/0x1c0 [kvm] kthread+0xd9/0x100 ret_from_fork+0x2c/0x50 </TASK> ---[ end trace 0000000000000000 ]--- This bug was exposed by commit `edbdb43fc9` ("KVM: x86: Preserve TDP MMU roots until they are explicitly invalidated"), which allowed KVM to retain SMM TDP MMU roots effectively indefinitely. Before commit `edbdb43fc9`, KVM would zap all SMM TDP MMU roots and thus all SMM TDP MMU shadow pages once all vCPUs exited SMM, which made the window where this bug (recovering an SMM NX huge page) could be encountered quite tiny. To hit the bug, the NX recovery thread would have to run while at least one vCPU was in SMM. Most VMs typically only use SMM during boot, and so the problematic shadow pages were gone by the time the NX recovery thread ran. Now that KVM preserves TDP MMU roots until they are explicitly invalidated (e.g. by a memslot deletion), the window to trigger the bug is effectively never closed because most VMMs don't delete memslots after boot (except for a handful of special scenarios). Fixes: `eb29860570` ("KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages") Reported-by: Fabio Coatti <fabio.coatti@gmail.com> Closes: https://lore.kernel.org/all/CADpTngX9LESCdHVu_2mQkNGena_Ng2CphWNwsRGSMxzDsTjU2A@mail.gmail.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230602010137.784664-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-02 16:34:10 -07:00
Sean Christopherson	c3a1e119a3	KVM: VMX: Inject #GP, not #UD, if SGX2 ENCLS leafs are unsupported Per Intel's SDM, unsupported ENCLS leafs result in a #GP, not a #UD. SGX1 is a special snowflake as the SGX1 flag is used by the CPU as a "soft" disable, e.g. if software disables machine check reporting, i.e. having SGX but not SGX1 is effectively "SGX completely unsupported" and and thus #UDs. Fixes: `9798adbc04` ("KVM: VMX: Frame in ENCLS handler for SGX virtualization") Cc: Binbin Wu <binbin.wu@linux.intel.com> Cc: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230405234556.696927-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-02 10:12:03 -07:00
Sean Christopherson	5e50082c8c	KVM: VMX: Inject #GP on ENCLS if vCPU has paging disabled (CR0.PG==0) Inject a #GP when emulating/forwarding a valid ENCLS leaf if the vCPU has paging disabled, e.g. if KVM is intercepting ECREATE to enforce additional restrictions. The pseudocode in the SDM lists all #GP triggers, including CR0.PG=0, as being checked after the ENLCS-exiting checks, i.e. the VM-Exit will occur before the CPU performs the CR0.PG check. Fixes: `70210c044b` ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions") Cc: Binbin Wu <binbin.wu@linux.intel.com> Cc: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230405234556.696927-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-02 10:10:54 -07:00
Nadav Amit	5516c89d58	x86/lib: Make get/put_user() exception handling a visible symbol The .L-prefixed exception handling symbols of get_user() and put_user() do get discarded from the symbol table of the final kernel image. This confuses tools which parse that symbol table and try to map the chunk of code to a symbol. And, in general, from toolchain perspective, it is a good practice to have all code belong to a symbol, and the correct one at that. ( Currently, objdump displays that exception handling chunk as part of the previous symbol which is a "fallback" of sorts and not correct. ) While at it, rename them to something more descriptive. [ bp: Rewrite commit message, rename symbols. ] Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230525184244.2311-1-namit@vmware.com	2023-06-02 10:51:46 +02:00
Jon Kohler	331f229768	KVM: VMX: restore vmx_vmexit alignment Commit `8bd200d23e` ("KVM: VMX: Flatten __vmx_vcpu_run()") changed vmx_vmexit from SYM_FUNC_START to SYM_INNER_LABEL, accidentally removing 16 byte alignment as SYM_FUNC_START uses SYM_A_ALIGN and SYM_INNER_LABEL does not. Josh mentioned [1] this was unintentional. Fix by changing to SYM_INNER_LABEL_ALIGN instead. [1] https://lore.kernel.org/lkml/Y3adkSe%2FJ70PqUyt@p183 Fixes: `8bd200d23e` ("KVM: VMX: Flatten __vmx_vcpu_run()") Signed-off-by: Jon Kohler <jon@nutanix.com> Suggested-by: Alexey Dobriyan <adobriyan@gmail.com> CC: Josh Poimboeuf <jpoimboe@kernel.org> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20230531155821.80590-1-jon@nutanix.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 14:20:04 -07:00
Mike Christie	f9010dbdce	fork, vhost: Use CLONE_THREAD to fix freezer/ps regression When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps or ps a would not incorrectly detect the vhost task as another process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's didn't disable or add support for them. To fix both bugs, this switches the vhost task to be thread in the process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that SIGKILL/STOP support is required because CLONE_THREAD requires CLONE_SIGHAND which requires those 2 signals to be supported. This is a modified version of the patch written by Mike Christie <michael.christie@oracle.com> which was a modified version of patch originally written by Linus. Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER. Including ignoring signals, setting up the register state, and having get_signal return instead of calling do_group_exit. Tidied up the vhost_task abstraction so that the definition of vhost_task only needs to be visible inside of vhost_task.c. Making it easier to review the code and tell what needs to be done where. As part of this the main loop has been moved from vhost_worker into vhost_task_fn. vhost_worker now returns true if work was done. The main loop has been updated to call get_signal which handles SIGSTOP, freezing, and collects the message that tells the thread to exit as part of process exit. This collection clears __fatal_signal_pending. This collection is not guaranteed to clear signal_pending() so clear that explicitly so the schedule() sleeps. For now the vhost thread continues to exist and run work until the last file descriptor is closed and the release function is called as part of freeing struct file. To avoid hangs in the coredump rendezvous and when killing threads in a multi-threaded exec. The coredump code and de_thread have been modified to ignore vhost threads. Remvoing the special case for exec appears to require teaching vhost_dev_flush how to directly complete transactions in case the vhost thread is no longer running. Removing the special case for coredump rendezvous requires either the above fix needed for exec or moving the coredump rendezvous into get_signal. Fixes: `6e890c5d50` ("vhost: use vhost_tasks for worker threads") Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Co-developed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-06-01 17:15:33 -04:00
Sean Christopherson	ab322c43cc	KVM: x86: Update number of entries for KVM_GET_CPUID2 on success, not failure Update cpuid->nent if and only if kvm_vcpu_ioctl_get_cpuid2() succeeds. The sole caller copies @cpuid to userspace only on success, i.e. the existing code effectively does nothing. Arguably, KVM should report the number of entries when returning -E2BIG so that userspace doesn't have to guess the size, but all other similar KVM ioctls() don't report the size either, i.e. userspace is conditioned to guess. Suggested-by: Takahiro Itazuri <itazur@amazon.com> Link: https://lore.kernel.org/all/20230410141820.57328-1-itazur@amazon.com Link: https://lore.kernel.org/r/20230526210340.2799158-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 14:07:14 -07:00
Jinrong Liang	33ab767c26	KVM: x86/pmu: Remove redundant check for MSR_IA32_DS_AREA set handler After commit `2de154f541` ("KVM: x86/pmu: Provide "error" semantics for unsupported-but-known PMU MSRs"), the guest_cpuid_has(DS) check is not necessary any more since if the guest supports X86_FEATURE_DS, it never returns 1. And if the guest does not support this feature, the set_msr handler will get false from kvm_pmu_is_valid_msr() before reaching this point. Therefore, the check will not be true in all cases and can be safely removed, which also simplifies the code and improves its readability. Signed-off-by: Jinrong Liang <cloudliang@tencent.com> Link: https://lore.kernel.org/r/20230411130338.8592-1-cloudliang@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 14:04:24 -07:00
Jinliang Zheng	0d42522bde	KVM: x86: Fix poll command According to the hardware manual, when the Poll command is issued, the byte returned by the I/O read is 1 in Bit 7 when there is an interrupt, and the highest priority binary code in Bits 2:0. The current pic simulation code is not implemented strictly according to the above expression. Fix the implementation of pic_poll_read(), set Bit 7 when there is an interrupt. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Link: https://lore.kernel.org/r/20230419021924.1342184-1-alexjlzheng@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 13:44:13 -07:00
Sean Christopherson	dee321977a	KVM: x86: Move common handling of PAT MSR writes to kvm_set_msr_common() Move the common check-and-set handling of PAT MSR writes out of vendor code and into kvm_set_msr_common(). This aligns writes with reads, which are already handled in common code, i.e. makes the handling of reads and writes symmetrical in common code. Alternatively, the common handling in kvm_get_msr_common() could be moved to vendor code, but duplicating code is generally undesirable (even though the duplicatated code is trivial in this case), and guest writes to PAT should be rare, i.e. the overhead of the extra function call is a non-issue in practice. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 13:41:06 -07:00
Sean Christopherson	3a5f49078e	KVM: x86: Make kvm_mtrr_valid() static now that there are no external users Make kvm_mtrr_valid() local to mtrr.c now that it's not used to check the validity of a PAT MSR value. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 13:41:06 -07:00
Sean Christopherson	bc7fe2f0b7	KVM: x86: Move PAT MSR handling out of mtrr.c Drop handling of MSR_IA32_CR_PAT from mtrr.c now that SVM and VMX handle writes without bouncing through kvm_set_msr_common(). PAT isn't truly an MTRR even though it affects memory types, and more importantly KVM enables hardware virtualization of guest PAT (by NOT setting "ignore guest PAT") when a guest has non-coherent DMA, i.e. KVM doesn't need to zap SPTEs when the guest PAT changes. The read path is and always has been trivial, i.e. burying it in the MTRR code does more harm than good. WARN and continue for the PAT case in kvm_set_msr_common(), as that code is _currently_ reached if and only if KVM is buggy. Defer cleaning up the lack of symmetry between the read and write paths to a future patch. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 13:41:06 -07:00
Sean Christopherson	34a83deac3	KVM: x86: Use MTRR macros to define possible MTRR MSR ranges Use the MTRR macros to identify the ranges of possible MTRR MSRs instead of bounding the ranges with a mismash of open coded values and unrelated MSR indices. Carving out the gap for the machine check MSRs in particular is confusing, as it's easy to incorrectly think the case statement handles MCE MSRs instead of skipping them. Drop the range-based funneling of MSRs between the end of the MCE MSRs and MTRR_DEF_TYPE, i.e. 0x2A0-0x2FF, and instead handle MTTR_DEF_TYPE as the one-off case that it is. Extract PAT (0x277) as well in anticipation of dropping PAT "handling" from the MTRR code. Keep the range-based handling for the variable+fixed MTRRs even though capturing unknown MSRs 0x214-0x24F is arguably "wrong". There is a gap in the fixed MTRRs, 0x260-0x267, i.e. the MTRR code needs to filter out unknown MSRs anyways, and using a single range generates marginally better code for the big switch statement. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 13:41:06 -07:00
Sean Christopherson	9ae38b4fb1	KVM: x86: Add helper to get variable MTRR range from MSR index Add a helper to dedup the logic for retrieving a variable MTRR range structure given a variable MTRR MSR index. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 13:41:06 -07:00
Sean Christopherson	ebda79e505	KVM: x86: Add helper to query if variable MTRR MSR is base (versus mask) Add a helper to query whether a variable MTRR MSR is a base versus as mask MSR. Replace the unnecessarily complex math with a simple check on bit 0; base MSRs are even, mask MSRs are odd. Link: https://lore.kernel.org/r/20230511233351.635053-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 13:41:06 -07:00
Ke Guo	7aeae02761	KVM: SVM: Use kvm_pat_valid() directly instead of kvm_mtrr_valid() Use kvm_pat_valid() directly instead of bouncing through kvm_mtrr_valid(). The PAT is not an MTRR, and kvm_mtrr_valid() just redirects to kvm_pat_valid(), i.e. is exempt from KVM's "zap SPTEs" logic that's needed to honor guest MTRRs when the VM has a passthrough device with non-coherent DMA (KVM does NOT set "ignore guest PAT" in this case, and so enables hardware virtualization of the guest's PAT, i.e. doesn't need to manually emulate the PAT memtype). Signed-off-by: Ke Guo <guoke@uniontech.com> [sean: massage changelog] Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 13:41:05 -07:00
Wenyao Hai	a33ba1bf0d	KVM: VMX: Open code writing vCPU's PAT in VMX's MSR handler Open code setting "vcpu->arch.pat" in vmx_set_msr() instead of bouncing through kvm_set_msr_common() to get to the same code in kvm_mtrr_set_msr(). This aligns VMX with SVM, avoids hiding a very simple operation behind a relatively complicated function call (finding the PAT MSR case in kvm_set_msr_common() is non-trivial), and most importantly, makes it clear that not unwinding the VMCS updates if kvm_set_msr_common() isn't a bug (because kvm_set_msr_common() can never fail for PAT). Opportunistically set vcpu->arch.pat before updating the VMCS info so that a future patch can move the common bits (back) into kvm_set_msr_common() without a functional change. Note, MSR_IA32_CR_PAT is 0x277, and is very subtly handled by case 0x200 ... MSR_IA32_MC0_CTL2 - 1: in kvm_set_msr_common(). Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Wenyao Hai <haiwenyao@uniontech.com> [sean: massage changelog, hoist setting vcpu->arch.pat up] Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230511233351.635053-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 13:41:05 -07:00
Mingwei Zhang	0d3518d2f8	KVM: SVM: Remove TSS reloading code after VMEXIT Remove the dedicated post-VMEXIT TSS reloading code now that KVM uses VMLOAD to load host segment state, which includes TSS state. Fixes: `e79b91bb3c` ("KVM: SVM: use vmsave/vmload for saving/restoring additional host state") Reported-by: Venkatesh Srinivas <venkateshs@google.com> Suggested-by: Jim Mattson <jmattson@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20230523165635.4002711-1-mizhang@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-06-01 13:38:16 -07:00
Kees Cook	91218d7d70	x86/purgatory: Do not use fortified string functions With the addition of -fstrict-flex-arrays=3, struct sha256_state's trailing array is no longer ignored by CONFIG_FORTIFY_SOURCE: struct sha256_state { u32 state[SHA256_DIGEST_SIZE / 4]; u64 count; u8 buf[SHA256_BLOCK_SIZE]; }; This means that the memcpy() calls with "buf" as a destination in sha256.c's code will attempt to perform run-time bounds checking, which could lead to calling missing functions, specifically a potential WARN_ONCE, which isn't callable from purgatory. Reported-by: Thorsten Leemhuis <linux@leemhuis.info> Closes: https://lore.kernel.org/lkml/175578ec-9dec-7a9c-8d3a-43f24ff86b92@leemhuis.info/ Bisected-by: "Joan Bruguera Micó" <joanbrugueram@gmail.com> Fixes: `df8fc4e934` ("kbuild: Enable -fstrict-flex-arrays=3") Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Alyssa Ross <hi@alyssa.is> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Thorsten Leemhuis <linux@leemhuis.info> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20230531003345.never.325-kees@kernel.org	2023-06-01 11:24:51 -07:00
Borislav Petkov (AMD)	7c1dee734f	x86/mtrr: Unify debugging printing Put all the debugging output behind "mtrr=debug" and get rid of "mtrr_cleanup_debug" which wasn't even documented anywhere. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230531174857.GDZHeIib57h5lT5Vh1@fat_crate.local	2023-06-01 15:04:33 +02:00
Juergen Gross	08611a3a9a	x86/mtrr: Remove unused code mtrr_centaur_report_mcr() isn't used by anyone, so it can be removed. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-17-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:33 +02:00
Juergen Gross	12f0dd8df1	x86/mm: Only check uniform after calling mtrr_type_lookup() Today pud_set_huge() and pmd_set_huge() test for the MTRR type to be WB or INVALID after calling mtrr_type_lookup(). Those tests can be dropped as the only reason not to use a large mapping would be uniform being 0. Any MTRR type can be accepted as long as it applies to the whole memory range covered by the mapping, as the alternative would only be to map the same region with smaller pages instead, using the same PAT type as for the large mapping. [ bp: Massage commit message. ] Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-16-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:33 +02:00
Juergen Gross	973df19420	x86/mtrr: Don't let mtrr_type_lookup() return MTRR_TYPE_INVALID mtrr_type_lookup() should always return a valid memory type. In case there is no information available, it should return the default UC. This will remove the last case where mtrr_type_lookup() can return MTRR_TYPE_INVALID, so adjust the comment in include/uapi/asm/mtrr.h. Note that removing the MTRR_TYPE_INVALID #define from that header could break user code, so it has to stay. At the same time the mtrr_type_lookup() stub for the !CONFIG_MTRR case should set uniform to 1, as if the memory range would be covered by no MTRR at all. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-15-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:33 +02:00
Juergen Gross	8227f40ade	x86/mtrr: Use new cache_map in mtrr_type_lookup() Instead of crawling through the MTRR register state, use the new cache_map for looking up the cache type(s) of a memory region. This allows now to set the uniform parameter according to the uniformity of the cache mode of the region, instead of setting it only if the complete region is mapped by a single MTRR. This now includes even the region covered by the fixed MTRR registers. Make sure uniform is always set. [ bp: Massage. ] [ jgross: Explain mtrr_type_lookup() logic. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-14-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:33 +02:00
Juergen Gross	a431660353	x86/mtrr: Add mtrr=debug command line option Add a new command line option "mtrr=debug" for getting debug output after building the new cache mode map. The output will include MTRR register values and the resulting map. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-13-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:33 +02:00
Juergen Gross	061b984aab	x86/mtrr: Construct a memory map with cache modes After MTRR initialization construct a memory map with cache modes from MTRR values. This will speed up lookups via mtrr_lookup_type() especially in case of overlapping MTRRs. This will be needed when switching the semantics of the "uniform" parameter of mtrr_lookup_type() from "only covered by one MTRR" to "memory range has a uniform cache mode", which is the data the callers really want to know. Today this information is not easily available, in case MTRRs are not well sorted regarding base address. The map will be built in __initdata. When memory management is up, the map will be moved to dynamically allocated memory, in order to avoid the need of an overly large array. The size of this array is calculated using the number of variable MTRR registers and the needed size for fixed entries. Only add the map creation and expansion for now. The lookup will be added later. When writing new MTRR entries in the running system rebuild the map inside the call from mtrr_rendezvous_handler() in order to avoid nasty race conditions with concurrent lookups. [ bp: Move out rebuild_map() call and rename it. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-12-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:33 +02:00
Juergen Gross	1ca1209904	x86/mtrr: Add get_effective_type() service function Add a service function for obtaining the effective cache mode of overlapping MTRR registers. Make use of that function in check_type_overlap(). Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-11-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:33 +02:00
Juergen Gross	961c6a4326	x86/mtrr: Allocate mtrr_value array dynamically The mtrr_value[] array is a static variable which is used only in a few configurations. Consuming 6kB is ridiculous for this case, especially as the array doesn't need to be that large and it can easily be allocated dynamically. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-10-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:33 +02:00
Juergen Gross	b5d3c72829	x86/mtrr: Move 32-bit code from mtrr.c to legacy.c There is some code in mtrr.c which is relevant for old 32-bit CPUs only. Move it to a new source legacy.c. While modifying mtrr_init_finalize() fix spelling of its name. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-9-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:33 +02:00
Juergen Gross	34cf2d1955	x86/mtrr: Have only one set_mtrr() variant Today there are two variants of set_mtrr(): one calling stop_machine() and one calling stop_machine_cpuslocked(). The first one (set_mtrr()) has only one caller, and this caller is running only when resuming from suspend when the interrupts are still off and only one CPU is active. Additionally this code is used only on rather old 32-bit CPUs not supporting SMP. For these reasons the first variant can be replaced by a simple call of mtrr_if->set(). Rename the second variant set_mtrr_cpuslocked() to set_mtrr() now that there is only one variant left, in order to have a shorter function name. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-8-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:32 +02:00
Juergen Gross	0340906952	x86/mtrr: Replace vendor tests in MTRR code Modern CPUs all share the same MTRR interface implemented via generic_mtrr_ops. At several places in MTRR code this generic interface is deduced via is_cpu(INTEL) tests, which is only working due to X86_VENDOR_INTEL being 0 (the is_cpu() macro is testing mtrr_if->vendor, which isn't explicitly set in generic_mtrr_ops). Test the generic CPU feature X86_FEATURE_MTRR instead. The only other place where the .vendor member of struct mtrr_ops is being used is in set_num_var_ranges(), where depending on the vendor the number of MTRR registers is determined. This can easily be changed by replacing .vendor with the static number of MTRR registers. It should be noted that the test "is_cpu(HYGON)" wasn't ever returning true, as there is no struct mtrr_ops with that vendor information. [ bp: Use mtrr_enabled() before doing mtrr_if-> accesses, esp. in mtrr_trim_uncached_memory() which gets called independently from whether mtrr_if is set or not. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230502120931.20719-7-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:32 +02:00
Juergen Gross	a153f254e5	x86/xen: Set MTRR state when running as Xen PV initial domain When running as Xen PV initial domain (aka dom0), MTRRs are disabled by the hypervisor, but the system should nevertheless use correct cache memory types. This has always kind of worked, as disabled MTRRs resulted in disabled PAT, too, so that the kernel avoided code paths resulting in inconsistencies. This bypassed all of the sanity checks the kernel is doing with enabled MTRRs in order to avoid memory mappings with conflicting memory types. This has been changed recently, leading to PAT being accepted to be enabled, while MTRRs stayed disabled. The result is that mtrr_type_lookup() no longer is accepting all memory type requests, but started to return WB even if UC- was requested. This led to driver failures during initialization of some devices. In reality MTRRs are still in effect, but they are under complete control of the Xen hypervisor. It is possible, however, to retrieve the MTRR settings from the hypervisor. In order to fix those problems, overwrite the MTRR state via mtrr_overwrite_state() with the MTRR data from the hypervisor, if the system is running as a Xen dom0. Fixes: `72cbc8f04f` ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-6-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:32 +02:00
Juergen Gross	c957f1f3c4	x86/hyperv: Set MTRR state when running as SEV-SNP Hyper-V guest In order to avoid mappings using the UC- cache attribute, set the MTRR state to use WB caching as the default. This is needed in order to cope with the fact that PAT is enabled, while MTRRs are not supported by the hypervisor. Fixes: `90b926e68f` ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/20230502120931.20719-5-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:32 +02:00
Juergen Gross	29055dc742	x86/mtrr: Support setting MTRR state for software defined MTRRs When running virtualized, MTRR access can be reduced (e.g. in Xen PV guests or when running as a SEV-SNP guest under Hyper-V). Typically, the hypervisor will not advertize the MTRR feature in CPUID data, resulting in no MTRR memory type information being available for the kernel. This has turned out to result in problems (Link tags below): - Hyper-V SEV-SNP guests using uncached mappings where they shouldn't - Xen PV dom0 mapping memory as WB which should be UC- instead Solve those problems by allowing an MTRR static state override, overwriting the empty state used today. In case such a state has been set, don't call get_mtrr_state() in mtrr_bp_init(). The set state will only be used by mtrr_type_lookup(), as in all other cases mtrr_enabled() is being checked, which will return false. Accept the overwrite call only for selected cases when running as a guest. Disable X86_FEATURE_MTRR in order to avoid any MTRR modifications by just refusing them. [ bp: Massage. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/all/4fe9541e-4d4c-2b2a-f8c8-2d34a7284930@nerdbynature.de/ Link: https://lore.kernel.org/lkml/BYAPR21MB16883ABC186566BD4D2A1451D7FE9@BYAPR21MB1688.namprd21.prod.outlook.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:32 +02:00
Juergen Gross	d053b481a5	x86/mtrr: Replace size_or_mask and size_and_mask with a much easier concept Replace size_or_mask and size_and_mask with the much easier concept of high reserved bits. While at it, instead of using constants in the MTRR code, use some new [ bp: - Drop mtrr_set_mask() - Unbreak long lines - Move struct mtrr_state_type out of the uapi header as it doesn't belong there. It also fixes a HDRTEST breakage "unknown type name ‘bool’" as Reported-by: kernel test robot <lkp@intel.com> - Massage. ] Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230502120931.20719-3-jgross@suse.com Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>	2023-06-01 15:04:23 +02:00
Steve Wahl	73b3108dfd	x86/platform/uv: Update UV[23] platform code for SNC Previous Sub-NUMA Clustering changes need not just a count of blades present, but a count that includes any missing ids for blades not present; in other words, the range from lowest to highest blade id. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230519190752.3297140-9-steve.wahl%40hpe.com	2023-05-31 09:35:00 -07:00
Steve Wahl	89827568a8	x86/platform/uv: Remove remaining BUG_ON() and BUG() calls Replace BUG and BUG_ON with WARN_ON_ONCE and carry on as best as we can. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230519190752.3297140-8-steve.wahl%40hpe.com	2023-05-31 09:35:00 -07:00
Steve Wahl	8a50c58519	x86/platform/uv: UV support for sub-NUMA clustering Sub-NUMA clustering (SNC) invalidates previous assumptions of a 1:1 relationship between blades, sockets, and nodes. Fix these assumptions and build tables correctly when SNC is enabled. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230519190752.3297140-7-steve.wahl%40hpe.com	2023-05-31 09:34:59 -07:00
Steve Wahl	45e9f9a995	x86/platform/uv: Helper functions for allocating and freeing conversion tables Add alloc_conv_table() and FREE_1_TO_1_TABLE() to reduce duplicated code among the conversion tables we use. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230519190752.3297140-6-steve.wahl%40hpe.com	2023-05-31 09:34:59 -07:00
Steve Wahl	35bd896ccc	x86/platform/uv: When searching for minimums, start at INT_MAX not 99999 Using a starting value of INT_MAX rather than 999999 or 99999 means this algorithm won't fail should the numbers being compared ever exceed this value. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230519190752.3297140-5-steve.wahl%40hpe.com	2023-05-31 09:34:59 -07:00
Steve Wahl	e4860f0377	x86/platform/uv: Fix printed information in calc_mmioh_map Fix incorrect mask names and values in calc_mmioh_map() that caused it to print wrong NASID information. And an unused blade position is not an error condition, but will yield an invalid NASID value, so change the invalid NASID message from an error to a debug message. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230519190752.3297140-4-steve.wahl%40hpe.com	2023-05-31 09:34:59 -07:00
Steve Wahl	8c646cee0a	x86/platform/uv: Introduce helper function uv_pnode_to_socket. Add and use uv_pnode_to_socket() function, which parallels other helper functions in here, and will enable avoiding duplicate code in an upcoming patch. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230519190752.3297140-3-steve.wahl%40hpe.com	2023-05-31 09:34:59 -07:00
Steve Wahl	fd27bea340	x86/platform/uv: Add platform resolving #defines for misc GAM_MMIOH_REDIRECT* Upcoming changes will require use of new #defines UVH_RH_GAM_MMIOH_REDIRECT_CONFIG0_NASID_MASK and UVH_RH_GAM_MMIOH_REDIRECT_CONFIG1_NASID_MASK, which provide the appropriate values on different uv platforms. Also, fix typo that defined a couple of "_CONFIG0_" values twice where "_CONFIG1_" was intended. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230519190752.3297140-2-steve.wahl%40hpe.com	2023-05-31 09:34:59 -07:00
Thomas Gleixner	ff3cfcb0d4	x86/smpboot: Fix the parallel bringup decision The decision to allow parallel bringup of secondary CPUs checks CC_ATTR_GUEST_STATE_ENCRYPT to detect encrypted guests. Those cannot use parallel bootup because accessing the local APIC is intercepted and raises a #VC or #VE, which cannot be handled at that point. The check works correctly, but only for AMD encrypted guests. TDX does not set that flag. As there is no real connection between CC attributes and the inability to support parallel bringup, replace this with a generic control flag in x86_cpuinit and let SEV-ES and TDX init code disable it. Fixes: `0c7ffa32db` ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it") Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/87ilc9gd2d.ffs@tglx	2023-05-31 16:49:34 +02:00
Peter Zijlstra	3496d1c64a	x86/nospec: Shorten RESET_CALL_DEPTH RESET_CALL_DEPTH is a pretty fat monster and blows up UNTRAIN_RET to 20 bytes: 19: 48 c7 c0 80 00 00 00 mov $0x80,%rax 20: 48 c1 e0 38 shl $0x38,%rax 24: 65 48 89 04 25 00 00 00 00 mov %rax,%gs:0x0 29: R_X86_64_32S pcpu_hot+0x10 Shrink it by 4 bytes: 0: 31 c0 xor %eax,%eax 2: 48 0f ba e8 3f bts $0x3f,%rax 7: 65 48 89 04 25 00 00 00 00 mov %rax,%gs:0x0 Shrink RESET_CALL_DEPTH_FROM_CALL by 5 bytes by only setting %al, the other bits are shifted out (the same could be done for RESET_CALL_DEPTH, but the XOR+BTS sequence has less dependencies due to the zeroing). Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230515093020.729622326@infradead.org	2023-05-31 13:40:57 +02:00
Peter Zijlstra	df25edbac3	x86/alternatives: Add longer 64-bit NOPs By adding support for longer NOPs there are a few more alternatives that can turn into a single instruction. Add up to NOP11, the same limit where GNU as .nops also stops generating longer nops. This is because a number of uarchs have severe decode penalties for more than 3 prefixes. [ bp: Sync up with the version in tools/ while at it. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230515093020.661756940@infradead.org	2023-05-31 10:21:21 +02:00
Shawn Wang	2997d94b5d	x86/resctrl: Only show tasks' pid in current pid namespace When writing a task id to the "tasks" file in an rdtgroup, rdtgroup_tasks_write() treats the pid as a number in the current pid namespace. But when reading the "tasks" file, rdtgroup_tasks_show() shows the list of global pids from the init namespace, which is confusing and incorrect. To be more robust, let the "tasks" file only show pids in the current pid namespace. Fixes: `e02737d5b8` ("x86/intel_rdt: Add tasks files") Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Acked-by: Fenghua Yu <fenghua.yu@intel.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/all/20230116071246.97717-1-shawnwang@linux.alibaba.com/	2023-05-30 20:57:39 +02:00
Thomas Gleixner	33e20b07be	x86/realmode: Make stack lock work in trampoline_compat() The stack locking and stack assignment macro LOAD_REALMODE_ESP fails to work when invoked from the 64bit trampoline entry point: trampoline_start64 trampoline_compat LOAD_REALMODE_ESP <- lock Accessing tr_lock is only possible from 16bit mode. For the compat entry point this needs to be pa_tr_lock so that the required relocation entry is generated. Otherwise it locks the non-relocated address which is aside of being wrong never cleared in secondary_startup_64() causing all but the first CPU to get stuck on the lock. Make the macro take an argument lock_pa which defaults to 0 and rename it to LOCK_AND_LOAD_REALMODE_ESP to make it clear what this is about. Fixes: `f6f1ae9128` ("x86/smpboot: Implement a bit spinlock to protect the realmode stack") Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/87h6rujdvl.ffs@tglx	2023-05-30 14:11:47 +02:00
Thomas Gleixner	5da80b28bf	x86/smp: Initialize cpu_primary_thread_mask late Marking primary threads in the cpumask during early boot is only correct in certain configurations, but broken e.g. for the legacy hyperthreading detection. This is due to the complete mess in the CPUID evaluation code which initializes smp_num_siblings only half during early init and fixes it up later when identify_boot_cpu() is invoked. So using smp_num_siblings before identify_boot_cpu() leads to incorrect results. Fixing the early CPU init code to provide the proper data is a larger scale surgery as the code has dependencies on data structures which are not initialized during early boot. Move the initialization of cpu_primary_thread_mask wich depends on smp_num_siblings being correct to an early initcall so that it is set up correctly before SMP bringup. Fixes: `f54d4434c2` ("x86/apic: Provide cpu_primary_thread mask") Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/87sfbhlwp9.ffs@tglx	2023-05-29 21:31:23 +02:00
Nathan Chancellor	2fe1e67e69	x86/csum: Fix clang -Wuninitialized in csum_partial() Clang warns: arch/x86/lib/csum-partial_64.c:74:20: error: variable 'result' is uninitialized when used here [-Werror,-Wuninitialized] return csum_tail(result, temp64, odd); ^~~~~~ arch/x86/lib/csum-partial_64.c:48:22: note: initialize the variable 'result' to silence this warning unsigned odd, result; ^ = 0 1 error generated. The only initialization and uses of result in csum_partial() were moved into csum_tail() but result is still being passed by value to csum_tail() (clang's -Wuninitialized does not do interprocedural analysis to realize that result is always assigned in csum_tail() however). Sink the declaration of result into csum_tail() to clear up the warning. Closes: https://lore.kernel.org/202305262039.3HUYjWJk-lkp@intel.com/ Fixes: `688eb8191b` ("x86/csum: Improve performance of `csum_partial`") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230526-csum_partial-wuninitialized-v1-1-ebc0108dcec1%40kernel.org	2023-05-29 06:52:32 -07:00
Linus Torvalds	7a6c8e512f	This push fixes an alignment crash in x86/aria. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmRt4tIACgkQxycdCkmx i6fKpw/6A52EyoKBOjfVE+FKc40Fs/1ecmMUumN0kqnpPFKMHvJehnFsb6U0/gCS l6BI3CsFVogeICpBYAo1tjvrt3B4FO+zhBWME+il3aOzYMpdryXf3oLMMORqWlCl a+TA0LZH+NmySfYKy6rg1fYTQwKICyv25dp0aFC2n/Nj5SQxVtSVN2TFZ8uXNwcT ubjMcVUdmGXVls0KO7/ZbhRf9t/2VELzrS/WovhxUJxxXXW4Qhu1tI7X4UJPbnqc m14/dj6aZuaElst8B82BT3OSqZdT2rEbomfcnKOo7ePTp5tkJcpXmtM8eDO40AXn lzwZ4d9TIPiXalZWZ4ZL0htvbaC6QAvioSlcFmPPZJudSXYm4n31eimMYBmYlgLq uAm5wG4Tl84USumoekTNusZJXinDi0oNDXR7RV3gACy1TQJ17+5vgXKr+1/kNH2u clHoBIuon9oFAvxZ44iNyBDc4770D+gU+dNsuFCreBUT/8fvAtFWeptpGY08iptY c1hu5Tv3kK68JfEgxTgVWs+ObX1pLWO0bEPeQummd5O9jNXRLLPrSB7wOZrkqrpq nl4AchdGE9u9OmB5h0VvNj0tUjJqxr8h04a+Hkhu7NqkREYajGWPJkTZQjNzrTSP lu52zuvflim1wyWMozMbFPg1O0J78JqI8EQxs/nuFAjF77u2xgs= =JBjy -----END PGP SIGNATURE----- Merge tag 'v6.4-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "Fix an alignment crash in x86/aria" * tag 'v6.4-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: x86/aria - Use 16 byte alignment for GFNI constant vectors	2023-05-29 07:05:49 -04:00
Linus Torvalds	f8b2507c26	A single fix for x86: - Prevent a bogus setting for the number of HT siblings, which is caused by the CPUID evaluation trainwreck of X86. That recomputes the value for each CPU, so the last CPU "wins". That can cause completely bogus sibling values. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRzBxMTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYodfbEACp+oRBD2zIcmf8kpakMxIbH2CkDAxl AgqGHYVKAvsqKPQZhYmOFEsk+0isMzssuVaub9gMeLRmpMQxUqv7K20pluvWmuQm h7MTYEAvCZpBU2BGB1mixp57tQ/gpLSB4uaJU2Q5hh9po6uImvalIdqThf7ArxgA mm3BhtbFObLGwRaV2981zdsEs8Cjs9RusnAOrX5LbBBb6ibcnrqWy7DAhySMA0zr 7RfaGyG/T9z6/BgX54FBQ7aDNU/RkZ8YYQoMj0kAFXdM3V+sa3oPfMRQpACPHKm4 kwskJfWVdMq06kgOD7N9+WrOfBAgIsmzasT6VhTuGAGNo4CiEOthLqXDHwf4NLhN hj0XReHVyBiQ1KuI2lGNJ71c30KRkh7SCdGNiEFtnlpteXnS4bRnzWtfMf/+2SCC WOziRRYRzuAvvoTwoQS1BYJrDAB9f5jOX8FTYbC12rEk/RPOB6t4iFNLUzid7u1H yPYphoftXlQlli7EajVc5pSZCftMHRCzWernptiPFU5tMvyagYppcg25JY54rquC CTJQqoa/HPAEZhBjgZElwO9QBgu+VyJUq3R/Z+LbkQuz5z+PVDfqeDuDWhaqbcz1 WGtdGdNo4bqVhwqgwyHjgKCFe05mF6tPuotnYZE/VSkXUCvLkjGEVg8HQ7WiPhJd IYDqxkIhX5h1Mg== =l43p -----END PGP SIGNATURE----- Merge tag 'x86-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu fix from Thomas Gleixner: "A single fix for x86: - Prevent a bogus setting for the number of HT siblings, which is caused by the CPUID evaluation trainwreck of X86. That recomputes the value for each CPU, so the last CPU "wins". That can cause completely bogus sibling values" * tag 'x86-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/topology: Fix erroneous smp_num_siblings on Intel Hybrid platforms	2023-05-28 07:42:05 -04:00
Linus Torvalds	2d5438f4c6	A small set of perf fixes: - Make the CHA discovery based on MSR readout to work around broken discovery tables in some SPR firmwares. - Prevent saving PEBS configuration which has software bits set that cause a crash when restored into the relevant MSR. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRzBlETHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYofUsEAC6VMmscn6MDmCpO71+o80IhFUeN/Pa /Qu4kKCQmeT/gonMG+bKCYSNobgdi4a7KNDRZypTjzMORTDmS3lT4aTbzy3ry+eI 7rUTolAo3HiY0ZwPhU/evrGDf0O2d/19nd9pXiLZFinH9WsMf7mKsXhAJvlq6SKu SliGCQlB7+0bicdYgBkJqWPUbKY7QvcCcXiPx1Ujxg7CbMTcuj7MLzu/TVYmwZYH Je/k+vBnhl54PF9nGiqNYZGBPh4cBgkpHEFGFrvwplBRBdwHrjapNxhIRcA2N66u mr+f7Cl6StOP16Bn4dzG9nHLVTDz5NVgsbtPStVTcOoIzfOhch7mOL+tu/pafXdt zRL7U6usfbQqt1rcBBtPoeFlEifQwHoz3ZGFSyIZn5IRfXtXlDJHyPxlRYMB5/uz WjR10GzE0tZFLetvW4PdlMHcwwrOZNU+crvADmCk4KzzVDip9QLbomiGfxHVHI2J Fjstd6ZaNkuDHRy/r76YfwRDDzQWPvtsaWRLzD4aFQQxYdHHZOabBittK9UvENvu MjLRO+1ymN52Ap6TdcK9ZMNpb8ol/Xn5aYivNRlHsUjJ8RxJj94sElIlNHLI2yoa hICdMcGSCd7H+FIkXdgT2O8OBxEsf139xSrG2oSTOZFCjkR1BSxtyqNY47MFVSiA KBtbnJMkde48pg== =Z3GU -----END PGP SIGNATURE----- Merge tag 'perf-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A small set of perf fixes: - Make the MSR-readout based CHA discovery work around broken discovery tables in some SPR firmwares. - Prevent saving PEBS configuration which has software bits set that cause a crash when restored into the relevant MSR" * tag 'perf-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/uncore: Correct the number of CHAs on SPR perf/x86/intel: Save/restore cpuc->active_pebs_data_cfg when using guest PEBS	2023-05-28 07:37:23 -04:00
Linus Torvalds	abbf7fa15b	A set of unwinder and tooling fixes: - Ensure that the stack pointer on x86 is aligned again so that the unwinder does not read past the end of the stack - Discard .note.gnu.property section which has a pointlessly different alignment than the other note sections. That confuses tooling of all sorts including readelf, libbpf and pahole. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRzBW0THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoczHD/4vCopMPBFsT5w5HhSTQSX5Kglx3UQL g+qa0Z/uJKLpMgGkBzWBQGXHH/UqyY4sk5T3DDOkVQjdSfD+zmJu4R44wAonEU+y SDW2MeBciyp1WtjwPNWNY9QgDUQj7Cr2dSLTVqBR9akladTcj7EKHApq0pgaqsbi MnmfXzPyH17PRfyW8FbsBjdQvhpkYZSZntQ0cU+W5VtsUtZvg2w4I8IU76ri9L0E OYBbk2zV8RabGVJBv79fnm4fXb78+Zkp/xvs5sTBRKRS+3T5FNNJjo8tNp1AtxdF eJMlOeY7PjSFyAL6+/+/uRqYNCnH4Rw2MlMqJJIhzKYCw7BFo0FAhi+mro63Z85b byGuriv1g7suOgOCK8IzTl7e1nTDP4YCZ0cMDDvcRceIiqR6Rrk3C/B5cBVz0Bng iFYFUJSlLT4G+tGZwupV2UUGsCwS5+vVCG/FVWes8Mz4g4zr2Dmrh6YMLzrcDshE eAY2yKMypD/JD9cJa4KOVJuFARaSDJDiBpsO1BjRC6MxLFaw1biB0mE2hkwn3LJe fLVvU/sWnIQ+dQoy7+GBzm7Pi5SmKBuLDsgOt/ocbCwfUwUQMGf+I7bNg+08UOOS dNzM6/agdd8rLZTS/Ma7EvnA367ZwVOb5COVMqwxAoKZbPkvgz1xvRGkFQCHYapx BwMILF9FIM0kvg== =TPwY -----END PGP SIGNATURE----- Merge tag 'objtool-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull unwinder fixes from Thomas Gleixner: "A set of unwinder and tooling fixes: - Ensure that the stack pointer on x86 is aligned again so that the unwinder does not read past the end of the stack - Discard .note.gnu.property section which has a pointlessly different alignment than the other note sections. That confuses tooling of all sorts including readelf, libbpf and pahole" * tag 'objtool-urgent-2023-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/show_trace_log_lvl: Ensure stack pointer is aligned, again vmlinux.lds.h: Discard .note.gnu.property section	2023-05-28 07:33:29 -04:00
Linus Torvalds	4e893b5aa4	xen: branch for v6.4-rc4 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCZHGVRAAKCRCAXGG7T9hj vqtJAQDizKasLE7tSnfs/FrZ/4xPaDLe3bQifMx2C1dtYCjRcAD/ciZSa1L0WzZP dpEZnlYRzsR3bwLktQEMQFOvlbh1SwE= =K860 -----END PGP SIGNATURE----- Merge tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - a double free fix in the Xen pvcalls backend driver - a fix for a regression causing the MSI related sysfs entries to not being created in Xen PV guests - a fix in the Xen blkfront driver for handling insane input data better * tag 'for-linus-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/pci/xen: populate MSI sysfs entries xen/pvcalls-back: fix double frees with pvcalls_new_active_socket() xen/blkfront: Only check REQ_FUA for writes	2023-05-27 09:42:56 -07:00
Sean Christopherson	023cfa6fc2	KVM: VMX: Use proper accessor to read guest CR4 in handle_desc() Use kvm_is_cr4_bit_set() to read guest CR4.UMIP when sanity checking that a descriptor table VM-Exit occurs if and only if guest.CR4.UMIP=1. UMIP can't be guest-owned, i.e. using kvm_read_cr4_bits() to decache guest- owned bits isn't strictly necessary, but eliminating raw reads of vcpu->arch.cr4 is desirable as it makes it easy to visually audit KVM for correctness. Opportunistically add a compile-time assertion that UMIP isn't guest-owned as letting the guest own UMIP isn't compatible with emulation (or any CR4 bit that is emulated by KVM). Opportunistically change the WARN_ON() to a ONCE variant. When the WARN fires, it fires _a lot_, and spamming the kernel logs ends up doing more harm than whatever led to KVM's unnecessary emulation. Reported-by: Robert Hoo <robert.hu@intel.com> Link: https://lore.kernel.org/all/20230310125718.1442088-4-robert.hu@intel.com Link: https://lore.kernel.org/r/20230413231914.1482782-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-05-26 13:50:42 -07:00
Sean Christopherson	3243b93c16	KVM: VMX: Treat UMIP as emulated if and only if the host doesn't have UMIP Advertise UMIP as emulated if and only if the host doesn't natively support UMIP, otherwise vmx_umip_emulated() is misleading when the host _does_ support UMIP. Of the four users of vmx_umip_emulated(), two already check for native support, and the logic in vmx_set_cpu_caps() is relevant if and only if UMIP isn't natively supported as UMIP is set in KVM's caps by kvm_set_cpu_caps() when UMIP is present in hardware. That leaves KVM's stuffing of X86_CR4_UMIP into the default cr4_fixed1 value enumerated for nested VMX. In that case, checking for (lack of) host support is actually a bug fix of sorts, as enumerating UMIP support based solely on descriptor table exiting works only because KVM doesn't sanity check MSR_IA32_VMX_CR4_FIXED1. E.g. if a (very theoretical) host supported UMIP in hardware but didn't allow UMIP+VMX, KVM would advertise UMIP but not actually emulate UMIP. Of course, KVM would explode long before it could run a nested VM on said theoretical CPU, as KVM doesn't modify host CR4 when enabling VMX, i.e. would load an "illegal" value into vmcs.HOST_CR4. Reported-by: Robert Hoo <robert.hu@intel.com> Link: https://lore.kernel.org/all/20230310125718.1442088-2-robert.hu@intel.com Link: https://lore.kernel.org/r/20230413231914.1482782-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-05-26 13:50:42 -07:00
Xiaoyao Li	82dc11b82b	KVM: VMX: Move the comment of CR4.MCE handling right above the code Move the comment about keeping the hosts CR4.MCE loaded in hardware above the code that actually modifies the hardware CR4 value. No functional change indented. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230410125017.1305238-3-xiaoyao.li@intel.com [sean: elaborate in changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-05-26 13:43:16 -07:00
Xiaoyao Li	334006b78c	KVM: VMX: Use kvm_read_cr4() to get cr4 value Directly use vcpu->arch.cr4 is not recommended since it gets stale value if the cr4 is not available. Use kvm_read_cr4() instead to ensure correct value. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20230410125017.1305238-2-xiaoyao.li@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-05-26 13:41:43 -07:00
Linus Torvalds	47ee3f1dd9	x86: re-introduce support for ERMS copies for user space accesses I tried to streamline our user memory copy code fairly aggressively in commit `adfcf4231b` ("x86: don't use REP_GOOD or ERMS for user memory copies"), in order to then be able to clean up the code and inline the modern FSRM case in commit `577e6a7fd5` ("x86: inline the 'rep movs' in user copies for the FSRM case"). We had reports [1] of that causing regressions earlier with blogbench, but that turned out to be a horrible benchmark for that case, and not a sufficient reason for re-instating "rep movsb" on older machines. However, now Eric Dumazet reported [2] a regression in performance that seems to be a rather more real benchmark, where due to the removal of "rep movs" a TCP stream over a 100Gbps network no longer reaches line speed. And it turns out that with the simplified the calling convention for the non-FSRM case in commit `427fda2c8a` ("x86: improve on the non-rep 'copy_user' function"), re-introducing the ERMS case is actually fairly simple. Of course, that "fairly simple" is glossing over several missteps due to having to fight our assembler alternative code. This code really wanted to rewrite a conditional branch to have two different targets, but that made objtool sufficiently unhappy that this instead just ended up doing a choice between "jump to the unrolled loop, or use 'rep movsb' directly". Let's see if somebody finds a case where the kernel memory copies also care (see commit `68674f94ff`: "x86: don't use REP_GOOD or ERMS for small memory copies"). But Eric does argue that the user copies are special because networking tries to copy up to 32KB at a time, if order-3 pages allocations are possible. In-kernel memory copies are typically small, unless they are the special "copy pages at a time" kind that still use "rep movs". Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1] Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/ [2] Reported-and-tested-by: Eric Dumazet <edumazet@google.com> Fixes: `adfcf4231b` ("x86: don't use REP_GOOD or ERMS for user memory copies") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-05-26 12:34:20 -07:00
Like Xu	762b33eb90	KVM: x86/mmu: Assert on @mmu in the __kvm_mmu_invalidate_addr() Add assertion to track that "mmu == vcpu->arch.mmu" is always true in the context of __kvm_mmu_invalidate_addr(). for_each_shadow_entry_using_root() and kvm_sync_spte() operate on vcpu->arch.mmu, but the only reason that doesn't cause explosions is because handle_invept() frees roots instead of doing a manual invalidation. As of now, there are no major roadblocks to switching INVEPT emulation over to use kvm_mmu_invalidate_addr(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230523032947.60041-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-05-26 11:24:52 -07:00
Uros Bizjak	12ced09595	KVM: x86/mmu: Add comment on try_cmpxchg64 usage in tdp_mmu_set_spte_atomic Commit `aee98a6838` ("KVM: x86/mmu: Use try_cmpxchg64 in tdp_mmu_set_spte_atomic") removed the comment that iter->old_spte is updated when different logical CPU modifies the page table entry. Although this is what try_cmpxchg does implicitly, it won't hurt if this fact is explicitly mentioned in a restored comment. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20230425113932.3148-1-ubizjak@gmail.com [sean: extend comment above try_cmpxchg64()] Signed-off-by: Sean Christopherson <seanjc@google.com>	2023-05-26 11:24:52 -07:00
Dave Airlie	b8887e796e	drm-misc-next for v6.5: UAPI Changes: Cross-subsystem Changes: * fbdev: Move framebuffer I/O helpers to <asm/fb.h>, fix naming * firmware: Init sysfb as early as possible Core Changes: * DRM scheduler: Rename interfaces * ttm: Store ttm_device_funcs in .rodata * Replace strlcpy() with strscpy() in various places * Cleanups Driver Changes: * bridge: analogix: Fix endless probe loop; samsung-dsim: Support swapping clock/data polarity; tc358767: Use devm_ Cleanups; * gma500: Fix I/O-memory access * panel: boe-tv101wum-nl6: Improve initialization; sharp-ls043t1le001: Mode fixes; simple: Add BOE EV121WXM-N10-1850 plus DT bindings; AddS6D7AA0 plus DT bindings; Cleanups * ssd1307x: Style fixes * sun4i: Release clocks * msm: Fix I/O-memory access * nouveau: Cleanups * shmobile: Support Renesas; Enable framebuffer console; Various fixes * vkms: Fix RGB565 conversion -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEchf7rIzpz2NEoWjlaA3BHVMLeiMFAmRuBXEACgkQaA3BHVML eiPLkwgAqCa7IuSDQhFMWVOI0EJpPPEHtHM8SCT1Pp8aniXk23Ru+E16c5zck53O uf4tB+zoFrwD9npy60LIvX1OZmXS1KI4+ZO8itYFk6GSjxqbTWbjNFREBeWFdIpa OG54nEqjFQZzEXY+gJYDpu5zqLy3xLN07ZgQkcMyfW3O/Krj4LLzfQTDl+jP5wkO 7/v5Eu5CG5QjupMxIjb4e+ruUflp73pynur5bhZsfS1bPNGFTnxHlwg7NWnBXU7o Hg23UYfCuZZWPmuO26EeUDlN33rCoaycmVgtpdZft2eznca5Mg74Loz1Qc3GQfjw LLvKsAIlBcZvEIhElkzhtXitBoe7LQ== =/9zV -----END PGP SIGNATURE----- Merge tag 'drm-misc-next-2023-05-24' of git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for v6.5: UAPI Changes: Cross-subsystem Changes: * fbdev: Move framebuffer I/O helpers to <asm/fb.h>, fix naming * firmware: Init sysfb as early as possible Core Changes: * DRM scheduler: Rename interfaces * ttm: Store ttm_device_funcs in .rodata * Replace strlcpy() with strscpy() in various places * Cleanups Driver Changes: * bridge: analogix: Fix endless probe loop; samsung-dsim: Support swapping clock/data polarity; tc358767: Use devm_ Cleanups; * gma500: Fix I/O-memory access * panel: boe-tv101wum-nl6: Improve initialization; sharp-ls043t1le001: Mode fixes; simple: Add BOE EV121WXM-N10-1850 plus DT bindings; AddS6D7AA0 plus DT bindings; Cleanups * ssd1307x: Style fixes * sun4i: Release clocks * msm: Fix I/O-memory access * nouveau: Cleanups * shmobile: Support Renesas; Enable framebuffer console; Various fixes * vkms: Fix RGB565 conversion Signed-off-by: Dave Airlie <airlied@redhat.com> # -----BEGIN PGP SIGNATURE----- # # iQEzBAABCAAdFiEEchf7rIzpz2NEoWjlaA3BHVMLeiMFAmRuBXEACgkQaA3BHVML # eiPLkwgAqCa7IuSDQhFMWVOI0EJpPPEHtHM8SCT1Pp8aniXk23Ru+E16c5zck53O # uf4tB+zoFrwD9npy60LIvX1OZmXS1KI4+ZO8itYFk6GSjxqbTWbjNFREBeWFdIpa # OG54nEqjFQZzEXY+gJYDpu5zqLy3xLN07ZgQkcMyfW3O/Krj4LLzfQTDl+jP5wkO # 7/v5Eu5CG5QjupMxIjb4e+ruUflp73pynur5bhZsfS1bPNGFTnxHlwg7NWnBXU7o # Hg23UYfCuZZWPmuO26EeUDlN33rCoaycmVgtpdZft2eznca5Mg74Loz1Qc3GQfjw # LLvKsAIlBcZvEIhElkzhtXitBoe7LQ== # =/9zV # -----END PGP SIGNATURE----- # gpg: Signature made Wed 24 May 2023 22:39:13 AEST # gpg: using RSA key 7217FBAC8CE9CF6344A168E5680DC11D530B7A23 # gpg: Can't check signature: No public key # Conflicts: # MAINTAINERS From: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230524124237.GA25416@linux-uq9g	2023-05-26 14:23:29 +10:00
Noah Goldstein	688eb8191b	x86/csum: Improve performance of `csum_partial` 1) Add special case for len == 40 as that is the hottest value. The nets a ~8-9% latency improvement and a ~30% throughput improvement in the len == 40 case. 2) Use multiple accumulators in the 64-byte loop. This dramatically improves ILP and results in up to a 40% latency/throughput improvement (better for more iterations). Results from benchmarking on Icelake. Times measured with rdtsc() len lat_new lat_old r tput_new tput_old r 8 3.58 3.47 1.032 3.58 3.51 1.021 16 4.14 4.02 1.028 3.96 3.78 1.046 24 4.99 5.03 0.992 4.23 4.03 1.050 32 5.09 5.08 1.001 4.68 4.47 1.048 40 5.57 6.08 0.916 3.05 4.43 0.690 48 6.65 6.63 1.003 4.97 4.69 1.059 56 7.74 7.72 1.003 5.22 4.95 1.055 64 6.65 7.22 0.921 6.38 6.42 0.994 96 9.43 9.96 0.946 7.46 7.54 0.990 128 9.39 12.15 0.773 8.90 8.79 1.012 200 12.65 18.08 0.699 11.63 11.60 1.002 272 15.82 23.37 0.677 14.43 14.35 1.005 440 24.12 36.43 0.662 21.57 22.69 0.951 952 46.20 74.01 0.624 42.98 53.12 0.809 1024 47.12 78.24 0.602 46.36 58.83 0.788 1552 72.01 117.30 0.614 71.92 96.78 0.743 2048 93.07 153.25 0.607 93.28 137.20 0.680 2600 114.73 194.30 0.590 114.28 179.32 0.637 3608 156.34 268.41 0.582 154.97 254.02 0.610 4096 175.01 304.03 0.576 175.89 292.08 0.602 There is no such thing as a free lunch, however, and the special case for len == 40 does add overhead to the len != 40 cases. This seems to amount to be ~5% throughput and slightly less in terms of latency. Testing: Part of this change is a new kunit test. The tests check all alignment X length pairs in [0, 64) X [0, 512). There are three cases. 1) Precomputed random inputs/seed. The expected results where generated use the generic implementation (which is assumed to be non-buggy). 2) An input of all 1s. The goal of this test is to catch any case a carry is missing. 3) An input that never carries. The goal of this test si to catch any case of incorrectly carrying. More exhaustive tests that test all alignment X length pairs in [0, 8192) X [0, 8192] on random data are also available here: https://github.com/goldsteinn/csum-reproduction The reposity also has the code for reproducing the above benchmark numbers. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230511011002.935690-1-goldstein.w.n%40gmail.com	2023-05-25 10:55:18 -07:00
Zhang Rui	edc0a2b595	x86/topology: Fix erroneous smp_num_siblings on Intel Hybrid platforms Traditionally, all CPUs in a system have identical numbers of SMT siblings. That changes with hybrid processors where some logical CPUs have a sibling and others have none. Today, the CPU boot code sets the global variable smp_num_siblings when every CPU thread is brought up. The last thread to boot will overwrite it with the number of siblings of that thread. That last thread to boot will "win". If the thread is a Pcore, smp_num_siblings == 2. If it is an Ecore, smp_num_siblings == 1. smp_num_siblings describes if the system supports SMT. It should specify the maximum number of SMT threads among all cores. Ensure that smp_num_siblings represents the system-wide maximum number of siblings by always increasing its value. Never allow it to decrease. On MeteorLake-P platform, this fixes a problem that the Ecore CPUs are not updated in any cpu sibling map because the system is treated as an UP system when probing Ecore CPUs. Below shows part of the CPU topology information before and after the fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore). ... -/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff -/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11 +/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff +/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21 ... -/sys/devices/system/cpu/cpu12/topology/package_cpus:001000 -/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12 +/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff +/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21 Notice that the "before" 'package_cpus_list' has only one CPU. This means that userspace tools like lscpu will see a little laptop like an 11-socket system: -Core(s) per socket: 1 -Socket(s): 11 +Core(s) per socket: 16 +Socket(s): 1 This is also expected to make the scheduler do rather wonky things too. [ dhansen: remove CPUID detail from changelog, add end user effects ] CC: stable@kernel.org Fixes: `bbb65d2d36` ("x86: use cpuid vector 0xb when available for detecting cpu topology") Fixes: `95f3d39ccf` ("x86/cpu/topology: Provide detect_extended_topology_early()") Suggested-by: Len Brown <len.brown@intel.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20230323015640.27906-1-rui.zhang%40intel.com	2023-05-25 10:48:42 -07:00
Kan Liang	38776cc45e	perf/x86/uncore: Correct the number of CHAs on SPR The number of CHAs from the discovery table on some SPR variants is incorrect, because of a firmware issue. An accurate number can be read from the MSR UNC_CBO_CONFIG. Fixes: `949b11381f` ("perf/x86/intel/uncore: Add Sapphire Rapids server CHA support") Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Stephane Eranian <eranian@google.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230508140206.283708-1-kan.liang@linux.intel.com	2023-05-24 22:19:41 +02:00
Maximilian Heyne	335b422346	x86/pci/xen: populate MSI sysfs entries Commit `bf5e758f02` ("genirq/msi: Simplify sysfs handling") reworked the creation of sysfs entries for MSI IRQs. The creation used to be in msi_domain_alloc_irqs_descs_locked after calling ops->domain_alloc_irqs. Then it moved into __msi_domain_alloc_irqs which is an implementation of domain_alloc_irqs. However, Xen comes with the only other implementation of domain_alloc_irqs and hence doesn't run the sysfs population code anymore. Commit `6c796996ee` ("x86/pci/xen: Fixup fallout from the PCI/MSI overhaul") set the flag MSI_FLAG_DEV_SYSFS for the xen msi_domain_info but that doesn't actually have an effect because Xen uses it's own domain_alloc_irqs implementation. Fix this by making use of the fallback functions for sysfs population. Fixes: `bf5e758f02` ("genirq/msi: Simplify sysfs handling") Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230503131656.15928-1-mheyne@amazon.de Signed-off-by: Juergen Gross <jgross@suse.com>	2023-05-24 18:08:49 +02:00
Ard Biesheuvel	6ab39f9992	crypto: x86/aria - Use 16 byte alignment for GFNI constant vectors The GFNI routines in the AVX version of the ARIA implementation now use explicit VMOVDQA instructions to load the constant input vectors, which means they must be 16 byte aligned. So ensure that this is the case, by dropping the section split and the incorrect .align 8 directive, and emitting the constants into the 16-byte aligned section instead. Note that the AVX2 version of this code deviates from this pattern, and does not require a similar fix, given that it loads these contants as 8-byte memory operands, for which AVX2 permits any alignment. Cc: Taehee Yoo <ap420073@gmail.com> Fixes: `8b84475318` ("crypto: x86/aria-avx - Do not use avx2 instructions") Reported-by: syzbot+a6abcf08bad8b18fd198@syzkaller.appspotmail.com Tested-by: syzbot+a6abcf08bad8b18fd198@syzkaller.appspotmail.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2023-05-24 18:10:27 +08:00
Nikolay Borisov	122333d6bd	x86/tdx: Wrap exit reason with hcall_func() TDX reuses VMEXIT "reasons" in its guest->host hypercall ABI. This is confusing because there might not be a VMEXIT involved at all. These instances are supposed to document situation and reduce confusion by wrapping VMEXIT reasons with hcall_func(). The decompression code does not follow this convention. Unify the TDX decompression code with the other TDX use of VMEXIT reasons. No functional changes. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20230505120332.1429957-1-nik.borisov%40suse.com	2023-05-23 07:01:45 -07:00
Like Xu	3c845304d2	perf/x86/intel: Save/restore cpuc->active_pebs_data_cfg when using guest PEBS After commit `b752ea0c28` ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG"), the cpuc->pebs_data_cfg may save some bits that are not supported by real hardware, such as PEBS_UPDATE_DS_SW. This would cause the VMX hardware MSR switching mechanism to save/restore invalid values for PEBS_DATA_CFG MSR, thus crashing the host when PEBS is used for guest. Fix it by using the active host value from cpuc->active_pebs_data_cfg. Fixes: `b752ea0c28` ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG") Signed-off-by: Like Xu <likexu@tencent.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20230517133808.67885-1-likexu@tencent.com	2023-05-23 10:01:13 +02:00
Linus Torvalds	ae8373a5ad	Work around a TLB invalidation issue in recent hybrid CPUs -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmRr6MMACgkQaDWVMHDJ krAaIA/+IIV2rijvxzlnN41RksfiuI8lmDXuOeAqFCFVAukm6/faiDHmXQC/4ipw JTdgq4JbB6wTZmYsijSMdwgWDt3BGXZAAXK7sSOXLNRkaf0EuPk6h59P7QXBAkRD A6ibjWPipV6/k/Mczm8o+WnTSHUln6CVz8yB13gF3eB6+QMO7yYznL4EJ4iUBdZC OWCukkI8zT07p1u7nqpOmTixXZEqSauC3qB9tg856w8UJoS1evOocgExZr2d8GT3 Ex9Q+Hj8ULkM6tDeke9mfoDDEdHtHDkQIMH5L63siTc0LzvRIs/EZEnZGdZWNy9F zw6AaIShzJQ9Z9KTP/SRxQ3nvUh/UuI2z7KnZLLxHYqLZZVMNQtinK+ZDGlzvvqg xr3ZI2PXUXoQT8Mg0BzzpeLwuuSXS8DFyj63SErValRuyIQQs07MhIRFiU3abQtq A4nfXMzfCwfRn5ggm9HIPLKOtiw1ZlsQtDxmHULpXsUqGgBxEfmCF5f210TmkOhN yrNlDmNBr39j1xlUeFlJRXWm4badlo39G8s/70oqFTushvatfet7Gyt7fZseGhJL KNQCfxyCu8bYYmtYfuW6WTc8nq5sJmnTdWEjcT0ftbgDeYNtBjfKfhJ+8+q04FWl wBxu5nqfKMT1bAKIOLse9eVbaMXNuHJJkStQZAZP5/wUMoVvfHI= =TZgh -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Dave Hansen: "This works around and issue where the INVLPG instruction may miss invalidating kernel TLB entries in recent hybrid CPUs. I do expect an eventual microcode fix for this. When the microcode version numbers are known, we can circle back around and add them the model table to disable this workaround" * tag 'x86_urgent_for_6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Avoid incomplete Global INVLPG flushes	2023-05-22 16:31:28 -07:00
Andrew Cooper	6a4be69845	x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils "x86/smpboot: Support parallel startup of secondary CPUs" adds the first use of X2APIC_ENABLE in assembly, but older binutils don't tolerate the UL suffix. Switch to using BIT() instead. Fixes: `7e75178a09` ("x86/smpboot: Support parallel startup of secondary CPUs") Reported-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Link: https://lore.kernel.org/r/20230522105738.2378364-1-andrew.cooper3@citrix.com	2023-05-22 14:06:33 +02:00
Linus Torvalds	a35747c310	ARM: * Plug a race in the stage-2 mapping code where the IPA and the PA would end up being out of sync * Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...) * FP/SVE/SME documentation update, in the hope that this field becomes clearer... * Add workaround for Apple SEIS brokenness to a new SoC * Random comment fixes x86: * add MSR_IA32_TSX_CTRL into msrs_to_save * fixes for XCR0 handling in SGX enclaves Generic: * Fix vcpu_array[0] races * Fix race between starting a VM and "reboot -f" -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmRp0WIUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPqVwf+OFayNPpURAFqfrOuISYW7hoCL24+ sCtXyVv4Ei0np1vGekit2h/m8GmxO12xEBibcFeYj+YQItIqu9HvC08fRxAKaMeE N3p9iLuS1zcM3cEuZpg0r6QN+pKybttdadl70yho43CtagEM4FmB7dgyAo9AhyXk pZUaVfoO6beBQ/J6A6V/Q5xlue1LvHk1+K4rmNcYVTYn6ZOd+yYgvqng1nv5/h9b 0HgW0aUWkEHAB67/sSnUUro707loMNTowsZlMCtgDk4Fzf8RwQ7qc8lClLk1UPjJ DHB6Hif9F0Q5mkrwn+c7xyVlKARaY6/FOshS2Q620q19+4fq5fUD9HgjrQ== =ARzH -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "ARM: - Plug a race in the stage-2 mapping code where the IPA and the PA would end up being out of sync - Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...) - FP/SVE/SME documentation update, in the hope that this field becomes clearer... - Add workaround for Apple SEIS brokenness to a new SoC - Random comment fixes x86: - add MSR_IA32_TSX_CTRL into msrs_to_save - fixes for XCR0 handling in SGX enclaves Generic: - Fix vcpu_array[0] races - Fix race between starting a VM and 'reboot -f'" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: VMX: add MSR_IA32_TSX_CTRL into msrs_to_save KVM: x86: Don't adjust guest's CPUID.0x12.1 (allowed SGX enclave XFRM) KVM: VMX: Don't rely _only_ on CPUID to enforce XCR0 restrictions for ECREATE KVM: Fix vcpu_array[0] races KVM: VMX: Fix header file dependency of asm/vmx.h KVM: Don't enable hardware after a restart/shutdown is initiated KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown KVM: arm64: vgic: Add Apple M2 PRO/MAX cpus to the list of broken SEIS implementations KVM: arm64: Clarify host SME state management KVM: arm64: Restructure check for SVE support in FP trap handler KVM: arm64: Document check for TIF_FOREIGN_FPSTATE KVM: arm64: Fix repeated words in comments KVM: arm64: Constify start/end/phys fields of the pgtable walker data KVM: arm64: Infer PA offset from VA in hyp map walker KVM: arm64: Infer the PA offset from IPA in stage-2 map walker KVM: arm64: Use the bitmap API to allocate bitmaps KVM: arm64: Slightly optimize flush_context()	2023-05-21 13:58:37 -07:00
Mingwei Zhang	b9846a698c	KVM: VMX: add MSR_IA32_TSX_CTRL into msrs_to_save Add MSR_IA32_TSX_CTRL into msrs_to_save[] to explicitly tell userspace to save/restore the register value during migration. Missing this may cause userspace that relies on KVM ioctl(KVM_GET_MSR_INDEX_LIST) fail to port the value to the target VM. In addition, there is no need to add MSR_IA32_TSX_CTRL when ARCH_CAP_TSX_CTRL_MSR is not supported in kvm_get_arch_capabilities(). So add the checking in kvm_probe_msr_to_save(). Fixes: `c11f83e062` ("KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality") Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20230509032348.1153070-1-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-05-21 04:05:51 -04:00
Sean Christopherson	275a87244e	KVM: x86: Don't adjust guest's CPUID.0x12.1 (allowed SGX enclave XFRM) Drop KVM's manipulation of guest's CPUID.0x12.1 ECX and EDX, i.e. the allowed XFRM of SGX enclaves, now that KVM explicitly checks the guest's allowed XCR0 when emulating ECREATE. Note, this could theoretically break a setup where userspace advertises a "bad" XFRM and relies on KVM to provide a sane CPUID model, but QEMU is the only known user of KVM SGX, and QEMU explicitly sets the SGX CPUID XFRM subleaf based on the guest's XCR0. Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230503160838.3412617-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-05-21 04:05:51 -04:00
Sean Christopherson	ad45413d22	KVM: VMX: Don't rely _only_ on CPUID to enforce XCR0 restrictions for ECREATE Explicitly check the vCPU's supported XCR0 when determining whether or not the XFRM for ECREATE is valid. Checking CPUID works because KVM updates guest CPUID.0x12.1 to restrict the leaf to a subset of the guest's allowed XCR0, but that is rather subtle and KVM should not modify guest CPUID except for modeling true runtime behavior (allowed XFRM is most definitely not "runtime" behavior). Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230503160838.3412617-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-05-21 04:05:51 -04:00
Jacob Xu	3367eeab97	KVM: VMX: Fix header file dependency of asm/vmx.h Include a definition of WARN_ON_ONCE() before using it. Fixes: `bb1fcc70d9` ("KVM: nVMX: Allow L1 to use 5-level page walks for nested EPT") Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Jacob Xu <jacobhxu@google.com> [reworded commit message; changed <asm/bug.h> to <linux/bug.h>] Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220225012959.1554168-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2023-05-19 13:56:25 -04:00
Dave Airlie	33a8617088	drm-misc-next for 6.5: UAPI Changes: Cross-subsystem Changes: - arch: Consolidate <asm/fb.h> Core Changes: - aperture: Ignore firmware framebuffers with non-primary devices - fbdev: Use fbdev's I/O helpers - sysfs: Expose DRM connector ID - tests: More tests for drm_rect Driver Changes: - armada: Implement fbdev emulation as a client - bridge: - fsl-ldb: Support i.MX6SX - lt9211: Remove blanking packets - lt9611: Remove blanking packets - tc358768: Implement input bus formats reporting, fix various timings and clocks settings - ti-sn65dsi86: Implement wait_hpd_asserted - nouveau: Improve NULL pointer checks before dereference - panel: - nt36523: Support Lenovo J606F - st7703: Support Anbernic RG353V-V2 - new panels: InnoLux G070ACE-L01 - sun4i: Fix MIPI-DSI dotclock - vc4: RGB Range toggle property, BT601 and BT2020 support for HDMI - vkms: Convert to drmm helpers, Add reflection and rotation support -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRcEzekXsqa64kGDp7j7w1vZxhRxQUCZFyY2wAKCRDj7w1vZxhR xWfwAP47UUe2pd7DjVM3TZETgNDXAWxhMAiMhHDpDJIgHArVGgD+PNom4rm2efsr uiUt9yrwcIKUEfPPa+GfNIBFQ66hwg4= =HvL5 -----END PGP SIGNATURE----- Merge tag 'drm-misc-next-2023-05-11' of git://anongit.freedesktop.org/drm/drm-misc into drm-next drm-misc-next for 6.5: UAPI Changes: Cross-subsystem Changes: - arch: Consolidate <asm/fb.h> Core Changes: - aperture: Ignore firmware framebuffers with non-primary devices - fbdev: Use fbdev's I/O helpers - sysfs: Expose DRM connector ID - tests: More tests for drm_rect Driver Changes: - armada: Implement fbdev emulation as a client - bridge: - fsl-ldb: Support i.MX6SX - lt9211: Remove blanking packets - lt9611: Remove blanking packets - tc358768: Implement input bus formats reporting, fix various timings and clocks settings - ti-sn65dsi86: Implement wait_hpd_asserted - nouveau: Improve NULL pointer checks before dereference - panel: - nt36523: Support Lenovo J606F - st7703: Support Anbernic RG353V-V2 - new panels: InnoLux G070ACE-L01 - sun4i: Fix MIPI-DSI dotclock - vc4: RGB Range toggle property, BT601 and BT2020 support for HDMI - vkms: Convert to drmm helpers, Add reflection and rotation support Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maxime Ripard <maxime@cerno.tech> Link: https://patchwork.freedesktop.org/patch/msgid/2pxmxdzsk2ekjy6xvbpj67zrhtwvkkhfspuvdm5pfm5i54hed6@sooct7yq6z4w	2023-05-19 11:37:59 +10:00
Arnd Bergmann	454a348714	x86/platform: Avoid missing-prototype warnings for OLPC There are two functions in the olpc platform that have no prototype: arch/x86/platform/olpc/olpc_dt.c:237:13: error: no previous prototype for 'olpc_dt_fixup' [-Werror=missing-prototypes] arch/x86/platform/olpc/olpc-xo1-pm.c:73:26: error: no previous prototype for 'xo1_do_sleep' [-Werror=missing-prototypes] The first one should just be marked 'static' as there are no other callers, while the second one is called from assembler and is just a false-positive warning that can be silenced by adding a prototype. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-21-arnd%40kernel.org	2023-05-18 11:56:19 -07:00
Arnd Bergmann	3b939ba0c2	x86/usercopy: Include arch_wb_cache_pmem() declaration arch_wb_cache_pmem() is declared in a global header but defined by the architecture. On x86, the implementation needs to include the header to avoid this warning: arch/x86/lib/usercopy_64.c:39:6: error: no previous prototype for 'arch_wb_cache_pmem' [-Werror=missing-prototypes] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-18-arnd%40kernel.org	2023-05-18 11:56:18 -07:00
Arnd Bergmann	3e0bd4dd35	x86/vdso: Include vdso/processor.h __vdso_getcpu is declared in a header but this is not included before the definition, causing a W=1 warning: arch/x86/entry/vdso/vgetcpu.c:13:1: error: no previous prototype for '__vdso_getcpu' [-Werror=missing-prototypes] arch/x86/entry/vdso/vdso32/../vgetcpu.c:13:1: error: no previous prototype for '__vdso_getcpu' [-Werror=missing-prototypes] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-17-arnd%40kernel.org	2023-05-18 11:56:18 -07:00
Arnd Bergmann	e9c2a283e7	x86/mce: Add copy_mc_fragile_handle_tail() prototype copy_mc_fragile_handle_tail() is only called from assembler, but 'make W=1' complains about a missing prototype: arch/x86/lib/copy_mc.c:26:1: warning: no previous prototype for ‘copy_mc_fragile_handle_tail’ [-Wmissing-prototypes] 26 \| copy_mc_fragile_handle_tail(char to, char from, unsigned len) Add the prototype to avoid the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-16-arnd%40kernel.org	2023-05-18 11:56:18 -07:00
Arnd Bergmann	29bf464cb8	x86/fbdev: Include asm/fb.h as needed fb_is_primary_device() is defined as a global function on x86, unlike the others that have an inline version. The file that defines is however needs to include the declaration to avoid a warning: arch/x86/video/fbdev.c:14:5: error: no previous prototype for 'fb_is_primary_device' [-Werror=missing-prototypes] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-15-arnd%40kernel.org	2023-05-18 11:56:18 -07:00
Arnd Bergmann	f34f0d3c10	x86/entry: Add do_SYSENTER_32() prototype The 32-bit system call entry points can be called on both 32-bit and 64-bit kernels, but on the former the declarations are hidden: arch/x86/entry/common.c:238:24: error: no previous prototype for 'do_SYSENTER_32' [-Werror=missing-prototypes] Move them all out of the #ifdef block to avoid the warnings. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-12-arnd%40kernel.org	2023-05-18 11:56:18 -07:00
Arnd Bergmann	056b44a4d1	x86/quirks: Include linux/pnp.h for arch_pnpbios_disabled() arch_pnpbios_disabled() is defined in architecture code on x86, but this does not include the appropriate header, causing a warning: arch/x86/kernel/platform-quirks.c:42:13: error: no previous prototype for 'arch_pnpbios_disabled' [-Werror=missing-prototypes] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-10-arnd%40kernel.org	2023-05-18 11:56:18 -07:00
Arnd Bergmann	b963d12aa6	x86/mm: Include asm/numa.h for set_highmem_pages_init() The set_highmem_pages_init() function is declared in asm/numa.h, which must be included in the file that defines it to avoid a W=1 warning: arch/x86/mm/highmem_32.c:7:13: error: no previous prototype for 'set_highmem_pages_init' [-Werror=missing-prototypes] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-9-arnd%40kernel.org	2023-05-18 11:56:18 -07:00
Arnd Bergmann	c966483930	x86: Avoid missing-prototype warnings for doublefault code Two functions in the 32-bit doublefault code are lacking a prototype: arch/x86/kernel/doublefault_32.c:23:36: error: no previous prototype for 'doublefault_shim' [-Werror=missing-prototypes] 23 \| asmlinkage noinstr void __noreturn doublefault_shim(void) \| ^~~~~~~~~~~~~~~~ arch/x86/kernel/doublefault_32.c:114:6: error: no previous prototype for 'doublefault_init_cpu_tss' [-Werror=missing-prototypes] 114 \| void doublefault_init_cpu_tss(void) The first one is only called from assembler, while the second one is declared in doublefault.h, but this file is not included. Include the header file and add the other declaration there as well. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-8-arnd%40kernel.org	2023-05-18 11:56:18 -07:00
Arnd Bergmann	16db7e9c6e	x86/fpu: Include asm/fpu/regset.h The fpregs_soft_set/fpregs_soft_get functions are declared in a header that is not included in the file that defines them, causing a W=1 warning: /home/arnd/arm-soc/arch/x86/math-emu/fpu_entry.c:638:5: error: no previous prototype for 'fpregs_soft_set' [-Werror=missing-prototypes] 638 \| int fpregs_soft_set(struct task_struct target, \| ^~~~~~~~~~~~~~~ /home/arnd/arm-soc/arch/x86/math-emu/fpu_entry.c:690:5: error: no previous prototype for 'fpregs_soft_get' [-Werror=missing-prototypes] 690 \| int fpregs_soft_get(struct task_struct target, Include the file here to avoid the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-7-arnd%40kernel.org	2023-05-18 11:56:18 -07:00
Arnd Bergmann	2eb5d1df2a	x86: Add dummy prototype for mk_early_pgtbl_32() 'make W=1' warns about a function without a prototype in the x86-32 head code: arch/x86/kernel/head32.c:72:13: error: no previous prototype for 'mk_early_pgtbl_32' [-Werror=missing-prototypes] This is called from assembler code, so it does not actually need a prototype. I could not find an appropriate header for it, so just declare it in front of the definition to shut up the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-6-arnd%40kernel.org	2023-05-18 11:56:16 -07:00
Arnd Bergmann	0253b04d5b	x86/pci: Mark local functions as 'static' Two functions in this file are global but have no prototype in a header and are not called from elsewhere, so they should be static: arch/x86/pci/ce4100.c:86:6: error: no previous prototype for 'sata_revid_init' [-Werror=missing-prototypes] arch/x86/pci/ce4100.c:175:5: error: no previous prototype for 'bridge_read' [-Werror=missing-prototypes] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-3-arnd%40kernel.org	2023-05-18 11:56:10 -07:00
Arnd Bergmann	26c3379a69	x86/ftrace: Move prepare_ftrace_return prototype to header On 32-bit builds, the prepare_ftrace_return() function only has a global definition, but no prototype before it, which causes a warning: arch/x86/kernel/ftrace.c:625:6: warning: no previous prototype for ‘prepare_ftrace_return’ [-Wmissing-prototypes] 625 \| void prepare_ftrace_return(unsigned long ip, unsigned long *parent, Move the prototype that is already needed for some configurations into a header file where it can be seen unconditionally. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-2-arnd%40kernel.org	2023-05-18 11:56:01 -07:00
Linus Torvalds	2d1bcbc6cd	Probes fixes for 6.4-rc1: - Initialize 'ret' local variables on fprobe_handler() to fix the smatch warning. With this, fprobe function exit handler is not working randomly. - Fix to use preempt_enable/disable_notrace for rethook handler to prevent recursive call of fprobe exit handler (which is based on rethook) - Fix recursive call issue on fprobe_kprobe_handler(). - Fix to detect recursive call on fprobe_exit_handler(). - Fix to make all arch-dependent rethook code notrace. (the arch-independent code is already notrace) -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmRmKgQACgkQ2/sHvwUr PxvlCgf+OJk5O9IJlTgqDV6JNPsTzFS7qqyAyQmZW9Bj8STfWAIRxa0zeGbZE58K 5LwgzAj+SqzYRwIvzzZ3xsA5j7f1Wj7wG0TQgmpnIW+hprwDrLsUhoZ5s1D/Ojel A4rAnqCrgnh5m5SenU2QCUngGKn004j4RASaZvRELDyvyIkBSqNhswCH8ZWGPror KuCu5AmEnFagYl0lmNL3H2aCITAg3QEK+fE6iR+lYsqfR3xbs4YAcqiylHBdY0wX ssK7LVdRmv7O6TxSj4P2ohDvLJP3eL9bVirsJpg0OVbqWJCs65T2rJJjXiKojYXf vSVWFJFK5oV98ZHfXTG9R7x0DEwc+g== =jO68 -----END PGP SIGNATURE----- Merge tag 'probes-fixes-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - Initialize 'ret' local variables on fprobe_handler() to fix the smatch warning. With this, fprobe function exit handler is not working randomly. - Fix to use preempt_enable/disable_notrace for rethook handler to prevent recursive call of fprobe exit handler (which is based on rethook) - Fix recursive call issue on fprobe_kprobe_handler() - Fix to detect recursive call on fprobe_exit_handler() - Fix to make all arch-dependent rethook code notrace (the arch-independent code is already notrace)" * tag 'probes-fixes-v6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rethook, fprobe: do not trace rethook related functions fprobe: add recursion detection in fprobe_exit_handler fprobe: make fprobe_kprobe_handler recursion free rethook: use preempt_{disable, enable}_notrace in rethook_trampoline_handler tracing: fprobe: Initialize ret valiable to fix smatch error	2023-05-18 09:04:45 -07:00
Thomas Zimmermann	8ff1541da3	fbdev: Include <linux/fb.h> instead of <asm/fb.h> Replace include statements for <asm/fb.h> with <linux/fb.h>. Fixes the coding style: if a header is available in asm/ and linux/, it is preferable to include the header from linux/. This only affects a few source files, most of which already include <linux/fb.h>. Suggested-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Sui Jingfeng <suijingfeng@loongson.cn> Reviewed-by: Sam Ravnborg <sam@ravnborg.org> Link: https://patchwork.freedesktop.org/patch/msgid/20230512102444.5438-6-tzimmermann@suse.de	2023-05-18 11:06:21 +02:00
Ze Gao	571a2a50a8	rethook, fprobe: do not trace rethook related functions These functions are already marked as NOKPROBE to prevent recursion and we have the same reason to blacklist them if rethook is used with fprobe, since they are beyond the recursion-free region ftrace can guard. Link: https://lore.kernel.org/all/20230517034510.15639-5-zegao@tencent.com/ Fixes: `f3a112c0c4` ("x86,rethook,kprobes: Replace kretprobe with rethook on x86") Signed-off-by: Ze Gao <zegao@tencent.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>	2023-05-18 07:08:01 +09:00
Anisse Astier	d86ff3333c	efivarfs: expose used and total size When writing EFI variables, one might get errors with no other message on why it fails. Being able to see how much is used by EFI variables helps analyzing such issues. Since this is not a conventional filesystem, block size is intentionally set to 1 instead of PAGE_SIZE. x86 quirks of reserved size are taken into account; so that available and free size can be different, further helping debugging space issues. With this patch, one can see the remaining space in EFI variable storage via efivarfs, like this: $ df -h /sys/firmware/efi/efivars/ Filesystem Size Used Avail Use% Mounted on efivarfs 176K 106K 66K 62% /sys/firmware/efi/efivars Signed-off-by: Anisse Astier <an.astier@criteo.com> [ardb: - rename efi_reserved_space() to efivar_reserved_space() - whitespace/coding style tweaks] Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2023-05-17 18:21:34 +02:00
Dave Hansen	ce0b15d11a	x86/mm: Avoid incomplete Global INVLPG flushes The INVLPG instruction is used to invalidate TLB entries for a specified virtual address. When PCIDs are enabled, INVLPG is supposed to invalidate TLB entries for the specified address for both the current PCID and Global entries. (Note: Only kernel mappings set Global=1.) Unfortunately, some INVLPG implementations can leave Global translations unflushed when PCIDs are enabled. As a workaround, never enable PCIDs on affected processors. I expect there to eventually be microcode mitigations to replace this software workaround. However, the exact version numbers where that will happen are not known today. Once the version numbers are set in stone, the processor list can be tweaked to only disable PCIDs on affected processors with affected microcode. Note: if anyone wants a quick fix that doesn't require patching, just stick 'nopcid' on your kernel command-line. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org	2023-05-17 08:55:02 -07:00
Borislav Petkov (AMD)	f220125b99	x86/retbleed: Add __x86_return_thunk alignment checks Add a linker assertion and compute the 0xcc padding dynamically so that __x86_return_thunk is always cacheline-aligned. Leave the SYM_START() macro in as the untraining doesn't need ENDBR annotations anyway. Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Link: https://lore.kernel.org/r/20230515140726.28689-1-bp@alien8.de	2023-05-17 12:14:21 +02:00
Arnd Bergmann	ef104443bf	procfs: consolidate arch_report_meminfo declaration The arch_report_meminfo() function is provided by four architectures, with a __weak fallback in procfs itself. On architectures that don't have a custom version, the __weak version causes a warning because of the missing prototype. Remove the architecture specific prototypes and instead add one in linux/proc_fs.h. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for arch/x86 Acked-by: Helge Deller <deller@gmx.de> # parisc Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Message-Id: <20230516195834.551901-1-arnd@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-05-17 09:24:49 +02:00
Josh Poimboeuf	89da5a69a8	x86/unwind/orc: Add 'unwind_debug' cmdline option Sometimes the one-line ORC unwinder warnings aren't very helpful. Add a new 'unwind_debug' cmdline option which will dump the full stack contents of the current task when an error condition is encountered. Reviewed-by: Miroslav Benes <mbenes@suse.cz> Link: https://lore.kernel.org/r/6afb9e48a05fd2046bfad47e69b061b43dfd0e0e.1681331449.git.jpoimboe@kernel.org Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>	2023-05-16 06:31:50 -07:00
Vernon Lovejoy	2e4be0d011	x86/show_trace_log_lvl: Ensure stack pointer is aligned, again The commit `e335bb51cc` ("x86/unwind: Ensure stack pointer is aligned") tried to align the stack pointer in show_trace_log_lvl(), otherwise the "stack < stack_info.end" check can't guarantee that the last read does not go past the end of the stack. However, we have the same problem with the initial value of the stack pointer, it can also be unaligned. So without this patch this trivial kernel module #include <linux/module.h> static int init(void) { asm volatile("sub $0x4,%rsp"); dump_stack(); asm volatile("add $0x4,%rsp"); return -EAGAIN; } module_init(init); MODULE_LICENSE("GPL"); crashes the kernel. Fixes: `e335bb51cc` ("x86/unwind: Ensure stack pointer is aligned") Signed-off-by: Vernon Lovejoy <vlovejoy@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20230512104232.GA10227@redhat.com Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>	2023-05-16 06:31:04 -07:00
Jiapeng Chong	95f0e3a209	x86/unwind/orc: Use swap() instead of open coding it Swap is a function interface that provides exchange function. To avoid code duplication, we can use swap function. ./arch/x86/kernel/unwind_orc.c:235:16-17: WARNING opportunity for swap(). Reported-by: Abaci Robot <abaci@linux.alibaba.com> Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=4641 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Link: https://lore.kernel.org/r/20230330020014.40489-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>	2023-05-16 06:06:56 -07:00
Yazen Ghannam	e40879b6d7	x86/MCE: Check a hw error's address to determine proper recovery action Make sure that machine check errors with a usable address are properly marked as poison. This is needed for errors that occur on memory which have MCG_STATUS[RIPV] clear - i.e., the interrupted process cannot be restarted reliably. One example is data poison consumption through the instruction fetch units on AMD Zen-based systems. The MF_MUST_KILL flag is passed to memory_failure() when MCG_STATUS[RIPV] is not set. So the associated process will still be killed. What this does, practically, is get rid of one more check to kill_current_task with the eventual goal to remove it completely. Also, make the handling identical to what is done on the notifier path (uc_decode_notifier() does that address usability check too). The scenario described above occurs when hardware can precisely identify the address of poisoned memory, but execution cannot reliably continue for the interrupted hardware thread. [ bp: Massage commit message. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20230322005131.174499-1-tony.luck@intel.com	2023-05-16 12:16:22 +02:00
Lukas Bulwahn	7583e8fbdc	x86/cpu: Remove X86_FEATURE_NAMES While discussing to change the visibility of X86_FEATURE_NAMES (see Link) in order to remove CONFIG_EMBEDDED, Boris suggested to simply make the X86_FEATURE_NAMES functionality unconditional. As the need for really tiny kernel images has gone away and kernel images with !X86_FEATURE_NAMES are hardly tested, remove this config and the whole ifdeffery in the source code. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20230509084007.24373-1-lukas.bulwahn@gmail.com/ Link: https://lore.kernel.org/r/20230510065713.10996-3-lukas.bulwahn@gmail.com	2023-05-15 20:03:08 +02:00
Lukas Bulwahn	424e23fd6c	x86/Kconfig: Make X86_FEATURE_NAMES non-configurable in prompt While discussing to change the visibility of X86_FEATURE_NAMES (see Link) in order to remove CONFIG_EMBEDDED, Boris suggested to simply make the X86_FEATURE_NAMES functionality unconditional. As a first step, make X86_FEATURE_NAMES disappear from Kconfig. So, as X86_FEATURE_NAMES defaults to yes, to disable it, one now needs to modify the .config file before compiling the kernel. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/all/20230509084007.24373-1-lukas.bulwahn@gmail.com/	2023-05-15 19:56:19 +02:00
Thomas Gleixner	0c7ffa32db	x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it Implement the validation function which tells the core code whether parallel bringup is possible. The only condition for now is that the kernel does not run in an encrypted guest as these will trap the RDMSR via #VC, which cannot be handled at that point in early startup. There was an earlier variant for AMD-SEV which used the GHBC protocol for retrieving the APIC ID via CPUID, but there is no guarantee that the initial APIC ID in CPUID is the same as the real APIC ID. There is no enforcement from the secure firmware and the hypervisor can assign APIC IDs as it sees fit as long as the ACPI/MADT table is consistent with that assignment. Unfortunately there is no RDMSR GHCB protocol at the moment, so enabling AMD-SEV guests for parallel startup needs some more thought. Intel-TDX provides a secure RDMSR hypercall, but supporting that is outside the scope of this change. Fixup announce_cpu() as e.g. on Hyper-V CPU1 is the secondary sibling of CPU0, which makes the @cpu == 1 logic in announce_cpu() fall apart. [ mikelley: Reported the announce_cpu() fallout Originally-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.467571745@linutronix.de	2023-05-15 13:45:05 +02:00
David Woodhouse	7e75178a09	x86/smpboot: Support parallel startup of secondary CPUs In parallel startup mode the APs are kicked alive by the control CPU quickly after each other and run through the early startup code in parallel. The real-mode startup code is already serialized with a bit-spinlock to protect the real-mode stack. In parallel startup mode the smpboot_control variable obviously cannot contain the Linux CPU number so the APs have to determine their Linux CPU number on their own. This is required to find the CPUs per CPU offset in order to find the idle task stack and other per CPU data. To achieve this, export the cpuid_to_apicid[] array so that each AP can find its own CPU number by searching therein based on its APIC ID. Introduce a flag in the top bits of smpboot_control which indicates that the AP should find its CPU number by reading the APIC ID from the APIC. This is required because CPUID based APIC ID retrieval can only provide the initial APIC ID, which might have been overruled by the firmware. Some AMD APUs come up with APIC ID = initial APIC ID + 0x10, so the APIC ID to CPU number lookup would fail miserably if based on CPUID. Also virtualization can make its own APIC ID assignements. The only requirement is that the APIC IDs are consistent with the APCI/MADT table. For the boot CPU or in case parallel bringup is disabled the control bits are empty and the CPU number is directly available in bit 0-23 of smpboot_control. [ tglx: Initial proof of concept patch with bitlock and APIC ID lookup ] [ dwmw2: Rework and testing, commit message, CPUID 0x1 and CPU0 support ] [ seanc: Fix stray override of initial_gs in common_cpu_up() ] [ Oleksandr Natalenko: reported suspend/resume issue fixed in x86_acpi_suspend_lowlevel ] [ tglx: Make it read the APIC ID from the APIC instead of using CPUID, split the bitlock part out ] Co-developed-by: Thomas Gleixner <tglx@linutronix.de> Co-developed-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.411554373@linutronix.de	2023-05-15 13:45:04 +02:00
Thomas Gleixner	f6f1ae9128	x86/smpboot: Implement a bit spinlock to protect the realmode stack Parallel AP bringup requires that the APs can run fully parallel through the early startup code including the real mode trampoline. To prepare for this implement a bit-spinlock to serialize access to the real mode stack so that parallel upcoming APs are not going to corrupt each others stack while going through the real mode startup code. Co-developed-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.355425551@linutronix.de	2023-05-15 13:45:03 +02:00
Thomas Gleixner	bea629d57d	x86/apic: Save the APIC virtual base address For parallel CPU brinugp it's required to read the APIC ID in the low level startup code. The virtual APIC base address is a constant because its a fix-mapped address. Exposing that constant which is composed via macros to assembly code is non-trivial due to header inclusion hell. Aside of that it's constant only because of the vsyscall ABI requirement. Once vsyscall is out of the picture the fixmap can be placed at runtime. Avoid header hell, stay flexible and store the address in a variable which can be exposed to the low level startup code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.299231005@linutronix.de	2023-05-15 13:45:03 +02:00
Thomas Gleixner	f54d4434c2	x86/apic: Provide cpu_primary_thread mask Make the primary thread tracking CPU mask based in preparation for simpler handling of parallel bootup. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.186599880@linutronix.de	2023-05-15 13:45:02 +02:00
Thomas Gleixner	8b5a0f957c	x86/smpboot: Enable split CPU startup The x86 CPU bringup state currently does AP wake-up, wait for AP to respond and then release it for full bringup. It is safe to be split into a wake-up and and a separate wait+release state. Provide the required functions and enable the split CPU bringup, which prepares for parallel bringup, where the bringup of the non-boot CPUs takes two iterations: One to prepare and wake all APs and the second to wait and release them. Depending on timing this can eliminate the wait time completely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.133453992@linutronix.de	2023-05-15 13:45:01 +02:00
Thomas Gleixner	2711b8e2b7	x86/smpboot: Switch to hotplug core state synchronization The new AP state tracking and synchronization mechanism in the CPU hotplug core code allows to remove quite some x86 specific code: 1) The AP alive synchronization based on cpumasks 2) The decision whether an AP can be brought up again Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.529657366@linutronix.de	2023-05-15 13:44:56 +02:00
Thomas Gleixner	ab24eb9abb	x86/xen/hvm: Get rid of DEAD_FROZEN handling No point in this conditional voodoo. Un-initializing the lock mechanism is safe to be called unconditionally even if it was already invoked when the CPU died. Remove the invocation of xen_smp_intr_free() as that has been already cleaned up in xen_cpu_dead_hvm(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.423407127@linutronix.de	2023-05-15 13:44:55 +02:00
Thomas Gleixner	2de7fd26d9	x86/xen/smp_pv: Remove wait for CPU online Now that the core code drops sparse_irq_lock after the idle thread synchronized, it's pointless to wait for the AP to mark itself online. Whether the control CPU runs in a wait loop or sleeps in the core code waiting for the online operation to complete makes no difference. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.369512093@linutronix.de	2023-05-15 13:44:54 +02:00
Thomas Gleixner	e464640cf7	x86/smpboot: Remove wait for cpu_online() Now that the core code drops sparse_irq_lock after the idle thread synchronized, it's pointless to wait for the AP to mark itself online. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.316417181@linutronix.de	2023-05-15 13:44:54 +02:00
Thomas Gleixner	c8b7fb09d1	x86/smpboot: Remove cpu_callin_mask Now that TSC synchronization is SMP function call based there is no reason to wait for the AP to be set in smp_callin_mask. The control CPU waits for the AP to set itself in the online mask anyway. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.206394064@linutronix.de	2023-05-15 13:44:53 +02:00
Thomas Gleixner	9d349d47f0	x86/smpboot: Make TSC synchronization function call based Spin-waiting on the control CPU until the AP reaches the TSC synchronization is just a waste especially in the case that there is no synchronization required. As the synchronization has to run with interrupts disabled the control CPU part can just be done from a SMP function call. The upcoming AP issues that call async only in the case that synchronization is required. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.148255496@linutronix.de	2023-05-15 13:44:53 +02:00
Thomas Gleixner	d4f28f07c2	x86/smpboot: Move synchronization masks to SMP boot code The usage is in smpboot.c and not in the CPU initialization code. The XEN_PV usage of cpu_callout_mask is obsolete as cpu_init() not longer waits and cacheinfo has its own CPU mask now, so cpu_callout_mask can be made static too. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.091511483@linutronix.de	2023-05-15 13:44:52 +02:00
Thomas Gleixner	a32226fa3b	x86/cpu/cacheinfo: Remove cpu_callout_mask dependency cpu_callout_mask is used for the stop machine based MTRR/PAT init. In preparation of moving the BP/AP synchronization to the core hotplug code, use a private CPU mask for cacheinfo and manage it in the starting/dying hotplug state. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205256.035041005@linutronix.de	2023-05-15 13:44:52 +02:00
Thomas Gleixner	e94cd1503b	x86/smpboot: Get rid of cpu_init_secondary() The synchronization of the AP with the control CPU is a SMP boot problem and has nothing to do with cpu_init(). Open code cpu_init_secondary() in start_secondary() and move wait_for_master_cpu() into the SMP boot code. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.981999763@linutronix.de	2023-05-15 13:44:51 +02:00
David Woodhouse	2b3be65d2e	x86/smpboot: Split up native_cpu_up() into separate phases and document them There are four logical parts to what native_cpu_up() does on the BSP (or on the controlling CPU for a later hotplug): 1) Wake the AP by sending the INIT/SIPI/SIPI sequence. 2) Wait for the AP to make it as far as wait_for_master_cpu() which sets that CPU's bit in cpu_initialized_mask, then sets the bit in cpu_callout_mask to let the AP proceed through cpu_init(). 3) Wait for the AP to finish cpu_init() and get as far as the smp_callin() call, which sets that CPU's bit in cpu_callin_mask. 4) Perform the TSC synchronization and wait for the AP to actually mark itself online in cpu_online_mask. In preparation to allow these phases to operate in parallel on multiple APs, split them out into separate functions and document the interactions a little more clearly in both the BP and AP code paths. No functional change intended. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.928917242@linutronix.de	2023-05-15 13:44:51 +02:00
Thomas Gleixner	c7f15dd3f0	x86/smpboot: Remove unnecessary barrier() Peter stumbled over the barrier() after the invocation of smp_callin() in start_secondary(): "...this barrier() and it's comment seem weird vs smp_callin(). That function ends with an atomic bitop (it has to, at the very least it must not be weaker than store-release) but also has an explicit wmb() to order setup vs CPU_STARTING. There is no way the smp_processor_id() referred to in this comment can land before cpu_init() even without the barrier()." The barrier() along with the comment was added in 2003 with commit d8f19f2cac70 ("[PATCH] x86-64 merge") in the history tree. One of those well documented combo patches of that time which changes world and some more. The context back then was: /* * Dont put anything before smp_callin(), SMP * booting is too fragile that we want to limit the * things done here to the most necessary things. / cpu_init(); smp_callin(); + / otherwise gcc will move up smp_processor_id before the cpu_init */ + barrier(); Dprintk("cpu %d: waiting for commence\n", smp_processor_id()); Even back in 2003 the compiler was not allowed to reorder that smp_processor_id() invocation before the cpu_init() function call. Especially not as smp_processor_id() resolved to: asm volatile("movl %%gs:%c1,%0":"=r" (ret__):"i"(pda_offset(field)):"memory"); There is no trace of this change in any mailing list archive including the back then official x86_64 list discuss@x86-64.org, which would explain the problem this change solved. The debug prints are gone by now and the the only smp_processor_id() invocation today is farther down in start_secondary() after locking vector_lock which itself prevents reordering. Even if the compiler would be allowed to reorder this, the code would still be correct as GSBASE is set up early in the assembly code and is valid when the CPU reaches start_secondary(), while the code at the time when this barrier was added did the GSBASE setup in cpu_init(). As the barrier has zero value, remove it. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.875713771@linutronix.de	2023-05-15 13:44:50 +02:00
Thomas Gleixner	cded367976	x86/smpboot: Restrict soft_restart_cpu() to SEV Now that the CPU0 hotplug cruft is gone, the only user is AMD SEV. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.822234014@linutronix.de	2023-05-15 13:44:50 +02:00
Thomas Gleixner	5475abbde7	x86/smpboot: Remove the CPU0 hotplug kludge This was introduced with commit `e1c467e690` ("x86, hotplug: Wake up CPU0 via NMI instead of INIT, SIPI, SIPI") to eventually support physical hotplug of CPU0: "We'll change this code in the future to wake up hard offlined CPU0 if real platform and request are available." 11 years later this has not happened and physical hotplug is not officially supported. Remove the cruft. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.768845190@linutronix.de	2023-05-15 13:44:49 +02:00
Thomas Gleixner	e59e74dc48	x86/topology: Remove CPU0 hotplug option This was introduced together with commit `e1c467e690` ("x86, hotplug: Wake up CPU0 via NMI instead of INIT, SIPI, SIPI") to eventually support physical hotplug of CPU0: "We'll change this code in the future to wake up hard offlined CPU0 if real platform and request are available." 11 years later this has not happened and physical hotplug is not officially supported. Remove the cruft. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.715707999@linutronix.de	2023-05-15 13:44:49 +02:00
Thomas Gleixner	666e1156b2	x86/smpboot: Rename start_cpu0() to soft_restart_cpu() This is used in the SEV play_dead() implementation to re-online CPUs. But that has nothing to do with CPU0. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.662319599@linutronix.de	2023-05-15 13:44:48 +02:00
Thomas Gleixner	134a12827b	x86/smpboot: Avoid pointless delay calibration if TSC is synchronized When TSC is synchronized across sockets then there is no reason to calibrate the delay for the first CPU which comes up on a socket. Just reuse the existing calibration value. This removes 100ms pointlessly wasted time from CPU hotplug per socket. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.608773568@linutronix.de	2023-05-15 13:44:48 +02:00
Thomas Gleixner	ba831b7b1a	cpu/hotplug: Mark arch_disable_smp_support() and bringup_nonboot_cpus() __init No point in keeping them around. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.551974164@linutronix.de	2023-05-15 13:44:47 +02:00
Thomas Gleixner	5107e3ebb8	x86/smpboot: Cleanup topology_phys_to_logical_pkg()/die() Make topology_phys_to_logical_pkg_die() static as it's only used in smpboot.c and fixup the kernel-doc warnings for both functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205255.493750666@linutronix.de	2023-05-15 13:44:47 +02:00
Linus Torvalds	ef21831c2e	- Make sure the PEBS buffer is flushed before reprogramming the hardware so that the correct record sizes are used - Update the sample size for AMD BRS events - Fix a confusion with using the same on-stack struct with different events in the event processing path -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmRgzWIACgkQEsHwGGHe VUpdgw//a1toWyjwrIV1YMu8lEpsrPKpOqIFuDQcLSl1vsYrmTRJ47PI1j/ZTQeo HgNkEE6lxAa9h/lKAjlE/lACE6Hr59xnQmu0BdG/SS+hlhWkT+oKLEUWz5qD4MuE bWdpxwHOhMIFR1ASAMThy/mE9V4TKsI/tsd7lMXUo6/skDGCmCGIgRq//3NUB5fV 0ivp5lv6NXFnUwS34Ot3fbWj/be7rr2vkYgN8WbwMAaEbpCIyseh6Tz+5ZRbENfP dMdh6ryuJ2BJ9BcDe9XlcEvPcaTvz7LVnzOVFz/AnBgtBTIOw/26xt17pgXBH7NK kpTKQTPp0mnt6ysnX5zYkeumKaxxqvVWaf18AQHkupj1HwggjiEFPnKK9KfslSy4 1tcED/D3i5QLOx+A8lCtA4ACwGl0Cvwgvw98Gp9imLst/zmMKa4MK96BYCodirKJ iDKN5aFA6c3pKJ4KTE7N6KKFzwhslTrehTHAJIL7BiVw3aMGin6514OnMELZBzam /zud81OWAKywWWRSwg7wy+K8RGH0R6K5dhwFrrm2BMqAluMq+rX1pRY9pEsL6jDj bCl45L52IsXZBSz2JTwWHGTssPyeDIe157ICFDOBnIx08u4KzJ+Knxsbaq2Jjs3R 9wm5H9yp/+q7//3XcEkdFjQwDVh2LJkY0QinH+6rPiAseBC9ukU= =OCba -----END PGP SIGNATURE----- Merge tag 'perf_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Make sure the PEBS buffer is flushed before reprogramming the hardware so that the correct record sizes are used - Update the sample size for AMD BRS events - Fix a confusion with using the same on-stack struct with different events in the event processing path * tag 'perf_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG perf/x86: Fix missing sample size update on AMD BRS perf/core: Fix perf_sample_data not properly initialized for different swevents in perf_tp_event()	2023-05-14 07:56:51 -07:00
Linus Torvalds	011e33ee48	- Add the required PCI IDs so that the generic SMN accesses provided by amd_nb.c work for drivers which switch to them. Add a PCI device ID to k10temp's table so that latter is loaded on such systems too -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmRgx6IACgkQEsHwGGHe VUpzJQ//WL//vtzAyQyEQQ0HP5cUNwtts79rEWpNyHJD0nFymbOrR7ho97HTi7pu a+wc3p9gwTC626GJD9dFVZs9YPqXI/Q2BKndQ6vrct9eaVnxWiJ45B5yJYLCUppE AHAChrDZ8ErXQjV1jeHtkIZY1mSa5Q7slfn8KFLzY5gIWA/3ds2pmIDK4r2wqQUR xIn+kDaaprsu8vhffWtHKIj5n1zRkLwPNZZ/2rZbQ7NJHt0yqu4Ld/L6myC6cQVj cX1Z1Wg++T3rZHxChx69w/jYuCY2EeM5JHXEbQqUG1tnfkpInZqvyjz1FUs9iYOY NOv9hcod9jIRjGvMooKn7MDhVy3yCtyrR590tKVP4qj3Wb+wKKJtPECjMfCo6I0H nVIiQPMZAbbo9UVLVA9UickSxKUgNCvhANUcnYWEQp+DM8wYxbV35PT8lomNz6Cj 3mKo/uqQPfGn3yUR1GIQagYSxHZJ1NTv50lPQOV0tZtfP2Gk/H4uT2w6ZFJKiJIV KTiE5vFW9VhbVye/VcHWyJ3+mB5SgClaA+t87zGpgJiPjfVWo2bnB6JkwZodEWi9 V8c53Hq+NG78QnFYZR5WNDYQ96LYWfurhcuqol8Sr836Y9LaOCGEAIsPMvG7esB2 y/GHRWdhkjOf5uNHduGoOVzlH95qW9mBtdFacVpaM/eocb1Zghk= =dn2p -----END PGP SIGNATURE----- Merge tag 'x86_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Borislav Petkov: - Add the required PCI IDs so that the generic SMN accesses provided by amd_nb.c work for drivers which switch to them. Add a PCI device ID to k10temp's table so that latter is loaded on such systems too * tag 'x86_urgent_for_v6.4_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hwmon: (k10temp) Add PCI ID for family 19, model 78h x86/amd_nb: Add PCI ID for family 19h model 78h	2023-05-14 07:44:48 -07:00

... 2 3 4 5 6 ...

44289 commits