Replace uses of sme_active() with the more generic cc_platform_has()
using CC_ATTR_HOST_MEM_ENCRYPT. If future support is added for other
memory encryption technologies, the use of CC_ATTR_HOST_MEM_ENCRYPT
can be updated, as required.
This also replaces two usages of sev_active() that are really geared
towards detecting if SME is active.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-6-bp@alien8.de
Introduce an x86 version of the cc_platform_has() function. This will be
used to replace vendor specific calls like sme_active(), sev_active(),
etc.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210928191009.32551-4-bp@alien8.de
Replace audit syscall class magic numbers with macros.
This required putting the macros into new header file
include/linux/audit_arch.h since the syscall macros were
included for both 64 bit and 32 bit in any compat code, causing
redefinition warnings.
Link: https://lore.kernel.org/r/2300b1083a32aade7ae7efb95826e8f3f260b1df.1621363275.git.rgb@redhat.com
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
[PM: renamed header to audit_arch.h after consulting with Richard]
Signed-off-by: Paul Moore <paul@paul-moore.com>
All Zen or newer CPU which support C3 shares cache. Its not necessary to
flush the caches in software before entering C3. This will cause drop in
performance for the cores which share some caches. ARB_DIS is not used
with current AMD C state implementation. So set related flags correctly.
Signed-off-by: Deepak Sharma <deepak.sharma@amd.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmFXQUoUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMglgf/egh3zb9/+BUQWe0xWfhcINNzpsVk
PJtiBmJc3nQLbZbTSLp63rouy1lNgR0s2DiMwP7G1u39OwW8W3LHMrBUSqF1F01+
gntb4GGiRTiTPJI64K4z6ytORd3tuRarHq8TUIa2zvki9ZW5Obgkm1i1RsNMOo+s
AOA7whhpS8e/a5fBbtbS9bTZb30PKTZmbW4oMjvO9Sw4Eb76IauqPSEtRPSuCAc7
r7z62RTlm10Qk0JR3tW1iXMxTJHZk+tYPJ8pclUAWVX5bZqWa/9k8R0Z5i/miFiZ
glW/y3R4+aUwIQV2v7V3Jx9MOKDhZxniMtnqZG/Hp9NVDtWIz37V/U37vw==
=zQQ1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm fixes from Paolo Bonzini:
"Small x86 fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: selftests: Ensure all migrations are performed when test is affined
KVM: x86: Swap order of CPUID entry "index" vs. "significant flag" checks
ptp: Fix ptp_kvm_getcrosststamp issue for x86 ptp_kvm
x86/kvmclock: Move this_cpu_pvti into kvmclock.h
selftests: KVM: Don't clobber XMM register when read
KVM: VMX: Fix a TSX_CTRL_CPUID_CLEAR field mask issue
On recent Intel systems the HPET stops working when the system reaches PC10
idle state.
The approach of adding PCI ids to the early quirks to disable HPET on
these systems is a whack a mole game which makes no sense.
Check for PC10 instead and force disable HPET if supported. The check is
overbroad as it does not take ACPI, intel_idle enablement and command
line parameters into account. That's fine as long as there is at least
PMTIMER available to calibrate the TSC frequency. The decision can be
overruled by adding "hpet=force" on the kernel command line.
Remove the related early PCI quirks for affected Ice Cake and Coffin Lake
systems as they are not longer required. That should also cover all
other systems, i.e. Tiger Rag and newer generations, which are most
likely affected by this as well.
Fixes: Yet another hardware trainwreck
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Rafael J. Wysocki <rafael@kernel.org>
Cc: stable@vger.kernel.org
Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
After returning from a VMGEXIT NAE event, SW_EXITINFO1[31:0] is checked
for a value of 1, which indicates an error and that SW_EXITINFO2
contains exception information. However, future versions of the GHCB
specification may define new values for SW_EXITINFO1[31:0], so really
any non-zero value should be treated as an error.
Fixes: 597cfe4821 ("x86/boot/compressed/64: Setup a GHCB-based VC Exception handler")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 5.10+
Link: https://lkml.kernel.org/r/efc772af831e9e7f517f0439b13b41f56bad8784.1633063321.git.thomas.lendacky@amd.com
In x86, the fake return address on the stack saved by
__kretprobe_trampoline() will be replaced with the real return
address after returning from trampoline_handler(). Before fixing
the return address, the real return address can be found in the
'current->kretprobe_instances'.
However, since there is a window between updating the
'current->kretprobe_instances' and fixing the address on the stack,
if an interrupt happens at that timing and the interrupt handler
does stacktrace, it may fail to unwind because it can not get
the correct return address from 'current->kretprobe_instances'.
This will eliminate that window by fixing the return address
right before updating 'current->kretprobe_instances'.
Link: https://lkml.kernel.org/r/163163057094.489837.9044470370440745866.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since the kretprobe replaces the function return address with
the kretprobe_trampoline on the stack, x86 unwinders can not
continue the stack unwinding at that point, or record
kretprobe_trampoline instead of correct return address.
To fix this issue, find the correct return address from task's
kretprobe_instances as like as function-graph tracer does.
With this fix, the unwinder can correctly unwind the stack
from kretprobe event on x86, as below.
<...>-135 [003] ...1 6.722338: r_full_proxy_read_0: (vfs_read+0xab/0x1a0 <- full_proxy_read)
<...>-135 [003] ...1 6.722377: <stack trace>
=> kretprobe_trace_func+0x209/0x2f0
=> kretprobe_dispatcher+0x4a/0x70
=> __kretprobe_trampoline_handler+0xca/0x150
=> trampoline_handler+0x44/0x70
=> kretprobe_trampoline+0x2a/0x50
=> vfs_read+0xab/0x1a0
=> ksys_read+0x5f/0xe0
=> do_syscall_64+0x33/0x40
=> entry_SYSCALL_64_after_hwframe+0x44/0xae
Link: https://lkml.kernel.org/r/163163055130.489837.5161749078833497255.stgit@devnote2
Reported-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Change __kretprobe_trampoline() to push the address of the
__kretprobe_trampoline() as a fake return address at the bottom
of the stack frame. This fake return address will be replaced
with the correct return address in the trampoline_handler().
With this change, the ORC unwinder can check whether the return
address is modified by kretprobes or not.
Link: https://lkml.kernel.org/r/163163054185.489837.14338744048957727386.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add UNWIND_HINT_FUNC on __kretprobe_trampoline() code so that ORC
information is generated on the __kretprobe_trampoline() correctly.
Also, this uses STACK_FRAME_NON_STANDARD_FP(), CONFIG_FRAME_POINTER-
-specific version of STACK_FRAME_NON_STANDARD().
Link: https://lkml.kernel.org/r/163163049242.489837.11970969750993364293.stgit@devnote2
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since now there is kretprobe_trampoline_addr() for referring the
address of kretprobe trampoline code, we don't need to access
kretprobe_trampoline directly.
Make it harder to refer by renaming it to __kretprobe_trampoline().
Link: https://lkml.kernel.org/r/163163045446.489837.14510577516938803097.stgit@devnote2
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The __kretprobe_trampoline_handler() callback, called from low level
arch kprobes methods, has the 'trampoline_address' parameter, which is
entirely superfluous as it basically just replicates:
dereference_kernel_function_descriptor(kretprobe_trampoline)
In fact we had bugs in arch code where it wasn't replicated correctly.
So remove this superfluous parameter and use kretprobe_trampoline_addr()
instead.
Link: https://lkml.kernel.org/r/163163044546.489837.13505751885476015002.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since get_optimized_kprobe() is only used inside kprobes,
it doesn't need to use 'unsigned long' type for 'addr' parameter.
Make it use 'kprobe_opcode_t *' for the 'addr' parameter and
subsequent call of arch_within_optimized_kprobe() also should use
'kprobe_opcode_t *'.
Note that MAX_OPTIMIZED_LENGTH and RELATIVEJUMP_SIZE are defined
by byte-size, but the size of 'kprobe_opcode_t' depends on the
architecture. Therefore, we must be careful when calculating
addresses using those macros.
Link: https://lkml.kernel.org/r/163163040680.489837.12133032364499833736.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There're other modules might use hv_clock_per_cpu variable like ptp_kvm,
so move it into kvmclock.h and export the symbol to make it visiable to
other modules.
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Cc: <stable@vger.kernel.org>
Message-Id: <1632892429-101194-2-git-send-email-zelin.deng@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
PREEMPT_RT preempts softirqs and the current implementation avoids
do_softirq_own_stack() and only uses __do_softirq().
Disable the unused softirqs stacks on PREEMPT_RT to safe some memory and
ensure that do_softirq_own_stack() is not used which is not expected.
[ bigeasy: commit description. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210924161245.2357247-1-bigeasy@linutronix.de
After four years in the wild, those have not fullfilled their
initial purpose of pushing people to fix their software to not use
UMIP-emulated instructions, and to warn users about the degraded
emulation performance.
Yet, the only thing that "degrades" performance is overflowing dmesg
with those:
[Di Sep 7 00:24:05 2021] umip_printk: 1345 callbacks suppressed
[Di Sep 7 00:24:05 2021] umip: someapp.exe[29231] ip:14064cdba sp:11b7c0: SIDT instruction cannot be used by applications.
[Di Sep 7 00:24:05 2021] umip: someapp.exe[29231] ip:14064cdba sp:11b7c0: For now, expensive software emulation returns the result.
...
[Di Sep 7 00:26:06 2021] umip_printk: 2227 callbacks suppressed
[Di Sep 7 00:26:06 2021] umip: someapp.exe[29231] ip:14064cdba sp:11b940: SIDT instruction cannot be used by applications.
and users don't really care about that - they just want to play their
games in wine.
So convert those to debug loglevel - in case someone is still interested
in them, someone can boot with "debug" on the kernel cmdline.
Reported-by: Marcus Rückert <mrueckert@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Link: https://lkml.kernel.org/r/20210907200454.30458-1-bp@alien8.de
Avoid having indirect calls and use a normal function which returns the
proper MSR address based on ->smca setting.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-4-bp@alien8.de
Get rid of the indirect function pointer and use flags settings instead
to steer execution.
Now that it is not an indirect call any longer, drop the instrumentation
annotation for objtool too.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-3-bp@alien8.de
Turn it into a normal function which calls an AMD- or Intel-specific
variant depending on the CPU it runs on.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20210922165101.18951-2-bp@alien8.de
Fix the missing return code polarity in save_xstate_epilog().
[ bp: Massage, use the right commit in the Fixes: tag ]
Fixes: 2af07f3a6e ("x86/fpu/signal: Change return type of copy_fpregs_to_sigframe() helpers to boolean")
Reported-by: Remi Duraffort <remi.duraffort@linaro.org>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1461
Link: https://lkml.kernel.org/r/20210922200901.1823741-1-anders.roxell@linaro.org
Current code has an explicit check for hitting the task stack guard;
but overflowing any of the other stacks will get you a non-descript
general #DF warning.
Improve matters by using get_stack_info_noinstr() to detetrmine if and
which stack guard page got hit, enabling a better stack warning.
In specific, Michael Wang reported what turned out to be an NMI
exception stack overflow, which is now clearly reported as such:
[] BUG: NMI stack guard page was hit at 0000000085fd977b (stack is 000000003a55b09e..00000000d8cce1a5)
Reported-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Wang <yun.wang@linux.alibaba.com>
Link: https://lkml.kernel.org/r/YUTE/NuqnaWbST8n@hirez.programming.kicks-ass.net
Since commit c8137ace56 ("x86/iopl: Restrict iopl() permission
scope") it's possible to emulate iopl(3) using ioperm(), except for
the CLI/STI usage.
Userspace CLI/STI usage is very dubious (read broken), since any
exception taken during that window can lead to rescheduling anyway (or
worse). The IOPL(2) manpage even states that usage of CLI/STI is highly
discouraged and might even crash the system.
Of course, that won't stop people and HP has the dubious honour of
being the first vendor to be found using this in their hp-health
package.
In order to enable this 'software' to still 'work', have the #GP treat
the CLI/STI instructions as NOPs when iopl(3). Warn the user that
their program is doing dubious things.
Fixes: a24ca99768 ("x86/iopl: Remove legacy IOPL option")
Reported-by: Ondrej Zary <linux@zary.sk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org # v5.5+
Link: https://lkml.kernel.org/r/20210918090641.GD5106@worktop.programming.kicks-ass.net
Commit in Fixes introduced early_reserve_memory() to do all needed
initial memblock_reserve() calls in one function. Unfortunately, the call
of early_reserve_memory() is done too late for Xen dom0, as in some
cases a Xen hook called by e820__memory_setup() will need those memory
reservations to have happened already.
Move the call of early_reserve_memory() before the call of
e820__memory_setup() in order to avoid such problems.
Fixes: a799c2bd29 ("x86/setup: Consolidate early memory reservations")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210920120421.29276-1-jgross@suse.com
Sending a SIGBUS for a copy from user is not the correct semantic.
System calls should return -EFAULT (or a short count for write(2)).
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com
- Prevent a infinite loop in the MCE recovery on return to user space,
which was caused by a second MCE queueing work for the same page and
thereby creating a circular work list.
- Make kern_addr_valid() handle existing PMD entries, which are marked not
present in the higher level page table, correctly instead of blindly
dereferencing them.
- Pass a valid address to sanitize_phys(). This was caused by the mixture
of inclusive and exclusive ranges. memtype_reserve() expect 'end' being
exclusive, but sanitize_phys() wants it inclusive. This worked so far,
but with end being the end of the physical address space the fail is
exposed.
- Increase the maximum supported GPIO numbers for 64bit. Newer SoCs exceed
the previous maximum.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmFHhPIACgkQEsHwGGHe
VUqqQA/+MHQ2HxVOPxnJ0i/D1nK8ccNqTEkSN08z23RGnjqKQun/VaNIIceJY25f
Abeb2tI+0qRrdWVPVd5YqcTHuBLmnPs6Je3MfOrG47eQNW4/SmkXYuOexK80Bew3
YDgEV73d40rHcolXZCaonVajx+FmjoNvkDt5LpLvLcCxIyv0GClFBcZrFAm70AxI
Feax30koh3/MIFxHoXyADN8D+MJu1GxA6QWuoTK40s3G/gTTAwimkDgnNU1JXbcj
VvVVZaNnnAxjxrCa81blr9nDpHJCDinG9bdvDT3UDLous52hGMZTsHoHogxwfogT
EhIgPvL8hf+wm1WXA4NyvSNKZxsGfdkvIXaUq9XYHpLRD6Ao6x7jQDL039imucqb
9YtaH52GhG0SgJlYjkm/zrKezIjKLDen0ZYr/2iNTDM1p2GqQEFo07wC/ME8TkQ6
/BvtbkIvOuUz3nJeV4/AO+O4kaNvto9O2eHq9oodIN9nrwmlO5fMg8XO9nrhWB11
ChXEz6kPqta1nyZXy0mwOrlXlqzcusiroG4G9F7IBBz+t/gNwlu3uZuIgkQCXyYw
DgKz9cnQ3RdgCFknbmEwV5oCjewm7UdcgwaDAaelHIDuWMcshZFvMf1uSjnyg4Z/
39WI8W7W2aZnIoKWpvu8s7Gr8f1krE7C3xrkvl2WmbKPkxNAin8=
=7cq3
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.15_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Prevent a infinite loop in the MCE recovery on return to user space,
which was caused by a second MCE queueing work for the same page and
thereby creating a circular work list.
- Make kern_addr_valid() handle existing PMD entries, which are marked
not present in the higher level page table, correctly instead of
blindly dereferencing them.
- Pass a valid address to sanitize_phys(). This was caused by the
mixture of inclusive and exclusive ranges. memtype_reserve() expect
'end' being exclusive, but sanitize_phys() wants it inclusive. This
worked so far, but with end being the end of the physical address
space the fail is exposed.
- Increase the maximum supported GPIO numbers for 64bit. Newer SoCs
exceed the previous maximum.
* tag 'x86_urgent_for_v5.15_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Avoid infinite loop for copy from user recovery
x86/mm: Fix kern_addr_valid() to cope with existing but not present entries
x86/platform: Increase maximum GPIO number for X86_64
x86/pat: Pass valid address to sanitize_phys()
Coverity warns of an unused value in arch_scale_freq_tick():
CID 100778 (#1 of 1): Unused value (UNUSED_VALUE)
assigned_value: Assigning value 1024ULL to freq_scale here, but that stored
value is overwritten before it can be used.
It was introduced by commit:
e2b0d619b4 ("x86, sched: check for counters overflow in frequency invariant accounting")
Remove the variable initializer.
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Link: https://lkml.kernel.org/r/20210910184405.24422-1-tim.gardner@canonical.com
The boot-time allocation interface for memblock is a mess, with
'memblock_alloc()' returning a virtual pointer, but then you are
supposed to free it with 'memblock_free()' that takes a _physical_
address.
Not only is that all kinds of strange and illogical, but it actually
causes bugs, when people then use it like a normal allocation function,
and it fails spectacularly on a NULL pointer:
https://lore.kernel.org/all/20210912140820.GD25450@xsang-OptiPlex-9020/
or just random memory corruption if the debug checks don't catch it:
https://lore.kernel.org/all/61ab2d0c-3313-aaab-514c-e15b7aa054a0@suse.cz/
I really don't want to apply patches that treat the symptoms, when the
fundamental cause is this horribly confusing interface.
I started out looking at just automating a sane replacement sequence,
but because of this mix or virtual and physical addresses, and because
people have used the "__pa()" macro that can take either a regular
kernel pointer, or just the raw "unsigned long" address, it's all quite
messy.
So this just introduces a new saner interface for freeing a virtual
address that was allocated using 'memblock_alloc()', and that was kept
as a regular kernel pointer. And then it converts a couple of users
that are obvious and easy to test, including the 'xbc_nodes' case in
lib/bootconfig.c that caused problems.
Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: 40caa127f3 ("init: bootconfig: Remove all bootconfig data when the init memory is removed")
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__fpu_sig_restore() only needs information about success or fail and no
real error code.
This cleans up the confusing conversion of the trap number, which is
returned by the *RSTOR() exception fixups, to an error code.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132526.084109938@linutronix.de
__fpu_sig_restore() only needs success/fail information and no detailed
error code.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132526.024024598@linutronix.de
Now that fpu__restore_sig() returns a boolean get rid of the individual
error codes in __fpu_restore_sig() as well.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.966197097@linutronix.de
None of the call sites cares about the error code. All they need to know is
whether the function succeeded or not.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.909065931@linutronix.de
None of the call sites cares about the return code. All they are interested
in is success or fail.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.851280949@linutronix.de
Now that copy_fpregs_to_sigframe() returns boolean the individual return
codes in the related helper functions do not make sense anymore. Change
them to return boolean success/fail.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.794334915@linutronix.de
None of the call sites cares about the actual return code. Change the
return type to boolean and return 'true' on success.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.736773588@linutronix.de
When the direct saving of the FPU registers to the user space sigframe
fails, copy_fpregs_to_sigframe() attempts to clear the user buffer.
The most likely reason for such a fail is a page fault. As
copy_fpregs_to_sigframe() is invoked with pagefaults disabled the chance
that __clear_user() succeeds is minuscule.
Move the clearing out into the caller which replaces the
fault_in_pages_writeable() in that error handling path.
The return value confusion will be cleaned up separately.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.679356300@linutronix.de
There is no reason to have the header zeroing in the pagefault disabled
region. Do it upfront once.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.621674721@linutronix.de
There are two cases for machine check recovery:
1) The machine check was triggered by ring3 (application) code.
This is the simpler case. The machine check handler simply queues
work to be executed on return to user. That code unmaps the page
from all users and arranges to send a SIGBUS to the task that
triggered the poison.
2) The machine check was triggered in kernel code that is covered by
an exception table entry. In this case the machine check handler
still queues a work entry to unmap the page, etc. but this will
not be called right away because the #MC handler returns to the
fix up code address in the exception table entry.
Problems occur if the kernel triggers another machine check before the
return to user processes the first queued work item.
Specifically, the work is queued using the ->mce_kill_me callback
structure in the task struct for the current thread. Attempting to queue
a second work item using this same callback results in a loop in the
linked list of work functions to call. So when the kernel does return to
user, it enters an infinite loop processing the same entry for ever.
There are some legitimate scenarios where the kernel may take a second
machine check before returning to the user.
1) Some code (e.g. futex) first tries a get_user() with page faults
disabled. If this fails, the code retries with page faults enabled
expecting that this will resolve the page fault.
2) Copy from user code retries a copy in byte-at-time mode to check
whether any additional bytes can be copied.
On the other side of the fence are some bad drivers that do not check
the return value from individual get_user() calls and may access
multiple user addresses without noticing that some/all calls have
failed.
Fix by adding a counter (current->mce_count) to keep track of repeated
machine checks before task_work() is called. First machine check saves
the address information and calls task_work_add(). Subsequent machine
checks before that task_work call back is executed check that the address
is in the same page as the first machine check (since the callback will
offline exactly one page).
Expected worst case is four machine checks before moving on (e.g. one
user access with page faults disabled, then a repeat to the same address
with page faults enabled ... repeat in copy tail bytes). Just in case
there is some code that loops forever enforce a limit of 10.
[ bp: Massage commit message, drop noinstr, fix typo, extend panic
messages. ]
Fixes: 5567d11c21 ("x86/mce: Send #MC singal from task work")
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.com
A number of systems are showing "hotplug capable" CPUs when they
are not really hotpluggable. This is because the MADT has extra
CPU entries to support different CPUs that may be inserted into
the socket with different numbers of cores.
Starting with ACPI 6.3 the spec has an Online Capable bit in the
MADT used to determine whether or not a CPU is hotplug capable
when the enabled bit is not set.
Link: https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/05_ACPI_Software_Programming_Model/ACPI_Software_Programming_Model.html?#local-apic-flags
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
FPU restore from a signal frame can trigger various exceptions. The
exceptions are caught with an exception table entry. The handler of this
entry stores the trap number in EAX. The FPU specific fixup negates that
trap number to convert it into an negative error code.
Any other exception than #PF is fatal and recovery is not possible. This
relies on the fact that the #PF exception number is the same as EFAULT, but
that's not really obvious.
Remove the negation from the exception fixup as it really has no value and
check for X86_TRAP_PF at the call site.
There is still confusion due to the return code conversion for the error
case which will be cleaned up separately.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.506192488@linutronix.de
Now that the MC safe copy and FPU have been converted to use the MCE safe
fixup types remove EX_TYPE_FAULT from the list of types which MCE considers
to be safe to be recovered in kernel.
This removes the SGX exception handling of ENCLS from the #MC safe
handling, but according to the SGX wizards the current SGX implementations
cannot survive #MC on ENCLS:
https://lore.kernel.org/r/YS+upEmTfpZub3s9@google.com
The code relies on the trap number being stored if ENCLS raised an
exception. That's still working, but it does no longer trick the MCE code
into assuming that #MC is handled correctly for ENCLS.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.445255957@linutronix.de
Provide exception fixup types which can be used to identify fixups which
allow in kernel #MC recovery and make them invoke the existing handlers.
These will be used at places where #MC recovery is handled correctly by the
caller.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210908132525.269689153@linutronix.de
The exception table entries contain the instruction address, the fixup
address and the handler address. All addresses are relative. Storing the
handler address has a few downsides:
1) Most handlers need to be exported
2) Handlers can be defined everywhere and there is no overview about the
handler types
3) MCE needs to check the handler type to decide whether an in kernel #MC
can be recovered. The functionality of the handler itself is not in any
way special, but for these checks there need to be separate functions
which in the worst case have to be exported.
Some of these 'recoverable' exception fixups are pretty obscure and
just reuse some other handler to spare code. That obfuscates e.g. the
#MC safe copy functions. Cleaning that up would require more handlers
and exports
Rework the exception fixup mechanics by storing a fixup type number instead
of the handler address and invoke the proper handler for each fixup
type. Also teach the extable sort to leave the type field alone.
This makes most handlers static except for special cases like the MCE
MSR fixup and the BPF fixup. This allows to add more types for cleaning up
the obscure places without adding more handler code and exports.
There is a marginal code size reduction for a production config and it
removes _eight_ exported symbols.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lkml.kernel.org/r/20210908132525.211958725@linutronix.de
Ensure that all usage sites of get/put_online_cpus() except for the
struggler in drivers/thermal are gone. So the last user and the deprecated
inlines can be removed.
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
s390:
- enable interpretation of specification exceptions
- fix a vcpu_idx vs vcpu_id mixup
x86:
- fast (lockless) page fault support for the new MMU
- new MMU now the default
- increased maximum allowed VCPU count
- allow inhibit IRQs on KVM_RUN while debugging guests
- let Hyper-V-enabled guests run with virtualized LAPIC as long as they
do not enable the Hyper-V "AutoEOI" feature
- fixes and optimizations for the toggling of AMD AVIC (virtualized LAPIC)
- tuning for the case when two-dimensional paging (EPT/NPT) is disabled
- bugfixes and cleanups, especially with respect to 1) vCPU reset and
2) choosing a paging mode based on CR0/CR4/EFER
- support for 5-level page table on AMD processors
Generic:
- MMU notifier invalidation callbacks do not take mmu_lock unless necessary
- improved caching of LRU kvm_memory_slot
- support for histogram statistics
- add statistics for halt polling and remote TLB flush requests
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmE2CIAUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMyqwf+Ky2WoThuQ9Ra0r/m8pUTAx5+gsAf
MmG24rNLE+26X0xuBT9Q5+etYYRLrRTWJvo5cgHooz7muAYW6scR+ho5xzvLTAxi
DAuoijkXsSdGoFCp0OMUHiwG3cgY5N7feTEwLPAb2i6xr/l6SZyCP4zcwiiQbJ2s
UUD0i3rEoNQ02/hOEveud/ENxzUli9cmmgHKXR3kNgsJClSf1fcuLnhg+7EGMhK9
+c2V+hde5y0gmEairQWm22MLMRolNZ5NL4kjykiNh2M5q9YvbHe5+f/JmENlNZMT
bsUQT6Ry1ukuJ0V59rZvUw71KknPFzZ3d6HgW4pwytMq6EJKiISHzRbVnQ==
=FCAB
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual
PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
s390:
- enable interpretation of specification exceptions
- fix a vcpu_idx vs vcpu_id mixup
x86:
- fast (lockless) page fault support for the new MMU
- new MMU now the default
- increased maximum allowed VCPU count
- allow inhibit IRQs on KVM_RUN while debugging guests
- let Hyper-V-enabled guests run with virtualized LAPIC as long as
they do not enable the Hyper-V "AutoEOI" feature
- fixes and optimizations for the toggling of AMD AVIC (virtualized
LAPIC)
- tuning for the case when two-dimensional paging (EPT/NPT) is
disabled
- bugfixes and cleanups, especially with respect to vCPU reset and
choosing a paging mode based on CR0/CR4/EFER
- support for 5-level page table on AMD processors
Generic:
- MMU notifier invalidation callbacks do not take mmu_lock unless
necessary
- improved caching of LRU kvm_memory_slot
- support for histogram statistics
- add statistics for halt polling and remote TLB flush requests"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (210 commits)
KVM: Drop unused kvm_dirty_gfn_invalid()
KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted
KVM: MMU: mark role_regs and role accessors as maybe unused
KVM: MIPS: Remove a "set but not used" variable
x86/kvm: Don't enable IRQ when IRQ enabled in kvm_wait
KVM: stats: Add VM stat for remote tlb flush requests
KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count()
KVM: x86/mmu: Move lpage_disallowed_link further "down" in kvm_mmu_page
KVM: x86/mmu: Relocate kvm_mmu_page.tdp_mmu_page for better cache locality
Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()"
KVM: x86/mmu: Remove unused field mmio_cached in struct kvm_mmu_page
kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710
kvm: x86: Increase MAX_VCPUS to 1024
kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS
KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation
KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host
KVM: s390: index kvm->arch.idle_mask by vcpu_idx
KVM: s390: Enable specification exception interpretation
KVM: arm64: Trim guest debug exception handling
KVM: SVM: Add 5-level page table support for SVM
...
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmEnfogPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDF9oQAINWHN1n30gsxcErMV8gH+XAyhDq2vTjkExQ
Qz5ddo4R5zeVkj0nkunFSK+W3xYz+W97X3I+IaiiHvk5D6dUatj37IyYlazX5iFT
7mbjTAqY7GRxfd6um7uK+CTRCApXY49GGkCVLGA5f+6mQ0JMVXaK9AKlsXKWUQLZ
JvLasUgKkseN6IEJWmPDNBdIeiKBTZloeZMdlM2vSm34HsuirSS5LmshdzJQzSk8
QSEqwXZX50afzJLNlB9Qa6V1tokjZVoYIBk0vAPO83tTh9HIyGL/PFAqBeq2rnWT
M19fFFbx5vizap4ICbpviLmZ5AOywCoBmbPBT79eMAJ53rOqHUJhU1y/3DoiVzxu
LJZI4wmGBQZVivOWOqyEZcNtTAagPLhyrLhMzYulBLwAjfFJmUHdSOxYtx+2Ysvr
SDIPN31FKWrvifTXTqJHDmaaXusi2CNZUOPzVSe2I14SbX+ZX2ny9DltlbRgPNuc
hGJagI5cZc0ngd4mAIzjjNmgBS2B+dSc8dOo71dRNJRLtQLiNHcAyQNJyFme+4xI
NpvpkvzxBAs8rG2X0YIR/Cz3W3yZoCYuQNcoPk7+F/bUTK47VocQCS+gLucHVLbT
H4286EV5n4nZ7E01oJ6uWnDnslPvrx9Sz2fxsrWYkBDR+xrz0EprrGsftFaILprz
Ic43uXfd
=LuHM
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 5.15
- Page ownership tracking between host EL1 and EL2
- Rely on userspace page tables to create large stage-2 mappings
- Fix incompatibility between pKVM and kmemleak
- Fix the PMU reset state, and improve the performance of the virtual PMU
- Move over to the generic KVM entry code
- Address PSCI reset issues w.r.t. save/restore
- Preliminary rework for the upcoming pKVM fixed feature
- A bunch of MM cleanups
- a vGIC fix for timer spurious interrupts
- Various cleanups
Commit f4e61f0c9a ("x86/kvm: Fix broken irq restoration in kvm_wait")
replaced "local_irq_restore() when IRQ enabled" with "local_irq_enable()
when IRQ enabled" to suppress a warnning.
Although there is no similar debugging warnning for doing local_irq_enable()
when IRQ enabled as doing local_irq_restore() in the same IRQ situation. But
doing local_irq_enable() when IRQ enabled is no less broken as doing
local_irq_restore() and we'd better avoid it.
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210814035129.154242-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge misc updates from Andrew Morton:
"173 patches.
Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
oom-kill, migration, ksm, percpu, vmstat, and madvise)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
mm/madvise: add MADV_WILLNEED to process_madvise()
mm/vmstat: remove unneeded return value
mm/vmstat: simplify the array size calculation
mm/vmstat: correct some wrong comments
mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
selftests: vm: add COW time test for KSM pages
selftests: vm: add KSM merging time test
mm: KSM: fix data type
selftests: vm: add KSM merging across nodes test
selftests: vm: add KSM zero page merging test
selftests: vm: add KSM unmerge test
selftests: vm: add KSM merge test
mm/migrate: correct kernel-doc notation
mm: wire up syscall process_mrelease
mm: introduce process_mrelease system call
memblock: make memblock_find_in_range method private
mm/mempolicy.c: use in_task() in mempolicy_slab_node()
mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
mm/mempolicy: advertise new MPOL_PREFERRED_MANY
mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
...
There are a lot of uses of memblock_find_in_range() along with
memblock_reserve() from the times memblock allocation APIs did not exist.
memblock_find_in_range() is the very core of memblock allocations, so any
future changes to its internal behaviour would mandate updates of all the
users outside memblock.
Replace the calls to memblock_find_in_range() with an equivalent calls to
memblock_phys_alloc() and memblock_phys_alloc_range() and make
memblock_find_in_range() private method of memblock.
This simplifies the callers, ensures that (unlikely) errors in
memblock_reserve() are handled and improves maintainability of
memblock_find_in_range().
Link: https://lkml.kernel.org/r/20210816122622.30279-1-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Acked-by: Kirill A. Shutemov <kirill.shtuemov@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ACPI]
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Nick Kossifidis <mick@ics.forth.gr> [riscv]
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each task can request own LDT and force the kernel to allocate up to 64Kb
memory per-mm.
There are legitimate workloads with hundreds of processes and there can be
hundreds of workloads running on large machines. The unaccounted memory
can cause isolation issues between the workloads particularly on highly
utilized machines.
It makes sense to account for this objects to restrict the host's memory
consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/38010594-50fe-c06d-7cb0-d1f77ca422f3@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yutian Yang <nglaive@gmail.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- fix debugfs initialization order (Anthony Iliopoulos)
- use memory_intersects() directly (Kefeng Wang)
- allow to return specific errors from ->map_sg
(Logan Gunthorpe, Martin Oliveira)
- turn the dma_map_sg return value into an unsigned int (me)
- provide a common global coherent pool іmplementation (me)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmEvY+8LHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPaehAAsgnBzzzoLHO83pgs0KL92c+0DiMNHYmaMCJOvZXk
x2Irv+O74WikRJc4S7uQ26p2spjmUxjmiOjld+8+NN0liD4QO9BQ/SZpIp8emuKS
/yPG6Xh86xSl/OrPL1y7kGeHkRi5sm3mRhcTdILFQFPLcSReupe++GRfnvrpbOPk
tj3pBGXluD6iJH12BBt00ushUVzZ0F2xaF6xUDAs94RSZ3tlqsfx6c928Y1KxSZh
f89q/KuaokyogFG7Ujj/nYgIUETaIs2W6UmxBfRzdEMJFSffwomUMbw+M+qGJ7/d
2UjamFYRX16FReE8WNsndbX1E6k5JBW12E1qwV3dUwatlNLWEaRq3PNiWkF7zcFH
LDkpDYN6s5bIDPTfDp21XfPygoH8KQhnD9lVf0aB7n04uu8VJrGB9+10PpkCJVXD
0b2dcuSwCO7hAfTfNGVV8f3EI/1XPflr1hJvMgcVtY53CR96ldp+4QaElzWLXumN
MyptirmrVITNVyVwGzhGAblXBLWdarXD0EXudyiaF4Xbrj3AkIOSUCghEwKLpjQf
UwMFFwSE8yGxKTRK4HfU5gMzy6G751fU7TUe5lmxZLovDflQoSXMWgHE8e7r0Qel
o5v6lmUzoWz2fAISf3xjauo2ncgmfWMwYM6C7OJy5nG73QXLQId9J+ReXbmrgrrN
DgI=
=spje
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.15' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- fix debugfs initialization order (Anthony Iliopoulos)
- use memory_intersects() directly (Kefeng Wang)
- allow to return specific errors from ->map_sg (Logan Gunthorpe,
Martin Oliveira)
- turn the dma_map_sg return value into an unsigned int (me)
- provide a common global coherent pool іmplementation (me)
* tag 'dma-mapping-5.15' of git://git.infradead.org/users/hch/dma-mapping: (31 commits)
hexagon: use the generic global coherent pool
dma-mapping: make the global coherent pool conditional
dma-mapping: add a dma_init_global_coherent helper
dma-mapping: simplify dma_init_coherent_memory
dma-mapping: allow using the global coherent pool for !ARM
ARM/nommu: use the generic dma-direct code for non-coherent devices
dma-direct: add support for dma_coherent_default_memory
dma-mapping: return an unsigned int from dma_map_sg{,_attrs}
dma-mapping: disallow .map_sg operations from returning zero on error
dma-mapping: return error code from dma_dummy_map_sg()
x86/amd_gart: don't set failed sg dma_address to DMA_MAPPING_ERROR
x86/amd_gart: return error code from gart_map_sg()
xen: swiotlb: return error code from xen_swiotlb_map_sg()
parisc: return error code from .map_sg() ops
sparc/iommu: don't set failed sg dma_address to DMA_MAPPING_ERROR
sparc/iommu: return error codes from .map_sg() ops
s390/pci: don't set failed sg dma_address to DMA_MAPPING_ERROR
s390/pci: return error code from s390_dma_map_sg()
powerpc/iommu: don't set failed sg dma_address to DMA_MAPPING_ERROR
powerpc/iommu: return error code from .map_sg() ops
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmEt+hwACgkQUqAMR0iA
lPLppBAAiyrUNVmqqtdww+IJajEs1uD/4FqPsysHRwroHBFymJeQG1XCwUpDZ7jj
6gXT0chxyjQE18gT/W9nf+PSmA9XvIVA1WSR+WCECTNW3YoZXqtgwiHfgnitXYku
HlmoZLthYeuoXWw2wn+hVLfTRh6VcPHYEaC21jXrs6B1pOXHbvjJ5eTLHlX9oCfL
UKSK+jFTHAJcn/GskRzviBe0Hpe8fqnkRol2XX13ltxqtQ73MjaGNu7imEH6/Pa7
/MHXWtuWJtOvuYz17aztQP4Qwh1xy+kakMy3aHucdlxRBTP4PTzzTuQI3L/RYi6l
+ttD7OHdRwqFAauBLY3bq3uJjYb5v/64ofd8DNnT2CJvtznY8wrPbTdFoSdPcL2Q
69/opRWHcUwbU/Gt4WLtyQf3Mk0vepgMbbVg1B5SSy55atRZaXMrA2QJ/JeawZTB
KK6D/mE7ccze/YFzsySunCUVKCm0veoNxEAcakCCZKXSbsvd1MYcIRC0e+2cv6e5
2NEH7gL4dD+5tqu5nzvIuKDn3NrDQpbi28iUBoFbkxRgcVyvHJ9AGSa62wtb5h3D
OgkqQMdVKBbjYNeUodPlQPzmXZDasytavyd0/BC/KENOcBvU/8gW++2UZTfsh/1A
dLjgwFBdyJncQcCS9Abn20/EKntbIMEX8NLa97XWkA3fuzMKtak=
=yEVq
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Optionally, provide an index of possible printk messages via
<debugfs>/printk/index/. It can be used when monitoring important
kernel messages on a farm of various hosts. The monitor has to be
updated when some messages has changed or are not longer available by
a newly deployed kernel.
- Add printk.console_no_auto_verbose boot parameter. It allows to
generate crash dump even with slow consoles in a reasonable time
frame.
- Remove printk_safe buffers. The messages are always stored directly
to the main logbuffer, even in NMI or recursive context. Also it
allows to serialize syslog operations by a mutex instead of a spin
lock.
- Misc clean up and build fixes.
* tag 'printk-for-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk/index: Fix -Wunused-function warning
lib/nmi_backtrace: Serialize even messages about idle CPUs
printk: Add printk.console_no_auto_verbose boot parameter
printk: Remove console_silent()
lib/test_scanf: Handle n_bits == 0 in random tests
printk: syslog: close window between wait and read
printk: convert @syslog_lock to mutex
printk: remove NMI tracking
printk: remove safe buffers
printk: track/limit recursion
lib/nmi_backtrace: explicitly serialize banner and regs
printk: Move the printk() kerneldoc comment to its new home
printk/index: Fix warning about missing prototypes
MIPS/asm/printk: Fix build failure caused by printk
printk: index: Add indexing support to dev_printk
printk: Userspace format indexing support
printk: Rework parse_prefix into printk_parse_prefix
printk: Straighten out log_flags into printk_info_flags
string_helpers: Escape double quotes in escape_special
printk/console: Check consistent sequence number when handling race in console_unlock()
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmEuJwwTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXp0ICACsx9NtQh1f9xGMClYrbobJfGmiwHVV
uKut/44Vg39tSyZB4mQt3A8YcaQj1Nibo6HVmxJtbKbrKwlTGAiQh5fiOmBOd7Re
/rII/S+CGtAyChI1adHTSL2xdk6WY0c7XQw+IPaERBikG5rO81Y6NLjFZNOv494k
JnG9uGGjAcJWFYylPcLxt4sR/hEfE4KDzsWjWOb5azYgo/RwOan6zYDdkUgocp4A
J+zmgCiME8LLmEV19gn7p4gpX7X9m5mcNgn53eICYPhrBqI0PTWocm6DepCEnrQ+
pEobIagWIMx5Dr7euEJwLxFSN7bdzleVOa4FSfM0zUsEjdbiPH47VQFM
=Vae6
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20210831' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- make Hyper-V code arch-agnostic (Michael Kelley)
- fix sched_clock behaviour on Hyper-V (Ani Sinha)
- fix a fault when Linux runs as the root partition on MSHV (Praveen
Kumar)
- fix VSS driver (Vitaly Kuznetsov)
- cleanup (Sonia Sharma)
* tag 'hyperv-next-signed-20210831' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
hv_utils: Set the maximum packet size for VSS driver to the length of the receive buffer
Drivers: hv: Enable Hyper-V code to be built on ARM64
arm64: efi: Export screen_info
arm64: hyperv: Initialize hypervisor on boot
arm64: hyperv: Add panic handler
arm64: hyperv: Add Hyper-V hypercall and register access utilities
x86/hyperv: fix root partition faults when writing to VP assist page MSR
hv: hyperv.h: Remove unused inline functions
drivers: hv: Decouple Hyper-V clock/timer code from VMbus drivers
x86/hyperv: add comment describing TSC_INVARIANT_CONTROL MSR setting bit 0
Drivers: hv: Move Hyper-V misc functionality to arch-neutral code
Drivers: hv: Add arch independent default functions for some Hyper-V handlers
Drivers: hv: Make portions of Hyper-V init code be arch neutral
x86/hyperv: fix for unwanted manipulation of sched_clock when TSC marked unstable
asm-generic/hyperv: Add missing #include of nmi.h
Pull siginfo si_trapno updates from Eric Biederman:
"The full set of si_trapno changes was not appropriate as a fix for the
newly added SIGTRAP TRAP_PERF, and so I postponed the rest of the
related cleanups.
This is the rest of the cleanups for si_trapno that reduces it from
being a really weird arch special case that is expect to be always
present (but isn't) on the architectures that support it to being yet
another field in the _sigfault union of struct siginfo.
The changes have been reviewed and marinated in linux-next. With the
removal of this awkward special case new code (like SIGTRAP TRAP_PERF)
that works across architectures should be easier to write and
maintain"
* 'siginfo-si_trapno-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Rename SIL_PERF_EVENT SIL_FAULT_PERF_EVENT for consistency
signal: Verify the alignment and size of siginfo_t
signal: Remove the generic __ARCH_SI_TRAPNO support
signal/alpha: si_trapno is only used with SIGFPE and SIGTRAP TRAP_UNK
signal/sparc: si_trapno is only used with SIGILL ILL_ILLTRP
arm64: Add compile-time asserts for siginfo_t offsets
arm: Add compile-time asserts for siginfo_t offsets
sparc64: Add compile-time asserts for siginfo_t offsets
core:
- extract i915 eDP backlight into core
- DP aux bus support
- drm_device.irq_enabled removed
- port drivers to native irq interfaces
- export gem shadow plane handling for vgem
- print proper driver name in framebuffer registration
- driver fixes for implicit fencing rules
- ARM fixed rate compression modifier added
- updated fb damage handling
- rmfb ioctl logging/docs
- drop drm_gem_object_put_locked
- define DRM_FORMAT_MAX_PLANES
- add gem fb vmap/vunmap helpers
- add lockdep_assert(once) helpers
- mark drm irq midlayer as legacy
- use offset adjusted bo mapping conversion
vgaarb:
- cleanups
fbdev:
- extend efifb handling to all arches
- div by 0 fixes for multiple drivers
udmabuf:
- add hugepage mapping support
dma-buf:
- non-dynamic exporter fixups
- document implicit fencing rules
amdgpu:
- Initial Cyan Skillfish support
- switch virtual DCE over to vkms based atomic
- VCN/JPEG power down fixes
- NAVI PCIE link handling fixes
- AMD HDMI freesync fixes
- Yellow Carp + Beige Goby fixes
- Clockgating/S0ix/SMU/EEPROM fixes
- embed hw fence in job
- rework dma-resv handling
- ensure eviction to system ram
amdkfd:
- uapi: SVM address range query added
- sysfs leak fix
- GPUVM TLB optimizations
- vmfault/migration counters
i915:
- Enable JSL and EHL by default
- preliminary XeHP/DG2 support
- remove all CNL support (never shipped)
- move to TTM for discrete memory support
- allow mixed object mmap handling
- GEM uAPI spring cleaning
- add I915_MMAP_OBJECT_FIXED
- reinstate ADL-P mmap ioctls
- drop a bunch of unused by userspace features
- disable and remove GPU relocations
- revert some i915 misfeatures
- major refactoring of GuC for Gen11+
- execbuffer object locking separate step
- reject caching/set-domain on discrete
- Enable pipe DMC loading on XE-LPD and ADL-P
- add PSF GV point support
- Refactor and fix DDI buffer translations
- Clean up FBC CFB allocation code
- Finish INTEL_GEN() and friends macro conversions
nouveau:
- add eDP backlight support
- implicit fence fix
msm:
- a680/7c3 support
- drm/scheduler conversion
panfrost:
- rework GPU reset
virtio:
- fix fencing for planes
ast:
- add detect support
bochs:
- move to tiny GPU driver
vc4:
- use hotplug irqs
- HDMI codec support
vmwgfx:
- use internal vmware device headers
ingenic:
- demidlayering irq
rcar-du:
- shutdown fixes
- convert to bridge connector helpers
zynqmp-dsub:
- misc fixes
mgag200:
- convert PLL handling to atomic
mediatek:
- MT8133 AAL support
- gem mmap object support
- MT8167 support
etnaviv:
- NXP Layerscape LS1028A SoC support
- GEM mmap cleanups
tegra:
- new user API
exynos:
- missing unlock fix
- build warning fix
- use refcount_t
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmEtvn8ACgkQDHTzWXnE
hr7aqw//WfcIyGdPLjAz59cW8jm+FgihD5colHtOUYRHRO4GeX/bNNufquR8+N3y
HESsyZdpihFHms/wURMq41ibmHg0EuHA01HZzjZuGBesG4F9I8sP/HnDOxDuYuAx
N7Lg4PlUNlfFHmw7Y84owQ6s/XWmNp5iZ8e/mTK5hcraJFQKS4QO74n9RbG/F1vC
Hc3P6AnpqGac2AEGXt0NjIRxVVCTUIBGx+XOhj+1AMyAGzt9VcO1DS9PVCS0zsEy
zKMj9tZAPNg0wYsXAi4kA1lK7uVY8KoXSVDYLpsI5Or2/e7mfq2b4EWrezbtp6UA
H+w86axuwJq7NaYHYH6HqyrLTOmvcHgIl2LoZN91KaNt61xfJT3XZkyQoYViGIrJ
oZy6X/+s+WPoW98bHZrr6vbcxtWKfEeQyUFEAaDMmraKNJwROjtwgFC9DP8MDctq
PUSM+XkwbGRRxQfv9dNKufeWfV5blVfzEJO8EfTU1YET3WTDaUHe/FoIcLZt2DZG
JAJgZkIlU8egthPdakUjQz/KoyLMyovcN5zcjgzgjA9PyNEq74uElN9l446kSSxu
jEVErOdd+aG3Zzk7/ZZL/RmpNQpPfpQ2RaPUkgeUsW01myNzUNuU3KUDaSlVa+Oi
1n7eKoaQ2to/+LjhYApVriri4hIZckNNn5FnnhkgwGi8mpHQIVQ=
=vZkA
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2021-08-31-1' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"Highlights:
- i915 has seen a lot of refactoring and uAPI cleanups due to a
change in the upstream direction going forward
This has all been audited with known userspace, but there may be
some pitfalls that were missed.
- i915 now uses common TTM to enable discrete memory on DG1/2 GPUs
- i915 enables Jasper and Elkhart Lake by default and has preliminary
XeHP/DG2 support
- amdgpu adds support for Cyan Skillfish
- lots of implicit fencing rules documented and fixed up in drivers
- msm now uses the core scheduler
- the irq midlayer has been removed for non-legacy drivers
- the sysfb code now works on more than x86.
Otherwise the usual smattering of stuff everywhere, panels, bridges,
refactorings.
Detailed summary:
core:
- extract i915 eDP backlight into core
- DP aux bus support
- drm_device.irq_enabled removed
- port drivers to native irq interfaces
- export gem shadow plane handling for vgem
- print proper driver name in framebuffer registration
- driver fixes for implicit fencing rules
- ARM fixed rate compression modifier added
- updated fb damage handling
- rmfb ioctl logging/docs
- drop drm_gem_object_put_locked
- define DRM_FORMAT_MAX_PLANES
- add gem fb vmap/vunmap helpers
- add lockdep_assert(once) helpers
- mark drm irq midlayer as legacy
- use offset adjusted bo mapping conversion
vgaarb:
- cleanups
fbdev:
- extend efifb handling to all arches
- div by 0 fixes for multiple drivers
udmabuf:
- add hugepage mapping support
dma-buf:
- non-dynamic exporter fixups
- document implicit fencing rules
amdgpu:
- Initial Cyan Skillfish support
- switch virtual DCE over to vkms based atomic
- VCN/JPEG power down fixes
- NAVI PCIE link handling fixes
- AMD HDMI freesync fixes
- Yellow Carp + Beige Goby fixes
- Clockgating/S0ix/SMU/EEPROM fixes
- embed hw fence in job
- rework dma-resv handling
- ensure eviction to system ram
amdkfd:
- uapi: SVM address range query added
- sysfs leak fix
- GPUVM TLB optimizations
- vmfault/migration counters
i915:
- Enable JSL and EHL by default
- preliminary XeHP/DG2 support
- remove all CNL support (never shipped)
- move to TTM for discrete memory support
- allow mixed object mmap handling
- GEM uAPI spring cleaning
- add I915_MMAP_OBJECT_FIXED
- reinstate ADL-P mmap ioctls
- drop a bunch of unused by userspace features
- disable and remove GPU relocations
- revert some i915 misfeatures
- major refactoring of GuC for Gen11+
- execbuffer object locking separate step
- reject caching/set-domain on discrete
- Enable pipe DMC loading on XE-LPD and ADL-P
- add PSF GV point support
- Refactor and fix DDI buffer translations
- Clean up FBC CFB allocation code
- Finish INTEL_GEN() and friends macro conversions
nouveau:
- add eDP backlight support
- implicit fence fix
msm:
- a680/7c3 support
- drm/scheduler conversion
panfrost:
- rework GPU reset
virtio:
- fix fencing for planes
ast:
- add detect support
bochs:
- move to tiny GPU driver
vc4:
- use hotplug irqs
- HDMI codec support
vmwgfx:
- use internal vmware device headers
ingenic:
- demidlayering irq
rcar-du:
- shutdown fixes
- convert to bridge connector helpers
zynqmp-dsub:
- misc fixes
mgag200:
- convert PLL handling to atomic
mediatek:
- MT8133 AAL support
- gem mmap object support
- MT8167 support
etnaviv:
- NXP Layerscape LS1028A SoC support
- GEM mmap cleanups
tegra:
- new user API
exynos:
- missing unlock fix
- build warning fix
- use refcount_t"
* tag 'drm-next-2021-08-31-1' of git://anongit.freedesktop.org/drm/drm: (1318 commits)
drm/amd/display: Move AllowDRAMSelfRefreshOrDRAMClockChangeInVblank to bounding box
drm/amd/display: Remove duplicate dml init
drm/amd/display: Update bounding box states (v2)
drm/amd/display: Update number of DCN3 clock states
drm/amdgpu: disable GFX CGCG in aldebaran
drm/amdgpu: Clear RAS interrupt status on aldebaran
drm/amdgpu: Add support for RAS XGMI err query
drm/amdkfd: Account for SH/SE count when setting up cu masks.
drm/amdgpu: rename amdgpu_bo_get_preferred_pin_domain
drm/amdgpu: drop redundant cancel_delayed_work_sync call
drm/amdgpu: add missing cleanups for more ASICs on UVD/VCE suspend
drm/amdgpu: add missing cleanups for Polaris12 UVD/VCE on suspend
drm/amdkfd: map SVM range with correct access permission
drm/amdkfd: check access permisson to restore retry fault
drm/amdgpu: Update RAS XGMI Error Query
drm/amdgpu: Add driver infrastructure for MCA RAS
drm/amd/display: Add Logging for HDMI color depth information
drm/amd/amdgpu: consolidate PSP TA init shared buf functions
drm/amd/amdgpu: add name field back to ras_common_if
drm/amdgpu: Fix build with missing pm_suspend_target_state module export
...
After commit 342f43af70 ("iscsi_ibft: fix crash due to KASLR physical
memory remapping") x86_64_defconfig shows the following errors:
arch/x86/kernel/setup.c: In function ‘setup_arch’:
arch/x86/kernel/setup.c:916:13: error: implicit declaration of function ‘acpi_mps_check’ [-Werror=implicit-function-declaration]
916 | if (acpi_mps_check()) {
| ^~~~~~~~~~~~~~
arch/x86/kernel/setup.c:1110:9: error: implicit declaration of function ‘acpi_table_upgrade’ [-Werror=implicit-function-declaration]
1110 | acpi_table_upgrade();
| ^~~~~~~~~~~~~~~~~~
[... more acpi noise ...]
acpi.h was being implicitly included from iscsi_ibft.h in this
configuration so the removal of that header means these functions have
no definition or declaration.
In most other configurations, <linux/acpi.h> continued to be included
through at least <linux/tboot.h> if CONFIG_INTEL_TXT was enabled, and
there were probably other implicit include paths too.
Add acpi.h explicitly so there is no more error, and so that we don't
continue to depend on these unreliable implicit include paths.
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Cc: Maurizio Lombardi <mlombard@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DEFINE_SMP_CALL_CACHE_FUNCTION() was usefel before the CPU hotplug rework
to ensure that the cache related functions are called on the upcoming CPU
because the notifier itself could run on any online CPU.
The hotplug state machine guarantees that the callbacks are invoked on the
upcoming CPU. So there is no need to have this SMP function call
obfuscation. That indirection was missed when the hotplug notifiers were
converted.
This also solves the problem of ARM64 init_cache_level() invoking ACPI
functions which take a semaphore in that context. That's invalid as SMP
function calls run with interrupts disabled. Running it just from the
callback in context of the CPU hotplug thread solves this.
Fixes: 8571890e15 ("arm64: Add support for ACPI based firmware tables")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/871r69ersb.ffs@tglx
Pull ibft updates from Konrad Rzeszutek Wilk:
"A fix for iBFT parsing code badly interfacing when KASLR is enabled"
* 'stable/for-linus-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft:
iscsi_ibft: fix warning in reserve_ibft_region()
iscsi_ibft: fix crash due to KASLR physical memory remapping
New drivers for:
- Aquacomputer D5 Next
- SB-RMI power module
Added chip support t oexisting drivers:
- Support for various Zen2 and Zen3 APUs and for Yellow Carp
(SMU v13) added to k10temp driver
- Support for Silicom n5010 PAC added to intel-m10-bmc driver
- Support for BPD-RS600 added to pmbus/bpa-rs600 driver
Other notable changes:
- In k10temp, do not display Tdie on Zen CPUs if there is no
difference between Tdie and Tctl
- Converted adt7470 and dell-smm drivers to use
devm_hwmon_device_register_with_info API
- Support for temperature/pwm tables added to axi-fan-control
driver
- Enabled fan control for Dell Precision 7510 in dell-smm driver
Various other minor improvements and fixes in several drivers.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEiHPvMQj9QTOCiqgVyx8mb86fmYEFAmEtTKoACgkQyx8mb86f
mYFBRA/+N2xANCT6pxEILm9RaOj7ezQQNH1dlVGIus15zu+5sFlqSmmFvWK32KSB
n7B/0poO323Hf5llRVDenelDLfZ2gMryRqvt12tGyqTm9dsOVL8DLqz50K/BEjfQ
cq5YMJZfAr2EtTc19Re3Mk5b5V87/U/a/BRWeG/4wIO7TCP38TqmfIUcxbTRTeiC
n1SUA1xJMKyxkUoJl0kXtMqFLDaz9FESLQc0ioe/yWqI5P22j9KJqNhJ01PPAOiS
4sfwtXwwR9b5nD3nX0AfcRCqJVAzwfqCrj66WKFLIcat0kSWXJVHhxz7go9McEqa
jV1Z4MCV1uuQ8m6eqYako+Fda+lpNQKE71Z89582aY9cB7xw1bxaHuebJdzH20CD
GtXQfDrDAQcRSnFlXCTz/wvAQN96TSu9gvgQyLYgAGyuv3EIEuDbQaixl8xzWcpB
W6Pg1V/FgBShK/LDj03A/oCszOBI5lL9ZgV1GVZjwjVc/UxmMpBlDiikfM2VUy3T
dPsjfCiXqWiZ4FcESXjLlXGsZG5AM7ZfMiKG8T1JkGm1UokWdGhALiF7B3o1Cfmw
QE/I5VM2gArgCTNzeFF11rP8lNMqoXk3KAm6KVxNIl6ltTWuH2dRkIhfDp/1/R32
ySbhy4FMvl9zjQFwUK6JG3tmrEN6DXlh+OFeAmW0/uf+wlinBlc=
=tQk6
-----END PGP SIGNATURE-----
Merge tag 'hwmon-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon updates from Guenter Roeck:
"New drivers for:
- Aquacomputer D5 Next
- SB-RMI power module
Added chip support to existing drivers:
- Support for various Zen2 and Zen3 APUs and for Yellow Carp (SMU
v13) added to k10temp driver
- Support for Silicom n5010 PAC added to intel-m10-bmc driver
- Support for BPD-RS600 added to pmbus/bpa-rs600 driver
Other notable changes:
- In k10temp, do not display Tdie on Zen CPUs if there is no
difference between Tdie and Tctl
- Converted adt7470 and dell-smm drivers to use
devm_hwmon_device_register_with_info API
- Support for temperature/pwm tables added to axi-fan-control driver
- Enabled fan control for Dell Precision 7510 in dell-smm driver
Various other minor improvements and fixes in several drivers"
* tag 'hwmon-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: (41 commits)
hwmon: add driver for Aquacomputer D5 Next
hwmon: (adt7470) Convert to devm_hwmon_device_register_with_info API
hwmon: (adt7470) Convert to use regmap
hwmon: (adt7470) Fix some style issues
hwmon: (k10temp) Add support for yellow carp
hwmon: (k10temp) Rework the temperature offset calculation
hwmon: (k10temp) Don't show Tdie for all Zen/Zen2/Zen3 CPU/APU
hwmon: (k10temp) Add additional missing Zen2 and Zen3 APUs
hwmon: remove amd_energy driver in Makefile
hwmon: (dell-smm) Rework SMM function debugging
hwmon: (dell-smm) Mark i8k_get_fan_nominal_speed as __init
hwmon: (dell-smm) Mark tables as __initconst
hwmon: (pmbus/bpa-rs600) Add workaround for incorrect Pin max
hwmon: (pmbus/bpa-rs600) Don't use rated limits as warn limits
hwmon: (axi-fan-control) Support temperature vs pwm points
hwmon: (axi-fan-control) Handle irqs in natural order
hwmon: (axi-fan-control) Make sure the clock is enabled
hwmon: (pmbus/ibm-cffps) Fix write bits for LED control
hwmon: (w83781d) Match on device tree compatibles
dt-bindings: hwmon: Add bindings for Winbond W83781D
...
- Limit the Dell Optiplex 990 quirk to early BIOS versions to avoid the
full 'power cycle' alike reboot which is required for the buggy BIOSes.
- Update documentation for the reboot=pci command line option and
document how DMI platform quirks can be overridden.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsn30THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoQcDEACIFp9ERMipDoY/8+v8qFr5NIbVhcQQ
8/uRBl/GSt+NE5VS/BnDzsvRM7K98eIb/sFtZPqPD83bmm/jWDIPlKLVijh+HxjP
c+IcBC1NscR278b+ousV87Qx4uYnxgrbwi+BXgaa2lSyfGIlecLc6BOWNod3z4e9
2dPoui7tgEk0awqLoQkJxNnKhtXr+1fe4NoJU0WkjLZ0GZ65jluh9QAFjNMY5zrK
JDiyGpT7U9Yp5iAQ4UJ86ll9ZvGgGn+G/4RDSPRcZs8ui6DGBxQ4/ndTKyqLRiqf
QVC+9KaVexQqVegPyXQMDI72530i75tyIDN/DQWS9tg4kTKA4HRc9drUVYhWDGSe
5AlrVwWhQP/qR7WTjTyrxgaMtuirzkqgbTESdXtiycBGYJ1q30zkekqPhPjySOUL
slS1/hIgYZAiesaZnyMIyBKox60AZPhPOBlEdGAIZjyNBlNr1LrdMZBybF7JMb6+
J2MBi8HFUr1yu0lmh4940mhDajyPM32plgY20d6HF5P5FB2RI87jF4s4yJ790zl5
FosGcAtnxPEpxFtB2HOzxLuSJa73j6jj5hYS7rcnXAxN2TAMxGx7OVJJqkEswbHL
Od8RetxtWlaEKcqqw5186DIzUEJbGzbDbtGrEkXf7x4YKDE7MK8SFLHukq+bkHNt
HxS9QpGSv2JxhQ==
=Us12
-----END PGP SIGNATURE-----
Merge tag 'x86-misc-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Thomas Gleixner:
"A set of updates for the x86 reboot code:
- Limit the Dell Optiplex 990 quirk to early BIOS versions to avoid
the full 'power cycle' alike reboot which is required for the buggy
BIOSes.
- Update documentation for the reboot=pci command line option and
document how DMI platform quirks can be overridden"
* tag 'x86-misc-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/reboot: Limit Dell Optiplex 990 quirk to early BIOS versions
x86/reboot: Document how to override DMI platform quirks
x86/reboot: Document the "reboot=pci" option
which can be found on various ALi chipsets and is also available on older
Intel systems which expose a PIRQ router. While the Intel support is more
or less nostalgia, the ALi chips are still in use on popular embedded
boards used for routers.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsn2QTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoQT1EACIvzRbycwclASIV6rBK5FMcVa2VuXR
GVqrfERPCUHQxnLshUJxnLk0NvZcQrLHjYl/QMCHBFOeEh3XrzU7JkKDW0Q8Dnov
QGFRtandKDwY4TwnCPKVdz/HeWMxNRT7OF4d08Q3iKCN5l39RLxraMixSrFL8soO
wgGcRTjbTa6HaMlqacFN7DwwiHxbIGJNepi0yqLZBV2dQOnZPd+ujV1FRSNXkv9p
vFPfuazk/psiSXy3x/+YVPUw+6h8DRDkflc9+wvSR+1cVl8eyrjkLgLH43ihddEN
Dl1SG5vKyCOtvQm+TEYdB5qjb/Zd4BjlbvKPJ+94OTtsjIIwxzInizkeTXiLHXnl
SDHX9Sc8L4sYP5+tAew1WMj8K2/p6FzdHm+sBJHd2JFSsMpeErI7p0y0Nz58E7pG
0cRqeWlq7rbGFPq544A8cgx/LjPkZT4LgutGpJ6f3NTZeLfj09xbFRqxNOHqAx+h
fp+36RNb1/j70Yz+4r7lLeDOVswbK+YxPIZGdnNfINTHeGllthDI5vaUL0L2jZnI
CnnKjss2a1WkDC8gczr/3QYcQRKrKDHL0hn0nUh+9laAaTSwNv3oRrkUWvMqwaT8
qSMMm5Eb84B4fZLyvPIcAwyC++JU/cVCgWEP37EzhYcvp6tq8GmR1cdi1lo2/K4O
qhg1d7loNh0eCg==
=R+c1
-----END PGP SIGNATURE-----
Merge tag 'x86-irq-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PIRQ updates from Thomas Gleixner:
"A set of updates to support port 0x22/0x23 based PCI configuration
space which can be found on various ALi chipsets and is also available
on older Intel systems which expose a PIRQ router.
While the Intel support is more or less nostalgia, the ALi chips are
still in use on popular embedded boards used for routers"
* tag 'x86-irq-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Fix typo s/ECLR/ELCR/ for the PIC register
x86: Avoid magic number with ELCR register accesses
x86/PCI: Add support for the Intel 82426EX PIRQ router
x86/PCI: Add support for the Intel 82374EB/82374SB (ESC) PIRQ router
x86/PCI: Add support for the ALi M1487 (IBC) PIRQ router
x86: Add support for 0x22/0x23 port I/O configuration space
A stop gap for potential future speculation related hardware
vulnerabilities and a mechanism for truly security paranoid
applications.
It allows a task to request that the L1D cache is flushed when the kernel
switches to a different mm. This can be requested via prctl().
Changes vs. the previous versions:
- Get rid of the software flush fallback
- Make the handling consistent with other mitigations
- Kill the task when it ends up on a SMT enabled core which defeats the
purpose of L1D flushing obviously
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEsn0oTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoa5fD/47vHGtjAtDr/DaXR1C6F9AvVbKEl8p
oNHn8IukE6ts6G4dFH9wUvo/Ut0K3kxX54I+BATew0LTy6tsQeUYh/xjwXMupgNV
oKOc9waoqdFvju3ayLFWJmuACLdXpyrGC1j35Aji61zSbR/GdtZ4oDxbuN2YJDAT
BTcgKrBM5nQm94JNa083RQSCU5LJxbC7ETkIh6NR73RSPCjUC1Wpxy1sAQAa2MPD
8EzcJ/DjVGaHCI7adX10sz3xdUcyOz7qYz16HpoMGx+oSiq7pGEBtUiK97EYMcrB
s+ADFUjYmx/pbEWv2r4c9zxNh7ZV3aLBsWwi7bScHIsv8GjrsA/mYLWskuwOV6BB
22qZjfd0c4raiJwd+nmSx+D2Szv6lZ20gP+krtP2VNC6hUv7ft0VPLySiaFMmUHj
quooDZis/W5n+4C9Q8Rk9uUtKzzJOngqW+duftiixHiNQ/ECP/QCAHhZYck/NOkL
tZkNj6lJj9+2iR7mhbYROZ+wrYQzRvqNb2pJJQoi/wA0q7wPSKBi3m+51lPsht5W
tn94CpaDDZ4IB7Fe1NtcA0UpYJSWpDQGlau4qp92HMCCIcRFfQEm+m9x8axwcj7m
ECblHJYBPHuNcCHvPA8kHvr1nd6UUXrGPIo8TK8YhUUbK6pO0OjdNzZX496ia/2g
pLzaW2ENTPLbXg==
=27wH
-----END PGP SIGNATURE-----
Merge tag 'x86-cpu-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cache flush updates from Thomas Gleixner:
"A reworked version of the opt-in L1D flush mechanism.
This is a stop gap for potential future speculation related hardware
vulnerabilities and a mechanism for truly security paranoid
applications.
It allows a task to request that the L1D cache is flushed when the
kernel switches to a different mm. This can be requested via prctl().
Changes vs the previous versions:
- Get rid of the software flush fallback
- Make the handling consistent with other mitigations
- Kill the task when it ends up on a SMT enabled core which defeats
the purpose of L1D flushing obviously"
* tag 'x86-cpu-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation: Add L1D flushing Documentation
x86, prctl: Hook L1D flushing in via prctl
x86/mm: Prepare for opt-in based L1D flush in switch_mm()
x86/process: Make room for TIF_SPEC_L1D_FLUSH
sched: Add task_work callback for paranoid L1D flush
x86/mm: Refactor cond_ibpb() to support other use cases
x86/smp: Add a per-cpu view of SMT state
- Add support for Intel Sapphire Rapids server CPU uncore events
- Allow the AMD uncore driver to be built as a module
- Misc cleanups and fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmEsrWgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iPbg/+LTi1ki2kswWF5Fmoo0IWZ1GiCkWD0rSm
SqCJR3u/MHd7QxsfPXFuhABt7mqaC50epfim0MNC1J11wRoNy6GzM/9nICdez9CK
c2c3jAwoEfWE7o5PTSjXolH0FFXVYwQ9WBRJTBCaZjuXUW7TOk7h9o8fXoF8SssU
VkP9z0YKanhN480v4k7/KR20ktY6GPFFu5cxhC0wyygZXIoG6ku+nmDjvN1ipo54
JQYQbCTz+dj/C0uZehbEoTZWx7cajAxlbq+Iyor3ND30YHLeRxWNEcq/Jzn9Szqa
LXnIi3mg4/mHc8mtZjDHXazpMxGYI02hkBACftPb2gizR4DOwtWlw1A2oHjbfmvM
oF29kcZmFU/Na2O5JTf0IpV4LLkt/OlrTsTcd5ROYWgmri3UV18MhaKz0R0vQNhI
8Xx6TJVA9fFHCIONg+4yxzDkYlxHJ9Tg8yFb06M6NWOnud03xYv4FGOzi0laoums
8XbYZUnMh8aVkXz05CYUadknu/ajMOSqAZZAstng3unazrutSCMkZ+Pzs4X+yqq5
Zz2Tb26oi6KepLD0eQNTmo1pRwuWC/IBqVF5aKH4e4qgyta/VxWob3Qd+I7ONaCl
6HpbaWz01Nw8U2wE4dQB0wKwIGlLE8bRQTS2QFuqaEHu0yZsQGt8zMwVWB+f93Gb
viglPyeA/y8=
=QrPT
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf event updates from Ingo Molnar:
- Add support for Intel Sapphire Rapids server CPU uncore events
- Allow the AMD uncore driver to be built as a module
- Misc cleanups and fixes
* tag 'perf-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
perf/x86/amd/ibs: Add bitfield definitions in new <asm/amd-ibs.h> header
perf/amd/uncore: Allow the driver to be built as a module
x86/cpu: Add get_llc_id() helper function
perf/amd/uncore: Clean up header use, use <linux/ include paths instead of <asm/
perf/amd/uncore: Simplify code, use free_percpu()'s built-in check for NULL
perf/hw_breakpoint: Replace deprecated CPU-hotplug functions
perf/x86/intel: Replace deprecated CPU-hotplug functions
perf/x86: Remove unused assignment to pointer 'e'
perf/x86/intel/uncore: Fix IIO cleanup mapping procedure for SNR/ICX
perf/x86/intel/uncore: Support IMC free-running counters on Sapphire Rapids server
perf/x86/intel/uncore: Support IIO free-running counters on Sapphire Rapids server
perf/x86/intel/uncore: Factor out snr_uncore_mmio_map()
perf/x86/intel/uncore: Add alias PMU name
perf/x86/intel/uncore: Add Sapphire Rapids server MDF support
perf/x86/intel/uncore: Add Sapphire Rapids server M3UPI support
perf/x86/intel/uncore: Add Sapphire Rapids server UPI support
perf/x86/intel/uncore: Add Sapphire Rapids server M2M support
perf/x86/intel/uncore: Add Sapphire Rapids server IMC support
perf/x86/intel/uncore: Add Sapphire Rapids server PCU support
perf/x86/intel/uncore: Add Sapphire Rapids server M2PCIe support
...
the filesystem bits of resctrl, the ultimate goal being to support ARM's
equivalent technology MPAM, with the same fs interface (James Morse)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmEsrC8ACgkQEsHwGGHe
VUowsA//TblmI1t1kz6JUAneOVmZILJKRKyekrUsMr3twPpUcH1zMS4yE+ohE06X
IZZYh3jQokirOxQAmntbh2/uLz5AntldRDzNCEsBZerLz1kW502xaLYMCWg5wPbs
LWqacvHmOmQtXpB5fxA/jNIHxQyfKc1z/wWjTMwpU6K0P0usflx1UmSaiW7Kol6y
KY8B1V1DRYsCmtGQ+0Ww4Fye6TZ/w9jwwFolerSVqXy0I8TZNFITUCfmkFeSbAIp
uQMBXq5SGHn+Q9AsLB2xBhLmqkIb5482eC096t/UJPdFWcBnYdMsOmuLSxz6IuVF
z5gMctnsoIwdhtV7YSpTPIKopGkZKuyN4EIEwv3LVn2J9FatoeWDZu7kUyWr9WaQ
rp0u+09gxUBe68h9bNQkvUfSqi5RVSsijKzQ1vs5PpsTgo4lsPjtb4AjHr9bys/e
2GSMuhaLv+h/kNnte1PXnO5+8+8Yt0N16rVPw8aJE543o2jOm21LUPb/TuAHEmi8
uKn1iq3Dt9gzep92WMFXHwapbwkpF8NXqrcP/ibN97cXx3pt8wPL3yIqOl7rRsGI
BxkGDHP33aBEqbBuMFo9wW8P7whdo9ZhqjfLNmL+6HIM+JXXa0dC6sd2aD6RDrx7
+C0zffDWNfXiUWqC7bFURqRBj3bJ7oufLye3eSPHIMsssYIZYgQ=
=1WGp
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
"A first round of changes towards splitting the arch-specific bits from
the filesystem bits of resctrl, the ultimate goal being to support
ARM's equivalent technology MPAM, with the same fs interface (James
Morse)"
* tag 'x86_cache_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
x86/resctrl: Make resctrl_arch_get_config() return its value
x86/resctrl: Merge the CDP resources
x86/resctrl: Expand resctrl_arch_update_domains()'s msr_param range
x86/resctrl: Remove rdt_cdp_peer_get()
x86/resctrl: Merge the ctrl_val arrays
x86/resctrl: Calculate the index from the configuration type
x86/resctrl: Apply offset correction when config is staged
x86/resctrl: Make ctrlval arrays the same size
x86/resctrl: Pass configuration type to resctrl_arch_get_config()
x86/resctrl: Add a helper to read a closid's configuration
x86/resctrl: Rename update_domains() to resctrl_arch_update_domains()
x86/resctrl: Allow different CODE/DATA configurations to be staged
x86/resctrl: Group staged configuration into a separate struct
x86/resctrl: Move the schemata names into struct resctrl_schema
x86/resctrl: Add a helper to read/set the CDP configuration
x86/resctrl: Swizzle rdt_resource and resctrl_schema in pseudo_lock_region
x86/resctrl: Pass the schema to resctrl filesystem functions
x86/resctrl: Add resctrl_arch_get_num_closid()
x86/resctrl: Store the effective num_closid in the schema
x86/resctrl: Walk the resctrl schema list instead of an arch list
...
is not up yet - delay that processing until everything is ready
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmEsp9MACgkQEsHwGGHe
VUp1ARAAtlYWtIQXUV+YiGGnueO3WZlbME/O/I6uz3KSIgZg59qaXPpscL7vCjS1
pFav1F3GKpplvkWAPBQf290yLYn934oLDanof1ENcAQxE46RD/CTr4V1J3xSzb31
+XEcwHoDFMdSHp219pinNWEDFGlYvb9/Q8AxCMFlUomRrPE/1tjo2l8gmZ7Wi1r5
uUwImmiYKnKoin3HpaY5n0NW+3madWhVfPhwv7Tsh0Hp0Qh5KRra+0OoLSGeKNDk
EDpKE/XKAjCNjcBLNIAWtLCHvPC1bWh2jC+Qmu8a9UGn4e5KmWw+GCUoOuBRuA6X
KOAoXtZXARsDDXadQKjNnQ4P3CvFcmNpCzjoFAarsflA3vKSHttDamk14bxE/mRx
AkNdw7oQYfOLAP2PVpPGgBEiU5WIoPwXdudDi6R8Tu107Z0mhiJIgbjPajJOvFpj
EnTO4k/A+Lk4uhPYYFhlhphDT/B8KdVlnEv0VJds0nPEE2uC3JiXxu7dHbBPYcXm
KtvJ5A+jDEMvFoPzoqwDBr7EzeHUUjyyI1Gym5EsjCNoH4WG8SyXYngCRrFOaIFe
e047UnJ4a7ibsFHAPrB9LtWLxvhPl13D4D+Ejb1GvnzDwSaAJAP9gVBGC0VpicId
NGEYAjsRsLGP2D3I4+Bhh1MN6mL/wlcmzO4izXXhMyc9vuTrXws=
=1ROV
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS update from Borislav Petkov:
"A single RAS change for 5.15:
- Do not start processing MCEs logged early because the decoding
chain is not up yet - delay that processing until everything is
ready"
* tag 'ras_core_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Defer processing of early errors
Yellow carp matches same behavior as green sardine and other Zen3
products, but have different CCD offsets.
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20210827201527.24454-3-mario.limonciello@amd.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Factor out a helper function rather than export cpu_llc_id, which is
needed in order to be able to build the AMD uncore driver as a module.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20210817221048.88063-7-kim.phillips@amd.com
When a fatal machine check results in a system reset, Linux does not
clear the error(s) from machine check bank(s) - hardware preserves the
machine check banks across a warm reset.
During initialization of the kernel after the reboot, Linux reads, logs,
and clears all machine check banks.
But there is a problem. In:
5de97c9f6d ("x86/mce: Factor out and deprecate the /dev/mcelog driver")
the call to mce_register_decode_chain() moved later in the boot
sequence. This means that /dev/mcelog doesn't see those early error
logs.
This was partially fixed by:
cd9c57cad3 ("x86/MCE: Dump MCE to dmesg if no consumers")
which made sure that the logs were not lost completely by printing
to the console. But parsing console logs is error prone. Users of
/dev/mcelog should expect to find any early errors logged to standard
places.
Add a new flag MCP_QUEUE_LOG to machine_check_poll() to be used in early
machine check initialization to indicate that any errors found should
just be queued to genpool. When mcheck_late_init() is called it will
call mce_schedule_work() to actually log and flush any errors queued in
the genpool.
[ Based on an original patch, commit message by and completely
productized by Tony Luck. ]
Fixes: 5de97c9f6d ("x86/mce: Factor out and deprecate the /dev/mcelog driver")
Reported-by: Sumanth Kamatala <skamatala@juniper.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210824003129.GA1642753@agluck-desk2.amr.corp.intel.com
The recent commit
064855a690 ("x86/resctrl: Fix default monitoring groups reporting")
caused a RHEL build failure with an uninitialized variable warning
treated as an error because it removed the default case snippet.
The RHEL Makefile uses '-Werror=maybe-uninitialized' to force possibly
uninitialized variable warnings to be treated as errors. This is also
reported by smatch via the 0day robot.
The error from the RHEL build is:
arch/x86/kernel/cpu/resctrl/monitor.c: In function ‘__mon_event_count’:
arch/x86/kernel/cpu/resctrl/monitor.c:261:12: error: ‘m’ may be used
uninitialized in this function [-Werror=maybe-uninitialized]
m->chunks += chunks;
^~
The upstream Makefile does not build using '-Werror=maybe-uninitialized'.
So, the problem is not seen there. Fix the problem by putting back the
default case snippet.
[ bp: note that there's nothing wrong with the code and other compilers
do not trigger this warning - this is being done just so the RHEL compiler
is happy. ]
Fixes: 064855a690 ("x86/resctrl: Fix default monitoring groups reporting")
Reported-by: Terry Bowman <Terry.Bowman@amd.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/162949631908.23903.17090272726012848523.stgit@bmoger-ubuntu
- Mask all MSI-X entries when enabling MSI-X otherwise stale unmasked
entries stay around e.g. when a crashkernel is booted.
- Enforce masking of a MSI-X table entry when updating it, which mandatory
according to speification
- Ensure that writes to MSI[-X} tables are flushed.
- Prevent invalid bits being set in the MSI mask register
- Properly serialize modifications to the mask cache and the mask register
for multi-MSI.
- Cure the violation of the affinity setting rules on X86 during interrupt
startup which can cause lost and stale interrupts. Move the initial
affinity setting ahead of actualy enabling the interrupt.
- Ensure that MSI interrupts are completely torn down before freeing them
in the error handling case.
- Prevent an array out of bounds access in the irq timings code.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEY5bcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaMvD/9KeK2430f4h/x/fQhHmHHIJOv3kqmB
gRXX4RV+N/DfU9GSflbzPxY9l2SkydgpQjHeGnqpV7DYRIu84nVYAuWcWtPimHHy
JxapniLlQv2GS+SIy9f1mmChH6VUPS05brHxKSqAQZvQIoZqza8vF3umZlV7eYF4
uZFd86TCbDFsBxbsKmyV1FtQLo008EeEp8dtZ/1cZ9Fbp0M/mQkuu7aTNqY0qWwZ
rAoGyE4PjDR+yf87XjE5z7hMs2vfUjiGXg7Kbp30NPKGcRyasb+SlHVKcvZKJIji
Y0Bk/SOyqoj1Co3U+cEaWolB1MeGff4nP+Xx8xvyNklKxxs1+92Z7L1RElXIc0cL
kmUehUSf5JuJ83B6ucAYbmnXKNw1XB00PaMy7iSxsYekTXJx+t0b+Rt6o0R3inWB
xUWbIVmoL2uF1oOAb6mEc3wDNMBVkY33e9l2jD0PUPxKXZ730MVeojWJ8FGFiPOT
9+aCRLjZHV5slVQAgLnlpcrseJLuUei6HLVwRXxv19Bz5L+HuAXUxWL9h74SRuE9
14kH63aXSVDlcYyW7c3t8Lh6QjKAf7AIz0iG+u3n09IWyURd4agHuKOl5itileZB
BK9NuRrNgmr2nEKG461Suc6GojLBXc1ih3ak+MG+O4iaLxnhapTjW3Weqr+OVXr+
SrIjoxjpEk2ECA==
=yf3u
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2021-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A set of fixes for PCI/MSI and x86 interrupt startup:
- Mask all MSI-X entries when enabling MSI-X otherwise stale unmasked
entries stay around e.g. when a crashkernel is booted.
- Enforce masking of a MSI-X table entry when updating it, which
mandatory according to speification
- Ensure that writes to MSI[-X} tables are flushed.
- Prevent invalid bits being set in the MSI mask register
- Properly serialize modifications to the mask cache and the mask
register for multi-MSI.
- Cure the violation of the affinity setting rules on X86 during
interrupt startup which can cause lost and stale interrupts. Move
the initial affinity setting ahead of actualy enabling the
interrupt.
- Ensure that MSI interrupts are completely torn down before freeing
them in the error handling case.
- Prevent an array out of bounds access in the irq timings code"
* tag 'irq-urgent-2021-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
driver core: Add missing kernel doc for device::msi_lock
genirq/msi: Ensure deactivation on teardown
genirq/timings: Prevent potential array overflow in __irq_timings_store()
x86/msi: Force affinity setup before startup
x86/ioapic: Force affinity setup before startup
genirq: Provide IRQCHIP_AFFINITY_PRE_STARTUP
PCI/MSI: Protect msi_desc::masked for multi-MSI
PCI/MSI: Use msi_mask_irq() in pci_msi_shutdown()
PCI/MSI: Correct misleading comments
PCI/MSI: Do not set invalid bits in MSI mask
PCI/MSI: Enforce MSI[X] entry updates to be visible
PCI/MSI: Enforce that MSI-X table entry is masked for update
PCI/MSI: Mask all unused MSI-X entries
PCI/MSI: Enable and mask MSI-X early
Creating a new sub monitoring group in the root /sys/fs/resctrl leads to
getting the "Unavailable" value for mbm_total_bytes and mbm_local_bytes
on the entire filesystem.
Steps to reproduce:
1. mount -t resctrl resctrl /sys/fs/resctrl/
2. cd /sys/fs/resctrl/
3. cat mon_data/mon_L3_00/mbm_total_bytes
23189832
4. Create sub monitor group:
mkdir mon_groups/test1
5. cat mon_data/mon_L3_00/mbm_total_bytes
Unavailable
When a new monitoring group is created, a new RMID is assigned to the
new group. But the RMID is not active yet. When the events are read on
the new RMID, it is expected to report the status as "Unavailable".
When the user reads the events on the default monitoring group with
multiple subgroups, the events on all subgroups are consolidated
together. Currently, if any of the RMID reads report as "Unavailable",
then everything will be reported as "Unavailable".
Fix the issue by discarding the "Unavailable" reads and reporting all
the successful RMID reads. This is not a problem on Intel systems as
Intel reports 0 on Inactive RMIDs.
Fixes: d89b737901 ("x86/intel_rdt/cqm: Add mon_data")
Reported-by: Paweł Szulik <pawel.szulik@intel.com>
Signed-off-by: Babu Moger <Babu.Moger@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: stable@vger.kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213311
Link: https://lkml.kernel.org/r/162793309296.9224.15871659871696482080.stgit@bmoger-ubuntu
When this platform was relatively new in November 2011, with early BIOS
revisions, a reboot quirk was added in commit 6be30bb7d7 ("x86/reboot:
Blacklist Dell OptiPlex 990 known to require PCI reboot")
However, this quirk (and several others) are open-ended to all BIOS
versions and left no automatic expiry if/when the system BIOS fixed the
issue, meaning that nobody is likely to come along and re-test.
What is really problematic with using PCI reboot as this quirk does, is
that it causes this platform to do a full power down, wait one second,
and then power back on. This is less than ideal if one is using it for
boot testing and/or bisecting kernels when legacy rotating hard disks
are installed.
It was only by chance that the quirk was noticed in dmesg - and when
disabled it turned out that it wasn't required anymore (BIOS A24), and a
default reboot would work fine without the "harshness" of power cycling the
machine (and disks) down and up like the PCI reboot does.
Doing a bit more research, it seems that the "newest" BIOS for which the
issue was reported[1] was version A06, however Dell[2] seemed to suggest
only up to and including version A05, with the A06 having a large number of
fixes[3] listed.
As is typical with a new platform, the initial BIOS updates come frequently
and then taper off (and in this case, with a revival for CPU CVEs); a
search for O990-A<ver>.exe reveals the following dates:
A02 16 Mar 2011
A03 11 May 2011
A06 14 Sep 2011
A07 24 Oct 2011
A10 08 Dec 2011
A14 06 Sep 2012
A16 15 Oct 2012
A18 30 Sep 2013
A19 23 Sep 2015
A20 02 Jun 2017
A23 07 Mar 2018
A24 21 Aug 2018
While it's overkill to flash and test each of the above, it would seem
likely that the issue was contained within A0x BIOS versions, given the
dates above and the dates of issue reports[4] from distros. So rather than
just throw out the quirk entirely, limit the scope to just those early BIOS
versions, in case people are still running systems from 2011 with the
original as-shipped early A0x BIOS versions.
[1] https://lore.kernel.org/lkml/1320373471-3942-1-git-send-email-trenn@suse.de/
[2] https://www.dell.com/support/kbdoc/en-ca/000131908/linux-based-operating-systems-stall-upon-reboot-on-optiplex-390-790-990-systems
[3] https://www.dell.com/support/home/en-ca/drivers/driversdetails?driverid=85j10
[4] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/768039
Fixes: 6be30bb7d7 ("x86/reboot: Blacklist Dell OptiPlex 990 known to require PCI reboot")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210530162447.996461-4-paul.gortmaker@windriver.com
resctrl_arch_get_config() has no return, but does pass a single value
back via one of its arguments.
Return the value instead.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210811163831.14917-1-james.morse@arm.com
resctrl uses struct rdt_resource to describe the available hardware
resources. The domains of the CDP aliases share a single ctrl_val[]
array. The only differences between the struct rdt_hw_resource aliases
is the name and conf_type.
The name from struct rdt_hw_resource is visible to user-space. To
support another architecture, as many user-visible details should be
handled in the filesystem parts of the code that is common to all
architectures. The name and conf_type go together.
Remove conf_type and the CDP aliases. When CDP is supported and enabled,
schemata_list_create() can create two schemata using the single
resource, generating the CODE/DATA suffix to the schema name itself.
This allows the alloc_ctrlval_array() and complications around free()ing
the ctrl_val arrays to be removed.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-25-james.morse@arm.com
resctrl_arch_update_domains() specifies the one closid that has been
modified and needs copying to the hardware.
resctrl_arch_update_domains() takes a struct rdt_resource and a closid
as arguments, but copies all the staged configurations for that closid
into the ctrl_val[] array.
resctrl_arch_update_domains() is called once per schema, but once the
resources and domains are merged, the second call of a L2CODE/L2DATA
pair will find no staged configurations, as they were previously
applied. The msr_param of the first call only has one index, so would
only have update the hardware for the last staged configuration.
To avoid a second round of IPIs when changing L2CODE and L2DATA in one
go, expand the range of the msr_param if multiple staged configurations
are found.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-24-james.morse@arm.com
When CDP is enabled, rdt_cdp_peer_get() finds the alternative
CODE/DATA resource and returns the alternative domain. This is used
to determine if bitmaps overlap when there are aliased entries
in the two struct rdt_hw_resources.
Now that the ctrl_val[] used by the CODE/DATA resources is the same,
the search for an alternate resource/domain is not needed.
Replace rdt_cdp_peer_get() with resctrl_peer_type(), which returns
the alternative type. This can be passed to resctrl_arch_get_config()
with the same resource and domain.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-23-james.morse@arm.com
Each struct rdt_hw_resource has its own ctrl_val[] array. When CDP is
enabled, two resources are in use, each with its own ctrl_val[] array
that holds half of the configuration used by hardware. One uses the odd
slots, the other the even. rdt_cdp_peer_get() is the helper to find the
alternate resource, its domain, and corresponding entry in the other
ctrl_val[] array.
Once the CDP resources are merged there will be one struct
rdt_hw_resource and one ctrl_val[] array for each hardware resource.
This will include changes to rdt_cdp_peer_get(), making it hard to
bisect any issue.
Merge the ctrl_val[] arrays for three CODE/DATA/NONE resources first.
Doing this before merging the resources temporarily complicates
allocating and freeing the ctrl_val arrays. Add a helper to allocate
the ctrl_val array, that returns the value on the L2 or L3 resource if
it already exists. This gets removed once the resources are merged, and
there really is only one ctrl_val[] array.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-22-james.morse@arm.com
resctrl uses cbm_idx() to map a closid to an index in the configuration
array. This is based on a multiplier and offset that are held in the
resource.
To merge the resources, the resctrl arch code needs to calculate the
index from something else, as there will only be one resource.
Decide based on the staged configuration type. This makes the static
mult and offset parameters redundant.
[ bp: Remove superfluous brackets in get_config_index() ]
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-21-james.morse@arm.com
When resctrl comes to copy the CAT MSR values from the ctrl_val[] array
into hardware, it applies an offset adjustment based on the type of
the resource. CODE and DATA resources have their closid mapped into an
odd/even range. This mapping is based on a property of the resource.
This happens once the new control value has been written to the ctrl_val[]
array. Once the CDP resources are merged, there will only be a single
property that needs to cover both odd/even mappings to the single
ctrl_val[] array. The offset adjustment must be applied before the new
value is written to the array.
Move the logic from cat_wrmsr() to resctrl_arch_update_domains(). The
value provided to apply_config() is now an index in the array, not the
closid. The parameters provided via struct msr_param are now indexes
too. As resctrl's use of closid is a u32, struct msr_param's type is
changed to match.
With this, the CODE and DATA resources only use the odd or even
indexes in the array. This allows the temporary num_closid/2 fixes in
domain_setup_ctrlval() and reset_all_ctrls() to be removed.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-20-james.morse@arm.com
The CODE and DATA resources report a num_closid that is half the actual
size supported by the hardware. This behaviour is visible to user-space
when CDP is enabled.
The CODE and DATA resources have their own ctrlval arrays which are
half the size of the underlying hardware because num_closid was already
adjusted. One holds the odd configurations values, the other even.
Before the CDP resources can be merged, the 'half the closids' behaviour
needs to be implemented by schemata_list_create(), but this causes the
ctrl_val[] array to be full sized.
Remove the logic from the architecture specific rdt_get_cdp_config()
setup, and add it to schemata_list_create(). Functions that walk all the
configurations, such as domain_setup_ctrlval() and reset_all_ctrls(),
take num_closid directly from struct rdt_hw_resource also have
to halve num_closid as only the lower half of each array is in
use. domain_setup_ctrlval() and reset_all_ctrls() both copy struct
rdt_hw_resource's num_closid to a struct msr_param. Correct the value
here.
This is temporary as a subsequent patch will merge all three ctrl_val[]
arrays such that when CDP is in use, the CODA/DATA layout in the array
matches the hardware. reset_all_ctrls()'s loop over the whole of
ctrl_val[] is not touched as this is harmless, and will be required as
it is once the resources are merged.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-19-james.morse@arm.com
The ctrl_val[] array for a struct rdt_hw_resource only holds
configurations of one type. The type is implicit.
Once the CDP resources are merged, the ctrl_val[] array will hold all
the configurations for the hardware resource. When a particular type of
configuration is needed, it must be specified explicitly.
Pass the expected type from the schema into resctrl_arch_get_config().
Nothing uses this yet, but once a single ctrl_val[] array is used for
the three struct rdt_hw_resources that share hardware, the type will be
used to return the correct configuration value from the shared array.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-18-james.morse@arm.com
Functions like show_doms() reach into the architecture's private
structure to retrieve the configuration from the struct rdt_hw_resource.
The hardware configuration may look completely different to the
values resctrl gets from user-space. The staged configuration and
resctrl_arch_update_domains() allow the architecture to convert or
translate these values.
Resctrl shouldn't read or write the ctrl_val[] values directly. Add
a helper to read the current configuration. This will allow another
architecture to scale the bitmaps if necessary, and possibly use
controls that don't take the user-space control format at all.
Of the remaining functions that access ctrl_val[] directly,
apply_config() is part of the architecture-specific code, and is
called via resctrl_arch_update_domains(). reset_all_ctrls() will be an
architecture specific helper.
update_mba_bw() manipulates both ctrl_val[], mbps_val[] and the
hardware. The mbps_val[] that matches the mba_sc state of the resource
is changed, but the other is left unchanged. Abstracting this is the
subject of later patches that affect set_mba_sc() too.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-17-james.morse@arm.com
update_domains() merges the staged configuration changes into the arch
codes configuration array. Rename to make it clear it is part of the
arch code interface to resctrl.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-16-james.morse@arm.com
Before the CDP resources can be merged, struct rdt_domain will need an
array of struct resctrl_staged_config, one per type of configuration.
Use the type as an index to the array to ensure that a schema
configuration string can't specify the same domain twice. This will
allow two schemata to apply configuration changes to one resource.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-15-james.morse@arm.com
When configuration changes are made, the new value is written to struct
rdt_domain's new_ctrl field and the have_new_ctrl flag is set. Later
new_ctrl is copied to hardware by a call to update_domains().
Once the CDP resources are merged, there will be one new_ctrl field in
use by two struct resctrl_schema requiring a per-schema IPI to copy the
value to hardware.
Move new_ctrl and have_new_ctrl into a new struct resctrl_staged_config.
Before the CDP resources can be merged, struct rdt_domain will need an
array of these, one per type of configuration. Using the type as an
index to the array will ensure that a schema configuration string can't
specify the same domain twice.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-14-james.morse@arm.com
resctrl 'info' directories and schema parsing use the schema name.
This lives in the struct rdt_resource, and is specified by the
architecture code.
Once the CDP resources are merged, there will only be one resource (and
one name) in use by two schemata. To allow the CDP CODE/DATA property to
be the type of configuration the schema uses, the name should also be
per-schema.
Add a name field to struct resctrl_schema, and use this wherever
the schema name is exposed (or read from) user-space. Calculating
max_name_width for padding the schemata file also moves as this is
visible to user-space. As the names in struct rdt_resource already
include the CDP information, schemata_list_create() copies them.
schemata_list_create() includes the length of the CDP suffix when
calculating max_name_width in preparation for CDP resources being
merged.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-13-james.morse@arm.com
Whether CDP is enabled for a hardware resource like the L3 cache can be
found by inspecting the alloc_enabled flags of the L3CODE/L3DATA struct
rdt_hw_resources, even if they aren't in use.
Once these resources are merged, the flags can't be compared. Whether
CDP is enabled needs tracking explicitly. If another architecture is
emulating CDP the behaviour may not be per-resource. 'cdp_capable' needs
to be visible to resctrl, even if its not in use, as this affects the
padding of the schemata table visible to user-space.
Add cdp_enabled to struct rdt_hw_resource and cdp_capable to struct
rdt_resource. Add resctrl_arch_set_cdp_enabled() to let resctrl enable
or disable CDP on a resource. resctrl_arch_get_cdp_enabled() lets it
read the current state.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-12-james.morse@arm.com
struct pseudo_lock_region points to the rdt_resource.
Once the resources are merged, this won't be unique. The resource name
is moving into the schema, so that the filesystem portions of resctrl can
generate it.
Swap pseudo_lock_region's rdt_resource pointer for a schema pointer.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-11-james.morse@arm.com
Once the CDP resources are merged, there will be two struct
resctrl_schema for one struct rdt_resource. CDP becomes a type of
configuration that belongs to the schema.
Helpers like rdtgroup_cbm_overlaps() need access to the schema to query
the configuration (or configurations) based on schema properties.
Change these functions to take a struct schema instead of the struct
rdt_resource. All the modified functions are part of the filesystem code
that will move to /fs/resctrl once it is possible to support a second
architecture.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-10-james.morse@arm.com
To initialise struct resctrl_schema's num_closid, schemata_list_create()
reaches into the architectures private structure to retrieve num_closid
from the struct rdt_hw_resource. The 'half the closids' behaviour should
be part of the filesystem parts of resctrl that are the same on any
architecture. struct resctrl_schema's num_closid should include any
correction for CDP.
Having two properties called num_closid is likely to be confusing when
they have different values.
Add a helper to read the resource's num_closid from the arch code.
This should return the number of closid that the resource supports,
regardless of whether CDP is in use. Once the CDP resources are merged,
schemata_list_create() can apply the correction itself.
Using a type with an obvious size for the arch helper means changing the
type of num_closid to u32, which matches the type already used by struct
rdtgroup.
reset_all_ctrls() does not use resctrl_arch_get_num_closid(), even
though it sets up a structure for modifying the hardware. This function
will be part of the architecture code, the maximum closid should be the
maximum value the hardware has, regardless of the way resctrl is using
it. All the uses of num_closid in core.c are naturally part of the
architecture specific code.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-9-james.morse@arm.com
Struct resctrl_schema holds properties that vary with the style of
configuration that resctrl applies to a resource. There are already
two values for the hardware's num_closid, depending on whether the
architecture presents the L3 or L3CODE/L3DATA resources.
As the way CDP changes the number of control groups that resctrl can
create is part of the user-space interface, it should be managed by the
filesystem parts of resctrl. This allows the architecture code to only
describe the value the hardware supports.
Add num_closid to resctrl_schema. This is the value seen by the
filesystem, which may be different to the maximum value described by the
arch code when CDP is enabled.
These functions operate on the num_closid value that is exposed to
user-space:
* rdtgroup_parse_resource()
* rdtgroup_schemata_show()
* rdt_num_closids_show()
* closid_init()
Change them to use the schema value instead. schemata_list_create() sets
this value, and reaches into the architecture-specific structure to get
the value. This will eventually be replaced with a helper.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-8-james.morse@arm.com
When parsing a schema configuration value from user-space, resctrl walks
the architectures rdt_resources_all[] array to find a matching struct
rdt_resource.
Once the CDP resources are merged there will be one resource in use
by two schemata. Anything walking rdt_resources_all[] on behalf of a
user-space request should walk the list of struct resctrl_schema
instead.
Change the users of for_each_alloc_enabled_rdt_resource() to walk the
schema instead. Schemata were only created for alloc_enabled resources
so these two lists are currently equivalent.
schemata_list_create() and rdt_kill_sb() are ignored. The first
creates the schema list, and will eventually loop over the resource
indexes using an arch helper to retrieve the resource. rdt_kill_sb()
will eventually make use of an arch 'reset everything' helper.
After the filesystem code is moved, rdtgroup_pseudo_locked_in_hierarchy()
remains part of the x86 specific hooks to support pseudo lock. This
code walks each domain, and still does this after the separate resources
are merged.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-7-james.morse@arm.com
The names of resources are used for the schema name presented to
user-space. The name used is rooted in a structure provided by the
architecture code because the names are different when CDP is enabled.
x86 implements this by swapping between two sets of resource structures
based on their alloc_enabled flag. The type of configuration in-use is
encoded in the name (and cbm_idx_offset).
Once the CDP behaviour is moved into the parts of resctrl that will
move to /fs/, there will be two struct resctrl_schema for one struct
rdt_resource. The schema describes the type of configuration being
applied to the resource. The name of the schema should be generated
by resctrl, base on the type of configuration. To do this struct
resctrl_schema needs to store the type of configuration in use for a
schema.
Create an enum resctrl_conf_type describing the options, and add it to
struct resctrl_schema. The underlying resources are still separate, as
cbm_idx_offset is still in use.
Temporarily label all the entries in rdt_resources_all[] and copy that
value to struct resctrl_schema. Copying the value ensures there is no
mismatch while the filesystem parts of resctrl are modified to use the
schema. Once the resources are merged, the filesystem code can assign
this value based on the schema being created.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-6-james.morse@arm.com
Many of resctrl's per-schema files return a value from struct
rdt_resource, which they take as their 'priv' pointer.
Moving properties that resctrl exposes to user-space into the core 'fs'
code, (e.g. the name of the schema), means some of the functions that
back the filesystem need the schema struct (to where the properties are
moved), but currently take struct rdt_resource. For example, once the
CDP resources are merged, struct rdt_resource no longer reflects all the
properties of the schema.
For the info dirs that represent a control, the information needed
will be accessed via struct resctrl_schema, as this is how the resource
is being used. For the monitors, its still struct rdt_resource as the
monitors aren't described as schema.
This difference means the type of the private pointers varies between
control and monitor info dirs.
Change the 'priv' pointer to point to struct resctrl_schema for
the per-schema files that represent a control. The type can be
determined from the fflags field. If the flags are RF_MON_INFO, its
a struct rdt_resource. If the flags are RF_CTRL_INFO, its a struct
resctrl_schema. No entry in res_common_files[] has both flags.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-5-james.morse@arm.com
Resctrl exposes schemata to user-space, which allow the control values
to be specified for a group of tasks.
User-visible properties of the interface, (such as the schemata names
and how the values are parsed) are rooted in a struct provided by the
architecture code. (struct rdt_hw_resource). Once a second architecture
uses resctrl, this would allow user-visible properties to diverge
between architectures.
These properties should come from the resctrl code that will be common
to all architectures. Resctrl has no per-schema structure, only struct
rdt_{hw_,}resource. Create a struct resctrl_schema to hold the
rdt_resource. Before a second architecture can be supported, this
structure will also need to hold the schema name visible to user-space
and the type of configuration values for resctrl.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-4-james.morse@arm.com
resctrl is the defacto Linux ABI for SoC resource partitioning features.
To support it on another architecture, it needs to be abstracted from
the features provided by Intel RDT and AMD PQoS, and moved to /fs/.
struct rdt_domain contains a mix of architecture private details and
properties of the filesystem interface user-space uses.
Continue by splitting struct rdt_domain, into an architecture private
'hw' struct, which contains the common resctrl structure that would be
used by any architecture. The hardware values in ctrl_val and mbps_val
need to be accessed via helpers to allow another architecture to convert
these into a different format if necessary. After this split, filesystem
code paths touching a 'hw' struct indicates where an abstraction is
needed.
Splitting this structure only moves types around, and should not lead
to any change in behaviour.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-3-james.morse@arm.com
resctrl is the defacto Linux ABI for SoC resource partitioning features.
To support it on another architecture, it needs to be abstracted from
the features provided by Intel RDT and AMD PQoS, and moved to /fs/.
struct rdt_resource contains a mix of architecture private details
and properties of the filesystem interface user-space uses.
Start by splitting struct rdt_resource, into an architecture private
'hw' struct, which contains the common resctrl structure that would be
used by any architecture. The foreach helpers are most commonly used by
the filesystem code, and should return the common resctrl structure.
for_each_rdt_resource() is changed to walk the common structure in its
parent arch private structure.
Move as much of the structure as possible into the common structure
in the core code's header file. The x86 hardware accessors remain
part of the architecture private code, as do num_closid, mon_scale
and mbm_width.
mon_scale and mbm_width are used to detect overflow of the hardware
counters, and convert them from their native size to bytes. Any
cross-architecture abstraction should be in terms of bytes, making
these properties private.
The hardware's num_closid is kept in the private structure to force the
filesystem code to use a helper to access it. MPAM would return a single
value for the system, regardless of the resource. Using the helper
prevents this field from being confused with the version of num_closid
that is being exposed to user-space (added in a later patch).
After this split, filesystem code touching a 'hw' struct indicates
where an abstraction is needed.
Splitting this structure only moves types around, and should not lead
to any change in behaviour.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Babu Moger <babu.moger@amd.com>
Link: https://lkml.kernel.org/r/20210728170637.25610-2-james.morse@arm.com
The proper spelling for the acronym referring to the Edge/Level Control
Register is ELCR rather than ECLR. Adjust references accordingly. No
functional change.
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107200251080.9461@angie.orcam.me.uk
Define PIC_ELCR1 and PIC_ELCR2 macros for accesses to the ELCR registers
implemented by many chipsets in their embedded 8259A PIC cores, avoiding
magic numbers that are difficult to handle, and complementing the macros
we already have for registers originally defined with discrete 8259A PIC
implementations. No functional change.
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107200237300.9461@angie.orcam.me.uk
Define macros and accessors for the configuration space addressed
indirectly with an index register and a data register at the port I/O
locations of 0x22 and 0x23 respectively.
This space is defined by the Intel MultiProcessor Specification for the
IMCR register used to switch between the PIC and the APIC mode[1], by
Cyrix processors for their configuration[2][3], and also some chipsets.
Given the lack of atomicity with the indirect addressing a spinlock is
required to protect accesses, although for Cyrix processors it is enough
if accesses are executed with interrupts locally disabled, because the
registers are local to the accessing CPU, and IMCR is only ever poked at
by the BSP and early enough for interrupts not to have been configured
yet. Therefore existing code does not have to change or use the new
spinlock and neither it does.
Put the spinlock in a library file then, so that it does not get pulled
unnecessarily for configurations that do not refer it.
Convert Cyrix accessors to wrappers so as to retain the brevity and
clarity of the `getCx86' and `setCx86' calls.
References:
[1] "MultiProcessor Specification", Version 1.4, Intel Corporation,
Order Number: 242016-006, May 1997, Section 3.6.2.1 "PIC Mode", pp.
3-7, 3-8
[2] "5x86 Microprocessor", Cyrix Corporation, Order Number: 94192-00,
July 1995, Section 2.3.2.4 "Configuration Registers", p. 2-23
[3] "6x86 Processor", Cyrix Corporation, Order Number: 94175-01, March
1996, Section 2.4.4 "6x86 Configuration Registers", p. 2-23
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2107182353140.9461@angie.orcam.me.uk
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-10-bigeasy@linutronix.de
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-9-bigeasy@linutronix.de
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-8-bigeasy@linutronix.de
The X86 MSI mechanism cannot handle interrupt affinity changes safely after
startup other than from an interrupt handler, unless interrupt remapping is
enabled. The startup sequence in the generic interrupt code violates that
assumption.
Mark the irq chips with the new IRQCHIP_AFFINITY_PRE_STARTUP flag so that
the default interrupt setting happens before the interrupt is started up
for the first time.
While the interrupt remapping MSI chip does not require this, there is no
point in treating it differently as this might spare an interrupt to a CPU
which is not in the default affinity mask.
For the non-remapping case go to the direct write path when the interrupt
is not yet started similar to the not yet activated case.
Fixes: 1840475676 ("genirq: Expose default irq affinity mask (take 3)")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.886722080@linutronix.de
The IO/APIC cannot handle interrupt affinity changes safely after startup
other than from an interrupt handler. The startup sequence in the generic
interrupt code violates that assumption.
Mark the irq chip with the new IRQCHIP_AFFINITY_PRE_STARTUP flag so that
the default interrupt setting happens before the interrupt is started up
for the first time.
Fixes: 1840475676 ("genirq: Expose default irq affinity mask (take 3)")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.832143400@linutronix.de
Setting the ->dma_address to DMA_MAPPING_ERROR is not part of
the ->map_sg calling convention, so remove it.
Link: https://lore.kernel.org/linux-mips/20210716063241.GC13345@lst.de/
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The .map_sg() op now expects an error code instead of zero on failure.
So make __dma_map_cont() return a valid errno (which is then propagated
to gart_map_sg() via dma_map_cont()) and return it in case of failure.
Also, return -EINVAL in case of invalid nents.
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Starting with commit a799c2bd29
("x86/setup: Consolidate early memory reservations")
memory reservations have been moved earlier during the boot process,
before the execution of the Kernel Address Space Layout Randomization code.
setup_arch() calls the iscsi_ibft's find_ibft_region() function
to find and reserve the memory dedicated to the iBFT and this function
also saves a virtual pointer to the iBFT table for later use.
The problem is that if KALSR is active, the physical memory gets
remapped somewhere else in the virtual address space and the pointer is
no longer valid, this will cause a kernel panic when the iscsi driver tries
to dereference it.
iBFT detected.
BUG: unable to handle page fault for address: ffff888000099fd8
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
..snip..
Call Trace:
? ibft_create_kobject+0x1d2/0x1d2 [iscsi_ibft]
do_one_initcall+0x44/0x1d0
? kmem_cache_alloc_trace+0x119/0x220
do_init_module+0x5c/0x270
__do_sys_init_module+0x12e/0x1b0
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fix this bug by saving the address of the physical location
of the ibft; later the driver will use isa_bus_to_virt() to get
the correct virtual address.
N.B. On each reboot KASLR randomizes the virtual addresses so
assuming phys_to_virt before KASLR does its deed is incorrect.
Simplify the code by renaming find_ibft_region()
to reserve_ibft_region() and remove all the wrappers.
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
Use the existing PR_GET/SET_SPECULATION_CTRL API to expose the L1D flush
capability. For L1D flushing PR_SPEC_FORCE_DISABLE and
PR_SPEC_DISABLE_NOEXEC are not supported.
Enabling L1D flush does not check if the task is running on an SMT enabled
core, rather a check is done at runtime (at the time of flush), if the task
runs on a SMT sibling then the task is sent a SIGBUS which is executed
before the task returns to user space or to a guest.
This is better than the other alternatives of:
a. Ensuring strict affinity of the task (hard to enforce without further
changes in the scheduler)
b. Silently skipping flush for tasks that move to SMT enabled cores.
Hook up the core prctl and implement the x86 specific parts which in turn
makes it functional.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-5-sblbir@amazon.com
The goal of this is to allow tasks that want to protect sensitive
information, against e.g. the recently found snoop assisted data sampling
vulnerabilites, to flush their L1D on being switched out. This protects
their data from being snooped or leaked via side channels after the task
has context switched out.
This could also be used to wipe L1D when an untrusted task is switched in,
but that's not a really well defined scenario while the opt-in variant is
clearly defined.
The mechanism is default disabled and can be enabled on the kernel command
line.
Prepare for the actual prctl based opt-in:
1) Provide the necessary setup functionality similar to the other
mitigations and enable the static branch when the command line option
is set and the CPU provides support for hardware assisted L1D
flushing. Software based L1D flush is not supported because it's CPU
model specific and not really well defined.
This does not come with a sysfs file like the other mitigations
because it is not bound to any specific vulnerability.
Support has to be queried via the prctl(2) interface.
2) Add TIF_SPEC_L1D_FLUSH next to L1D_SPEC_IB so the two bits can be
mangled into the mm pointer in one go which allows to reuse the
existing mechanism in switch_mm() for the conditional IBPB speculation
barrier efficiently.
3) Add the L1D flush specific functionality which flushes L1D when the
outgoing task opted in.
Also check whether the incoming task has requested L1D flush and if so
validate that it is not accidentaly running on an SMT sibling as this
makes the whole excercise moot because SMT siblings share L1D which
opens tons of other attack vectors. If that happens schedule task work
which signals the incoming task on return to user/guest with SIGBUS as
this is part of the paranoid L1D flush contract.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-1-sblbir@amazon.com
A new field smt_active in cpuinfo_x86 identifies if the current core/cpu
is in SMT mode or not.
This is helpful when the system has some of its cores with threads offlined
and can be used for cases where action is taken based on the state of SMT.
The upcoming support for paranoid L1D flush will make use of this information.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Balbir Singh <sblbir@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210108121056.21940-2-sblbir@amazon.com
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmD95yIeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGqp0H/j/xHL20EHaUJOaV
iJjnfGyjtnkLC5FCoV/q/v9sFuSW2p4W1nyF8/eIgVKObef94Mg4/xxaHQrWIM56
cbzK9aIcD9InAuImJ6lju4fqjNmFrt2x7mhfzjPKqmhfINfZ5CohpLFN5XdOwzYC
l+ZgmUUl7GLDAND2M6rtkc7AOk4qTyAySDvvPFELE/uNgV4EKaENSIWofHhEzW5v
Yk+4agawaFTfa6H9+uMVYZBOcEKwheQ0E2tcOJvHJT8Mwm8MFoC/B7fLY5zxIdN2
7A7r/7qbSQmSDSjOgwKS4ZOjom0xGSD+V+596SzET6jkbahR2HJ/mrFvmD7GNEoW
OWJPjzI=
=vzIM
-----END PGP SIGNATURE-----
Backmerge tag 'v5.14-rc3' into drm-next
Linux 5.14-rc3
Daniel said we should pull the nouveau fix from fixes in here, probably
a good plan.
Signed-off-by: Dave Airlie <airlied@redhat.com>
un-inlining which results in a section mismatch.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmD9LQsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUzRD/95KYmoyg2BUv1QIzPi1cHtBH6MANdE
OAgWU82TMCa23okQL38BcNQdc/lE2mtJJT3F+V6NEl72U2kS+Ujf5x94T+ITPshM
5/Kv66SKH9xOmIlNto55PQC6glKf8Y0n2sOWV6JTAIQ/mbQkYe/fIioh/rblftG7
lqItTkCycz+soh3A/BH1kimJ3Mj4EOVybQ14UxNAH8FdhI+5gs6IHv5IqODWVLTs
SB9cytUHUlnLHzVP/M0y1v8X+6hD9ajsz2boJPUESG9d/5KVU9yKC7brZrtCkNkH
iIaa7e2SB80CS92gsgTqEMEN+aLDey4fQH4FnxP5sRBx5yuln5hGpsHxtTuPMkAf
u9IMgoqedGDCEF1IZzLKjOP7UXovKyP2xfieISOt78cKwA4dChTaychAv4UeFlLo
1jnGHWTFy5UcIJfW86vBiF+6IS9bRFGw0t8KOOGl84ot3H8ZEzzn9Rmu/P1XR5ms
8mguz+4+bYR+dgKH8OVrAW104T1rlRYTqaLqfZIDDtmbjv9KLYUnw1NcL6ZmPrIZ
5nA4qU3e32pETIK6sJUo7OICVw3lFtokTu4P0Yts/oJZ7ggn7H+Gb33zg/yCjk9E
GWRfF0uw6i+fHJeq0ZeNkSgiUzgMpUXjwLqPnM7XvRiwKDd7lZWoXPwZ5Axjd80h
ZFBexWfxwZk4Gw==
=ho3b
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 jump label fix from Thomas Gleixner:
"A single fix for jump labels to prevent the compiler from agressive
un-inlining which results in a section mismatch"
* tag 'locking-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
jump_labels: Mark __jump_label_transform() as __always_inlined to work around aggressive compiler un-inlining
Update the static assertions about siginfo_t to also describe
it's alignment and size.
While investigating if it was possible to add a 64bit field into
siginfo_t[1] it became apparent that the alignment of siginfo_t
is as much a part of the ABI as the size of the structure.
If the alignment changes siginfo_t when embedded in another structure
can move to a different offset. Which is not acceptable from an ABI
structure.
So document that fact and add static assertions to notify developers
if they change change the alignment by accident.
[1] https://lkml.kernel.org/r/YJEZdhe6JGFNYlum@elver.google.com
Acked-by: Marco Elver <elver@google.com>
v1: https://lkml.kernel.org/r/20210505141101.11519-4-ebiederm@xmission.co
Link: https://lkml.kernel.org/r/875yxaxmyl.fsf_-_@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
UAPI Changes:
- Remove sysfs stats for dma-buf attachments, as it causes a performance regression.
Previous merge is not in a rc kernel yet, so no userspace regression possible.
Cross-subsystem Changes:
- Sanitize user input in kyro's viewport ioctl.
- Use refcount_t in fb_info->count
- Assorted fixes to dma-buf.
- Extend x86 efifb handling to all archs.
- Fix neofb divide by 0.
- Document corpro,gm7123 bridge dt bindings.
Core Changes:
- Slightly rework drm master handling.
- Cleanup vgaarb handling.
- Assorted fixes.
Driver Changes:
- Add support for ws2401 panel.
- Assorted fixes to stm, ast, bochs.
- Demidlayer ingenic irq.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEuXvWqAysSYEJGuVH/lWMcqZwE8MFAmD5TGAACgkQ/lWMcqZw
E8PNgxAApjTYQSfjIBbOZnNraxW6w7/bPea35E9A47EdBQsNGnYftNsFjbrn/mCJ
D+0eRLjCMlg4FF1SHdh9cPJ35py+ygbDeupogboLITfU99eGBth3fM2Xdg9LPcBh
dbni/JLG9R7gIvSlqdJuweN21trfVrV/9FQEilG5DvQcl27Wx5g8VMRZke1EqGKX
7Id09Uq50ky18vhDjQRCveYhRqJAxV+XozBatzHyxpDVzjLQvRhlAAYdvrSMHZ5R
jreGzOfR8awc6Om+w7wx3Jn1oEGmXVZB/VqxEqGtMOr3lpARPucxrqfHsqpam3rv
yIoEKPrkG+k6fsU7Tbg59jNqe/PbCUW3AlpyuBxf55EbnVGgjLDbq4sRRMkehPfA
fhC31ujOXQQnAgaxyeQAaAJFKNFJzA8Cq5ZPfG+zztzuomHCiUVQBRowP65hJMzR
+ZlEDnhUD3STLz39zuO1reZR1ZoPIvKbsokHAA+ZrIwUd6U3D3ia8V51pq+lL5aS
TGDkyMN9jyZ+SO8Z7+2FnJAv9FAOPU/WCLU/fWW46jAvuezwMIwVcjfSqDU2XbZD
e7KgHpHhx3BGxI8TThHKlY7mf6IL2Bm7X1Cv1pdZs/eEn3Udh2ax942uTQZu/YOO
0AT1XchpvYCBNRw05bVI3OlJ+w3I8uV+h+11jHOKeY6cbwdHeKE=
=BUya
-----END PGP SIGNATURE-----
Merge tag 'drm-misc-next-2021-07-22' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for v5.15-rc1:
UAPI Changes:
- Remove sysfs stats for dma-buf attachments, as it causes a performance regression.
Previous merge is not in a rc kernel yet, so no userspace regression possible.
Cross-subsystem Changes:
- Sanitize user input in kyro's viewport ioctl.
- Use refcount_t in fb_info->count
- Assorted fixes to dma-buf.
- Extend x86 efifb handling to all archs.
- Fix neofb divide by 0.
- Document corpro,gm7123 bridge dt bindings.
Core Changes:
- Slightly rework drm master handling.
- Cleanup vgaarb handling.
- Assorted fixes.
Driver Changes:
- Add support for ws2401 panel.
- Assorted fixes to stm, ast, bochs.
- Demidlayer ingenic irq.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/2d0d2fe8-01fc-e216-c3fd-38db9e69944e@linux.intel.com
The x86 architecture has generic support to register a system framebuffer
platform device. It either registers a "simple-framebuffer" if the config
option CONFIG_X86_SYSFB is enabled, or a legacy VGA/VBE/EFI FB device.
But the code is generic enough to be reused by other architectures and can
be moved out of the arch/x86 directory.
This will allow to also support the simple{fb,drm} drivers on non-x86 EFI
platforms, such as aarch64 where these drivers are only supported with DT.
Signed-off-by: Javier Martinez Canillas <javierm@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20210625130947.1803678-2-javierm@redhat.com
We have a number of systems industry-wide that have a subset of their
functionality that works as follows:
1. Receive a message from local kmsg, serial console, or netconsole;
2. Apply a set of rules to classify the message;
3. Do something based on this classification (like scheduling a
remediation for the machine), rinse, and repeat.
As a couple of examples of places we have this implemented just inside
Facebook, although this isn't a Facebook-specific problem, we have this
inside our netconsole processing (for alarm classification), and as part
of our machine health checking. We use these messages to determine
fairly important metrics around production health, and it's important
that we get them right.
While for some kinds of issues we have counters, tracepoints, or metrics
with a stable interface which can reliably indicate the issue, in order
to react to production issues quickly we need to work with the interface
which most kernel developers naturally use when developing: printk.
Most production issues come from unexpected phenomena, and as such
usually the code in question doesn't have easily usable tracepoints or
other counters available for the specific problem being mitigated. We
have a number of lines of monitoring defence against problems in
production (host metrics, process metrics, service metrics, etc), and
where it's not feasible to reliably monitor at another level, this kind
of pragmatic netconsole monitoring is essential.
As one would expect, monitoring using printk is rather brittle for a
number of reasons -- most notably that the message might disappear
entirely in a new version of the kernel, or that the message may change
in some way that the regex or other classification methods start to
silently fail.
One factor that makes this even harder is that, under normal operation,
many of these messages are never expected to be hit. For example, there
may be a rare hardware bug which one wants to detect if it was to ever
happen again, but its recurrence is not likely or anticipated. This
precludes using something like checking whether the printk in question
was printed somewhere fleetwide recently to determine whether the
message in question is still present or not, since we don't anticipate
that it should be printed anywhere, but still need to monitor for its
future presence in the long-term.
This class of issue has happened on a number of occasions, causing
unhealthy machines with hardware issues to remain in production for
longer than ideal. As a recent example, some monitoring around
blk_update_request fell out of date and caused semi-broken machines to
remain in production for longer than would be desirable.
Searching through the codebase to find the message is also extremely
fragile, because many of the messages are further constructed beyond
their callsite (eg. btrfs_printk and other module-specific wrappers,
each with their own functionality). Even if they aren't, guessing the
format and formulation of the underlying message based on the aesthetics
of the message emitted is not a recipe for success at scale, and our
previous issues with fleetwide machine health checking demonstrate as
much.
This provides a solution to the issue of silently changed or deleted
printks: we record pointers to all printk format strings known at
compile time into a new .printk_index section, both in vmlinux and
modules. At runtime, this can then be iterated by looking at
<debugfs>/printk/index/<module>, which emits the following format, both
readable by humans and able to be parsed by machines:
$ head -1 vmlinux; shuf -n 5 vmlinux
# <level[,flags]> filename:line function "format"
<5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n"
<4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n"
<6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n"
<6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n"
<6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"
This mitigates the majority of cases where we have a highly-specific
printk which we want to match on, as we can now enumerate and check
whether the format changed or the printk callsite disappeared entirely
in userspace. This allows us to catch changes to printks we monitor
earlier and decide what to do about it before it becomes problematic.
There is no additional runtime cost for printk callers or printk itself,
and the assembly generated is exactly the same.
Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Jessica Yu <jeyu@kernel.org> # for module.{c,h}
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d19c7b212a5.1623775748.git.chris@chrisdown.name
Commit dce7cd6275 ("x86/hyperv: Allow guests to enable InvariantTSC")
added the support for HV_X64_MSR_TSC_INVARIANT_CONTROL. Setting bit 0
of this synthetic MSR will allow hyper-v guests to report invariant TSC
CPU feature through CPUID. This comment adds this explanation to the code
and mentions where the Intel's generic platform init code reads this
feature bit from CPUID. The comment will help developers understand how
the two parts of the initialization (hyperV specific and non-hyperV
specific generic hw init) are related.
Signed-off-by: Ani Sinha <ani@anisinha.ca>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20210716133245.3272672-1-ani@anisinha.ca
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The check for whether hibernation is possible, and the enabling of
Hyper-V panic notification during kexec, are both architecture neutral.
Move the code from under arch/x86 and into drivers/hv/hv_common.c where
it can also be used for ARM64.
No functional change.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1626287687-2045-4-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Architecture independent Hyper-V code calls various arch-specific handlers
when needed. To aid in supporting multiple architectures, provide weak
defaults that can be overridden by arch-specific implementations where
appropriate. But when arch-specific overrides aren't needed or haven't
been implemented yet for a particular architecture, these stubs reduce
the amount of clutter under arch/.
No functional change.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1626287687-2045-3-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The code to allocate and initialize the hv_vp_index array is
architecture neutral. Similarly, the code to allocate and
populate the hypercall input and output arg pages is architecture
neutral. Move both sets of code out from arch/x86 and into
utility functions in drivers/hv/hv_common.c that can be shared
by Hyper-V initialization on ARM64.
No functional changes. However, the allocation of the hypercall
input and output arg pages is done differently so that the
size is always the Hyper-V page size, even if not the same as
the guest page size (such as with ARM64's 64K page size).
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1626287687-2045-2-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Marking TSC as unstable has a side effect of marking sched_clock as
unstable when TSC is still being used as the sched_clock. This is not
desirable. Hyper-V ultimately uses a paravirtualized clock source that
provides a stable scheduler clock even on systems without TscInvariant
CPU capability. Hence, mark_tsc_unstable() call should be called _after_
scheduler clock has been changed to the paravirtualized clocksource. This
will prevent any unwanted manipulation of the sched_clock. Only TSC will
be correctly marked as unstable.
Signed-off-by: Ani Sinha <ani@anisinha.ca>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20210713030522.1714803-1-ani@anisinha.ca
Signed-off-by: Wei Liu <wei.liu@kernel.org>
In randconfig testing, certain UBSAN and CC Kconfig combinations
with GCC 10.3.0:
CONFIG_X86_32=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_UBSAN=y
# CONFIG_UBSAN_TRAP is not set
# CONFIG_UBSAN_BOUNDS is not set
CONFIG_UBSAN_SHIFT=y
# CONFIG_UBSAN_DIV_ZERO is not set
CONFIG_UBSAN_UNREACHABLE=y
CONFIG_UBSAN_BOOL=y
# CONFIG_UBSAN_ENUM is not set
# CONFIG_UBSAN_ALIGNMENT is not set
# CONFIG_UBSAN_SANITIZE_ALL is not set
... produce this build warning (and build error if
CONFIG_SECTION_MISMATCH_WARN_ONLY=y is set):
WARNING: modpost: vmlinux.o(.text+0x4c1cc): Section mismatch in reference from the function __jump_label_transform() to the function .init.text:text_poke_early()
The function __jump_label_transform() references
the function __init text_poke_early().
This is often because __jump_label_transform lacks a __init
annotation or the annotation of text_poke_early is wrong.
ERROR: modpost: Section mismatches detected.
The problem is that __jump_label_transform() gets uninlined by GCC,
despite there being only a single local scope user of the 'static inline'
function.
Mark the function __always_inline instead, to work around this compiler
bug/artifact.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Let's use the new printk formats to print the stacktrace entries when
printing a backtrace to the kernel logs. This will include any module's
build ID[1] in it so that offline/crash debugging can easily locate the
debuginfo for a module via something like debuginfod[2].
Link: https://lkml.kernel.org/r/20210511003845.2429846-8-swboyd@chromium.org
Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId [1]
Link: https://sourceware.org/elfutils/Debuginfod.html [2]
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Prevent sigaltstack out of bounds writes. The kernel unconditionally
writes the FPU state to the alternate stack without checking whether
the stack is large enough to accomodate it.
Check the alternate stack size before doing so and in case it's too
small force a SIGSEGV instead of silently corrupting user space data.
- MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been
updated despite the fact that the FPU state which is stored on the
signal stack has grown over time which causes trouble in the field
when AVX512 is available on a CPU. The kernel does not expose the
minimum requirements for the alternate stack size depending on the
available and enabled CPU features.
ARM already added an aux vector AT_MINSIGSTKSZ for the same reason.
Add it to x86 as well
- A major cleanup of the x86 FPU code. The recent discoveries of XSTATE
related issues unearthed quite some inconsistencies, duplicated code
and other issues.
The fine granular overhaul addresses this, makes the code more robust
and maintainable, which allows to integrate upcoming XSTATE related
features in sane ways.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDlcpETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoeP5D/4i+AgYYeiMLgGb+NS7iaKPfoWo6LIz
y3qdTSA0DQaIYbYivWwRO/g0GYdDMXDWeZalFi7eGnVI8O3eOog+22Zrf/y0UINB
KJHdYd4ApWHhs401022y5hexrWQvnV8w1yQCuj/zLm6eC+AVhdwt2AY+IBoRrdUj
wqY97B/4rJNsBvvqTDn9EeDrJA2y0y0Suc7AhIp2BGMI+dpIdxys8RJDamXNWyDL
gJf0YRgUoiIn3AHKb+fgv60AoxfC175NSg/5/y/scFNXqVlW0Up4YCb7pqG9o2Ga
f3XvtWfbw1N5PmUYjFkALwEkzGUbM3v0RA3xLY2j2WlWm9fBPPy59dt+i/h/VKyA
GrA7i7lcIqX8dfVH6XkrReZBkRDSB6t9SZTvV54jAz5fcIZO2Rg++UFUvI/R6GKK
XCcxukYaArwo+IG62iqDszS3gfLGhcor/cviOeULRC5zMUIO4Jah+IhDnifmShtC
M5s9QzrwIRD/XMewGRQmvkiN4kBfE7jFoBQr1J9leCXJKrM+2JQmMzVInuubTQIq
SdlKOaAIn7xtekz+6XdFG9Gmhck0PCLMJMOLNvQkKWI3KqGLRZ+dAWKK0vsCizAx
0BA7ZeB9w9lFT+D8mQCX77JvW9+VNwyfwIOLIrJRHk3VqVpS5qvoiFTLGJJBdZx4
/TbbRZu7nXDN2w==
=Mq1m
-----END PGP SIGNATURE-----
Merge tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu updates from Thomas Gleixner:
"Fixes and improvements for FPU handling on x86:
- Prevent sigaltstack out of bounds writes.
The kernel unconditionally writes the FPU state to the alternate
stack without checking whether the stack is large enough to
accomodate it.
Check the alternate stack size before doing so and in case it's too
small force a SIGSEGV instead of silently corrupting user space
data.
- MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never
been updated despite the fact that the FPU state which is stored on
the signal stack has grown over time which causes trouble in the
field when AVX512 is available on a CPU. The kernel does not expose
the minimum requirements for the alternate stack size depending on
the available and enabled CPU features.
ARM already added an aux vector AT_MINSIGSTKSZ for the same reason.
Add it to x86 as well.
- A major cleanup of the x86 FPU code. The recent discoveries of
XSTATE related issues unearthed quite some inconsistencies,
duplicated code and other issues.
The fine granular overhaul addresses this, makes the code more
robust and maintainable, which allows to integrate upcoming XSTATE
related features in sane ways"
* tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
x86/fpu/xstate: Clear xstate header in copy_xstate_to_uabi_buf() again
x86/fpu/signal: Let xrstor handle the features to init
x86/fpu/signal: Handle #PF in the direct restore path
x86/fpu: Return proper error codes from user access functions
x86/fpu/signal: Split out the direct restore code
x86/fpu/signal: Sanitize copy_user_to_fpregs_zeroing()
x86/fpu/signal: Sanitize the xstate check on sigframe
x86/fpu/signal: Remove the legacy alignment check
x86/fpu/signal: Move initial checks into fpu__restore_sig()
x86/fpu: Mark init_fpstate __ro_after_init
x86/pkru: Remove xstate fiddling from write_pkru()
x86/fpu: Don't store PKRU in xstate in fpu_reset_fpstate()
x86/fpu: Remove PKRU handling from switch_fpu_finish()
x86/fpu: Mask PKRU from kernel XRSTOR[S] operations
x86/fpu: Hook up PKRU into ptrace()
x86/fpu: Add PKRU storage outside of task XSAVE buffer
x86/fpu: Dont restore PKRU in fpregs_restore_userspace()
x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi()
x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs()
x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs()
...
- Added option for per CPU threads to the hwlat tracer
- Have hwlat tracer handle hotplug CPUs
- New tracer: osnoise, that detects latency caused by interrupts, softirqs
and scheduling of other tasks.
- Added timerlat tracer that creates a thread and measures in detail what
sources of latency it has for wake ups.
- Removed the "success" field of the sched_wakeup trace event.
This has been hardcoded as "1" since 2015, no tooling should be looking
at it now. If one exists, we can revert this commit, fix that tool and
try to remove it again in the future.
- tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.
- New boot command line option "tp_printk_stop", as tp_printk causes trace
events to write to console. When user space starts, this can easily live
lock the system. Having a boot option to stop just after boot up is
useful to prevent that from happening.
- Have ftrace_dump_on_oops boot command line option take numbers that match
the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.
- Bootconfig clean ups, fixes and enhancements.
- New ktest script that tests bootconfig options.
- Add tracepoint_probe_register_may_exist() to register a tracepoint
without triggering a WARN*() if it already exists. BPF has a path from
user space that can do this. All other paths are considered a bug.
- Small clean ups and fixes
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYN8YPhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhxLAP9Mo5hHv7Hg6W7Ddv77rThm+qclsMR/
yW0P+eJpMm4+xAD8Cq03oE1DimPK+9WZBKU5rSqAkqG6CjgDRw6NlIszzQQ=
=WEPR
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
- Added option for per CPU threads to the hwlat tracer
- Have hwlat tracer handle hotplug CPUs
- New tracer: osnoise, that detects latency caused by interrupts,
softirqs and scheduling of other tasks.
- Added timerlat tracer that creates a thread and measures in detail
what sources of latency it has for wake ups.
- Removed the "success" field of the sched_wakeup trace event. This has
been hardcoded as "1" since 2015, no tooling should be looking at it
now. If one exists, we can revert this commit, fix that tool and try
to remove it again in the future.
- tgid mapping fixed to handle more than PID_MAX_DEFAULT pids/tgids.
- New boot command line option "tp_printk_stop", as tp_printk causes
trace events to write to console. When user space starts, this can
easily live lock the system. Having a boot option to stop just after
boot up is useful to prevent that from happening.
- Have ftrace_dump_on_oops boot command line option take numbers that
match the numbers shown in /proc/sys/kernel/ftrace_dump_on_oops.
- Bootconfig clean ups, fixes and enhancements.
- New ktest script that tests bootconfig options.
- Add tracepoint_probe_register_may_exist() to register a tracepoint
without triggering a WARN*() if it already exists. BPF has a path
from user space that can do this. All other paths are considered a
bug.
- Small clean ups and fixes
* tag 'trace-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (49 commits)
tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
tracing: Simplify & fix saved_tgids logic
treewide: Add missing semicolons to __assign_str uses
tracing: Change variable type as bool for clean-up
trace/timerlat: Fix indentation on timerlat_main()
trace/osnoise: Make 'noise' variable s64 in run_osnoise()
tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
tracing: Fix spelling in osnoise tracer "interferences" -> "interference"
Documentation: Fix a typo on trace/osnoise-tracer
trace/osnoise: Fix return value on osnoise_init_hotplug_support
trace/osnoise: Make interval u64 on osnoise_main
trace/osnoise: Fix 'no previous prototype' warnings
tracing: Have osnoise_main() add a quiescent state for task rcu
seq_buf: Make trace_seq_putmem_hex() support data longer than 8
seq_buf: Fix overflow in seq_buf_putmem_hex()
trace/osnoise: Support hotplug operations
trace/hwlat: Support hotplug operations
trace/hwlat: Protect kdata->kthread with get/put_online_cpus
trace: Add timerlat tracer
trace: Add osnoise tracer
...
Merge more updates from Andrew Morton:
"190 patches.
Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...
free_insn_page() in x86 and s390 is same with the common weak function in
kernel/kprobes.c. Plus, the comment "Recover page to RW mode before
releasing it" in x86 seems insensible to be there since resetting mapping
is done by common code in vfree() of module_memfree(). So drop these two
duplicated strong functions and related comment, then mark the common one
in kernel/kprobes.c strong.
Link: https://lkml.kernel.org/r/20210608065736.32656-1-song.bao.hua@hisilicon.com
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Qi Liu <liuqi115@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.
There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain
At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
[akpm@linux-foundation.org: thread_info.h needs limits.h]
[andriy.shevchenko@linux.intel.com: ia64 fix]
Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com
Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Co-developed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Stephen Boyd <sboyd@kernel.org>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc updates from Andrew Morton:
"191 patches.
Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
pagealloc, and memory-failure)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits)
mm,hwpoison: make get_hwpoison_page() call get_any_page()
mm,hwpoison: send SIGBUS with error virutal address
mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
docs: remove description of DISCONTIGMEM
arch, mm: remove stale mentions of DISCONIGMEM
mm: remove CONFIG_DISCONTIGMEM
m68k: remove support for DISCONTIGMEM
arc: remove support for DISCONTIGMEM
arc: update comment about HIGHMEM implementation
alpha: remove DISCONTIGMEM and NUMA
mm/page_alloc: move free_the_page
mm/page_alloc: fix counting of managed_pages
mm/page_alloc: improve memmap_pages dbg msg
mm: drop SECTION_SHIFT in code comments
mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
mm/page_alloc: scale the number of pages that are batch freed
...
- Update ACPICA code in the kernel to upstrea revision 20210604
including the following changes:
* Add defines for the CXL Host Bridge Structureand and add the
CFMWS structure definition to CEDT (Alison Schofield).
* iASL: Finish support for the IVRS ACPI table (Bob Moore).
* iASL: Add support for the SVKL table (Bob Moore).
* iASL: Add full support for RGRT ACPI table (Bob Moore).
* iASL: Add support for the BDAT ACPI table (Bob Moore).
* iASL: add disassembler support for PRMT (Erik Kaneda).
* Fix memory leak caused by _CID repair function (Erik Kaneda).
* Add support for PlatformRtMechanism OpRegion (Erik Kaneda).
* Add PRMT module header to facilitate parsing (Erik Kaneda).
* Add _PLD panel positions (Fabian Wüthrich).
* MADT: add Multiprocessor Wakeup Mailbox Structure and the
SVKL table headers (Kuppuswamy Sathyanarayanan).
* Use ACPI_FALLTHROUGH (Wei Ming Chen).
- Add preliminary support for the Platform Runtime Mechanism (PRM)
to allow the AML interpreter to call PRM functions (Erik Kaneda).
- Address some issues related to the handling of device dependencies
reported by _DEP in the ACPI device enumeration code and clean up
some related pieces of it (Rafael Wysocki).
- Improve the tracking of states of ACPI power resources (Rafael
Wysocki).
- Improve ACPI support for suspend-to-idle on AMD systems (Alex
Deucher, Mario Limonciello, Pratik Vishwakarma).
- Continue the unification and cleanup of message printing in the
ACPI code (Hanjun Guo, Heiner Kallweit).
- Fix possible buffer overrun issue with the description_show()
sysfs attribute method (Krzysztof Wilczyński).
- Improve the acpi_mask_gpe kernel command line parameter handling
and clean up the core ACPI code related to sysfs (Andy Shevchenko,
Baokun Li, Clayton Casciato).
- Postpone bringing devices in the general ACPI PM domain to D0
during resume from system-wide suspend until they are really
needed (Dmitry Torokhov).
- Make the ACPI processor driver fix up C-state latency if not
ordered (Mario Limonciello).
- Add support for identifying devices depening on the given one
that are not its direct descendants with the help of _DEP (Daniel
Scally).
- Extend the checks related to ACPI IRQ overrides on x86 in order to
avoid false-positives (Hui Wang).
- Add battery DPTF participant for Intel SoCs (Sumeet Pawnikar).
- Rearrange the ACPI fan driver and device power management code to
use a common list of device IDs (Rafael Wysocki).
- Fix clang CFI violation in the ACPI BGRT table parsing code and
clean it up (Nathan Chancellor).
- Add GPE-related quirks for some laptops to the EC driver (Chris
Chiu, Zhang Rui).
- Make the ACPI PPTT table parsing code populate the cache-id
value if present in the firmware (James Morse).
- Remove redundant clearing of context->ret.pointer from
acpi_run_osc() (Hans de Goede).
- Add missing acpi_put_table() in acpi_init_fpdt() (Jing Xiangfeng).
- Make ACPI APEI handle ARM Processor Error CPER records like
Memory Error ones to avoid user space task lockups (Xiaofei Tan).
- Stop warning about disabled ACPI in APEI (Jon Hunter).
- Fix fall-through warning for Clang in the SBSHC driver (Gustavo A.
R. Silva).
- Add custom DSDT file as Makefile prerequisite (Richard Fitzgerald).
- Initialize local variable to avoid garbage being returned (Colin
Ian King).
- Simplify assorted pieces of code, address assorted coding style
and documentation issues and comment typos (Baokun Li, Christophe
JAILLET, Clayton Casciato, Liu Shixin, Shaokun Zhang, Wei Yongjun,
Yang Li, Zhen Lei).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmDbajwSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRxkCkQAKRZbSiyrCRLHCV81ZcxAWkluxfT8ljv
B4C4kuKvYDvnEO2Bo40QByvQsgE176nexIxsO9BKEoSlrokX5tIBC1KbMmc7ZbtQ
SFU6gCs+FgsVWQoD0PTJKIRaGFfLl8GRr45Bb+5Ta2DUYIkgMlV/8jf7WWqYLvw3
QkCeU/7CiCUUPxE8i0dv5thEIsWLahkE/9FdCN9yVTxLX/9hqhepWdC62N6TYYy7
Ai8Yt4BCLOSg1ZG8oqHo4I8bzuXgIb6zBFZJwNtP+ISh218RE/zl+0siCF/x5WQ0
pUEDDiji7f6Puwk91IYn5ODlq8iTKO0mKJssIXFGn8lYyhZOdm9LpuTL0+x7zqrz
Nt9Lw/85Ibf11XHetT5O0OrMygChtB2en1G593gI95TJeOfvJ+/374hhROGTd0bL
rw0uOjc5g8MP2WQiGErNgyY0xAUkbKXSOXNOG0iTTKHKOGCBKhs5VWNL216j/wyD
nsJoSyF//xJfvTd3CHp8m0LYe0PM06lWUTIfrVLxQYE2fU13hJcAzIt+6+1Pwmk7
+gvGsVsf8kjjFAvHKT7EgM67JHecx6s6kh8MJ2DAqToAeuuCHFVHj8msIFBeZ28e
vQ62CmdcXax3VNTYV6qC633ZwvaJ99QUX2x18hJpx8P2Z43rBgRBNZl/s/8/NIq4
VVx6u54hGpWH
=S//q
-----END PGP SIGNATURE-----
Merge tag 'acpi-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI updates from Rafael Wysocki:
"These update the ACPICA code in the kernel to the 20210604 upstream
revision, add preliminary support for the Platform Runtime Mechanism
(PRM), address issues related to the handling of device dependencies
in the ACPI device eunmeration code, improve the tracking of ACPI
power resource states, improve the ACPI support for suspend-to-idle on
AMD systems, continue the unification of message printing in the ACPI
code, address assorted issues and clean up the code in a number of
places.
Specifics:
- Update ACPICA code in the kernel to upstrea revision 20210604
including the following changes:
- Add defines for the CXL Host Bridge Structureand and add the
CFMWS structure definition to CEDT (Alison Schofield).
- iASL: Finish support for the IVRS ACPI table (Bob Moore).
- iASL: Add support for the SVKL table (Bob Moore).
- iASL: Add full support for RGRT ACPI table (Bob Moore).
- iASL: Add support for the BDAT ACPI table (Bob Moore).
- iASL: add disassembler support for PRMT (Erik Kaneda).
- Fix memory leak caused by _CID repair function (Erik Kaneda).
- Add support for PlatformRtMechanism OpRegion (Erik Kaneda).
- Add PRMT module header to facilitate parsing (Erik Kaneda).
- Add _PLD panel positions (Fabian Wüthrich).
- MADT: add Multiprocessor Wakeup Mailbox Structure and the SVKL
table headers (Kuppuswamy Sathyanarayanan).
- Use ACPI_FALLTHROUGH (Wei Ming Chen).
- Add preliminary support for the Platform Runtime Mechanism (PRM) to
allow the AML interpreter to call PRM functions (Erik Kaneda).
- Address some issues related to the handling of device dependencies
reported by _DEP in the ACPI device enumeration code and clean up
some related pieces of it (Rafael Wysocki).
- Improve the tracking of states of ACPI power resources (Rafael
Wysocki).
- Improve ACPI support for suspend-to-idle on AMD systems (Alex
Deucher, Mario Limonciello, Pratik Vishwakarma).
- Continue the unification and cleanup of message printing in the
ACPI code (Hanjun Guo, Heiner Kallweit).
- Fix possible buffer overrun issue with the description_show() sysfs
attribute method (Krzysztof Wilczyński).
- Improve the acpi_mask_gpe kernel command line parameter handling
and clean up the core ACPI code related to sysfs (Andy Shevchenko,
Baokun Li, Clayton Casciato).
- Postpone bringing devices in the general ACPI PM domain to D0
during resume from system-wide suspend until they are really needed
(Dmitry Torokhov).
- Make the ACPI processor driver fix up C-state latency if not
ordered (Mario Limonciello).
- Add support for identifying devices depening on the given one that
are not its direct descendants with the help of _DEP (Daniel
Scally).
- Extend the checks related to ACPI IRQ overrides on x86 in order to
avoid false-positives (Hui Wang).
- Add battery DPTF participant for Intel SoCs (Sumeet Pawnikar).
- Rearrange the ACPI fan driver and device power management code to
use a common list of device IDs (Rafael Wysocki).
- Fix clang CFI violation in the ACPI BGRT table parsing code and
clean it up (Nathan Chancellor).
- Add GPE-related quirks for some laptops to the EC driver (Chris
Chiu, Zhang Rui).
- Make the ACPI PPTT table parsing code populate the cache-id value
if present in the firmware (James Morse).
- Remove redundant clearing of context->ret.pointer from
acpi_run_osc() (Hans de Goede).
- Add missing acpi_put_table() in acpi_init_fpdt() (Jing Xiangfeng).
- Make ACPI APEI handle ARM Processor Error CPER records like Memory
Error ones to avoid user space task lockups (Xiaofei Tan).
- Stop warning about disabled ACPI in APEI (Jon Hunter).
- Fix fall-through warning for Clang in the SBSHC driver (Gustavo A.
R. Silva).
- Add custom DSDT file as Makefile prerequisite (Richard Fitzgerald).
- Initialize local variable to avoid garbage being returned (Colin
Ian King).
- Simplify assorted pieces of code, address assorted coding style and
documentation issues and comment typos (Baokun Li, Christophe
JAILLET, Clayton Casciato, Liu Shixin, Shaokun Zhang, Wei Yongjun,
Yang Li, Zhen Lei)"
* tag 'acpi-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (97 commits)
ACPI: PM: postpone bringing devices to D0 unless we need them
ACPI: tables: Add custom DSDT file as makefile prerequisite
ACPI: bgrt: Use sysfs_emit
ACPI: bgrt: Fix CFI violation
ACPI: EC: trust DSDT GPE for certain HP laptop
ACPI: scan: Simplify acpi_table_events_fn()
ACPI: PM: Adjust behavior for field problems on AMD systems
ACPI: PM: s2idle: Add support for new Microsoft UUID
ACPI: PM: s2idle: Add support for multiple func mask
ACPI: PM: s2idle: Refactor common code
ACPI: PM: s2idle: Use correct revision id
ACPI: sysfs: Remove tailing return statement in void function
ACPI: sysfs: Use __ATTR_RO() and __ATTR_RW() macros
ACPI: sysfs: Sort headers alphabetically
ACPI: sysfs: Refactor param_get_trace_state() to drop dead code
ACPI: sysfs: Unify pattern of memory allocations
ACPI: sysfs: Allow bitmap list to be supplied to acpi_mask_gpe
ACPI: sysfs: Make sparse happy about address space in use
ACPI: scan: Fix race related to dropping dependencies
ACPI: scan: Reorganize acpi_device_add()
...
- Consolidate the VECTOR defines and the usage sites.
- Cleanup GDT/IDT related code and replace open coded ASM with proper
native helfper functions.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDbLAUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTiXEACiuisDJ2fYFqU1dmYRbWIDtWbgsJ3k
CVABRjgCbGfviKaaJuMoHf5tbnXWWu7y8jd8Z+h9cwOlyQOzNBsZjplzPS0h8zME
KAekAkO2VGf5G7VdWLrfMvjIY/NDuAgxj+7w01LvnyWROePGRkbeP3iH41qo+auM
5Cj4lu333+rO4kzmdXzwJ7CHQXOa/OT0MrBL14saYFaM3qSSkCzeIXnE6/ZNapsE
zZYOCDF19MpPm6GZT1i4qRxirhw1TLNycsYavlOxZ/Hyp0BO0t2TiNRwZtdIVz+a
1sedm+pD9E+1qHQfB+P03P65OixxN0hArNlKgGou5LDMRF45pvfqQXEBbTsqHSxh
vWlL/tK7Z7U5dsK7ZA0HvlZYdrunWn/cNMqWb08WDyuPLxJ0QxJjsdOB2teVEus+
kNYsP0ZxRvPNHKtqVfTXGS8ksrNS/57lUz6UJmBA3UYhYg33UgPCfF/gQzTnpfSo
4TzhWIeLlCOId9FPxXpXa4NjjsqXvNEOPGrTx4BY8SYHYln4HoSyffRIZQ8xl0lA
Qfetod+Hajt+5JXGndb906kexY7i14ZOrkHEjkUtq0asNmbwJ+hVs2VaYcq/ghuS
BmhlnarYuWw9t11yD9Ln5stoVgRJ2KEX5T9fOCtCsJZyHo+Eta/p14ocU0eLQQdh
HbsRKB+pE+al2A==
=eAPe
-----END PGP SIGNATURE-----
Merge tag 'x86-irq-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 interrupt related updates from Thomas Gleixner:
- Consolidate the VECTOR defines and the usage sites.
- Cleanup GDT/IDT related code and replace open coded ASM with proper
native helper functions.
* tag 'x86-irq-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kexec: Set_[gi]dt() -> native_[gi]dt_invalidate() in machine_kexec_*.c
x86: Add native_[ig]dt_invalidate()
x86/idt: Remove address argument from idt_invalidate()
x86/irq: Add and use NR_EXTERNAL_VECTORS and NR_SYSTEM_VECTORS
x86/irq: Remove unused vectors defines
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDbLo4THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoZFyD/4icyCNaeV2R8fufdQGWjPwZfpc8JiQ
pqEKWlIGaImG3NgbL953/or8pDZe3LCk+p0hJOwYKtPP0LGjgZvPp6glOofAzvC8
sM5RCsJoDOI7mrc23JRXy8z78C/9tmth5UFw1RlXXuiE4hVr2Gc31YpoyvJLQWn0
XcrkSx2J3Cn7WFpjZCZkeC+Wr34+AVXhAY9t8S3WMn2bPj8Bw5vkxmnR2zbZ0PQI
KZcbYI6r/dJv8ov2AXfkD+EJIe5dzjdRVSX5UZYXWIQMB/vMkt8HinHPm+hFuHWn
Swz7ldBznFDTasoEUVMpn2mObjIuEs0jOYIxlXHYEgl1elRmBbgzQhMY5UGnAUnU
na4RHgZ0WOygwXcZIYYrl7aDuSvt4BvlVz17wNQ4P85QsOcGINSH3c0At0JdEeIg
WPJuBIq02A9bHXg+fvVtZMCvnyTYe7DRVL+J7eVopGIka8b07nUcP5UB+nRJGjxI
uOzdA2oFtucWRAxqtQh8FKVYR9vrIeSMfKhqaIQmzlBgbAzSo1OPX23O8gwkLSab
bzjPb5XOw23w20Oqh7SkTTIMR2m633IZBqnd5gPL4nUZTmB40EEYhwH6vfopeCS+
q4+1tzHmTkAvrnjhN9QTr2bGGGhPeehiYVdQ8QwvB10nF3Lca47hopSoJa5fKIeC
nWb2ZXUN1YwUMQ==
=5Hb8
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Time and clocksource/clockevent related updates:
Core changes:
- Infrastructure to support per CPU "broadcast" devices for per CPU
clockevent devices which stop in deep idle states. This allows us
to utilize the more efficient architected timer on certain ARM SoCs
for normal operation instead of permanentely using the slow to
access SoC specific clockevent device.
- Print the name of the broadcast/wakeup device in /proc/timer_list
- Make the clocksource watchdog more robust against delays between
reading the current active clocksource and the watchdog
clocksource. Such delays can be caused by NMIs, SMIs and vCPU
preemption.
Handle this by reading the watchdog clocksource twice, i.e. before
and after reading the current active clocksource. In case that the
two watchdog reads shows an excessive time delta, the read sequence
is repeated up to 3 times.
- Improve the debug output and add a test module for the watchdog
mechanism.
- Reimplementation of the venerable time64_to_tm() function with a
faster and significantly smaller version. Straight from the source,
i.e. the author of the related research paper contributed this!
Driver changes:
- No new drivers, not even new device tree bindings!
- Fixes, improvements and cleanups and all over the place"
* tag 'timers-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
time/kunit: Add missing MODULE_LICENSE()
time: Improve performance of time64_to_tm()
clockevents: Use list_move() instead of list_del()/list_add()
clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
clocksource: Provide kernel module to test clocksource watchdog
clocksource: Reduce clocksource-skew threshold
clocksource: Limit number of CPUs checked for clock synchronization
clocksource: Check per-CPU clock synchronization when marked unstable
clocksource: Retry clock read if long delays detected
clockevents: Add missing parameter documentation
clocksource/drivers/timer-ti-dm: Drop unnecessary restore
clocksource/arm_arch_timer: Improve Allwinner A64 timer workaround
clocksource/drivers/arm_global_timer: Remove duplicated argument in arm_global_timer
clocksource/drivers/arm_global_timer: Make symbol 'gt_clk_rate_change_nb' static
arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore
clocksource/drivers/arm_global_timer: Implement rate compensation whenever source clock changes
clocksource/drivers/ingenic: Rename unreasonable array names
clocksource/drivers/timer-ti-dm: Save and restore timer TIOCP_CFG
clocksource/drivers/mediatek: Ack and disable interrupts on suspend
clocksource/drivers/samsung_pwm: Constify source IO memory
...
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmDa/58THHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXufBB/sGVIp1OhrtRFXeLJGJtbPPJzbh1w+K
hlDxthpN8uW7W7jyQRShiNUVwGX1QUiKBUH5g9oVKYBm0/srup4TivT2PiXPp4mC
7rPM9jcOD0ei9W2Z1/fjQvyorz37pQzq9GMF37FGqPM2mVuCl16QhdF5EqYjipCU
k34MyvjUPudmYot3gdiURyxrUljzw3KLfVXeg1Tpk0mXLZvm6OnyRn9ywgNJb+Pz
wn2+Om3hZCtccF2CUUS1LbMPFF97xy/CdAObyDuJyyXfyY7JFZ/guWBkQpi0tejQ
/+yJFYYZSo1lXV8xxB0t8LR9Vf6OxZNexn+0El4IUO4TZuAJcr4Q6fYp
=yOAz
-----END PGP SIGNATURE-----
Merge tag 'hyperv-next-signed-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
"Just a few minor enhancement patches and bug fixes"
* tag 'hyperv-next-signed-20210629' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
PCI: hv: Add check for hyperv_initialized in init_hv_pci_drv()
Drivers: hv: Move Hyper-V extended capability check to arch neutral code
drivers: hv: Fix missing error code in vmbus_connect()
x86/hyperv: fix logical processor creation
hv_utils: Fix passing zero to 'PTR_ERR' warning
scsi: storvsc: Use blk_mq_unique_tag() to generate requestIDs
Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer
hv_balloon: Remove redundant assignment to region_start
Now an action required MCE in already hwpoisoned address surely sends a
SIGBUS to current process, but the SIGBUS doesn't convey error virtual
address. That's not optimal for hwpoison-aware applications.
To fix the issue, make memory_failure() call kill_accessing_process(),
that does pagetable walk to find the error virtual address. It could find
multiple virtual addresses for the same error page, and it seems hard to
tell which virtual address is correct one. But that's rare and sending
incorrect virtual address could be better than no address. So let's
report the first found virtual address for now.
[naoya.horiguchi@nec.com: fix walk_page_range() return]
Link: https://lkml.kernel.org/r/20210603051055.GA244241@hori.linux.bs1.fc.nec.co.jp
Link: https://lkml.kernel.org/r/20210521030156.2612074-4-nao.horiguchi@gmail.com
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Aili Yao <yaoaili@kingsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jue Wang <juew@google.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
configuration options are equivalent.
Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.
Done with
$ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
$(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
$ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
$(git grep -wl NEED_MULTIPLE_NODES)
with manual tweaks afterwards.
[rppt@linux.ibm.com: fix arm boot crash]
Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com
Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use vma_lookup() to find the VMA at a specific address. As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.
Link: https://lkml.kernel.org/r/20210521174745.2219620-10-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* acpi-prm:
ACPI: PRM: make symbol 'prm_module_list' static
ACPI: Add \_SB._OSC bit for PRM
ACPI: PRM: implement OperationRegion handler for the PlatformRtMechanism subtype
* acpi-sysfs:
ACPI: sysfs: Remove tailing return statement in void function
ACPI: sysfs: Use __ATTR_RO() and __ATTR_RW() macros
ACPI: sysfs: Sort headers alphabetically
ACPI: sysfs: Refactor param_get_trace_state() to drop dead code
ACPI: sysfs: Unify pattern of memory allocations
ACPI: sysfs: Allow bitmap list to be supplied to acpi_mask_gpe
ACPI: sysfs: Make sparse happy about address space in use
ACPI: sysfs: fix doc warnings in device_sysfs.c
ACPI: sysfs: Drop four redundant return statements
ACPI: sysfs: Fix a buffer overrun problem with description_show()
* acpi-x86:
x86/acpi: Switch to pr_xxx log functions
- Add MTE support in guests, complete with tag save/restore interface
- Reduce the impact of CMOs by moving them in the page-table code
- Allow device block mappings at stage-2
- Reduce the footprint of the vmemmap in protected mode
- Support the vGIC on dumb systems such as the Apple M1
- Add selftest infrastructure to support multiple configuration
and apply that to PMU/non-PMU setups
- Add selftests for the debug architecture
- The usual crop of PMU fixes
PPC:
- Support for the H_RPT_INVALIDATE hypercall
- Conversion of Book3S entry/exit to C
- Bug fixes
S390:
- new HW facilities for guests
- make inline assembly more robust with KASAN and co
x86:
- Allow userspace to handle emulation errors (unknown instructions)
- Lazy allocation of the rmap (host physical -> guest physical address)
- Support for virtualizing TSC scaling on VMX machines
- Optimizations to avoid shattering huge pages at the beginning of live migration
- Support for initializing the PDPTRs without loading them from memory
- Many TLB flushing cleanups
- Refuse to load if two-stage paging is available but NX is not (this has
been a requirement in practice for over a year)
- A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from
CR0/CR4/EFER, using the MMU mode everywhere once it is computed
from the CPU registers
- Use PM notifier to notify the guest about host suspend or hibernate
- Support for passing arguments to Hyper-V hypercalls using XMM registers
- Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap on
AMD processors
- Hide Hyper-V hypercalls that are not included in the guest CPUID
- Fixes for live migration of virtual machines that use the Hyper-V
"enlightened VMCS" optimization of nested virtualization
- Bugfixes (not many)
Generic:
- Support for retrieving statistics without debugfs
- Cleanups for the KVM selftests API
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDV9UYUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOIRgf/XX8fKLh24RnTOs2ldIu2AfRGVrT4
QMrr8MxhmtukBAszk2xKvBt8/6gkUjdaIC3xqEnVjxaDaUvZaEtP7CQlF5JV45rn
iv1zyxUKucXrnIOr+gCioIT7qBlh207zV35ArKioP9Y83cWx9uAs22pfr6g+7RxO
h8bJZlJbSG6IGr3voANCIb9UyjU1V/l8iEHqRwhmr/A5rARPfD7g8lfMEQeGkzX6
+/UydX2fumB3tl8e2iMQj6vLVdSOsCkehvpHK+Z33EpkKhan7GwZ2sZ05WmXV/nY
QLAYfD10KegoNWl5Ay4GTp4hEAIYVrRJCLC+wnLdc0U8udbfCuTC31LK4w==
=NcRh
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"This covers all architectures (except MIPS) so I don't expect any
other feature pull requests this merge window.
ARM:
- Add MTE support in guests, complete with tag save/restore interface
- Reduce the impact of CMOs by moving them in the page-table code
- Allow device block mappings at stage-2
- Reduce the footprint of the vmemmap in protected mode
- Support the vGIC on dumb systems such as the Apple M1
- Add selftest infrastructure to support multiple configuration and
apply that to PMU/non-PMU setups
- Add selftests for the debug architecture
- The usual crop of PMU fixes
PPC:
- Support for the H_RPT_INVALIDATE hypercall
- Conversion of Book3S entry/exit to C
- Bug fixes
S390:
- new HW facilities for guests
- make inline assembly more robust with KASAN and co
x86:
- Allow userspace to handle emulation errors (unknown instructions)
- Lazy allocation of the rmap (host physical -> guest physical
address)
- Support for virtualizing TSC scaling on VMX machines
- Optimizations to avoid shattering huge pages at the beginning of
live migration
- Support for initializing the PDPTRs without loading them from
memory
- Many TLB flushing cleanups
- Refuse to load if two-stage paging is available but NX is not (this
has been a requirement in practice for over a year)
- A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from
CR0/CR4/EFER, using the MMU mode everywhere once it is computed
from the CPU registers
- Use PM notifier to notify the guest about host suspend or hibernate
- Support for passing arguments to Hyper-V hypercalls using XMM
registers
- Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap
on AMD processors
- Hide Hyper-V hypercalls that are not included in the guest CPUID
- Fixes for live migration of virtual machines that use the Hyper-V
"enlightened VMCS" optimization of nested virtualization
- Bugfixes (not many)
Generic:
- Support for retrieving statistics without debugfs
- Cleanups for the KVM selftests API"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (314 commits)
KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled
kvm: x86: disable the narrow guest module parameter on unload
selftests: kvm: Allows userspace to handle emulation errors.
kvm: x86: Allow userspace to handle emulation errors
KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on
KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault
KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault
KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT
KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic
KVM: x86: Enhance comments for MMU roles and nested transition trickiness
KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE
KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU
KVM: x86/mmu: Use MMU's role to determine PTTYPE
KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers
KVM: x86/mmu: Add a helper to calculate root from role_regs
KVM: x86/mmu: Add helper to update paging metadata
KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0
KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls
KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper
KVM: x86/mmu: Get nested MMU's root level from the MMU's role
...
- Add the "ratelimit:N" parameter to the split_lock_detect= boot option,
to rate-limit the generation of bus-lock exceptions. This is both
easier on system resources and kinder to offending applications than
the current policy of outright killing them.
- Document the split-lock detection feature and its parameters.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZfS4RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hegw//RVafMIceiA0R4zUG8jsGA7SEUaQixfWX
YjSYbpbsQRLHBASu8sb9yT/O4Dy+WmJ2PdETeWNTqX3MMfL41bMMEjdzU/5kL4By
RsWWissxwsx7MRSFdChI74BVT45/DqTnRpbbW5XnYjKoYbXeYqmSIeP/j+Rn5ACQ
rszqIPM/yTK2/NkU9qDoJZitqCuzs925C8k/685prRHzM7gvbQi+6hjKxcQqYtCX
s2wMUGqAMtD+sadHXJAkmtfG7JzPOJYfdG/qeyB88EmT48N8KDjwTDfQZH3Cuox0
DGy7KwtVRiYumF6yaVXXXTCY0ChpPpmZhYA7VuBUIjmFq0EhLwGJ1D4ACL11IX1W
rmqjJ9rNhO+zVc+JLY8671HtyWm0bkUqKaEYhyqJHosI78pRWJIcfqySOAvuqT0N
h1JRko3F/gBGh5DB2zsVcI/odYBiBQk7hAz7SZmPRaXmpNb+epesLrdbI2juxpvO
r6Mt2f1dAWgH+lv+amJRZWWMewrf4bk9mmjGSssUmrSBbi1lxlO1B9it1I0jQn+M
9hELPj4rj82XLkWVggiM0l24FtAHhBeci+wRx1/NrWp8fSsdZ2FojyzXDOLJFfxF
NaQLMuqkWH71CeEWVAdYE69OBHWa2ctmZwMj4BM7RnmKk4tVR13qG5BEWcI4TCsS
TcswzOa1AVA=
=4DyL
-----END PGP SIGNATURE-----
Merge tag 'x86-splitlock-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 splitlock updates from Ingo Molnar:
- Add the "ratelimit:N" parameter to the split_lock_detect= boot
option, to rate-limit the generation of bus-lock exceptions.
This is both easier on system resources and kinder to offending
applications than the current policy of outright killing them.
- Document the split-lock detection feature and its parameters.
* tag 'x86-splitlock-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/x86: Add ratelimit in buslock.rst
Documentation/admin-guide: Add bus lock ratelimit
x86/bus_lock: Set rate limit for bus lock
Documentation/x86: Add buslock.rst
- Micro-optimize and standardize the do_syscall_64() calling convention
- Make syscall entry flags clearing more conservative
- Clean up syscall table handling
- Clean up & standardize assembly macros, in preparation of FRED
- Misc cleanups and fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZeG8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gHQw//fI9MAIVQbB6tVMH6GtFkQZIJLMt/bik5
AWelEXoBUbbLFGKpugC+oWGJjsvZ026f65hfQEswuqD4n0Xx8FFPRi51LP88lLya
XQV8nssJYUKYZAVA0EJd7NmnJchbnRc4KQmu6ekEQdP6+Nht8k7U9O2QetgQgcE5
IYhXctoYpr/FnBpV5PmVNAakOt0cZh6mXAtpzjHfdU8lUHZ13zPIpniSXCPd4vUB
u/a3x3l1fP+Gg8d1vpfGCBvNKRBEh5pJsjaObMlLM/qhHupsDi5Ji6y6pcJSgkcv
2nBtRGYDjYIQ0qXx6ILhNuqGFT76i/j2p8YfwMnH4NmYk908RlT0quu7fI8wBO9E
cKd3m9BG8wP67xbOrG/0ckdl3+y/1iW8kPY6SeO03Vvfm6ryqHdZs4oi4CmcX9lP
bFXi5AiYdHm0vqbwQG8P9LerWotgz4yFC9z7yC1KXJDXJxSwVxDFiXvyvxepRi6E
NZxe4RSnDp7sijEvZJa/2EA+rDVDIokfzTLgnRSMkaUuxwNsVjeNsV0b5727kiVC
DwVkxC7NZKG9UBr6WFs9hxRPE0g6xz3EJEBXaWpk2ggBmQxTfBRTjV0Pe3ii7dqQ
z7O3Gv8pojki3ttG4wExLepPHRxTBzjdsoV6/BHZpraYTP11bpQlgx/K7IYJZYa5
Tt9IZ4vNd10=
=mbmH
-----END PGP SIGNATURE-----
Merge tag 'x86-asm-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
- Micro-optimize and standardize the do_syscall_64() calling convention
- Make syscall entry flags clearing more conservative
- Clean up syscall table handling
- Clean up & standardize assembly macros, in preparation of FRED
- Misc cleanups and fixes
* tag 'x86-asm-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Make <asm/asm.h> valid on cross-builds as well
x86/regs: Syscall_get_nr() returns -1 for a non-system call
x86/entry: Split PUSH_AND_CLEAR_REGS into two submacros
x86/syscall: Maximize MSR_SYSCALL_MASK
x86/syscall: Unconditionally prototype {ia32,x32}_sys_call_table[]
x86/entry: Reverse arguments to do_syscall_64()
x86/entry: Unify definitions from <asm/calling.h> and <asm/ptrace-abi.h>
x86/asm: Use _ASM_BYTES() in <asm/nops.h>
x86/asm: Add _ASM_BYTES() macro for a .byte ... opcode sequence
x86/asm: Have the __ASM_FORM macros handle commas in arguments
- Clean up & simplify AP exception handling setup.
- Consolidate the disjoint IDT setup code living in
idt_setup_traps() and idt_setup_ist_traps() into
a single idt_setup_traps() initialization function
and call it before cpu_init().
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZdu0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gWgA//ecu2FqCFT2gpZP7ABdJqAhtYA8f7rg/a
BMKNNfTha/L9Dot2PSqMq8oVi8NA++EbKjuWeFujAIoU/2vso+YrmHc84O35nOOF
u2Zgps7UK+ffKo11Yu1Vpb911ctJClAwL0CemsC30QDbpGVHPecLuxOIgY6E6BwC
qhNqNLp0K4bFRq0ya27O8RPiz9LjCzUILHHvWSAl5m5tWqovED8aXdjrDJcFXqwY
u9nuuRpUpQWqCldZP9X7+pdo4Z2HZjvIBjqHD/wl3VMjV6q+k+su6AjV9p1D8hoz
otY96i8MQjD/sgIa1H+tUc2ZusGzDls+EpYiGaPmqeXMitKEwOFpVDAaT8SelUms
bR4VQ9IYB1NG7Qbco3NQHMV1sWuvUJcLG6ILYFWXgH0hP1EDHFn/TvOn0rfJysbE
AmCpwmUo0b8Bj6nbKkVcXxoX1FdeqiM5+cPxHxGVgxVoR0Umz13EX4y4cBzSIRht
eYwT6H1CxR9a4TIr8cMBsN14QsnV3f6lv/RNfVdmZEJVVr0boRI90L2xMLBB9RkP
z03g7VvfMuSWnKyOFheP4ae9ul2qxAT380+g1oHQH0XIFtj9yIhzJHpoUCzhgCra
Ui2Z71Dhq0R1UNpPsPfc1XkQI9chiahn8gc1u2zvN4SzZa6DZH22VvGNK0ghoIxq
5WFho50hNIk=
=BPbv
-----END PGP SIGNATURE-----
Merge tag 'x86-apic-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 exception handling updates from Ingo Molnar:
- Clean up & simplify AP exception handling setup.
- Consolidate the disjoint IDT setup code living in idt_setup_traps()
and idt_setup_ist_traps() into a single idt_setup_traps()
initialization function and call it before cpu_init().
* tag 'x86-apic-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/idt: Rework IDT setup for boot CPU
x86/cpu: Init AP exception handling from cpu_init_secondary()
- Changes to core scheduling facilities:
- Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
coordinated scheduling across SMT siblings. This is a much
requested feature for cloud computing platforms, to allow
the flexible utilization of SMT siblings, without exposing
untrusted domains to information leaks & side channels, plus
to ensure more deterministic computing performance on SMT
systems used by heterogenous workloads.
There's new prctls to set core scheduling groups, which
allows more flexible management of workloads that can share
siblings.
- Fix task->state access anti-patterns that may result in missed
wakeups and rename it to ->__state in the process to catch new
abuses.
- Load-balancing changes:
- Tweak newidle_balance for fair-sched, to improve
'memcache'-like workloads.
- "Age" (decay) average idle time, to better track & improve workloads
such as 'tbench'.
- Fix & improve energy-aware (EAS) balancing logic & metrics.
- Fix & improve the uclamp metrics.
- Fix task migration (taskset) corner case on !CONFIG_CPUSET.
- Fix RT and deadline utilization tracking across policy changes
- Introduce a "burstable" CFS controller via cgroups, which allows
bursty CPU-bound workloads to borrow a bit against their future
quota to improve overall latencies & batching. Can be tweaked
via /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.
- Rework assymetric topology/capacity detection & handling.
- Scheduler statistics & tooling:
- Disable delayacct by default, but add a sysctl to enable
it at runtime if tooling needs it. Use static keys and
other optimizations to make it more palatable.
- Use sched_clock() in delayacct, instead of ktime_get_ns().
- Misc cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZcPoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g3yw//WfhIqy7Psa9d/MBMjQDRGbTuO4+w22Dj
vmWFU44Q4KJxQHWeIgUlrK+dzvYWvNmflUs2CUUOiDVzxFTHMIyBtL4qCBUbx4Ns
vKAcB9wsWZge2o3WzZqpProRhdoRaSKw8egUr2q7rACVBkckY7eGP/OjWxXU8BdA
b7D0LPWwuIBFfN4pFYeCDLn32Dqr9s6Chyj+ZecabdG7EE6Gu+f1diVcxy7JE/mc
4WWL0D1RqdgpGrBEuMJIxPYekdrZiuy4jtEbztz5gbTBteN1cj3BLfqn0Pc/e6rO
Vyuc5mXCAmzRVi18z6g6bsVl+IA/nrbErENB2OHOhOYtqiZxqGTd4GPWZszMyY17
5AsEO5+5pcaBsy4gyp09qURggBu9zhJnMVmOI3rIHZkmkhwzc6uUJlyhDCTiFWOz
3ZF3LjbZEyCKodMD8qMHbs3axIBpIfZqjzkvSKyFnvfXEGVytVse7NUuWtQ36u92
GnURxVeYY1TDVXvE1Y8owNKMxknKQ6YRlypP7Dtbeo/qG6hShp0xmS7qDLDi0ybZ
ZlK+bDECiVoDf3nvJo+8v5M82IJ3CBt4UYldeRJsa1YCK/FsbK8tp91fkEfnXVue
+U6LPX0AmMpXacR5HaZfb3uBIKRw/QMdP/7RFtBPhpV6jqCrEmuqHnpPQiEVtxwO
UmG7bt94Trk=
=3VDr
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler udpates from Ingo Molnar:
- Changes to core scheduling facilities:
- Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
coordinated scheduling across SMT siblings. This is a much
requested feature for cloud computing platforms, to allow the
flexible utilization of SMT siblings, without exposing untrusted
domains to information leaks & side channels, plus to ensure more
deterministic computing performance on SMT systems used by
heterogenous workloads.
There are new prctls to set core scheduling groups, which allows
more flexible management of workloads that can share siblings.
- Fix task->state access anti-patterns that may result in missed
wakeups and rename it to ->__state in the process to catch new
abuses.
- Load-balancing changes:
- Tweak newidle_balance for fair-sched, to improve 'memcache'-like
workloads.
- "Age" (decay) average idle time, to better track & improve
workloads such as 'tbench'.
- Fix & improve energy-aware (EAS) balancing logic & metrics.
- Fix & improve the uclamp metrics.
- Fix task migration (taskset) corner case on !CONFIG_CPUSET.
- Fix RT and deadline utilization tracking across policy changes
- Introduce a "burstable" CFS controller via cgroups, which allows
bursty CPU-bound workloads to borrow a bit against their future
quota to improve overall latencies & batching. Can be tweaked via
/sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us.
- Rework assymetric topology/capacity detection & handling.
- Scheduler statistics & tooling:
- Disable delayacct by default, but add a sysctl to enable it at
runtime if tooling needs it. Use static keys and other
optimizations to make it more palatable.
- Use sched_clock() in delayacct, instead of ktime_get_ns().
- Misc cleanups and fixes.
* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/doc: Update the CPU capacity asymmetry bits
sched/topology: Rework CPU capacity asymmetry detection
sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
psi: Fix race between psi_trigger_create/destroy
sched/fair: Introduce the burstable CFS controller
sched/uclamp: Fix uclamp_tg_restrict()
sched/rt: Fix Deadline utilization tracking during policy change
sched/rt: Fix RT utilization tracking during policy change
sched: Change task_struct::state
sched,arch: Remove unused TASK_STATE offsets
sched,timer: Use __set_current_state()
sched: Add get_current_state()
sched,perf,kvm: Fix preemption condition
sched: Introduce task_is_running()
sched: Unbreak wakeups
sched/fair: Age the average idle time
sched/cpufreq: Consider reduced CPU capacity in energy calculation
sched/fair: Take thermal pressure into account while estimating energy
thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
sched/fair: Return early from update_tg_cfs_load() if delta == 0
...
- Platform PMU driver updates:
- x86 Intel uncore driver updates for Skylake (SNR) and Icelake (ICX) servers
- Fix RDPMC support
- Fix [extended-]PEBS-via-PT support
- Fix Sapphire Rapids event constraints
- Fix :ppp support on Sapphire Rapids
- Fix fixed counter sanity check on Alder Lake & X86_FEATURE_HYBRID_CPU
- Other heterogenous-PMU fixes
- Kprobes:
- Remove the unused and misguided kprobe::fault_handler callbacks.
- Warn about kprobes taking a page fault.
- Fix the 'nmissed' stat counter.
- Misc cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZaxMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hPgw//f9SnGzFoP1uR5TBqM8j/QHulMewew/iD
dM5lh2emdmqHWYPBeRxUHgag38K2Golr3Y+NxLA3R+RMx+OZQe8Mz/wYvPQcBvsV
k1HHImU3GRMn4GM7GwxH3vPIottDUx3mNS2J6pzlw3kwRUVqrxUdj/0/pSY/4eJ7
ZT4uq4yLV83Jd3qioU7o7e/u6MrdNIIcAXRpVDdE9Mm1+kWXSVN7/h3Vsiz4tj5E
iS+UXEtSc1a2mnmekv63pYkJHHNUb6guD8jgI/wrm1KIFGjDRifM+3TV6R/kB96/
TfD2LhCcTShfSp8KI191pgV7/NQbB/PmLdSYmff3rTBiii4cqXuCygJCHInZ09z0
4fTSSqM6aHg7kfTQyOCp+DUQ+9vNVXWo8mxt9c6B8xA0GyCI3zhjQ4UIiSUWRpjs
Be5ZyF0kNNuPxYrKFnGnBf8+51DURpCz3sDdYRuK4KNkj1+4ZvJo/KzGTMUUIE4B
IDQG6wDP5Kb388eRDtKrG5X7IXg+L5F/kezin60j0QF5MwDgxirT217teN8H1lNn
YgWMjRK8Tw0flUJsbCxa51/nl93UtByB+fIRIc88MSeLxcI6/ORW+TxBBEqkYm5Z
6BLFtmHSuAqAXUuyZXSGLcW7XLJvIaDoHgvbDn6l4g7FMWHqPOIq6nJQY3L8ben2
e+fQrGh4noI=
=20Vc
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
- Platform PMU driver updates:
- x86 Intel uncore driver updates for Skylake (SNR) and Icelake (ICX) servers
- Fix RDPMC support
- Fix [extended-]PEBS-via-PT support
- Fix Sapphire Rapids event constraints
- Fix :ppp support on Sapphire Rapids
- Fix fixed counter sanity check on Alder Lake & X86_FEATURE_HYBRID_CPU
- Other heterogenous-PMU fixes
- Kprobes:
- Remove the unused and misguided kprobe::fault_handler callbacks.
- Warn about kprobes taking a page fault.
- Fix the 'nmissed' stat counter.
- Misc cleanups and fixes.
* tag 'perf-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix task context PMU for Hetero
perf/x86/intel: Fix instructions:ppp support in Sapphire Rapids
perf/x86/intel: Add more events requires FRONTEND MSR on Sapphire Rapids
perf/x86/intel: Fix fixed counter check warning for some Alder Lake
perf/x86/intel: Fix PEBS-via-PT reload base value for Extended PEBS
perf/x86: Reset the dirty counter to prevent the leak for an RDPMC task
kprobes: Do not increment probe miss count in the fault handler
x86,kprobes: WARN if kprobes tries to handle a fault
kprobes: Remove kprobe::fault_handler
uprobes: Update uprobe_write_opcode() kernel-doc comment
perf/hw_breakpoint: Fix DocBook warnings in perf hw_breakpoint
perf/core: Fix DocBook warnings
perf/core: Make local function perf_pmu_snapshot_aux() static
perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on ICX
perf/x86/intel/uncore: Enable I/O stacks to IIO PMON mapping on SNR
perf/x86/intel/uncore: Generalize I/O stacks to PMON mapping procedure
perf/x86/intel/uncore: Drop unnecessary NULL checks after container_of()
kernel tooling such as kpatch-build.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZYv4RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1ipeBAAhJPS/kCQ17Y5zGyMB0/6yfCWIifODoS7
9J+6/mqKHPDdV07yzPtOXuTTmpKV4OHPi8Yj8kaXs5L5fOmQ1uAwITwZNF5hU0a5
CiFIsubUCJmglf9b6L9EH5pBEQ72Cq4u8zIhJ9LmZ4t625AHJAm2ikZgascc4U67
RvVoGr5sYTo0YEsc1IDM1wUtnUhXBNjS1VwkXNnCFFTXYHju47MeY1sPHq2hvkzO
iJGC9A+hxfM1eQt9/qC/2L/6F/XECN61gcR9Get8TkWeEGHmPG+FthmPLd4oO9Ho
03J4JfMbmXumWosAeilYBNUkfii/M5Em78Wpv/cB94iSt67rq7Eb+8gm4D5svmfN
l+utsPY/HYB+uWV0hy2cV/ORRiwcJnon54dEWL6912YkKz+OIb3DK/7l9ex5lW+D
r3o8NP0s6S+RgUkOFxz5VaYK1giu6fiaFysWdKeflvwlvY/64owMepQ1QfPBbeB7
3DTzvuYZ4Cb1x/vR6WBbFqGcuJKZ1CsZIBLCblveUs+G0wlu147K5E1qlXg/Wvq7
5Vzznc4fmRng8np5hxAw8ieLkatWg7szyryUV/4H2Ubs/jWGcH628ZYbapaCb7EM
Eson65xzbVfhnz16z8sN13XIF1lGe8sb0+qiFSclEfyDUnZDuhwMn6d9Ubqxrg5J
uTULEzmY/rI=
=MvPd
-----END PGP SIGNATURE-----
mergetag object d33b9035e1
type commit
tag objtool-core-2021-06-28
tagger Ingo Molnar <mingo@kernel.org> 1624859477 +0200
The biggest change in this cycle is the new code to handle
and rewrite variable sized jump labels - which results in
slightly tighter code generation in hot paths, through the
use of short(er) NOPs.
Also a number of cleanups and fixes, and a change to the
generic include/linux/compiler.h to handle a s390 GCC quirk.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZZGcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1goYg/7BxUIJXP0F5wbrMbAvJIDRgR/j3TA+ztk
uNU1yabBGluMxCqJ87HadJ+A5d010G+GRUn/birVr7w1UuwWv8HOda78dnyG7tme
xm78/1FlOnstuOTQxhK6rjbb2cp+QOmdsAQkq1TF4SOxArBQiwtjiOvytHjb5yNx
7LrlbtuZ7Dtc0qd2evkG4ma4QkGoDhBS1dRogrItc27ZLuFIQoNnEd2K2QNMgczw
a/Jx8fgNmdoJSq+vkBn9TnS/cJYUW/PAlPNtO3ac8yE857aDIVnjXFRzveAP/nTh
rwFD6aCGnJAqyqP7A8ElNjySos5O+ebYApxe7rEx0TNLbrc55qSP9lpdIO+vgytV
Xzy4O7z6o+lailQ4EoF8Qf+rlPeue0kLF23SsNbZY1uT0vjX1Uv70xgKbkuyPygp
GNXAy6dOXK0AfaZYL/Wa50yVnJnkYDjes/hHr+HEam5Oad566pqIyQNP8yWSPqaf
KHkL//1pb5C2RKwot4IYv/ftHfZB5QftoFq6bhGBc1GXUd/FiqivvGHPW/6g7rxi
ZIrXs+Fqm/5KP9mssNONfyz5XEvbcUTD1CbeqX9eyVbiYZbLp1oWSgtogiRW9ya+
HR7t0Dt/UFzFWbilb6EZff/Hdr1NZBZLdrfpvVDoMf5tR9J0BIOyjddTu89g/FIO
KcfJ5yyjJBU=
=+HAB
-----END PGP SIGNATURE-----
Merge tags 'objtool-urgent-2021-06-28' and 'objtool-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fix and updates from Ingo Molnar:
"An ELF format fix for a section flags mismatch bug that breaks kernel
tooling such as kpatch-build.
The biggest change in this cycle is the new code to handle and rewrite
variable sized jump labels - which results in slightly tighter code
generation in hot paths, through the use of short(er) NOPs.
Also a number of cleanups and fixes, and a change to the generic
include/linux/compiler.h to handle a s390 GCC quirk"
* tag 'objtool-urgent-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Don't make .altinstructions writable
* tag 'objtool-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Improve reloc hash size guestimate
instrumentation.h: Avoid using inline asm operand modifiers
compiler.h: Avoid using inline asm operand modifiers
kbuild: Fix objtool dependency for 'OBJECT_FILES_NON_STANDARD_<obj> := n'
objtool: Reflow handle_jump_alt()
jump_label/x86: Remove unused JUMP_LABEL_NOP_SIZE
jump_label, x86: Allow short NOPs
objtool: Provide stats for jump_labels
objtool: Rewrite jump_label instructions
objtool: Decode jump_entry::key addend
jump_label, x86: Emit short JMP
jump_label: Free jump_entry::key bit1 for build use
jump_label, x86: Add variable length patching support
jump_label, x86: Introduce jump_entry_size()
jump_label, x86: Improve error when we fail expected text
jump_label, x86: Factor out the __jump_table generation
jump_label, x86: Strip ASM jump_label support
x86, objtool: Dont exclude arch/x86/realmode/
objtool: Rewrite hashtable sizing
on code executed in the guest and handle the case where failure to
get the RIP would result in a #GP, as it should, instead of in a #PF
- Disable interrupts while the per-CPU GHCB is held
- Split the #VC handler depending on where the #VC exception has
happened and therefore provide for precise context tracking like the
rest of the exception handlers deal with noinstr regions now
- Add defines for the GHCB version 2 protocol so that further shared
development with KVM can happen without merge conflicts
- The usual small cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDZij8ACgkQEsHwGGHe
VUpwIQ/8CzFbGm2k2RdmO0H/VPwfF1HFSWpM9YFGSs++yOqfiyCFbyIcTcRbK4IO
+BUIRoHSgCWPb+5pJli1Wf0J/sIdYr9D4MDWt1oRQG6e/4NE2SL3EOnYJWW5VtOT
u1AVk01ooPOFDKIoh4OIZ7tCKAeNWBv+oe5dmP46spiEZbHHCzHIEaBuOQRzvX9C
jSKulDHjA4iaNl/BQMF7dJL1+aPWj2NXjSj86fhMAa+m5MspDXbIaM5wMZfPzc1k
Rj/m89JThp+mFwik46o/7g/5Q8SYtTE+Hqi1TX/65/dbyizLqbH5W3g0zwrD8TYf
B7kHguqkoE1j1avLwOYK1yJB8ZTjtf+OXjUAR4UPzxkG7Xhelu5Qb7RD/WCJ3YqO
KEFIFq+hsiAqvb6RkmX0aVecIJ49aqGX+onsMpLWq9pz2R4BRcH7jo81TIBcosg5
2Kfx2aPcMec7u7RMBHqwiaC4Adp7/vmHhukawfI8xCWLd7wEjvAMP3eeePxR+C0l
SSnn0O9COj8pctvq4eOGJAUXzPa4YtsaX+kILBs+hUdQXmQGVSxyTpakyhhUpGQ8
YyblbHybS8JeYdGqPVS/tn0Rc2DqOSQJetjmXAGhlkEkkGY8i1Ddwe0MaamJozol
g/wHNYcok/OQWglvVThv6EAY2pTSeWelmjUkZi1dnkYNH1VUxxE=
=iyX+
-----END PGP SIGNATURE-----
Merge tag 'x86_sev_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Differentiate the type of exception the #VC handler raises depending
on code executed in the guest and handle the case where failure to
get the RIP would result in a #GP, as it should, instead of in a #PF
- Disable interrupts while the per-CPU GHCB is held
- Split the #VC handler depending on where the #VC exception has
happened and therefore provide for precise context tracking like the
rest of the exception handlers deal with noinstr regions now
- Add defines for the GHCB version 2 protocol so that further shared
development with KVM can happen without merge conflicts
- The usual small cleanups
* tag 'x86_sev_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Use "SEV: " prefix for messages from sev.c
x86/sev: Add defines for GHCB version 2 MSR protocol requests
x86/sev: Split up runtime #VC handler for correct state tracking
x86/sev: Make sure IRQs are disabled while GHCB is active
x86/sev: Propagate #GP if getting linear instruction address failed
x86/insn: Extend error reporting from insn_fetch_from_user[_inatomic]()
x86/insn-eval: Make 0 a valid RIP for insn_get_effective_ip()
x86/sev: Fix error message in runtime #VC handler
- Allow MONITOR/MWAIT to be used for C1 state entry on Hygon too
- Use the special RAPL CPUID bit to detect the functionality on AMD and
Hygon instead of doing family matching.
- Add support for new Intel microcode deprecating TSX on some models and
do not enable kernel workarounds for those CPUs when TSX transactions
always abort, as a result of that microcode update.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDZhzEACgkQEsHwGGHe
VUo5ow//eRwlb1OL/D3jzLT4nTYX8+XdufaJF1HBr1Cf3mdNkiEgyu2bvsXNTpN/
ZP7CFCHibgYeHJ7qTTkhoK1DCe4YHjj450oCgg7pv40Mv9E29Rpszie8y8e/ngkc
g9OiAeEd4A32v8bRMAOOX0UZN4afismXBW0k4iwOAguNFiZ/usrrVYTZpJe3wG65
/YM9FdDZ+Mt7BavJdVyGh03PpzoSMrKyEQ673CHhERQyy5oEublrDSmtt5hQJv1W
4tgNOWpw57Gi7Vs7UYd7VvBQKwQZKeQeHJWu1TXUB6pw0lKYvULH6m0dasvc6cGb
WtCBvbQU9MRP0LvdvYOdgmSgn400z7mEwlUWmAFJLIUlDsuRpZmVQ4C1/OUnOSdx
amb7I3bp1z6Rqjs9ADW5h87qDA+q5OmbIZeIDvuRypQOB3yEktAEdUvWb65b1Fgm
9CpzebxyaOUM9YRxDzDd2joZYKnfI3stF6UCrVXaZwYei+Jmzn5gc8ZOoOX9g6gO
eX/sLW2RWRx6XxilaWZijOHJTjokVUpEnD12aGtKO6ou5QbFTwldc2Metpua42cL
5p8wRxEYeKT/EE/GKy/qIEp624QaInSEmfyq8RFKU4em7GSaSUmoQF5151LfnoRY
ARHkEdz+T8s5RI5xSvUZLRMNYjig9tZas3blYfbJHnU7V2+bspQ=
=wW+k
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Borislav Petkov:
- New AMD models support
- Allow MONITOR/MWAIT to be used for C1 state entry on Hygon too
- Use the special RAPL CPUID bit to detect the functionality on AMD and
Hygon instead of doing family matching.
- Add support for new Intel microcode deprecating TSX on some models
and do not enable kernel workarounds for those CPUs when TSX
transactions always abort, as a result of that microcode update.
* tag 'x86_cpu_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsx: Clear CPUID bits when TSX always force aborts
x86/events/intel: Do not deploy TSX force abort workaround when TSX is deprecated
x86/msr: Define new bits in TSX_FORCE_ABORT MSR
perf/x86/rapl: Use CPUID bit on AMD and Hygon parts
x86/cstate: Allow ACPI C1 FFH MWAIT use on Hygon systems
x86/amd_nb: Add AMD family 19h model 50h PCI ids
x86/cpu: Fix core name for Sapphire Rapids
that the kernel properly attempts to offline the corresponding page, as
it does for kernel-detected correctable errors.
- Fix a typo in AMD's error descriptions.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDZg1cACgkQEsHwGGHe
VUpW8BAAlnwit5Vg4UVocY7mwTi0GvP36Fz2u81kppMROpWgQhhmX35ZxoxgQoSC
0ojKnOJTgGpOdKknmK/vom4ysxNRZxjz0zat9n+cqcfqVwP14KzhjaX1FPXnEQfE
mPkn3v8fsML87glPTzmpELYSOZTpu6OYdiFZAzKL8Gp8aytyh4FamTV2eTxn5ClG
+dejrN0NFiSALarliNttPnpfC5JvQ0KUJFxapYaMd27ssqL/2XMvJmBSpGC+OaZg
lvvv7XuRrIPRZ7lU3Zipz7Rv5r8tTfPUMr33DcUuAZxpXW3zRpds153HktTYSqsv
pZHTTLZ73GbAFVlkjqP6wcAtW2ygKW3lxsPuBSR8aIj8yU7rrrkG4wm2XsvCtrXP
4KrTZLgqGHFQaXbp1BzJzrnLyb6dxZXkEaAnX/7ZygDz+L5aMlG/7XEk4c/R9YbS
bg6NO/Dh1E547cf+bN6/yYNwPjNaO1lGOMU9N2IwjCiHFERzTsFGyFNjqMSGa7Ul
34FZAB11aklqbj+0amu5IeMd8vM3unqTGnYEQCcyG09mdsa9/bjEvEgCirq5FXf3
szjUmGpdtAsxCRZ7SzhsDu1IMT0F2D8hwgJbFSLXmtpiq5WB/EHaYbiqg8F6V36J
bENGE3rLj3HkgWHsLpgEMX2OXh7Pzo3UqwwbtOuYEiwwhvh7CZk=
=1Azq
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:
- Add the required information to the faked APEI-reported mem error so
that the kernel properly attempts to offline the corresponding page,
as it does for kernel-detected correctable errors.
- Fix a typo in AMD's error descriptions.
* tag 'ras_core_for_v5.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
EDAC/mce_amd: Fix typo "FIfo" -> "Fifo"
x86/mce: Include a MCi_MISC value in faked mce logs
x86/MCE/AMD, EDAC/mce_amd: Add new SMCA bank types
kernel test robot reported some osnoise functions with "no previous
prototype."
Fix these warnings by making local functions static, and by adding:
void osnoise_trace_irq_entry(int id);
void osnoise_trace_irq_exit(int id, const char *desc);
to include/linux/trace.h.
Link: https://lkml.kernel.org/r/e40d3cb4be8bde921f4b40fa6a095cf85ab807bd.1624872608.git.bristot@redhat.com
Fixes: bce29ac9ce ("trace: Add osnoise tracer")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In the context of high-performance computing (HPC), the Operating System
Noise (*osnoise*) refers to the interference experienced by an application
due to activities inside the operating system. In the context of Linux,
NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
system. Moreover, hardware-related jobs can also cause noise, for example,
via SMIs.
The osnoise tracer leverages the hwlat_detector by running a similar
loop with preemption, SoftIRQs and IRQs enabled, thus allowing all
the sources of *osnoise* during its execution. Using the same approach
of hwlat, osnoise takes note of the entry and exit point of any
source of interferences, increasing a per-cpu interference counter. The
osnoise tracer also saves an interference counter for each source of
interference. The interference counter for NMI, IRQs, SoftIRQs, and
threads is increased anytime the tool observes these interferences' entry
events. When a noise happens without any interference from the operating
system level, the hardware noise counter increases, pointing to a
hardware-related noise. In this way, osnoise can account for any
source of interference. At the end of the period, the osnoise tracer
prints the sum of all noise, the max single noise, the percentage of CPU
available for the thread, and the counters for the noise sources.
Usage
Write the ASCII text "osnoise" into the current_tracer file of the
tracing system (generally mounted at /sys/kernel/tracing).
For example::
[root@f32 ~]# cd /sys/kernel/tracing/
[root@f32 tracing]# echo osnoise > current_tracer
It is possible to follow the trace by reading the trace trace file::
[root@f32 tracing]# cat trace
# tracer: osnoise
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth MAX
# || / SINGLE Interference counters:
# |||| RUNTIME NOISE % OF CPU NOISE +-----------------------------+
# TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
# | | | |||| | | | | | | | | | |
<...>-859 [000] .... 81.637220: 1000000 190 99.98100 9 18 0 1007 18 1
<...>-860 [001] .... 81.638154: 1000000 656 99.93440 74 23 0 1006 16 3
<...>-861 [002] .... 81.638193: 1000000 5675 99.43250 202 6 0 1013 25 21
<...>-862 [003] .... 81.638242: 1000000 125 99.98750 45 1 0 1011 23 0
<...>-863 [004] .... 81.638260: 1000000 1721 99.82790 168 7 0 1002 49 41
<...>-864 [005] .... 81.638286: 1000000 263 99.97370 57 6 0 1006 26 2
<...>-865 [006] .... 81.638302: 1000000 109 99.98910 21 3 0 1006 18 1
<...>-866 [007] .... 81.638326: 1000000 7816 99.21840 107 8 0 1016 39 19
In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
tracer prints a message at the end of each period for each CPU that is
running an osnoise/CPU thread. The osnoise specific fields report:
- The RUNTIME IN USE reports the amount of time in microseconds that
the osnoise thread kept looping reading the time.
- The NOISE IN US reports the sum of noise in microseconds observed
by the osnoise tracer during the associated runtime.
- The % OF CPU AVAILABLE reports the percentage of CPU available for
the osnoise thread during the runtime window.
- The MAX SINGLE NOISE IN US reports the maximum single noise observed
during the runtime window.
- The Interference counters display how many each of the respective
interference happened during the runtime window.
Note that the example above shows a high number of HW noise samples.
The reason being is that this sample was taken on a virtual machine,
and the host interference is detected as a hardware interference.
Tracer options
The tracer has a set of options inside the osnoise directory, they are:
- osnoise/cpus: CPUs at which a osnoise thread will execute.
- osnoise/period_us: the period of the osnoise thread.
- osnoise/runtime_us: how long an osnoise thread will look for noise.
- osnoise/stop_tracing_us: stop the system tracing if a single noise
higher than the configured value happens. Writing 0 disables this
option.
- osnoise/stop_tracing_total_us: stop the system tracing if total noise
higher than the configured value happens. Writing 0 disables this
option.
- tracing_threshold: the minimum delta between two time() reads to be
considered as noise, in us. When set to 0, the default value will
be used, which is currently 5 us.
Additional Tracing
In addition to the tracer, a set of tracepoints were added to
facilitate the identification of the osnoise source.
- osnoise:sample_threshold: printed anytime a noise is higher than
the configurable tolerance_ns.
- osnoise:nmi_noise: noise from NMI, including the duration.
- osnoise:irq_noise: noise from an IRQ, including the duration.
- osnoise:softirq_noise: noise from a SoftIRQ, including the
duration.
- osnoise:thread_noise: noise from a thread, including the duration.
Note that all the values are *net values*. For example, if while osnoise
is running, another thread preempts the osnoise thread, it will start a
thread_noise duration at the start. Then, an IRQ takes place, preempting
the thread_noise, starting a irq_noise. When the IRQ ends its execution,
it will compute its duration, and this duration will be subtracted from
the thread_noise, in such a way as to avoid the double accounting of the
IRQ execution. This logic is valid for all sources of noise.
Here is one example of the usage of these tracepoints::
osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8723 ns interferences 2
In this example, a noise sample of 8 microseconds was reported in the last
line, pointing to two interferences. Looking backward in the trace, the
two previous entries were about the migration thread running after a
timer IRQ execution. The first event is not part of the noise because
it took place one millisecond before.
It is worth noticing that the sum of the duration reported in the
tracepoints is smaller than eight us reported in the sample_threshold.
The reason roots in the overhead of the entry and exit code that happens
before and after any interference execution. This justifies the dual
approach: measuring thread and tracing.
Link: https://lkml.kernel.org/r/e649467042d60e7b62714c9c6751a56299d15119.1624372313.git.bristot@redhat.com
Cc: Phil Auld <pauld@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kate Carcia <kcarcia@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
Cc: Clark Willaims <williams@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
[
Made the following functions static:
trace_irqentry_callback()
trace_irqexit_callback()
trace_intel_irqentry_callback()
trace_intel_irqexit_callback()
Added to include/trace.h:
osnoise_arch_register()
osnoise_arch_unregister()
Fixed define logic for LATENCY_FS_NOTIFY
Reported-by: kernel test robot <lkp@intel.com>
]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Add MTE support in guests, complete with tag save/restore interface
- Reduce the impact of CMOs by moving them in the page-table code
- Allow device block mappings at stage-2
- Reduce the footprint of the vmemmap in protected mode
- Support the vGIC on dumb systems such as the Apple M1
- Add selftest infrastructure to support multiple configuration
and apply that to PMU/non-PMU setups
- Add selftests for the debug architecture
- The usual crop of PMU fixes
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmDV2bEPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDEr8P/ivwROx5NwGcHGmU5RfUCT3aFqhtVHHwD/lu
jPcgoO61kz9TelOu6QRaVuK+mVHxcq3iP4R8nPq/QCkUlEXTmK2xkyhXhGXSYpH4
6jM8+BbC3eG7iAxx6H0UM4JTl4Riwat6ZZtXpWEWs9TKqOHOQYFpMkxSttwVZ1CZ
SjbtFvXLEdzKn6PzUWnKdBNMV/mHsdAtohZit9oJOc4ttc8072XxETQ4TFQ+MSvA
j9zY9QPmWzgcZnotqRRu9sbTGO2vxtXuUtY3sjdD8+C9OgSe9qvpnNjymcmfwaMu
1fBkfh65oaO4ItJBdGOUOoEcFqwN5imPiI7CB/O+ZYkO9sBCuTUPSQwPkyiwXb9r
bUkTaQw2nZiNWsqR1x07fQ2sGYbMp5mnmgmqiV4MUWkLmFp9LZATCWYTTn24cBNS
6SjVP6/8S0r3EhLnYjH0Pn1we5PooU1EF6RlCAd3ewYoo+9fPnwjNYwIWH5i5wB7
+tnei44NACAw9cfbos+BYQQ/dY15OSFzLzIMomlabB7OpXOdDg3H6tJnPbFwWwXb
9nF8XdHqxeDVVVrDCAx1BSodSXm9xqgnQM2RDGTUnpVcAfqAr3MXX6VsyKQDzj8T
QXF9qOVCBAABv6BXAvSQ6mvMJZDUVbUPEPhf7kXzF46JsRd6A7wWoU/OnMGHQ/w7
wjvH8HVy
=fWBV
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for v5.14.
- Add MTE support in guests, complete with tag save/restore interface
- Reduce the impact of CMOs by moving them in the page-table code
- Allow device block mappings at stage-2
- Reduce the footprint of the vmemmap in protected mode
- Support the vGIC on dumb systems such as the Apple M1
- Add selftest infrastructure to support multiple configuration
and apply that to PMU/non-PMU setups
- Add selftests for the debug architecture
- The usual crop of PMU fixes
The change which made copy_xstate_to_uabi_buf() usable for
[x]fpregs_get() removed the zeroing of the header which means the
header, which is copied to user space later, contains except for the
xfeatures member, random stack content.
Add the memset() back to zero it before usage.
Fixes: eb6f51723f ("x86/fpu: Make copy_xstate_to_kernel() usable for [x]fpregs_get()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/875yy3wb8h.ffs@nanos.tec.linutronix.de
Add description of undocumented parameters. Issues detected by
scripts/kernel-doc.
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20210618223206.29539-1-fmdefrancesco@gmail.com
There is no reason to do an extra XRSTOR from init_fpstate for feature
bits which have been cleared by user space in the FX magic xfeatures
storage.
Just clear them in the task's XSTATE header and do a full restore which
will put these cleared features into init state.
There is no real difference in performance because the current code
already does a full restore when the xfeatures bits are preserved as the
signal frame setup has stored them, which is the full UABI feature set.
[ bp: Use the negated mxcsr_feature_mask in the MXCSR check. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121457.804115017@linutronix.de
If *RSTOR raises an exception, then the slow path is taken. That's wrong
because if the reason was not #PF then going through the slow path is waste
of time because that will end up with the same conclusion that the data is
invalid.
Now that the wrapper around *RSTOR return an negative error code, which is
the negated trap number, it's possible to differentiate.
If the *RSTOR raised #PF then handle it directly in the fast path and if it
was some other exception, e.g. #GP, then give up and do not try the fast
path.
This removes the legacy frame FRSTOR code from the slow path because FRSTOR
is not a ia32_fxstate frame and is therefore handled in the fast path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121457.696022863@linutronix.de
Now that user_xfeatures is correctly set when xsave is enabled, remove
the duplicated initialization of components.
Rename the function while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121457.377341297@linutronix.de
Utilize the check for the extended state magic in the FX software reserved
bytes and set the parameters for restoring fx_only in the relevant members
of fw_sw_user.
This allows further cleanups on top because the data is consistent.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121457.277738268@linutronix.de
Checking for the XSTATE buffer being 64-byte aligned, and if not,
deciding just to restore the FXSR state is daft.
If user space provides an unaligned math frame and has the extended state
magic set in the FX software reserved bytes, then it really can keep the
pieces.
If the frame is unaligned and the FX software magic is not set, then
fx_only is already set and the restore will use fxrstor.
Remove it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121457.184149902@linutronix.de
__fpu__restore_sig() is convoluted and some of the basic checks can
trivially be done in the calling function as well as the final error
handling of clearing user state.
[ bp: Fixup typos. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121457.086336154@linutronix.de
PKRU for a task is stored in task->thread.pkru when the task is scheduled
out. For 'current' the authoritative source of PKRU is the hardware.
fpu_reset_fpstate() has two callers:
1) fpu__clear_user_states() for !FPU systems. For those PKRU is irrelevant
2) fpu_flush_thread() which is invoked from flush_thread(). flush_thread()
resets the hardware to the kernel restrictive default value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.802850233@linutronix.de
As the PKRU state is managed separately restoring it from the xstate
buffer would be counterproductive as it might either restore a stale
value or reinit the PKRU state to 0.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.606745195@linutronix.de
One nice thing about having PKRU be XSAVE-managed is that it gets naturally
exposed into the XSAVE-using ABIs. Now that XSAVE will not be used to
manage PKRU, these ABIs need to be manually enabled to deal with PKRU.
ptrace() uses copy_uabi_xstate_to_kernel() to collect the tracee's
XSTATE. As PKRU is not in the task's XSTATE buffer, use task->thread.pkru
for filling in up the ptrace buffer.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.508770763@linutronix.de
PKRU is currently partly XSAVE-managed and partly not. It has space
in the task XSAVE buffer and is context-switched by XSAVE/XRSTOR.
However, it is switched more eagerly than FPU because there may be a
need for PKRU to be up-to-date for things like copy_to/from_user() since
PKRU affects user-permission memory accesses, not just accesses from
userspace itself.
This leaves PKRU in a very odd position. XSAVE brings very little value
to the table for how Linux uses PKRU except for signal related XSTATE
handling.
Prepare to move PKRU away from being XSAVE-managed. Allocate space in
the thread_struct for it and save/restore it in the context-switch path
separately from the XSAVE-managed features. task->thread_struct.pkru
is only valid when the task is scheduled out. For the current task the
authoritative source is the hardware, i.e. it has to be retrieved via
rdpkru().
Leave the XSAVE code in place for now to ensure bisectability.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.399107624@linutronix.de
switch_to() and flush_thread() write the task's PKRU value eagerly so
the PKRU value of current is always valid in the hardware.
That means there is no point in restoring PKRU on exit to user or when
reactivating the task's FPU registers in the signal frame setup path.
This allows to remove all the xstate buffer updates with PKRU values once
the PKRU state is stored in thread struct while a task is scheduled out.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.303919033@linutronix.de
Rename it so it's clear that this is about user ABI features which can
differ from the feature set which the kernel saves and restores because the
kernel handles e.g. PKRU differently. But the user ABI (ptrace, signal
frame) expects it to be there.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.211585137@linutronix.de
copy_kernel_to_fpregs() restores all xfeatures but it is also the place
where the AMD FXSAVE_LEAK bug is handled.
That prevents fpregs_restore_userregs() to limit the restored features,
which is required to untangle PKRU and XSTATE handling and also for the
upcoming supervisor state management.
Move the FXSAVE_LEAK quirk into __copy_kernel_to_fpregs() and deinline that
function which has become rather fat.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.114271278@linutronix.de
Rename it so that it becomes entirely clear what this function is
about. It's purpose is to restore the FPU registers to the state which was
saved in the task's FPU memory state either at context switch or by an in
kernel FPU user.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.018867925@linutronix.de
fpu__clear() currently resets both register state and kernel XSAVE buffer
state. It has two modes: one for all state (supervisor and user) and
another for user state only. fpu__clear_all() uses the "all state"
(user_only=0) mode, while a number of signal paths use the user_only=1
mode.
Make fpu__clear() work only for user state (user_only=1) and remove the
"all state" (user_only=0) code. Rename it to match so it can be used by
the signal paths.
Replace the "all state" (user_only=0) fpu__clear() functionality. Use the
TIF_NEED_FPU_LOAD functionality instead of making any actual hardware
registers changes in this path.
Instead of invoking fpu__initialize() just memcpy() init_fpstate into the
task's FPU state because that has already the correct format and in case of
PKRU also contains the default PKRU value. Move the actual PKRU write out
into flush_thread() where it belongs and where it will end up anyway when
PKRU and XSTATE have been untangled.
For bisectability a workaround is required which stores the PKRU value in
the xstate memory until PKRU is untangled from XSTATE for context
switching and return to user.
[ Dave Hansen: Polished changelog ]
[ tglx: Fixed the PKRU fallout ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.922729522@linutronix.de
There is no point in using copy_init_pkru_to_fpregs() which in turn calls
write_pkru(). write_pkru() tries to fiddle with the task's xstate buffer
for nothing because the XRSTOR[S](init_fpstate) just cleared the xfeature
flag in the xstate header which makes get_xsave_addr() fail.
It's a useless exercise anyway because the reinitialization activates the
FPU so before the task's xstate buffer can be used again a XRSTOR[S] must
happen which in turn dumps the PKRU value.
Get rid of the now unused copy_init_pkru_to_fpregs().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.732508792@linutronix.de
In preparation of making the PKRU management more independent from XSTATES,
write the default PKRU value into the hardware right after enabling PKRU in
CR4. This ensures that switch_to() and copy_thread() have the correct
setting for init task and the per CPU idle threads right away.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.622983906@linutronix.de
X86_FEATURE_OSPKE is enabled first on the boot CPU and the feature flag is
set. Secondary CPUs have to enable CR4.PKE as well and set their per CPU
feature flag. That's ineffective because all call sites have checks for
boot_cpu_data.
Make it smarter and force the feature flag when PKU is enabled on the boot
cpu which allows then to use cpu_feature_enabled(X86_FEATURE_OSPKE) all
over the place. That either compiles the code out when PKEY support is
disabled in Kconfig or uses a static_cpu_has() for the feature check which
makes a significant difference in hotpaths, e.g. context switch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.305113644@linutronix.de
Both function names are a misnomer.
fpu__save() is actually about synchronizing the hardware register state
into the task's memory state so that either coredump or a math exception
handler can inspect the state at the time where the problem happens.
The function guarantees to preserve the register state, while "save" is a
common terminology for saving the current state so it can be modified and
restored later. This is clearly not the case here.
Rename it to fpu_sync_fpstate().
fpu__copy() is used to clone the current task's FPU state when duplicating
task_struct. While the register state is a copy the rest of the FPU state
is not.
Name it accordingly and remove the really pointless @src argument along
with the warning which comes along with it.
Nothing can ever copy the FPU state of a non-current task. It's clearly
just a consequence of arch_dup_task_struct(), but it makes no sense to
proliferate that further.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.196727450@linutronix.de
write_pkru() was originally used just to write to the PKRU register. It
was mercifully short and sweet and was not out of place in pgtable.h with
some other pkey-related code.
But, later work included a requirement to also modify the task XSAVE
buffer when updating the register. This really is more related to the
XSAVE architecture than to paging.
Move the read/write_pkru() to asm/pkru.h. pgtable.h won't miss them.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.102647114@linutronix.de
The copy functions for the independent features are horribly named and the
supervisor and independent part is just overengineered.
The point is that the supplied mask has either to be a subset of the
independent features or a subset of the task->fpu.xstate managed features.
Rewrite it so it checks for invalid overlaps of these areas in the caller
supplied feature mask. Rename it so it follows the new naming convention
for these operations. Mop up the function documentation.
This allows to use that function for other purposes as well.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lkml.kernel.org/r/20210623121455.004880675@linutronix.de
The salient feature of "dynamic" XSTATEs is that they are not part of the
main task XSTATE buffer. The fact that they are dynamically allocated is
irrelevant and will become quite confusing when user math XSTATEs start
being dynamically allocated. Rename them to "independent" because they
are independent of the main XSTATE code.
This is just a search-and-replace with some whitespace updates to keep
things aligned.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/1eecb0e4f3e07828ebe5d737ec77dc3b708fad2d.1623388344.git.luto@kernel.org
Link: https://lkml.kernel.org/r/20210623121454.911450390@linutronix.de
The FNSAVE support requires conditionals in quite some call paths because
FNSAVE reinitializes the FPU hardware. If the save has to preserve the FPU
register state then the caller has to conditionally restore it from memory
when FNSAVE is in use.
This also requires a conditional in context switch because the restore
avoidance optimization cannot work with FNSAVE. As this only affects 20+
years old CPUs there is really no reason to keep this optimization
effective for FNSAVE. It's about time to not optimize for antiques anymore.
Just unconditionally FRSTOR the save content to the registers and clean up
the conditionals all over the place.
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.617369268@linutronix.de
A copy is guaranteed to leave the source intact, which is not the case when
FNSAVE is used as that reinitilizes the registers.
Save does not make such guarantees and it matches what this is about,
i.e. to save the state for a later restore.
Rename it to save_fpregs_to_fpstate().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.508853062@linutronix.de
copy_uabi_from_user_to_xstate() and copy_uabi_from_kernel_to_xstate() are
almost identical except for the copy function.
Unify them.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20210623121454.414215896@linutronix.de
Rename them to reflect that these functions deal with user space format
XSAVE buffers.
copy_kernel_to_xstate() -> copy_uabi_from_kernel_to_xstate()
copy_user_to_xstate() -> copy_sigframe_from_user_to_xstate()
Again a clear statement that these functions deal with user space ABI.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.318485015@linutronix.de
The function names for fnsave/fnrstor operations are horribly named and
a permanent source of confusion.
Rename:
copy_kernel_to_fregs() to frstor()
copy_fregs_to_user() to fnsave_to_user_sigframe()
copy_user_to_fregs() to frstor_from_user_sigframe()
so it's clear what these are doing. All these functions are really low
level wrappers around the equally named instructions, so mapping to the
documentation is just natural.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.223594101@linutronix.de
The function names for fxsave/fxrstor operations are horribly named and
a permanent source of confusion.
Rename:
copy_fxregs_to_kernel() to fxsave()
copy_kernel_to_fxregs() to fxrstor()
copy_fxregs_to_user() to fxsave_to_user_sigframe()
copy_user_to_fxregs() to fxrstor_from_user_sigframe()
so it's clear what these are doing. All these functions are really low
level wrappers around the equally named instructions, so mapping to the
documentation is just natural.
While at it, replace the static_cpu_has(X86_FEATURE_FXSR) with
use_fxsr() to be consistent with the rest of the code.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.017863494@linutronix.de
The function names for xsave[s]/xrstor[s] operations are horribly named and
a permanent source of confusion.
Rename:
copy_xregs_to_user() to xsave_to_user_sigframe()
copy_user_to_xregs() to xrstor_from_user_sigframe()
so it's entirely clear what this is about. This is also a clear indicator
of the potentially different storage format because this is user ABI and
cannot use compacted format.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.924266705@linutronix.de
The function names for xsave[s]/xrstor[s] operations are horribly named and
a permanent source of confusion.
Rename:
copy_xregs_to_kernel() to os_xsave()
copy_kernel_to_xregs() to os_xrstor()
These are truly low level wrappers around the actual instructions
XSAVE[OPT]/XRSTOR and XSAVES/XRSTORS with the twist that the selection
based on the available CPU features happens with an alternative to avoid
conditionals all over the place and to provide the best performance for hot
paths.
The os_ prefix tells that this is the OS selected mechanism.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.830239347@linutronix.de
If the fast path of restoring the FPU state on sigreturn fails or is not
taken and the current task's FPU is active then the FPU has to be
deactivated for the slow path to allow a safe update of the tasks FPU
memory state.
With supervisor states enabled, this requires to save the supervisor state
in the memory state first. Supervisor states require XSAVES so saving only
the supervisor state requires to reshuffle the memory buffer because XSAVES
uses the compacted format and therefore stores the supervisor states at the
beginning of the memory state. That's just an overengineered optimization.
Get rid of it and save the full state for this case.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.734561971@linutronix.de
The function does a sanity check with a WARN_ON_ONCE() but happily proceeds
when the pkey argument is out of range.
Clean it up.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.635764326@linutronix.de
This function is pointlessly global and a complete misnomer because it's
usage is related to both supervisor state checks and compacted format
checks. Remove it and just make the conditions check the XSAVES feature.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.425493349@linutronix.de
The only usecase for fpu__write_begin is the set() callback of regset, so
the function is pointlessly global.
Move it to the regset code and rename it to fpu_force_restore() which is
exactly decribing what the function does.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.328652975@linutronix.de
The function can only be used from the regset get() callbacks safely. So
there is no reason to have it globally exposed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.234942936@linutronix.de
Use the new functionality of copy_xstate_to_uabi_buf() to retrieve the
FX state when XSAVE* is in use. This avoids to overwrite the FPU state
buffer with fpstate_sanitize_xstate() which is error prone and duplicated
code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.014441775@linutronix.de
Use the new functionality of copy_xstate_to_uabi_buf() to retrieve the
FX state when XSAVE* is in use. This avoids overwriting the FPU state
buffer with fpstate_sanitize_xstate() which is error prone and duplicated
code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.901736860@linutronix.de
When xsave with init state optimization is used then a component's state
in the task's xsave buffer can be stale when the corresponding feature bit
is not set.
fpregs_get() and xfpregs_get() invoke fpstate_sanitize_xstate() to update
the task's xsave buffer before retrieving the FX or FP state. That's just
duplicated code as copy_xstate_to_kernel() already handles this correctly.
Add a copy mode argument to the function which allows to restrict the state
copy to the FP and SSE features.
Also rename the function to copy_xstate_to_uabi_buf() so the name reflects
what it is doing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.805327286@linutronix.de
fpregs_set() has unnecessary complexity to support short or nonzero-offset
writes and to handle the case in which a copy from userspace overwrites
some of the target buffer and then fails. Support for partial writes is
useless -- just require that the write has offset 0 and the correct size,
and copy into a temporary kernel buffer to avoid clobbering the state if
the user access fails.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.710467587@linutronix.de
There is no benefit from accepting and silently changing an invalid MXCSR
value supplied via ptrace(). Instead, return -EINVAL on invalid input.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.613614842@linutronix.de
xfpregs_set() was incomprehensible. Almost all of the complexity was due
to trying to support nonsensically sized writes or -EFAULT errors that
would have partially or completely overwritten the destination before
failing. Nonsensically sized input would only have been possible using
PTRACE_SETREGSET on REGSET_XFP. Fortunately, it appears (based on Debian
code search results) that no one uses that API at all, let alone with the
wrong sized buffer. Failed user access can be handled more cleanly by
first copying to kernel memory.
Just rewrite it to require sensible input.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.504234607@linutronix.de
ptrace() has interfaces that let a ptracer inspect a ptracee's register state.
This includes XSAVE state. The ptrace() ABI includes a hardware-format XSAVE
buffer for both the SETREGS and GETREGS interfaces.
In the old days, the kernel buffer and the ptrace() ABI buffer were the
same boring non-compacted format. But, since the advent of supervisor
states and the compacted format, the kernel buffer has diverged from the
format presented in the ABI.
This leads to two paths in the kernel:
1. Effectively a verbatim copy_to_user() which just copies the kernel buffer
out to userspace. This is used when the kernel buffer is kept in the
non-compacted form which means that it shares a format with the ptrace
ABI.
2. A one-state-at-a-time path: copy_xstate_to_kernel(). This is theoretically
slower since it does a bunch of piecemeal copies.
Remove the verbatim copy case. Speed probably does not matter in this path,
and the vast majority of new hardware will use the one-state-at-a-time path
anyway. This ensures greater testing for the "slow" path.
This also makes enabling PKRU in this interface easier since a single path
can be patched instead of two.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.408457100@linutronix.de
Instead of masking out reserved bits, check them and reject the provided
state as invalid if not zero.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.308388343@linutronix.de
xstateregs_set() operates on a stopped task and tries to copy the provided
buffer into the task's fpu.state.xsave buffer.
Any error while copying or invalid state detected after copying results in
wiping the target task's FPU state completely including supervisor states.
That's just wrong. The caller supplied invalid data or has a problem with
unmapped memory, so there is absolutely no justification to corrupt the
target state.
Fix this with the following modifications:
1) If data has to be copied from userspace, allocate a buffer and copy from
user first.
2) Use copy_kernel_to_xstate() unconditionally so that header checking
works correctly.
3) Return on error without corrupting the target state.
This prevents corrupting states and lets the caller deal with the problem
it caused in the first place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.214903673@linutronix.de
If the count argument is larger than the xstate size, this will happily
copy beyond the end of xstate.
Fixes: 91c3dba7db ("x86/fpu/xstate: Fix PTRACE frames for XSAVES")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.120741557@linutronix.de
They are only used in fpstate_init() and there is no point to have them in
a header just to make reading the code harder.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121452.023118522@linutronix.de
This function is really not doing what the comment advertises:
"Find supported xfeatures based on cpu features and command-line input.
This must be called after fpu__init_parse_early_param() is called and
xfeatures_mask is enumerated."
fpu__init_parse_early_param() does not exist anymore and the function just
returns a constant.
Remove it and fix the caller and get rid of further references to
fpu__init_parse_early_param().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121451.816404717@linutronix.de
Nothing has to modify this after init.
But of course there is code which unconditionally masks
xfeatures_mask_all on CPU hotplug. This goes unnoticed during boot
hotplug because at that point the variable is still RW mapped.
This is broken in several ways:
1) Masking this in post init CPU hotplug means that any
modification of this state goes unnoticed until actual hotplug
happens.
2) If that ever happens then these bogus feature bits are already
populated all over the place and the system is in inconsistent state
vs. the compacted XSTATE offsets. If at all then this has to panic the
machine because the inconsistency cannot be undone anymore.
Make this a one-time paranoia check in xstate init code and disable
xsave when this happens.
Reported-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121451.712803952@linutronix.de
This cannot work and it's unclear how that ever made a difference.
init_fpstate.xsave.header.xfeatures is always 0 so get_xsave_addr() will
always return a NULL pointer, which will prevent storing the default PKRU
value in init_fpstate.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121451.451391598@linutronix.de
The gap handling in copy_xstate_to_kernel() is wrong when XSAVES is in
use.
Using init_fpstate for copying the init state of features which are
not set in the xstate header is only correct for the legacy area, but
not for the extended features area because when XSAVES is in use then
init_fpstate is in compacted form which means the xstate offsets which
are used to copy from init_fpstate are not valid.
Fortunately, this is not a real problem today because all extended
features in use have an all-zeros init state, but it is wrong
nevertheless and with a potentially dynamically sized init_fpstate this
would result in an access outside of the init_fpstate.
Fix this by keeping track of the last copied state in the target buffer and
explicitly zero it when there is a feature or alignment gap.
Use the compacted offset when accessing the extended feature space in
init_fpstate.
As this is not a functional issue on older kernels this is intentionally
not tagged for stable.
Fixes: b8be15d588 ("x86/fpu/xstate: Re-enable XSAVES")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121451.294282032@linutronix.de
Pick up dependent changes which either went mainline (x86/urgent is
based on -rc7 and that contains them) as urgent fixes and the current
x86/urgent branch which contains two more urgent fixes, so that the
bigger FPU rework can base off ontop.
Signed-off-by: Borislav Petkov <bp@suse.de>
The source file has been renamed froms sev-es.c to sev.c, but the
messages are still prefixed with "SEV-ES: ". Change that to "SEV: " to
make it consistent.
Fixes: e759959fe3 ("x86/sev-es: Rename sev-es.{ch} to sev.{ch}")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210622144825.27588-4-joro@8bytes.org
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmDPuyMeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGvxgH/RKvSuRPwkJ2Jcp9
VLi5kCbqtJlYLq6tB6peSJ8otKgxkcRwC0pIY4LlYIAWYboktLQ5RKp/9nB2h2FN
aMZUMu6AI/lVJyFMI5MnKnJIUiUq+WXR3lSSlw68vwFLFdzqUZFNq+bqeiVvnIy1
yqA6naj24Tu/RbYffQoPvdSJcU2SLXRMxwD8HRGiU2d51RaFsOvsZvF+P5TVcsEV
ZmttJeER21CaI/A809eqaFmyGrUOcZZK9roZEbMwanTZOMw18biEsLu/UH4kBX01
JC4+RlGxcWjQ5YNZgChsgoOK/CHzc6ITztTntdeDWAvwZjQFzV7pCy4/3BWne3O+
5178yHM=
=o8cN
-----END PGP SIGNATURE-----
Backmerge tag 'v5.13-rc7' into drm-next
Backmerge Linux 5.13-rc7 to make some pulls from later bases apply,
and to bake in the conflicts so far.
Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in
a 500-millisecond WATCHDOG_INTERVAL. This requires that clocks be skewed
by more than 12.5% in order to be marked unstable. Except that a clock
that is skewed by that much is probably destroying unsuspecting software
right and left. And given that there are now checks for false-positive
skews due to delays between reading the two clocks, it should be possible
to greatly decrease WATCHDOG_THRESHOLD, at least for fine-grained clocks
such as TSC.
Therefore, add a new uncertainty_margin field to the clocksource structure
that contains the maximum uncertainty in nanoseconds for the corresponding
clock. This field may be initialized manually, as it is for
clocksource_tsc_early and clocksource_jiffies, which is copied to
refined_jiffies. If the field is not initialized manually, it will be
computed at clock-registry time as the period of the clock in question
based on the scale and freq parameters to __clocksource_update_freq_scale()
function. If either of those two parameters are zero, the
tens-of-milliseconds WATCHDOG_THRESHOLD is used as a cowardly alternative
to dividing by zero. No matter how the uncertainty_margin field is
calculated, it is bounded below by twice WATCHDOG_MAX_SKEW, that is, by 100
microseconds.
Note that manually initialized uncertainty_margin fields are not adjusted,
but there is a WARN_ON_ONCE() that triggers if any such field is less than
twice WATCHDOG_MAX_SKEW. This WARN_ON_ONCE() is intended to discourage
production use of the one-nanosecond uncertainty_margin values that are
used to test the clock-skew code itself.
The actual clock-skew check uses the sum of the uncertainty_margin fields
of the two clocksource structures being compared. Integer overflow is
avoided because the largest computed value of the uncertainty_margin
fields is one billion (10^9), and double that value fits into an
unsigned int. However, if someone manually specifies (say) UINT_MAX,
they will get what they deserve.
Note that the refined_jiffies uncertainty_margin field is initialized to
TICK_NSEC, which means that skew checks involving this clocksource will
be sufficently forgiving. In a similar vein, the clocksource_tsc_early
uncertainty_margin field is initialized to 32*NSEC_PER_MSEC, which
replicates the current behavior and allows custom setting if needed
in order to address the rare skews detected for this clocksource in
current mainline.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-4-paulmck@kernel.org
Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other. However, this problem has purportedy been
solved in the past ten years. Except that it is all too possible that the
problem has instead simply been made less likely, which might mean that
some of the occasional "Marking clocksource 'tsc' as unstable" messages
might be due to desynchronization. How would anyone know?
Therefore apply CPU-to-CPU synchronization checking to newly unstable
clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag.
Lists of desynchronized CPUs are printed, with the caveat that if it
is the reporting CPU that is itself desynchronized, it will appear that
all the other clocks are wrong. Just like in real life.
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Feng Tang <feng.tang@intel.com>
Link: https://lore.kernel.org/r/20210527190124.440372-2-paulmck@kernel.org
The XSAVE init code initializes all enabled and supported components with
XRSTOR(S) to init state. Then it XSAVEs the state of the components back
into init_fpstate which is used in several places to fill in the init state
of components.
This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because
those use the init optimization and skip writing state of components which
are in init state. So init_fpstate.xsave still contains all zeroes after
this operation.
There are two ways to solve that:
1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when
XSAVES is enabled because XSAVES uses compacted format.
2) Save the components which are known to have a non-zero init state by other
means.
Looking deeper, #2 is the right thing to do because all components the
kernel supports have all-zeroes init state except the legacy features (FP,
SSE). Those cannot be hard coded because the states are not identical on all
CPUs, but they can be saved with FXSAVE which avoids all conditionals.
Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with
a BUILD_BUG_ON() which reminds developers to validate that a newly added
component has all zeroes init state. As a bonus remove the now unused
copy_xregs_to_kernel_booting() crutch.
The XSAVE and reshuffle method can still be implemented in the unlikely
case that components are added which have a non-zero init state and no
other means to save them. For now, FXSAVE is just simple and good enough.
[ bp: Fix a typo or two in the text. ]
Fixes: 6bad06b768 ("x86, xsave: Use xsaveopt in context-switch path when supported")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de
sanitize_restored_user_xstate() preserves the supervisor states only
when the fx_only argument is zero, which allows unprivileged user space
to put supervisor states back into init state.
Preserve them unconditionally.
[ bp: Fix a typo or two in the text. ]
Fixes: 5d6b6a6f9b ("x86/fpu/xstate: Update sanitize_restored_xstate() for supervisor xstates")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210618143444.438635017@linutronix.de
Split up the #VC handler code into a from-user and a from-kernel part.
This allows clean and correct state tracking, as the #VC handler needs
to enter NMI-state when raised from kernel mode and plain IRQ state when
raised from user-mode.
Fixes: 62441a1fb5 ("x86/sev-es: Correctly track IRQ states in runtime #VC handler")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210618115409.22735-3-joro@8bytes.org
The #VC handler only cares about IRQs being disabled while the GHCB is
active, as it must not be interrupted by something which could cause
another #VC while it holds the GHCB (NMI is the exception for which the
backup GHCB exits).
Make sure nothing interrupts the code path while the GHCB is active
by making sure that callers of __sev_{get,put}_ghcb() have disabled
interrupts upfront.
[ bp: Massage commit message. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210618115409.22735-2-joro@8bytes.org
(There's a lot more in the pipe):
- Prevent corruption of the XSTATE buffer in signal handling by
validating what is being copied from userspace first.
- Invalidate other task's preserved FPU registers on XRSTOR failure
(#PF) because latter can still modify some of them.
- Restore the proper PKRU value in case userspace modified it
- Reset FPU state when signal restoring fails
Other:
- Map EFI boot services data memory as encrypted in a SEV guest so that
the guest can access it and actually boot properly
- Two SGX correctness fixes: proper resources freeing and a NUMA fix
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDO5vQACgkQEsHwGGHe
VUrUjw//fRU8BPZ3/SWNQO188QhHdFpm3jqtjRJsZD1FfnnLdxIg2SCP4RjFxv+Y
eFyN0nYLekG8a3CMV081H9Rhr5tt3bflk0oTcGAar7m2qQiCiqaAH0wptIlQonSu
nQCSs+PeaaK4nRCtW+TUJnwG0ZU/y7fEXa3pxJ6hSMnxZjz3lj70zKhpA1nQtqRZ
OOStvBNtaWcDdTTE4r8XuFIxuMUUEuwHlQQmkAVHQYUf6vxGYfnDYEg83Wddvq1E
1leSRNFlLcCAbPUV/fax3KGvaekeJ1U411uWqXlain6m105+mk+irmrLxtur/lJ5
cWTVb5CbIHFZnJvC5jzNPv/03GbIIQaVm4jPI2qB1AZbjcVlAPKj1Ne+U1fzvmDT
wNUob/rnIXiGptvtUMNYGURxBTj65Nnom3iAJV+AdMOThDwYMvsJJjFkMnC5wO2n
ZAexumWPnUzWoxSMTraT7a6b/kilFUrcPljxSrFd9yVeU8E6a1OSW35oWoQ3itrc
xx/ne8RodLmCPC9DjecFcQR+qUuXsF+XCCj07QpfKNTAObr17e9nsKJneR6MX79C
Lpc7Ka/CiTGYcebWX7tqtjwGPfa6iqekswxYRRp7j54bQ4sHmKyordZy0Q8+c079
gmMlPdNbqQg3YwHyXW2yeJETDS1HBp61RRojAP15BsL73wyYQNE=
=AuXr
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"A first set of urgent fixes to the FPU/XSTATE handling mess^W code.
(There's a lot more in the pipe):
- Prevent corruption of the XSTATE buffer in signal handling by
validating what is being copied from userspace first.
- Invalidate other task's preserved FPU registers on XRSTOR failure
(#PF) because latter can still modify some of them.
- Restore the proper PKRU value in case userspace modified it
- Reset FPU state when signal restoring fails
Other:
- Map EFI boot services data memory as encrypted in a SEV guest so
that the guest can access it and actually boot properly
- Two SGX correctness fixes: proper resources freeing and a NUMA fix"
* tag 'x86_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Avoid truncating memblocks for SGX memory
x86/sgx: Add missing xa_destroy() when virtual EPC is destroyed
x86/fpu: Reset state for all signal restore failures
x86/pkru: Write hardware init value to PKRU when xstate is init
x86/process: Check PF_KTHREAD and not current->mm for kernel threads
x86/fpu: Invalidate FPU state after a failed XRSTOR from a user buffer
x86/fpu: Prevent state corruption in __fpu__restore_sig()
x86/ioremap: Map EFI-reserved memory as encrypted for SEV
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
task_is_running(p).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
This commit in sched/urgent moved the cfs_rq_is_decayed() function:
a7b359fc6a: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
and this fresh commit in sched/core modified it in the old location:
9e077b52d8: ("sched/pelt: Check that *_avg are null when *_sum are")
Merge the two variants.
Conflicts:
kernel/sched/fair.c
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Previously, to detect nested virtualization enlightenment support,
we were using HV_X64_ENLIGHTENED_VMCS_RECOMMENDED feature bit of
HYPERV_CPUID_ENLIGHTMENT_INFO.EAX CPUID as docuemented in TLFS:
"Bit 14: Recommend a nested hypervisor using the enlightened VMCS
interface. Also indicates that additional nested enlightenments
may be available (see leaf 0x4000000A)".
Enlightened VMCS, however, is an Intel only feature so the above
detection method doesn't work for AMD. So, use the
HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS.EAX CPUID information ("The
maximum input value for hypervisor CPUID information.") and this
works for both AMD and Intel.
Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Message-Id: <43b25ff21cd2d9a51582033c9bdd895afefac056.1622730232.git.viremana@linux.microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
xa_destroy() needs to be called to destroy a virtual EPC's page array
before calling kfree() to free the virtual EPC. Currently it is not
called so add the missing xa_destroy().
Fixes: 540745ddbc ("x86/sgx: Introduce virtual EPC for use by KVM guests")
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Yang Zhong <yang.zhong@intel.com>
Link: https://lkml.kernel.org/r/20210615101639.291929-1-kai.huang@intel.com
As a result of TSX deprecation, some processors always abort TSX
transactions by default after a microcode update.
When TSX feature cannot be used it is better to hide it. Clear CPUID.RTM
and CPUID.HLE bits when TSX transactions always abort.
[ bp: Massage commit message and comments. ]
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Link: https://lkml.kernel.org/r/5209b3d72ffe5bd3cafdcc803f5b883f785329c3.1623704845.git-series.pawan.kumar.gupta@linux.intel.com
When an instruction is fetched from user-space, segmentation needs to
be taken into account. This means that getting the linear address of an
instruction can fail. Hardware would raise a #GP exception in that case,
but the #VC exception handler would emulate it as a page-fault.
The insn_fetch_from_user*() functions now provide the relevant
information in case of a failure. Use that and propagate a #GP when the
linear address of an instruction to fetch could not be calculated.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210614135327.9921-7-joro@8bytes.org
The error reporting from the insn_fetch_from_user*() functions is not
very verbose. Extend it to include information on whether the linear
RIP could not be calculated or whether the memory access faulted.
This will be used in the SEV-ES code to propagate the correct
exception depending on what went wrong during instruction fetch.
[ bp: Massage comments. ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210614135327.9921-6-joro@8bytes.org
The runtime #VC handler is not "early" anymore. Fix the copy&paste error
and remove that word from the error message.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210614135327.9921-2-joro@8bytes.org
- Fix the NMI watchdog on ancient Intel CPUs
- Remove a misguided, NMI-unsafe KASAN callback
from the NMI-safe irq_work path used by perf.
- Fix uncore events on Ice Lake servers.
- Someone booted maxcpus=1 on an SNB-EP, and the
uncore driver emitted warnings and was probably
buggy. Fix it.
- KCSAN found a genuine data race in the core perf
code. Somewhat ironically the bug was introduced
through a recent race fix. :-/ In our defense, the
new race window was much more narrow. Fix it.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDErJkRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gjNxAAhWPl+zsVr+bMZGQVnjPf7swXSaqsphtU
LrP0hrs4nH0JiB7lZJVjPhCMQKXb+gvP0CTmxkOXmNORDKDK3slIS/zp9uyH1F+d
nXhmWi7c1bHU0vortnv87LGJpeeI1E7rQ/uBxK6b2v6kOBmCnjvQEiPvJEIGTtpE
YimVBERdPDTBQiW5EQbbyL3VScwm5QUN2STnLPjUtVc9HES/zCdhXNlsASfhn/Tn
8rlSAqVEOUcsTpTXYadHckNi1zn4zrpuhWKpSHXrvXCo3qU8QpISjYNwAJ/0IGBj
CMdg2r+MneF6gop76R5aRcA0JDvDgtv56LKFVhi9gEkE5em9YAni17HU0IeTvJmT
mL9j64h8oUErC/TpAU1vXCJjIxH7jLq8YQoNwHUvF0pSvcNGsaFeWu1ADQuTEIi9
fyKHRpFwPMBhwc+AMaRepgQ9FlvE4567fQmwlrUDUKlCU0x0dfvFCM2z/o61YFlH
oFgB0h0SNxdoj5EXny50LtokP1Kp/oBNVhhNsUpH8wVxWLi61BHJOslcc7nzdP6t
JBqVE6bLQlxmlKt2AwiOkxe9xVv34o3AMxUYtUBYgCTZSlRjL//7pcqgG5r+CZH/
eXEU3wWcGtRPEItGXtiGT9Vm2ZYSaUMFF7k7OrTPCHgkW51oEW4FUoaV7M+9fl43
638x9Wnse4Q=
=9LoT
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Misc fixes:
- Fix the NMI watchdog on ancient Intel CPUs
- Remove a misguided, NMI-unsafe KASAN callback from the NMI-safe
irq_work path used by perf.
- Fix uncore events on Ice Lake servers.
- Someone booted maxcpus=1 on an SNB-EP, and the uncore driver
emitted warnings and was probably buggy. Fix it.
- KCSAN found a genuine data race in the core perf code. Somewhat
ironically the bug was introduced through a recent race fix. :-/
In our defense, the new race window was much more narrow. Fix it"
* tag 'perf-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs
irq_work: Make irq_work_queue() NMI-safe again
perf/x86/intel/uncore: Fix M2M event umask for Ice Lake server
perf/x86/intel/uncore: Fix a kernel WARNING triggered by maxcpus=1
perf: Fix data race between pin_count increment/decrement