linux-stable/arch/x86/kernel/fpu/xstate.c

1324 lines
36 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0-only
/*
* xsave/xrstor support.
*
* Author: Suresh Siddha <suresh.b.siddha@intel.com>
*/
#include <linux/compat.h>
#include <linux/cpu.h>
x86/pkeys: Allocation/free syscalls This patch adds two new system calls: int pkey_alloc(unsigned long flags, unsigned long init_access_rights) int pkey_free(int pkey); These implement an "allocator" for the protection keys themselves, which can be thought of as analogous to the allocator that the kernel has for file descriptors. The kernel tracks which numbers are in use, and only allows operations on keys that are valid. A key which was not obtained by pkey_alloc() may not, for instance, be passed to pkey_mprotect(). These system calls are also very important given the kernel's use of pkeys to implement execute-only support. These help ensure that userspace can never assume that it has control of a key unless it first asks the kernel. The kernel does not promise to preserve PKRU (right register) contents except for allocated pkeys. The 'init_access_rights' argument to pkey_alloc() specifies the rights that will be established for the returned pkey. For instance: pkey = pkey_alloc(flags, PKEY_DENY_WRITE); will allocate 'pkey', but also sets the bits in PKRU[1] such that writing to 'pkey' is already denied. The kernel does not prevent pkey_free() from successfully freeing in-use pkeys (those still assigned to a memory range by pkey_mprotect()). It would be expensive to implement the checks for this, so we instead say, "Just don't do it" since sane software will never do it anyway. Any piece of userspace calling pkey_alloc() needs to be prepared for it to fail. Why? pkey_alloc() returns the same error code (ENOSPC) when there are no pkeys and when pkeys are unsupported. They can be unsupported for a whole host of reasons, so apps must be prepared for this. Also, libraries or LD_PRELOADs might steal keys before an application gets access to them. This allocation mechanism could be implemented in userspace. Even if we did it in userspace, we would still need additional user/kernel interfaces to tell userspace which keys are being used by the kernel internally (such as for execute-only mappings). Having the kernel provide this facility completely removes the need for these additional interfaces, or having an implementation of this in userspace at all. Note that we have to make changes to all of the architectures that do not use mman-common.h because we use the new PKEY_DENY_ACCESS/WRITE macros in arch-independent code. 1. PKRU is the Protection Key Rights User register. It is a usermode-accessible register that controls whether writes and/or access to each individual pkey is allowed or denied. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: linux-arch@vger.kernel.org Cc: Dave Hansen <dave@sr71.net> Cc: arnd@arndb.de Cc: linux-api@vger.kernel.org Cc: linux-mm@kvack.org Cc: luto@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 16:30:15 +00:00
#include <linux/mman.h>
x86/mm/pkeys: Allow kernel to modify user pkey rights register The Protection Key Rights for User memory (PKRU) is a 32-bit user-accessible register. It contains two bits for each protection key: one to write-disable (WD) access to memory covered by the key and another to access-disable (AD). Userspace can read/write the register with the RDPKRU and WRPKRU instructions. But, the register is saved and restored with the XSAVE family of instructions, which means we have to treat it like a floating point register. The kernel needs to write to the register if it wants to implement execute-only memory or if it implements a system call to change PKRU. To do this, we need to create a 'pkru_state' buffer, read the old contents in to it, modify it, and then tell the FPU code that there is modified data in there so it can (possibly) move the buffer back in to the registers. This uses the fpu__xfeature_set_state() function that we defined in the previous patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 21:02:36 +00:00
#include <linux/pkeys.h>
x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status AVX-512 components usage can result in turbo frequency drop. So it's useful to expose AVX-512 usage elapsed time as a heuristic hint for user space job schedulers to cluster the AVX-512 using tasks together. Examples: $ while [ 1 ]; do cat /proc/tid/arch_status | grep AVX512; sleep 1; done AVX512_elapsed_ms: 4 AVX512_elapsed_ms: 8 AVX512_elapsed_ms: 4 This means that 4 milliseconds have elapsed since the tsks AVX512 usage was detected when the task was scheduled out. $ cat /proc/tid/arch_status | grep AVX512 AVX512_elapsed_ms: -1 '-1' indicates that no AVX512 usage was recorded for this task. The time exposed is not necessarily accurate when the arch_status file is read as the AVX512 usage is only evaluated when a task is scheduled out. Accurate usage information can be obtained with performance counters. [ tglx: Massaged changelog ] Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: peterz@infradead.org Cc: hpa@zytor.com Cc: ak@linux.intel.com Cc: tim.c.chen@linux.intel.com Cc: dave.hansen@intel.com Cc: arjan@linux.intel.com Cc: adobriyan@gmail.com Cc: aubrey.li@intel.com Cc: linux-api@vger.kernel.org Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linux API <linux-api@vger.kernel.org> Link: https://lkml.kernel.org/r/20190606012236.9391-2-aubrey.li@linux.intel.com
2019-06-06 01:22:35 +00:00
#include <linux/seq_file.h>
#include <linux/proc_fs.h>
#include <asm/fpu/api.h>
#include <asm/fpu/internal.h>
#include <asm/fpu/signal.h>
#include <asm/fpu/regset.h>
#include <asm/fpu/xstate.h>
#include <asm/tlbflush.h>
#include <asm/cpufeature.h>
x86/fpu: Add placeholder for 'Processor Trace' XSAVE state There is an XSAVE state component for Intel Processor Trace (PT). But, we do not currently use it. We add a placeholder in the code for it so it is not a mystery and also so we do not need an explicit enum initialization for Protection Keys in a moment. Why don't we use it? We might end up using this at _some_ point in the future. But, this is a "system" state which requires using the currently unsupported XSAVES feature. Unlike all the other XSAVE states, PT state is also not directly tied to a thread. You might context-switch between threads, but not want to change any of the PT state. Or, you might switch between threads, and *do* want to change PT state, all depending on what is being traced. We currently just manually set some MSRs to do this PT context switching, and it is unclear whether replacing our direct MSR use with XSAVE will be a net win or loss, both in code complexity and performance. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: fenghua.yu@intel.com Cc: linux-mm@kvack.org Cc: yu-cheng.yu@intel.com Link: http://lkml.kernel.org/r/20160212210158.5E4BCAE2@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 21:01:58 +00:00
/*
* Although we spell it out in here, the Processor Trace
* xfeature is completely unused. We use other mechanisms
* to save/restore PT state in Linux.
*/
static const char *xfeature_names[] =
{
"x87 floating point registers" ,
"SSE registers" ,
"AVX registers" ,
"MPX bounds registers" ,
"MPX CSR" ,
"AVX-512 opmask" ,
"AVX-512 Hi256" ,
"AVX-512 ZMM_Hi256" ,
x86/fpu: Add placeholder for 'Processor Trace' XSAVE state There is an XSAVE state component for Intel Processor Trace (PT). But, we do not currently use it. We add a placeholder in the code for it so it is not a mystery and also so we do not need an explicit enum initialization for Protection Keys in a moment. Why don't we use it? We might end up using this at _some_ point in the future. But, this is a "system" state which requires using the currently unsupported XSAVES feature. Unlike all the other XSAVE states, PT state is also not directly tied to a thread. You might context-switch between threads, but not want to change any of the PT state. Or, you might switch between threads, and *do* want to change PT state, all depending on what is being traced. We currently just manually set some MSRs to do this PT context switching, and it is unclear whether replacing our direct MSR use with XSAVE will be a net win or loss, both in code complexity and performance. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: fenghua.yu@intel.com Cc: linux-mm@kvack.org Cc: yu-cheng.yu@intel.com Link: http://lkml.kernel.org/r/20160212210158.5E4BCAE2@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 21:01:58 +00:00
"Processor Trace (unused)" ,
"Protection Keys User registers",
"PASID state",
"unknown xstate feature" ,
};
static short xsave_cpuid_features[] __initdata = {
X86_FEATURE_FPU,
X86_FEATURE_XMM,
X86_FEATURE_AVX,
X86_FEATURE_MPX,
X86_FEATURE_MPX,
X86_FEATURE_AVX512F,
X86_FEATURE_AVX512F,
X86_FEATURE_AVX512F,
X86_FEATURE_INTEL_PT,
X86_FEATURE_PKU,
X86_FEATURE_ENQCMD,
};
/*
* This represents the full set of bits that should ever be set in a kernel
* XSAVE buffer, both supervisor and user xstates.
*/
u64 xfeatures_mask_all __ro_after_init;
static unsigned int xstate_offsets[XFEATURE_MAX] __ro_after_init =
{ [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_sizes[XFEATURE_MAX] __ro_after_init =
{ [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_comp_offsets[XFEATURE_MAX] __ro_after_init =
{ [ 0 ... XFEATURE_MAX - 1] = -1};
static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] __ro_after_init =
{ [ 0 ... XFEATURE_MAX - 1] = -1};
/*
* The XSAVE area of kernel can be in standard or compacted format;
* it is always in standard format for user mode. This is the user
* mode standard format size used for signal and ptrace frames.
*/
unsigned int fpu_user_xstate_size __ro_after_init;
/*
* Return whether the system supports a given xfeature.
*
* Also return the name of the (most advanced) feature that the caller requested:
*/
int cpu_has_xfeatures(u64 xfeatures_needed, const char **feature_name)
{
u64 xfeatures_missing = xfeatures_needed & ~xfeatures_mask_all;
if (unlikely(feature_name)) {
long xfeature_idx, max_idx;
u64 xfeatures_print;
/*
* So we use FLS here to be able to print the most advanced
* feature that was requested but is missing. So if a driver
x86/fpu: Rename XSAVE macros There are two concepts that have some confusing naming: 1. Extended State Component numbers (currently called XFEATURE_BIT_*) 2. Extended State Component masks (currently called XSTATE_*) The numbers are (currently) from 0-9. State component 3 is the bounds registers for MPX, for instance. But when we want to enable "state component 3", we go set a bit in XCR0. The bit we set is 1<<3. We can check to see if a state component feature is enabled by looking at its bit. The current 'xfeature_bit's are at best xfeature bit _numbers_. Calling them bits is at best inconsistent with ending the enum list with 'XFEATURES_NR_MAX'. This patch renames the enum to be 'xfeature'. These also happen to be what the Intel documentation calls a "state component". We also want to differentiate these from the "XSTATE_*" macros. The "XSTATE_*" macros are a mask, and we rename them to match. These macros are reasonably widely used so this patch is a wee bit big, but this really is just a rename. The only non-mechanical part of this is the s/XSTATE_EXTEND_MASK/XFEATURE_MASK_EXTEND/ We need a better name for it, but that's another patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233126.38653250@viggo.jf.intel.com [ Ported to v4.3-rc1. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:26 +00:00
* asks about "XFEATURE_MASK_SSE | XFEATURE_MASK_YMM" we'll print the
* missing AVX feature - this is the most informative message
* to users:
*/
if (xfeatures_missing)
xfeatures_print = xfeatures_missing;
else
xfeatures_print = xfeatures_needed;
xfeature_idx = fls64(xfeatures_print)-1;
max_idx = ARRAY_SIZE(xfeature_names)-1;
xfeature_idx = min(xfeature_idx, max_idx);
*feature_name = xfeature_names[xfeature_idx];
}
if (xfeatures_missing)
return 0;
return 1;
}
EXPORT_SYMBOL_GPL(cpu_has_xfeatures);
static bool xfeature_is_supervisor(int xfeature_nr)
{
/*
* Extended State Enumeration Sub-leaves (EAX = 0DH, ECX = n, n > 1)
* returns ECX[0] set to (1) for a supervisor state, and cleared (0)
* for a user state.
*/
u32 eax, ebx, ecx, edx;
cpuid_count(XSTATE_CPUID, xfeature_nr, &eax, &ebx, &ecx, &edx);
return ecx & 1;
}
/*
* Enable the extended processor state save/restore feature.
* Called once per CPU onlining.
*/
void fpu__init_cpu_xstate(void)
{
if (!boot_cpu_has(X86_FEATURE_XSAVE) || !xfeatures_mask_all)
return;
cr4_set_bits(X86_CR4_OSXSAVE);
/*
* XCR_XFEATURE_ENABLED_MASK (aka. XCR0) sets user features
* managed by XSAVE{C, OPT, S} and XRSTOR{S}. Only XSAVE user
* states can be set here.
*/
xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask_user());
/*
* MSR_IA32_XSS sets supervisor states managed by XSAVES.
*/
x86/fpu/xstate: Support dynamic supervisor feature for LBR Last Branch Records (LBR) registers are used to log taken branches and other control flows. In perf with call stack mode, LBR information is used to reconstruct a call stack. To get the complete call stack, perf has to save/restore all LBR registers during a context switch. Due to the large number of the LBR registers, e.g., the current platform has 96 LBR registers, this process causes a high CPU overhead. To reduce the CPU overhead during a context switch, an LBR state component that contains all the LBR related registers is introduced in hardware. All LBR registers can be saved/restored together using one XSAVES/XRSTORS instruction. However, the kernel should not save/restore the LBR state component at each context switch, like other state components, because of the following unique features of LBR: - The LBR state component only contains valuable information when LBR is enabled in the perf subsystem, but for most of the time, LBR is disabled. - The size of the LBR state component is huge. For the current platform, it's 808 bytes. If the kernel saves/restores the LBR state at each context switch, for most of the time, it is just a waste of space and cycles. To efficiently support the LBR state component, it is desired to have: - only context-switch the LBR when the LBR feature is enabled in perf. - only allocate an LBR-specific XSAVE buffer on demand. (Besides the LBR state, a legacy region and an XSAVE header have to be included in the buffer as well. There is a total of (808+576) byte overhead for the LBR-specific XSAVE buffer. The overhead only happens when the perf is actively using LBRs. There is still a space-saving, on average, when it replaces the constant 808 bytes of overhead for every task, all the time on the systems that support architectural LBR.) - be able to use XSAVES/XRSTORS for accessing LBR at run time. However, the IA32_XSS should not be adjusted at run time. (The XCR0 | IA32_XSS are used to determine the requested-feature bitmap (RFBM) of XSAVES.) A solution, called dynamic supervisor feature, is introduced to address this issue, which - does not allocate a buffer in each task->fpu; - does not save/restore a state component at each context switch; - sets the bit corresponding to the dynamic supervisor feature in IA32_XSS at boot time, and avoids setting it at run time. - dynamically allocates a specific buffer for a state component on demand, e.g. only allocates LBR-specific XSAVE buffer when LBR is enabled in perf. (Note: The buffer has to include the LBR state component, a legacy region and a XSAVE header space.) (Implemented in a later patch) - saves/restores a state component on demand, e.g. manually invokes the XSAVES/XRSTORS instruction to save/restore the LBR state to/from the buffer when perf is active and a call stack is required. (Implemented in a later patch) A new mask XFEATURE_MASK_DYNAMIC and a helper xfeatures_mask_dynamic() are introduced to indicate the dynamic supervisor feature. For the systems which support the Architecture LBR, LBR is the only dynamic supervisor feature for now. For the previous systems, there is no dynamic supervisor feature available. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Link: https://lkml.kernel.org/r/1593780569-62993-21-git-send-email-kan.liang@linux.intel.com
2020-07-03 12:49:26 +00:00
if (boot_cpu_has(X86_FEATURE_XSAVES)) {
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
xfeatures_mask_dynamic());
}
}
static bool xfeature_enabled(enum xfeature xfeature)
{
return xfeatures_mask_all & BIT_ULL(xfeature);
}
/*
x86/fpu/xstate: Don't assume the first zero xfeatures zero bit means the end The current xstate code in setup_xstate_features() assumes that the first zero bit means the end of xfeatures - but that is not so, the SDM clearly states that an arbitrary set of xfeatures might be enabled - and it is also clear from the description of the compaction feature that holes are possible: "13-6 Vol. 1MANAGING STATE USING THE XSAVE FEATURE SET [...] Compacted format. Each state component i (i ≥ 2) is located at a byte offset from the base address of the XSAVE area based on the XCOMP_BV field in the XSAVE header: — If XCOMP_BV[i] = 0, state component i is not in the XSAVE area. — If XCOMP_BV[i] = 1, the following items apply: • If XCOMP_BV[j] = 0 for every j, 2 ≤ j < i, state component i is located at a byte offset 576 from the base address of the XSAVE area. (This item applies if i is the first bit set in bits 62:2 of the XCOMP_BV; it implies that state component i is located at the beginning of the extended region.) • Otherwise, let j, 2 ≤ j < i, be the greatest value such that XCOMP_BV[j] = 1. Then state component i is located at a byte offset X from the location of state component j, where X is the number of bytes required for state component j as enumerated in CPUID.(EAX=0DH,ECX=j):EAX. (This item implies that state component i immediately follows the preceding state component whose bit is set in XCOMP_BV.)" So don't assume that the first zero xfeatures bit means the end of all xfeatures - iterate through all of them. I'm not aware of hardware that triggers this currently. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-04 05:37:47 +00:00
* Record the offsets and sizes of various xstates contained
* in the XSAVE state memory layout.
*/
static void __init setup_xstate_features(void)
{
u32 eax, ebx, ecx, edx, i;
/* start at the beginning of the "extended state" */
unsigned int last_good_offset = offsetof(struct xregs_state,
extended_state_area);
/*
* The FP xstates and SSE xstates are legacy states. They are always
* in the fixed offsets in the xsave area in either compacted form
* or standard form.
*/
xstate_offsets[XFEATURE_FP] = 0;
xstate_sizes[XFEATURE_FP] = offsetof(struct fxregs_state,
xmm_space);
xstate_offsets[XFEATURE_SSE] = xstate_sizes[XFEATURE_FP];
xstate_sizes[XFEATURE_SSE] = sizeof_field(struct fxregs_state,
xmm_space);
for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (!xfeature_enabled(i))
continue;
cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);
xstate_sizes[i] = eax;
/*
* If an xfeature is supervisor state, the offset in EBX is
* invalid, leave it to -1.
*/
if (xfeature_is_supervisor(i))
continue;
xstate_offsets[i] = ebx;
/*
* In our xstate size checks, we assume that the highest-numbered
* xstate feature has the highest offset in the buffer. Ensure
* it does.
*/
WARN_ONCE(last_good_offset > xstate_offsets[i],
"x86/fpu: misordered xstate at %d\n", last_good_offset);
last_good_offset = xstate_offsets[i];
x86/fpu/xstate: Don't assume the first zero xfeatures zero bit means the end The current xstate code in setup_xstate_features() assumes that the first zero bit means the end of xfeatures - but that is not so, the SDM clearly states that an arbitrary set of xfeatures might be enabled - and it is also clear from the description of the compaction feature that holes are possible: "13-6 Vol. 1MANAGING STATE USING THE XSAVE FEATURE SET [...] Compacted format. Each state component i (i ≥ 2) is located at a byte offset from the base address of the XSAVE area based on the XCOMP_BV field in the XSAVE header: — If XCOMP_BV[i] = 0, state component i is not in the XSAVE area. — If XCOMP_BV[i] = 1, the following items apply: • If XCOMP_BV[j] = 0 for every j, 2 ≤ j < i, state component i is located at a byte offset 576 from the base address of the XSAVE area. (This item applies if i is the first bit set in bits 62:2 of the XCOMP_BV; it implies that state component i is located at the beginning of the extended region.) • Otherwise, let j, 2 ≤ j < i, be the greatest value such that XCOMP_BV[j] = 1. Then state component i is located at a byte offset X from the location of state component j, where X is the number of bytes required for state component j as enumerated in CPUID.(EAX=0DH,ECX=j):EAX. (This item implies that state component i immediately follows the preceding state component whose bit is set in XCOMP_BV.)" So don't assume that the first zero xfeatures bit means the end of all xfeatures - iterate through all of them. I'm not aware of hardware that triggers this currently. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-04 05:37:47 +00:00
}
}
static void __init print_xstate_feature(u64 xstate_mask)
{
const char *feature_name;
if (cpu_has_xfeatures(xstate_mask, &feature_name))
pr_info("x86/fpu: Supporting XSAVE feature 0x%03Lx: '%s'\n", xstate_mask, feature_name);
}
/*
* Print out all the supported xstate features:
*/
static void __init print_xstate_features(void)
{
x86/fpu: Rename XSAVE macros There are two concepts that have some confusing naming: 1. Extended State Component numbers (currently called XFEATURE_BIT_*) 2. Extended State Component masks (currently called XSTATE_*) The numbers are (currently) from 0-9. State component 3 is the bounds registers for MPX, for instance. But when we want to enable "state component 3", we go set a bit in XCR0. The bit we set is 1<<3. We can check to see if a state component feature is enabled by looking at its bit. The current 'xfeature_bit's are at best xfeature bit _numbers_. Calling them bits is at best inconsistent with ending the enum list with 'XFEATURES_NR_MAX'. This patch renames the enum to be 'xfeature'. These also happen to be what the Intel documentation calls a "state component". We also want to differentiate these from the "XSTATE_*" macros. The "XSTATE_*" macros are a mask, and we rename them to match. These macros are reasonably widely used so this patch is a wee bit big, but this really is just a rename. The only non-mechanical part of this is the s/XSTATE_EXTEND_MASK/XFEATURE_MASK_EXTEND/ We need a better name for it, but that's another patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233126.38653250@viggo.jf.intel.com [ Ported to v4.3-rc1. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:26 +00:00
print_xstate_feature(XFEATURE_MASK_FP);
print_xstate_feature(XFEATURE_MASK_SSE);
print_xstate_feature(XFEATURE_MASK_YMM);
print_xstate_feature(XFEATURE_MASK_BNDREGS);
print_xstate_feature(XFEATURE_MASK_BNDCSR);
print_xstate_feature(XFEATURE_MASK_OPMASK);
print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
print_xstate_feature(XFEATURE_MASK_PKRU);
print_xstate_feature(XFEATURE_MASK_PASID);
}
/*
* This check is important because it is easy to get XSTATE_*
* confused with XSTATE_BIT_*.
*/
#define CHECK_XFEATURE(nr) do { \
WARN_ON(nr < FIRST_EXTENDED_XFEATURE); \
WARN_ON(nr >= XFEATURE_MAX); \
} while (0)
/*
* We could cache this like xstate_size[], but we only use
* it here, so it would be a waste of space.
*/
static int xfeature_is_aligned(int xfeature_nr)
{
u32 eax, ebx, ecx, edx;
CHECK_XFEATURE(xfeature_nr);
if (!xfeature_enabled(xfeature_nr)) {
WARN_ONCE(1, "Checking alignment of disabled xfeature %d\n",
xfeature_nr);
return 0;
}
cpuid_count(XSTATE_CPUID, xfeature_nr, &eax, &ebx, &ecx, &edx);
/*
* The value returned by ECX[1] indicates the alignment
* of state component 'i' when the compacted format
* of the extended region of an XSAVE area is used:
*/
return !!(ecx & 2);
}
/*
* This function sets up offsets and sizes of all extended states in
* xsave area. This supports both standard format and compacted format
* of the xsave area.
*/
static void __init setup_xstate_comp_offsets(void)
{
unsigned int next_offset;
int i;
/*
* The FP xstates and SSE xstates are legacy states. They are always
* in the fixed offsets in the xsave area in either compacted form
* or standard form.
*/
xstate_comp_offsets[XFEATURE_FP] = 0;
xstate_comp_offsets[XFEATURE_SSE] = offsetof(struct fxregs_state,
xmm_space);
if (!boot_cpu_has(X86_FEATURE_XSAVES)) {
for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (xfeature_enabled(i))
xstate_comp_offsets[i] = xstate_offsets[i];
}
return;
}
next_offset = FXSAVE_SIZE + XSAVE_HDR_SIZE;
for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (!xfeature_enabled(i))
continue;
if (xfeature_is_aligned(i))
next_offset = ALIGN(next_offset, 64);
xstate_comp_offsets[i] = next_offset;
next_offset += xstate_sizes[i];
}
}
/*
* Setup offsets of a supervisor-state-only XSAVES buffer:
*
* The offsets stored in xstate_comp_offsets[] only work for one specific
* value of the Requested Feature BitMap (RFBM). In cases where a different
* RFBM value is used, a different set of offsets is required. This set of
* offsets is for when RFBM=xfeatures_mask_supervisor().
*/
static void __init setup_supervisor_only_offsets(void)
{
unsigned int next_offset;
int i;
next_offset = FXSAVE_SIZE + XSAVE_HDR_SIZE;
for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (!xfeature_enabled(i) || !xfeature_is_supervisor(i))
continue;
if (xfeature_is_aligned(i))
next_offset = ALIGN(next_offset, 64);
xstate_supervisor_only_offsets[i] = next_offset;
next_offset += xstate_sizes[i];
}
}
/*
* Print out xstate component offsets and sizes
*/
static void __init print_xstate_offset_size(void)
{
int i;
for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (!xfeature_enabled(i))
continue;
pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d\n",
i, xstate_comp_offsets[i], i, xstate_sizes[i]);
}
}
x86/fpu: Make init_fpstate correct with optimized XSAVE The XSAVE init code initializes all enabled and supported components with XRSTOR(S) to init state. Then it XSAVEs the state of the components back into init_fpstate which is used in several places to fill in the init state of components. This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because those use the init optimization and skip writing state of components which are in init state. So init_fpstate.xsave still contains all zeroes after this operation. There are two ways to solve that: 1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when XSAVES is enabled because XSAVES uses compacted format. 2) Save the components which are known to have a non-zero init state by other means. Looking deeper, #2 is the right thing to do because all components the kernel supports have all-zeroes init state except the legacy features (FP, SSE). Those cannot be hard coded because the states are not identical on all CPUs, but they can be saved with FXSAVE which avoids all conditionals. Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with a BUILD_BUG_ON() which reminds developers to validate that a newly added component has all zeroes init state. As a bonus remove the now unused copy_xregs_to_kernel_booting() crutch. The XSAVE and reshuffle method can still be implemented in the unlikely case that components are added which have a non-zero init state and no other means to save them. For now, FXSAVE is just simple and good enough. [ bp: Fix a typo or two in the text. ] Fixes: 6bad06b76892 ("x86, xsave: Use xsaveopt in context-switch path when supported") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de
2021-06-18 14:18:25 +00:00
/*
* All supported features have either init state all zeros or are
* handled in setup_init_fpu() individually. This is an explicit
* feature list and does not use XFEATURE_MASK*SUPPORTED to catch
* newly added supported features at build time and make people
* actually look at the init state for the new feature.
*/
#define XFEATURES_INIT_FPSTATE_HANDLED \
(XFEATURE_MASK_FP | \
XFEATURE_MASK_SSE | \
XFEATURE_MASK_YMM | \
XFEATURE_MASK_OPMASK | \
XFEATURE_MASK_ZMM_Hi256 | \
XFEATURE_MASK_Hi16_ZMM | \
XFEATURE_MASK_PKRU | \
XFEATURE_MASK_BNDREGS | \
XFEATURE_MASK_BNDCSR | \
XFEATURE_MASK_PASID)
/*
* setup the xstate image representing the init state
*/
static void __init setup_init_fpu_buf(void)
{
static int on_boot_cpu __initdata = 1;
x86/fpu: Make init_fpstate correct with optimized XSAVE The XSAVE init code initializes all enabled and supported components with XRSTOR(S) to init state. Then it XSAVEs the state of the components back into init_fpstate which is used in several places to fill in the init state of components. This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because those use the init optimization and skip writing state of components which are in init state. So init_fpstate.xsave still contains all zeroes after this operation. There are two ways to solve that: 1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when XSAVES is enabled because XSAVES uses compacted format. 2) Save the components which are known to have a non-zero init state by other means. Looking deeper, #2 is the right thing to do because all components the kernel supports have all-zeroes init state except the legacy features (FP, SSE). Those cannot be hard coded because the states are not identical on all CPUs, but they can be saved with FXSAVE which avoids all conditionals. Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with a BUILD_BUG_ON() which reminds developers to validate that a newly added component has all zeroes init state. As a bonus remove the now unused copy_xregs_to_kernel_booting() crutch. The XSAVE and reshuffle method can still be implemented in the unlikely case that components are added which have a non-zero init state and no other means to save them. For now, FXSAVE is just simple and good enough. [ bp: Fix a typo or two in the text. ] Fixes: 6bad06b76892 ("x86, xsave: Use xsaveopt in context-switch path when supported") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de
2021-06-18 14:18:25 +00:00
BUILD_BUG_ON((XFEATURE_MASK_USER_SUPPORTED |
XFEATURE_MASK_SUPERVISOR_SUPPORTED) !=
XFEATURES_INIT_FPSTATE_HANDLED);
WARN_ON_FPU(!on_boot_cpu);
on_boot_cpu = 0;
if (!boot_cpu_has(X86_FEATURE_XSAVE))
return;
setup_xstate_features();
print_xstate_features();
if (boot_cpu_has(X86_FEATURE_XSAVES))
init_fpstate.xsave.header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT |
xfeatures_mask_all;
/*
* Init all the features state with header.xfeatures being 0x0
*/
os_xrstor_booting(&init_fpstate.xsave);
/*
x86/fpu: Make init_fpstate correct with optimized XSAVE The XSAVE init code initializes all enabled and supported components with XRSTOR(S) to init state. Then it XSAVEs the state of the components back into init_fpstate which is used in several places to fill in the init state of components. This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because those use the init optimization and skip writing state of components which are in init state. So init_fpstate.xsave still contains all zeroes after this operation. There are two ways to solve that: 1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when XSAVES is enabled because XSAVES uses compacted format. 2) Save the components which are known to have a non-zero init state by other means. Looking deeper, #2 is the right thing to do because all components the kernel supports have all-zeroes init state except the legacy features (FP, SSE). Those cannot be hard coded because the states are not identical on all CPUs, but they can be saved with FXSAVE which avoids all conditionals. Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with a BUILD_BUG_ON() which reminds developers to validate that a newly added component has all zeroes init state. As a bonus remove the now unused copy_xregs_to_kernel_booting() crutch. The XSAVE and reshuffle method can still be implemented in the unlikely case that components are added which have a non-zero init state and no other means to save them. For now, FXSAVE is just simple and good enough. [ bp: Fix a typo or two in the text. ] Fixes: 6bad06b76892 ("x86, xsave: Use xsaveopt in context-switch path when supported") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de
2021-06-18 14:18:25 +00:00
* All components are now in init state. Read the state back so
* that init_fpstate contains all non-zero init state. This only
* works with XSAVE, but not with XSAVEOPT and XSAVES because
* those use the init optimization which skips writing data for
* components in init state.
*
* XSAVE could be used, but that would require to reshuffle the
* data when XSAVES is available because XSAVES uses xstate
* compaction. But doing so is a pointless exercise because most
* components have an all zeros init state except for the legacy
* ones (FP and SSE). Those can be saved with FXSAVE into the
* legacy area. Adding new features requires to ensure that init
* state is all zeroes or if not to add the necessary handling
* here.
*/
x86/fpu: Make init_fpstate correct with optimized XSAVE The XSAVE init code initializes all enabled and supported components with XRSTOR(S) to init state. Then it XSAVEs the state of the components back into init_fpstate which is used in several places to fill in the init state of components. This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because those use the init optimization and skip writing state of components which are in init state. So init_fpstate.xsave still contains all zeroes after this operation. There are two ways to solve that: 1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when XSAVES is enabled because XSAVES uses compacted format. 2) Save the components which are known to have a non-zero init state by other means. Looking deeper, #2 is the right thing to do because all components the kernel supports have all-zeroes init state except the legacy features (FP, SSE). Those cannot be hard coded because the states are not identical on all CPUs, but they can be saved with FXSAVE which avoids all conditionals. Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with a BUILD_BUG_ON() which reminds developers to validate that a newly added component has all zeroes init state. As a bonus remove the now unused copy_xregs_to_kernel_booting() crutch. The XSAVE and reshuffle method can still be implemented in the unlikely case that components are added which have a non-zero init state and no other means to save them. For now, FXSAVE is just simple and good enough. [ bp: Fix a typo or two in the text. ] Fixes: 6bad06b76892 ("x86, xsave: Use xsaveopt in context-switch path when supported") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de
2021-06-18 14:18:25 +00:00
fxsave(&init_fpstate.fxsave);
}
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
static int xfeature_uncompacted_offset(int xfeature_nr)
{
u32 eax, ebx, ecx, edx;
/*
* Only XSAVES supports supervisor states and it uses compacted
* format. Checking a supervisor state's uncompacted offset is
* an error.
*/
if (XFEATURE_MASK_SUPERVISOR_ALL & BIT_ULL(xfeature_nr)) {
WARN_ONCE(1, "No fixed offset for xstate %d\n", xfeature_nr);
return -1;
}
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
CHECK_XFEATURE(xfeature_nr);
cpuid_count(XSTATE_CPUID, xfeature_nr, &eax, &ebx, &ecx, &edx);
return ebx;
}
perf/x86/intel/lbr: Support XSAVES/XRSTORS for LBR context switch In the LBR call stack mode, LBR information is used to reconstruct a call stack. To get the complete call stack, perf has to save/restore all LBR registers during a context switch. Due to a large number of the LBR registers, this process causes a high CPU overhead. To reduce the CPU overhead during a context switch, use the XSAVES/XRSTORS instructions. Every XSAVE area must follow a canonical format: the legacy region, an XSAVE header and the extended region. Although the LBR information is only kept in the extended region, a space for the legacy region and XSAVE header is still required. Add a new dedicated structure for LBR XSAVES support. Before enabling XSAVES support, the size of the LBR state has to be sanity checked, because: - the size of the software structure is calculated from the max number of the LBR depth, which is enumerated by the CPUID leaf for Arch LBR. The size of the LBR state is enumerated by the CPUID leaf for XSAVE support of Arch LBR. If the values from the two CPUID leaves are not consistent, it may trigger a buffer overflow. For example, a hypervisor may unconsciously set inconsistent values for the two emulated CPUID. - unlike other state components, the size of an LBR state depends on the max number of LBRs, which may vary from generation to generation. Expose the function xfeature_size() for the sanity check. The LBR XSAVES support will be disabled if the size of the LBR state enumerated by CPUID doesn't match with the size of the software structure. The XSAVE instruction requires 64-byte alignment for state buffers. A new macro is added to reflect the alignment requirement. A 64-byte aligned kmem_cache is created for architecture LBR. Currently, the structure for each state component is maintained in fpu/types.h. The structure for the new LBR state component should be maintained in the same place. Move structure lbr_entry to fpu/types.h as well for broader sharing. Add dedicated lbr_save/lbr_restore functions for LBR XSAVES support, which invokes the corresponding xstate helpers to XSAVES/XRSTORS LBR information at the context switch when the call stack mode is enabled. Since the XSAVES/XRSTORS instructions will be eventually invoked, the dedicated functions is named with '_xsaves'/'_xrstors' postfix. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Link: https://lkml.kernel.org/r/1593780569-62993-23-git-send-email-kan.liang@linux.intel.com
2020-07-03 12:49:28 +00:00
int xfeature_size(int xfeature_nr)
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
{
u32 eax, ebx, ecx, edx;
CHECK_XFEATURE(xfeature_nr);
cpuid_count(XSTATE_CPUID, xfeature_nr, &eax, &ebx, &ecx, &edx);
return eax;
}
x86/fpu: Introduce validate_xstate_header() Move validation of user-supplied xstate_header into a helper function, in preparation of calling it from both the ptrace and sigreturn syscall paths. The new function also considers it to be an error if *any* reserved bits are set, whereas before we were just clearing most of them silently. This should reduce the chance of bugs that fail to correctly validate user-supplied XSAVE areas. It also will expose any broken userspace programs that set the other reserved bits; this is desirable because such programs will lose compatibility with future CPUs and kernels if those bits are ever used for anything. (There shouldn't be any such programs, and in fact in the case where the compacted format is in use we were already validating xfeatures. But you never know...) Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kevin Hao <haokexin@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Halcrow <mhalcrow@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Cc: kernel-hardening@lists.openwall.com Link: http://lkml.kernel.org/r/20170924105913.9157-2-mingo@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-24 10:59:04 +00:00
/* Validate an xstate header supplied by userspace (ptrace or sigreturn) */
static int validate_user_xstate_header(const struct xstate_header *hdr)
x86/fpu: Introduce validate_xstate_header() Move validation of user-supplied xstate_header into a helper function, in preparation of calling it from both the ptrace and sigreturn syscall paths. The new function also considers it to be an error if *any* reserved bits are set, whereas before we were just clearing most of them silently. This should reduce the chance of bugs that fail to correctly validate user-supplied XSAVE areas. It also will expose any broken userspace programs that set the other reserved bits; this is desirable because such programs will lose compatibility with future CPUs and kernels if those bits are ever used for anything. (There shouldn't be any such programs, and in fact in the case where the compacted format is in use we were already validating xfeatures. But you never know...) Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kevin Hao <haokexin@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Halcrow <mhalcrow@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Cc: kernel-hardening@lists.openwall.com Link: http://lkml.kernel.org/r/20170924105913.9157-2-mingo@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-24 10:59:04 +00:00
{
/* No unknown or supervisor features may be set */
if (hdr->xfeatures & ~xfeatures_mask_user())
x86/fpu: Introduce validate_xstate_header() Move validation of user-supplied xstate_header into a helper function, in preparation of calling it from both the ptrace and sigreturn syscall paths. The new function also considers it to be an error if *any* reserved bits are set, whereas before we were just clearing most of them silently. This should reduce the chance of bugs that fail to correctly validate user-supplied XSAVE areas. It also will expose any broken userspace programs that set the other reserved bits; this is desirable because such programs will lose compatibility with future CPUs and kernels if those bits are ever used for anything. (There shouldn't be any such programs, and in fact in the case where the compacted format is in use we were already validating xfeatures. But you never know...) Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kevin Hao <haokexin@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Halcrow <mhalcrow@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Cc: kernel-hardening@lists.openwall.com Link: http://lkml.kernel.org/r/20170924105913.9157-2-mingo@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-24 10:59:04 +00:00
return -EINVAL;
/* Userspace must use the uncompacted format */
if (hdr->xcomp_bv)
return -EINVAL;
/*
* If 'reserved' is shrunken to add a new field, make sure to validate
* that new field here!
*/
BUILD_BUG_ON(sizeof(hdr->reserved) != 48);
/* No reserved bits may be set */
if (memchr_inv(hdr->reserved, 0, sizeof(hdr->reserved)))
return -EINVAL;
return 0;
}
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
static void __xstate_dump_leaves(void)
{
int i;
u32 eax, ebx, ecx, edx;
static int should_dump = 1;
if (!should_dump)
return;
should_dump = 0;
/*
* Dump out a few leaves past the ones that we support
* just in case there are some goodies up there
*/
for (i = 0; i < XFEATURE_MAX + 10; i++) {
cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);
pr_warn("CPUID[%02x, %02x]: eax=%08x ebx=%08x ecx=%08x edx=%08x\n",
XSTATE_CPUID, i, eax, ebx, ecx, edx);
}
}
#define XSTATE_WARN_ON(x) do { \
if (WARN_ONCE(x, "XSAVE consistency problem, dumping leaves")) { \
__xstate_dump_leaves(); \
} \
} while (0)
#define XCHECK_SZ(sz, nr, nr_macro, __struct) do { \
if ((nr == nr_macro) && \
WARN_ONCE(sz != sizeof(__struct), \
"%s: struct is %zu bytes, cpu state %d bytes\n", \
__stringify(nr_macro), sizeof(__struct), sz)) { \
__xstate_dump_leaves(); \
} \
} while (0)
/*
* We have a C struct for each 'xstate'. We need to ensure
* that our software representation matches what the CPU
* tells us about the state's size.
*/
static void check_xstate_against_struct(int nr)
{
/*
* Ask the CPU for the size of the state.
*/
int sz = xfeature_size(nr);
/*
* Match each CPU state with the corresponding software
* structure.
*/
XCHECK_SZ(sz, nr, XFEATURE_YMM, struct ymmh_struct);
XCHECK_SZ(sz, nr, XFEATURE_BNDREGS, struct mpx_bndreg_state);
XCHECK_SZ(sz, nr, XFEATURE_BNDCSR, struct mpx_bndcsr_state);
XCHECK_SZ(sz, nr, XFEATURE_OPMASK, struct avx_512_opmask_state);
XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM, struct avx_512_hi16_state);
XCHECK_SZ(sz, nr, XFEATURE_PKRU, struct pkru_state);
XCHECK_SZ(sz, nr, XFEATURE_PASID, struct ia32_pasid_state);
/*
* Make *SURE* to add any feature numbers in below if
* there are "holes" in the xsave state component
* numbers.
*/
if ((nr < XFEATURE_YMM) ||
x86/fpu: Add placeholder for 'Processor Trace' XSAVE state There is an XSAVE state component for Intel Processor Trace (PT). But, we do not currently use it. We add a placeholder in the code for it so it is not a mystery and also so we do not need an explicit enum initialization for Protection Keys in a moment. Why don't we use it? We might end up using this at _some_ point in the future. But, this is a "system" state which requires using the currently unsupported XSAVES feature. Unlike all the other XSAVE states, PT state is also not directly tied to a thread. You might context-switch between threads, but not want to change any of the PT state. Or, you might switch between threads, and *do* want to change PT state, all depending on what is being traced. We currently just manually set some MSRs to do this PT context switching, and it is unclear whether replacing our direct MSR use with XSAVE will be a net win or loss, both in code complexity and performance. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: fenghua.yu@intel.com Cc: linux-mm@kvack.org Cc: yu-cheng.yu@intel.com Link: http://lkml.kernel.org/r/20160212210158.5E4BCAE2@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 21:01:58 +00:00
(nr >= XFEATURE_MAX) ||
x86/fpu/xstate: Support dynamic supervisor feature for LBR Last Branch Records (LBR) registers are used to log taken branches and other control flows. In perf with call stack mode, LBR information is used to reconstruct a call stack. To get the complete call stack, perf has to save/restore all LBR registers during a context switch. Due to the large number of the LBR registers, e.g., the current platform has 96 LBR registers, this process causes a high CPU overhead. To reduce the CPU overhead during a context switch, an LBR state component that contains all the LBR related registers is introduced in hardware. All LBR registers can be saved/restored together using one XSAVES/XRSTORS instruction. However, the kernel should not save/restore the LBR state component at each context switch, like other state components, because of the following unique features of LBR: - The LBR state component only contains valuable information when LBR is enabled in the perf subsystem, but for most of the time, LBR is disabled. - The size of the LBR state component is huge. For the current platform, it's 808 bytes. If the kernel saves/restores the LBR state at each context switch, for most of the time, it is just a waste of space and cycles. To efficiently support the LBR state component, it is desired to have: - only context-switch the LBR when the LBR feature is enabled in perf. - only allocate an LBR-specific XSAVE buffer on demand. (Besides the LBR state, a legacy region and an XSAVE header have to be included in the buffer as well. There is a total of (808+576) byte overhead for the LBR-specific XSAVE buffer. The overhead only happens when the perf is actively using LBRs. There is still a space-saving, on average, when it replaces the constant 808 bytes of overhead for every task, all the time on the systems that support architectural LBR.) - be able to use XSAVES/XRSTORS for accessing LBR at run time. However, the IA32_XSS should not be adjusted at run time. (The XCR0 | IA32_XSS are used to determine the requested-feature bitmap (RFBM) of XSAVES.) A solution, called dynamic supervisor feature, is introduced to address this issue, which - does not allocate a buffer in each task->fpu; - does not save/restore a state component at each context switch; - sets the bit corresponding to the dynamic supervisor feature in IA32_XSS at boot time, and avoids setting it at run time. - dynamically allocates a specific buffer for a state component on demand, e.g. only allocates LBR-specific XSAVE buffer when LBR is enabled in perf. (Note: The buffer has to include the LBR state component, a legacy region and a XSAVE header space.) (Implemented in a later patch) - saves/restores a state component on demand, e.g. manually invokes the XSAVES/XRSTORS instruction to save/restore the LBR state to/from the buffer when perf is active and a call stack is required. (Implemented in a later patch) A new mask XFEATURE_MASK_DYNAMIC and a helper xfeatures_mask_dynamic() are introduced to indicate the dynamic supervisor feature. For the systems which support the Architecture LBR, LBR is the only dynamic supervisor feature for now. For the previous systems, there is no dynamic supervisor feature available. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Link: https://lkml.kernel.org/r/1593780569-62993-21-git-send-email-kan.liang@linux.intel.com
2020-07-03 12:49:26 +00:00
(nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_LBR))) {
WARN_ONCE(1, "no structure for xstate: %d\n", nr);
XSTATE_WARN_ON(1);
}
}
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
/*
* This essentially double-checks what the cpu told us about
* how large the XSAVE buffer needs to be. We are recalculating
* it to be safe.
x86/fpu/xstate: Fix an xstate size check warning with architectural LBRs An xstate size check warning is triggered on machines which support Architectural LBRs. XSAVE consistency problem, dumping leaves WARNING: CPU: 0 PID: 0 at arch/x86/kernel/fpu/xstate.c:649 fpu__init_system_xstate+0x4d4/0xd0e Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted intel-arch_lbr+ RIP: 0010:fpu__init_system_xstate+0x4d4/0xd0e The xstate size check routine, init_xstate_size(), compares the size retrieved from the hardware with the size of task->fpu, which is calculated by the software. The size from the hardware is the total size of the enabled xstates in XCR0 | IA32_XSS. Architectural LBR state is a dynamic supervisor feature, which sets the corresponding bit in the IA32_XSS at boot time. The size from the hardware includes the size of the Architectural LBR state. However, a dynamic supervisor feature doesn't allocate a buffer in the task->fpu. The size of task->fpu doesn't include the size of the Architectural LBR state. The mismatch will trigger the warning. Three options as below were considered to fix the issue: - Correct the size from the hardware by subtracting the size of the dynamic supervisor features. The purpose of the check is to compare the size CPU told with the size of the XSAVE buffer, which is calculated by the software. If the software mucks with the number from hardware, it removes the value of the check. This option is not a good option. - Prevent the hardware from counting the size of the dynamic supervisor feature by temporarily removing the corresponding bits in IA32_XSS. Two extra MSR writes are required to flip the IA32_XSS. The option is not pretty, but it is workable. The check is only called once at early boot time. The synchronization or context-switching doesn't need to be worried. This option is implemented here. - Remove the check entirely, because the check hasn't found any real problems. The option may be an alternative as option 2. This option is not implemented here. Add a new function, get_xsaves_size_no_dynamic(), which retrieves the total size without the dynamic supervisor features from the hardware. The size will be used to compare with the size of task->fpu. Fixes: f0dccc9da4c0 ("x86/fpu/xstate: Support dynamic supervisor feature for LBR") Reported-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Link: https://lore.kernel.org/r/1595253051-75374-1-git-send-email-kan.liang@linux.intel.com
2020-07-20 13:50:51 +00:00
*
* Dynamic XSAVE features allocate their own buffers and are not
* covered by these checks. Only the size of the buffer for task->fpu
* is checked here.
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
*/
static void do_extra_xstate_size_checks(void)
{
int paranoid_xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
int i;
for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
if (!xfeature_enabled(i))
continue;
check_xstate_against_struct(i);
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
/*
* Supervisor state components can be managed only by
* XSAVES.
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
*/
if (!cpu_feature_enabled(X86_FEATURE_XSAVES))
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
XSTATE_WARN_ON(xfeature_is_supervisor(i));
/* Align from the end of the previous feature */
if (xfeature_is_aligned(i))
paranoid_xstate_size = ALIGN(paranoid_xstate_size, 64);
/*
* The offset of a given state in the non-compacted
* format is given to us in a CPUID leaf. We check
* them for being ordered (increasing offsets) in
* setup_xstate_features(). XSAVES uses compacted format.
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
*/
if (!cpu_feature_enabled(X86_FEATURE_XSAVES))
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
paranoid_xstate_size = xfeature_uncompacted_offset(i);
/*
* The compacted-format offset always depends on where
* the previous state ended.
*/
paranoid_xstate_size += xfeature_size(i);
}
XSTATE_WARN_ON(paranoid_xstate_size != fpu_kernel_xstate_size);
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
}
/*
* Get total size of enabled xstates in XCR0 | IA32_XSS.
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
*
* Note the SDM's wording here. "sub-function 0" only enumerates
* the size of the *user* states. If we use it to size a buffer
* that we use 'XSAVES' on, we could potentially overflow the
* buffer because 'XSAVES' saves system states too.
*/
static unsigned int __init get_xsaves_size(void)
{
unsigned int eax, ebx, ecx, edx;
/*
* - CPUID function 0DH, sub-function 1:
* EBX enumerates the size (in bytes) required by
* the XSAVES instruction for an XSAVE area
* containing all the state components
* corresponding to bits currently set in
* XCR0 | IA32_XSS.
*/
cpuid_count(XSTATE_CPUID, 1, &eax, &ebx, &ecx, &edx);
return ebx;
}
x86/fpu/xstate: Fix an xstate size check warning with architectural LBRs An xstate size check warning is triggered on machines which support Architectural LBRs. XSAVE consistency problem, dumping leaves WARNING: CPU: 0 PID: 0 at arch/x86/kernel/fpu/xstate.c:649 fpu__init_system_xstate+0x4d4/0xd0e Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted intel-arch_lbr+ RIP: 0010:fpu__init_system_xstate+0x4d4/0xd0e The xstate size check routine, init_xstate_size(), compares the size retrieved from the hardware with the size of task->fpu, which is calculated by the software. The size from the hardware is the total size of the enabled xstates in XCR0 | IA32_XSS. Architectural LBR state is a dynamic supervisor feature, which sets the corresponding bit in the IA32_XSS at boot time. The size from the hardware includes the size of the Architectural LBR state. However, a dynamic supervisor feature doesn't allocate a buffer in the task->fpu. The size of task->fpu doesn't include the size of the Architectural LBR state. The mismatch will trigger the warning. Three options as below were considered to fix the issue: - Correct the size from the hardware by subtracting the size of the dynamic supervisor features. The purpose of the check is to compare the size CPU told with the size of the XSAVE buffer, which is calculated by the software. If the software mucks with the number from hardware, it removes the value of the check. This option is not a good option. - Prevent the hardware from counting the size of the dynamic supervisor feature by temporarily removing the corresponding bits in IA32_XSS. Two extra MSR writes are required to flip the IA32_XSS. The option is not pretty, but it is workable. The check is only called once at early boot time. The synchronization or context-switching doesn't need to be worried. This option is implemented here. - Remove the check entirely, because the check hasn't found any real problems. The option may be an alternative as option 2. This option is not implemented here. Add a new function, get_xsaves_size_no_dynamic(), which retrieves the total size without the dynamic supervisor features from the hardware. The size will be used to compare with the size of task->fpu. Fixes: f0dccc9da4c0 ("x86/fpu/xstate: Support dynamic supervisor feature for LBR") Reported-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Link: https://lore.kernel.org/r/1595253051-75374-1-git-send-email-kan.liang@linux.intel.com
2020-07-20 13:50:51 +00:00
/*
* Get the total size of the enabled xstates without the dynamic supervisor
* features.
*/
static unsigned int __init get_xsaves_size_no_dynamic(void)
{
u64 mask = xfeatures_mask_dynamic();
unsigned int size;
if (!mask)
return get_xsaves_size();
/* Disable dynamic features. */
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor());
/*
* Ask the hardware what size is required of the buffer.
* This is the size required for the task->fpu buffer.
*/
size = get_xsaves_size();
/* Re-enable dynamic features so XSAVES will work on them again. */
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() | mask);
return size;
}
static unsigned int __init get_xsave_size(void)
{
unsigned int eax, ebx, ecx, edx;
/*
* - CPUID function 0DH, sub-function 0:
* EBX enumerates the size (in bytes) required by
* the XSAVE instruction for an XSAVE area
* containing all the *user* state components
* corresponding to bits currently set in XCR0.
*/
cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
return ebx;
x86/fpu: Remove XSTATE_RESERVE The original purpose of XSTATE_RESERVE was to carve out space to store all of the possible extended state components that get saved with the XSAVE instruction(s). However, we are now almost entirely dynamically allocating the buffers we use for XSAVE by placing them at the end of the task_struct and them sizing them at boot. The one exception for that is the init_task. The maximum extended state component size that we have today is on systems with space for AVX-512 and Memory Protection Keys: 2696 bytes. We have reserved a PAGE_SIZE buffer in the init_task via fpregs_state->__padding. This check ensures that even if the component sizes or layout were changed (which we do not expect), that we will still not overflow the init_task's buffer. In the case that we detect we might overflow the buffer, we completely disable XSAVE support in the kernel and try to boot as if we had 'legacy x87 FPU' support in place. This is a crippled state without any of the XSAVE-enabled features (MPX, AVX, etc...). But, it at least let us boot safely. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233125.D948D475@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:25 +00:00
}
/*
* Will the runtime-enumerated 'xstate_size' fit in the init
* task's statically-allocated buffer?
*/
static bool is_supported_xstate_size(unsigned int test_xstate_size)
{
if (test_xstate_size <= sizeof(union fpregs_state))
return true;
pr_warn("x86/fpu: xstate buffer too small (%zu < %d), disabling xsave\n",
sizeof(union fpregs_state), test_xstate_size);
return false;
}
static int __init init_xstate_size(void)
x86/fpu: Remove XSTATE_RESERVE The original purpose of XSTATE_RESERVE was to carve out space to store all of the possible extended state components that get saved with the XSAVE instruction(s). However, we are now almost entirely dynamically allocating the buffers we use for XSAVE by placing them at the end of the task_struct and them sizing them at boot. The one exception for that is the init_task. The maximum extended state component size that we have today is on systems with space for AVX-512 and Memory Protection Keys: 2696 bytes. We have reserved a PAGE_SIZE buffer in the init_task via fpregs_state->__padding. This check ensures that even if the component sizes or layout were changed (which we do not expect), that we will still not overflow the init_task's buffer. In the case that we detect we might overflow the buffer, we completely disable XSAVE support in the kernel and try to boot as if we had 'legacy x87 FPU' support in place. This is a crippled state without any of the XSAVE-enabled features (MPX, AVX, etc...). But, it at least let us boot safely. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233125.D948D475@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:25 +00:00
{
/* Recompute the context size for enabled features: */
unsigned int possible_xstate_size;
unsigned int xsave_size;
xsave_size = get_xsave_size();
if (boot_cpu_has(X86_FEATURE_XSAVES))
x86/fpu/xstate: Fix an xstate size check warning with architectural LBRs An xstate size check warning is triggered on machines which support Architectural LBRs. XSAVE consistency problem, dumping leaves WARNING: CPU: 0 PID: 0 at arch/x86/kernel/fpu/xstate.c:649 fpu__init_system_xstate+0x4d4/0xd0e Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted intel-arch_lbr+ RIP: 0010:fpu__init_system_xstate+0x4d4/0xd0e The xstate size check routine, init_xstate_size(), compares the size retrieved from the hardware with the size of task->fpu, which is calculated by the software. The size from the hardware is the total size of the enabled xstates in XCR0 | IA32_XSS. Architectural LBR state is a dynamic supervisor feature, which sets the corresponding bit in the IA32_XSS at boot time. The size from the hardware includes the size of the Architectural LBR state. However, a dynamic supervisor feature doesn't allocate a buffer in the task->fpu. The size of task->fpu doesn't include the size of the Architectural LBR state. The mismatch will trigger the warning. Three options as below were considered to fix the issue: - Correct the size from the hardware by subtracting the size of the dynamic supervisor features. The purpose of the check is to compare the size CPU told with the size of the XSAVE buffer, which is calculated by the software. If the software mucks with the number from hardware, it removes the value of the check. This option is not a good option. - Prevent the hardware from counting the size of the dynamic supervisor feature by temporarily removing the corresponding bits in IA32_XSS. Two extra MSR writes are required to flip the IA32_XSS. The option is not pretty, but it is workable. The check is only called once at early boot time. The synchronization or context-switching doesn't need to be worried. This option is implemented here. - Remove the check entirely, because the check hasn't found any real problems. The option may be an alternative as option 2. This option is not implemented here. Add a new function, get_xsaves_size_no_dynamic(), which retrieves the total size without the dynamic supervisor features from the hardware. The size will be used to compare with the size of task->fpu. Fixes: f0dccc9da4c0 ("x86/fpu/xstate: Support dynamic supervisor feature for LBR") Reported-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Link: https://lore.kernel.org/r/1595253051-75374-1-git-send-email-kan.liang@linux.intel.com
2020-07-20 13:50:51 +00:00
possible_xstate_size = get_xsaves_size_no_dynamic();
else
possible_xstate_size = xsave_size;
x86/fpu: Remove XSTATE_RESERVE The original purpose of XSTATE_RESERVE was to carve out space to store all of the possible extended state components that get saved with the XSAVE instruction(s). However, we are now almost entirely dynamically allocating the buffers we use for XSAVE by placing them at the end of the task_struct and them sizing them at boot. The one exception for that is the init_task. The maximum extended state component size that we have today is on systems with space for AVX-512 and Memory Protection Keys: 2696 bytes. We have reserved a PAGE_SIZE buffer in the init_task via fpregs_state->__padding. This check ensures that even if the component sizes or layout were changed (which we do not expect), that we will still not overflow the init_task's buffer. In the case that we detect we might overflow the buffer, we completely disable XSAVE support in the kernel and try to boot as if we had 'legacy x87 FPU' support in place. This is a crippled state without any of the XSAVE-enabled features (MPX, AVX, etc...). But, it at least let us boot safely. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233125.D948D475@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:25 +00:00
/* Ensure we have the space to store all enabled: */
if (!is_supported_xstate_size(possible_xstate_size))
return -EINVAL;
/*
* The size is OK, we are definitely going to use xsave,
* make it known to the world that we need more space.
*/
fpu_kernel_xstate_size = possible_xstate_size;
x86/fpu: Correct and check XSAVE xstate size calculations Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:30 +00:00
do_extra_xstate_size_checks();
/*
* User space is always in standard format.
*/
fpu_user_xstate_size = xsave_size;
x86/fpu: Remove XSTATE_RESERVE The original purpose of XSTATE_RESERVE was to carve out space to store all of the possible extended state components that get saved with the XSAVE instruction(s). However, we are now almost entirely dynamically allocating the buffers we use for XSAVE by placing them at the end of the task_struct and them sizing them at boot. The one exception for that is the init_task. The maximum extended state component size that we have today is on systems with space for AVX-512 and Memory Protection Keys: 2696 bytes. We have reserved a PAGE_SIZE buffer in the init_task via fpregs_state->__padding. This check ensures that even if the component sizes or layout were changed (which we do not expect), that we will still not overflow the init_task's buffer. In the case that we detect we might overflow the buffer, we completely disable XSAVE support in the kernel and try to boot as if we had 'legacy x87 FPU' support in place. This is a crippled state without any of the XSAVE-enabled features (MPX, AVX, etc...). But, it at least let us boot safely. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233125.D948D475@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:25 +00:00
return 0;
}
x86/fpu: Rename XSAVE macros There are two concepts that have some confusing naming: 1. Extended State Component numbers (currently called XFEATURE_BIT_*) 2. Extended State Component masks (currently called XSTATE_*) The numbers are (currently) from 0-9. State component 3 is the bounds registers for MPX, for instance. But when we want to enable "state component 3", we go set a bit in XCR0. The bit we set is 1<<3. We can check to see if a state component feature is enabled by looking at its bit. The current 'xfeature_bit's are at best xfeature bit _numbers_. Calling them bits is at best inconsistent with ending the enum list with 'XFEATURES_NR_MAX'. This patch renames the enum to be 'xfeature'. These also happen to be what the Intel documentation calls a "state component". We also want to differentiate these from the "XSTATE_*" macros. The "XSTATE_*" macros are a mask, and we rename them to match. These macros are reasonably widely used so this patch is a wee bit big, but this really is just a rename. The only non-mechanical part of this is the s/XSTATE_EXTEND_MASK/XFEATURE_MASK_EXTEND/ We need a better name for it, but that's another patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233126.38653250@viggo.jf.intel.com [ Ported to v4.3-rc1. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:26 +00:00
/*
* We enabled the XSAVE hardware, but something went wrong and
* we can not use it. Disable it.
*/
static void fpu__init_disable_system_xstate(void)
x86/fpu: Remove XSTATE_RESERVE The original purpose of XSTATE_RESERVE was to carve out space to store all of the possible extended state components that get saved with the XSAVE instruction(s). However, we are now almost entirely dynamically allocating the buffers we use for XSAVE by placing them at the end of the task_struct and them sizing them at boot. The one exception for that is the init_task. The maximum extended state component size that we have today is on systems with space for AVX-512 and Memory Protection Keys: 2696 bytes. We have reserved a PAGE_SIZE buffer in the init_task via fpregs_state->__padding. This check ensures that even if the component sizes or layout were changed (which we do not expect), that we will still not overflow the init_task's buffer. In the case that we detect we might overflow the buffer, we completely disable XSAVE support in the kernel and try to boot as if we had 'legacy x87 FPU' support in place. This is a crippled state without any of the XSAVE-enabled features (MPX, AVX, etc...). But, it at least let us boot safely. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233125.D948D475@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:25 +00:00
{
xfeatures_mask_all = 0;
x86/fpu: Remove XSTATE_RESERVE The original purpose of XSTATE_RESERVE was to carve out space to store all of the possible extended state components that get saved with the XSAVE instruction(s). However, we are now almost entirely dynamically allocating the buffers we use for XSAVE by placing them at the end of the task_struct and them sizing them at boot. The one exception for that is the init_task. The maximum extended state component size that we have today is on systems with space for AVX-512 and Memory Protection Keys: 2696 bytes. We have reserved a PAGE_SIZE buffer in the init_task via fpregs_state->__padding. This check ensures that even if the component sizes or layout were changed (which we do not expect), that we will still not overflow the init_task's buffer. In the case that we detect we might overflow the buffer, we completely disable XSAVE support in the kernel and try to boot as if we had 'legacy x87 FPU' support in place. This is a crippled state without any of the XSAVE-enabled features (MPX, AVX, etc...). But, it at least let us boot safely. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233125.D948D475@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:25 +00:00
cr4_clear_bits(X86_CR4_OSXSAVE);
setup_clear_cpu_cap(X86_FEATURE_XSAVE);
}
/*
* Enable and initialize the xsave feature.
* Called once per system bootup.
*/
void __init fpu__init_system_xstate(void)
{
unsigned int eax, ebx, ecx, edx;
static int on_boot_cpu __initdata = 1;
u64 xfeatures;
x86/fpu: Remove XSTATE_RESERVE The original purpose of XSTATE_RESERVE was to carve out space to store all of the possible extended state components that get saved with the XSAVE instruction(s). However, we are now almost entirely dynamically allocating the buffers we use for XSAVE by placing them at the end of the task_struct and them sizing them at boot. The one exception for that is the init_task. The maximum extended state component size that we have today is on systems with space for AVX-512 and Memory Protection Keys: 2696 bytes. We have reserved a PAGE_SIZE buffer in the init_task via fpregs_state->__padding. This check ensures that even if the component sizes or layout were changed (which we do not expect), that we will still not overflow the init_task's buffer. In the case that we detect we might overflow the buffer, we completely disable XSAVE support in the kernel and try to boot as if we had 'legacy x87 FPU' support in place. This is a crippled state without any of the XSAVE-enabled features (MPX, AVX, etc...). But, it at least let us boot safely. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233125.D948D475@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:25 +00:00
int err;
int i;
WARN_ON_FPU(!on_boot_cpu);
on_boot_cpu = 0;
if (!boot_cpu_has(X86_FEATURE_FPU)) {
pr_info("x86/fpu: No FPU detected\n");
return;
}
if (!boot_cpu_has(X86_FEATURE_XSAVE)) {
pr_info("x86/fpu: x87 FPU will use %s\n",
boot_cpu_has(X86_FEATURE_FXSR) ? "FXSAVE" : "FSAVE");
return;
}
if (boot_cpu_data.cpuid_level < XSTATE_CPUID) {
WARN_ON_FPU(1);
return;
}
/*
* Find user xstates supported by the processor.
*/
cpuid_count(XSTATE_CPUID, 0, &eax, &ebx, &ecx, &edx);
xfeatures_mask_all = eax + ((u64)edx << 32);
/*
* Find supervisor xstates supported by the processor.
*/
cpuid_count(XSTATE_CPUID, 1, &eax, &ebx, &ecx, &edx);
xfeatures_mask_all |= ecx + ((u64)edx << 32);
if ((xfeatures_mask_user() & XFEATURE_MASK_FPSSE) != XFEATURE_MASK_FPSSE) {
x86/fpu: Do not BUG_ON() in early FPU code I don't think it is really possible to have a system where CPUID enumerates support for XSAVE but that it does not have FP/SSE (they are "legacy" features and always present). But, I did manage to hit this case in qemu when I enabled its somewhat shaky XSAVE support. The bummer is that the FPU is set up before we parse the command-line or have *any* console support including earlyprintk. That turned what should have been an easy thing to debug in to a bit more of an odyssey. So a BUG() here is worthless. All it does it guarantee that if/when we hit this case we have an empty console. So, remove the BUG() and try to limp along by disabling XSAVE and trying to continue. Add a comment on why we are doing this, and also add a common "out_disable" path for leaving fpu__init_system_xstate(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160720194551.63BB2B58@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-20 19:45:51 +00:00
/*
* This indicates that something really unexpected happened
* with the enumeration. Disable XSAVE and try to continue
* booting without it. This is too early to BUG().
*/
pr_err("x86/fpu: FP/SSE not present amongst the CPU's xstate features: 0x%llx.\n",
xfeatures_mask_all);
x86/fpu: Do not BUG_ON() in early FPU code I don't think it is really possible to have a system where CPUID enumerates support for XSAVE but that it does not have FP/SSE (they are "legacy" features and always present). But, I did manage to hit this case in qemu when I enabled its somewhat shaky XSAVE support. The bummer is that the FPU is set up before we parse the command-line or have *any* console support including earlyprintk. That turned what should have been an easy thing to debug in to a bit more of an odyssey. So a BUG() here is worthless. All it does it guarantee that if/when we hit this case we have an empty console. So, remove the BUG() and try to limp along by disabling XSAVE and trying to continue. Add a comment on why we are doing this, and also add a common "out_disable" path for leaving fpu__init_system_xstate(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160720194551.63BB2B58@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-20 19:45:51 +00:00
goto out_disable;
}
/*
* Clear XSAVE features that are disabled in the normal CPUID.
*/
for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
if (!boot_cpu_has(xsave_cpuid_features[i]))
xfeatures_mask_all &= ~BIT_ULL(i);
}
xfeatures_mask_all &= XFEATURE_MASK_USER_SUPPORTED |
XFEATURE_MASK_SUPERVISOR_SUPPORTED;
/* Store it for paranoia check at the end */
xfeatures = xfeatures_mask_all;
/* Enable xstate instructions to be able to continue with initialization: */
fpu__init_cpu_xstate();
x86/fpu: Remove XSTATE_RESERVE The original purpose of XSTATE_RESERVE was to carve out space to store all of the possible extended state components that get saved with the XSAVE instruction(s). However, we are now almost entirely dynamically allocating the buffers we use for XSAVE by placing them at the end of the task_struct and them sizing them at boot. The one exception for that is the init_task. The maximum extended state component size that we have today is on systems with space for AVX-512 and Memory Protection Keys: 2696 bytes. We have reserved a PAGE_SIZE buffer in the init_task via fpregs_state->__padding. This check ensures that even if the component sizes or layout were changed (which we do not expect), that we will still not overflow the init_task's buffer. In the case that we detect we might overflow the buffer, we completely disable XSAVE support in the kernel and try to boot as if we had 'legacy x87 FPU' support in place. This is a crippled state without any of the XSAVE-enabled features (MPX, AVX, etc...). But, it at least let us boot safely. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233125.D948D475@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-02 23:31:25 +00:00
err = init_xstate_size();
x86/fpu: Do not BUG_ON() in early FPU code I don't think it is really possible to have a system where CPUID enumerates support for XSAVE but that it does not have FP/SSE (they are "legacy" features and always present). But, I did manage to hit this case in qemu when I enabled its somewhat shaky XSAVE support. The bummer is that the FPU is set up before we parse the command-line or have *any* console support including earlyprintk. That turned what should have been an easy thing to debug in to a bit more of an odyssey. So a BUG() here is worthless. All it does it guarantee that if/when we hit this case we have an empty console. So, remove the BUG() and try to limp along by disabling XSAVE and trying to continue. Add a comment on why we are doing this, and also add a common "out_disable" path for leaving fpu__init_system_xstate(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160720194551.63BB2B58@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-20 19:45:51 +00:00
if (err)
goto out_disable;
/*
* Update info used for ptrace frames; use standard-format size and no
* supervisor xstates:
*/
update_regset_xstate_info(fpu_user_xstate_size, xfeatures_mask_user());
fpu__init_prepare_fx_sw_frame();
setup_init_fpu_buf();
setup_xstate_comp_offsets();
setup_supervisor_only_offsets();
/*
* Paranoia check whether something in the setup modified the
* xfeatures mask.
*/
if (xfeatures != xfeatures_mask_all) {
pr_err("x86/fpu: xfeatures modified from 0x%016llx to 0x%016llx during init, disabling XSAVE\n",
xfeatures, xfeatures_mask_all);
goto out_disable;
}
print_xstate_offset_size();
pr_info("x86/fpu: Enabled xstate features 0x%llx, context size is %d bytes, using '%s' format.\n",
xfeatures_mask_all,
fpu_kernel_xstate_size,
boot_cpu_has(X86_FEATURE_XSAVES) ? "compacted" : "standard");
x86/fpu: Do not BUG_ON() in early FPU code I don't think it is really possible to have a system where CPUID enumerates support for XSAVE but that it does not have FP/SSE (they are "legacy" features and always present). But, I did manage to hit this case in qemu when I enabled its somewhat shaky XSAVE support. The bummer is that the FPU is set up before we parse the command-line or have *any* console support including earlyprintk. That turned what should have been an easy thing to debug in to a bit more of an odyssey. So a BUG() here is worthless. All it does it guarantee that if/when we hit this case we have an empty console. So, remove the BUG() and try to limp along by disabling XSAVE and trying to continue. Add a comment on why we are doing this, and also add a common "out_disable" path for leaving fpu__init_system_xstate(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160720194551.63BB2B58@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-20 19:45:51 +00:00
return;
out_disable:
/* something went wrong, try to boot without any XSAVE support */
fpu__init_disable_system_xstate();
}
/*
* Restore minimal FPU state after suspend:
*/
void fpu__resume_cpu(void)
{
/*
* Restore XCR0 on xsave capable CPUs:
*/
if (boot_cpu_has(X86_FEATURE_XSAVE))
xsetbv(XCR_XFEATURE_ENABLED_MASK, xfeatures_mask_user());
/*
* Restore IA32_XSS. The same CPUID bit enumerates support
* of XSAVES and MSR_IA32_XSS.
*/
x86/fpu/xstate: Support dynamic supervisor feature for LBR Last Branch Records (LBR) registers are used to log taken branches and other control flows. In perf with call stack mode, LBR information is used to reconstruct a call stack. To get the complete call stack, perf has to save/restore all LBR registers during a context switch. Due to the large number of the LBR registers, e.g., the current platform has 96 LBR registers, this process causes a high CPU overhead. To reduce the CPU overhead during a context switch, an LBR state component that contains all the LBR related registers is introduced in hardware. All LBR registers can be saved/restored together using one XSAVES/XRSTORS instruction. However, the kernel should not save/restore the LBR state component at each context switch, like other state components, because of the following unique features of LBR: - The LBR state component only contains valuable information when LBR is enabled in the perf subsystem, but for most of the time, LBR is disabled. - The size of the LBR state component is huge. For the current platform, it's 808 bytes. If the kernel saves/restores the LBR state at each context switch, for most of the time, it is just a waste of space and cycles. To efficiently support the LBR state component, it is desired to have: - only context-switch the LBR when the LBR feature is enabled in perf. - only allocate an LBR-specific XSAVE buffer on demand. (Besides the LBR state, a legacy region and an XSAVE header have to be included in the buffer as well. There is a total of (808+576) byte overhead for the LBR-specific XSAVE buffer. The overhead only happens when the perf is actively using LBRs. There is still a space-saving, on average, when it replaces the constant 808 bytes of overhead for every task, all the time on the systems that support architectural LBR.) - be able to use XSAVES/XRSTORS for accessing LBR at run time. However, the IA32_XSS should not be adjusted at run time. (The XCR0 | IA32_XSS are used to determine the requested-feature bitmap (RFBM) of XSAVES.) A solution, called dynamic supervisor feature, is introduced to address this issue, which - does not allocate a buffer in each task->fpu; - does not save/restore a state component at each context switch; - sets the bit corresponding to the dynamic supervisor feature in IA32_XSS at boot time, and avoids setting it at run time. - dynamically allocates a specific buffer for a state component on demand, e.g. only allocates LBR-specific XSAVE buffer when LBR is enabled in perf. (Note: The buffer has to include the LBR state component, a legacy region and a XSAVE header space.) (Implemented in a later patch) - saves/restores a state component on demand, e.g. manually invokes the XSAVES/XRSTORS instruction to save/restore the LBR state to/from the buffer when perf is active and a call stack is required. (Implemented in a later patch) A new mask XFEATURE_MASK_DYNAMIC and a helper xfeatures_mask_dynamic() are introduced to indicate the dynamic supervisor feature. For the systems which support the Architecture LBR, LBR is the only dynamic supervisor feature for now. For the previous systems, there is no dynamic supervisor feature available. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Link: https://lkml.kernel.org/r/1593780569-62993-21-git-send-email-kan.liang@linux.intel.com
2020-07-03 12:49:26 +00:00
if (boot_cpu_has(X86_FEATURE_XSAVES)) {
wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
xfeatures_mask_dynamic());
}
}
/*
* Given an xstate feature nr, calculate where in the xsave
* buffer the state is. Callers should ensure that the buffer
* is valid.
*/
static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
{
if (!xfeature_enabled(xfeature_nr)) {
WARN_ON_FPU(1);
return NULL;
}
return (void *)xsave + xstate_comp_offsets[xfeature_nr];
}
/*
* Given the xsave area and a state inside, this function returns the
* address of the state.
*
* This is the API that is called to get xstate address in either
* standard format or compacted format of xsave area.
*
* Note that if there is no data for the field in the xsave buffer
* this will return NULL.
*
* Inputs:
* xstate: the thread's storage area for all FPU data
x86/fpu: Use a feature number instead of mask in two more helpers After changing the argument of __raw_xsave_addr() from a mask to number Dave suggested to check if it makes sense to do the same for get_xsave_addr(). As it turns out it does. Only get_xsave_addr() needs the mask to check if the requested feature is part of what is supported/saved and then uses the number again. The shift operation is cheaper compared to fls64() (find last bit set). Also, the feature number uses less opcode space compared to the mask. :) Make the get_xsave_addr() argument a xfeature number instead of a mask and fix up its callers. Furthermore, use xfeature_nr and xfeature_mask consistently. This results in the following changes to the kvm code: feature -> xfeature_mask index -> xfeature_nr Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Siarhei Liakh <Siarhei.Liakh@concurrent-rt.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190403164156.19645-12-bigeasy@linutronix.de
2019-04-03 16:41:40 +00:00
* xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
* XFEATURE_SSE, etc...)
* Output:
* address of the state in the xsave area, or NULL if the
* field is not present in the xsave buffer.
*/
x86/fpu: Use a feature number instead of mask in two more helpers After changing the argument of __raw_xsave_addr() from a mask to number Dave suggested to check if it makes sense to do the same for get_xsave_addr(). As it turns out it does. Only get_xsave_addr() needs the mask to check if the requested feature is part of what is supported/saved and then uses the number again. The shift operation is cheaper compared to fls64() (find last bit set). Also, the feature number uses less opcode space compared to the mask. :) Make the get_xsave_addr() argument a xfeature number instead of a mask and fix up its callers. Furthermore, use xfeature_nr and xfeature_mask consistently. This results in the following changes to the kvm code: feature -> xfeature_mask index -> xfeature_nr Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Siarhei Liakh <Siarhei.Liakh@concurrent-rt.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190403164156.19645-12-bigeasy@linutronix.de
2019-04-03 16:41:40 +00:00
void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
{
/*
* Do we even *have* xsave state?
*/
if (!boot_cpu_has(X86_FEATURE_XSAVE))
return NULL;
/*
* We should not ever be requesting features that we
* have not enabled.
*/
WARN_ONCE(!(xfeatures_mask_all & BIT_ULL(xfeature_nr)),
"get of unsupported state");
/*
* This assumes the last 'xsave*' instruction to
x86/fpu: Use a feature number instead of mask in two more helpers After changing the argument of __raw_xsave_addr() from a mask to number Dave suggested to check if it makes sense to do the same for get_xsave_addr(). As it turns out it does. Only get_xsave_addr() needs the mask to check if the requested feature is part of what is supported/saved and then uses the number again. The shift operation is cheaper compared to fls64() (find last bit set). Also, the feature number uses less opcode space compared to the mask. :) Make the get_xsave_addr() argument a xfeature number instead of a mask and fix up its callers. Furthermore, use xfeature_nr and xfeature_mask consistently. This results in the following changes to the kvm code: feature -> xfeature_mask index -> xfeature_nr Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Siarhei Liakh <Siarhei.Liakh@concurrent-rt.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190403164156.19645-12-bigeasy@linutronix.de
2019-04-03 16:41:40 +00:00
* have requested that 'xfeature_nr' be saved.
* If it did not, we might be seeing and old value
* of the field in the buffer.
*
* This can happen because the last 'xsave' did not
* request that this feature be saved (unlikely)
* or because the "init optimization" caused it
* to not be saved.
*/
x86/fpu: Use a feature number instead of mask in two more helpers After changing the argument of __raw_xsave_addr() from a mask to number Dave suggested to check if it makes sense to do the same for get_xsave_addr(). As it turns out it does. Only get_xsave_addr() needs the mask to check if the requested feature is part of what is supported/saved and then uses the number again. The shift operation is cheaper compared to fls64() (find last bit set). Also, the feature number uses less opcode space compared to the mask. :) Make the get_xsave_addr() argument a xfeature number instead of a mask and fix up its callers. Furthermore, use xfeature_nr and xfeature_mask consistently. This results in the following changes to the kvm code: feature -> xfeature_mask index -> xfeature_nr Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Siarhei Liakh <Siarhei.Liakh@concurrent-rt.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190403164156.19645-12-bigeasy@linutronix.de
2019-04-03 16:41:40 +00:00
if (!(xsave->header.xfeatures & BIT_ULL(xfeature_nr)))
return NULL;
return __raw_xsave_addr(xsave, xfeature_nr);
}
EXPORT_SYMBOL_GPL(get_xsave_addr);
x86/fpu/xstate: Wrap get_xsave_addr() to make it safer The MPX code appears is calling a low-level FPU function (copy_fpregs_to_fpstate()). This function is not able to be called in all contexts, although it is safe to call directly in some cases. Although probably correct, the current code is ugly and potentially error-prone. So, add a wrapper that calls the (slightly) higher-level fpu__save() (which is preempt- safe) and also ensures that we even *have* an FPU context (in the case that this was called when in lazy FPU mode). Ingo had this to say about the details about when we need preemption disabled: > it's indeed generally unsafe to access/copy FPU registers with preemption enabled, > for two reasons: > > - on older systems that use FSAVE the instruction destroys FPU register > contents, which has to be handled carefully > > - even on newer systems if we copy to FPU registers (which this code doesn't) > then we don't want a context switch to occur in the middle of it, because a > context switch will write to the fpstate, potentially overwriting our new data > with old FPU state. > > But it's safe to access FPU registers with preemption enabled in a couple of > special cases: > > - potentially destructively saving FPU registers: the signal handling code does > this in copy_fpstate_to_sigframe(), because it can rely on the signal restore > side to restore the original FPU state. > > - reading FPU registers on modern systems: we don't do this anywhere at the > moment, mostly to keep symmetry with older systems where FSAVE is > destructive. > > - initializing FPU registers on modern systems: fpu__clear() does this. Here > it's safe because we don't copy from the fpstate. > > - directly writing FPU registers from user-space memory (!). We do this in > fpu__restore_sig(), and it's safe because neither context switches nor > irq-handler FPU use can corrupt the source context of the copy (which is > user-space memory). > > Note that the MPX code's current use of copy_fpregs_to_fpstate() was safe I think, > because: > > - MPX is predicated on eagerfpu, so the destructive F[N]SAVE instruction won't be > used. > > - the code was only reading FPU registers, and was doing it only in places that > guaranteed that an FPU state was already active (i.e. didn't do it in > kthreads) Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave@sr71.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Suresh Siddha <sbsiddha@gmail.com> Cc: bp@alien8.de Link: http://lkml.kernel.org/r/20150607183700.AA881696@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07 18:37:00 +00:00
x86/pkeys: Allocation/free syscalls This patch adds two new system calls: int pkey_alloc(unsigned long flags, unsigned long init_access_rights) int pkey_free(int pkey); These implement an "allocator" for the protection keys themselves, which can be thought of as analogous to the allocator that the kernel has for file descriptors. The kernel tracks which numbers are in use, and only allows operations on keys that are valid. A key which was not obtained by pkey_alloc() may not, for instance, be passed to pkey_mprotect(). These system calls are also very important given the kernel's use of pkeys to implement execute-only support. These help ensure that userspace can never assume that it has control of a key unless it first asks the kernel. The kernel does not promise to preserve PKRU (right register) contents except for allocated pkeys. The 'init_access_rights' argument to pkey_alloc() specifies the rights that will be established for the returned pkey. For instance: pkey = pkey_alloc(flags, PKEY_DENY_WRITE); will allocate 'pkey', but also sets the bits in PKRU[1] such that writing to 'pkey' is already denied. The kernel does not prevent pkey_free() from successfully freeing in-use pkeys (those still assigned to a memory range by pkey_mprotect()). It would be expensive to implement the checks for this, so we instead say, "Just don't do it" since sane software will never do it anyway. Any piece of userspace calling pkey_alloc() needs to be prepared for it to fail. Why? pkey_alloc() returns the same error code (ENOSPC) when there are no pkeys and when pkeys are unsupported. They can be unsupported for a whole host of reasons, so apps must be prepared for this. Also, libraries or LD_PRELOADs might steal keys before an application gets access to them. This allocation mechanism could be implemented in userspace. Even if we did it in userspace, we would still need additional user/kernel interfaces to tell userspace which keys are being used by the kernel internally (such as for execute-only mappings). Having the kernel provide this facility completely removes the need for these additional interfaces, or having an implementation of this in userspace at all. Note that we have to make changes to all of the architectures that do not use mman-common.h because we use the new PKEY_DENY_ACCESS/WRITE macros in arch-independent code. 1. PKRU is the Protection Key Rights User register. It is a usermode-accessible register that controls whether writes and/or access to each individual pkey is allowed or denied. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: linux-arch@vger.kernel.org Cc: Dave Hansen <dave@sr71.net> Cc: arnd@arndb.de Cc: linux-api@vger.kernel.org Cc: linux-mm@kvack.org Cc: luto@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 16:30:15 +00:00
#ifdef CONFIG_ARCH_HAS_PKEYS
x86/mm/pkeys: Allow kernel to modify user pkey rights register The Protection Key Rights for User memory (PKRU) is a 32-bit user-accessible register. It contains two bits for each protection key: one to write-disable (WD) access to memory covered by the key and another to access-disable (AD). Userspace can read/write the register with the RDPKRU and WRPKRU instructions. But, the register is saved and restored with the XSAVE family of instructions, which means we have to treat it like a floating point register. The kernel needs to write to the register if it wants to implement execute-only memory or if it implements a system call to change PKRU. To do this, we need to create a 'pkru_state' buffer, read the old contents in to it, modify it, and then tell the FPU code that there is modified data in there so it can (possibly) move the buffer back in to the registers. This uses the fpu__xfeature_set_state() function that we defined in the previous patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 21:02:36 +00:00
/*
x86/mm/pkeys: Fix compact mode by removing protection keys' XSAVE buffer manipulation The Memory Protection Keys "rights register" (PKRU) is XSAVE-managed, and is saved/restored along with the FPU state. When kernel code accesses FPU regsisters, it does a delicate dance with preempt. Otherwise, the context switching code can get confused as to whether the most up-to-date state is in the registers themselves or in the XSAVE buffer. But, PKRU is not a normal FPU register. Using it does not generate the normal device-not-available (#NM) exceptions which means we can not manage it lazily, and the kernel completley disallows using lazy mode when it is enabled. The dance with preempt *only* occurs when managing the FPU lazily. Since we never manage PKRU lazily, we do not have to do the dance with preempt; we can access it directly. Doing it this way saves a ton of complicated code (and is faster too). Further, the XSAVES reenabling failed to patch a bit of code in fpu__xfeature_set_state() the checked for compacted buffers. That check caused fpu__xfeature_set_state() to silently refuse to work when the kernel is using compacted XSAVE buffers. This broke execute-only and future pkey_mprotect() support when using compact XSAVE buffers. But, removing fpu__xfeature_set_state() gets rid of this issue, in addition to the nice cleanup and speedup. This fixes the same thing as a fix that Sai posted: https://lkml.org/lkml/2016/7/25/637 The fix that he posted is a much more obviously correct, but I think we should just do this instead. Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-Cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/20160727232040.7D060DAD@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-27 23:20:40 +00:00
* This will go out and modify PKRU register to set the access
* rights for @pkey to @init_val.
x86/mm/pkeys: Allow kernel to modify user pkey rights register The Protection Key Rights for User memory (PKRU) is a 32-bit user-accessible register. It contains two bits for each protection key: one to write-disable (WD) access to memory covered by the key and another to access-disable (AD). Userspace can read/write the register with the RDPKRU and WRPKRU instructions. But, the register is saved and restored with the XSAVE family of instructions, which means we have to treat it like a floating point register. The kernel needs to write to the register if it wants to implement execute-only memory or if it implements a system call to change PKRU. To do this, we need to create a 'pkru_state' buffer, read the old contents in to it, modify it, and then tell the FPU code that there is modified data in there so it can (possibly) move the buffer back in to the registers. This uses the fpu__xfeature_set_state() function that we defined in the previous patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 21:02:36 +00:00
*/
int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val)
x86/mm/pkeys: Allow kernel to modify user pkey rights register The Protection Key Rights for User memory (PKRU) is a 32-bit user-accessible register. It contains two bits for each protection key: one to write-disable (WD) access to memory covered by the key and another to access-disable (AD). Userspace can read/write the register with the RDPKRU and WRPKRU instructions. But, the register is saved and restored with the XSAVE family of instructions, which means we have to treat it like a floating point register. The kernel needs to write to the register if it wants to implement execute-only memory or if it implements a system call to change PKRU. To do this, we need to create a 'pkru_state' buffer, read the old contents in to it, modify it, and then tell the FPU code that there is modified data in there so it can (possibly) move the buffer back in to the registers. This uses the fpu__xfeature_set_state() function that we defined in the previous patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 21:02:36 +00:00
{
u32 old_pkru, new_pkru_bits = 0;
int pkey_shift;
x86/mm/pkeys: Allow kernel to modify user pkey rights register The Protection Key Rights for User memory (PKRU) is a 32-bit user-accessible register. It contains two bits for each protection key: one to write-disable (WD) access to memory covered by the key and another to access-disable (AD). Userspace can read/write the register with the RDPKRU and WRPKRU instructions. But, the register is saved and restored with the XSAVE family of instructions, which means we have to treat it like a floating point register. The kernel needs to write to the register if it wants to implement execute-only memory or if it implements a system call to change PKRU. To do this, we need to create a 'pkru_state' buffer, read the old contents in to it, modify it, and then tell the FPU code that there is modified data in there so it can (possibly) move the buffer back in to the registers. This uses the fpu__xfeature_set_state() function that we defined in the previous patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 21:02:36 +00:00
/*
* This check implies XSAVE support. OSPKE only gets
* set if we enable XSAVE and we enable PKU in XCR0.
*/
if (!boot_cpu_has(X86_FEATURE_OSPKE))
return -EINVAL;
/*
* This code should only be called with valid 'pkey'
* values originating from in-kernel users. Complain
* if a bad value is observed.
*/
if (WARN_ON_ONCE(pkey >= arch_max_pkey()))
return -EINVAL;
/* Set the bits we need in PKRU: */
x86/mm/pkeys: Allow kernel to modify user pkey rights register The Protection Key Rights for User memory (PKRU) is a 32-bit user-accessible register. It contains two bits for each protection key: one to write-disable (WD) access to memory covered by the key and another to access-disable (AD). Userspace can read/write the register with the RDPKRU and WRPKRU instructions. But, the register is saved and restored with the XSAVE family of instructions, which means we have to treat it like a floating point register. The kernel needs to write to the register if it wants to implement execute-only memory or if it implements a system call to change PKRU. To do this, we need to create a 'pkru_state' buffer, read the old contents in to it, modify it, and then tell the FPU code that there is modified data in there so it can (possibly) move the buffer back in to the registers. This uses the fpu__xfeature_set_state() function that we defined in the previous patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 21:02:36 +00:00
if (init_val & PKEY_DISABLE_ACCESS)
new_pkru_bits |= PKRU_AD_BIT;
if (init_val & PKEY_DISABLE_WRITE)
new_pkru_bits |= PKRU_WD_BIT;
/* Shift the bits in to the correct place in PKRU for pkey: */
pkey_shift = pkey * PKRU_BITS_PER_PKEY;
x86/mm/pkeys: Allow kernel to modify user pkey rights register The Protection Key Rights for User memory (PKRU) is a 32-bit user-accessible register. It contains two bits for each protection key: one to write-disable (WD) access to memory covered by the key and another to access-disable (AD). Userspace can read/write the register with the RDPKRU and WRPKRU instructions. But, the register is saved and restored with the XSAVE family of instructions, which means we have to treat it like a floating point register. The kernel needs to write to the register if it wants to implement execute-only memory or if it implements a system call to change PKRU. To do this, we need to create a 'pkru_state' buffer, read the old contents in to it, modify it, and then tell the FPU code that there is modified data in there so it can (possibly) move the buffer back in to the registers. This uses the fpu__xfeature_set_state() function that we defined in the previous patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 21:02:36 +00:00
new_pkru_bits <<= pkey_shift;
x86/mm/pkeys: Fix compact mode by removing protection keys' XSAVE buffer manipulation The Memory Protection Keys "rights register" (PKRU) is XSAVE-managed, and is saved/restored along with the FPU state. When kernel code accesses FPU regsisters, it does a delicate dance with preempt. Otherwise, the context switching code can get confused as to whether the most up-to-date state is in the registers themselves or in the XSAVE buffer. But, PKRU is not a normal FPU register. Using it does not generate the normal device-not-available (#NM) exceptions which means we can not manage it lazily, and the kernel completley disallows using lazy mode when it is enabled. The dance with preempt *only* occurs when managing the FPU lazily. Since we never manage PKRU lazily, we do not have to do the dance with preempt; we can access it directly. Doing it this way saves a ton of complicated code (and is faster too). Further, the XSAVES reenabling failed to patch a bit of code in fpu__xfeature_set_state() the checked for compacted buffers. That check caused fpu__xfeature_set_state() to silently refuse to work when the kernel is using compacted XSAVE buffers. This broke execute-only and future pkey_mprotect() support when using compact XSAVE buffers. But, removing fpu__xfeature_set_state() gets rid of this issue, in addition to the nice cleanup and speedup. This fixes the same thing as a fix that Sai posted: https://lkml.org/lkml/2016/7/25/637 The fix that he posted is a much more obviously correct, but I think we should just do this instead. Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-Cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/20160727232040.7D060DAD@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-27 23:20:40 +00:00
/* Get old PKRU and mask off any old bits in place: */
old_pkru = read_pkru();
old_pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift);
x86/mm/pkeys: Allow kernel to modify user pkey rights register The Protection Key Rights for User memory (PKRU) is a 32-bit user-accessible register. It contains two bits for each protection key: one to write-disable (WD) access to memory covered by the key and another to access-disable (AD). Userspace can read/write the register with the RDPKRU and WRPKRU instructions. But, the register is saved and restored with the XSAVE family of instructions, which means we have to treat it like a floating point register. The kernel needs to write to the register if it wants to implement execute-only memory or if it implements a system call to change PKRU. To do this, we need to create a 'pkru_state' buffer, read the old contents in to it, modify it, and then tell the FPU code that there is modified data in there so it can (possibly) move the buffer back in to the registers. This uses the fpu__xfeature_set_state() function that we defined in the previous patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 21:02:36 +00:00
x86/mm/pkeys: Fix compact mode by removing protection keys' XSAVE buffer manipulation The Memory Protection Keys "rights register" (PKRU) is XSAVE-managed, and is saved/restored along with the FPU state. When kernel code accesses FPU regsisters, it does a delicate dance with preempt. Otherwise, the context switching code can get confused as to whether the most up-to-date state is in the registers themselves or in the XSAVE buffer. But, PKRU is not a normal FPU register. Using it does not generate the normal device-not-available (#NM) exceptions which means we can not manage it lazily, and the kernel completley disallows using lazy mode when it is enabled. The dance with preempt *only* occurs when managing the FPU lazily. Since we never manage PKRU lazily, we do not have to do the dance with preempt; we can access it directly. Doing it this way saves a ton of complicated code (and is faster too). Further, the XSAVES reenabling failed to patch a bit of code in fpu__xfeature_set_state() the checked for compacted buffers. That check caused fpu__xfeature_set_state() to silently refuse to work when the kernel is using compacted XSAVE buffers. This broke execute-only and future pkey_mprotect() support when using compact XSAVE buffers. But, removing fpu__xfeature_set_state() gets rid of this issue, in addition to the nice cleanup and speedup. This fixes the same thing as a fix that Sai posted: https://lkml.org/lkml/2016/7/25/637 The fix that he posted is a much more obviously correct, but I think we should just do this instead. Reported-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-Cheng Yu <yu-cheng.yu@intel.com> Link: http://lkml.kernel.org/r/20160727232040.7D060DAD@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-27 23:20:40 +00:00
/* Write old part along with new part: */
write_pkru(old_pkru | new_pkru_bits);
return 0;
}
x86/pkeys: Allocation/free syscalls This patch adds two new system calls: int pkey_alloc(unsigned long flags, unsigned long init_access_rights) int pkey_free(int pkey); These implement an "allocator" for the protection keys themselves, which can be thought of as analogous to the allocator that the kernel has for file descriptors. The kernel tracks which numbers are in use, and only allows operations on keys that are valid. A key which was not obtained by pkey_alloc() may not, for instance, be passed to pkey_mprotect(). These system calls are also very important given the kernel's use of pkeys to implement execute-only support. These help ensure that userspace can never assume that it has control of a key unless it first asks the kernel. The kernel does not promise to preserve PKRU (right register) contents except for allocated pkeys. The 'init_access_rights' argument to pkey_alloc() specifies the rights that will be established for the returned pkey. For instance: pkey = pkey_alloc(flags, PKEY_DENY_WRITE); will allocate 'pkey', but also sets the bits in PKRU[1] such that writing to 'pkey' is already denied. The kernel does not prevent pkey_free() from successfully freeing in-use pkeys (those still assigned to a memory range by pkey_mprotect()). It would be expensive to implement the checks for this, so we instead say, "Just don't do it" since sane software will never do it anyway. Any piece of userspace calling pkey_alloc() needs to be prepared for it to fail. Why? pkey_alloc() returns the same error code (ENOSPC) when there are no pkeys and when pkeys are unsupported. They can be unsupported for a whole host of reasons, so apps must be prepared for this. Also, libraries or LD_PRELOADs might steal keys before an application gets access to them. This allocation mechanism could be implemented in userspace. Even if we did it in userspace, we would still need additional user/kernel interfaces to tell userspace which keys are being used by the kernel internally (such as for execute-only mappings). Having the kernel provide this facility completely removes the need for these additional interfaces, or having an implementation of this in userspace at all. Note that we have to make changes to all of the architectures that do not use mman-common.h because we use the new PKEY_DENY_ACCESS/WRITE macros in arch-independent code. 1. PKRU is the Protection Key Rights User register. It is a usermode-accessible register that controls whether writes and/or access to each individual pkey is allowed or denied. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: linux-arch@vger.kernel.org Cc: Dave Hansen <dave@sr71.net> Cc: arnd@arndb.de Cc: linux-api@vger.kernel.org Cc: linux-mm@kvack.org Cc: luto@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 16:30:15 +00:00
#endif /* ! CONFIG_ARCH_HAS_PKEYS */
/*
* Weird legacy quirk: SSE and YMM states store information in the
* MXCSR and MXCSR_FLAGS fields of the FP area. That means if the FP
* area is marked as unused in the xfeatures header, we need to copy
* MXCSR and MXCSR_FLAGS if either SSE or YMM are in use.
*/
static inline bool xfeatures_mxcsr_quirk(u64 xfeatures)
{
if (!(xfeatures & (XFEATURE_MASK_SSE|XFEATURE_MASK_YMM)))
return false;
if (xfeatures & XFEATURE_MASK_FP)
return false;
return true;
}
static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
void *init_xstate, unsigned int size)
{
membuf_write(to, from_xstate ? xstate : init_xstate, size);
}
/**
* copy_xstate_to_uabi_buf - Copy kernel saved xstate to a UABI buffer
* @to: membuf descriptor
* @xsave: The kernel xstate buffer to copy from
* @copy_mode: The requested copy mode
*
* Converts from kernel XSAVE or XSAVES compacted format to UABI conforming
* format, i.e. from the kernel internal hardware dependent storage format
* to the requested @mode. UABI XSTATE is always uncompacted!
*
* It supports partial copy but @to.pos always starts from zero.
*/
void copy_xstate_to_uabi_buf(struct membuf to, struct xregs_state *xsave,
enum xstate_copy_mode copy_mode)
{
const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr);
struct xregs_state *xinit = &init_fpstate.xsave;
struct xstate_header header;
unsigned int zerofrom;
int i;
header.xfeatures = xsave->header.xfeatures;
/* Mask out the feature bits depending on copy mode */
switch (copy_mode) {
case XSTATE_COPY_FP:
header.xfeatures &= XFEATURE_MASK_FP;
break;
case XSTATE_COPY_FX:
header.xfeatures &= XFEATURE_MASK_FP | XFEATURE_MASK_SSE;
break;
case XSTATE_COPY_XSAVE:
header.xfeatures &= xfeatures_mask_user();
break;
}
/* Copy FP state up to MXCSR */
copy_feature(header.xfeatures & XFEATURE_MASK_FP, &to, &xsave->i387,
&xinit->i387, off_mxcsr);
/* Copy MXCSR when SSE or YMM are set in the feature mask */
copy_feature(header.xfeatures & (XFEATURE_MASK_SSE | XFEATURE_MASK_YMM),
&to, &xsave->i387.mxcsr, &xinit->i387.mxcsr,
MXCSR_AND_FLAGS_SIZE);
/* Copy the remaining FP state */
copy_feature(header.xfeatures & XFEATURE_MASK_FP,
&to, &xsave->i387.st_space, &xinit->i387.st_space,
sizeof(xsave->i387.st_space));
/* Copy the SSE state - shared with YMM, but independently managed */
copy_feature(header.xfeatures & XFEATURE_MASK_SSE,
&to, &xsave->i387.xmm_space, &xinit->i387.xmm_space,
sizeof(xsave->i387.xmm_space));
if (copy_mode != XSTATE_COPY_XSAVE)
goto out;
/* Zero the padding area */
membuf_zero(&to, sizeof(xsave->i387.padding));
/* Copy xsave->i387.sw_reserved */
membuf_write(&to, xstate_fx_sw_bytes, sizeof(xsave->i387.sw_reserved));
/* Copy the user space relevant state of @xsave->header */
membuf_write(&to, &header, sizeof(header));
zerofrom = offsetof(struct xregs_state, extended_state_area);
for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
/*
* The ptrace buffer is in non-compacted XSAVE format.
* In non-compacted format disabled features still occupy
* state space, but there is no state to copy from in the
* compacted init_fpstate. The gap tracking will zero this
* later.
*/
if (!(xfeatures_mask_user() & BIT_ULL(i)))
continue;
/*
* If there was a feature or alignment gap, zero the space
* in the destination buffer.
*/
if (zerofrom < xstate_offsets[i])
membuf_zero(&to, xstate_offsets[i] - zerofrom);
copy_feature(header.xfeatures & BIT_ULL(i), &to,
__raw_xsave_addr(xsave, i),
__raw_xsave_addr(xinit, i),
xstate_sizes[i]);
/*
* Keep track of the last copied state in the non-compacted
* target buffer for gap zeroing.
*/
zerofrom = xstate_offsets[i] + xstate_sizes[i];
}
out:
if (to.left)
membuf_zero(&to, to.left);
}
static inline bool mxcsr_valid(struct xstate_header *hdr, const u32 *mxcsr)
{
u64 mask = XFEATURE_MASK_FP | XFEATURE_MASK_SSE | XFEATURE_MASK_YMM;
/* Only check if it is in use */
if (hdr->xfeatures & mask) {
/* Reserved bits in MXCSR must be zero. */
if (*mxcsr & ~mxcsr_feature_mask)
return false;
}
return true;
}
/*
* Convert from a ptrace standard-format kernel buffer to kernel XSAVE[S] format
* and copy to the target thread. This is called from xstateregs_set().
*/
int copy_uabi_from_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
{
unsigned int offset, size;
int i;
struct xstate_header hdr;
offset = offsetof(struct xregs_state, header);
size = sizeof(hdr);
memcpy(&hdr, kbuf + offset, size);
if (validate_user_xstate_header(&hdr))
return -EINVAL;
if (!mxcsr_valid(&hdr, kbuf + offsetof(struct fxregs_state, mxcsr)))
return -EINVAL;
for (i = 0; i < XFEATURE_MAX; i++) {
u64 mask = ((u64)1 << i);
if (hdr.xfeatures & mask) {
void *dst = __raw_xsave_addr(xsave, i);
offset = xstate_offsets[i];
size = xstate_sizes[i];
memcpy(dst, kbuf + offset, size);
}
}
if (xfeatures_mxcsr_quirk(hdr.xfeatures)) {
offset = offsetof(struct fxregs_state, mxcsr);
size = MXCSR_AND_FLAGS_SIZE;
memcpy(&xsave->i387.mxcsr, kbuf + offset, size);
}
/*
* The state that came in from userspace was user-state only.
* Mask all the user states out of 'xfeatures':
*/
xsave->header.xfeatures &= XFEATURE_MASK_SUPERVISOR_ALL;
/*
* Add back in the features that came in from userspace:
*/
xsave->header.xfeatures |= hdr.xfeatures;
return 0;
}
/*
* Convert from a sigreturn standard-format user-space buffer to kernel
* XSAVE[S] format and copy to the target thread. This is called from the
* sigreturn() and rt_sigreturn() system calls.
*/
int copy_sigframe_from_user_to_xstate(struct xregs_state *xsave,
const void __user *ubuf)
{
unsigned int offset, size;
int i;
struct xstate_header hdr;
offset = offsetof(struct xregs_state, header);
size = sizeof(hdr);
if (copy_from_user(&hdr, ubuf + offset, size))
return -EFAULT;
if (validate_user_xstate_header(&hdr))
return -EINVAL;
for (i = 0; i < XFEATURE_MAX; i++) {
u64 mask = ((u64)1 << i);
if (hdr.xfeatures & mask) {
void *dst = __raw_xsave_addr(xsave, i);
offset = xstate_offsets[i];
size = xstate_sizes[i];
if (copy_from_user(dst, ubuf + offset, size))
return -EFAULT;
}
}
if (xfeatures_mxcsr_quirk(hdr.xfeatures)) {
offset = offsetof(struct fxregs_state, mxcsr);
size = MXCSR_AND_FLAGS_SIZE;
if (copy_from_user(&xsave->i387.mxcsr, ubuf + offset, size))
return -EFAULT;
}
/*
* The state that came in from userspace was user-state only.
* Mask all the user states out of 'xfeatures':
*/
xsave->header.xfeatures &= XFEATURE_MASK_SUPERVISOR_ALL;
/*
* Add back in the features that came in from userspace:
*/
xsave->header.xfeatures |= hdr.xfeatures;
x86/mm/pkeys: Allow kernel to modify user pkey rights register The Protection Key Rights for User memory (PKRU) is a 32-bit user-accessible register. It contains two bits for each protection key: one to write-disable (WD) access to memory covered by the key and another to access-disable (AD). Userspace can read/write the register with the RDPKRU and WRPKRU instructions. But, the register is saved and restored with the XSAVE family of instructions, which means we have to treat it like a floating point register. The kernel needs to write to the register if it wants to implement execute-only memory or if it implements a system call to change PKRU. To do this, we need to create a 'pkru_state' buffer, read the old contents in to it, modify it, and then tell the FPU code that there is modified data in there so it can (possibly) move the buffer back in to the registers. This uses the fpu__xfeature_set_state() function that we defined in the previous patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210236.0BE13217@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-12 21:02:36 +00:00
return 0;
}
x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status AVX-512 components usage can result in turbo frequency drop. So it's useful to expose AVX-512 usage elapsed time as a heuristic hint for user space job schedulers to cluster the AVX-512 using tasks together. Examples: $ while [ 1 ]; do cat /proc/tid/arch_status | grep AVX512; sleep 1; done AVX512_elapsed_ms: 4 AVX512_elapsed_ms: 8 AVX512_elapsed_ms: 4 This means that 4 milliseconds have elapsed since the tsks AVX512 usage was detected when the task was scheduled out. $ cat /proc/tid/arch_status | grep AVX512 AVX512_elapsed_ms: -1 '-1' indicates that no AVX512 usage was recorded for this task. The time exposed is not necessarily accurate when the arch_status file is read as the AVX512 usage is only evaluated when a task is scheduled out. Accurate usage information can be obtained with performance counters. [ tglx: Massaged changelog ] Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: peterz@infradead.org Cc: hpa@zytor.com Cc: ak@linux.intel.com Cc: tim.c.chen@linux.intel.com Cc: dave.hansen@intel.com Cc: arjan@linux.intel.com Cc: adobriyan@gmail.com Cc: aubrey.li@intel.com Cc: linux-api@vger.kernel.org Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linux API <linux-api@vger.kernel.org> Link: https://lkml.kernel.org/r/20190606012236.9391-2-aubrey.li@linux.intel.com
2019-06-06 01:22:35 +00:00
/**
* copy_dynamic_supervisor_to_kernel() - Save dynamic supervisor states to
* an xsave area
* @xstate: A pointer to an xsave area
* @mask: Represent the dynamic supervisor features saved into the xsave area
*
* Only the dynamic supervisor states sets in the mask are saved into the xsave
* area (See the comment in XFEATURE_MASK_DYNAMIC for the details of dynamic
* supervisor feature). Besides the dynamic supervisor states, the legacy
* region and XSAVE header are also saved into the xsave area. The supervisor
* features in the XFEATURE_MASK_SUPERVISOR_SUPPORTED and
* XFEATURE_MASK_SUPERVISOR_UNSUPPORTED are not saved.
*
* The xsave area must be 64-bytes aligned.
*/
void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask)
{
u64 dynamic_mask = xfeatures_mask_dynamic() & mask;
u32 lmask, hmask;
int err;
if (WARN_ON_FPU(!boot_cpu_has(X86_FEATURE_XSAVES)))
return;
if (WARN_ON_FPU(!dynamic_mask))
return;
lmask = dynamic_mask;
hmask = dynamic_mask >> 32;
XSTATE_OP(XSAVES, xstate, lmask, hmask, err);
/* Should never fault when copying to a kernel buffer */
WARN_ON_FPU(err);
}
/**
* copy_kernel_to_dynamic_supervisor() - Restore dynamic supervisor states from
* an xsave area
* @xstate: A pointer to an xsave area
* @mask: Represent the dynamic supervisor features restored from the xsave area
*
* Only the dynamic supervisor states sets in the mask are restored from the
* xsave area (See the comment in XFEATURE_MASK_DYNAMIC for the details of
* dynamic supervisor feature). Besides the dynamic supervisor states, the
* legacy region and XSAVE header are also restored from the xsave area. The
* supervisor features in the XFEATURE_MASK_SUPERVISOR_SUPPORTED and
* XFEATURE_MASK_SUPERVISOR_UNSUPPORTED are not restored.
*
* The xsave area must be 64-bytes aligned.
*/
void copy_kernel_to_dynamic_supervisor(struct xregs_state *xstate, u64 mask)
{
u64 dynamic_mask = xfeatures_mask_dynamic() & mask;
u32 lmask, hmask;
int err;
if (WARN_ON_FPU(!boot_cpu_has(X86_FEATURE_XSAVES)))
return;
if (WARN_ON_FPU(!dynamic_mask))
return;
lmask = dynamic_mask;
hmask = dynamic_mask >> 32;
XSTATE_OP(XRSTORS, xstate, lmask, hmask, err);
/* Should never fault when copying from a kernel buffer */
WARN_ON_FPU(err);
}
x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status AVX-512 components usage can result in turbo frequency drop. So it's useful to expose AVX-512 usage elapsed time as a heuristic hint for user space job schedulers to cluster the AVX-512 using tasks together. Examples: $ while [ 1 ]; do cat /proc/tid/arch_status | grep AVX512; sleep 1; done AVX512_elapsed_ms: 4 AVX512_elapsed_ms: 8 AVX512_elapsed_ms: 4 This means that 4 milliseconds have elapsed since the tsks AVX512 usage was detected when the task was scheduled out. $ cat /proc/tid/arch_status | grep AVX512 AVX512_elapsed_ms: -1 '-1' indicates that no AVX512 usage was recorded for this task. The time exposed is not necessarily accurate when the arch_status file is read as the AVX512 usage is only evaluated when a task is scheduled out. Accurate usage information can be obtained with performance counters. [ tglx: Massaged changelog ] Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: peterz@infradead.org Cc: hpa@zytor.com Cc: ak@linux.intel.com Cc: tim.c.chen@linux.intel.com Cc: dave.hansen@intel.com Cc: arjan@linux.intel.com Cc: adobriyan@gmail.com Cc: aubrey.li@intel.com Cc: linux-api@vger.kernel.org Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linux API <linux-api@vger.kernel.org> Link: https://lkml.kernel.org/r/20190606012236.9391-2-aubrey.li@linux.intel.com
2019-06-06 01:22:35 +00:00
#ifdef CONFIG_PROC_PID_ARCH_STATUS
/*
* Report the amount of time elapsed in millisecond since last AVX512
* use in the task.
*/
static void avx512_status(struct seq_file *m, struct task_struct *task)
{
unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp);
long delta;
if (!timestamp) {
/*
* Report -1 if no AVX512 usage
*/
delta = -1;
} else {
delta = (long)(jiffies - timestamp);
/*
* Cap to LONG_MAX if time difference > LONG_MAX
*/
if (delta < 0)
delta = LONG_MAX;
delta = jiffies_to_msecs(delta);
}
seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta);
seq_putc(m, '\n');
}
/*
* Report architecture specific information
*/
int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
/*
* Report AVX512 state if the processor and build option supported.
*/
if (cpu_feature_enabled(X86_FEATURE_AVX512F))
avx512_status(m, task);
return 0;
}
#endif /* CONFIG_PROC_PID_ARCH_STATUS */