A previous commit changed the notification mode from true/false to an
int, allowing notify-no, notify-yes, or signal-notify. This was
backwards compatible in the sense that any existing true/false user
would translate to either 0 (on notification sent) or 1, the latter
which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.
Clean this up properly, and define a proper enum for the notification
mode. Now we have:
- TWA_NONE. This is 0, same as before the original change, meaning no
notification requested.
- TWA_RESUME. This is 1, same as before the original change, meaning
that we use TIF_NOTIFY_RESUME.
- TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
notification.
Clean up all the callers, switching their 0/1/false/true to using the
appropriate TWA_* mode for notifications.
Fixes: e91b481623 ("task_work: teach task_work_add() to do signal_wake_up()")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
encounter an MCE in kernel space but while copying from user memory by
sending them a SIGBUS on return to user space and umapping the faulty
memory, by Tony Luck and Youquan Song.
* memcpy_mcsafe() rework by splitting the functionality into
copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
support for new hardware which can recover from a machine check
encountered during a fast string copy and makes that the default and
lets the older hardware which does not support that advance recovery,
opt in to use the old, fragile, slow variant, by Dan Williams.
* New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.
* Do not use MSR-tracing accessors in #MC context and flag any fault
while accessing MCA architectural MSRs as an architectural violation
with the hope that such hw/fw misdesigns are caught early during the hw
eval phase and they don't make it into production.
* Misc fixes, improvements and cleanups, as always.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EIpUACgkQEsHwGGHe
VUouoBAAgwb+NkWZtIqGImV4f+LOyFjhTR/r/7ZyiijXdbhOIuAdc/jQM31mQxug
sX2jxaRYnf1n6SLA0ggX99gwr2deRQ/hsNf5Abw55GC+Z1dOxpGL0k59A3ELl1IR
H9KYmCAFQIHvzfk38qcdND73XHcgthQoXFBOG9wAPAdgDWnaiWt6lcLAq8OiJTmp
D8pInAYhcnL8YXwMGyQQ1KkFn9HwydoWDsK5Ff2shaw2/+dMQqd1zetenbVtjhLb
iNYGvV7Bi/RQ8PyMbzmtTWa4kwQJAHC2gptkGxty//2ADGVBbqUQdqF9TjIWCNy5
V6Ldv5zo0/1s7DOzji3htzqkSs/K1Ea6d2LtZjejkJipHKV5x068UC6Fu+PlfS2D
VZfcICeapU4G2F3Zvks2DlZ7dVTbHCvoI78Qi7bBgczPUVmk6iqah4xuQaiHyBJc
kTFDA4Nnf/026GpoWRiFry9vqdnHBZyLet5A6Y+SoWF0FbhYnCVPpq4MnussYoav
lUIi9ZZav6X2RZp9DDM1f9d5xubtKq0DKt93wvzqAhjK0T2DikckJ+riOYkI6N8t
fHCBNUkdfgyMzJUTBPAzYQ7RmjbjKWJi7xWP0oz6+GqOJkQfSTVC5/2yEffbb3ya
whYRS6iklbl7yshzaOeecXsZcAeK2oGPfoHg34WkHFgXdF5mNgA=
=u1Wg
-----END PGP SIGNATURE-----
Merge tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Extend the recovery from MCE in kernel space also to processes which
encounter an MCE in kernel space but while copying from user memory
by sending them a SIGBUS on return to user space and umapping the
faulty memory, by Tony Luck and Youquan Song.
- memcpy_mcsafe() rework by splitting the functionality into
copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
support for new hardware which can recover from a machine check
encountered during a fast string copy and makes that the default and
lets the older hardware which does not support that advance recovery,
opt in to use the old, fragile, slow variant, by Dan Williams.
- New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.
- Do not use MSR-tracing accessors in #MC context and flag any fault
while accessing MCA architectural MSRs as an architectural violation
with the hope that such hw/fw misdesigns are caught early during the
hw eval phase and they don't make it into production.
- Misc fixes, improvements and cleanups, as always.
* tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Allow for copy_mc_fragile symbol checksum to be generated
x86/mce: Decode a kernel instruction to determine if it is copying from user
x86/mce: Recover from poison found while copying from user space
x86/mce: Avoid tail copy when machine check terminated a copy from user
x86/mce: Add _ASM_EXTABLE_CPY for copy user access
x86/mce: Provide method to find out the type of an exception handler
x86/mce: Pass pointer to saved pt_regs to severity calculation routines
x86/copy_mc: Introduce copy_mc_enhanced_fast_string()
x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list
x86/mce: Add Skylake quirk for patrol scrub reported errors
RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE()
x86/mce: Annotate mce_rd/wrmsrl() with noinstr
x86/mce/dev-mcelog: Do not update kflags on AMD systems
x86/mce: Stop mce_reign() from re-computing severity for every CPU
x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR
x86/mce: Increase maximum number of banks to 64
x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check()
x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap
RAS/CEC: Fix cec_init() prototype
All instructions copying data between kernel and user memory
are tagged with either _ASM_EXTABLE_UA or _ASM_EXTABLE_CPY
entries in the exception table. ex_fault_handler_type() returns
EX_HANDLER_UACCESS for both of these.
Recovery is only possible when the machine check was triggered
on a read from user memory. In this case the same strategy for
recovery applies as if the user had made the access in ring3. If
the fault was in kernel memory while copying to user there is no
current recovery plan.
For MOV and MOVZ instructions a full decode of the instruction
is done to find the source address. For MOVS instructions
the source address is in the %rsi register. The function
fault_in_kernel_space() determines whether the source address is
kernel or user, upgrade it from "static" so it can be used here.
Co-developed-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-7-tony.luck@intel.com
Existing kernel code can only recover from a machine check on code that
is tagged in the exception table with a fault handling recovery path.
Add two new fields in the task structure to pass information from
machine check handler to the "task_work" that is queued to run before
the task returns to user mode:
+ mce_vaddr: will be initialized to the user virtual address of the fault
in the case where the fault occurred in the kernel copying data from
a user address. This is so that kill_me_maybe() can provide that
information to the user SIGBUS handler.
+ mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe()
to determine if mce_vaddr is applicable to this error.
Add code to recover from a machine check while copying data from user
space to the kernel. Action for this case is the same as if the user
touched the poison directly; unmap the page and send a SIGBUS to the task.
Use a new helper function to share common code between the "fault
in user mode" case and the "fault while copying from user" case.
New code paths will be activated by the next patch which sets
MCE_IN_KERNEL_COPYIN.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-6-tony.luck@intel.com
Avoid a proliferation of ex_has_*_handler() functions by having just
one function that returns the type of the handler (if any).
Drop the __visible attribute for this function. It is not called
from assembler so the attribute is not necessary.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-3-tony.luck@intel.com
New recovery features require additional information about processor
state when a machine check occurred. Pass pt_regs down to the routines
that need it.
No functional change.
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-2-tony.luck@intel.com
In reaction to a proposal to introduce a memcpy_mcsafe_fast()
implementation Linus points out that memcpy_mcsafe() is poorly named
relative to communicating the scope of the interface. Specifically what
addresses are valid to pass as source, destination, and what faults /
exceptions are handled.
Of particular concern is that even though x86 might be able to handle
the semantics of copy_mc_to_user() with its common copy_user_generic()
implementation other archs likely need / want an explicit path for this
case:
On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > However now I see that copy_user_generic() works for the wrong reason.
> > It works because the exception on the source address due to poison
> > looks no different than a write fault on the user address to the
> > caller, it's still just a short copy. So it makes copy_to_user() work
> > for the wrong reason relative to the name.
>
> Right.
>
> And it won't work that way on other architectures. On x86, we have a
> generic function that can take faults on either side, and we use it
> for both cases (and for the "in_user" case too), but that's an
> artifact of the architecture oddity.
>
> In fact, it's probably wrong even on x86 - because it can hide bugs -
> but writing those things is painful enough that everybody prefers
> having just one function.
Replace a single top-level memcpy_mcsafe() with either
copy_mc_to_user(), or copy_mc_to_kernel().
Introduce an x86 copy_mc_fragile() name as the rename for the
low-level x86 implementation formerly named memcpy_mcsafe(). It is used
as the slow / careful backend that is supplanted by a fast
copy_mc_generic() in a follow-on patch.
One side-effect of this reorganization is that separating copy_mc_64.S
to its own file means that perf no longer needs to track dependencies
for its memcpy_64.S benchmarks.
[ bp: Massage a bit. ]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org>
Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com
Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
The recent fix for NMI vs. IRQ state tracking missed to apply the cure
to the MCE handler.
Fixes: ba1f2b2eaa ("x86/entry: Fix NMI vs IRQ state tracking")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/87mu17ism2.fsf@nanos.tec.linutronix.de
Way back in v3.19 Intel and AMD shared the same machine check severity
grading code. So it made sense to add a case for AMD DEFERRED errors in
commit
e3480271f5 ("x86, mce, severity: Extend the the mce_severity mechanism to handle UCNA/DEFERRED error")
But later in v4.2 AMD switched to a separate grading function in
commit
bf80bbd7dc ("x86/mce: Add an AMD severities-grading function")
Belatedly drop the DEFERRED case from the Intel rule list.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200930021313.31810-3-tony.luck@intel.com
The patrol scrubber in Skylake and Cascade Lake systems can be configured
to report uncorrected errors using a special signature in the machine
check bank and to signal using CMCI instead of machine check.
Update the severity calculation mechanism to allow specifying the model,
minimum stepping and range of machine check bank numbers.
Add a new rule to detect the special signature (on model 0x55, stepping
>=4 in any of the memory controller banks).
[ bp: Rewrite it.
aegl: Productize it. ]
Suggested-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200930021313.31810-2-tony.luck@intel.com
They do get called from the #MC handler which is already marked
"noinstr".
Commit
e2def7d49d ("x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR")
already got rid of the instrumentation in the MSR accessors, fix the
annotation now too, in order to get rid of:
vmlinux.o: warning: objtool: do_machine_check()+0x4a: call to mce_rdmsrl() leaves .noinstr.text section
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200915194020.28807-1-bp@alien8.de
The mcelog utility is not commonly used on AMD systems. Therefore,
errors logged only by the dev_mce_log() notifier will be missed. This
may occur if the EDAC modules are not loaded, in which case it's
preferable to print the error record by the default notifier.
However, the mce->kflags set by dev_mce_log() notifier makes the
default notifier skip over the errors assuming they are processed by
dev_mce_log().
Do not update kflags in the dev_mce_log() notifier on AMD systems.
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200903234531.162484-3-Smita.KoralahalliChannabasappa@amd.com
Back in commit:
20d51a426f ("x86/mce: Reuse one of the u16 padding fields in 'struct mce'")
a field was added to "struct mce" to save the computed error severity.
Make use of this in mce_reign() to avoid re-computing the severity
for every CPU.
In the case where the machine panics, one call to mce_severity() is
still needed in order to provide the correct message giving the reason
for the panic.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200908175519.14223-2-tony.luck@intel.com
If an exception needs to be handled while reading an MSR - which is in
most of the cases caused by a #GP on a non-existent MSR - then this
is most likely the incarnation of a BIOS or a hardware bug. Such bug
violates the architectural guarantee that MCA banks are present with all
MSRs belonging to them.
The proper fix belongs in the hardware/firmware - not in the kernel.
Handling an #MC exception which is raised while an NMI is being handled
would cause the nasty NMI nesting issue because of the shortcoming of
IRET of reenabling NMIs when executed. And the machine is in an #MC
context already so <Deity> be at its side.
Tracing MSR accesses while in #MC is another no-no due to tracing being
inherently a bad idea in atomic context:
vmlinux.o: warning: objtool: do_machine_check()+0x4a: call to mce_rdmsrl() leaves .noinstr.text section
so remove all that "additional" functionality from mce_rdmsrl() and
provide it with a special exception handler which panics the machine
when that MSR is not accessible.
The exception handler prints a human-readable message explaining what
the panic reason is but, what is more, it panics while in the #GP
handler and latter won't have executed an IRET, thus opening the NMI
nesting issue in the case when the #MC has happened while handling
an NMI. (#MC itself won't be reenabled until MCG_STATUS hasn't been
cleared).
Suggested-by: Andy Lutomirski <luto@kernel.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
[ Add missing prototypes for ex_handler_* ]
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200906212130.GA28456@zn.tnic
A long time ago, Linux cleared IA32_MCG_STATUS at the very end of machine
check processing.
Then, some fancy recovery and IST manipulation was added in:
d4812e169d ("x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks")
and clearing IA32_MCG_STATUS was pulled earlier in the function.
Next change moved the actual recovery out of do_machine_check() and
just used task_work_add() to schedule it later (before returning to the
user):
5567d11c21 ("x86/mce: Send #MC singal from task work")
Most recently the fancy IST footwork was removed as no longer needed:
b052df3da8 ("x86/entry: Get rid of ist_begin/end_non_atomic()")
At this point there is no reason remaining to clear IA32_MCG_STATUS early.
It can move back to the very end of the function.
Also move sync_core(). The comments for this function say that it should
only be called when instructions have been changed/re-mapped. Recovery
for an instruction fetch may change the physical address. But that
doesn't happen until the scheduled work runs (which could be on another
CPU).
[ bp: Massage commit message. ]
Reported-by: Gabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200824221237.5397-1-tony.luck@intel.com
The Extended Error Code Bitmap (xec_bitmap) for a Scalable MCA bank type
was intended to be used by the kernel to filter out invalid error codes
on a system. However, this is unnecessary after a few product releases
because the hardware will only report valid error codes. Thus, there's
no need for it with future systems.
Remove the xec_bitmap field and all references to it.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200720145353.43924-1-Yazen.Ghannam@amd.com
to the generic code. Pretty much a straight forward 1:1 conversion plus the
consolidation of the KVM handling of pending work before entering guest
mode.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8pEFgTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocEwD/474Eb7LzZ8yahyUBirWJP3k3qzgs9j
dZUxqB6LNuDOstEyTGLPdx1dmQP2vHbFfjoM7YBOH37EGcHsqjGliLvn2Y05ZD7O
6kYwjz6qVnJcm3IMtfSUn/8LkfO5pGUdKd3U5ngDmPLpkeaQ4nPKqiO0uIb0wzwa
cO7l10tG4YjMCWQxPNIaOh8kncLieQBediJPFjkQjV+Fh33kSU3LWTl3fccz6b5+
mgSUFL0qjQpp+Nl7lCaDQQiAop9GTUETfDtximRydZauiM2NpCfz+QBmQzq50Xv1
G3DWZoBIZBjmWJUgfSmS/s4GOYkBTBnT/fUcZmIDcgdRwvtEvRzIhcP87/wn7P3N
UKpLdHqmvA0BFDXZbNZgS362++29pj5Lnb+u3QbWSKQ9UqHN0NUlSY4wzfTLXsGp
Mzpp4TW0u/8kyOlo7wK3lVDgNJaPG31aiNVuDPgLe4cEluO5cq7/7g2GcFBqF1Ly
SqNGD1IccteNQTNvDopczPy7qUl5Lal+Ia06szNSPR48gLrvhSWdyYr2i1sD7vx4
hAhR0Hsi9dacGv46TrRw1OdDzq9bOW68G8GIgLJgDXaayPXLnx6TQEUjzQtIkE/i
ydTPUarp5QOFByt+RBjI90ZcW4RuLgMTOEVONPXtSn8IoCP2Kdg9u3gD9AmUW3Q2
JFkKMiSiJPGxlw==
=84y7
-----END PGP SIGNATURE-----
Merge tag 'x86-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 conversion to generic entry code from Thomas Gleixner:
"The conversion of X86 syscall, interrupt and exception entry/exit
handling to the generic code.
Pretty much a straight-forward 1:1 conversion plus the consolidation
of the KVM handling of pending work before entering guest mode"
* tag 'x86-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu()
x86/kvm: Use generic xfer to guest work function
x86/entry: Cleanup idtentry_enter/exit
x86/entry: Use generic interrupt entry/exit code
x86/entry: Cleanup idtentry_entry/exit_user
x86/entry: Use generic syscall exit functionality
x86/entry: Use generic syscall entry function
x86/ptrace: Provide pt_regs helper for entry/exit
x86/entry: Move user return notifier out of loop
x86/entry: Consolidate 32/64 bit syscall entry
x86/entry: Consolidate check_user_regs()
x86: Correct noinstr qualifiers
x86/idtentry: Remove stale comment
- Print the PPIN field on CPUs that fill them out
- Fix an MCE injection bug
- Simplify a kzalloc in dev_mcelog_init_device()
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8oZ+ERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iIVQ//XLP3qky/X/DOcOhdqw7JEDUpElK+sLdb
nFLcSP0vgdoip0jjz0voEf2tb0nRbZRqjnCIuHj5Dq8fTWbM/qcD0zM35N5lk5bJ
N+y9cE+FQg2oayU/0O3IezoT0lAW3YzdE2HdzhZ2ZEiDKx3fAj9QoJbKlc4TxshC
Tg93IpMq5dmnj6Ge3BfNE56O4WCLIQxs2MCIWHWjBEU8qb4Nry7vEL9hijmFQNRd
85i9pDrOkdnLDKhTagHUR4HbhkAacWeJKe0UoraJFr8/1nCm8Hcii5hI1t5N9KXg
/+asjkhLX75uYy2aMmwF7qelSLNbOieMNji7rt42vevBsqrAtX0zuSvbEBb8bNdx
f4WS5GcmUTZVHTuppPzAX3YNmYdrGqmecerAY+1j+uHtWFGs07ZCnKx+kQbiK0m4
LqaQmi16KsV/Jj9BYZu8wLd/uIAsRqnY5nJlbe7of0/kFPzlJkRkT8Z3DPhSp/ee
qHc3eR1EFMQvNr8F1e17TPCTcq5YjLM9BXFX4JJppaX30YOb6ZH0z5jYMbouJeKD
8qtQ1BtawWKqkoi/0PtbH6khRFXH6oUNWZ+2aAKQqWwfJDdP8Q7fFyIvgN487s79
u7fN4h8qj7w1AeIRJQH5v1G/Es14gdBcotgkM2AflQLF+c7QormnHdCJZad4gTAj
/X8Ks5DE51w=
=z0MD
-----END PGP SIGNATURE-----
Merge tag 'ras-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Ingo Molnar:
"Boris is on vacation and he asked us to send you the pending RAS bits:
- Print the PPIN field on CPUs that fill them out
- Fix an MCE injection bug
- Simplify a kzalloc in dev_mcelog_init_device()"
* tag 'ras-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce, EDAC/mce_amd: Print PPIN in machine check records
x86/mce/dev-mcelog: Use struct_size() helper in kzalloc()
x86/mce/inject: Fix a wrong assignment of i_mce.status
Having sync_core() in processor.h is problematic since it is not possible
to check for hardware capabilities via the *cpu_has() family of macros.
The latter needs the definitions in processor.h.
It also looks more intuitive to relocate the function to sync_core.h.
This changeset does not make changes in functionality.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200727043132.15082-3-ricardo.neri-calderon@linux.intel.com
The noinstr qualifier is to be specified before the return type in the
same way inline is used.
These 2 cases were missed by previous patches.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200723161405.852613-1-ira.weiny@intel.com
- Reset MXCSR in kernel_fpu_begin() to prevent using a stale user space
value.
- Prevent writing MSR_TEST_CTRL on CPUs which are not explicitly
whitelisted for split lock detection. Some CPUs which do not support
it crash even when the MSR is written to 0 which is the default value.
- Fix the XEN PV fallout of the entry code rework
- Fix the 32bit fallout of the entry code rework
- Add more selftests to ensure that these entry problems don't come back.
- Disable 16 bit segments on XEN PV. It's not supported because XEN PV
does not implement ESPFIX64
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8B9JoTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoV8LEAC6QJPDvqYUl4r0rNIRG+S6D99lQOse
1smxvgXX4UaRz5Tgz6kvYUcucqmmnTfvnO8cg82LASeFw1xfVPPAtl3GZjoClwhv
0NJkKYcMm5QUOSVjJmjkcbAld//FyRfxHuJ8HMEtrbvkys2qWBmLzMaUNhFDNhcc
73UMmyuyL4kef9v/iAeR5WXG5+b+j9lZDiC1lTWuEKs10d1EdTwt2O/wtSRRPpMn
kL1qGTJAL+iRyRe7weLOkC2KZ9+Gq2NtyJQutkthZtGe5+pLT3AT6AlWxeg1HU8q
pxaQP25oe8/8naIoOmwiuwAP2qmm5eHedzXoN0h7i2XmofYOJaWeF95K7oDro8Nj
2deCx1bk0wr/RUxbYlfUacs8S+wmMWe7+BPnHXZphkSq5Vx+oXIw6mJOqmNb7Yiv
7ld1QwSD5dyWCEk1af16XKsFvSIRiGh8FypfTiTxyk+z7HIWBNXlu8OWHn1A7Sra
iaolCZfXtTJzm4w5+VVT2FX3s7jJrmMM4iSLtM2ISo2k+1HMlTbgLE6/yGjQ3ZaY
U298W7Pm8CwBRgzyKBvZVfncm0U/B0FNo/8C0jsJKPIOdpoLhs+u7sjpyaNC+toz
GE0skoWZxMhga4xPF84ua/l1VGncVUN1d5/dmnXz8xdyxFlktUtkt2iPE4G0rt3S
Xgh2uLHOgST6Kw==
=lI9c
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A series of fixes for x86:
- Reset MXCSR in kernel_fpu_begin() to prevent using a stale user
space value.
- Prevent writing MSR_TEST_CTRL on CPUs which are not explicitly
whitelisted for split lock detection. Some CPUs which do not
support it crash even when the MSR is written to 0 which is the
default value.
- Fix the XEN PV fallout of the entry code rework
- Fix the 32bit fallout of the entry code rework
- Add more selftests to ensure that these entry problems don't come
back.
- Disable 16 bit segments on XEN PV. It's not supported because XEN
PV does not implement ESPFIX64"
* tag 'x86-urgent-2020-07-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ldt: Disable 16-bit segments on Xen PV
x86/entry/32: Fix #MC and #DB wiring on x86_32
x86/entry/xen: Route #DB correctly on Xen PV
x86/entry, selftests: Further improve user entry sanity checks
x86/entry/compat: Clear RAX high bits on Xen PV SYSENTER
selftests/x86: Consolidate and fix get/set_eflags() helpers
selftests/x86/syscall_nt: Clear weird flags after each test
selftests/x86/syscall_nt: Add more flag combinations
x86/entry/64/compat: Fix Xen PV SYSENTER frame setup
x86/entry: Move SYSENTER's regs->sp and regs->flags fixups into C
x86/entry: Assert that syscalls are on the right stack
x86/split_lock: Don't write MSR_TEST_CTRL on CPUs that aren't whitelisted
x86/fpu: Reset MXCSR to default in kernel_fpu_begin()
DEFINE_IDTENTRY_MCE and DEFINE_IDTENTRY_DEBUG were wired up as non-RAW
on x86_32, but the code expected them to be RAW.
Get rid of all the macro indirection for them on 32-bit and just use
DECLARE_IDTENTRY_RAW and DEFINE_IDTENTRY_RAW directly.
Also add a warning to make sure that we only hit the _kernel paths
in kernel mode.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/9e90a7ee8e72fd757db6d92e1e5ff16339c1ecf9.1593795633.git.luto@kernel.org
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.
This code was detected with the help of Coccinelle and, audited and
fixed manually.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200617211734.GA9636@embeddedor
The original code is a nop as i_mce.status is or'ed with part of itself,
fix it.
Fixes: a1300e5052 ("x86/ras/mce_amd_inj: Trigger deferred and thresholding errors interrupts")
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lkml.kernel.org/r/20200611023238.3830-1-zhenzhong.duan@gmail.com
The kbuild test robot reported this warning:
arch/x86/kernel/cpu/mce/dev-mcelog.c: In function 'dev_mcelog_init_device':
arch/x86/kernel/cpu/mce/dev-mcelog.c:346:2: warning: 'strncpy' output \
truncated before terminating nul copying 12 bytes from a string of the \
same length [-Wstringop-truncation]
This is accurate, but I don't care that the trailing NUL character isn't
copied. The string being copied is just a magic number signature so that
crash dump tools can be sure they are decoding the right blob of memory.
Use memcpy() instead of strncpy().
Fixes: d8ecca4043 ("x86/mce/dev-mcelog: Dynamically allocate space for machine check records")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200527182808.27737-1-tony.luck@intel.com
An interesting thing happened when a guest Linux instance took a machine
check. The VMM unmapped the bad page from guest physical space and
passed the machine check to the guest.
Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:
do_memory_failure
set_mce_nospec
set_memory_uc
_set_memory_uc
change_page_attr_set_clr
cpa_flush
clflush_cache_range_opt
This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.
Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.
This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).
[ bp: Adjust to x86/entry changes. ]
Fixes: 284ce4011b ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reported-by: Jue Wang <juew@google.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jue Wang <juew@google.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200520163546.GA7977@agluck-desk2.amr.corp.intel.com
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow
up patches can be applied without creating a horrible merge conflict
afterwards.
The typical pattern for trace_hardirqs_off_prepare() is:
ENTRY
lockdep_hardirqs_off(); // because hardware
... do entry magic
instrumentation_begin();
trace_hardirqs_off_prepare();
... do actual work
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare();
instrumentation_end();
... do exit magic
lockdep_hardirqs_on();
which shows that it's named wrong, rename it to
trace_hardirqs_off_finish(), as it concludes the hardirq_off transition.
Also, given that the above is the only correct order, make the traditional
all-in-one trace_hardirqs_off() follow suit.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200529213321.415774872@infradead.org
The last step to remove the irq tracing cruft from ASM. Ignore #DF as the
maschine is going to die anyway.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200521202120.414043330@linutronix.de
Convert various system vectors to IDTENTRY_SYSVEC:
- Implement the C entry point with DEFINE_IDTENTRY_SYSVEC
- Emit the ASM stub with DECLARE_IDTENTRY_SYSVEC
- Remove the ASM idtentries in 64-bit
- Remove the BUILD_INTERRUPT entries in 32-bit
- Remove the old prototypes
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200521202119.464812973@linutronix.de
Switch all idtentry_enter/exit() users over to the new conditional RCU
handling scheme and make the user mode entries in #DB, #INT3 and #MCE use
the user mode idtentry functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20200521202117.382387286@linutronix.de
Mark the relevant functions noinstr, use the plain non-instrumented MSR
accessors. The only odd part is the instrumentation_begin()/end() pair around the
indirect machine_check_vector() call as objtool can't figure that out. The
possible invoked functions are annotated correctly.
Also use notrace variant of nmi_enter/exit(). If MCEs happen then hardware
latency tracing is the least of the worries.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135315.476734898@linutronix.de
The MCE entry point uses the same mechanism as the IST entry point for
now. For #DB split the inner workings and just keep the nmi_enter/exit()
magic in the IST variant. Fixup the ASM code to emit the proper
noist_##cfunc call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135315.177564104@linutronix.de
mce_check_crashing_cpu() is called right at the entry of the MCE
handler. It uses mce_rdmsr() and mce_wrmsr() which are wrappers around
rdmsr() and wrmsr() to handle the MCE error injection mechanism, which is
pointless in this context, i.e. when the MCE hits an offline CPU or the
system is already marked crashing.
The MSR access can also be traced, so use the untraceable variants. This
is also safe vs. XEN paravirt as these MSRs are not affected by XEN PV
modifications.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135314.426347351@linutronix.de
Convert #MC to IDTENTRY_MCE:
- Implement the C entry points with DEFINE_IDTENTRY_MCE
- Emit the ASM stub with DECLARE_IDTENTRY_MCE
- Remove the ASM idtentry in 64bit
- Remove the open coded ASM entry code in 32bit
- Fixup the XEN/PV code
- Remove the old prototypes
- Remove the error code from *machine_check_vector() as
it is always 0 and not used by any of the functions
it can point to. Fixup all the functions as well.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135314.334980426@linutronix.de
There is no reason to have nmi_enter/exit() in the actual MCE
handlers. Move it to the entry point. This also covers the until now
uncovered initial handler which only prints.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200505135314.243936614@linutronix.de
A few exceptions (like #DB and #BP) can happen at any location in the code,
this then means that tracers should treat events from these exceptions as
NMI-like. The interrupted context could be holding locks with interrupts
disabled for instance.
Similarly, #MC is an actual NMI-like exception.
All of them use ist_enter() which only concerns itself with RCU, but does
not do any of the other setup that NMIs need. This means things like:
printk()
raw_spin_lock_irq(&logbuf_lock);
<#DB/#BP/#MC>
printk()
raw_spin_lock_irq(&logbuf_lock);
are entirely possible (well, not really since printk tries hard to
play nice, but the concept stands).
So replace ist_enter() with nmi_enter(). Also observe that any nmi_enter()
caller must be both notrace and NOKPROBE, or in the noinstr text section.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134101.525508608@linutronix.de
Convert #MC over to using task_work_add(); it will run the same code
slightly later, on the return to user path of the same exception.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200505134100.957390899@linutronix.de
This is completely overengineered and definitely not an interface which
should be made available to anything else than this particular MCE case.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200505134059.462640294@linutronix.de
A 32-bit version of mcelog issuing ioctls on /dev/mcelog causes errors
like the following:
MCE_GET_RECORD_LEN: Inappropriate ioctl for device
This is due to a missing compat_ioctl callback.
Assign to it compat_ptr_ioctl() as a generic implementation of the
.compat_ioctl file operation to ioctl functions that either ignore the
argument or pass a pointer to a compatible data type.
[ bp: Massage commit message. ]
Signed-off-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/1583303947-49858-1-git-send-email-zhe.he@windriver.com
The severity grading code returns IN_KERNEL_RECOV error context for
errors which have happened in kernel space but from which the kernel can
recover. Whether the recovery can happen is determined by the exception
table entry having as handler ex_handler_fault() and which has been
declared at build time using _ASM_EXTABLE_FAULT().
IN_KERNEL_RECOV is used in mce_severity_intel() to lookup the
corresponding error severity in the severities table.
However, the mapping back from error severity to whether the error is
IN_KERNEL_RECOV is ambiguous and in the very paranoid case - which
might not be possible right now - but be better safe than sorry later,
an exception fixup could be attempted for another MCE whose address
is in the exception table and has the proper severity. Which would be
unfortunate, to say the least.
Therefore, mark such MCEs explicitly as MCE_IN_KERNEL_RECOV so that the
recovery attempt is done only for them.
Document the whole handling, while at it, as it is not trivial.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200407163414.18058-10-bp@alien8.de
Sometimes, when logs are getting lost, it's nice to just
have everything dumped to the serial console.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-7-tony.luck@intel.com
Instead of keeping count of how many handlers are registered on the
MCE notifier chain and printing if below some magic value, look at
mce->kflags to see if anyone claims to have handled/logged this error.
[ bp: Do not print ->kflags in __print_mce(). ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-6-tony.luck@intel.com
If the handler took any action to log or deal with the error, set a bit
in mce->kflags so that the default handler on the end of the machine
check chain can see what has been done.
Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
skip over errors already processed by CEC.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com
The CEC code has its claws in a couple of routines in mce/core.c.
Convert it to just register itself on the normal MCE notifier chain.
[ bp: Make cec_add_elem() and cec_init() static. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20200214222720.13168-3-tony.luck@intel.com