linux-stable/Documentation
Daniel Sneddon f826d0412d x86/speculation: Add RSB VM Exit protections
commit 2b12993220 upstream.

tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as
documented for RET instructions after VM exits. Mitigate it with a new
one-entry RSB stuffing mechanism and a new LFENCE.

== Background ==

Indirect Branch Restricted Speculation (IBRS) was designed to help
mitigate Branch Target Injection and Speculative Store Bypass, i.e.
Spectre, attacks. IBRS prevents software run in less privileged modes
from affecting branch prediction in more privileged modes. IBRS requires
the MSR to be written on every privilege level change.

To overcome some of the performance issues of IBRS, Enhanced IBRS was
introduced.  eIBRS is an "always on" IBRS, in other words, just turn
it on once instead of writing the MSR on every privilege level change.
When eIBRS is enabled, more privileged modes should be protected from
less privileged modes, including protecting VMMs from guests.

== Problem ==

Here's a simplification of how guests are run on Linux' KVM:

void run_kvm_guest(void)
{
	// Prepare to run guest
	VMRESUME();
	// Clean up after guest runs
}

The execution flow for that would look something like this to the
processor:

1. Host-side: call run_kvm_guest()
2. Host-side: VMRESUME
3. Guest runs, does "CALL guest_function"
4. VM exit, host runs again
5. Host might make some "cleanup" function calls
6. Host-side: RET from run_kvm_guest()

Now, when back on the host, there are a couple of possible scenarios of
post-guest activity the host needs to do before executing host code:

* on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not
touched and Linux has to do a 32-entry stuffing.

* on eIBRS hardware, VM exit with IBRS enabled, or restoring the host
IBRS=1 shortly after VM exit, has a documented side effect of flushing
the RSB except in this PBRSB situation where the software needs to stuff
the last RSB entry "by hand".

IOW, with eIBRS supported, host RET instructions should no longer be
influenced by guest behavior after the host retires a single CALL
instruction.

However, if the RET instructions are "unbalanced" with CALLs after a VM
exit as is the RET in #6, it might speculatively use the address for the
instruction after the CALL in #3 as an RSB prediction. This is a problem
since the (untrusted) guest controls this address.

Balanced CALL/RET instruction pairs such as in step #5 are not affected.

== Solution ==

The PBRSB issue affects a wide variety of Intel processors which
support eIBRS. But not all of them need mitigation. Today,
X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates
PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e.,
eIBRS systems which enable legacy IBRS explicitly.

However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT
and most of them need a new mitigation.

Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE
which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT.

The lighter-weight mitigation performs a CALL instruction which is
immediately followed by a speculative execution barrier (INT3). This
steers speculative execution to the barrier -- just like a retpoline
-- which ensures that speculation can never reach an unbalanced RET.
Then, ensure this CALL is retired before continuing execution with an
LFENCE.

In other words, the window of exposure is opened at VM exit where RET
behavior is troublesome. While the window is open, force RSB predictions
sampling for RET targets to a dead end at the INT3. Close the window
with the LFENCE.

There is a subset of eIBRS systems which are not vulnerable to PBRSB.
Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB.
Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO.

  [ bp: Massage, incorporate review comments from Andy Cooper. ]

Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-11 13:22:05 +02:00
..
ABI 1st set of IIO fixes for the 5.19 cycle. 2022-06-20 09:49:52 +02:00
accounting delayacct: track delays from write-protect copy 2022-06-01 15:55:25 -07:00
admin-guide x86/speculation: Add RSB VM Exit protections 2022-08-11 13:22:05 +02:00
arc
arm docs: arm: tcm: Fix typo in description of TCM and MMU usage 2022-06-09 12:56:33 -06:00
arm64 arm64/sme: Fix SVE/SME typo in ABI documentation 2022-06-08 18:38:31 +01:00
block
bpf
cdrom
core-api doc: module: update file references 2022-07-01 14:50:01 -07:00
cpu-freq
crypto
dev-tools Yang Shi has improved the behaviour of khugepaged collapsing of readonly 2022-05-26 12:32:41 -07:00
devicetree dt-bindings: bluetooth: broadcom: Add BCM4349B1 DT binding 2022-08-11 13:22:04 +02:00
doc-guide
driver-api A NULL pointer dereference fix for vc4, and 3 patches to improve the 2022-07-01 09:27:55 +10:00
fault-injection
fb
features Documentation/features: Update the arch support status files 2022-06-09 09:35:57 -06:00
filesystems A folio locking fixup that Xiubo and David cooperated on, marked for 2022-07-15 10:27:28 -07:00
firmware-guide TTY / Serial driver changes for 5.19-rc1 2022-06-03 11:08:40 -07:00
firmware_class
fpga
gpu
hid
hwmon
i2c
ia64
iio
images docs: add SVG version of the Linux logo 2022-06-01 09:32:45 -06:00
infiniband
input documentation: Format button_dev as a pointer. 2022-06-01 09:34:28 -06:00
isdn
kbuild Documentation/llvm: Update Supported Arch table 2022-06-20 08:21:29 +09:00
kernel-hacking
leds
litmus-tests
livepatch doc: module: update file references 2022-07-01 14:50:01 -07:00
locking
loongarch docs/LoongArch: Fix notes rendering by using reST directives 2022-06-17 22:09:05 +08:00
m68k
maintainer
mhi
mips
misc-devices
netlabel
networking Documentation: fix sctp_wmem in ip-sysctl.rst 2022-07-24 21:41:58 +01:00
nios2
nvdimm
openrisc
parisc
PCI
pcmcia
peci
power
powerpc
process docs: netdev: add a cheat sheet for the rules 2022-07-04 10:06:50 +01:00
RCU
riscv Documentation: riscv: Add sv48 description to VM layout 2022-06-01 20:38:34 -07:00
s390
scheduler
scsi
security
sh
sound ASoC: doc: Capitalize RESET line name 2022-07-07 17:16:30 +01:00
sparc
sphinx
sphinx-static
spi
staging
target
timers
tools Updates to Real Time Linux Analysis tool for 5.19: 2022-05-29 10:48:58 -07:00
trace tracing/timerlat: Print stacktrace in the IRQ handler if needed 2022-05-26 21:13:00 -04:00
translations doc: module: update file references 2022-07-01 14:50:01 -07:00
usb docs: usb: fix literal block marker in usbmon verification example 2022-06-09 09:50:03 -06:00
userspace-api media: lirc: add missing exceptions for lirc uapi header file 2022-05-26 14:30:17 -07:00
virt KVM: stats: Fix value for KVM_STATS_UNIT_MAX for boolean stats 2022-07-19 08:54:11 -04:00
vm mm/memory-failure: disable unpoison once hw error happens 2022-06-16 19:11:32 -07:00
w1
watchdog
x86
xtensa
.gitignore
arch.rst Documentation: LoongArch: Add basic documentations 2022-06-03 20:09:27 +08:00
asm-annotations.rst
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py docs/conf.py: Cope with removal of language=None in Sphinx 5.0.0 2022-06-01 09:26:05 -06:00
docutils.conf
dontdiff
index.rst docs: Move the HTE documentation to driver-api/ 2022-06-09 10:02:47 -06:00
Kconfig
Makefile
memory-barriers.txt
SubmittingPatches