linux-stable

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	5e2cb28dd7	configfs-tsm for v6.7 - Introduce configfs-tsm as a shared ABI for confidential computing attestation reports - Convert sev-guest to additionally support configfs-tsm alongside its vendor specific ioctl() - Added signed attestation report retrieval to the tdx-guest driver forgoing a new vendor specific ioctl() - Misc. cleanups and a new __free() annotation for kvfree() -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCZUQhiQAKCRDfioYZHlFs Z2gMAQCJdtP0f2kH+pvf3oxAkA1OubKBqJqWOppeyrhTsNMpDQEA9ljXH9h7eRB/ 2NQ6USrU6jqcdu3gB5Tzq8J/ZZabMQU= =1Eiv -----END PGP SIGNATURE----- Merge tag 'tsm-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux Pull unified attestation reporting from Dan Williams: "In an ideal world there would be a cross-vendor standard attestation report format for confidential guests along with a common device definition to act as the transport. In the real world the situation ended up with multiple platform vendors inventing their own attestation report formats with the SEV-SNP implementation being a first mover to define a custom sev-guest character device and corresponding ioctl(). Later, this configfs-tsm proposal intercepted an attempt to add a tdx-guest character device and a corresponding new ioctl(). It also anticipated ARM and RISC-V showing up with more chardevs and more ioctls(). The proposal takes for granted that Linux tolerates the vendor report format differentiation until a standard arrives. From talking with folks involved, it sounds like that standardization work is unlikely to resolve anytime soon. It also takes the position that kernfs ABIs are easier to maintain than ioctl(). The result is a shared configfs mechanism to return per-vendor report-blobs with the option to later support a standard when that arrives. Part of the goal here also is to get the community into the "uncomfortable, but beneficial to the long term maintainability of the kernel" state of talking to each other about their differentiation and opportunities to collaborate. Think of this like the device-driver equivalent of the common memory-management infrastructure for confidential-computing being built up in KVM. As for establishing an "upstream path for cross-vendor confidential-computing device driver infrastructure" this is something I want to discuss at Plumbers. At present, the multiple vendor proposals for assigning devices to confidential computing VMs likely needs a new dedicated repository and maintainer team, but that is a discussion for v6.8. For now, Greg and Thomas have acked this approach and this is passing is AMD, Intel, and Google tests. Summary: - Introduce configfs-tsm as a shared ABI for confidential computing attestation reports - Convert sev-guest to additionally support configfs-tsm alongside its vendor specific ioctl() - Added signed attestation report retrieval to the tdx-guest driver forgoing a new vendor specific ioctl() - Misc cleanups and a new __free() annotation for kvfree()" * tag 'tsm-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux: virt: tdx-guest: Add Quote generation support using TSM_REPORTS virt: sevguest: Add TSM_REPORTS support for SNP_GET_EXT_REPORT mm/slab: Add __free() support for kvfree virt: sevguest: Prep for kernel internal get_ext_report() configfs-tsm: Introduce a shared ABI for attestation reports virt: coco: Add a coco/Makefile and coco/Kconfig virt: sevguest: Fix passing a stack buffer as a scatterlist target	2023-11-04 15:58:13 -10:00
Linus Torvalds	8999ad99f4	* Refactor and clean up TDX hypercall/module call infrastructure * Handle retrying/resuming page conversion hypercalls * Make sure to use the (shockingly) reliable TSC in TDX guests -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmVBlqMACgkQaDWVMHDJ krBrhBAArKay0MvzmdzS4IQs8JqkmuMEHI6WabYv2POPjJNXrn5MelLH972pLuX9 NJ3+yeOLmNMYwqu5qwLCxyeO5CtqEyT2lNumUrxAtHQG4+oS2RYJYUalxMuoGxt8 fAHxbItFg0TobBSUtwcnN2R2WdXwPuUW0Co+pJfLlZV4umVM7QANO1nf1g8YmlDD sVtpDaeKJRdylmwgWgAyGow0tDKd6oZB9j/vOHvZRrEQ+DMjEtG75fjwbjbu43Cl tI/fbxKjzAkOFcZ7PEPsQ8jE1h9DXU+JzTML9Nu/cPMalxMeBg3Dol/JOEbqgreI 4W8Lg7g071EkUcQDxpwfe4aS6rsfsbwUIV4gJVkg9ZhlT7RayWsFik2CfBpJ4IMQ TM8BxtCEGCz3cxVvg3mstX9rRA7eNlXOzcKE/8Y7cpSsp94bA9jtf2GgUSUoi9St y+fIEei8mgeHutdiFh8psrmR7hp6iX/ldMFqHtjNo6xatf2KjdVHhVSU13Jz544z 43ATNi1gZeHOgfwlAlIxLPDVDJidHuux3f6g2vfMkAqItyEqFauC1HA1pIDgckoY 9FpBPp9vNUToSPp6reB6z/PkEBIrG2XtQh82JLt2CnCb6aTUtnPds+psjtT4sSE/ a9SQvZLWWmpj+BlI2yrtfJzhy7SwhltgdjItQHidmCNEn0PYfTc= =FJ1Y -----END PGP SIGNATURE----- Merge tag 'x86_tdx_for_6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 TDX updates from Dave Hansen: "The majority of this is a rework of the assembly and C wrappers that are used to talk to the TDX module and VMM. This is a nice cleanup in general but is also clearing the way for using this code when Linux is the TDX VMM. There are also some tidbits to make TDX guests play nicer with Hyper-V and to take advantage the hardware TSC. Summary: - Refactor and clean up TDX hypercall/module call infrastructure - Handle retrying/resuming page conversion hypercalls - Make sure to use the (shockingly) reliable TSC in TDX guests" [ TLA reminder: TDX is "Trust Domain Extensions", Intel's guest VM confidentiality technology ] * tag 'x86_tdx_for_6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tdx: Mark TSC reliable x86/tdx: Fix __noreturn build warning around __tdx_hypercall_failed() x86/virt/tdx: Make TDX_MODULE_CALL handle SEAMCALL #UD and #GP x86/virt/tdx: Wire up basic SEAMCALL functions x86/tdx: Remove 'struct tdx_hypercall_args' x86/tdx: Reimplement __tdx_hypercall() using TDX_MODULE_CALL asm x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL x86/tdx: Extend TDX_MODULE_CALL to support more TDCALL/SEAMCALL leafs x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure x86/tdx: Rename __tdx_module_call() to __tdcall() x86/tdx: Make macros of TDCALLs consistent with the spec x86/tdx: Skip saving output regs when SEAMCALL fails with VMFailInvalid x86/tdx: Zero out the missing RSI in TDX_HYPERCALL macro x86/tdx: Retry partially-completed page conversion hypercalls	2023-11-01 10:28:32 -10:00
Kuppuswamy Sathyanarayanan	f4738f56d1	virt: tdx-guest: Add Quote generation support using TSM_REPORTS In TDX guest, the attestation process is used to verify the TDX guest trustworthiness to other entities before provisioning secrets to the guest. The first step in the attestation process is TDREPORT generation, which involves getting the guest measurement data in the format of TDREPORT, which is further used to validate the authenticity of the TDX guest. TDREPORT by design is integrity-protected and can only be verified on the local machine. To support remote verification of the TDREPORT in a SGX-based attestation, the TDREPORT needs to be sent to the SGX Quoting Enclave (QE) to convert it to a remotely verifiable Quote. SGX QE by design can only run outside of the TDX guest (i.e. in a host process or in a normal VM) and guest can use communication channels like vsock or TCP/IP to send the TDREPORT to the QE. But for security concerns, the TDX guest may not support these communication channels. To handle such cases, TDX defines a GetQuote hypercall which can be used by the guest to request the host VMM to communicate with the SGX QE. More details about GetQuote hypercall can be found in TDX Guest-Host Communication Interface (GHCI) for Intel TDX 1.0, section titled "TDG.VP.VMCALL<GetQuote>". Trusted Security Module (TSM) [1] exposes a common ABI for Confidential Computing Guest platforms to get the measurement data via ConfigFS. Extend the TSM framework and add support to allow an attestation agent to get the TDX Quote data (included usage example below). report=/sys/kernel/config/tsm/report/report0 mkdir $report dd if=/dev/urandom bs=64 count=1 > $report/inblob hexdump -C $report/outblob rmdir $report GetQuote TDVMCALL requires TD guest pass a 4K aligned shared buffer with TDREPORT data as input, which is further used by the VMM to copy the TD Quote result after successful Quote generation. To create the shared buffer, allocate a large enough memory and mark it shared using set_memory_decrypted() in tdx_guest_init(). This buffer will be re-used for GetQuote requests in the TDX TSM handler. Although this method reserves a fixed chunk of memory for GetQuote requests, such one time allocation can help avoid memory fragmentation related allocation failures later in the uptime of the guest. Since the Quote generation process is not time-critical or frequently used, the current version uses a polling model for Quote requests and it also does not support parallel GetQuote requests. Link: https://lore.kernel.org/lkml/169342399185.3934343.3035845348326944519.stgit@dwillia2-xfh.jf.intel.com/ [1] Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Erdem Aktas <erdemaktas@google.com> Tested-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Tested-by: Peter Gonda <pgonda@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2023-10-19 18:12:00 -07:00
Kirill A. Shutemov	9ee4318c15	x86/tdx: Mark TSC reliable In x86 virtualization environments, including TDX, RDTSC instruction is handled without causing a VM exit, resulting in minimal overhead and jitters. On the other hand, other clock sources (such as HPET, ACPI timer, APIC, etc.) necessitate VM exits to implement, resulting in more fluctuating measurements compared to TSC. Thus, those clock sources are not effective for calibrating TSC. As a foundation, the host TSC is guaranteed to be invariant on any system which enumerates TDX support. TDX guests and the TDX module build on that foundation by enforcing: - Virtual TSC is monotonously incrementing for any single VCPU; - Virtual TSC values are consistent among all the TD’s VCPUs at the level supported by the CPU: + VMM is required to set the same TSC_ADJUST; + VMM must not modify from initial value of TSC_ADJUST before SEAMCALL; - The frequency is determined by TD configuration: + Virtual TSC frequency is specified by VMM on TDH.MNG.INIT; + Virtual TSC starts counting from 0 at TDH.MNG.INIT; The result is that a reliable TSC is a TDX architectural guarantee. Use the TSC as the only reliable clock source in TD guests, bypassing unstable calibration. This is similar to what the kernel already does in some VMWare and HyperV environments. [ dhansen: changelog tweaks ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Erdem Aktas <erdemaktas@google.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/20231006144549.2633-1-kirill.shutemov%40linux.intel.com	2023-10-06 10:00:04 -07:00
Justin Stitt	c9babd5d95	x86/tdx: Replace deprecated strncpy() with strtomem_pad() strncpy() works perfectly here in all cases, however, it is deprecated and as such we should prefer more robust and less ambiguous string APIs: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings Let's use strtomem_pad() as this matches the functionality of strncpy() and is _not_ deprecated. Signed-off-by: Justin Stitt <justinstitt@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://github.com/KSPP/linux/issues/90 Link: https://lore.kernel.org/r/20231003-strncpy-arch-x86-coco-tdx-tdx-c-v2-1-0bd21174a217@google.com	2023-10-04 09:34:07 +02:00
Kai Huang	8a8544bde8	x86/tdx: Remove 'struct tdx_hypercall_args' Now 'struct tdx_hypercall_args' is basically 'struct tdx_module_args' minus RCX. Although from __tdx_hypercall()'s perspective RCX isn't used as shared register thus not part of input/output registers, it's not worth to have a separate structure just due to one register. Remove the 'struct tdx_hypercall_args' and use 'struct tdx_module_args' instead in __tdx_hypercall() related code. This also saves the memory copy between the two structures within __tdx_hypercall(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/798dad5ce24e9d745cf0e16825b75ccc433ad065.1692096753.git.kai.huang%40intel.com	2023-09-12 16:30:14 -07:00
Kai Huang	90f5ecd37f	x86/tdx: Reimplement __tdx_hypercall() using TDX_MODULE_CALL asm Now the TDX_HYPERCALL asm is basically identical to the TDX_MODULE_CALL with both '\saved' and '\ret' enabled, with two minor things though: 1) The way to restore the structure pointer is different The TDX_HYPERCALL uses RCX as spare to restore the structure pointer, but the TDX_MODULE_CALL assumes no spare register can be used. In other words, TDX_MODULE_CALL already covers what TDX_HYPERCALL does. 2) TDX_MODULE_CALL only clears shared registers for TDH.VP.ENTER For this just need to make that code available for the non-host case. Thus, remove the TDX_HYPERCALL and reimplement the __tdx_hypercall() using the TDX_MODULE_CALL. Extend the TDX_MODULE_CALL to cover "clear shared registers" for TDG.VP.VMCALL. Introduce a new __tdcall_saved_ret() to replace the temporary __tdcall_hypercall(). The __tdcall_saved_ret() can also be used for those new TDCALLs which require more input/output registers than the basic TDCALLs do. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/e68a2473fb6f5bcd78b078cae7510e9d0753b3df.1692096753.git.kai.huang%40intel.com	2023-09-12 16:30:14 -07:00
Kai Huang	c641cfb5c1	x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL Now the 'struct tdx_hypercall_args' and 'struct tdx_module_args' are almost the same, and the TDX_HYPERCALL and TDX_MODULE_CALL asm macro share similar code pattern too. The __tdx_hypercall() and __tdcall() should be unified to use the same assembly code. As a preparation to unify them, simplify the TDX_HYPERCALL to make it more like the TDX_MODULE_CALL. The TDX_HYPERCALL takes the pointer of 'struct tdx_hypercall_args' as function call argument, and does below extra things comparing to the TDX_MODULE_CALL: 1) It sets RAX to 0 (TDG.VP.VMCALL leaf) internally; 2) It sets RCX to the (fixed) bitmap of shared registers internally; 3) It calls __tdx_hypercall_failed() internally (and panics) when the TDCALL instruction itself fails; 4) After TDCALL, it moves R10 to RAX to return the return code of the VMCALL leaf, regardless the '\ret' asm macro argument; Firstly, change the TDX_HYPERCALL to take the same function call arguments as the TDX_MODULE_CALL does: TDCALL leaf ID, and the pointer to 'struct tdx_module_args'. Then 1) and 2) can be moved to the caller: - TDG.VP.VMCALL leaf ID can be passed via the function call argument; - 'struct tdx_module_args' is 'struct tdx_hypercall_args' + RCX, thus the bitmap of shared registers can be passed via RCX in the structure. Secondly, to move 3) and 4) out of assembly, make the TDX_HYPERCALL always save output registers to the structure. The caller then can: - Call __tdx_hypercall_failed() when TDX_HYPERCALL returns error; - Return R10 in the structure as the return code of the VMCALL leaf; With above changes, change the asm function from __tdx_hypercall() to __tdcall_hypercall(), and reimplement __tdx_hypercall() as the C wrapper of it. This avoids having to add another wrapper of __tdx_hypercall() (_tdx_hypercall() is already taken). The __tdcall_hypercall() will be replaced with a __tdcall() variant using TDX_MODULE_CALL in a later commit as the final goal is to have one assembly to handle both TDCALL and TDVMCALL. Currently, the __tdx_hypercall() asm is in '.noinstr.text'. To keep this unchanged, annotate __tdx_hypercall(), which is a C function now, as 'noinstr'. Remove the __tdx_hypercall_ret() as __tdx_hypercall() already does so. Implement __tdx_hypercall() in tdx-shared.c so it can be shared with the compressed code. Opportunistically fix a checkpatch error complaining using space around parenthesis '(' and ')' while moving the bitmap of shared registers to <asm/shared/tdx.h>. [ dhansen: quash new calls of __tdx_hypercall_ret() that showed up ] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/0cbf25e7aee3256288045023a31f65f0cef90af4.1692096753.git.kai.huang%40intel.com	2023-09-12 16:28:13 -07:00
Kai Huang	12f34ed862	x86/tdx: Extend TDX_MODULE_CALL to support more TDCALL/SEAMCALL leafs The TDX guest live migration support (TDX 1.5) adds new TDCALL/SEAMCALL leaf functions. Those new TDCALLs/SEAMCALLs take additional registers for input (R10-R13) and output (R12-R13). TDG.SERVTD.RD is an example. Also, the current TDX_MODULE_CALL doesn't aim to handle TDH.VP.ENTER SEAMCALL, which monitors the TDG.VP.VMCALL in input/output registers when it returns in case of VMCALL from TDX guest. With those new TDCALLs/SEAMCALLs and the TDH.VP.ENTER covered, the TDX_MODULE_CALL macro basically needs to handle the same input/output registers as the TDX_HYPERCALL does. And as a result, they also share similar logic in the assembly, thus should be unified to use one common assembly. Extend the TDX_MODULE_CALL asm to support the new TDCALLs/SEAMCALLs and also the TDH.VP.ENTER SEAMCALL. Eventually it will be unified with the TDX_HYPERCALL. The new input/output registers fit with the "callee-saved" registers in the x86 calling convention. Add a new "saved" parameter to support those new TDCALLs/SEAMCALLs and TDH.VP.ENTER and keep the existing TDCALLs/SEAMCALLs minimally impacted. For TDH.VP.ENTER, after it returns the registers shared by the guest contain guest's values. Explicitly clear them to prevent speculative use of guest's values. Note most TDX live migration related SEAMCALLs may also clobber AVX* state ("AVX, AVX2 and AVX512 state: may be reset to the architectural INIT state" -- see TDH.EXPORT.MEM for example). And TDH.VP.ENTER also clobbers XMM0-XMM15 when the corresponding bit is set in RCX. Don't handle them in the TDX_MODULE_CALL macro but let the caller save and restore when needed. This is basically based on Peter's code. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/d4785de7c392f7c5684407f6c24a73b92148ec49.1692096753.git.kai.huang%40intel.com	2023-09-11 16:33:51 -07:00
Kai Huang	57a420bb81	x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure Currently, the TDX_MODULE_CALL asm macro, which handles both TDCALL and SEAMCALL, takes one parameter for each input register and an optional 'struct tdx_module_output' (a collection of output registers) as output. This is different from the TDX_HYPERCALL macro which uses a single 'struct tdx_hypercall_args' to carry all input/output registers. The newer TDX versions introduce more TDCALLs/SEAMCALLs which use more input/output registers. Also, the TDH.VP.ENTER (which isn't covered by the current TDX_MODULE_CALL macro) basically can use all registers that the TDX_HYPERCALL does. The current TDX_MODULE_CALL macro isn't extendible to cover those cases. Similar to the TDX_HYPERCALL macro, simplify the TDX_MODULE_CALL macro to use a single structure 'struct tdx_module_args' to carry all the input/output registers. Currently, R10/R11 are only used as output register but not as input by any TDCALL/SEAMCALL. Change to also use R10/R11 as input register to make input/output registers symmetric. Currently, the TDX_MODULE_CALL macro depends on the caller to pass a non-NULL 'struct tdx_module_output' to get additional output registers. Similar to the TDX_HYPERCALL macro, change the TDX_MODULE_CALL macro to take a new 'ret' macro argument to indicate whether to save the output registers to the 'struct tdx_module_args'. Also introduce a new __tdcall_ret() for that purpose, similar to the __tdx_hypercall_ret(). Note the tdcall(), which is a wrapper of __tdcall(), is called by three callers: tdx_parse_tdinfo(), tdx_get_ve_info() and tdx_early_init(). The former two need the additional output but the last one doesn't. For simplicity, make tdcall() always call __tdcall_ret() to avoid another "_ret()" wrapper. The last caller tdx_early_init() isn't performance critical anyway. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/483616c1762d85eb3a3c3035a7de061cfacf2f14.1692096753.git.kai.huang%40intel.com	2023-09-11 16:33:38 -07:00
Kai Huang	5efb96289e	x86/tdx: Rename __tdx_module_call() to __tdcall() __tdx_module_call() is only used by the TDX guest to issue TDCALL to the TDX module. Rename it to __tdcall() to match its behaviour, e.g., it cannot be used to make host-side SEAMCALL. Also rename tdx_module_call() which is a wrapper of __tdx_module_call() to tdcall(). No functional change intended. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/785d20d99fbcd0db8262c94da6423375422d8c75.1692096753.git.kai.huang%40intel.com	2023-09-11 16:33:32 -07:00
Kai Huang	f0024dbfc4	x86/tdx: Make macros of TDCALLs consistent with the spec The TDX spec names all TDCALLs with prefix "TDG". Currently, the kernel doesn't follow such convention for the macros of those TDCALLs but uses prefix "TDX_" for all of them. Although it's arguable whether the TDX spec names those TDCALLs properly, it's better for the kernel to follow the spec when naming those macros. Change all macros of TDCALLs to make them consistent with the spec. As a bonus, they get distinguished easily from the host-side SEAMCALLs, which all have prefix "TDH". No functional change intended. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/516dccd0bd8fb9a0b6af30d25bb2d971aa03d598.1692096753.git.kai.huang%40intel.com	2023-09-11 16:33:16 -07:00
Kai Huang	03a423d40c	x86/tdx: Skip saving output regs when SEAMCALL fails with VMFailInvalid If SEAMCALL fails with VMFailInvalid, the SEAM software (e.g., the TDX module) won't have chance to set any output register. Skip saving the output registers to the structure in this case. Also, as '.Lno_output_struct' is the very last symbol before RET, rename it to '.Lout' to make it short. Opportunistically make the asm directives unindented. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/704088f5b4d72c7e24084f7f15bd1ac5005b7213.1692096753.git.kai.huang%40intel.com	2023-09-11 16:32:23 -07:00
Kai Huang	5d092b6611	x86/tdx: Zero out the missing RSI in TDX_HYPERCALL macro In the TDX_HYPERCALL asm, after the TDCALL instruction returns from the untrusted VMM, the registers that the TDX guest shares to the VMM need to be cleared to avoid speculative execution of VMM-provided values. RSI is specified in the bitmap of those registers, but it is missing when zeroing out those registers in the current TDX_HYPERCALL. It was there when it was originally added in commit `752d13305c` ("x86/tdx: Expand __tdx_hypercall() to handle more arguments"), but was later removed in commit `1e70c68037` ("x86/tdx: Do not corrupt frame-pointer in __tdx_hypercall()"), which was correct because %rsi is later restored in the "pop %rsi". However a later commit `7a3a401874` ("x86/tdx: Drop flags from __tdx_hypercall()") removed that "pop %rsi" but forgot to add the "xor %rsi, %rsi" back. Fix by adding it back. Fixes: `7a3a401874` ("x86/tdx: Drop flags from __tdx_hypercall()") Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/e7d1157074a0b45d34564d5f17f3e0ffee8115e9.1692096753.git.kai.huang%40intel.com	2023-09-11 16:31:52 -07:00
Dexuan Cui	019b383d11	x86/tdx: Retry partially-completed page conversion hypercalls TDX guest memory is private by default and the VMM may not access it. However, in cases where the guest needs to share data with the VMM, the guest and the VMM can coordinate to make memory shared between them. The guest side of this protocol includes the "MapGPA" hypercall. This call takes a guest physical address range. The hypercall spec (aka. the GHCI) says that the MapGPA call is allowed to return partial progress in mapping this range and indicate that fact with a special error code. A guest that sees such partial progress is expected to retry the operation for the portion of the address range that was not completed. Hyper-V does this partial completion dance when set_memory_decrypted() is called to "decrypt" swiotlb bounce buffers that can be up to 1GB in size. It is evidently the only VMM that does this, which is why nobody noticed this until now. [ dhansen: rewrite changelog ] Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20230811021246.821-2-decui%40microsoft.com	2023-09-11 16:19:33 -07:00
Linus Torvalds	12dc010071	- Some SEV and CC platform helpers cleanup and simplifications now that the usage patterns are becoming apparent -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSa9CoACgkQEsHwGGHe VUr1NhAAjmOq/T41u3FCSU6fZ8gXo5UkIUT13a6cx+6Omx9waJn5G0xdf/380vQN RPRTcc4cfQHCdnIeHgiz1YtCh1ljxXswOSbexHgWHjEcqadgxkZTlKaBEbqyEwLI lnRRsowfk7J/8RsYqtzuBvGaWNliiszWE8iayruI1IL+FoEDLlLx1GNYqusP5WIs 0KYm919Zozl8FEZjP47nH4bab1RcE+HGmLG7UEBmR0zHl4cc7iN3wpv2o/vDxVzR /KP8a2G7J/xjllGW+OP81dFCS7iklHpNuaxQS73fDIL7ll2VDqNemh4ivykCrplo 93twODBwKboKmZhnKc0M2axm5JGGx7IC3KTqEUHzb2Wo4bZCYnrj+9Utzxsa3FxB m0BSUcmBqzZCsHCbu62N66l1NlB32EnMO80/45NrgGi62YiGP8qQNhy3TiceUNle NHFkQmRZwLyW5YC1ntSK8fwSu4GrMG1MG/eRfMPDmmsYogiZUm2KIj7XKy3dKXR6 maqifh/raPk3rL+7cl9BQleCDLjnkHNxFxFa329P9K4wrtqn7Ley4izWYAUYhNrl VOjLs+thTwEmPPgpo/K1wAbR/PjLdmSQ/fQR6w7eUrzNVnm5ndXzhTUcIawycG3T haVry+xPMIRlLCIg+dkrwwNcTW1y/X3K4SmgnjLLF57lZFn9UJc= =Aj6p -----END PGP SIGNATURE----- Merge tag 'x86_sev_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Some SEV and CC platform helpers cleanup and simplifications now that the usage patterns are becoming apparent [ I'm sure I'm the only one that has gets confused by all the TLAs, but in case there are others: here SEV is AMD's "Secure Encrypted Virtualization" and CC is generic "Confidential Computing". There's also Intel SGX (Software Guard Extensions) and TDX (Trust Domain Extensions), along with all the vendor memory encryption extensions (SME, TSME, TME, and WTF). And then we have arm64 with RMA and CCA, and I probably forgot another dozen or so related acronyms - Linus ] * tag 'x86_sev_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/coco: Get rid of accessor functions x86/sev: Get rid of special sev_es_enable_key x86/coco: Mark cc_platform_has() and descendants noinstr	2023-06-27 13:26:30 -07:00
Linus Torvalds	5dfe7a7e52	- Fix a race window where load_unaligned_zeropad() could cause a fatal shutdown during TDX private<=>shared conversion - Annotate sites where VM "exit reasons" are reused as hypercall numbers. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmSZ50cACgkQaDWVMHDJ krAPEg//bAR0SzrjIir2eiQ7p3ktr4L7iae0odzFkW/XHam5ZJP+v9cCMLzY6zNO 44x9Z85jZ9w34GcZ4D7D0OmTHbcDpcxckPXSFco/dK4IyYeLzUImYXEYo41YJEx9 O4sSQMBqIyjMXej/oKBhgKHSWaV60XimvQvTvhpjXGD/45bt9sx4ZNVTi8+xVMbw jpktMsQcsjHctcIY2D2eUR61Ma/Vg9t6Qih51YMtbq6Nqcyhw2IKvDwIx3kQuW34 qSW7wsyn+RfHQDpwjPgDG/6OE815Pbtzlxz+y6tB9pN88IWkA1H5Jh2CQRlMBud2 2nVQRpqPgr9uOIeNnNI7FFd1LgTIc/v7lDPfpUH9KelOs7cGWvaRymkuhPSvWxRI tmjlMdFq8XcjrOPieA9WpxYKXinqj4wNXtnYGyaM+Ur/P3qWaj18PMCYMbeN6pJC eNYEJVk2Mt8GmiPL55aYG5+Z1F8sciLKbz8TFq5ya2z0EnSbyVvR+DReqd7zRzh6 Bmbmx9isAzN6wWNszNt7f8XSgRPV2Ri1tvb1vixk3JLxyx2iVCUL6KJ0cZOUNy0x nQqy7/zMtBsFGZ/Ca8f2kpVaGgxkUFy7n1rI4psXTGBOVlnJyMz3WSi9N8F/uJg0 Ca5W4493+txdyHSAmWBQAQuZp3RJOlhTkXe5dfjukv6Rnnw1MZU= =Hr3D -----END PGP SIGNATURE----- Merge tag 'x86_tdx_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 tdx updates from Dave Hansen: - Fix a race window where load_unaligned_zeropad() could cause a fatal shutdown during TDX private<=>shared conversion The race has never been observed in practice but might allow load_unaligned_zeropad() to catch a TDX page in the middle of its conversion process which would lead to a fatal and unrecoverable guest shutdown. - Annotate sites where VM "exit reasons" are reused as hypercall numbers. * tag 'x86_tdx_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Fix enc_status_change_finish_noop() x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad() x86/mm: Allow guest.enc_status_change_prepare() to fail x86/tdx: Wrap exit reason with hcall_func()	2023-06-26 16:32:47 -07:00
Linus Torvalds	2c96136a3f	- Add support for unaccepted memory as specified in the UEFI spec v2.9. The gist of it all is that Intel TDX and AMD SEV-SNP confidential computing guests define the notion of accepting memory before using it and thus preventing a whole set of attacks against such guests like memory replay and the like. There are a couple of strategies of how memory should be accepted - the current implementation does an on-demand way of accepting. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZ0f4ACgkQEsHwGGHe VUpasw//RKoNW9HSU1csY+XnG9uuaT6QKgji+gIEZWWIGPO9iibvbBj6P5WxJE8T fe7yb6CGa6d6thoU0v+mQGVVvCd7OjCFwPD5wAo4mXToD7Ig+4mI6jMkaKifqa2f N1Uuy8u/zQnGyWrP5Y//WH5bJYfsmds4UGwXI2nLvKlhE7MG90/ePjt7iqnnwZsy waLp6a0Q1VeOvnfRszFLHZw/SoER5RSJ4qeVqttkFNmPPEKMK1Kirrl2poR56OQJ nMr6LqVtD7erlSJ36VRXOKzLI443A4iIEIg/wBjIOU6L5ZEWJGNqtCDnIqFJ6+TM XatsejfRYkkMZH0qXtX9+M0u+HJHbZPCH5rEcA21P3Nbd7od/ANq91qCGoMjtUZ4 7pZohMG8M6IDvkLiOb8fQVkR5k/9Jbk8UvdN/8jdPx1ERxYMFO3BDvJpV2gzrW4B KYtFTPR7j2nY3eKfDpe3flanqYzKUBsKoTlLnlH7UHaiMZ2idwG8AQjlrhC/erCq /Lq1LXt4Mq46FyHABc+PSHytu0WWj1nBUftRt+lviY/Uv7TlkBldOTT7wm7itsfF HUCTfLWl0CJXKPq8rbbZhAG/exN6Ay6MO3E3OcNq8A72E5y4cXenuG3ic/0tUuOu FfjpiMk35qE2Qb4hnj1YtF3XINtd1MpKcuwzGSzEdv9s3J7hrS0= =FS95 -----END PGP SIGNATURE----- Merge tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 confidential computing update from Borislav Petkov: - Add support for unaccepted memory as specified in the UEFI spec v2.9. The gist of it all is that Intel TDX and AMD SEV-SNP confidential computing guests define the notion of accepting memory before using it and thus preventing a whole set of attacks against such guests like memory replay and the like. There are a couple of strategies of how memory should be accepted - the current implementation does an on-demand way of accepting. * tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: virt: sevguest: Add CONFIG_CRYPTO dependency x86/efi: Safely enable unaccepted memory in UEFI x86/sev: Add SNP-specific unaccepted memory support x86/sev: Use large PSC requests if applicable x86/sev: Allow for use of the early boot GHCB for PSC requests x86/sev: Put PSC struct on the stack in prep for unaccepted memory support x86/sev: Fix calculation of end address based on number of pages x86/tdx: Add unaccepted memory support x86/tdx: Refactor try_accept_one() x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory efi: Add unaccepted memory support x86/boot/compressed: Handle unaccepted memory efi/libstub: Implement support for unaccepted memory efi/x86: Get full memory map in allocate_e820() mm: Add support for unaccepted memory	2023-06-26 15:32:39 -07:00
Kirill A. Shutemov	195edce08b	x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad() tl;dr: There is a race in the TDX private<=>shared conversion code which could kill the TDX guest. Fix it by changing conversion ordering to eliminate the window. TDX hardware maintains metadata to track which pages are private and shared. Additionally, TDX guests use the guest x86 page tables to specify whether a given mapping is intended to be private or shared. Bad things happen when the intent and metadata do not match. So there are two thing in play: 1. "the page" -- the physical TDX page metadata 2. "the mapping" -- the guest-controlled x86 page table intent For instance, an unrecoverable exit to VMM occurs if a guest touches a private mapping that points to a shared physical page. In summary: * Private mapping => Private Page == OK (obviously) * Shared mapping => Shared Page == OK (obviously) * Private mapping => Shared Page == BIG BOOM! * Shared mapping => Private Page == OK-ish (It will read generate a recoverable #VE via handle_mmio()) Enter load_unaligned_zeropad(). It can touch memory that is adjacent but otherwise unrelated to the memory it needs to touch. It will cause one of those unrecoverable exits (aka. BIG BOOM) if it blunders into a shared mapping pointing to a private page. This is a problem when __set_memory_enc_pgtable() converts pages from shared to private. It first changes the mapping and second modifies the TDX page metadata. It's moving from: * Shared mapping => Shared Page == OK to: * Private mapping => Shared Page == BIG BOOM! This means that there is a window with a shared mapping pointing to a private page where load_unaligned_zeropad() can strike. Add a TDX handler for guest.enc_status_change_prepare(). This converts the page from shared to private before the page becomes private. This ensures that there is never a private mapping to a shared page. Leave a guest.enc_status_change_finish() in place but only use it for private=>shared conversions. This will delay updating the TDX metadata marking the page private until after the mapping matches the metadata. This also ensures that there is never a private mapping to a shared page. [ dhansen: rewrite changelog ] Fixes: `7dbde76316` ("x86/mm/cpa: Add support for TDX shared memory") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20230606095622.1939-3-kirill.shutemov%40linux.intel.com	2023-06-06 16:24:02 -07:00
Kirill A. Shutemov	75d090fd16	x86/tdx: Add unaccepted memory support Hookup TDX-specific code to accept memory. Accepting the memory is done with ACCEPT_PAGE module call on every page in the range. MAP_GPA hypercall is not required as the unaccepted memory is considered private already. Extract the part of tdx_enc_status_changed() that does memory acceptance in a new helper. Move the helper tdx-shared.c. It is going to be used by both main kernel and decompressor. [ bp: Fix the INTEL_TDX_GUEST=y, KVM_GUEST=n build. ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230606142637.5171-10-kirill.shutemov@linux.intel.com	2023-06-06 18:25:57 +02:00
Kirill A. Shutemov	c2b353ae24	x86/tdx: Refactor try_accept_one() Rework try_accept_one() to return accepted size instead of modifying 'start' inside the helper. It makes 'start' in-only argument and streamlines code on the caller side. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20230606142637.5171-9-kirill.shutemov@linux.intel.com	2023-06-06 17:38:50 +02:00
Kirill A. Shutemov	ff40b5769a	x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub Memory acceptance requires a hypercall and one or multiple module calls. Make helpers for the calls available in boot stub. It has to accept memory where kernel image and initrd are placed. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20230606142637.5171-8-kirill.shutemov@linux.intel.com	2023-06-06 17:31:23 +02:00
Thomas Gleixner	ff3cfcb0d4	x86/smpboot: Fix the parallel bringup decision The decision to allow parallel bringup of secondary CPUs checks CC_ATTR_GUEST_STATE_ENCRYPT to detect encrypted guests. Those cannot use parallel bootup because accessing the local APIC is intercepted and raises a #VC or #VE, which cannot be handled at that point. The check works correctly, but only for AMD encrypted guests. TDX does not set that flag. As there is no real connection between CC attributes and the inability to support parallel bringup, replace this with a generic control flag in x86_cpuinit and let SEV-ES and TDX init code disable it. Fixes: `0c7ffa32db` ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it") Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/87ilc9gd2d.ffs@tglx	2023-05-31 16:49:34 +02:00
Nikolay Borisov	122333d6bd	x86/tdx: Wrap exit reason with hcall_func() TDX reuses VMEXIT "reasons" in its guest->host hypercall ABI. This is confusing because there might not be a VMEXIT involved at all. These instances are supposed to document situation and reduce confusion by wrapping VMEXIT reasons with hcall_func(). The decompression code does not follow this convention. Unify the TDX decompression code with the other TDX use of VMEXIT reasons. No functional changes. Signed-off-by: Nikolay Borisov <nik.borisov@suse.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20230505120332.1429957-1-nik.borisov%40suse.com	2023-05-23 07:01:45 -07:00
Borislav Petkov (AMD)	da86eb9611	x86/coco: Get rid of accessor functions cc_vendor is __ro_after_init and thus can be used directly. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230508121957.32341-1-bp@alien8.de	2023-05-09 12:53:16 +02:00
Borislav Petkov (AMD)	1eaf282e2c	x86/coco: Mark cc_platform_has() and descendants noinstr Those will be used in code regions where instrumentation is not allowed so mark them as such. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20230328201712.25852-2-bp@alien8.de	2023-05-08 11:39:35 +02:00
Linus Torvalds	7b664cc38e	* Do conditional __tdx_hypercall() 'output' processing via an assembly macro argument rather than a runtime register. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmRKvncACgkQaDWVMHDJ krDQsA/+OlDCexITqWgmd3rPN2ZUuyB+aV4MfoKppQuAuWD4I7vxD5qeqjRS2XTh E6SSzp43zEVhVo6Kv3UvPR/Tr9edUGn2KzIWmqd1bOwhgbEfd898gzbWuRmK6i8t qqweR1RMAL/COgPAlcrdpTLl2PCc9tLYpDnQ8WcAUqH4uoePpQyN3Za0J/dcKX7l 8XexOAaco4Wz3ylD9npPcLo9ytvohg+exJtCNldN1l2j5xXdA2fTqEJYaUMp/+Nd Z1TTQ43QcT7dRknFojxdYfAkCqBfr8ccBAwV1mriahKWY/3xl35BqSeJVlma1tkm UzkTY1CFwKYRk24C/oQK7OQMYnyJ7Q1RhSrd91lQWVjaTcI/3DPUKiKKdwFXDv4C FUYvuJkanPVk3PyCZRvltdNvsXsifzx0RKZWLZ+3TQ2jtaMEDOzPgChq7a6WfpkQ HQPuVoENHvyHdUycQhtELUsaJ3AdnOM87XiQDcbNNiaPiOLB9C8dhSWMKoPsMehO oAiUQ7lW6po0lcELVSKib2ASVpXhOmlAxdRyZ50mhjrbpcxfBBGD3+KdFqZ4Gs1c 8UyrQbjVq07Lx2fvdizvDpIcr4M7z0xBAhJeIegC6z86XpJq5uvin+vOLzFAfe16 WGy6FiZtVpXp4fyqUY7GgQNqhk1b8h6EHKd9d/zCSPuH8/wT/6g= =hvDm -----END PGP SIGNATURE----- Merge tag 'x86_tdx_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 tdx update from Dave Hansen: "The original tdx hypercall assembly code took two flags in %RSI to tweak its behavior at runtime. PeterZ recently axed one flag in commit `e80a48bade` ("x86/tdx: Remove TDX_HCALL_ISSUE_STI"). Kill the other flag too and tweak the 'output' mode with an assembly macro instead. This results in elimination of one push/pop pair and overall easier to read assembly. - Do conditional __tdx_hypercall() 'output' processing via an assembly macro argument rather than a runtime register" * tag 'x86_tdx_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tdx: Drop flags from __tdx_hypercall()	2023-04-28 09:36:09 -07:00
Linus Torvalds	de10553fce	x86 APIC updates: - Fix the incorrect handling of atomic offset updates in reserve_eilvt_offset() The check for the return value of atomic_cmpxchg() is not compared against the old value, it is compared against the new value, which makes it two round on success. Convert it to atomic_try_cmpxchg() which does the right thing. - Handle IO/APIC less systems correctly When IO/APIC is not advertised by ACPI then the computation of the lower bound for dynamically allocated interrupts like MSI goes wrong. This lower bound is used to exclude the IO/APIC legacy GSI space as that must stay reserved for the legacy interrupts. In case that the system, e.g. VM, does not advertise an IO/APIC the lower bound stays at 0. 0 is an invalid interrupt number except for the legacy timer interrupt on x86. The return value is unchecked in the core code, so it ends up to allocate interrupt number 0 which is subsequently considered to be invalid by the caller, e.g. the MSI allocation code. A similar problem was already cured for device tree based systems years ago, but that missed - or did not envision - the zero IO/APIC case. Consolidate the zero check and return the provided "from" argument to the core code call site, which is guaranteed to be greater than 0. - Simplify the X2APIC cluster CPU mask logic for CPU hotplug Per cluster CPU masks are required for X2APIC in cluster mode to determine the correct cluster for a target CPU when calculating the destination for IPIs These masks are established when CPUs are borught up. The first CPU in a cluster must allocate a new cluster CPU mask. As this happens during the early startup of a CPU, where memory allocations cannot be done, the mask has to be allocated by the control CPU. The current implementation allocates a clustermask just in case and if the to be brought up CPU is the first in a cluster the CPU takes over this allocation from a global pointer. This works nicely in the fully serialized CPU bringup scenario which is used today, but would fail completely for parallel bringup of CPUs. The cluster association of a CPU can be computed from the APIC ID which is enumerated by ACPI/MADT. So the cluster CPU masks can be preallocated and associated upfront and the upcoming CPUs just need to set their corresponding bit. Aside of preparing for parallel bringup this is a valuable simplification on its own. - Remove global variables which control the early startup of secondary CPUs on 64-bit The only information which is needed by a starting CPU is the Linux CPU number. The CPU number allows it to retrieve the rest of the required data from already existing per CPU storage. So instead of initial_stack, early_gdt_desciptor and initial_gs provide a new variable smpboot_control which contains the Linux CPU number for now. The starting CPU can retrieve and compute all required information for startup from there. Aside of being a cleanup, this is also preparing for parallel CPU bringup, where starting CPUs will look up their Linux CPU number via the APIC ID, when smpboot_control has the corresponding control bit set. - Make cc_vendor globally accesible Subsequent parallel bringup changes require access to cc_vendor because confidental computing platforms need special treatment in the early startup phase vs. CPUID and APCI ID readouts. The change makes cc_vendor global and provides stub accessors in case that CONFIG_ARCH_HAS_CC_PLATFORM is not set. This was merged from the x86/cc branch in anticipation of further parallel bringup commits which require access to cc_vendor. Due to late discoveries of fundamental issue with those patches these commits never happened. The merge commit is unfortunately in the middle of the APIC commits so unraveling it would have required a rebase or revert. As the parallel bringup seems to be well on its way for 6.5 this would be just pointless churn. As the commit does not contain any functional change it's not a risk to keep it. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRGuAwTHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoRzSEADEx1sVkd2yrLcTYdpjdKbbUaDJ6lR0 DXxIP3+ApGHmV9l9yIh+/5C2oEJsiUfFf1vdh6ajv5iXpksCKzcUzkW5g3w7nM36 CSpULpFjwvaq8TIo0o1PIhAbo/yIMMzJVDs8R0reCnWgGAWZoW/a9Ndcvcicd0an pQAlkw3FD5r92mcMlKPNWFoui1AkScGEV02zJ7884MAukmBZwD8Jd+gE6eQC9GKa 9hyJiB77st1URl+a0cPsPYvv8RLVuVcljWsh2edyvxgovIO56+BoEjbrgRSF6cqQ Bhzo//3KgbUJ1y+YqH01aKZzY0hRpbAi2Rew4RBKcBKwCGd2qltUQG0LFNxAtV83 RsC573wSCGSCGO5Xb1RVXih5is+9YqMqitJNWvEc15jjOA9nwoLc80axP11v42f9 Xl4iGHQTWVGdxT4H22NH7UCuRlGg38vAx+In2HGpN/e57q2ighESjiGuqQAQpLel pbOeJtQ/D2xXVKcCap4T/P/2x5ls7bsc76MWJBMcYC3pRgJ5M7ZHw7wTw0IAty4x xCfR1bsRVEAhrE9r/odgNipXjBJu+CdGBAupNEIiRyq1QiwUKtMTayasRGUlbYO6 vrieHKqoflzRVg2M9Bgm3oI28X27FzZHWAZJW2oJ2Wnn2jL5kuRJa1nEykqo8pEP j6rjnScRVvdpIw== =IQWG -----END PGP SIGNATURE----- Merge tag 'x86-apic-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 APIC updates from Thomas Gleixner: - Fix the incorrect handling of atomic offset updates in reserve_eilvt_offset() The check for the return value of atomic_cmpxchg() is not compared against the old value, it is compared against the new value, which makes it two round on success. Convert it to atomic_try_cmpxchg() which does the right thing. - Handle IO/APIC less systems correctly When IO/APIC is not advertised by ACPI then the computation of the lower bound for dynamically allocated interrupts like MSI goes wrong. This lower bound is used to exclude the IO/APIC legacy GSI space as that must stay reserved for the legacy interrupts. In case that the system, e.g. VM, does not advertise an IO/APIC the lower bound stays at 0. 0 is an invalid interrupt number except for the legacy timer interrupt on x86. The return value is unchecked in the core code, so it ends up to allocate interrupt number 0 which is subsequently considered to be invalid by the caller, e.g. the MSI allocation code. A similar problem was already cured for device tree based systems years ago, but that missed - or did not envision - the zero IO/APIC case. Consolidate the zero check and return the provided "from" argument to the core code call site, which is guaranteed to be greater than 0. - Simplify the X2APIC cluster CPU mask logic for CPU hotplug Per cluster CPU masks are required for X2APIC in cluster mode to determine the correct cluster for a target CPU when calculating the destination for IPIs These masks are established when CPUs are borught up. The first CPU in a cluster must allocate a new cluster CPU mask. As this happens during the early startup of a CPU, where memory allocations cannot be done, the mask has to be allocated by the control CPU. The current implementation allocates a clustermask just in case and if the to be brought up CPU is the first in a cluster the CPU takes over this allocation from a global pointer. This works nicely in the fully serialized CPU bringup scenario which is used today, but would fail completely for parallel bringup of CPUs. The cluster association of a CPU can be computed from the APIC ID which is enumerated by ACPI/MADT. So the cluster CPU masks can be preallocated and associated upfront and the upcoming CPUs just need to set their corresponding bit. Aside of preparing for parallel bringup this is a valuable simplification on its own. - Remove global variables which control the early startup of secondary CPUs on 64-bit The only information which is needed by a starting CPU is the Linux CPU number. The CPU number allows it to retrieve the rest of the required data from already existing per CPU storage. So instead of initial_stack, early_gdt_desciptor and initial_gs provide a new variable smpboot_control which contains the Linux CPU number for now. The starting CPU can retrieve and compute all required information for startup from there. Aside of being a cleanup, this is also preparing for parallel CPU bringup, where starting CPUs will look up their Linux CPU number via the APIC ID, when smpboot_control has the corresponding control bit set. - Make cc_vendor globally accesible Subsequent parallel bringup changes require access to cc_vendor because confidental computing platforms need special treatment in the early startup phase vs. CPUID and APCI ID readouts. The change makes cc_vendor global and provides stub accessors in case that CONFIG_ARCH_HAS_CC_PLATFORM is not set. This was merged from the x86/cc branch in anticipation of further parallel bringup commits which require access to cc_vendor. Due to late discoveries of fundamental issue with those patches these commits never happened. The merge commit is unfortunately in the middle of the APIC commits so unraveling it would have required a rebase or revert. As the parallel bringup seems to be well on its way for 6.5 this would be just pointless churn. As the commit does not contain any functional change it's not a risk to keep it. * tag 'x86-apic-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ioapic: Don't return 0 from arch_dynirq_lower_bound() x86/apic: Fix atomic update of offset in reserve_eilvt_offset() x86/coco: Export cc_vendor x86/smpboot: Reference count on smpboot_setup_warm_reset_vector() x86/smpboot: Remove initial_gs x86/smpboot: Remove early_gdt_descr on 64-bit x86/smpboot: Remove initial_stack on 64-bit x86/apic/x2apic: Allow CPU cluster_mask to be populated in parallel	2023-04-25 11:39:45 -07:00
Borislav Petkov (AMD)	3d91c53729	x86/coco: Export cc_vendor It will be used in different checks in future changes. Export it directly and provide accessor functions and stubs so this can be used in general code when CONFIG_ARCH_HAS_CC_PLATFORM is not set. No functional changes. [ tglx: Add accessor functions ] Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230318115634.9392-2-bp@alien8.de	2023-03-30 14:06:28 +02:00
Michael Kelley	812b0597fb	x86/hyperv: Change vTOM handling to use standard coco mechanisms Hyper-V guests on AMD SEV-SNP hardware have the option of using the "virtual Top Of Memory" (vTOM) feature specified by the SEV-SNP architecture. With vTOM, shared vs. private memory accesses are controlled by splitting the guest physical address space into two halves. vTOM is the dividing line where the uppermost bit of the physical address space is set; e.g., with 47 bits of guest physical address space, vTOM is 0x400000000000 (bit 46 is set). Guest physical memory is accessible at two parallel physical addresses -- one below vTOM and one above vTOM. Accesses below vTOM are private (encrypted) while accesses above vTOM are shared (decrypted). In this sense, vTOM is like the GPA.SHARED bit in Intel TDX. Support for Hyper-V guests using vTOM was added to the Linux kernel in two patch sets[1][2]. This support treats the vTOM bit as part of the physical address. For accessing shared (decrypted) memory, these patch sets create a second kernel virtual mapping that maps to physical addresses above vTOM. A better approach is to treat the vTOM bit as a protection flag, not as part of the physical address. This new approach is like the approach for the GPA.SHARED bit in Intel TDX. Rather than creating a second kernel virtual mapping, the existing mapping is updated using recently added coco mechanisms. When memory is changed between private and shared using set_memory_decrypted() and set_memory_encrypted(), the PTEs for the existing kernel mapping are changed to add or remove the vTOM bit in the guest physical address, just as with TDX. The hypercalls to change the memory status on the host side are made using the existing callback mechanism. Everything just works, with a minor tweak to map the IO-APIC to use private accesses. To accomplish the switch in approach, the following must be done: * Update Hyper-V initialization to set the cc_mask based on vTOM and do other coco initialization. * Update physical_mask so the vTOM bit is no longer treated as part of the physical address * Remove CC_VENDOR_HYPERV and merge the associated vTOM functionality under CC_VENDOR_AMD. Update cc_mkenc() and cc_mkdec() to set/clear the vTOM bit as a protection flag. * Code already exists to make hypercalls to inform Hyper-V about pages changing between shared and private. Update this code to run as a callback from __set_memory_enc_pgtable(). * Remove the Hyper-V special case from __set_memory_enc_dec() * Remove the Hyper-V specific call to swiotlb_update_mem_attributes() since mem_encrypt_init() will now do it. * Add a Hyper-V specific implementation of the is_private_mmio() callback that returns true for the IO-APIC and vTPM MMIO addresses [1] https://lore.kernel.org/all/20211025122116.264793-1-ltykernel@gmail.com/ [2] https://lore.kernel.org/all/20211213071407.314309-1-ltykernel@gmail.com/ [ bp: Touchups. ] Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/1679838727-87310-7-git-send-email-mikelley@microsoft.com	2023-03-27 09:31:43 +02:00
Kirill A. Shutemov	7a3a401874	x86/tdx: Drop flags from __tdx_hypercall() After TDX_HCALL_ISSUE_STI got dropped, the only flag left is TDX_HCALL_HAS_OUTPUT. The flag indicates if the caller wants to see tdx_hypercall_args updated based on the hypercall output. Drop the flags and provide __tdx_hypercall_ret() that matches TDX_HCALL_HAS_OUTPUT semantics. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20230321003511.9469-1-kirill.shutemov%40linux.intel.com	2023-03-22 11:36:05 -07:00
Linus Torvalds	d8e473182a	- Fixup comment typo - Prevent unexpected #VE's from: - Hosts removing perfectly good guest mappings (SEPT_VE_DISABLE - Excessive #VE notifications (NOTIFY_ENABLES) which are delivered via a #VE. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmP1WzMACgkQaDWVMHDJ krAqig/+MzIYmUIkuYbluektxPdzI6zhY/Z+eD5DDH9OFZX5e0WQrmHpQbJ3i4Q6 LT5JQ+yAI2ox/mhPfyCeDXqdRiatJJExUDUepc0qsOEW9gTsJ+edYwUsJg8HII61 +TLz/BiMSF6xCUk46b4CqzhoeEk1dupFAG204uc4vGwSfXdysN3buAcciJc1rOTS 7G9hI9fdLSjEJ8yyFebSDMPxSmdnjJPrDK3LF/leGJEpAQ/eMU0entG4ZH3Uyh2s 3EnDpOdRjX56LAEixB4e5igXyS7wesCun4ytOnwndzW8p4gPIsypcJUEbVt84BfA HQaSWP35BFAn0JshJnFPmj4r4jV2EB8l630dVTOKdNSiIa3YjyB5nbzy+mMPFl4f 8vcrHEZ6boEcRhgz0zFG0RfnDsjdbqKgFBXdRt0vYB/CG+EfmYaPoDXsb/8A7dtc 8IQ9wLk2AqG0L8blZVS2kjFxNa/9lkDcMsAbfZmlORTQTF2WN2Jlbxri87vuBpRy 8sqMUhgvHoffd/SIiDzJJIBjOH5/RhXLKhGzXQHI1vpZdU6ps9KIvohiycgx1mUQ lXXQwN5OWSHdUXZ7TFBIGXy9n32Ak/k5GCzCJSqvsMJDDdbycGVB+YCaKX6QK30+ HAHrPy/FQ3FFvZWdsDMD5Pn4RkF4LYH/k4QZwqBFMs9+/Sdzwxc= =UpyL -----END PGP SIGNATURE----- Merge tag 'x86_tdx_for_6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull Intel Trust Domain Extensions (TDX) updates from Dave Hansen: "Other than a minor fixup, the content here is to ensure that TDX guests never see virtualization exceptions (#VE's) that might be induced by the untrusted VMM. This is a highly desirable property. Without it, #VE exception handling would fall somewhere between NMIs, machine checks and total insanity. With it, #VE handling remains pretty mundane. Summary: - Fixup comment typo - Prevent unexpected #VE's from: - Hosts removing perfectly good guest mappings (SEPT_VE_DISABLE) - Excessive #VE notifications (NOTIFY_ENABLES) which are delivered via a #VE" * tag 'x86_tdx_for_6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tdx: Do not corrupt frame-pointer in __tdx_hypercall() x86/tdx: Disable NOTIFY_ENABLES x86/tdx: Relax SEPT_VE_DISABLE check for debug TD x86/tdx: Use ReportFatalError to report missing SEPT_VE_DISABLE x86/tdx: Expand __tdx_hypercall() to handle more arguments x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments x86/tdx: Add more registers to struct tdx_hypercall_args x86/tdx: Fix typo in comment in __tdx_hypercall()	2023-02-25 09:11:30 -08:00
Kirill A. Shutemov	1e70c68037	x86/tdx: Do not corrupt frame-pointer in __tdx_hypercall() If compiled with CONFIG_FRAME_POINTER=y, objtool is not happy that __tdx_hypercall() messes up RBP: objtool: __tdx_hypercall+0x7f: return with modified stack frame Rework the function to store TDX_HCALL_ flags on stack instead of RBP. [ dhansen: minor changelog tweaks ] Fixes: `c30c4b2555` ("x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/202301290255.buUBs99R-lkp@intel.com Link: https://lore.kernel.org/all/20230130135354.27674-1-kirill.shutemov%40linux.intel.com	2023-02-02 16:31:25 -08:00
Ingo Molnar	57a30218fa	Linux 6.2-rc6 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPW7E8eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGf7MIAI0JnHN9WvtEukSZ E6j6+cEGWxsvD6q0g3GPolaKOCw7hlv0pWcFJFcUAt0jebspMdxV2oUGJ8RYW7Lg nCcHvEVswGKLAQtQSWw52qotW6fUfMPsNYYB5l31sm1sKH4Cgss0W7l2HxO/1LvG TSeNHX53vNAZ8pVnFYEWCSXC9bzrmU/VALF2EV00cdICmfvjlgkELGXoLKJJWzUp s63fBHYGGURSgwIWOKStoO6HNo0j/F/wcSMx8leY8qDUtVKHj4v24EvSgxUSDBER ch3LiSQ6qf4sw/z7pqruKFthKOrlNmcc0phjiES0xwwGiNhLv0z3rAhc4OM2cgYh SDc/Y/c= =zpaD -----END PGP SIGNATURE----- Merge tag 'v6.2-rc6' into sched/core, to pick up fixes Pick up fixes before merging another batch of cpuidle updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2023-01-31 15:01:20 +01:00
Kirill A. Shutemov	8de62af018	x86/tdx: Disable NOTIFY_ENABLES == Background == There is a class of side-channel attacks against SGX enclaves called "SGX Step"[1]. These attacks create lots of exceptions inside of enclaves. Basically, run an in-enclave instruction, cause an exception. Over and over. There is a concern that a VMM could attack a TDX guest in the same way by causing lots of #VE's. The TDX architecture includes new countermeasures for these attacks. It basically counts the number of exceptions and can send another special exception once the number of VMM-induced #VE's hits a critical threshold[2]. == Problem == But, these special exceptions are independent of any action that the guest takes. They can occur anywhere that the guest executes. This includes sensitive areas like the entry code. The (non-paranoid) #VE handler is incapable of handling exceptions in these areas. == Solution == Fortunately, the special exceptions can be disabled by the guest via write to NOTIFY_ENABLES TDCS field. NOTIFY_ENABLES is disabled by default, but might be enabled by a bootloader, firmware or an earlier kernel before the current kernel runs. Disable NOTIFY_ENABLES feature explicitly and unconditionally. Any NOTIFY_ENABLES-based #VE's that occur before this point will end up in the early #VE exception handler and die due to unexpected exit reason. [1] https://github.com/jovanbulck/sgx-step [2] https://intel.github.io/ccc-linux-guest-hardening-docs/security-spec.html#safety-against-ve-in-kernel-code Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Link: https://lore.kernel.org/all/20230126221159.8635-8-kirill.shutemov%40linux.intel.com	2023-01-27 09:46:05 -08:00
Kirill A. Shutemov	47e67cf317	x86/tdx: Relax SEPT_VE_DISABLE check for debug TD A "SEPT #VE" occurs when a TDX guest touches memory that is not properly mapped into the "secure EPT". This can be the result of hypervisor attacks or bugs, OR guest bugs. Most notably, buggy guests might touch unaccepted memory for lots of different memory safety bugs like buffer overflows. TDX guests do not want to continue in the face of hypervisor attacks or hypervisor bugs. They want to terminate as fast and safely as possible. SEPT_VE_DISABLE ensures that TDX guests can't continue in the face of these kinds of issues. But, that causes a problem. TDX guests that can't continue can't spit out oopses or other debugging info. In essence SEPT_VE_DISABLE=1 guests are not debuggable. Relax the SEPT_VE_DISABLE check to warning on debug TD and panic() in the #VE handler on EPT-violation on private memory. It will produce useful backtrace. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230126221159.8635-7-kirill.shutemov%40linux.intel.com	2023-01-27 09:46:05 -08:00
Kirill A. Shutemov	71acdcd7cd	x86/tdx: Use ReportFatalError to report missing SEPT_VE_DISABLE Linux TDX guests require that the SEPT_VE_DISABLE "attribute" be set. If it is not set, the kernel is theoretically required to handle exceptions anywhere that kernel memory is accessed, including places like NMI handlers and in the syscall entry gap. Rather than even try to handle these exceptions, the kernel refuses to run if SEPT_VE_DISABLE is unset. However, the SEPT_VE_DISABLE detection and refusal code happens very early in boot, even before earlyprintk runs. Calling panic() will effectively just hang the system. Instead, call a TDX-specific panic() function. This makes a very simple TDVMCALL which gets a short error string out to the hypervisor without any console infrastructure. Use TDG.VP.VMCALL<ReportFatalError> to report the error. The hypercall can encode message up to 64 bytes in eight registers. [ dhansen: tweak comment and remove while loop brackets. ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230126221159.8635-6-kirill.shutemov%40linux.intel.com	2023-01-27 09:45:55 -08:00
Kirill A. Shutemov	752d13305c	x86/tdx: Expand __tdx_hypercall() to handle more arguments So far __tdx_hypercall() only handles six arguments for VMCALL. Expanding it to six more register would allow to cover more use-cases like ReportFatalError() and Hyper-V hypercalls. With all preparations in place, the expansion is pretty straight forward. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230126221159.8635-5-kirill.shutemov%40linux.intel.com	2023-01-27 09:42:09 -08:00
Kirill A. Shutemov	c30c4b2555	x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments RDI is the first argument to __tdx_hypercall() that used to pass pointer to struct tdx_hypercall_args. RSI is the second argument that contains flags, such as TDX_HCALL_HAS_OUTPUT and TDX_HCALL_ISSUE_STI. RDI and RSI can also be used as arguments to TDVMCALL leafs. Move RDI to RAX and RSI to RBP to free up them for the hypercall arguments. RAX saved on stack during TDCALL as it returns status code in the register. RBP value has to be restored before returning from __tdx_hypercall() as it is callee-saved register. This is preparatory patch. No functional change. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230126221159.8635-4-kirill.shutemov%40linux.intel.com	2023-01-27 09:42:09 -08:00
Kirill A. Shutemov	3543f8830b	x86/tdx: Fix typo in comment in __tdx_hypercall() Comment in __tdx_hypercall() points that RAX==0 indicates TDVMCALL failure which is opposite of the truth: RAX==0 is success. Fix the comment. No functional changes. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230126221159.8635-2-kirill.shutemov%40linux.intel.com	2023-01-27 09:42:09 -08:00
Peter Zijlstra	c3982c1a36	cpuidle, tdx: Make TDX code noinstr clean objtool found a few cases where this code called out into instrumented code: vmlinux.o: warning: objtool: __halt+0x2c: call to hcall_func.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: __halt+0x3f: call to __tdx_hypercall() leaves .noinstr.text section vmlinux.o: warning: objtool: __tdx_hypercall+0x66: call to __tdx_hypercall_failed() leaves .noinstr.text section Fix it by: - moving TDX tdcall assembly methods into .noinstr.text (they are already noistr-clean) - marking __tdx_hypercall_failed() as 'noinstr' - annotating hcall_func() as __always_inline Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230112195541.111485720@infradead.org	2023-01-13 11:48:16 +01:00
Peter Zijlstra	e80a48bade	x86/tdx: Remove TDX_HCALL_ISSUE_STI Now that arch_cpu_idle() is expected to return with IRQs disabled, avoid the useless STI/CLI dance. Per the specs this is supposed to work, but nobody has yet relied up this behaviour so broken implementations are possible. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ulf Hansson <ulf.hansson@linaro.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230112195540.682137572@infradead.org	2023-01-13 11:48:15 +01:00
Peter Zijlstra	89b3098703	arch/idle: Change arch_cpu_idle() behavior: always exit with IRQs disabled Current arch_cpu_idle() is called with IRQs disabled, but will return with IRQs enabled. However, the very first thing the generic code does after calling arch_cpu_idle() is raw_local_irq_disable(). This means that architectures that can idle with IRQs disabled end up doing a pointless 'enable-disable' dance. Therefore, push this IRQ disabling into the idle function, meaning that those architectures can avoid the pointless IRQ state flipping. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Tony Lindgren <tony@atomide.com> Tested-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Acked-by: Mark Rutland <mark.rutland@arm.com> [arm64] Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Guo Ren <guoren@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20230112195540.618076436@infradead.org	2023-01-13 11:48:15 +01:00
Jason A. Donenfeld	72bb8f8cc0	x86/insn: Avoid namespace clash by separating instruction decoder MMIO type from MMIO trace type Both <linux/mmiotrace.h> and <asm/insn-eval.h> define various MMIO_ enum constants, whose namespace overlaps. Rename the <asm/insn-eval.h> ones to have a INSN_ prefix, so that the headers can be used from the same source file. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230101162910.710293-2-Jason@zx2c4.com	2023-01-03 18:46:06 +01:00
Kuppuswamy Sathyanarayanan	51acfe89af	x86/tdx: Add a wrapper to get TDREPORT0 from the TDX Module To support TDX attestation, the TDX guest driver exposes an IOCTL interface to allow userspace to get the TDREPORT0 (a.k.a. TDREPORT subtype 0) from the TDX module via TDG.MR.TDREPORT TDCALL. In order to get the TDREPORT0 in the TDX guest driver, instead of using a low level function like __tdx_module_call(), add a tdx_mcall_get_report0() wrapper function to handle it. This is a preparatory patch for adding attestation support. Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Wander Lairson Costa <wander@redhat.com> Link: https://lore.kernel.org/all/20221116223820.819090-2-sathyanarayanan.kuppuswamy%40linux.intel.com	2022-11-17 11:03:09 -08:00
Kirill A. Shutemov	373e715e31	x86/tdx: Panic on bad configs that #VE on "private" memory access All normal kernel memory is "TDX private memory". This includes everything from kernel stacks to kernel text. Handling exceptions on arbitrary accesses to kernel memory is essentially impossible because they can happen in horribly nasty places like kernel entry/exit. But, TDX hardware can theoretically _deliver_ a virtualization exception (#VE) on any access to private memory. But, it's not as bad as it sounds. TDX can be configured to never deliver these exceptions on private memory with a "TD attribute" called ATTR_SEPT_VE_DISABLE. The guest has no way to set this attribute, but it can check it. Ensure ATTR_SEPT_VE_DISABLE is set in early boot. panic() if it is unset. There is no sane way for Linux to run with this attribute clear so a panic() is appropriate. There's small window during boot before the check where kernel has an early #VE handler. But the handler is only for port I/O and will also panic() as soon as it sees any other #VE, such as a one generated by a private memory access. [ dhansen: Rewrite changelog and rebase on new tdx_parse_tdinfo(). Add Kirill's tested-by because I made changes since he wrote this. ] Fixes: `9a22bf6deb` ("x86/traps: Add #VE support for TDX guest") Reported-by: ruogui.ygr@alibaba-inc.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221028141220.29217-3-kirill.shutemov%40linux.intel.com	2022-11-01 16:02:40 -07:00
Dave Hansen	a6dd6f3900	x86/tdx: Prepare for using "INFO" call for a second purpose The TDG.VP.INFO TDCALL provides the guest with various details about the TDX system that the guest needs to run. Only one field is currently used: 'gpa_width' which tells the guest which PTE bits mark pages shared or private. A second field is now needed: the guest "TD attributes" to tell if virtualization exceptions are configured in a way that can harm the guest. Make the naming and calling convention more generic and discrete from the mask-centric one. Thanks to Sathya for the inspiration here, but there's no code, comments or changelogs left from where he started. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org	2022-11-01 10:07:15 -07:00
Kirill A. Shutemov	1e7769653b	x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page load_unaligned_zeropad() can lead to unwanted loads across page boundaries. The unwanted loads are typically harmless. But, they might be made to totally unrelated or even unmapped memory. load_unaligned_zeropad() relies on exception fixup (#PF, #GP and now #VE) to recover from these unwanted loads. In TDX guests, the second page can be shared page and a VMM may configure it to trigger #VE. The kernel assumes that #VE on a shared page is an MMIO access and tries to decode instruction to handle it. In case of load_unaligned_zeropad() it may result in confusion as it is not MMIO access. Fix it by detecting split page MMIO accesses and failing them. load_unaligned_zeropad() will recover using exception fixups. The issue was discovered by analysis and reproduced artificially. It was not triggered during testing. [ dhansen: fix up changelogs and comments for grammar and clarity, plus incorporate Kirill's off-by-one fix] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220614120135.14812-4-kirill.shutemov@linux.intel.com	2022-06-17 15:37:33 -07:00
Kirill A. Shutemov	cdd85786f4	x86/tdx: Clarify RIP adjustments in #VE handler After successful #VE handling, tdx_handle_virt_exception() has to move RIP to the next instruction. The handler needs to know the length of the instruction. If the #VE happened due to instruction execution, the GET_VEINFO TDX module call provides info on the instruction in R10, including its length. For #VE due to EPT violation, the info in R10 is not populand and the kernel must decode the instruction manually to find out its length. Restructure the code to make it explicit that the instruction length depends on the type of #VE. Make individual #VE handlers return the instruction length on success or -errno on failure. [ dhansen: fix up changelog and comments ] Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220614120135.14812-3-kirill.shutemov@linux.intel.com	2022-06-15 11:05:16 -07:00
Kirill A. Shutemov	60428d8bc2	x86/tdx: Fix early #VE handling tdx_early_handle_ve() does not increment RIP after successfully handling the exception. That leads to infinite loop of exceptions. Move RIP when exceptions are successfully handled. [ dhansen: make problem statement more clear ] Fixes: `32e72854fa` ("x86/tdx: Port I/O: Add early boot support") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lkml.kernel.org/r/20220614120135.14812-2-kirill.shutemov@linux.intel.com	2022-06-15 10:52:59 -07:00

1 2

72 Commits