Commit Graph

57 Commits

Author SHA1 Message Date
Linus Torvalds 12dc010071 - Some SEV and CC platform helpers cleanup and simplifications now that
the usage patterns are becoming apparent
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSa9CoACgkQEsHwGGHe
 VUr1NhAAjmOq/T41u3FCSU6fZ8gXo5UkIUT13a6cx+6Omx9waJn5G0xdf/380vQN
 RPRTcc4cfQHCdnIeHgiz1YtCh1ljxXswOSbexHgWHjEcqadgxkZTlKaBEbqyEwLI
 lnRRsowfk7J/8RsYqtzuBvGaWNliiszWE8iayruI1IL+FoEDLlLx1GNYqusP5WIs
 0KYm919Zozl8FEZjP47nH4bab1RcE+HGmLG7UEBmR0zHl4cc7iN3wpv2o/vDxVzR
 /KP8a2G7J/xjllGW+OP81dFCS7iklHpNuaxQS73fDIL7ll2VDqNemh4ivykCrplo
 93twODBwKboKmZhnKc0M2axm5JGGx7IC3KTqEUHzb2Wo4bZCYnrj+9Utzxsa3FxB
 m0BSUcmBqzZCsHCbu62N66l1NlB32EnMO80/45NrgGi62YiGP8qQNhy3TiceUNle
 NHFkQmRZwLyW5YC1ntSK8fwSu4GrMG1MG/eRfMPDmmsYogiZUm2KIj7XKy3dKXR6
 maqifh/raPk3rL+7cl9BQleCDLjnkHNxFxFa329P9K4wrtqn7Ley4izWYAUYhNrl
 VOjLs+thTwEmPPgpo/K1wAbR/PjLdmSQ/fQR6w7eUrzNVnm5ndXzhTUcIawycG3T
 haVry+xPMIRlLCIg+dkrwwNcTW1y/X3K4SmgnjLLF57lZFn9UJc=
 =Aj6p
 -----END PGP SIGNATURE-----

Merge tag 'x86_sev_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SEV updates from Borislav Petkov:

 - Some SEV and CC platform helpers cleanup and simplifications now that
   the usage patterns are becoming apparent

[ I'm sure I'm the only one that has gets confused by all the TLAs, but
  in case there are others: here SEV is AMD's "Secure Encrypted
  Virtualization" and CC is generic "Confidential Computing".

  There's also Intel SGX (Software Guard Extensions) and TDX (Trust
  Domain Extensions), along with all the vendor memory encryption
  extensions (SME, TSME, TME, and WTF).

  And then we have arm64 with RMA and CCA, and I probably forgot another
  dozen or so related acronyms    - Linus ]

* tag 'x86_sev_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/coco: Get rid of accessor functions
  x86/sev: Get rid of special sev_es_enable_key
  x86/coco: Mark cc_platform_has() and descendants noinstr
2023-06-27 13:26:30 -07:00
Linus Torvalds 5dfe7a7e52 - Fix a race window where load_unaligned_zeropad() could cause
a fatal shutdown during TDX private<=>shared conversion
  - Annotate sites where VM "exit reasons" are reused as hypercall
    numbers.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmSZ50cACgkQaDWVMHDJ
 krAPEg//bAR0SzrjIir2eiQ7p3ktr4L7iae0odzFkW/XHam5ZJP+v9cCMLzY6zNO
 44x9Z85jZ9w34GcZ4D7D0OmTHbcDpcxckPXSFco/dK4IyYeLzUImYXEYo41YJEx9
 O4sSQMBqIyjMXej/oKBhgKHSWaV60XimvQvTvhpjXGD/45bt9sx4ZNVTi8+xVMbw
 jpktMsQcsjHctcIY2D2eUR61Ma/Vg9t6Qih51YMtbq6Nqcyhw2IKvDwIx3kQuW34
 qSW7wsyn+RfHQDpwjPgDG/6OE815Pbtzlxz+y6tB9pN88IWkA1H5Jh2CQRlMBud2
 2nVQRpqPgr9uOIeNnNI7FFd1LgTIc/v7lDPfpUH9KelOs7cGWvaRymkuhPSvWxRI
 tmjlMdFq8XcjrOPieA9WpxYKXinqj4wNXtnYGyaM+Ur/P3qWaj18PMCYMbeN6pJC
 eNYEJVk2Mt8GmiPL55aYG5+Z1F8sciLKbz8TFq5ya2z0EnSbyVvR+DReqd7zRzh6
 Bmbmx9isAzN6wWNszNt7f8XSgRPV2Ri1tvb1vixk3JLxyx2iVCUL6KJ0cZOUNy0x
 nQqy7/zMtBsFGZ/Ca8f2kpVaGgxkUFy7n1rI4psXTGBOVlnJyMz3WSi9N8F/uJg0
 Ca5W4493+txdyHSAmWBQAQuZp3RJOlhTkXe5dfjukv6Rnnw1MZU=
 =Hr3D
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 tdx updates from Dave Hansen:

 - Fix a race window where load_unaligned_zeropad() could cause a fatal
   shutdown during TDX private<=>shared conversion

   The race has never been observed in practice but might allow
   load_unaligned_zeropad() to catch a TDX page in the middle of its
   conversion process which would lead to a fatal and unrecoverable
   guest shutdown.

 - Annotate sites where VM "exit reasons" are reused as hypercall
   numbers.

* tag 'x86_tdx_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Fix enc_status_change_finish_noop()
  x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()
  x86/mm: Allow guest.enc_status_change_prepare() to fail
  x86/tdx: Wrap exit reason with hcall_func()
2023-06-26 16:32:47 -07:00
Linus Torvalds 2c96136a3f - Add support for unaccepted memory as specified in the UEFI spec v2.9.
The gist of it all is that Intel TDX and AMD SEV-SNP confidential
   computing guests define the notion of accepting memory before using it
   and thus preventing a whole set of attacks against such guests like
   memory replay and the like.
 
   There are a couple of strategies of how memory should be accepted
   - the current implementation does an on-demand way of accepting.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSZ0f4ACgkQEsHwGGHe
 VUpasw//RKoNW9HSU1csY+XnG9uuaT6QKgji+gIEZWWIGPO9iibvbBj6P5WxJE8T
 fe7yb6CGa6d6thoU0v+mQGVVvCd7OjCFwPD5wAo4mXToD7Ig+4mI6jMkaKifqa2f
 N1Uuy8u/zQnGyWrP5Y//WH5bJYfsmds4UGwXI2nLvKlhE7MG90/ePjt7iqnnwZsy
 waLp6a0Q1VeOvnfRszFLHZw/SoER5RSJ4qeVqttkFNmPPEKMK1Kirrl2poR56OQJ
 nMr6LqVtD7erlSJ36VRXOKzLI443A4iIEIg/wBjIOU6L5ZEWJGNqtCDnIqFJ6+TM
 XatsejfRYkkMZH0qXtX9+M0u+HJHbZPCH5rEcA21P3Nbd7od/ANq91qCGoMjtUZ4
 7pZohMG8M6IDvkLiOb8fQVkR5k/9Jbk8UvdN/8jdPx1ERxYMFO3BDvJpV2gzrW4B
 KYtFTPR7j2nY3eKfDpe3flanqYzKUBsKoTlLnlH7UHaiMZ2idwG8AQjlrhC/erCq
 /Lq1LXt4Mq46FyHABc+PSHytu0WWj1nBUftRt+lviY/Uv7TlkBldOTT7wm7itsfF
 HUCTfLWl0CJXKPq8rbbZhAG/exN6Ay6MO3E3OcNq8A72E5y4cXenuG3ic/0tUuOu
 FfjpiMk35qE2Qb4hnj1YtF3XINtd1MpKcuwzGSzEdv9s3J7hrS0=
 =FS95
 -----END PGP SIGNATURE-----

Merge tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 confidential computing update from Borislav Petkov:

 - Add support for unaccepted memory as specified in the UEFI spec v2.9.

   The gist of it all is that Intel TDX and AMD SEV-SNP confidential
   computing guests define the notion of accepting memory before using
   it and thus preventing a whole set of attacks against such guests
   like memory replay and the like.

   There are a couple of strategies of how memory should be accepted -
   the current implementation does an on-demand way of accepting.

* tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  virt: sevguest: Add CONFIG_CRYPTO dependency
  x86/efi: Safely enable unaccepted memory in UEFI
  x86/sev: Add SNP-specific unaccepted memory support
  x86/sev: Use large PSC requests if applicable
  x86/sev: Allow for use of the early boot GHCB for PSC requests
  x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
  x86/sev: Fix calculation of end address based on number of pages
  x86/tdx: Add unaccepted memory support
  x86/tdx: Refactor try_accept_one()
  x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
  efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
  efi: Add unaccepted memory support
  x86/boot/compressed: Handle unaccepted memory
  efi/libstub: Implement support for unaccepted memory
  efi/x86: Get full memory map in allocate_e820()
  mm: Add support for unaccepted memory
2023-06-26 15:32:39 -07:00
Kirill A. Shutemov 195edce08b x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()
tl;dr: There is a race in the TDX private<=>shared conversion code
       which could kill the TDX guest.  Fix it by changing conversion
       ordering to eliminate the window.

TDX hardware maintains metadata to track which pages are private and
shared. Additionally, TDX guests use the guest x86 page tables to
specify whether a given mapping is intended to be private or shared.
Bad things happen when the intent and metadata do not match.

So there are two thing in play:
 1. "the page" -- the physical TDX page metadata
 2. "the mapping" -- the guest-controlled x86 page table intent

For instance, an unrecoverable exit to VMM occurs if a guest touches a
private mapping that points to a shared physical page.

In summary:
	* Private mapping => Private Page == OK (obviously)
	* Shared mapping  => Shared Page  == OK (obviously)
	* Private mapping => Shared Page  == BIG BOOM!
	* Shared mapping  => Private Page == OK-ish
	  (It will read generate a recoverable #VE via handle_mmio())

Enter load_unaligned_zeropad(). It can touch memory that is adjacent but
otherwise unrelated to the memory it needs to touch. It will cause one
of those unrecoverable exits (aka. BIG BOOM) if it blunders into a
shared mapping pointing to a private page.

This is a problem when __set_memory_enc_pgtable() converts pages from
shared to private. It first changes the mapping and second modifies
the TDX page metadata.  It's moving from:

        * Shared mapping  => Shared Page  == OK
to:
        * Private mapping => Shared Page  == BIG BOOM!

This means that there is a window with a shared mapping pointing to a
private page where load_unaligned_zeropad() can strike.

Add a TDX handler for guest.enc_status_change_prepare(). This converts
the page from shared to private *before* the page becomes private.  This
ensures that there is never a private mapping to a shared page.

Leave a guest.enc_status_change_finish() in place but only use it for
private=>shared conversions.  This will delay updating the TDX metadata
marking the page private until *after* the mapping matches the metadata.
This also ensures that there is never a private mapping to a shared page.

[ dhansen: rewrite changelog ]

Fixes: 7dbde76316 ("x86/mm/cpa: Add support for TDX shared memory")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lore.kernel.org/all/20230606095622.1939-3-kirill.shutemov%40linux.intel.com
2023-06-06 16:24:02 -07:00
Kirill A. Shutemov 75d090fd16 x86/tdx: Add unaccepted memory support
Hookup TDX-specific code to accept memory.

Accepting the memory is done with ACCEPT_PAGE module call on every page
in the range. MAP_GPA hypercall is not required as the unaccepted memory
is considered private already.

Extract the part of tdx_enc_status_changed() that does memory acceptance
in a new helper. Move the helper tdx-shared.c. It is going to be used by
both main kernel and decompressor.

  [ bp: Fix the INTEL_TDX_GUEST=y, KVM_GUEST=n build. ]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230606142637.5171-10-kirill.shutemov@linux.intel.com
2023-06-06 18:25:57 +02:00
Kirill A. Shutemov c2b353ae24 x86/tdx: Refactor try_accept_one()
Rework try_accept_one() to return accepted size instead of modifying
'start' inside the helper. It makes 'start' in-only argument and
streamlines code on the caller side.

Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20230606142637.5171-9-kirill.shutemov@linux.intel.com
2023-06-06 17:38:50 +02:00
Kirill A. Shutemov ff40b5769a x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
Memory acceptance requires a hypercall and one or multiple module calls.

Make helpers for the calls available in boot stub. It has to accept
memory where kernel image and initrd are placed.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20230606142637.5171-8-kirill.shutemov@linux.intel.com
2023-06-06 17:31:23 +02:00
Thomas Gleixner ff3cfcb0d4 x86/smpboot: Fix the parallel bringup decision
The decision to allow parallel bringup of secondary CPUs checks
CC_ATTR_GUEST_STATE_ENCRYPT to detect encrypted guests. Those cannot use
parallel bootup because accessing the local APIC is intercepted and raises
a #VC or #VE, which cannot be handled at that point.

The check works correctly, but only for AMD encrypted guests. TDX does not
set that flag.

As there is no real connection between CC attributes and the inability to
support parallel bringup, replace this with a generic control flag in
x86_cpuinit and let SEV-ES and TDX init code disable it.

Fixes: 0c7ffa32db ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it")
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/r/87ilc9gd2d.ffs@tglx
2023-05-31 16:49:34 +02:00
Nikolay Borisov 122333d6bd x86/tdx: Wrap exit reason with hcall_func()
TDX reuses VMEXIT "reasons" in its guest->host hypercall ABI.  This is
confusing because there might not be a VMEXIT involved at *all*.
These instances are supposed to document situation and reduce confusion
by wrapping VMEXIT reasons with hcall_func().

The decompression code does not follow this convention.

Unify the TDX decompression code with the other TDX use of VMEXIT reasons.
No functional changes.

Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lore.kernel.org/all/20230505120332.1429957-1-nik.borisov%40suse.com
2023-05-23 07:01:45 -07:00
Borislav Petkov (AMD) da86eb9611 x86/coco: Get rid of accessor functions
cc_vendor is __ro_after_init and thus can be used directly.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230508121957.32341-1-bp@alien8.de
2023-05-09 12:53:16 +02:00
Borislav Petkov (AMD) 1eaf282e2c x86/coco: Mark cc_platform_has() and descendants noinstr
Those will be used in code regions where instrumentation is not allowed
so mark them as such.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230328201712.25852-2-bp@alien8.de
2023-05-08 11:39:35 +02:00
Linus Torvalds 7b664cc38e * Do conditional __tdx_hypercall() 'output' processing via an
assembly macro argument rather than a runtime register.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmRKvncACgkQaDWVMHDJ
 krDQsA/+OlDCexITqWgmd3rPN2ZUuyB+aV4MfoKppQuAuWD4I7vxD5qeqjRS2XTh
 E6SSzp43zEVhVo6Kv3UvPR/Tr9edUGn2KzIWmqd1bOwhgbEfd898gzbWuRmK6i8t
 qqweR1RMAL/COgPAlcrdpTLl2PCc9tLYpDnQ8WcAUqH4uoePpQyN3Za0J/dcKX7l
 8XexOAaco4Wz3ylD9npPcLo9ytvohg+exJtCNldN1l2j5xXdA2fTqEJYaUMp/+Nd
 Z1TTQ43QcT7dRknFojxdYfAkCqBfr8ccBAwV1mriahKWY/3xl35BqSeJVlma1tkm
 UzkTY1CFwKYRk24C/oQK7OQMYnyJ7Q1RhSrd91lQWVjaTcI/3DPUKiKKdwFXDv4C
 FUYvuJkanPVk3PyCZRvltdNvsXsifzx0RKZWLZ+3TQ2jtaMEDOzPgChq7a6WfpkQ
 HQPuVoENHvyHdUycQhtELUsaJ3AdnOM87XiQDcbNNiaPiOLB9C8dhSWMKoPsMehO
 oAiUQ7lW6po0lcELVSKib2ASVpXhOmlAxdRyZ50mhjrbpcxfBBGD3+KdFqZ4Gs1c
 8UyrQbjVq07Lx2fvdizvDpIcr4M7z0xBAhJeIegC6z86XpJq5uvin+vOLzFAfe16
 WGy6FiZtVpXp4fyqUY7GgQNqhk1b8h6EHKd9d/zCSPuH8/wT/6g=
 =hvDm
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 tdx update from Dave Hansen:
 "The original tdx hypercall assembly code took two flags in %RSI to
  tweak its behavior at runtime. PeterZ recently axed one flag in commit
  e80a48bade ("x86/tdx: Remove TDX_HCALL_ISSUE_STI").

  Kill the other flag too and tweak the 'output' mode with an assembly
  macro instead. This results in elimination of one push/pop pair and
  overall easier to read assembly.

   - Do conditional __tdx_hypercall() 'output' processing via an
     assembly macro argument rather than a runtime register"

* tag 'x86_tdx_for_6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tdx: Drop flags from __tdx_hypercall()
2023-04-28 09:36:09 -07:00
Linus Torvalds de10553fce x86 APIC updates:
- Fix the incorrect handling of atomic offset updates in
    reserve_eilvt_offset()
 
    The check for the return value of atomic_cmpxchg() is not compared
    against the old value, it is compared against the new value, which
    makes it two round on success.
 
    Convert it to atomic_try_cmpxchg() which does the right thing.
 
  - Handle IO/APIC less systems correctly
 
    When IO/APIC is not advertised by ACPI then the computation of the lower
    bound for dynamically allocated interrupts like MSI goes wrong.
 
    This lower bound is used to exclude the IO/APIC legacy GSI space as that
    must stay reserved for the legacy interrupts.
 
    In case that the system, e.g. VM, does not advertise an IO/APIC the
    lower bound stays at 0.
 
    0 is an invalid interrupt number except for the legacy timer interrupt
    on x86. The return value is unchecked in the core code, so it ends up
    to allocate interrupt number 0 which is subsequently considered to be
    invalid by the caller, e.g. the MSI allocation code.
 
    A similar problem was already cured for device tree based systems years
    ago, but that missed - or did not envision - the zero IO/APIC case.
 
    Consolidate the zero check and return the provided "from" argument to the
    core code call site, which is guaranteed to be greater than 0.
 
  - Simplify the X2APIC cluster CPU mask logic for CPU hotplug
 
    Per cluster CPU masks are required for X2APIC in cluster mode to
    determine the correct cluster for a target CPU when calculating the
    destination for IPIs
 
    These masks are established when CPUs are borught up. The first CPU in a
    cluster must allocate a new cluster CPU mask. As this happens during the
    early startup of a CPU, where memory allocations cannot be done, the
    mask has to be allocated by the control CPU.
 
    The current implementation allocates a clustermask just in case and if
    the to be brought up CPU is the first in a cluster the CPU takes over
    this allocation from a global pointer.
 
    This works nicely in the fully serialized CPU bringup scenario which is
    used today, but would fail completely for parallel bringup of CPUs.
 
    The cluster association of a CPU can be computed from the APIC ID which
    is enumerated by ACPI/MADT.
 
    So the cluster CPU masks can be preallocated and associated upfront and
    the upcoming CPUs just need to set their corresponding bit.
 
    Aside of preparing for parallel bringup this is a valuable
    simplification on its own.
 
  - Remove global variables which control the early startup of secondary
    CPUs on 64-bit
 
    The only information which is needed by a starting CPU is the Linux CPU
    number. The CPU number allows it to retrieve the rest of the required
    data from already existing per CPU storage.
 
    So instead of initial_stack, early_gdt_desciptor and initial_gs provide
    a new variable smpboot_control which contains the Linux CPU number for
    now. The starting CPU can retrieve and compute all required information
    for startup from there.
 
    Aside of being a cleanup, this is also preparing for parallel CPU
    bringup, where starting CPUs will look up their Linux CPU number via the
    APIC ID, when smpboot_control has the corresponding control bit set.
 
  - Make cc_vendor globally accesible
 
    Subsequent parallel bringup changes require access to cc_vendor because
    confidental computing platforms need special treatment in the early
    startup phase vs. CPUID and APCI ID readouts.
 
    The change makes cc_vendor global and provides stub accessors in case
    that CONFIG_ARCH_HAS_CC_PLATFORM is not set.
 
    This was merged from the x86/cc branch in anticipation of further
    parallel bringup commits which require access to cc_vendor. Due to late
    discoveries of fundamental issue with those patches these commits never
    happened.
 
    The merge commit is unfortunately in the middle of the APIC commits so
    unraveling it would have required a rebase or revert. As the parallel
    bringup seems to be well on its way for 6.5 this would be just pointless
    churn. As the commit does not contain any functional change it's not a
    risk to keep it.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRGuAwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRzSEADEx1sVkd2yrLcTYdpjdKbbUaDJ6lR0
 DXxIP3+ApGHmV9l9yIh+/5C2oEJsiUfFf1vdh6ajv5iXpksCKzcUzkW5g3w7nM36
 CSpULpFjwvaq8TIo0o1PIhAbo/yIMMzJVDs8R0reCnWgGAWZoW/a9Ndcvcicd0an
 pQAlkw3FD5r92mcMlKPNWFoui1AkScGEV02zJ7884MAukmBZwD8Jd+gE6eQC9GKa
 9hyJiB77st1URl+a0cPsPYvv8RLVuVcljWsh2edyvxgovIO56+BoEjbrgRSF6cqQ
 Bhzo//3KgbUJ1y+YqH01aKZzY0hRpbAi2Rew4RBKcBKwCGd2qltUQG0LFNxAtV83
 RsC573wSCGSCGO5Xb1RVXih5is+9YqMqitJNWvEc15jjOA9nwoLc80axP11v42f9
 Xl4iGHQTWVGdxT4H22NH7UCuRlGg38vAx+In2HGpN/e57q2ighESjiGuqQAQpLel
 pbOeJtQ/D2xXVKcCap4T/P/2x5ls7bsc76MWJBMcYC3pRgJ5M7ZHw7wTw0IAty4x
 xCfR1bsRVEAhrE9r/odgNipXjBJu+CdGBAupNEIiRyq1QiwUKtMTayasRGUlbYO6
 vrieHKqoflzRVg2M9Bgm3oI28X27FzZHWAZJW2oJ2Wnn2jL5kuRJa1nEykqo8pEP
 j6rjnScRVvdpIw==
 =IQWG
 -----END PGP SIGNATURE-----

Merge tag 'x86-apic-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 APIC updates from Thomas Gleixner:

 - Fix the incorrect handling of atomic offset updates in
   reserve_eilvt_offset()

   The check for the return value of atomic_cmpxchg() is not compared
   against the old value, it is compared against the new value, which
   makes it two round on success.

   Convert it to atomic_try_cmpxchg() which does the right thing.

 - Handle IO/APIC less systems correctly

   When IO/APIC is not advertised by ACPI then the computation of the
   lower bound for dynamically allocated interrupts like MSI goes wrong.

   This lower bound is used to exclude the IO/APIC legacy GSI space as
   that must stay reserved for the legacy interrupts.

   In case that the system, e.g. VM, does not advertise an IO/APIC the
   lower bound stays at 0.

   0 is an invalid interrupt number except for the legacy timer
   interrupt on x86. The return value is unchecked in the core code, so
   it ends up to allocate interrupt number 0 which is subsequently
   considered to be invalid by the caller, e.g. the MSI allocation code.

   A similar problem was already cured for device tree based systems
   years ago, but that missed - or did not envision - the zero IO/APIC
   case.

   Consolidate the zero check and return the provided "from" argument to
   the core code call site, which is guaranteed to be greater than 0.

 - Simplify the X2APIC cluster CPU mask logic for CPU hotplug

   Per cluster CPU masks are required for X2APIC in cluster mode to
   determine the correct cluster for a target CPU when calculating the
   destination for IPIs

   These masks are established when CPUs are borught up. The first CPU
   in a cluster must allocate a new cluster CPU mask. As this happens
   during the early startup of a CPU, where memory allocations cannot be
   done, the mask has to be allocated by the control CPU.

   The current implementation allocates a clustermask just in case and
   if the to be brought up CPU is the first in a cluster the CPU takes
   over this allocation from a global pointer.

   This works nicely in the fully serialized CPU bringup scenario which
   is used today, but would fail completely for parallel bringup of
   CPUs.

   The cluster association of a CPU can be computed from the APIC ID
   which is enumerated by ACPI/MADT.

   So the cluster CPU masks can be preallocated and associated upfront
   and the upcoming CPUs just need to set their corresponding bit.

   Aside of preparing for parallel bringup this is a valuable
   simplification on its own.

 - Remove global variables which control the early startup of secondary
   CPUs on 64-bit

   The only information which is needed by a starting CPU is the Linux
   CPU number. The CPU number allows it to retrieve the rest of the
   required data from already existing per CPU storage.

   So instead of initial_stack, early_gdt_desciptor and initial_gs
   provide a new variable smpboot_control which contains the Linux CPU
   number for now. The starting CPU can retrieve and compute all
   required information for startup from there.

   Aside of being a cleanup, this is also preparing for parallel CPU
   bringup, where starting CPUs will look up their Linux CPU number via
   the APIC ID, when smpboot_control has the corresponding control bit
   set.

 - Make cc_vendor globally accesible

   Subsequent parallel bringup changes require access to cc_vendor
   because confidental computing platforms need special treatment in the
   early startup phase vs. CPUID and APCI ID readouts.

   The change makes cc_vendor global and provides stub accessors in case
   that CONFIG_ARCH_HAS_CC_PLATFORM is not set.

   This was merged from the x86/cc branch in anticipation of further
   parallel bringup commits which require access to cc_vendor. Due to
   late discoveries of fundamental issue with those patches these
   commits never happened.

   The merge commit is unfortunately in the middle of the APIC commits
   so unraveling it would have required a rebase or revert. As the
   parallel bringup seems to be well on its way for 6.5 this would be
   just pointless churn. As the commit does not contain any functional
   change it's not a risk to keep it.

* tag 'x86-apic-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ioapic: Don't return 0 from arch_dynirq_lower_bound()
  x86/apic: Fix atomic update of offset in reserve_eilvt_offset()
  x86/coco: Export cc_vendor
  x86/smpboot: Reference count on smpboot_setup_warm_reset_vector()
  x86/smpboot: Remove initial_gs
  x86/smpboot: Remove early_gdt_descr on 64-bit
  x86/smpboot: Remove initial_stack on 64-bit
  x86/apic/x2apic: Allow CPU cluster_mask to be populated in parallel
2023-04-25 11:39:45 -07:00
Borislav Petkov (AMD) 3d91c53729 x86/coco: Export cc_vendor
It will be used in different checks in future changes. Export it directly
and provide accessor functions and stubs so this can be used in general
code when CONFIG_ARCH_HAS_CC_PLATFORM is not set.

No functional changes.

[ tglx: Add accessor functions ]

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230318115634.9392-2-bp@alien8.de
2023-03-30 14:06:28 +02:00
Michael Kelley 812b0597fb x86/hyperv: Change vTOM handling to use standard coco mechanisms
Hyper-V guests on AMD SEV-SNP hardware have the option of using the
"virtual Top Of Memory" (vTOM) feature specified by the SEV-SNP
architecture. With vTOM, shared vs. private memory accesses are
controlled by splitting the guest physical address space into two
halves.

vTOM is the dividing line where the uppermost bit of the physical
address space is set; e.g., with 47 bits of guest physical address
space, vTOM is 0x400000000000 (bit 46 is set).  Guest physical memory is
accessible at two parallel physical addresses -- one below vTOM and one
above vTOM.  Accesses below vTOM are private (encrypted) while accesses
above vTOM are shared (decrypted). In this sense, vTOM is like the
GPA.SHARED bit in Intel TDX.

Support for Hyper-V guests using vTOM was added to the Linux kernel in
two patch sets[1][2]. This support treats the vTOM bit as part of
the physical address. For accessing shared (decrypted) memory, these
patch sets create a second kernel virtual mapping that maps to physical
addresses above vTOM.

A better approach is to treat the vTOM bit as a protection flag, not
as part of the physical address. This new approach is like the approach
for the GPA.SHARED bit in Intel TDX. Rather than creating a second kernel
virtual mapping, the existing mapping is updated using recently added
coco mechanisms.

When memory is changed between private and shared using
set_memory_decrypted() and set_memory_encrypted(), the PTEs for the
existing kernel mapping are changed to add or remove the vTOM bit in the
guest physical address, just as with TDX. The hypercalls to change the
memory status on the host side are made using the existing callback
mechanism. Everything just works, with a minor tweak to map the IO-APIC
to use private accesses.

To accomplish the switch in approach, the following must be done:

* Update Hyper-V initialization to set the cc_mask based on vTOM
  and do other coco initialization.

* Update physical_mask so the vTOM bit is no longer treated as part
  of the physical address

* Remove CC_VENDOR_HYPERV and merge the associated vTOM functionality
  under CC_VENDOR_AMD. Update cc_mkenc() and cc_mkdec() to set/clear
  the vTOM bit as a protection flag.

* Code already exists to make hypercalls to inform Hyper-V about pages
  changing between shared and private.  Update this code to run as a
  callback from __set_memory_enc_pgtable().

* Remove the Hyper-V special case from __set_memory_enc_dec()

* Remove the Hyper-V specific call to swiotlb_update_mem_attributes()
  since mem_encrypt_init() will now do it.

* Add a Hyper-V specific implementation of the is_private_mmio()
  callback that returns true for the IO-APIC and vTPM MMIO addresses

  [1] https://lore.kernel.org/all/20211025122116.264793-1-ltykernel@gmail.com/
  [2] https://lore.kernel.org/all/20211213071407.314309-1-ltykernel@gmail.com/

  [ bp: Touchups. ]

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/1679838727-87310-7-git-send-email-mikelley@microsoft.com
2023-03-27 09:31:43 +02:00
Kirill A. Shutemov 7a3a401874 x86/tdx: Drop flags from __tdx_hypercall()
After TDX_HCALL_ISSUE_STI got dropped, the only flag left is
TDX_HCALL_HAS_OUTPUT. The flag indicates if the caller wants to see
tdx_hypercall_args updated based on the hypercall output.

Drop the flags and provide __tdx_hypercall_ret() that matches
TDX_HCALL_HAS_OUTPUT semantics.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20230321003511.9469-1-kirill.shutemov%40linux.intel.com
2023-03-22 11:36:05 -07:00
Linus Torvalds d8e473182a - Fixup comment typo
- Prevent unexpected #VE's from:
   - Hosts removing perfectly good guest mappings (SEPT_VE_DISABLE
   - Excessive #VE notifications (NOTIFY_ENABLES) which are
     delivered via a #VE.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmP1WzMACgkQaDWVMHDJ
 krAqig/+MzIYmUIkuYbluektxPdzI6zhY/Z+eD5DDH9OFZX5e0WQrmHpQbJ3i4Q6
 LT5JQ+yAI2ox/mhPfyCeDXqdRiatJJExUDUepc0qsOEW9gTsJ+edYwUsJg8HII61
 +TLz/BiMSF6xCUk46b4CqzhoeEk1dupFAG204uc4vGwSfXdysN3buAcciJc1rOTS
 7G9hI9fdLSjEJ8yyFebSDMPxSmdnjJPrDK3LF/leGJEpAQ/eMU0entG4ZH3Uyh2s
 3EnDpOdRjX56LAEixB4e5igXyS7wesCun4ytOnwndzW8p4gPIsypcJUEbVt84BfA
 HQaSWP35BFAn0JshJnFPmj4r4jV2EB8l630dVTOKdNSiIa3YjyB5nbzy+mMPFl4f
 8vcrHEZ6boEcRhgz0zFG0RfnDsjdbqKgFBXdRt0vYB/CG+EfmYaPoDXsb/8A7dtc
 8IQ9wLk2AqG0L8blZVS2kjFxNa/9lkDcMsAbfZmlORTQTF2WN2Jlbxri87vuBpRy
 8sqMUhgvHoffd/SIiDzJJIBjOH5/RhXLKhGzXQHI1vpZdU6ps9KIvohiycgx1mUQ
 lXXQwN5OWSHdUXZ7TFBIGXy9n32Ak/k5GCzCJSqvsMJDDdbycGVB+YCaKX6QK30+
 HAHrPy/FQ3FFvZWdsDMD5Pn4RkF4LYH/k4QZwqBFMs9+/Sdzwxc=
 =UpyL
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull Intel Trust Domain Extensions (TDX) updates from Dave Hansen:
 "Other than a minor fixup, the content here is to ensure that TDX
  guests never see virtualization exceptions (#VE's) that might be
  induced by the untrusted VMM.

  This is a highly desirable property. Without it, #VE exception
  handling would fall somewhere between NMIs, machine checks and total
  insanity. With it, #VE handling remains pretty mundane.

  Summary:

   - Fixup comment typo

   - Prevent unexpected #VE's from:
      - Hosts removing perfectly good guest mappings (SEPT_VE_DISABLE)
      - Excessive #VE notifications (NOTIFY_ENABLES) which are delivered
        via a #VE"

* tag 'x86_tdx_for_6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tdx: Do not corrupt frame-pointer in __tdx_hypercall()
  x86/tdx: Disable NOTIFY_ENABLES
  x86/tdx: Relax SEPT_VE_DISABLE check for debug TD
  x86/tdx: Use ReportFatalError to report missing SEPT_VE_DISABLE
  x86/tdx: Expand __tdx_hypercall() to handle more arguments
  x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments
  x86/tdx: Add more registers to struct tdx_hypercall_args
  x86/tdx: Fix typo in comment in __tdx_hypercall()
2023-02-25 09:11:30 -08:00
Kirill A. Shutemov 1e70c68037 x86/tdx: Do not corrupt frame-pointer in __tdx_hypercall()
If compiled with CONFIG_FRAME_POINTER=y, objtool is not happy that
__tdx_hypercall() messes up RBP:

  objtool: __tdx_hypercall+0x7f: return with modified stack frame

Rework the function to store TDX_HCALL_ flags on stack instead of RBP.

[ dhansen: minor changelog tweaks ]

Fixes: c30c4b2555 ("x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/202301290255.buUBs99R-lkp@intel.com
Link: https://lore.kernel.org/all/20230130135354.27674-1-kirill.shutemov%40linux.intel.com
2023-02-02 16:31:25 -08:00
Ingo Molnar 57a30218fa Linux 6.2-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmPW7E8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGf7MIAI0JnHN9WvtEukSZ
 E6j6+cEGWxsvD6q0g3GPolaKOCw7hlv0pWcFJFcUAt0jebspMdxV2oUGJ8RYW7Lg
 nCcHvEVswGKLAQtQSWw52qotW6fUfMPsNYYB5l31sm1sKH4Cgss0W7l2HxO/1LvG
 TSeNHX53vNAZ8pVnFYEWCSXC9bzrmU/VALF2EV00cdICmfvjlgkELGXoLKJJWzUp
 s63fBHYGGURSgwIWOKStoO6HNo0j/F/wcSMx8leY8qDUtVKHj4v24EvSgxUSDBER
 ch3LiSQ6qf4sw/z7pqruKFthKOrlNmcc0phjiES0xwwGiNhLv0z3rAhc4OM2cgYh
 SDc/Y/c=
 =zpaD
 -----END PGP SIGNATURE-----

Merge tag 'v6.2-rc6' into sched/core, to pick up fixes

Pick up fixes before merging another batch of cpuidle updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-01-31 15:01:20 +01:00
Kirill A. Shutemov 8de62af018 x86/tdx: Disable NOTIFY_ENABLES
== Background ==

There is a class of side-channel attacks against SGX enclaves called
"SGX Step"[1]. These attacks create lots of exceptions inside of
enclaves. Basically, run an in-enclave instruction, cause an exception.
Over and over.

There is a concern that a VMM could attack a TDX guest in the same way
by causing lots of #VE's. The TDX architecture includes new
countermeasures for these attacks. It basically counts the number of
exceptions and can send another *special* exception once the number of
VMM-induced #VE's hits a critical threshold[2].

== Problem ==

But, these special exceptions are independent of any action that the
guest takes. They can occur anywhere that the guest executes. This
includes sensitive areas like the entry code. The (non-paranoid) #VE
handler is incapable of handling exceptions in these areas.

== Solution ==

Fortunately, the special exceptions can be disabled by the guest via
write to NOTIFY_ENABLES TDCS field. NOTIFY_ENABLES is disabled by
default, but might be enabled by a bootloader, firmware or an earlier
kernel before the current kernel runs.

Disable NOTIFY_ENABLES feature explicitly and unconditionally. Any
NOTIFY_ENABLES-based #VE's that occur before this point will end up
in the early #VE exception handler and die due to unexpected exit
reason.

[1] https://github.com/jovanbulck/sgx-step
[2] https://intel.github.io/ccc-linux-guest-hardening-docs/security-spec.html#safety-against-ve-in-kernel-code

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-8-kirill.shutemov%40linux.intel.com
2023-01-27 09:46:05 -08:00
Kirill A. Shutemov 47e67cf317 x86/tdx: Relax SEPT_VE_DISABLE check for debug TD
A "SEPT #VE" occurs when a TDX guest touches memory that is not properly
mapped into the "secure EPT".  This can be the result of hypervisor
attacks or bugs, *OR* guest bugs.  Most notably, buggy guests might
touch unaccepted memory for lots of different memory safety bugs like
buffer overflows.

TDX guests do not want to continue in the face of hypervisor attacks or
hypervisor bugs.  They want to terminate as fast and safely as possible.
SEPT_VE_DISABLE ensures that TDX guests *can't* continue in the face of
these kinds of issues.

But, that causes a problem.  TDX guests that can't continue can't spit
out oopses or other debugging info.  In essence SEPT_VE_DISABLE=1 guests
are not debuggable.

Relax the SEPT_VE_DISABLE check to warning on debug TD and panic() in
the #VE handler on EPT-violation on private memory. It will produce
useful backtrace.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-7-kirill.shutemov%40linux.intel.com
2023-01-27 09:46:05 -08:00
Kirill A. Shutemov 71acdcd7cd x86/tdx: Use ReportFatalError to report missing SEPT_VE_DISABLE
Linux TDX guests require that the SEPT_VE_DISABLE "attribute" be set.
If it is not set, the kernel is theoretically required to handle
exceptions anywhere that kernel memory is accessed, including places
like NMI handlers and in the syscall entry gap.

Rather than even try to handle these exceptions, the kernel refuses to
run if SEPT_VE_DISABLE is unset.

However, the SEPT_VE_DISABLE detection and refusal code happens very
early in boot, even before earlyprintk runs.  Calling panic() will
effectively just hang the system.

Instead, call a TDX-specific panic() function.  This makes a very simple
TDVMCALL which gets a short error string out to the hypervisor without
any console infrastructure.

Use TDG.VP.VMCALL<ReportFatalError> to report the error. The hypercall
can encode message up to 64 bytes in eight registers.

[ dhansen: tweak comment and remove while loop brackets. ]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-6-kirill.shutemov%40linux.intel.com
2023-01-27 09:45:55 -08:00
Kirill A. Shutemov 752d13305c x86/tdx: Expand __tdx_hypercall() to handle more arguments
So far __tdx_hypercall() only handles six arguments for VMCALL.
Expanding it to six more register would allow to cover more use-cases
like ReportFatalError() and Hyper-V hypercalls.

With all preparations in place, the expansion is pretty straight
forward.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-5-kirill.shutemov%40linux.intel.com
2023-01-27 09:42:09 -08:00
Kirill A. Shutemov c30c4b2555 x86/tdx: Refactor __tdx_hypercall() to allow pass down more arguments
RDI is the first argument to __tdx_hypercall() that used to pass pointer
to struct tdx_hypercall_args. RSI is the second argument that contains
flags, such as TDX_HCALL_HAS_OUTPUT and TDX_HCALL_ISSUE_STI.

RDI and RSI can also be used as arguments to TDVMCALL leafs. Move RDI to
RAX and RSI to RBP to free up them for the hypercall arguments.

RAX saved on stack during TDCALL as it returns status code in the
register.

RBP value has to be restored before returning from __tdx_hypercall() as
it is callee-saved register.

This is preparatory patch. No functional change.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-4-kirill.shutemov%40linux.intel.com
2023-01-27 09:42:09 -08:00
Kirill A. Shutemov 3543f8830b x86/tdx: Fix typo in comment in __tdx_hypercall()
Comment in __tdx_hypercall() points that RAX==0 indicates TDVMCALL
failure which is opposite of the truth: RAX==0 is success.

Fix the comment. No functional changes.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230126221159.8635-2-kirill.shutemov%40linux.intel.com
2023-01-27 09:42:09 -08:00
Peter Zijlstra c3982c1a36 cpuidle, tdx: Make TDX code noinstr clean
objtool found a few cases where this code called out into instrumented
code:

  vmlinux.o: warning: objtool: __halt+0x2c: call to hcall_func.constprop.0() leaves .noinstr.text section
  vmlinux.o: warning: objtool: __halt+0x3f: call to __tdx_hypercall() leaves .noinstr.text section
  vmlinux.o: warning: objtool: __tdx_hypercall+0x66: call to __tdx_hypercall_failed() leaves .noinstr.text section

Fix it by:

  - moving TDX tdcall assembly methods into .noinstr.text (they are already noistr-clean)
  - marking __tdx_hypercall_failed() as 'noinstr'
  - annotating hcall_func() as __always_inline

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195541.111485720@infradead.org
2023-01-13 11:48:16 +01:00
Peter Zijlstra e80a48bade x86/tdx: Remove TDX_HCALL_ISSUE_STI
Now that arch_cpu_idle() is expected to return with IRQs disabled,
avoid the useless STI/CLI dance.

Per the specs this is supposed to work, but nobody has yet relied up
this behaviour so broken implementations are possible.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195540.682137572@infradead.org
2023-01-13 11:48:15 +01:00
Peter Zijlstra 89b3098703 arch/idle: Change arch_cpu_idle() behavior: always exit with IRQs disabled
Current arch_cpu_idle() is called with IRQs disabled, but will return
with IRQs enabled.

However, the very first thing the generic code does after calling
arch_cpu_idle() is raw_local_irq_disable(). This means that
architectures that can idle with IRQs disabled end up doing a
pointless 'enable-disable' dance.

Therefore, push this IRQ disabling into the idle function, meaning
that those architectures can avoid the pointless IRQ state flipping.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Tony Lindgren <tony@atomide.com>
Tested-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Mark Rutland <mark.rutland@arm.com> [arm64]
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20230112195540.618076436@infradead.org
2023-01-13 11:48:15 +01:00
Jason A. Donenfeld 72bb8f8cc0 x86/insn: Avoid namespace clash by separating instruction decoder MMIO type from MMIO trace type
Both <linux/mmiotrace.h> and <asm/insn-eval.h> define various MMIO_ enum constants,
whose namespace overlaps.

Rename the <asm/insn-eval.h> ones to have a INSN_ prefix, so that the headers can be
used from the same source file.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230101162910.710293-2-Jason@zx2c4.com
2023-01-03 18:46:06 +01:00
Kuppuswamy Sathyanarayanan 51acfe89af x86/tdx: Add a wrapper to get TDREPORT0 from the TDX Module
To support TDX attestation, the TDX guest driver exposes an IOCTL
interface to allow userspace to get the TDREPORT0 (a.k.a. TDREPORT
subtype 0) from the TDX module via TDG.MR.TDREPORT TDCALL.

In order to get the TDREPORT0 in the TDX guest driver, instead of using
a low level function like __tdx_module_call(), add a
tdx_mcall_get_report0() wrapper function to handle it.

This is a preparatory patch for adding attestation support.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Wander Lairson Costa <wander@redhat.com>
Link: https://lore.kernel.org/all/20221116223820.819090-2-sathyanarayanan.kuppuswamy%40linux.intel.com
2022-11-17 11:03:09 -08:00
Kirill A. Shutemov 373e715e31 x86/tdx: Panic on bad configs that #VE on "private" memory access
All normal kernel memory is "TDX private memory".  This includes
everything from kernel stacks to kernel text.  Handling
exceptions on arbitrary accesses to kernel memory is essentially
impossible because they can happen in horribly nasty places like
kernel entry/exit.  But, TDX hardware can theoretically _deliver_
a virtualization exception (#VE) on any access to private memory.

But, it's not as bad as it sounds.  TDX can be configured to never
deliver these exceptions on private memory with a "TD attribute"
called ATTR_SEPT_VE_DISABLE.  The guest has no way to *set* this
attribute, but it can check it.

Ensure ATTR_SEPT_VE_DISABLE is set in early boot.  panic() if it
is unset.  There is no sane way for Linux to run with this
attribute clear so a panic() is appropriate.

There's small window during boot before the check where kernel
has an early #VE handler. But the handler is only for port I/O
and will also panic() as soon as it sees any other #VE, such as
a one generated by a private memory access.

[ dhansen: Rewrite changelog and rebase on new tdx_parse_tdinfo().
	   Add Kirill's tested-by because I made changes since
	   he wrote this. ]

Fixes: 9a22bf6deb ("x86/traps: Add #VE support for TDX guest")
Reported-by: ruogui.ygr@alibaba-inc.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20221028141220.29217-3-kirill.shutemov%40linux.intel.com
2022-11-01 16:02:40 -07:00
Dave Hansen a6dd6f3900 x86/tdx: Prepare for using "INFO" call for a second purpose
The TDG.VP.INFO TDCALL provides the guest with various details about
the TDX system that the guest needs to run.  Only one field is currently
used: 'gpa_width' which tells the guest which PTE bits mark pages shared
or private.

A second field is now needed: the guest "TD attributes" to tell if
virtualization exceptions are configured in a way that can harm the guest.

Make the naming and calling convention more generic and discrete from the
mask-centric one.

Thanks to Sathya for the inspiration here, but there's no code, comments
or changelogs left from where he started.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: stable@vger.kernel.org
2022-11-01 10:07:15 -07:00
Kirill A. Shutemov 1e7769653b x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
The unwanted loads are typically harmless. But, they might be made to
totally unrelated or even unmapped memory. load_unaligned_zeropad()
relies on exception fixup (#PF, #GP and now #VE) to recover from these
unwanted loads.

In TDX guests, the second page can be shared page and a VMM may configure
it to trigger #VE.

The kernel assumes that #VE on a shared page is an MMIO access and tries to
decode instruction to handle it. In case of load_unaligned_zeropad() it
may result in confusion as it is not MMIO access.

Fix it by detecting split page MMIO accesses and failing them.
load_unaligned_zeropad() will recover using exception fixups.

The issue was discovered by analysis and reproduced artificially. It was
not triggered during testing.

[ dhansen: fix up changelogs and comments for grammar and clarity,
	   plus incorporate Kirill's off-by-one fix]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220614120135.14812-4-kirill.shutemov@linux.intel.com
2022-06-17 15:37:33 -07:00
Kirill A. Shutemov cdd85786f4 x86/tdx: Clarify RIP adjustments in #VE handler
After successful #VE handling, tdx_handle_virt_exception() has to move
RIP to the next instruction. The handler needs to know the length of the
instruction.

If the #VE happened due to instruction execution, the GET_VEINFO TDX
module call provides info on the instruction in R10, including its length.

For #VE due to EPT violation, the info in R10 is not populand and the
kernel must decode the instruction manually to find out its length.

Restructure the code to make it explicit that the instruction length
depends on the type of #VE. Make individual #VE handlers return
the instruction length on success or -errno on failure.

[ dhansen: fix up changelog and comments ]

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220614120135.14812-3-kirill.shutemov@linux.intel.com
2022-06-15 11:05:16 -07:00
Kirill A. Shutemov 60428d8bc2 x86/tdx: Fix early #VE handling
tdx_early_handle_ve() does not increment RIP after successfully
handling the exception.  That leads to infinite loop of exceptions.

Move RIP when exceptions are successfully handled.

[ dhansen: make problem statement more clear ]

Fixes: 32e72854fa ("x86/tdx: Port I/O: Add early boot support")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lkml.kernel.org/r/20220614120135.14812-2-kirill.shutemov@linux.intel.com
2022-06-15 10:52:59 -07:00
Linus Torvalds 3a755ebcc2 Intel Trust Domain Extensions
This is the Intel version of a confidential computing solution called
 Trust Domain Extensions (TDX). This series adds support to run the
 kernel as part of a TDX guest. It provides similar guest protections to
 AMD's SEV-SNP like guest memory and register state encryption, memory
 integrity protection and a lot more.
 
 Design-wise, it differs from AMD's solution considerably: it uses
 a software module which runs in a special CPU mode called (Secure
 Arbitration Mode) SEAM. As the name suggests, this module serves as sort
 of an arbiter which the confidential guest calls for services it needs
 during its lifetime.
 
 Just like AMD's SNP set, this series reworks and streamlines certain
 parts of x86 arch code so that this feature can be properly accomodated.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLbisACgkQEsHwGGHe
 VUqZLg/7B55iygCwzz0W/KLcXL2cISatUpzGbFs1XTbE9DMz06BPkOsEjF2k8ckv
 kfZjgqhSx3GvUI80gK0Tn2M2DfIj3nKuNSXd1pfextP7AxEf68FFJsQz1Ju7bHpT
 pZaG+g8IK4+mnEHEKTCO9ANg/Zw8yqJLdtsCaCNE9SUGUfQ6m/ujTEfsambXDHNm
 khyCAgpIGSOt51/4apoR9ebyrNCaeVbDawpIPjTy+iyFRc/WyaLFV9CQ8klw4gbw
 r/90x2JYxvAf0/z/ifT9Wa+TnYiQ0d4VjFbfr0iJ4GcPn5L3EIoIKPE8vPGMpoSX
 fLSzoNmAOT3ja57ytUUQ3o0edoRUIPEdixOebf9qWvE/aj7W37YRzrlJ8Ej/x9Jy
 HcI4WZF6Dr1bh6FnI/xX2eVZRzLOL4j9gNyPCwIbvgr1NjDqQnxU7nhxVMmQhJrs
 IdiEcP5WYerLKfka/uF//QfWUg5mDBgFa1/3xK57Z3j0iKWmgjaPpR0SWlOKjj8G
 tr0gGN9ejikZTqXKGsHn8fv/R3bjXvbVD8z0IEcx+MIrRmZPnX2QBlg7UA1AXV5n
 HoVwPFdH1QAtjZq1MRcL4hTOjz3FkS68rg7ZH0f2GWJAzWmEGytBIhECRnN/PFFq
 VwRB4dCCt0bzqRxkiH5lzdgR+xqRe61juQQsMzg+Flv/trpXDqM=
 =ac9K
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull Intel TDX support from Borislav Petkov:
 "Intel Trust Domain Extensions (TDX) support.

  This is the Intel version of a confidential computing solution called
  Trust Domain Extensions (TDX). This series adds support to run the
  kernel as part of a TDX guest. It provides similar guest protections
  to AMD's SEV-SNP like guest memory and register state encryption,
  memory integrity protection and a lot more.

  Design-wise, it differs from AMD's solution considerably: it uses a
  software module which runs in a special CPU mode called (Secure
  Arbitration Mode) SEAM. As the name suggests, this module serves as
  sort of an arbiter which the confidential guest calls for services it
  needs during its lifetime.

  Just like AMD's SNP set, this series reworks and streamlines certain
  parts of x86 arch code so that this feature can be properly
  accomodated"

* tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  x86/tdx: Fix RETs in TDX asm
  x86/tdx: Annotate a noreturn function
  x86/mm: Fix spacing within memory encryption features message
  x86/kaslr: Fix build warning in KASLR code in boot stub
  Documentation/x86: Document TDX kernel architecture
  ACPICA: Avoid cache flush inside virtual machines
  x86/tdx/ioapic: Add shared bit for IOAPIC base address
  x86/mm: Make DMA memory shared for TD guest
  x86/mm/cpa: Add support for TDX shared memory
  x86/tdx: Make pages shared in ioremap()
  x86/topology: Disable CPU online/offline control for TDX guests
  x86/boot: Avoid #VE during boot for TDX platforms
  x86/boot: Set CR0.NE early and keep it set during the boot
  x86/acpi/x86/boot: Add multiprocessor wake-up support
  x86/boot: Add a trampoline for booting APs via firmware handoff
  x86/tdx: Wire up KVM hypercalls
  x86/tdx: Port I/O: Add early boot support
  x86/tdx: Port I/O: Add runtime hypercalls
  x86/boot: Port I/O: Add decompression-time support for TDX
  x86/boot: Port I/O: Allow to hook up alternative helpers
  ...
2022-05-23 17:51:12 -07:00
Peter Zijlstra c796f02162 x86/tdx: Fix RETs in TDX asm
Because build-testing is over-rated, fix a few trivial objtool complaints:

  vmlinux.o: warning: objtool: __tdx_module_call+0x3e: missing int3 after ret
  vmlinux.o: warning: objtool: __tdx_hypercall+0x6e: missing int3 after ret

Fixes: eb94f1b6a7 ("x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220520083839.GR2578@worktop.programming.kicks-ass.net
2022-05-20 12:53:22 +02:00
Borislav Petkov 5af14c29f7 x86/tdx: Annotate a noreturn function
objdump complains:

  vmlinux.o: warning: objtool: __tdx_hypercall()+0x74: unreachable instruction

because __tdx_hypercall_failed() won't return but panic the guest.
Annotate that that is ok and desired.

Fixes: eb94f1b6a7 ("x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functions")
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220420115025.5448-1-bp@alien8.de
2022-04-21 12:54:08 +02:00
Kirill A. Shutemov 968b493173 x86/mm: Make DMA memory shared for TD guest
Intel TDX doesn't allow VMM to directly access guest private memory.
Any memory that is required for communication with the VMM must be
shared explicitly. The same rule applies for any DMA to and from the
TDX guest. All DMA pages have to be marked as shared pages. A generic way
to achieve this without any changes to device drivers is to use the
SWIOTLB framework.

The previous patch ("Add support for TDX shared memory") gave TDX guests
the _ability_ to make some pages shared, but did not make any pages
shared. This actually marks SWIOTLB buffers *as* shared.

Start returning true for cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) in
TDX guests.  This has several implications:

 - Allows the existing mem_encrypt_init() to be used for TDX which
   sets SWIOTLB buffers shared (aka. "decrypted").
 - Ensures that all DMA is routed via the SWIOTLB mechanism (see
   pci_swiotlb_detect())

Stop selecting DYNAMIC_PHYSICAL_MASK directly. It will get set
indirectly by selecting X86_MEM_ENCRYPT.

mem_encrypt_init() is currently under an AMD-specific #ifdef. Move it to
a generic area of the header.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-28-kirill.shutemov@linux.intel.com
2022-04-07 08:27:53 -07:00
Kirill A. Shutemov 7dbde76316 x86/mm/cpa: Add support for TDX shared memory
Intel TDX protects guest memory from VMM access. Any memory that is
required for communication with the VMM must be explicitly shared.

It is a two-step process: the guest sets the shared bit in the page
table entry and notifies VMM about the change. The notification happens
using MapGPA hypercall.

Conversion back to private memory requires clearing the shared bit,
notifying VMM with MapGPA hypercall following with accepting the memory
with AcceptPage hypercall.

Provide a TDX version of x86_platform.guest.* callbacks. It makes
__set_memory_enc_pgtable() work right in TDX guest.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-27-kirill.shutemov@linux.intel.com
2022-04-07 08:27:53 -07:00
Kuppuswamy Sathyanarayanan bae1a962ac x86/topology: Disable CPU online/offline control for TDX guests
Unlike regular VMs, TDX guests use the firmware hand-off wakeup method
to wake up the APs during the boot process. This wakeup model uses a
mailbox to communicate with firmware to bring up the APs. As per the
design, this mailbox can only be used once for the given AP, which means
after the APs are booted, the same mailbox cannot be used to
offline/online the given AP. More details about this requirement can be
found in Intel TDX Virtual Firmware Design Guide, sec titled "AP
initialization in OS" and in sec titled "Hotplug Device".

Since the architecture does not support any method of offlining the
CPUs, disable CPU hotplug support in the kernel.

Since this hotplug disable feature can be re-used by other VM guests,
add a new CC attribute CC_ATTR_HOTPLUG_DISABLED and use it to disable
the hotplug support.

Attempt to offline CPU will fail with -EOPNOTSUPP.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-25-kirill.shutemov@linux.intel.com
2022-04-07 08:27:53 -07:00
Kuppuswamy Sathyanarayanan cfb8ec7a31 x86/tdx: Wire up KVM hypercalls
KVM hypercalls use the VMCALL or VMMCALL instructions. Although the ABI
is similar, those instructions no longer function for TDX guests.

Make vendor-specific TDVMCALLs instead of VMCALL. This enables TDX
guests to run with KVM acting as the hypervisor.

Among other things, KVM hypercall is used to send IPIs.

Since the KVM driver can be built as a kernel module, export
tdx_kvm_hypercall() to make the symbols visible to kvm.ko.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-20-kirill.shutemov@linux.intel.com
2022-04-07 08:27:52 -07:00
Andi Kleen 32e72854fa x86/tdx: Port I/O: Add early boot support
TDX guests cannot do port I/O directly. The TDX module triggers a #VE
exception to let the guest kernel emulate port I/O by converting them
into TDCALLs to call the host.

But before IDT handlers are set up, port I/O cannot be emulated using
normal kernel #VE handlers. To support the #VE-based emulation during
this boot window, add a minimal early #VE handler support in early
exception handlers. This is similar to what AMD SEV does. This is
mainly to support earlyprintk's serial driver, as well as potentially
the VGA driver.

The early handler only supports I/O-related #VE exceptions. Unhandled or
failed exceptions will be handled via early_fixup_exceptions() (like
normal exception failures). At runtime I/O-related #VE exceptions (along
with other types) handled by virt_exception_kernel().

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-19-kirill.shutemov@linux.intel.com
2022-04-07 08:27:52 -07:00
Kuppuswamy Sathyanarayanan 0314994883 x86/tdx: Port I/O: Add runtime hypercalls
TDX hypervisors cannot emulate instructions directly. This includes
port I/O which is normally emulated in the hypervisor. All port I/O
instructions inside TDX trigger the #VE exception in the guest and
would be normally emulated there.

Use a hypercall to emulate port I/O. Extend the
tdx_handle_virt_exception() and add support to handle the #VE due to
port I/O instructions.

String I/O operations are not supported in TDX. Unroll them by declaring
CC_ATTR_GUEST_UNROLL_STRING_IO confidential computing attribute.

== Userspace Implications ==

The ioperm() facility allows userspace access to I/O instructions like
inb/outb.  Among other things, this allows writing userspace device
drivers.

This series has no special handling for ioperm(). Users will be able to
successfully request I/O permissions but will induce a #VE on their
first I/O instruction which leads SIGSEGV. If this is undesirable users
can enable kernel lockdown feature with 'lockdown=integrity' kernel
command line option. It makes ioperm() fail.

More robust handling of this situation (denying ioperm() in all TDX
guests) will be addressed in follow-on work.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-18-kirill.shutemov@linux.intel.com
2022-04-07 08:27:52 -07:00
Kirill A. Shutemov 31d58c4e55 x86/tdx: Handle in-kernel MMIO
In non-TDX VMs, MMIO is implemented by providing the guest a mapping
which will cause a VMEXIT on access and then the VMM emulating the
instruction that caused the VMEXIT. That's not possible for TDX VM.

To emulate an instruction an emulator needs two things:

  - R/W access to the register file to read/modify instruction arguments
    and see RIP of the faulted instruction.

  - Read access to memory where instruction is placed to see what to
    emulate. In this case it is guest kernel text.

Both of them are not available to VMM in TDX environment:

  - Register file is never exposed to VMM. When a TD exits to the module,
    it saves registers into the state-save area allocated for that TD.
    The module then scrubs these registers before returning execution
    control to the VMM, to help prevent leakage of TD state.

  - TDX does not allow guests to execute from shared memory. All executed
    instructions are in TD-private memory. Being private to the TD, VMMs
    have no way to access TD-private memory and no way to read the
    instruction to decode and emulate it.

In TDX the MMIO regions are instead configured by VMM to trigger a #VE
exception in the guest.

Add #VE handling that emulates the MMIO instruction inside the guest and
converts it into a controlled hypercall to the host.

This approach is bad for performance. But, it has (virtually) no impact
on the size of the kernel image and will work for a wide variety of
drivers. This allows TDX deployments to use arbitrary devices and device
drivers, including virtio. TDX customers have asked for the capability
to use random devices in their deployments.

In other words, even if all of the work was done to paravirtualize all
x86 MMIO users and virtio, this approach would still be needed. There
is essentially no way to get rid of this code.

This approach is functional for all in-kernel MMIO users current and
future and does so with a minimal amount of code and kernel image bloat.

MMIO addresses can be used with any CPU instruction that accesses
memory. Address only MMIO accesses done via io.h helpers, such as
'readl()' or 'writeq()'.

Any CPU instruction that accesses memory can also be used to access
MMIO.  However, by convention, MMIO access are typically performed via
io.h helpers such as 'readl()' or 'writeq()'.

The io.h helpers intentionally use a limited set of instructions when
accessing MMIO.  This known, limited set of instructions makes MMIO
instruction decoding and emulation feasible in KVM hosts and SEV guests
today.

MMIO accesses performed without the io.h helpers are at the mercy of the
compiler.  Compilers can and will generate a much more broad set of
instructions which can not practically be decoded and emulated.  TDX
guests will oops if they encounter one of these decoding failures.

This means that TDX guests *must* use the io.h helpers to access MMIO.

This requirement is not new.  Both KVM hosts and AMD SEV guests have the
same limitations on MMIO access.

=== Potential alternative approaches ===

== Paravirtualizing all MMIO ==

An alternative to letting MMIO induce a #VE exception is to avoid
the #VE in the first place. Similar to the port I/O case, it is
theoretically possible to paravirtualize MMIO accesses.

Like the exception-based approach offered here, a fully paravirtualized
approach would be limited to MMIO users that leverage common
infrastructure like the io.h macros.

However, any paravirtual approach would be patching approximately 120k
call sites. Any paravirtual approach would need to replace a bare memory
access instruction with (at least) a function call. With a conservative
overhead estimation of 5 bytes per call site (CALL instruction),
it leads to bloating code by 600k.

Many drivers will never be used in the TDX environment and the bloat
cannot be justified.

== Patching TDX drivers ==

Rather than touching the entire kernel, it might also be possible to
just go after drivers that use MMIO in TDX guests *and* are performance
critical to justify the effrort. Right now, that's limited only to virtio.

All virtio MMIO appears to be done through a single function, which
makes virtio eminently easy to patch.

This approach will be adopted in the future, removing the bulk of
MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-12-kirill.shutemov@linux.intel.com
2022-04-07 08:27:51 -07:00
Kirill A. Shutemov c141fa2c2b x86/tdx: Handle CPUID via #VE
In TDX guests, most CPUID leaf/sub-leaf combinations are virtualized
by the TDX module while some trigger #VE.

Implement the #VE handling for EXIT_REASON_CPUID by handing it through
the hypercall, which in turn lets the TDX module handle it by invoking
the host VMM.

More details on CPUID Virtualization can be found in the TDX module
specification, the section titled "CPUID Virtualization".

Note that VMM that handles the hypercall is not trusted. It can return
data that may steer the guest kernel in wrong direct. Only allow  VMM
to control range reserved for hypervisor communication.

Return all-zeros for any CPUID outside the hypervisor range. It matches
CPU behaviour for non-supported leaf.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-11-kirill.shutemov@linux.intel.com
2022-04-07 08:27:51 -07:00
Kirill A. Shutemov ae87f609cd x86/tdx: Add MSR support for TDX guests
Use hypercall to emulate MSR read/write for the TDX platform.

There are two viable approaches for doing MSRs in a TD guest:

1. Execute the RDMSR/WRMSR instructions like most VMs and bare metal
   do. Some will succeed, others will cause a #VE. All of those that
   cause a #VE will be handled with a TDCALL.
2. Use paravirt infrastructure.  The paravirt hook has to keep a list
   of which MSRs would cause a #VE and use a TDCALL.  All other MSRs
   execute RDMSR/WRMSR instructions directly.

The second option can be ruled out because the list of MSRs was
challenging to maintain. That leaves option #1 as the only viable
solution for the minimal TDX support.

Kernel relies on the exception fixup machinery to handle MSR access
errors. #VE handler uses the same exception fixup code as #GP. It
covers MSR accesses along with other types of fixups.

For performance-critical MSR writes (like TSC_DEADLINE), future patches
will replace the WRMSR/#VE sequence with the direct TDCALL.

RDMSR and WRMSR specification details can be found in
Guest-Host-Communication Interface (GHCI) for Intel Trust Domain
Extensions (Intel TDX) specification, sec titled "TDG.VP.
VMCALL<Instruction.RDMSR>" and "TDG.VP.VMCALL<Instruction.WRMSR>".

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-10-kirill.shutemov@linux.intel.com
2022-04-07 08:27:51 -07:00
Kirill A. Shutemov bfe6ed0c67 x86/tdx: Add HLT support for TDX guests
The HLT instruction is a privileged instruction, executing it stops
instruction execution and places the processor in a HALT state. It
is used in kernel for cases like reboot, idle loop and exception fixup
handlers. For the idle case, interrupts will be enabled (using STI)
before the HLT instruction (this is also called safe_halt()).

To support the HLT instruction in TDX guests, it needs to be emulated
using TDVMCALL (hypercall to VMM). More details about it can be found
in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication
Interface (GHCI) specification, section TDVMCALL[Instruction.HLT].

In TDX guests, executing HLT instruction will generate a #VE, which is
used to emulate the HLT instruction. But #VE based emulation will not
work for the safe_halt() flavor, because it requires STI instruction to
be executed just before the TDCALL. Since idle loop is the only user of
safe_halt() variant, handle it as a special case.

To avoid *safe_halt() call in the idle function, define the
tdx_guest_idle() and use it to override the "x86_idle" function pointer
for a valid TDX guest.

Alternative choices like PV ops have been considered for adding
safe_halt() support. But it was rejected because HLT paravirt calls
only exist under PARAVIRT_XXL, and enabling it in TDX guest just for
safe_halt() use case is not worth the cost.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-9-kirill.shutemov@linux.intel.com
2022-04-07 08:27:51 -07:00
Kirill A. Shutemov 9a22bf6deb x86/traps: Add #VE support for TDX guest
Virtualization Exceptions (#VE) are delivered to TDX guests due to
specific guest actions which may happen in either user space or the
kernel:

 * Specific instructions (WBINVD, for example)
 * Specific MSR accesses
 * Specific CPUID leaf accesses
 * Access to specific guest physical addresses

Syscall entry code has a critical window where the kernel stack is not
yet set up. Any exception in this window leads to hard to debug issues
and can be exploited for privilege escalation. Exceptions in the NMI
entry code also cause issues. Returning from the exception handler with
IRET will re-enable NMIs and nested NMI will corrupt the NMI stack.

For these reasons, the kernel avoids #VEs during the syscall gap and
the NMI entry code. Entry code paths do not access TD-shared memory,
MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves
that might generate #VE. VMM can remove memory from TD at any point,
but access to unaccepted (or missing) private memory leads to VM
termination, not to #VE.

Similarly to page faults and breakpoints, #VEs are allowed in NMI
handlers once the kernel is ready to deal with nested NMIs.

During #VE delivery, all interrupts, including NMIs, are blocked until
TDGETVEINFO is called. It prevents #VE nesting until the kernel reads
the VE info.

TDGETVEINFO retrieves the #VE info from the TDX module, which also
clears the "#VE valid" flag.  This must be done before anything else as
any #VE that occurs while the valid flag is set escalates to #DF by TDX
module. It will result in an oops.

Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be
delivered until TDGETVEINFO is called.

For now, convert unhandled #VE's (everything, until later in this
series) so that they appear just like a #GP by calling the
ve_raise_fault() directly. The ve_raise_fault() function is similar
to #GP handler and is responsible for sending SIGSEGV to userspace
and CPU die and notifying debuggers and other die chain users.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-8-kirill.shutemov@linux.intel.com
2022-04-07 08:27:51 -07:00
Kirill A. Shutemov 65fab5bc03 x86/tdx: Exclude shared bit from __PHYSICAL_MASK
In TDX guests, by default memory is protected from host access. If a
guest needs to communicate with the VMM (like the I/O use case), it uses
a single bit in the physical address to communicate the protected/shared
attribute of the given page.

In the x86 ARCH code, __PHYSICAL_MASK macro represents the width of the
physical address in the given architecture. It is used in creating
physical PAGE_MASK for address bits in the kernel. Since in TDX guest,
a single bit is used as metadata, it needs to be excluded from valid
physical address bits to avoid using incorrect addresses bits in the
kernel.

Enable DYNAMIC_PHYSICAL_MASK to support updating the __PHYSICAL_MASK.

Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20220405232939.73860-6-kirill.shutemov@linux.intel.com
2022-04-07 08:27:51 -07:00