Commit Graph

1422 Commits

Author SHA1 Message Date
Linus Torvalds d65e1a0f30 - Store AP Query Configuration Information in a static buffer
- Rework the AP initialization and add missing cleanups to the error path
 
 - Swap IRQ and AP bus/device registration to avoid race conditions
 
 - Export prot_virt_guest symbol
 
 - Introduce AP configuration changes notifier interface to facilitate
   modularization of the AP bus
 
 - Add CONFIG_AP kernel configuration option to allow modularization of
   the AP bus
 
 - Rework CONFIG_ZCRYPT_DEBUG kernel configuration option description and
   dependency and rename it to CONFIG_AP_DEBUG
 
 - Convert sprintf() and snprintf() to sysfs_emit() in CIO code
 
 - Adjust indentation of RELOCS command build step
 
 - Make crypto performance counters upward compatible
 
 - Convert make_page_secure() and gmap_make_secure() to use folio
 
 - Rework channel-utilization-block (CUB) handling in preparation of
   introducing additional CUBs
 
 - Use attribute groups to simplify registration, removal and extension
   of measurement-related channel-path sysfs attributes
 
 - Add a per-channel-path binary "ext_measurement" sysfs attribute that
   provides access to extended channel-path measurement data
 
 - Export measurement data for all channel-measurement-groups (CMG), not
   only for a specific ones. This enables support of new CMG data formats
   in userspace without the need for kernel changes
 
 - Add a per-channel-path sysfs attribute "speed_bps" that provides the
   operating speed in bits per second or 0 if the operating speed is not
   available
 
 - The CIO tracepoint subchannel-type field "st" is incorrectly set to
   the value of subchannel-enabled SCHIB "ena" field. Fix that
 
 - Do not forcefully limit vmemmap starting address to MAX_PHYSMEM_BITS
 
 - Consider the maximum physical address available to a DCSS segment
   (512GB) when memory layout is set up
 
 - Simplify the virtual memory layout setup by reducing the size of
   identity mapping vs vmemmap overlap
 
 - Swap vmalloc and Lowcore/Real Memory Copy areas in virtual memory.
   This will allow to place the kernel image next to kernel modules
 
 - Move everyting KASLR related from <asm/setup.h> to <asm/page.h>
 
 - Put virtual memory layout information into a structure to improve
   code generation
 
 - Currently __kaslr_offset is the kernel offset in both physical and
   virtual memory spaces. Uncouple these offsets to allow uncoupling
   of the addresses spaces
 
 - Currently the identity mapping base address is implicit and is always
   set to zero. Make it explicit by putting into __identity_base persistent
   boot variable and use it in proper context
 
 - Introduce .amode31 section start and end macros AMODE31_START and
   AMODE31_END
 
 - Introduce OS_INFO entries that do not reference any data in memory,
   but rather provide only values
 
 - Store virtual memory layout in OS_INFO. It is read out by makedumpfile,
   crash and other tools
 
 - Store virtual memory layout in VMCORE_INFO. It is read out by crash and
   other tools when /proc/kcore device is used
 
 - Create additional PT_LOAD ELF program header that covers kernel image
   only, so that vmcore tools could locate kernel text and data when virtual
   and physical memory spaces are uncoupled
 
 - Uncouple physical and virtual address spaces
 
 - Map kernel at fixed location when KASLR mode is disabled. The location is
   defined by CONFIG_KERNEL_IMAGE_BASE kernel configuration value.
 
 - Rework deployment of kernel image for both compressed and uncompressed
   variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel configuration
   value
 
 - Move .vmlinux.relocs section in front of the compressed kernel.
   The interim section rescue step is avoided as result
 
 - Correct modules thunk offset calculation when branch target is more
   than 2GB away
 
 - Kernel modules contain their own set of expoline thunks. Now that the
   kernel modules area is less than 4GB away from kernel expoline thunks,
   make modules use kernel expolines. Also make EXPOLINE_EXTERN the default
   if the compiler supports it
 
 - userfaultfd can insert shared zeropages into processes running VMs,
   but that is not allowed for s390. Fallback to allocating a fresh
   zeroed anonymous folio and insert that instead
 
 - Re-enable shared zeropages for non-PV and non-skeys KVM guests
 
 - Rename hex2bitmap() to ap_hex2bitmap() and export it for external use
 
 - Add ap_config sysfs attribute to provide the means for setting or
   displaying adapters, domains and control domains assigned to a vfio-ap
   mediated device in a single operation
 
 - Make vfio_ap_mdev_link_queue() ignore duplicate link requests
 
 - Add write support to ap_config sysfs attribute to allow atomic update
   a vfio-ap mediated device state
 
 - Document ap_config sysfs attribute
 
 - Function os_info_old_init() is expected to be called only from a regular
   kdump kernel. Enable it to be called from a stand-alone dump kernel
 
 - Address gcc -Warray-bounds warning and fix array size in struct os_info
 
 - s390 does not support SMBIOS, so drop unneeded CONFIG_DMI checks
 
 - Use unwinder instead of __builtin_return_address() with ftrace to
   prevent returning of undefined values
 
 - Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD
   kernel is enabled. Drop these for the case CONFIG_PIE_BUILD is disabled
 
 - Compile kernel with -fPIC and link with -no-pie to allow kpatch feature
   always succeed and drop the whole CONFIG_PIE_BUILD option-enabled code
 
 - Add missing virt_to_phys() converter for VSIE facility and crypto
   control blocks
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZjkp5xccYWdvcmRlZXZA
 bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8D99AQCEby+KHssuZe9m0NvvikWREYBC
 myqob4EmdU3KdTEbNAEAt2OB7mzSQc90yjawI+Je7vwVyh3uc2Nb4Qg05yO6owI=
 =eOYN
 -----END PGP SIGNATURE-----

Merge tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Alexander Gordeev:

 - Store AP Query Configuration Information in a static buffer

 - Rework the AP initialization and add missing cleanups to the error
   path

 - Swap IRQ and AP bus/device registration to avoid race conditions

 - Export prot_virt_guest symbol

 - Introduce AP configuration changes notifier interface to facilitate
   modularization of the AP bus

 - Add CONFIG_AP kernel configuration option to allow modularization of
   the AP bus

 - Rework CONFIG_ZCRYPT_DEBUG kernel configuration option description
   and dependency and rename it to CONFIG_AP_DEBUG

 - Convert sprintf() and snprintf() to sysfs_emit() in CIO code

 - Adjust indentation of RELOCS command build step

 - Make crypto performance counters upward compatible

 - Convert make_page_secure() and gmap_make_secure() to use folio

 - Rework channel-utilization-block (CUB) handling in preparation of
   introducing additional CUBs

 - Use attribute groups to simplify registration, removal and extension
   of measurement-related channel-path sysfs attributes

 - Add a per-channel-path binary "ext_measurement" sysfs attribute that
   provides access to extended channel-path measurement data

 - Export measurement data for all channel-measurement-groups (CMG), not
   only for a specific ones. This enables support of new CMG data
   formats in userspace without the need for kernel changes

 - Add a per-channel-path sysfs attribute "speed_bps" that provides the
   operating speed in bits per second or 0 if the operating speed is not
   available

 - The CIO tracepoint subchannel-type field "st" is incorrectly set to
   the value of subchannel-enabled SCHIB "ena" field. Fix that

 - Do not forcefully limit vmemmap starting address to MAX_PHYSMEM_BITS

 - Consider the maximum physical address available to a DCSS segment
   (512GB) when memory layout is set up

 - Simplify the virtual memory layout setup by reducing the size of
   identity mapping vs vmemmap overlap

 - Swap vmalloc and Lowcore/Real Memory Copy areas in virtual memory.
   This will allow to place the kernel image next to kernel modules

 - Move everyting KASLR related from <asm/setup.h> to <asm/page.h>

 - Put virtual memory layout information into a structure to improve
   code generation

 - Currently __kaslr_offset is the kernel offset in both physical and
   virtual memory spaces. Uncouple these offsets to allow uncoupling of
   the addresses spaces

 - Currently the identity mapping base address is implicit and is always
   set to zero. Make it explicit by putting into __identity_base
   persistent boot variable and use it in proper context

 - Introduce .amode31 section start and end macros AMODE31_START and
   AMODE31_END

 - Introduce OS_INFO entries that do not reference any data in memory,
   but rather provide only values

 - Store virtual memory layout in OS_INFO. It is read out by
   makedumpfile, crash and other tools

 - Store virtual memory layout in VMCORE_INFO. It is read out by crash
   and other tools when /proc/kcore device is used

 - Create additional PT_LOAD ELF program header that covers kernel image
   only, so that vmcore tools could locate kernel text and data when
   virtual and physical memory spaces are uncoupled

 - Uncouple physical and virtual address spaces

 - Map kernel at fixed location when KASLR mode is disabled. The
   location is defined by CONFIG_KERNEL_IMAGE_BASE kernel configuration
   value.

 - Rework deployment of kernel image for both compressed and
   uncompressed variants as defined by CONFIG_KERNEL_UNCOMPRESSED kernel
   configuration value

 - Move .vmlinux.relocs section in front of the compressed kernel. The
   interim section rescue step is avoided as result

 - Correct modules thunk offset calculation when branch target is more
   than 2GB away

 - Kernel modules contain their own set of expoline thunks. Now that the
   kernel modules area is less than 4GB away from kernel expoline
   thunks, make modules use kernel expolines. Also make EXPOLINE_EXTERN
   the default if the compiler supports it

 - userfaultfd can insert shared zeropages into processes running VMs,
   but that is not allowed for s390. Fallback to allocating a fresh
   zeroed anonymous folio and insert that instead

 - Re-enable shared zeropages for non-PV and non-skeys KVM guests

 - Rename hex2bitmap() to ap_hex2bitmap() and export it for external use

 - Add ap_config sysfs attribute to provide the means for setting or
   displaying adapters, domains and control domains assigned to a
   vfio-ap mediated device in a single operation

 - Make vfio_ap_mdev_link_queue() ignore duplicate link requests

 - Add write support to ap_config sysfs attribute to allow atomic update
   a vfio-ap mediated device state

 - Document ap_config sysfs attribute

 - Function os_info_old_init() is expected to be called only from a
   regular kdump kernel. Enable it to be called from a stand-alone dump
   kernel

 - Address gcc -Warray-bounds warning and fix array size in struct
   os_info

 - s390 does not support SMBIOS, so drop unneeded CONFIG_DMI checks

 - Use unwinder instead of __builtin_return_address() with ftrace to
   prevent returning of undefined values

 - Sections .hash and .gnu.hash are only created when CONFIG_PIE_BUILD
   kernel is enabled. Drop these for the case CONFIG_PIE_BUILD is
   disabled

 - Compile kernel with -fPIC and link with -no-pie to allow kpatch
   feature always succeed and drop the whole CONFIG_PIE_BUILD
   option-enabled code

 - Add missing virt_to_phys() converter for VSIE facility and crypto
   control blocks

* tag 's390-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits)
  Revert "s390: Relocate vmlinux ELF data to virtual address space"
  KVM: s390: vsie: Use virt_to_phys for crypto control block
  s390: Relocate vmlinux ELF data to virtual address space
  s390: Compile kernel with -fPIC and link with -no-pie
  s390: vmlinux.lds.S: Drop .hash and .gnu.hash for !CONFIG_PIE_BUILD
  s390/ftrace: Use unwinder instead of __builtin_return_address()
  s390/pci: Drop unneeded reference to CONFIG_DMI
  s390/os_info: Fix array size in struct os_info
  s390/os_info: Initialize old os_info in standalone dump kernel
  docs: Update s390 vfio-ap doc for ap_config sysfs attribute
  s390/vfio-ap: Add write support to sysfs attr ap_config
  s390/vfio-ap: Ignore duplicate link requests in vfio_ap_mdev_link_queue
  s390/vfio-ap: Add sysfs attr, ap_config, to export mdev state
  s390/ap: Externalize AP bus specific bitmap reading function
  s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests
  mm/userfaultfd: Do not place zeropages when zeropages are disallowed
  s390/expoline: Make modules use kernel expolines
  s390/nospec: Correct modules thunk offset calculation
  s390/boot: Do not rescue .vmlinux.relocs section
  s390/boot: Rework deployment of the kernel image
  ...
2024-05-13 08:33:52 -07:00
Alexander Gordeev 22a49f6d30 Merge branch 'shared-zeropage' into features
David Hildenbrand says:

===================
This series fixes one issue with uffd + shared zeropages on s390x and
fixes that "ordinary" KVM guests can make use of shared zeropages again.

userfaultfd could currently end up mapping shared zeropages into processes
that forbid shared zeropages. This only apples to s390x, relevant for
handling PV guests and guests that use storage kets correctly. Fix it
by placing a zeroed folio instead of the shared zeropage during
UFFDIO_ZEROPAGE instead.

I stumbled over this issue while looking into a customer scenario that
is using:

(1) Memory ballooning for dynamic resizing. Start a VM with, say, 100 GiB
    and inflate the balloon during boot to 60 GiB. The VM has ~40 GiB
    available and additional memory can be "fake hotplugged" to the VM
    later on demand by deflating the balloon. Actual memory overcommit is
    not desired, so physical memory would only be moved between VMs.

(2) Live migration of VMs between sites to evacuate servers in case of
    emergency.

Without the shared zeropage, during (2), the VM would suddenly consume
100 GiB on the migration source and destination. On the migration source,
where we don't excpect memory overcommit, we could easilt end up crashing
the VM during migration.

Independent of that, memory handed back to the hypervisor using "free page
reporting" would end up consuming actual memory after the migration on the
destination, not getting freed up until reused+freed again.

While there might be ways to optimize parts of this in QEMU, we really
should just support the shared zeropage again for ordinary VMs.

We only expect legcy guests to make use of storage keys, so let's handle
zeropages again when enabling storage keys or when enabling PV. To not
break userfaultfd like we did in the past, don't zap the shared zeropages,
but instead trigger unsharing faults, just like we do for unsharing
KSM pages in break_ksm().

Unsharing faults will simply replace the shared zeropage by a zeroed
anonymous folio. We can already trigger the same fault path using GUP,
when trying to long-term pin a shared zeropage, but also when unmerging
a KSM-placed zeropages, so this is nothing new.

Patch #1 tested on 86-64 by forcing mm_forbids_zeropage() to be 1, and
running the uffd selftests.

Patch #2 tested on s390x: the live migration scenario now works as
expected, and kvm-unit-tests that trigger usage of skeys work well, whereby
I can see detection and unsharing of shared zeropages.

Further (as broken in v2), I tested that the shared zeropage is no
longer populated after skeys are used -- that mm_forbids_zeropage() works
as expected:
  ./s390x-run s390x/skey.elf \
   -no-shutdown \
   -chardev socket,id=monitor,path=/var/tmp/mon,server,nowait \
   -mon chardev=monitor,mode=readline

  Then, in another shell:

  # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
  Rss:               31484 kB
  #  echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon
  ...
  # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
  Rss:              160452 kB

  -> Reading guest memory does not populate the shared zeropage

  Doing the same with selftest.elf (no skeys)

  # cat /proc/`pgrep qemu`/smaps_rollup | grep Rss
  Rss:               30900 kB
  #  echo "dump-guest-memory tmp" | sudo nc -U /var/tmp/mon
  ...
  # cat /proc/`pgrep qemu`/smaps_rollup | grep Rsstmp/mon
  Rss:               30924 kB

  -> Reading guest memory does populate the shared zeropage
===================

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-02 22:02:25 +02:00
Jean-Philippe Brucker 175f2f5bcd KVM: s390: Check kvm pointer when testing KVM_CAP_S390_HPAGE_1M
KVM allows issuing the KVM_CHECK_EXTENSION ioctl either on the /dev/kvm
fd or the VM fd. In the first case, kvm_vm_ioctl_check_extension() is
called with kvm==NULL. Ensure we don't dereference the pointer in that
case.

Fixes: 40ebdb8e59 ("KVM: s390: Make huge pages unavailable in ucontrol VMs")
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Message-ID: <20240419160723.320910-2-jean-philippe@linaro.org>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2024-05-02 09:41:38 +02:00
Nina Schoetterl-Glausch cc4edb92f5 KVM: s390: vsie: Use virt_to_phys for crypto control block
The address of the crypto control block in the (shadow) SIE block is
absolute/physical.
Convert from virtual to physical when shadowing the guest's control
block during VSIE.

Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Link: https://lore.kernel.org/r/20240429171512.879215-1-nsg@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-05-01 11:48:21 +02:00
David Hildenbrand 06201e00ee s390/mm: Re-enable the shared zeropage for !PV and !skeys KVM guests
commit fa41ba0d08 ("s390/mm: avoid empty zero pages for KVM guests to
avoid postcopy hangs") introduced an undesired side effect when combined
with memory ballooning and VM migration: memory part of the inflated
memory balloon will consume memory.

Assuming we have a 100GiB VM and inflated the balloon to 40GiB. Our VM
will consume ~60GiB of memory. If we now trigger a VM migration,
hypervisors like QEMU will read all VM memory. As s390x does not support
the shared zeropage, we'll end up allocating for all previously-inflated
memory part of the memory balloon: 50 GiB. So we might easily
(unexpectedly) crash the VM on the migration source.

Even worse, hypervisors like QEMU optimize for zeropage migration to not
consume memory on the migration destination: when migrating a
"page full of zeroes", on the migration destination they check whether the
target memory is already zero (by reading the destination memory) and avoid
writing to the memory to not allocate memory: however, s390x will also
allocate memory here, implying that also on the migration destination, we
will end up allocating all previously-inflated memory part of the memory
balloon.

This is especially bad if actual memory overcommit was not desired, when
memory ballooning is used for dynamic VM memory resizing, setting aside
some memory during boot that can be added later on demand. Alternatives
like virtio-mem that would avoid this issue are not yet available on
s390x.

There could be ways to optimize some cases in user space: before reading
memory in an anonymous private mapping on the migration source, check via
/proc/self/pagemap if anything is already populated. Similarly check on
the migration destination before reading. While that would avoid
populating tables full of shared zeropages on all architectures, it's
harder to get right and performant, and requires user space changes.

Further, with posctopy live migration we must place a page, so there,
"avoid touching memory to avoid allocating memory" is not really
possible. (Note that a previously we would have falsely inserted
shared zeropages into processes using UFFDIO_ZEROPAGE where
mm_forbids_zeropage() would have actually forbidden it)

PV is currently incompatible with memory ballooning, and in the common
case, KVM guests don't make use of storage keys. Instead of zapping
zeropages when enabling storage keys / PV, that turned out to be
problematic in the past, let's do exactly the same we do with KSM pages:
trigger unsharing faults to replace the shared zeropages by proper
anonymous folios.

What about added latency when enabling storage kes? Having a lot of
zeropages in applicable environments (PV, legacy guests, unittests) is
unexpected. Further, KSM could today already unshare the zeropages
and unmerging KSM pages when enabling storage kets would unshare the
KSM-placed zeropages in the same way, resulting in the same latency.

[ agordeev: Fixed sparse and checkpatch complaints and error handling ]

Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Fixes: fa41ba0d08 ("s390/mm: avoid empty zero pages for KVM guests to avoid postcopy hangs")
Signed-off-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20240411161441.910170-3-david@redhat.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-18 15:02:53 +02:00
Nina Schoetterl-Glausch 22fdd8ba61 KVM: s390: vsie: Use virt_to_phys for facility control block
In order for SIE to interpretively execute STFLE, it requires the real
or absolute address of a facility-list control block.
Before writing the location into the shadow SIE control block, convert
it from a virtual address.
We currently do not run into this bug because the lower 31 bits are the
same for virtual and physical addresses.

Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Link: https://lore.kernel.org/r/20240319164420.4053380-3-nsg@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-Id: <20240319164420.4053380-3-nsg@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2024-04-17 13:37:59 +02:00
Linus Torvalds 4f712ee0cb S390:
* Changes to FPU handling came in via the main s390 pull request
 
 * Only deliver to the guest the SCLP events that userspace has
   requested.
 
 * More virtual vs physical address fixes (only a cleanup since
   virtual and physical address spaces are currently the same).
 
 * Fix selftests undefined behavior.
 
 x86:
 
 * Fix a restriction that the guest can't program a PMU event whose
   encoding matches an architectural event that isn't included in the
   guest CPUID.  The enumeration of an architectural event only says
   that if a CPU supports an architectural event, then the event can be
   programmed *using the architectural encoding*.  The enumeration does
   NOT say anything about the encoding when the CPU doesn't report support
   the event *in general*.  It might support it, and it might support it
   using the same encoding that made it into the architectural PMU spec.
 
 * Fix a variety of bugs in KVM's emulation of RDPMC (more details on
   individual commits) and add a selftest to verify KVM correctly emulates
   RDMPC, counter availability, and a variety of other PMC-related
   behaviors that depend on guest CPUID and therefore are easier to
   validate with selftests than with custom guests (aka kvm-unit-tests).
 
 * Zero out PMU state on AMD if the virtual PMU is disabled, it does not
   cause any bug but it wastes time in various cases where KVM would check
   if a PMC event needs to be synthesized.
 
 * Optimize triggering of emulated events, with a nice ~10% performance
   improvement in VM-Exit microbenchmarks when a vPMU is exposed to the
   guest.
 
 * Tighten the check for "PMI in guest" to reduce false positives if an NMI
   arrives in the host while KVM is handling an IRQ VM-Exit.
 
 * Fix a bug where KVM would report stale/bogus exit qualification information
   when exiting to userspace with an internal error exit code.
 
 * Add a VMX flag in /proc/cpuinfo to report 5-level EPT support.
 
 * Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for
   read, e.g. to avoid serializing vCPUs when userspace deletes a memslot.
 
 * Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB).  KVM
   doesn't support yielding in the middle of processing a zap, and 1GiB
   granularity resulted in multi-millisecond lags that are quite impolite
   for CONFIG_PREEMPT kernels.
 
 * Allocate write-tracking metadata on-demand to avoid the memory overhead when
   a kernel is built with i915 virtualization support but the workloads use
   neither shadow paging nor i915 virtualization.
 
 * Explicitly initialize a variety of on-stack variables in the emulator that
   triggered KMSAN false positives.
 
 * Fix the debugregs ABI for 32-bit KVM.
 
 * Rework the "force immediate exit" code so that vendor code ultimately decides
   how and when to force the exit, which allowed some optimization for both
   Intel and AMD.
 
 * Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
   vCPU creation ultimately failed, causing extra unnecessary work.
 
 * Cleanup the logic for checking if the currently loaded vCPU is in-kernel.
 
 * Harden against underflowing the active mmu_notifier invalidation
   count, so that "bad" invalidations (usually due to bugs elsehwere in the
   kernel) are detected earlier and are less likely to hang the kernel.
 
 x86 Xen emulation:
 
 * Overlay pages can now be cached based on host virtual address,
   instead of guest physical addresses.  This removes the need to
   reconfigure and invalidate the cache if the guest changes the
   gpa but the underlying host virtual address remains the same.
 
 * When possible, use a single host TSC value when computing the deadline for
   Xen timers in order to improve the accuracy of the timer emulation.
 
 * Inject pending upcall events when the vCPU software-enables its APIC to fix
   a bug where an upcall can be lost (and to follow Xen's behavior).
 
 * Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
   events fails, e.g. if the guest has aliased xAPIC IDs.
 
 RISC-V:
 
 * Support exception and interrupt handling in selftests
 
 * New self test for RISC-V architectural timer (Sstc extension)
 
 * New extension support (Ztso, Zacas)
 
 * Support userspace emulation of random number seed CSRs.
 
 ARM:
 
 * Infrastructure for building KVM's trap configuration based on the
   architectural features (or lack thereof) advertised in the VM's ID
   registers
 
 * Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
   x86's WC) at stage-2, improving the performance of interacting with
   assigned devices that can tolerate it
 
 * Conversion of KVM's representation of LPIs to an xarray, utilized to
   address serialization some of the serialization on the LPI injection
   path
 
 * Support for _architectural_ VHE-only systems, advertised through the
   absence of FEAT_E2H0 in the CPU's ID register
 
 * Miscellaneous cleanups, fixes, and spelling corrections to KVM and
   selftests
 
 LoongArch:
 
 * Set reserved bits as zero in CPUCFG.
 
 * Start SW timer only when vcpu is blocking.
 
 * Do not restart SW timer when it is expired.
 
 * Remove unnecessary CSR register saving during enter guest.
 
 * Misc cleanups and fixes as usual.
 
 Generic:
 
 * cleanup Kconfig by removing CONFIG_HAVE_KVM, which was basically always
   true on all architectures except MIPS (where Kconfig determines the
   available depending on CPU capabilities).  It is replaced either by
   an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM)
   everywhere else.
 
 * Factor common "select" statements in common code instead of requiring
   each architecture to specify it
 
 * Remove thoroughly obsolete APIs from the uapi headers.
 
 * Move architecture-dependent stuff to uapi/asm/kvm.h
 
 * Always flush the async page fault workqueue when a work item is being
   removed, especially during vCPU destruction, to ensure that there are no
   workers running in KVM code when all references to KVM-the-module are gone,
   i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded.
 
 * Grab a reference to the VM's mm_struct in the async #PF worker itself instead
   of gifting the worker a reference, so that there's no need to remember
   to *conditionally* clean up after the worker.
 
 Selftests:
 
 * Reduce boilerplate especially when utilize selftest TAP infrastructure.
 
 * Add basic smoke tests for SEV and SEV-ES, along with a pile of library
   support for handling private/encrypted/protected memory.
 
 * Fix benign bugs where tests neglect to close() guest_memfd files.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmX0iP8UHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroND7wf+JZoNvwZ+bmwWe/4jn/YwNoYi/C5z
 eypn8M1gsWEccpCpqPBwznVm9T29rF4uOlcMvqLEkHfTpaL1EKUUjP1lXPz/ileP
 6a2RdOGxAhyTiFC9fjy+wkkjtLbn1kZf6YsS0hjphP9+w0chNbdn0w81dFVnXryd
 j7XYI8R/bFAthNsJOuZXSEjCfIHxvTTG74OrTf1B1FEBB+arPmrgUeJftMVhffQK
 Sowgg8L/Ii/x6fgV5NZQVSIyVf1rp8z7c6UaHT4Fwb0+RAMW8p9pYv9Qp1YkKp8y
 5j0V9UzOHP7FRaYimZ5BtwQoqiZXYylQ+VuU/Y2f4X85cvlLzSqxaEMAPA==
 =mqOV
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "S390:

   - Changes to FPU handling came in via the main s390 pull request

   - Only deliver to the guest the SCLP events that userspace has
     requested

   - More virtual vs physical address fixes (only a cleanup since
     virtual and physical address spaces are currently the same)

   - Fix selftests undefined behavior

  x86:

   - Fix a restriction that the guest can't program a PMU event whose
     encoding matches an architectural event that isn't included in the
     guest CPUID. The enumeration of an architectural event only says
     that if a CPU supports an architectural event, then the event can
     be programmed *using the architectural encoding*. The enumeration
     does NOT say anything about the encoding when the CPU doesn't
     report support the event *in general*. It might support it, and it
     might support it using the same encoding that made it into the
     architectural PMU spec

   - Fix a variety of bugs in KVM's emulation of RDPMC (more details on
     individual commits) and add a selftest to verify KVM correctly
     emulates RDMPC, counter availability, and a variety of other
     PMC-related behaviors that depend on guest CPUID and therefore are
     easier to validate with selftests than with custom guests (aka
     kvm-unit-tests)

   - Zero out PMU state on AMD if the virtual PMU is disabled, it does
     not cause any bug but it wastes time in various cases where KVM
     would check if a PMC event needs to be synthesized

   - Optimize triggering of emulated events, with a nice ~10%
     performance improvement in VM-Exit microbenchmarks when a vPMU is
     exposed to the guest

   - Tighten the check for "PMI in guest" to reduce false positives if
     an NMI arrives in the host while KVM is handling an IRQ VM-Exit

   - Fix a bug where KVM would report stale/bogus exit qualification
     information when exiting to userspace with an internal error exit
     code

   - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support

   - Rework TDP MMU root unload, free, and alloc to run with mmu_lock
     held for read, e.g. to avoid serializing vCPUs when userspace
     deletes a memslot

   - Tear down TDP MMU page tables at 4KiB granularity (used to be
     1GiB). KVM doesn't support yielding in the middle of processing a
     zap, and 1GiB granularity resulted in multi-millisecond lags that
     are quite impolite for CONFIG_PREEMPT kernels

   - Allocate write-tracking metadata on-demand to avoid the memory
     overhead when a kernel is built with i915 virtualization support
     but the workloads use neither shadow paging nor i915 virtualization

   - Explicitly initialize a variety of on-stack variables in the
     emulator that triggered KMSAN false positives

   - Fix the debugregs ABI for 32-bit KVM

   - Rework the "force immediate exit" code so that vendor code
     ultimately decides how and when to force the exit, which allowed
     some optimization for both Intel and AMD

   - Fix a long-standing bug where kvm_has_noapic_vcpu could be left
     elevated if vCPU creation ultimately failed, causing extra
     unnecessary work

   - Cleanup the logic for checking if the currently loaded vCPU is
     in-kernel

   - Harden against underflowing the active mmu_notifier invalidation
     count, so that "bad" invalidations (usually due to bugs elsehwere
     in the kernel) are detected earlier and are less likely to hang the
     kernel

  x86 Xen emulation:

   - Overlay pages can now be cached based on host virtual address,
     instead of guest physical addresses. This removes the need to
     reconfigure and invalidate the cache if the guest changes the gpa
     but the underlying host virtual address remains the same

   - When possible, use a single host TSC value when computing the
     deadline for Xen timers in order to improve the accuracy of the
     timer emulation

   - Inject pending upcall events when the vCPU software-enables its
     APIC to fix a bug where an upcall can be lost (and to follow Xen's
     behavior)

   - Fall back to the slow path instead of warning if "fast" IRQ
     delivery of Xen events fails, e.g. if the guest has aliased xAPIC
     IDs

  RISC-V:

   - Support exception and interrupt handling in selftests

   - New self test for RISC-V architectural timer (Sstc extension)

   - New extension support (Ztso, Zacas)

   - Support userspace emulation of random number seed CSRs

  ARM:

   - Infrastructure for building KVM's trap configuration based on the
     architectural features (or lack thereof) advertised in the VM's ID
     registers

   - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
     x86's WC) at stage-2, improving the performance of interacting with
     assigned devices that can tolerate it

   - Conversion of KVM's representation of LPIs to an xarray, utilized
     to address serialization some of the serialization on the LPI
     injection path

   - Support for _architectural_ VHE-only systems, advertised through
     the absence of FEAT_E2H0 in the CPU's ID register

   - Miscellaneous cleanups, fixes, and spelling corrections to KVM and
     selftests

  LoongArch:

   - Set reserved bits as zero in CPUCFG

   - Start SW timer only when vcpu is blocking

   - Do not restart SW timer when it is expired

   - Remove unnecessary CSR register saving during enter guest

   - Misc cleanups and fixes as usual

  Generic:

   - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically
     always true on all architectures except MIPS (where Kconfig
     determines the available depending on CPU capabilities). It is
     replaced either by an architecture-dependent symbol for MIPS, and
     IS_ENABLED(CONFIG_KVM) everywhere else

   - Factor common "select" statements in common code instead of
     requiring each architecture to specify it

   - Remove thoroughly obsolete APIs from the uapi headers

   - Move architecture-dependent stuff to uapi/asm/kvm.h

   - Always flush the async page fault workqueue when a work item is
     being removed, especially during vCPU destruction, to ensure that
     there are no workers running in KVM code when all references to
     KVM-the-module are gone, i.e. to prevent a very unlikely
     use-after-free if kvm.ko is unloaded

   - Grab a reference to the VM's mm_struct in the async #PF worker
     itself instead of gifting the worker a reference, so that there's
     no need to remember to *conditionally* clean up after the worker

  Selftests:

   - Reduce boilerplate especially when utilize selftest TAP
     infrastructure

   - Add basic smoke tests for SEV and SEV-ES, along with a pile of
     library support for handling private/encrypted/protected memory

   - Fix benign bugs where tests neglect to close() guest_memfd files"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits)
  selftests: kvm: remove meaningless assignments in Makefiles
  KVM: riscv: selftests: Add Zacas extension to get-reg-list test
  RISC-V: KVM: Allow Zacas extension for Guest/VM
  KVM: riscv: selftests: Add Ztso extension to get-reg-list test
  RISC-V: KVM: Allow Ztso extension for Guest/VM
  RISC-V: KVM: Forward SEED CSR access to user space
  KVM: riscv: selftests: Add sstc timer test
  KVM: riscv: selftests: Change vcpu_has_ext to a common function
  KVM: riscv: selftests: Add guest helper to get vcpu id
  KVM: riscv: selftests: Add exception handling support
  LoongArch: KVM: Remove unnecessary CSR register saving during enter guest
  LoongArch: KVM: Do not restart SW timer when it is expired
  LoongArch: KVM: Start SW timer only when vcpu is blocking
  LoongArch: KVM: Set reserved bits as zero in CPUCFG
  KVM: selftests: Explicitly close guest_memfd files in some gmem tests
  KVM: x86/xen: fix recursive deadlock in timer injection
  KVM: pfncache: simplify locking and make more self-contained
  KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
  KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
  KVM: x86/xen: improve accuracy of Xen timers
  ...
2024-03-15 13:03:13 -07:00
Paolo Bonzini 17193ced2d - Memop selftest rotate fix
- SCLP event bits over indication fix
 - Missing virt_to_phys for the CRYCB fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmXcfYoACgkQ41TmuOI4
 ufhPExAAxdcg3WjnTe/EYe+GnyjKo3nZs4y9dhZk9gf06qEYEawhg0ug5akzRZIH
 SDKeFqOXzl/ZRuL5hvfYBzxpy+IR3rWAYhBKUyxR6aJBl+RZKlf+Xn7l8iIKbNDq
 vAtLh9Hqza5IJiw/jtorw90TmiHDKvMlvft4UMG3t1IppyktUuuH0aujaVpeKtMR
 8qVyGsaTmNHip6Pi7w3WUnvYPkMNLoM7UIPhBAvWrJyYrLxao8pKEGWHaKwbMNHL
 Om4bjykfFCZ1Cs9aLZDLEasuD61Fpp41DnvImYm77yuDOdI4WalIlV7F5NbjQhhd
 IrQdsmlZc+N+HKcYvia6MnzAChTpo25pynvW7xXFQIfl/9VxcMFAfSLqLZMGMKFC
 IwzwI+BA3+bgw6zbN2z2uBShIom7Zzr689U8mbt5q7JborOH38qd5+IX6QFwUtTv
 IPHrgcULdWHWT5TRaIp61cB9YzCx2YU1QrMWEUVehldQqGEt8ANdZU5Ov0KG1BVl
 L9ULBIEnJ2ib1pGA7Xlxl2U0Lr2w/dg/p7EAdnOGes50GfEwEjtBzb7VO9Xfrz/Z
 j927hQO354Y8OYRFjKDjTceENynCiYsbNEhTHE6qFRIwAmeSVk4PT+vIXO6wZlZi
 Ee3LxsvVUnhYuC7sZbBUNhyiEjNn6GG3LxAtPeDoD+HvhqXI2Qg=
 =bwck
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-next-6.9-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

- Memop selftest rotate fix
- SCLP event bits over indication fix
- Missing virt_to_phys for the CRYCB fix
2024-03-14 14:47:56 -04:00
Linus Torvalds 691632f0e8 s390 updates for 6.9 merge window
- Various virtual vs physical address usage fixes
 
 - Fix error handling in Processor Activity Instrumentation device driver, and
   export number of counters with a sysfs file
 
 - Allow for multiple events when Processor Activity Instrumentation counters
   are monitored in system wide sampling
 
 - Change multiplier and shift values of the Time-of-Day clock source to improve
   steering precision
 
 - Remove a couple of unneeded GFP_DMA flags from allocations
 
 - Disable mmap alignment if randomize_va_space is also disabled, to avoid a too
   small heap
 
 - Various changes to allow s390 to be compiled with LLVM=1, since ld.lld and
   llvm-objcopy will have proper s390 support witch clang 19
 
 - Add __uninitialized macro to Compiler Attributes. This is helpful with s390's
   FPU code where some users have up to 520 byte stack frames. Clearing such
   stack frames (if INIT_STACK_ALL_PATTERN or INIT_STACK_ALL_ZERO is enabled)
   before they are used contradicts the intention (performance improvement) of
   such code sections.
 
 - Convert switch_to() to an out-of-line function, and use the generic switch_to
   header file
 
 - Replace the usage of s390's debug feature with pr_debug() calls within the
   zcrypt device driver
 
 - Improve hotplug support of the Adjunct Processor device driver
 
 - Improve retry handling in the zcrypt device driver
 
 - Various changes to the in-kernel FPU code:
 
   - Make in-kernel FPU sections preemptible
 
   - Convert various larger inline assemblies and assembler files to C, mainly
     by using singe instruction inline assemblies. This increases readability,
     but also allows makes it easier to add proper instrumentation hooks
 
   - Cleanup of the header files
 
 - Provide fast variants of csum_partial() and csum_partial_copy_nocheck() based
   on vector instructions
 
 - Introduce and use a lock to synchronize accesses to zpci device data
   structures to avoid inconsistent states caused by concurrent accesses
 
 - Compile the kernel without -fPIE. This addresses the following problems if
   the kernel is compiled with -fPIE:
 
   - It uses dynamic symbols (.dynsym), for which the linker refuses to allow
     more than 64k sections. This can break features which use
     '-ffunction-sections' and '-fdata-sections', including kpatch-build and
     function granular KASLR
 
   - It unnecessarily uses GOT relocations, adding an extra layer of indirection
     for many memory accesses
 
 - Fix shared_cpu_list for CPU private L2 caches, which incorrectly were
   reported as globally shared
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmXu3jEACgkQIg7DeRsp
 bsJC8A/9Gi9JSMKWpIDR4WE2MQGwP/PnYdEamtK6c9ewOjIR/UzRIyIM3J1pyV0L
 RwL8k7EBuv3f7shTcwfPzZWlnAwNwqr1UdcafjFNtHTig50YtdP5fBL33frKHBrm
 ATedlCjagojOuVbh1gB45WUgzjSSkPyn0vqwjjo4h6uEAQ35zMEWwCs5Hpajlkhi
 GCdJaiBLJcnhT96QGurQdke+MsrpGCzeBVBnA0qopQEWaQo8OdiAJ1uMD2WKbgPR
 817kNzvmE6nXnfd5JevYbaiLjK/HQUSw2dZUS6/fjuIrzTsZEUhSg4ECaprKXDg7
 5qiVVPNg4WbJAp0SsB+w7c4U99VxhbS7IVHXju18GrXw6SSAupdxIo7R7YiaT8vC
 YIXZ1uIQ4Vbts3w/UqWUczIl/ooQt2DdrWT5NDNA+84OlOM42rthzA3vznTWuPTb
 U21R7cZmN++hAUjR6s4aO2LfS7HQdnKL8nvJW2y99qSfrOXm+M973W2pDhYEVXQh
 ixQ/lxfQpbBT1yUGlquIErokCPB85VY6ZTdGu6Erziywf4CWGsT5CspyaQnX2KTJ
 s4CpFPnilrW3OnxmIkrM+pNJDun1nnkGA388Xq1NEKX8Oe65OMXEFNCb0kAHQ1ua
 vb6534Ib/iuPnxsGpz1sX9iRqtUd06aBovPcbwIvatHCSfkWws8=
 =KZ31
 -----END PGP SIGNATURE-----

Merge tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Heiko Carstens:

 - Various virtual vs physical address usage fixes

 - Fix error handling in Processor Activity Instrumentation device
   driver, and export number of counters with a sysfs file

 - Allow for multiple events when Processor Activity Instrumentation
   counters are monitored in system wide sampling

 - Change multiplier and shift values of the Time-of-Day clock source to
   improve steering precision

 - Remove a couple of unneeded GFP_DMA flags from allocations

 - Disable mmap alignment if randomize_va_space is also disabled, to
   avoid a too small heap

 - Various changes to allow s390 to be compiled with LLVM=1, since
   ld.lld and llvm-objcopy will have proper s390 support witch clang 19

 - Add __uninitialized macro to Compiler Attributes. This is helpful
   with s390's FPU code where some users have up to 520 byte stack
   frames. Clearing such stack frames (if INIT_STACK_ALL_PATTERN or
   INIT_STACK_ALL_ZERO is enabled) before they are used contradicts the
   intention (performance improvement) of such code sections.

 - Convert switch_to() to an out-of-line function, and use the generic
   switch_to header file

 - Replace the usage of s390's debug feature with pr_debug() calls
   within the zcrypt device driver

 - Improve hotplug support of the Adjunct Processor device driver

 - Improve retry handling in the zcrypt device driver

 - Various changes to the in-kernel FPU code:

     - Make in-kernel FPU sections preemptible

     - Convert various larger inline assemblies and assembler files to
       C, mainly by using singe instruction inline assemblies. This
       increases readability, but also allows makes it easier to add
       proper instrumentation hooks

     - Cleanup of the header files

 - Provide fast variants of csum_partial() and
   csum_partial_copy_nocheck() based on vector instructions

 - Introduce and use a lock to synchronize accesses to zpci device data
   structures to avoid inconsistent states caused by concurrent accesses

 - Compile the kernel without -fPIE. This addresses the following
   problems if the kernel is compiled with -fPIE:

     - It uses dynamic symbols (.dynsym), for which the linker refuses
       to allow more than 64k sections. This can break features which
       use '-ffunction-sections' and '-fdata-sections', including
       kpatch-build and function granular KASLR

     - It unnecessarily uses GOT relocations, adding an extra layer of
       indirection for many memory accesses

 - Fix shared_cpu_list for CPU private L2 caches, which incorrectly were
   reported as globally shared

* tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (117 commits)
  s390/tools: handle rela R_390_GOTPCDBL/R_390_GOTOFF64
  s390/cache: prevent rebuild of shared_cpu_list
  s390/crypto: remove retry loop with sleep from PAES pkey invocation
  s390/pkey: improve pkey retry behavior
  s390/zcrypt: improve zcrypt retry behavior
  s390/zcrypt: introduce retries on in-kernel send CPRB functions
  s390/ap: introduce mutex to lock the AP bus scan
  s390/ap: rework ap_scan_bus() to return true on config change
  s390/ap: clarify AP scan bus related functions and variables
  s390/ap: rearm APQNs bindings complete completion
  s390/configs: increase number of LOCKDEP_BITS
  s390/vfio-ap: handle hardware checkstop state on queue reset operation
  s390/pai: change sampling event assignment for PMU device driver
  s390/boot: fix minor comment style damages
  s390/boot: do not check for zero-termination relocation entry
  s390/boot: make type of __vmlinux_relocs_64_start|end consistent
  s390/boot: sanitize kaslr_adjust_relocs() function prototype
  s390/boot: simplify GOT handling
  s390: vmlinux.lds.S: fix .got.plt assertion
  s390/boot: workaround current 'llvm-objdump -t -j ...' behavior
  ...
2024-03-12 10:14:22 -07:00
Paolo Bonzini e9a2bba476 KVM Xen and pfncache changes for 6.9:
- Rip out the half-baked support for using gfn_to_pfn caches to manage pages
    that are "mapped" into guests via physical addresses.
 
  - Add support for using gfn_to_pfn caches with only a host virtual address,
    i.e. to bypass the "gfn" stage of the cache.  The primary use case is
    overlay pages, where the guest may change the gfn used to reference the
    overlay page, but the backing hva+pfn remains the same.
 
  - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead
    of a gpa, so that userspace doesn't need to reconfigure and invalidate the
    cache/mapping if the guest changes the gpa (but userspace keeps the resolved
    hva the same).
 
  - When possible, use a single host TSC value when computing the deadline for
    Xen timers in order to improve the accuracy of the timer emulation.
 
  - Inject pending upcall events when the vCPU software-enables its APIC to fix
    a bug where an upcall can be lost (and to follow Xen's behavior).
 
  - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
    events fails, e.g. if the guest has aliased xAPIC IDs.
 
  - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to
    refresh), and drop a now-redundant acquisition of xen_lock (that was
    protecting the shared_info cache) to fix a deadlock due to recursively
    acquiring xen_lock.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmXrblYACgkQOlYIJqCj
 N/3K4Q/+KZ8lrnNXvdHNCQdosA5DDXpqUcRzhlTUp82fncpdJ0LqrSMzMots2Eh9
 KC0jSPo8EkivF+Epug0+bpQBEaLXzTWhRcS1grePCDz2lBnxoHFSWjvaK2p14KlC
 LvxCJZjxyfLKHwKHpSndvO9hVFElCY3mvvE9KRcKeQAmrz1cz+DDMKelo1MuV8D+
 GfymhYc+UXpY41+6hQdznx+WoGoXKRameo3iGYuBoJjvKOyl4Wxkx9WSXIxxxuqG
 kHxjiWTR/jF1ITJl6PeMrFcGl3cuGKM/UfTOM6W2h6Wi3mhLpXveoVLnqR1kipIj
 btSzSVHL7C4WTPwOcyhwPzap+dJmm31c6N0uPScT7r9yhs+q5BDj26vcVcyPZUHo
 efIwmsnO2eQvuw+f8C6QqWCPaxvw46N0zxzwgc5uA3jvAC93y0l4v+xlAQsC0wzV
 0+BwU00cutH/3t3c/WPD5QcmRLH726VoFuTlaDufpoMU7gBVJ8rzjcusxR+5BKT+
 GJcAgZxZhEgvnzmTKd4Ec/mt+xZ2Erd+kV3MKCHvDPyj8jqy8FQ4DAWKGBR+h3WR
 rqAs2k8NPHyh3i1a3FL1opmxEGsRS+Cnc6Bi77cj9DxTr22JkgDJEuFR+Ues1z6/
 SpE889kt3w5zTo34+lNxNPlIKmO0ICwwhDL6pxJTWU7iWQnKypU=
 =GliW
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD

KVM Xen and pfncache changes for 6.9:

 - Rip out the half-baked support for using gfn_to_pfn caches to manage pages
   that are "mapped" into guests via physical addresses.

 - Add support for using gfn_to_pfn caches with only a host virtual address,
   i.e. to bypass the "gfn" stage of the cache.  The primary use case is
   overlay pages, where the guest may change the gfn used to reference the
   overlay page, but the backing hva+pfn remains the same.

 - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead
   of a gpa, so that userspace doesn't need to reconfigure and invalidate the
   cache/mapping if the guest changes the gpa (but userspace keeps the resolved
   hva the same).

 - When possible, use a single host TSC value when computing the deadline for
   Xen timers in order to improve the accuracy of the timer emulation.

 - Inject pending upcall events when the vCPU software-enables its APIC to fix
   a bug where an upcall can be lost (and to follow Xen's behavior).

 - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
   events fails, e.g. if the guest has aliased xAPIC IDs.

 - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to
   refresh), and drop a now-redundant acquisition of xen_lock (that was
   protecting the shared_info cache) to fix a deadlock due to recursively
   acquiring xen_lock.
2024-03-11 10:42:55 -04:00
Eric Farman 01be7f53df KVM: s390: fix access register usage in ioctls
The routine ar_translation() can be reached by both the instruction
intercept path (where the access registers had been loaded with the
guest register contents), and the MEM_OP ioctls (which hadn't).
Since this routine saves the current registers to vcpu->run,
this routine erroneously saves host registers into the guest space.

Introduce a boolean in the kvm_vcpu_arch struct to indicate whether
the registers contain guest contents. If they do (the instruction
intercept path), the save can be performed and the AR translation
is done just as it is today. If they don't (the MEM_OP path), the
AR can be read from vcpu->run without stashing the current contents.

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Link: https://lore.kernel.org/r/20240220211211.3102609-2-farman@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-02-22 16:06:56 +01:00
Eric Farman 85a19b3054 KVM: s390: only deliver the set service event bits
The SCLP driver code masks off the last two bits of the parameter [1]
to determine if a read is required, but doesn't care about the
contents of those bits. Meanwhile, the KVM code that delivers
event interrupts masks off those two bits but sends both to the
guest, even if only one was specified by userspace [2].

This works for the driver code, but it means any nuances of those
bits gets lost. Use the event pending mask as an actual mask, and
only send the bit(s) that were specified in the pending interrupt.

[1] Linux: sclp_interrupt_handler() (drivers/s390/char/sclp.c:658)
[2] QEMU: service_interrupt() (hw/s390x/sclp.c:360..363)

Fixes: 0890ddea1a ("KVM: s390: protvirt: Add SCLP interrupt handling")
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20240205214300.1018522-1-farman@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-Id: <20240205214300.1018522-1-farman@linux.ibm.com>
2024-02-22 10:36:42 +01:00
Janosch Frank 4a59932874 KVM: s390: introduce kvm_s390_fpu_(store|load)
It's a bit nicer than having multiple lines and will help if there's
another re-work since we'll only have to change one location.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-02-21 15:09:13 +01:00
Sean Christopherson 9e7325acb3 KVM: s390: Refactor kvm_is_error_gpa() into kvm_is_gpa_in_memslot()
Rename kvm_is_error_gpa() to kvm_is_gpa_in_memslot() and invert the
polarity accordingly in order to (a) free up kvm_is_error_gpa() to match
with kvm_is_error_{hva,page}(), and (b) to make it more obvious that the
helper is doing a memslot lookup, i.e. not simply checking for INVALID_GPA.

No functional change intended.

Link: https://lore.kernel.org/r/20240215152916.1158-9-paul@xen.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20 07:37:45 -08:00
Heiko Carstens 066c40918b s390/fpu: decrease stack usage for some cases
The kernel_fpu structure has a quite large size of 520 bytes. In order to
reduce stack footprint introduce several kernel fpu structures with
different and also smaller sizes. This way every kernel fpu user must use
the correct variant. A compile time check verifies that the correct variant
is used.

There are several users which use only 16 instead of all 32 vector
registers. For those users the new kernel_fpu_16 structure with a size of
only 266 bytes can be used.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-02-16 14:30:16 +01:00
Heiko Carstens ed3a0a011a s390/kvm: convert to regular kernel fpu user
KVM modifies the kernel fpu's regs pointer to its own area to implement its
custom version of preemtible kernel fpu context. With general support for
preemptible kernel fpu context there is no need for the extra complexity in
KVM code anymore.

Therefore convert KVM to a regular kernel fpu user. In particular this
means that all TIF_FPU checks can be removed, since the fpu register
context will never be changed by other kernel fpu users, and also the fpu
register context will be restored if a thread is preempted.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-02-16 14:30:16 +01:00
Heiko Carstens 87c5c70036 s390/fpu: rename save_fpu_regs() to save_user_fpu_regs(), etc
Rename save_fpu_regs(), load_fpu_regs(), and struct thread_struct's fpu
member to save_user_fpu_regs(), load_user_fpu_regs(), and ufpu. This way
the function and variable names reflect for which context they are supposed
to be used.

This large and trivial conversion is a prerequisite for making the kernel
fpu usage preemptible.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-02-16 14:30:15 +01:00
Heiko Carstens 419abc4d38 s390/fpu: convert FPU CIF flag to regular TIF flag
The FPU state, as represented by the CIF_FPU flag reflects the FPU state of
a task, not the CPU it is running on. Therefore convert the flag to a
regular TIF flag.

This removes the magic in switch_to() where a save_fpu_regs() call for the
currently (previous) running task sets the per-cpu CIF_FPU flag, which is
required to restore FPU register contents of the next task, when it returns
to user space.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-02-16 14:30:15 +01:00
Heiko Carstens fd2527f209 s390/fpu: move, rename, and merge header files
Move, rename, and merge the fpu and vx header files. This way fpu header
files have a consistent naming scheme (fpu*.h).

Also get rid of the fpu subdirectory and move header files to asm
directory, so that all fpu and vx header files can be found at the same
location.

Merge internal.h header file into other header files, since the internal
helpers are used at many locations. so those helper functions are really
not internal.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-02-16 14:30:14 +01:00
Alexander Gordeev 7b2411e793 KVM: s390: fix virtual vs physical address confusion
Fix virtual vs physical address confusion. This does not fix a bug
since virtual and physical address spaces are currently the same.

Suggested-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Anthony Krowiak <akrowiak@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2024-02-14 16:25:29 +01:00
Heiko Carstens 304103736b s390/acrs: cleanup access register handling
save_access_regs() and restore_access_regs() are only available by
including switch_to.h. This is done by a couple of C files, which have
nothing to do with switch_to(), but only need these functions.

Move both functions to a new header file and improve the implementation:

- Get rid of typedef

- Add memory access instrumentation support

- Use long displacement instructions lamy/stamy instead of lam/stam - all
  current users end up with better code because of this

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2024-02-12 15:03:33 +01:00
Paolo Bonzini f48212ee8e treewide: remove CONFIG_HAVE_KVM
It has no users anymore.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08 08:45:36 -05:00
Paolo Bonzini 3f5198c7f6 pqap instruction missing cc fix
vsie shadow creation race fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmWuk3oACgkQ41TmuOI4
 ufgUgA/+J3io5etMjY6/NMtUT0pJAb3Z4LtW8795+glGmtQ1fS4HC/rnr8eA6VEx
 OoN16Yh2WyeqzwRpPW5TVXOgxOswaB2rgdY3waBJtyoO/0MjfhSGXju15vjYfP6S
 QO5p2NNNkzIBIgsQfhkVUiWTAa3IswulHdz74OonGaKYLzf/IUrKJQg8OKYtfnzk
 uIeW2Om8weXT8VLTsm1koJ3EFadARcc4chLLj/QX7uVs4zQ2IT1zgmJhfzW3+/4g
 ATHXKkbuGT+dQEZJF8BcRX42gFCYynXlwMBGeb3mEtqXQE9gBB1/IsyrOgt/zsuO
 Nh5cpN00Xxb50tbmODeTj1WLhAXnbn5Qlg+ZOeOZuIp12X0OP8PHXFq99kqp3s2w
 lCFuO0OBS2atmUlm7ik3J/FW9KUq4pdaDOw1MJjwTCh+0cf4Vn5mtMLkcBfGkX6s
 c6Uc//RHkXV0BRFbqTQdk4cRTft3P5x2TKxW/kkmSBxOTn9OKgJPN03yu0XQ2DNJ
 7wLyxWlNMMK0PTLFZo5DUVuiKRUZBrw2sXa2pn7q4OMYbyK+7a46c/vu4xLBqusm
 NU75elRY9gBad3Kd3NR3EkEz5ltu0oo/wFy47biSDHv8a2jtb4UCQ5q3Ao+fu5XA
 eh1yyX4ZhNYtJeB5Qk3jkajZRjoj8Lw5JanFk53/wVe4ZWZDAvc=
 =8K58
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-master-6.8-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

pqap instruction missing cc fix
vsie shadow creation race fix
2024-01-26 12:57:12 -05:00
Linus Torvalds 09d1c6a80f Generic:
- Use memdup_array_user() to harden against overflow.
 
 - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
 
 - Clean up Kconfigs that all KVM architectures were selecting
 
 - New functionality around "guest_memfd", a new userspace API that
   creates an anonymous file and returns a file descriptor that refers
   to it.  guest_memfd files are bound to their owning virtual machine,
   cannot be mapped, read, or written by userspace, and cannot be resized.
   guest_memfd files do however support PUNCH_HOLE, which can be used to
   switch a memory area between guest_memfd and regular anonymous memory.
 
 - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
   per-page attributes for a given page of guest memory; right now the
   only attribute is whether the guest expects to access memory via
   guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
   TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees
   confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM).
 
 x86:
 
 - Support for "software-protected VMs" that can use the new guest_memfd
   and page attributes infrastructure.  This is mostly useful for testing,
   since there is no pKVM-like infrastructure to provide a meaningfully
   reduced TCB.
 
 - Fix a relatively benign off-by-one error when splitting huge pages during
   CLEAR_DIRTY_LOG.
 
 - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf
   TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE.
 
 - Use more generic lockdep assertions in paths that don't actually care
   about whether the caller is a reader or a writer.
 
 - let Xen guests opt out of having PV clock reported as "based on a stable TSC",
   because some of them don't expect the "TSC stable" bit (added to the pvclock
   ABI by KVM, but never set by Xen) to be set.
 
 - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL.
 
 - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always
   flushes on nested transitions, i.e. always satisfies flush requests.  This
   allows running bleeding edge versions of VMware Workstation on top of KVM.
 
 - Sanity check that the CPU supports flush-by-ASID when enabling SEV support.
 
 - On AMD machines with vNMI, always rely on hardware instead of intercepting
   IRET in some cases to detect unmasking of NMIs
 
 - Support for virtualizing Linear Address Masking (LAM)
 
 - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state
   prior to refreshing the vPMU model.
 
 - Fix a double-overflow PMU bug by tracking emulated counter events using a
   dedicated field instead of snapshotting the "previous" counter.  If the
   hardware PMC count triggers overflow that is recognized in the same VM-Exit
   that KVM manually bumps an event count, KVM would pend PMIs for both the
   hardware-triggered overflow and for KVM-triggered overflow.
 
 - Turn off KVM_WERROR by default for all configs so that it's not
   inadvertantly enabled by non-KVM developers, which can be problematic for
   subsystems that require no regressions for W=1 builds.
 
 - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
   "features".
 
 - Don't force a masterclock update when a vCPU synchronizes to the current TSC
   generation, as updating the masterclock can cause kvmclock's time to "jump"
   unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.
 
 - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
   partly as a super minor optimization, but mostly to make KVM play nice with
   position independent executable builds.
 
 - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
   CONFIG_HYPERV as a minor optimization, and to self-document the code.
 
 - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation"
   at build time.
 
 ARM64:
 
 - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB
   base granule sizes. Branch shared with the arm64 tree.
 
 - Large Fine-Grained Trap rework, bringing some sanity to the
   feature, although there is more to come. This comes with
   a prefix branch shared with the arm64 tree.
 
 - Some additional Nested Virtualization groundwork, mostly
   introducing the NV2 VNCR support and retargetting the NV
   support to that version of the architecture.
 
 - A small set of vgic fixes and associated cleanups.
 
 Loongarch:
 
 - Optimization for memslot hugepage checking
 
 - Cleanup and fix some HW/SW timer issues
 
 - Add LSX/LASX (128bit/256bit SIMD) support
 
 RISC-V:
 
 - KVM_GET_REG_LIST improvement for vector registers
 
 - Generate ISA extension reg_list using macros in get-reg-list selftest
 
 - Support for reporting steal time along with selftest
 
 s390:
 
 - Bugfixes
 
 Selftests:
 
 - Fix an annoying goof where the NX hugepage test prints out garbage
   instead of the magic token needed to run the test.
 
 - Fix build errors when a header is delete/moved due to a missing flag
   in the Makefile.
 
 - Detect if KVM bugged/killed a selftest's VM and print out a helpful
   message instead of complaining that a random ioctl() failed.
 
 - Annotate the guest printf/assert helpers with __printf(), and fix the
   various bugs that were lurking due to lack of said annotation.
 
 There are two non-KVM patches buried in the middle of guest_memfd support:
 
   fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
   mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
 
 The first is small and mostly suggested-by Christian Brauner; the second
 a bit less so but it was written by an mm person (Vlastimil Babka).
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmWcMWkUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroO15gf/WLmmg3SET6Uzw9iEq2xo28831ZA+
 6kpILfIDGKozV5safDmMvcInlc/PTnqOFrsKyyN4kDZ+rIJiafJdg/loE0kPXBML
 wdR+2ix5kYI1FucCDaGTahskBDz8Lb/xTpwGg9BFLYFNmuUeHc74o6GoNvr1uliE
 4kLZL2K6w0cSMPybUD+HqGaET80ZqPwecv+s1JL+Ia0kYZJONJifoHnvOUJ7DpEi
 rgudVdgzt3EPjG0y1z6MjvDBXTCOLDjXajErlYuZD3Ej8N8s59Dh2TxOiDNTLdP4
 a4zjRvDmgyr6H6sz+upvwc7f4M4p+DBvf+TkWF54mbeObHUYliStqURIoA==
 =66Ws
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "Generic:

   - Use memdup_array_user() to harden against overflow.

   - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
     architectures.

   - Clean up Kconfigs that all KVM architectures were selecting

   - New functionality around "guest_memfd", a new userspace API that
     creates an anonymous file and returns a file descriptor that refers
     to it. guest_memfd files are bound to their owning virtual machine,
     cannot be mapped, read, or written by userspace, and cannot be
     resized. guest_memfd files do however support PUNCH_HOLE, which can
     be used to switch a memory area between guest_memfd and regular
     anonymous memory.

   - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
     per-page attributes for a given page of guest memory; right now the
     only attribute is whether the guest expects to access memory via
     guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
     TDX or ARM64 pKVM is checked by firmware or hypervisor that
     guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
     the case of pKVM).

  x86:

   - Support for "software-protected VMs" that can use the new
     guest_memfd and page attributes infrastructure. This is mostly
     useful for testing, since there is no pKVM-like infrastructure to
     provide a meaningfully reduced TCB.

   - Fix a relatively benign off-by-one error when splitting huge pages
     during CLEAR_DIRTY_LOG.

   - Fix a bug where KVM could incorrectly test-and-clear dirty bits in
     non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
     a non-huge SPTE.

   - Use more generic lockdep assertions in paths that don't actually
     care about whether the caller is a reader or a writer.

   - let Xen guests opt out of having PV clock reported as "based on a
     stable TSC", because some of them don't expect the "TSC stable" bit
     (added to the pvclock ABI by KVM, but never set by Xen) to be set.

   - Revert a bogus, made-up nested SVM consistency check for
     TLB_CONTROL.

   - Advertise flush-by-ASID support for nSVM unconditionally, as KVM
     always flushes on nested transitions, i.e. always satisfies flush
     requests. This allows running bleeding edge versions of VMware
     Workstation on top of KVM.

   - Sanity check that the CPU supports flush-by-ASID when enabling SEV
     support.

   - On AMD machines with vNMI, always rely on hardware instead of
     intercepting IRET in some cases to detect unmasking of NMIs

   - Support for virtualizing Linear Address Masking (LAM)

   - Fix a variety of vPMU bugs where KVM fail to stop/reset counters
     and other state prior to refreshing the vPMU model.

   - Fix a double-overflow PMU bug by tracking emulated counter events
     using a dedicated field instead of snapshotting the "previous"
     counter. If the hardware PMC count triggers overflow that is
     recognized in the same VM-Exit that KVM manually bumps an event
     count, KVM would pend PMIs for both the hardware-triggered overflow
     and for KVM-triggered overflow.

   - Turn off KVM_WERROR by default for all configs so that it's not
     inadvertantly enabled by non-KVM developers, which can be
     problematic for subsystems that require no regressions for W=1
     builds.

   - Advertise all of the host-supported CPUID bits that enumerate
     IA32_SPEC_CTRL "features".

   - Don't force a masterclock update when a vCPU synchronizes to the
     current TSC generation, as updating the masterclock can cause
     kvmclock's time to "jump" unexpectedly, e.g. when userspace
     hotplugs a pre-created vCPU.

   - Use RIP-relative address to read kvm_rebooting in the VM-Enter
     fault paths, partly as a super minor optimization, but mostly to
     make KVM play nice with position independent executable builds.

   - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
     CONFIG_HYPERV as a minor optimization, and to self-document the
     code.

   - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
     "emulation" at build time.

  ARM64:

   - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
     granule sizes. Branch shared with the arm64 tree.

   - Large Fine-Grained Trap rework, bringing some sanity to the
     feature, although there is more to come. This comes with a prefix
     branch shared with the arm64 tree.

   - Some additional Nested Virtualization groundwork, mostly
     introducing the NV2 VNCR support and retargetting the NV support to
     that version of the architecture.

   - A small set of vgic fixes and associated cleanups.

  Loongarch:

   - Optimization for memslot hugepage checking

   - Cleanup and fix some HW/SW timer issues

   - Add LSX/LASX (128bit/256bit SIMD) support

  RISC-V:

   - KVM_GET_REG_LIST improvement for vector registers

   - Generate ISA extension reg_list using macros in get-reg-list
     selftest

   - Support for reporting steal time along with selftest

  s390:

   - Bugfixes

  Selftests:

   - Fix an annoying goof where the NX hugepage test prints out garbage
     instead of the magic token needed to run the test.

   - Fix build errors when a header is delete/moved due to a missing
     flag in the Makefile.

   - Detect if KVM bugged/killed a selftest's VM and print out a helpful
     message instead of complaining that a random ioctl() failed.

   - Annotate the guest printf/assert helpers with __printf(), and fix
     the various bugs that were lurking due to lack of said annotation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
  x86/kvm: Do not try to disable kvmclock if it was not enabled
  KVM: x86: add missing "depends on KVM"
  KVM: fix direction of dependency on MMU notifiers
  KVM: introduce CONFIG_KVM_COMMON
  KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
  KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
  RISC-V: KVM: selftests: Add get-reg-list test for STA registers
  RISC-V: KVM: selftests: Add steal_time test support
  RISC-V: KVM: selftests: Add guest_sbi_probe_extension
  RISC-V: KVM: selftests: Move sbi_ecall to processor.c
  RISC-V: KVM: Implement SBI STA extension
  RISC-V: KVM: Add support for SBI STA registers
  RISC-V: KVM: Add support for SBI extension registers
  RISC-V: KVM: Add SBI STA info to vcpu_arch
  RISC-V: KVM: Add steal-update vcpu request
  RISC-V: KVM: Add SBI STA extension skeleton
  RISC-V: paravirt: Implement steal-time support
  RISC-V: Add SBI STA extension definitions
  RISC-V: paravirt: Add skeleton for pv-time support
  RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
  ...
2024-01-17 13:03:37 -08:00
Linus Torvalds de927f6c0b s390 updates for 6.8 merge window
- Add machine variable capacity information to /proc/sysinfo.
 
 - Limit the waste of page tables and always align vmalloc area size
   and base address on segment boundary.
 
 - Fix a memory leak when an attempt to register interruption sub class
   (ISC) for the adjunct-processor (AP) guest failed.
 
 - Reset response code AP_RESPONSE_INVALID_GISA to understandable
   by guest AP_RESPONSE_INVALID_ADDRESS in response to a failed
   interruption sub class (ISC) registration attempt.
 
 - Improve reaction to adjunct-processor (AP) AP_RESPONSE_OTHERWISE_CHANGED
   response code when enabling interrupts on behalf of a guest.
 
 - Fix incorrect sysfs 'status' attribute of adjunct-processor (AP) queue
   device bound to the vfio_ap device driver when the mediated device is
   attached to a guest, but the queue device is not passed through.
 
 - Rework struct ap_card to hold the whole adjunct-processor (AP) card
   hardware information. As result, all the ugly bit checks are replaced
   by simple evaluations of the required bit fields.
 
 - Improve handling of some weird scenarios between service element (SE)
   host and SE guest with adjunct-processor (AP) pass-through support.
 
 - Change local_ctl_set_bit() and local_ctl_clear_bit() so they return the
   previous value of the to be changed control register. This is useful if
   a bit is only changed temporarily and the previous content needs to be
   restored.
 
 - The kernel starts with machine checks disabled and is expected to enable
   it once trap_init() is called. However the implementation allows machine
   checks early. Consistently enable it in trap_init() only.
 
 - local_mcck_disable() and local_mcck_enable() assume that machine checks
   are always enabled. Instead implement and use local_mcck_save() and
   local_mcck_restore() to disable machine checks and restore the previous
   state.
 
 - Modification of floating point control (FPC) register of a traced
   process using ptrace interface may lead to corruption of the FPC
   register of the tracing process. Fix this.
 
 - kvm_arch_vcpu_ioctl_set_fpu() allows to set the floating point control
   (FPC) register in vCPU, but may lead to corruption of the FPC register
   of the host process. Fix this.
 
 - Use READ_ONCE() to read a vCPU floating point register value from the
   memory mapped area. This avoids that, depending on code generation,
   a different value is tested for validity than the one that is used.
 
 - Get rid of test_fp_ctl(), since it is quite subtle to use it correctly.
   Instead copy a new floating point control register value into its save
   area and test the validity of the new value when loading it.
 
 - Remove superfluous save_fpu_regs() call.
 
 - Remove s390 support for ARCH_WANTS_DYNAMIC_TASK_STRUCT. All machines
   provide the vector facility since many years and the need to make the
   task structure size dependent on the vector facility does not exist.
 
 - Remove the "novx" kernel command line option, as the vector code runs
   without any problems since many years.
 
 - Add the vector facility to the z13 architecture level set (ALS).
   All hypervisors support the vector facility since many years.
   This allows compile time optimizations of the kernel.
 
 - Get rid of MACHINE_HAS_VX and replace it with cpu_has_vx(). As result,
   the compiled code will have less runtime checks and less code.
 
 - Convert pgste_get_lock() and pgste_set_unlock() ASM inlines to C.
 
 - Convert the struct subchannel spinlock from pointer to member.
 -----BEGIN PGP SIGNATURE-----
 
 iI0EABYIADUWIQQrtrZiYVkVzKQcYivNdxKlNrRb8AUCZZxnchccYWdvcmRlZXZA
 bGludXguaWJtLmNvbQAKCRDNdxKlNrRb8PyGAP9Z0Z1Pm3hPf24M4FgY5uuRqBHo
 VoiuohOZRGONwifnsgEAmN/3ba/7d/j0iMwpUdUFouiqtwTUihh15wYx1sH2IA8=
 =F6YD
 -----END PGP SIGNATURE-----

Merge tag 's390-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Alexander Gordeev:

 - Add machine variable capacity information to /proc/sysinfo.

 - Limit the waste of page tables and always align vmalloc area size and
   base address on segment boundary.

 - Fix a memory leak when an attempt to register interruption sub class
   (ISC) for the adjunct-processor (AP) guest failed.

 - Reset response code AP_RESPONSE_INVALID_GISA to understandable by
   guest AP_RESPONSE_INVALID_ADDRESS in response to a failed
   interruption sub class (ISC) registration attempt.

 - Improve reaction to adjunct-processor (AP)
   AP_RESPONSE_OTHERWISE_CHANGED response code when enabling interrupts
   on behalf of a guest.

 - Fix incorrect sysfs 'status' attribute of adjunct-processor (AP)
   queue device bound to the vfio_ap device driver when the mediated
   device is attached to a guest, but the queue device is not passed
   through.

 - Rework struct ap_card to hold the whole adjunct-processor (AP) card
   hardware information. As result, all the ugly bit checks are replaced
   by simple evaluations of the required bit fields.

 - Improve handling of some weird scenarios between service element (SE)
   host and SE guest with adjunct-processor (AP) pass-through support.

 - Change local_ctl_set_bit() and local_ctl_clear_bit() so they return
   the previous value of the to be changed control register. This is
   useful if a bit is only changed temporarily and the previous content
   needs to be restored.

 - The kernel starts with machine checks disabled and is expected to
   enable it once trap_init() is called. However the implementation
   allows machine checks early. Consistently enable it in trap_init()
   only.

 - local_mcck_disable() and local_mcck_enable() assume that machine
   checks are always enabled. Instead implement and use
   local_mcck_save() and local_mcck_restore() to disable machine checks
   and restore the previous state.

 - Modification of floating point control (FPC) register of a traced
   process using ptrace interface may lead to corruption of the FPC
   register of the tracing process. Fix this.

 - kvm_arch_vcpu_ioctl_set_fpu() allows to set the floating point
   control (FPC) register in vCPU, but may lead to corruption of the FPC
   register of the host process. Fix this.

 - Use READ_ONCE() to read a vCPU floating point register value from the
   memory mapped area. This avoids that, depending on code generation, a
   different value is tested for validity than the one that is used.

 - Get rid of test_fp_ctl(), since it is quite subtle to use it
   correctly. Instead copy a new floating point control register value
   into its save area and test the validity of the new value when
   loading it.

 - Remove superfluous save_fpu_regs() call.

 - Remove s390 support for ARCH_WANTS_DYNAMIC_TASK_STRUCT. All machines
   provide the vector facility since many years and the need to make the
   task structure size dependent on the vector facility does not exist.

 - Remove the "novx" kernel command line option, as the vector code runs
   without any problems since many years.

 - Add the vector facility to the z13 architecture level set (ALS). All
   hypervisors support the vector facility since many years. This allows
   compile time optimizations of the kernel.

 - Get rid of MACHINE_HAS_VX and replace it with cpu_has_vx(). As
   result, the compiled code will have less runtime checks and less
   code.

 - Convert pgste_get_lock() and pgste_set_unlock() ASM inlines to C.

 - Convert the struct subchannel spinlock from pointer to member.

* tag 's390-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (24 commits)
  Revert "s390: update defconfigs"
  s390/cio: make sch->lock spinlock pointer a member
  s390: update defconfigs
  s390/mm: convert pgste locking functions to C
  s390/fpu: get rid of MACHINE_HAS_VX
  s390/als: add vector facility to z13 architecture level set
  s390/fpu: remove "novx" option
  s390/fpu: remove ARCH_WANTS_DYNAMIC_TASK_STRUCT support
  KVM: s390: remove superfluous save_fpu_regs() call
  s390/fpu: get rid of test_fp_ctl()
  KVM: s390: use READ_ONCE() to read fpc register value
  KVM: s390: fix setting of fpc register
  s390/ptrace: handle setting of fpc register correctly
  s390/nmi: implement and use local_mcck_save() / local_mcck_restore()
  s390/nmi: consistently enable machine checks in trap_init()
  s390/ctlreg: return old register contents when changing bits
  s390/ap: handle outband SE bind state change
  s390/ap: store TAPQ hwinfo in struct ap_card
  s390/vfio-ap: fix sysfs status attribute for AP queue devices
  s390/vfio-ap: improve reaction to response code 07 from PQAP(AQIC) command
  ...
2024-01-10 18:18:20 -08:00
Eric Farman 83303a4c77 KVM: s390: fix cc for successful PQAP
The various errors that are possible when processing a PQAP
instruction (the absence of a driver hook, an error FROM that
hook), all correctly set the PSW condition code to 3. But if
that processing works successfully, CC0 needs to be set to
convey that everything was fine.

Fix the check so that the guest can examine the condition code
to determine whether GPR1 has meaningful data.

Fixes: e5282de931 ("s390: ap: kvm: add PQAP interception for AQIC")
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Link: https://lore.kernel.org/r/20231201181657.1614645-1-farman@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-Id: <20231201181657.1614645-1-farman@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2024-01-08 18:05:44 +01:00
Paolo Bonzini fb872da8e7 Common KVM changes for 6.8:
- Use memdup_array_user() to harden against overflow.
 
  - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmWW8F4SHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5urcP/Rex6Too26aHJXelUVHlFOGw3hfOnvbq
 Wr/P3kPqB/1Mncx3aiYTpEvUxFjVTvIkMB5dWba39Eq/G1BbOT2CAHCunlvKJrXy
 L83YgOl17QtZZJS1KmLTRCj1umfl4Z0c+GEIH+P1FOuOmllNXlLJ1+GWmolP6LLf
 u4DF2/tyVZf8JXXeJWYITHsU0YQQ0MhHgYL8/aMYJK8epNFpR3wKIqT3428ASxV3
 Ru4WH7jpYkFF7PaKbvjKdepr+1wyVt4PXJDDpciCScz45/8eebgfylLJbMglpsR1
 JSUTzd6KdCbekgzp51NnRdoIxP+MXgKA3dIuzXKyIDzm2Xq6tna87ve/aWDGw8JC
 nUMkP/vAuaKT+/QTOwskGAvK2GYDQD1UwVcFNLi12Iis50H0qPwcxsUionQuZgUC
 ykCmY4N31rSX4DhPg1WLiqsvC/EeDhfXprYrfSd4HQq08NgD45orRJw0Kov+shcS
 xijIlE1e3aVJMRrbfoSWyc4m79AcooxjYwojQC1Ayqsq0ZTTzzIpd6rqjmY+LbLL
 aP/wNz8hCfMhFekUV7dDk9rMdZY+bBnTiolyKAN66E6EnPYfl2EdrDEGnZOCPXF4
 L/O/kMCXHE90cszzrmiR40yNHLkPelij8sK+ligE4JpqteQ7ia/knh8YAiPBxDw6
 XcIfftXMm5XG
 =wpT4
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-generic-6.8' of https://github.com/kvm-x86/linux into HEAD

Common KVM changes for 6.8:

 - Use memdup_array_user() to harden against overflow.

 - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
2024-01-08 08:09:57 -05:00
Paolo Bonzini caadf876bb KVM: introduce CONFIG_KVM_COMMON
CONFIG_HAVE_KVM is currently used by some architectures to either
enabled the KVM config proper, or to enable host-side code that is
not part of the KVM module.  However, CONFIG_KVM's "select" statement
in virt/kvm/Kconfig corresponds to a third meaning, namely to
enable common Kconfigs required by all architectures that support
KVM.

These three meanings can be replaced respectively by an
architecture-specific Kconfig, by IS_ENABLED(CONFIG_KVM), or by
a new Kconfig symbol that is in turn selected by the
architecture-specific "config KVM".

Start by introducing such a new Kconfig symbol, CONFIG_KVM_COMMON.
Unlike CONFIG_HAVE_KVM, it is selected by CONFIG_KVM, not by
architecture code, and it brings in all dependencies of common
KVM code.  In particular, INTERVAL_TREE was missing in loongarch
and riscv, so that is another thing that is fixed.

Fixes: 8132d887a7 ("KVM: remove CONFIG_HAVE_KVM_EVENTFD", 2023-12-08)
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Closes: https://lore.kernel.org/all/44907c6b-c5bd-4e4a-a921-e4d3825539d8@infradead.org/
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-08 08:09:38 -05:00
Paolo Bonzini 731859dde8 - uvdevice fixed additional data return length
- stfle (feature indication) vsie fixes and minor cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmWUBE8ACgkQ41TmuOI4
 ufg9jA//YjZ725t0nTIMa1SzmFRJRvuAI6JBSnKhJI0/Z+aXiuUZm5tyFOVrZNLB
 tLTlTbTNA5ZzpyrvdNPTVfRKzK5nA7XtHY50Id3leyjsmhSPMWyiOZOn46mlCSAS
 UB9Fxu54RqEu4qiEkFFq4OVhMq0nYldK7pTSfQZW0FfnU4bC0mwg/qQ+u1v0GTRN
 dGmzhpqf2CTT8x7YhzAvOViVgMxQghb34Pfhar/mf0n3LCP0Shx5qOqLyja/+M/+
 EXk4vGF4vHYUB0HJkuAEAiLeyqztE5kty5u/hSQ2bk2dImBdeCg9eN6JMTzZKn6t
 yAdMdaNUpR769MnckKKG1qVKxuggzaB4toY9EVPLTHk42ZiqZPHo8avPCEsxnz+n
 rJKtxy0EPm74TPGG9Z2/v8dIB1VOBQ3V9KLu+hnR6BDw5DZJc/1GylEW6NyuVhZq
 Ak8XjFcY2Zx2tFWNwppoPziTRpsnbgevhAomyDvVelAO2buU6SXR/Rv7AsY97dEj
 2YTCHaBY1j2TjioQJm/AjOvBy0JhWGx9Wdni+ZyAdNG2xX7/JUbp1jKHHXp7Eolj
 7/driKHsNeqw2pRulEoLGQeTlx6GoabXb8HVqwPX8fC69WzJ7YQi4giQTX7rZR4B
 Mg6AMGl24iNjaHSIXrNbg9JrYjgOS45sZ2BBRybdhTmKBwbNGSk=
 =IY7z
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-next-6.8-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

- uvdevice fixed additional data return length
- stfle (feature indication) vsie fixes and minor cleanup
2024-01-02 13:18:30 -05:00
Paolo Bonzini 136292522e LoongArch KVM changes for v6.8
1. Optimization for memslot hugepage checking.
 2. Cleanup and fix some HW/SW timer issues.
 3. Add LSX/LASX (128bit/256bit SIMD) support.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmWGu+0WHGNoZW5odWFj
 YWlAa2VybmVsLm9yZwAKCRAChivD8uImesO7D/wOdYP96R+mRzpLBeuTtFxU8e4A
 3n2luxOeP8v1WYtQ9H8M01Wgly+9u6cJ2pgAlv79BQHfmCfC0aWQLmpnCZmk/mYW
 wtQ75ASA3Qg6zOBWEksCkA0LUdPDHfQuaaUXT7RYZ7QtHKSNkkhsw2nMCq6fgrXU
 RnZjGctjuxgYSqQtwzfYO2AjSBAfAq1MjSzCTULJ0KkE8o5Bg0KOoGj8ijC1U+ua
 QWBnqTNzeKmYmqAFfhXoiiFYcuBUq7DEk5RtwDU7SeqqJEV3a8AbbsrWfz+wMemG
 gri95uRxvnhpPZ+6/PrVjIezqexPJmQ9+tjY6mxh/bPRnS5ICFygjV3lt050JUK8
 xIaJEFvl7g88RIz5mnTeM9tU4ibIsCLgA9zj33ps2H7QP5NazUm1dzk1YGAgqPdw
 m5hjwtTFQEujQM6cz1DLfhoi15VDNcYUonJIvGFZMhl7InitDpB3u9sI+AVGIVUG
 yKzBkqGB1L1vbJGnuWmspEqSUo7Z9iYzuVGbOnjc9LKQ/8OpLxj0brymYheA+CKG
 CIdULximQFVEHc2lbE+H+bW4hnrFP4sN9hlTng7KN7ommCIg+FltisM8Nt5NLWID
 9ywLj4Qa0Qrc5vB3FJ8+ksuDe2nD83uVLj247R7B0wxQcYw4ocyW/YU+gayF4EjY
 6azutwllW5ZB+I3hyw==
 =phol
 -----END PGP SIGNATURE-----

Merge tag 'loongarch-kvm-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD

LoongArch KVM changes for v6.8

1. Optimization for memslot hugepage checking.
2. Cleanup and fix some HW/SW timer issues.
3. Add LSX/LASX (128bit/256bit SIMD) support.
2024-01-02 13:16:29 -05:00
Nina Schoetterl-Glausch 682dbf430d KVM: s390: vsie: Fix length of facility list shadowed
The length of the facility list accessed when interpretively executing
STFLE is the same as the hosts facility list (in case of format-0)
The memory following the facility list doesn't need to be accessible.
The current VSIE implementation accesses a fixed length that exceeds the
guest/host facility list length and can therefore wrongly inject a
validity intercept.
Instead, find out the host facility list length by running STFLE and
copy only as much as necessary when shadowing.

Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20231219140854.1042599-3-nsg@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-ID: <20231219140854.1042599-3-nsg@linux.ibm.com>
2023-12-23 10:41:09 +01:00
Nina Schoetterl-Glausch 2731d605d5 KVM: s390: vsie: Fix STFLE interpretive execution identification
STFLE can be interpretively executed.
This occurs when the facility list designation is unequal to zero.
Perform the check before applying the address mask instead of after.

Fixes: 66b630d5b7 ("KVM: s390: vsie: support STFLE interpretation")
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Nina Schoetterl-Glausch <nsg@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20231219140854.1042599-2-nsg@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-ID: <20231219140854.1042599-2-nsg@linux.ibm.com>
2023-12-23 10:41:09 +01:00
Christian Borntraeger fe752331d4 KVM: s390: vsie: fix race during shadow creation
Right now it is possible to see gmap->private being zero in
kvm_s390_vsie_gmap_notifier resulting in a crash.  This is due to the
fact that we add gmap->private == kvm after creation:

static int acquire_gmap_shadow(struct kvm_vcpu *vcpu,
                               struct vsie_page *vsie_page)
{
[...]
        gmap = gmap_shadow(vcpu->arch.gmap, asce, edat);
        if (IS_ERR(gmap))
                return PTR_ERR(gmap);
        gmap->private = vcpu->kvm;

Let children inherit the private field of the parent.

Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Fixes: a3508fbe9d ("KVM: s390: vsie: initial support for nested virtualization")
Cc: <stable@vger.kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20231220125317.4258-1-borntraeger@linux.ibm.com
2023-12-21 11:40:18 +01:00
Heiko Carstens 18564756ab s390/fpu: get rid of MACHINE_HAS_VX
Get rid of MACHINE_HAS_VX and replace it with cpu_has_vx() which is a
short readable wrapper for "test_facility(129)".

Facility bit 129 is set if the vector facility is present. test_facility()
returns also true for all bits which are set in the architecture level set
of the cpu that the kernel is compiled for. This means that
test_facility(129) is a compile time constant which returns true for z13
and later, since the vector facility bit is part of the z13 kernel ALS.

In result the compiled code will have less runtime checks, and less code.

Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-12-11 14:33:07 +01:00
Heiko Carstens d7271ba401 KVM: s390: remove superfluous save_fpu_regs() call
The save_fpu_regs() call in kvm_arch_vcpu_ioctl_get_fpu() is pointless: it
will save the current user space fpu context to the thread's save area. But
the code is accessing only the vcpu's save are / mapped register area,
which in this case are not the same.

Therefore remove the confusing call.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-12-11 14:33:06 +01:00
Heiko Carstens 702644249d s390/fpu: get rid of test_fp_ctl()
It is quite subtle to use test_fp_ctl() correctly. Therefore remove it -
instead copy whatever new floating point control (fpc) register values are
supposed to be used into its save area.

Test the validity of the new value when loading it. If the new value is
invalid, load the fpc register with zero.

This seems to be a the best way to approach this problem. Even though this
changes behavior:

- sigreturn with an invalid fpc value on the stack will succeed, and
  continue with zero value, instead of returning with SIGSEGV

- ptraced processes will also use a zero value instead of letting the
  request fail with -EINVAL

However all of this seems to acceptable. After all testing of the value was
only implemented to avoid that user space can crash the kernel. It is not
there to test values for validity; and the assumption is that there is no
existing user space which is doing this.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-12-11 14:33:06 +01:00
Heiko Carstens 3b2e00f167 KVM: s390: use READ_ONCE() to read fpc register value
Use READ_ONCE() to read a vcpu's floating point register value from
the memory mapped area. This avoids that, depending on code
generation, a different value is tested for validity than the one that
is used, since user space can modify the area concurrently and the
compiler is free to generate code that reads the value multiple times.

Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-12-11 14:33:05 +01:00
Heiko Carstens b988b1bb00 KVM: s390: fix setting of fpc register
kvm_arch_vcpu_ioctl_set_fpu() allows to set the floating point control
(fpc) register of a guest cpu. The new value is tested for validity by
temporarily loading it into the fpc register.

This may lead to corruption of the fpc register of the host process:
if an interrupt happens while the value is temporarily loaded into the fpc
register, and within interrupt context floating point or vector registers
are used, the current fp/vx registers are saved with save_fpu_regs()
assuming they belong to user space and will be loaded into fp/vx registers
when returning to user space.

test_fp_ctl() restores the original user space / host process fpc register
value, however it will be discarded, when returning to user space.

In result the host process will incorrectly continue to run with the value
that was supposed to be used for a guest cpu.

Fix this by simply removing the test. There is another test right before
the SIE context is entered which will handles invalid values.

This results in a change of behaviour: invalid values will now be accepted
instead of that the ioctl fails with -EINVAL. This seems to be acceptable,
given that this interface is most likely not used anymore, and this is in
addition the same behaviour implemented with the memory mapped interface
(replace invalid values with zero) - see sync_regs() in kvm-s390.c.

Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2023-12-11 14:33:05 +01:00
Paolo Bonzini c5b31cc237 KVM: remove CONFIG_HAVE_KVM_IRQFD
All platforms with a kernel irqchip have support for irqfd.  Unify the
two configuration items so that userspace can expect to use irqfd to
inject interrupts into the irqchip.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-08 15:43:33 -05:00
Paolo Bonzini 8132d887a7 KVM: remove CONFIG_HAVE_KVM_EVENTFD
virt/kvm/eventfd.c is compiled unconditionally, meaning that the ioeventfds
member of struct kvm is accessed unconditionally.  CONFIG_HAVE_KVM_EVENTFD
therefore must be defined for KVM common code to compile successfully,
remove it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-08 15:43:33 -05:00
Philipp Stanner 8c4976772d KVM: s390: Harden copying of userspace-array against overflow
guestdbg.c utilizes memdup_user() to copy a userspace array. This,
currently, does not check for an overflow.

Use the new wrapper memdup_array_user() to copy the array more safely.

Note, KVM explicitly checks the number of entries before duplicating the
array, i.e. adding the overflow check should be a glorified nop.

Suggested-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20231102181526.43279-3-pstanner@redhat.com
[sean: call out that KVM pre-checks the number of entries]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01 08:00:42 -08:00
Wei Wang 63912245c1 KVM: move KVM_CAP_DEVICE_CTRL to the generic check
KVM_CAP_DEVICE_CTRL allows userspace to check if the kvm_device
framework (e.g. KVM_CREATE_DEVICE) is supported by KVM. Move
KVM_CAP_DEVICE_CTRL to the generic check for the two reasons:
1) it already supports arch agnostic usages (i.e. KVM_DEV_TYPE_VFIO).
For example, userspace VFIO implementation may needs to create
KVM_DEV_TYPE_VFIO on x86, riscv, or arm etc. It is simpler to have it
checked at the generic code than at each arch's code.
2) KVM_CREATE_DEVICE has been added to the generic code.

Link: https://lore.kernel.org/all/20221215115207.14784-1-wei.w.wang@intel.com
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Acked-by: Anup Patel <anup@brainfault.org> (riscv)
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Link: https://lore.kernel.org/r/20230315101606.10636-1-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30 13:09:43 -08:00
Claudio Imbrenda 80aea01c48 KVM: s390: vsie: fix wrong VIR 37 when MSO is used
When the host invalidates a guest page, it will also check if the page
was used to map the prefix of any guest CPUs, in which case they are
stopped and marked as needing a prefix refresh. Upon starting the
affected CPUs again, their prefix pages are explicitly faulted in and
revalidated if they had been invalidated. A bit in the PGSTEs indicates
whether or not a page might contain a prefix. The bit is allowed to
overindicate. Pages above 2G are skipped, because they cannot be
prefixes, since KVM runs all guests with MSO = 0.

The same applies for nested guests (VSIE). When the host invalidates a
guest page that maps the prefix of the nested guest, it has to stop the
affected nested guest CPUs and mark them as needing a prefix refresh.
The same PGSTE bit used for the guest prefix is also used for the
nested guest. Pages above 2G are skipped like for normal guests, which
is the source of the bug.

The nested guest runs is the guest primary address space. The guest
could be running the nested guest using MSO != 0. If the MSO + prefix
for the nested guest is above 2G, the check for nested prefix will skip
it. This will cause the invalidation notifier to not stop the CPUs of
the nested guest and not mark them as needing refresh. When the nested
guest is run again, its prefix will not be refreshed, since it has not
been marked for refresh. This will cause a fatal validity intercept
with VIR code 37.

Fix this by removing the check for 2G for nested guests. Now all
invalidations of pages with the notify bit set will always scan the
existing VSIE shadow state descriptors.

This allows to catch invalidations of nested guest prefix mappings even
when the prefix is above 2G in the guest virtual address space.

Fixes: a3508fbe9d ("KVM: s390: vsie: initial support for nested virtualization")
Tested-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <20231102153549.53984-1-imbrenda@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2023-11-14 18:56:36 +01:00
Linus Torvalds e392ea4d4d s390 updates for the 6.7 merge window
- Get rid of private VM_FAULT flags
 
 - Add word-at-a-time implementation
 
 - Add DCACHE_WORD_ACCESS support
 
 - Cleanup control register handling
 
 - Disallow CPU hotplug of CPU 0 to simplify its handling complexity,
   following a similar restriction in x86
 
 - Optimize pai crypto map allocation
 
 - Update the list of crypto express EP11 coprocessor operation modes
 
 - Fixes and improvements for secure guests AP pass-through
 
 - Several fixes to address incorrect page marking for address translation
   with the "cmma no-dat" feature, preventing potential incorrect guest
   TLB flushes
 
 - Fix early IPI handling
 
 - Several virtual vs physical address confusion fixes
 
 - Various small fixes and improvements all over the code
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmVFLkYACgkQjYWKoQLX
 FBgxRwf9FSNFwLcbYbG1x94rUUHnbaiyJWCezp3/ypr+m+qDvQatLYc75SxwrH0y
 ocSygqvtVryVkWAKKvOHF1Kg5R2Fedmzf5wuVTXglfPqE1ZgMGdwS/LtknIoz556
 twZJIpFzUFt5xaljpTCZJanLMvy/npl0bilezhNGl6v7N5rsWLbfK6vsPMDm+TTZ
 yscapOsk8Z16NjXq0FETS5JHG65jjj9rkRfb0qD8SOFhti0fR9MSP2xeRXrDMDZE
 IWXog5usx2DS6VX2HnxA8O7z1hhuTccJ1K1+rYqbb0Fwccqi7QaGZXEvocYEvlvy
 lVe3/jbyn27hUoypHcfVCAVxdoOrnw==
 =SMOp
 -----END PGP SIGNATURE-----

Merge tag 's390-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 updates from Vasily Gorbik:

 - Get rid of private VM_FAULT flags

 - Add word-at-a-time implementation

 - Add DCACHE_WORD_ACCESS support

 - Cleanup control register handling

 - Disallow CPU hotplug of CPU 0 to simplify its handling complexity,
   following a similar restriction in x86

 - Optimize pai crypto map allocation

 - Update the list of crypto express EP11 coprocessor operation modes

 - Fixes and improvements for secure guests AP pass-through

 - Several fixes to address incorrect page marking for address
   translation with the "cmma no-dat" feature, preventing potential
   incorrect guest TLB flushes

 - Fix early IPI handling

 - Several virtual vs physical address confusion fixes

 - Various small fixes and improvements all over the code

* tag 's390-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (74 commits)
  s390/cio: replace deprecated strncpy with strscpy
  s390/sclp: replace deprecated strncpy with strtomem
  s390/cio: fix virtual vs physical address confusion
  s390/cio: export CMG value as decimal
  s390: delete the unused store_prefix() function
  s390/cmma: fix handling of swapper_pg_dir and invalid_pg_dir
  s390/cmma: fix detection of DAT pages
  s390/sclp: handle default case in sclp memory notifier
  s390/pai_crypto: remove per-cpu variable assignement in event initialization
  s390/pai: initialize event count once at initialization
  s390/pai_crypto: use PERF_ATTACH_TASK define for per task detection
  s390/mm: add missing arch_set_page_dat() call to gmap allocations
  s390/mm: add missing arch_set_page_dat() call to vmem_crst_alloc()
  s390/cmma: fix initial kernel address space page table walk
  s390/diag: add missing virt_to_phys() translation to diag224()
  s390/mm,fault: move VM_FAULT_ERROR handling to do_exception()
  s390/mm,fault: remove VM_FAULT_BADMAP and VM_FAULT_BADACCESS
  s390/mm,fault: remove VM_FAULT_SIGNAL
  s390/mm,fault: remove VM_FAULT_BADCONTEXT
  s390/mm,fault: simplify kfence fault handling
  ...
2023-11-03 10:17:22 -10:00
Paolo Bonzini 140139c5bd - nested page table management performance counters
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEwGNS88vfc9+v45Yq41TmuOI4ufgFAmU/WREACgkQ41TmuOI4
 ufgLuw/9HuVlJ4HLrnbFQ+d7DkATkxMwx+gCxMLw9Xw+wwKc03OMU5CJk7QyHCHY
 z6uYEcRQqQeAu9ux8ygtMPlEB3fwZnhsfA8AROCST25T7JN1xMFFOiq3QMMIQcGZ
 hmf050xF2GheW/AAPTSdQu1BwCTBqf4WtXf8C2GGwvTEgFEoCd/ZOxmGJw82X1pe
 Dhn7KV9D2q7if/dk5Rosp7zlXx4YDt9Q3zxsvWsnnOEQGghjCeG/ShkEZdwfBW/O
 /1b3OO24MqS3nnXdC6Ktw2lzPFDiCfeaxKrU9X9OLwxcTzOqQ1XT6bpVRrDzSP2q
 UyvoYh9JdyAdOdf1woXeepY/hubAHbXEx3EifgzujEtcjHIwXPd23FrLmR+UaFdw
 kOl/7JEVEkj+FZM3ZP3cHWV01CEgvUqyp6eSyo8Lr1NhF5VcENZiSDtLkoTUFUhx
 6UFeGWnqlFPwI5sXBjGCn233j9hY5ZNpn+H82jzqLqsWcklBC9ZvWIpP9U9/mmh9
 M4t3NA7WTuzxEGsxPh3JcrHpWGB4WgOvXGZA8JI2RIbliZXTwha2Y5d2Jc+gPovY
 ALcCLPiB8VeCcl8F0SptGACm7UesRDTyD2Cl8pTioIYPOdFOFnitGm/qgPOgN4mO
 y5K8GBWfcH++0gAQyHbeRF9uN1L03SyTZ8hT08mig5aDY6SiH78=
 =Ko6K
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-next-6.7-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

- nested page table management performance counters
2023-10-31 10:10:15 -04:00
Heiko Carstens 44ae766353 s390/mm: move translation-exception identification structure to fault.h
Move translation-exception identification structure to new fault.h
header file, change it to a union, and change existing kvm code
accordingly. The new union will be used by subsequent patches.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-10-23 18:21:22 +02:00
Nico Boehr 70fea30195 KVM: s390: add tracepoint in gmap notifier
The gmap notifier is called for changes in table entries with the
notifier bit set. To diagnose performance issues, it can be useful to
see what causes certain changes in the gmap.

Hence, add a tracepoint in the gmap notifier.

Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20231009093304.2555344-3-nrb@linux.ibm.com
Message-Id: <20231009093304.2555344-3-nrb@linux.ibm.com>
2023-10-16 14:54:29 +02:00
Nico Boehr c3235e2dd6 KVM: s390: add stat counter for shadow gmap events
The shadow gmap tracks memory of nested guests (guest-3). In certain
scenarios, the shadow gmap needs to be rebuilt, which is a costly operation
since it involves a SIE exit into guest-1 for every entry in the respective
shadow level.

Add kvm stat counters when new shadow structures are created at various
levels. Also add a counter gmap_shadow_create when a completely fresh
shadow gmap is created as well as a counter gmap_shadow_reuse when an
existing gmap is being reused.

Note that when several levels are shadowed at once, counters on all
affected levels will be increased.

Also note that not all page table levels need to be present and a ASCE
can directly point to e.g. a segment table. In this case, a new segment
table will always be equivalent to a new shadow gmap and hence will be
counted as gmap_shadow_create and not as gmap_shadow_segment.

Signed-off-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20231009093304.2555344-2-nrb@linux.ibm.com
Message-Id: <20231009093304.2555344-2-nrb@linux.ibm.com>
2023-10-16 14:54:29 +02:00
Michael Mueller f87ef57235 KVM: s390: fix gisa destroy operation might lead to cpu stalls
A GISA cannot be destroyed as long it is linked in the GIB alert list
as this would break the alert list. Just waiting for its removal from
the list triggered by another vm is not sufficient as it might be the
only vm. The below shown cpu stall situation might occur when GIB alerts
are delayed and is fixed by calling process_gib_alert_list() instead of
waiting.

At this time the vcpus of the vm are already destroyed and thus
no vcpu can be kicked to enter the SIE again if for some reason an
interrupt is pending for that vm.

Additionally the IAM restore value is set to 0x00. That would be a bug
introduced by incomplete device de-registration, i.e. missing
kvm_s390_gisc_unregister() call.

Setting this value and the IAM in the GISA to 0x00 guarantees that late
interrupts don't bring the GISA back into the alert list.

CPU stall caused by kvm_s390_gisa_destroy():

 [ 4915.311372] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 14-.... } 24533 jiffies s: 5269 root: 0x1/.
 [ 4915.311390] rcu: blocking rcu_node structures (internal RCU debug): l=1:0-15:0x4000/.
 [ 4915.311394] Task dump for CPU 14:
 [ 4915.311395] task:qemu-system-s39 state:R  running task     stack:0     pid:217198 ppid:1      flags:0x00000045
 [ 4915.311399] Call Trace:
 [ 4915.311401]  [<0000038003a33a10>] 0x38003a33a10
 [ 4933.861321] rcu: INFO: rcu_sched self-detected stall on CPU
 [ 4933.861332] rcu: 	14-....: (42008 ticks this GP) idle=53f4/1/0x4000000000000000 softirq=61530/61530 fqs=14031
 [ 4933.861353] rcu: 	(t=42008 jiffies g=238109 q=100360 ncpus=18)
 [ 4933.861357] CPU: 14 PID: 217198 Comm: qemu-system-s39 Not tainted 6.5.0-20230816.rc6.git26.a9d17c5d8813.300.fc38.s390x #1
 [ 4933.861360] Hardware name: IBM 8561 T01 703 (LPAR)
 [ 4933.861361] Krnl PSW : 0704e00180000000 000003ff804bfc66 (kvm_s390_gisa_destroy+0x3e/0xe0 [kvm])
 [ 4933.861414]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
 [ 4933.861416] Krnl GPRS: 0000000000000000 00000372000000fc 00000002134f8000 000000000d5f5900
 [ 4933.861419]            00000002f5ea1d18 00000002f5ea1d18 0000000000000000 0000000000000000
 [ 4933.861420]            00000002134fa890 00000002134f8958 000000000d5f5900 00000002134f8000
 [ 4933.861422]            000003ffa06acf98 000003ffa06858b0 0000038003a33c20 0000038003a33bc8
 [ 4933.861430] Krnl Code: 000003ff804bfc58: ec66002b007e	cij	%r6,0,6,000003ff804bfcae
                           000003ff804bfc5e: b904003a		lgr	%r3,%r10
                          #000003ff804bfc62: a7f40005		brc	15,000003ff804bfc6c
                          >000003ff804bfc66: e330b7300204	lg	%r3,10032(%r11)
                           000003ff804bfc6c: 58003000		l	%r0,0(%r3)
                           000003ff804bfc70: ec03fffb6076	crj	%r0,%r3,6,000003ff804bfc66
                           000003ff804bfc76: e320b7600271	lay	%r2,10080(%r11)
                           000003ff804bfc7c: c0e5fffea339	brasl	%r14,000003ff804942ee
 [ 4933.861444] Call Trace:
 [ 4933.861445]  [<000003ff804bfc66>] kvm_s390_gisa_destroy+0x3e/0xe0 [kvm]
 [ 4933.861460] ([<00000002623523de>] free_unref_page+0xee/0x148)
 [ 4933.861507]  [<000003ff804aea98>] kvm_arch_destroy_vm+0x50/0x120 [kvm]
 [ 4933.861521]  [<000003ff8049d374>] kvm_destroy_vm+0x174/0x288 [kvm]
 [ 4933.861532]  [<000003ff8049d4fe>] kvm_vm_release+0x36/0x48 [kvm]
 [ 4933.861542]  [<00000002623cd04a>] __fput+0xea/0x2a8
 [ 4933.861547]  [<00000002620d5bf8>] task_work_run+0x88/0xf0
 [ 4933.861551]  [<00000002620b0aa6>] do_exit+0x2c6/0x528
 [ 4933.861556]  [<00000002620b0f00>] do_group_exit+0x40/0xb8
 [ 4933.861557]  [<00000002620b0fa6>] __s390x_sys_exit_group+0x2e/0x30
 [ 4933.861559]  [<0000000262d481f4>] __do_syscall+0x1d4/0x200
 [ 4933.861563]  [<0000000262d59028>] system_call+0x70/0x98
 [ 4933.861565] Last Breaking-Event-Address:
 [ 4933.861566]  [<0000038003a33b60>] 0x38003a33b60

Fixes: 9f30f62163 ("KVM: s390: add gib_alert_irq_handler()")
Signed-off-by: Michael Mueller <mimu@linux.ibm.com>
Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Link: https://lore.kernel.org/r/20230901105823.3973928-1-mimu@linux.ibm.com
Message-ID: <20230901105823.3973928-1-mimu@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2023-09-25 08:31:47 +02:00
Heiko Carstens 99441a38c3 s390: use control register bit defines
Use control register bit defines instead of plain numbers where
possible.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2023-09-19 13:26:57 +02:00