Commit graph

768075 commits

Author SHA1 Message Date
Liran Alon
abfc52c612 KVM: nVMX: Separate logic allocating shadow vmcs to a function
No functionality change.
This is done as a preparation for VMCS shadowing virtualization.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:48 +02:00
Liran Alon
491a603845 KVM: VMX: Mark vmcs header as shadow in case alloc_vmcs_cpu() allocate shadow vmcs
No functionality change.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:47 +02:00
Liran Alon
32c7acf044 KVM: nVMX: Expose VMCS shadowing to L1 guest
Expose VMCS shadowing to L1 as a VMX capability of the virtual CPU,
whether or not VMCS shadowing is supported by the physical CPU.
(VMCS shadowing emulation)

Shadowed VMREADs and VMWRITEs from L2 are handled by L0, without a
VM-exit to L1.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:46 +02:00
Liran Alon
a7cde481b6 KVM: nVMX: Do not forward VMREAD/VMWRITE VMExits to L1 if required so by vmcs12 vmread/vmwrite bitmaps
This is done as a preparation for VMCS shadowing emulation.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:45 +02:00
Liran Alon
6d894f498f KVM: nVMX: vmread/vmwrite: Use shadow vmcs12 if running L2
This is done as a preparation to VMCS shadowing emulation.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:44 +02:00
Paolo Bonzini
9a78bdf31d KVM: selftests: add tests for shadow VMCS save/restore
This includes setting up the shadow VMCS and the secondary execution
controls in lib/vmx.c.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:44 +02:00
Paolo Bonzini
fa58a9fa74 KVM: nVMX: include shadow vmcs12 in nested state
The shadow vmcs12 cannot be flushed on KVM_GET_NESTED_STATE,
because at that point guest memory is assumed by userspace to
be immutable.  Capture the cache in vmx_get_nested_state, adding
another page at the end if there is an active shadow vmcs12.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:43 +02:00
Liran Alon
61ada7488f KVM: nVMX: Cache shadow vmcs12 on VMEntry and flush to memory on VMExit
This is done is done as a preparation to VMCS shadowing emulation.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:42 +02:00
Liran Alon
f145d90d97 KVM: nVMX: Verify VMCS shadowing VMCS link pointer
Intel SDM considers these checks to be part of
"Checks on Guest Non-Register State".

Note that it is legal for vmcs->vmcs_link_pointer to be -1ull
when VMCS shadowing is enabled. In this case, any VMREAD/VMWRITE to
shadowed-field sets the ALU flags for VMfailInvalid (i.e. CF=1).

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:41 +02:00
Liran Alon
a8a7c02bf7 KVM: nVMX: Verify VMCS shadowing controls
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:40 +02:00
Liran Alon
f792d2743e KVM: nVMX: Introduce nested_cpu_has_shadow_vmcs()
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:40 +02:00
Liran Alon
a6192d40d5 KVM: nVMX: Fail VMLAUNCH and VMRESUME on shadow VMCS
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:39 +02:00
Liran Alon
fa97d7dba7 KVM: nVMX: Allow VMPTRLD for shadow VMCS if vCPU supports VMCS shadowing
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:38 +02:00
Liran Alon
e253674227 KVM: VMX: Change vmcs12_{read,write}_any() to receive vmcs12 as parameter
No functionality change.
This is done as a preparation for VMCS shadowing emulation.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:37 +02:00
Liran Alon
392b2f25aa KVM: VMX: Create struct for VMCS header
No functionality change.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:37 +02:00
Paolo Bonzini
cb5476379f kvm: selftests: add test for nested state save/restore
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:36 +02:00
Jim Mattson
8fcc4b5923 kvm: nVMX: Introduce KVM_CAP_NESTED_STATE
For nested virtualization L0 KVM is managing a bit of state for L2 guests,
this state can not be captured through the currently available IOCTLs. In
fact the state captured through all of these IOCTLs is usually a mix of L1
and L2 state. It is also dependent on whether the L2 guest was running at
the moment when the process was interrupted to save its state.

With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
that is in VMX operation.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
[karahmed@ - rename structs and functions and make them ready for AMD and
             address previous comments.
           - handle nested.smm state.
           - rebase & a bit of refactoring.
           - Merge 7/8 and 8/8 into one patch. ]
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:58:30 +02:00
Paolo Bonzini
7f7f1ba33c KVM: x86: do not load vmcs12 pages while still in SMM
If the vCPU enters system management mode while running a nested guest,
RSM starts processing the vmentry while still in SMM.  In that case,
however, the pages pointed to by the vmcs12 might be incorrectly
loaded from SMRAM.  To avoid this, delay the handling of the pages
until just before the next vmentry.  This is done with a new request
and a new entry in kvm_x86_ops, which we will be able to reuse for
nested VMX state migration.

Extracted from a patch by Jim Mattson and KarimAllah Ahmed.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:57:58 +02:00
Paolo Bonzini
fa3899add1 kvm: selftests: add basic test for state save and restore
The test calls KVM_RUN repeatedly, and creates an entirely new VM with the
old memory and vCPU state on every exit to userspace.  The kvm_util API is
expanded with two functions that manage the lifetime of a kvm_vm struct:
the first closes the file descriptors and leaves the memory allocated,
and the second opens the file descriptors and reuses the memory from
the previous incarnation of the kvm_vm struct.

For now the test is very basic, as it does not test for example XSAVE or
vCPU events.  However, it will test nested virtualization state starting
with the next patch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:32:04 +02:00
Paolo Bonzini
0a505fe6f2 kvm: selftests: ensure vcpu file is released
The selftests were not munmap-ing the kvm_run area from the vcpu file descriptor.
The result was that kvm_vcpu_release was not called and a reference was left in the
parent "struct kvm".  Ultimately this was visible in the upcoming state save/restore
test as an error when KVM attempted to create a duplicate debugfs entry.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:32:04 +02:00
Paolo Bonzini
87ccb7dbb2 kvm: selftests: actually use all of lib/vmx.c
The allocation of the VMXON and VMCS is currently done twice, in
lib/vmx.c and in vmx_tsc_adjust_test.c.  Reorganize the code to
provide a cleaner and easier to use API to the tests.  lib/vmx.c
now does the complete setup of the VMX data structures, but does not
create the VM or set CPUID.  This has to be done by the caller.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:32:03 +02:00
Paolo Bonzini
2305339ee7 kvm: selftests: create a GDT and TSS
The GDT and the TSS base were left to zero, and this has interesting effects
when the TSS descriptor is later read to set up a VMCS's TR_BASE.  Basically
it worked by chance, and this patch fixes it by setting up all the protected
mode data structures properly.

Because the GDT and TSS addresses are virtual, the page tables now always
exist at the time of vcpu setup.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:32:02 +02:00
Paolo Bonzini
44883f01fe KVM: x86: ensure all MSRs can always be KVM_GET/SET_MSR'd
Some of the MSRs returned by GET_MSR_INDEX_LIST currently cannot be sent back
to KVM_GET_MSR and/or KVM_SET_MSR; either they can never be sent back, or you
they are only accepted under special conditions.  This makes the API a pain to
use.

To avoid this pain, this patch makes it so that the result of the get-list
ioctl can always be used for host-initiated get and set.  Since we don't have
a separate way to check for read-only MSRs, this means some Hyper-V MSRs are
ignored when written.  Arguably they should not even be in the result of
GET_MSR_INDEX_LIST, but I am leaving there in case userspace is using the
outcome of GET_MSR_INDEX_LIST to derive the support for the corresponding
Hyper-V feature.

Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:32:01 +02:00
Sean Christopherson
cf81a7e580 KVM: vmx: remove save/restore of host BNDCGFS MSR
Linux does not support Memory Protection Extensions (MPX) in the
kernel itself, thus the BNDCFGS (Bound Config Supervisor) MSR will
always be zero in the KVM host, i.e. RDMSR in vmx_save_host_state()
is superfluous.  KVM unconditionally sets VM_EXIT_CLEAR_BNDCFGS,
i.e. BNDCFGS will always be zero after VMEXIT, thus manually loading
BNDCFGS is also superfluous.

And in the event the MPX kernel support is added (unlikely given
that MPX for userspace is in its death throes[1]), BNDCFGS will
likely be common across all CPUs[2], and at the least shouldn't
change on a regular basis, i.e. saving the MSR on every VMENTRY is
completely unnecessary.

WARN_ONCE in hardware_setup() if the host's BNDCFGS is non-zero to
document that KVM does not preserve BNDCFGS and to serve as a hint
as to how BNDCFGS likely should be handled if MPX is used in the
kernel, e.g. BNDCFGS should be saved once during KVM setup.

[1] https://lkml.org/lkml/2018/4/27/1046
[2] http://www.openwall.com/lists/kernel-hardening/2017/07/24/28

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:32:00 +02:00
KarimAllah Ahmed
86dafed50e KVM: Switch 'requests' to be 64-bit (explicitly)
Switch 'requests' to be explicitly 64-bit and update BUILD_BUG_ON check to
use the size of "requests" instead of the hard-coded '32'.

That gives us a bit more room again for arch-specific requests as we
already ran out of space for x86 due to the hard-coded check.

The only exception here is ARM32 as it is still 32-bits.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrÄmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:31:59 +02:00
Wei Huang
ca35906688 kvm: selftests: add cr4_cpuid_sync_test
KVM is supposed to update some guest VM's CPUID bits (e.g. OSXSAVE) when
CR4 is changed. A bug was found in KVM recently and it was fixed by
Commit c4d2188206 ("KVM: x86: Update cpuid properly when CR4.OSXAVE or
CR4.PKE is changed"). This patch adds a test to verify the synchronization
between guest VM's CR4 and CPUID bits.

Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-08-06 17:31:59 +02:00
Paolo Bonzini
d2ce98ca0a Linux 4.18-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
 sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
 1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
 9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
 mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
 L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
 vJJlibI=
 =42a5
 -----END PGP SIGNATURE-----

Merge tag 'v4.18-rc6' into HEAD

Pull bug fixes into the KVM development tree to avoid nasty conflicts.
2018-08-06 17:31:36 +02:00
Paolo Bonzini
85eae57bbb KVM: s390: Features for 4.19
- initial version for host large page support. Must be enabled with
   module parameter hpage=1 and will conflict with the nested=1
   parameter.
 - enable etoken facility for guests
 - Fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJbYAz8AAoJEBF7vIC1phx8r6cP+wa1xFphqa5ha7/qlr/upfWF
 aqIQ9IqyUcsYC83EQuEKTgEEgXkhOZ6eUM8B5dkmTP0iTroB8bFc5+AEKZfRoMzk
 rLBuxZti5cKI0l5nE4ChX20WNzjwVIRiHqkyh7Z+otpfVS5tEujkVPbOkzHTVZIL
 MJWAGb0m8PXbXShqAq3d5aPKcQz3L/y4uPHZsPJtv6vf/e06l5OlvP5gxiIFORFQ
 ss01RrzlqXGJ8KEUPD1kWT2I3mbSUZldb2Yd0oi/m4RK2csJ9WAEYNBjLstdXJ9/
 l0iqT7ZAbRfdAxRKHSUPxoj1WK/EPxEYfh+NbR+gbVgGrsMi9yeKPAIVC1ws/kG1
 qdjVHbJyMXCHRaKi/+ayyNwdcv6cWSIEFbfIwqFfKEm/5/AbSIg76uReuWh1oDM4
 tX7aQNJp7zmRWBbxGy9ezJlvOOvz+G2hMtf5MiD3rMBI602FE+eak8NBTXmOQ8Jv
 /EZw/xJFQBOnD7G/cNMQRiPGHOPRdhS4O/9N4DUh9Qw88rOtqLKQRRnkB5cTawWa
 LEVU+sOhqCROEr/pQdN4cID71bnWb2bsvtmTX5EOK1UF4LeaKLaEmgCJBIkOGjns
 VI9RecNa7meeeYYN67Mgp9VdGab38fc+ybT9TVejVheznrxcKrR8RlwCjcmLP2n4
 QpzmR/9E3SzwY6y/k11i
 =WSH5
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-next-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD

KVM: s390: Features for 4.19

- initial version for host large page support. Must be enabled with
  module parameter hpage=1 and will conflict with the nested=1
  parameter.
- enable etoken facility for guests
- Fixes
2018-08-02 13:57:29 +02:00
Paolo Bonzini
3a1174cd3e PPC KVM update for 4.19.
This update adds no new features; it just has some minor code cleanups
 and bug fixes, including a fix to allow us to create KVM_MAX_VCPUS
 vCPUs on POWER9 in all CPU threading modes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJbX/AJAAoJEJ2a6ncsY3GfO7sH/jEq1yuCzIjegVFkO7qPFd0C
 LXwcXyFfM7EIB7kWfHLPzfavBaQZvILUR5N2J2nEjEkyeHzouZ3utwU0zUeVLmDm
 3zqb/OT1viNxpKV88qkYCIsW1aB1j9dIP714BTqXYF49z09tv/LXE5ErqWBLJoVi
 eK1o1CoivfudONTyz4SQlMbGrdA0UgR9hxQviIhPnQmgOJfJCWpQR5wTkQh7kxDn
 htBi2Qe0gpRL+9VMMGQbCVpHRw/VTolOfS+MDxfmrGlHh6ML9GjoOMBO5uraTeDg
 1he8PZS1In5KR62qrOhzfraJ49oscSn+dhxrQfYx2crCmxPygJsWkVIJrmGFiFU=
 =P447
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-next-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD

PPC KVM update for 4.19.

This update adds no new features; it just has some minor code cleanups
and bug fixes, including a fix to allow us to create KVM_MAX_VCPUS
vCPUs on POWER9 in all CPU threading modes.
2018-08-02 13:57:26 +02:00
Janosch Frank
2375846193 KVM: s390: initial host large page support
- must be enabled via module parameter hpage=1
 - cannot be used together with nested
 - does support migration
 - does support hugetlbfs
 - no THP yet
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJbX4AwAAoJEBF7vIC1phx85eMP/ifsNHwqfAOrZBdlJuLVPla5
 47J8iY4i4DOKGhKI4YOTcJQhn1izKZhECXS8d8hghB/sQUCE2CLVr1X/r1Udy2Pq
 bpKG4apYtcJZBF6qn7yDMjBGkIRK4OCBD1pkuKEq2NyvUgPsHUVUgpuq2gngMTBk
 ZN9MIfRQMdIEJsT389D6T9as0lwABJ0MJap5AudkQwguN2dDhQGeZv8l0QYV8C2I
 WqRI2VsI1QEo3cJr1lJ5li/F9fC7q0l6QwlvPVocIHJAnq01zJvOekeAgQ4hzz16
 JIoQckJq8m4d4PqZ7aWmAaMEemoQ9llmCavovspJNtFT79jho6cWWtBEvq+t0GLQ
 qTsG9Yi20hONZMWAw+JIdSdOuFMD0HCpOWdUtSMjENFRbr8LLHUr91dGIxRLjF8Z
 gv3vDJrbGzCQ+b9qPA8SrAN7U3VNCZG384MEmobwTuv5hxOopWp6chcK7RCriV/m
 7cFDfO7+2pZymdW7D4DWlFiZl4mWpwOxip32C9tCt0CQveqeYSZsb5Qb9Pe+50vr
 JhpB74UL79Wffvd65InGlu5jx1SdGG0QAzmBOkdOsAhX+0WMmXRB1ddn4whu7HPU
 ssNtdKgLt9KkM/kIsB9RC/YLvUFK1lBVHrfnzUmLw3CBHP3QeO+V+arLwdVLVDjV
 PA/LPECBWtGtQtxGWb2H
 =Y0Wl
 -----END PGP SIGNATURE-----

Merge tag 'hlp_stage1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvms390/next

KVM: s390: initial host large page support

- must be enabled via module parameter hpage=1
- cannot be used together with nested
- does support migration
- does support hugetlbfs
- no THP yet
2018-07-30 23:20:48 +02:00
Janosch Frank
a449938297 KVM: s390: Add huge page enablement control
General KVM huge page support on s390 has to be enabled via the
kvm.hpage module parameter. Either nested or hpage can be enabled, as
we currently do not support vSIE for huge backed guests. Once the vSIE
support is added we will either drop the parameter or enable it as
default.

For a guest the feature has to be enabled through the new
KVM_CAP_S390_HPAGE_1M capability and the hpage module
parameter. Enabling it means that cmm can't be enabled for the vm and
disables pfmf and storage key interpretation.

This is due to the fact that in some cases, in upcoming patches, we
have to split huge pages in the guest mapping to be able to set more
granular memory protection on 4k pages. These split pages have fake
page tables that are not visible to the Linux memory management which
subsequently will not manage its PGSTEs, while the SIE will. Disabling
these features lets us manage PGSTE data in a consistent matter and
solve that problem.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
2018-07-30 23:13:38 +02:00
Janosch Frank
a9e00d8349 s390/mm: Add huge page gmap linking support
Let's allow huge pmd linking when enabled through the
KVM_CAP_S390_HPAGE_1M capability. Also we can now restrict gmap
invalidation and notification to the cases where the capability has
been activated and save some cycles when that's not the case.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
2018-07-30 23:13:38 +02:00
Dominik Dingel
7d735b9ae8 s390/mm: hugetlb pages within a gmap can not be freed
Guests backed by huge pages could theoretically free unused pages via
the diagnose 10 instruction. We currently don't allow that, so we
don't have to refault it once it's needed again.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2018-07-30 23:13:38 +02:00
Janosch Frank
57cb198cfd KVM: s390: Beautify skey enable check
Let's introduce an explicit check if skeys have already been enabled
for the vcpu, so we don't have to check the mm context if we don't have
the storage key facility.

This lets us check for enablement without having to take the mm
semaphore and thus speedup skey emulation.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-07-30 17:05:52 +02:00
Janosch Frank
bd096f6443 KVM: s390: Add skey emulation fault handling
When doing skey emulation for huge guests, we now need to fault in
pmds, as we don't have PGSTES anymore to store them when we do not
have valid table entries.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2018-07-30 11:20:18 +01:00
Janosch Frank
637ff9efe5 s390/mm: Add huge pmd storage key handling
Storage keys for guests with huge page mappings have to be managed in
hardware. There are no PGSTEs for PMDs that we could use to retain the
guests's logical view of the key.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
2018-07-30 11:20:18 +01:00
Janosch Frank
3afdfca698 s390/mm: Clear skeys for newly mapped huge guest pmds
Similarly to the pte skey handling, where we set the storage key to
the default key for each newly mapped pte, we have to also do that for
huge pmds.

With the PG_arch_1 flag we keep track if the area has already been
cleared of its skeys.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2018-07-30 11:20:18 +01:00
Dominik Dingel
964c2c05c9 s390/mm: Clear huge page storage keys on enable_skey
When a guest starts using storage keys, we trap and set a default one
for its whole valid address space. With this patch we are now able to
do that for large pages.

To speed up the storage key insertion, we use
__storage_key_init_range, which in-turn will use sske_frame to set
multiple storage keys with one instruction. As it has been previously
used for debuging we have to get rid of the default key check and make
it quiescing.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
[replaced page_set_storage_key loop with __storage_key_init_range]
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
2018-07-30 11:20:18 +01:00
Janosch Frank
0959e16867 s390/mm: Add huge page dirty sync support
To do dirty loging with huge pages, we protect huge pmds in the
gmap. When they are written to, we unprotect them and mark them dirty.

We introduce the function gmap_test_and_clear_dirty_pmd which handles
dirty sync for huge pages.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
2018-07-30 11:20:18 +01:00
Janosch Frank
6a3762778d s390/mm: Add gmap pmd invalidation and clearing
If the host invalidates a pmd, we also have to invalidate the
corresponding gmap pmds, as well as flush them from the TLB. This is
necessary, as we don't share the pmd tables between host and guest as
we do with ptes.

The clearing part of these three new functions sets a guest pmd entry
to _SEGMENT_ENTRY_EMPTY, so the guest will fault on it and we will
re-link it.

Flushing the gmap is not necessary in the host's lazy local and csp
cases. Both purge the TLB completely.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
2018-07-30 11:20:17 +01:00
Janosch Frank
7c4b13a7c0 s390/mm: Add gmap pmd notification bit setting
Like for ptes, we also need invalidation notification for pmds, to
make sure the guest lowcore pages are always accessible and later
addition of shadowed pmds.

With PMDs we do not have PGSTEs or some other bits we could use in the
host PMD. Instead we pick one of the free bits in the gmap PMD. Every
time a host pmd will be invalidated, we will check if the respective
gmap PMD has the bit set and in that case fire up the notifier.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2018-07-30 11:20:17 +01:00
Janosch Frank
58b7e200d2 s390/mm: Add gmap pmd linking
Let's allow pmds to be linked into gmap for the upcoming s390 KVM huge
page support.

Before this patch we copied the full userspace pmd entry. This is not
correct, as it contains SW defined bits that might be interpreted
differently in the GMAP context. Now we only copy over all hardware
relevant information leaving out the software bits.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
2018-07-30 11:20:17 +01:00
Janosch Frank
2c46e974dd s390/mm: Abstract gmap notify bit setting
Currently we use the software PGSTE bits PGSTE_IN_BIT and
PGSTE_VSIE_BIT to notify before an invalidation occurs on a prefix
page or a VSIE page respectively. Both bits are pgste specific, but
are used when protecting a memory range.

Let's introduce abstract GMAP_NOTIFY_* bits that will be realized into
the respective bits when gmap DAT table entries are protected.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
2018-07-30 11:20:17 +01:00
Janosch Frank
5a045bb9c4 s390/mm: Make gmap_protect_range more modular
This patch reworks the gmap_protect_range logic and extracts the pte
handling into an own function. Also we do now walk to the pmd and make
it accessible in the function for later use. This way we can add huge
page handling logic more easily.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2018-07-30 11:20:17 +01:00
Paul Mackerras
b5c6f7607b KVM: PPC: Book3S HV: Read kvm->arch.emul_smt_mode under kvm->lock
Commit 1e175d2 ("KVM: PPC: Book3S HV: Pack VCORE IDs to access full
VCPU ID space", 2018-07-25) added code that uses kvm->arch.emul_smt_mode
before any VCPUs are created.  However, userspace can change
kvm->arch.emul_smt_mode at any time up until the first VCPU is created.
Hence it is (theoretically) possible for the check in
kvmppc_core_vcpu_create_hv() to race with another userspace thread
changing kvm->arch.emul_smt_mode.

This fixes it by moving the test that uses kvm->arch.emul_smt_mode into
the block where kvm->lock is held.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-07-26 15:38:41 +10:00
Paul Mackerras
1ebe6b81eb KVM: PPC: Book3S HV: Allow creating max number of VCPUs on POWER9
Commit 1e175d2 ("KVM: PPC: Book3S HV: Pack VCORE IDs to access full
VCPU ID space", 2018-07-25) allowed use of VCPU IDs up to
KVM_MAX_VCPU_ID on POWER9 in all guest SMT modes and guest emulated
hardware SMT modes.  However, with the current definition of
KVM_MAX_VCPU_ID, a guest SMT mode of 1 and an emulated SMT mode of 8,
it is only possible to create KVM_MAX_VCPUS / 2 VCPUS, because
threads_per_subcore is 4 on POWER9 CPUs.  (Using an emulated SMT mode
of 8 is useful when migrating VMs to or from POWER8 hosts.)

This increases KVM_MAX_VCPU_ID to 8 * KVM_MAX_VCPUS when HV KVM is
configured in, so that a full complement of KVM_MAX_VCPUS VCPUs can
be created on POWER9 in all guest SMT modes and emulated hardware
SMT modes.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-07-26 14:53:54 +10:00
Sam Bobroff
1e175d2e07 KVM: PPC: Book3S HV: Pack VCORE IDs to access full VCPU ID space
It is not currently possible to create the full number of possible
VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses fewer
threads per core than its core stride (or "VSMT mode"). This is
because the VCORE ID and XIVE offsets grow beyond KVM_MAX_VCPUS
even though the VCPU ID is less than KVM_MAX_VCPU_ID.

To address this, "pack" the VCORE ID and XIVE offsets by using
knowledge of the way the VCPU IDs will be used when there are fewer
guest threads per core than the core stride. The primary thread of
each core will always be used first. Then, if the guest uses more than
one thread per core, these secondary threads will sequentially follow
the primary in each core.

So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
VCPUs are being spaced apart, so at least half of each core is empty,
and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
into the second half of each core (4..7, in an 8-thread core).

Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
each core is being left empty, and we can map down into the second and
third quarters of each core (2, 3 and 5, 6 in an 8-thread core).

Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
threads are being used and 7/8 of the core is empty, allowing use of
the 1, 5, 3 and 7 thread slots.

(Strides less than 8 are handled similarly.)

This allows the VCORE ID or offset to be calculated quickly from the
VCPU ID or XIVE server numbers, without access to the VCPU structure.

[paulus@ozlabs.org - tidied up comment a little, changed some WARN_ONCE
 to pr_devel, wrapped line, fixed id check.]

Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-07-26 13:23:52 +10:00
Linus Torvalds
d72e90f33a Linux 4.18-rc6 2018-07-22 14:12:20 -07:00
Linus Torvalds
7441308421 NVMe fixes for 4.18-rc6:
- fix a regression in 4.18 that causes a memory leak on probe failure
    (Keith Bush)
  - fix a deadlock in the passthrough ioctl code (Scott Bauer)
  - don't enable AENs if not supported (Weiping Zhang)
  - fix an old regression in metadata handling in the passthrough ioctl
    code (Roland Dreier)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAltUe8YLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYM6TQ//UDrKbhnW6x2Vl7wfSyPjG1lADDXjLrIPoy2+WJNN
 ylgRl0Ezv7bXvj9gdwkDcgeoN0ua6gf88vjrwgem27BySPNeDMWYaaaAbwUaHxJd
 rsW/ogaB3gHrgn0MWn7OPb/WT2bQtoq55ivBP9A1ExRAdZ6RjM8qQc/7dkPFCaLf
 XxUE1+udgFVp5a7nbFb6TRdaZmxzYgkDU1PTgERD8RTmBes7K5uOQtO5whFVHU7b
 tveIXLmybgpB0BDN8R9x1uHRtjRmIdgSrJ6H+ps5cc+LB/wHTWvRd/hdlC++Ug8u
 k3+ifvsOLDdTz0xFW+0256edCyStQvVQYog7EcjxHL2GViSyxUayJWKE3XVI7DFW
 tClP6IW39XqTbYs0LGJmv1POufiQUUD3I6xHgE3R3Yb5CyE4EKrNnBkAK4F2pX6n
 Y9rgSY3cjswi/qn9vKZr2DVkEl1oqGiFVBV6PxMZwIHnIoJfZQ4ZwsPgEaeridil
 +GjyF6j2mI5DtrJ9rN8UYENDVioqb1r+1TXt9k/t4bmaK4IWms2+/w9YHfH+4hUr
 M64CkvQa7/wHhE3oIEzgOWLDhvksNyyZQHR6BkMGlwGg7xvO2FuQZlong6MlTVyc
 bgVNPf71X705xuYfXOHCxkSvviWAJlJtsB7r+R6ez6ikngagt2VOK+yPeSesRnux
 kCo=
 =PvXH
 -----END PGP SIGNATURE-----

Merge tag 'nvme-for-4.18' of git://git.infradead.org/nvme

Pull NVMe fixes from Christoph Hellwig:

 - fix a regression in 4.18 that causes a memory leak on probe failure
   (Keith Bush)

 - fix a deadlock in the passthrough ioctl code (Scott Bauer)

 - don't enable AENs if not supported (Weiping Zhang)

 - fix an old regression in metadata handling in the passthrough ioctl
   code (Roland Dreier)

* tag 'nvme-for-4.18' of git://git.infradead.org/nvme:
  nvme: fix handling of metadata_len for NVME_IOCTL_IO_CMD
  nvme: don't enable AEN if not supported
  nvme: ensure forward progress during Admin passthru
  nvme-pci: fix memory leak on probe failure
2018-07-22 13:21:45 -07:00
Linus Torvalds
165ea0d1c2 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Fix several places that screw up cleanups after failures halfway
  through opening a file (one open-coding filp_clone_open() and getting
  it wrong, two misusing alloc_file()). That part is -stable fodder from
  the 'work.open' branch.

  And Christoph's regression fix for uapi breakage in aio series;
  include/uapi/linux/aio_abi.h shouldn't be pulling in the kernel
  definition of sigset_t, the reason for doing so in the first place had
  been bogus - there's no need to expose struct __aio_sigset in
  aio_abi.h at all"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  aio: don't expose __aio_sigset in uapi
  ocxlflash_getfile(): fix double-iput() on alloc_file() failures
  cxl_getfile(): fix double-iput() on alloc_file() failures
  drm_mode_create_lease_ioctl(): fix open-coded filp_clone_open()
2018-07-22 12:04:51 -07:00