linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-20 09:31:09 +00:00

Author	SHA1	Message	Date
Avi Kivity	e4e517b4be	KVM: MMU: Do not unconditionally read PDPTE from guest memory Architecturally, PDPTEs are cached in the PDPTRs when CR3 is reloaded. On SVM, it is not possible to implement this, but on VMX this is possible and was indeed implemented until nested SVM changed this to unconditionally read PDPTEs dynamically. This has noticable impact when running PAE guests. Fix by changing the MMU to read PDPTRs from the cache, falling back to reading from memory for the nested MMU. Signed-off-by: Avi Kivity <avi@redhat.com> Tested-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:18:01 +03:00
Julia Lawall	cf3ace79c0	KVM: VMX: trivial: use BUG_ON Use BUG_ON(x) rather than if(x) BUG(); The semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ identifier x; @@ -if (x) BUG(); +BUG_ON(x); @@ identifier x; @@ -if (!x) BUG(); +BUG_ON(!x); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:18:01 +03:00
Marcelo Tosatti	742bc67042	KVM: x86: report valid microcode update ID Windows Server 2008 SP2 checked build with smp > 1 BSOD's during boot due to lack of microcode update: * Assertion failed: The system BIOS on this machine does not properly support the processor. The system BIOS did not load any microcode update. A BIOS containing the latest microcode update is needed for system reliability. (CurrentUpdateRevision != 0) * Source File: d:\longhorn\base\hals\update\intelupd\update.c, line 440 Report a non-zero microcode update signature to make it happy. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:18:01 +03:00
Takuya Yoshikawa	1d2887e2d8	KVM: x86 emulator: Make x86_decode_insn() return proper macros Return EMULATION_OK/FAILED consistently. Also treat instruction fetch errors, not restricted to X86EMUL_UNHANDLEABLE, as EMULATION_FAILED; although this cannot happen in practice, the current logic will continue the emulation even if the decoder fails to fetch the instruction. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:18:01 +03:00
Takuya Yoshikawa	7d88bb4803	KVM: x86 emulator: Let compiler know insn_fetch() rarely fails Fetching the instruction which was to be executed by the guest cannot fail normally. So compiler should always predict that it will succeed. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:18:00 +03:00
Takuya Yoshikawa	e85a10852c	KVM: x86 emulator: Drop _size argument from insn_fetch() _type is enough to know the size. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:59 +03:00
Takuya Yoshikawa	807941b121	KVM: x86 emulator: Use ctxt->_eip directly in do_insn_fetch_byte() Instead of passing ctxt->_eip from insn_fetch() call sites, get it from ctxt in do_insn_fetch_byte(). This is done by replacing the argument _eip of insn_fetch() with _ctxt, which should be better than letting the macro use ctxt silently in its body. Though this changes the place where ctxt->_eip is incremented from insn_fetch() to do_insn_fetch_byte(), this does not have any real effect. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:59 +03:00
Sasha Levin	743eeb0b01	KVM: Intelligent device lookup on I/O bus Currently the method of dealing with an IO operation on a bus (PIO/MMIO) is to call the read or write callback for each device registered on the bus until we find a device which handles it. Since the number of devices on a bus can be significant due to ioeventfds and coalesced MMIO zones, this leads to a lot of overhead on each IO operation. Instead of registering devices, we now register ranges which points to a device. Lookup is done using an efficient bsearch instead of a linear search. Performance test was conducted by comparing exit count per second with 200 ioeventfds created on one byte and the guest is trying to access a different byte continuously (triggering usermode exits). Before the patch the guest has achieved 259k exits per second, after the patch the guest does 274k exits per second. Cc: Avi Kivity <avi@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:59 +03:00
Stefan Hajnoczi	0d460ffc09	KVM: Use __print_symbolic() for vmexit tracepoints The vmexit tracepoints format the exit_reason to make it human-readable. Since the exit_reason depends on the instruction set (vmx or svm), formatting is handled with ftrace_print_symbols_seq() by referring to the appropriate exit reason table. However, the ftrace_print_symbols_seq() function is not meant to be used directly in tracepoints since it does not export the formatting table which userspace tools like trace-cmd and perf use to format traces. In practice perf dies when formatting vmexit-related events and trace-cmd falls back to printing the numeric value (with extra formatting code in the kvm plugin to paper over this limitation). Other userspace consumers of vmexit-related tracepoints would be in similar trouble. To avoid significant changes to the kvm_exit tracepoint, this patch moves the vmx and svm exit reason tables into arch/x86/kvm/trace.h and selects the right table with __print_symbolic() depending on the instruction set. Note that __print_symbolic() is designed for exporting the formatting table to userspace and allows trace-cmd and perf to work. Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:59 +03:00
Stefan Hajnoczi	e097e5ffd6	KVM: Record instruction set in all vmexit tracepoints The kvm_exit tracepoint recently added the isa argument to aid decoding exit_reason. The semantics of exit_reason depend on the instruction set (vmx or svm) and the isa argument allows traces to be analyzed on other machines. Add the isa argument to kvm_nested_vmexit and kvm_nested_vmexit_inject so these tracepoints can also be self-describing. Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:58 +03:00
Mike Waychison	d1613ad5d0	KVM: Really fix HV_X64_MSR_APIC_ASSIST_PAGE Commit 0945d4b228 tried to fix the get_msr path for the HV_X64_MSR_APIC_ASSIST_PAGE msr, but was poorly tested. We should be returning 0 if the read succeeded, and passing the value back to the caller via the pdata out argument, not returning the value directly. Signed-off-by: Mike Waychison <mikew@google.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:58 +03:00
Mike Waychison	14fa67ee95	KVM: x86: get_msr support for HV_X64_MSR_APIC_ASSIST_PAGE "get" support for the HV_X64_MSR_APIC_ASSIST_PAGE msr was missing, even though it is explicitly enumerated as something the vmm should save in msrs_to_save and reported to userland via the KVM_GET_MSR_INDEX_LIST ioctl. Add "get" support for HV_X64_MSR_APIC_ASSIST_PAGE. We simply return the guest visible value of this register, which seems to be correct as a set on the register is validated for us already. Signed-off-by: Mike Waychison <mikew@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:17:58 +03:00
Sasha Levin	8c3ba334f8	KVM: x86: Raise the hard VCPU count limit The patch raises the hard limit of VCPU count to 254. This will allow developers to easily work on scalability and will allow users to test high VCPU setups easily without patching the kernel. To prevent possible issues with current setups, KVM_CAP_NR_VCPUS now returns the recommended VCPU limit (which is still 64) - this should be a safe value for everybody, while a new KVM_CAP_MAX_VCPUS returns the hard limit which is now 254. Cc: Avi Kivity <avi@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Suggested-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:17:57 +03:00
Xiao Guangrong	22388a3c8c	KVM: x86: cleanup the code of read/write emulation Using the read/write operation to remove the same code Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:17 +03:00
Xiao Guangrong	77d197b2ca	KVM: x86: abstract the operation for read/write emulation The operations of read emulation and write emulation are very similar, so we can abstract the operation of them, in larter patch, it is used to cleanup the same code Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:17 +03:00
Xiao Guangrong	ca7d58f375	KVM: x86: fix broken read emulation spans a page boundary If the range spans a page boundary, the mmio access can be broke, fix it as write emulation. And we already get the guest physical address, so use it to read guest data directly to avoid walking guest page table again Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:17 +03:00
Avi Kivity	9be3be1f15	KVM: x86 emulator: fix Src2CL decode Src2CL decode (used for double width shifts) erronously decodes only bit 3 of %rcx, instead of bits 7:0. Fix by decoding %cl in its entirety. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:14:58 +03:00
Zhao Jin	41bc3186b3	KVM: MMU: fix incorrect return of spte __update_clear_spte_slow should return original spte while the current code returns low half of original spte combined with high half of new spte. Signed-off-by: Zhao Jin <cronozhj@gmail.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:13:25 +03:00
Linus Torvalds	ab7e2dbf9b	Merge branch 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: uses TASKSTATS, depends on NET KVM: fix TASK_DELAY_ACCT kconfig warning	2011-08-16 11:14:44 -07:00
Randy Dunlap	df3d8ae1f8	KVM: uses TASKSTATS, depends on NET CONFIG_TASKSTATS just had a change to use netlink, including a change to "depends on NET". Since "select" does not follow dependencies, KVM also needs to depend on NET to prevent build errors when CONFIG_NET is not enabled. Sample of the reported "undefined reference" build errors: taskstats.c:(.text+0x8f686): undefined reference to `nla_put' taskstats.c:(.text+0x8f721): undefined reference to `nla_reserve' taskstats.c:(.text+0x8f8fb): undefined reference to `init_net' taskstats.c:(.text+0x8f905): undefined reference to `netlink_unicast' taskstats.c:(.text+0x8f934): undefined reference to `kfree_skb' taskstats.c:(.text+0x8f9e9): undefined reference to `skb_clone' taskstats.c:(.text+0x90060): undefined reference to `__alloc_skb' taskstats.c:(.text+0x901e9): undefined reference to `skb_put' taskstats.c:(.init.text+0x4665): undefined reference to `genl_register_family' taskstats.c:(.init.text+0x4699): undefined reference to `genl_register_ops' taskstats.c:(.init.text+0x4710): undefined reference to `genl_unregister_ops' taskstats.c:(.init.text+0x471c): undefined reference to `genl_unregister_family' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-08-16 19:00:41 +03:00
Randy Dunlap	fd079facb3	KVM: fix TASK_DELAY_ACCT kconfig warning Fix kconfig dependency warning: warning: (KVM) selects TASK_DELAY_ACCT which has unmet direct dependencies (TASKSTATS) Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-27 14:22:06 +03:00
Arun Sharma	60063497a9	atomic: use <linux/atomic.h> This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:47 -07:00
Linus Torvalds	d3ec4844d4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) fs: Merge split strings treewide: fix potentially dangerous trailing ';' in #defined values/expressions uwb: Fix misspelling of neighbourhood in comment net, netfilter: Remove redundant goto in ebt_ulog_packet trivial: don't touch files that are removed in the staging tree lib/vsprintf: replace link to Draft by final RFC number doc: Kconfig: `to be' -> `be' doc: Kconfig: Typo: square -> squared doc: Konfig: Documentation/power/{pm => apm-acpi}.txt drivers/net: static should be at beginning of declaration drivers/media: static should be at beginning of declaration drivers/i2c: static should be at beginning of declaration XTENSA: static should be at beginning of declaration SH: static should be at beginning of declaration MIPS: static should be at beginning of declaration ARM: static should be at beginning of declaration rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check Update my e-mail address PCIe ASPM: forcedly -> forcibly gma500: push through device driver tree ... Fix up trivial conflicts: - arch/arm/mach-ep93xx/dma-m2p.c (deleted) - drivers/gpio/gpio-ep93xx.c (renamed and context nearby) - drivers/net/r8169.c (just context changes)	2011-07-25 13:56:39 -07:00
Linus Torvalds	5fabc487c9	Merge branch 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits) KVM: IOMMU: Disable device assignment without interrupt remapping KVM: MMU: trace mmio page fault KVM: MMU: mmio page fault support KVM: MMU: reorganize struct kvm_shadow_walk_iterator KVM: MMU: lockless walking shadow page table KVM: MMU: do not need atomicly to set/clear spte KVM: MMU: introduce the rules to modify shadow page table KVM: MMU: abstract some functions to handle fault pfn KVM: MMU: filter out the mmio pfn from the fault pfn KVM: MMU: remove bypass_guest_pf KVM: MMU: split kvm_mmu_free_page KVM: MMU: count used shadow pages on prepareing path KVM: MMU: rename 'pt_write' to 'emulate' KVM: MMU: cleanup for FNAME(fetch) KVM: MMU: optimize to handle dirty bit KVM: MMU: cache mmio info on page fault path KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code KVM: MMU: do not update slot bitmap if spte is nonpresent KVM: MMU: fix walking shadow page table KVM guest: KVM Steal time registration ...	2011-07-24 09:07:03 -07:00
Xiao Guangrong	4f0226482d	KVM: MMU: trace mmio page fault Add tracepoints to trace mmio page fault Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:41 +03:00
Xiao Guangrong	ce88decffd	KVM: MMU: mmio page fault support The idea is from Avi: \| We could cache the result of a miss in an spte by using a reserved bit, and \| checking the page fault error code (or seeing if we get an ept violation or \| ept misconfiguration), so if we get repeated mmio on a page, we don't need to \| search the slot list/tree. \| (https://lkml.org/lkml/2011/2/22/221) When the page fault is caused by mmio, we cache the info in the shadow page table, and also set the reserved bits in the shadow page table, so if the mmio is caused again, we can quickly identify it and emulate it directly Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it can be reduced by this feature, and also avoid walking guest page table for soft mmu. [jan: fix operator precedence issue] Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:40 +03:00
Xiao Guangrong	dd3bfd59db	KVM: MMU: reorganize struct kvm_shadow_walk_iterator Reorganize it for good using the cache Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:39 +03:00
Xiao Guangrong	c2a2ac2b56	KVM: MMU: lockless walking shadow page table Use rcu to protect shadow pages table to be freed, so we can safely walk it, it should run fastly and is needed by mmio page fault Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:38 +03:00
Xiao Guangrong	603e0651cf	KVM: MMU: do not need atomicly to set/clear spte Now, the spte is just from nonprsent to present or present to nonprsent, so we can use some trick to set/clear spte non-atomicly as linux kernel does Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:37 +03:00
Xiao Guangrong	1df9f2dc39	KVM: MMU: introduce the rules to modify shadow page table Introduce some interfaces to modify spte as linux kernel does: - mmu_spte_clear_track_bits, it set the spte from present to nonpresent, and track the stat bits(accessed/dirty) of spte - mmu_spte_clear_no_track, the same as mmu_spte_clear_track_bits except tracking the stat bits - mmu_spte_set, set spte from nonpresent to present - mmu_spte_update, only update the stat bits Now, it does not allowed to set spte from present to present, later, we can drop the atomicly opration for X86_32 host, and it is the preparing work to get spte on X86_32 host out of the mmu lock Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:36 +03:00
Xiao Guangrong	d7c55201e6	KVM: MMU: abstract some functions to handle fault pfn Introduce handle_abnormal_pfn to handle fault pfn on page fault path, introduce mmu_invalid_pfn to handle fault pfn on prefetch path It is the preparing work for mmio page fault support Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:35 +03:00
Xiao Guangrong	fce92dce79	KVM: MMU: filter out the mmio pfn from the fault pfn If the page fault is caused by mmio, the gfn can not be found in memslots, and 'bad_pfn' is returned on gfn_to_hva path, so we can use 'bad_pfn' to identify the mmio page fault. And, to clarify the meaning of mmio pfn, we return fault page instead of bad page when the gfn is not allowd to prefetch Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:34 +03:00
Xiao Guangrong	c37079586f	KVM: MMU: remove bypass_guest_pf The idea is from Avi: \| Maybe it's time to kill off bypass_guest_pf=1. It's not as effective as \| it used to be, since unsync pages always use shadow_trap_nonpresent_pte, \| and since we convert between the two nonpresent_ptes during sync and unsync. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:33 +03:00
Xiao Guangrong	bd4c86eaa6	KVM: MMU: split kvm_mmu_free_page Split kvm_mmu_free_page to kvm_mmu_isolate_page and kvm_mmu_free_page One is used to remove the page from cache under mmu lock and the other is used to free page table out of mmu lock Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:32 +03:00
Xiao Guangrong	aa6bd187af	KVM: MMU: count used shadow pages on prepareing path Move counting used shadow pages from commiting path to preparing path to reduce tlb flush on some paths Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:31 +03:00
Xiao Guangrong	b90a0e6c81	KVM: MMU: rename 'pt_write' to 'emulate' If 'pt_write' is true, we need to emulate the fault. And in later patch, we need to emulate the fault even though it is not a pt_write event, so rename it to better fit the meaning Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:30 +03:00
Xiao Guangrong	b36c7a7c10	KVM: MMU: cleanup for FNAME(fetch) gw->pte_access is the final access permission, since it is unified with gw->pt_access when we walked guest page table: FNAME(walk_addr_generic): pte_access = pt_access & FNAME(gpte_access)(vcpu, pte, true); Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:29 +03:00
Xiao Guangrong	640d9b0dbe	KVM: MMU: optimize to handle dirty bit If dirty bit is not set, we can make the pte access read-only to avoid handing dirty bit everywhere Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:27 +03:00
Xiao Guangrong	bebb106a5a	KVM: MMU: cache mmio info on page fault path If the page fault is caused by mmio, we can cache the mmio info, later, we do not need to walk guest page table and quickly know it is a mmio fault while we emulate the mmio instruction Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:26 +03:00
Xiao Guangrong	af7cc7d1ee	KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code Introduce vcpu_mmio_gva_to_gpa to translate the gva to gpa, we can use it to cleanup the code between read emulation and write emulation Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:25 +03:00
Xiao Guangrong	ffb61bb3bc	KVM: MMU: do not update slot bitmap if spte is nonpresent Set slot bitmap only if the spte is present Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:24 +03:00
Xiao Guangrong	052331bea3	KVM: MMU: fix walking shadow page table Properly check the last mapping, and do not walk to the next level if last spte is met Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:23 +03:00
Ohad Ben-Cohen	e72542191c	virtio: expose for non-virtualization users too virtio has been so far used only in the context of virtualization, and the virtio Kconfig was sourced directly by the relevant arch Kconfigs when VIRTUALIZATION was selected. Now that we start using virtio for inter-processor communications, we need to source the virtio Kconfig outside of the virtualization scope too. Moreover, some architectures might use virtio for both virtualization and inter-processor communications, so directly sourcing virtio might yield unexpected results due to conflicting selections. The simple solution offered by this patch is to always source virtio's Kconfig in drivers/Kconfig, and remove it from the appropriate arch Kconfigs. Additionally, a virtio menu entry has been added so virtio drivers don't show up in the general drivers menu. This way anyone can use virtio, though it's arguably less accessible (and neat!) for virtualization users now. Note: some architectures (mips and sh) seem to have a VIRTUALIZATION menu merely for sourcing virtio's Kconfig, so that menu is removed too. Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-07-23 16:20:30 +09:30
Glauber Costa	c9aaa8957f	KVM: Steal time implementation To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM, while the vcpu had meaningful work to do - halt time does not count. This information is acquired through the run_delay field of delayacct/schedstats infrastructure, that counts time spent in a runqueue but not running. Steal time is a per-cpu information, so the traditional MSR-based infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the memory area address containing information about steal time This patch contains the hypervisor part of the steal time infrasructure, and can be backported independently of the guest portion. [avi, yongjie: export delayacct_on, to avoid build failures in some configs] Signed-off-by: Glauber Costa <glommer@redhat.com> Tested-by: Eric B Munson <emunson@mgebm.net> CC: Rik van Riel <riel@redhat.com> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Yongjie Ren <yongjie.ren@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-14 12:59:14 +03:00
Takuya Yoshikawa	3c8c652ae4	KVM: MMU: Introduce is_last_gpte() to clean up walk_addr_generic() Suggested by Ingo and Avi. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:44 +03:00
Takuya Yoshikawa	92c1c1e85b	KVM: MMU: Rename the walk label in walk_addr_generic() The current name does not explain the meaning well. So give it a better name "retry_walk" to show that we are trying the walk again. This was suggested by Ingo Molnar. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:43 +03:00
Takuya Yoshikawa	134291bf3c	KVM: MMU: Clean up the error handling of walk_addr_generic() Avoid two step jump to the error handling part. This eliminates the use of the variables present and rsvd_fault. We also use the const type qualifier to show that write/user/fetch_fault do not change in the function. Both of these were suggested by Ingo Molnar. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:42 +03:00
Marcelo Tosatti	f8f7e5ee10	Revert "KVM: MMU: make kvm_mmu_reset_context() flush the guest TLB" This reverts commit bee931d31e588b8eb86b7edee32fac2d16930cd7. TLB flush should be done lazily during guest entry, in kvm_mmu_load(). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:41 +03:00
Avi Kivity	45bd07b9d5	KVM: MMU: make kvm_mmu_reset_context() flush the guest TLB kvm_set_cr0() and kvm_set_cr4(), and possible other functions, assume that kvm_mmu_reset_context() flushes the guest TLB. However, it does not. Fix by flushing the tlb (and syncing the new root as well). Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:27 +03:00
Avi Kivity	411c588dfb	KVM: MMU: Adjust shadow paging to work when SMEP=1 and CR0.WP=0 When CR0.WP=0, we sometimes map user pages as kernel pages (to allow the kernel to write to them). Unfortunately this also allows the kernel to fetch from these pages, even if CR4.SMEP is set. Adjust for this by also setting NX on the spte in these circumstances. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:26 +03:00
Yang, Wei	a01c8f9b4e	KVM: Enable ERMS feature support for KVM This patch exposes ERMS feature to KVM guests. The REP MOVSB/STOSB instruction can enhance fast strings attempts to move as much of the data with larger size load/stores as possible. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:25 +03:00
Yang, Wei	176f61da82	KVM: Expose RDWRGSFS bit to KVM guests This patch exposes RDWRGSFS bit to KVM guests. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:24 +03:00
Yang, Wei	74dc2b4ffe	KVM: Add RDWRGSFS support when setting CR4 This patch adds RDWRGSFS support when setting CR4. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:23 +03:00
Yang, Wei Y	4a00efdf0c	KVM: Enable DRNG feature support for KVM This patch exposes DRNG feature to KVM guests. The RDRAND instruction can provide software with sequences of random numbers generated from white noise. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:21 +03:00
Andre Przywara	02668b061d	KVM: fix XSAVE bit scanning (now properly) commit 123108f1c1aafd51d6a5c79cc04d7999dd88a930 tried to fix KVMs XSAVE valid feature scanning, but it was wrong. It was not considering the sparse nature of this bitfield, instead reading values from uninitialized members of the entries array. This patch now separates subleaf indicies from KVM's array indicies and fills the entry before querying it's value. This fixes AVX support in KVM guests. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:20 +03:00
Yang, Wei Y	e57d4a356a	KVM: Add instruction fetch checking when walking guest page table This patch adds instruction fetch checking when walking guest page table, to implement SMEP when emulating instead of executing natively. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Shan, Haitao <haitao.shan@intel.com> Signed-off-by: Li, Xin <xin.li@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:15 +03:00
Yang, Wei Y	611c120f74	KVM: Mask function7 ebx against host capability word9 This patch masks CPUID leaf 7 ebx against host capability word9. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Shan, Haitao <haitao.shan@intel.com> Signed-off-by: Li, Xin <xin.li@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:14 +03:00
Yang, Wei Y	c68b734fba	KVM: Add SMEP support when setting CR4 This patch adds SMEP handling when setting CR4. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Shan, Haitao <haitao.shan@intel.com> Signed-off-by: Li, Xin <xin.li@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:13 +03:00
Nadav Har'El	509c75ea19	KVM: nVMX: Fix bug preventing more than two levels of nesting The nested VMX feature is supposed to fully emulate VMX for the guest. This (theoretically) not only allows it to run its own guests, but also also to further emulate VMX for its own guests, and allow arbitrarily deep nesting. This patch fixes a bug (discovered by Kevin Tian) in handling a VMLAUNCH by L2, which prevented deeper nesting. Deeper nesting now works (I only actually tested L3), but is currently absurdly slow, to the point of being unusable. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:11 +03:00
Avi Kivity	9dac77fa40	KVM: x86 emulator: fold decode_cache into x86_emulate_ctxt This saves a lot of pointless casts x86_emulate_ctxt and decode_cache. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:09 +03:00
Avi Kivity	36dd9bb5ce	KVM: x86 emulator: rename decode_cache::eip to _eip The name eip conflicts with a field of the same name in x86_emulate_ctxt, which we plan to fold decode_cache into. The name _eip is unfortunate, but what's really needed is a refactoring here, not a better name. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:09 +03:00
Jan Kiszka	2e4ce7f574	KVM: VMX: Silence warning on 32-bit hosts a is unused now on CONFIG_X86_32. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:08 +03:00
Takuya Yoshikawa	f411e6cdc2	KVM: x86 emulator: Use opcode::execute for CLI/STI(FA/FB) Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:07 +03:00
Takuya Yoshikawa	d06e03adcb	KVM: x86 emulator: Use opcode::execute for LOOP/JCXZ LOOP/LOOPcc : E0-E2 JCXZ/JECXZ/JRCXZ : E3 Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:06 +03:00
Takuya Yoshikawa	5c5df76b8b	KVM: x86 emulator: Clean up INT n/INTO/INT 3(CC/CD/CE) Call emulate_int() directly to avoid spaghetti goto's. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:04 +03:00
Takuya Yoshikawa	1bd5f469b2	KVM: x86 emulator: Use opcode::execute for MOV(8C/8E) Different functions for those which take segment register operands. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:03 +03:00
Takuya Yoshikawa	ebda02c2a5	KVM: x86 emulator: Use opcode::execute for RET(C3) Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:02 +03:00
Takuya Yoshikawa	e4f973ae91	KVM: x86 emulator: Use opcode::execute for XCHG(86/87) In addition, replace one "goto xchg" with an em_xchg() call. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:01 +03:00
Takuya Yoshikawa	9f21ca599c	KVM: x86 emulator: Use opcode::execute for TEST(84/85, A8/A9) Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:00 +03:00
Takuya Yoshikawa	db5b0762f3	KVM: x86 emulator: Use opcode::execute for some instructions Move the following functions to the opcode tables: RET (Far return) : CB IRET : CF JMP (Jump far) : EA SYSCALL : 0F 05 CLTS : 0F 06 SYSENTER : 0F 34 SYSEXIT : 0F 35 Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:15:59 +03:00
Takuya Yoshikawa	e01991e71a	KVM: x86 emulator: Rename emulate_xxx() to em_xxx() The next patch will change these to be called by opcode::execute. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:15:58 +03:00
Takuya Yoshikawa	9d74191ab1	KVM: x86 emulator: Use the pointers ctxt and c consistently We should use the local variables ctxt and c when the emulate_ctxt and decode appears many times. At least, we need to be consistent about how we use these in a function. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:15:57 +03:00
Nadav Har'El	2844d84905	KVM: nVMX: Miscellenous small corrections Small corrections of KVM (spelling, etc.) not directly related to nested VMX. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:19 +03:00
Nadav Har'El	7b8050f570	KVM: nVMX: Add VMX to list of supported cpuid features If the "nested" module option is enabled, add the "VMX" CPU feature to the list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl. Qemu uses this ioctl, and intersects KVM's list with its own list of desired cpu features (depending on the -cpu option given to qemu) to determine the final list of features presented to the guest. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:19 +03:00
Nadav Har'El	7991825b85	KVM: nVMX: Additional TSC-offset handling In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to set vmcs12.tsc_offset, for this change to survive the next nested entry (see prepare_vmcs02()). Additionally, we also need to modify vmx_adjust_tsc_offset: The semantics of this function is that the TSC of all guests on this vcpu, L1 and possibly several L2s, need to be adjusted. To do this, we need to adjust vmcs01's tsc_offset (this offset will also apply to each L2s we enter). We can't set vmcs01 now, so we have to remember this adjustment and apply it when we later exit to L1. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:19 +03:00
Nadav Har'El	36cf24e01e	KVM: nVMX: Further fixes for lazy FPU loading KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and NM exceptions, even if we have a guest hypervisor (L1) who didn't want these traps. And of course, conversely: If L1 wanted to trap these events, we must let it, even if L0 is not interested in them. This patch fixes some existing KVM code (in update_exception_bitmap(), vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's and L1's needs. Note that handle_cr() was already fixed in the above patch, and that new code in introduced in previous patches already handles CR0 correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()). Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:18 +03:00
Nadav Har'El	eeadf9e755	KVM: nVMX: Handling of CR0 and CR4 modifying instructions When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a previous patch). When L2 modifies bits that L1 doesn't care about, we let it think (via CR[04]_READ_SHADOW) that it did these modifications, while only changing (in GUEST_CR[04]) the bits that L0 doesn't shadow. This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may want to leave TS on, while pretending to allow the guest to change it. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:18 +03:00
Nadav Har'El	66c78ae40c	KVM: nVMX: Correct handling of idt vectoring info This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested case. When a guest exits while delivering an interrupt or exception, we get this information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1, there's nothing we need to do, because L1 will see this field in vmcs12, and handle it itself. However, when L2 exits and L0 handles the exit itself and plans to return to L2, L0 must inject this event to L2. In the normal non-nested case, the idt_vectoring_info case is discovered after the exit, and the decision to inject (though not the injection itself) is made at that point. However, in the nested case a decision of whether to return to L2 or L1 also happens during the injection phase (see the previous patches), so in the nested case we can only decide what to do about the idt_vectoring_info right after the injection, i.e., in the beginning of vmx_vcpu_run, which is the first time we know for sure if we're staying in L2. Therefore, when we exit L2 (is_guest_mode(vcpu)), we disable the regular vmx_complete_interrupts() code which queues the idt_vectoring_info for injection on next entry - because such injection would not be appropriate if we will decide to exit to L1. Rather, we just save the idt_vectoring_info and related fields in vmcs12 (which is a convenient place to save these fields). On the next entry in vmx_vcpu_run (after the injection phase, potentially exiting to L1 to inject an event requested by user space), if we find ourselves in L1 we don't need to do anything with those values we saved (as explained above). But if we find that we're in L2, or rather still at L2 (it's not nested_run_pending, meaning that this is the first round of L2 running after L1 having just launched it), we need to inject the event saved in those fields - by writing the appropriate VMCS fields. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:18 +03:00
Nadav Har'El	0b6ac343fc	KVM: nVMX: Correct handling of exception injection Similar to the previous patch, but concerning injection of exceptions rather than external interrupts. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:17 +03:00
Nadav Har'El	b6f1250edb	KVM: nVMX: Correct handling of interrupt injection The code in this patch correctly emulates external-interrupt injection while a nested guest L2 is running. Because of this code's relative un-obviousness, I include here a longer-than- usual justification for what it does - much longer than the code itself ;-) To understand how to correctly emulate interrupt injection while L2 is running, let's look first at what we need to emulate: How would things look like if the extra L0 hypervisor layer is removed, and instead of L0 injecting an interrupt, we had hardware delivering an interrupt? Now we have L1 running on bare metal with a guest L2, and the hardware generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to 1, and VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below), what happens now is this: The processor exits from L2 to L1, with an external- interrupt exit reason but without an interrupt vector. L1 runs, with interrupts disabled, and it doesn't yet know what the interrupt was. Soon after, it enables interrupts and only at that moment, it gets the interrupt from the processor. when L1 is KVM, Linux handles this interrupt. Now we need exactly the same thing to happen when that L1->L2 system runs on top of L0, instead of real hardware. This is how we do this: When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with external-interrupt exit reason (with an invalid interrupt vector), and run L1. Just like in the bare metal case, it likely can't deliver the interrupt to L1 now because L1 is running with interrupts disabled, in which case it turns on the interrupt window when running L1 after the exit. L1 will soon enable interrupts, and at that point L0 will gain control again and inject the interrupt to L1. Finally, there is an extra complication in the code: when nested_run_pending, we cannot return to L1 now, and must launch L2. We need to remember the interrupt we wanted to inject (and not clear it now), and do it on the next exit. The above explanation shows that the relative strangeness of the nested interrupt injection code in this patch, and the extra interrupt-window exit incurred, are in fact necessary for accurate emulation, and are not just an unoptimized implementation. Let's revisit now the two assumptions made above: If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know does, by the way), things are simple: L0 may inject the interrupt directly to the L2 guest - using the normal code path that injects to any guest. We support this case in the code below. If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT, things look very different from the description above: L1 expects to see an exit from L2 with the interrupt vector already filled in the exit information, and does not expect to be interrupted again with this interrupt. The current code does not (yet) support this case, so we do not allow the VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:17 +03:00
Nadav Har'El	644d711aa0	KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit This patch contains the logic of whether an L2 exit should be handled by L0 and then L2 should be resumed, or whether L1 should be run to handle this exit (using the nested_vmx_vmexit() function of the previous patch). The basic idea is to let L1 handle the exit only if it actually asked to trap this sort of event. For example, when L2 exits on a change to CR0, we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any bit which changed; If it did, we exit to L1. But if it didn't it means that it is we (L0) that wished to trap this event, so we handle it ourselves. The next two patches add additional logic of what to do when an interrupt or exception is injected: Does L0 need to do it, should we exit to L1 to do it, or should we resume L2 and keep the exception to be injected later. We keep a new flag, "nested_run_pending", which can override the decision of which should run next, L1 or L2. nested_run_pending=1 means that we must run L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2 and therefore expects L2 to be run (and perhaps be injected with an event it specified, etc.). Nested_run_pending is especially intended to avoid switching to L1 in the injection decision-point described above. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:16 +03:00
Nadav Har'El	7c1779384a	KVM: nVMX: vmcs12 checks on nested entry This patch adds a bunch of tests of the validity of the vmcs12 fields, according to what the VMX spec and our implementation allows. If fields we cannot (or don't want to) honor are discovered, an entry failure is emulated. According to the spec, there are two types of entry failures: If the problem was in vmcs12's host state or control fields, the VMLAUNCH instruction simply fails. But a problem is found in the guest state, the behavior is more similar to that of an exit. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:16 +03:00
Nadav Har'El	4704d0befb	KVM: nVMX: Exiting from L2 to L1 This patch implements nested_vmx_vmexit(), called when the nested L2 guest exits and we want to run its L1 parent and let it handle this exit. Note that this will not necessarily be called on every L2 exit. L0 may decide to handle a particular exit on its own, without L1's involvement; In that case, L0 will handle the exit, and resume running L2, without running L1 and without calling nested_vmx_vmexit(). The logic for deciding whether to handle a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(), will appear in a separate patch below. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:16 +03:00
Nadav Har'El	99e65e805d	KVM: nVMX: No need for handle_vmx_insn function any more Before nested VMX support, the exit handler for a guest executing a VMX instruction (vmclear, vmlaunch, vmptrld, vmptrst, vmread, vmread, vmresume, vmwrite, vmon, vmoff), was handle_vmx_insn(). This handler simply threw a #UD exception. Now that all these exit reasons are properly handled (and emulate the respective VMX instruction), nothing calls this dummy handler and it can be removed. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:15 +03:00
Nadav Har'El	cd232ad02f	KVM: nVMX: Implement VMLAUNCH and VMRESUME Implement the VMLAUNCH and VMRESUME instructions, allowing a guest hypervisor to run its own guests. This patch does not include some of the necessary validity checks on vmcs12 fields before the entry. These will appear in a separate patch below. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:15 +03:00
Nadav Har'El	fe3ef05c75	KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12 This patch contains code to prepare the VMCS which can be used to actually run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our own guests). Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:14 +03:00
Nadav Har'El	bf8179a011	KVM: nVMX: Move control field setup to functions Move some of the control field setup to common functions. These functions will also be needed for running L2 guests - L0's desires (expressed in these functions) will be appropriately merged with L1's desires. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:14 +03:00
Nadav Har'El	a3a8ff8ebf	KVM: nVMX: Move host-state field setup to a function Move the setting of constant host-state fields (fields that do not change throughout the life of the guest) from vmx_vcpu_setup to a new common function vmx_set_constant_host_state(). This function will also be used to set the host state when running L2 guests. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:14 +03:00
Nadav Har'El	49f705c532	KVM: nVMX: Implement VMREAD and VMWRITE Implement the VMREAD and VMWRITE instructions. With these instructions, L1 can read and write to the VMCS it is holding. The values are read or written to the fields of the vmcs12 structure introduced in a previous patch. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:14 +03:00
Nadav Har'El	6a4d755060	KVM: nVMX: Implement VMPTRST This patch implements the VMPTRST instruction. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:13 +03:00
Nadav Har'El	63846663ea	KVM: nVMX: Implement VMPTRLD This patch implements the VMPTRLD instruction. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:12 +03:00
Nadav Har'El	27d6c86521	KVM: nVMX: Implement VMCLEAR This patch implements the VMCLEAR instruction. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:12 +03:00
Nadav Har'El	0140caea3b	KVM: nVMX: Success/failure of VMX instructions. VMX instructions specify success or failure by setting certain RFLAGS bits. This patch contains common functions to do this, and they will be used in the following patches which emulate the various VMX instructions. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:12 +03:00
Nadav Har'El	22bd035868	KVM: nVMX: Add VMCS fields to the vmcs12 In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the standard VMCS fields. Later patches will enable L1 to read and write these fields using VMREAD/ VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02, a hardware VMCS for running L2. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:11 +03:00
Nadav Har'El	ff2f6fe961	KVM: nVMX: Introduce vmcs02: VMCS used to run L2 We saw in a previous patch that L1 controls its L2 guest with a vcms12. L0 needs to create a real VMCS for running L2. We call that "vmcs02". A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02 fields. This patch only contains code for allocating vmcs02. In this version, prepare_vmcs02() sets all of vmcs02's fields each time we enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can be reused even when L1 runs multiple L2 guests. However, in future versions we'll probably want to add an optimization where vmcs02 fields that rarely change will not be set each time. For that, we may want to keep around several vmcs02s of L2 guests that have recently run, so that potentially we could run these L2s again more quickly because less vmwrites to vmcs02 will be needed. This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool, which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s. As explained above, in the current version we choose VMCS02_POOL_SIZE=1, I.e., one vmcs02 is allocated (and loaded onto the processor), and it is reused to enter any L2 guest. In the future, when prepare_vmcs02() is optimized not to set all fields every time, VMCS02_POOL_SIZE should be increased. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:11 +03:00
Nadav Har'El	064aea7747	KVM: nVMX: Decoding memory operands of VMX instructions This patch includes a utility function for decoding pointer operands of VMX instructions issued by L1 (a guest hypervisor) Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:11 +03:00
Nadav Har'El	b87a51ae28	KVM: nVMX: Implement reading and writing of VMX MSRs When the guest can use VMX instructions (when the "nested" module option is on), it should also be able to read and write VMX MSRs, e.g., to query about VMX capabilities. This patch adds this support. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:11 +03:00
Nadav Har'El	a9d30f33dd	KVM: nVMX: Introduce vmcs12: a VMCS structure for L1 An implementation of VMX needs to define a VMCS structure. This structure is kept in guest memory, but is opaque to the guest (who can only read or write it with VMX instructions). This patch starts to define the VMCS structure which our nested VMX implementation will present to L1. We call it "vmcs12", as it is the VMCS that L1 keeps for its L2 guest. We will add more content to this structure in later patches. This patch also adds the notion (as required by the VMX spec) of L1's "current VMCS", and finally includes utility functions for mapping the guest-allocated VMCSs in host memory. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:10 +03:00
Nadav Har'El	5e1746d620	KVM: nVMX: Allow setting the VMXE bit in CR4 This patch allows the guest to enable the VMXE bit in CR4, which is a prerequisite to running VMXON. Whether to allow setting the VMXE bit now depends on the architecture (svm or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4() will also return 1, and this will cause kvm_set_cr4() will throw a #GP. Turning on the VMXE bit is allowed only when the nested VMX feature is enabled, and turning it off is forbidden after a vmxon. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:10 +03:00
Nadav Har'El	ec378aeef9	KVM: nVMX: Implement VMXON and VMXOFF This patch allows a guest to use the VMXON and VMXOFF instructions, and emulates them accordingly. Basically this amounts to checking some prerequisites, and then remembering whether the guest has enabled or disabled VMX operation. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:09 +03:00
Nadav Har'El	801d342432	KVM: nVMX: Add "nested" module option to kvm_intel This patch adds to kvm_intel a module option "nested". This option controls whether the guest can use VMX instructions, i.e., whether we allow nested virtualization. A similar, but separate, option already exists for the SVM module. This option currently defaults to 0, meaning that nested VMX must be explicitly enabled by giving nested=1. When nested VMX matures, the default should probably be changed to enable nested VMX by default - just like nested SVM is currently enabled by default. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:09 +03:00
Takuya Yoshikawa	b5c9ff731f	KVM: x86 emulator: Avoid clearing the whole decode_cache During tracing the emulator, we noticed that init_emulate_ctxt() sometimes took a bit longer time than we expected. This patch is for mitigating the problem by some degree. By looking into the function, we soon notice that it clears the whole decode_cache whose size is about 2.5K bytes now. Furthermore, most of the bytes are taken for the two read_cache arrays, which are used only by a few instructions. Considering the fact that we are not assuming the cache arrays have been cleared when we store actual data, we do not need to clear the arrays: 2K bytes elimination. In addition, we can avoid clearing the fetch_cache and regs arrays. This patch changes the initialization not to clear the arrays. On our 64-bit host, init_emulate_ctxt() becomes 0.3 to 0.5us faster with this patch applied. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 11:45:09 +03:00
Takuya Yoshikawa	adf52235b4	KVM: x86 emulator: Clean up init_emulate_ctxt() Use a local pointer to the emulate_ctxt for simplicity. Then, arrange the hard-to-read mode selection lines neatly. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 11:45:08 +03:00
Jan Kiszka	d780592b99	KVM: Clean up error handling during VCPU creation So far kvm_arch_vcpu_setup is responsible for freeing the vcpu struct if it fails. Move this confusing resonsibility back into the hands of kvm_vm_ioctl_create_vcpu. Only kvm_arch_vcpu_setup of x86 is affected, all other archs cannot fail. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 11:45:08 +03:00
Nadav Har'El	d462b81923	KVM: VMX: Keep list of loaded VMCSs, instead of vcpus In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it because (at least in theory) the processor might not have written all of its content back to memory. Since a patch from June 26, 2008, this is done using a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU. The problem is that with nested VMX, we no longer have the concept of a vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for L2s), and each of those may be have been last loaded on a different cpu. So instead of linking the vcpus, we link the VMCSs, using a new structure loaded_vmcs. This structure contains the VMCS, and the information pertaining to its loading on a specific cpu (namely, the cpu number, and whether it was already launched on this cpu once). In nested we will also use the same structure to hold L2 VMCSs, and vmx->loaded_vmcs is a pointer to the currently active VMCS. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Acked-by: Acked-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 11:45:08 +03:00
Avi Kivity	24c82e576b	KVM: Sanitize cpuid Instead of blacklisting known-unsupported cpuid leaves, whitelist known- supported leaves. This is more conservative and prevents us from reporting features we don't support. Also whitelist a few more leaves while at it. Signed-off-by: Avi Kivity <avi@redhat.com> Acked-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:07 +03:00
Xiao Guangrong	bcdd9a93c5	KVM: MMU: cleanup for dropping parent pte Introduce drop_parent_pte to remove the rmap of parent pte and clear parent pte Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:07 +03:00
Xiao Guangrong	38e3b2b28c	KVM: MMU: cleanup for kvm_mmu_page_unlink_children Cleanup the same operation between kvm_mmu_page_unlink_children and mmu_pte_write_zap_pte Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:07 +03:00
Xiao Guangrong	67052b3508	KVM: MMU: remove the arithmetic of parent pte rmap Parent pte rmap and page rmap are very similar, so use the same arithmetic for them Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:07 +03:00
Xiao Guangrong	53c07b1878	KVM: MMU: abstract the operation of rmap Abstract the operation of rmap to spte_list, then we can use it for the reverse mapping of parent pte in the later patch Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:06 +03:00
Xiao Guangrong	1249b96e72	KVM: fix uninitialized warning Fix: warning: ‘cs_sel’ may be used uninitialized in this function warning: ‘ss_sel’ may be used uninitialized in this function Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:06 +03:00
Xiao Guangrong	8b0cedff04	KVM: use __copy_to_user/__clear_user to write guest page Simply use __copy_to_user/__clear_user to write guest page since we have already verified the user address when the memslot is set Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:03 +03:00
Xiao Guangrong	332b207d65	KVM: MMU: optimize pte write path if don't have protected sp Simply return from kvm_mmu_pte_write path if no shadow page is write-protected, then we can avoid to walk all shadow pages and hold mmu-lock Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:02 +03:00
Avi Kivity	96304217a7	KVM: VMX: always_inline VMREADs vmcs_readl() and friends are really short, but gcc thinks they are long because of the out-of-line exception handlers. Mark them always_inline to clear the misunderstanding. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:01 +03:00
Avi Kivity	5e520e6278	KVM: VMX: Move VMREAD cleanup to exception handler We clean up a failed VMREAD by clearing the output register. Do it in the exception handler instead of unconditionally. This is worthwhile since there are more than a hundred call sites. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:45:00 +03:00
Takuya Yoshikawa	7b105ca290	KVM: x86 emulator: Stop passing ctxt->ops as arg of emul functions Dereference it in the actual users. This not only cleans up the emulator but also makes it easy to convert the old emulation functions to the new em_xxx() form later. Note: Remove some inline keywords to let the compiler decide inlining. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:44:59 +03:00
Takuya Yoshikawa	ef5d75cc9a	KVM: x86 emulator: Stop passing ctxt->ops as arg of decode helpers Dereference it in the actual users: only do_insn_fetch_byte(). This is consistent with the way __linearize() dereferences it. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:44:57 +03:00
Takuya Yoshikawa	67cbc90db5	KVM: x86 emulator: Place insn_fetch helpers together The two macros need special care to use: Assume rc, ctxt, ops and done exist outside of them. Can goto outside. Considering the fact that these are used only in decode functions, moving these right after do_insn_fetch() seems to be a right thing to improve the readability. We also rename do_fetch_insn_byte() to do_insn_fetch_byte() to be consistent. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 11:44:56 +03:00
Jiri Kosina	b7e9c223be	Merge branch 'master' into for-next Sync with Linus' tree to be able to apply pending patches that are based on newer code already present upstream.	2011-07-11 14:15:55 +02:00
Avi Kivity	cb16c34876	KVM: x86 emulator: fix %rip-relative addressing with immediate source operand %rip-relative addressing is relative to the first byte of the next instruction, so we need to add %rip only after we've fetched any immediate bytes. Based on original patch by Li Xin <xin.li@intel.com>. Signed-off-by: Avi Kivity <avi@redhat.com> Acked-by: Li Xin <xin.li@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-06-29 10:09:25 +03:00
Vitaliy Ivanov	e44ba033c5	treewide: remove duplicate includes Many stupid corrections of duplicated includes based on the output of scripts/checkincludes.pl. Signed-off-by: Vitaliy Ivanov <vitalivanov@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2011-06-20 16:08:19 +02:00
Steve	a0a8eaba16	KVM: MMU: fix opposite condition in mapping_level_dirty_bitmap The condition is opposite, it always maps huge page for the dirty tracked page Reported-by: Steve <stefan.bosak@gmail.com> Signed-off-by: Steve <stefan.bosak@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-06-19 19:23:13 +03:00
Marcelo Tosatti	5233dd51ec	KVM: VMX: do not overwrite uptodate vcpu->arch.cr3 on KVM_SET_SREGS Only decache guest CR3 value if vcpu->arch.cr3 is stale. Fixes loadvm with live guest. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Tested-by: Markus Schade <markus.schade@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-06-19 19:23:13 +03:00
Borislav Petkov	b72336355b	KVM: MMU: Fix build warnings in walk_addr_generic() On 3.0-rc1 I get In file included from arch/x86/kvm/mmu.c:2856: arch/x86/kvm/paging_tmpl.h: In function ‘paging32_walk_addr_generic’: arch/x86/kvm/paging_tmpl.h:124: warning: ‘ptep_user’ may be used uninitialized in this function In file included from arch/x86/kvm/mmu.c:2852: arch/x86/kvm/paging_tmpl.h: In function ‘paging64_walk_addr_generic’: arch/x86/kvm/paging_tmpl.h:124: warning: ‘ptep_user’ may be used uninitialized in this function caused by `6e2ca7d180`. According to Takuya Yoshikawa, ptep_user won't be used uninitialized so shut up gcc. Cc: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Link: http://lkml.kernel.org/r/20110530094604.GC21833@liondog.tnic Signed-off-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-06-19 19:23:13 +03:00
Linus Torvalds	58a9a36b54	Merge branch 'kvm-updates/3.0' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/3.0' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: Initialize kvm before registering the mmu notifier KVM: x86: use proper port value when checking io instruction permission KVM: add missing void __user * cast to access_ok() call	2011-06-07 19:06:28 -07:00
Marcelo Tosatti	221192bdff	KVM: x86: use proper port value when checking io instruction permission Commit `f6511935f4` moved the permission check for io instructions to the ->check_perm callback. It failed to copy the port value from RDX register for string and "in,out ax,dx" instructions. Fix it by reading RDX register at decode stage when appropriate. Fixes FC8.32 installation. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-06-06 10:52:09 +03:00
Ying Han	1495f230fa	vmscan: change shrinker API by passing shrink_control struct Change each shrinker's API by consolidating the existing parameters into shrink_control struct. This will simplify any further features added w/o touching each file of shrinker. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: fix warning] [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API] [akpm@linux-foundation.org: fix xfs warning] [akpm@linux-foundation.org: update gfs2] Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-25 08:39:26 -07:00
Takuya Yoshikawa	c8cfbb555e	KVM: MMU: Use ptep_user for cmpxchg_gpte() The address of the gpte was already calculated and stored in ptep_user before entering cmpxchg_gpte(). This patch makes cmpxchg_gpte() to use that to make it clear that we are using the same address during walk_addr_generic(). Note that the unlikely annotations are used to show that the conditions are something unusual rather than for performance. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-05-22 08:48:14 -04:00
Takuya Yoshikawa	d2f62766d5	KVM: x86 emulator: Make jmp far emulation into a separate function We introduce em_jmp_far(). We also call this from em_grp45() to stop treating modrm_reg == 5 case separately in the group 5 emulation. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:48:06 -04:00
Takuya Yoshikawa	51187683cb	KVM: x86 emulator: Rename emulate_grpX() to em_grpX() The prototypes are changed appropriately. We also replaces "goto grp45;" with simple em_grp45() call. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:48:05 -04:00
Takuya Yoshikawa	3b9be3bf2e	KVM: x86 emulator: Remove unused arg from emulate_pop() The opt of emulate_grp1a() is also removed. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:48:03 -04:00
Takuya Yoshikawa	adddcecf92	KVM: x86 emulator: Remove unused arg from writeback() Remove inline at this chance. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:48:00 -04:00
Takuya Yoshikawa	509cf9fe11	KVM: x86 emulator: Remove unused arg from read_descriptor() Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:47:59 -04:00
Takuya Yoshikawa	c1ed6dea81	KVM: x86 emulator: Remove unused arg from seg_override() In addition, one comma at the end of a statement is replaced with a semicolon. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:47:57 -04:00
Takuya Yoshikawa	fa3d315a4c	KVM: Validate userspace_addr of memslot when registered This way, we can avoid checking the user space address many times when we read the guest memory. Although we can do the same for write if we check which slots are writable, we do not care write now: reading the guest memory happens more often than writing. [avi: change VERIFY_READ to VERIFY_WRITE] Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:47:56 -04:00
Takuya Yoshikawa	12cb814f3b	KVM: MMU: Clean up gpte reading with copy_from_user() When we optimized walk_addr_generic() by not using the generic guest memory reader, we replaced copy_from_user() with get_user(): commit e30d2a170506830d5eef5e9d7990c5aedf1b0a51 KVM: MMU: Optimize guest page table walk commit 15e2ac9a43d4d7d08088e404fddf2533a8e7d52e KVM: MMU: Fix 64-bit paging breakage on x86_32 But as Andi pointed out later, copy_from_user() does the same as get_user() as long as we give a constant size to it. So we use copy_from_user() to clean up the code. The only, noticeable, regression introduced by this is 64-bit gpte reading on x86_32 hosts needed for PAE guests. But this can be mitigated by implementing 8-byte get_user() for x86_32, if needed. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:47:54 -04:00
Avi Kivity	2fb92db1ec	KVM: VMX: Cache vmcs segment fields Since the emulator now checks segment limits and access rights, it generates a lot more accesses to the vmcs segment fields. Undo some of the performance hit by cacheing those fields in a read-only cache (the entire cache is invalidated on any write, or on guest exit). Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:47:45 -04:00
Avi Kivity	1aa366163b	KVM: x86 emulator: consolidate segment accessors Instead of separate accessors for the segment selector and cached descriptor, use one accessor for both. This simplifies the code somewhat. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:47:39 -04:00
Avi Kivity	0a434bb2bf	KVM: VMX: Avoid reading %rip unnecessarily when handling exceptions Avoids a VMREAD. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:40:01 -04:00
Joe Perches	ae8cc05955	KVM: SVM: Make dump_vmcb static, reduce text dump_vmcb isn't used outside this module, make it static. Shrink text and object by ~1% by standardizing formats. $ size arch/x86/kvm/svm.o* text data bss dec hex filename 52910 580 10072 63562 f84a arch/x86/kvm/svm.o.new 53563 580 10072 64215 fad7 arch/x86/kvm/svm.o.old Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:40:00 -04:00
Takuya Yoshikawa	8f74d8e168	KVM: MMU: Fix 64-bit paging breakage on x86_32 Fix regression introduced by commit e30d2a170506830d5eef5e9d7990c5aedf1b0a51 KVM: MMU: Optimize guest page table walk On x86_32, get_user() does not support 64-bit values and we fail to build KVM at the point of 64-bit paging. This patch fixes this by using get_user() twice for that condition. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Reported-by: Jan Kiszka <jan.kiszka@web.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:58 -04:00
BrillyWu@viatech.com.cn	4429d5dc11	KVM: Add CPUID support for VIA CPU The CPUIDs for Centaur are added, and then the features of PadLock hardware engine on VIA CPU, such as "ace", "ace_en" and so on, can be passed into the kvm guest. Signed-off-by: Brilly Wu <brillywu@viatech.com.cn> Signed-off-by: Kary Jin <karyjin@viatech.com.cn> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:55 -04:00
Gleb Natapov	2aab2c5b2b	KVM: call cache_all_regs() only once during instruction emulation Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:54 -04:00
Gleb Natapov	0004c7c257	KVM: Fix compound mmio mmio_index should be taken into account when copying data from userspace. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:52 -04:00
Gleb Natapov	4947e7cd0e	KVM: emulator: Propagate fault in far jump emulation Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:51 -04:00
Gleb Natapov	8d7d810255	KVM: mmio_fault_cr2 is not used Remove unused variable mmio_fault_cr2. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:49 -04:00
Avi Kivity	46561646ce	KVM: x86 emulator: consolidate group handling Move all groups into a single field and handle them in a single place. This saves bits when we add more group types (3 bits -> 7 groups types). Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:48 -04:00
Avi Kivity	781e0743af	KVM: MMU: Add unlikely() annotations to walk_addr_generic() walk_addr_generic() is a hot path and is also hard for the cpu to predict - some of the parameters (fetch_fault in particular) vary wildly from invocation to invocation. Add unlikely() annotations where appropriate; all walk failures are considered unlikely, as are cases where we have to mark the accessed or dirty bit, as they are slow paths both in kvm and on real processors. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:46 -04:00
Takuya Yoshikawa	62aaa2f05a	KVM: x86 emulator: Use opcode::execute for PUSHF/POPF (9C/9D) For this, em_pushf/popf() are introduced. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:45 -04:00
Takuya Yoshikawa	b96a7fad02	KVM: x86 emulator: Use opcode::execute for PUSHA/POPA (60/61) For this, emulate_pusha/popa() are converted to em_pusha/popa(). Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:43 -04:00
Takuya Yoshikawa	c54fe50469	KVM: x86 emulator: Use opcode::execute for POP reg (58-5F) In addition, the RET emulation is changed to call em_pop() to remove the pop_instruction label. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:41 -04:00
Takuya Yoshikawa	d67fc27ae2	KVM: x86 emulator: Use opcode::execute for Group 1, CMPS and SCAS The following instructions are changed to use opcode::execute. Group 1 (80-83) ADD (00-05), OR (08-0D), ADC (10-15), SBB (18-1D), AND (20-25), SUB (28-2D), XOR (30-35), CMP (38-3D) CMPS (A6-A7), SCAS (AE-AF) The last two do the same as CMP in the emulator, so em_cmp() is used. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:40 -04:00
Takuya Yoshikawa	6e2ca7d180	KVM: MMU: Optimize guest page table walk This patch optimizes the guest page table walk by using get_user() instead of copy_from_user(). With this patch applied, paging64_walk_addr_generic() has become about 0.5us to 1.0us faster on my Phenom II machine with NPT on. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:38 -04:00
Avi Kivity	40e19b519c	KVM: SVM: Get rid of x86_intercept_map::valid By reserving 0 as an invalid x86_intercept_stage, we no longer need to store a valid flag in x86_intercept_map. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:37 -04:00
Avi Kivity	5ef39c71d8	KVM: x86 emulator: Use opcode::execute for 0F 01 opcode Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:35 -04:00
Avi Kivity	68152d8812	KVM: x86 emulator: Don't force #UD for 0F 01 /5 While it isn't defined, no need to force a #UD. If it becomes defined in the future this can cause wierd problems for the guest. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:34 -04:00
Avi Kivity	26d05cc740	KVM: x86 emulator: move 0F 01 sub-opcodes into their own functions Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:32 -04:00
Randy Dunlap	d42244499f	KVM: x86 emulator: fix const value warning on i386 in svm insn RAX check arch/x86/kvm/emulate.c:2598: warning: integer constant is too large for 'long' type Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:30 -04:00
Clemens Noss	cfb223753c	KVM: x86 emulator: avoid calling wbinvd() macro Commit 0b56652e33c72092956c651ab6ceb9f0ad081153 fails to build: CC [M] arch/x86/kvm/emulate.o arch/x86/kvm/emulate.c: In function 'x86_emulate_insn': arch/x86/kvm/emulate.c:4095:25: error: macro "wbinvd" passed 1 arguments, but takes just 0 arch/x86/kvm/emulate.c:4095:3: warning: statement with no effect make[2]: * [arch/x86/kvm/emulate.o] Error 1 make[1]: * [arch/x86/kvm] Error 2 make: *** [arch/x86] Error 2 Work around this for now. Signed-off-by: Clemens Noss <cnoss@gmx.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:29 -04:00
Roedel, Joerg	a78484c60e	KVM: MMU: Make cmpxchg_gpte aware of nesting too This patch makes the cmpxchg_gpte() function aware of the difference between l1-gfns and l2-gfns when nested virtualization is in use. This fixes a potential data-corruption problem in the l1-guest and makes the code work correct (at least as correct as the hardware which is emulated in this code) again. Cc: stable@kernel.org Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:26 -04:00
Avi Kivity	13db70eca6	KVM: x86 emulator: drop x86_emulate_ctxt::vcpu No longer used. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:24 -04:00
Avi Kivity	5197b808a7	KVM: Avoid using x86_emulate_ctxt.vcpu We can use container_of() instead. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:22 -04:00
Avi Kivity	bcaf5cc543	KVM: x86 emulator: add new ->wbinvd() callback Instead of calling kvm_emulate_wbinvd() directly. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:20 -04:00
Avi Kivity	d6aa10003b	KVM: x86 emulator: add ->fix_hypercall() callback Artificial, but needed to remove direct calls to KVM. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:18 -04:00
Avi Kivity	6c3287f7c5	KVM: x86 emulator: add new ->halt() callback Instead of reaching into vcpu internals. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:17 -04:00
Avi Kivity	3cb16fe78c	KVM: x86 emulator: make emulate_invlpg() an emulator callback Removing direct calls to KVM. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:15 -04:00
Avi Kivity	2d04a05bd7	KVM: x86 emulator: emulate CLTS internally Avoid using ctxt->vcpu; we can do everything with ->get_cr() and ->set_cr(). A side effect is that we no longer activate the fpu on emulated CLTS; but that should be very rare. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:14 -04:00
Avi Kivity	fd72c41922	KVM: x86 emulator: Replace calls to is_pae() and is_paging with ->get_cr() Avoid use of ctxt->vcpu. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:12 -04:00
Avi Kivity	c2ad2bb3ef	KVM: x86 emulator: drop use of is_long_mode() Requires ctxt->vcpu, which is to be abolished. Replace with open calls to get_msr(). Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:10 -04:00
Avi Kivity	1ac9d0cfb0	KVM: x86 emulator: add and use new callbacks set_idt(), set_gdt() Replacing direct calls to realmode_lgdt(), realmode_lidt(). Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:08 -04:00
Avi Kivity	fe870ab9ce	KVM: x86 emulator: avoid using ctxt->vcpu in check_perm() callbacks Unneeded for register access. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:07 -04:00
Avi Kivity	2953538ebb	KVM: x86 emulator: drop vcpu argument from intercept callback Making the emulator caller agnostic. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:05 -04:00
Avi Kivity	717746e382	KVM: x86 emulator: drop vcpu argument from cr/dr/cpl/msr callbacks Making the emulator caller agnostic. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:39:03 -04:00
Avi Kivity	4bff1e86ad	KVM: x86 emulator: drop vcpu argument from segment/gdt/idt callbacks Making the emulator caller agnostic. [Takuya Yoshikawa: fix typo leading to LDT failures] Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-22 08:35:20 -04:00
Avi Kivity	ca1d4a9e77	KVM: x86 emulator: drop vcpu argument from pio callbacks Making the emulator caller agnostic. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:11 -04:00
Avi Kivity	0f65dd70a4	KVM: x86 emulator: drop vcpu argument from memory read/write callbacks Making the emulator caller agnostic. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:10 -04:00
Avi Kivity	7295261cdd	KVM: x86 emulator: whitespace cleanups Clean up lines longer than 80 columns. No code changes. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:10 -04:00
Nelson Elhage	3d9b938eef	KVM: emulator: Use linearize() when fetching instructions Since segments need to be handled slightly differently when fetching instructions, we add a __linearize helper that accepts a new 'fetch' boolean. [avi: fix oops caused by wrong segmented_address initialization order] Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:10 -04:00
Joerg Roedel	7c4c0f4fd5	KVM: X86: Update last_guest_tsc in vcpu_put The last_guest_tsc is used in vcpu_load to adjust the tsc_offset since tsc-scaling is merged. So the last_guest_tsc needs to be updated in vcpu_put instead of the the last_host_tsc. This is fixed with this patch. Reported-by: Jan Kiszka <jan.kiszka@web.de> Tested-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:10 -04:00
Joerg Roedel	977b2d03e4	KVM: SVM: Fix nested sel_cr0 intercept path with decode-assists This patch fixes a bug in the nested-svm path when decode-assists is available on the machine. After a selective-cr0 intercept is detected the rip is advanced unconditionally. This causes the l1-guest to continue running with an l2-rip. This bug was with the sel_cr0 unit-test on decode-assists capable hardware. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:10 -04:00
Nelson Elhage	0521e4c0bc	KVM: x86 emulator: Handle wraparound in (cs_base + offset) when fetching insns Currently, setting a large (i.e. negative) base address for %cs does not work on a 64-bit host. The "JOS" teaching operating system, used by MIT and other universities, relies on such segments while bootstrapping its way to full virtual memory management. Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:09 -04:00
Duan Jiong	49704f2658	KVM: remove useless function declaration kvm_inject_pit_timer_irqs() Just remove useless function define kvm_inject_pit_timer_irqs() from file arch/x86/kvm/i8254.h Signed-off-by:Duan Jiong<djduanjiong@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:09 -04:00
Duan Jiong	1e015968df	KVM: remove useless function declarations from file arch/x86/kvm/irq.h Just remove useless function define kvm_pic_clear_isr_ack() and pit_has_pending_timer() Signed-off-by: Duan Jiong<djduanjiong@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:09 -04:00
Serge E. Hallyn	71f9833bb1	KVM: fix push of wrong eip when doing softint When doing a soft int, we need to bump eip before pushing it to the stack. Otherwise we'll do the int a second time. [apw@canonical.com: merged eip update as per Jan's recommendation.] Signed-off-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:09 -04:00
Takuya Yoshikawa	4487b3b48d	KVM: x86 emulator: Use em_push() instead of emulate_push() em_push() is a simple wrapper of emulate_push(). So this patch replaces emulate_push() with em_push() and removes the unnecessary former. In addition, the unused ops arguments are removed from emulate_pusha() and emulate_grp45(). Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:08 -04:00
Takuya Yoshikawa	4179bb02fd	KVM: x86 emulator: Make emulate_push() store the value directly PUSH emulation stores the value by calling writeback() after setting the dst operand appropriately in emulate_push(). This writeback() using dst is not needed at all because we know the target is the stack. So this patch makes emulate_push() call, newly introduced, segmented_write() directly. By this, many inlined writeback()'s are removed. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:08 -04:00
Takuya Yoshikawa	575e7c1417	KVM: x86 emulator: Disable writeback for CMP emulation This stops "CMP r/m, reg" to write back the data into memory. Pointed out by Avi. The writeback suppression now covers CMP, CMPS, SCAS. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:08 -04:00
Jan Kiszka	be6d05cfdf	KVM: VMX: Ensure that vmx_create_vcpu always returns proper error In case certain allocations fail, vmx_create_vcpu may return 0 as error instead of a negative value encoded via ERR_PTR. This causes a NULL pointer dereferencing later on in kvm_vm_ioctl_vcpu_create. Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-05-11 07:57:08 -04:00
Gleb Natapov	7ae441eac5	KVM: emulator: do not needlesly sync registers from emulator ctxt to vcpu Currently we sync registers back and forth before/after exiting to userspace for IO, but during IO device model shouldn't need to read/write the registers, so we can as well skip those sync points. The only exaception is broken vmware backdor interface. The new code sync registers content during IO only if registers are read from/written to by userspace in the middle of the IO operation and this almost never happens in practise. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-05-11 07:57:08 -04:00
Avi Kivity	618ff15de1	KVM: x86 emulator: implement segment permission checks Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:08 -04:00
Avi Kivity	56697687da	KVM: x86 emulator: move desc_limit_scaled() For reuse later. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:07 -04:00
Avi Kivity	52fd8b445f	KVM: x86 emulator: move linearize() downwards So it can call emulate_gp() without forward declarations. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:07 -04:00
Avi Kivity	83b8795a29	KVM: x86 emulator: pass access size and read/write intent to linearize() Needed for segment read/write checks. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:07 -04:00
Avi Kivity	9fa088f4d2	KVM: x86 emulator: change address linearization to return an error code Preparing to add segment checks. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:07 -04:00
Avi Kivity	38503911b3	KVM: x86 emulator: move invlpg emulation into a function It's going to get more complicated soon. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:06 -04:00
Avi Kivity	3ca3ac4dae	KVM: x86 emulator: Add helpers for memory access using segmented addresses Will help later adding proper segment checks. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:06 -04:00
Joerg Roedel	e3e9ed3d2c	KVM: SVM: Fix fault-rip on vmsave/vmload emulation When the emulation of vmload or vmsave fails because the guest passed an unsupported physical address it gets an #GP with rip pointing to the instruction after vmsave/vmload. This is a bug and fixed by this patch. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:06 -04:00
Joerg Roedel	92a1f12d25	KVM: X86: Implement userspace interface to set virtual_tsc_khz This patch implements two new vm-ioctls to get and set the virtual_tsc_khz if the machine supports tsc-scaling. Setting the tsc-frequency is only possible before userspace creates any vcpu. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:06 -04:00
Joerg Roedel	857e40999e	KVM: X86: Delegate tsc-offset calculation to architecture code With TSC scaling in SVM the tsc-offset needs to be calculated differently. This patch propagates this calculation into the architecture specific modules so that this complexity can be handled there. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:05 -04:00
Joerg Roedel	4051b18801	KVM: X86: Implement call-back to propagate virtual_tsc_khz This patch implements a call-back into the architecture code to allow the propagation of changes to the virtual tsc_khz of the vcpu. On SVM it updates the tsc_ratio variable, on VMX it does nothing. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:05 -04:00
Joerg Roedel	8f6055cbaf	KVM: X86: Make tsc_delta calculation a function of guest tsc The calculation of the tsc_delta value to ensure a forward-going tsc for the guest is a function of the host-tsc. This works as long as the guests tsc_khz is equal to the hosts tsc_khz. With tsc-scaling hardware support this is not longer true and the tsc_delta needs to be calculated using guest_tsc values. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:04 -04:00
Joerg Roedel	1e993611d0	KVM: X86: Let kvm-clock report the right tsc frequency This patch changes the kvm_guest_time_update function to use TSC frequency the guest actually has for updating its clock. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:04 -04:00
Joerg Roedel	fbc0db76b7	KVM: SVM: Implement infrastructure for TSC_RATE_MSR This patch enhances the kvm_amd module with functions to support the TSC_RATE_MSR which can be used to set a given tsc frequency for the guest vcpu. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:04 -04:00
Avi Kivity	bfeed29d6d	KVM: x86 emulator: Drop EFER.SVME requirement from VMMCALL VMMCALL requires EFER.SVME to be enabled in the host, not in the guest, which is what check_svme() checks. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:04 -04:00
Avi Kivity	8b18bc3782	KVM: x86 emulator: Re-add VendorSpecific tag to VMMCALL insn VMMCALL needs the VendorSpecific tag so that #UD emulation (called if a guest running on AMD was migrated to an Intel host) is allowed to process the instruction. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:04 -04:00
Xiao Guangrong	7c5625227f	KVM: MMU: remove mmu_seq verification on pte update path The mmu_seq verification can be removed since we get the pfn in the protection of mmu_lock. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:03 -04:00
Gleb Natapov	a0c0ab2feb	KVM: x86 emulator: do not open code return values from the emulator Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:03 -04:00
Justin P. Mattock	0be839bfb4	KVM: Remove base_addresss in kvm_pit since it is unused The patch below removes unsigned long base_addresss; in i8254.h since it is unused. Signed-off-by: Justin P. Mattock <justinmattock@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:03 -04:00
Joerg Roedel	628afd2aeb	KVM: SVM: Remove nested sel_cr0_write handling code This patch removes all the old code which handled the nested selective cr0 write intercepts. This code was only in place as a work-around until the instruction emulator is capable of doing the same. This is the case with this patch-set and so the code can be removed. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:03 -04:00
Joerg Roedel	f6511935f4	KVM: SVM: Add checks for IO instructions This patch adds code to check for IOIO intercepts on instructions decoded by the KVM instruction emulator. [avi: fix build error due to missing #define D2bvIP] Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:03 -04:00
Joerg Roedel	bf608f88fa	KVM: SVM: Add intercept checks for one-byte instructions This patch add intercept checks for emulated one-byte instructions to the KVM instruction emulation path. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:02 -04:00
Joerg Roedel	8061252ee0	KVM: SVM: Add intercept checks for remaining twobyte instructions This patch adds intercepts checks for the remaining twobyte instructions to the KVM instruction emulator. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:02 -04:00
Joerg Roedel	d7eb820306	KVM: SVM: Add intercept checks for remaining group7 instructions This patch implements the emulator intercept checks for the RDTSCP, MONITOR, and MWAIT instructions. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:02 -04:00
Joerg Roedel	01de8b09e6	KVM: SVM: Add intercept checks for SVM instructions This patch adds the necessary code changes in the instruction emulator and the extensions to svm.c to implement intercept checks for the svm instructions. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:02 -04:00
Joerg Roedel	dee6bb70e4	KVM: SVM: Add intercept checks for descriptor table accesses This patch add intercept checks into the KVM instruction emulator to check for the 8 instructions that access the descriptor table addresses. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:02 -04:00
Joerg Roedel	3b88e41a41	KVM: SVM: Add intercept check for accessing dr registers This patch adds the intercept checks for instruction accessing the debug registers. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:01 -04:00
Joerg Roedel	cfec82cb7d	KVM: SVM: Add intercept check for emulated cr accesses This patch adds all necessary intercept checks for instructions that access the crX registers. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:01 -04:00
Joerg Roedel	8a76d7f25f	KVM: x86: Add x86 callback for intercept check This patch adds a callback into kvm_x86_ops so that svm and vmx code can do intercept checks on emulated instructions. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:01 -04:00
Joerg Roedel	8ea7d6aef8	KVM: x86 emulator: Add flag to check for protected mode instructions This patch adds a flag for the opcoded to tag instruction which are only recognized in protected mode. The necessary check is added too. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:01 -04:00
Joerg Roedel	d09beabd7c	KVM: x86 emulator: Add check_perm callback This patch adds a check_perm callback for each opcode into the instruction emulator. This will be used to do all necessary permission checks on instructions before checking whether they are intercepted or not. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:01 -04:00
Joerg Roedel	775fde8648	KVM: x86 emulator: Don't write-back cpu-state on X86EMUL_INTERCEPTED This patch prevents the changed CPU state to be written back when the emulator detected that the instruction was intercepted by the guest. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:00 -04:00
Avi Kivity	3c6e276f22	KVM: x86 emulator: add SVM intercepts Add intercept codes for instructions defined by SVM as interceptable. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:00 -04:00
Avi Kivity	c4f035c60d	KVM: x86 emulator: add framework for instruction intercepts When running in guest mode, certain instructions can be intercepted by hardware. This also holds for nested guests running on emulated virtualization hardware, in particular instructions emulated by kvm itself. This patch adds a framework for intercepting instructions. If an instruction is marked for interception, and if we're running in guest mode, a callback is called to check whether an intercept is needed or not. The callback is called at three points in time: immediately after beginning execution, after checking privilge exceptions, and after checking memory exception. This suits the different interception points defined for different instructions and for the various virtualization instruction sets. In addition, a new X86EMUL_INTERCEPT is defined, which any callback or memory access may define, allowing the more complicated intercepts to be implemented in existing callbacks. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:57:00 -04:00
Avi Kivity	aa97bb4891	KVM: x86 emulator: implement movdqu instruction (f3 0f 6f, f3 0f 7f) Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:59 -04:00
Avi Kivity	1253791df9	KVM: x86 emulator: SSE support Add support for marking an instruction as SSE, switching registers used to the SSE register file. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:59 -04:00
Avi Kivity	0d7cdee83a	KVM: x86 emulator: Specialize decoding for insns with 66/f2/f3 prefixes Most SIMD instructions use the 66/f2/f3 prefixes to distinguish between different variants of the same instruction. Usually the encoding is quite regular, but in some cases (including non-SIMD instructions) the prefixes generate very different instructions. Examples include XCHG/PAUSE, MOVQ/MOVDQA/MOVDQU, and MOVBE/CRC32. Allow the emulator to handle these special cases by splitting such opcodes into groups, with different decode flags and execution functions for different prefixes. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:59 -04:00
Avi Kivity	5037f6f324	KVM: x86 emulator: define callbacks for using the guest fpu within the emulator Needed for emulating fpu instructions. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:58 -04:00
Avi Kivity	1d6b114f20	KVM: x86 emulator: do not munge rep prefix Currently we store a rep prefix as 1 or 2 depending on whether it is a REPE or REPNE. Since sse instructions depend on the prefix value, store it as the original opcode to simplify things further on. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:58 -04:00
Avi Kivity	cef4dea07f	KVM: 16-byte mmio support Since sse instructions can issue 16-byte mmios, we need to support them. We can't increase the kvm_run mmio buffer size to 16 bytes without breaking compatibility, so instead we break the large mmios into two smaller 8-byte ones. Since the bus is 64-bit we aren't breaking any atomicity guarantees. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:58 -04:00
Avi Kivity	5287f194bf	KVM: Split mmio completion into a function Make room for sse mmio completions. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:58 -04:00
Avi Kivity	70252a1053	KVM: extend in-kernel mmio to handle >8 byte transactions Needed for coalesced mmio using sse. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:58 -04:00
Gleb Natapov	1499e54af0	KVM: x86: better fix for race between nmi injection and enabling nmi window Fix race between nmi injection and enabling nmi window in a simpler way. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-05-11 07:56:57 -04:00
Marcelo Tosatti	c761e5868e	Revert "KVM: Fix race between nmi injection and enabling nmi window" This reverts commit `f86368493e`. Simpler fix to follow. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-05-11 07:56:57 -04:00
Glauber Costa	3291892450	KVM: expose async pf through our standard mechanism As Avi recently mentioned, the new standard mechanism for exposing features is KVM_GET_SUPPORTED_CPUID, not spamming CAPs. For some reason async pf missed that. So expose async_pf here. Signed-off-by: Glauber Costa <glommer@redhat.com> CC: Gleb Natapov <gleb@redhat.com> CC: Avi Kivity <avi@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:57 -04:00
Avi Kivity	654f06fc65	KVM: VMX: simplify NMI mask management Use vmx_set_nmi_mask() instead of open-coding management of the hardware bit and the software hint (nmi_known_unmasked). There's a slight change of behaviour when running without hardware virtual NMI support - we now clear the NMI mask if NMI delivery faulted in that case as well. This improves emulation accuracy. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:57 -04:00
Jan Kiszka	89a9fb78b5	KVM: SVM: Remove unused svm_features We use boot_cpu_has now. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:57 -04:00
Avi Kivity	8878647585	KVM: VMX: Use cached VM_EXIT_INTR_INFO in handle_exception vmx_complete_atomic_exit() cached it for us, so we can use it here. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:56 -04:00
Avi Kivity	c5ca8e572c	KVM: VMX: Don't VMREAD VM_EXIT_INTR_INFO unconditionally Only read it if we're going to use it later. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:56 -04:00
Avi Kivity	00eba012d5	KVM: VMX: Refactor vmx_complete_atomic_exit() Move the exit reason checks to the front of the function, for early exit in the common case. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:56 -04:00
Avi Kivity	f9902069c4	KVM: VMX: Qualify check for host NMI Check for the exit reason first; this allows us, later, to avoid a VMREAD for VM_EXIT_INTR_INFO_FIELD. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:56 -04:00
Avi Kivity	9d58b93192	KVM: VMX: Avoid vmx_recover_nmi_blocking() when unneeded When we haven't injected an interrupt, we don't need to recover the nmi blocking state (since the guest can't set it by itself). This allows us to avoid a VMREAD later on. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:56 -04:00
Avi Kivity	69c7302890	KVM: VMX: Cache cpl We may read the cpl quite often in the same vmexit (instruction privilege check, memory access checks for instruction and operands), so we gain a bit if we cache the value. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:54 -04:00
Avi Kivity	f4c63e5d5a	KVM: VMX: Optimize vmx_get_cpl() In long mode, vm86 mode is disallowed, so we need not check for it. Reading rflags.vm may require a VMREAD, so it is expensive. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:54 -04:00
Avi Kivity	6de12732c4	KVM: VMX: Optimize vmx_get_rflags() If called several times within the same exit, return cached results. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:54 -04:00
Avi Kivity	f6e7847589	KVM: Use kvm_get_rflags() and kvm_set_rflags() instead of the raw versions Some rflags bits are owned by the host, not guest, so we need to use kvm_get_rflags() to strip those bits away or kvm_set_rflags() to add them back. Signed-off-by: Avi Kivity <avi@redhat.com>	2011-05-11 07:56:54 -04:00
Andre Przywara	bd22f5cfcf	KVM: move and fix substitue search for missing CPUID entries If KVM cannot find an exact match for a requested CPUID leaf, the code will try to find the closest match instead of simply confessing it's failure. The implementation was meant to satisfy the CPUID specification, but did not properly check for extended and standard leaves and also didn't account for the index subleaf. Beside that this rule only applies to CPUID intercepts, which is not the only user of the kvm_find_cpuid_entry() function. So fix this algorithm and call it from kvm_emulate_cpuid(). This fixes a crash of newer Linux kernels as KVM guests on AMD Bulldozer CPUs, where bogus values were returned in response to a CPUID intercept. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-04-06 13:15:56 +03:00
Andre Przywara	20800bc940	KVM: fix XSAVE bit scanning When KVM scans the 0xD CPUID leaf for propagating the XSAVE save area leaves, it assumes that the leaves are contigious and stops at the first zero one. On AMD hardware there is a gap, though, as LWP uses leaf 62 to announce it's state save area. So lets iterate through all 64 possible leaves and simply skip zero ones to also cover later features. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-04-06 13:15:55 +03:00
Linus Torvalds	f2e1fbb5f2	Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Flush TLB if PGD entry is changed in i386 PAE mode x86, dumpstack: Correct stack dump info when frame pointer is available x86: Clean up csum-copy_64.S a bit x86: Fix common misspellings x86: Fix misspelling and align params x86: Use PentiumPro-optimized partial_csum() on VIA C7	2011-03-18 10:45:21 -07:00
Lucas De Marchi	0d2eb44f63	x86: Fix common misspellings They were generated by 'codespell' and then manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: trivial@kernel.org LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2011-03-18 10:39:30 +01:00
Gleb Natapov	776e58ea3d	KVM: unbreak userspace that does not sets tss address Commit 6440e5967bc broke old userspaces that do not set tss address before entering vcpu. Unbreak it by setting tss address to a safe value on the first vcpu entry. New userspaces should set tss address, so print warning in case it doesn't. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-03-17 13:08:35 -03:00

... 3 4 5 6 7 ...

2183 commits