Commit graph

377 commits

Author SHA1 Message Date
Amos Kong
6ea34c9b78 kvm: exclude ioeventfd from counting kvm_io_range limit
We can easily reach the 1000 limit by start VM with a couple
hundred I/O devices (multifunction=on). The hardcode limit
already been adjusted 3 times (6 ~ 200 ~ 300 ~ 1000).

In userspace, we already have maximum file descriptor to
limit ioeventfd count. But kvm_io_bus devices also are used
for pit, pic, ioapic, coalesced_mmio. They couldn't be limited
by maximum file descriptor.

Currently only ioeventfds take too much kvm_io_bus devices,
so just exclude it from counting kvm_io_range limit.

Also fixed one indent issue in kvm_host.h

Signed-off-by: Amos Kong <akong@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-04 11:49:38 +03:00
Wei Yongjun
afc2f792cd KVM: add missing misc_deregister() on error in kvm_init()
Add the missing misc_deregister() before return from kvm_init()
in the debugfs init error handling case.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-12 12:06:24 +03:00
Linus Torvalds
c67723ebbb Merge tag 'kvm-3.10-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Gleb Natapov:
 "Most of the fixes are in the emulator since now we emulate more than
  we did before for correctness sake we see more bugs there, but there
  is also an OOPS fixed and corruption of xcr0 register."

* tag 'kvm-3.10-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: emulator: emulate SALC
  KVM: emulator: emulate XLAT
  KVM: emulator: emulate AAM
  KVM: VMX: fix halt emulation while emulating invalid guest sate
  KVM: Fix kvm_irqfd_init initialization
  KVM: x86: fix maintenance of guest/host xcr0 state
2013-05-10 09:08:21 -07:00
Linus Torvalds
daf799cca8 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS updates from Ralf Baechle:

 - More work on DT support for various platforms

 - Various fixes that were to late to make it straight into 3.9

 - Improved platform support, in particular the Netlogic XLR and
   BCM63xx, and the SEAD3 and Malta eval boards.

 - Support for several Ralink SOC families.

 - Complete support for the microMIPS ASE which basically reencodes the
   existing MIPS32/MIPS64 ISA to use non-constant size instructions.

 - Some fallout from LTO work which remove old cruft and will generally
   make the MIPS kernel easier to maintain and resistant to compiler
   optimization, even in absence of LTO.

 - KVM support.  While MIPS has announced hardware virtualization
   extensions this KVM extension uses trap and emulate mode for
   virtualization of MIPS32.  More KVM work to add support for VZ
   hardware virtualizaiton extensions and MIPS64 will probably already
   be merged for 3.11.

Most of this has been sitting in -next for a long time.  All defconfigs
have been build or run time tested except three for which fixes are being
sent by other maintainers.

Semantic conflict with kvm updates done as per Ralf

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (118 commits)
  MIPS: Add new GIC clockevent driver.
  MIPS: Formatting clean-ups for clocksources.
  MIPS: Refactor GIC clocksource code.
  MIPS: Move 'gic_frequency' to common location.
  MIPS: Move 'gic_present' to common location.
  MIPS: MIPS16e: Add unaligned access support.
  MIPS: MIPS16e: Support handling of delay slots.
  MIPS: MIPS16e: Add instruction formats.
  MIPS: microMIPS: Optimise 'strnlen' core library function.
  MIPS: microMIPS: Optimise 'strlen' core library function.
  MIPS: microMIPS: Optimise 'strncpy' core library function.
  MIPS: microMIPS: Optimise 'memset' core library function.
  MIPS: microMIPS: Add configuration option for microMIPS kernel.
  MIPS: microMIPS: Disable LL/SC and fix linker bug.
  MIPS: microMIPS: Add vdso support.
  MIPS: microMIPS: Add unaligned access support.
  MIPS: microMIPS: Support handling of delay slots.
  MIPS: microMIPS: Add support for exception handling.
  MIPS: microMIPS: Floating point support.
  MIPS: microMIPS: Fix macro naming in micro-assembler.
  ...
2013-05-10 07:48:05 -07:00
Ralf Baechle
5e0e61dd2c Merge branch 'next/kvm' into mips-for-linux-next 2013-05-09 17:56:40 +02:00
Sanjay Lal
2f4d9b5442 KVM/MIPS32: Do not call vcpu_load when injecting interrupts.
Signed-off-by: Sanjay Lal <sanjayl@kymasys.com>
Cc: kvm@vger.kernel.org
Cc: linux-mips@linux-mips.org
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2013-05-09 17:48:22 +02:00
Asias He
7dac16c379 KVM: Fix kvm_irqfd_init initialization
In commit a0f155e96 'KVM: Initialize irqfd from kvm_init()', when
kvm_init() is called the second time (e.g kvm-amd.ko and kvm-intel.ko),
kvm_arch_init() will fail with -EEXIST, then kvm_irqfd_exit() will be
called on the error handling path. This way, the kvm_irqfd system will
not be ready.

This patch fix the following:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff81c0721e>] _raw_spin_lock+0xe/0x30
PGD 0
Oops: 0002 [#1] SMP
Modules linked in: vhost_net
CPU 6
Pid: 4257, comm: qemu-system-x86 Not tainted 3.9.0-rc3+ #757 Dell Inc. OptiPlex 790/0V5HMK
RIP: 0010:[<ffffffff81c0721e>]  [<ffffffff81c0721e>] _raw_spin_lock+0xe/0x30
RSP: 0018:ffff880221721cc8  EFLAGS: 00010046
RAX: 0000000000000100 RBX: ffff88022dcc003f RCX: ffff880221734950
RDX: ffff8802208f6ca8 RSI: 000000007fffffff RDI: 0000000000000000
RBP: ffff880221721cc8 R08: 0000000000000002 R09: 0000000000000002
R10: 00007f7fd01087e0 R11: 0000000000000246 R12: ffff8802208f6ca8
R13: 0000000000000080 R14: ffff880223e2a900 R15: 0000000000000000
FS:  00007f7fd38488e0(0000) GS:ffff88022dcc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000022309f000 CR4: 00000000000427e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-system-x86 (pid: 4257, threadinfo ffff880221720000, task ffff880222bd5640)
Stack:
 ffff880221721d08 ffffffff810ac5c5 ffff88022431dc00 0000000000000086
 0000000000000080 ffff880223e2a900 ffff8802208f6ca8 0000000000000000
 ffff880221721d48 ffffffff810ac8fe 0000000000000000 ffff880221734000
Call Trace:
 [<ffffffff810ac5c5>] __queue_work+0x45/0x2d0
 [<ffffffff810ac8fe>] queue_work_on+0x8e/0xa0
 [<ffffffff810ac949>] queue_work+0x19/0x20
 [<ffffffff81009b6b>] irqfd_deactivate+0x4b/0x60
 [<ffffffff8100a69d>] kvm_irqfd+0x39d/0x580
 [<ffffffff81007a27>] kvm_vm_ioctl+0x207/0x5b0
 [<ffffffff810c9545>] ? update_curr+0xf5/0x180
 [<ffffffff811b66e8>] do_vfs_ioctl+0x98/0x550
 [<ffffffff810c1f5e>] ? finish_task_switch+0x4e/0xe0
 [<ffffffff81c054aa>] ? __schedule+0x2ea/0x710
 [<ffffffff811b6bf7>] sys_ioctl+0x57/0x90
 [<ffffffff8140ae9e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [<ffffffff81c0f602>] system_call_fastpath+0x16/0x1b
Code: c1 ea 08 38 c2 74 0f 66 0f 1f 44 00 00 f3 90 0f b6 03 38 c2 75 f7 48 83 c4 08 5b c9 c3 55 48 89 e5 66 66 66 66 90 b8 00 01 00 00 <f0> 66 0f c1 07 89 c2 66 c1 ea 08 38 c2 74 0c 0f 1f 00 f3 90 0f
RIP  [<ffffffff81c0721e>] _raw_spin_lock+0xe/0x30
RSP <ffff880221721cc8>
CR2: 0000000000000000
---[ end trace 13fb1e4b6e5ab21f ]---

Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-08 13:15:35 +03:00
Linus Torvalds
01227a889e Merge tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Gleb Natapov:
 "Highlights of the updates are:

  general:
   - new emulated device API
   - legacy device assignment is now optional
   - irqfd interface is more generic and can be shared between arches

  x86:
   - VMCS shadow support and other nested VMX improvements
   - APIC virtualization and Posted Interrupt hardware support
   - Optimize mmio spte zapping

  ppc:
    - BookE: in-kernel MPIC emulation with irqfd support
    - Book3S: in-kernel XICS emulation (incomplete)
    - Book3S: HV: migration fixes
    - BookE: more debug support preparation
    - BookE: e6500 support

  ARM:
   - reworking of Hyp idmaps

  s390:
   - ioeventfd for virtio-ccw

  And many other bug fixes, cleanups and improvements"

* tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
  kvm: Add compat_ioctl for device control API
  KVM: x86: Account for failing enable_irq_window for NMI window request
  KVM: PPC: Book3S: Add API for in-kernel XICS emulation
  kvm/ppc/mpic: fix missing unlock in set_base_addr()
  kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write
  kvm/ppc/mpic: remove users
  kvm/ppc/mpic: fix mmio region lists when multiple guests used
  kvm/ppc/mpic: remove default routes from documentation
  kvm: KVM_CAP_IOMMU only available with device assignment
  ARM: KVM: iterate over all CPUs for CPU compatibility check
  KVM: ARM: Fix spelling in error message
  ARM: KVM: define KVM_ARM_MAX_VCPUS unconditionally
  KVM: ARM: Fix API documentation for ONE_REG encoding
  ARM: KVM: promote vfp_host pointer to generic host cpu context
  ARM: KVM: add architecture specific hook for capabilities
  ARM: KVM: perform HYP initilization for hotplugged CPUs
  ARM: KVM: switch to a dual-step HYP init code
  ARM: KVM: rework HYP page table freeing
  ARM: KVM: enforce maximum size for identity mapped code
  ARM: KVM: move to a KVM provided HYP idmap
  ...
2013-05-05 14:47:31 -07:00
Scott Wood
db6ae61581 kvm: Add compat_ioctl for device control API
This API shouldn't have 32/64-bit issues, but VFS assumes it does
unless told otherwise.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-05 12:14:15 +03:00
Paul Mackerras
5975a2e095 KVM: PPC: Book3S: Add API for in-kernel XICS emulation
This adds the API for userspace to instantiate an XICS device in a VM
and connect VCPUs to it.  The API consists of a new device type for
the KVM_CREATE_DEVICE ioctl, a new capability KVM_CAP_IRQ_XICS, which
functions similarly to KVM_CAP_IRQ_MPIC, and the KVM_IRQ_LINE ioctl,
which is used to assert and deassert interrupt inputs of the XICS.

The XICS device has one attribute group, KVM_DEV_XICS_GRP_SOURCES.
Each attribute within this group corresponds to the state of one
interrupt source.  The attribute number is the same as the interrupt
source number.

This does not support irq routing or irqfd yet.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-05-02 15:28:36 +02:00
Scott Wood
07f0a7bdec kvm: destroy emulated devices on VM exit
The hassle of getting refcounting right was greater than the hassle
of keeping a list of devices to destroy on VM exit.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:28 +02:00
Scott Wood
5df554ad5b kvm/ppc/mpic: in-kernel MPIC emulation
Hook the MPIC code up to the KVM interfaces, add locking, etc.

Signed-off-by: Scott Wood <scottwood@freescale.com>
[agraf: add stub function for kvmppc_mpic_set_epr, non-booke, 64bit]
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:23 +02:00
Scott Wood
852b6d57dc kvm: add device control API
Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-04-26 20:27:20 +02:00
Alexander Graf
7df35f5496 KVM: Move irqfd resample cap handling to generic code
Now that we have most irqfd code completely platform agnostic, let's move
irqfd's resample capability return to generic code as well.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2013-04-26 20:27:19 +02:00
Alexander Graf
aa8d5944b8 KVM: Move irq routing to generic code
The IRQ routing set ioctl lives in the hacky device assignment code inside
of KVM today. This is definitely the wrong place for it. Move it to the much
more natural kvm_main.c.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2013-04-26 20:27:17 +02:00
Alexander Graf
a725d56a02 KVM: Introduce CONFIG_HAVE_KVM_IRQ_ROUTING
Quite a bit of code in KVM has been conditionalized on availability of
IOAPIC emulation. However, most of it is generically applicable to
platforms that don't have an IOPIC, but a different type of irq chip.

Make code that only relies on IRQ routing, not an APIC itself, on
CONFIG_HAVE_KVM_IRQ_ROUTING, so that we can reuse it later.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2013-04-26 20:27:14 +02:00
Yang Zhang
a20ed54d6e KVM: VMX: Add the deliver posted interrupt algorithm
Only deliver the posted interrupt when target vcpu is running
and there is no previous interrupt pending in pir.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang
3d81bc7e96 KVM: Call common update function when ioapic entry changed.
Both TMR and EOI exit bitmap need to be updated when ioapic changed
or vcpu's id/ldr/dfr changed. So use common function instead eoi exit
bitmap specific function.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang
aa2fbe6d44 KVM: Let ioapic know the irq line status
Userspace may deliver RTC interrupt without query the status. So we
want to track RTC EOI for this case.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-15 23:20:34 -03:00
Geoff Levand
e3ba45b804 KVM: Move kvm_spurious_fault to x86.c
The routine kvm_spurious_fault() is an x86 specific routine, so
move it from virt/kvm/kvm_main.c to arch/x86/kvm/x86.c.

Fixes this sparse warning when building on arm64:

  virt/kvm/kvm_main.c⚠️ symbol 'kvm_spurious_fault' was not declared. Should it be static?

Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-08 13:02:06 +03:00
Geoff Levand
39369f7a8b KVM: Make local routines static
The routines get_user_page_nowait(), kvm_io_bus_sort_cmp(), kvm_io_bus_insert_dev()
and kvm_io_bus_get_first_dev() are only referenced within kvm_main.c, so give them
static linkage.

Fixes sparse warnings like these:

  virt/kvm/kvm_main.c: warning: symbol 'get_user_page_nowait' was not declared. Should it be static?

Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-08 13:02:04 +03:00
Andrew Honig
8f964525a1 KVM: Allow cross page reads and writes from cached translations.
This patch adds support for kvm_gfn_to_hva_cache_init functions for
reads and writes that will cross a page.  If the range falls within
the same memslot, then this will be a fast operation.  If the range
is split between two memslots, then the slower kvm_read_guest and
kvm_write_guest are used.

Tested: Test against kvm_clock unit tests.

Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-07 13:05:35 +03:00
Raghavendra K T
7bc7ae25b1 kvm: Iterate over only vcpus that are preempted
This helps in filtering out the eligible candidates further and
thus potentially helps in quickly allowing preempted lockholders to run.
Note that if a vcpu was spinning during preemption we filter them
by checking whether they are preempted due to pause loop exit.

Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-11 11:37:22 +02:00
Raghavendra K T
3a08a8f9f0 kvm: Record the preemption status of vcpus using preempt notifiers
Note that we mark as preempted only when vcpu's task state was
Running during preemption.

Thanks Jiannan, Avi for preemption notifier ideas. Thanks Gleb, PeterZ
for their precious suggestions. Thanks Srikar for an idea on avoiding
rcu lock while checking task state that improved overcommit numbers.

Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-11 11:37:08 +02:00
Cornelia Huck
a0f155e964 KVM: Initialize irqfd from kvm_init().
Currently, eventfd introduces module_init/module_exit functions
to initialize/cleanup the irqfd workqueue. This only works, however,
if no other module_init/module_exit functions are built into the
same module.

Let's just move the initialization and cleanup to kvm_init and kvm_exit.
This way, it is also clearer where kvm startup may fail.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-05 19:12:16 -03:00
Takuya Yoshikawa
8482644aea KVM: set_memory_region: Refactor commit_memory_region()
This patch makes the parameter old a const pointer to the old memory
slot and adds a new parameter named change to know the change being
requested: the former is for removing extra copying and the latter is
for cleaning up the code.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Takuya Yoshikawa
7b6195a91d KVM: set_memory_region: Refactor prepare_memory_region()
This patch drops the parameter old, a copy of the old memory slot, and
adds a new parameter named change to know the change being requested.

This not only cleans up the code but also removes extra copying of the
memory slot structure.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Takuya Yoshikawa
74d0727cb7 KVM: set_memory_region: Make kvm_mr_change available to arch code
This will be used for cleaning up prepare/commit_memory_region() later.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Takuya Yoshikawa
47ae31e257 KVM: set_memory_region: Drop user_alloc from set_memory_region()
Except ia64's stale code, KVM_SET_MEMORY_REGION support, this is only
used for sanity checks in __kvm_set_memory_region() which can easily
be changed to use slot id instead.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Takuya Yoshikawa
462fce4606 KVM: set_memory_region: Drop user_alloc from prepare/commit_memory_region()
X86 does not use this any more.  The remaining user, s390's !user_alloc
check, can be simply removed since KVM_SET_MEMORY_REGION ioctl is no
longer supported.

Note: fixed powerpc's indentations with spaces to suppress checkpatch
errors.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Takuya Yoshikawa
7a905b1485 KVM: Remove user_alloc from struct kvm_memory_slot
This field was needed to differentiate memory slots created by the new
API, KVM_SET_USER_MEMORY_REGION, from those by the old equivalent,
KVM_SET_MEMORY_REGION, whose support was dropped long before:

  commit b74a07beed
  KVM: Remove kernel-allocated memory regions

Although we also have private memory slots to which KVM allocates
memory with vm_mmap(), !user_alloc slots in other words, the slot id
should be enough for differentiating them.

Note: corresponding function parameters will be removed later.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-11 11:52:00 +02:00
Takuya Yoshikawa
75d61fbcf5 KVM: set_memory_region: Disallow changing read-only attribute later
As Xiao pointed out, there are a few problems with it:
 - kvm_arch_commit_memory_region() write protects the memory slot only
   for GET_DIRTY_LOG when modifying the flags.
 - FNAME(sync_page) uses the old spte value to set a new one without
   checking KVM_MEM_READONLY flag.

Since we flush all shadow pages when creating a new slot, the simplest
fix is to disallow such problematic flag changes: this is safe because
no one is doing such things.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-04 22:56:47 -02:00
Takuya Yoshikawa
f64c039893 KVM: set_memory_region: Identify the requested change explicitly
KVM_SET_USER_MEMORY_REGION forces __kvm_set_memory_region() to identify
what kind of change is being requested by checking the arguments.  The
current code does this checking at various points in code and each
condition being used there is not easy to understand at first glance.

This patch consolidates these checks and introduces an enum to name the
possible changes to clean up the code.

Although this does not introduce any functional changes, there is one
change which optimizes the code a bit: if we have nothing to change, the
new code returns 0 immediately.

Note that the return value for this case cannot be changed since QEMU
relies on it: we noticed this when we changed it to -EINVAL and got a
section mismatch error at the final stage of live migration.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-04 22:00:53 -02:00
Raghavendra K T
c45c528e89 kvm: Handle yield_to failure return code for potential undercommit case
yield_to returns -ESRCH, When source and target of yield_to
run queue length is one. When we see three successive failures of
yield_to we assume we are in potential undercommit case and abort
from PLE handler.
The assumption is backed by low probability of wrong decision
for even worst case scenarios such as average runqueue length
between 1 and 2.

More detail on rationale behind using three tries:
if p is the probability of finding rq length one on a particular cpu,
and if we do n tries, then probability of exiting ple handler is:

 p^(n+1) [ because we would have come across one source with rq length
1 and n target cpu rqs  with length 1 ]

so
num tries:         probability of aborting ple handler (1.5x overcommit)
 1                 1/4
 2                 1/8
 3                 1/16

We can increase this probability with more tries, but the problem is
the overhead.
Also, If we have tried three times that means we would have iterated
over 3 good eligible vcpus along with many non-eligible candidates. In
worst case if we iterate all the vcpus, we reduce 1x performance and
overcommit performance get hit.

note that we do not update last boosted vcpu in failure cases.
Thank Avi for raising question on aborting after first fail from yield_to.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 15:38:45 +02:00
Yang Zhang
c7c9c56ca2 x86, apicv: add virtual interrupt delivery support
Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:

- for pending interrupt, instead of direct injection, we may need
  update architecture specific indicators before resuming to guest.

- A pending interrupt, which is masked by ISR, should be also
  considered in above update action, since hardware will decide
  when to inject it at right time. Current has_interrupt and
  get_interrupt only returns a valid vector from injection p.o.v.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 10:48:19 +02:00
Alex Williamson
261874b0d5 kvm: Force IOMMU remapping on memory slot read-only flag changes
Memory slot flags can be altered without changing other parameters of
the slot.  The read-only attribute is the only one the IOMMU cares
about, so generate an un-map, re-map when this occurs.  This also
avoid unnecessarily re-mapping the slot when no IOMMU visible changes
are made.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-27 12:41:30 +02:00
Takuya Yoshikawa
a843fac253 KVM: set_memory_region: Remove unnecessary variable memslot
One such variable, slot, is enough for holding a pointer temporarily.
We also remove another local variable named slot, which is limited in
a block, since it is confusing to have the same name in this function.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-17 14:27:59 +02:00
Takuya Yoshikawa
0a706beefb KVM: set_memory_region: Don't check for overlaps unless we create or move a slot
Don't need the check for deleting an existing slot or just modifiying
the flags.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-17 14:27:50 +02:00
Takuya Yoshikawa
0ea75e1d26 KVM: set_memory_region: Don't jump to out_free unnecessarily
This makes the separation between the sanity checks and the rest of the
code a bit clearer.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-17 14:27:43 +02:00
Takuya Yoshikawa
c972f3b125 KVM: Write protect the updated slot only when dirty logging is enabled
Calling kvm_mmu_slot_remove_write_access() for a deleted slot does
nothing but search for non-existent mmu pages which have mappings to
that deleted memory; this is safe but a waste of time.

Since we want to make the function rmap based in a later patch, in a
manner which makes it unsafe to be called for a deleted slot, we makes
the caller see if the slot is non-zero and being dirty logged.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-14 11:13:15 +02:00
Gleb Natapov
7ec4fb4496 KVM: move the code that installs new slots array to a separate function.
Move repetitive code sequence to a separate function.

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-24 17:49:30 +02:00
Alex Williamson
116c14c019 kvm: Fix memory slot generation updates
Previous patch "kvm: Minor memory slot optimization" (b7f69c555c)
overlooked the generation field of the memory slots.  Re-using the
original memory slots left us with with two slightly different memory
slots with the same generation.  To fix this, make update_memslots()
take a new parameter to specify the last generation.  This also makes
generation management more explicit to avoid such problems in the future.

Reported-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 10:17:38 +02:00
Alex Williamson
1e702d9af5 KVM: struct kvm_memory_slot.id -> short
We're currently offering a whopping 32 memory slots to user space, an
int is a bit excessive for storing this.  We would like to increase
our memslots, but SHRT_MAX should be more than enough.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:25:24 -02:00
Alex Williamson
f82a8cfe93 KVM: struct kvm_memory_slot.user_alloc -> bool
There's no need for this to be an int, it holds a boolean.
Move to the end of the struct for alignment.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:24:38 -02:00
Alex Williamson
bbacc0c111 KVM: Rename KVM_MEMORY_SLOTS -> KVM_USER_MEM_SLOTS
It's easy to confuse KVM_MEMORY_SLOTS and KVM_MEM_SLOTS_NUM.  One is
the user accessible slots and the other is user + private.  Make this
more obvious.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:21:57 -02:00
Alex Williamson
b7f69c555c KVM: Minor memory slot optimization
If a slot is removed or moved in the guest physical address space, we
first allocate and install a new slot array with the invalidated
entry.  The old array is then freed.  We then proceed to allocate yet
another slot array to install the permanent replacement.  Re-use the
original array when this occurs and avoid the extra kfree/kmalloc.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:21:53 -02:00
Alex Williamson
e40f193f5b KVM: Fix iommu map/unmap to handle memory slot moves
The iommu integration into memory slots expects memory slots to be
added or removed and doesn't handle the move case.  We can unmap
slots from the iommu after we mark them invalid and map them before
installing the final memslot array.  Also re-order the kmemdup vs
map so we don't leave iommu mappings if we get ENOMEM.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:21:52 -02:00
Alex Williamson
9c695d42db KVM: Check userspace_addr when modifying a memory slot
The API documents that only flags and guest physical memory space can
be modified on an existing slot, but we don't enforce that the
userspace address cannot be modified.  Instead we just ignore it.
This means that a user may think they've successfully moved both the
guest and user addresses, when in fact only the guest address changed.
Check and error instead.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:21:51 -02:00
Alex Williamson
f0736cf055 KVM: Restrict non-existing slot state transitions
The API documentation states:

	When changing an existing slot, it may be moved in the guest
	physical memory space, or its flags may be modified.

An "existing slot" requires a non-zero npages (memory_size).  The only
transition we should therefore allow for a non-existing slot should be
to create the slot, which includes setting a non-zero memory_size.  We
currently allow calls to modify non-existing slots, which is pointless,
confusing, and possibly wrong.

With this we know that the invalidation path of __kvm_set_memory_region
is always for a delete or move and never for adding a zero size slot.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:21:50 -02:00
Alex Williamson
5419369ed6 KVM: Fix user memslot overlap check
Prior to memory slot sorting this loop compared all of the user memory
slots for overlap with new entries.  With memory slot sorting, we're
just checking some number of entries in the array that may or may not
be user slots.  Instead, walk all the slots with kvm_for_each_memslot,
which has the added benefit of terminating early when we hit the first
empty slot, and skip comparison to private slots.

Cc: stable@vger.kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-29 23:30:32 -02:00