Commit Graph

203 Commits

Author SHA1 Message Date
Michael S. Tsirkin 8160c4e455 vfio: fix ioctl error handling
Calling return copy_to_user(...) in an ioctl will not
do the right thing if there's a pagefault:
copy_to_user returns the number of bytes not copied
in this case.

Fix up vfio to do
	return copy_to_user(...)) ?
		-EFAULT : 0;

everywhere.

Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-28 07:38:52 -07:00
Alex Williamson 16ab8a5cbe vfio/noiommu: Don't use iommu_present() to track fake groups
Using iommu_present() to determine whether an IOMMU group is real or
fake has some problems.  First, apparently Power systems don't
register an IOMMU on the device bus, so the groups and containers get
marked as noiommu and then won't bind to their actual IOMMU driver.
Second, I expect we'll run into the same issue as we try to support
vGPUs through vfio, since they're likely to emulate this behavior of
creating an IOMMU group on a virtual device and then providing a vfio
IOMMU backend tailored to the sort of isolation they provide, which
won't necessarily be fully compatible with the IOMMU API.

The solution here is to use the existing iommudata interface to IOMMU
groups, which allows us to easily identify the fake groups we've
created for noiommu purposes.  The iommudata we set is purely
arbitrary since we're only comparing the address, so we use the
address of the noiommu switch itself.

Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <sshukla@mvista.com>
Fixes: 03a76b60f8 ("vfio: Include No-IOMMU mode")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-01-27 11:22:25 -07:00
Pierre Morel d4f50ee2f5 vfio/iommu_type1: make use of info.flags
The flags entry is there to tell the user that some
optional information is available.

Since we report the iova_pgsizes signal it to the user
by setting the flags to VFIO_IOMMU_INFO_PGSIZES.

Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-01-04 12:55:44 -07:00
Alex Williamson 03a76b60f8 vfio: Include No-IOMMU mode
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system.  There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines.  However, there are still those users
that want userspace drivers even under those conditions.  The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has.  In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
should make it very clear that this mode is not safe.  Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode.  Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported.  This patch includes no-iommu support for the vfio-pci bus
driver only.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2015-12-21 15:28:11 -07:00
Dan Carpenter 967628827f VFIO: platform: reset: fix a warning message condition
This loop ends with count set to -1 and not zero so the warning message
isn't printed when it should be.  I've fixed this by change the postop
to a preop.

Fixes: 0990822c98 ('VFIO: platform: reset: AMD xgbe reset module')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-12-21 15:28:11 -07:00
Alex Williamson ae5515d663 Revert: "vfio: Include No-IOMMU mode"
Revert commit 033291eccb ("vfio: Include No-IOMMU mode") due to lack
of a user.  This was originally intended to fill a need for the DPDK
driver, but uptake has been slow so rather than support an unproven
kernel interface revert it and revisit when userspace catches up.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-12-04 08:38:42 -07:00
Dan Carpenter 049af1060b vfio: fix a warning message
The first argument to the WARN() macro has to be a condition.  I'm sort
of disappointed that this code doesn't generate a compiler warning.  I
guess -Wformat-extra-args doesn't work in the kernel.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-21 06:55:58 -07:00
Kees Cook 7200be7c81 vfio: platform: remove needless stack usage
request_module already takes format strings, so no need to duplicate
the effort.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-20 09:00:10 -07:00
Julia Lawall 7d10f4e079 vfio-pci: constify pci_error_handlers structures
This pci_error_handlers structure is never modified, like all the other
pci_error_handlers structures, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-19 16:53:07 -07:00
Krzysztof Kozlowski 7fe9414223 vfio: Drop owner assignment from platform_driver
platform_driver does not need to set an owner because
platform_driver_register() will set it.

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-19 15:52:08 -07:00
Linus Torvalds 934f98d7e8 VFIO updates for v4.4-rc1
- Use kernel interfaces for VPD emulation (Alex Williamson)
  - Platform fix for releasing IRQs (Eric Auger)
  - Type1 IOMMU always advertises PAGE_SIZE support when smaller
    mapping sizes are available (Eric Auger)
  - Platform fixes for incorrectly using copies of structures rather
    than pointers to structures (James Morse)
  - Rework platform reset modules, fix leak, and add AMD xgbe reset
    module (Eric Auger)
  - Fix vfio_device_get_from_name() return value (Joerg Roedel)
  - No-IOMMU interface (Alex Williamson)
  - Fix potential out of bounds array access in PCI config handling
    (Dan Carpenter)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWRg8+AAoJECObm247sIsitkwP+wc6cRzBpeGxiufZl9Ci7JV9
 G0qNHBm8tAYDwm1uJATcyZC303ad2B3gaYSO6msTFUXTg3d9ZDtUOWhoTNPs/px1
 5dF38DREzqYHpyC9HT2Qj3i9G+ejvgg+SoyxBOTOIw/dmq3tGZhUz1Sj+wuvQTwr
 XXBKsWltZ+8wwXmQXmOWI4L3m7Xhs8NwAl5iLJ3UiltpW9zZzuPtoKnQCfMYUcmh
 hIJg52t0WPSLyn47UvecUQcqxaO+QYELa7UN84fnQAihk+ewMpzg5blTebayFdu3
 f2WC3ivxbamebw74LaRfQFjx4mT+DI0aXYtraC600PVe7gdVXB66QMNNpPhBwAy5
 wpfeFpTKU5gC+LHmrIMUS2/A4sdNfUBw44CS8+Lm2D6bQAblPv/C5xQV1rz9HADv
 f4/D3Y0TUKSYArewtBHTC0mnXdkZwetttBoy6/zQBl8vkelhoJ3GPcVa8FEZCIuT
 2MSS17I3ftJ1enfynicF+Wstn/H5lWcuRBdg5wTLHIuhFn6MiEVxfIuSEx9JfjIb
 NGZO7y5JiJ0b5QRCG0tFznwceU/cql/3oRqOGXqaf1cQ1Ag3JOIAUzxknFoJQUj1
 XYe+Im1eMaugjj39J3+m5EYNKT3nh/bBLD/V3iWYpgoZtQrmQQm5nu0JsQo88/JR
 je0BuJioCuPlO/Wj/KYw
 =bI62
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.4-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:
 - Use kernel interfaces for VPD emulation (Alex Williamson)
 - Platform fix for releasing IRQs (Eric Auger)
 - Type1 IOMMU always advertises PAGE_SIZE support when smaller mapping
   sizes are available (Eric Auger)
 - Platform fixes for incorrectly using copies of structures rather than
   pointers to structures (James Morse)
 - Rework platform reset modules, fix leak, and add AMD xgbe reset
   module (Eric Auger)
 - Fix vfio_device_get_from_name() return value (Joerg Roedel)
 - No-IOMMU interface (Alex Williamson)
 - Fix potential out of bounds array access in PCI config handling (Dan
   Carpenter)

* tag 'vfio-v4.4-rc1' of git://github.com/awilliam/linux-vfio:
  vfio/pci: make an array larger
  vfio: Include No-IOMMU mode
  vfio: Fix bug in vfio_device_get_from_name()
  VFIO: platform: reset: AMD xgbe reset module
  vfio: platform: reset: calxedaxgmac: fix ioaddr leak
  vfio: platform: add dev_info on device reset
  vfio: platform: use list of registered reset function
  vfio: platform: add compat in vfio_platform_device
  vfio: platform: reset: calxedaxgmac: add reset function registration
  vfio: platform: introduce module_vfio_reset_handler macro
  vfio: platform: add capability to register a reset function
  vfio: platform: introduce vfio-platform-base module
  vfio/platform: store mapped memory in region, instead of an on-stack copy
  vfio/type1: handle case where IOMMU does not support PAGE_SIZE size
  VFIO: platform: clear IRQ_NOAUTOEN when de-assigning the IRQ
  vfio/pci: Use kernel VPD access functions
  vfio: Whitelist PCI bridges
2015-11-13 17:05:32 -08:00
Dan Carpenter 222e684ca7 vfio/pci: make an array larger
Smatch complains about a possible out of bounds error:

	drivers/vfio/pci/vfio_pci_config.c:1241 vfio_cap_init()
	error: buffer overflow 'pci_cap_length' 20 <= 20

The problem is that pci_cap_length[] was defined as large enough to
hold "PCI_CAP_ID_AF + 1" elements.  The code in vfio_cap_init() assumes
it has PCI_CAP_ID_MAX + 1 elements.  Originally, PCI_CAP_ID_AF and
PCI_CAP_ID_MAX were the same but then we introduced PCI_CAP_ID_EA in
commit f80b0ba959 ("PCI: Add Enhanced Allocation register entries")
so now the array is too small.

Let's fix this by making the array size PCI_CAP_ID_MAX + 1.  And let's
make a similar change to pci_ext_cap_length[] for consistency.  Also
both these arrays can be made const.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-09 08:59:11 -07:00
Alex Williamson 033291eccb vfio: Include No-IOMMU mode
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system.  There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines.  However, there are still those users
that want userspace drivers even under those conditions.  The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has.  In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
should make it very clear that this mode is not safe.  Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode.  Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported.  This patch includes no-iommu support for the vfio-pci bus
driver only.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2015-11-04 09:56:16 -07:00
Joerg Roedel e324fc82ea vfio: Fix bug in vfio_device_get_from_name()
The vfio_device_get_from_name() function might return a
non-NULL pointer, when called with a device name that is not
found in the list. This causes undefined behavior, in my
case calling an invalid function pointer later on:

 kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
 BUG: unable to handle kernel paging request at ffff8800cb3ddc08

[...]

 Call Trace:
  [<ffffffffa03bd733>] ? vfio_group_fops_unl_ioctl+0x253/0x410 [vfio]
  [<ffffffff811efc4d>] do_vfs_ioctl+0x2cd/0x4c0
  [<ffffffff811f9657>] ? __fget+0x77/0xb0
  [<ffffffff811efeb9>] SyS_ioctl+0x79/0x90
  [<ffffffff81001bb0>] ? syscall_return_slowpath+0x50/0x130
  [<ffffffff8167f776>] entry_SYSCALL_64_fastpath+0x16/0x75

Fix the issue by returning NULL when there is no device with
the requested name in the list.

Cc: stable@vger.kernel.org # v4.2+
Fixes: 4bc94d5dc9 ("vfio: Fix lockdep issue")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-04 09:27:39 -07:00
Eric Auger 0990822c98 VFIO: platform: reset: AMD xgbe reset module
This patch introduces a module that registers and implements a low-level
reset function for the AMD XGBE device.

it performs the following actions:
- reset the PHY
- disable auto-negotiation
- disable & clear auto-negotiation IRQ
- soft-reset the MAC

Those tiny pieces of code are inherited from the native xgbe driver.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:55:21 -07:00
Eric Auger daac3bbedb vfio: platform: reset: calxedaxgmac: fix ioaddr leak
In the current code the vfio_platform_region is copied on the stack.
As a consequence the ioaddr address is not iounmapped in the vfio
platform driver (vfio_platform_regions_cleanup). The patch uses the
pointer to the region instead.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:55:06 -07:00
Eric Auger 705e60bae3 vfio: platform: add dev_info on device reset
It might be helpful for the end-user to check the device reset
function was found by the vfio platform reset framework.

Lets store a pointer to the struct device in vfio_platform_device
and trace when the reset function is called or not found.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:55:05 -07:00
Eric Auger e9e0506ee6 vfio: platform: use list of registered reset function
Remove the static lookup table and use the dynamic list of registered
reset functions instead. Also load the reset module through its alias.
The reset struct module pointer is stored in vfio_platform_device.

We also remove the useless struct device pointer parameter in
vfio_platform_get_reset.

This patch fixes the issue related to the usage of __symbol_get, which
besides from being moot, prevented compilation with CONFIG_MODULES
disabled.

Also usage of MODULE_ALIAS makes possible to add a new reset module
without needing to update the framework. This was suggested by Arnd.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:55:03 -07:00
Eric Auger 0628c4dfd3 vfio: platform: add compat in vfio_platform_device
Let's retrieve the compatibility string on probe and store it
in the vfio_platform_device struct

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:55:02 -07:00
Eric Auger 680742f644 vfio: platform: reset: calxedaxgmac: add reset function registration
This patch adds the reset function registration/unregistration.
This is handled through the module_vfio_reset_handler macro. This
latter also defines a MODULE_ALIAS which simplifies the load from
vfio-platform.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:55:00 -07:00
Eric Auger 588646529f vfio: platform: introduce module_vfio_reset_handler macro
The module_vfio_reset_handler macro
- define a module alias
- implement module init/exit function which respectively registers
  and unregisters the reset function.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:54:59 -07:00
Eric Auger e086497d31 vfio: platform: add capability to register a reset function
In preparation for subsequent changes in reset function lookup,
lets introduce a dynamic list of reset combos (compat string,
reset module, reset function). The list can be populated/voided with
vfio_platform_register/unregister_reset. Those are not yet used in
this patch.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:54:57 -07:00
Eric Auger 32a2d71c4e vfio: platform: introduce vfio-platform-base module
To prepare for vfio platform reset rework let's build
vfio_platform_common.c and vfio_platform_irq.c in a separate
module from vfio-platform and vfio-amba. This makes possible
to have separate module inits and works around a race between
platform driver init and vfio reset module init: that way we
make sure symbols exported by base are available when vfio-platform
driver gets probed.

The open/release being implemented in the base module, the ref
count is applied to the parent module instead.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:54:56 -07:00
James Morse 1b4bb2eaa9 vfio/platform: store mapped memory in region, instead of an on-stack copy
vfio_platform_{read,write}_mmio() call ioremap_nocache() to map
a region of io memory, which they store in struct vfio_platform_region to
be eventually re-used, or unmapped by vfio_platform_regions_cleanup().

These functions receive a copy of their struct vfio_platform_region
argument on the stack - so these mapped areas are always allocated, and
always leaked.

Pass this argument as a pointer instead.

Fixes: 6e3f264560 "vfio/platform: read and write support for the device fd"
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:54:00 -07:00
Eric Auger 4644321fd3 vfio/type1: handle case where IOMMU does not support PAGE_SIZE size
Current vfio_pgsize_bitmap code hides the supported IOMMU page
sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
does not support PAGE_SIZE page, the alignment check on map/unmap
is done with larger page sizes, if any. This can fail although
mapping could be done with pages smaller than PAGE_SIZE.

This patch modifies vfio_pgsize_bitmap implementation so that,
in case the IOMMU supports page sizes smaller than PAGE_SIZE
we pretend PAGE_SIZE is supported and hide sub-PAGE_SIZE sizes.
That way the user will be able to map/unmap buffers whose size/
start address is aligned with PAGE_SIZE. Pinning code uses that
granularity while iommu driver can use the sub-PAGE_SIZE size
to map the buffer.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-11-03 12:53:53 -07:00
Eric Auger 1276ece32c VFIO: platform: clear IRQ_NOAUTOEN when de-assigning the IRQ
The vfio platform driver currently sets the IRQ_NOAUTOEN before
doing the request_irq to properly handle the user masking. However
it does not clear it when de-assigning the IRQ. This brings issues
when loading the native driver again which may not explicitly enable
the IRQ. This problem was observed with xgbe driver.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-10-27 15:02:53 -06:00
Alex Williamson 4e1a635552 vfio/pci: Use kernel VPD access functions
The PCI VPD capability operates on a set of window registers in PCI
config space.  Writing to the address register triggers either a read
or write, depending on the setting of the PCI_VPD_ADDR_F bit within
the address register.  The data register provides either the source
for writes or the target for reads.

This model is susceptible to being broken by concurrent access, for
which the kernel has adopted a set of access functions to serialize
these registers.  Additionally, commits like 932c435cab ("PCI: Add
dev_flags bit to access VPD through function 0") and 7aa6ca4d39
("PCI: Add VPD function 0 quirk for Intel Ethernet devices") indicate
that VPD registers can be shared between functions on multifunction
devices creating dependencies between otherwise independent devices.

Fortunately it's quite easy to emulate the VPD registers, simply
storing copies of the address and data registers in memory and
triggering a VPD read or write on writes to the address register.
This allows vfio users to avoid seeing spurious register changes from
accesses on other devices and enables the use of shared quirks in the
host kernel.  We can theoretically still race with access through
sysfs, but the window of opportunity is much smaller.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Mark Rustad <mark.d.rustad@intel.com>
2015-10-27 14:53:05 -06:00
Alex Williamson 5f096b14d4 vfio: Whitelist PCI bridges
When determining whether a group is viable, we already allow devices
bound to pcieport.  Generalize this to include any PCI bridge device.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-10-27 14:53:04 -06:00
Feng Wu 6d7425f109 vfio: Register/unregister irq_bypass_producer
This patch adds the registration/unregistration of an
irq_bypass_producer for MSI/MSIx on vfio pci devices.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:50 +02:00
Alex Williamson 4bc94d5dc9 vfio: Fix lockdep issue
When we open a device file descriptor, we currently have the
following:

vfio_group_get_device_fd()
  mutex_lock(&group->device_lock);
    open()
    ...
    if (ret)
      release()

If we hit that error case, we call the backend driver release path,
which for vfio-pci looks like this:

vfio_pci_release()
  vfio_pci_disable()
    vfio_pci_try_bus_reset()
      vfio_pci_get_devs()
        vfio_device_get_from_dev()
          vfio_group_get_device()
            mutex_lock(&group->device_lock);

Whoops, we've stumbled back onto group.device_lock and created a
deadlock.  There's a low likelihood of ever seeing this play out, but
obviously it needs to be fixed.  To do that we can use a reference to
the vfio_device for vfio_group_get_device_fd() rather than holding the
lock.  There was a loop in this function, theoretically allowing
multiple devices with the same name, but in practice we don't expect
such a thing to happen and the code is already aborting from the loop
with break on any sort of error rather than continuing and only
parsing the first match anyway, so the loop was effectively unused
already.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fixes: 20f300175a ("vfio/pci: Fix racy vfio_device_get_from_dev() call")
Reported-by: Joerg Roedel <joro@8bytes.org>
Tested-by: Joerg Roedel <jroedel@suse.de>
2015-07-24 15:14:04 -06:00
Linus Torvalds b779157dd3 VFIO updates for v4.2
- Fix race with device reference versus driver release (Alex Williamson)
 - Add reset hooks and Calxeda xgmac reset for vfio-platform (Eric Auger)
 - Enable vfio-platform for ARM64 (Eric Auger)
 - Tag Baptiste Reynal as vfio-platform sub-maintainer (Alex Williamson)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVjaiUAAoJECObm247sIsiRJQP/RiqcYg9uDGbBt8Gi+Qi2BmG
 KrVqG/ty18CZgyis01GZa5O8bL2ViQFAwfpnIJ6l5v4IQAYUUPlfy6Gy9ZgotQir
 U5ot4ABT9KGqJyjkDixvfUmg5M1gj1euO/xC+KVMpnxN0tC1wbymtAahzd3/RXyq
 5cef/cran5bwnh2RtAN/rFkbjZBuqppVZJJV7uyOd69HXhktrZw5skQssL/61+vT
 znww9H6WT/Qk60haQrcD+SVqAaPv6Tx8bu1LVZ1Zc/rWVxJ2HisOp5dTxGeRnaSu
 PtMfOndgtYwe5fBaYyNjWluJLpBzY7b5UvHtsZ52hPXvlgkKSMrfCJxM4BQQSekW
 txmeV6LpQsm+Nq0+sIOFph8h/LzZaebvMsc5YnLamixGb/mAxbbr31yGjokTuS4F
 ndKUR+9/mFrdcqeBT4UD1Yt45YHl7bRZ9AWKJIC1TstrTW+1rEZ49r71PLea6y4f
 SJJ+eCvhYr/eebvrsfePqzeo1X+S8HNTgCXrSlqZdohS21qtnsh4g/CaOxI2UVt8
 zaZosxNQLzpNK/Z8s7MZpCMenWZsq2frcckROeEgNKdPntwkbVT4MG3LBrWvtuBU
 HGAKfWCGAjfIIN8ZBUSFtIqua/kXCcec99CcYYQOtv16UbmhadAPoUlMur3HI9T9
 S0vWtIKTsRa8ymyrGBDp
 =/gWA
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v4.2-rc1' of git://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - fix race with device reference versus driver release (Alex Williamson)

 - add reset hooks and Calxeda xgmac reset for vfio-platform (Eric Auger)

 - enable vfio-platform for ARM64 (Eric Auger)

 - tag Baptiste Reynal as vfio-platform sub-maintainer (Alex Williamson)

* tag 'vfio-v4.2-rc1' of git://github.com/awilliam/linux-vfio:
  MAINTAINERS: Add vfio-platform sub-maintainer
  VFIO: platform: enable ARM64 build
  VFIO: platform: Calxeda xgmac reset module
  VFIO: platform: populate the reset function on probe
  VFIO: platform: add reset callback
  VFIO: platform: add reset struct and lookup table
  vfio/pci: Fix racy vfio_device_get_from_dev() call
2015-06-28 12:32:13 -07:00
Linus Torvalds 08d183e3c1 powerpc updates for 4.2
- Disable the 32-bit vdso when building LE, so we can build with a 64-bit only
    toolchain.
  - EEH fixes from Gavin & Richard.
  - Enable the sys_kcmp syscall from Laurent.
  - Sysfs control for fastsleep workaround from Shreyas.
  - Expose OPAL events as an irq chip by Alistair.
  - MSI ops moved to pci_controller_ops by Daniel.
  - Fix for kernel to userspace backtraces for perf from Anton.
  - Merge pseries and pseries_le defconfigs from Cyril.
  - CXL in-kernel API from Mikey.
  - OPAL prd driver from Jeremy.
  - Fix for DSCR handling & tests from Anshuman.
  - Powernv flash mtd driver from Cyril.
  - Dynamic DMA Window support on powernv from Alexey.
  - LLVM clang fixes & workarounds from Anton.
  - Reworked version of the patch to abort syscalls when transactional.
  - Fix the swap encoding to support 4TB, from Aneesh.
  - Various fixes as usual.
  - Freescale updates from Scott: Highlights include more 8xx optimizations, an
    e6500 hugetlb optimization, QMan device tree nodes, t1024/t1023 support, and
    various fixes and cleanup.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJViSZqAAoJEFHr6jzI4aWAA7kQAKq3+pejfo2rY7alpKJyeVao
 vlaIEaDNOTh+ctcmu3MFF9Jy6fai8gNZziRXU5JRmE5RW4GVBN4KZiqXRbkVjdBK
 uG9sCX7Y58VRsS2vnGBYLsamfTMgjaXeDvgunQHVLiechJnrDr0RHEK90F3LSi73
 Axp6l8XIG63a3zFZmkhzANMCme2lm5+MWmGlSjUUNi5F+viQUgJc5iiO8xrVUgM5
 RpNlV2NJSqFiU+gMQWJ226V85UIniouq4j+qtyUcu8/m9BberyolXVU0GPlPFdsx
 r/Qh9uCJyZaUdSB5hzomQZj50IsSz6J6nEuJTeGRoVZOmeI8Dnc2xU9fxQF5fC8H
 lUJw10WPoNOggQZTeSUKn7wTXw3i4p3KsWNUczaW68VJdhqZUVaSp0+I6mnDSqzs
 9iGC+VffLYNa1OHq7mGRFrgDdLBCHes31aZ3CxlQsmyNpAPCwMzsD4TUfVnvOG6E
 oJOeaQ4mZM9PvqxEYJfoIL+vgRxmQ8sdIBtNY4in+C7J6eFnZNFO9xmPnJZuVU31
 PGtx60kjFCOVMXvqn34WkRNbgqGWI91IK0KcRwFO2LXVio1uY77TWL52kNK2IMsp
 Az+VDDvqnT3+BoV1yz0P6SrXAkwTpvFk2y+IdmEiUUN7zZFL5ZSA2epej9AzHTAK
 WID2bc5yVtIL6p6x5ICH
 =d9Wh
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux

Pull powerpc updates from Michael Ellerman:

 - disable the 32-bit vdso when building LE, so we can build with a
   64-bit only toolchain.

 - EEH fixes from Gavin & Richard.

 - enable the sys_kcmp syscall from Laurent.

 - sysfs control for fastsleep workaround from Shreyas.

 - expose OPAL events as an irq chip by Alistair.

 - MSI ops moved to pci_controller_ops by Daniel.

 - fix for kernel to userspace backtraces for perf from Anton.

 - merge pseries and pseries_le defconfigs from Cyril.

 - CXL in-kernel API from Mikey.

 - OPAL prd driver from Jeremy.

 - fix for DSCR handling & tests from Anshuman.

 - Powernv flash mtd driver from Cyril.

 - dynamic DMA Window support on powernv from Alexey.

 - LLVM clang fixes & workarounds from Anton.

 - reworked version of the patch to abort syscalls when transactional.

 - fix the swap encoding to support 4TB, from Aneesh.

 - various fixes as usual.

 - Freescale updates from Scott: Highlights include more 8xx
   optimizations, an e6500 hugetlb optimization, QMan device tree nodes,
   t1024/t1023 support, and various fixes and cleanup.

* tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (180 commits)
  cxl: Fix typo in debug print
  cxl: Add CXL_KERNEL_API config option
  powerpc/powernv: Fix wrong IOMMU table in pnv_ioda_setup_bus_dma()
  powerpc/mm: Change the swap encoding in pte.
  powerpc/mm: PTE_RPN_MAX is not used, remove the same
  powerpc/tm: Abort syscalls in active transactions
  powerpc/iommu/ioda2: Enable compile with IOV=on and IOMMU_API=off
  powerpc/include: Add opal-prd to installed uapi headers
  powerpc/powernv: fix construction of opal PRD messages
  powerpc/powernv: Increase opal-irqchip initcall priority
  powerpc: Make doorbell check preemption safe
  powerpc/powernv: pnv_init_idle_states() should only run on powernv
  macintosh/nvram: Remove as unused
  powerpc: Don't use gcc specific options on clang
  powerpc: Don't use -mno-strict-align on clang
  powerpc: Only use -mtraceback=no, -mno-string and -msoft-float if toolchain supports it
  powerpc: Only use -mabi=altivec if toolchain supports it
  powerpc: Fix duplicate const clang warning in user access code
  vfio: powerpc/spapr: Support Dynamic DMA windows
  vfio: powerpc/spapr: Register memory and define IOMMU v2
  ...
2015-06-24 08:46:32 -07:00
Eric Auger e6bcd47f8f VFIO: platform: enable ARM64 build
This patch enables building VFIO platform and derivatives on ARM64.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-22 09:35:47 -06:00
Eric Auger 713cc334a6 VFIO: platform: Calxeda xgmac reset module
This patch introduces a module that registers and implements a basic
reset function for the Calxeda xgmac device. This latter basically disables
interrupts and stops DMA transfers.

The reset function code is inherited from the native calxeda xgmac driver.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-22 09:35:38 -06:00
Eric Auger 3eeb0d510c VFIO: platform: populate the reset function on probe
The reset function lookup happens on vfio-platform probe. The reset
module load is requested  and a reference to the function symbol is
hold. The reference is released on vfio-platform remove.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-22 09:35:30 -06:00
Eric Auger 813ae66008 VFIO: platform: add reset callback
A new reset callback is introduced. If this callback is populated,
the reset is invoked on device first open/last close or upon userspace
ioctl.  The modality is exposed on VFIO_DEVICE_GET_INFO.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-22 09:35:22 -06:00
Eric Auger 9f85d8f9fa VFIO: platform: add reset struct and lookup table
This patch introduces the vfio_platform_reset_combo struct that
stores all the information useful to handle the reset modality:
compat string, name of the reset function, name of the module that
implements the reset function. A lookup table of such structures
is added, currently void.

Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2015-06-17 15:39:09 -06:00
Alexey Kardashevskiy e633bc86a9 vfio: powerpc/spapr: Support Dynamic DMA windows
This adds create/remove window ioctls to create and remove DMA windows.
sPAPR defines a Dynamic DMA windows capability which allows
para-virtualized guests to create additional DMA windows on a PCI bus.
The existing linux kernels use this new window to map the entire guest
memory and switch to the direct DMA operations saving time on map/unmap
requests which would normally happen in a big amounts.

This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
Up to 2 windows are supported now by the hardware and by this driver.

This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
information such as a number of supported windows and maximum number
levels of TCE tables.

DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
as we still want to support v2 on platforms which cannot do DDW for
the sake of TCE acceleration in KVM (coming soon).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:55 +10:00
Alexey Kardashevskiy 2157e7b82f vfio: powerpc/spapr: Register memory and define IOMMU v2
The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update
into 2 different operations. It requires:
1) guest pages to be registered first
2) consequent map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.

This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).

As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
DDW API), this creates a default DMA window for IODA2 for consistency.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:55 +10:00
Alexey Kardashevskiy 46d3e1e162 vfio: powerpc/spapr: powerpc/powernv/ioda2: Use DMA windows API in ownership control
Before the IOMMU user (VFIO) would take control over the IOMMU table
belonging to a specific IOMMU group. This approach did not allow sharing
tables between IOMMU groups attached to the same container.

This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU implements take/release_ownership() callbacks, this lets
the user have full control over the IOMMU group. When the ownership
is taken, the platform code removes all the windows so the caller must
create them.
Before returning the ownership back to the platform code, VFIO
unprograms and removes all the tables it created.

This changes IODA2's onwership handler to remove the existing table
rather than manipulating with the existing one. From now on,
iommu_take_ownership() and iommu_release_ownership() are only called
from the vfio_iommu_spapr_tce driver.

Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.

No change in userspace-visible behaviour is expected. Since it recreates
TCE tables on each ownership change, related kernel traces will appear
more often.

This adds a pnv_pci_ioda2_setup_default_config() which is called
when PE is being configured at boot time and when the ownership is
passed from VFIO to the platform code.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:54 +10:00
Alexey Kardashevskiy 4793d65d1a vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API
This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.

create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi-level table might be also specific to the PHB model (not
the case now though).
This callback calculated the DMA window offset on a PCI bus from @num
and stores it in a just created table.

set_window() sets the window at specified TVT index + @num on PHB.

unset_window() unsets the window from specified TVT.

This adds a free() callback to iommu_table_ops to free the memory
(potentially a tree of tables) allocated for the TCE table.

create_table() and free() are supposed to be called once per
VFIO container and set_window()/unset_window() are supposed to be
called for every group in a container.

This adds IOMMU capabilities to iommu_table_group such as default
32bit window parameters and others. This makes use of new values in
vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
advertise pagemasks to the userspace.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:52 +10:00
Alexey Kardashevskiy 05c6cfb9dc powerpc/iommu/powernv: Release replaced TCE
At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.

This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE and DMA direction so
the caller can release the pages afterwards. The exchange() receives
a physical address unlike set() which receives linear mapping address;
and returns a physical address as the clear() does.

This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
for a platform to have exchange() implemented in order to support VFIO.

This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().

This makes sure that TCE permission bits are not set in TCE passed to
IOMMU API as those are to be calculated by platform code from
DMA direction.

This moves SetPageDirty() to the IOMMU code to make it work for both
VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
available later).

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:49 +10:00
Alexey Kardashevskiy f87a88642e vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control
This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
which call in a loop iommu_take_ownership()/iommu_release_ownership()
for every table on the group. As there is just one now, no change in
behaviour is expected.

At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a iommu_table_group_ops struct and
adds take_ownership()/release_ownership() callbacks to it which are
called when an external user takes/releases control over the IOMMU.

This replaces set_bypass() with ownership callbacks as it is not
necessarily just bypass enabling, it can be something else/more
so let's give it more generic name.

The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
The following patches will replace iommu_take_ownership/
iommu_release_ownership calls in IODA2 with full IOMMU table release/
create.

As we here and touching bypass control, this removes
pnv_pci_ioda2_setup_bypass_pe() as it does not do much
more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
initialization to pnv_pci_ioda2_setup_dma_pe.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:47 +10:00
Alexey Kardashevskiy 0eaf4defc7 powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
So far one TCE table could only be used by one IOMMU group. However
IODA2 hardware allows programming the same TCE table address to
multiple PE allowing sharing tables.

This replaces a single pointer to a group in a iommu_table struct
with a linked list of groups which provides the way of invalidating
TCE cache for every PE when an actual TCE table is updated. This adds
pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
helpers to manage the list. However without VFIO, it is still going
to be a single IOMMU group per iommu_table.

This changes iommu_add_device() to add a device to a first group
from the group list of a table as it is only called from the platform
init code or PCI bus notifier and at these moments there is only
one group per table.

This does not change TCE invalidation code to loop through all
attached groups in order to simplify this patch and because
it is not really needed in most cases. IODA2 is fixed in a later
patch.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:16:15 +10:00
Alexey Kardashevskiy b348aa6529 powerpc/spapr: vfio: Replace iommu_table with iommu_table_group
Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
for TCE tables. Right now just one table is supported.

This defines iommu_table_group struct which stores pointers to
iommu_group and iommu_table(s). This replaces iommu_table with
iommu_table_group where iommu_table was used to identify a group:
- iommu_register_group();
- iommudata of generic iommu_group;

This removes @data from iommu_table as it_table_group provides
same access to pnv_ioda_pe.

For IODA, instead of embedding iommu_table, the new iommu_table_group
keeps pointers to those. The iommu_table structs are allocated
dynamically.

For P5IOC2, both iommu_table_group and iommu_table are embedded into
PE struct. As there is no EEH and SRIOV support for P5IOC2,
iommu_free_table() should not be called on iommu_table struct pointers
so we can keep it embedded in pnv_phb::p5ioc2.

For pSeries, this replaces multiple calls of kzalloc_node() with a new
iommu_pseries_alloc_group() helper and stores the table group struct
pointer into the pci_dn struct. For release, a iommu_table_free_group()
helper is added.

This moves iommu_table struct allocation from SR-IOV code to
the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
This change is here because those lines had to be changed anyway.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:57 +10:00
Alexey Kardashevskiy 22af48596e vfio: powerpc/spapr: Rework groups attaching
This is to make extended ownership and multiple groups support patches
simpler for review.

This should cause no behavioural change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:56 +10:00
Alexey Kardashevskiy 649354b75d vfio: powerpc/spapr: Moving pinning/unpinning to helpers
This is a pretty mechanical patch to make next patches simpler.

New tce_iommu_unuse_page() helper does put_page() now but it might skip
that after the memory registering patch applied.

As we are here, this removes unnecessary checks for a value returned
by pfn_to_page() as it cannot possibly return NULL.

This moves tce_iommu_disable() later to let tce_iommu_clear() know if
the container has been enabled because if it has not been, then
put_page() must not be called on TCEs from the TCE table. This situation
is not yet possible but it will after KVM acceleration patchset is
applied.

This changes code to work with physical addresses rather than linear
mapping addresses for better code readability. Following patches will
add an xchg() callback for an IOMMU table which will accept/return
physical addresses (unlike current tce_build()) which will eliminate
redundant conversions.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:56 +10:00
Alexey Kardashevskiy 3c56e822f8 vfio: powerpc/spapr: Disable DMA mappings on disabled container
At the moment DMA map/unmap requests are handled irrespective to
the container's state. This allows the user space to pin memory which
it might not be allowed to pin.

This adds checks to MAP/UNMAP that the container is enabled, otherwise
-EPERM is returned.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:56 +10:00
Alexey Kardashevskiy 2d270df8f7 vfio: powerpc/spapr: Move locked_vm accounting to helpers
There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).

This reworks debug messages to show the current value and the limit.

This stores the locked pages number in the container so when unlocking
the iommu table pointer won't be needed. This does not have an effect
now but it will with the multiple tables per container as then we will
allow attaching/detaching groups on fly and we may end up having
a container with no group attached but with the counter incremented.

While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM. This also prints
pid of the current process in pr_warn/pr_debug.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:55 +10:00
Alexey Kardashevskiy 00663d4ee0 vfio: powerpc/spapr: Use it_page_size
This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-06-11 15:14:55 +10:00