linux-stable/drivers/iommu/iommufd/io_pagetable.h

242 lines
8.2 KiB
C
Raw Normal View History

iommufd: PFN handling for iopt_pages The top of the data structure provides an IO Address Space (IOAS) that is similar to a VFIO container. The IOAS allows map/unmap of memory into ranges of IOVA called iopt_areas. Multiple IOMMU domains (IO page tables) and in-kernel accesses (like VFIO mdevs) can be attached to the IOAS to access the PFNs that those IOVA areas cover. The IO Address Space (IOAS) datastructure is composed of: - struct io_pagetable holding the IOVA map - struct iopt_areas representing populated portions of IOVA - struct iopt_pages representing the storage of PFNs - struct iommu_domain representing each IO page table in the system IOMMU - struct iopt_pages_access representing in-kernel accesses of PFNs (ie VFIO mdevs) - struct xarray pinned_pfns holding a list of pages pinned by in-kernel accesses This patch introduces the lowest part of the datastructure - the movement of PFNs in a tiered storage scheme: 1) iopt_pages::pinned_pfns xarray 2) Multiple iommu_domains 3) The origin of the PFNs, i.e. the userspace pointer PFN have to be copied between all combinations of tiers, depending on the configuration. The interface is an iterator called a 'pfn_reader' which determines which tier each PFN is stored and loads it into a list of PFNs held in a struct pfn_batch. Each step of the iterator will fill up the pfn_batch, then the caller can use the pfn_batch to send the PFNs to the required destination. Repeating this loop will read all the PFNs in an IOVA range. The pfn_reader and pfn_batch also keep track of the pinned page accounting. While PFNs are always stored and accessed as full PAGE_SIZE units the iommu_domain tier can store with a sub-page offset/length to support IOMMUs with a smaller IOPTE size than PAGE_SIZE. Link: https://lore.kernel.org/r/8-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:31 +00:00
/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
*
*/
#ifndef __IO_PAGETABLE_H
#define __IO_PAGETABLE_H
#include <linux/interval_tree.h>
#include <linux/mutex.h>
#include <linux/kref.h>
#include <linux/xarray.h>
#include "iommufd_private.h"
struct iommu_domain;
/*
* Each io_pagetable is composed of intervals of areas which cover regions of
* the iova that are backed by something. iova not covered by areas is not
* populated in the page table. Each area is fully populated with pages.
*
* iovas are in byte units, but must be iopt->iova_alignment aligned.
*
* pages can be NULL, this means some other thread is still working on setting
* up or tearing down the area. When observed under the write side of the
* domain_rwsem a NULL pages must mean the area is still being setup and no
* domains are filled.
*
* storage_domain points at an arbitrary iommu_domain that is holding the PFNs
* for this area. It is locked by the pages->mutex. This simplifies the locking
* as the pages code can rely on the storage_domain without having to get the
* iopt->domains_rwsem.
*
* The io_pagetable::iova_rwsem protects node
* The iopt_pages::mutex protects pages_node
* iopt and iommu_prot are immutable
iommufd: PFN handling for iopt_pages The top of the data structure provides an IO Address Space (IOAS) that is similar to a VFIO container. The IOAS allows map/unmap of memory into ranges of IOVA called iopt_areas. Multiple IOMMU domains (IO page tables) and in-kernel accesses (like VFIO mdevs) can be attached to the IOAS to access the PFNs that those IOVA areas cover. The IO Address Space (IOAS) datastructure is composed of: - struct io_pagetable holding the IOVA map - struct iopt_areas representing populated portions of IOVA - struct iopt_pages representing the storage of PFNs - struct iommu_domain representing each IO page table in the system IOMMU - struct iopt_pages_access representing in-kernel accesses of PFNs (ie VFIO mdevs) - struct xarray pinned_pfns holding a list of pages pinned by in-kernel accesses This patch introduces the lowest part of the datastructure - the movement of PFNs in a tiered storage scheme: 1) iopt_pages::pinned_pfns xarray 2) Multiple iommu_domains 3) The origin of the PFNs, i.e. the userspace pointer PFN have to be copied between all combinations of tiers, depending on the configuration. The interface is an iterator called a 'pfn_reader' which determines which tier each PFN is stored and loads it into a list of PFNs held in a struct pfn_batch. Each step of the iterator will fill up the pfn_batch, then the caller can use the pfn_batch to send the PFNs to the required destination. Repeating this loop will read all the PFNs in an IOVA range. The pfn_reader and pfn_batch also keep track of the pinned page accounting. While PFNs are always stored and accessed as full PAGE_SIZE units the iommu_domain tier can store with a sub-page offset/length to support IOMMUs with a smaller IOPTE size than PAGE_SIZE. Link: https://lore.kernel.org/r/8-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:31 +00:00
* The pages::mutex protects num_accesses
*/
struct iopt_area {
struct interval_tree_node node;
struct interval_tree_node pages_node;
struct io_pagetable *iopt;
struct iopt_pages *pages;
struct iommu_domain *storage_domain;
/* How many bytes into the first page the area starts */
unsigned int page_offset;
/* IOMMU_READ, IOMMU_WRITE, etc */
int iommu_prot;
iommufd: Data structure to provide IOVA to PFN mapping This is the remainder of the IOAS data structure. Provide an object called an io_pagetable that is composed of iopt_areas pointing at iopt_pages, along with a list of iommu_domains that mirror the IOVA to PFN map. At the top this is a simple interval tree of iopt_areas indicating the map of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based on the attached domains there is a minimum alignment for areas (which may be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't be mapped and an IOVA of allowed IOVA that can always be mappable. The concept of an 'access' refers to something like a VFIO mdev that is accessing the IOVA and using a 'struct page *' for CPU based access. Externally an API is provided that matches the requirements of the IOCTL interface for map/unmap and domain attachment. The API provides a 'copy' primitive to establish a new IOVA map in a different IOAS from an existing mapping by re-using the iopt_pages. This is the basic mechanism to provide single pinning. This is designed to support a pre-registration flow where userspace would setup an dummy IOAS with no domains, map in memory and then establish an access to pin all PFNs into the xarray. Copy can then be used to create new IOVA mappings in a different IOAS, with iommu_domains attached. Upon copy the PFNs will be read out of the xarray and mapped into the iommu_domains, avoiding any pin_user_pages() overheads. Link: https://lore.kernel.org/r/10-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:33 +00:00
bool prevent_access : 1;
iommufd: PFN handling for iopt_pages The top of the data structure provides an IO Address Space (IOAS) that is similar to a VFIO container. The IOAS allows map/unmap of memory into ranges of IOVA called iopt_areas. Multiple IOMMU domains (IO page tables) and in-kernel accesses (like VFIO mdevs) can be attached to the IOAS to access the PFNs that those IOVA areas cover. The IO Address Space (IOAS) datastructure is composed of: - struct io_pagetable holding the IOVA map - struct iopt_areas representing populated portions of IOVA - struct iopt_pages representing the storage of PFNs - struct iommu_domain representing each IO page table in the system IOMMU - struct iopt_pages_access representing in-kernel accesses of PFNs (ie VFIO mdevs) - struct xarray pinned_pfns holding a list of pages pinned by in-kernel accesses This patch introduces the lowest part of the datastructure - the movement of PFNs in a tiered storage scheme: 1) iopt_pages::pinned_pfns xarray 2) Multiple iommu_domains 3) The origin of the PFNs, i.e. the userspace pointer PFN have to be copied between all combinations of tiers, depending on the configuration. The interface is an iterator called a 'pfn_reader' which determines which tier each PFN is stored and loads it into a list of PFNs held in a struct pfn_batch. Each step of the iterator will fill up the pfn_batch, then the caller can use the pfn_batch to send the PFNs to the required destination. Repeating this loop will read all the PFNs in an IOVA range. The pfn_reader and pfn_batch also keep track of the pinned page accounting. While PFNs are always stored and accessed as full PAGE_SIZE units the iommu_domain tier can store with a sub-page offset/length to support IOMMUs with a smaller IOPTE size than PAGE_SIZE. Link: https://lore.kernel.org/r/8-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:31 +00:00
unsigned int num_accesses;
};
iommufd: Data structure to provide IOVA to PFN mapping This is the remainder of the IOAS data structure. Provide an object called an io_pagetable that is composed of iopt_areas pointing at iopt_pages, along with a list of iommu_domains that mirror the IOVA to PFN map. At the top this is a simple interval tree of iopt_areas indicating the map of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based on the attached domains there is a minimum alignment for areas (which may be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't be mapped and an IOVA of allowed IOVA that can always be mappable. The concept of an 'access' refers to something like a VFIO mdev that is accessing the IOVA and using a 'struct page *' for CPU based access. Externally an API is provided that matches the requirements of the IOCTL interface for map/unmap and domain attachment. The API provides a 'copy' primitive to establish a new IOVA map in a different IOAS from an existing mapping by re-using the iopt_pages. This is the basic mechanism to provide single pinning. This is designed to support a pre-registration flow where userspace would setup an dummy IOAS with no domains, map in memory and then establish an access to pin all PFNs into the xarray. Copy can then be used to create new IOVA mappings in a different IOAS, with iommu_domains attached. Upon copy the PFNs will be read out of the xarray and mapped into the iommu_domains, avoiding any pin_user_pages() overheads. Link: https://lore.kernel.org/r/10-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:33 +00:00
struct iopt_allowed {
struct interval_tree_node node;
};
struct iopt_reserved {
struct interval_tree_node node;
void *owner;
};
iommufd: Algorithms for PFN storage The iopt_pages which represents a logical linear list of full PFNs held in different storage tiers. Each area points to a slice of exactly one iopt_pages, and each iopt_pages can have multiple areas and accesses. The three storage tiers are managed to meet these objectives: - If no iommu_domain or in-kerenel access exists then minimal memory should be consumed by iomufd - If a page has been pinned then an iopt_pages will not pin it again - If an in-kernel access exists then the xarray must provide the backing storage to avoid allocations on domain removals - Otherwise any iommu_domain will be used for storage In a common configuration with only an iommu_domain the iopt_pages does not allocate significant memory itself. The external interface for pages has several logical operations: iopt_area_fill_domain() will load the PFNs from storage into a single domain. This is used when attaching a new domain to an existing IOAS. iopt_area_fill_domains() will load the PFNs from storage into multiple domains. This is used when creating a new IOVA map in an existing IOAS iopt_pages_add_access() creates an iopt_pages_access that tracks an in-kernel access of PFNs. This is some external driver that might be accessing the IOVA using the CPU, or programming PFNs with the DMA API. ie a VFIO mdev. iopt_pages_rw_access() directly perform a memcpy on the PFNs, without the overhead of iopt_pages_add_access() iopt_pages_fill_xarray() will load PFNs into the xarray and return a 'struct page *' array. It is used by iopt_pages_access's to extract PFNs for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it is known the xarray is already filled. As an iopt_pages can be referred to in slices by many areas and accesses it uses interval trees to keep track of which storage tiers currently hold the PFNs. On a page-by-page basis any request for a PFN will be satisfied from one of the storage tiers and the PFN copied to target domain/array. Unfill actions are similar, on a page by page basis domains are unmapped, xarray entries freed or struct pages fully put back. Significant complexity is required to fully optimize all of these data motions. The implementation calculates the largest consecutive range of same-storage indexes and operates in blocks. The accumulation of PFNs always generates the largest contiguous PFN range possible to optimize and this gathering can cross storage tier boundaries. For cases like 'fill domains' care is taken to avoid duplicated work and PFNs are read once and pushed into all domains. The map/unmap interaction with the iommu_domain always works in contiguous PFN blocks. The implementation does not require or benefit from any split/merge optimization in the iommu_domain driver. This design suggests several possible improvements in the IOMMU API that would greatly help performance, particularly a way for the driver to map and read the pfns lists instead of working with one driver call per page to read, and one driver call per contiguous range to store. Link: https://lore.kernel.org/r/9-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:32 +00:00
int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
struct iommu_domain *domain);
void iopt_area_unmap_domain(struct iopt_area *area,
struct iommu_domain *domain);
iommufd: PFN handling for iopt_pages The top of the data structure provides an IO Address Space (IOAS) that is similar to a VFIO container. The IOAS allows map/unmap of memory into ranges of IOVA called iopt_areas. Multiple IOMMU domains (IO page tables) and in-kernel accesses (like VFIO mdevs) can be attached to the IOAS to access the PFNs that those IOVA areas cover. The IO Address Space (IOAS) datastructure is composed of: - struct io_pagetable holding the IOVA map - struct iopt_areas representing populated portions of IOVA - struct iopt_pages representing the storage of PFNs - struct iommu_domain representing each IO page table in the system IOMMU - struct iopt_pages_access representing in-kernel accesses of PFNs (ie VFIO mdevs) - struct xarray pinned_pfns holding a list of pages pinned by in-kernel accesses This patch introduces the lowest part of the datastructure - the movement of PFNs in a tiered storage scheme: 1) iopt_pages::pinned_pfns xarray 2) Multiple iommu_domains 3) The origin of the PFNs, i.e. the userspace pointer PFN have to be copied between all combinations of tiers, depending on the configuration. The interface is an iterator called a 'pfn_reader' which determines which tier each PFN is stored and loads it into a list of PFNs held in a struct pfn_batch. Each step of the iterator will fill up the pfn_batch, then the caller can use the pfn_batch to send the PFNs to the required destination. Repeating this loop will read all the PFNs in an IOVA range. The pfn_reader and pfn_batch also keep track of the pinned page accounting. While PFNs are always stored and accessed as full PAGE_SIZE units the iommu_domain tier can store with a sub-page offset/length to support IOMMUs with a smaller IOPTE size than PAGE_SIZE. Link: https://lore.kernel.org/r/8-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:31 +00:00
static inline unsigned long iopt_area_index(struct iopt_area *area)
{
return area->pages_node.start;
}
static inline unsigned long iopt_area_last_index(struct iopt_area *area)
{
return area->pages_node.last;
}
static inline unsigned long iopt_area_iova(struct iopt_area *area)
{
return area->node.start;
}
static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
{
return area->node.last;
}
iommufd: Algorithms for PFN storage The iopt_pages which represents a logical linear list of full PFNs held in different storage tiers. Each area points to a slice of exactly one iopt_pages, and each iopt_pages can have multiple areas and accesses. The three storage tiers are managed to meet these objectives: - If no iommu_domain or in-kerenel access exists then minimal memory should be consumed by iomufd - If a page has been pinned then an iopt_pages will not pin it again - If an in-kernel access exists then the xarray must provide the backing storage to avoid allocations on domain removals - Otherwise any iommu_domain will be used for storage In a common configuration with only an iommu_domain the iopt_pages does not allocate significant memory itself. The external interface for pages has several logical operations: iopt_area_fill_domain() will load the PFNs from storage into a single domain. This is used when attaching a new domain to an existing IOAS. iopt_area_fill_domains() will load the PFNs from storage into multiple domains. This is used when creating a new IOVA map in an existing IOAS iopt_pages_add_access() creates an iopt_pages_access that tracks an in-kernel access of PFNs. This is some external driver that might be accessing the IOVA using the CPU, or programming PFNs with the DMA API. ie a VFIO mdev. iopt_pages_rw_access() directly perform a memcpy on the PFNs, without the overhead of iopt_pages_add_access() iopt_pages_fill_xarray() will load PFNs into the xarray and return a 'struct page *' array. It is used by iopt_pages_access's to extract PFNs for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it is known the xarray is already filled. As an iopt_pages can be referred to in slices by many areas and accesses it uses interval trees to keep track of which storage tiers currently hold the PFNs. On a page-by-page basis any request for a PFN will be satisfied from one of the storage tiers and the PFN copied to target domain/array. Unfill actions are similar, on a page by page basis domains are unmapped, xarray entries freed or struct pages fully put back. Significant complexity is required to fully optimize all of these data motions. The implementation calculates the largest consecutive range of same-storage indexes and operates in blocks. The accumulation of PFNs always generates the largest contiguous PFN range possible to optimize and this gathering can cross storage tier boundaries. For cases like 'fill domains' care is taken to avoid duplicated work and PFNs are read once and pushed into all domains. The map/unmap interaction with the iommu_domain always works in contiguous PFN blocks. The implementation does not require or benefit from any split/merge optimization in the iommu_domain driver. This design suggests several possible improvements in the IOMMU API that would greatly help performance, particularly a way for the driver to map and read the pfns lists instead of working with one driver call per page to read, and one driver call per contiguous range to store. Link: https://lore.kernel.org/r/9-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:32 +00:00
static inline size_t iopt_area_length(struct iopt_area *area)
{
return (area->node.last - area->node.start) + 1;
}
iommufd: Data structure to provide IOVA to PFN mapping This is the remainder of the IOAS data structure. Provide an object called an io_pagetable that is composed of iopt_areas pointing at iopt_pages, along with a list of iommu_domains that mirror the IOVA to PFN map. At the top this is a simple interval tree of iopt_areas indicating the map of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based on the attached domains there is a minimum alignment for areas (which may be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't be mapped and an IOVA of allowed IOVA that can always be mappable. The concept of an 'access' refers to something like a VFIO mdev that is accessing the IOVA and using a 'struct page *' for CPU based access. Externally an API is provided that matches the requirements of the IOCTL interface for map/unmap and domain attachment. The API provides a 'copy' primitive to establish a new IOVA map in a different IOAS from an existing mapping by re-using the iopt_pages. This is the basic mechanism to provide single pinning. This is designed to support a pre-registration flow where userspace would setup an dummy IOAS with no domains, map in memory and then establish an access to pin all PFNs into the xarray. Copy can then be used to create new IOVA mappings in a different IOAS, with iommu_domains attached. Upon copy the PFNs will be read out of the xarray and mapped into the iommu_domains, avoiding any pin_user_pages() overheads. Link: https://lore.kernel.org/r/10-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:33 +00:00
/*
* Number of bytes from the start of the iopt_pages that the iova begins.
* iopt_area_start_byte() / PAGE_SIZE encodes the starting page index
* iopt_area_start_byte() % PAGE_SIZE encodes the offset within that page
*/
static inline unsigned long iopt_area_start_byte(struct iopt_area *area,
unsigned long iova)
{
if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
WARN_ON(iova < iopt_area_iova(area) ||
iova > iopt_area_last_iova(area));
iommufd: Data structure to provide IOVA to PFN mapping This is the remainder of the IOAS data structure. Provide an object called an io_pagetable that is composed of iopt_areas pointing at iopt_pages, along with a list of iommu_domains that mirror the IOVA to PFN map. At the top this is a simple interval tree of iopt_areas indicating the map of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based on the attached domains there is a minimum alignment for areas (which may be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't be mapped and an IOVA of allowed IOVA that can always be mappable. The concept of an 'access' refers to something like a VFIO mdev that is accessing the IOVA and using a 'struct page *' for CPU based access. Externally an API is provided that matches the requirements of the IOCTL interface for map/unmap and domain attachment. The API provides a 'copy' primitive to establish a new IOVA map in a different IOAS from an existing mapping by re-using the iopt_pages. This is the basic mechanism to provide single pinning. This is designed to support a pre-registration flow where userspace would setup an dummy IOAS with no domains, map in memory and then establish an access to pin all PFNs into the xarray. Copy can then be used to create new IOVA mappings in a different IOAS, with iommu_domains attached. Upon copy the PFNs will be read out of the xarray and mapped into the iommu_domains, avoiding any pin_user_pages() overheads. Link: https://lore.kernel.org/r/10-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:33 +00:00
return (iova - iopt_area_iova(area)) + area->page_offset +
iopt_area_index(area) * PAGE_SIZE;
}
static inline unsigned long iopt_area_iova_to_index(struct iopt_area *area,
unsigned long iova)
{
return iopt_area_start_byte(area, iova) / PAGE_SIZE;
}
iommufd: Algorithms for PFN storage The iopt_pages which represents a logical linear list of full PFNs held in different storage tiers. Each area points to a slice of exactly one iopt_pages, and each iopt_pages can have multiple areas and accesses. The three storage tiers are managed to meet these objectives: - If no iommu_domain or in-kerenel access exists then minimal memory should be consumed by iomufd - If a page has been pinned then an iopt_pages will not pin it again - If an in-kernel access exists then the xarray must provide the backing storage to avoid allocations on domain removals - Otherwise any iommu_domain will be used for storage In a common configuration with only an iommu_domain the iopt_pages does not allocate significant memory itself. The external interface for pages has several logical operations: iopt_area_fill_domain() will load the PFNs from storage into a single domain. This is used when attaching a new domain to an existing IOAS. iopt_area_fill_domains() will load the PFNs from storage into multiple domains. This is used when creating a new IOVA map in an existing IOAS iopt_pages_add_access() creates an iopt_pages_access that tracks an in-kernel access of PFNs. This is some external driver that might be accessing the IOVA using the CPU, or programming PFNs with the DMA API. ie a VFIO mdev. iopt_pages_rw_access() directly perform a memcpy on the PFNs, without the overhead of iopt_pages_add_access() iopt_pages_fill_xarray() will load PFNs into the xarray and return a 'struct page *' array. It is used by iopt_pages_access's to extract PFNs for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it is known the xarray is already filled. As an iopt_pages can be referred to in slices by many areas and accesses it uses interval trees to keep track of which storage tiers currently hold the PFNs. On a page-by-page basis any request for a PFN will be satisfied from one of the storage tiers and the PFN copied to target domain/array. Unfill actions are similar, on a page by page basis domains are unmapped, xarray entries freed or struct pages fully put back. Significant complexity is required to fully optimize all of these data motions. The implementation calculates the largest consecutive range of same-storage indexes and operates in blocks. The accumulation of PFNs always generates the largest contiguous PFN range possible to optimize and this gathering can cross storage tier boundaries. For cases like 'fill domains' care is taken to avoid duplicated work and PFNs are read once and pushed into all domains. The map/unmap interaction with the iommu_domain always works in contiguous PFN blocks. The implementation does not require or benefit from any split/merge optimization in the iommu_domain driver. This design suggests several possible improvements in the IOMMU API that would greatly help performance, particularly a way for the driver to map and read the pfns lists instead of working with one driver call per page to read, and one driver call per contiguous range to store. Link: https://lore.kernel.org/r/9-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:32 +00:00
#define __make_iopt_iter(name) \
static inline struct iopt_##name *iopt_##name##_iter_first( \
struct io_pagetable *iopt, unsigned long start, \
unsigned long last) \
{ \
struct interval_tree_node *node; \
\
lockdep_assert_held(&iopt->iova_rwsem); \
node = interval_tree_iter_first(&iopt->name##_itree, start, \
last); \
if (!node) \
return NULL; \
return container_of(node, struct iopt_##name, node); \
} \
static inline struct iopt_##name *iopt_##name##_iter_next( \
struct iopt_##name *last_node, unsigned long start, \
unsigned long last) \
{ \
struct interval_tree_node *node; \
\
node = interval_tree_iter_next(&last_node->node, start, last); \
if (!node) \
return NULL; \
return container_of(node, struct iopt_##name, node); \
}
__make_iopt_iter(area)
iommufd: Data structure to provide IOVA to PFN mapping This is the remainder of the IOAS data structure. Provide an object called an io_pagetable that is composed of iopt_areas pointing at iopt_pages, along with a list of iommu_domains that mirror the IOVA to PFN map. At the top this is a simple interval tree of iopt_areas indicating the map of IOVA to iopt_pages. An xarray keeps track of a list of domains. Based on the attached domains there is a minimum alignment for areas (which may be smaller than PAGE_SIZE), an interval tree of reserved IOVA that can't be mapped and an IOVA of allowed IOVA that can always be mappable. The concept of an 'access' refers to something like a VFIO mdev that is accessing the IOVA and using a 'struct page *' for CPU based access. Externally an API is provided that matches the requirements of the IOCTL interface for map/unmap and domain attachment. The API provides a 'copy' primitive to establish a new IOVA map in a different IOAS from an existing mapping by re-using the iopt_pages. This is the basic mechanism to provide single pinning. This is designed to support a pre-registration flow where userspace would setup an dummy IOAS with no domains, map in memory and then establish an access to pin all PFNs into the xarray. Copy can then be used to create new IOVA mappings in a different IOAS, with iommu_domains attached. Upon copy the PFNs will be read out of the xarray and mapped into the iommu_domains, avoiding any pin_user_pages() overheads. Link: https://lore.kernel.org/r/10-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:33 +00:00
__make_iopt_iter(allowed)
__make_iopt_iter(reserved)
struct iopt_area_contig_iter {
unsigned long cur_iova;
unsigned long last_iova;
struct iopt_area *area;
};
struct iopt_area *iopt_area_contig_init(struct iopt_area_contig_iter *iter,
struct io_pagetable *iopt,
unsigned long iova,
unsigned long last_iova);
struct iopt_area *iopt_area_contig_next(struct iopt_area_contig_iter *iter);
static inline bool iopt_area_contig_done(struct iopt_area_contig_iter *iter)
{
return iter->area && iter->last_iova <= iopt_area_last_iova(iter->area);
}
/*
* Iterate over a contiguous list of areas that span the iova,last_iova range.
* The caller must check iopt_area_contig_done() after the loop to see if
* contiguous areas existed.
*/
#define iopt_for_each_contig_area(iter, area, iopt, iova, last_iova) \
for (area = iopt_area_contig_init(iter, iopt, iova, last_iova); area; \
area = iopt_area_contig_next(iter))
iommufd: Algorithms for PFN storage The iopt_pages which represents a logical linear list of full PFNs held in different storage tiers. Each area points to a slice of exactly one iopt_pages, and each iopt_pages can have multiple areas and accesses. The three storage tiers are managed to meet these objectives: - If no iommu_domain or in-kerenel access exists then minimal memory should be consumed by iomufd - If a page has been pinned then an iopt_pages will not pin it again - If an in-kernel access exists then the xarray must provide the backing storage to avoid allocations on domain removals - Otherwise any iommu_domain will be used for storage In a common configuration with only an iommu_domain the iopt_pages does not allocate significant memory itself. The external interface for pages has several logical operations: iopt_area_fill_domain() will load the PFNs from storage into a single domain. This is used when attaching a new domain to an existing IOAS. iopt_area_fill_domains() will load the PFNs from storage into multiple domains. This is used when creating a new IOVA map in an existing IOAS iopt_pages_add_access() creates an iopt_pages_access that tracks an in-kernel access of PFNs. This is some external driver that might be accessing the IOVA using the CPU, or programming PFNs with the DMA API. ie a VFIO mdev. iopt_pages_rw_access() directly perform a memcpy on the PFNs, without the overhead of iopt_pages_add_access() iopt_pages_fill_xarray() will load PFNs into the xarray and return a 'struct page *' array. It is used by iopt_pages_access's to extract PFNs for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it is known the xarray is already filled. As an iopt_pages can be referred to in slices by many areas and accesses it uses interval trees to keep track of which storage tiers currently hold the PFNs. On a page-by-page basis any request for a PFN will be satisfied from one of the storage tiers and the PFN copied to target domain/array. Unfill actions are similar, on a page by page basis domains are unmapped, xarray entries freed or struct pages fully put back. Significant complexity is required to fully optimize all of these data motions. The implementation calculates the largest consecutive range of same-storage indexes and operates in blocks. The accumulation of PFNs always generates the largest contiguous PFN range possible to optimize and this gathering can cross storage tier boundaries. For cases like 'fill domains' care is taken to avoid duplicated work and PFNs are read once and pushed into all domains. The map/unmap interaction with the iommu_domain always works in contiguous PFN blocks. The implementation does not require or benefit from any split/merge optimization in the iommu_domain driver. This design suggests several possible improvements in the IOMMU API that would greatly help performance, particularly a way for the driver to map and read the pfns lists instead of working with one driver call per page to read, and one driver call per contiguous range to store. Link: https://lore.kernel.org/r/9-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:32 +00:00
iommufd: PFN handling for iopt_pages The top of the data structure provides an IO Address Space (IOAS) that is similar to a VFIO container. The IOAS allows map/unmap of memory into ranges of IOVA called iopt_areas. Multiple IOMMU domains (IO page tables) and in-kernel accesses (like VFIO mdevs) can be attached to the IOAS to access the PFNs that those IOVA areas cover. The IO Address Space (IOAS) datastructure is composed of: - struct io_pagetable holding the IOVA map - struct iopt_areas representing populated portions of IOVA - struct iopt_pages representing the storage of PFNs - struct iommu_domain representing each IO page table in the system IOMMU - struct iopt_pages_access representing in-kernel accesses of PFNs (ie VFIO mdevs) - struct xarray pinned_pfns holding a list of pages pinned by in-kernel accesses This patch introduces the lowest part of the datastructure - the movement of PFNs in a tiered storage scheme: 1) iopt_pages::pinned_pfns xarray 2) Multiple iommu_domains 3) The origin of the PFNs, i.e. the userspace pointer PFN have to be copied between all combinations of tiers, depending on the configuration. The interface is an iterator called a 'pfn_reader' which determines which tier each PFN is stored and loads it into a list of PFNs held in a struct pfn_batch. Each step of the iterator will fill up the pfn_batch, then the caller can use the pfn_batch to send the PFNs to the required destination. Repeating this loop will read all the PFNs in an IOVA range. The pfn_reader and pfn_batch also keep track of the pinned page accounting. While PFNs are always stored and accessed as full PAGE_SIZE units the iommu_domain tier can store with a sub-page offset/length to support IOMMUs with a smaller IOPTE size than PAGE_SIZE. Link: https://lore.kernel.org/r/8-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:31 +00:00
enum {
IOPT_PAGES_ACCOUNT_NONE = 0,
IOPT_PAGES_ACCOUNT_USER = 1,
IOPT_PAGES_ACCOUNT_MM = 2,
};
/*
* This holds a pinned page list for multiple areas of IO address space. The
* pages always originate from a linear chunk of userspace VA. Multiple
* io_pagetable's, through their iopt_area's, can share a single iopt_pages
* which avoids multi-pinning and double accounting of page consumption.
*
* indexes in this structure are measured in PAGE_SIZE units, are 0 based from
* the start of the uptr and extend to npages. pages are pinned dynamically
* according to the intervals in the access_itree and domains_itree, npinned
* records the current number of pages pinned.
*/
struct iopt_pages {
struct kref kref;
struct mutex mutex;
size_t npages;
size_t npinned;
size_t last_npinned;
struct task_struct *source_task;
struct mm_struct *source_mm;
struct user_struct *source_user;
void __user *uptr;
bool writable:1;
u8 account_mode;
struct xarray pinned_pfns;
/* Of iopt_pages_access::node */
struct rb_root_cached access_itree;
/* Of iopt_area::pages_node */
struct rb_root_cached domains_itree;
};
iommufd: Algorithms for PFN storage The iopt_pages which represents a logical linear list of full PFNs held in different storage tiers. Each area points to a slice of exactly one iopt_pages, and each iopt_pages can have multiple areas and accesses. The three storage tiers are managed to meet these objectives: - If no iommu_domain or in-kerenel access exists then minimal memory should be consumed by iomufd - If a page has been pinned then an iopt_pages will not pin it again - If an in-kernel access exists then the xarray must provide the backing storage to avoid allocations on domain removals - Otherwise any iommu_domain will be used for storage In a common configuration with only an iommu_domain the iopt_pages does not allocate significant memory itself. The external interface for pages has several logical operations: iopt_area_fill_domain() will load the PFNs from storage into a single domain. This is used when attaching a new domain to an existing IOAS. iopt_area_fill_domains() will load the PFNs from storage into multiple domains. This is used when creating a new IOVA map in an existing IOAS iopt_pages_add_access() creates an iopt_pages_access that tracks an in-kernel access of PFNs. This is some external driver that might be accessing the IOVA using the CPU, or programming PFNs with the DMA API. ie a VFIO mdev. iopt_pages_rw_access() directly perform a memcpy on the PFNs, without the overhead of iopt_pages_add_access() iopt_pages_fill_xarray() will load PFNs into the xarray and return a 'struct page *' array. It is used by iopt_pages_access's to extract PFNs for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it is known the xarray is already filled. As an iopt_pages can be referred to in slices by many areas and accesses it uses interval trees to keep track of which storage tiers currently hold the PFNs. On a page-by-page basis any request for a PFN will be satisfied from one of the storage tiers and the PFN copied to target domain/array. Unfill actions are similar, on a page by page basis domains are unmapped, xarray entries freed or struct pages fully put back. Significant complexity is required to fully optimize all of these data motions. The implementation calculates the largest consecutive range of same-storage indexes and operates in blocks. The accumulation of PFNs always generates the largest contiguous PFN range possible to optimize and this gathering can cross storage tier boundaries. For cases like 'fill domains' care is taken to avoid duplicated work and PFNs are read once and pushed into all domains. The map/unmap interaction with the iommu_domain always works in contiguous PFN blocks. The implementation does not require or benefit from any split/merge optimization in the iommu_domain driver. This design suggests several possible improvements in the IOMMU API that would greatly help performance, particularly a way for the driver to map and read the pfns lists instead of working with one driver call per page to read, and one driver call per contiguous range to store. Link: https://lore.kernel.org/r/9-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:32 +00:00
struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
bool writable);
void iopt_release_pages(struct kref *kref);
static inline void iopt_put_pages(struct iopt_pages *pages)
{
kref_put(&pages->kref, iopt_release_pages);
}
void iopt_pages_fill_from_xarray(struct iopt_pages *pages, unsigned long start,
unsigned long last, struct page **out_pages);
int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start,
unsigned long last, struct page **out_pages);
void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
unsigned long last);
int iopt_area_add_access(struct iopt_area *area, unsigned long start,
unsigned long last, struct page **out_pages,
unsigned int flags);
void iopt_area_remove_access(struct iopt_area *area, unsigned long start,
unsigned long last);
int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte,
void *data, unsigned long length, unsigned int flags);
/*
* Each interval represents an active iopt_access_pages(), it acts as an
* interval lock that keeps the PFNs pinned and stored in the xarray.
*/
struct iopt_pages_access {
struct interval_tree_node node;
unsigned int users;
};
iommufd: PFN handling for iopt_pages The top of the data structure provides an IO Address Space (IOAS) that is similar to a VFIO container. The IOAS allows map/unmap of memory into ranges of IOVA called iopt_areas. Multiple IOMMU domains (IO page tables) and in-kernel accesses (like VFIO mdevs) can be attached to the IOAS to access the PFNs that those IOVA areas cover. The IO Address Space (IOAS) datastructure is composed of: - struct io_pagetable holding the IOVA map - struct iopt_areas representing populated portions of IOVA - struct iopt_pages representing the storage of PFNs - struct iommu_domain representing each IO page table in the system IOMMU - struct iopt_pages_access representing in-kernel accesses of PFNs (ie VFIO mdevs) - struct xarray pinned_pfns holding a list of pages pinned by in-kernel accesses This patch introduces the lowest part of the datastructure - the movement of PFNs in a tiered storage scheme: 1) iopt_pages::pinned_pfns xarray 2) Multiple iommu_domains 3) The origin of the PFNs, i.e. the userspace pointer PFN have to be copied between all combinations of tiers, depending on the configuration. The interface is an iterator called a 'pfn_reader' which determines which tier each PFN is stored and loads it into a list of PFNs held in a struct pfn_batch. Each step of the iterator will fill up the pfn_batch, then the caller can use the pfn_batch to send the PFNs to the required destination. Repeating this loop will read all the PFNs in an IOVA range. The pfn_reader and pfn_batch also keep track of the pinned page accounting. While PFNs are always stored and accessed as full PAGE_SIZE units the iommu_domain tier can store with a sub-page offset/length to support IOMMUs with a smaller IOPTE size than PAGE_SIZE. Link: https://lore.kernel.org/r/8-v6-a196d26f289e+11787-iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-11-29 20:29:31 +00:00
#endif