[ Upstream commit 9f7c689115 ]
The BTM support wants to be able to change the ASID of any smmu_domain.
When it goes to do this it holds the arm_smmu_asid_lock and iterates over
the target domain's devices list.
During attach of a S1 domain we must ensure that the devices list and
CD are in sync, otherwise we could miss CD updates or a parallel CD update
could push an out of date CD.
This is pretty complicated, and almost works today because
arm_smmu_detach_dev() removes the master from the linked list before
working on the CD entries, preventing parallel update of the CD.
However, it does have an issue where the CD can remain programed while the
domain appears to be unattached. arm_smmu_share_asid() will then not clear
any CD entriess and install its own CD entry with the same ASID
concurrently. This creates a small race window where the IOMMU can see two
ASIDs pointing to different translations.
CPU0 CPU1
arm_smmu_attach_dev()
arm_smmu_detach_dev()
spin_lock_irqsave(&smmu_domain->devices_lock, flags);
list_del(&master->domain_head);
spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
arm_smmu_mmu_notifier_get()
arm_smmu_alloc_shared_cd()
arm_smmu_share_asid():
// Does nothing due to list_del above
arm_smmu_update_ctx_desc_devices()
arm_smmu_tlb_inv_asid()
arm_smmu_write_ctx_desc()
** Now the ASID is in two CDs
with different translation
arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, NULL);
Solve this by wrapping most of the attach flow in the
arm_smmu_asid_lock. This locks more than strictly needed to prepare for
the next patch which will reorganize the order of the linked list, STE and
CD changes.
Move arm_smmu_detach_dev() till after we have initialized the domain so
the lock can be held for less time.
Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v6-96275f25c39d+2d4-smmuv3_newapi_p1_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c404f55c26 ]
The SVA code checks that the PASID is valid for the device when assigning
the PASID to the MM, but the normal PAGING related path does not check it.
Devices that don't support PASID or PASID values too large for the device
should not invoke the driver callback. The drivers should rely on the
core code for this enforcement.
Fixes: 1660370455 ("iommu: Add attach/detach_dev_pasid iommu interfaces")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/0-v1-460705442b30+659-iommu_check_pasid_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit afc5aa46ed ]
The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
an NVME device can map a size between 300KB~512KB, which certainly failed
the swiotlb mappings, though the default pool of swiotlb has many slots:
systemd[1]: Started Journal Service.
=> nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
note: journal-offline[392] exited with irqs disabled
note: journal-offline[392] exited with preempt_count 1
Call trace:
[ 3.099918] swiotlb_tbl_map_single+0x214/0x240
[ 3.099921] iommu_dma_map_page+0x218/0x328
[ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
[ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
[ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
[ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
[ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
[ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
[ 3.103463] __submit_bio+0x44/0x220
[ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
[ 3.103470] submit_bio_noacct+0x180/0x6c8
[ 3.103471] submit_bio+0x34/0x130
[ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
[ 3.104766] mpage_submit_folio+0xa0/0x100
[ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
[ 3.104771] ext4_do_writepages+0x6a0/0xd78
[ 3.105615] ext4_writepages+0x80/0x118
[ 3.105616] do_writepages+0x90/0x1e8
[ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
[ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
[ 3.106656] file_write_and_wait_range+0x84/0x120
[ 3.106658] ext4_sync_file+0x7c/0x4c0
[ 3.106660] vfs_fsync_range+0x3c/0xa8
[ 3.106663] do_fsync+0x44/0xc0
Since untrusted devices might go down the swiotlb pathway with dma-iommu,
these devices should not map a size larger than swiotlb_max_mapping_size.
To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
the iommu_dma_opt_mapping_size().
Fixes: 82612d66d5 ("iommu: Allow the dma-iommu api to use bounce buffers")
Link: https://lore.kernel.org/r/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
[will: Drop redundant is_swiotlb_active(dev) check]
Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 70bad345e6 ]
When the kernel is comiled with CONFIG_IRQ_REMAP=y but without
CONFIG_IOMMU_INTEL compilation fails since commit def054b01a with an
undefined reference to device_rbtree_find(). This patch makes sure that
intel specific code is only compiled with CONFIG_IOMMU_INTEL=y.
Signed-off-by: Bert Karwatzki <spasswolf@web.de>
Fixes: 80a9b50c0b ("iommu/vt-d: Improve ITE fault handling if target device isn't present")
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20240307194419.15801-1-spasswolf@web.de
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 81e921fd32 ]
In the kdump kernel, the IOMMU operates in deferred_attach mode. In this
mode, info->domain may not yet be assigned by the time the release_device
function is called. It leads to the following crash in the crash kernel:
BUG: kernel NULL pointer dereference, address: 000000000000003c
...
RIP: 0010:do_raw_spin_lock+0xa/0xa0
...
_raw_spin_lock_irqsave+0x1b/0x30
intel_iommu_release_device+0x96/0x170
iommu_deinit_device+0x39/0xf0
__iommu_group_remove_device+0xa0/0xd0
iommu_bus_notifier+0x55/0xb0
notifier_call_chain+0x5a/0xd0
blocking_notifier_call_chain+0x41/0x60
bus_notify+0x34/0x50
device_del+0x269/0x3d0
pci_remove_bus_device+0x77/0x100
p2sb_bar+0xae/0x1d0
...
i801_probe+0x423/0x740
Use the release_domain mechanism to fix it. The scalable mode context
entry which is not part of release domain should be cleared in
release_device().
Fixes: 586081d3f6 ("iommu/vt-d: Remove DEFER_DEVICE_DOMAIN_INFO")
Reported-by: Eric Badger <ebadger@purestorage.com>
Closes: https://lore.kernel.org/r/20240113181713.1817855-1-ebadger@purestorage.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240305013305.204605-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0061ffe289 ]
The current device_release callback for individual iommu drivers does the
following:
1) Silent IOMMU DMA translation: It detaches any existing domain from the
device and puts it into a blocking state (some drivers might use the
identity state).
2) Resource release: It releases resources allocated during the
device_probe callback and restores the device to its pre-probe state.
Step 1 is challenging for individual iommu drivers because each must check
if a domain is already attached to the device. Additionally, if a deferred
attach never occurred, the device_release should avoid modifying hardware
configuration regardless of the reason for its call.
To simplify this process, introduce a static release_domain within the
iommu_ops structure. It can be either a blocking or identity domain
depending on the iommu hardware. The iommu core will decide whether to
attach this domain before the device_release callback, eliminating the
need for repetitive code in various drivers.
Consequently, the device_release callback can focus solely on the opposite
operations of device_probe, including releasing all resources allocated
during that callback.
Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240305013305.204605-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Stable-dep-of: 81e921fd32 ("iommu/vt-d: Fix NULL domain on device release")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit def054b01a ]
The existing I/O page fault handler currently locates the PCI device by
calling pci_get_domain_bus_and_slot(). This function searches the list
of all PCI devices until the desired device is found. To improve lookup
efficiency, replace it with device_rbtree_find() to search the device
within the probed device rbtree.
The I/O page fault is initiated by the device, which does not have any
synchronization mechanism with the software to ensure that the device
stays in the probed device tree. Theoretically, a device could be released
by the IOMMU subsystem after device_rbtree_find() and before
iopf_get_dev_fault_param(), which would cause a use-after-free problem.
Add a mutex to synchronize the I/O page fault reporting path and the IOMMU
release device path. This lock doesn't introduce any performance overhead,
as the conflict between I/O page fault reporting and device releasing is
very rare.
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240220065939.121116-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Stable-dep-of: 81e921fd32 ("iommu/vt-d: Fix NULL domain on device release")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 80a9b50c0b ]
Because surprise removal could happen anytime, e.g. user could request safe
removal to EP(endpoint device) via sysfs and brings its link down to do
surprise removal cocurrently. such aggressive cases would cause ATS
invalidation request issued to non-existence target device, then deadly
loop to retry that request after ITE fault triggered in interrupt context.
this patch aims to optimize the ITE handling by checking the target device
presence state to avoid retrying the timeout request blindly, thus avoid
hard lockup or system hang.
Devices TLB should only be invalidated when devices are in the
iommu->device_rbtree (probed, not released) and present.
Fixes: 6ba6c3a4ca ("VT-d: add device IOTLB invalidation support")
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
Link: https://lore.kernel.org/r/20240301080727.3529832-4-haifeng.zhao@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1a75cc710b ]
Use a red-black tree(rbtree) to track devices probed by the driver's
probe_device callback. These devices need to be looked up quickly by
a source ID when the hardware reports a fault, either recoverable or
unrecoverable.
Fault reporting paths are critical. Searching a list in this scenario
is inefficient, with an algorithm complexity of O(n). An rbtree is a
self-balancing binary search tree, offering an average search time
complexity of O(log(n)). This significant performance improvement
makes rbtrees a better choice.
Furthermore, rbtrees are implemented on a per-iommu basis, eliminating
the need for global searches and further enhancing efficiency in
critical fault paths. The rbtree is protected by a spin lock with
interrupts disabled to ensure thread-safe access even within interrupt
contexts.
Co-developed-by: Huang Jiaqing <jiaqing.huang@intel.com>
Signed-off-by: Huang Jiaqing <jiaqing.huang@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240220065939.121116-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Stable-dep-of: 80a9b50c0b ("iommu/vt-d: Improve ITE fault handling if target device isn't present")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0feda94c86 ]
On many systems that have an AMD IOMMU the following sequence of
warnings is observed during bootup.
```
pci 0000:00:00.2 can't derive routing for PCI INT A
pci 0000:00:00.2: PCI INT A: not connected
```
This series of events happens because of the IOMMU initialization
sequence order and the lack of _PRT entries for the IOMMU.
During initialization the IOMMU driver first enables the PCI device
using pci_enable_device(). This will call acpi_pci_irq_enable()
which will check if the interrupt is declared in a PCI routing table
(_PRT) entry. According to the PCI spec [1] these routing entries
are only required under PCI root bridges:
The _PRT object is required under all PCI root bridges
The IOMMU is directly connected to the root complex, so there is no
parent bridge to look for a _PRT entry. The first warning is emitted
since no entry could be found in the hierarchy. The second warning is
then emitted because the interrupt hasn't yet been configured to any
value. The pin was configured in pci_read_irq() but the byte in
PCI_INTERRUPT_LINE return 0xff which means "Unknown".
After that sequence of events pci_enable_msi() is called and this
will allocate an interrupt.
That is both of these warnings are totally harmless because the IOMMU
uses MSI for interrupts. To avoid even trying to probe for a _PRT
entry mark the IOMMU as IRQ managed. This avoids both warnings.
Link: https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/06_Device_Configuration/Device_Configuration.html?highlight=_prt#prt-pci-routing-table [1]
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Fixes: cffe0a2b5a ("x86, irq: Keep balance of IOAPIC pin reference count")
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20240122233400.1802-1-mario.limonciello@amd.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Four syzkaller found bugs:
- Corruption during error unwind in iommufd_access_change_ioas()
- Overlapping IDs in the test suite due to out of order destruction
- Missing locking for access->ioas in the test suite
- False failures in the test suite validation logic with huge pages
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZeH0gQAKCRCFwuHvBreF
YcC5AP91VuNxoExC6Zx9c8i1U2SR5IbvVAHT572qgdhaKoVzugEAtfKJA2ELheL8
s1F+TDOo8fR/aawIAPzK5GCuHc5oIwg=
=S44j
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd fixes from Jason Gunthorpe:
"Four syzkaller found bugs:
- Corruption during error unwind in iommufd_access_change_ioas()
- Overlapping IDs in the test suite due to out of order destruction
- Missing locking for access->ioas in the test suite
- False failures in the test suite validation logic with huge pages"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommufd/selftest: Don't check map/unmap pairing with HUGE_PAGES
iommufd: Fix protection fault in iommufd_test_syz_conv_iova
iommufd/selftest: Fix mock_dev_num bug
iommufd: Fix iopt_access_list_id overwrite bug
iommu_sva_bind_device will directly goto out in multi-device
case when found existing domain, ignoring list_add handle,
which causes the handle to fail to be shared.
Fixes: 65d4418c50 ("iommu/sva: Restore SVA handle sharing")
Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240227064821.128-1-zhangfei.gao@linaro.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Since MOCK_HUGE_PAGE_SIZE was introduced it allows the core code to invoke
mock with large page sizes. This confuses the validation logic that checks
that map/unmap are paired.
This is because the page size computed for map is based on the physical
address and in many cases will always be the base page size, however the
entire range generated by iommufd will be passed to map.
Randomly iommufd can see small groups of physically contiguous pages,
(say 8k unaligned and grouped together), but that group crosses a huge
page boundary. The map side will observe this as a contiguous run and mark
it accordingly, but there is a chance the unmap side will end up
terminating interior huge pages in the middle of that group and trigger a
validation failure. Meaning the validation only works if the core code
passes the iova/length directly from iommufd to mock.
syzkaller randomly hits this with failures like:
WARNING: CPU: 0 PID: 11568 at drivers/iommu/iommufd/selftest.c:461 mock_domain_unmap_pages+0x1c0/0x250
Modules linked in:
CPU: 0 PID: 11568 Comm: syz-executor.0 Not tainted 6.8.0-rc3+ #4
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:mock_domain_unmap_pages+0x1c0/0x250
Code: 2b e8 94 37 0f ff 48 d1 eb 31 ff 48 b8 00 00 00 00 00 00 20 00 48 21 c3 48 89 de e8 aa 32 0f ff 48 85 db 75 07 e8 70 37 0f ff <0f> 0b e8 69 37 0f ff 31 f6 31 ff e8 90 32 0f ff e8 5b 37 0f ff 4c
RSP: 0018:ffff88800e707490 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff822dfae6
RDX: ffff88800cf86400 RSI: ffffffff822dfaf0 RDI: 0000000000000007
RBP: ffff88800e7074d8 R08: 0000000000000000 R09: ffffed1001167c90
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000001500000
R13: 0000000000083000 R14: 0000000000000001 R15: 0000000000000800
FS: 0000555556048480(0000) GS:ffff88806d400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2dc23000 CR3: 0000000008cbb000 CR4: 0000000000350eb0
Call Trace:
<TASK>
__iommu_unmap+0x281/0x520
iommu_unmap+0xc9/0x180
iopt_area_unmap_domain_range+0x1b1/0x290
iopt_area_unpin_domain+0x590/0x800
__iopt_area_unfill_domain+0x22e/0x650
iopt_area_unfill_domain+0x47/0x60
iopt_unfill_domain+0x187/0x590
iopt_table_remove_domain+0x267/0x2d0
iommufd_hwpt_paging_destroy+0x1f1/0x370
iommufd_object_remove+0x2a3/0x490
iommufd_device_detach+0x23a/0x2c0
iommufd_selftest_destroy+0x7a/0xf0
iommufd_fops_release+0x1d3/0x340
__fput+0x272/0xb50
__fput_sync+0x4b/0x60
__x64_sys_close+0x8b/0x110
do_syscall_64+0x71/0x140
entry_SYSCALL_64_after_hwframe+0x46/0x4e
Do the simple thing and just disable the validation when the huge page
tests are being run.
Fixes: 7db521e23f ("iommufd/selftest: Hugepage mock domain support")
Link: https://lore.kernel.org/r/0-v1-1e17e60a5c8a+103fb-iommufd_mock_hugepg_jgg@nvidia.com
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Syzkaller reported the following bug:
general protection fault, probably for non-canonical address 0xdffffc0000000038: 0000 [#1] SMP KASAN
KASAN: null-ptr-deref in range [0x00000000000001c0-0x00000000000001c7]
Call Trace:
lock_acquire
lock_acquire+0x1ce/0x4f0
down_read+0x93/0x4a0
iommufd_test_syz_conv_iova+0x56/0x1f0
iommufd_test_access_rw.isra.0+0x2ec/0x390
iommufd_test+0x1058/0x1e30
iommufd_fops_ioctl+0x381/0x510
vfs_ioctl
__do_sys_ioctl
__se_sys_ioctl
__x64_sys_ioctl+0x170/0x1e0
do_syscall_x64
do_syscall_64+0x71/0x140
This is because the new iommufd_access_change_ioas() sets access->ioas to
NULL during its process, so the lock might be gone in a concurrent racing
context.
Fix this by doing the same access->ioas sanity as iommufd_access_rw() and
iommufd_access_pin_pages() functions do.
Cc: stable@vger.kernel.org
Fixes: 9227da7816 ("iommufd: Add iommufd_access_change_ioas(_id) helpers")
Link: https://lore.kernel.org/r/3f1932acaf1dd494d404c04364d73ce8f57f3e5e.1708636627.git.nicolinc@nvidia.com
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Syzkaller reported the following bug:
sysfs: cannot create duplicate filename '/devices/iommufd_mock4'
Call Trace:
sysfs_warn_dup+0x71/0x90
sysfs_create_dir_ns+0x1ee/0x260
? sysfs_create_mount_point+0x80/0x80
? spin_bug+0x1d0/0x1d0
? do_raw_spin_unlock+0x54/0x220
kobject_add_internal+0x221/0x970
kobject_add+0x11c/0x1e0
? lockdep_hardirqs_on_prepare+0x273/0x3e0
? kset_create_and_add+0x160/0x160
? kobject_put+0x5d/0x390
? bus_get_dev_root+0x4a/0x60
? kobject_put+0x5d/0x390
device_add+0x1d5/0x1550
? __fw_devlink_link_to_consumers.isra.0+0x1f0/0x1f0
? __init_waitqueue_head+0xcb/0x150
iommufd_test+0x462/0x3b60
? lock_release+0x1fe/0x640
? __might_fault+0x117/0x170
? reacquire_held_locks+0x4b0/0x4b0
? iommufd_selftest_destroy+0xd0/0xd0
? __might_fault+0xbe/0x170
iommufd_fops_ioctl+0x256/0x350
? iommufd_option+0x180/0x180
? __lock_acquire+0x1755/0x45f0
__x64_sys_ioctl+0xa13/0x1640
The bug is triggered when Syzkaller created multiple mock devices but
didn't destroy them in the same sequence, messing up the mock_dev_num
counter. Replace the atomic with an mock_dev_ida.
Cc: stable@vger.kernel.org
Fixes: 23a1b46f15 ("iommufd/selftest: Make the mock iommu driver into a real driver")
Link: https://lore.kernel.org/r/5af41d5af6d5c013cc51de01427abb8141b3587e.1708636627.git.nicolinc@nvidia.com
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Syzkaller reported the following WARN_ON:
WARNING: CPU: 1 PID: 4738 at drivers/iommu/iommufd/io_pagetable.c:1360
Call Trace:
iommufd_access_change_ioas+0x2fe/0x4e0
iommufd_access_destroy_object+0x50/0xb0
iommufd_object_remove+0x2a3/0x490
iommufd_object_destroy_user
iommufd_access_destroy+0x71/0xb0
iommufd_test_staccess_release+0x89/0xd0
__fput+0x272/0xb50
__fput_sync+0x4b/0x60
__do_sys_close
__se_sys_close
__x64_sys_close+0x8b/0x110
do_syscall_x64
The mismatch between the access pointer in the list and the passed-in
pointer is resulting from an overwrite of access->iopt_access_list_id, in
iopt_add_access(). Called from iommufd_access_change_ioas() when
xa_alloc() succeeds but iopt_calculate_iova_alignment() fails.
Add a new_id in iopt_add_access() and only update iopt_access_list_id when
returning successfully.
Cc: stable@vger.kernel.org
Fixes: 9227da7816 ("iommufd: Add iommufd_access_change_ioas(_id) helpers")
Link: https://lore.kernel.org/r/2dda7acb25b8562ec5f1310de828ef5da9ef509c.1708636627.git.nicolinc@nvidia.com
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Including:
- Intel VT-d fixes for nested domain handling:
- Cache invalidation for changes in a parent domain
- Dirty tracking setting for parent and nested domains
- Fix a constant-out-of-range warning
- ARM SMMU fixes:
- Fix CD allocation from atomic context when using SVA with SMMUv3
- Revert the conversion of SMMUv2 to domain_alloc_paging(), as it
breaks the boot for Qualcomm MSM8996 devices
- Restore SVA handle sharing in core code as it turned out there are
still drivers relying on it
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmXaZ/QACgkQK/BELZcB
GuPTaQ/9GJTAmfIktgPWqAgUOUa6fWpEsmChwVgpMDxw4fyQsJrKZ/YQfX8Vy17S
TV9EyCFGALRdcE+VX41KtrvG5MWY5cHa4BlJ6/KaD3tilhViTsACN7JkGiWnlEIs
4t6YefoaMvgdaJ4nqysFlHtlaVKHObTi6toyyCCHIywOcei2YqX3mqTIiPzUyflj
dqx0HwBG5uz6q00JNbHVQHeMc8rIEvT61oMssQUMNt8KPvdNJl9OrZSRvXimHABU
Vh0nMKYLAqHp40IvoXScA9Aj/DWTwE2346/Xpd6hnZ/yJvBlm6YQWKpJtjJz3z2Y
ZnK+cmFPAaC0EE7dlEpN7hcwtEqumw1K+CJ4s8rfnNY5IdcY0DIRAVxCLsh2YH15
rcupp3iNJMR3JeVUYHKe+mEHcYSyC9SABw01aq9NdEu1LRfXjLRrYFVh3yovpshV
abXssThTFWQTfTvUs2Vt9zjVIinST3ogki+mxyfgqTuKYj9GY8P3eqYRFrdxGJnI
mtwu0ByRwVpdNUBUWsOk+3sSdvPBsgd/Fchr8Gmpl5W2cAjPOetYn0th3hksG7tn
9qyUDwTHDaQsXEONFwS44eP9KlccUUvvLhFHCBghDm5A6i/aDdnz6d1CdfPOdb4y
DwOQem4AiOmeTBGEZA6Z/ZSKsqZGmpZditBveki0ImauxrUZXMU=
=aVOB
-----END PGP SIGNATURE-----
Merge tag 'iommu-fixes-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- Intel VT-d fixes for nested domain handling:
- Cache invalidation for changes in a parent domain
- Dirty tracking setting for parent and nested domains
- Fix a constant-out-of-range warning
- ARM SMMU fixes:
- Fix CD allocation from atomic context when using SVA with SMMUv3
- Revert the conversion of SMMUv2 to domain_alloc_paging(), as it
breaks the boot for Qualcomm MSM8996 devices
- Restore SVA handle sharing in core code as it turned out there are
still drivers relying on it
* tag 'iommu-fixes-v6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/sva: Restore SVA handle sharing
iommu/arm-smmu-v3: Do not use GFP_KERNEL under as spinlock
iommu/vt-d: Fix constant-out-of-range warning
iommu/vt-d: Set SSADE when attaching to a parent with dirty tracking
iommu/vt-d: Add missing dirty tracking set for parent domain
iommu/vt-d: Wrap the dirty tracking loop to be a helper
iommu/vt-d: Remove domain parameter for intel_pasid_setup_dirty_tracking()
iommu/vt-d: Add missing device iotlb flush for parent domain
iommu/vt-d: Update iotlb in nested domain attach
iommu/vt-d: Add missing iotlb flush for parent domain
iommu/vt-d: Add __iommu_flush_iotlb_psi()
iommu/vt-d: Track nested domains in parent
Revert "iommu/arm-smmu: Convert to domain_alloc_paging()"
Prior to commit 092edaddb6 ("iommu: Support mm PASID 1:n with sva
domains") the code allowed a SVA handle to be bound multiple times to the
same (mm, device) pair. This was alluded to in the kdoc comment, but we
had understood this to be more a remark about allowing multiple devices,
not a literal same-driver re-opening the same SVA.
It turns out uacce and idxd were both relying on the core code to handle
reference counting for same-device same-mm scenarios. As this looks hard
to resolve in the drivers bring it back to the core code.
The new design has changed the meaning of the domain->users refcount to
refer to the number of devices that are sharing that domain for the same
mm. This is part of the design to lift the SVA domain de-duplication out
of the drivers.
Return the old behavior by explicitly de-duplicating the struct iommu_sva
handle. The same (mm, device) will return the same handle pointer and the
core code will handle tracking this. The last unbind of the handle will
destroy it.
Fixes: 092edaddb6 ("iommu: Support mm PASID 1:n with sva domains")
Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Closes: https://lore.kernel.org/all/20240221110658.529-1-zhangfei.gao@linaro.org/
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/0-v1-9455fc497a6f+3b4-iommu_sva_sharing_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
- Fix CD allocation from atomic context when using SVA with SMMUv3
- Revert the conversion of SMMUv2 to domain_alloc_paging(), as it
breaks the boot for Qualcomm MSM8996 devices
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmXYgxgQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNKXKCACg7WYRfX725aIb2kW3ARW/nczdOwiXi9Fm
dXgCgQTJHn3tPiRR0HhTlDSif0GI0uXN96fYb5GqkkDFLlvA276DQkoc3ckh97wD
YaVtc/W77vDUgHYoADdbXu4oEjuSbBb9j+PU4fzwZToRNSsGi7fWfIbrMiWg5rkG
MFJR2qtTrMcuay2/nE/iIu7VHrd0j4rgYz1NiJ9+o2Vi2IjIpq+qaMm1TxzNRmOs
v9GXzCJ08uP/o1gayGdSHz2YtMhCMfIFfZ/yRYHsof8b9JY7lZ6bY5V89Z160Sy9
NpZgfs2BIa/JPmmllF8YxBnbx7nP3aiC/s6Is5wD0zY/wjXHFNR0
=IwdZ
-----END PGP SIGNATURE-----
Merge tag 'arm-smmu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into iommu/fixes
Arm SMMU fixes for 6.8
- Fix CD allocation from atomic context when using SVA with SMMUv3
- Revert the conversion of SMMUv2 to domain_alloc_paging(), as it
breaks the boot for Qualcomm MSM8996 devices
If the SMMU is configured to use a two level CD table then
arm_smmu_write_ctx_desc() allocates a CD table leaf internally using
GFP_KERNEL. Due to recent changes this is being done under a spinlock to
iterate over the device list - thus it will trigger a sleeping while
atomic warning:
arm_smmu_sva_set_dev_pasid()
mutex_lock(&sva_lock);
__arm_smmu_sva_bind()
arm_smmu_mmu_notifier_get()
spin_lock_irqsave()
arm_smmu_write_ctx_desc()
arm_smmu_get_cd_ptr()
arm_smmu_alloc_cd_leaf_table()
dmam_alloc_coherent(GFP_KERNEL)
This is a 64K high order allocation and really should not be done
atomically.
At the moment the rework of the SVA to follow the new API is half
finished. Recently the CD table memory was moved from the domain to the
master, however we have the confusing situation where the SVA code is
wrongly using the RID domains device's list to track which CD tables the
SVA is installed in.
Remove the logic to replicate the CD across all the domain's masters
during attach. We know which master and which CD table the PASID should be
installed in.
Right now SVA only works when dma-iommu.c is in control of the RID
translation, which means we have a single iommu_domain shared across the
entire group and that iommu_domain is not shared outside the group.
Critically this means that the iommu_group->devices list and RID's
smmu_domain->devices list describe the same set of masters.
For PCI cases the core code also insists on singleton groups so there is
only one entry in the smmu_domain->devices list that is equal to the
master being passed in to arm_smmu_sva_set_dev_pasid().
Only non-PCI cases may have multi-device groups. However, the core code
will repeat the calls to arm_smmu_sva_set_dev_pasid() across the entire
iommu_group->devices list.
Instead of having arm_smmu_mmu_notifier_get() indirectly loop over all the
devices in the group via the RID's smmu_domain, rely on
__arm_smmu_sva_bind() to be called for each device in the group and
install the repeated CD entry that way.
This avoids taking the spinlock to access the devices list and permits the
arm_smmu_write_ctx_desc() to use a sleeping allocation. Leave the
arm_smmu_mm_release() as a confusing situation, this requires tracking
attached masters inside the SVA domain.
Removing the loop allows arm_smmu_write_ctx_desc() to be called outside
the spinlock and thus is safe to use GFP_KERNEL.
Move the clearing of the CD into arm_smmu_sva_remove_dev_pasid() so that
arm_smmu_mmu_notifier_get/put() remain paired functions.
Fixes: 24503148c5 ("iommu/arm-smmu-v3: Refactor write_ctx_desc")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/4e25d161-0cf8-4050-9aa3-dfa21cd63e56@moroto.mountain/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Michael Shavit <mshavit@google.com>
Link: https://lore.kernel.org/r/0-v3-11978fc67151+112-smmu_cd_atomic_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
Since the current design doesn't forward the data_type to the driver to
check unless there is a data_len/uptr for a driver specific struct we
should check and ensure that data_type is 0 if data_len is 0. Otherwise
any value is permitted.
Fixes: bd529dbb66 ("iommufd: Add a nested HW pagetable object")
Link: https://lore.kernel.org/r/0-v1-9b1ea6869554+110c60-iommufd_ck_data_type_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
On 32-bit builds, the vt-d driver causes a warning with clang:
drivers/iommu/intel/nested.c:112:13: error: result of comparison of constant 18446744073709551615 with expression of type 'unsigned long' is always false [-Werror,-Wtautological-constant-out-of-range-compare]
112 | if (npages == U64_MAX)
| ~~~~~~ ^ ~~~~~~~
Make the variable a 64-bit type, which matches both the caller and the
use anyway.
Fixes: f6f3721244 ("iommu/vt-d: Add iotlb flush for nested domain")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20240213095832.455245-1-arnd@kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Should set the SSADE (Second Stage Access/Dirty bit Enable) bit of the
pasid entry when attaching a device to a nested domain if its parent
has already enabled dirty tracking.
Fixes: 111bf85c68 ("iommu/vt-d: Add helper to setup pasid nested translation")
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20240208091414.28133-1-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Setting dirty tracking for a s2 domain requires to loop all the related
devices and set the dirty tracking enable bit in the PASID table entry.
This includes the devices that are attached to the nested domains of a
s2 domain if this s2 domain is used as parent. However, the existing dirty
tracking set only loops s2 domain's own devices. It will miss dirty page
logs in the parent domain.
Now, the parent domain tracks the nested domains, so it can loop the
nested domains and the devices attached to the nested domains to ensure
dirty tracking on the parent is set completely.
Fixes: b41e38e225 ("iommu/vt-d: Add nested domain allocation")
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240208082307.15759-9-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add device_set_dirty_tracking() to loop all the devices and set the dirty
tracking per the @enable parameter.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Link: https://lore.kernel.org/r/20240208082307.15759-8-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The only usage of input @domain is to get the domain id (DID) to flush
cache after setting dirty tracking. However, DID can be obtained from
the pasid entry. So no need to pass in domain. This can make this helper
cleaner when adding the missing dirty tracking for the parent domain,
which needs to use the DID of nested domain.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240208082307.15759-7-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
ATS-capable devices cache the result of nested translation. This result
relies on the mappings in s2 domain (a.k.a. parent). When there are
modifications in the s2 domain, the related nested translation caches on
the device should be flushed. This includes the devices that are attached
to the s1 domain. However, the existing code ignores this fact to only
loops its own devices.
As there is no easy way to identify the exact set of nested translations
affected by the change of s2 domain. So, this just flushes the entire
device iotlb on the device.
As above, driver loops the s2 domain's s1_domains list and loops the
devices list of each s1_domain to flush the entire device iotlb on the
devices.
Fixes: b41e38e225 ("iommu/vt-d: Add nested domain allocation")
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240208082307.15759-6-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Should call domain_update_iotlb() to update the has_iotlb_device flag
of the domain after attaching device to nested domain. Without it, this
flag is not set properly and would result in missing device TLB flush.
Fixes: 9838f2bb6b ("iommu/vt-d: Set the nested domain to a device")
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240208082307.15759-5-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
If a domain is used as the parent in nested translation its mappings might
be cached using DID of the nested domain. But the existing code ignores
this fact to only invalidate the iotlb entries tagged by the domain's own
DID.
Loop the s1_domains list, if any, to invalidate all iotlb entries related
to the target s2 address range. According to VT-d spec there is no need for
software to explicitly flush the affected s1 cache. It's implicitly done by
HW when s2 cache is invalidated.
Fixes: b41e38e225 ("iommu/vt-d: Add nested domain allocation")
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240208082307.15759-4-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add __iommu_flush_iotlb_psi() to do the psi iotlb flush with a DID input
rather than calculating it within the helper.
This is useful when flushing cache for parent domain which reuses DIDs of
its nested domains.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240208082307.15759-3-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Today the parent domain (s2_domain) is unaware of which DID's are
used by and which devices are attached to nested domains (s1_domain)
nested on it. This leads to a problem that some operations (flush
iotlb/devtlb and enable dirty tracking) on parent domain only apply to
DID's and devices directly tracked in the parent domain hence are
incomplete.
This tracks the nested domains in list in parent domain. With this,
operations on parent domain can loop the nested domains and refer to
the devices and iommu_array to ensure the operations on parent domain
take effect on all the affected devices and iommus.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240208082307.15759-2-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This reverts commit 9b3febc3a3 ("iommu/arm-smmu: Convert to
domain_alloc_paging()"). It breaks Qualcomm MSM8996 platform. Calling
arm_smmu_write_context_bank() from new codepath results in the platform
being reset because of the unclocked hardware access.
Fixes: 9b3febc3a3 ("iommu/arm-smmu: Convert to domain_alloc_paging()")
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20240213-iommu-revert-domain-alloc-v1-1-325ff55dece4@linaro.org
Signed-off-by: Will Deacon <will@kernel.org>
For small bitmaps that aren't PAGE_SIZE aligned *and* that are less than
512 pages in bitmap length, use an extra page to be able to cover the
entire range e.g. [1M..3G] which would be iterated more efficiently in a
single iteration, rather than two.
Fixes: b058ea3ab5 ("vfio/iova_bitmap: refactor iova_bitmap_set() to better handle page boundaries")
Link: https://lore.kernel.org/r/20240202133415.23819-10-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Tested-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add support to mock iommu hugepages of 1M (for a 2K mock io page size). To
avoid breaking test suite defaults, the way this is done is by explicitly
creating a iommu mock device which has hugepage support (i.e. through
MOCK_FLAGS_DEVICE_HUGE_IOVA).
The same scheme is maintained of mock base page index tracking in the
XArray, except that an extra bit is added to mark it as a hugepage. One
subpage containing the dirty bit, means that the whole hugepage is dirty
(similar to AMD IOMMU non-standard page sizes). For clearing, same thing
applies, and it must clear all dirty subpages.
This is in preparation for dirty tracking to mark mock hugepages as
dirty to exercise all the iova-bitmap fixes.
Link: https://lore.kernel.org/r/20240202133415.23819-8-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Move the clearing of the dirty bit of the mock domain into
mock_domain_test_and_clear_dirty() helper, simplifying the caller
function.
Additionally, rework the mock_domain_read_and_clear_dirty() loop to
iterate over a potentially variable IO page size. No functional change
intended with the loop refactor.
This is in preparation for dirty tracking support for IOMMU hugepage mock
domains.
Link: https://lore.kernel.org/r/20240202133415.23819-7-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
IOVA bitmap is a zero-copy scheme of recording dirty bits that iterate the
different bitmap user pages at chunks of a maximum of
PAGE_SIZE/sizeof(struct page*) pages.
When the iterations are split up into 64G, the end of the range may be
broken up in a way that's aligned with a non base page PTE size. This
leads to only part of the huge page being recorded in the bitmap. Note
that in pratice this is only a problem for IOMMU dirty tracking i.e. when
the backing PTEs are in IOMMU hugepages and the bitmap is in base page
granularity. So far this not something that affects VF dirty trackers
(which reports and records at the same granularity).
To fix that, if there is a remainder of bits left to set in which the
current IOVA bitmap doesn't cover, make a copy of the bitmap structure and
iterate-and-set the rest of the bits remaining. Finally, when advancing
the iterator, skip all the bits that were set ahead.
Link: https://lore.kernel.org/r/20240202133415.23819-5-joao.m.martins@oracle.com
Reported-by: Avihai Horon <avihaih@nvidia.com>
Fixes: f35f22cc76 ("iommu/vt-d: Access/Dirty bit support for SS domains")
Fixes: 421a511a29 ("iommu/amd: Access/Dirty bit support in IOPTEs")
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Tested-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
iova_bitmap_mapped_length() don't deal correctly with the small bitmaps
(< 2M bitmaps) when the starting address isn't u64 aligned, leading to
skipping a tiny part of the IOVA range. This is materialized as not
marking data dirty that should otherwise have been.
Fix that by using a u8 * in the internal state of IOVA bitmap. Most of the
data structures use the type of the bitmap to adjust its indexes, thus
changing the type of the bitmap decreases the granularity of the bitmap
indexes.
Fixes: b058ea3ab5 ("vfio/iova_bitmap: refactor iova_bitmap_set() to better handle page boundaries")
Link: https://lore.kernel.org/r/20240202133415.23819-3-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Tested-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Dirty IOMMU hugepages reported on a base page page-size granularity can
lead to an attempt to set dirty pages in the bitmap beyond the limits that
are pinned.
Bounds check the page index of the array we are trying to access is within
the limits before we kmap() and return otherwise.
While it is also a defensive check, this is also in preparation to defer
setting bits (outside the mapped range) to the next iteration(s) when the
pages become available.
Fixes: b058ea3ab5 ("vfio/iova_bitmap: refactor iova_bitmap_set() to better handle page boundaries")
Link: https://lore.kernel.org/r/20240202133415.23819-2-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Tested-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The ops->default_domain flow used a 0 req_type to select the default
domain and this was enforced by iommu_group_alloc_default_domain().
When !CONFIG_IOMMU_DMA started forcing the old ARM32 drivers into IDENTITY
it also overroad the 0 req_type of the ops->default_domain drivers to
IDENTITY which ends up causing failures during device probe.
Make iommu_group_alloc_default_domain() accept a req_type that matches the
ops->default_domain and have iommu_group_alloc_default_domain() generate a
req_type that matches the default_domain.
This way the req_type always describes what kind of domain should be
attached and ops->default_domain overrides all other mechanisms to choose
the default domain.
Fixes: 2ad56efa80 ("powerpc/iommu: Setup a default domain and remove set_platform_dma_ops")
Fixes: 0f6a90436a ("iommu: Do not use IOMMU_DOMAIN_DMA if CONFIG_IOMMU_DMA is not enabled")
Reported-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Closes: https://lore.kernel.org/linux-iommu/20240123165829.630276-1-ovidiu.panait@windriver.com/
Reported-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Closes: https://lore.kernel.org/linux-iommu/170618452753.3805.4425669653666211728.stgit@ltcd48-lp2.aus.stglab.ibm.com/
Tested-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Tested-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-755bd21c4a64+525b8-iommu_def_dom_fix_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This brings the first of three planned user IO page table invalidation
operations:
- IOMMU_HWPT_INVALIDATE allows invalidating the IOTLB integrated into the
iommu itself. The Intel implementation will also generate an ATC
invalidation to flush the device IOTLB as it unambiguously knows the
device, but other HW will not.
It goes along with the prior PR to implement userspace IO page tables (aka
nested translation for VMs) to allow Intel to have full functionality for
simple cases. An Intel implementation of the operation is provided.
Fix a small bug in the selftest mock iommu driver probe.
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZaFiRQAKCRCFwuHvBreF
YbmgAP9Z0+cAUPKxUKaMRls8YR+gmaOCniSkqBlyrxcib+F/WAD2NPLcBPBRk2o7
GfXPIrovx96Btf8M40AFdiTEp7LABw==
=9POe
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"This brings the first of three planned user IO page table invalidation
operations:
- IOMMU_HWPT_INVALIDATE allows invalidating the IOTLB integrated into
the iommu itself. The Intel implementation will also generate an
ATC invalidation to flush the device IOTLB as it unambiguously
knows the device, but other HW will not.
It goes along with the prior PR to implement userspace IO page tables
(aka nested translation for VMs) to allow Intel to have full
functionality for simple cases. An Intel implementation of the
operation is provided.
Also fix a small bug in the selftest mock iommu driver probe"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommufd/selftest: Check the bus type during probe
iommu/vt-d: Add iotlb flush for nested domain
iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
iommufd/selftest: Add mock_domain_cache_invalidate_user support
iommu: Add iommu_copy_struct_from_user_array helper
iommufd: Add IOMMU_HWPT_INVALIDATE
iommu: Add cache_invalidate_user op
Including:
- Core changes:
- Fix race conditions in device probe path
- Retire IOMMU bus_ops
- Support for passing custom allocators to page table drivers
- Clean up Kconfig around IOMMU_SVA
- Support for sharing SVA domains with all devices bound to
a mm
- Firmware data parsing cleanup
- Tracing improvements for iommu-dma code
- Some smaller fixes and cleanups
- ARM-SMMU drivers:
- Device-tree binding updates:
- Add additional compatible strings for Qualcomm SoCs
- Document Adreno clocks for Qualcomm's SM8350 SoC
- SMMUv2:
- Implement support for the ->domain_alloc_paging() callback
- Ensure Secure context is restored following suspend of Qualcomm SMMU
implementation
- SMMUv3:
- Disable stalling mode for the "quiet" context descriptor
- Minor refactoring and driver cleanups
- Intel VT-d driver:
- Cleanup and refactoring
- AMD IOMMU driver:
- Improve IO TLB invalidation logic
- Small cleanups and improvements
- Rockchip IOMMU driver:
- DT binding update to add Rockchip RK3588
- Apple DART driver:
- Apple M1 USB4/Thunderbolt DART support
- Cleanups
- Virtio IOMMU driver:
- Add support for iotlb_sync_map
- Enable deferred IO TLB flushes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmWecQoACgkQK/BELZcB
GuN5ZxAAzC5QUKAzANx0puk7QhPpKKlbSvj6Q7iRgCLk00KJO1+VQh9v4ouCmXqF
kn3Ko8gddjhtrgwN0OQ54F39cLUrp1SBemy71K5YOR+vu8VKtwtmawZGeeRZ+k+B
Eohw58oaXTiR1maYvoLixLYczLrjklqyJOQ1vZ0GxFGxDqrFByAryHDgG/3OCpJx
C9e6PsLbbfhfqA8Kv97iKcBqniGbXxAMuodqSUG0buQ3oZgfpIP6Bt3EgUzFGPGk
3BTlYxowS/gkjUWd3fgjQFIFLTA01u9FhpA2Jb0a4v67pUCR64YxHN7rBQ6ZChtG
kB9laQfU9re79RsHhqQzr0JT9x/eyq7pzGzjp5TV5TPW6IW+sqjMIPhzd9P08Ef7
BclkCVobx0jSAHOhnnG4QJiKANr2Y2oM3HfsAJccMMY45RRhUKmVqM7jxMPfGn3A
i+inlee73xTjZXJse1EWG1fmKKMLvX9LDEp4DyOfn9CqVT+7hpZvzPjfbGr937Rm
JlwXhF3rQXEpOCagEsbt1vOf+V0e9QiCLf1Y2KpkIkDbE5wwSD/2qLm3tFhJG3oF
fkW+J14Cid0pj+hY0afGe0kOUOIYlimu0nFmSf0pzMH+UktZdKogSfyb1gSDsy+S
rsZRGPFhMJ832ExqhlDfxqBebqh+jsfKynlskui6Td5C9ZULaHA=
=q751
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
"Core changes:
- Fix race conditions in device probe path
- Retire IOMMU bus_ops
- Support for passing custom allocators to page table drivers
- Clean up Kconfig around IOMMU_SVA
- Support for sharing SVA domains with all devices bound to a mm
- Firmware data parsing cleanup
- Tracing improvements for iommu-dma code
- Some smaller fixes and cleanups
ARM-SMMU drivers:
- Device-tree binding updates:
- Add additional compatible strings for Qualcomm SoCs
- Document Adreno clocks for Qualcomm's SM8350 SoC
- SMMUv2:
- Implement support for the ->domain_alloc_paging() callback
- Ensure Secure context is restored following suspend of Qualcomm
SMMU implementation
- SMMUv3:
- Disable stalling mode for the "quiet" context descriptor
- Minor refactoring and driver cleanups
Intel VT-d driver:
- Cleanup and refactoring
AMD IOMMU driver:
- Improve IO TLB invalidation logic
- Small cleanups and improvements
Rockchip IOMMU driver:
- DT binding update to add Rockchip RK3588
Apple DART driver:
- Apple M1 USB4/Thunderbolt DART support
- Cleanups
Virtio IOMMU driver:
- Add support for iotlb_sync_map
- Enable deferred IO TLB flushes"
* tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (66 commits)
iommu: Don't reserve 0-length IOVA region
iommu/vt-d: Move inline helpers to header files
iommu/vt-d: Remove unused vcmd interfaces
iommu/vt-d: Remove unused parameter of intel_pasid_setup_pass_through()
iommu/vt-d: Refactor device_to_iommu() to retrieve iommu directly
iommu/sva: Fix memory leak in iommu_sva_bind_device()
dt-bindings: iommu: rockchip: Add Rockchip RK3588
iommu/dma: Trace bounce buffer usage when mapping buffers
iommu/arm-smmu: Convert to domain_alloc_paging()
iommu/arm-smmu: Pass arm_smmu_domain to internal functions
iommu/arm-smmu: Implement IOMMU_DOMAIN_BLOCKED
iommu/arm-smmu: Convert to a global static identity domain
iommu/arm-smmu: Reorganize arm_smmu_domain_add_master()
iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Master cannot be NULL in arm_smmu_write_strtab_ent()
iommu/arm-smmu-v3: Add a type for the STE
iommu/arm-smmu-v3: disable stall for quiet_cd
iommu/qcom: restore IOMMU state if needed
iommu/arm-smmu-qcom: Add QCM2290 MDSS compatible
iommu/arm-smmu-qcom: Add missing GMU entry to match table
...
This relied on the probe function only being invoked by the bus type mock
was registered on. The removal of the bus ops broke this assumption and
the probe could be called on non-mock bus types like PCI.
Check the bus type directly in probe.
Fixes: 17de3f5fdd ("iommu: Retire bus ops")
Link: https://lore.kernel.org/r/0-v1-82d59f7eab8c+40c-iommufd_mock_bus_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This implements the .cache_invalidate_user() callback to support iotlb
flush for nested domain.
Link: https://lore.kernel.org/r/20240111041015.47920-9-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Co-developed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Allow to test whether IOTLB has been invalidated or not.
Link: https://lore.kernel.org/r/20240111041015.47920-6-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add mock_domain_cache_invalidate_user() data structure to support user
space selftest program to cover user cache invalidation pathway.
Link: https://lore.kernel.org/r/20240111041015.47920-5-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Co-developed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In nested translation, the stage-1 page table is user-managed but cached
by the IOMMU hardware, so an update on present page table entries in the
stage-1 page table should be followed with a cache invalidation.
Add an IOMMU_HWPT_INVALIDATE ioctl to support such a cache invalidation.
It takes hwpt_id to specify the iommu_domain, and a multi-entry array to
support multiple invalidation data in one ioctl.
enum iommu_hwpt_invalidate_data_type is defined to tag the data type of
the entries in the multi-entry array.
Link: https://lore.kernel.org/r/20240111041015.47920-3-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
are included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the
series
"maple_tree: add mt_free_one() and mt_attr() helpers"
"Some cleanups of maple tree"
- In the series "mm: use memmap_on_memory semantics for dax/kmem"
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few
fixes) in the patch series
"Add folio_zero_tail() and folio_fill_tail()"
"Make folio_start_writeback return void"
"Fix fault handler's handling of poisoned tail pages"
"Convert aops->error_remove_page to ->error_remove_folio"
"Finish two folio conversions"
"More swap folio conversions"
- Kefeng Wang has also contributed folio-related work in the series
"mm: cleanup and use more folio in page fault"
- Jim Cromie has improved the kmemleak reporting output in the
series "tweak kmemleak report format".
- In the series "stackdepot: allow evicting stack traces" Andrey
Konovalov to permits clients (in this case KASAN) to cause
eviction of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series "mm:
page_alloc: fixes for high atomic reserve caluculations".
- Dmitry Rokosov has added to the samples/ dorectory some sample
code for a userspace memcg event listener application. See the
series "samples: introduce cgroup events listeners".
- Some mapletree maintanance work from Liam Howlett in the series
"maple_tree: iterator state changes".
- Nhat Pham has improved zswap's approach to writeback in the
series "workload-specific and memory pressure-driven zswap
writeback".
- DAMON/DAMOS feature and maintenance work from SeongJae Park in
the series
"mm/damon: let users feed and tame/auto-tune DAMOS"
"selftests/damon: add Python-written DAMON functionality tests"
"mm/damon: misc updates for 6.8"
- Yosry Ahmed has improved memcg's stats flushing in the series
"mm: memcg: subtree stats flushing and thresholds".
- In the series "Multi-size THP for anonymous memory" Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series "More buffer_head
cleanups".
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
"userfaultfd move option". UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a "KSM Advisor", in the series
"mm/ksm: Add ksm advisor". This is a governor which tunes KSM's
scanning aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory
use in the series "mm/zswap: dstmem reuse optimizations and
cleanups".
- Matthew Wilcox has performed some maintenance work on the
writeback code, both code and within filesystems. The series is
"Clean up the writeback paths".
- Andrey Konovalov has optimized KASAN's handling of alloc and
free stack traces for secondary-level allocators, in the series
"kasan: save mempool stack traces".
- Andrey also performed some KASAN maintenance work in the series
"kasan: assorted clean-ups".
- David Hildenbrand has gone to town on the rmap code. Cleanups,
more pte batching, folio conversions and more. See the series
"mm/rmap: interface overhaul".
- Kinsey Ho has contributed some maintenance work on the MGLRU
code in the series "mm/mglru: Kconfig cleanup".
- Matthew Wilcox has contributed lruvec page accounting code
cleanups in the series "Remove some lruvec page accounting
functions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
=0NHs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the series
'maple_tree: add mt_free_one() and mt_attr() helpers'
'Some cleanups of maple tree'
- In the series 'mm: use memmap_on_memory semantics for dax/kmem'
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few fixes)
in the patch series
'Add folio_zero_tail() and folio_fill_tail()'
'Make folio_start_writeback return void'
'Fix fault handler's handling of poisoned tail pages'
'Convert aops->error_remove_page to ->error_remove_folio'
'Finish two folio conversions'
'More swap folio conversions'
- Kefeng Wang has also contributed folio-related work in the series
'mm: cleanup and use more folio in page fault'
- Jim Cromie has improved the kmemleak reporting output in the series
'tweak kmemleak report format'.
- In the series 'stackdepot: allow evicting stack traces' Andrey
Konovalov to permits clients (in this case KASAN) to cause eviction
of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series 'mm:
page_alloc: fixes for high atomic reserve caluculations'.
- Dmitry Rokosov has added to the samples/ dorectory some sample code
for a userspace memcg event listener application. See the series
'samples: introduce cgroup events listeners'.
- Some mapletree maintanance work from Liam Howlett in the series
'maple_tree: iterator state changes'.
- Nhat Pham has improved zswap's approach to writeback in the series
'workload-specific and memory pressure-driven zswap writeback'.
- DAMON/DAMOS feature and maintenance work from SeongJae Park in the
series
'mm/damon: let users feed and tame/auto-tune DAMOS'
'selftests/damon: add Python-written DAMON functionality tests'
'mm/damon: misc updates for 6.8'
- Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
memcg: subtree stats flushing and thresholds'.
- In the series 'Multi-size THP for anonymous memory' Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series 'More buffer_head
cleanups'.
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
Add ksm advisor'. This is a governor which tunes KSM's scanning
aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory use
in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.
- Matthew Wilcox has performed some maintenance work on the writeback
code, both code and within filesystems. The series is 'Clean up the
writeback paths'.
- Andrey Konovalov has optimized KASAN's handling of alloc and free
stack traces for secondary-level allocators, in the series 'kasan:
save mempool stack traces'.
- Andrey also performed some KASAN maintenance work in the series
'kasan: assorted clean-ups'.
- David Hildenbrand has gone to town on the rmap code. Cleanups, more
pte batching, folio conversions and more. See the series 'mm/rmap:
interface overhaul'.
- Kinsey Ho has contributed some maintenance work on the MGLRU code
in the series 'mm/mglru: Kconfig cleanup'.
- Matthew Wilcox has contributed lruvec page accounting code cleanups
in the series 'Remove some lruvec page accounting functions'"
* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
mm, treewide: introduce NR_PAGE_ORDERS
selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
selftests/mm: skip test if application doesn't has root privileges
selftests/mm: conform test to TAP format output
selftests: mm: hugepage-mmap: conform to TAP format output
selftests/mm: gup_test: conform test to TAP format output
mm/selftests: hugepage-mremap: conform test to TAP format output
mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
mm/memcontrol: remove __mod_lruvec_page_state()
mm/khugepaged: use a folio more in collapse_file()
slub: use a folio in __kmalloc_large_node
slub: use folio APIs in free_large_kmalloc()
slub: use alloc_pages_node() in alloc_slab_page()
mm: remove inc/dec lruvec page state functions
mm: ratelimit stat flush from workingset shrinker
kasan: stop leaking stack trace handles
mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
mm/mglru: add dummy pmd_dirty()
...
commit 23baf831a3 ("mm, treewide: redefine MAX_ORDER sanely") has
changed the definition of MAX_ORDER to be inclusive. This has caused
issues with code that was not yet upstream and depended on the previous
definition.
To draw attention to the altered meaning of the define, rename MAX_ORDER
to MAX_PAGE_ORDER.
Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>