linux-stable/mm
David Hildenbrand 980cde3ac4 mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast
commit 5805192c7b upstream.

In contrast to most other GUP code, GUP-fast common page table walking
code like gup_pte_range() also handles hugetlb pages.  But in contrast to
other hugetlb page table walking code, it does not look at the hugetlb PTE
abstraction whereby we have only a single logical hugetlb PTE per hugetlb
page, even when using multiple cont-PTEs underneath -- which is for
example what huge_ptep_get() abstracts.

So when we have a hugetlb page that is mapped via cont-PTEs, GUP-fast
might stumble over a PTE that does not map the head page of a hugetlb page
-- not the first "head" PTE of such a cont mapping.

Logically, the whole hugetlb page is mapped (entire_mapcount == 1), but we
might end up calling gup_must_unshare() with a tail page of a hugetlb
page.

We only maintain a single PageAnonExclusive flag per hugetlb page (as
hugetlb pages cannot get partially COW-shared), stored for the head page.
That flag is clear for all tail pages.

So when gup_must_unshare() ends up calling PageAnonExclusive() with a tail
page of a hugetlb page:

1) With CONFIG_DEBUG_VM_PGFLAGS

Stumbles over the:

	VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page);

For example, when executing the COW selftests with 64k hugetlb pages on
arm64:

  [   61.082187] page:00000000829819ff refcount:3 mapcount:1 mapping:0000000000000000 index:0x1 pfn:0x11ee11
  [   61.082842] head:0000000080f79bf7 order:4 entire_mapcount:1 nr_pages_mapped:0 pincount:2
  [   61.083384] anon flags: 0x17ffff80003000e(referenced|uptodate|dirty|head|mappedtodisk|node=0|zone=2|lastcpupid=0xfffff)
  [   61.084101] page_type: 0xffffffff()
  [   61.084332] raw: 017ffff800000000 fffffc00037b8401 0000000000000402 0000000200000000
  [   61.084840] raw: 0000000000000010 0000000000000000 00000000ffffffff 0000000000000000
  [   61.085359] head: 017ffff80003000e ffffd9e95b09b788 ffffd9e95b09b788 ffff0007ff63cf71
  [   61.085885] head: 0000000000000000 0000000000000002 00000003ffffffff 0000000000000000
  [   61.086415] page dumped because: VM_BUG_ON_PAGE(PageHuge(page) && !PageHead(page))
  [   61.086914] ------------[ cut here ]------------
  [   61.087220] kernel BUG at include/linux/page-flags.h:990!
  [   61.087591] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
  [   61.087999] Modules linked in: ...
  [   61.089404] CPU: 0 PID: 4612 Comm: cow Kdump: loaded Not tainted 6.5.0-rc4+ #3
  [   61.089917] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
  [   61.090409] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  [   61.090897] pc : gup_must_unshare.part.0+0x64/0x98
  [   61.091242] lr : gup_must_unshare.part.0+0x64/0x98
  [   61.091592] sp : ffff8000825eb940
  [   61.091826] x29: ffff8000825eb940 x28: 0000000000000000 x27: fffffc00037b8440
  [   61.092329] x26: 0400000000000001 x25: 0000000000080101 x24: 0000000000080000
  [   61.092835] x23: 0000000000080100 x22: ffff0000cffb9588 x21: ffff0000c8ec6b58
  [   61.093341] x20: 0000ffffad6b1000 x19: fffffc00037b8440 x18: ffffffffffffffff
  [   61.093850] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548
  [   61.094358] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021
  [   61.094858] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffffd9e958b7a1c0
  [   61.095359] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 00000000002bffa8
  [   61.095873] x5 : ffff0008bb19e708 x4 : 0000000000000000 x3 : 0000000000000000
  [   61.096380] x2 : 0000000000000000 x1 : ffff0000cf6636c0 x0 : 0000000000000046
  [   61.096894] Call trace:
  [   61.097080]  gup_must_unshare.part.0+0x64/0x98
  [   61.097392]  gup_pte_range+0x3a8/0x3f0
  [   61.097662]  gup_pgd_range+0x1ec/0x280
  [   61.097942]  lockless_pages_from_mm+0x64/0x1a0
  [   61.098258]  internal_get_user_pages_fast+0xe4/0x1d0
  [   61.098612]  pin_user_pages_fast+0x58/0x78
  [   61.098917]  pin_longterm_test_start+0xf4/0x2b8
  [   61.099243]  gup_test_ioctl+0x170/0x3b0
  [   61.099528]  __arm64_sys_ioctl+0xa8/0xf0
  [   61.099822]  invoke_syscall.constprop.0+0x7c/0xd0
  [   61.100160]  el0_svc_common.constprop.0+0xe8/0x100
  [   61.100500]  do_el0_svc+0x38/0xa0
  [   61.100736]  el0_svc+0x3c/0x198
  [   61.100971]  el0t_64_sync_handler+0x134/0x150
  [   61.101280]  el0t_64_sync+0x17c/0x180
  [   61.101543] Code: aa1303e0 f00074c1 912b0021 97fffeb2 (d4210000)

2) Without CONFIG_DEBUG_VM_PGFLAGS

Always detects "not exclusive" for passed tail pages and refuses to PIN
the tail pages R/O, as gup_must_unshare() == true.  GUP-fast will fallback
to ordinary GUP.  As ordinary GUP properly considers the logical hugetlb
PTE abstraction in hugetlb_follow_page_mask(), pinning the page will
succeed when looking at the PageAnonExclusive on the head page only.

So the only real effect of this is that with cont-PTE hugetlb pages, we'll
always fallback from GUP-fast to ordinary GUP when not working on the head
page, which ends up checking the head page and do the right thing.

Consequently, the cow selftests pass with cont-PTE hugetlb pages as well
without CONFIG_DEBUG_VM_PGFLAGS.

Note that this only applies to anon hugetlb pages that are mapped using
cont-PTEs: for example 64k hugetlb pages on a 4k arm64 kernel.

... and only when R/O-pinning (FOLL_PIN) such pages that are mapped into
the page table R/O using GUP-fast.

On production kernels (and even most debug kernels, that don't set
CONFIG_DEBUG_VM_PGFLAGS) this patch should theoretically not be required
to be backported.  But of course, it does not hurt.

Link: https://lkml.kernel.org/r/20230805101256.87306-1-david@redhat.com
Fixes: a7f2266041 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Tested-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-30 14:52:36 +02:00
..
damon mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
kasan kasan,kmsan: remove __GFP_KSWAPD_RECLAIM usage from kasan/kmsan 2023-08-11 12:14:24 +02:00
kfence mm: kfence: fix false positives on big endian 2023-05-17 15:24:33 -07:00
kmsan kasan,kmsan: remove __GFP_KSWAPD_RECLAIM usage from kasan/kmsan 2023-08-11 12:14:24 +02:00
backing-dev.c - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
balloon_compaction.c
bootmem_info.c
cma.c mm: move most of core MM initialization to mm/mm_init.c 2023-04-05 19:42:52 -07:00
cma.h
cma_debug.c
cma_sysfs.c mm: cma: make kobj_type structure constant 2023-03-28 16:20:06 -07:00
compaction.c - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
debug.c mm/debug: use %pGt to display page_type in dump_page() 2023-03-28 16:20:09 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
dmapool.c dmapool: link blocks across pages 2023-05-06 10:33:38 -07:00
dmapool_test.c dmapool: add alloc/free performance test 2023-04-05 19:42:38 -07:00
early_ioremap.c
fadvise.c mm: support POSIX_FADV_NOREUSE 2023-01-18 17:12:57 -08:00
failslab.c mm: fix unexpected changes to {failslab|fail_page_alloc}.attr 2022-11-22 18:50:44 -08:00
filemap.c Revert "page cache: fix page_cache_next/prev_miss off by one" 2023-08-11 12:14:23 +02:00
folio-compat.c - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
frontswap.c
gup.c mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT 2023-08-30 14:52:36 +02:00
gup_test.c mm/gup_test: fix ioctl fail for compat task 2023-06-12 11:31:51 -07:00
gup_test.h mm/gup_test: start/stop/read functionality for PIN LONGTERM test 2022-11-08 17:37:15 -08:00
highmem.c highmem: fix kmap_to_page() for kmap_local_page() addresses 2022-10-12 18:51:51 -07:00
hmm.c mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
huge_memory.c mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT 2023-08-30 14:52:36 +02:00
hugetlb.c hugetlb: do not clear hugetlb dtor until allocating vmemmap 2023-08-16 18:32:20 +02:00
hugetlb_cgroup.c mm/hugetlb: increase use of folios in alloc_huge_page() 2023-02-13 15:54:27 -08:00
hugetlb_vmemmap.c mm, page_alloc: use check_pages_enabled static key to check tail pages 2023-04-18 16:29:54 -07:00
hugetlb_vmemmap.h
hwpoison-inject.c mm/hwpoison: add __init/__exit annotations to module init/exit funcs 2022-10-03 14:03:05 -07:00
init-mm.c IOMMU Updates for Linux 6.4 2023-04-30 13:00:38 -07:00
internal.h mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast 2023-08-30 14:52:36 +02:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig mm: introduce new 'lock_mm_and_find_vma()' page fault helper 2023-07-01 13:12:38 +02:00
Kconfig.debug mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM 2023-05-29 16:14:28 +01:00
khugepaged.c mm/khugepaged: fix regression in collapse_file() 2023-07-01 13:12:40 +02:00
kmemleak.c lib/stackdepot, mm: rename stack_depot_want_early_init 2023-02-16 20:43:49 -08:00
ksm.c mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
list_lru.c
maccess.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
madvise.c mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
Makefile - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
mapping_dirty_helpers.c mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export 2023-02-02 22:32:54 -08:00
memblock.c mm: avoid passing 0 to __ffs() 2023-04-18 16:29:42 -07:00
memcontrol.c mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
memfd.c memfd: check for non-NULL file_seals in memfd_create() syscall 2023-06-19 13:19:31 -07:00
memory-failure.c mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
memory-tiers.c memory tier: release the new_memtier in find_create_memory_tier() 2023-02-09 16:51:40 -08:00
memory.c mm: lock_vma_under_rcu() must check vma->anon_vma under vma lock 2023-08-11 12:14:05 +02:00
memory_hotplug.c mm: avoid passing 0 to __ffs() 2023-04-18 16:29:42 -07:00
mempolicy.c mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
mempool.c mempool: do not use ksize() for poisoning 2022-11-30 15:58:41 -08:00
memremap.c mm/memremap.c: fix outdated comment in devm_memremap_pages 2023-02-09 16:51:46 -08:00
memtest.c mm/memtest: add results of early memtest to /proc/meminfo 2023-04-05 19:42:55 -07:00
migrate.c Add support for new Linear Address Masking CPU feature. This is similar 2023-04-28 09:43:49 -07:00
migrate_device.c mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
mincore.c mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
mlock.c mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
mm_init.c mm/vmemmap/devdax: fix kernel crash when probing devdax devices 2023-04-18 16:30:09 -07:00
mm_slot.h mm: introduce common struct mm_slot 2022-10-03 14:02:43 -07:00
mmap.c mm: lock VMA in dup_anon_vma() before setting ->anon_vma 2023-08-03 10:26:14 +02:00
mmap_lock.c
mmu_gather.c mm: prefer xxx_page() alloc/free functions for order-0 pages 2023-03-28 16:20:16 -07:00
mmu_notifier.c mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export 2023-02-02 22:32:54 -08:00
mmzone.c
mprotect.c mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
mremap.c mm/mremap: write-lock VMA while remapping it to a new address range 2023-04-05 20:02:58 -07:00
msync.c
nommu.c xtensa: fix lock_mm_and_find_vma in case VMA not found 2023-07-05 18:30:30 +01:00
oom_kill.c mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export 2023-02-02 22:32:54 -08:00
page-writeback.c writeback: account the number of pages written back 2023-07-19 16:36:50 +02:00
page_alloc.c - Some DAMON cleanups from Kefeng Wang 2023-05-04 13:09:43 -07:00
page_counter.c
page_ext.c mm/page_ext: init page_ext early if there are no deferred struct pages 2023-02-02 22:33:22 -08:00
page_idle.c mm: page_idle: convert page idle to use a folio 2023-01-18 17:12:52 -08:00
page_io.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
page_isolation.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
page_owner.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
page_poison.c
page_reporting.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
page_reporting.h
page_table_check.c mm: page_table_check: Ensure user pages are not slab pages 2023-05-29 16:14:28 +01:00
page_vma_mapped.c mm/hugetlb: introduce hugetlb_walk() 2023-01-18 17:12:39 -08:00
pagewalk.c mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
percpu-internal.h mm: percpu: fix incorrect size in pcpu_obj_full_size() 2023-02-16 20:43:55 -08:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c mm: memcontrol: rename memcg_kmem_enabled() 2023-02-16 20:43:56 -08:00
pgalloc-track.h
pgtable-generic.c mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault() 2023-03-28 16:20:12 -07:00
process_vm_access.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
ptdump.c
readahead.c readahead: convert readahead_expand() to use a folio 2023-02-02 22:33:21 -08:00
rmap.c mm,unmap: avoid flushing TLB in batch if PTE is inaccessible 2023-04-27 13:42:16 -07:00
rodata_test.c mm/rodata_test: use PAGE_ALIGNED() helper 2022-10-03 14:03:05 -07:00
secretmem.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
shmem.c shmem: fix smaps BUG sleeping while atomic 2023-08-30 14:52:35 +02:00
shrinker_debug.c Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless" 2023-06-19 13:19:34 -07:00
shuffle.c mm/shuffle: convert module_param_call to module_param_cb 2022-10-03 14:03:07 -07:00
shuffle.h mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
slab.c mm: vmscan: refactor updating current->reclaim_state 2023-04-18 16:30:10 -07:00
slab.h kasan, slub: fix HW_TAGS zeroing with slub_debug 2023-07-23 13:53:54 +02:00
slab_common.c mm/slab: document kfree() as allowed for kmem_cache_alloc() objects 2023-03-29 10:35:41 +02:00
slub.c - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
sparse-vmemmap.c mm/vmemmap/devdax: fix kernel crash when probing devdax devices 2023-04-18 16:30:09 -07:00
sparse.c sparse: remove unnecessary 0 values from rc 2023-04-21 14:52:05 -07:00
swap.c mm: swap: fix performance regression on sparsetruncate-tiny 2023-04-16 10:41:24 -07:00
swap.h mm: remove the __swap_writepage return value 2023-02-02 22:33:33 -08:00
swap_cgroup.c mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled 2022-10-03 14:03:36 -07:00
swap_slots.c mm/swap: convert put_swap_page() to put_swap_folio() 2022-10-03 14:02:46 -07:00
swap_state.c mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
swapfile.c sync mm-stable with mm-hotfixes-stable to pick up depended-upon upstream changes 2023-04-16 12:31:58 -07:00
truncate.c mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
usercopy.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
userfaultfd.c userfaultfd: use helper function range_in_vma() 2023-04-21 14:52:02 -07:00
util.c mm: uninline kstrdup() 2023-04-08 13:45:37 -07:00
vmalloc.c mm/vmalloc: do not output a spurious warning when huge vmalloc() fails 2023-06-19 13:19:31 -07:00
vmpressure.c
vmscan.c mm: enable page walking API to lock vmas during the walk 2023-08-30 14:52:36 +02:00
vmstat.c mm: introduce per-VMA lock statistics 2023-04-05 20:03:01 -07:00
workingset.c mm: workingset: update description of the source file 2023-04-18 16:30:11 -07:00
z3fold.c mm: remove PageMovable export 2023-01-18 17:12:57 -08:00
zbud.c zpool: clean out dead code 2022-12-11 18:12:10 -08:00
zpool.c zpool: remove MODULE_LICENSE in non-modules 2023-04-13 13:13:54 -07:00
zsmalloc.c zsmalloc: fix races between modifications of fullness and isolated 2023-08-16 18:32:19 +02:00
zswap.c zswap: do not shrink if cgroup may not zswap 2023-06-12 11:31:52 -07:00