linux-stable/mm
Aneesh Kumar K.V 97113eb39f mm/mremap: hold the rmap lock in write mode when moving page table entries.
To avoid a race between rmap walk and mremap, mremap does
take_rmap_locks().  The lock was taken to ensure that rmap walk don't miss
a page table entry due to PTE moves via move_pagetables().  The kernel
does further optimization of this lock such that if we are going to find
the newly added vma after the old vma, the rmap lock is not taken.  This
is because rmap walk would find the vmas in the same order and if we don't
find the page table attached to older vma we would find it with the new
vma which we would iterate later.

As explained in commit eb66ae0308 ("mremap: properly flush TLB before
releasing the page") mremap is special in that it doesn't take ownership
of the page.  The optimized version for PUD/PMD aligned mremap also
doesn't hold the ptl lock.  This can result in stale TLB entries as show
below.

This patch updates the rmap locking requirement in mremap to handle the race condition
explained below with optimized mremap::

Optmized PMD move

    CPU 1                           CPU 2                                   CPU 3

    mremap(old_addr, new_addr)      page_shrinker/try_to_unmap_one

    mmap_write_lock_killable()

                                    addr = old_addr
                                    lock(pte_ptl)
    lock(pmd_ptl)
    pmd = *old_pmd
    pmd_clear(old_pmd)
    flush_tlb_range(old_addr)

    *new_pmd = pmd
                                                                            *new_addr = 10; and fills
                                                                            TLB with new addr
                                                                            and old pfn

    unlock(pmd_ptl)
                                    ptep_clear_flush()
                                    old pfn is free.
                                                                            Stale TLB entry

Optimized PUD move also suffers from a similar race.  Both the above race
condition can be fixed if we force mremap path to take rmap lock.

Link: https://lkml.kernel.org/r/20210616045239.370802-7-aneesh.kumar@linux.ibm.com
Fixes: 2c91bd4a4e ("mm: speed up mremap by 20x on large regions")
Fixes: c49dd34018 ("mm: speedup mremap on 1GB or larger regions")
Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-huwbPw@mail.gmail.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-08 11:48:23 -07:00
..
kasan Merge branch 'akpm' (patches from Andrew) 2021-06-29 17:29:11 -07:00
kfence kfence: unconditionally use unbound work queue 2021-07-01 11:06:03 -07:00
Kconfig mm: introduce memfd_secret system call to create "secret" memory areas 2021-07-08 11:48:21 -07:00
Kconfig.debug mm, page_poison: remove CONFIG_PAGE_POISONING_ZERO 2020-12-15 12:13:46 -08:00
Makefile mm: introduce memfd_secret system call to create "secret" memory areas 2021-07-08 11:48:21 -07:00
backing-dev.c writeback, cgroup: release dying cgwbs by switching attached inodes 2021-06-29 10:53:48 -07:00
balloon_compaction.c mm: fix typos in comments 2021-05-07 00:26:35 -07:00
bootmem_info.c mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c 2021-06-30 20:47:25 -07:00
cleancache.c
cma.c mm: use proper type for cma_[alloc|release] 2021-05-05 11:27:24 -07:00
cma.h mm: cma: support sysfs 2021-05-05 11:27:24 -07:00
cma_debug.c mm/cma: change cma mutex to irq safe spinlock 2021-05-05 11:27:21 -07:00
cma_sysfs.c mm: cma: support sysfs 2021-05-05 11:27:24 -07:00
compaction.c Merge branch 'akpm' (patches from Andrew) 2021-07-02 12:08:10 -07:00
debug.c mm/debug: factor PagePoisoned out of __dump_page 2021-06-29 10:53:53 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm/swapops: rework swap entry manipulation code 2021-07-01 11:06:03 -07:00
dmapool.c mm/dmapool: use DEVICE_ATTR_RO macro 2021-06-29 10:53:52 -07:00
early_ioremap.c mm/early_ioremap.c: use __func__ instead of function name 2021-02-26 09:41:02 -08:00
fadvise.c mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED 2020-10-13 18:38:29 -07:00
failslab.c
filemap.c Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:30:04 -07:00
frontswap.c mm/mempool: minor coding style tweaks 2021-05-05 11:27:27 -07:00
gup.c mm: introduce memfd_secret system call to create "secret" memory areas 2021-07-08 11:48:21 -07:00
gup_test.c selftests/vm: gup_test: test faulting in kernel, and verify pinnable pages 2021-05-05 11:27:26 -07:00
gup_test.h selftests/vm: gup_test: fix test flag 2021-05-05 11:27:26 -07:00
highmem.c mm: fix typos in comments 2021-05-07 00:26:35 -07:00
hmm.c mm: device exclusive memory access 2021-07-01 11:06:03 -07:00
huge_memory.c mm/rmap: split migration into its own function 2021-07-01 11:06:03 -07:00
hugetlb.c mm/swapops: rework swap entry manipulation code 2021-07-01 11:06:03 -07:00
hugetlb_cgroup.c hugetlb: make free_huge_page irq safe 2021-05-05 11:27:22 -07:00
hugetlb_vmemmap.c mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON 2021-06-30 20:47:26 -07:00
hugetlb_vmemmap.h mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate 2021-06-30 20:47:25 -07:00
hwpoison-inject.c mm,hwpoison-inject: don't pin for hwpoison_filter 2020-10-16 11:11:16 -07:00
init-mm.c mm: add setup_initial_init_mm() helper 2021-07-08 11:48:21 -07:00
internal.h mmap: make mlock_future_check() global 2021-07-08 11:48:20 -07:00
interval_tree.c mm/interval_tree: add comments to improve code readability 2021-04-30 11:20:38 -07:00
io-mapping.c mm: add a io_mapping_map_user helper 2021-04-30 11:20:39 -07:00
ioremap.c mm/ioremap: fix iomap_max_page_shift 2021-05-14 19:41:32 -07:00
khugepaged.c mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs 2021-06-30 20:47:29 -07:00
kmemleak.c mm/kmemleak: fix possible wrong memory scanning period 2021-06-29 10:53:47 -07:00
ksm.c mm/ksm: use vma_lookup() in find_mergeable_vma() 2021-06-29 10:53:52 -07:00
list_lru.c mm: vmscan: consolidate shrinker_maps handling code 2021-05-05 11:27:23 -07:00
maccess.c uaccess: add force_uaccess_{begin,end} helpers 2020-08-12 10:57:59 -07:00
madvise.c mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables 2021-06-30 20:47:30 -07:00
mapping_dirty_helpers.c mm/mapping_dirty_helpers: remove double Note in kerneldoc 2021-07-01 11:06:02 -07:00
memblock.c memblock, arm: fix crashes caused by holes in the memory map 2021-07-04 12:23:05 -07:00
memcontrol.c Merge branch 'akpm' (patches from Andrew) 2021-07-02 12:08:10 -07:00
memfd.c Reimplement RLIMIT_MEMLOCK on top of ucounts 2021-04-30 14:14:02 -05:00
memory-failure.c mm: fix spelling mistakes 2021-07-01 11:06:02 -07:00
memory.c mm: device exclusive memory access 2021-07-01 11:06:03 -07:00
memory_hotplug.c mm/memory_hotplug: fix kerneldoc comment for __remove_memory 2021-07-01 11:06:02 -07:00
mempolicy.c mm/mempolicy: use unified 'nodes' for bind/interleave/prefer policies 2021-06-30 20:47:29 -07:00
mempool.c kasan: use separate (un)poison implementation for integrated init 2021-06-04 19:32:21 +01:00
memremap.c mm/memremap.c: fix improper SPDX comment style 2021-04-30 11:20:37 -07:00
memtest.c
migrate.c mm: rename migrate_pgmap_owner 2021-07-01 11:06:03 -07:00
mincore.c inode: make init and permission helpers idmapped mount aware 2021-01-24 14:27:16 +01:00
mlock.c mm: introduce memfd_secret system call to create "secret" memory areas 2021-07-08 11:48:21 -07:00
mm_init.c include/linux/page-flags-layout.h: cleanups 2021-04-30 11:20:42 -07:00
mmap.c mmap: make mlock_future_check() global 2021-07-08 11:48:20 -07:00
mmap_lock.c mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations 2021-07-01 11:06:03 -07:00
mmu_gather.c mm: eliminate "expecting prototype" kernel-doc warnings 2021-04-16 16:10:36 -07:00
mmu_notifier.c mm/mmu_notifiers: ensure range_end() is paired with range_start() 2021-03-25 09:22:55 -07:00
mmzone.c mm/lru: replace pgdat lru_lock with lruvec lock 2020-12-15 14:48:04 -08:00
mprotect.c mm: device exclusive memory access 2021-07-01 11:06:03 -07:00
mremap.c mm/mremap: hold the rmap lock in write mode when moving page table entries. 2021-07-08 11:48:23 -07:00
msync.c mm/msync: exit early when the flags is an MS_ASYNC and start < vm_start 2021-04-30 11:20:37 -07:00
nommu.c mm/nommu: unexport do_munmap() 2021-06-30 20:47:30 -07:00
oom_kill.c Merge branch 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu 2021-07-04 12:58:33 -07:00
page-writeback.c for-5.14/block-2021-06-29 2021-06-30 12:12:56 -07:00
page_alloc.c Merge branch 'akpm' (patches from Andrew) 2021-07-02 12:08:10 -07:00
page_counter.c mm: page_counter: mitigate consequences of a page_counter underflow 2021-04-30 11:20:38 -07:00
page_ext.c mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM 2021-06-29 10:53:55 -07:00
page_idle.c mm: page_idle_get_page() does not need lru_lock 2020-12-15 14:48:03 -08:00
page_io.c swap: fix swapfile read/write offset 2021-03-02 17:25:46 -07:00
page_isolation.c mm/page_isolation: do not isolate the max order page 2020-12-15 12:13:45 -08:00
page_owner.c mm/page_owner: constify dump_page_owner 2021-06-29 10:53:53 -07:00
page_poison.c mm: page_poison: print page info when corruption is caught 2021-04-30 11:20:36 -07:00
page_reporting.c mm/page_reporting: allow driver to specify reporting order 2021-06-29 10:53:47 -07:00
page_reporting.h mm/page_reporting: export reporting order as module parameter 2021-06-29 10:53:47 -07:00
page_vma_mapped.c mm: device exclusive memory access 2021-07-01 11:06:03 -07:00
pagewalk.c mm: pagewalk: fix walk for hugepage tables 2021-06-29 10:53:49 -07:00
percpu-internal.h Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu 2021-07-01 17:17:24 -07:00
percpu-km.c percpu: rework memcg accounting 2021-06-05 20:43:15 +00:00
percpu-stats.c percpu: rework memcg accounting 2021-06-05 20:43:15 +00:00
percpu-vm.c Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu 2021-07-01 17:17:24 -07:00
percpu.c Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu 2021-07-01 17:17:24 -07:00
pgalloc-track.h mm: fix typos in comments 2021-05-07 00:26:35 -07:00
pgtable-generic.c mm/thp: fix __split_huge_pmd_locked() on shmem migration entry 2021-06-16 09:24:42 -07:00
process_vm_access.c mm/process_vm_access.c: remove duplicate include 2021-05-05 11:27:27 -07:00
ptdump.c mm: ptdump: fix build failure 2021-04-16 16:10:37 -07:00
readahead.c mm: Implement readahead_control pageset expansion 2021-04-23 10:14:29 +01:00
rmap.c mm: device exclusive memory access 2021-07-01 11:06:03 -07:00
rodata_test.c mm/rodata_test.c: fix missing function declaration 2020-08-21 09:52:53 -07:00
secretmem.c PM: hibernate: disable when there are active secretmem users 2021-07-08 11:48:21 -07:00
shmem.c Merge branch 'akpm' (patches from Andrew) 2021-07-02 12:08:10 -07:00
shuffle.c mm: eliminate "expecting prototype" kernel-doc warnings 2021-04-16 16:10:36 -07:00
shuffle.h mm/shuffle: fix section mismatch warning 2021-05-22 15:09:07 -10:00
slab.c mm: fix typos in comments 2021-05-07 00:26:35 -07:00
slab.h Merge branch 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu 2021-07-04 12:58:33 -07:00
slab_common.c Merge branch 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu 2021-07-04 12:58:33 -07:00
slob.c mm: Don't build mm_dump_obj() on CONFIG_PRINTK=n kernels 2021-03-08 14:18:46 -08:00
slub.c mm/slub: use stackdepot to save stack trace in objects 2021-07-08 11:48:20 -07:00
sparse-vmemmap.c mm: sparsemem: split the huge PMD mapping of vmemmap pages 2021-06-30 20:47:26 -07:00
sparse.c mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c 2021-06-30 20:47:25 -07:00
swap.c mm: fix typos and grammar error in comments 2021-07-01 11:06:02 -07:00
swap_cgroup.c mm: memcontrol: make swap tracking an integral part of memory control 2020-06-03 20:09:48 -07:00
swap_slots.c mm/swap_slots.c: delete meaningless forward declarations 2021-06-29 10:53:49 -07:00
swap_state.c swap: check mapping_empty() for swap cache before being freed 2021-06-29 10:53:49 -07:00
swapfile.c mm: fix spelling mistakes 2021-07-01 11:06:02 -07:00
truncate.c mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() 2021-06-16 09:24:42 -07:00
usercopy.c mm/usercopy.c: delete duplicated word 2020-08-12 10:57:58 -07:00
userfaultfd.c userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte() 2021-06-30 20:47:27 -07:00
util.c Merge branch 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu 2021-07-04 12:58:33 -07:00
vmacache.c kernel: better document the use_mm/unuse_mm API contract 2020-06-10 19:14:18 -07:00
vmalloc.c mm/vmalloc: include header for prototype of set_iounmap_nonlazy 2021-07-01 11:06:02 -07:00
vmpressure.c
vmscan.c mm/vmscan: remove kerneldoc-like comment from isolate_lru_pages 2021-07-01 11:06:02 -07:00
vmstat.c mm/vmstat: inline NUMA event counter updates 2021-06-29 10:53:54 -07:00
workingset.c mm: workingset: define macro WORKINGSET_SHIFT 2021-06-30 20:47:28 -07:00
z3fold.c mm/z3fold: add kerneldoc fields for z3fold_pool 2021-07-01 11:06:03 -07:00
zbud.c mm/zbud: add kerneldoc fields for zbud_pool 2021-07-01 11:06:03 -07:00
zpool.c mm: fix typos in comments 2021-05-07 00:26:35 -07:00
zsmalloc.c mm/zsmalloc.c: improve readability for async_free_zspage() 2021-07-01 11:06:02 -07:00
zswap.c mm/zswap.c: fix two bugs in zswap_writeback_entry() 2021-06-30 20:47:31 -07:00