linux-stable/mm
Dave Hansen 5a28fc94c9 x86/mpx, mm/core: Fix recursive munmap() corruption
This is a bit of a mess, to put it mildly.  But, it's a bug
that only seems to have showed up in 4.20 but wasn't noticed
until now, because nobody uses MPX.

MPX has the arch_unmap() hook inside of munmap() because MPX
uses bounds tables that protect other areas of memory.  When
memory is unmapped, there is also a need to unmap the MPX
bounds tables.  Barring this, unused bounds tables can eat 80%
of the address space.

But, the recursive do_munmap() that gets called vi arch_unmap()
wreaks havoc with __do_munmap()'s state.  It can result in
freeing populated page tables, accessing bogus VMA state,
double-freed VMAs and more.

See the "long story" further below for the gory details.

To fix this, call arch_unmap() before __do_unmap() has a chance
to do anything meaningful.  Also, remove the 'vma' argument
and force the MPX code to do its own, independent VMA lookup.

== UML / unicore32 impact ==

Remove unused 'vma' argument to arch_unmap().  No functional
change.

I compile tested this on UML but not unicore32.

== powerpc impact ==

powerpc uses arch_unmap() well to watch for munmap() on the
VDSO and zeroes out 'current->mm->context.vdso_base'.  Moving
arch_unmap() makes this happen earlier in __do_munmap().  But,
'vdso_base' seems to only be used in perf and in the signal
delivery that happens near the return to userspace.  I can not
find any likely impact to powerpc, other than the zeroing
happening a little earlier.

powerpc does not use the 'vma' argument and is unaffected by
its removal.

I compile-tested a 64-bit powerpc defconfig.

== x86 impact ==

For the common success case this is functionally identical to
what was there before.  For the munmap() failure case, it's
possible that some MPX tables will be zapped for memory that
continues to be in use.  But, this is an extraordinarily
unlikely scenario and the harm would be that MPX provides no
protection since the bounds table got reset (zeroed).

I can't imagine anyone doing this:

	ptr = mmap();
	// use ptr
	ret = munmap(ptr);
	if (ret)
		// oh, there was an error, I'll
		// keep using ptr.

Because if you're doing munmap(), you are *done* with the
memory.  There's probably no good data in there _anyway_.

This passes the original reproducer from Richard Biener as
well as the existing mpx selftests/.

The long story:

munmap() has a couple of pieces:

 1. Find the affected VMA(s)
 2. Split the start/end one(s) if neceesary
 3. Pull the VMAs out of the rbtree
 4. Actually zap the memory via unmap_region(), including
    freeing page tables (or queueing them to be freed).
 5. Fix up some of the accounting (like fput()) and actually
    free the VMA itself.

This specific ordering was actually introduced by:

  dd2283f260 ("mm: mmap: zap pages with read mmap_sem in munmap")

during the 4.20 merge window.  The previous __do_munmap() code
was actually safe because the only thing after arch_unmap() was
remove_vma_list().  arch_unmap() could not see 'vma' in the
rbtree because it was detached, so it is not even capable of
doing operations unsafe for remove_vma_list()'s use of 'vma'.

Richard Biener reported a test that shows this in dmesg:

  [1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551
  [1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576

What triggered this was the recursive do_munmap() called via
arch_unmap().  It was freeing page tables that has not been
properly zapped.

But, the problem was bigger than this.  For one, arch_unmap()
can free VMAs.  But, the calling __do_munmap() has variables
that *point* to VMAs and obviously can't handle them just
getting freed while the pointer is still in use.

I tried a couple of things here.  First, I tried to fix the page
table freeing problem in isolation, but I then found the VMA
issue.  I also tried having the MPX code return a flag if it
modified the rbtree which would force __do_munmap() to re-walk
to restart.  That spiralled out of control in complexity pretty
fast.

Just moving arch_unmap() and accepting that the bonkers failure
case might eat some bounds tables seems like the simplest viable
fix.

This was also reported in the following kernel bugzilla entry:

  https://bugzilla.kernel.org/show_bug.cgi?id=203123

There are some reports that this commit triggered this bug:

  dd2283f260 ("mm: mmap: zap pages with read mmap_sem in munmap")

While that commit certainly made the issues easier to hit, I believe
the fundamental issue has been with us as long as MPX itself, thus
the Fixes: tag below is for one of the original MPX commits.

[ mingo: Minor edits to the changelog and the patch. ]

Reported-by: Richard Biener <rguenther@suse.de>
Reported-by: H.J. Lu <hjl.tools@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-um@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: stable@vger.kernel.org
Fixes: dd2283f260 ("mm: mmap: zap pages with read mmap_sem in munmap")
Link: http://lkml.kernel.org/r/20190419194747.5E1AD6DC@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-05-09 10:37:17 +02:00
..
kasan arm64 updates for 5.2 2019-05-06 17:54:22 -07:00
backing-dev.c writeback: synchronize sync(2) against cgroup writeback membership switches 2019-01-22 14:39:38 -07:00
balloon_compaction.c
cleancache.c mm: use octal not symbolic permissions 2018-06-15 07:55:25 +09:00
cma.c memblock: emphasize that memblock_alloc_range() returns a physical address 2019-03-12 10:04:01 -07:00
cma.h
cma_debug.c mm/cma_debug.c: remove static scoped cma_debugfs_root 2019-03-05 21:07:20 -08:00
compaction.c mm/compaction.c: abort search if isolation fails 2019-04-04 11:56:15 +01:00
debug.c mm/debug.c: fix __dump_page when mapping->host is not set 2019-03-29 10:01:37 -07:00
debug_page_ref.c
dmapool.c docs/core-api/mm: fix return value descriptions in mm/ 2019-03-05 21:07:20 -08:00
early_ioremap.c mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep 2017-12-11 14:54:44 +01:00
fadvise.c vfs: implement readahead(2) using POSIX_FADV_WILLNEED 2018-08-30 20:01:32 +02:00
failslab.c mm: no need to check return value of debugfs_create functions 2019-03-05 21:07:17 -08:00
filemap.c filemap: add a comment about FAULT_FLAG_RETRY_NOWAIT behavior 2019-03-15 11:26:07 -07:00
frame_vector.c mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()' 2017-12-14 16:00:48 -08:00
frontswap.c mm: use octal not symbolic permissions 2018-06-15 07:55:25 +09:00
gup.c Merge branch 'page-refs' (page ref overflow) 2019-04-14 15:09:40 -07:00
gup_benchmark.c mm: no need to check return value of debugfs_create functions 2019-03-05 21:07:17 -08:00
highmem.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
hmm.c mm/hmm: convert to use vm_fault_t 2019-03-12 10:04:00 -07:00
huge_memory.c Merge branch 'core-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-06 11:36:58 -07:00
hugetlb.c Merge branch 'core-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-06 11:36:58 -07:00
hugetlb_cgroup.c mm: rename page_counter's count/limit into usage/max 2018-06-07 17:34:35 -07:00
hwpoison-inject.c mm/memory_failure: Remove unused trapno from memory_failure 2018-01-23 12:17:42 -06:00
init-mm.c mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids 2018-07-17 09:35:30 +02:00
internal.h mm, compaction: capture a page under direct compaction 2019-03-05 21:07:17 -08:00
interval_tree.c mm/interval_tree.c: use vma_pages() helper 2018-01-31 17:18:37 -08:00
Kconfig ksm: replace jhash2 with xxhash 2018-12-28 12:11:46 -08:00
Kconfig.debug mm/page_owner: move config option to mm/Kconfig.debug 2019-03-05 21:07:18 -08:00
khugepaged.c mm: memcontrol: expose THP events on a per-memcg basis 2019-03-05 21:07:19 -08:00
kmemleak-test.c
kmemleak.c Merge branch 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-06 13:11:48 -07:00
ksm.c mm: ksm: do not block on page lock when searching stable tree 2019-03-05 21:07:19 -08:00
list_lru.c numa: make "nr_node_ids" unsigned int 2019-03-05 21:07:19 -08:00
maccess.c Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses" 2019-02-25 09:10:51 -08:00
madvise.c asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE 2019-04-03 10:32:40 +02:00
Makefile mm: remove nobootmem 2018-10-31 08:54:16 -07:00
memblock.c mm: memblock: update comments and kernel-doc 2019-03-12 10:04:02 -07:00
memcontrol.c mm: writeback: use exact memcg dirty counts 2019-04-05 16:02:31 -10:00
memfd.c mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd 2019-03-05 21:07:19 -08:00
memory-failure.c mm: hwpoison: fix thp split handing in soft_offline_in_use_page() 2019-03-05 21:07:13 -08:00
memory.c asm-generic/tlb: Remove tlb_flush_mmu_free() 2019-04-03 10:33:01 +02:00
memory_hotplug.c mm/memory_hotplug.c: drop memory device reference after find_memory_block() 2019-04-26 09:18:04 -07:00
mempolicy.c mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified 2019-03-29 10:01:37 -07:00
mempool.c docs/core-api/mm: fix return value descriptions in mm/ 2019-03-05 21:07:20 -08:00
memtest.c
migrate.c mm/migrate.c: add missing flush_dcache_page for non-mapped page migrate 2019-03-29 10:01:37 -07:00
mincore.c Revert "Change mincore() to count "mapped" pages rather than "cached" pages" 2019-01-24 09:04:37 +13:00
mlock.c mm: remove zone_lru_lock() function, access ->lru_lock directly 2019-03-05 21:07:21 -08:00
mm_init.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
mmap.c x86/mpx, mm/core: Fix recursive munmap() corruption 2019-05-09 10:37:17 +02:00
mmu_context.c
mmu_gather.c asm-generic/tlb: Remove tlb_table_flush() 2019-04-03 10:33:02 +02:00
mmu_notifier.c mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 2018-12-28 12:11:50 -08:00
mmzone.c
mprotect.c mm: update ptep_modify_prot_commit to take old pte value as arg 2019-03-05 21:07:18 -08:00
mremap.c mm,mremap: bail out earlier in mremap_to under map pressure 2019-03-05 21:07:21 -08:00
msync.c
nommu.c mm/gup: cache dev_pagemap while pinning pages 2018-10-26 16:38:15 -07:00
oom_kill.c mm,oom: don't kill global init via memory.oom.group 2019-03-05 21:07:19 -08:00
page-writeback.c docs/core-api/mm: fix return value descriptions in mm/ 2019-03-05 21:07:20 -08:00
page_alloc.c mm/hibernation: Make hibernation handle unmapped pages 2019-04-30 12:37:57 +02:00
page_counter.c memcg: introduce memory.min 2018-06-07 17:34:36 -07:00
page_ext.c memblock: drop memblock_alloc_*_nopanic() variants 2019-03-12 10:04:02 -07:00
page_idle.c mm: remove zone_lru_lock() function, access ->lru_lock directly 2019-03-05 21:07:21 -08:00
page_io.c mm/page_io.c: fix polled swap page in 2019-01-04 13:13:48 -08:00
page_isolation.c mm/page_isolation.c: fix a wrong flag in set_migratetype_isolate() 2019-03-29 10:01:37 -07:00
page_owner.c mm/page_owner: Simplify stack trace handling 2019-04-29 12:37:50 +02:00
page_poison.c page_poison: play nicely with KASAN 2019-03-05 21:07:13 -08:00
page_vma_mapped.c mm/rmap: map_pte() was not handling private ZONE_DEVICE page properly 2018-10-31 08:54:11 -07:00
pagewalk.c mm: kernel-doc: add missing parameter descriptions 2018-04-05 21:36:27 -07:00
percpu-internal.h
percpu-km.c percpu: km: no need to consider pcpu_group_offsets[0] 2019-02-26 13:47:58 -08:00
percpu-stats.c treewide: Use array_size() in vmalloc() 2018-06-12 16:19:22 -07:00
percpu-vm.c percpu: allow select gfp to be passed to underlying allocators 2018-02-18 05:33:01 -08:00
percpu.c percpu: stop printing kernel addresses 2019-03-18 10:36:36 -07:00
pgtable-generic.c x86/mm: Page size aware flush_tlb_mm_range() 2018-10-09 16:51:11 +02:00
process_vm_access.c mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors 2018-02-06 18:32:48 -08:00
quicklist.c
readahead.c docs/core-api/mm: fix return value descriptions in mm/ 2019-03-05 21:07:20 -08:00
rmap.c mm: remove zone_lru_lock() function, access ->lru_lock directly 2019-03-05 21:07:21 -08:00
rodata_test.c
shmem.c mm: swapoff: shmem_unuse() stop eviction without igrab() 2019-04-19 09:46:04 -07:00
slab.c Merge branch 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-05-06 15:56:41 -07:00
slab.h mm: add support for kmem caches in DMA32 zone 2019-03-29 10:01:37 -07:00
slab_common.c mm: add support for kmem caches in DMA32 zone 2019-03-29 10:01:37 -07:00
slob.c slab: __GFP_ZERO is incompatible with a constructor 2018-06-07 17:34:34 -07:00
slub.c mm/slub: Simplify stack trace retrieval 2019-04-29 12:37:48 +02:00
sparse-vmemmap.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
sparse.c mm/hotplug: fix offline undo_isolate_page_range() 2019-03-29 10:01:37 -07:00
swap.c mm: remove zone_lru_lock() function, access ->lru_lock directly 2019-03-05 21:07:21 -08:00
swap_cgroup.c
swap_slots.c mm, swap, get_swap_pages: use entry_size instead of cluster in parameter 2018-08-22 10:52:44 -07:00
swap_state.c mm: swap: add comment for swap_vma_readahead 2019-03-05 21:07:16 -08:00
swapfile.c mm: swapoff: shmem_unuse() stop eviction without igrab() 2019-04-19 09:46:04 -07:00
truncate.c docs/core-api/mm: fix return value descriptions in mm/ 2019-03-05 21:07:20 -08:00
usercopy.c mm/usercopy.c: no check page span for stack objects 2019-01-08 17:15:11 -08:00
userfaultfd.c hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization" 2019-01-08 17:15:11 -08:00
util.c mm/util.c: fix strndup_user() comment 2019-04-05 16:02:31 -10:00
vmacache.c mm: get rid of vmacache_flush_all() entirely 2018-09-13 15:18:04 -10:00
vmalloc.c mm/vmalloc: Add flag for freeing of special permsissions 2019-04-30 12:37:58 +02:00
vmpressure.c mm/vmpressure.c: convert to use match_string() helper 2018-06-07 17:34:36 -07:00
vmscan.c mm: fix inactive list balancing between NUMA nodes and cgroups 2019-04-19 09:46:05 -07:00
vmstat.c mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n 2019-04-19 09:46:04 -07:00
workingset.c mm/workingset: remove unused @mapping argument in workingset_eviction() 2019-03-05 21:07:21 -08:00
z3fold.c z3fold: fix possible reclaim races 2018-11-18 10:15:09 -08:00
zbud.c mm: docs: fix parameter names mismatch 2018-02-06 18:32:48 -08:00
zpool.c mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc 2018-02-21 15:35:43 -08:00
zsmalloc.c mm/zsmalloc.c: fix fall-through annotation 2018-10-26 16:26:35 -07:00
zswap.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00