linux-stable/mm
Hugh Dickins 7512102cf6 memcg: fix GPF when cgroup removal races with last exit
When moving tasks from old memcg (with move_charge_at_immigrate on new
memcg), followed by removal of old memcg, hit General Protection Fault in
mem_cgroup_lru_del_list() (called from release_pages called from
free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
exit_mmap from mmput from exit_mm from do_exit).

Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
still trying to update its stats, and take page off lru before freeing.

A task, or a charge, or a page on lru: each secures a memcg against
removal.  In this case, the last task has been moved out of the old memcg,
and it is exiting: anonymous pages are uncharged one by one from the
memcg, as they are zapped from its pagetables, so the charge gets down to
0; but the pages themselves are queued in an mmu_gather for freeing.

Most of those pages will be on lru (and force_empty is careful to
lru_add_drain_all, to add pages from pagevec to lru first), but not
necessarily all: perhaps some have been isolated for page reclaim, perhaps
some isolated for other reasons.  So, force_empty may find no task, no
charge and no page on lru, and let the removal proceed.

There would still be no problem if these pages were immediately freed; but
typically (and the put_page_testzero protocol demands it) they have to be
added back to lru before they are found freeable, then removed from lru
and freed.  We don't see the issue when adding, because the
mem_cgroup_iter() loops keep their own reference to the memcg being
scanned; but when it comes to mem_cgroup_lru_del_list().

I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
38c5d72f3e ("memcg: simplify LRU handling by new rule") mercifully
removed those convolutions, but left this General Protection Fault.

But it's surprisingly easy to restore the old behaviour: just check
PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
to add), and reset pc to root_mem_cgroup if page is uncharged.  A risky
change?  just going back to how it worked before; testing, and an audit of
uses of pc->mem_cgroup, show no problem.

And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
longer has any purpose, and we can safely revert 4e5f01c2b9 ("memcg:
clear pc->mem_cgroup if necessary").

Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
is not strictly necessary: the lru_lock there, with RCU before memcg
structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
without that; but it seems cleaner to rely on one dependency less.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05 15:49:43 -08:00
..
backing-dev.c backing-dev: fix wakeup timer races with bdi_unregister() 2012-02-01 16:52:49 +08:00
bootmem.c mm: bootmem: try harder to free pages in bulk 2012-01-10 16:30:45 -08:00
bounce.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
cleancache.c mm: cleancache core ops functions and config 2011-05-26 10:01:36 -06:00
compaction.c mm: compaction: check for overlapping nodes during isolation for migration 2012-02-08 19:03:51 -08:00
debug-pagealloc.c mm, x86: Remove debug_pagealloc_enabled 2011-12-06 09:24:07 +01:00
dmapool.c mm: fix implicit stat.h usage in dmapool.c 2011-10-31 09:20:12 -04:00
fadvise.c fadvise: only initiate writeback for specified range with FADV_DONTNEED 2012-01-10 16:30:43 -08:00
failslab.c switch debugfs to umode_t 2012-01-03 22:54:56 -05:00
filemap.c readahead: fix pipeline break caused by block plug 2012-02-03 16:16:41 -08:00
filemap_xip.c mm/filemap_xip.c: fix race condition in xip_file_fault() 2012-02-03 16:16:41 -08:00
fremap.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
highmem.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
huge_memory.c mm: fix UP THP spin_is_locked BUGs 2012-02-08 19:03:51 -08:00
hugetlb.c mm/hugetlb.c: undo change to page mapcount in fault handler 2012-01-23 08:38:48 -08:00
hwpoison-inject.c Fix common misspellings 2011-03-31 11:26:23 -03:00
init-mm.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
internal.h mm: thp: tail page refcounting fix 2011-11-02 16:06:57 -07:00
Kconfig Merge branch 'master' into x86/memblock 2011-11-28 09:46:22 -08:00
Kconfig.debug mm: more intensive memory corruption debugging 2012-01-10 16:30:42 -08:00
kmemcheck.c
kmemleak-test.c
kmemleak.c kmemleak: Disable early logging when kmemleak is off by default 2012-01-20 16:57:05 +00:00
ksm.c memcg: fix GPF when cgroup removal races with last exit 2012-03-05 15:49:43 -08:00
maccess.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
madvise.c fs: kill i_alloc_sem 2011-07-20 20:47:46 -04:00
Makefile Cross Memory Attach 2011-10-31 17:30:44 -07:00
memblock.c memblock: Fix size aligning of memblock_alloc_base_nid() 2012-03-01 10:53:18 +01:00
memcontrol.c memcg: fix GPF when cgroup removal races with last exit 2012-03-05 15:49:43 -08:00
memory-failure.c mm: compaction: introduce sync-light migration for use by compaction 2012-01-12 20:13:09 -08:00
memory.c mm: fix rss count leakage during migration 2012-01-23 08:38:49 -08:00
memory_hotplug.c mm: compaction: introduce sync-light migration for use by compaction 2012-01-12 20:13:09 -08:00
mempolicy.c mm: compaction: introduce sync-light migration for use by compaction 2012-01-12 20:13:09 -08:00
mempool.c mempool: fix first round failure behavior 2012-01-10 16:30:45 -08:00
migrate.c memcg: fix GPF when cgroup removal races with last exit 2012-03-05 15:49:43 -08:00
mincore.c mm: clarify the radix_tree exceptional cases 2011-08-03 14:25:24 -10:00
mlock.c Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux 2011-11-06 19:44:47 -08:00
mm_init.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
mmap.c mm: simplify find_vma_prev() 2012-01-10 16:30:44 -08:00
mmu_context.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
mmu_notifier.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
mmzone.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
mprotect.c
mremap.c mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() 2012-01-10 16:30:44 -08:00
msync.c
nobootmem.c Merge branch 'master' into x86/memblock 2011-11-28 09:46:22 -08:00
nommu.c NOMMU: Don't need to clear vm_mm when deleting a VMA 2012-02-24 08:59:04 -08:00
oom_kill.c mm: unify remaining mem_cont, mem, etc. variable names to memcg 2012-01-12 20:13:06 -08:00
page-writeback.c Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux 2012-01-10 16:59:59 -08:00
page_alloc.c vfs: fix panic in __d_lookup() with high dentry hashtable counts 2012-02-13 20:45:38 -05:00
page_cgroup.c page_cgroup: drop multi CONFIG_MEMORY_HOTPLUG 2012-01-12 20:13:08 -08:00
page_io.c
page_isolation.c
pagewalk.c pagewalk: fix code comment for THP 2011-07-25 20:57:09 -07:00
percpu-km.c
percpu-vm.c percpu: fix chunk range calculation 2011-11-22 08:09:46 -08:00
percpu.c Kmemleak patches 2012-01-14 18:11:11 -08:00
pgtable-generic.c
prio_tree.c sanitize <linux/prefetch.h> usage 2011-05-20 12:50:29 -07:00
process_vm_access.c Fix race in process_vm_rw_core 2012-02-02 12:55:17 -08:00
quicklist.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
readahead.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
rmap.c mm: unify remaining mem_cont, mem, etc. variable names to memcg 2012-01-12 20:13:06 -08:00
shmem.c SHM_UNLOCK: fix Unevictable pages stranded after swap 2012-01-23 08:38:48 -08:00
slab.c Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux 2012-01-11 18:52:23 -08:00
slob.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
slub.c mm,x86,um: move CMPXCHG_DOUBLE config option 2012-01-12 20:13:03 -08:00
sparse-vmemmap.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
sparse.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
swap.c memcg: fix GPF when cgroup removal races with last exit 2012-03-05 15:49:43 -08:00
swap_state.c memcg: fix GPF when cgroup removal races with last exit 2012-03-05 15:49:43 -08:00
swapfile.c mm: unify remaining mem_cont, mem, etc. variable names to memcg 2012-01-12 20:13:06 -08:00
thrash.c mm/thrash.c: quiet sparse noise 2011-10-31 17:30:50 -07:00
truncate.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
util.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
vmalloc.c mm/vmalloc.c: eliminate extra loop in pcpu_get_vm_areas error path 2012-01-12 20:13:10 -08:00
vmscan.c SHM_UNLOCK: fix Unevictable pages stranded after swap 2012-01-23 08:38:48 -08:00
vmstat.c mm,x86,um: move CMPXCHG_LOCAL config option 2012-01-12 20:13:03 -08:00