linux-stable/mm
akpm@linux-foundation.org ac39cf8cb8 memcg: fix mis-accounting of file mapped racy with migration
FILE_MAPPED per memcg of migrated file cache is not properly updated,
because our hook in page_add_file_rmap() can't know to which memcg
FILE_MAPPED should be counted.

Basically, this patch is for fixing the bug but includes some big changes
to fix up other messes.

Now, at migrating mapped file, events happen in following sequence.

 1. allocate a new page.
 2. get memcg of an old page.
 3. charge ageinst a new page before migration. But at this point,
    no changes to new page's page_cgroup, no commit for the charge.
    (IOW, PCG_USED bit is not set.)
 4. page migration replaces radix-tree, old-page and new-page.
 5. page migration remaps the new page if the old page was mapped.
 6. Here, the new page is unlocked.
 7. memcg commits the charge for newpage, Mark the new page's page_cgroup
    as PCG_USED.

Because "commit" happens after page-remap, we can count FILE_MAPPED
at "5", because we should avoid to trust page_cgroup->mem_cgroup.
if PCG_USED bit is unset.
(Note: memcg's LRU removal code does that but LRU-isolation logic is used
 for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
 not on LRU or page_cgroup->mem_cgroup is NULL.)

We can lose file_mapped accounting information at 5 because FILE_MAPPED
is updated only when mapcount changes 0->1. So we should catch it.

BTW, historically, above implemntation comes from migration-failure
of anonymous page. Because we charge both of old page and new page
with mapcount=0, we can't catch
  - the page is really freed before remap.
  - migration fails but it's freed before remap
or .....corner cases.

New migration sequence with memcg is:

 1. allocate a new page.
 2. mark PageCgroupMigration to the old page.
 3. charge against a new page onto the old page's memcg. (here, new page's pc
    is marked as PageCgroupUsed.)
 4. page migration replaces radix-tree, page table, etc...
 5. At remapping, new page's page_cgroup is now makrked as "USED"
    We can catch 0->1 event and FILE_MAPPED will be properly updated.

    And we can catch SWAPOUT event after unlock this and freeing this
    page by unmap() can be caught.

 7. Clear PageCgroupMigration of the old page.

So, FILE_MAPPED will be correctly updated.

Then, for what MIGRATION flag is ?
  Without it, at migration failure, we may have to charge old page again
  because it may be fully unmapped. "charge" means that we have to dive into
  memory reclaim or something complated. So, it's better to avoid
  charge it again. Before this patch, __commit_charge() was working for
  both of the old/new page and fixed up all. But this technique has some
  racy condtion around FILE_MAPPED and SWAPOUT etc...
  Now, the kernel use MIGRATION flag and don't uncharge old page until
  the end of migration.

I hope this change will make memcg's page migration much simpler.  This
page migration has caused several troubles.  Worth to add a flag for
simplification.

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 09:12:44 -07:00
..
Kconfig mm: allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove 2010-05-25 08:06:59 -07:00
Kconfig.debug trivial: improve help text for mm debug config options 2009-09-21 15:14:57 +02:00
Makefile mm: compaction: memory compaction core 2010-05-25 08:06:59 -07:00
backing-dev.c writeback: fixups for !dirty_writeback_centisecs 2010-05-21 20:00:35 +02:00
bootmem.c Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-tip 2010-04-07 11:02:23 -07:00
bounce.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
compaction.c mm: compaction: add a tunable that decides when memory should be compacted and when it should be reclaimed 2010-05-25 08:06:59 -07:00
debug-pagealloc.c
dmapool.c dmapools: protect page_list walk in show_pools() 2009-06-30 18:56:00 -07:00
fadvise.c readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM 2010-03-06 11:26:25 -08:00
failslab.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
filemap.c do_generic_file_read: clear page errors when issuing a fresh read of the page 2010-05-26 10:20:27 -07:00
filemap_xip.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
fremap.c mm: clean up mm_counter 2010-03-06 11:26:23 -08:00
highmem.c highmem: remove unneeded #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT for debug_kmap_atomic() 2010-05-25 08:07:01 -07:00
hugetlb.c cpuset,mm: fix no node to alloc memory when changing cpuset's mems 2010-05-25 08:06:57 -07:00
hwpoison-inject.c HWPOISON: Don't do early filtering if filter is disabled 2009-12-16 12:20:01 +01:00
init-mm.c
internal.h HWPOISON: add an interface to switch off/on all the page filters 2009-12-16 12:19:59 +01:00
kmemcheck.c kmemcheck: Fix build errors due to missing slab.h 2010-03-30 22:02:32 +09:00
kmemleak-test.c
kmemleak.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
ksm.c mm: migration: share the anon_vma ref counts between KSM and page migration 2010-05-25 08:06:58 -07:00
maccess.c maccess,probe_kernel: Allow arch specific override probe_kernel_(read|write) 2010-01-07 11:58:36 -06:00
madvise.c HWPOISON: Add a madvise() injector for soft page offlining 2009-12-16 12:20:00 +01:00
memcontrol.c memcg: fix mis-accounting of file mapped racy with migration 2010-05-27 09:12:44 -07:00
memory-failure.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
memory.c mm: document follow_page() 2010-05-25 08:07:00 -07:00
memory_hotplug.c mem-hotplug: fix potential race while building zonelist for new populated zone 2010-05-25 08:07:02 -07:00
mempolicy.c mempolicy: ERR_PTR dereference in mpol_shared_policy_init() 2010-05-26 08:19:23 -07:00
mempool.c mm: remove broken 'kzalloc' mempool 2009-09-22 07:17:35 -07:00
migrate.c memcg: fix mis-accounting of file mapped racy with migration 2010-05-27 09:12:44 -07:00
mincore.c mincore: do nested page table walks 2010-05-25 08:06:58 -07:00
mlock.c x86, perf, bts, mm: Delete the never used BTS-ptrace code 2010-03-26 11:33:55 +01:00
mm_init.c
mmap.c mmap: check ->vm_ops before dereferencing 2010-04-27 08:26:51 -07:00
mmu_context.c exit: fix oops in sync_mm_rss 2010-03-24 16:31:21 -07:00
mmu_notifier.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
mmzone.c
mprotect.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
mremap.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
msync.c sanitize vfs_fsync calling conventions 2010-05-21 18:31:21 -04:00
nommu.c nommu: allow private mappings of read-only devices 2010-05-26 08:19:23 -07:00
oom_kill.c memcg: make oom killer a no-op when no killable task can be found 2010-05-27 09:12:43 -07:00
page-writeback.c writeback: fix mixed up arguments to bdi_start_writeback() 2010-05-21 20:01:54 +02:00
page_alloc.c mem-hotplug: fix potential race while building zonelist for new populated zone 2010-05-25 08:07:02 -07:00
page_cgroup.c memcg: avoid use cmpxchg in swap cgroup maintainance 2010-03-17 18:43:47 -07:00
page_io.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
page_isolation.c
pagewalk.c pagemap: fix pfn calculation for hugepage 2010-04-07 08:38:04 -07:00
percpu-km.c percpu: implement kernel memory based chunk allocation 2010-05-01 08:30:50 +02:00
percpu-vm.c percpu: move vmalloc based chunk management into percpu-vm.c 2010-05-01 08:30:50 +02:00
percpu.c percpu: implement kernel memory based chunk allocation 2010-05-01 08:30:50 +02:00
percpu_up.c percpu: don't implicitly include slab.h from percpu.h 2010-03-30 22:02:32 +09:00
prio_tree.c
quicklist.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
readahead.c readahead.c: fix comment 2010-05-25 08:07:00 -07:00
rmap.c mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks 2010-05-25 08:06:59 -07:00
shmem.c memcg: move charge of file pages 2010-05-27 09:12:43 -07:00
slab.c cpuset,mm: fix no node to alloc memory when changing cpuset's mems 2010-05-25 08:06:57 -07:00
slob.c mm: Move ARCH_SLAB_MINALIGN and ARCH_KMALLOC_MINALIGN to <linux/slob_def.h> 2010-05-19 22:03:13 +03:00
slub.c cpuset,mm: fix no node to alloc memory when changing cpuset's mems 2010-05-25 08:06:57 -07:00
sparse-vmemmap.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
sparse.c sparsemem: on no vmemmap path put mem_map on node high too 2010-05-25 08:06:56 -07:00
swap.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
swap_state.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
swapfile.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6 2010-05-21 15:26:46 -07:00
thrash.c
truncate.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
util.c slab: Generify kernel pointer validation 2010-04-09 10:09:50 -07:00
vmalloc.c mm: purge fragmented percpu vmap blocks 2010-02-02 12:50:47 -08:00
vmscan.c vmscan: remove isolate_pages callback scan control 2010-05-25 08:07:00 -07:00
vmstat.c mm: compaction: direct compact when a high-order allocation fails 2010-05-25 08:06:59 -07:00