linux-stable/mm
Yosry Ahmed f82e6bf9bb mm: memcg: use rstat for non-hierarchical stats
Currently, memcg uses rstat to maintain aggregated hierarchical stats. 
Counters are maintained for hierarchical stats at each memcg.  Rstat
tracks which cgroups have updates on which cpus to keep those counters
fresh on the read-side.

Non-hierarchical stats are currently not covered by rstat.  Their per-cpu
counters are summed up on every read, which is expensive.  The original
implementation did the same.  At some point before rstat, non-hierarchical
aggregated counters were introduced by commit a983b5ebee ("mm:
memcontrol: fix excessive complexity in memory.stat reporting").  However,
those counters were updated on the performance critical write-side, which
caused regressions, so they were later removed by commit 815744d751
("mm: memcontrol: don't batch updates of local VM stats and events").  See
[1] for more detailed history.

Kernel versions in between a983b5ebee & 815744d751 (a year and a half)
enjoyed cheap reads of non-hierarchical stats, specifically on cgroup v1. 
When moving to more recent kernels, a performance regression for reading
non-hierarchical stats is observed.

Now that we have rstat, we know exactly which percpu counters have updates
for each stat.  We can maintain non-hierarchical counters again, making
reads much more efficient, without affecting the performance critical
write-side.  Hence, add non-hierarchical (i.e local) counters for the
stats, and extend rstat flushing to keep those up-to-date.

A caveat is that we now need a stats flush before reading
local/non-hierarchical stats through {memcg/lruvec}_page_state_local() or
memcg_events_local(), where we previously only needed a flush to read
hierarchical stats.  Most contexts reading non-hierarchical stats are
already doing a flush, add a flush to the only missing context in
count_shadow_nodes().

With this patch, reading memory.stat from 1000 memcgs is 3x faster on a
machine with 256 cpus on cgroup v1:

 # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done
 # time cat /sys/fs/cgroup/memory/cg*/memory.stat > /dev/null
 real	 0m0.125s
 user	 0m0.005s
 sys	 0m0.120s

After:
 real	 0m0.032s
 user	 0m0.005s
 sys	 0m0.027s

To make sure there are no regressions on cgroup v2, I ran an artificial
reclaim/refault stress test [2] that creates (NR_CPUS * 2) cgroups,
assigns them limits, runs a worker process in each cgroup that allocates
tmpfs memory equal to quadruple the limit (to invoke reclaim
continuously), and then reads back the entire file (to invoke refaults). 
All workers are run in parallel, and zram is used as a swapping backend. 
Both reclaim and refault have conditional stats flushing.  I ran this on a
machine with 112 cpus, once on mm-unstable, and once on mm-unstable with
this patch reverted.

(1) A few runs without this patch:

 # time ./stress_reclaim_refault.sh
 real 0m9.949s
 user 0m0.496s
 sys 14m44.974s

 # time ./stress_reclaim_refault.sh
 real 0m10.049s
 user 0m0.486s
 sys 14m55.791s

 # time ./stress_reclaim_refault.sh
 real 0m9.984s
 user 0m0.481s
 sys 14m53.841s

(2) A few runs with this patch:

 # time ./stress_reclaim_refault.sh
 real 0m9.885s
 user 0m0.486s
 sys 14m48.753s

 # time ./stress_reclaim_refault.sh
 real 0m9.903s
 user 0m0.495s
 sys 14m48.339s

 # time ./stress_reclaim_refault.sh
 real 0m9.861s
 user 0m0.507s
 sys 14m49.317s

No regressions are observed with this patch. There is actually a very
slight improvement. If I have to guess, maybe it's because we avoid
the percpu loop in count_shadow_nodes() when calling
lruvec_page_state_local(), but I could not prove this using perf, it's
probably in the noise.

[1] https://lore.kernel.org/lkml/20230725201811.GA1231514@cmpxchg.org/
[2] https://lore.kernel.org/lkml/CAJD7tkb17x=qwoO37uxyYXLEUVp15BQKR+Xfh7Sg9Hx-wTQ_=w@mail.gmail.com/

Link: https://lkml.kernel.org/r/20230803185046.1385770-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230726153223.821757-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:20:18 -07:00
..
damon merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes 2023-08-21 14:26:20 -07:00
kasan kasan, slub: fix HW_TAGS zeroing with slub_debug 2023-07-08 09:29:32 -07:00
kfence mm: kfence: allocate kfence_metadata at runtime 2023-08-18 10:12:39 -07:00
kmsan mm: kmsan: use helper macros PAGE_ALIGN and PAGE_ALIGN_DOWN 2023-08-21 13:37:29 -07:00
Kconfig mm/memory_hotplug: simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig 2023-08-21 13:37:48 -07:00
Kconfig.debug mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM 2023-05-29 16:14:28 +01:00
Makefile mm: kill frontswap 2023-08-21 13:37:26 -07:00
backing-dev.c writeback: remove redundant checks for root memcg 2023-08-21 13:37:48 -07:00
balloon_compaction.c
bootmem_info.c
cma.c mm: cma: print cma name as well in cma_alloc debug 2023-08-18 10:12:12 -07:00
cma.h
cma_debug.c
cma_sysfs.c mm: cma: make kobj_type structure constant 2023-03-28 16:20:06 -07:00
compaction.c merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes 2023-08-21 14:26:20 -07:00
debug.c mm: update validate_mm() to use vma iterator 2023-06-09 16:25:31 -07:00
debug_page_alloc.c mm: page_alloc: split out DEBUG_PAGEALLOC 2023-06-09 16:25:23 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm: change pudp_huge_get_and_clear_full take vm_area_struct as arg 2023-08-18 10:12:53 -07:00
dmapool.c dmapool: create/destroy cleanup 2023-06-09 16:25:17 -07:00
dmapool_test.c dmapool: add alloc/free performance test 2023-04-05 19:42:38 -07:00
early_ioremap.c mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup() 2023-06-09 16:25:56 -07:00
fadvise.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
fail_page_alloc.c mm: page_alloc: split out FAIL_PAGE_ALLOC 2023-06-09 16:25:23 -07:00
failslab.c
filemap.c mm: handle swap page faults under per-VMA lock 2023-08-24 16:20:17 -07:00
folio-compat.c - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
gup.c mm/gup: don't implicitly set FOLL_HONOR_NUMA_FAULT 2023-08-21 14:28:41 -07:00
gup_test.c Merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes. 2023-06-23 16:58:19 -07:00
gup_test.h
highmem.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
hmm.c mm: enable page walking API to lock vmas during the walk 2023-08-21 13:07:20 -07:00
huge_memory.c merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes 2023-08-24 15:25:56 -07:00
hugetlb.c hugetlb: clear flags in tail pages that will be freed individually 2023-08-24 16:20:15 -07:00
hugetlb_cgroup.c
hugetlb_vmemmap.c mm: hugetlb_vmemmap: fix a race between vmemmap pmd split 2023-08-18 10:12:14 -07:00
hugetlb_vmemmap.h
hwpoison-inject.c
init-mm.c mm: move dummy_vm_ops out of a header 2023-08-21 13:37:46 -07:00
internal.h mm: free up a word in the first tail page 2023-08-21 14:28:45 -07:00
interval_tree.c
io-mapping.c
ioremap.c mm: ioremap: remove unneeded ioremap_allowed and iounmap_allowed 2023-08-18 10:12:36 -07:00
khugepaged.c mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd 2023-08-24 16:20:15 -07:00
kmemleak.c Rename kmemleak_initialized to kmemleak_late_initialized 2023-08-21 13:38:02 -07:00
ksm.c merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes 2023-08-21 14:26:20 -07:00
list_lru.c
maccess.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
madvise.c swap: remove remnants of polling from read_swap_cache_async 2023-08-24 16:20:16 -07:00
mapping_dirty_helpers.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
memblock.c mm: disable kernelcore=mirror when no mirror memory 2023-08-21 13:37:43 -07:00
memcontrol.c mm: memcg: use rstat for non-hierarchical stats 2023-08-24 16:20:18 -07:00
memfd.c memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy 2023-08-21 13:37:59 -07:00
memory-failure.c mm: memory-failure: fix potential page refcnt leak in memory_failure() 2023-08-24 16:20:16 -07:00
memory-tiers.c memory tier: use helper macro __ATTR_RW() 2023-08-18 10:12:38 -07:00
memory.c mm: handle userfaults under VMA lock 2023-08-24 16:20:17 -07:00
memory_hotplug.c mm/memory_hotplug: embed vmem_altmap details in memory block 2023-08-21 13:37:49 -07:00
mempolicy.c mm: convert prep_transhuge_page() to folio_prep_large_rmappable() 2023-08-21 14:28:43 -07:00
mempool.c
memremap.c
memtest.c mm: memtest: convert to memtest_report_meminfo() 2023-08-21 13:37:47 -07:00
migrate.c migrate: use folio_set_bh() instead of set_bh_page() 2023-08-18 10:12:30 -07:00
migrate_device.c merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes 2023-08-21 14:26:20 -07:00
mincore.c mm: enable page walking API to lock vmas during the walk 2023-08-21 13:07:20 -07:00
mlock.c merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes 2023-08-21 14:26:20 -07:00
mm_init.c mm/mm_init: use helper macro BITS_PER_LONG and BITS_PER_BYTE 2023-08-21 13:37:47 -07:00
mm_slot.h
mmap.c mm: move vma locking out of vma_prepare and dup_anon_vma 2023-08-21 13:37:46 -07:00
mmap_lock.c
mmu_gather.c mm: prefer xxx_page() alloc/free functions for order-0 pages 2023-03-28 16:20:16 -07:00
mmu_notifier.c mmu_notifiers: rename invalidate_range notifier 2023-08-18 10:12:41 -07:00
mmzone.c
mprotect.c merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes 2023-08-21 14:26:20 -07:00
mremap.c mm/huge pud: use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE 2023-08-18 10:12:54 -07:00
msync.c
nommu.c mm/nommu.c: use helper macro K() 2023-08-21 13:37:44 -07:00
oom_kill.c mm: remove redundant K() macro definition 2023-08-21 13:37:44 -07:00
page-writeback.c writeback: account the number of pages written back 2023-07-08 09:29:30 -07:00
page_alloc.c mm: add large_rmappable page flag 2023-08-21 14:28:44 -07:00
page_counter.c
page_ext.c mm/page_ext: move functions around for minor cleanups to page_ext 2023-08-18 10:12:31 -07:00
page_idle.c
page_io.c zswap: make zswap_load() take a folio 2023-08-21 13:37:27 -07:00
page_isolation.c mm/hugetlb: get rid of page_hstate() 2023-08-18 10:12:39 -07:00
page_owner.c mm/page_ext: use page_ext_data helper in page_owner 2023-08-21 13:37:27 -07:00
page_poison.c mm/page_poison: remove unused page_ext.h from page_poison 2023-08-21 13:37:30 -07:00
page_reporting.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
page_reporting.h
page_table_check.c mm/page_ext: use page_ext_data helper in page_table_check 2023-08-21 13:37:27 -07:00
page_vma_mapped.c mm: correct stale comment of function check_pte 2023-08-18 10:12:13 -07:00
pagewalk.c mm: enable page walking API to lock vmas during the walk 2023-08-21 13:07:20 -07:00
percpu-internal.h percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing 2023-06-19 16:19:29 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c
pgalloc-track.h
pgtable-generic.c mm/pgtable: notes on pte_offset_map[_lock]() 2023-08-18 10:12:25 -07:00
process_vm_access.c mm/gup: remove unused vmas parameter from pin_user_pages_remote() 2023-06-09 16:25:25 -07:00
ptdump.c mm: ptdump should use ptep_get_lockless() 2023-06-19 16:19:24 -07:00
readahead.c mm: remove unnecessary pagevec includes 2023-06-23 16:59:31 -07:00
rmap.c mmu_notifiers: don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end() 2023-08-18 10:12:41 -07:00
rodata_test.c
secretmem.c mm/secretmem: use a folio in secretmem_fault() 2023-08-21 13:38:02 -07:00
shmem.c merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes 2023-08-24 15:25:56 -07:00
show_mem.c mm,thp: no space after colon in Mem-Info fields 2023-08-21 13:38:01 -07:00
shrinker_debug.c Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless" 2023-06-19 13:19:34 -07:00
shuffle.c
shuffle.h mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
slab.c slab updates for 6.5 2023-06-29 16:34:12 -07:00
slab.h kasan, slub: fix HW_TAGS zeroing with slub_debug 2023-07-08 09:29:32 -07:00
slab_common.c slab updates for 6.5 2023-06-29 16:34:12 -07:00
slub.c slab updates for 6.5 2023-06-29 16:34:12 -07:00
sparse-vmemmap.c mm/vmemmap: allow architectures to override how vmemmap optimization works 2023-08-18 10:12:53 -07:00
sparse.c mm/sparse: remove redundant judgments from macro for_each_present_section_nr 2023-08-18 10:12:14 -07:00
swap.c mm: remove references to pagevec 2023-06-23 16:59:30 -07:00
swap.h swap: remove remnants of polling from read_swap_cache_async 2023-08-24 16:20:16 -07:00
swap_cgroup.c
swap_slots.c
swap_state.c swap: remove remnants of polling from read_swap_cache_async 2023-08-24 16:20:16 -07:00
swapfile.c merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes 2023-08-21 14:26:20 -07:00
truncate.c mm: merge folio_has_private()/filemap_release_folio() call pairs 2023-08-18 10:12:12 -07:00
usercopy.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
userfaultfd.c mm: userfaultfd: support UFFDIO_POISON for hugetlbfs 2023-08-18 10:12:17 -07:00
util.c mm: remove page_rmapping() 2023-08-18 10:12:01 -07:00
vmalloc.c mm: add a call to flush_cache_vmap() in vmap_pfn() 2023-08-21 13:07:21 -07:00
vmpressure.c
vmscan.c merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes 2023-08-21 14:26:20 -07:00
vmstat.c mm/vmstat: remove unused page_ext.h from vmstat 2023-08-21 13:37:30 -07:00
workingset.c mm: memcg: use rstat for non-hierarchical stats 2023-08-24 16:20:18 -07:00
z3fold.c mm/z3fold: remove obsolete comment for struct z3fold_pool 2023-08-21 13:37:51 -07:00
zbud.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zpool.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zsmalloc.c merge mm-hotfixes-stable into mm-stable to pick up depended-upon changes 2023-08-21 14:26:20 -07:00
zswap.c mm: zswap: update comment for struct zswap_entry 2023-08-21 13:37:47 -07:00