linux-stable/mm
Yu Ma 3a6358c0db percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing
When running UnixBench/Execl throughput case, false sharing is observed
due to frequent read on base_addr and write on free_bytes, chunk_md.

UnixBench/Execl represents a class of workload where bash scripts are
spawned frequently to do some short jobs.  It will do system call on execl
frequently, and execl will call mm_init to initialize mm_struct of the
process.  mm_init will call __percpu_counter_init for percpu_counters
initialization.  Then pcpu_alloc is called to read the base_addr of
pcpu_chunk for memory allocation.  Inside pcpu_alloc, it will call
pcpu_alloc_area to allocate memory from a specified chunk.  This function
will update "free_bytes" and "chunk_md" to record the rest free bytes and
other meta data for this chunk.  Correspondingly, pcpu_free_area will also
update these 2 members when free memory.

Call trace from perf is as below:
+   57.15%  0.01%  execl   [kernel.kallsyms] [k] __percpu_counter_init
+   57.13%  0.91%  execl   [kernel.kallsyms] [k] pcpu_alloc
-   55.27% 54.51%  execl   [kernel.kallsyms] [k] osq_lock
   - 53.54% 0x654278696e552f34
        main
        __execve
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        __x64_sys_execve
        do_execveat_common.isra.47
        alloc_bprm
        mm_init
        __percpu_counter_init
        pcpu_alloc
      - __mutex_lock.isra.17

In current pcpu_chunk layout, `base_addr' is in the same cache line with
`free_bytes' and `chunk_md', and `base_addr' is at the last 8 bytes.  This
patch moves `bound_map' up to `base_addr', to let `base_addr' locate in a
new cacheline.

With this change, on Intel Sapphire Rapids 112C/224T platform, based on
v6.4-rc4, the 160 parallel score improves by 24%.

The pcpu_chunk struct is a backing data structure per chunk, so the
additional memory should not be dramatic.  A chunk covers ballpark
between 64kb and 512kb memory depending on some config and boot time
stuff, so I believe the additional memory used here is nominal at best.

Working the #s on my desktop:
Percpu:            58624 kB
28 cores -> ~2.1MB of percpu memory.
At say ~128KB per chunk -> 33 chunks, generously 40 chunks.
Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB
of overhead?

I believe we can do a little better to avoid eating that full padding,
so likely less than that.

[dennis@kernel.org: changelog details]
Link: https://lkml.kernel.org/r/20230610030730.110074-1-yu.ma@intel.com
Signed-off-by: Yu Ma <yu.ma@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:29 -07:00
..
damon mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
kasan mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
kfence mm: kfence: fix false positives on big endian 2023-05-17 15:24:33 -07:00
kmsan printk: export console trace point for kcsan/kasan/kfence/kmsan 2023-04-18 16:30:11 -07:00
Kconfig mm: zswap: support exclusive loads 2023-06-19 16:19:05 -07:00
Kconfig.debug mm: change per-VMA lock statistics to be disabled by default 2023-05-02 17:23:28 -07:00
Makefile mm: page_alloc: split out DEBUG_PAGEALLOC 2023-06-09 16:25:23 -07:00
backing-dev.c - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
balloon_compaction.c
bootmem_info.c
cma.c mm: move most of core MM initialization to mm/mm_init.c 2023-04-05 19:42:52 -07:00
cma.h
cma_debug.c
cma_sysfs.c mm: cma: make kobj_type structure constant 2023-03-28 16:20:06 -07:00
compaction.c mm: compaction: mark kcompactd_run() and kcompactd_stop() __meminit 2023-06-19 16:19:28 -07:00
debug.c mm: update validate_mm() to use vma iterator 2023-06-09 16:25:31 -07:00
debug_page_alloc.c mm: page_alloc: split out DEBUG_PAGEALLOC 2023-06-09 16:25:23 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable,page_table_check: warn pte map fails 2023-06-19 16:19:15 -07:00
dmapool.c dmapool: create/destroy cleanup 2023-06-09 16:25:17 -07:00
dmapool_test.c dmapool: add alloc/free performance test 2023-04-05 19:42:38 -07:00
early_ioremap.c mm/early_ioremap.c: improve the execution efficiency of early_ioremap_setup() 2023-06-09 16:25:56 -07:00
fadvise.c mm: support POSIX_FADV_NOREUSE 2023-01-18 17:12:57 -08:00
fail_page_alloc.c mm: page_alloc: split out FAIL_PAGE_ALLOC 2023-06-09 16:25:23 -07:00
failslab.c mm: fix unexpected changes to {failslab|fail_page_alloc}.attr 2022-11-22 18:50:44 -08:00
filemap.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
folio-compat.c - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
frontswap.c mm: zswap: support exclusive loads 2023-06-19 16:19:05 -07:00
gup.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
gup_test.c mm/gup: remove vmas parameter from pin_user_pages() 2023-06-09 16:25:26 -07:00
gup_test.h mm/gup_test: start/stop/read functionality for PIN LONGTERM test 2022-11-08 17:37:15 -08:00
highmem.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
hmm.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
huge_memory.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
hugetlb.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
hugetlb_cgroup.c mm/hugetlb: increase use of folios in alloc_huge_page() 2023-02-13 15:54:27 -08:00
hugetlb_vmemmap.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
hugetlb_vmemmap.h
hwpoison-inject.c
init-mm.c IOMMU Updates for Linux 6.4 2023-04-30 13:00:38 -07:00
internal.h mm/folio: replace set_compound_order with folio_set_order 2023-06-19 16:19:27 -07:00
interval_tree.c
io-mapping.c
ioremap.c
khugepaged.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
kmemleak.c lib/stackdepot, mm: rename stack_depot_want_early_init 2023-02-16 20:43:49 -08:00
ksm.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
list_lru.c
maccess.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
madvise.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
mapping_dirty_helpers.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
memblock.c mm/memory_hotplug: remove reset_node_managed_pages() in hotadd_init_pgdat() 2023-06-19 16:19:05 -07:00
memcontrol.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
memfd.c memfd: pass argument of memfd_fcntl as int 2023-04-18 16:30:11 -07:00
memory-failure.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
memory-tiers.c memory tier: remove unneeded !IS_ENABLED(CONFIG_MIGRATION) check 2023-06-19 16:19:29 -07:00
memory.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
memory_hotplug.c mm/mm_init.c: remove reset_node_present_pages() 2023-06-19 16:19:05 -07:00
mempolicy.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
mempool.c mempool: do not use ksize() for poisoning 2022-11-30 15:58:41 -08:00
memremap.c mm/memremap.c: fix outdated comment in devm_memremap_pages 2023-02-09 16:51:46 -08:00
memtest.c mm/memtest: add results of early memtest to /proc/meminfo 2023-04-05 19:42:55 -07:00
migrate.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
migrate_device.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
mincore.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
mlock.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
mm_init.c mm/mm_init.c: remove reset_node_present_pages() 2023-06-19 16:19:05 -07:00
mm_slot.h
mmap.c userfaultfd: fix regression in userfaultfd_unmap_prep() 2023-06-19 16:19:28 -07:00
mmap_lock.c
mmu_gather.c mm: prefer xxx_page() alloc/free functions for order-0 pages 2023-03-28 16:20:16 -07:00
mmu_notifier.c mm/mmu_notifier: remove unused mmu_notifier_range_update_to_read_only export 2023-02-02 22:32:54 -08:00
mmzone.c
mprotect.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
mremap.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
msync.c
nommu.c mm: vmalloc: convert vread() to vread_iter() 2023-04-05 19:42:57 -07:00
oom_kill.c mm, oom: do not check 0 mask in out_of_memory() 2023-06-09 16:25:20 -07:00
page-writeback.c mm,jfs: move write_one_page/folio_write_one to jfs 2023-03-28 16:20:14 -07:00
page_alloc.c mm: page_alloc: remove unneeded header files 2023-06-09 16:25:55 -07:00
page_counter.c
page_ext.c mm/page_ext: init page_ext early if there are no deferred struct pages 2023-02-02 22:33:22 -08:00
page_idle.c mm: page_idle: convert page idle to use a folio 2023-01-18 17:12:52 -08:00
page_io.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
page_isolation.c mm: page_isolation: write proper kerneldoc 2023-06-19 16:18:59 -07:00
page_owner.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
page_poison.c
page_reporting.c mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
page_reporting.h
page_table_check.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
page_vma_mapped.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
pagewalk.c mm/pagewalk: walk_pte_range() allow for pte_offset_map() 2023-06-19 16:19:14 -07:00
percpu-internal.h percpu-internal/pcpu_chunk: re-layout pcpu_chunk structure to reduce false sharing 2023-06-19 16:19:29 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c mm: memcontrol: rename memcg_kmem_enabled() 2023-02-16 20:43:56 -08:00
pgalloc-track.h
pgtable-generic.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
process_vm_access.c mm/gup: remove unused vmas parameter from pin_user_pages_remote() 2023-06-09 16:25:25 -07:00
ptdump.c mm: ptdump should use ptep_get_lockless() 2023-06-19 16:19:24 -07:00
readahead.c readahead: convert readahead_expand() to use a folio 2023-02-02 22:33:21 -08:00
rmap.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
rodata_test.c mm/rodata_test: use PAGE_ALIGNED() helper 2022-10-03 14:03:05 -07:00
secretmem.c mm/mlock: rename mlock_future_check() to mlock_future_ok() 2023-06-09 16:25:38 -07:00
shmem.c shmem: use ramfs_kill_sb() for kill_sb method of ramfs-based tmpfs 2023-06-19 16:19:04 -07:00
show_mem.c mm: page_alloc: collect mem statistic into show_mem.c 2023-06-09 16:25:22 -07:00
shrinker_debug.c mm: shrinkers: fix race condition on debugfs cleanup 2023-05-17 15:24:33 -07:00
shuffle.c mm/shuffle: convert module_param_call to module_param_cb 2022-10-03 14:03:07 -07:00
shuffle.h mm, treewide: redefine MAX_ORDER sanely 2023-04-05 19:42:46 -07:00
slab.c mm/slab: simplify create_kmalloc_cache() args and make it static 2023-06-19 16:19:20 -07:00
slab.h mm/slab: simplify create_kmalloc_cache() args and make it static 2023-06-19 16:19:20 -07:00
slab_common.c mm: slab: reduce the kmalloc() minimum alignment if DMA bouncing possible 2023-06-19 16:19:23 -07:00
slub.c - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of 2023-04-27 19:42:02 -07:00
sparse-vmemmap.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
sparse.c mm/sparse: remove unused parameters in sparse_remove_section() 2023-06-19 16:19:04 -07:00
swap.c mm: swap: fix performance regression on sparsetruncate-tiny 2023-04-16 10:41:24 -07:00
swap.h mm: remove the __swap_writepage return value 2023-02-02 22:33:33 -08:00
swap_cgroup.c mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled 2022-10-03 14:03:36 -07:00
swap_slots.c
swap_state.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
swapfile.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
truncate.c mm: return an ERR_PTR from __filemap_get_folio 2023-04-05 19:42:42 -07:00
usercopy.c mm: Fix copy_from_user_nofault(). 2023-04-12 17:36:23 -07:00
userfaultfd.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
util.c mm: uninline kstrdup() 2023-04-08 13:45:37 -07:00
vmalloc.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
vmpressure.c
vmscan.c mm: ptep_get() conversion 2023-06-19 16:19:25 -07:00
vmstat.c vmstat: skip periodic vmstat update for isolated CPUs 2023-06-19 16:19:11 -07:00
workingset.c Multi-gen LRU: fix workingset accounting 2023-06-09 16:25:46 -07:00
z3fold.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zbud.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zpool.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zsmalloc.c mm: zswap: remove shrink from zpool interface 2023-06-19 16:19:27 -07:00
zswap.c mm: zswap: remove zswap_header 2023-06-19 16:19:27 -07:00