linux-stable

Go to file

Yang Shi 5db4f15c4f mm: memory: add orig_pmd to struct vm_fault Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3. When the THP NUMA fault support was added THP migration was not supported yet. So the ad hoc THP migration was implemented in NUMA fault handling. Since v4.14 THP migration has been supported so it doesn't make too much sense to still keep another THP migration implementation rather than using the generic migration code. It is definitely a maintenance burden to keep two THP migration implementation for different code paths and it is more error prone. Using the generic THP migration implementation allows us remove the duplicate code and some hacks needed by the old ad hoc implementation. A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP and NUMA balancing. The most of them support THP migration except for S390. Zi Yan tried to add THP migration support for S390 before but it was not accepted due to the design of S390 PMD. For the discussion, please see: https://lkml.org/lkml/2018/4/27/953. Per the discussion with Gerald Schaefer in v1 it is acceptible to skip huge PMD for S390 for now. I saw there were some hacks about gup from git history, but I didn't figure out if they have been removed or not since I just found FOLL_NUMA code in the current gup implementation and they seems useful. Patch #1 ~ #2 are preparation patches. Patch #3 is the real meat. Patch #4 ~ #6 keep consistent counters and behaviors with before. Patch #7 skips change huge PMD to prot_none if thp migration is not supported. Test ---- Did some tests to measure the latency of do_huge_pmd_numa_page. The test VM has 80 vcpus and 64G memory. The test would create 2 processes to consume 128G memory together which would incur memory pressure to cause THP splits. And it also creates 80 processes to hog cpu, and the memory consumer processes are bound to different nodes periodically in order to increase NUMA faults. The below test script is used: echo 3 > /proc/sys/vm/drop_caches # Run stress-ng for 24 hours ./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h & PID=$! ./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h & # Wait for vm stressors forked sleep 5 PID_1=`pgrep -P $PID \| awk 'NR == 1'` PID_2=`pgrep -P $PID \| awk 'NR == 2'` JOB1=`pgrep -P $PID_1` JOB2=`pgrep -P $PID_2` # Bind load jobs to different nodes periodically to force generate # cross node memory access while [ -d "/proc/$PID" ] do taskset -apc 8 $JOB1 taskset -apc 8 $JOB2 sleep 300 taskset -apc 58 $JOB1 taskset -apc 58 $JOB2 sleep 300 done With the above test the histogram of latency of do_huge_pmd_numa_page is as shown below. Since the number of do_huge_pmd_numa_page varies drastically for each run (should be due to scheduler), so I converted the raw number to percentage. patched base @us[stress-ng]: [0] 3.57% 0.16% [1] 55.68% 18.36% [2, 4) 10.46% 40.44% [4, 8) 7.26% 17.82% [8, 16) 21.12% 13.41% [16, 32) 1.06% 4.27% [32, 64) 0.56% 4.07% [64, 128) 0.16% 0.35% [128, 256) < 0.1% < 0.1% [256, 512) < 0.1% < 0.1% [512, 1K) < 0.1% < 0.1% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) < 0.1% < 0.1% [16K, 32K) < 0.1% < 0.1% [32K, 64K) < 0.1% < 0.1% Per the result, patched kernel is even slightly better than the base kernel. I think this is because the lock contention against THP split is less than base kernel due to the refactor. To exclude the affect from THP split, I also did test w/o memory pressure. No obvious regression is spotted. The below is the test result w/o memory pressure. patched base @us[stress-ng]: [0] 7.97% 18.4% [1] 69.63% 58.24% [2, 4) 4.18% 2.63% [4, 8) 0.22% 0.17% [8, 16) 1.03% 0.92% [16, 32) 0.14% < 0.1% [32, 64) < 0.1% < 0.1% [64, 128) < 0.1% < 0.1% [128, 256) < 0.1% < 0.1% [256, 512) 0.45% 1.19% [512, 1K) 15.45% 17.27% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) 0.86% 0.88% [16K, 32K) < 0.1% 0.15% [32K, 64K) < 0.1% < 0.1% [64K, 128K) < 0.1% < 0.1% [128K, 256K) < 0.1% < 0.1% The series also survived a series of tests that exercise NUMA balancing migrations by Mel. This patch (of 7): Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page fault could be removed, just like its PTE counterpart does. Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2021-06-30 20:47:30 -07:00
Documentation	docs: proc.rst: meminfo: briefly describe gaps in memory accounting	2021-06-30 20:47:28 -07:00
LICENSES	LICENSES: Add the CC-BY-4.0 license	2020-12-08 10:33:27 -07:00
arch	arm64/mm: drop HAVE_ARCH_PFN_VALID	2021-06-30 20:47:29 -07:00
block	block-5.13-2021-05-22	2021-05-22 07:40:34 -10:00
certs	Kbuild updates for v5.13 (2nd)	2021-05-08 10:00:11 -07:00
crypto	async_xor: check src_offs is not NULL before updating it	2021-06-10 19:40:14 -07:00
drivers	virtio-mem: use page_offline_(start\|end) when setting PageOffline()	2021-06-30 20:47:28 -07:00
fs	mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs	2021-06-30 20:47:29 -07:00
include	mm: memory: add orig_pmd to struct vm_fault	2021-06-30 20:47:30 -07:00
init	pid: take a reference when initializing `cad_pid`	2021-06-05 08:58:11 -07:00
ipc	ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry	2021-05-22 15:09:07 -10:00
kernel	mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM	2021-06-29 10:53:55 -07:00
lib	mm/page_alloc: convert per-cpu list protection to local_lock	2021-06-29 10:53:54 -07:00
mm	mm: memory: add orig_pmd to struct vm_fault	2021-06-30 20:47:30 -07:00
net	net/ipv5/tcp: use vma_lookup() in tcp_zerocopy_receive()	2021-06-29 10:53:51 -07:00
samples	VFIO fixes for v5.13-rc5	2021-06-03 11:52:24 -07:00
scripts	kbuild: skip per-CPU BTF generation for pahole v1.18-v1.21	2021-06-29 10:53:54 -07:00
security	trusted-keys: match tpm_get_ops on all return paths	2021-05-12 22:36:37 +03:00
sound	ASoC: rt5645: Avoid upgrading static warnings to errors	2021-06-24 12:22:27 +02:00
tools	userfaultfd/selftests: exercise minor fault handling shmem support	2021-06-30 20:47:28 -07:00
usr	.gitignore: prefix local generated files with a slash	2021-05-02 00:43:35 +09:00
virt	virt/kvm: use vma_lookup() instead of find_vma_intersection()	2021-06-29 10:53:51 -07:00
.clang-format	clang-format: Update with the latest for_each macro list	2021-05-12 23:32:39 +02:00
.cocciconfig	…
.get_maintainer.ignore	Opt out of scripts/get_maintainer.pl	2019-05-16 10:53:40 -07:00
.gitattributes	.gitattributes: use 'dts' diff driver for dts files	2019-12-04 19:44:11 -08:00
.gitignore	.gitignore: ignore only top-level modules.builtin	2021-05-02 00:43:35 +09:00
.mailmap	mailmap: add Marek's other e-mail address and identity without diacritics	2021-06-24 19:40:54 -07:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	MAINTAINERS: move Murali Karicheri to credits	2021-04-29 15:47:30 -07:00
Kbuild	kbuild: rename hostprogs-y/always to hostprogs/always-y	2020-02-04 01:53:07 +09:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	mm/zbud: don't export any zbud API	2021-06-30 20:47:29 -07:00
Makefile	Linux 5.13	2021-06-27 15:21:11 -07:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.