linux-stable

History

Gregory Price fa3bea4e1f mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving When a system has multiple NUMA nodes and it becomes bandwidth hungry, using the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based interleave policy does not optimally distribute data to make use of their different bandwidth characteristics. Instead, interleave is more effective when the allocation policy follows each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, enabling weighted interleave between NUMA nodes. Weighted interleave allows for proportional distribution of memory across multiple numa nodes, preferably apportioned to match the bandwidth of each node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights for each node can be assigned via the new sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ For now, the default value of all nodes will be `1`, which matches the behavior of standard 1:1 round-robin interleave. An extension will be added in the future to allow default values to be registered at kernel and device bringup time. The policy allocates a number of pages equal to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). Some high level notes about the pieces of weighted interleave: current->il_prev: Tracks the node previously allocated from. current->il_weight: The active weight of the current node (current->il_prev) When this reaches 0, current->il_prev is set to the next node and current->il_weight is set to the next weight. weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. Operates only on task->mempolicy. weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. Operates on VMA policies. bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them. If a node was scheduled for interleave via interleave_nodes, the current weight will be allocated first. Operates only on the task->mempolicy. One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. If a call to alloc_pages_mpol() were made with a weighted-interleave policy and ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA policy - violating the description above. An inspection of all callers of alloc_pages_mpol() shows that all external callers set ilx to `0`, an index value, or will call get_vma_policy() to acquire the ilx. For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces the `weighted_interleave_nodes()` and `weighted_interleave_nid()` policy requirements (task/vma respectively). Link: https://lkml.kernel.org/r/20240202170238.90004-4-gregory.price@memverge.com Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com> Signed-off-by: Gregory Price <gregory.price@memverge.com> Co-developed-by: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Rakie Kim <rakie.kim@sk.com> Co-developed-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2024-02-22 10:24:46 -08:00
..
ABI	mm/mempolicy: implement the sysfs-based weighted_interleave interface	2024-02-22 10:24:46 -08:00
PCI	…
RAS	…
RCU	…
accel	…
accounting	…
admin-guide	mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving	2024-02-22 10:24:46 -08:00
arch	arm64: Subscribe Microsoft Azure Cobalt 100 to ARM Neoverse N2 errata	2024-02-15 11:47:22 +00:00
block	…
bpf	…
cdrom	…
core-api	…
cpu-freq	…
crypto	…
dev-tools	…
devicetree	sound fixes for 6.8-rc5	2024-02-16 09:02:19 -08:00
doc-guide	…
driver-api	…
fault-injection	…
fb	…
features	…
filesystems	…
firmware-guide	…
firmware_class	…
fpga	…
gpu	…
hid	…
hwmon	…
i2c	…
iio	…
images	…
infiniband	…
input	…
isdn	…
kbuild	docs: kconfig: Fix grammar and formatting	2024-02-15 06:55:47 +09:00
kernel-hacking	…
leds	…
litmus-tests	…
livepatch	…
locking	…
maintainer	…
mhi	…
misc-devices	…
mm	…
netlabel	…
netlink	dpll: fix possible deadlock during netlink dump operation	2024-02-08 18:29:21 -08:00
networking	net-device: move lstats in net_device_read_txrx	2024-02-12 09:51:26 +00:00
nvdimm	…
nvme	…
pcmcia	…
peci	…
power	…
process	Documentation: Document the Linux Kernel CVE process	2024-02-17 14:46:39 +01:00
rust	…
scheduler	…
scsi	…
security	…
sound	…
sphinx	docs: kernel_feat.py: fix build error for missing files	2024-02-08 11:05:35 -07:00
sphinx-static	…
spi	…
staging	…
target	…
tee	…
timers	…
tools	…
trace	…
translations	Docs/translations/damon/usage: update for monitor_on renaming	2024-02-22 10:24:46 -08:00
usb	…
userspace-api	…
virt	…
w1	…
watchdog	…
wmi	…
.gitignore	…
Changes	…
CodingStyle	…
Kconfig	…
Makefile	…
SubmittingPatches	…
atomic_bitops.txt	…
atomic_t.txt	…
conf.py	…
docutils.conf	…
dontdiff	…
index.rst	…
memory-barriers.txt	…
subsystem-apis.rst	…