linux-stable/kernel
Mel Gorman e496132ebe sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Commit 7d2b5dd0bc ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
allows an imbalance to exist up to the point where LLCs should be balanced
between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

                            5.17.0-rc0             5.17.0-rc0
                               vanilla       sched-numaimb-v6
MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per core, the vanilla kernel
allows threads to contend on cache. With the patch;

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v5
Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed. Even in cases where
the average performance is neutral, the results are more stable.

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v6
Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)

Similarly, for embarassingly parallel problems like NPB-ep, there are
improvements due to better spreading across LLC when the machine is not
fully utilised.

                              vanilla       sched-numaimb-v6
Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net
2022-02-11 23:30:08 +01:00
..
bpf bpf: Fix ringbuf memory type confusion when passing to helpers 2022-01-19 01:21:46 +01:00
cgroup Merge branch 'for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2022-01-11 09:14:37 -08:00
configs configs: introduce debug.config for CI-like setup 2022-01-20 08:52:55 +02:00
debug kdb: Adopt scheduler's task classification 2021-11-03 17:21:37 +00:00
dma hyperv-next for 5.17 2022-01-16 15:53:00 +02:00
entry entry: Snapshot thread flags 2021-12-01 00:06:43 +01:00
events Peter Zijlstra says: 2022-01-12 16:26:58 -08:00
futex Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2022-01-17 05:49:30 +02:00
gcov gcov: Remove compiler version check 2021-12-02 17:25:21 +09:00
irq proc: remove PDE_DATA() completely 2022-01-22 08:33:37 +02:00
kcsan KCSAN updates for v5.17 2022-01-11 09:51:26 -08:00
livepatch Livepatching changes for 5.17 2022-01-16 10:08:13 +02:00
locking locking/rwlocks: introduce write_lock_nested 2022-01-22 08:33:37 +02:00
power PM: hibernate: Allow ACPI hardware signature to be honoured 2021-12-08 16:06:10 +01:00
printk printk: fix build warning when CONFIG_PRINTK=n 2022-01-22 08:33:36 +02:00
rcu Merge branch 'akpm' (patches from Andrew) 2022-01-15 20:37:06 +02:00
sched sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-11 23:30:08 +01:00
time bitmap patches for 5.17-rc1 2022-01-23 06:20:44 +02:00
trace ftrace: Fix s390 breakage from sorting mcount tables 2022-01-23 08:07:02 +02:00
.gitignore
acct.c kernel: remove spurious blkdev.h includes 2021-10-18 06:17:01 -06:00
async.c
audit.c audit/stable-5.17 PR 20220110 2022-01-11 13:08:21 -08:00
audit.h audit/stable-5.16 PR 20211101 2021-11-01 21:17:39 -07:00
audit_fsnotify.c fsnotify: clarify contract for create event hooks 2021-10-27 12:32:34 +02:00
audit_tree.c audit: use struct_size() helper in kmalloc() 2021-12-14 17:39:42 -05:00
audit_watch.c \n 2021-11-06 16:43:20 -07:00
auditfilter.c audit/stable-5.17 PR 20220110 2022-01-11 13:08:21 -08:00
auditsc.c lsm: security_task_getsecid_subj() -> security_current_getsecid_subj() 2021-11-22 17:52:47 -05:00
backtracetest.c
bounds.c
capability.c
cfi.c cfi: Use rcu_read_{un}lock_sched_notrace 2021-08-11 13:11:12 -07:00
compat.c arch: remove compat_alloc_user_space 2021-09-08 15:32:35 -07:00
configs.c
context_tracking.c
cpu.c sched/scs: Reset task stack state in bringup_cpu() 2021-11-24 12:20:27 +01:00
cpu_pm.c PM: cpu: Make notifier chain use a raw_spinlock_t 2021-08-16 18:55:32 +02:00
crash_core.c kernel/crash_core: suppress unknown crashkernel parameter warning 2021-12-25 12:20:55 -08:00
crash_dump.c
cred.c ucounts: In set_cred_ucounts assume new->ucounts is non-NULL 2021-10-20 10:45:34 -05:00
delayacct.c delayacct: track delays from memory compact 2022-01-20 08:52:55 +02:00
dma.c
exec_domain.c
exit.c exit: Fix the exit_code for wait_task_zombie 2022-01-08 12:43:57 -06:00
extable.c extable: use is_kernel_text() helper 2021-11-09 10:02:51 -08:00
fail_function.c
fork.c Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2022-01-17 05:49:30 +02:00
freezer.c
gen_kheaders.sh
groups.c
hung_task.c hung_task: move hung_task sysctl interface to hung_task.c 2022-01-22 08:33:34 +02:00
iomem.c
irq_work.c irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT 2021-10-15 11:25:18 +02:00
jump_label.c
kallsyms.c Livepatching changes for 5.17 2022-01-16 10:08:13 +02:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks locking/rwlock: Provide RT variant 2021-08-17 17:50:51 +02:00
Kconfig.preempt preempt: Restore preemption model selection configs 2021-11-11 13:09:33 +01:00
kcov.c kcov: replace local_irq_save() with a local_lock_t 2021-11-09 10:02:52 -08:00
kexec.c kexec: avoid compat_alloc_user_space 2021-09-08 15:32:34 -07:00
kexec_core.c exit: Move oops specific logic from do_exit into make_task_dead 2021-12-13 12:04:45 -06:00
kexec_elf.c
kexec_file.c memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED 2021-11-06 13:30:42 -07:00
kexec_internal.h
kheaders.c
kmod.c
kprobes.c kprobe: move sysctl_kprobes_optimization to kprobes.c 2022-01-22 08:33:36 +02:00
ksysfs.c
kthread.c Merge branch 'akpm' (patches from Andrew) 2022-01-20 10:41:01 +02:00
latencytop.c
Makefile module: add in-kernel support for decompressing 2022-01-11 18:45:02 -08:00
module-internal.h module: add in-kernel support for decompressing 2022-01-11 18:45:02 -08:00
module.c Merge branch 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux 2022-01-17 07:32:51 +02:00
module_decompress.c kernel: Fix spelling mistake "compresser" -> "compressor" 2022-01-13 07:17:47 -08:00
module_signature.c
module_signing.c
notifier.c notifier: Return an error when a callback has already been registered 2021-12-29 10:37:33 +01:00
nsproxy.c memcg: enable accounting for new namesapces and struct nsproxy 2021-09-03 09:58:12 -07:00
padata.c padata: Remove repeated verbose license text 2021-08-27 16:30:18 +08:00
panic.c panic: remove oops_id 2022-01-20 08:52:55 +02:00
params.c kobject: remove kset from struct kset_uevent_ops callbacks 2021-12-28 11:26:18 +01:00
pid.c pid: add pidfd_get_task() helper 2021-10-14 13:29:18 +02:00
pid_namespace.c memcg: enable accounting for new namesapces and struct nsproxy 2021-09-03 09:58:12 -07:00
profile.c exit: Remove profile_handoff_task 2022-01-08 12:43:57 -06:00
ptrace.c ptrace: Remove second setting of PT_SEIZED in ptrace_attach 2022-01-08 12:43:57 -06:00
range.c
reboot.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2021-11-12 11:53:16 -08:00
regset.c
relay.c
resource.c proc: remove PDE_DATA() completely 2022-01-22 08:33:37 +02:00
resource_kunit.c
rseq.c rseq: Remove broken uapi field layout on 32-bit little endian 2022-02-02 13:11:34 +01:00
scftorture.c scftorture: Always log error message 2021-12-07 16:36:17 -08:00
scs.c scs: Release kasan vmalloc poison in scs_free process 2021-09-30 09:37:27 +01:00
seccomp.c Merge branch 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2021-09-01 14:52:05 -07:00
signal.c Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2022-01-17 05:49:30 +02:00
smp.c sched: Improve wake_up_all_idle_cpus() take #2 2021-10-22 15:32:46 +02:00
smpboot.c smpboot: Replace deprecated CPU-hotplug functions. 2021-08-10 14:57:42 +02:00
smpboot.h
softirq.c timers/nohz: Last resort update jiffies on nohz_full IRQ entry 2021-12-02 15:07:22 +01:00
stackleak.c stackleak: move stack_erasing sysctl to stackleak.c 2022-01-22 08:33:35 +02:00
stacktrace.c stacktrace: move filter_irq_stacks() to kernel/stacktrace.c 2021-11-06 13:30:43 -07:00
static_call.c
stop_machine.c
sys.c Merge branch 'akpm' (patches from Andrew) 2022-01-20 10:41:01 +02:00
sys_ni.c mm/mempolicy: wire up syscall set_mempolicy_home_node 2022-01-15 16:30:30 +02:00
sysctl-test.c
sysctl.c sched: move autogroup sysctls into its own file 2022-02-02 13:11:37 +01:00
task_work.c
taskstats.c
torture.c locktorture,rcutorture,torture: Always log error message 2021-12-07 16:36:17 -08:00
tracepoint.c tracepoint: Fix kerneldoc comments 2021-08-16 11:39:51 -04:00
tsacct.c taskstats: Cleanup the use of task->exit_code 2022-01-08 12:43:57 -06:00
ucount.c ucounts: Fix rlimit max values check 2021-12-09 15:37:18 -06:00
uid16.c
uid16.h
umh.c
up.c
user-return-notifier.c
user.c fs/epoll: use a per-cpu counter for user's watches count 2021-09-08 11:50:27 -07:00
user_namespace.c memcg: enable accounting for new namesapces and struct nsproxy 2021-09-03 09:58:12 -07:00
usermode_driver.c
utsname.c
utsname_sysctl.c
watch_queue.c
watchdog.c watchdog: move watchdog sysctl interface to watchdog.c 2022-01-22 08:33:34 +02:00
watchdog_hld.c
workqueue.c Merge branch 'workqueue/for-5.16-fixes' into workqueue/for-5.17 2022-01-10 07:54:04 -10:00
workqueue_internal.h workqueue: Assign a color to barrier work items 2021-08-17 07:49:10 -10:00