linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-11-01 17:08:10 +00:00

History

Odin Ugedal 0258bdfaff sched/fair: Fix unfairness caused by missing load decay This fixes an issue where old load on a cfs_rq is not properly decayed, resulting in strange behavior where fairness can decrease drastically. Real workloads with equally weighted control groups have ended up getting a respective 99% and 1%(!!) of cpu time. When an idle task is attached to a cfs_rq by attaching a pid to a cgroup, the old load of the task is attached to the new cfs_rq and sched_entity by attach_entity_cfs_rq. If the task is then moved to another cpu (and therefore cfs_rq) before being enqueued/woken up, the load will be moved to cfs_rq->removed from the sched_entity. Such a move will happen when enforcing a cpuset on the task (eg. via a cgroup) that force it to move. The load will however not be removed from the task_group itself, making it look like there is a constant load on that cfs_rq. This causes the vruntime of tasks on other sibling cfs_rq's to increase faster than they are supposed to; causing severe fairness issues. If no other task is started on the given cfs_rq, and due to the cpuset it would not happen, this load would never be properly unloaded. With this patch the load will be properly removed inside update_blocked_averages. This also applies to tasks moved to the fair scheduling class and moved to another cpu, and this path will also fix that. For fork, the entity is queued right away, so this problem does not affect that. This applies to cases where the new process is the first in the cfs_rq, issue introduced `3d30544f02` ("sched/fair: Apply more PELT fixes"), and when there has previously been load on the cgroup but the cgroup was removed from the leaflist due to having null PELT load, indroduced in `039ae8bcf7` ("sched/fair: Fix O(nr_cgroups) in the load balancing path"). For a simple cgroup hierarchy (as seen below) with two equally weighted groups, that in theory should get 50/50 of cpu time each, it often leads to a load of 60/40 or 70/30. parent/ cg-1/ cpu.weight: 100 cpuset.cpus: 1 cg-2/ cpu.weight: 100 cpuset.cpus: 1 If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2 equally weighted, they should still get a 50/50 balance of cpu time. This however sometimes results in a balance of 10/90 or 1/99(!!) between the task groups. $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18568 1.1 0.0 3684 100 pts/12 R+ 13:36 0:00 stress --cpu 1 root 18580 99.3 0.0 3684 100 pts/12 R+ 13:36 0:09 stress --cpu 1 parent/ cg-1/ cpu.weight: 100 sub-group/ cpu.weight: 1 cpuset.cpus: 1 cg-2/ cpu.weight: 100 sub-group/ cpu.weight: 10000 cpuset.cpus: 1 This can be reproduced by attaching an idle process to a cgroup and moving it to a given cpuset before it wakes up. The issue is evident in many (if not most) container runtimes, and has been reproduced with both crun and runc (and therefore docker and all its "derivatives"), and with both cgroup v1 and v2. Fixes: `3d30544f02` ("sched/fair: Apply more PELT fixes") Fixes: `039ae8bcf7` ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210501141950.23622-2-odin@uged.al		2021-05-06 15:33:27 +02:00
..
autogroup.c
autogroup.h
clock.c	sched: Fix various typos	2021-03-22 00:11:52 +01:00
completion.c	completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all()	2020-03-23 18:40:25 +01:00
core.c	sched: Fix out-of-bound access in uclamp	2021-05-06 15:33:26 +02:00
cpuacct.c	sched: Fix various typos	2021-03-22 00:11:52 +01:00
cpudeadline.c	sched,rt: Use the full cpumask for balancing	2020-11-10 18:39:00 +01:00
cpudeadline.h
cpufreq.c	cpufreq: Avoid leaving stale IRQ work items during CPU offline	2019-12-12 17:59:43 +01:00
cpufreq_schedutil.c	Scheduler updates for this cycle are:	2021-04-28 13:33:57 -07:00
cpupri.c	sched: Fix various typos	2021-03-22 00:11:52 +01:00
cpupri.h	sched/cpupri: Add CPUPRI_HIGHER	2020-10-29 11:00:30 +01:00
cputime.c	Scheduler updates for this cycle are:	2021-04-28 13:33:57 -07:00
deadline.c	sched: Fix various typos	2021-03-22 00:11:52 +01:00
debug.c	sched/debug: Fix cgroup_path[] serialization	2021-04-21 13:55:42 +02:00
fair.c	sched/fair: Fix unfairness caused by missing load decay	2021-05-06 15:33:27 +02:00
features.h	sched: Warn on long periods of pending need_resched	2021-04-21 13:55:41 +02:00
idle.c	sched: Fix various typos	2021-03-22 00:11:52 +01:00
isolation.c	isolcpus: Affine unbound kernel threads to housekeeping cpus	2020-06-15 14:10:03 +02:00
loadavg.c	sched: Fix various typos	2021-03-22 00:11:52 +01:00
Makefile
membarrier.c	sched/membarrier: fix missing local execution of ipi_sync_rq_state()	2021-03-06 12:40:21 +01:00
pelt.c	sched: Fix various typos	2021-03-22 00:11:52 +01:00
pelt.h	sched: Fix various typos	2021-03-22 00:11:52 +01:00
psi.c	psi: Fix psi state corruption when schedule() races with cgroup move	2021-05-06 15:33:26 +02:00
rt.c	sched: Fix various typos	2021-03-22 00:11:52 +01:00
sched-pelt.h
sched.h	sched: Warn on long periods of pending need_resched	2021-04-21 13:55:41 +02:00
smp.h	sched/headers: Split out open-coded prototypes into kernel/sched/smp.h	2020-05-28 11:03:20 +02:00
stats.c	sched: Fix various typos	2021-03-22 00:11:52 +01:00
stats.h	psi: Optimize task switch inside shared cgroups	2021-03-06 12:40:23 +01:00
stop_task.c	sched: Remove select_task_rq()'s sd_flag parameter	2020-11-10 18:39:06 +01:00
swait.c	sched/swait: Prepare usage in completions	2020-03-21 16:00:23 +01:00
topology.c	sched/debug: Rename the sched_debug parameter to sched_verbose	2021-04-17 13:22:44 +02:00
wait.c	sched/wait: Add add_wait_queue_priority()	2020-11-15 09:49:09 -05:00
wait_bit.c	sched/wait: fix ___wait_var_event(exclusive)	2019-12-17 13:32:50 +01:00