linux-stable/kernel
Tejun Heo 05f7658210 cgroup: Use separate src/dst nodes when preloading css_sets for migration
commit 07fd5b6cdf upstream.

Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.

Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:

 #1> mkdir -p /sys/fs/cgroup/misc/a/b
 #2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
 #3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
 #4> PID=$!
 #5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
 #6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs

the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.

After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:

 1. cgroup_migrate_add_src() is called on B for the leader.

 2. cgroup_migrate_add_src() is called on A for the other threads.

 3. cgroup_migrate_prepare_dst() is called. It scans the src list.

 4. It notices that B wants to migrate to A, so it tries to A to the dst
    list but realizes that its ->mg_preload_node is already busy.

 5. and then it notices A wants to migrate to A as it's an identity
    migration, it culls it by list_del_init()'ing its ->mg_preload_node and
    putting references accordingly.

 6. The rest of migration takes place with B on the src list but nothing on
    the dst list.

This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.

This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.

This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de9851 ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-21 20:42:43 +02:00
..
bpf bpf: Add kconfig knob for disabling unpriv bpf by default 2022-02-16 12:44:50 +01:00
cgroup cgroup: Use separate src/dst nodes when preloading css_sets for migration 2022-07-21 20:42:43 +02:00
configs
debug kdb: Make memory allocations more robust 2021-03-03 18:22:36 +01:00
events perf: Fix sys_perf_event_open() race against self 2022-05-25 08:41:19 +02:00
gcov gcov: add support for GCC 10.1 2020-09-23 10:46:32 +02:00
irq random: remove unused irq_flags argument from add_interrupt_randomness() 2022-06-25 11:46:30 +02:00
livepatch livepatch: Nullify obj->mod in klp_module_coming()'s error path 2019-10-07 18:55:09 +02:00
locking locking/lockdep: Avoid RCU-induced noinstr fail 2021-11-26 11:40:26 +01:00
power PM: suspend: fix return value of __setup handler 2022-04-20 09:08:13 +02:00
printk printk: fix return value of printk.devkmsg __setup handler 2022-04-20 09:08:15 +02:00
rcu rcu: Fix missed wakeup of exp_wq waiters 2021-09-26 13:37:28 +02:00
sched sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numa 2022-04-20 09:08:13 +02:00
time timekeeping: Add raw clock fallback for random_get_entropy() 2022-06-25 11:46:38 +02:00
trace tracing: Avoid adding tracer option before update_tracer_options 2022-06-14 16:53:58 +02:00
.gitignore
acct.c
async.c Revert "module, async: async_synchronize_full() on module init iff async is used" 2022-02-23 11:57:33 +01:00
audit.c audit: improve audit queue handling when "audit=1" on cmdline 2022-02-08 18:16:28 +01:00
audit.h audit: fix a net reference leak in audit_list_rules_send() 2020-06-20 10:25:10 +02:00
audit_fsnotify.c
audit_tree.c
audit_watch.c audit: CONFIG_CHANGE don't log internal bookkeeping as an event 2020-10-01 13:12:33 +02:00
auditfilter.c audit: fix a net reference leak in audit_list_rules_send() 2020-06-20 10:25:10 +02:00
auditsc.c audit: print empty EXECVE args 2019-12-01 09:14:03 +01:00
backtracetest.c
bounds.c kbuild: fix kernel/bounds.c 'W=1' warning 2018-11-13 11:15:08 -08:00
capability.c
compat.c make 'user_access_begin()' do 'access_ok()' 2020-06-20 10:24:58 +02:00
configs.c
context_tracking.c
cpu.c random: clear fast pool, crng, and batches in cpuhp bring up 2022-06-25 11:46:35 +02:00
cpu_pm.c kernel/cpu_pm: Fix uninitted local in cpu_pm 2020-06-20 10:25:19 +02:00
crash_core.c
crash_dump.c
cred.c memcg: account security cred as well to kmemcg 2020-01-09 10:17:54 +01:00
delayacct.c delayacct: Use raw_spinlocks 2018-08-03 07:50:38 +02:00
dma.c
exec_domain.c
exit.c don't dump the threads that had been already exiting when zapped. 2020-11-18 18:27:58 +01:00
extable.c
fork.c mm/hugetlb: initialize hugetlb_usage in mm_init 2021-09-22 11:45:32 +02:00
freezer.c
futex.c mm, futex: fix shared futex pgoff on shmem huge page 2021-07-11 12:48:12 +02:00
groups.c
hung_task.c kernel: hung_task.c: disable on suspend 2019-04-20 09:15:05 +02:00
irq_work.c
jump_label.c sched/core: Fix cpu.max vs. cpuhotplug deadlock 2018-12-05 19:41:17 +01:00
kallsyms.c kallsyms: Don't let kallsyms_lookup_size_offset() fail on retrieving the first symbol 2019-09-21 07:15:38 +02:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c kcov: ensure irq code sees a valid area 2018-08-03 07:50:22 +02:00
kexec.c
kexec_core.c kexec: Allocate decrypted control pages for kdump if SME is enabled 2019-11-24 08:23:15 +01:00
kexec_file.c kexec_file: drop weak attribute from arch_kexec_apply_relocations[_add] 2022-07-02 16:18:11 +02:00
kexec_internal.h
kmod.c kmod: make request_module() return an error when autoloading is disabled 2020-04-24 08:00:44 +02:00
kprobes.c kprobes: Limit max data_size of the kretprobe instances 2021-12-08 08:46:54 +01:00
ksysfs.c
kthread.c kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync() 2021-07-11 12:48:13 +02:00
latencytop.c
Makefile elfcore: fix building with clang 2021-02-10 09:12:08 +01:00
memremap.c mm, devm_memremap_pages: kill mapping "System RAM" support 2019-01-13 10:01:02 +01:00
module-internal.h
module.c Revert "module, async: async_synchronize_full() on module init iff async is used" 2022-02-23 11:57:33 +01:00
module_signing.c
notifier.c x86/mm: split vmalloc_sync_all() 2020-04-02 16:34:20 +02:00
nsproxy.c
padata.c padata: purge get_cpu and reorder_via_wq from padata_do_serial 2020-05-27 16:43:05 +02:00
panic.c panic: ensure preemption is disabled during panic() 2019-10-17 13:43:19 -07:00
params.c
pid.c
pid_namespace.c memcg: enable accounting for pids in nested pid namespaces 2021-09-22 11:45:32 +02:00
profile.c profiling: fix shift-out-of-bounds bugs 2021-09-26 13:37:28 +02:00
ptrace.c ptrace: Reimplement PTRACE_KILL by always sending SIGKILL 2022-06-14 16:53:43 +02:00
range.c
reboot.c reboot: fix overflow parsing reboot cpu number 2020-11-18 18:28:02 +01:00
relay.c kernel/relay.c: fix memleak on destroy relay channel 2020-08-26 10:29:54 +02:00
resource.c resource: fix integer overflow at reallocation 2018-04-24 09:36:22 +02:00
seccomp.c seccomp: Invalidate seccomp mode to catch death failures 2022-02-16 12:44:52 +01:00
signal.c signal: Remove the bogus sigkill_pending in ptrace_stop 2021-11-26 11:40:24 +01:00
smp.c smp: Fix offline cpu check in flush_smp_call_function_queue() 2022-04-20 09:08:33 +02:00
smpboot.c kthread: Extract KTHREAD_IS_PER_CPU 2021-02-07 14:47:41 +01:00
smpboot.h
softirq.c Mark HI and TASKLET softirq synchronous 2018-08-15 18:12:47 +02:00
stacktrace.c
stop_machine.c stop_machine: Atomically queue and wake stopper threads 2018-09-05 09:26:36 +02:00
sys.c prctl: allow to setup brk for et_dyn executables 2021-09-26 13:37:28 +02:00
sys_ni.c
sysctl.c x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting 2022-03-11 10:13:28 +01:00
sysctl_binary.c
task_work.c
taskstats.c taskstats: fix data-race 2020-01-09 10:17:53 +01:00
test_kprobes.c
torture.c
tracepoint.c tracepoint: Do not fail unregistering a probe due to memory failure 2021-03-03 18:22:47 +01:00
tsacct.c taskstats: Cleanup the use of task->exit_code 2022-02-23 11:57:34 +01:00
ucount.c
uid16.c
umh.c usermodehelper: reset umask to default before executing user process 2020-10-14 09:51:10 +02:00
up.c smp: Fix smp_call_function_single_async prototype 2021-05-22 10:57:35 +02:00
user-return-notifier.c
user.c
user_namespace.c userns: move user access out of the mutex 2018-09-09 19:56:00 +02:00
utsname.c
utsname_sysctl.c sys: don't hold uts_sem while accessing userspace memory 2018-09-09 19:56:00 +02:00
watchdog.c watchdog/softlockup: Enforce that timestamp is valid on boot 2020-02-28 16:36:05 +01:00
watchdog_hld.c watchdog: Mark watchdog touch functions as notrace 2018-09-05 09:26:42 +02:00
workqueue.c workqueue: fix UAF in pwq_unbound_release_workfn() 2021-08-04 12:22:14 +02:00
workqueue_internal.h