From 1a11533fbd71792e8c5d36f6763fbce8df0d231d Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Wed, 12 Feb 2014 19:06:19 -0500 Subject: [PATCH 1/2] Revert "cgroup: use an ordered workqueue for cgroup destruction" This reverts commit ab3f5faa6255a0eb4f832675507d9e295ca7e9ba. Explanation from Hugh: It's because more thorough testing, by others here, found that it wasn't always solving the problem: so I asked Tejun privately to hold off from sending it in, until we'd worked out why not. Most of our testing being on a v3,11-based kernel, it was perfectly possible that the problem was merely our own e.g. missing Tejun's 8a2b75384444 ("workqueue: fix ordered workqueues in NUMA setups"). But that turned out not to be enough to fix it either. Then Filipe pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched() before we ever get to put the offline on to the workqueue: by the time we get to the workqueue, the ordering has already been lost. So, thanks for the Acks, but I'm afraid that this ordered workqueue solution is just not good enough: we should simply forget that patch and provide a different answer." Signed-off-by: Tejun Heo Cc: Hugh Dickins --- kernel/cgroup.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 52719ce55dd3..68d87103b493 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -4844,16 +4844,12 @@ static int __init cgroup_wq_init(void) /* * There isn't much point in executing destruction path in * parallel. Good chunk is serialized with cgroup_mutex anyway. - * - * XXX: Must be ordered to make sure parent is offlined after - * children. The ordering requirement is for memcg where a - * parent's offline may wait for a child's leading to deadlock. In - * the long term, this should be fixed from memcg side. + * Use 1 for @max_active. * * We would prefer to do this in cgroup_init() above, but that * is called before init_workqueues(): so leave this until after. */ - cgroup_destroy_wq = alloc_ordered_workqueue("cgroup_destroy", 0); + cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1); BUG_ON(!cgroup_destroy_wq); /* From 532de3fc72adc2a6525c4d53c07bf81e1732083d Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Thu, 13 Feb 2014 13:29:31 -0500 Subject: [PATCH 2/2] cgroup: update cgroup_enable_task_cg_lists() to grab siglock Currently, there's nothing preventing cgroup_enable_task_cg_lists() from missing set PF_EXITING and race against cgroup_exit(). Depending on the timing, cgroup_exit() may finish with the task still linked on css_set leading to list corruption. Fix it by grabbing siglock in cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be visible. This whole on-demand cg_list optimization is extremely fragile and has ample possibility to lead to bugs which can cause things like once-a-year oops during boot. I'm wondering whether the better approach would be just adding "cgroup_disable=all" handling which disables the whole cgroup rather than tempting fate with this on-demand craziness. Signed-off-by: Tejun Heo Acked-by: Li Zefan Cc: stable@vger.kernel.org --- kernel/cgroup.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 68d87103b493..105f273b6f86 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2905,9 +2905,14 @@ static void cgroup_enable_task_cg_lists(void) * We should check if the process is exiting, otherwise * it will race with cgroup_exit() in that the list * entry won't be deleted though the process has exited. + * Do it while holding siglock so that we don't end up + * racing against cgroup_exit(). */ + spin_lock_irq(&p->sighand->siglock); if (!(p->flags & PF_EXITING) && list_empty(&p->cg_list)) list_add(&p->cg_list, &task_css_set(p)->tasks); + spin_unlock_irq(&p->sighand->siglock); + task_unlock(p); } while_each_thread(g, p); read_unlock(&tasklist_lock);