mm: memcontrol: don't throttle dying tasks on memory.high

While investigating hosts with high cgroup memory pressures, Tejun
found culprit zombie tasks that had were holding on to a lot of
memory, had SIGKILL pending, but were stuck in memory.high reclaim.

In the past, we used to always force-charge allocations from tasks
that were exiting in order to accelerate them dying and freeing up
their rss. This changed for memory.max in a4ebf1b6ca ("memcg:
prohibit unconditional exceeding the limit of dying tasks"); it noted
that this can cause (userspace inducable) containment failures, so it
added a mandatory reclaim and OOM kill cycle before forcing charges.
At the time, memory.high enforcement was handled in the userspace
return path, which isn't reached by dying tasks, and so memory.high
was still never enforced by dying tasks.

When c9afe31ec4 ("memcg: synchronously enforce memory.high for large
overcharges") added synchronous reclaim for memory.high, it added
unconditional memory.high enforcement for dying tasks as well. The
callstack shows that this path is where the zombie is stuck in.

We need to accelerate dying tasks getting past memory.high, but we
cannot do it quite the same way as we do for memory.max: memory.max is
enforced strictly, and tasks aren't allowed to move past it without
FIRST reclaiming and OOM killing if necessary. This ensures very small
levels of excess. With memory.high, though, enforcement happens lazily
after the charge, and OOM killing is never triggered. A lot of
concurrent threads could have pushed, or could actively be pushing,
the cgroup into excess. The dying task will enter reclaim on every
allocation attempt, with little hope of restoring balance.

To fix this, skip synchronous memory.high enforcement on dying tasks
altogether again. Update memory.high path documentation while at it.

[hannes@cmpxchg.org: also handle tasks are being killed during the reclaim]
  Link: https://lkml.kernel.org/r/20240111192807.GA424308@cmpxchg.org
Link: https://lkml.kernel.org/r/20240111132902.389862-1-hannes@cmpxchg.org
Fixes: c9afe31ec4 ("memcg: synchronously enforce memory.high for large overcharges")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit is contained in:
Johannes Weiner 2024-01-11 08:29:02 -05:00 committed by Andrew Morton
parent c4608d1bf7
commit 63fd327016
1 changed files with 25 additions and 4 deletions

View File

@ -2623,8 +2623,9 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
}
/*
* Scheduled by try_charge() to be executed from the userland return path
* and reclaims memory over the high limit.
* Reclaims memory over the high limit. Called directly from
* try_charge() (context permitting), as well as from the userland
* return path where reclaim is always able to block.
*/
void mem_cgroup_handle_over_high(gfp_t gfp_mask)
{
@ -2643,6 +2644,17 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
current->memcg_nr_pages_over_high = 0;
retry_reclaim:
/*
* Bail if the task is already exiting. Unlike memory.max,
* memory.high enforcement isn't as strict, and there is no
* OOM killer involved, which means the excess could already
* be much bigger (and still growing) than it could for
* memory.max; the dying task could get stuck in fruitless
* reclaim for a long time, which isn't desirable.
*/
if (task_is_dying())
goto out;
/*
* The allocating task should reclaim at least the batch size, but for
* subsequent retries we only want to do what's necessary to prevent oom
@ -2693,6 +2705,9 @@ retry_reclaim:
}
/*
* Reclaim didn't manage to push usage below the limit, slow
* this allocating task down.
*
* If we exit early, we're guaranteed to die (since
* schedule_timeout_killable sets TASK_KILLABLE). This means we don't
* need to account for any ill-begotten jiffies to pay them off later.
@ -2887,11 +2902,17 @@ done_restock:
}
} while ((memcg = parent_mem_cgroup(memcg)));
/*
* Reclaim is set up above to be called from the userland
* return path. But also attempt synchronous reclaim to avoid
* excessive overrun while the task is still inside the
* kernel. If this is successful, the return path will see it
* when it rechecks the overage and simply bail out.
*/
if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
!(current->flags & PF_MEMALLOC) &&
gfpflags_allow_blocking(gfp_mask)) {
gfpflags_allow_blocking(gfp_mask))
mem_cgroup_handle_over_high(gfp_mask);
}
return 0;
}