Commit graph

12765 commits

Author SHA1 Message Date
Sebastian Andrzej Siewior
e58dd0de5e bdi: use refcount_t for reference counting instead atomic_t
refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter.  This permits avoiding
accidental refcounter overflows that might lead to use-after-free
situations.

Link: http://lkml.kernel.org/r/20180703200141.28415-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Dennis Zhou (Facebook)
7e8a6304d5 /proc/meminfo: add percpu populated pages count
Currently, percpu memory only exposes allocation and utilization
information via debugfs.  This more or less is only really useful for
understanding the fragmentation and allocation information at a per-chunk
level with a few global counters.  This is also gated behind a config.
BPF and cgroup, for example, have seen an increase in use causing
increased use of percpu memory.  Let's make it easier for someone to
identify how much memory is being used.

This patch adds the "Percpu" stat to meminfo to more easily look up how
much percpu memory is in use.  This number includes the cost for all
allocated backing pages and not just insight at the per a unit, per chunk
level.  Metadata is excluded.  I think excluding metadata is fair because
the backing memory scales with the numbere of cpus and can quickly
outweigh the metadata.  It also makes this calculation light.

Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Roman Gushchin
3d8b38eb81 mm, oom: introduce memory.oom.group
For some workloads an intervention from the OOM killer can be painful.
Killing a random task can bring the workload into an inconsistent state.

Historically, there are two common solutions for this
problem:
1) enabling panic_on_oom,
2) using a userspace daemon to monitor OOMs and kill
   all outstanding processes.

Both approaches have their downsides: rebooting on each OOM is an obvious
waste of capacity, and handling all in userspace is tricky and requires a
userspace agent, which will monitor all cgroups for OOMs.

In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
the necessity of enabling panic_on_oom.  Also, it can simplify the cgroup
management for userspace applications.

This commit introduces a new knob for cgroup v2 memory controller:
memory.oom.group.  The knob determines whether the cgroup should be
treated as an indivisible workload by the OOM killer.  If set, all tasks
belonging to the cgroup or to its descendants (if the memory cgroup is not
a leaf cgroup) are killed together or not at all.

To determine which cgroup has to be killed, we do traverse the cgroup
hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
and looking for the highest-level cgroup with memory.oom.group set.

Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
an exception and are never killed.

This patch doesn't change the OOM victim selection algorithm.

Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Roman Gushchin
5989ad7b5e mm, oom: refactor oom_kill_process()
Patch series "introduce memory.oom.group", v2.

This is a tiny implementation of cgroup-aware OOM killer, which adds an
ability to kill a cgroup as a single unit and so guarantee the integrity
of the workload.

Although it has only a limited functionality in comparison to what now
resides in the mm tree (it doesn't change the victim task selection
algorithm, doesn't look at memory stas on cgroup level, etc), it's also
much simpler and more straightforward.  So, hopefully, we can avoid having
long debates here, as we had with the full implementation.

As it doesn't prevent any futher development, and implements an useful and
complete feature, it looks as a sane way forward.

This patch (of 2):

oom_kill_process() consists of two logical parts: the first one is
responsible for considering task's children as a potential victim and
printing the debug information.  The second half is responsible for
sending SIGKILL to all tasks sharing the mm struct with the given victim.

This commit splits oom_kill_process() with an intention to re-use the the
second half: __oom_kill_process().

The cgroup-aware OOM killer will kill multiple tasks belonging to the
victim cgroup.  We don't need to print the debug information for the each
task, as well as play with task selection (considering task's children),
so we can't use the existing oom_kill_process().

Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com
Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Oscar Salvador
03e85f9d5f mm/page_alloc: Introduce free_area_init_core_hotplug
Currently, whenever a new node is created/re-used from the memhotplug
path, we call free_area_init_node()->free_area_init_core().  But there is
some code that we do not really need to run when we are coming from such
path.

free_area_init_core() performs the following actions:

1) Initializes pgdat internals, such as spinlock, waitqueues and more.
2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
   when creating hash tables.
3) Account number of managed_pages per zone, substracting dma_reserved and
   memmap pages.
4) Initializes some fields of the zone structure data
5) Calls init_currently_empty_zone to initialize all the freelists
6) Calls memmap_init to initialize all pages belonging to certain zone

When called from memhotplug path, free_area_init_core() only performs
actions #1 and #4.

Action #2 is pointless as the zones do not have any pages since either the
node was freed, or we are re-using it, eitherway all zones belonging to
this node should have 0 pages.  For the same reason, action #3 results
always in manages_pages being 0.

Action #5 and #6 are performed later on when onlining the pages:
 online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
 online_pages()->move_pfn_range_to_zone()->memmap_init_zone()

This patch does two things:

First, moves the node/zone initializtion to their own function, so it
allows us to create a small version of free_area_init_core, where we only
perform:

1) Initialization of pgdat internals, such as spinlock, waitqueues and more
4) Initialization of some fields of the zone structure data

These two functions are: pgdat_init_internals() and zone_init_internals().

The second thing this patch does, is to introduce
free_area_init_core_hotplug(), the memhotplug version of
free_area_init_core():

Currently, we call free_area_init_node() from the memhotplug path.  In
there, we set some pgdat's fields, and call calculate_node_totalpages().
calculate_node_totalpages() calculates the # of pages the node has.

Since the node is either new, or we are re-using it, the zones belonging
to this node should not have any pages, so there is no point to calculate
this now.

Actually, we re-set these values to 0 later on with the calls to:

reset_node_managed_pages()
reset_node_present_pages()

The # of pages per node and the # of pages per zone will be calculated when
onlining the pages:

online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()

Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
__paginginit with __init, so their code gets freed up.

[osalvador@techadventures.net: fix section usage]
  Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
[osalvador@suse.de: v6]
  Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Oscar Salvador
0188dc98ad mm/page_alloc: inline function to handle CONFIG_DEFERRED_STRUCT_PAGE_INIT
Let us move the code between CONFIG_DEFERRED_STRUCT_PAGE_INIT to an inline
function.  Not having an ifdef in the function makes the code more
readable.

Link: http://lkml.kernel.org/r/20180730101757.28058-4-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Pavel Tatashin
7cc2a9596d mm: remove __paginginit
__paginginit is the same thing as __meminit except for platforms without
sparsemem, there it is defined as __init.

Remove __paginginit and use __meminit.  Use __ref in one single function
that merges __meminit and __init sections: setup_usemap().

Link: http://lkml.kernel.org/r/20180801122348.21588-4-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Pavel Tatashin
c1093b746c mm: access zone->node via zone_to_nid() and zone_set_nid()
zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
have inline functions to access this field in order to avoid ifdef's in c
files.

Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Oscar Salvador
ace1db3976 mm/page_alloc.c: move ifdefery out of free_area_init_core
Patch series "Refactor free_area_init_core and add
free_area_init_core_hotplug", v6.

This patchset does three things:

 1) Clean up/refactor free_area_init_core/free_area_init_node
    by moving the ifdefery out of the functions.
 2) Move the pgdat/zone initialization in free_area_init_core to its
    own function.
 3) Introduce free_area_init_core_hotplug, a small subset of
    free_area_init_core, which is only called from memhotlug code path. In this
    way, we have:

    free_area_init_core: called during early initialization
    free_area_init_core_hotplug: called whenever a new node is allocated/re-used (memhotplug path)

This patch (of 5):

Moving the #ifdefs out of the function makes it easier to follow.

Link: http://lkml.kernel.org/r/20180730101757.28058-2-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Shakeel Butt
8de7ecc648 memcg: reduce memcg tree traversals for stats collection
Currently cgroup-v1's memcg_stat_show traverses the memcg tree ~17 times
to collect the stats while cgroup-v2's memory_stat_show traverses the
memcg tree thrice.  On a large machine, a couple thousand memcgs is very
normal and if the churn is high and memcgs stick around during to several
reasons, tens of thousands of nodes in memcg tree can exist.  This patch
has refactored and shared the stat collection code between cgroup-v1 and
cgroup-v2 and has reduced the tree traversal to just one.

I ran a simple benchmark which reads the root_mem_cgroup's stat file
1000 times in the presense of 2500 memcgs on cgroup-v1. The results are:

Without the patch:
$ time ./read-root-stat-1000-times

real    0m1.663s
user    0m0.000s
sys     0m1.660s

With the patch:
$ time ./read-root-stat-1000-times

real    0m0.468s
user    0m0.000s
sys     0m0.467s

Link: http://lkml.kernel.org/r/20180724224635.143944-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Bruce Merry <bmerry@ska.ac.za>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Jiang Biao
1c4c3b99c0 mm: fix page_freeze_refs and page_unfreeze_refs in comments
page_freeze_refs/page_unfreeze_refs have already been relplaced by
page_ref_freeze/page_ref_unfreeze , but they are not modified in the
comments.

Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Kees Cook
8c9a134cae mm: clarify CONFIG_PAGE_POISONING and usage
The Kconfig text for CONFIG_PAGE_POISONING doesn't mention that it has to
be enabled explicitly.  This updates the documentation for that and adds a
note about CONFIG_PAGE_POISONING to the "page_poison" command line docs.
While here, change description of CONFIG_PAGE_POISONING_ZERO too, as it's
not "random" data, but rather the fixed debugging value that would be used
when not zeroing.  Additionally removes a stray "bool" in the Kconfig.

Link: http://lkml.kernel.org/r/20180725223832.GA43733@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Andrew Morton
a670468f5e mm: zero out the vma in vma_init()
Rather than in vm_area_alloc().  To ensure that the various oddball
stack-based vmas are in a good state.  Some of the callers were zeroing
them out, others were not.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Mike Rapoport
a3bf6ce366 mm/mempool.c: add missing parameter description
The kernel-doc for mempool_init function is missing the description of the
pool parameter.  Add it.

Link: http://lkml.kernel.org/r/1532336274-26228-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Michal Hocko
431f42fdfd mm/oom_kill.c: clean up oom_reap_task_mm()
Andrew has noticed some inconsistencies in oom_reap_task_mm.  Notably

 - Undocumented return value.

 - comment "failed to reap part..." is misleading - sounds like it's
   referring to something which happened in the past, is in fact
   referring to something which might happen in the future.

 - fails to call trace_finish_task_reaping() in one case

 - code duplication.

 - Increases mmap_sem hold time a little by moving
   trace_finish_task_reaping() inside the locked region.  So sue me ;)

 - Sharing the finish: path means that the trace event won't
   distinguish between the two sources of finishing.

Add a short explanation for the return value and fix the rest by
reorganizing the function a bit to have unified function exit paths.

Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Rodrigo Freire
c3b78b11ef mm, oom: describe task memory unit, larger PID pad
The default page memory unit of OOM task dump events might not be
intuitive and potentially misleading for the non-initiated when debugging
OOM events: These are pages and not kBs.  Add a small printk prior to the
task dump informing that the memory units are actually memory _pages_.

Also extends PID field to align on up to 7 characters.
Reference https://lkml.org/lkml/2018/7/3/1201

Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com
Signed-off-by: Rodrigo Freire <rfreire@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Michal Hocko
af5679fbc6 mm, oom: remove oom_lock from oom_reaper
oom_reaper used to rely on the oom_lock since e2fe14564d ("oom_reaper:
close race with exiting task").  We do not really need the lock anymore
though.  2129258024 ("mm: oom: let oom_reap_task and exit_mmap run
concurrently") has removed serialization with the exit path based on the
mm reference count and so we do not really rely on the oom_lock anymore.

Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
to prevent from races when the page allocator didn't manage to get the
freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
on and move on to another victim.  Although this is possible in principle
let's wait for it to actually happen in real life before we make the
locking more complex again.

Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
oom_reap_task_mm).  The reaper serializes with exit_mmap by mmap_sem +
MMF_OOM_SKIP flag.  There is no synchronization with out_of_memory path
now.

[mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did]
  Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Michal Hocko
93065ac753 mm, oom: distinguish blockable mode for mmu notifiers
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep.  That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.

We can do much better though.  Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.  Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range.  Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.

This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false.  This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.

I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that.  The first
part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode.  A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom.  This can be done e.g.  after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small.  Then we are looking for a proper process tear down.

[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: David Rientjes <rientjes@google.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Huang Ying
c2343d2761 mm/swapfile.c: put_swap_page: share more between huge/normal code path
In this patch, locking related code is shared between huge/normal code
path in put_swap_page() to reduce code duplication. The `free_entries == 0`
case is merged into the more general `free_entries != SWAPFILE_CLUSTER`
case, because the new locking method makes it easy.

The added lines is same as the removed lines.  But the code size is
increased when CONFIG_TRANSPARENT_HUGEPAGE=n.

		text	   data	    bss	    dec	    hex	filename
base:	       24123	   2004	    340	  26467	   6763	mm/swapfile.o
unified:       24485	   2004	    340	  26829	   68cd	mm/swapfile.o

Dig on step deeper with `size -A mm/swapfile.o` for base and unified
kernel and compare the result, yields,

  -.text                                17723      0
  +.text                                17835      0
  -.orc_unwind_ip                        1380      0
  +.orc_unwind_ip                        1480      0
  -.orc_unwind                           2070      0
  +.orc_unwind                           2220      0
  -Total                                26686
  +Total                                27048

The total difference is the same.  The text segment difference is much
smaller: 112.  More difference comes from the ORC unwinder segments:
(1480 + 2220) - (1380 + 2070) = 250.  If the frame pointer unwinder is
used, this costs nothing.

Link: http://lkml.kernel.org/r/20180720071845.17920-9-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Huang Ying
b32d5f32b9 mm/swapfile.c: add __swap_entry_free_locked()
The part of __swap_entry_free() with lock held is separated into a new
function __swap_entry_free_locked().  Because we want to reuse that
piece of code in some other places.

Just mechanical code refactoring, there is no any functional change in
this function.

Link: http://lkml.kernel.org/r/20180720071845.17920-8-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Huang Ying
5d5e8f1954 mm, swap, get_swap_pages: use entry_size instead of cluster in parameter
As suggested by Matthew Wilcox, it is better to use "int entry_size"
instead of "bool cluster" as parameter to specify whether to operate for
huge or normal swap entries.  Because this improve the flexibility to
support other swap entry size.  And Dave Hansen thinks that this
improves code readability too.

So in this patch, the "bool cluster" parameter of get_swap_pages() is
replaced by "int entry_size".

And nr_swap_entries() trick is used to reduce the binary size when
!CONFIG_TRANSPARENT_HUGE_PAGE.

       text	   data	    bss	    dec	    hex	filename
base  24215	   2028	    340	  26583	   67d7	mm/swapfile.o
head  24123	   2004	    340	  26467	   6763	mm/swapfile.o

Link: http://lkml.kernel.org/r/20180720071845.17920-7-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Huang Ying
a448f2d07f mm/swapfile.c: unify normal/huge code path in put_swap_page()
In this patch, the normal/huge code path in put_swap_page() and several
helper functions are unified to avoid duplicated code, bugs, etc.  and
make it easier to review the code.

The removed lines are more than added lines.  And the binary size is
kept exactly same when CONFIG_TRANSPARENT_HUGEPAGE=n.

Link: http://lkml.kernel.org/r/20180720071845.17920-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Huang Ying
33ee011e56 mm/swapfile.c: unify normal/huge code path in swap_page_trans_huge_swapped()
As suggested by Dave, we should unify the code path for normal and huge
swap support if possible to avoid duplicated code, bugs, etc.  and make
it easier to review code.

In this patch, the normal/huge code path in
swap_page_trans_huge_swapped() is unified, the added and removed lines
are same.  And the binary size is kept almost same when
CONFIG_TRANSPARENT_HUGEPAGE=n.

		 text	   data	    bss	    dec	    hex	filename
base:		24179	   2028	    340	  26547	   67b3	mm/swapfile.o
unified:	24215	   2028	    340	  26583	   67d7	mm/swapfile.o

Link: http://lkml.kernel.org/r/20180720071845.17920-5-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:43 -07:00
Huang Ying
afa4711ef1 mm/swapfile.c: use swap_count() in swap_page_trans_huge_swapped()
In swap_page_trans_huge_swapped(), to identify whether there's any page
table mapping for a 4k sized swap entry, "si->swap_map[i] !=
SWAP_HAS_CACHE" is used.  This works correctly now, because all users of
the function will only call it after checking SWAP_HAS_CACHE.  But as
pointed out by Daniel, it is better to use "swap_count(map[i])" here,
because it works for "map[i] == 0" case too.

And this makes the implementation more consistent between normal and
huge swap entry.

Link: http://lkml.kernel.org/r/20180720071845.17920-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-and-reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:43 -07:00
Huang Ying
fe5266d5d5 mm/swapfile.c: replace some #ifdef with IS_ENABLED()
In mm/swapfile.c, THP (Transparent Huge Page) swap specific code is
enclosed by #ifdef CONFIG_THP_SWAP/#endif to avoid code dilating when
THP isn't enabled.  But #ifdef/#endif in .c file hurt the code
readability, so Dave suggested to use IS_ENABLED(CONFIG_THP_SWAP)
instead and let compiler to do the dirty job for us.  This has potential
to remove some duplicated code too.  From output of `size`,

		text	   data	    bss	    dec	    hex	filename
THP=y:         26269	   2076	    340	  28685	   700d	mm/swapfile.o
ifdef/endif:   24115	   2028	    340	  26483	   6773	mm/swapfile.o
IS_ENABLED:    24179	   2028	    340	  26547	   67b3	mm/swapfile.o

IS_ENABLED() based solution works quite well, almost as good as that of
#ifdef/#endif.  And from the diffstat, the removed lines are more than
added lines.

One #ifdef for split_swap_cluster() is kept.  Because it is a public
function with a stub implementation for CONFIG_THP_SWAP=n in swap.h.

Link: http://lkml.kernel.org/r/20180720071845.17920-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:43 -07:00
Huang Ying
59d98bf3c2 mm: swap: add comments to lock_cluster_or_swap_info()
Patch series "swap: THP optimizing refactoring", v4.

Now the THP (Transparent Huge Page) swap optimizing is implemented in the
way like below,

  #ifdef CONFIG_THP_SWAP
  huge_function(...)
  {
  }
  #else
  normal_function(...)
  {
  }
  #endif

  general_function(...)
  {
  	if (huge)
  		return thp_function(...);
	else
  		return normal_function(...);
  }

As pointed out by Dave Hansen, this will,

1. Create a new, wholly untested code path for huge page
2. Create two places to patch bugs
3. Are not reusing code when possible

This patchset is to address these problems via merging huge/normal code
path/functions if possible.

One concern is that this may cause code size to dilate when
!CONFIG_TRANSPARENT_HUGEPAGE.  The data shows that most refactoring will
only cause quite slight code size increase.

This patch (of 8):

To improve code readability.

Link: http://lkml.kernel.org/r/20180720071845.17920-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:43 -07:00
Kirill Tkhai
8df4a44cc4 mm: check shrinker is memcg-aware in register_shrinker_prepared()
There is a sad BUG introduced in patch adding SHRINKER_REGISTERING.
shrinker_idr business is only for memcg-aware shrinkers.  Only such type
of shrinkers have id and they must be finaly installed via idr_replace()
in this function.  For !memcg-aware shrinkers we never initialize
shrinker->id field.

But there are all types of shrinkers passed to idr_replace(), and every
!memcg-aware shrinker with random ID (most probably, its id is 0)
replaces memcg-aware shrinker pointed by the ID in IDR.

This patch fixes the problem.

Link: http://lkml.kernel.org/r/8ff8a793-8211-713a-4ed9-d6e52390c2fc@virtuozzo.com
Fixes: 7e010df53c "mm: use special value SHRINKER_REGISTERING instead of list_empty() check"
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reported-by: <syzbot+d5f648a1bfe15678786b@syzkaller.appspotmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <syzkaller-bugs@googlegroups.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:43 -07:00
Linus Torvalds
0214f46b3a Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull core signal handling updates from Eric Biederman:
 "It was observed that a periodic timer in combination with a
  sufficiently expensive fork could prevent fork from every completing.
  This contains the changes to remove the need for that restart.

  This set of changes is split into several parts:

   - The first part makes PIDTYPE_TGID a proper pid type instead
     something only for very special cases. The part starts using
     PIDTYPE_TGID enough so that in __send_signal where signals are
     actually delivered we know if the signal is being sent to a a group
     of processes or just a single process.

   - With that prep work out of the way the logic in fork is modified so
     that fork logically makes signals received while it is running
     appear to be received after the fork completes"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
  signal: Don't send signals to tasks that don't exist
  signal: Don't restart fork when signals come in.
  fork: Have new threads join on-going signal group stops
  fork: Skip setting TIF_SIGPENDING in ptrace_init_task
  signal: Add calculate_sigpending()
  fork: Unconditionally exit if a fatal signal is pending
  fork: Move and describe why the code examines PIDNS_ADDING
  signal: Push pid type down into complete_signal.
  signal: Push pid type down into __send_signal
  signal: Push pid type down into send_signal
  signal: Pass pid type into do_send_sig_info
  signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
  signal: Pass pid type into group_send_sig_info
  signal: Pass pid and pid type into send_sigqueue
  posix-timers: Noralize good_sigevent
  signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
  pid: Implement PIDTYPE_TGID
  pids: Move the pgrp and session pid pointers from task_struct to signal_struct
  kvm: Don't open code task_pid in kvm_vcpu_ioctl
  pids: Compute task_tgid using signal->leader_pid
  ...
2018-08-21 13:47:29 -07:00
Colin Ian King
1e92641929 mm/hmm.c: remove unused variables align_start and align_end
Variables align_start and align_end are being assigned but are never
used hence they are redundant and can be removed.

Cleans up clang warnings:
  warning: variable 'align_start' set but not used [-Wunused-but-set-variable]
  warning: variable 'align_size' set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/20180714161124.3923-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:33 -07:00
David Rientjes
ddbf369c0a mm, vmacache: hash addresses based on pmd
When perf profiling a wide variety of different workloads, it was found
that vmacache_find() had higher than expected cost: up to 0.08% of cpu
utilization in some cases.  This was found to rival other core VM
functions such as alloc_pages_vma() with thp enabled and default
mempolicy, and the conditionals in __get_vma_policy().

VMACACHE_HASH() determines which of the four per-task_struct slots a vma
is cached for a particular address.  This currently depends on the pfn,
so pfn 5212 occupies a different vmacache slot than its neighboring pfn
5213.

vmacache_find() iterates through all four of current's vmacache slots
when looking up an address.  Hashing based on pfn, an address has
~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or
about 25%, *if* the vma is cached.

This patch hashes an address by its pmd instead of pte to optimize for
workloads with good spatial locality.  This results in a higher
probability of vmas being cached in the first slot that is checked:
normally ~70% on the same workloads instead of 25%.

[rientjes@google.com: various updates]
  Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807231532290.109445@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807091749150.114630@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Sebastian Andrzej Siewior
6b51e88199 mm/list_lru: introduce list_lru_shrink_walk_irq()
Provide list_lru_shrink_walk_irq() and let it behave like
list_lru_walk_one() except that it locks the spinlock with
spin_lock_irq().  This is used by scan_shadow_nodes() because its lock
nests within the i_pages lock which is acquired with IRQ.  This change
allows to use proper locking promitives instead hand crafted
lock_irq_disable() plus spin_lock().

There is no EXPORT_SYMBOL provided because the current user is in-kernel
only.

Add list_lru_shrink_walk_irq() which acquires the spinlock with the
proper locking primitives.

Link: http://lkml.kernel.org/r/20180716111921.5365-5-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Sebastian Andrzej Siewior
6e018968f8 mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
__list_lru_walk_one() is invoked with struct list_lru *lru, int nid as
the first two argument.  Those two are only used to retrieve struct
list_lru_node.  Since this is already done by the caller of the function
for the locking, we can pass struct list_lru_node* directly and avoid
the dance around it.

Link: http://lkml.kernel.org/r/20180716111921.5365-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Sebastian Andrzej Siewior
6cfe57a96b mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
Move the locking inside __list_lru_walk_one() to its caller.  This is a
preparation step in order to introduce list_lru_walk_one_irq() which
does spin_lock_irq() instead of spin_lock() for the locking.

Link: http://lkml.kernel.org/r/20180716111921.5365-3-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Sebastian Andrzej Siewior
87a5ffc163 mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
Patch series "mm/list_lru: Add list_lru_shrink_walk_irq() and a user".

This series removes the local_irq_disable() around
list_lru_shrink_walk() (as used by mm/workingset) by adding
list_lru_shrink_walk_irq().

Vladimir Davydov preferred this over `irq' argument which I added to
struct list_lru.

The initial post (of this series) received a Reviewed-by tag by Vladimir
Davydov which I added to each patch of the series.  The series applies
on top of akpm's tree which has Kirill's shrink_slab series and does not
clash with it (akpm asked me to wait a week or so and repost it then).

I tested the code paths by triggering the OOM-killer via memory over
commit and lockdep did not complain (nor did I see any warnings).

This patch (of 4):

list_lru_walk_node() invokes __list_lru_walk_one() with -1 as the
memcg_idx parameter.  The same can be achieved by list_lru_walk_one() and
passing NULL as memcg argument which then gets converted into -1.  This is
a preparation step when the spin_lock() function is lifted to the caller
of __list_lru_walk_one().  Invoke list_lru_walk_one() instead
__list_lru_walk_one() when possible.

Link: http://lkml.kernel.org/r/20180716111921.5365-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Huang Ying
14fef28414 mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
CONFIG_THP_SWAP should depend on CONFIG_SWAP, because it's unreasonable
to optimize swapping for THP (Transparent Huge Page) without basic
swapping support.

In original code, when CONFIG_SWAP=n and CONFIG_THP_SWAP=y,
split_swap_cluster() will not be built because it is in swapfile.c, but
it will be called in huge_memory.c.  This doesn't trigger a build error
in practice because the call site is enclosed by PageSwapCache(), which
is defined to be constant 0 when CONFIG_SWAP=n.  But this is fragile and
should be fixed.

The comments are fixed too to reflect the latest progress.

Link: http://lkml.kernel.org/r/20180713021228.439-1-ying.huang@intel.com
Fixes: 38d8b4e6bd ("mm, THP, swap: delay splitting THP during swap out")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Pavel Tatashin
2a3cb8baef mm/sparse: delete old sparse_init and enable new one
Rename new_sparse_init() to sparse_init() which enables it.  Delete old
sparse_init() and all the code that became obsolete with.

[pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
  Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Tested-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Pavel Tatashin
85c77f7913 mm/sparse: add new sparse_init_nid() and sparse_init()
sparse_init() requires to temporary allocate two large buffers: usemap_map
and map_map.  Baoquan He has identified that these buffers are so large
that Linux is not bootable on small memory machines, such as a kdump boot.
The buffers are especially large when CONFIG_X86_5LEVEL is set, as they
are scaled to the maximum physical memory size.

Baoquan provided a fix, which reduces these sizes of these buffers, but it
is much better to get rid of them entirely.

Add a new way to initialize sparse memory: sparse_init_nid(), which only
operates within one memory node, and thus allocates memory either in large
contiguous block or allocates section by section.  This eliminates the
need for use of temporary buffers.

For simplified bisecting and review temporarly call sparse_init()
new_sparse_init(), the new interface is going to be enabled as well as old
code removed in the next patch.

Link: http://lkml.kernel.org/r/20180712203730.8703-5-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Pavel Tatashin
afda57bc13 mm/sparse: move buffer init/fini to the common place
Now that both variants of sparse memory use the same buffers to populate
memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the
common place.

Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Tested-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Pavel Tatashin
e131c06b14 mm/sparse: use the new sparse buffer functions in non-vmemmap
non-vmemmap sparse also allocated large contiguous chunk of memory, and if
fails falls back to smaller allocations.  Use the same functions to
allocate buffer as the vmemmap-sparse

Link: http://lkml.kernel.org/r/20180712203730.8703-3-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Pavel Tatashin
35fd1eb1e8 mm/sparse: abstract sparse buffer allocations
Patch series "sparse_init rewrite", v6.

In sparse_init() we allocate two large buffers to temporary hold usemap
and memmap for the whole machine.  However, we can avoid doing that if
we changed sparse_init() to operated on per-node bases instead of doing
it on the whole machine beforehand.

As shown by Baoquan
  http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com

The buffers are large enough to cause machine stop to boot on small
memory systems.

Another benefit of these changes is that they also obsolete
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.

This patch (of 5):

When struct pages are allocated for sparse-vmemmap VA layout, we first try
to allocate one large buffer, and than if that fails allocate struct pages
for each section as we go.

The code that allocates buffer is uses global variables and is spread
across several call sites.

Cleanup the code by introducing three functions to handle the global
buffer:

sparse_buffer_init()	initialize the buffer
sparse_buffer_fini()	free the remaining part of the buffer
sparse_buffer_alloc()	alloc from the buffer, and if buffer is empty
return NULL

Define these functions in sparse.c instead of sparse-vmemmap.c because
later we will use them for non-vmemmap sparse allocations as well.

[akpm@linux-foundation.org: use PTR_ALIGN()]
[akpm@linux-foundation.org: s/BUG_ON/WARN_ON/]
Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Cannon Matthews
330d6e489a mm/hugetlb.c: don't zero 1GiB bootmem pages
When using 1GiB pages during early boot, use the new
memblock_virt_alloc_try_nid_raw() to allocate memory without zeroing it.
Zeroing out hundreds or thousands of GiB in a single core memset() call
is very slow, and can make early boot last upwards of 20-30 minutes on
multi TiB machines.

The memory does not need to be zero'd as the hugetlb pages are always
zero'd on page fault.

Tested: Booted with ~3800 1G pages, and it booted successfully in
roughly the same amount of time as with 0, as opposed to the 25+ minutes
it would take before.

Link: http://lkml.kernel.org/r/20180711213313.92481-1-cannonmatthews@google.com
Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Aaron Lu
d8a759b570 mm, page_alloc: double zone's batchsize
To improve page allocator's performance for order-0 pages, each CPU has
a Per-CPU-Pageset(PCP) per zone.  Whenever an order-0 page is needed,
PCP will be checked first before asking pages from Buddy.  When PCP is
used up, a batch of pages will be fetched from Buddy to improve
performance and the size of batch can affect performance.

zone's batch size gets doubled last time by commit ba56e91c9401("mm:
page_alloc: increase size of per-cpu-pages") over ten years ago.  Since
then, CPU has envolved a lot and CPU's cache sizes also increased.

Dave Hansen is concerned the current batch size doesn't fit well with
modern hardware and suggested me to do two things: first, use a page
allocator intensive benchmark, e.g.  will-it-scale/page_fault1 to find
out how performance changes with different batch sizes on various
machines and then choose a new default batch size; second, see how this
new batch size work with other workloads.

In the first test, we saw performance gains on high-core-count systems
and little to no effect on older systems with more modest core counts.
In this phase's test data, two candidates: 63 and 127 are chosen.

In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
and more will-it-scale sub-tests are tested to see how these two
candidates work with these workloads and decides a new default according
to their results.

Most test results are flat.  will-it-scale/page_fault2 process mode has
10%-18% performance increase on 4-sockets Skylake and Broadwell.
vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
drop.  Further analysis showed that, with a larger pcp->batch and thus
larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
maintained in this patch), zone lock contention shifted to LRU add side
lock contention and that caused performance drop.  This performance drop
might be mitigated by others' work on optimizing LRU lock.

Another downside of increasing pcp->batch is, when PCP is used up and need
to fetch a batch of pages from Buddy, since batch is increased, that time
can be longer than before.  My understanding is, this doesn't affect
slowpath where direct reclaim and compaction dominates.  For fastpath,
throughput is a win(according to will-it-scale/page_fault1) but worst
latency can be larger now.

Overall, I think double the batch size from 31 to 63 is relatively safe
and provide good performance boost for high-core-count systems.

The two phase's test results are listed below(all tests are done with THP
disabled).

Phase one(will-it-scale/page_fault1) test results:

Skylake-EX: increased batch size has a good effect on zone->lock
contention, though LRU contention will rise at the same time and
limited the final performance increase.

batch   score     change   zone_contention   lru_contention   total_contention
 31   15345900    +0.00%       64%                 8%           72%
 53   17903847   +16.67%       32%                38%           70%
 63   17992886   +17.25%       24%                45%           69%
 73   18022825   +17.44%       10%                61%           71%
119   18023401   +17.45%        4%                66%           70%
127   18029012   +17.48%        3%                66%           69%
137   18036075   +17.53%        4%                66%           70%
165   18035964   +17.53%        2%                67%           69%
188   18101105   +17.95%        2%                67%           69%
223   18130951   +18.15%        2%                67%           69%
255   18118898   +18.07%        2%                67%           69%
267   18101559   +17.96%        2%                67%           69%
299   18160468   +18.34%        2%                68%           70%
320   18139845   +18.21%        2%                67%           69%
393   18160869   +18.34%        2%                68%           70%
424   18170999   +18.41%        2%                68%           70%
458   18144868   +18.24%        2%                68%           70%
467   18142366   +18.22%        2%                68%           70%
498   18154549   +18.30%        1%                68%           69%
511   18134525   +18.17%        1%                69%           70%

Broadwell-EX: similar pattern as Skylake-EX.

batch   score     change   zone_contention   lru_contention   total_contention
 31   16703983    +0.00%       67%                 7%           74%
 53   18195393    +8.93%       43%                28%           71%
 63   18288885    +9.49%       38%                33%           71%
 73   18344329    +9.82%       35%                37%           72%
119   18535529   +10.96%       24%                46%           70%
127   18513596   +10.83%       23%                48%           71%
137   18514327   +10.84%       23%                48%           71%
165   18511840   +10.82%       22%                49%           71%
188   18593478   +11.31%       17%                53%           70%
223   18601667   +11.36%       17%                52%           69%
255   18774825   +12.40%       12%                58%           70%
267   18754781   +12.28%        9%                60%           69%
299   18892265   +13.10%        7%                63%           70%
320   18873812   +12.99%        8%                62%           70%
393   18891174   +13.09%        6%                64%           70%
424   18975108   +13.60%        6%                64%           70%
458   18932364   +13.34%        8%                62%           70%
467   18960891   +13.51%        5%                65%           70%
498   18944526   +13.41%        5%                64%           69%
511   18960839   +13.51%        5%                64%           69%

Skylake-EP: although increased batch reduced zone->lock contention, but
the effect is not as good as EX: zone->lock contention is still as high as
20% with a very high batch value instead of 1% on Skylake-EX or 5% on
Broadwell-EX.  Also, total_contention actually decreased with a higher
batch but that doesn't translate to performance increase.

batch   score    change   zone_contention   lru_contention   total_contention
 31   9554867    +0.00%       66%                 3%           69%
 53   9855486    +3.15%       63%                 3%           66%
 63   9980145    +4.45%       62%                 4%           66%
 73   10092774   +5.63%       62%                 5%           67%
119   10310061   +7.90%       45%                19%           64%
127   10342019   +8.24%       42%                19%           61%
137   10358182   +8.41%       42%                21%           63%
165   10397060   +8.81%       37%                24%           61%
188   10341808   +8.24%       34%                26%           60%
223   10349135   +8.31%       31%                27%           58%
255   10327189   +8.08%       28%                29%           57%
267   10344204   +8.26%       27%                29%           56%
299   10325043   +8.06%       25%                30%           55%
320   10310325   +7.91%       25%                31%           56%
393   10293274   +7.73%       21%                31%           52%
424   10311099   +7.91%       21%                32%           53%
458   10321375   +8.02%       21%                32%           53%
467   10303881   +7.84%       21%                32%           53%
498   10332462   +8.14%       20%                33%           53%
511   10325016   +8.06%       20%                32%           52%

Broadwell-EP: zone->lock and lru lock had an agreement to make sure
performance doesn't increase and they successfully managed to keep total
contention at 70%.

batch   score    change   zone_contention   lru_contention   total_contention
 31   10121178   +0.00%       19%                50%           69%
 53   10142366   +0.21%        6%                63%           69%
 63   10117984   -0.03%       11%                58%           69%
 73   10123330   +0.02%        7%                63%           70%
119   10108791   -0.12%        2%                67%           69%
127   10166074   +0.44%        3%                66%           69%
137   10141574   +0.20%        3%                66%           69%
165   10154499   +0.33%        2%                68%           70%
188   10124921   +0.04%        2%                67%           69%
223   10137399   +0.16%        2%                67%           69%
255   10143289   +0.22%        0%                68%           68%
267   10123535   +0.02%        1%                68%           69%
299   10140952   +0.20%        0%                68%           68%
320   10163170   +0.41%        0%                68%           68%
393   10000633   -1.19%        0%                69%           69%
424   10087998   -0.33%        0%                69%           69%
458   10187116   +0.65%        0%                69%           69%
467   10146790   +0.25%        0%                69%           69%
498   10197958   +0.76%        0%                69%           69%
511   10152326   +0.31%        0%                69%           69%

Haswell-EP: similar to Broadwell-EP.

batch   score   change   zone_contention   lru_contention   total_contention
 31   10442205   +0.00%       14%                48%           62%
 53   10442255   +0.00%        5%                57%           62%
 63   10452059   +0.09%        6%                57%           63%
 73   10482349   +0.38%        5%                59%           64%
119   10454644   +0.12%        3%                60%           63%
127   10431514   -0.10%        3%                59%           62%
137   10423785   -0.18%        3%                60%           63%
165   10481216   +0.37%        2%                61%           63%
188   10448755   +0.06%        2%                61%           63%
223   10467144   +0.24%        2%                61%           63%
255   10480215   +0.36%        2%                61%           63%
267   10484279   +0.40%        2%                61%           63%
299   10466450   +0.23%        2%                61%           63%
320   10452578   +0.10%        2%                61%           63%
393   10499678   +0.55%        1%                62%           63%
424   10481454   +0.38%        1%                62%           63%
458   10473562   +0.30%        1%                62%           63%
467   10484269   +0.40%        0%                62%           62%
498   10505599   +0.61%        0%                62%           62%
511   10483395   +0.39%        0%                62%           62%

Westmere-EP: contention is pretty small so not interesting.  Note too high
a batch value could hurt performance.

batch   score   change   zone_contention   lru_contention   total_contention
 31   4831523   +0.00%        2%                 3%            5%
 53   4834086   +0.05%        2%                 4%            6%
 63   4834262   +0.06%        2%                 3%            5%
 73   4832851   +0.03%        2%                 4%            6%
119   4830534   -0.02%        1%                 3%            4%
127   4827461   -0.08%        1%                 4%            5%
137   4827459   -0.08%        1%                 3%            4%
165   4820534   -0.23%        0%                 4%            4%
188   4817947   -0.28%        0%                 3%            3%
223   4809671   -0.45%        0%                 3%            3%
255   4802463   -0.60%        0%                 4%            4%
267   4801634   -0.62%        0%                 3%            3%
299   4798047   -0.69%        0%                 3%            3%
320   4793084   -0.80%        0%                 3%            3%
393   4785877   -0.94%        0%                 3%            3%
424   4782911   -1.01%        0%                 3%            3%
458   4779346   -1.08%        0%                 3%            3%
467   4780306   -1.06%        0%                 3%            3%
498   4780589   -1.05%        0%                 3%            3%
511   4773724   -1.20%        0%                 3%            3%

Skylake-Desktop: similar to Westmere-EP, nothing interesting.

batch   score   change   zone_contention   lru_contention   total_contention
 31   3906608   +0.00%        2%                 3%            5%
 53   3940164   +0.86%        2%                 3%            5%
 63   3937289   +0.79%        2%                 3%            5%
 73   3940201   +0.86%        2%                 3%            5%
119   3933240   +0.68%        2%                 3%            5%
127   3930514   +0.61%        2%                 4%            6%
137   3938639   +0.82%        0%                 3%            3%
165   3908755   +0.05%        0%                 3%            3%
188   3905621   -0.03%        0%                 3%            3%
223   3903015   -0.09%        0%                 4%            4%
255   3889480   -0.44%        0%                 3%            3%
267   3891669   -0.38%        0%                 4%            4%
299   3898728   -0.20%        0%                 4%            4%
320   3894547   -0.31%        0%                 4%            4%
393   3875137   -0.81%        0%                 4%            4%
424   3874521   -0.82%        0%                 3%            3%
458   3880432   -0.67%        0%                 4%            4%
467   3888715   -0.46%        0%                 3%            3%
498   3888633   -0.46%        0%                 4%            4%
511   3875305   -0.80%        0%                 5%            5%

Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
contention is higher than other desktops.

batch   score   change   zone_contention   lru_contention   total_contention
 31   3511158   +0.00%        2%                 5%            7%
 53   3555445   +1.26%        2%                 6%            8%
 63   3561082   +1.42%        2%                 6%            8%
 73   3547218   +1.03%        2%                 6%            8%
119   3571319   +1.71%        1%                 7%            8%
127   3549375   +1.09%        0%                 6%            6%
137   3560233   +1.40%        0%                 6%            6%
165   3555176   +1.25%        2%                 6%            8%
188   3551501   +1.15%        0%                 8%            8%
223   3531462   +0.58%        0%                 7%            7%
255   3570400   +1.69%        0%                 7%            7%
267   3532235   +0.60%        1%                 8%            9%
299   3562326   +1.46%        0%                 6%            6%
320   3553569   +1.21%        0%                 8%            8%
393   3539519   +0.81%        0%                 7%            7%
424   3549271   +1.09%        0%                 8%            8%
458   3528885   +0.50%        0%                 8%            8%
467   3526554   +0.44%        0%                 7%            7%
498   3525302   +0.40%        0%                 9%            9%
511   3527556   +0.47%        0%                 8%            8%

Sandybridge-Desktop: the 0% contention isn't accurate but caused by
dropped fractional part. Since multiple contention path's contentions
are all under 1% here, with some arithmetic operations like add, the
final deviation could be as large as 3%.

batch   score   change   zone_contention   lru_contention   total_contention
 31   1744495   +0.00%        0%                 0%            0%
 53   1755341   +0.62%        0%                 0%            0%
 63   1758469   +0.80%        0%                 0%            0%
 73   1759626   +0.87%        0%                 0%            0%
119   1770417   +1.49%        0%                 0%            0%
127   1768252   +1.36%        0%                 0%            0%
137   1767848   +1.34%        0%                 0%            0%
165   1765088   +1.18%        0%                 0%            0%
188   1766918   +1.29%        0%                 0%            0%
223   1767866   +1.34%        0%                 0%            0%
255   1768074   +1.35%        0%                 0%            0%
267   1763187   +1.07%        0%                 0%            0%
299   1765620   +1.21%        0%                 0%            0%
320   1767603   +1.32%        0%                 0%            0%
393   1764612   +1.15%        0%                 0%            0%
424   1758476   +0.80%        0%                 0%            0%
458   1758593   +0.81%        0%                 0%            0%
467   1757915   +0.77%        0%                 0%            0%
498   1753363   +0.51%        0%                 0%            0%
511   1755548   +0.63%        0%                 0%            0%

Phase two test results:
Note: all percent change is against base(batch=31).

ebizzy.throughput (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    2410037±7%     2600451±2% +7.9%     2602878 +8.0%
lkp-bdw-ex1     1493328        1489243    -0.3%     1492145 -0.1%
lkp-skl-2sp2    1329674        1345891    +1.2%     1351056 +1.6%
lkp-bdw-ep2      711511         711511     0.0%      710708 -0.1%
lkp-wsm-ep2       75750          75528    -0.3%       75441 -0.4%
lkp-skl-d01      264126         262791    -0.5%      264113 +0.0%
lkp-hsw-d01      176601         176328    -0.2%      176368 -0.1%
lkp-sb02          98937          98937    +0.0%       99030 +0.1%

kbuild.buildtime (less is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     107.00        107.67  +0.6%        107.11  +0.1%
lkp-bdw-ex1       97.33         97.33  +0.0%         97.42  +0.1%
lkp-skl-2sp2     180.00        179.83  -0.1%        179.83  -0.1%
lkp-bdw-ep2      178.17        179.17  +0.6%        177.50  -0.4%
lkp-wsm-ep2      737.00        738.00  +0.1%        738.00  +0.1%
lkp-skl-d01      642.00        653.00  +1.7%        653.00  +1.7%
lkp-hsw-d01     1310.00       1316.00  +0.5%       1311.00  +0.1%

netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     948790        947144  -0.2%        948333 -0.0%
lkp-bdw-ex1      904224        904366  +0.0%        904926 +0.1%
lkp-skl-2sp2     239731        239607  -0.1%        239565 -0.1%
lk-bdw-ep2       365764        365933  +0.0%        365951 +0.1%
lkp-wsm-ep2       93736         93803  +0.1%         93808 +0.1%
lkp-skl-d01       77314         77303  -0.0%         77375 +0.1%
lkp-hsw-d01       58617         60387  +3.0%         60208 +2.7%
lkp-sb02          29990         30137  +0.5%         30103 +0.4%

oltp.transactions (higer is better)

machine         batch=31      batch=63             batch=127
lkp-bdw-ex1      9073276       9100377     +0.3%    9036344     -0.4%
lkp-skl-2sp2     8898717       8852054     -0.5%    8894459     -0.0%
lkp-bdw-ep2     13426155      13384654     -0.3%   13333637     -0.7%
lkp-hsw-ep2     13146314      13232784     +0.7%   13193163     +0.4%
lkp-wsm-ep2      5035355       5019348     -0.3%    5033418     -0.0%
lkp-skl-d01       418485       4413339     -0.1%    4419039     +0.0%
lkp-hsw-d01      3517817±5%    3396120±3%  -3.5%    3455138±3%  -1.8%

pigz.throughput (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.513e+08     1.507e+08 -0.4%      1.511e+08 -0.2%
lkp-bdw-ex1     2.060e+08     2.052e+08 -0.4%      2.044e+08 -0.8%
lkp-skl-2sp2    8.836e+08     8.845e+08 +0.1%      8.836e+08 -0.0%
lkp-bdw-ep2     8.275e+08     8.464e+08 +2.3%      8.330e+08 +0.7%
lkp-wsm-ep2     2.224e+08     2.221e+08 -0.2%      2.218e+08 -0.3%
lkp-skl-d01     1.177e+08     1.177e+08 -0.0%      1.176e+08 -0.1%
lkp-hsw-d01     1.154e+08     1.154e+08 +0.1%      1.154e+08 -0.0%
lkp-sb02        0.633e+08     0.633e+08 +0.1%      0.633e+08 +0.0%

will-it-scale.malloc1.processes (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1      620181       620484 +0.0%         620240 +0.0%
lkp-bdw-ex1      1403610      1401201 -0.2%        1417900 +1.0%
lkp-skl-2sp2     1288097      1284145 -0.3%        1283907 -0.3%
lkp-bdw-ep2      1427879      1427675 -0.0%        1428266 +0.0%
lkp-hsw-ep2      1362546      1353965 -0.6%        1354759 -0.6%
lkp-wsm-ep2      2099657      2107576 +0.4%        2100226 +0.0%
lkp-skl-d01      1476835      1476358 -0.0%        1474487 -0.2%
lkp-hsw-d01      1308810      1303429 -0.4%        1301299 -0.6%
lkp-sb02          589286       589284 -0.0%         588101 -0.2%

will-it-scale.malloc1.threads (higher is better)
machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     21289         21125     -0.8%      21241     -0.2%
lkp-bdw-ex1      28114         28089     -0.1%      28007     -0.4%
lkp-skl-2sp2     91866         91946     +0.1%      92723     +0.9%
lkp-bdw-ep2      37637         37501     -0.4%      37317     -0.9%
lkp-hsw-ep2      43673         43590     -0.2%      43754     +0.2%
lkp-wsm-ep2      28577         28298     -1.0%      28545     -0.1%
lkp-skl-d01     175277        173343     -1.1%     173082     -1.3%
lkp-hsw-d01     130303        129566     -0.6%     129250     -0.8%
lkp-sb02        113742±3%     116911     +2.8%     116417±3%  +2.4%

will-it-scale.malloc2.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.206e+09     1.206e+09 -0.0%      1.206e+09 +0.0%
lkp-bdw-ex1     1.319e+09     1.319e+09 -0.0%      1.319e+09 +0.0%
lkp-skl-2sp2    8.000e+08     8.021e+08 +0.3%      7.995e+08 -0.1%
lkp-bdw-ep2     6.582e+08     6.634e+08 +0.8%      6.513e+08 -1.1%
lkp-hsw-ep2     6.671e+08     6.669e+08 -0.0%      6.665e+08 -0.1%
lkp-wsm-ep2     1.805e+08     1.806e+08 +0.0%      1.804e+08 -0.1%
lkp-skl-d01     1.611e+08     1.611e+08 -0.0%      1.610e+08 -0.0%
lkp-hsw-d01     1.333e+08     1.332e+08 -0.0%      1.332e+08 -0.0%
lkp-sb02         82485104      82478206 -0.0%       82473546 -0.0%

will-it-scale.malloc2.threads (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.574e+09     1.574e+09 -0.0%      1.574e+09 -0.0%
lkp-bdw-ex1     1.737e+09     1.737e+09 +0.0%      1.737e+09 -0.0%
lkp-skl-2sp2    9.161e+08     9.162e+08 +0.0%      9.181e+08 +0.2%
lkp-bdw-ep2     7.856e+08     8.015e+08 +2.0%      8.113e+08 +3.3%
lkp-hsw-ep2     6.908e+08     6.904e+08 -0.1%      6.907e+08 -0.0%
lkp-wsm-ep2     2.409e+08     2.409e+08 +0.0%      2.409e+08 -0.0%
lkp-skl-d01     1.199e+08     1.199e+08 -0.0%      1.199e+08 -0.0%
lkp-hsw-d01     1.029e+08     1.029e+08 -0.0%      1.029e+08 +0.0%
lkp-sb02         68081213      68061423 -0.0%       68076037 -0.0%

will-it-scale.page_fault2.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    14509125±4%   16472364 +13.5%       17123117 +18.0%
lkp-bdw-ex1     14736381      16196588  +9.9%       16364011 +11.0%
lkp-skl-2sp2     6354925       6435444  +1.3%        6436644  +1.3%
lkp-bdw-ep2      8749584       8834422  +1.0%        8827179  +0.9%
lkp-hsw-ep2      8762591       8845920  +1.0%        8825697  +0.7%
lkp-wsm-ep2      3036083       3030428  -0.2%        3021741  -0.5%
lkp-skl-d01      2307834       2304731  -0.1%        2286142  -0.9%
lkp-hsw-d01      1806237       1800786  -0.3%        1795943  -0.6%
lkp-sb02          842616        837844  -0.6%         833921  -1.0%

will-it-scale.page_fault2.threads

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1623294       1615132±2% -0.5%     1656777    +2.1%
lkp-bdw-ex1      1995714       2025948    +1.5%     2113753±3% +5.9%
lkp-skl-2sp2     2346708       2415591    +2.9%     2416919    +3.0%
lkp-bdw-ep2      2342564       2344882    +0.1%     2300206    -1.8%
lkp-hsw-ep2      1820658       1831681    +0.6%     1844057    +1.3%
lkp-wsm-ep2      1725482       1733774    +0.5%     1740517    +0.9%
lkp-skl-d01      1832833       1823628    -0.5%     1806489    -1.4%
lkp-hsw-d01      1427913       1427287    -0.0%     1420226    -0.5%
lkp-sb02          750626        748615    -0.3%      746621    -0.5%

will-it-scale.page_fault3.processes (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    24382726      24400317 +0.1%       24668774 +1.2%
lkp-bdw-ex1     35399750      35683124 +0.8%       35829492 +1.2%
lkp-skl-2sp2    28136820      28068248 -0.2%       28147989 +0.0%
lkp-bdw-ep2     37269077      37459490 +0.5%       37373073 +0.3%
lkp-hsw-ep2     36224967      36114085 -0.3%       36104908 -0.3%
lkp-wsm-ep2     16820457      16911005 +0.5%       16968596 +0.9%
lkp-skl-d01      7721138       7725904 +0.1%        7756740 +0.5%
lkp-hsw-d01      7611979       7650928 +0.5%        7651323 +0.5%
lkp-sb02         3781546       3796502 +0.4%        3796827 +0.4%

will-it-scale.page_fault3.threads (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1865820±3%   1900917±2%  +1.9%     1826245±4%  -2.1%
lkp-bdw-ex1      3094060      3148326     +1.8%     3150036     +1.8%
lkp-skl-2sp2     3952940      3953898     +0.0%     3989360     +0.9%
lkp-bdw-ep2      3420373±3%   3643964     +6.5%     3644910±5%  +6.6%
lkp-hsw-ep2      2609635±2%   2582310±3%  -1.0%     2780459     +6.5%
lkp-wsm-ep2      4395001      4417196     +0.5%     4432499     +0.9%
lkp-skl-d01      5363977      5400003     +0.7%     5411370     +0.9%
lkp-hsw-d01      5274131      5311294     +0.7%     5319359     +0.9%
lkp-sb02         2917314      2913004     -0.1%     2935286     +0.6%

will-it-scale.read1.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    73762279±14%  69322519±10% -6.0%    69349855±13%  -6.0% (result unstable)
lkp-bdw-ex1     1.701e+08     1.704e+08    +0.1%    1.705e+08     +0.2%
lkp-skl-2sp2    63111570      63113953     +0.0%    63836573      +1.1%
lkp-bdw-ep2     79247409      79424610     +0.2%    78012656      -1.6%
lkp-hsw-ep2     67677026      68308800     +0.9%    67539106      -0.2%
lkp-wsm-ep2     13339630      13939817     +4.5%    13766865      +3.2%
lkp-skl-d01     10969487      10972650     +0.0%    no data
lkp-hsw-d01     9857342±2%    10080592±2%  +2.3%    10131560      +2.8%
lkp-sb02        5189076        5197473     +0.2%    5163253       -0.5%

will-it-scale.read1.threads (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    62468045±12%  73666726±7% +17.9%    79553123±12% +27.4% (result unstable)
lkp-bdw-ex1     1.62e+08      1.624e+08    +0.3%    1.614e+08     -0.3%
lkp-skl-2sp2    58319780      59181032     +1.5%    59821353      +2.6%
lkp-bdw-ep2     74057992      75698171     +2.2%    74990869      +1.3%
lkp-hsw-ep2     63672959      63639652     -0.1%    64387051      +1.1%
lkp-wsm-ep2     13489943      13526058     +0.3%    13259032      -1.7%
lkp-skl-d01     10297906      10338796     +0.4%    10407328      +1.1%
lkp-hsw-d01      9636721       9667376     +0.3%     9341147      -3.1%
lkp-sb02         4801938       4804496     +0.1%     4802290      +0.0%

will-it-scale.write1.processes (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.111e+08     1.104e+08±2%  -0.7%   1.122e+08±2%  +1.0%
lkp-bdw-ex1     1.392e+08     1.399e+08     +0.5%   1.397e+08     +0.4%
lkp-skl-2sp2     59369233      58994841     -0.6%    58715168     -1.1%
lkp-bdw-ep2      61820979      CPU throttle          63593123     +2.9%
lkp-hsw-ep2      57897587      57435605     -0.8%    56347450     -2.7%
lkp-wsm-ep2       7814203       7918017±2%  +1.3%     7669068     -1.9%
lkp-skl-d01       8886557       8971422     +1.0%     8818366     -0.8%
lkp-hsw-d01       9171001±5%    9189915     +0.2%     9483909     +3.4%
lkp-sb02          4475406       4475294     -0.0%     4501756     +0.6%

will-it-scale.write1.threads (higer is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    1.058e+08     1.055e+08±2%  -0.2%   1.065e+08  +0.7%
lkp-bdw-ex1     1.316e+08     1.300e+08     -1.2%   1.308e+08  -0.6%
lkp-skl-2sp2     54492421      56086678     +2.9%    55975657  +2.7%
lkp-bdw-ep2      59360449      59003957     -0.6%    58101262  -2.1%
lkp-hsw-ep2      53346346±2%   52530876     -1.5%    52902487  -0.8%
lkp-wsm-ep2       7774006       7800092±2%  +0.3%     7558833  -2.8%
lkp-skl-d01       8346174       8235695     -1.3%     no data
lkp-hsw-d01       8636244       8655731     +0.2%     8658868  +0.3%
lkp-sb02          4181820       4204107     +0.5%     4182992  +0.0%

vm-scalability.anon-r-rand.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    11933873±3%   12356544±2%  +3.5%   12188624     +2.1%
lkp-bdw-ex1      7114424±2%    7330949±2%  +3.0%    7392419     +3.9%
lkp-skl-2sp2     6773277±5%    6492332±8%  -4.1%    6543962     -3.4%
lkp-bdw-ep2      7133846±4%    7233508     +1.4%    7013518±3%  -1.7%
lkp-hsw-ep2      4576626       4527098     -1.1%    4551679     -0.5%
lkp-wsm-ep2      2583599       2592492     +0.3%    2588039     +0.2%
lkp-hsw-d01       998199±2%    1028311     +3.0%    1006460±2%  +0.8%
lkp-sb02          570572        567854     -0.5%     568449     -0.4%

vm-scalability.anon-r-rand-mt.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1789419       1787830     -0.1%    1788208     -0.1%
lkp-bdw-ex1      3492595±2%    3554966±2%  +1.8%    3558835±3%  +1.9%
lkp-skl-2sp2     3856238±2%    3975403±4%  +3.1%    3994600     +3.6%
lkp-bdw-ep2      3726963±11%   3809292±6%  +2.2%    3871924±4%  +3.9%
lkp-hsw-ep2      2131760±3%    2033578±4%  -4.6%    2130727±6%  -0.0%
lkp-wsm-ep2      2369731       2368384     -0.1%    2370252     +0.0%
lkp-skl-d01      1207128       1206220     -0.1%    1205801     -0.1%
lkp-hsw-d01       964317        992329±2%  +2.9%     992099±2%  +2.9%
lkp-sb02          567137        567346     +0.0%     566144     -0.2%

vm-scalability.lru-file-mmap-read.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1    19560469±6%   23018999     +17.7%   23418800     +19.7%
lkp-bdw-ex1     17769135±14%  26141676±3%  +47.1%   26284723±5%  +47.9%
lkp-skl-2sp2    14056512      13578884      -3.4%   13146214      -6.5%
lkp-bdw-ep2     15336542      14737654      -3.9%   14088159      -8.1%
lkp-hsw-ep2     16275498      15756296      -3.2%   15018090      -7.7%
lkp-wsm-ep2     11272160      11237231      -0.3%   11310047      +0.3%
lkp-skl-d01      7322119       7324569      +0.0%    7184148      -1.9%
lkp-hsw-d01      6449234       6404542      -0.7%    6356141      -1.4%
lkp-sb02         3517943       3520668      +0.1%    3527309      +0.3%

vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)

machine         batch=31      batch=63             batch=127
lkp-skl-4sp1     1689052       1697553  +0.5%       1698726  +0.6%
lkp-bdw-ex1      1675246       1699764  +1.5%       1712226  +2.2%
lkp-skl-2sp2     1800533       1799749  -0.0%       1800581  +0.0%
lkp-bdw-ep2      1807422       1807758  +0.0%       1804932  -0.1%
lkp-hsw-ep2      1809807       1808781  -0.1%       1807811  -0.1%
lkp-wsm-ep2      1800198       1802434  +0.1%       1801236  +0.1%
lkp-skl-d01       696689        695537  -0.2%        694106  -0.4%
lkp-hsw-d01       698364        698666  +0.0%        696686  -0.2%
lkp-sb02          258939        258787  -0.1%        258199  -0.3%

Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Michal Hocko
a195d3f5b7 mm/oom_kill.c: document oom_lock
Add comments describing oom_lock's scope.

Requested-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/20180711120121.25635-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Mike Kravetz
40d18ebffb mm/hugetlb: remove gigantic page support for HIGHMEM
This reverts ee8f248d26 ("hugetlb: add phys addr to struct
huge_bootmem_page").

At one time powerpc used this field and supporting code.  However that
was removed with commit 79cc38ded1 ("powerpc/mm/hugetlb: Add support
for reserving gigantic huge pages via kernel command line").

There are no users of this field and supporting code, so remove it.

Link: http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Michal Hocko
9bfe5ded05 mm, oom: remove sleep from under oom_lock
Tetsuo has pointed out that since 27ae357fa8 ("mm, oom: fix concurrent
munlock and oom reaper unmap, v3") we have a strong synchronization
between the oom_killer and victim's exiting because both have to take
the oom_lock.  Therefore the original heuristic to sleep for a short
time in out_of_memory doesn't serve the original purpose.

Moreover Tetsuo has noticed that the short sleep can be more harmful
than actually useful.  Hammering the system with many processes can lead
to a starvation when the task holding the oom_lock can block for a long
time (minutes) and block any further progress because the oom_reaper
depends on the oom_lock as well.

Drop the short sleep from out_of_memory when we hold the lock.  Keep the
sleep when the trylock fails to throttle the concurrent OOM paths a bit.
This should be solved in a more reasonable way (e.g.  sleep proportional
to the time spent in the active reclaiming etc.) but this is much more
complex thing to achieve.  This is a quick fixup to remove a stale code.

Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Marek Szyprowski
6518202970 mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
cma_alloc() doesn't really support gfp flags other than __GFP_NOWARN, so
convert gfp_mask parameter to boolean no_warn parameter.

This will help to avoid giving false feeling that this function supports
standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
what has already been an issue: see commit dd65a941f6 ("arm64:
dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").

Link: http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.com
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michał Nazarewicz <mina86@mina86.com>
Acked-by: Laura Abbott <labbott@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Rik van Riel
50c150f262 Revert "mm: always flush VMA ranges affected by zap_page_range"
There was a bug in Linux that could cause madvise (and mprotect?) system
calls to return to userspace without the TLB having been flushed for all
the pages involved.

This could happen when multiple threads of a process made simultaneous
madvise and/or mprotect calls.

This was noticed in the summer of 2017, at which time two solutions
were created:

  56236a5955 ("mm: refactor TLB gathering API")
  99baac21e4 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
and
  4647706ebe ("mm: always flush VMA ranges affected by zap_page_range")

We need only one of these solutions, and the former appears to be a
little more efficient than the latter, so revert that one.

This reverts 4647706ebe ("mm: always flush VMA ranges affected by
zap_page_range")

Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Baoquan He
c98aff6493 mm/sparse: optimize memmap allocation during sparse_init()
In sparse_init(), two temporary pointer arrays, usemap_map and map_map
are allocated with the size of NR_MEM_SECTIONS.  They are used to store
each memory section's usemap and mem map if marked as present.  With the
help of these two arrays, continuous memory chunk is allocated for
usemap and memmap for memory sections on one node.  This avoids too many
memory fragmentations.  Like below diagram, '1' indicates the present
memory section, '0' means absent one.  The number 'n' could be much
smaller than NR_MEM_SECTIONS on most of systems.

  |1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0|
  -------------------------------------------------------
   0 1 2 3         4 5         i   i+1     n-1   n

If we fail to populate the page tables to map one section's memmap, its
->section_mem_map will be cleared finally to indicate that it's not
present.  After use, these two arrays will be released at the end of
sparse_init().

In 4-level paging mode, each array costs 4M which can be ignorable.
While in 5-level paging, they costs 256M each, 512M altogether.  Kdump
kernel Usually only reserves very few memory, e.g 256M.  So, even thouth
they are temporarily allocated, still not acceptable.

In fact, there's no need to allocate them with the size of
NR_MEM_SECTIONS.  Since the ->section_mem_map clearing has been deferred
to the last, the number of present memory sections are kept the same
during sparse_init() until we finally clear out the memory section's
->section_mem_map if its usemap or memmap is not correctly handled.
Thus in the middle whenever for_each_present_section_nr() loop is taken,
the i-th present memory section is always the same one.

Here only allocate usemap_map and map_map with the size of
'nr_present_sections'.  For the i-th present memory section, install its
usemap and memmap to usemap_map[i] and mam_map[i] during allocation.
Then in the last for_each_present_section_nr() loop which clears the
failed memory section's ->section_mem_map, fetch usemap and memmap from
usemap_map[] and map_map[] array and set them into mem_section[]
accordingly.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180628062857.29658-5-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oscar Salvador <osalvador@techadventures.net>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Baoquan He
9258631b33 mm/sparse.c: add a new parameter 'data_unit_size' for alloc_usemap_and_memmap
It's used to pass the size of map data unit into
alloc_usemap_and_memmap, and is preparation for next patch.

Link: http://lkml.kernel.org/r/20180228032657.32385-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Baoquan He
07a34a8c36 mm/sparsemem.c: defer the ms->section_mem_map clearing
In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
will allocate one continuous memory chunk for mem maps on one node and
populate the relevant page tables to map memory section one by one.  If
fail to populate for a certain mem section, print warning and its
->section_mem_map will be cleared to cancel the marking of being
present.  Like this, the number of mem sections marked as present could
become less during sparse_init() execution.

Here just defer the ms->section_mem_map clearing if failed to populate
its page tables until the last for_each_present_section_nr() loop.  This
is in preparation for later optimizing the mem map allocation.

[akpm@linux-foundation.org: remove now-unused local `ms', per Oscar]
Link: http://lkml.kernel.org/r/20180228032657.32385-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Baoquan He
f2fc10e0b3 mm/sparse.c: add a static variable nr_present_sections
Patch series "mm/sparse: Optimize memmap allocation during
sparse_init()", v6.

In sparse_init(), two temporary pointer arrays, usemap_map and map_map
are allocated with the size of NR_MEM_SECTIONS.  They are used to store
each memory section's usemap and mem map if marked as present.  In
5-level paging mode, this will cost 512M memory though they will be
released at the end of sparse_init().  System with few memory, like
kdump kernel which usually only has about 256M, will fail to boot
because of allocation failure if CONFIG_X86_5LEVEL=y.

In this patchset, optimize the memmap allocation code to only use
usemap_map and map_map with the size of nr_present_sections.  This makes
kdump kernel boot up with normal crashkernel='' setting when
CONFIG_X86_5LEVEL=y.

This patch (of 5):

nr_present_sections is used to record how many memory sections are
marked as present during system boot up, and will be used in the later
patch.

Link: http://lkml.kernel.org/r/20180228032657.32385-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
7e010df53c mm: use special value SHRINKER_REGISTERING instead of list_empty() check
The patch introduces a special value SHRINKER_REGISTERING to use instead
of list_empty() to differ a registering shrinker from unregistered
shrinker.  Why we need that at all?

Shrinker registration is split in two parts.  The first one is
prealloc_shrinker(), which allocates shrinker memory and reserves ID in
shrinker_idr.  This function can fail.  The second is
register_shrinker_prepared(), and it finalizes the registration.  This
function actually makes shrinker available to be used from
shrink_slab(), and it can't fail.

One shrinker may be based on more then one LRU lists.  So, we never
clear the bit in memcg shrinker maps, when (one of) corresponding LRU
list becomes empty, since other LRU lists may be not empty.  See
superblock shrinker for example: it is based on two LRU lists:
s_inode_lru and s_dentry_lru.  We do not want to clear shrinker bit,
when there are no inodes in s_inode_lru, as s_dentry_lru may contain
dentries.

Instead of that, we use special algorithm to detect shrinkers having no
elements at all its LRU lists, and this is made in shrink_slab_memcg().
See the comment in this function for the details.

Also, in shrink_slab_memcg() we clear shrinker bit in the map, when we
meet unregistered shrinker (bit is set, while there is no a shrinker in
IDR).  Otherwise, we would have done that at the moment of shrinker
unregistration for all memcgs (and this looks worse, since iteration
over all memcg may take much time).  Also this would have imposed
restrictions on shrinker unregistration order for its users: they would
have had to guarantee, there are no new elements after
unregister_shrinker() (otherwise, a new added element would have set a
bit).

So, if we meet a set bit in map and no shrinker in IDR when we're
iterating over the map in shrink_slab_memcg(), this means the
corresponding shrinker is unregistered, and we must clear the bit.

Another case is shrinker registration.  We want two things there:

1) do_shrink_slab() can be called only for completely registered
   shrinkers;

2) shrinker internal lists may be populated in any order with
   register_shrinker_prepared() (let's talk on the example with sb).  Both
   of:

  a)list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu0]
    memcg_set_shrinker_bit();                               [cpu0]
    ...
    register_shrinker_prepared();                           [cpu1]

  and

  b)register_shrinker_prepared();                           [cpu0]
    ...
    list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu1]
    memcg_set_shrinker_bit();                               [cpu1]

   are legitimate.  We don't want to impose restriction here and to
   force people to use only (b) variant.  We don't want to force people to
   care, there is no elements in LRU lists before the shrinker is
   completely registered.  Internal users of LRU lists and shrinker code
   are two different subsystems, and they have to be closed in themselves
   each other.

In (a) case we have the bit set before shrinker is completely
registered.  We don't want do_shrink_slab() is called at this moment, so
we have to detect such the registering shrinkers.

Before this patch list_empty() (shrinker is not linked to the list)
check was used for that.  So, in (a) there could be a bit set, but we
don't call do_shrink_slab() unless shrinker is linked to the list.  It's
just an indicator, I just overloaded linking to the list.

This was not the best solution, since it's better not to touch the
shrinker memory from shrink_slab_memcg() before it's completely
registered (this also will be useful in the future to make shrink_slab()
completely lockless).

So, this patch introduces better way to detect registering shrinker,
which allows not to dereference shrinker memory.  It's just a ~0UL
value, which we insert into the IDR during ID allocation.  After
shrinker is ready to be used, we insert actual shrinker pointer in the
IDR, and it becomes available to shrink_slab_memcg().

We can't use NULL instead of this new value for this purpose as:
shrink_slab_memcg() already uses NULL to detect unregistered shrinkers,
and we don't want the function sees NULL and clears the bit, otherwise
(a) won't work.

This is the only thing the patch makes: the better way to detect
registering shrinker.  Nothing else this patch makes.

Also this gives a better assembler, but it's minor side of the patch:

Before:
  callq  <idr_find>
  mov    %rax,%r15
  test   %rax,%rax
  je     <shrink_slab_memcg+0x1d5>
  mov    0x20(%rax),%rax
  lea    0x20(%r15),%rdx
  cmp    %rax,%rdx
  je     <shrink_slab_memcg+0xbd>
  mov    0x8(%rsp),%edx
  mov    %r15,%rsi
  lea    0x10(%rsp),%rdi
  callq  <do_shrink_slab>

After:
  callq  <idr_find>
  mov    %rax,%r15
  lea    -0x1(%rax),%rax
  cmp    $0xfffffffffffffffd,%rax
  ja     <shrink_slab_memcg+0x1cd>
  mov    0x8(%rsp),%edx
  mov    %r15,%rsi
  lea    0x10(%rsp),%rdi
  callq  ffffffff810cefd0 <do_shrink_slab>

[ktkhai@virtuozzo.com: add #ifdef CONFIG_MEMCG_KMEM around idr_replace()]
  Link: http://lkml.kernel.org/r/758b8fec-7573-47eb-b26a-7b2847ae7b8c@virtuozzo.com
Link: http://lkml.kernel.org/r/153355467546.11522.4518015068123480218.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
ac7fb3ad27 mm/vmscan.c: move check for SHRINKER_NUMA_AWARE to do_shrink_slab()
In case of shrink_slab_memcg() we do not zero nid, when shrinker is not
numa-aware.  This is not a real problem, since currently all memcg-aware
shrinkers are numa-aware too (we have two: super_block shrinker and
workingset shrinker), but something may change in the future.

Link: http://lkml.kernel.org/r/153320759911.18959.8842396230157677671.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
f90280d6b7 mm/vmscan.c: clear shrinker bit if there are no objects related to memcg
To avoid further unneed calls of do_shrink_slab() for shrinkers, which
already do not have any charged objects in a memcg, their bits have to
be cleared.

This patch introduces a lockless mechanism to do that without races
without parallel list lru add.  After do_shrink_slab() returns
SHRINK_EMPTY the first time, we clear the bit and call it once again.
Then we restore the bit, if the new return value is different.

Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers
two situations:

1)list_lru_add()     shrink_slab_memcg
    list_add_tail()    for_each_set_bit() <--- read bit
                         do_shrink_slab() <--- missed list update (no barrier)
    <MB>                 <MB>
    set_bit()            do_shrink_slab() <--- seen list update

This situation, when the first do_shrink_slab() sees set bit, but it
doesn't see list update (i.e., race with the first element queueing), is
rare.  So we don't add <MB> before the first call of do_shrink_slab()
instead of this to do not slow down generic case.  Also, it's need the
second call as seen in below in (2).

2)list_lru_add()      shrink_slab_memcg()
    list_add_tail()     ...
    set_bit()           ...
  ...                   for_each_set_bit()
  do_shrink_slab()        do_shrink_slab()
    clear_bit()           ...
  ...                     ...
  list_lru_add()          ...
    list_add_tail()       clear_bit()
    <MB>                  <MB>
    set_bit()             do_shrink_slab()

The barriers guarantee that the second do_shrink_slab() in the right
side task sees list update if really cleared the bit.  This case is
drawn in the code comment.

[Results/performance of the patchset]

After the whole patchset applied the below test shows signify increase
of performance:

  $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
  $mkdir /sys/fs/cgroup/memory/ct
  $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
      $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i;
			    echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
			    mkdir -p s/$i; mount -t tmpfs $i s/$i;
			    touch s/$i/file; done

Then, 5 sequential calls of drop caches:

  $time echo 3 > /proc/sys/vm/drop_caches

1)Before:
  0.00user 13.78system 0:13.78elapsed 99%CPU
  0.00user 5.59system 0:05.60elapsed 99%CPU
  0.00user 5.48system 0:05.48elapsed 99%CPU
  0.00user 8.35system 0:08.35elapsed 99%CPU
  0.00user 8.34system 0:08.35elapsed 99%CPU

2)After
  0.00user 1.10system 0:01.10elapsed 99%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU

The results show the performance increases at least in 548 times.

Shakeel Butt tested this patchset with fork-bomb on his configuration:

 > I created 255 memcgs, 255 ext4 mounts and made each memcg create a
 > file containing few KiBs on corresponding mount. Then in a separate
 > memcg of 200 MiB limit ran a fork-bomb.
 >
 > I ran the "perf record -ag -- sleep 60" and below are the results:
 >
 > Without the patch series:
 > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005
 > +  36.40%            fb.sh  [kernel.kallsyms]    [k] shrink_slab
 > +  18.97%            fb.sh  [kernel.kallsyms]    [k] list_lru_count_one
 > +   6.75%            fb.sh  [kernel.kallsyms]    [k] super_cache_count
 > +   0.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
 > +   0.44%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
 > +   0.27%            fb.sh  [kernel.kallsyms]    [k] up_read
 > +   0.21%            fb.sh  [kernel.kallsyms]    [k] osq_lock
 > +   0.13%            fb.sh  [kernel.kallsyms]    [k] shmem_unused_huge_count
 > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
 > +   0.08%            fb.sh  [kernel.kallsyms]    [k] shrink_node
 >
 > With the patch series:
 > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946
 > +  47.49%            fb.sh  [kernel.kallsyms]    [k] down_read_trylock
 > +  30.72%            fb.sh  [kernel.kallsyms]    [k] up_read
 > +   9.51%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_iter
 > +   1.69%            fb.sh  [kernel.kallsyms]    [k] shrink_node_memcg
 > +   1.35%            fb.sh  [kernel.kallsyms]    [k] mem_cgroup_protected
 > +   1.05%            fb.sh  [kernel.kallsyms]    [k] queued_spin_lock_slowpath
 > +   0.85%            fb.sh  [kernel.kallsyms]    [k] _raw_spin_lock
 > +   0.78%            fb.sh  [kernel.kallsyms]    [k] lruvec_lru_size
 > +   0.57%            fb.sh  [kernel.kallsyms]    [k] shrink_node
 > +   0.54%            fb.sh  [kernel.kallsyms]    [k] queue_work_on
 > +   0.46%            fb.sh  [kernel.kallsyms]    [k] shrink_slab_memcg

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
9b996468cf mm: add SHRINK_EMPTY shrinker methods return value
We need to distinguish the situations when shrinker has very small
amount of objects (see vfs_pressure_ratio() called from
super_cache_count()), and when it has no objects at all.  Currently, in
the both of these cases, shrinker::count_objects() returns 0.

The patch introduces new SHRINK_EMPTY return value, which will be used
for "no objects at all" case.  It's is a refactoring mostly, as
SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
patch, and all the magic will happen in further.

Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Vladimir Davydov
aeed1d325d mm/vmscan.c: generalize shrink_slab() calls in shrink_node()
The patch makes shrink_slab() be called for root_mem_cgroup in the same
way as it's called for the rest of cgroups.  This simplifies the logic
and improves the readability.

[ktkhai@virtuozzo.com: wrote changelog]
Link: http://lkml.kernel.org/r/153063068338.1818.11496084754797453962.stgit@localhost.localdomain
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
b0dedc49a2 mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()
Using the preparations made in previous patches, in case of memcg
shrink, we may avoid shrinkers, which are not set in memcg's shrinkers
bitmap.  To do that, we separate iterations over memcg-aware and
!memcg-aware shrinkers, and memcg-aware shrinkers are chosen via
for_each_set_bit() from the bitmap.  In case of big nodes, having many
isolated environments, this gives significant performance growth.  See
next patches for the details.

Note that the patch does not respect to empty memcg shrinkers, since we
never clear the bitmap bits after we set it once.  Their shrinkers will
be called again, with no shrinked objects as result.  This functionality
is provided by next patches.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112558507.4097.12713813335683345488.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063066653.1818.976035462801487910.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
fae91d6d8b mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance
Introduce set_shrinker_bit() function to set shrinker-related bit in
memcg shrinker bitmap, and set the bit after the first item is added and
in case of reparenting destroyed memcg's items.

This will allow next patch to make shrinkers be called only, in case of
they have charged objects at the moment, and to improve shrink_slab()
performance.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
dfd2f10ccf mm/memcontrol.c: export mem_cgroup_is_root()
This will be used in next patch.

Link: http://lkml.kernel.org/r/153063064347.1818.1987011484100392706.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
3b82c4dcc2 mm/list_lru.c: pass lru argument to memcg_drain_list_lru_node()
This is just refactoring to allow next patches to have lru pointer in
memcg_drain_list_lru_node().

Link: http://lkml.kernel.org/r/153063063164.1818.55009531386089350.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
9bec5c35bf mm/list_lru: pass dst_memcg argument to memcg_drain_list_lru_node()
This is just refactoring to allow the next patches to have dst_memcg
pointer in memcg_drain_list_lru_node().

Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
44bd4a4759 mm/list_lru.c: add memcg argument to list_lru_from_kmem()
This is just refactoring to allow the next patches to have memcg pointer
in list_lru_from_kmem().

Link: http://lkml.kernel.org/r/153063060664.1818.9541345386733498582.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
c92e8e10ca fs: propagate shrinker::id to list_lru
Add list_lru::shrinker_id field and populate it by registered shrinker
id.

This will be used to set correct bit in memcg shrinkers map by lru code
in next patches, after there appeared the first related to memcg element
in list_lru.

Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
39887653aa mm/workingset.c: refactor workingset_init()
Use prealloc_shrinker()/register_shrinker_prepared() instead of
register_shrinker().  This will be used in next patch.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112550112.4097.16606173020912323761.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063057666.1818.17625951186610808734.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Kirill Tkhai
0a4465d340 mm, memcg: assign memcg-aware shrinkers bitmap to memcg
Imagine a big node with many cpus, memory cgroups and containers.  Let
we have 200 containers, every container has 10 mounts, and 10 cgroups.
All container tasks don't touch foreign containers mounts.  If there is
intensive pages write, and global reclaim happens, a writing task has to
iterate over all memcgs to shrink slab, before it's able to go to
shrink_page_list().

Iteration over all the memcg slabs is very expensive: the task has to
visit 200 * 10 = 2000 shrinkers for every memcg, and since there are
2000 memcgs, the total calls are 2000 * 2000 = 4000000.

So, the shrinker makes 4 million do_shrink_slab() calls just to try to
isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via
shrink_page_list().  I've observed a node spending almost 100% in
kernel, making useless iteration over already shrinked slab.

This patch adds bitmap of memcg-aware shrinkers to memcg.  The size of
the bitmap depends on bitmap_nr_ids, and during memcg life it's
maintained to be enough to fit bitmap_nr_ids shrinkers.  Every bit in
the map is related to corresponding shrinker id.

Next patches will maintain set bit only for really charged memcg.  This
will allow shrink_slab() to increase its performance in significant way.
See the last patch for the numbers.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain
[ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()]
  Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com
Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Kirill Tkhai
b05706f100 mm/memcontrol.c: move up for_each_mem_cgroup{, _tree} defines
Next patch requires these defines are above their current position, so
here they are moved to declarations.

Link: http://lkml.kernel.org/r/153063055665.1818.5200425793649695598.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Kirill Tkhai
b4c2b231c3 mm: assign id to every memcg-aware shrinker
Introduce shrinker::id number, which is used to enumerate memcg-aware
shrinkers.  The number start from 0, and the code tries to maintain it
as small as possible.

This will be used to represent a memcg-aware shrinkers in memcg
shrinkers map.

Since all memcg-aware shrinkers are based on list_lru, which is
per-memcg in case of !CONFIG_MEMCG_KMEM only, the new functionality will
be under this config option.

[ktkhai@virtuozzo.com: v9]
  Link: http://lkml.kernel.org/r/153112546435.4097.10607140323811756557.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/153063054586.1818.6041047871606697364.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Kirill Tkhai
84c07d11aa mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB
Introduce new config option, which is used to replace repeating
CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
memcg+kmem related code, so let's keep the defines more clearly.

Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Kirill Tkhai
e0295238e5 mm/list_lru.c: combine code under the same define
Patch series "Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))", v8.

This patcheset solves the problem with slow shrink_slab() occuring on
the machines having many shrinkers and memory cgroups (i.e., with many
containers).  The problem is complexity of shrink_slab() is O(n^2) and
it grows too fast with the growth of containers numbers.

Let us have 200 containers, and every container has 10 mounts and 10
cgroups.  All container tasks are isolated, and they don't touch foreign
containers mounts.

In case of global reclaim, a task has to iterate all over the memcgs and
to call all the memcg-aware shrinkers for all of them.  This means, the
task has to visit 200 * 10 = 2000 shrinkers for every memcg, and since
there are 2000 memcgs, the total calls of do_shrink_slab() are 2000 *
2000 = 4000000.

4 million calls are not a number operations, which can takes 1 cpu
cycle.  E.g., super_cache_count() accesses at least two lists, and makes
arifmetical calculations.  Even, if there are no charged objects, we do
these calculations, and replaces cpu caches by read memory.  I observed
nodes spending almost 100% time in kernel, in case of intensive writing
and global reclaim.  The writer consumes pages fast, but it's need to
shrink_slab() before the reclaimer reached shrink pages function (and
frees SWAP_CLUSTER_MAX pages).  Even if there is no writing, the
iterations just waste the time, and slows reclaim down.

Let's see the small test below:

  $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy
  $mkdir /sys/fs/cgroup/memory/ct
  $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes
  $for i in `seq 0 4000`;
          do mkdir /sys/fs/cgroup/memory/ct/$i;
          echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs;
          mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file;
  done

Then, let's see drop caches time (5 sequential calls):

  $time echo 3 > /proc/sys/vm/drop_caches

  0.00user 13.78system 0:13.78elapsed 99%CPU
  0.00user 5.59system 0:05.60elapsed 99%CPU
  0.00user 5.48system 0:05.48elapsed 99%CPU
  0.00user 8.35system 0:08.35elapsed 99%CPU
  0.00user 8.34system 0:08.35elapsed 99%CPU

The last four calls don't actually shrink anything.  So, the iterations
over slab shrinkers take 5.48 seconds.  Not so good for scalability.

The patchset solves the problem by making shrink_slab() of O(n)
complexity.  There are following functional actions:

1) Assign id to every registered memcg-aware shrinker.

2) Maintain per-memcgroup bitmap of memcg-aware shrinkers, and set a
   shrinker-related bit after the first element is added to lru list
   (also, when removed child memcg elements are reparanted).

3) Split memcg-aware shrinkers and !memcg-aware shrinkers, and call a
   shrinker if its bit is set in memcg's shrinker bitmap.  (Also, there is
   a functionality to clear the bit, after last element is shrinked).

This gives significant performance increase.  The result after patchset
is applied:

  $time echo 3 > /proc/sys/vm/drop_caches

  0.00user 1.10system 0:01.10elapsed 99%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU
  0.00user 0.00system 0:00.01elapsed 64%CPU
  0.00user 0.01system 0:00.01elapsed 82%CPU

The results show the performance increases at least in 548 times.

So, the patchset makes shrink_slab() of less complexity and improves the
performance in such types of load I pointed.  This will give a profit in
case of !global reclaim case, since there also will be less
do_shrink_slab() calls.

This patch (of 17):

These two pairs of blocks of code are under the same #ifdef #else
#endif.

Link: http://lkml.kernel.org/r/153063052519.1818.9393587113056959488.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Roman Gushchin <guro@fb.com>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Waiman Long <longman@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Mike Rapoport
a36aab890c mm/memblock.c: replace u64 with phys_addr_t where appropriate
Most functions in memblock already use phys_addr_t to represent a
physical address with __memblock_free_late() being an exception.

This patch replaces u64 with phys_addr_t in __memblock_free_late() and
switches several format strings from %llx to %pa to avoid casting from
phys_addr_t to u64.

Link: http://lkml.kernel.org/r/1530637506-1256-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Oscar Salvador
4e40987f12 mm/sparse.c: make sparse_init_one_section void and remove check
sparse_init_one_section() is being called from two sites: sparse_init()
and sparse_add_one_section().  The former calls it from a
for_each_present_section_nr() loop, and the latter marks the section as
present before calling it.  This means that when
sparse_init_one_section() gets called, we already know that the section
is present.  So there is no point to double check that in the function.

This removes the check and makes the function void.

[ross.zwisler@linux.intel.com: fix error path in sparse_add_one_section]
  Link: http://lkml.kernel.org/r/20180706190658.6873-1-ross.zwisler@linux.intel.com
[ross.zwisler@linux.intel.com: simplification suggested by Oscar]
  Link: http://lkml.kernel.org/r/20180706223358.742-1-ross.zwisler@linux.intel.com
Link: http://lkml.kernel.org/r/20180702154325.12196-1-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Michal Hocko
29ef680ae7 memcg, oom: move out_of_memory back to the charge path
Commit 3812c8c8f3 ("mm: memcg: do not trap chargers with full
callstack on OOM") has changed the ENOMEM semantic of memcg charges.
Rather than invoking the oom killer from the charging context it delays
the oom killer to the page fault path (pagefault_out_of_memory).  This
in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
the corresponding memcg hits the hard limit and the memcg is is OOM.
This is behavior is inconsistent with !memcg case where the oom killer
is invoked from the allocation context and the allocator keeps retrying
until it succeeds.

The difference in the behavior is user visible.  mmap(MAP_POPULATE)
might result in not fully populated ranges while the mmap return code
doesn't tell that to the userspace.  Random syscalls might fail with
ENOMEM etc.

The primary motivation of the different memcg oom semantic was the
deadlock avoidance.  Things have changed since then, though.  We have an
async oom teardown by the oom reaper now and so we do not have to rely
on the victim to tear down its memory anymore.  Therefore we can return
to the original semantic as long as the memcg oom killer is not handed
over to the users space.

There is still one thing to be careful about here though.  If the oom
killer is not able to make any forward progress - e.g.  because there is
no eligible task to kill - then we have to bail out of the charge path
to prevent from same class of deadlocks.  We have basically two options
here.  Either we fail the charge with ENOMEM or force the charge and
allow overcharge.  The first option has been considered more harmful
than useful because rare inconsistencies in the ENOMEM behavior is hard
to test for and error prone.  Basically the same reason why the page
allocator doesn't fail allocations under such conditions.  The later
might allow runaways but those should be really unlikely unless somebody
misconfigures the system.  E.g.  allowing to migrate tasks away from the
memcg to a different unlimited memcg with move_charge_at_immigrate
disabled.

Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Mike Rapoport
d39f8fb4b7 mm: make DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM
The deferred memory initialization relies on section definitions, e.g
PAGES_PER_SECTION, that are only available when CONFIG_SPARSEMEM=y on
most architectures.

Initially DEFERRED_STRUCT_PAGE_INIT depended on explicit
ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT configuration option, but since
the commit 2e3ca40f03 ("mm: relax deferred struct page
requirements") this requirement was relaxed and now it is possible to
enable DEFERRED_STRUCT_PAGE_INIT on architectures that support
DISCONTINGMEM and NO_BOOTMEM which causes build failures.

For instance, setting SMP=y and DEFERRED_STRUCT_PAGE_INIT=y on arc
causes the following build failure:

    CC      mm/page_alloc.o
  mm/page_alloc.c: In function 'update_defer_init':
  mm/page_alloc.c:321:14: error: 'PAGES_PER_SECTION'
  undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
        (pfn & (PAGES_PER_SECTION - 1)) == 0) {
                ^~~~~~~~~~~~~~~~~
                USEC_PER_SEC
  mm/page_alloc.c:321:14: note: each undeclared identifier is reported only once for each function it appears in
  In file included from include/linux/cache.h:5:0,
                   from include/linux/printk.h:9,
                   from include/linux/kernel.h:14,
                   from include/asm-generic/bug.h:18,
                   from arch/arc/include/asm/bug.h:32,
                   from include/linux/bug.h:5,
                   from include/linux/mmdebug.h:5,
                   from include/linux/mm.h:9,
                   from mm/page_alloc.c:18:
  mm/page_alloc.c: In function 'deferred_grow_zone':
  mm/page_alloc.c:1624:52: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
    unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
                                                      ^
  include/uapi/linux/kernel.h:11:47: note: in definition of macro '__ALIGN_KERNEL_MASK'
   #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
                                                 ^~~~
  include/linux/kernel.h:58:22: note: in expansion of macro '__ALIGN_KERNEL'
   #define ALIGN(x, a)  __ALIGN_KERNEL((x), (a))
                        ^~~~~~~~~~~~~~
  mm/page_alloc.c:1624:34: note: in expansion of macro 'ALIGN'
    unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
                                    ^~~~~
  In file included from include/asm-generic/bug.h:18:0,
                   from arch/arc/include/asm/bug.h:32,
                   from include/linux/bug.h:5,
                   from include/linux/mmdebug.h:5,
                   from include/linux/mm.h:9,
                   from mm/page_alloc.c:18:
  mm/page_alloc.c: In function 'free_area_init_node':
  mm/page_alloc.c:6379:50: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
    pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                                    ^
  include/linux/kernel.h:812:22: note: in definition of macro '__typecheck'
     (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                        ^
  include/linux/kernel.h:836:24: note: in expansion of macro '__safe_cmp'
    __builtin_choose_expr(__safe_cmp(x, y), \
                          ^~~~~~~~~~
  include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
   #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
                             ^~~~~~~~~~~~~
  mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t'
    pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                               ^~~~~
  include/linux/kernel.h:836:2: error: first argument to '__builtin_choose_expr' not a constant
    __builtin_choose_expr(__safe_cmp(x, y), \
    ^
  include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
   #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <)
                             ^~~~~~~~~~~~~
  mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t'
    pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                               ^~~~~
  scripts/Makefile.build:317: recipe for target 'mm/page_alloc.o' failed

Let's make the DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM
as the systems that support DISCONTIGMEM do not seem to have that huge
amounts of memory that would make DEFERRED_STRUCT_PAGE_INIT relevant.

Link: http://lkml.kernel.org/r/1530279308-24988-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Andrey Ryabinin
0207df4fa1 kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN
KASAN learns about hotadded memory via the memory hotplug notifier.
devm_memremap_pages() intentionally skips calling memory hotplug
notifiers.  So KASAN doesn't know anything about new memory added by
devm_memremap_pages().  This causes a crash when KASAN tries to access
non-existent shadow memory:

 BUG: unable to handle kernel paging request at ffffed0078000000
 RIP: 0010:check_memory_region+0x82/0x1e0
 Call Trace:
  memcpy+0x1f/0x50
  pmem_do_bvec+0x163/0x720
  pmem_make_request+0x305/0xac0
  generic_make_request+0x54f/0xcf0
  submit_bio+0x9c/0x370
  submit_bh_wbc+0x4c7/0x700
  block_read_full_page+0x5ef/0x870
  do_read_cache_page+0x2b8/0xb30
  read_dev_sector+0xbd/0x3f0
  read_lba.isra.0+0x277/0x670
  efi_partition+0x41a/0x18f0
  check_partition+0x30d/0x5e9
  rescan_partitions+0x18c/0x840
  __blkdev_get+0x859/0x1060
  blkdev_get+0x23f/0x810
  __device_add_disk+0x9c8/0xde0
  pmem_attach_disk+0x9a8/0xf50
  nvdimm_bus_probe+0xf3/0x3c0
  driver_probe_device+0x493/0xbd0
  bus_for_each_drv+0x118/0x1b0
  __device_attach+0x1cd/0x2b0
  bus_probe_device+0x1ac/0x260
  device_add+0x90d/0x1380
  nd_async_device_register+0xe/0x50
  async_run_entry_fn+0xc3/0x5d0
  process_one_work+0xa0a/0x1810
  worker_thread+0x87/0xe80
  kthread+0x2d7/0x390
  ret_from_fork+0x3a/0x50

Add kasan_add_zero_shadow()/kasan_remove_zero_shadow() - post mm_init()
interface to map/unmap kasan_zero_page at requested virtual addresses.
And use it to add/remove the shadow memory for hotplugged/unplugged
device memory.

Link: http://lkml.kernel.org/r/20180629164932.740-1-aryabinin@virtuozzo.com
Fixes: 41e94a8513 ("add devm_memremap_pages")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Song Liu
50f8b92f21 mm: thp: pass correct vm_flags to hugepage_vma_check()
khugepaged_enter_vma_merge() passes a stale vma->vm_flags to
hugepage_vma_check().  The argument vm_flags contains the latest value.
Therefore, it is necessary to pass this vm_flags into
hugepage_vma_check().

With this bug, madvise(MADV_HUGEPAGE) for mmap files in shmem fails to
put memory in huge pages.  Here is an example of failed madvise():

   /* mount /dev/shm with huge=advise:
    *     mount -o remount,huge=advise /dev/shm */
   /* create file /dev/shm/huge */
   #define HUGE_FILE "/dev/shm/huge"

   fd = open(HUGE_FILE, O_RDONLY);
   ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
   ret = madvise(ptr, FILE_SIZE, MADV_HUGEPAGE);

madvise() will return 0, but this memory region is never put in huge
page (check from /proc/meminfo: ShmemHugePages).

Link: http://lkml.kernel.org/r/20180629181752.792831-1-songliubraving@fb.com
Fixes: 02b75dc8160d ("mm: thp: register mm for khugepaged when merging vma for shmem")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Andrey Ryabinin
a718e28f53 mm/fadvise.c: fix signed overflow UBSAN complaint
Signed integer overflow is undefined according to the C standard.  The
overflow in ksys_fadvise64_64() is deliberate, but since it is signed
overflow, UBSAN complains:

	UBSAN: Undefined behaviour in mm/fadvise.c:76:10
	signed integer overflow:
	4 + 9223372036854775805 cannot be represented in type 'long long int'

Use unsigned types to do math.  Unsigned overflow is defined so UBSAN
will not complain about it.  This patch doesn't change generated code.

[akpm@linux-foundation.org: add comment explaining the casts]
Link: http://lkml.kernel.org/r/20180629184453.7614-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: <icytxw@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Colin Ian King
31f21da181 mm/swap_slots.c: make swap_slots_cache_mutex and swap_slots_cache_enable_mutex static
The mutexes swap_slots_cache_mutex and swap_slots_cache_enable_mutex are
local to the source and do not need to be in global scope, so make them
static.

Cleans up sparse warnings:
  symbol 'swap_slots_cache_mutex' was not declared. Should it be static?
  symbol 'swap_slots_cache_enable_mutex' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20180624182536.4937-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Colin Ian King
4d0a5402f5 mm/zsmalloc.c: make several functions and a struct static
The functions zs_page_isolate, zs_page_migrate, zs_page_putback,
lock_zspage, trylock_zspage and structure zsmalloc_aops are local to
source and do not need to be in global scope, so make them static.

Cleans up sparse warnings:
  symbol 'zs_page_isolate' was not declared. Should it be static?
  symbol 'zs_page_migrate' was not declared. Should it be static?
  symbol 'zs_page_putback' was not declared. Should it be static?
  symbol 'zsmalloc_aops' was not declared. Should it be static?
  symbol 'lock_zspage' was not declared. Should it be static?
  symbol 'trylock_zspage' was not declared. Should it be static?

[arnd@arndb.de: hide unused lock_zspage]
  Link: http://lkml.kernel.org/r/20180706130924.3891230-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20180624213322.13776-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Greg Thelen
dcfe4df3d5 mm/page-writeback.c: update stale account_page_redirty() comment
Commit 93f78d8828 ("writeback: move backing_dev_info->bdi_stat[] into
bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in
account_page_redirty().  Update comment to track that change.

  BDI_DIRTIED => WB_DIRTIED
  BDI_WRITTEN => WB_WRITTEN

Link: http://lkml.kernel.org/r/20180625171526.173483-1-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Shakeel Butt
f745c6f5fe fs, mm: account buffer_head to kmemcg
The buffer_head can consume a significant amount of system memory and is
directly related to the amount of page cache.  In our production
environment we have observed that a lot of machines are spending a
significant amount of memory as buffer_head and can not be left as
system memory overhead.

Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the
allocation.  The buffer_heads can be allocated in a memcg different from
the memcg of the page for which buffer_heads are being allocated.  One
concrete example is memory reclaim.  The reclaim can trigger I/O of
pages of any memcg on the system.  So, the right way to charge
buffer_head is to extract the memcg from the page for which buffer_heads
are being allocated and then use targeted memcg charging API.

[shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging]
  Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Shakeel Butt
d46eb14b73 fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8.

The Linux kernel's memory cgroup allows limiting the memory usage of the
jobs running on the system to provide isolation between the jobs.  All
the kernel memory allocated in the context of the job and marked with
__GFP_ACCOUNT will also be included in the memory usage and be limited
by the job's limit.

The kernel memory can only be charged to the memcg of the process in
whose context kernel memory was allocated.  However there are cases
where the allocated kernel memory should be charged to the memcg
different from the current processes's memcg.  This patch series
contains two such concrete use-cases i.e.  fsnotify and buffer_head.

The fsnotify event objects can consume a lot of system memory for large
or unlimited queues if there is either no or slow listener.  The events
are allocated in the context of the event producer.  However they should
be charged to the event consumer.  Similarly the buffer_head objects can
be allocated in a memcg different from the memcg of the page for which
buffer_head objects are being allocated.

To solve this issue, this patch series introduces mechanism to charge
kernel memory to a given memcg.  In case of fsnotify events, the memcg
of the consumer can be used for charging and for buffer_head, the memcg
of the page can be charged.  For directed charging, the caller can use
the scope API memalloc_[un]use_memcg() to specify the memcg to charge
for all the __GFP_ACCOUNT allocations within the scope.

This patch (of 2):

A lot of memory can be consumed by the events generated for the huge or
unlimited queues if there is either no or slow listener.  This can cause
system level memory pressure or OOMs.  So, it's better to account the
fsnotify kmem caches to the memcg of the listener.

However the listener can be in a different memcg than the memcg of the
producer and these allocations happen in the context of the event
producer.  This patch introduces remote memcg charging API which the
producer can use to charge the allocations to the memcg of the listener.

There are seven fsnotify kmem caches and among them allocations from
dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
inotify_inode_mark_cachep happens in the context of syscall from the
listener.  So, SLAB_ACCOUNT is enough for these caches.

The objects from fsnotify_mark_connector_cachep are not accounted as
they are small compared to the notification mark or events and it is
unclear whom to account connector to since it is shared by all events
attached to the inode.

The allocations from the event caches happen in the context of the event
producer.  For such caches we will need to remote charge the allocations
to the listener's memcg.  Thus we save the memcg reference in the
fsnotify_group structure of the listener.

This patch has also moved the members of fsnotify_group to keep the size
same, at least for 64 bit build, even with additional member by filling
the holes.

[shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
  Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Luis R. Rodriguez
1a9b4b3d75 mm: provide a fallback for PAGE_KERNEL_EXEC for architectures
Some architectures just don't have PAGE_KERNEL_EXEC.  The mm/nommu.c and
mm/vmalloc.c code have been using PAGE_KERNEL as a fallback for years.
Move this fallback to asm-generic.

Link: http://lkml.kernel.org/r/20180510185507.2439-3-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Oscar Salvador
4fbce63391 mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()
link_mem_sections() and walk_memory_range() share most of the code, so
we can use convert link_mem_sections() into a dummy function that calls
walk_memory_range() with a callback to register_mem_sect_under_node().

This patch converts register_mem_sect_under_node() in order to match a
walk_memory_range's callback, getting rid of the check_nid argument and
checking instead if the system is still boothing, since we only have to
check for the nid if the system is in such state.

Link: http://lkml.kernel.org/r/20180622111839.10071-4-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Oscar Salvador
d5b6f6a361 mm/memory_hotplug.c: call register_mem_sect_under_node()
When hotplugging memory, it is possible that two calls are being made to
register_mem_sect_under_node().

One comes from __add_section()->hotplug_memory_register() and the other
from add_memory_resource()->link_mem_sections() if we had to register a
new node.

In case we had to register a new node, hotplug_memory_register() will
only handle/allocate the memory_block's since
register_mem_sect_under_node() will return right away because the node
it is not online yet.

I think it is better if we leave hotplug_memory_register() to
handle/allocate only memory_block's and make link_mem_sections() to call
register_mem_sect_under_node().

So this patch removes the call to register_mem_sect_under_node() from
hotplug_memory_register(), and moves the call to link_mem_sections() out
of the condition, so it will always be called.  In this way we only have
one place where the memory sections are registered.

Link: http://lkml.kernel.org/r/20180622111839.10071-3-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Oscar Salvador
b9ff036082 mm/memory_hotplug.c: make add_memory_resource use __try_online_node
This is a small cleanup for the memhotplug code.  A lot more could be
done, but it is better to start somewhere.  I tried to unify/remove
duplicated code.

The following is what this patchset does:

1) add_memory_resource() has code to allocate a node in case it was
   offline.  Since try_online_node has some code for that as well, I just
   made add_memory_resource() to use that so we can remove duplicated
   code..  This is better explained in patch 1/4.

2) register_mem_sect_under_node() will be called only from
   link_mem_sections()

3) Make register_mem_sect_under_node() a callback of
   walk_memory_range()

4) Drop unnecessary checks from register_mem_sect_under_node()

I have done some tests and I could not see anything broken because of
this patchset.

add_memory_resource() contains code to allocate a new node in case it is
necessary.  Since try_online_node() also has some code for this purpose,
let us make use of that and remove duplicate code.

This introduces __try_online_node(), which is called by
add_memory_resource() and try_online_node().  __try_online_node() has
two new parameters, start_addr of the node, and if the node should be
onlined and registered right away.  This is always wanted if we are
calling from do_cpu_up(), but not when we are calling from memhotplug
code.  Nothing changes from the point of view of the users of
try_online_node(), since try_online_node passes start_addr=0 and
online_node=true to __try_online_node().

Link: http://lkml.kernel.org/r/20180622111839.10071-2-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Andrew Morton
930eaac5ee mm/list_lru.c: fold __list_lru_count_one() into its caller
__list_lru_count_one() has a single callsite.

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Sebastian Andrzej Siewior
6ca342d020 mm: workingset: make shadow_lru_isolate() use locking suffix
shadow_lru_isolate() disables interrupts and acquires a lock.  It could
use spin_lock_irq() instead.  It also uses local_irq_enable() while it
could use spin_unlock_irq()/xa_unlock_irq().

Use proper suffix for lock/unlock in order to enable/disable interrupts
during release/acquire of a lock.

Link: http://lkml.kernel.org/r/20180622151221.28167-3-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Sebastian Andrzej Siewior
ae1e16da14 mm: workingset: remove local_irq_disable() from count_shadow_nodes()
Patch series "mm: use irq locking suffix instead local_irq_disable()".

A small series which avoids using local_irq_disable()/local_irq_enable()
but instead does spin_lock_irq()/spin_unlock_irq() so it is within the
context of the lock which it belongs to.  Patch #1 is a cleanup where
local_irq_.*() remained after the lock was removed.

This patch (of 2):

In 0c7c1bed7e ("mm: make counting of list_lru_one::nr_items lockless")
the

	spin_lock(&nlru->lock);

statement was replaced with

	rcu_read_lock();

in __list_lru_count_one().  The comment in count_shadow_nodes() says
that the local_irq_disable() is required because the lock must be
acquired with disabled interrupts and (spin_lock()) does not do so.
Since the lock is replaced with rcu_read_lock() the local_irq_disable()
is no longer needed.  The code path is

  list_lru_shrink_count()
    -> list_lru_count_one()
      -> __list_lru_count_one()
        -> rcu_read_lock()
        -> list_lru_from_memcg_idx()
        -> rcu_read_unlock()

Remove the local_irq_disable() statement.

Link: http://lkml.kernel.org/r/20180622151221.28167-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Michal Hocko
9ea9a68064 mm: drop VM_BUG_ON from __get_free_pages
There is no real reason to blow up just because the caller doesn't know
that __get_free_pages cannot return highmem pages.  Simply fix that up
silently.  Even if we have some confused users such a fixup will not be
harmful.

[akpm@linux-foundation.org: mask off __GFP_HIGHMEM]
Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jiankang Chen <chenjiankang1@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Huang Ying
974e6d66b6 mm, hugetlbfs: pass fault address to cow handler
This is to take better advantage of the general huge page copying
optimization.  Where, the target subpage will be copied last to avoid
the cache lines of target subpage to be evicted when copying other
subpages.  This works better if the address of the target subpage is
available when copying huge page.  So hugetlbfs page fault handlers are
changed to pass that information to hugetlb_cow().  This will benefit
workloads which don't access the begin of the hugetlbfs huge page after
the page fault under heavy cache contention.

Link: http://lkml.kernel.org/r/20180524005851.4079-5-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Punit Agrawal <punit.agrawal@arm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Huang Ying
5b7a1d4060 mm, hugetlbfs: rename address to haddr in hugetlb_cow()
To take better advantage of general huge page copying optimization, the
target subpage address will be passed to hugetlb_cow(), then
copy_user_huge_page().  So we will use both target subpage address and
huge page size aligned address in hugetlb_cow().  To distinguish between
them, "haddr" is used for huge page size aligned address to be
consistent with Transparent Huge Page naming convention.

Now, only huge page size aligned address is used in hugetlb_cow(), so
the "address" is renamed to "haddr" in hugetlb_cow() in this patch.
Next patch will use target subpage address in hugetlb_cow() too.

The patch is just code cleanup without any functionality changes.

Link: http://lkml.kernel.org/r/20180524005851.4079-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Punit Agrawal <punit.agrawal@arm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Huang Ying
c9f4cd7138 mm, huge page: copy target sub-page last when copy huge page
Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue.  For example, when
copying huge page on x86_64 platform, the cache footprint is 4M.  But on
a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
(last level cache).  That is, in average, there are 2.5M LLC for each
core and 1.25M LLC for each thread.

If the cache contention is heavy when copying the huge page, and we copy
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing copying the
end of the huge page.  And it is possible for the application to access
the begin of the huge page after copying the huge page.

In c79b57e462 ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last.  The similar order
changing helps huge page copying too.  That is implemented in this
patch.  Because we have put the order algorithm into a separate
function, the implementation is quite simple.

The patch is a generic optimization which should benefit quite some
workloads, not for a specific use case.  To demonstrate the performance
benefit of the patch, we tested it with vm-scalability run on
transparent huge page.

With this patch, the throughput increases ~16.6% in vm-scalability
anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads).  The test case set
/sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
anonymous memory area and populate it, then forked 36 child processes,
each writes to the anonymous memory area from the begin to the end, so
cause copy on write.  For each child process, other child processes
could be seen as other workloads which generate heavy cache pressure.
At the same time, the IPC (instruction per cycle) increased from 0.63 to
0.78, and the time spent in user space is reduced ~7.2%.

Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Huang Ying
c6ddfb6c58 mm, clear_huge_page: move order algorithm into a separate function
Patch series "mm, huge page: Copy target sub-page last when copy huge
page", v2.

Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue.  For example, when
copying huge page on x86_64 platform, the cache footprint is 4M.  But on
a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
(last level cache).  That is, in average, there are 2.5M LLC for each
core and 1.25M LLC for each thread.

If the cache contention is heavy when copying the huge page, and we copy
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing copying the
end of the huge page.  And it is possible for the application to access
the begin of the huge page after copying the huge page.

In c79b57e462 ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last.  The similar order
changing helps huge page copying too.  That is implemented in this
patchset.

The patchset is a generic optimization which should benefit quite some
workloads, not for a specific use case.  To demonstrate the performance
benefit of the patchset, we have tested it with vm-scalability run on
transparent huge page.

With this patchset, the throughput increases ~16.6% in vm-scalability
anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads).  The test case set
/sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
anonymous memory area and populate it, then forked 36 child processes,
each writes to the anonymous memory area from the begin to the end, so
cause copy on write.  For each child process, other child processes
could be seen as other workloads which generate heavy cache pressure.
At the same time, the IPC (instruction per cycle) increased from 0.63 to
0.78, and the time spent in user space is reduced ~7.2%.

This patch (of 4):

In c79b57e462 ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last.  This optimization could
be applied to copying huge page too with the same order algorithm.  To
avoid code duplication and reduce maintenance overhead, in this patch,
the order algorithm is moved out of clear_huge_page() into a separate
function: process_huge_page().  So that we can use it for copying huge
page too.

This will change the direct calls to clear_user_highpage() into the
indirect calls.  But with the proper inline support of the compilers,
the indirect call will be optimized to be the direct call.  Our tests
show no performance change with the patch.

This patch is a code cleanup without functionality change.

Link: http://lkml.kernel.org/r/20180524005851.4079-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Yang Shi
87aa752906 mm: thp: inc counter for collapsed shmem THP
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed is used
to record the counter of collapsed THP, but it just gets inc'ed in
anonymous THP collapse path, do this for shmem THP collapse too.

Link: http://lkml.kernel.org/r/1529622949-75504-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Yang Shi
c2231020ea mm: thp: register mm for khugepaged when merging vma for shmem
When merging anonymous page vma, if the size of the vma can fit in at
least one hugepage, the mm will be registered for khugepaged for
collapsing THP in the future.

But it skips shmem vmas.  Do so for shmem also, but not for file-private
mappings when merging a vma in order to increase the odds of collapsing
a hugepage via khugepaged.

hugepage_vma_check() sounds like a good fit to do the check.  And move
the definition of it before khugepaged_enter_vma_merge() to avoid a
build error.

Link: http://lkml.kernel.org/r/1529697791-6950-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Jia-Ju Bai
8cded8668e mm/mempool.c: remove unused argument in kasan_unpoison_element() and remove_element()
The argument "gfp_t flags" is not used in kasan_unpoison_element() and
remove_element(), so remove it.

Link: http://lkml.kernel.org/r/20180621070332.16633-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Greg Thelen
bb451fdf3d mm/vmscan.c: condense scan_control
Use smaller scan_control fields for order, priority, and reclaim_idx.
Convert fields from int => s8.  All easily fit within a byte:

 - allocation order range: 0..MAX_ORDER(64?)
 - priority range:         0..12(DEF_PRIORITY)
 - reclaim_idx range:      0..6(__MAX_NR_ZONES)

Since 6538b8ea88 ("x86_64: expand kernel stack to 16K") x86_64 stack
overflows are not an issue.  But it's inefficient to use ints.

Use s8 (signed byte) rather than u8 to allow for loops like:
	do {
		...
	} while (--sc.priority >= 0);

Add BUILD_BUG_ON to verify that s8 is capable of storing max values.

This reduces sizeof(struct scan_control):
 - 96 => 80 bytes (x86_64)
 - 68 => 56 bytes (i386)

scan_control structure field order is changed to utilize padding.  After
this patch there is 1 bit of scan_control padding.

akpm: makes my vmscan.o's .text 572 bytes smaller as well.

Link: http://lkml.kernel.org/r/20180530061212.84915-1-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Kirill A. Shutemov
10ed634152 mm/page_ext.c: constify lookup_page_ext() argument
lookup_page_ext() finds 'struct page_ext' for a given page.  It requires
only read access to the given struct page.

Current implemnentation takes 'struct page *' as an argument.  It makes
compiler complain when 'const struct page *' passed.

Change the argument to 'const struct page *'.

Link: http://lkml.kernel.org/r/20180531135457.20167-3-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Arnd Bergmann
46c9a946d7 shmem: use monotonic time for i_generation
get_seconds() is deprecated because it will lead to a 32-bit overflow in
2038 or 2106.  We don't need the i_generation to be strictly monotonic
anyway, and other file systems like ext4 and xfs just use prandom_u32(),
so let's use the same one here.

If this is considered too slow, we could also use ktime_get_seconds() or
ktime_get_real_seconds() to keep the previous behavior.  Both of these
return a time64_t and are not deprecated, but only return a unique value
once per second, and are predictable.

Link: http://lkml.kernel.org/r/20180620082556.581543-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Vlastimil Babka
d6a24df006 mm, page_alloc: actually ignore mempolicies for high priority allocations
__alloc_pages_slowpath() has for a long time contained code to ignore
node restrictions from memory policies for high priority allocations.
The current code that resets the zonelist iterator however does
effectively nothing after commit 7810e6781e ("mm, page_alloc: do not
break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset.
Even before that commit, mempolicy restrictions were still not ignored,
as they are passed in ac->nodemask which is untouched by the code.

We can either remove the code, or make it work as intended.  Since
ac->nodemask can be set from task's mempolicy via alloc_pages_current()
and thus also alloc_pages(), it may indeed affect kernel allocations,
and it makes sense to ignore it to allow progress for high priority
allocations.

Thus, this patch resets ac->nodemask to NULL in such cases.  This
assumes all callers can handle it (i.e.  there are no guarantees as in
the case of __GFP_THISNODE) which seems to be the case.  The same
assumption is already present in check_retry_cpuset() for some time.

The expected effect is that high priority kernel allocations in the
context of userspace tasks (e.g.  OOM victims) restricted by mempolicies
will have higher chance to succeed if they are restricted to nodes with
depleted memory, while there are other nodes with free memory left.

It's not a new intention, but for the first time the code will match the
intention, AFAICS.  It was intended by commit 183f6371aa ("mm: ignore
mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never
really worked, as mempolicy restriction was already encoded in nodemask,
not zonelist, at that time.

So originally that was for ALLOC_NO_WATERMARK only.  Then it was
adjusted by e46e7b77c9 ("mm, page_alloc: recalculate the preferred
zoneref if the context can ignore memory policies") and cd04ae1e2d
("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the
current state.  So even GFP_ATOMIC would now ignore mempolicies after
the initial attempts fail - if the code worked as people thought it
does.

Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00