Commit graph

942 commits

Author SHA1 Message Date
Johannes Weiner
4c6355b25e mm: memcontrol: charge swapin pages on instantiation
Right now, users that are otherwise memory controlled can easily escape
their containment and allocate significant amounts of memory that they're
not being charged for.  That's because swap readahead pages are not being
charged until somebody actually faults them into their page table.  This
can be exploited with MADV_WILLNEED, which triggers arbitrary readahead
allocations without charging the pages.

There are additional problems with the delayed charging of swap pages:

1. To implement refault/workingset detection for anonymous pages, we
   need to have a target LRU available at swapin time, but the LRU is not
   determinable until the page has been charged.

2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
   stable when the page is isolated from the LRU; otherwise, the locks
   change under us.  But swapcache gets charged after it's already on the
   LRU, and even if we cannot isolate it ourselves (since charging is not
   exactly optional).

The previous patch ensured we always maintain cgroup ownership records for
swap pages.  This patch moves the swapcache charging point from the fault
handler to swapin time to fix all of the above problems.

v2: simplify swapin error checking (Joonsoo)

[hughd@google.com: fix livelock in __read_swap_cache_async()]
  Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005212246080.8458@eggly.anvils
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-17-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:48 -07:00
Johannes Weiner
9d82c69438 mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API
With the page->mapping requirement gone from memcg, we can charge anon and
file-thp pages in one single step, right after they're allocated.

This removes two out of three API calls - especially the tricky commit
step that needed to happen at just the right time between when the page is
"set up" and when it's "published" - somewhat vague and fluid concepts
that varied by page type.  All we need is a freshly allocated page and a
memcg context to charge.

v2: prevent double charges on pre-allocated hugepages in khugepaged

[hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
  Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:48 -07:00
Johannes Weiner
be5d0a74c6 mm: memcontrol: switch to native NR_ANON_MAPPED counter
Memcg maintains a private MEMCG_RSS counter.  This divergence from the
generic VM accounting means unnecessary code overhead, and creates a
dependency for memcg that page->mapping is set up at the time of charging,
so that page types can be told apart.

Convert the generic accounting sites to mod_lruvec_page_state and friends
to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED.  We use
lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
same way we do for NR_FILE_MAPPED.

With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
counter, this patch finally eliminates the need to have page->mapping set
up at charge time.  However, we need to have page->mem_cgroup set up by
the time rmap runs and does the accounting, so switch the commit and the
rmap callbacks around.

v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:47 -07:00
Johannes Weiner
3fba69a56e mm: memcontrol: drop @compound parameter from memcg charging API
The memcg charging API carries a boolean @compound parameter that tells
whether the page we're dealing with is a hugepage.
mem_cgroup_commit_charge() has another boolean @lrucare that indicates
whether the page needs LRU locking or not while charging.  The majority of
callsites know those parameters at compile time, which results in a lot of
naked "false, false" argument lists.  This makes for cryptic code and is a
breeding ground for subtle mistakes.

Thankfully, the huge page state can be inferred from the page itself and
doesn't need to be passed along.  This is safe because charging completes
before the page is published and somebody may split it.

Simplify the callsites by removing @compound, and let memcg infer the
state by using hpage_nr_pages() unconditionally.  That function does
PageTransHuge() to identify huge pages, which also helpfully asserts that
nobody passes in tail pages by accident.

The following patches will introduce a new charging API, best not to carry
over unnecessary weight.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 20:09:47 -07:00
Linus Torvalds
8226f11318 MIPS updates for v5.8:
- added support for MIPSr5 and P5600 cores
 - converted Loongson PCI driver into a PCI host driver using the generic
   PCI framework
 - added emulation of CPUCFG command for Loogonson64 cpus
 - removed of LASAT, PMC MSP71xx and NEC MARKEINS/EMMA
 - ioremap cleanup
 - fix for a race between two threads faulting the same page
 - various cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQJOBAABCAA4FiEEbt46xwy6kEcDOXoUeZbBVTGwZHAFAl7WK54aHHRzYm9nZW5k
 QGFscGhhLmZyYW5rZW4uZGUACgkQeZbBVTGwZHAbjA/9EEFeqNg9UNUH6/TS18QV
 qkxKp0+LC4Jk+SduzLyYsYy6l/dSaKYl8m9jyJsWjM6BvBZTcMJJOnzIPRafI0s+
 MK8GCSZunAkm25DsDvfobQUkbQ/UHjY/fuRpNslbDcsYqIKv90hUMd21ccXY6KC5
 RY+aMlpjgksg1X8JJ7k1Rs05sXyUPqpESteyqehF1b/+Iyv7H2L3v5EvQwvPDs6f
 TyVgNJU2B3RCU6/uAcWmHdVLxXd+Y8fM0vC8DCO0pg0rGf4be0FbZztHmeq6r2wy
 g7wsO7acKWGzulFQD5ftVSQ6i8KHIDNDePmDMtU5oFcXkzUDdGvd3j3Gst19/nve
 ZftNmQHOY1JqGUOhdq1fDG/4M3Vc5bvh3W6eMG22TuMLEWsOF8teY8uUa/vxOb+B
 2NsJ9q6ylRS7RDWWOrApJWfFYPvhr5wlLxT+azWNa9y3bjV8vDLjNdU0mRLA1nsu
 yLzYMwIhtWfZhkJZ+xJVSmQ6LjAHDN5TF/LEx/9itLg5t9wrEosFPAtOv8V15hy4
 KBNvvWeoy7RRmBTNuKh7r9Ui4jw7GgxL4D1OwzCsF//GAiGyuuh0zMuUE8EXA6K5
 MpdGt+bSOcLl8ILTtGir8e4MXLawDH8n94f8QWLb9FcOvU4KHUjRKU7EQ6dyD5dk
 a7xskGLXWdVO3IJ/Xvxcaeo=
 =eAtN
 -----END PGP SIGNATURE-----

Merge tag 'mips_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull MIPS updates from Thomas Bogendoerfer:

 - added support for MIPSr5 and P5600 cores

 - converted Loongson PCI driver into a PCI host driver using the
   generic PCI framework

 - added emulation of CPUCFG command for Loogonson64 cpus

 - removed of LASAT, PMC MSP71xx and NEC MARKEINS/EMMA

 - ioremap cleanup

 - fix for a race between two threads faulting the same page

 - various cleanups and fixes

* tag 'mips_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (143 commits)
  MIPS: ralink: drop ralink_clk_init for mt7621
  MIPS: ralink: bootrom: mark a function as __init to save some memory
  MIPS: Loongson64: Reorder CPUCFG model match arms
  MIPS: Expose Loongson CPUCFG availability via HWCAP
  MIPS: Loongson64: Guard against future cores without CPUCFG
  MIPS: Fix build warning about "PTR_STR" redefinition
  MIPS: Loongson64: Remove not used pci.c
  MIPS: Loongson64: Define PCI_IOBASE
  MIPS: CPU_LOONGSON2EF need software to maintain cache consistency
  MIPS: DTS: Fix build errors used with various configs
  MIPS: Loongson64: select NO_EXCEPT_FILL
  MIPS: Fix IRQ tracing when call handle_fpe() and handle_msa_fpe()
  MIPS: mm: add page valid judgement in function pte_modify
  mm/memory.c: Add memory read privilege on page fault handling
  mm/memory.c: Update local TLB if PTE entry exists
  MIPS: Do not flush tlb page when updating PTE entry
  MIPS: ingenic: Default to a generic board
  MIPS: ingenic: Add support for GCW Zero prototype
  MIPS: ingenic: DTS: Add memory info of GCW Zero
  MIPS: Loongson64: Switch to generic PCI driver
  ...
2020-06-03 13:32:21 -07:00
chenqiwu
6972f55c41 mm/memory: remove unnecessary pte_devmap case in copy_one_pte()
Since commit 25b2995a35 ("mm: remove MEMORY_DEVICE_PUBLIC support"),
the assignment to 'page' for pte_devmap case has been unnecessary.
Let's remove it.

[willy@infradead.org: changelog]
Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/1587349685-31712-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:10 -07:00
Bibo Mao
44bf431b47 mm/memory.c: Add memory read privilege on page fault handling
Here add pte_sw_mkyoung function to make page readable on MIPS
platform during page fault handling. This patch improves page
fault latency about 10% on my MIPS machine with lmbench
lat_pagefault case.

It is noop function on other arches, there is no negative
influence on those architectures.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2020-05-27 13:06:40 +02:00
Bibo Mao
7df6769743 mm/memory.c: Update local TLB if PTE entry exists
If two threads concurrently fault at the same page, the thread that
won the race updates the PTE and its local TLB. For now, the other
thread gives up, simply does nothing, and continues.

It could happen that this second thread triggers another fault, whereby
it only updates its local TLB while handling the fault. Instead of
triggering another fault, let's directly update the local TLB of the
second thread. Function update_mmu_tlb is used here to update local
TLB on the second thread, and it is defined as empty on other arches.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
2020-05-27 13:06:09 +02:00
Arjun Roy
8cd3984d81 mm/memory.c: add vm_insert_pages()
Add the ability to insert multiple pages at once to a user VM with lower
PTE spinlock operations.

The intention of this patch-set is to reduce atomic ops for tcp zerocopy
receives, which normally hits the same spinlock multiple times
consecutively.

[akpm@linux-foundation.org: pte_alloc() no longer takes the `addr' argument]
[arjunroy@google.com: add missing page_count() check to vm_insert_pages()]
  Link: http://lkml.kernel.org/r/20200214005929.104481-1-arjunroy.kdev@gmail.com
[arjunroy@google.com: vm_insert_pages() checks if pte_index defined]
  Link: http://lkml.kernel.org/r/20200228054714.204424-2-arjunroy.kdev@gmail.com
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Miller <davem@davemloft.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200128025958.43490-2-arjunroy.kdev@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-10 15:36:21 -07:00
Arjun Roy
8efd6f5b17 mm/memory.c: refactor insert_page to prepare for batched-lock insert
Add helper methods for vm_insert_page()/insert_page() to prepare for
vm_insert_pages(), which batch-inserts pages to reduce spinlock
operations when inserting multiple consecutive pages into the user page
table.

The intention of this patch-set is to reduce atomic ops for tcp zerocopy
receives, which normally hits the same spinlock multiple times
consecutively.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Miller <davem@davemloft.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200128025958.43490-1-arjunroy.kdev@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-10 15:36:21 -07:00
chenqiwu
552657b7b3 mm: fix ambiguous comments for better code readability
The parameter of remap_pfn_range() @pfn passed from the caller is actually
a page-frame number converted by corresponding physical address of kernel
memory, the original comment is ambiguous that may mislead the users.

Meanwhile, there is an ambiguous typo "VMM" in the comment of
vm_area_struct.  So fixing them will make the code more readable.

Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1583026921-15279-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:41 -07:00
Peter Xu
f45ec5ff16 userfaultfd: wp: support swap and page migration
For either swap and page migration, we all use the bit 2 of the entry to
identify whether this entry is uffd write-protected.  It plays a similar
role as the existing soft dirty bit in swap entries but only for keeping
the uffd-wp tracking for a specific PTE/PMD.

Something special here is that when we want to recover the uffd-wp bit
from a swap/migration entry to the PTE bit we'll also need to take care of
the _PAGE_RW bit and make sure it's cleared, otherwise even with the
_PAGE_UFFD_WP bit we can't trap it at all.

In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
That can lead to data mismatch if the page that we are going to write
protect is swapped out when sending the UFFDIO_WRITEPROTECT.  This patch
also applies/removes the uffd-wp bit even for the swap entries.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:39 -07:00
Peter Xu
b569a17607 userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork
UFFD_EVENT_FORK support for uffd-wp should be already there, except that
we should clean the uffd-wp bit if uffd fork event is not enabled.  Detect
that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked
by VM_UFFD_WP.  Do this for both small PTEs and huge PMDs.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:39 -07:00
Peter Xu
292924b260 userfaultfd: wp: apply _PAGE_UFFD_WP bit
Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
change_protection() when used with uffd-wp and make sure the two new flags
are exclusively used.  Then,

  - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
    when a range of memory is write protected by uffd

  - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
    _PAGE_RW when write protection is resolved from userspace

And use this new interface in mwriteprotect_range() to replace the old
MM_CP_DIRTY_ACCT.

Do this change for both PTEs and huge PMDs.  Then we can start to identify
which PTE/PMD is write protected by general (e.g., COW or soft dirty
tracking), and which is for userfaultfd-wp.

Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
can be even more strict when detecting uffd-wp page faults in either
do_wp_page() or wp_huge_pmd().

After we're with _PAGE_UFFD_WP, a special case is when a page is both
protected by the general COW logic and also userfault-wp.  Here the
userfault-wp will have higher priority and will be handled first.  Only
after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
the general COW.  These are the steps on what will happen with such a
page:

  1. CPU accesses write protected shared page (so both protected by
     general COW and uffd-wp), blocked by uffd-wp first because in
     do_wp_page we'll handle uffd-wp first, so it has higher priority
     than general COW.

  2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
     to remove the uffd-wp bit upon the PTE/PMD.  However here we
     still keep the write bit cleared.  Notify the blocked CPU.

  3. The blocked CPU resumes the page fault process with a fault
     retry, during retry it'll notice it was not with the uffd-wp bit
     this time but it is still write protected by general COW, then
     it'll go though the COW path in the fault handler, copy the page,
     apply write bit where necessary, and retry again.

  4. The CPU will be able to access this page with write bit set.

Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:39 -07:00
Andrea Arcangeli
529b930b87 userfaultfd: wp: hook userfault handler to write protection fault
There are several cases write protection fault happens.  It could be a
write to zero page, swaped page or userfault write protected page.  When
the fault happens, there is no way to know if userfault write protect the
page before.  Here we just blindly issue a userfault notification for vma
with VM_UFFD_WP regardless if app write protects it yet.  Application
should be ready to handle such wp fault.

In the swapin case, always swapin as readonly.  This will cause false
positive userfaults.  We need to decide later if to eliminate them with a
flag like soft-dirty in the swap entry (see _PAGE_SWP_SOFT_DIRTY).

hugetlbfs wouldn't need to worry about swapouts but and tmpfs would be
handled by a swap entry bit like anonymous memory.

The main problem with no easy solution to eliminate the false positives,
will be if/when userfaultfd is extended to real filesystem pagecache.
When the pagecache is freed by reclaim we can't leave the radix tree
pinned if the inode and in turn the radix tree is reclaimed as well.

The estimation is that full accuracy and lack of false positives could be
easily provided only to anonymous memory (as long as there's no fork or as
long as MADV_DONTFORK is used on the userfaultfd anonymous range) tmpfs
and hugetlbfs, it's most certainly worth to achieve it but in a later
incremental patch.

[peterx@redhat.com: don't conditionally drop FAULT_FLAG_WRITE in do_swap_page]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/20200220163112.11409-3-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:39 -07:00
Matthew Wilcox (Oracle)
396bcc5299 mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE
Commit e496cf3d78 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
notes that it should be reverted when the PowerPC problem was fixed.  The
commit fixing the PowerPC problem (953c66c2b2) did not revert the
commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
CONFIG_TRANSPARENT_HUGEPAGE.  Checking with Kirill and Aneesh, this was an
oversight, so remove the Kconfig symbol and undo the work of commit
e496cf3d78.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:38 -07:00
Anshuman Khandual
3122e80efc mm/vma: make vma_is_accessible() available for general use
Lets move vma_is_accessible() helper to include/linux/mm.h which makes it
available for general use.  While here, this replaces all remaining open
encodings for VMA access check with vma_is_accessible().

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Guo Ren <guoren@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paulburton@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:37 -07:00
Linus Torvalds
ea9448b254 drm: add support for hugepages to TTM
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJehnToAAoJEAx081l5xIa+bYEP/3IW+bip83OSR/Ay/29qmeBh
 FMZjz9G+jClVArea+8dlbmGohpQfkLuBiDBE1Ujxl9iqsm3STdIdbv9bHccqs2g8
 mtptkZ5qKwuOi7NhcNG5E5vy60bEAbZ9/QtXok5nckega2sdP7cr+uzZgp/Zc/Vo
 v9H8Wk6/l/MUF8agIXmgChpXII17lIyYbtbH5NV+PpsZMhAaAg2g4Z4vBP5Ue+Nc
 myNcdzKLF3nq++gBfIZ4gzAAnnqN2eYFvkSdvRSdn9HuXcur1tQHjMwC/DJuk8h7
 5dsaplrRLceMEqn6d61oWBJclPefXlkazvHzqNA9Zwr98yVev5h7tiT3BKNVTbKW
 iPoXCt55fJosvXAsJxW4UgXZy7kMGZdZ8GmSlwmZsA0kJRvOuuvWChvu/ugwnIeR
 DUWb5sa0Bn9aoczJ4Qq61O7CqtvhOf6NK24Jcc/HSk/iDbZ2tEnCPEXeCm0GibQ5
 PAFLfE1fZUcEeZlOp+zbZ6ni6XbLL9LX2Dkum/3zEvhf1rdF+0692ZM4o9VwedAX
 2TpE4kywhbYxhUq3MbyRzP3knu7pJYb0KCOfyg6Rqn/vCo17+PksRF+6XvzUVlzr
 VtRYU87TVP5FqIw+e3yela2alP/oo4kEe37n536TcRgFtU7vItcCA5vLuDSOivjX
 08B6Hy4QK2M0yKFuuAT5
 =KO6E
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2020-04-03-1' of git://anongit.freedesktop.org/drm/drm

Pull drm hugepage support from Dave Airlie:
 "This adds support for hugepages to TTM and has been tested with the
  vmwgfx drivers, though I expect other drivers to start using it"

* tag 'drm-next-2020-04-03-1' of git://anongit.freedesktop.org/drm/drm:
  drm/vmwgfx: Hook up the helpers to align buffer objects
  drm/vmwgfx: Introduce a huge page aligning TTM range manager
  drm: Add a drm_get_unmapped_area() helper
  drm/vmwgfx: Support huge page faults
  drm/ttm, drm/vmwgfx: Support huge TTM pagefaults
  mm: Add vmf_insert_pfn_xxx_prot() for huge page-table entries
  mm: Split huge pages on write-notify or COW
  mm: Introduce vma_is_special_huge
  fs: Constify vma argument to vma_is_dax
2020-04-04 11:58:55 -07:00
Wang Wenhu
abd69b9e00 mm/memory.c: clarify a confusing comment for vm_iomap_memory
The param "start" actually referes to the physical memory start, which is
to be mapped into virtual area vma.  And it is the field vma->vm_start
which stands for the start of the area.

Most of the time, we do not read through whole implementation of a
function but only the definition and essential comments.  Accurate
comments are definitely the base stone.

Signed-off-by: Wang Wenhu <wenhu.wang@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200318052206.105104-1-wenhu.wang@vivo.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:30 -07:00
WANG Wenhu
86a76331d9 mm: clarify a confusing comment for remap_pfn_range()
It really made me scratch my head.  Replace the comment with an accurate
and consistent description.

The parameter pfn actually refers to the page frame number which is
right-shifted by PAGE_SHIFT from the physical address.

Signed-off-by: WANG Wenhu <wenhu.wang@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200310073955.43415-1-wenhu.wang@vivo.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:30 -07:00
Thomas Hellstrom (VMware)
327e9fd489 mm: Split huge pages on write-notify or COW
The functions wp_huge_pmd() and wp_huge_pud() currently relies on the
huge_fault() callback to split huge page table entries if needed.
However for module users that requires export of the split_huge_xxx()
functionality which may be undesired. Instead split pre-existing huge
page-table entries on VM_FAULT_FALLBACK return.

We currently only do COW and write-notify on the PTE level, so if the
huge_fault() handler returns VM_FAULT_FALLBACK on wp faults,
split the huge pages and page-table entries. Also do this for huge PUDs
if there is no huge_fault() handler and the vma is not anonymous, similar
to how it's done for PMDs.

Note that fs/dax.c still does the splitting in the huge_fault() handler,
but as huge_fault() A follow-up patch can remove the dax.c split_huge_pmd()
if needed.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Thomas Hellstrom (VMware) <thomas_os@shipmail.org>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
2020-03-24 18:47:47 +01:00
Kirill A. Shutemov
c3e5ea6ee5 mm: avoid data corruption on CoW fault into PFN-mapped VMA
Jeff Moyer has reported that one of xfstests triggers a warning when run
on DAX-enabled filesystem:

	WARNING: CPU: 76 PID: 51024 at mm/memory.c:2317 wp_page_copy+0xc40/0xd50
	...
	wp_page_copy+0x98c/0xd50 (unreliable)
	do_wp_page+0xd8/0xad0
	__handle_mm_fault+0x748/0x1b90
	handle_mm_fault+0x120/0x1f0
	__do_page_fault+0x240/0xd70
	do_page_fault+0x38/0xd0
	handle_page_fault+0x10/0x30

The warning happens on failed __copy_from_user_inatomic() which tries to
copy data into a CoW page.

This happens because of race between MADV_DONTNEED and CoW page fault:

	CPU0					CPU1
 handle_mm_fault()
   do_wp_page()
     wp_page_copy()
       do_wp_page()
					madvise(MADV_DONTNEED)
					  zap_page_range()
					    zap_pte_range()
					      ptep_get_and_clear_full()
					      <TLB flush>
	 __copy_from_user_inatomic()
	 sees empty PTE and fails
	 WARN_ON_ONCE(1)
	 clear_page()

The solution is to re-try __copy_from_user_inatomic() under PTL after
checking that PTE is matches the orig_pte.

The second copy attempt can still fail, like due to non-readable PTE, but
there's nothing reasonable we can do about, except clearing the CoW page.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Justin He <Justin.He@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Link: http://lkml.kernel.org/r/20200218154151.13349-1-kirill.shutemov@linux.intel.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-06 07:06:09 -06:00
Linus Torvalds
9717c1cea1 drm ttm/mm changes for 5.6-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJeOKqlAAoJEAx081l5xIa+NKoP/1hhIF6iV6QwNxCvNukhiuNc
 VhS9LPllEQ3G86QE+CfPWmFmh1IBnbJi3CwigC1UX4nA2XoN2Eccm8f5BmitG+Dm
 YZNZBk8e1z9E68gqpdjA2qr3hddB8ZWcWW8SNqL0PID1UMdXlVr5QM+2Tw7flTQp
 YsajCwssMVZWIgWya+n+A//Qdu5z95KC0ycLjIrT1g+Hl1dxUUixqcbEWppE2wrF
 mJqTBJsfrPo2Njb6PFUKvUmzAZPJGWxLp4cGYJhHgqNrEtwJmmwhWpUDd+BpfxHn
 jq+ir0ALAph+quWdtouArrf1ibS+uipQl+7uaIW5J963czxxNRr7eNl5KP9gaHod
 zmtplXlw0ruNgCrDhRIJ4B4SB5M2nAZk7Y/xeHK+GwaCOGVt4rCaBWGBz/lorm7u
 9phhygqYKdoIJc3JmXcKcXZHgLWuGgobWmYFlA/vNloXMeY5C5LPgDmAKIW40nut
 pKg9iGH6fJoN0HUcECPIbv1bterZy7YKK6OWl9TRHfMnabLJXNCej8hcEnf6KQ2l
 spDO//L8tICBftrcZdJFunzoFFlTavF18XBqm9ZdGNfNo9BIipwFJQEQDahVKHSM
 6Du0kmuVoB02QLPXEUpA6+W5rFw7M5Qzi47pUnzM/VFJFkFx/eSsfs119aXJRXtc
 jIgI1vHIPFCX8Zg4SN/O
 =uiav
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2020-02-04' of git://anongit.freedesktop.org/drm/drm

Pull drm ttm/mm updates from Dave Airlie:
 "Thomas Hellstrom has some more changes to the TTM layer that needed a
  patch to the mm subsystem.

  This adds a new mm API vmf_insert_mixed_prot to avoid an ugly hack
  that has limitations in the TTM layer"

* tag 'drm-next-2020-02-04' of git://anongit.freedesktop.org/drm/drm:
  mm, drm/ttm: Fix vm page protection handling
  mm: Add a vmf_insert_mixed_prot() function
2020-02-04 07:21:04 +00:00
Dave Airlie
b45f1b3b58 Merge branch 'ttm-prot-fix' of git://people.freedesktop.org/~thomash/linux into drm-next
A small fix for the long-standing ttm vm page protection hack.

Sent as a separate PR as it touches mm, has all acks in place.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Hellström (VMware) <thellstrom@vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20200116102411.3056-1-thomas_os@shipmail.org
2020-01-31 16:58:35 +10:00
Thomas Hellstrom
5379e4dd32 mm, drm/ttm: Fix vm page protection handling
TTM graphics buffer objects may, transparently to user-space,  move
between IO and system memory. When that happens, all PTEs pointing to the
old location are zapped before the move and then faulted in again if
needed. When that happens, the page protection caching mode- and
encryption bits may change and be different from those of
struct vm_area_struct::vm_page_prot.

We were using an ugly hack to set the page protection correctly.
Fix that and instead export and use vmf_insert_mixed_prot() or use
vmf_insert_pfn_prot().
Also get the default page protection from
struct vm_area_struct::vm_page_prot rather than using vm_get_page_prot().
This way we catch modifications done by the vm system for drivers that
want write-notification.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
2020-01-16 10:32:41 +01:00
Thomas Hellstrom
574c5b3d0e mm: Add a vmf_insert_mixed_prot() function
The TTM module today uses a hack to be able to set a different page
protection than struct vm_area_struct::vm_page_prot. To be able to do
this properly, add the needed vm functionality as vmf_insert_mixed_prot().

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Christian König" <christian.koenig@amd.com>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
2020-01-16 10:32:33 +01:00
Ingo Molnar
1e5f8a3085 Linux 5.5-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4AEiYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGR3sH/ixrBBYUVyjRPOxS
 ce4iVoTqphGSoAzq/3FA1YZZOPQ/Ep0NXL4L2fTGxmoiqIiuy8JPp07/NKbHQjj1
 Rt6PGm6cw2pMJHaK9gRdlTH/6OyXkp06OkH1uHqKYrhPnpCWDnj+i2SHAX21Hr1y
 oBQh4/XKvoCMCV96J2zxRsLvw8OkQFE0ouWWfj6LbpXIsmWZ++s0OuaO1cVdP/oG
 j+j2Voi3B3vZNQtGgJa5W7YoZN5Qk4ZIj9bMPg7bmKRd3wNB228AiJH2w68JWD/I
 jCA+JcITilxC9ud96uJ6k7SMS2ufjQlnP0z6Lzd0El1yGtHYRcPOZBgfOoPU2Euf
 33WGSyI=
 =iEwx
 -----END PGP SIGNATURE-----

Merge tag 'v5.5-rc3' into sched/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25 10:41:37 +01:00
Daniel Axtens
be1db4753e mm/memory.c: add apply_to_existing_page_range() helper
apply_to_page_range() takes an address range, and if any parts of it are
not covered by the existing page table hierarchy, it allocates memory to
fill them in.

In some use cases, this is not what we want - we want to be able to
operate exclusively on PTEs that are already in the tables.

Add apply_to_existing_page_range() for this.  Adjust the walker
functions for apply_to_page_range to take 'create', which switches them
between the old and new modes.

This will be used in KASAN vmalloc.

[akpm@linux-foundation.org: reduce code duplication]
[akpm@linux-foundation.org: s/apply_to_existing_pages/apply_to_existing_page_range/]
[akpm@linux-foundation.org: initialize __apply_to_page_range::err]
Link: http://lkml.kernel.org/r/20191205140407.1874-1-dja@axtens.net
Signed-off-by: Daniel Axtens <dja@axtens.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Qian Cai <cai@lca.pw>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-17 20:59:59 -08:00
Thomas Gleixner
923717cbab sched/rt, mm: Use CONFIG_PREEMPTION
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.

Switch the pte_unmap_same() and SLUB code over to use CONFIG_PREEMPTION.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Chistoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20191015191821.11479-26-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-08 14:37:36 +01:00
Mike Rapoport
f949286c66 mm: remove __ARCH_HAS_4LEVEL_HACK and include/asm-generic/4level-fixup.h
There are no architectures that use include/asm-generic/4level-fixup.h
therefore it can be removed along with __ARCH_HAS_4LEVEL_HACK define.

Link: http://lkml.kernel.org/r/1572938135-31886-14-git-send-email-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Anatoly Pugachev <matorola@gmail.com>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Peter Rosin <peda@axentia.se>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Sam Creasey <sammy@sammy.net>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:15 -08:00
Yu Zhao
3cde287bb4 mm/memory.c: replace is_zero_pfn with is_huge_zero_pmd for thp
For hugely mapped thp, we use is_huge_zero_pmd() to check if it's zero
page or not.

We do fill ptes with my_zero_pfn() when we split zero thp pmd, but this
is not what we have in vm_normal_page_pmd() -- pmd_trans_huge_lock()
makes sure of it.

This is a trivial fix for /proc/pid/numa_maps, and AFAIK nobody
complains about it.

Gerald Schaefer asked:
: Maybe the description could also mention the symptom of this bug?
: I would assume that it affects anon/dirty accounting in gather_pte_stats(),
: for huge mappings, if zero page mappings are not correctly recognized.

I came across this while I was looking at the code, so I'm not aware of
any symptom.

Link: http://lkml.kernel.org/r/20191108192629.201556-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04 19:44:11 -08:00
Wei Yang
f4f5329d45 mm: fix typos in comments when calling __SetPageUptodate()
There are several places emphasise the effect of __SetPageUptodate(),
while the comment seems to have a typo in two places.

Link: http://lkml.kernel.org/r/20190926023705.7226-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 12:59:10 -08:00
Thomas Hellstrom
625110b5e9 mm/memory.c: fix a huge pud insertion race during faulting
A huge pud page can theoretically be faulted in racing with pmd_alloc()
in __handle_mm_fault().  That will lead to pmd_alloc() returning an
invalid pmd pointer.

Fix this by adding a pud_trans_unstable() function similar to
pmd_trans_unstable() and check whether the pud is really stable before
using the pmd pointer.

Race:
  Thread 1:             Thread 2:                 Comment
  create_huge_pud()                               Fallback - not taken.
                        create_huge_pud()         Taken.
  pmd_alloc()                                     Returns an invalid pointer.

This will result in user-visible huge page data corruption.

Note that this was caught during a code audit rather than a real
experienced problem.  It looks to me like the only implementation that
currently creates huge pud pagetable entries is dev_dax_huge_fault()
which doesn't appear to care much about private (COW) mappings or
write-tracking which is, I believe, a prerequisite for create_huge_pud()
falling back on thread 1, but not in thread 2.

Link: http://lkml.kernel.org/r/20191115115808.21181-2-thomas_os@shipmail.org
Fixes: a00cc7d9dd ("mm, x86: add support for PUD-sized transparent hugepages")
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:19 -08:00
Joel Fernandes (Google)
e4dcad204d rss_stat: add support to detect RSS updates of external mm
When a process updates the RSS of a different process, the rss_stat
tracepoint appears in the context of the process doing the update.  This
can confuse userspace that the RSS of process doing the update is
updated, while in reality a different process's RSS was updated.

This issue happens in reclaim paths such as with direct reclaim or
background reclaim.

This patch adds more information to the tracepoint about whether the mm
being updated belongs to the current process's context (curr field).  We
also include a hash of the mm pointer so that the process who the mm
belongs to can be uniquely identified (mm_id field).

Also vsprintf.c is refactored a bit to allow reuse of hashing code.

[akpm@linux-foundation.org: remove unused local `str']
[joelaf@google.com: inline call to ptr_to_hashval]
  Link: http://lore.kernel.org/r/20191113153816.14b95acd@gandalf.local.home
  Link: http://lkml.kernel.org/r/20191114164622.GC233237@google.com
Link: http://lkml.kernel.org/r/20191106024452.81923-1-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: Ioannis Ilkos <ilkos@google.com>
Acked-by: Petr Mladek <pmladek@suse.com>	[lib/vsprintf.c]
Cc: Tim Murray <timmurray@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Carmen Jackson <carmenjackson@google.com>
Cc: Mayank Gupta <mayankgupta@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:18 -08:00
Joel Fernandes (Google)
b3d1411b67 mm: emit tracepoint when RSS changes
Useful to track how RSS is changing per TGID to detect spikes in RSS and
memory hogs.  Several Android teams have been using this patch in
various kernel trees for half a year now.  Many reported to me it is
really useful so I'm posting it upstream.

Initial patch developed by Tim Murray.  Changes I made from original
patch: o Prevent any additional space consumed by mm_struct.

Regarding the fact that the RSS may change too often thus flooding the
traces - note that, there is some "hysterisis" with this already.  That
is - We update the counter only if we receive 64 page faults due to
SPLIT_RSS_ACCOUNTING.  However, during zapping or copying of pte range,
the RSS is updated immediately which can become noisy/flooding.  In a
previous discussion, we agreed that BPF or ftrace can be used to rate
limit the signal if this becomes an issue.

Also note that I added wrappers to trace_rss_stat to prevent compiler
errors where linux/mm.h is included from tracing code, causing errors
such as:

    CC      kernel/trace/power-traces.o
  In file included from ./include/trace/define_trace.h:102,
                   from ./include/trace/events/kmem.h:342,
                   from ./include/linux/mm.h:31,
                   from ./include/linux/ring_buffer.h:5,
                   from ./include/linux/trace_events.h:6,
                   from ./include/trace/events/power.h:12,
                   from kernel/trace/power-traces.c:15:
  ./include/trace/trace_events.h:113:22: error: field `ent' has incomplete type
     struct trace_entry ent;    \

Link: http://lore.kernel.org/r/20190903200905.198642-1-joel@joelfernandes.org
Link: http://lkml.kernel.org/r/20191001172817.234886-1-joel@joelfernandes.org
Co-developed-by: Tim Murray <timmurray@google.com>
Signed-off-by: Tim Murray <timmurray@google.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Carmen Jackson <carmenjackson@google.com>
Cc: Mayank Gupta <mayankgupta@google.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:18 -08:00
Johannes Weiner
89b15332af mm: drop mmap_sem before calling balance_dirty_pages() in write fault
One of our services is observing hanging ps/top/etc under heavy write
IO, and the task states show this is an mmap_sem priority inversion:

A write fault is holding the mmap_sem in read-mode and waiting for
(heavily cgroup-limited) IO in balance_dirty_pages():

    balance_dirty_pages+0x724/0x905
    balance_dirty_pages_ratelimited+0x254/0x390
    fault_dirty_shared_page.isra.96+0x4a/0x90
    do_wp_page+0x33e/0x400
    __handle_mm_fault+0x6f0/0xfa0
    handle_mm_fault+0xe4/0x200
    __do_page_fault+0x22b/0x4a0
    page_fault+0x45/0x50

Somebody tries to change the address space, contending for the mmap_sem in
write-mode:

    call_rwsem_down_write_failed_killable+0x13/0x20
    do_mprotect_pkey+0xa8/0x330
    SyS_mprotect+0xf/0x20
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

The waiting writer locks out all subsequent readers to avoid lock
starvation, and several threads can be seen hanging like this:

    call_rwsem_down_read_failed+0x14/0x30
    proc_pid_cmdline_read+0xa0/0x480
    __vfs_read+0x23/0x140
    vfs_read+0x87/0x130
    SyS_read+0x42/0x90
    do_syscall_64+0x5b/0x100
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

To fix this, do what we do for cache read faults already: drop the
mmap_sem before calling into anything IO bound, in this case the
balance_dirty_pages() function, and return VM_FAULT_RETRY.

Link: http://lkml.kernel.org/r/20190924194238.GA29030@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 06:29:18 -08:00
Jia He
83d116c530 mm: fix double page fault on arm64 if PTE_AF is cleared
When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there
will be a double page fault in __copy_from_user_inatomic of cow_user_page.

To reproduce the bug, the cmd is as follows after you deployed everything:
make -C src/test/vmmalloc_fork/ TEST_TIME=60m check

Below call trace is from arm64 do_page_fault for debugging purpose:
[  110.016195] Call trace:
[  110.016826]  do_page_fault+0x5a4/0x690
[  110.017812]  do_mem_abort+0x50/0xb0
[  110.018726]  el1_da+0x20/0xc4
[  110.019492]  __arch_copy_from_user+0x180/0x280
[  110.020646]  do_wp_page+0xb0/0x860
[  110.021517]  __handle_mm_fault+0x994/0x1338
[  110.022606]  handle_mm_fault+0xe8/0x180
[  110.023584]  do_page_fault+0x240/0x690
[  110.024535]  do_mem_abort+0x50/0xb0
[  110.025423]  el0_da+0x20/0x24

The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
[ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003,
               pmd=000000023d4b3003, pte=360000298607bd3

As told by Catalin: "On arm64 without hardware Access Flag, copying from
user will fail because the pte is old and cannot be marked young. So we
always end up with zeroed page after fork() + CoW for pfn mappings. we
don't always have a hardware-managed access flag on arm64."

This patch fixes it by calling pte_mkyoung. Also, the parameter is
changed because vmf should be passed to cow_user_page()

Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
in case there can be some obscure use-case (by Kirill).

[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork

Signed-off-by: Jia He <justin.he@arm.com>
Reported-by: Yibo Cai <Yibo.Cai@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-10-18 11:11:29 +01:00
Kefeng Wang
6aa9b8b2c6 mm: do not hash address in print_bad_pte()
Using %px to show the actual address in print_bad_pte()
to help us to debug issue.

Link: http://lkml.kernel.org/r/20190831011816.141002-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:09 -07:00
Minchan Kim
7b167b6810 mm: release the spinlock on zap_pte_range
In our testing (camera recording), Miguel and Wei found
unmap_page_range() takes above 6ms with preemption disabled easily.
When I see that, the reason is it holds page table spinlock during
entire 512 page operation in a PMD.  6.2ms is never trivial for user
experince if RT task couldn't run in the time because it could make
frame drop or glitch audio problem.

I had a time to benchmark it via adding some trace_printk hooks between
pte_offset_map_lock and pte_unmap_unlock in zap_pte_range.  The testing
device is 2018 premium mobile device.

I can get 2ms delay rather easily to release 2M(ie, 512 pages) when the
task runs on little core even though it doesn't have any IPI and LRU
lock contention.  It's already too heavy.

If I remove activate_page, 35-40% overhead of zap_pte_range is gone so
most of overhead(about 0.7ms) comes from activate_page via
mark_page_accessed.  Thus, if there are LRU contention, that 0.7ms could
accumulate up to several ms.

So this patch adds a check for need_resched() in the loop, and a
preemption point if necessary.

Link: http://lkml.kernel.org/r/20190731061440.GC155569@google.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Miguel de Dios <migueldedios@google.com>
Reported-by: Wei Wang <wvw@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:08 -07:00
Wei Yang
9da99f20ec mm: remove redundant assignment of entry
Since ptent will not be changed after previous assignment of entry, it is
not necessary to do the assignment again.

Link: http://lkml.kernel.org/r/20190708082740.21111-1-richardw.yang@linux.intel.com
Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:08 -07:00
Darrick J. Wong
dc617f29db vfs: don't allow writes to swap files
Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-08-20 07:55:16 -07:00
Yang Shi
43675e6fbb mm: thp: make transhuge_vma_suitable available for anonymous THP
transhuge_vma_suitable() was only available for shmem THP, but anonymous
THP has the same check except pgoff check.  And, it will be used for THP
eligible check in the later patch, so make it available for all kind of
THPs.  This also helps reduce code duplication slightly.

Since anonymous THP doesn't have to check pgoff, so make pgoff check
shmem vma only.

And regroup some functions in include/linux/mm.h to solve compile issue
since transhuge_vma_suitable() needs call vma_is_anonymous() which was
defined after huge_mm.h is included.

[akpm@linux-foundation.org: fix typo]
[yang.shi@linux.alibaba.com: v4]
  Link: http://lkml.kernel.org/r/1563400758-124759-2-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1560401041-32207-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18 17:08:06 -07:00
Linus Torvalds
fec88ab0af HMM patches for 5.3
Improvements and bug fixes for the hmm interface in the kernel:
 
 - Improve clarity, locking and APIs related to the 'hmm mirror' feature
   merged last cycle. In linux-next we now see AMDGPU and nouveau to be
   using this API.
 
 - Remove old or transitional hmm APIs. These are hold overs from the past
   with no users, or APIs that existed only to manage cross tree conflicts.
   There are still a few more of these cleanups that didn't make the merge
   window cut off.
 
 - Improve some core mm APIs:
   * export alloc_pages_vma() for driver use
   * refactor into devm_request_free_mem_region() to manage
     DEVICE_PRIVATE resource reservations
   * refactor duplicative driver code into the core dev_pagemap
     struct
 
 - Remove hmm wrappers of improved core mm APIs, instead have drivers use
   the simplified API directly
 
 - Remove DEVICE_PUBLIC
 
 - Simplify the kconfig flow for the hmm users and core code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl0k1zkACgkQOG33FX4g
 mxrO+w//QF/yI/9Hh30RWEBq8W107cODkDlaT0Z/7cVEXfGetZzIUpqzxnJofRfQ
 xTw1XmYkc9WpJe/mTTuFZFewNQwWuMM6X0Xi25fV438/Y64EclevlcJTeD49TIH1
 CIMsz8bX7CnCEq5sz+UypLg9LPnaD9L/JLyuSbyjqjms/o+yzqa7ji7p/DSINuhZ
 Qva9OZL1ZSEDJfNGi8uGpYBqryHoBAonIL12R9sCF5pbJEnHfWrH7C06q7AWOAjQ
 4vjN/p3F4L9l/v2IQ26Kn/S0AhmN7n3GT//0K66e2gJPfXa8fxRKGuFn/Kd79EGL
 YPASn5iu3cM23up1XkbMNtzacL8yiIeTOcMdqw26OaOClojy/9OJduv5AChe6qL/
 VUQIAn1zvPsJTyC5U7mhmkrGuTpP6ivHpxtcaUp+Ovvi1cyK40nLCmSNvLnbN5ES
 bxbb0SjE4uupDG5qU6Yct/hFp6uVMSxMqXZOb9Xy8ZBkbMsJyVOLj71G1/rVIfPU
 hO1AChX5CRG1eJoMo6oBIpiwmSvcOaPp3dqIOQZvwMOqrO869LR8qv7RXyh/g9gi
 FAEKnwLl4GK3YtEO4Kt/1YI5DXYjSFUbfgAs0SPsRKS6hK2+RgRk2M/B/5dAX0/d
 lgOf9WPODPwiSXBYLtJB8qHVDX0DIY8faOyTx6BYIKClUtgbBI8=
 =wKvp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull HMM updates from Jason Gunthorpe:
 "Improvements and bug fixes for the hmm interface in the kernel:

   - Improve clarity, locking and APIs related to the 'hmm mirror'
     feature merged last cycle. In linux-next we now see AMDGPU and
     nouveau to be using this API.

   - Remove old or transitional hmm APIs. These are hold overs from the
     past with no users, or APIs that existed only to manage cross tree
     conflicts. There are still a few more of these cleanups that didn't
     make the merge window cut off.

   - Improve some core mm APIs:
       - export alloc_pages_vma() for driver use
       - refactor into devm_request_free_mem_region() to manage
         DEVICE_PRIVATE resource reservations
       - refactor duplicative driver code into the core dev_pagemap
         struct

   - Remove hmm wrappers of improved core mm APIs, instead have drivers
     use the simplified API directly

   - Remove DEVICE_PUBLIC

   - Simplify the kconfig flow for the hmm users and core code"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
  mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
  mm: remove the HMM config option
  mm: sort out the DEVICE_PRIVATE Kconfig mess
  mm: simplify ZONE_DEVICE page private data
  mm: remove hmm_devmem_add
  mm: remove hmm_vma_alloc_locked_page
  nouveau: use devm_memremap_pages directly
  nouveau: use alloc_page_vma directly
  PCI/P2PDMA: use the dev_pagemap internal refcount
  device-dax: use the dev_pagemap internal refcount
  memremap: provide an optional internal refcount in struct dev_pagemap
  memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
  memremap: remove the data field in struct dev_pagemap
  memremap: add a migrate_to_ram method to struct dev_pagemap_ops
  memremap: lift the devmap_enable manipulation into devm_memremap_pages
  memremap: pass a struct dev_pagemap to ->kill and ->cleanup
  memremap: move dev_pagemap callbacks into a separate structure
  memremap: validate the pagemap type passed to devm_memremap_pages
  mm: factor out a devm_request_free_mem_region helper
  mm: export alloc_pages_vma
  ...
2019-07-14 19:42:11 -07:00
Konstantin Khlebnikov
1e426fe282 mm: use down_read_killable for locking mmap_sem in access_remote_vm
This function is used by ptrace and proc files like /proc/pid/cmdline and
/proc/pid/environ.

Access_remote_vm never returns error codes, all errors are ignored and
only size of successfully read data is returned.  So, if current task was
killed we'll simply return 0 (bytes read).

Mmap_sem could be locked for a long time or forever if something goes
wrong.  Using a killable lock permits cleanup of stuck tasks and
simplifies investigation.

Link: http://lkml.kernel.org/r/156007494202.3335.16782303099589302087.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:47 -07:00
Miguel Ojeda
96756fcb83 mm/memory.c: fail when offset == num in first check of __vm_map_pages()
If the caller asks us for offset == num, we should already fail in the
first check, i.e.  the one testing for offsets beyond the object.

At the moment, we are failing on the second test anyway, since count
cannot be 0.  Still, to agree with the comment of the first test, we
should first test it there.

Link: http://lkml.kernel.org/r/20190528193004.GA7744@gmail.com
Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:46 -07:00
Anshuman Khandual
8b1e0f81fb mm/pgtable: drop pgtable_t variable from pte_fn_t functions
Drop the pgtable_t variable from all implementation for pte_fn_t as none
of them use it.  apply_to_pte_range() should stop computing it as well.
Should help us save some cycles.

Link: http://lkml.kernel.org/r/1556803126-26596-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Matthew Wilcox <willy@infradead.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <jglisse@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:46 -07:00
Huang Ying
eb085574a7 mm, swap: fix race between swapoff and some swap operations
When swapin is performed, after getting the swap entry information from
the page table, system will swap in the swap entry, without any lock held
to prevent the swap device from being swapoff.  This may cause the race
like below,

CPU 1				CPU 2
-----				-----
				do_swap_page
				  swapin_readahead
				    __read_swap_cache_async
swapoff				      swapcache_prepare
  p->swap_map = NULL		        __swap_duplicate
					  p->swap_map[?] /* !!! NULL pointer access */

Because swapoff is usually done when system shutdown only, the race may
not hit many people in practice.  But it is still a race need to be fixed.

To fix the race, get_swap_device() is added to check whether the specified
swap entry is valid in its swap device.  If so, it will keep the swap
entry valid via preventing the swap device from being swapoff, until
put_swap_device() is called.

Because swapoff() is very rare code path, to make the normal path runs as
fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
reference count is used to implement get/put_swap_device().  >From
get_swap_device() to put_swap_device(), RCU reader side is locked, so
synchronize_rcu() in swapoff() will wait until put_swap_device() is
called.

In addition to swap_map, cluster_info, etc.  data structure in the struct
swap_info_struct, the swap cache radix tree will be freed after swapoff,
so this patch fixes the race between swap cache looking up and swapoff
too.

Races between some other swap cache usages and swapoff are fixed too via
calling synchronize_rcu() between clearing PageSwapCache() and freeing
swap cache data structure.

Another possible method to fix this is to use preempt_off() +
stop_machine() to prevent the swap device from being swapoff when its data
structure is being accessed.  The overhead in hot-path of both methods is
similar.  The advantages of RCU based method are,

1. stop_machine() may disturb the normal execution code path on other
   CPUs.

2. File cache uses RCU to protect its radix tree.  If the similar
   mechanism is used for swap cache too, it is easier to share code
   between them.

3. RCU is used to protect swap cache in total_swapcache_pages() and
   exit_swap_address_space() already.  The two mechanisms can be
   merged to simplify the logic.

Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
Fixes: 235b621767 ("mm/swap: add cluster lock")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Not-nacked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:43 -07:00
Miklos Szeredi
465fc3a9b3 mm/memory.c: trivial clean up in insert_page()
Make the success case use the same cleanup path as the failure case.

Link: http://lkml.kernel.org/r/20190523134024.GC24093@localhost.localdomain
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:42 -07:00
Christoph Hellwig
897e6365cd memremap: add a migrate_to_ram method to struct dev_pagemap_ops
This replaces the hacky ->fault callback, which is currently directly
called from common code through a hmm specific data structure as an
exercise in layering violations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:44 -03:00
Christoph Hellwig
25b2995a35 mm: remove MEMORY_DEVICE_PUBLIC support
The code hasn't been used since it was added to the tree, and doesn't
appear to actually be usable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-07-02 14:32:43 -03:00
Thomas Gleixner
457c899653 treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
   initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Souptick Joarder
a667d7456f mm: introduce new vm_map_pages() and vm_map_pages_zero() API
Patch series "mm: Use vm_map_pages() and vm_map_pages_zero() API", v5.

This patch (of 5):

Previouly drivers have their own way of mapping range of kernel
pages/memory into user vma and this was done by invoking vm_insert_page()
within a loop.

As this pattern is common across different drivers, it can be generalized
by creating new functions and using them across the drivers.

vm_map_pages() is the API which can be used to map kernel memory/pages in
drivers which have considered vm_pgoff

vm_map_pages_zero() is the API which can be used to map a range of kernel
memory/pages in drivers which have not considered vm_pgoff.  vm_pgoff is
passed as default 0 for those drivers.

We _could_ then at a later "fix" these drivers which are using
vm_map_pages_zero() to behave according to the normal vm_pgoff offsetting
simply by removing the _zero suffix on the function name and if that
causes regressions, it gives us an easy way to revert.

Tested on Rockchip hardware and display is working, including talking to
Lima via prime.

Link: http://lkml.kernel.org/r/751cb8a0f4c3e67e95c58a3b072937617f338eea.1552921225.git.jrdr.linux@gmail.com
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Suggested-by: Russell King <linux@armlinux.org.uk>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Heiko Stuebner <heiko@sntech.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Sandy Huang <hjc@rock-chips.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Pawel Osciak <pawel@osciak.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:50 -07:00
Jérôme Glisse
7269f99993 mm/mmu_notifier: use correct mmu_notifier events for each invalidation
This updates each existing invalidation to use the correct mmu notifier
event that represent what is happening to the CPU page table.  See the
patch which introduced the events to see the rational behind this.

Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Jérôme Glisse
6f4f13e8d9 mm/mmu_notifier: contextual information for event triggering invalidation
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).

Users of mmu notifier API track changes to the CPU page table and take
specific action for them.  While current API only provide range of virtual
address affected by the change, not why the changes is happening.

This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma).  Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.

The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens.  A latter patch do convert mm call path to use a more appropriate
events for each call.

This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:

%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }

@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)

@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%

Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place

Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Linus Torvalds
0968621917 Printk changes for 5.2
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAlzP8nQACgkQUqAMR0iA
 lPK79A/+NkRouqA9ihAZhUbgW0DHzOAFvUJSBgX11HQAZbGjngakuoyYFvwUx0T0
 m80SUTCysxQrWl+xLdccPZ9ZrhP2KFQrEBEdeYHZ6ymcYcl83+3bOIBS7VwdZAbO
 EzB8u/58uU/sI6ABL4lF7ZF/+R+U4CXveEUoVUF04bxdPOxZkRX4PT8u3DzCc+RK
 r4yhwQUXGcKrHa2GrRL3GXKsDxcnRdFef/nzq4RFSZsi0bpskzEj34WrvctV6j+k
 FH/R3kEcZrtKIMPOCoDMMWq07yNqK/QKj0MJlGoAlwfK4INgcrSXLOx+pAmr6BNq
 uMKpkxCFhnkZVKgA/GbKEGzFf+ZGz9+2trSFka9LD2Ig6DIstwXqpAgiUK8JFQYj
 lq1mTaJZD3DfF2vnGHGeAfBFG3XETv+mIT/ow6BcZi3NyNSVIaqa5GAR+lMc6xkR
 waNkcMDkzLFuP1r0p7ZizXOksk9dFkMP3M6KqJomRtApwbSNmtt+O2jvyLPvB3+w
 wRyN9WT7IJZYo4v0rrD5Bl6BjV15ZeCPRSFZRYofX+vhcqJQsFX1M9DeoNqokh55
 Cri8f6MxGzBVjE1G70y2/cAFFvKEKJud0NUIMEuIbcy+xNrEAWPF8JhiwpKKnU10
 c0u674iqHJ2HeVsYWZF0zqzqQ6E1Idhg/PrXfuVuhAaL5jIOnYY=
 =WZfC
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk updates from Petr Mladek:

 - Allow state reset of printk_once() calls.

 - Prevent crashes when dereferencing invalid pointers in vsprintf().
   Only the first byte is checked for simplicity.

 - Make vsprintf warnings consistent and inlined.

 - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf
   modifiers.

 - Some clean up of vsprintf and test_printf code.

* tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  lib/vsprintf: Make function pointer_string static
  vsprintf: Limit the length of inlined error messages
  vsprintf: Avoid confusion between invalid address and value
  vsprintf: Prevent crash when dereferencing invalid pointers
  vsprintf: Consolidate handling of unknown pointer specifiers
  vsprintf: Factor out %pO handler as kobject_string()
  vsprintf: Factor out %pV handler as va_format()
  vsprintf: Factor out %p[iI] handler as ip_addr_string()
  vsprintf: Do not check address of well-known strings
  vsprintf: Consistent %pK handling for kptr_restrict == 0
  vsprintf: Shuffle restricted_pointer()
  printk: Tie printk_once / printk_deferred_once into .data.once for reset
  treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
  lib/test_printf: Switch to bitmap_zalloc()
2019-05-07 09:18:12 -07:00
Sakari Ailus
d75f773c86 treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
%pF and %pf are functionally equivalent to %pS and %ps conversion
specifiers. The former are deprecated, therefore switch the current users
to use the preferred variant.

The changes have been produced by the following command:

	git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
	while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done

And verifying the result.

Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: xen-devel@lists.xenproject.org
Cc: linux-acpi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: drbd-dev@lists.linbit.com
Cc: linux-block@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Cc: linux-pci@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-btrfs@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: linux-mm@kvack.org
Cc: ceph-devel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: David Sterba <dsterba@suse.com> (for btrfs)
Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c)
Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci)
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-04-09 14:19:06 +02:00
Peter Zijlstra
fa0aafb8ac asm-generic/tlb: Remove tlb_flush_mmu_free()
As the comment notes; it is a potentially dangerous operation. Just
use tlb_flush_mmu(), that will skip the (double) TLB invalidate if
it really isn't needed anyway.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 10:33:01 +02:00
Peter Zijlstra
ed6a79352c asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
Move the mmu_gather::page_size things into the generic code instead of
PowerPC specific bits.

No change in behavior intended.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-03 10:32:40 +02:00
Jan Kara
cae85cb8ad mm/memory.c: fix modifying of page protection by insert_pfn()
Aneesh has reported that PPC triggers the following warning when
excercising DAX code:

  IP set_pte_at+0x3c/0x190
  LR insert_pfn+0x208/0x280
  Call Trace:
     insert_pfn+0x68/0x280
     dax_iomap_pte_fault.isra.7+0x734/0xa40
     __xfs_filemap_fault+0x280/0x2d0
     do_wp_page+0x48c/0xa40
     __handle_mm_fault+0x8d0/0x1fd0
     handle_mm_fault+0x140/0x250
     __do_page_fault+0x300/0xd60
     handle_page_fault+0x18

Now that is WARN_ON in set_pte_at which is

        VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));

The problem is that on some architectures set_pte_at() cannot cope with
a situation where there is already some (different) valid entry present.

Use ptep_set_access_flags() instead to modify the pfn which is built to
deal with modifying existing PTE.

Link: http://lkml.kernel.org/r/20190311084537.16029-1-jack@suse.cz
Fixes: b2770da642 "mm: add vm_insert_mixed_mkwrite()"
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Chandan Rajendra <chandan@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-29 10:01:37 -07:00
Jan Stancek
fc8efd2ddf mm/memory.c: do_fault: avoid usage of stale vm_area_struct
LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
This is a stress test, where one thread mmaps/writes/munmaps memory area
and other thread is trying to read from it:

  CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
  Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
  Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
  Call Trace:
  ([<0000000000000000>]           (null))
   [<00000000001adae4>] lock_acquire+0xec/0x258
   [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
   [<000000000012a780>] page_table_free+0x48/0x1a8
   [<00000000002f6e54>] do_fault+0xdc/0x670
   [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
   [<00000000002fb138>] handle_mm_fault+0x1b0/0x320
   [<00000000001248cc>] do_dat_exception+0x19c/0x2c8
   [<000000000080e5ee>] pgm_check_handler+0x19e/0x200

page_table_free() is called with NULL mm parameter, but because "0" is a
valid address on s390 (see S390_lowcore), it keeps going until it
eventually crashes in lockdep's lock_acquire.  This crash is
reproducible at least since 4.14.

Problem is that "vmf->vma" used in do_fault() can become stale.  Because
mmap_sem may be released, other threads can come in, call munmap() and
cause "vma" be returned to kmem cache, and get zeroed/re-initialized and
re-used:

handle_mm_fault                           |
  __handle_mm_fault                       |
    do_fault                              |
      vma = vmf->vma                      |
      do_read_fault                       |
        __do_fault                        |
          vma->vm_ops->fault(vmf);        |
            mmap_sem is released          |
                                          |
                                          | do_munmap()
                                          |   remove_vma_list()
                                          |     remove_vma()
                                          |       vm_area_free()
                                          |         # vma is released
                                          | ...
                                          | # same vma is allocated
                                          | # from kmem cache
                                          | do_mmap()
                                          |   vm_area_alloc()
                                          |     memset(vma, 0, ...)
                                          |
      pte_free(vma->vm_mm, ...);          |
        page_table_free                   |
          spin_lock_bh(&mm->context.lock);|
            <crash>                       |

Cache mm_struct to avoid using potentially stale "vma".

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.com
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Rafael Aquini <aquini@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:21 -08:00
Mike Rapoport
a862f68a8b docs/core-api/mm: fix return value descriptions in mm/
Many kernel-doc comments in mm/ have the return value descriptions
either misformatted or omitted at all which makes kernel-doc script
unhappy:

$ make V=1 htmldocs
...
./mm/util.c:36: info: Scanning doc for kstrdup
./mm/util.c:41: warning: No description found for return value of 'kstrdup'
./mm/util.c:57: info: Scanning doc for kstrdup_const
./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
./mm/util.c:75: info: Scanning doc for kstrndup
./mm/util.c:83: warning: No description found for return value of 'kstrndup'
...

Fixing the formatting and adding the missing return value descriptions
eliminates ~100 such warnings.

Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:20 -08:00
Aneesh Kumar K.V
04a8645304 mm: update ptep_modify_prot_commit to take old pte value as arg
Architectures like ppc64 require to do a conditional tlb flush based on
the old and new value of pte.  Enable that by passing old pte value as
the arg.

Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Aneesh Kumar K.V
0cbe3e26ab mm: update ptep_modify_prot_start/commit to take vm_area_struct as arg
Patch series "NestMMU pte upgrade workaround for mprotect", v5.

We can upgrade pte access (R -> RW transition) via mprotect.  We need to
make sure we follow the recommended pte update sequence as outlined in
commit bd5050e38a ("powerpc/mm/radix: Change pte relax sequence to
handle nest MMU hang") for such updates.  This patch series does that.

This patch (of 5):

Some architectures may want to call flush_tlb_range from these helpers.

Link: http://lkml.kernel.org/r/20190116085035.29729-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:18 -08:00
Greg Kroah-Hartman
d9f7979c92 mm: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Laura Abbott <labbott@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Matthew Wilcox
0ee930e6ca mm/memory.c: prevent mapping typed pages to userspace
Pages which use page_type must never be mapped to userspace as it would
destroy their page type.  Add an explicit check for this instead of
assuming that kernel drivers always get this right.

Link: http://lkml.kernel.org/r/20190129053830.3749-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Matthew Wilcox
2d432cb709 mm: prevent mapping slab pages to userspace
It's never appropriate to map a page allocated by SLAB into userspace.
A buggy device driver might try this, or an attacker might be able to
find a way to make it happen.

Christoph said:

: Let's just fail the code.  Currently this may work with SLUB.  But SLAB
: and SLOB overlay fields with mapcount.  So you would have a corrupted page
: struct if you mapped a slab page to user space.

Link: http://lkml.kernel.org/r/20190125173827.2658-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:17 -08:00
Kirill Tkhai
52d1e606ee mm: reuse only-pte-mapped KSM page in do_wp_page()
Add an optimization for KSM pages almost in the same way that we have
for ordinary anonymous pages.  If there is a write fault in a page,
which is mapped to an only pte, and it is not related to swap cache; the
page may be reused without copying its content.

[ Note that we do not consider PageSwapCache() pages at least for now,
  since we don't want to complicate __get_ksm_page(), which has nice
  optimization based on this (for the migration case). Currenly it is
  spinning on PageSwapCache() pages, waiting for when they have
  unfreezed counters (i.e., for the migration finish). But we don't want
  to make it also spinning on swap cache pages, which we try to reuse,
  since there is not a very high probability to reuse them. So, for now
  we do not consider PageSwapCache() pages at all. ]

So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
page_stable_node(), to skip a page, which KSM is currently trying to
link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
merge one more page into the page, we are reusing.  After that, nobody
can refer to the reusing page: KSM skips !PageSwapCache() pages with
zero refcount; and the protection against of all other participants is
the same as for reused ordinary anon pages pte lock, page lock and
mmap_sem.

[akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Anshuman Khandual
98fa15f34c mm: replace all open encodings for NUMA_NO_NODE
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code.  Please let me know if this
might have missed some instances.  This might also have replaced some
false positives.  I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1.  Even though implicitly understood it is always better to
have macros in there.  Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE.  This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:14 -08:00
Matthew Wilcox
1ed7293ac4 mm/memory.c: initialise mmu_notifier_range correctly
One of the paths in follow_pte_pmd() initialised the mmu_notifier_range
incorrectly.

Link: http://lkml.kernel.org/r/20190103002126.GM6310@bombadil.infradead.org
Fixes: ac46d4f3c4 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Michal Hocko
63f3655f95 mm, memcg: fix reclaim deadlock with writeback
Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
ext4 writeback

  task1:
    wait_on_page_bit+0x82/0xa0
    shrink_page_list+0x907/0x960
    shrink_inactive_list+0x2c7/0x680
    shrink_node_memcg+0x404/0x830
    shrink_node+0xd8/0x300
    do_try_to_free_pages+0x10d/0x330
    try_to_free_mem_cgroup_pages+0xd5/0x1b0
    try_charge+0x14d/0x720
    memcg_kmem_charge_memcg+0x3c/0xa0
    memcg_kmem_charge+0x7e/0xd0
    __alloc_pages_nodemask+0x178/0x260
    alloc_pages_current+0x95/0x140
    pte_alloc_one+0x17/0x40
    __pte_alloc+0x1e/0x110
    alloc_set_pte+0x5fe/0xc20
    do_fault+0x103/0x970
    handle_mm_fault+0x61e/0xd10
    __do_page_fault+0x252/0x4d0
    do_page_fault+0x30/0x80
    page_fault+0x28/0x30

  task2:
    __lock_page+0x86/0xa0
    mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
    ext4_writepages+0x479/0xd60
    do_writepages+0x1e/0x30
    __writeback_single_inode+0x45/0x320
    writeback_sb_inodes+0x272/0x600
    __writeback_inodes_wb+0x92/0xc0
    wb_writeback+0x268/0x300
    wb_workfn+0xb4/0x390
    process_one_work+0x189/0x420
    worker_thread+0x4e/0x4b0
    kthread+0xe6/0x100
    ret_from_fork+0x41/0x50

He adds
 "task1 is waiting for the PageWriteback bit of the page that task2 has
  collected in mpd->io_submit->io_bio, and tasks2 is waiting for the
  LOCKED bit the page which tasks1 has locked"

More precisely task1 is handling a page fault and it has a page locked
while it charges a new page table to a memcg.  That in turn hits a
memory limit reclaim and the memcg reclaim for legacy controller is
waiting on the writeback but that is never going to finish because the
writeback itself is waiting for the page locked in the #PF path.  So
this is essentially ABBA deadlock:

                                        lock_page(A)
                                        SetPageWriteback(A)
                                        unlock_page(A)
  lock_page(B)
                                        lock_page(B)
  pte_alloc_pne
    shrink_page_list
      wait_on_page_writeback(A)
                                        SetPageWriteback(B)
                                        unlock_page(B)

                                        # flush A, B to clear the writeback

This accumulating of more pages to flush is used by several filesystems
to generate a more optimal IO patterns.

Waiting for the writeback in legacy memcg controller is a workaround for
pre-mature OOM killer invocations because there is no dirty IO
throttling available for the controller.  There is no easy way around
that unfortunately.  Therefore fix this specific issue by pre-allocating
the page table outside of the page lock.  We have that handy
infrastructure for that already so simply reuse the fault-around pattern
which already does this.

There are probably other hidden __GFP_ACCOUNT | GFP_KERNEL allocations
from under a fs page locked but they should be really rare.  I am not
aware of a better solution unfortunately.

[akpm@linux-foundation.org: fix mm/memory.c:__do_fault()]
[akpm@linux-foundation.org: coding-style fixes]
[mhocko@kernel.org: enhance comment, per Johannes]
  Link: http://lkml.kernel.org/r/20181214084948.GA5624@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181213092221.27270-1-mhocko@kernel.org
Fixes: c3b94f44fc ("memcg: further prevent OOM with too many dirty pages")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Liu Bo <bo.liu@linux.alibaba.com>
Debugged-by: Liu Bo <bo.liu@linux.alibaba.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
Joel Fernandes (Google)
4cf5892495 mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap".

This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems.  There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work.  Also we find that there is no point in passing the 'address' to
pte_alloc since its unused.  This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well.  Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.

Build and boot tested on x86-64.  Build tested on arm64.  The config
enablement patch for arm64 will be posted in the future after more
testing.

The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from  pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.

// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.

virtual patch

@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@

 fn(...
- , T2 E2
 )
 { ... }

@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)

@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)

@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

 fn(...
-,  E2
 )

@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@

(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)

Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:47 -08:00
Michal Hocko
7635d9cbe8 mm, thp, proc: report THP eligibility for each vma
Userspace falls short when trying to find out whether a specific memory
range is eligible for THP.  There are usecases that would like to know
that
http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
: This is used to identify heap mappings that should be able to fault thp
: but do not, and they normally point to a low-on-memory or fragmentation
: issue.

The only way to deduce this now is to query for hg resp.  nh flags and
confronting the state with the global setting.  Except that there is also
PR_SET_THP_DISABLE that might change the picture.  So the final logic is
not trivial.  Moreover the eligibility of the vma depends on the type of
VMA as well.  In the past we have supported only anononymous memory VMAs
but things have changed and shmem based vmas are supported as well these
days and the query logic gets even more complicated because the
eligibility depends on the mount option and another global configuration
knob.

Simplify the current state and report the THP eligibility in
/proc/<pid>/smaps for each existing vma.  Reuse
transparent_hugepage_enabled for this purpose.  The original
implementation of this function assumes that the caller knows that the vma
itself is supported for THP so make the core checks into
__transparent_hugepage_enabled and use it for existing callers.
__show_smap just use the new transparent_hugepage_enabled which also
checks the vma support status (please note that this one has to be out of
line due to include dependency issues).

[mhocko@kernel.org: fix oops with NULL ->f_mapping]
  Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Oppenheimer <bepvte@gmail.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Jérôme Glisse
ac46d4f3c4 mm/mmu_notifier: use structure for invalidate_range_start/end calls v2
To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks.  No functional changes with this patch.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
From: Jérôme Glisse <jglisse@redhat.com>
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Jan Kara
f2c57d91b0 mm: Fix warning in insert_pfn()
In DAX mode a write pagefault can race with write(2) in the following
way:

CPU0                            CPU1
                                write fault for mapped zero page (hole)
dax_iomap_rw()
  iomap_apply()
    xfs_file_iomap_begin()
      - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - invalidates radix tree entries in given range
                                dax_iomap_pte_fault()
                                  grab_mapping_entry()
                                    - no entry found, creates empty
                                  ...
                                  xfs_file_iomap_begin()
                                    - finds already allocated block
                                  ...
                                  vmf_insert_mixed_mkwrite()
                                    - WARNs and does nothing because there
                                      is still zero page mapped in PTE
        unmap_mapping_pages()

This race results in WARN_ON from insert_pfn() and is occasionally
triggered by fstest generic/344. Note that the race is otherwise
harmless as before write(2) on CPU0 is finished, we will invalidate page
tables properly and thus user of mmap will see modified data from
write(2) from that point on. So just restrict the warning only to the
case when the PFN in PTE is not zero page.

Link: http://lkml.kernel.org/r/20180824154542.26872-1-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:17 -07:00
Aneesh Kumar K.V
ff09d7ec97 mm/memory.c: recheck page table entry with page table lock held
We clear the pte temporarily during read/modify/write update of the pte.
If we take a page fault while the pte is cleared, the application can get
SIGBUS.  One such case is with remap_pfn_range without a backing
vm_ops->fault callback.  do_fault will return SIGBUS in that case.

cpu 0		 				cpu1
mprotect()
ptep_modify_prot_start()/pte cleared.
.
.						page fault.
.
.
prep_modify_prot_commit()

Fix this by taking page table lock and rechecking for pte_none.

[aneesh.kumar@linux.ibm.com: fix crash observed with syzkaller run]
  Link: http://lkml.kernel.org/r/87va6bwlfg.fsf@linux.ibm.com
Link: http://lkml.kernel.org/r/20180926031858.9692-1-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:35 -07:00
Matthew Wilcox
9b5a8e00d4 mm: convert insert_pfn() to vm_fault_t
All callers convert its errno into a vm_fault_t, so convert it to return a
vm_fault_t directly.

Link: http://lkml.kernel.org/r/20180828145728.11873-11-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:20 -07:00
Matthew Wilcox
79f3aa5ba9 mm: convert __vm_insert_mixed() to vm_fault_t
Both of its callers currently convert its errno return into a vm_fault_t,
so move the conversion into __vm_insert_mixed().

Link: http://lkml.kernel.org/r/20180828145728.11873-10-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:20 -07:00
Matthew Wilcox
6d958546ff mm: inline vm_insert_pfn_prot() into caller
vm_insert_pfn_prot() is only called from vmf_insert_pfn_prot(), so inline
it and convert some of the errnos into vm_fault codes earlier.

Link: http://lkml.kernel.org/r/20180828145728.11873-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:20 -07:00
Matthew Wilcox
ae2b01f370 mm: remove vm_insert_pfn()
All callers are now converted to vmf_insert_pfn() so convert
vmf_insert_pfn() from being a compatibility wrapper around vm_insert_pfn()
to being a compatibility wrapper around vmf_insert_pfn_prot().

Link: http://lkml.kernel.org/r/20180828145728.11873-8-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:20 -07:00
Matthew Wilcox
bc12e6ad96 mm: make vm_insert_pfn_prot() static
Now this is no longer used outside mm/memory.c, make it static.

Link: http://lkml.kernel.org/r/20180828145728.11873-6-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Matthew Wilcox
f5e6d1d5f8 mm: introduce vmf_insert_pfn_prot()
Like vm_insert_pfn_prot(), but returns a vm_fault_t instead of an errno.
Also unexport vm_insert_pfn_prot as it has no modular users.

Link: http://lkml.kernel.org/r/20180828145728.11873-4-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Matthew Wilcox
5d74763745 mm: remove vm_insert_mixed()
All callers are now converted to vmf_insert_mixed() so convert
vmf_insert_mixed() from being a compatibility wrapper into the real
function.

Link: http://lkml.kernel.org/r/20180828145728.11873-3-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Peter Zijlstra
196d9d8bb7 mm/memory: Move mmu_gather and TLB invalidation code into its own file
In preparation for maintaining the mmu_gather code as its own entity,
move the implementation out of memory.c and into its own file.

Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-09-07 15:19:25 +01:00
Will Deacon
a6d60245d6 asm-generic/tlb: Track which levels of the page tables have been cleared
It is common for architectures with hugepage support to require only a
single TLB invalidation operation per hugepage during unmap(), rather than
iterating through the mapping at a PAGE_SIZE increment. Currently,
however, the level in the page table where the unmap() operation occurs
is not stored in the mmu_gather structure, therefore forcing
architectures to issue additional TLB invalidation operations or to give
up and over-invalidate by e.g. invalidating the entire TLB.

Ideally, we could add an interval rbtree to the mmu_gather structure,
which would allow us to associate the correct mapping granule with the
various sub-mappings within the range being invalidated. However, this
is costly in terms of book-keeping and memory management, so instead we
approximate by keeping track of the page table levels that are cleared
and provide a means to query the smallest granule required for invalidation.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-09-04 11:08:27 +01:00
Linus Torvalds
1b2de5d039 mm/cow: don't bother write protecting already write-protected pages
This is not normally noticeable, but repeated forks are unnecessarily
expensive because they repeatedly dirty the parent page tables during
the page table copy operation.

It's trivial to just avoid write protecting the page table entry if it
was already not writable.

This patch was inspired by

    https://bugzilla.kernel.org/show_bug.cgi?id=200447

which points to an ancient "waste time re-doing fork" issue in the
presence of lots of signals.

That bug was fixed by Eric Biederman's signal handling series
culminating in commit c3ad2c3b02 ("signal: Don't restart fork when
signals come in"), but the unnecessary work for repeated forks is still
work just fixing, particularly since the fix is trivial.

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-25 13:15:03 -07:00
Linus Torvalds
33e17876ea Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - the rest of MM

 - various misc fixes and tweaks

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
  mm: Change return type int to vm_fault_t for fault handlers
  lib/fonts: convert comments to utf-8
  s390: ebcdic: convert comments to UTF-8
  treewide: convert ISO_8859-1 text comments to utf-8
  drivers/gpu/drm/gma500/: change return type to vm_fault_t
  docs/core-api: mm-api: add section about GFP flags
  docs/mm: make GFP flags descriptions usable as kernel-doc
  docs/core-api: split memory management API to a separate file
  docs/core-api: move *{str,mem}dup* to "String Manipulation"
  docs/core-api: kill trailing whitespace in kernel-api.rst
  mm/util: add kernel-doc for kvfree
  mm/util: make strndup_user description a kernel-doc comment
  fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
  treewide: correct "differenciate" and "instanciate" typos
  fs/afs: use new return type vm_fault_t
  drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
  mm: soft-offline: close the race against page allocation
  mm: fix race on soft-offlining free huge pages
  namei: allow restricted O_CREAT of FIFOs and regular files
  hfs: prevent crash on exit from failed search
  ...
2018-08-23 19:20:12 -07:00
Souptick Joarder
2b74030354 mm: Change return type int to vm_fault_t for fault handlers
Use new return type vm_fault_t for fault handler.  For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno.  Once all instances are converted, vm_fault_t will become a
distinct type.

Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")

The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type.  As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.

The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.

vmf_error() is the newly introduce inline function in 4.17-rc6.

[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:44 -07:00
Nicholas Piggin
fd1102f0aa mm: mmu_notifier fix for tlb_end_vma
The generic tlb_end_vma does not call invalidate_range mmu notifier, and
it resets resets the mmu_gather range, which means the notifier won't be
called on part of the range in case of an unmap that spans multiple
vmas.

ARM64 seems to be the only arch I could see that has notifiers and uses
the generic tlb_end_vma.  I have not actually tested it.

[ Catalin and Will point out that ARM64 currently only uses the
  notifiers for KVM, which doesn't use the ->invalidate_range()
  callback right now, so it's a bug, but one that happens to
  not affect them.  So not necessary for stable.  - Linus ]

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 11:55:58 -07:00
Peter Zijlstra
d86564a2f0 mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
Jann reported that x86 was missing required TLB invalidates when he
hit the !*batch slow path in tlb_remove_table().

This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
invalidates, the PowerPC-hash where this code originated and the
Sparc-hash where this was subsequently used did not need that. ARM
which later used this put an explicit TLB invalidate in their
__p*_free_tlb() functions, and PowerPC-radix followed that example.

But when we hooked up x86 we failed to consider this. Fix this by
(optionally) hooking tlb_remove_table() into the TLB invalidate code.

NOTE: s390 was also needing something like this and might now
      be able to use the generic code again.

[ Modified to be on top of Nick's cleanups, which simplified this patch
  now that tlb_flush_mmu_tlbonly() really only flushes the TLB - Linus ]

Fixes: 9e52fc2b50 ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 11:55:58 -07:00
Peter Zijlstra
a6f572084f mm/tlb: Remove tlb_remove_table() non-concurrent condition
Will noted that only checking mm_users is incorrect; we should also
check mm_count in order to cover CPUs that have a lazy reference to
this mm (and could do speculative TLB operations).

If removing this turns out to be a performance issue, we can
re-instate a more complete check, but in tlb_table_flush() eliding the
call_rcu_sched().

Fixes: 2672391169 ("mm, powerpc: move the RCU page-table freeing into generic code")
Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 11:55:58 -07:00
Nicholas Piggin
db7ddef301 mm: move tlb_table_flush to tlb_flush_mmu_free
There is no need to call this from tlb_flush_mmu_tlbonly, it logically
belongs with tlb_flush_mmu_free.  This makes future fixes simpler.

[ This was originally done to allow code consolidation for the
  mmu_notifier fix, but it also ends up helping simplify the
  HAVE_RCU_TABLE_INVALIDATE fix.    - Linus ]

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 11:53:24 -07:00
Peter Zijlstra
52a288c736 x86/mm/tlb: Revert the recent lazy TLB patches
Revert commits:

  95b0e6357d x86/mm/tlb: Always use lazy TLB mode
  64482aafe5 x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
  ac03158969 x86/mm/tlb: Make lazy TLB mode lazier
  61d0beb579 x86/mm/tlb: Restructure switch_mm_irqs_off()
  2ff6ddf19c x86/mm/tlb: Leave lazy TLB mode at page table free time

In order to simplify the TLB invalidate fixes for x86 and unify the
parts that need backporting.  We'll try again later.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 18:22:04 -07:00
Rik van Riel
50c150f262 Revert "mm: always flush VMA ranges affected by zap_page_range"
There was a bug in Linux that could cause madvise (and mprotect?) system
calls to return to userspace without the TLB having been flushed for all
the pages involved.

This could happen when multiple threads of a process made simultaneous
madvise and/or mprotect calls.

This was noticed in the summer of 2017, at which time two solutions
were created:

  56236a5955 ("mm: refactor TLB gathering API")
  99baac21e4 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
and
  4647706ebe ("mm: always flush VMA ranges affected by zap_page_range")

We need only one of these solutions, and the former appears to be a
little more efficient than the latter, so revert that one.

This reverts 4647706ebe ("mm: always flush VMA ranges affected by
zap_page_range")

Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Michal Hocko
29ef680ae7 memcg, oom: move out_of_memory back to the charge path
Commit 3812c8c8f3 ("mm: memcg: do not trap chargers with full
callstack on OOM") has changed the ENOMEM semantic of memcg charges.
Rather than invoking the oom killer from the charging context it delays
the oom killer to the page fault path (pagefault_out_of_memory).  This
in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
the corresponding memcg hits the hard limit and the memcg is is OOM.
This is behavior is inconsistent with !memcg case where the oom killer
is invoked from the allocation context and the allocator keeps retrying
until it succeeds.

The difference in the behavior is user visible.  mmap(MAP_POPULATE)
might result in not fully populated ranges while the mmap return code
doesn't tell that to the userspace.  Random syscalls might fail with
ENOMEM etc.

The primary motivation of the different memcg oom semantic was the
deadlock avoidance.  Things have changed since then, though.  We have an
async oom teardown by the oom reaper now and so we do not have to rely
on the victim to tear down its memory anymore.  Therefore we can return
to the original semantic as long as the memcg oom killer is not handed
over to the users space.

There is still one thing to be careful about here though.  If the oom
killer is not able to make any forward progress - e.g.  because there is
no eligible task to kill - then we have to bail out of the charge path
to prevent from same class of deadlocks.  We have basically two options
here.  Either we fail the charge with ENOMEM or force the charge and
allow overcharge.  The first option has been considered more harmful
than useful because rare inconsistencies in the ENOMEM behavior is hard
to test for and error prone.  Basically the same reason why the page
allocator doesn't fail allocations under such conditions.  The later
might allow runaways but those should be really unlikely unless somebody
misconfigures the system.  E.g.  allowing to migrate tasks away from the
memcg to a different unlimited memcg with move_charge_at_immigrate
disabled.

Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:30 -07:00
Huang Ying
c9f4cd7138 mm, huge page: copy target sub-page last when copy huge page
Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue.  For example, when
copying huge page on x86_64 platform, the cache footprint is 4M.  But on
a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
(last level cache).  That is, in average, there are 2.5M LLC for each
core and 1.25M LLC for each thread.

If the cache contention is heavy when copying the huge page, and we copy
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing copying the
end of the huge page.  And it is possible for the application to access
the begin of the huge page after copying the huge page.

In c79b57e462 ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last.  The similar order
changing helps huge page copying too.  That is implemented in this
patch.  Because we have put the order algorithm into a separate
function, the implementation is quite simple.

The patch is a generic optimization which should benefit quite some
workloads, not for a specific use case.  To demonstrate the performance
benefit of the patch, we tested it with vm-scalability run on
transparent huge page.

With this patch, the throughput increases ~16.6% in vm-scalability
anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads).  The test case set
/sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
anonymous memory area and populate it, then forked 36 child processes,
each writes to the anonymous memory area from the begin to the end, so
cause copy on write.  For each child process, other child processes
could be seen as other workloads which generate heavy cache pressure.
At the same time, the IPC (instruction per cycle) increased from 0.63 to
0.78, and the time spent in user space is reduced ~7.2%.

Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Huang Ying
c6ddfb6c58 mm, clear_huge_page: move order algorithm into a separate function
Patch series "mm, huge page: Copy target sub-page last when copy huge
page", v2.

Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue.  For example, when
copying huge page on x86_64 platform, the cache footprint is 4M.  But on
a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
(last level cache).  That is, in average, there are 2.5M LLC for each
core and 1.25M LLC for each thread.

If the cache contention is heavy when copying the huge page, and we copy
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing copying the
end of the huge page.  And it is possible for the application to access
the begin of the huge page after copying the huge page.

In c79b57e462 ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last.  The similar order
changing helps huge page copying too.  That is implemented in this
patchset.

The patchset is a generic optimization which should benefit quite some
workloads, not for a specific use case.  To demonstrate the performance
benefit of the patchset, we have tested it with vm-scalability run on
transparent huge page.

With this patchset, the throughput increases ~16.6% in vm-scalability
anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads).  The test case set
/sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
anonymous memory area and populate it, then forked 36 child processes,
each writes to the anonymous memory area from the begin to the end, so
cause copy on write.  For each child process, other child processes
could be seen as other workloads which generate heavy cache pressure.
At the same time, the IPC (instruction per cycle) increased from 0.63 to
0.78, and the time spent in user space is reduced ~7.2%.

This patch (of 4):

In c79b57e462 ("mm: hugetlb: clear target sub-page last when clearing
huge page"), to keep the cache lines of the target subpage hot, the
order to clear the subpages in the huge page in clear_huge_page() is
changed to clearing the subpage which is furthest from the target
subpage firstly, and the target subpage last.  This optimization could
be applied to copying huge page too with the same order algorithm.  To
avoid code duplication and reduce maintenance overhead, in this patch,
the order algorithm is moved out of clear_huge_page() into a separate
function: process_huge_page().  So that we can use it for copying huge
page too.

This will change the direct calls to clear_user_highpage() into the
indirect calls.  But with the proper inline support of the compilers,
the indirect call will be optimized to be the direct call.  Our tests
show no performance change with the patch.

This patch is a code cleanup without functionality change.

Link: http://lkml.kernel.org/r/20180524005851.4079-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:29 -07:00
Yang Shi
fadae29530 thp: use mm_file_counter to determine update which rss counter
Since commit eca56ff906 ("mm, shmem: add internal shmem resident
memory accounting"), MM_SHMEMPAGES is added to separate the shmem
accounting from regular files.  So, all shmem pages should be accounted
to MM_SHMEMPAGES instead of MM_FILEPAGES.

And, normal 4K shmem pages have been accounted to MM_SHMEMPAGES, so
shmem thp pages should be not treated differently.  Account them to
MM_SHMEMPAGES via mm_counter_file() since shmem pages are swap backed to
keep consistent with normal 4K shmem pages.

This will not change the rss counter of processes since shmem pages are
still a part of it.

The /proc/pid/status and /proc/pid/statm counters will however be more
accurate wrt shmem usage, as originally intended.  And as eca56ff906
("mm, shmem: add internal shmem resident memory accounting") mentioned,
oom also could report more accurate "shmem-rss".

Link: http://lkml.kernel.org/r/1529442518-17398-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Dave Jiang
e1fb4a0864 dax: remove VM_MIXEDMAP for fsdax and device dax
This patch is reworked from an earlier patch that Dan has posted:
https://patchwork.kernel.org/patch/10131727/

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map.  The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations.  In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.

DAX should also now be supported with madvise_behavior(), vma_merge(),
and copy_page_range().

This patch has been tested against ndctl unit test.  It has also been
tested against xfstests commit: 625515d using fake pmem created by
memmap and no additional issues have been observed.

Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Linus Torvalds
73ba2fb33c for-4.19/block-20180812
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAltwvasQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv65EACTq5gSLnJBI6ZPr1RAHruVDnjfzO2Veitl
 tUtjm0XfWmnEiwQ3dYvnyhy99xbyaG3900d9BClCTlH6xaUdSiQkDpcKG/R2F36J
 5mZitYukQcpFAQJWF8YKsTTE7JPl4VglCIDqYiC4+C3rOSVi8lrKn2qp4J4MMCFn
 thRg3jCcq7c5s9Eigsop1pXWQSasubkXfk55Krcp4oybKYpYRKXXf74Mj14QAbwJ
 QHN3VisyAUWoBRg7UQZo1Npe2oPk6bbnJypnjf8M0M2EnlvddEkIlHob91sodka8
 6p4APOEu5cbyXOBCAQsw/koff14mb8aEadqeQA68WvXfIdX9ZjfxCX0OoC3sBEXk
 yqJhZ0C980AM13zIBD8ejv4uasGcPca8W+47mE5P8sRiI++5kBsFWDZPCtUBna0X
 2Kh24NsmEya9XRR5vsB84dsIPQ3tLMkxg/IgQRVDaSnfJz0c/+zm54xDyKRaFT4l
 5iERk2WSkm9+8jNfVmWG0edrv6nRAXjpGwFfOCPh6/LCSCi4xQRULYN7sVzsX8ZK
 FRjt24HftBI8mJbh4BtweJvg+ppVe1gAk3IO3HvxAQhv29Hz+uvFYe9kL+3N8LJA
 Qosr9n9O4+wKYizJcDnw+5iPqCHfAwOm9th4pyedR+R7SmNcP3yNC8AbbheNBiF5
 Zolos5H+JA==
 =b9ib
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "First pull request for this merge window, there will also be a
  followup request with some stragglers.

  This pull request contains:

   - Fix for a thundering heard issue in the wbt block code (Anchal
     Agarwal)

   - A few NVMe pull requests:
      * Improved tracepoints (Keith)
      * Larger inline data support for RDMA (Steve Wise)
      * RDMA setup/teardown fixes (Sagi)
      * Effects log suppor for NVMe target (Chaitanya Kulkarni)
      * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
      * TP4004 (ANA) support (Christoph)
      * Various NVMe fixes

   - Block io-latency controller support. Much needed support for
     properly containing block devices. (Josef)

   - Series improving how we handle sense information on the stack
     (Kees)

   - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

   - Zoned device support for null_blk (Matias)

   - AIX partition fixes (Mauricio Faria de Oliveira)

   - DIF checksum code made generic (Max Gurtovoy)

   - Add support for discard in iostats (Michael Callahan / Tejun)

   - Set of updates for BFQ (Paolo)

   - Removal of async write support for bsg (Christoph)

   - Bio page dirtying and clone fixups (Christoph)

   - Set of bcache fix/changes (via Coly)

   - Series improving blk-mq queue setup/teardown speed (Ming)

   - Series improving merging performance on blk-mq (Ming)

   - Lots of other fixes and cleanups from a slew of folks"

* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
  blkcg: Make blkg_root_lookup() work for queues in bypass mode
  bcache: fix error setting writeback_rate through sysfs interface
  null_blk: add lock drop/acquire annotation
  Blk-throttle: reduce tail io latency when iops limit is enforced
  block: paride: pd: mark expected switch fall-throughs
  block: Ensure that a request queue is dissociated from the cgroup controller
  block: Introduce blk_exit_queue()
  blkcg: Introduce blkg_root_lookup()
  block: Remove two superfluous #include directives
  blk-mq: count the hctx as active before allocating tag
  block: bvec_nr_vecs() returns value for wrong slab
  bcache: trivial - remove tailing backslash in macro BTREE_FLAG
  bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
  bcache: set max writeback rate when I/O request is idle
  bcache: add code comments for bset.c
  bcache: fix mistaken comments in request.c
  bcache: fix mistaken code comments in bcache.h
  bcache: add a comment in super.c
  bcache: avoid unncessary cache prefetch bch_btree_node_get()
  bcache: display rate debug parameters to 0 when writeback is not running
  ...
2018-08-14 10:23:25 -07:00
Linus Torvalds
958f338e96 Merge branch 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Merge L1 Terminal Fault fixes from Thomas Gleixner:
 "L1TF, aka L1 Terminal Fault, is yet another speculative hardware
  engineering trainwreck. It's a hardware vulnerability which allows
  unprivileged speculative access to data which is available in the
  Level 1 Data Cache when the page table entry controlling the virtual
  address, which is used for the access, has the Present bit cleared or
  other reserved bits set.

  If an instruction accesses a virtual address for which the relevant
  page table entry (PTE) has the Present bit cleared or other reserved
  bits set, then speculative execution ignores the invalid PTE and loads
  the referenced data if it is present in the Level 1 Data Cache, as if
  the page referenced by the address bits in the PTE was still present
  and accessible.

  While this is a purely speculative mechanism and the instruction will
  raise a page fault when it is retired eventually, the pure act of
  loading the data and making it available to other speculative
  instructions opens up the opportunity for side channel attacks to
  unprivileged malicious code, similar to the Meltdown attack.

  While Meltdown breaks the user space to kernel space protection, L1TF
  allows to attack any physical memory address in the system and the
  attack works across all protection domains. It allows an attack of SGX
  and also works from inside virtual machines because the speculation
  bypasses the extended page table (EPT) protection mechanism.

  The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646

  The mitigations provided by this pull request include:

   - Host side protection by inverting the upper address bits of a non
     present page table entry so the entry points to uncacheable memory.

   - Hypervisor protection by flushing L1 Data Cache on VMENTER.

   - SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
     by offlining the sibling CPU threads. The knobs are available on
     the kernel command line and at runtime via sysfs

   - Control knobs for the hypervisor mitigation, related to L1D flush
     and SMT control. The knobs are available on the kernel command line
     and at runtime via sysfs

   - Extensive documentation about L1TF including various degrees of
     mitigations.

  Thanks to all people who have contributed to this in various ways -
  patches, review, testing, backporting - and the fruitful, sometimes
  heated, but at the end constructive discussions.

  There is work in progress to provide other forms of mitigations, which
  might be less horrible performance wise for a particular kind of
  workloads, but this is not yet ready for consumption due to their
  complexity and limitations"

* 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
  x86/microcode: Allow late microcode loading with SMT disabled
  tools headers: Synchronise x86 cpufeatures.h for L1TF additions
  x86/mm/kmmio: Make the tracer robust against L1TF
  x86/mm/pat: Make set_memory_np() L1TF safe
  x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
  x86/speculation/l1tf: Invert all not present mappings
  cpu/hotplug: Fix SMT supported evaluation
  KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
  x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
  x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
  Documentation/l1tf: Remove Yonah processors from not vulnerable list
  x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
  x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
  x86: Don't include linux/irq.h from asm/hardirq.h
  x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
  x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
  x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
  x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
  x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
  cpu/hotplug: detect SMT disabled by BIOS
  ...
2018-08-14 09:46:06 -07:00