linux-stable/tools
David Hildenbrand 78fbe906cc mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages
The basic question we would like to have a reliable and efficient answer
to is: is this anonymous page exclusive to a single process or might it be
shared?  We need that information for ordinary/single pages, hugetlb
pages, and possibly each subpage of a THP.

Introduce a way to mark an anonymous page as exclusive, with the ultimate
goal of teaching our COW logic to not do "wrong COWs", whereby GUP pins
lose consistency with the pages mapped into the page table, resulting in
reported memory corruptions.

Most pageflags already have semantics for anonymous pages, however,
PG_mappedtodisk should never apply to pages in the swapcache, so let's
reuse that flag.

As PG_has_hwpoisoned also uses that flag on the second tail page of a
compound page, convert it to PG_error instead, which is marked as
PF_NO_TAIL, so never used for tail pages.

Use custom page flag modification functions such that we can do additional
sanity checks.  The semantics we'll put into some kernel doc in the future
are:

"
  PG_anon_exclusive is *usually* only expressive in combination with a
  page table entry. Depending on the page table entry type it might
  store the following information:

       Is what's mapped via this page table entry exclusive to the
       single process and can be mapped writable without further
       checks? If not, it might be shared and we might have to COW.

  For now, we only expect PTE-mapped THPs to make use of
  PG_anon_exclusive in subpages. For other anonymous compound
  folios (i.e., hugetlb), only the head page is logically mapped and
  holds this information.

  For example, an exclusive, PMD-mapped THP only has PG_anon_exclusive
  set on the head page. When replacing the PMD by a page table full
  of PTEs, PG_anon_exclusive, if set on the head page, will be set on
  all tail pages accordingly. Note that converting from a PTE-mapping
  to a PMD mapping using the same compound page is currently not
  possible and consequently doesn't require care.

  If GUP wants to take a reliable pin (FOLL_PIN) on an anonymous page,
  it should only pin if the relevant PG_anon_exclusive is set. In that
  case, the pin will be fully reliable and stay consistent with the pages
  mapped into the page table, as the bit cannot get cleared (e.g., by
  fork(), KSM) while the page is pinned. For anonymous pages that
  are mapped R/W, PG_anon_exclusive can be assumed to always be set
  because such pages cannot possibly be shared.

  The page table lock protecting the page table entry is the primary
  synchronization mechanism for PG_anon_exclusive; GUP-fast that does
  not take the PT lock needs special care when trying to clear the
  flag.

  Page table entry types and PG_anon_exclusive:
  * Present: PG_anon_exclusive applies.
  * Swap: the information is lost. PG_anon_exclusive was cleared.
  * Migration: the entry holds this information instead.
               PG_anon_exclusive was cleared.
  * Device private: PG_anon_exclusive applies.
  * Device exclusive: PG_anon_exclusive applies.
  * HW Poison: PG_anon_exclusive is stale and not changed.

  If the page may be pinned (FOLL_PIN), clearing PG_anon_exclusive is
  not allowed and the flag will stick around until the page is freed
  and folio->mapping is cleared.
"

We won't be clearing PG_anon_exclusive on destructive unmapping (i.e.,
zapping) of page table entries, page freeing code will handle that when
also invalidate page->mapping to not indicate PageAnon() anymore.  Letting
information about exclusivity stick around will be an important property
when adding sanity checks to unpinning code.

Note that we properly clear the flag in free_pages_prepare() via
PAGE_FLAGS_CHECK_AT_PREP for each individual subpage of a compound page,
so there is no need to manually clear the flag.

Link: https://lkml.kernel.org/r/20220428083441.37290-12-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-09 18:20:44 -07:00
..
accounting
arch x86/tsx: Disable TSX development mode at boot 2022-04-11 09:58:40 +02:00
bootconfig
bpf Networking fixes for 5.18-rc2, including fixes from bpf and netfilter 2022-04-07 19:01:47 -10:00
build tools build: Filter out options and warnings not supported by clang 2022-04-09 12:34:16 -03:00
cgroup tools/cgroup/slabinfo: update to work with struct slab 2022-02-21 11:34:49 +01:00
counter kbuild: replace $(if A,A,B) with $(or A,B) 2022-02-15 12:25:56 +09:00
debugging
edid
firewire
firmware
gpio kbuild: replace $(if A,A,B) with $(or A,B) 2022-02-15 12:25:56 +09:00
hv kbuild: replace $(if A,A,B) with $(or A,B) 2022-02-15 12:25:56 +09:00
iio Kbuild updates for v5.18 2022-03-31 11:59:03 -07:00
include tools: Add kmem_cache_alloc_lru() 2022-04-22 14:24:28 -04:00
io_uring
kvm/kvm_stat
laptop
leds
lib perf tools: Fix segfault accessing sample_id xyarray 2022-04-13 22:23:02 -03:00
memory-model tools/memory-model: Explain syntactic and semantic dependencies 2022-02-01 17:32:30 -08:00
objtool objtool: Fix SLS validation for kcov tail-call replacement 2022-04-05 10:24:40 +02:00
pci kbuild: replace $(if A,A,B) with $(or A,B) 2022-02-15 12:25:56 +09:00
pcmcia
perf perf test: Fix error message for test case 71 on s390, where it is not supported 2022-04-22 18:39:34 -03:00
power tools/power/x86/intel-speed-select: fix build failure when using -Wl,--as-needed 2022-04-13 13:49:48 +02:00
rcu
scripts Kbuild updates for v5.18 2022-03-31 11:59:03 -07:00
spi kbuild: replace $(if A,A,B) with $(or A,B) 2022-02-15 12:25:56 +09:00
testing selftests: cgroup: add a selftest for memory.reclaim 2022-04-29 14:37:00 -07:00
thermal/tmon
time
tracing Kbuild updates for v5.18 2022-03-31 11:59:03 -07:00
usb kbuild: replace $(if A,A,B) with $(or A,B) 2022-02-15 12:25:56 +09:00
virtio tools/virtio: compile with -pthread 2022-03-28 16:52:59 -04:00
vm mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() pages 2022-05-09 18:20:44 -07:00
wmi
Makefile