License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:07:57 +00:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2006-09-27 08:50:01 +00:00
|
|
|
#ifndef _LINUX_MM_TYPES_H
|
|
|
|
#define _LINUX_MM_TYPES_H
|
|
|
|
|
2017-02-03 23:12:19 +00:00
|
|
|
#include <linux/mm_types_task.h>
|
|
|
|
|
2007-10-17 06:30:12 +00:00
|
|
|
#include <linux/auxvec.h>
|
2006-09-27 08:50:01 +00:00
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/spinlock.h>
|
2007-10-16 08:24:43 +00:00
|
|
|
#include <linux/rbtree.h>
|
|
|
|
#include <linux/rwsem.h>
|
|
|
|
#include <linux/completion.h>
|
mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;
if (!kvm)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-28 22:46:29 +00:00
|
|
|
#include <linux/cpumask.h>
|
2012-03-30 18:26:31 +00:00
|
|
|
#include <linux/uprobes.h>
|
2013-02-23 00:34:30 +00:00
|
|
|
#include <linux/page-flags-layout.h>
|
2016-05-20 23:57:21 +00:00
|
|
|
#include <linux/workqueue.h>
|
2017-02-03 23:12:19 +00:00
|
|
|
|
2007-10-16 08:24:43 +00:00
|
|
|
#include <asm/mmu.h>
|
2006-09-27 08:50:01 +00:00
|
|
|
|
2007-10-17 06:30:12 +00:00
|
|
|
#ifndef AT_VECTOR_SIZE_ARCH
|
|
|
|
#define AT_VECTOR_SIZE_ARCH 0
|
|
|
|
#endif
|
|
|
|
#define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
|
|
|
|
|
2018-04-05 23:25:23 +00:00
|
|
|
|
2006-09-27 08:50:01 +00:00
|
|
|
struct address_space;
|
2014-12-10 23:44:52 +00:00
|
|
|
struct mem_cgroup;
|
2006-09-27 08:50:01 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Each physical page in the system has a struct page associated with
|
|
|
|
* it to keep track of whatever it is we are using the page for at the
|
|
|
|
* moment. Note that we have no way to track which tasks are using
|
|
|
|
* a page, though if it is a pagecache page, rmap structures can tell us
|
2018-06-08 00:08:53 +00:00
|
|
|
* who is mapping it.
|
2018-02-01 00:19:06 +00:00
|
|
|
*
|
2018-06-08 00:08:53 +00:00
|
|
|
* If you allocate the page using alloc_pages(), you can use some of the
|
|
|
|
* space in struct page for your own purposes. The five words in the main
|
|
|
|
* union are available, except for bit 0 of the first word which must be
|
|
|
|
* kept clear. Many users use this word to store a pointer to an object
|
|
|
|
* which is guaranteed to be aligned. If you use the same storage as
|
|
|
|
* page->mapping, you must restore it to NULL before freeing the page.
|
2018-02-01 00:19:06 +00:00
|
|
|
*
|
2018-06-08 00:08:53 +00:00
|
|
|
* If your page will not be mapped to userspace, you can also use the four
|
|
|
|
* bytes in the mapcount union, but you must call page_mapcount_reset()
|
|
|
|
* before freeing it.
|
|
|
|
*
|
|
|
|
* If you want to use the refcount field, it must be used in such a way
|
|
|
|
* that other CPUs temporarily incrementing and then decrementing the
|
|
|
|
* refcount does not cause problems. On receiving the page from
|
|
|
|
* alloc_pages(), the refcount will be positive.
|
|
|
|
*
|
|
|
|
* If you allocate pages of order > 0, you can use some of the fields
|
|
|
|
* in each subpage, but you may need to restore some of their values
|
|
|
|
* afterwards.
|
2011-06-01 17:25:48 +00:00
|
|
|
*
|
2018-02-01 00:18:51 +00:00
|
|
|
* SLUB uses cmpxchg_double() to atomically update its freelist and
|
|
|
|
* counters. That requires that freelist & counters be adjacent and
|
|
|
|
* double-word aligned. We align all struct pages to double-word
|
|
|
|
* boundaries, and ensure that 'freelist' is aligned within the
|
|
|
|
* struct.
|
2006-09-27 08:50:01 +00:00
|
|
|
*/
|
2018-02-01 00:18:44 +00:00
|
|
|
#ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
|
|
|
|
#define _struct_page_alignment __aligned(2 * sizeof(unsigned long))
|
|
|
|
#else
|
2018-02-01 00:18:58 +00:00
|
|
|
#define _struct_page_alignment
|
2018-06-08 00:08:31 +00:00
|
|
|
#endif
|
2018-02-01 00:18:44 +00:00
|
|
|
|
2006-09-27 08:50:01 +00:00
|
|
|
struct page {
|
|
|
|
unsigned long flags; /* Atomic flags, some possibly
|
|
|
|
* updated asynchronously */
|
2018-06-08 00:08:46 +00:00
|
|
|
/*
|
2018-06-08 00:08:50 +00:00
|
|
|
* Five words (20/40 bytes) are available in this union.
|
|
|
|
* WARNING: bit 0 of the first word is used for PageTail(). That
|
|
|
|
* means the other users of this union MUST NOT use the bit to
|
2018-06-08 00:08:46 +00:00
|
|
|
* avoid collision and false-positive PageTail().
|
|
|
|
*/
|
2013-10-24 01:07:49 +00:00
|
|
|
union {
|
2018-06-08 00:08:39 +00:00
|
|
|
struct { /* Page cache and anonymous pages */
|
2018-06-08 00:08:50 +00:00
|
|
|
/**
|
|
|
|
* @lru: Pageout list, eg. active_list protected by
|
2019-03-05 23:49:39 +00:00
|
|
|
* pgdat->lru_lock. Sometimes used as a generic list
|
2018-06-08 00:08:50 +00:00
|
|
|
* by the page owner.
|
|
|
|
*/
|
|
|
|
struct list_head lru;
|
2018-06-08 00:08:39 +00:00
|
|
|
/* See page-flags.h for PAGE_MAPPING_FLAGS */
|
|
|
|
struct address_space *mapping;
|
|
|
|
pgoff_t index; /* Our offset within mapping. */
|
|
|
|
/**
|
|
|
|
* @private: Mapping-private opaque data.
|
|
|
|
* Usually used for buffer_heads if PagePrivate.
|
|
|
|
* Used for swp_entry_t if PageSwapCache.
|
|
|
|
* Indicates order in the buddy system if PageBuddy.
|
|
|
|
*/
|
|
|
|
unsigned long private;
|
|
|
|
};
|
2019-02-13 01:55:40 +00:00
|
|
|
struct { /* page_pool used by netstack */
|
|
|
|
/**
|
|
|
|
* @dma_addr: might require a 64-bit value even on
|
|
|
|
* 32-bit architectures.
|
|
|
|
*/
|
|
|
|
dma_addr_t dma_addr;
|
|
|
|
};
|
2018-06-08 00:08:39 +00:00
|
|
|
struct { /* slab, slob and slub */
|
2018-06-08 00:08:50 +00:00
|
|
|
union {
|
2019-05-14 00:16:19 +00:00
|
|
|
struct list_head slab_list;
|
2018-06-08 00:08:50 +00:00
|
|
|
struct { /* Partial pages */
|
|
|
|
struct page *next;
|
|
|
|
#ifdef CONFIG_64BIT
|
|
|
|
int pages; /* Nr of pages left */
|
|
|
|
int pobjects; /* Approximate count */
|
|
|
|
#else
|
|
|
|
short int pages;
|
|
|
|
short int pobjects;
|
|
|
|
#endif
|
|
|
|
};
|
|
|
|
};
|
2018-06-08 00:08:39 +00:00
|
|
|
struct kmem_cache *slab_cache; /* not slob */
|
|
|
|
/* Double-word boundary */
|
|
|
|
void *freelist; /* first free object */
|
|
|
|
union {
|
|
|
|
void *s_mem; /* slab: first object */
|
|
|
|
unsigned long counters; /* SLUB */
|
|
|
|
struct { /* SLUB */
|
|
|
|
unsigned inuse:16;
|
|
|
|
unsigned objects:15;
|
|
|
|
unsigned frozen:1;
|
|
|
|
};
|
|
|
|
};
|
|
|
|
};
|
2018-06-08 00:08:50 +00:00
|
|
|
struct { /* Tail pages of compound page */
|
|
|
|
unsigned long compound_head; /* Bit zero is set */
|
|
|
|
|
|
|
|
/* First tail page only */
|
|
|
|
unsigned char compound_dtor;
|
|
|
|
unsigned char compound_order;
|
|
|
|
atomic_t compound_mapcount;
|
|
|
|
};
|
|
|
|
struct { /* Second tail page of compound page */
|
|
|
|
unsigned long _compound_pad_1; /* compound_head */
|
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 04:05:33 +00:00
|
|
|
atomic_t hpage_pinned_refcount;
|
2019-09-23 22:38:15 +00:00
|
|
|
/* For both global and memcg */
|
2018-06-08 00:08:50 +00:00
|
|
|
struct list_head deferred_list;
|
|
|
|
};
|
2018-06-08 00:08:39 +00:00
|
|
|
struct { /* Page table pages */
|
2018-06-08 00:08:50 +00:00
|
|
|
unsigned long _pt_pad_1; /* compound_head */
|
|
|
|
pgtable_t pmd_huge_pte; /* protected by page->ptl */
|
2018-06-08 00:08:39 +00:00
|
|
|
unsigned long _pt_pad_2; /* mapping */
|
2018-07-27 11:48:17 +00:00
|
|
|
union {
|
|
|
|
struct mm_struct *pt_mm; /* x86 pgds only */
|
|
|
|
atomic_t pt_frag_refcount; /* powerpc */
|
|
|
|
};
|
2018-06-08 00:08:31 +00:00
|
|
|
#if ALLOC_SPLIT_PTLOCKS
|
2018-06-08 00:08:39 +00:00
|
|
|
spinlock_t *ptl;
|
2018-06-08 00:08:31 +00:00
|
|
|
#else
|
2018-06-08 00:08:39 +00:00
|
|
|
spinlock_t ptl;
|
2018-06-08 00:08:31 +00:00
|
|
|
#endif
|
|
|
|
};
|
2018-06-08 00:09:01 +00:00
|
|
|
struct { /* ZONE_DEVICE pages */
|
|
|
|
/** @pgmap: Points to the hosting device page map. */
|
|
|
|
struct dev_pagemap *pgmap;
|
2019-06-26 12:27:21 +00:00
|
|
|
void *zone_device_data;
|
2019-08-13 22:37:04 +00:00
|
|
|
/*
|
|
|
|
* ZONE_DEVICE private pages are counted as being
|
|
|
|
* mapped so the next 3 words hold the mapping, index,
|
|
|
|
* and private fields from the source anonymous or
|
|
|
|
* page cache page while the page is migrated to device
|
|
|
|
* private memory.
|
|
|
|
* ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
|
|
|
|
* use the mapping, index, and private fields when
|
|
|
|
* pmem backed DAX files are mapped.
|
|
|
|
*/
|
2018-06-08 00:09:01 +00:00
|
|
|
};
|
2018-06-08 00:08:50 +00:00
|
|
|
|
|
|
|
/** @rcu_head: You can use this to free a page by RCU. */
|
|
|
|
struct rcu_head rcu_head;
|
2018-06-08 00:08:31 +00:00
|
|
|
};
|
|
|
|
|
2018-06-08 00:08:35 +00:00
|
|
|
union { /* This union is 4 bytes in size. */
|
|
|
|
/*
|
|
|
|
* If the page can be mapped to userspace, encodes the number
|
|
|
|
* of times this page is referenced by a page table.
|
|
|
|
*/
|
|
|
|
atomic_t _mapcount;
|
|
|
|
|
2018-06-08 00:08:18 +00:00
|
|
|
/*
|
|
|
|
* If the page is neither PageSlab nor mappable to userspace,
|
|
|
|
* the value stored here may help determine what this page
|
|
|
|
* is used for. See page-flags.h for a list of page types
|
|
|
|
* which are currently stored here.
|
|
|
|
*/
|
|
|
|
unsigned int page_type;
|
|
|
|
|
2018-02-01 00:18:47 +00:00
|
|
|
unsigned int active; /* SLAB */
|
|
|
|
int units; /* SLOB */
|
2007-05-06 21:49:36 +00:00
|
|
|
};
|
2011-06-01 17:25:48 +00:00
|
|
|
|
2018-06-08 00:08:35 +00:00
|
|
|
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
|
|
|
|
atomic_t _refcount;
|
|
|
|
|
2014-12-10 23:44:52 +00:00
|
|
|
#ifdef CONFIG_MEMCG
|
|
|
|
struct mem_cgroup *mem_cgroup;
|
|
|
|
#endif
|
|
|
|
|
2006-09-27 08:50:01 +00:00
|
|
|
/*
|
|
|
|
* On machines where all RAM is mapped into kernel address space,
|
|
|
|
* we can simply calculate the virtual address. On machines with
|
|
|
|
* highmem some memory is mapped into kernel virtual memory
|
|
|
|
* dynamically, so we need a place to store that address.
|
|
|
|
* Note that this field could be 16 bits on x86 ... ;)
|
|
|
|
*
|
|
|
|
* Architectures with slow multiplication can define
|
|
|
|
* WANT_PAGE_VIRTUAL in asm/page.h
|
|
|
|
*/
|
|
|
|
#if defined(WANT_PAGE_VIRTUAL)
|
|
|
|
void *virtual; /* Kernel virtual address (NULL if
|
|
|
|
not kmapped, ie. highmem) */
|
|
|
|
#endif /* WANT_PAGE_VIRTUAL */
|
2008-04-03 22:51:41 +00:00
|
|
|
|
2013-10-07 10:29:20 +00:00
|
|
|
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
|
|
|
|
int _last_cpupid;
|
2012-11-12 09:06:20 +00:00
|
|
|
#endif
|
2018-02-01 00:18:44 +00:00
|
|
|
} _struct_page_alignment;
|
2006-09-27 08:50:01 +00:00
|
|
|
|
2019-11-06 05:16:30 +00:00
|
|
|
static inline atomic_t *compound_mapcount_ptr(struct page *page)
|
|
|
|
{
|
|
|
|
return &page[1].compound_mapcount;
|
|
|
|
}
|
|
|
|
|
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 04:05:33 +00:00
|
|
|
static inline atomic_t *compound_pincount_ptr(struct page *page)
|
|
|
|
{
|
|
|
|
return &page[2].hpage_pinned_refcount;
|
|
|
|
}
|
|
|
|
|
2018-12-14 22:16:53 +00:00
|
|
|
/*
|
|
|
|
* Used for sizing the vmemmap region on some architectures
|
|
|
|
*/
|
|
|
|
#define STRUCT_PAGE_MAX_SHIFT (order_base_2(sizeof(struct page)))
|
|
|
|
|
2015-05-07 04:11:57 +00:00
|
|
|
#define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK)
|
|
|
|
#define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE)
|
|
|
|
|
2019-05-14 22:41:32 +00:00
|
|
|
#define page_private(page) ((page)->private)
|
2020-06-02 04:48:09 +00:00
|
|
|
|
|
|
|
static inline void set_page_private(struct page *page, unsigned long private)
|
|
|
|
{
|
|
|
|
page->private = private;
|
|
|
|
}
|
2019-05-14 22:41:32 +00:00
|
|
|
|
2015-05-07 04:11:57 +00:00
|
|
|
struct page_frag_cache {
|
|
|
|
void * va;
|
|
|
|
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
|
|
|
|
__u16 offset;
|
|
|
|
__u16 size;
|
|
|
|
#else
|
|
|
|
__u32 offset;
|
|
|
|
#endif
|
|
|
|
/* we maintain a pagecount bias, so that we dont dirty cache line
|
2016-05-20 00:10:49 +00:00
|
|
|
* containing page->_refcount every time we allocate a fragment.
|
2015-05-07 04:11:57 +00:00
|
|
|
*/
|
|
|
|
unsigned int pagecnt_bias;
|
|
|
|
bool pfmemalloc;
|
|
|
|
};
|
|
|
|
|
2015-09-08 22:02:15 +00:00
|
|
|
typedef unsigned long vm_flags_t;
|
2011-05-26 10:16:19 +00:00
|
|
|
|
2009-01-08 12:04:47 +00:00
|
|
|
/*
|
|
|
|
* A region containing a mapping of a non-memory backed file under NOMMU
|
|
|
|
* conditions. These are held in a global tree and are pinned by the VMAs that
|
|
|
|
* map parts of them.
|
|
|
|
*/
|
|
|
|
struct vm_region {
|
|
|
|
struct rb_node vm_rb; /* link in global region tree */
|
2011-05-26 10:16:19 +00:00
|
|
|
vm_flags_t vm_flags; /* VMA vm_flags */
|
2009-01-08 12:04:47 +00:00
|
|
|
unsigned long vm_start; /* start address of region */
|
|
|
|
unsigned long vm_end; /* region initialised to here */
|
2009-01-08 12:04:47 +00:00
|
|
|
unsigned long vm_top; /* region allocated to here */
|
2009-01-08 12:04:47 +00:00
|
|
|
unsigned long vm_pgoff; /* the offset in vm_file corresponding to vm_start */
|
|
|
|
struct file *vm_file; /* the backing file or NULL */
|
|
|
|
|
2010-01-16 01:01:33 +00:00
|
|
|
int vm_usage; /* region usage count (access under nommu_region_sem) */
|
NOMMU: Avoiding duplicate icache flushes of shared maps
When working with FDPIC, there are many shared mappings of read-only
code regions between applications (the C library, applet packages like
busybox, etc.), but the current do_mmap_pgoff() function will issue an
icache flush whenever a VMA is added to an MM instead of only doing it
when the map is initially created.
The flush can instead be done when a region is first mmapped PROT_EXEC.
Note that we may not rely on the first mapping of a region being
executable - it's possible for it to be PROT_READ only, so we have to
remember whether we've flushed the region or not, and then flush the
entire region when a bit of it is made executable.
However, this also affects the brk area. That will no longer be
executable. We can mprotect() it to PROT_EXEC on MPU-mode kernels, but
for NOMMU mode kernels, when it increases the brk allocation, making
sys_brk() flush the extra from the icache should suffice. The brk area
probably isn't used by NOMMU programs since the brk area can only use up
the leavings from the stack allocation, where the stack allocation is
larger than requested.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-06 17:23:23 +00:00
|
|
|
bool vm_icache_flushed : 1; /* true if the icache has been flushed for
|
|
|
|
* this region */
|
2009-01-08 12:04:47 +00:00
|
|
|
};
|
|
|
|
|
2015-09-04 22:46:14 +00:00
|
|
|
#ifdef CONFIG_USERFAULTFD
|
|
|
|
#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
|
|
|
|
struct vm_userfaultfd_ctx {
|
|
|
|
struct userfaultfd_ctx *ctx;
|
|
|
|
};
|
|
|
|
#else /* CONFIG_USERFAULTFD */
|
|
|
|
#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) {})
|
|
|
|
struct vm_userfaultfd_ctx {};
|
|
|
|
#endif /* CONFIG_USERFAULTFD */
|
|
|
|
|
2007-10-16 08:24:43 +00:00
|
|
|
/*
|
2020-04-07 03:08:33 +00:00
|
|
|
* This struct describes a virtual memory area. There is one of these
|
|
|
|
* per VM-area/task. A VM area is any part of the process virtual memory
|
2007-10-16 08:24:43 +00:00
|
|
|
* space that has a special rule for the page-fault handlers (ie a shared
|
|
|
|
* library, the executable area etc).
|
|
|
|
*/
|
|
|
|
struct vm_area_struct {
|
2012-12-12 00:01:44 +00:00
|
|
|
/* The first cache line has the info for VMA tree walking. */
|
|
|
|
|
2007-10-16 08:24:43 +00:00
|
|
|
unsigned long vm_start; /* Our start address within vm_mm. */
|
|
|
|
unsigned long vm_end; /* The first byte after our end address
|
|
|
|
within vm_mm. */
|
|
|
|
|
|
|
|
/* linked list of VM areas per task, sorted by address */
|
2010-08-20 23:24:55 +00:00
|
|
|
struct vm_area_struct *vm_next, *vm_prev;
|
2007-10-16 08:24:43 +00:00
|
|
|
|
|
|
|
struct rb_node vm_rb;
|
|
|
|
|
2012-12-12 00:01:38 +00:00
|
|
|
/*
|
|
|
|
* Largest free memory gap in bytes to the left of this VMA.
|
|
|
|
* Either between this VMA and vma->vm_prev, or between one of the
|
|
|
|
* VMAs below us in the VMA rbtree and its ->vm_prev. This helps
|
|
|
|
* get_unmapped_area find a free area of the right size.
|
|
|
|
*/
|
|
|
|
unsigned long rb_subtree_gap;
|
|
|
|
|
2012-12-12 00:01:44 +00:00
|
|
|
/* Second cache line starts here. */
|
|
|
|
|
|
|
|
struct mm_struct *vm_mm; /* The address space we belong to. */
|
2019-11-22 08:25:12 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Access permissions of this VMA.
|
|
|
|
* See vmf_insert_mixed_prot() for discussion.
|
|
|
|
*/
|
|
|
|
pgprot_t vm_page_prot;
|
2012-12-12 00:01:44 +00:00
|
|
|
unsigned long vm_flags; /* Flags, see mm.h. */
|
|
|
|
|
2007-10-16 08:24:43 +00:00
|
|
|
/*
|
|
|
|
* For areas with an address space and backing store,
|
2015-02-10 22:09:59 +00:00
|
|
|
* linkage into the address_space->i_mmap interval tree.
|
2007-10-16 08:24:43 +00:00
|
|
|
*/
|
2015-02-10 22:10:02 +00:00
|
|
|
struct {
|
|
|
|
struct rb_node rb;
|
|
|
|
unsigned long rb_subtree_last;
|
2007-10-16 08:24:43 +00:00
|
|
|
} shared;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
|
|
|
|
* list, after a COW of one of the file pages. A MAP_SHARED vma
|
|
|
|
* can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
|
|
|
|
* or brk vma (with NULL file) can only be in an anon_vma list.
|
|
|
|
*/
|
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-05 21:42:07 +00:00
|
|
|
struct list_head anon_vma_chain; /* Serialized by mmap_sem &
|
|
|
|
* page_table_lock */
|
2007-10-16 08:24:43 +00:00
|
|
|
struct anon_vma *anon_vma; /* Serialized by page_table_lock */
|
|
|
|
|
|
|
|
/* Function pointers to deal with this struct. */
|
2009-09-27 18:29:37 +00:00
|
|
|
const struct vm_operations_struct *vm_ops;
|
2007-10-16 08:24:43 +00:00
|
|
|
|
|
|
|
/* Information about our backing store: */
|
|
|
|
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
|
2016-04-01 12:29:48 +00:00
|
|
|
units */
|
2007-10-16 08:24:43 +00:00
|
|
|
struct file * vm_file; /* File we map to (can be NULL). */
|
|
|
|
void * vm_private_data; /* was vm_pte (shared mem) */
|
|
|
|
|
2019-07-12 03:54:43 +00:00
|
|
|
#ifdef CONFIG_SWAP
|
mm, swap: VMA based swap readahead
The swap readahead is an important mechanism to reduce the swap in
latency. Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.
In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory. And the different tasks in the system may have different access
patterns, which makes the global space locality estimation incorrect.
In this patch, when page fault occurs, the virtual pages near the fault
address will be readahead instead of the swap slots near the fault swap
slot in swap device. This avoid to readahead the unrelated swap slots.
At the same time, the swap readahead is changed to work on per-VMA from
globally. So that the different access patterns of the different VMAs
could be distinguished, and the different readahead policy could be
applied accordingly. The original core readahead detection and scaling
algorithm is reused, because it is an effect algorithm to detect the
space locality.
The test and result is as follow,
Common test condition
=====================
Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
NVMe disk
Micro-benchmark with combined access pattern
============================================
vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds. The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.
At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds. This will trigger random swap-in
in the background.
This is a combined workload with sequential and random memory accessing
at the same time. The result (for sequential workload) is as follow,
Base Optimized
---- ---------
throughput 345413 KB/s 414029 KB/s (+19.9%)
latency.average 97.14 us 61.06 us (-37.1%)
latency.50th 2 us 1 us
latency.60th 2 us 1 us
latency.70th 98 us 2 us
latency.80th 160 us 2 us
latency.90th 260 us 217 us
latency.95th 346 us 369 us
latency.99th 1.34 ms 1.09 ms
ra_hit% 52.69% 99.98%
The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower. The VMA-base
readahead algorithm works much better.
Linpack
=======
The test memory size is bigger than RAM to trigger swapping.
Base Optimized
---- ---------
elapsed_time 393.49 s 329.88 s (-16.2%)
ra_hit% 86.21% 98.82%
The score of base and optimized kernel hasn't visible changes. But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages. And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.
Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 23:24:36 +00:00
|
|
|
atomic_long_t swap_readahead_info;
|
2019-07-12 03:54:43 +00:00
|
|
|
#endif
|
2007-10-16 08:24:43 +00:00
|
|
|
#ifndef CONFIG_MMU
|
2009-01-08 12:04:47 +00:00
|
|
|
struct vm_region *vm_region; /* NOMMU mapping region */
|
2007-10-16 08:24:43 +00:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
|
|
|
|
#endif
|
2015-09-04 22:46:14 +00:00
|
|
|
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
|
2016-10-28 08:22:25 +00:00
|
|
|
} __randomize_layout;
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2008-07-25 08:47:44 +00:00
|
|
|
struct core_thread {
|
|
|
|
struct task_struct *task;
|
|
|
|
struct core_thread *next;
|
|
|
|
};
|
|
|
|
|
2008-07-25 08:47:41 +00:00
|
|
|
struct core_state {
|
2008-07-25 08:47:42 +00:00
|
|
|
atomic_t nr_threads;
|
2008-07-25 08:47:44 +00:00
|
|
|
struct core_thread dumper;
|
2008-07-25 08:47:41 +00:00
|
|
|
struct completion startup;
|
|
|
|
};
|
|
|
|
|
aio: convert the ioctx list to table lookup v3
On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
> On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
> > When using a large number of threads performing AIO operations the
> > IOCTX list may get a significant number of entries which will cause
> > significant overhead. For example, when running this fio script:
> >
> > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=512; thread; loops=100
> >
> > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
> > 30% CPU time spent by lookup_ioctx:
> >
> > 32.51% [guest.kernel] [g] lookup_ioctx
> > 9.19% [guest.kernel] [g] __lock_acquire.isra.28
> > 4.40% [guest.kernel] [g] lock_release
> > 4.19% [guest.kernel] [g] sched_clock_local
> > 3.86% [guest.kernel] [g] local_clock
> > 3.68% [guest.kernel] [g] native_sched_clock
> > 3.08% [guest.kernel] [g] sched_clock_cpu
> > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11
> > 2.60% [guest.kernel] [g] memcpy
> > 2.33% [guest.kernel] [g] lock_acquired
> > 2.25% [guest.kernel] [g] lock_acquire
> > 1.84% [guest.kernel] [g] do_io_submit
> >
> > This patchs converts the ioctx list to a radix tree. For a performance
> > comparison the above FIO script was run on a 2 sockets 8 core
> > machine. This are the results (average and %rsd of 10 runs) for the
> > original list based implementation and for the radix tree based
> > implementation:
> >
> > cores 1 2 4 8 16 32
> > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms
> > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43%
> > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms
> > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75%
> > % of radix
> > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66%
> > to list
> >
> > To consider the impact of the patch on the typical case of having
> > only one ctx per process the following FIO script was run:
> >
> > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=1; thread; loops=100
> >
> > on the same system and the results are the following:
> >
> > list 58892 ms
> > %rsd 0.91%
> > radix 59404 ms
> > %rsd 0.81%
> > % of radix
> > relative 100.87%
> > to list
>
> So, I was just doing some benchmarking/profiling to get ready to send
> out the aio patches I've got for 3.11 - and it looks like your patch is
> causing a ~1.5% throughput regression in my testing :/
... <snip>
I've got an alternate approach for fixing this wart in lookup_ioctx()...
Instead of using an rbtree, just use the reserved id in the ring buffer
header to index an array pointing the ioctx. It's not finished yet, and
it needs to be tidied up, but is most of the way there.
-ben
--
"Thought is the essence of where you are now."
--
kmo> And, a rework of Ben's code, but this was entirely his idea
kmo> -Kent
bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
free memory.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30 16:54:40 +00:00
|
|
|
struct kioctx_table;
|
2007-10-16 08:24:43 +00:00
|
|
|
struct mm_struct {
|
2018-07-16 19:03:31 +00:00
|
|
|
struct {
|
|
|
|
struct vm_area_struct *mmap; /* list of VMAs */
|
|
|
|
struct rb_root mm_rb;
|
2018-09-13 09:57:48 +00:00
|
|
|
u64 vmacache_seqnum; /* per-thread vmacache */
|
2010-01-16 01:01:35 +00:00
|
|
|
#ifdef CONFIG_MMU
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long (*get_unmapped_area) (struct file *filp,
|
2007-10-16 08:24:43 +00:00
|
|
|
unsigned long addr, unsigned long len,
|
|
|
|
unsigned long pgoff, unsigned long flags);
|
2010-01-16 01:01:35 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long mmap_base; /* base of mmap area */
|
|
|
|
unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
|
2017-03-06 14:17:19 +00:00
|
|
|
#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
|
2018-07-16 19:03:31 +00:00
|
|
|
/* Base adresses for compatible mmap() */
|
|
|
|
unsigned long mmap_compat_base;
|
|
|
|
unsigned long mmap_compat_legacy_base;
|
2017-03-06 14:17:19 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long task_size; /* size of task vm space */
|
|
|
|
unsigned long highest_vm_end; /* highest vma end address */
|
|
|
|
pgd_t * pgd;
|
|
|
|
|
2019-09-19 17:37:02 +00:00
|
|
|
#ifdef CONFIG_MEMBARRIER
|
|
|
|
/**
|
|
|
|
* @membarrier_state: Flags controlling membarrier behavior.
|
|
|
|
*
|
|
|
|
* This field is close to @pgd to hopefully fit in the same
|
|
|
|
* cache-line, which needs to be touched by switch_mm().
|
|
|
|
*/
|
|
|
|
atomic_t membarrier_state;
|
|
|
|
#endif
|
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
/**
|
|
|
|
* @mm_users: The number of users including userspace.
|
|
|
|
*
|
|
|
|
* Use mmget()/mmget_not_zero()/mmput() to modify. When this
|
|
|
|
* drops to 0 (i.e. when the task exits and there are no other
|
|
|
|
* temporary reference holders), we also release a reference on
|
|
|
|
* @mm_count (which may then free the &struct mm_struct if
|
|
|
|
* @mm_count also drops to 0).
|
|
|
|
*/
|
|
|
|
atomic_t mm_users;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* @mm_count: The number of references to &struct mm_struct
|
|
|
|
* (@mm_users count as 1).
|
|
|
|
*
|
|
|
|
* Use mmgrab()/mmdrop() to modify. When this drops to 0, the
|
|
|
|
* &struct mm_struct is freed.
|
|
|
|
*/
|
|
|
|
atomic_t mm_count;
|
2017-02-27 22:30:16 +00:00
|
|
|
|
2017-11-16 01:35:37 +00:00
|
|
|
#ifdef CONFIG_MMU
|
2018-07-16 19:03:31 +00:00
|
|
|
atomic_long_t pgtables_bytes; /* PTE page table pages */
|
2015-04-14 22:46:21 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
int map_count; /* number of VMAs */
|
2011-03-22 23:32:50 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
spinlock_t page_table_lock; /* Protects page tables and some
|
|
|
|
* counters
|
|
|
|
*/
|
|
|
|
struct rw_semaphore mmap_sem;
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
struct list_head mmlist; /* List of maybe swapped mm's. These
|
|
|
|
* are globally strung together off
|
|
|
|
* init_mm.mmlist, and are protected
|
|
|
|
* by mmlist_lock
|
|
|
|
*/
|
2007-10-16 08:24:43 +00:00
|
|
|
|
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long hiwater_rss; /* High-watermark of RSS usage */
|
|
|
|
unsigned long hiwater_vm; /* High-water virtual memory usage */
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long total_vm; /* Total pages mapped */
|
|
|
|
unsigned long locked_vm; /* Pages that have PG_mlocked set */
|
2019-02-06 17:59:15 +00:00
|
|
|
atomic64_t pinned_vm; /* Refcount permanently increased */
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
|
|
|
|
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
|
|
|
|
unsigned long stack_vm; /* VM_STACK */
|
|
|
|
unsigned long def_flags;
|
mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.
And, the mmap_sem contention may cause unexpected issue like below:
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
proc_pid_cmdline_read+0xd9/0x4e0
__vfs_read+0x37/0x150
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xc5
Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.
So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.
This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem. The mmap_sem
scalability issue will be solved in the future.
[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-08 00:05:28 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
spinlock_t arg_lock; /* protect the below fields */
|
|
|
|
unsigned long start_code, end_code, start_data, end_data;
|
|
|
|
unsigned long start_brk, brk, start_stack;
|
|
|
|
unsigned long arg_start, arg_end, env_start, env_end;
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
/*
|
|
|
|
* Special counters, in some configurations protected by the
|
|
|
|
* page_table_lock, in other configurations by being atomic.
|
|
|
|
*/
|
|
|
|
struct mm_rss_stat rss_stat;
|
2009-09-23 22:57:41 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
struct linux_binfmt *binfmt;
|
2011-05-29 18:32:28 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
/* Architecture-specific MM context */
|
|
|
|
mm_context_t context;
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long flags; /* Must use atomic bitops to access */
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
struct core_state *core_state; /* coredumping support */
|
2019-09-19 17:37:02 +00:00
|
|
|
|
2009-09-23 22:57:32 +00:00
|
|
|
#ifdef CONFIG_AIO
|
2018-07-16 19:03:31 +00:00
|
|
|
spinlock_t ioctx_lock;
|
|
|
|
struct kioctx_table __rcu *ioctx_table;
|
2009-09-23 22:57:32 +00:00
|
|
|
#endif
|
2014-06-04 23:07:34 +00:00
|
|
|
#ifdef CONFIG_MEMCG
|
2018-07-16 19:03:31 +00:00
|
|
|
/*
|
|
|
|
* "owner" points to a task that is regarded as the canonical
|
|
|
|
* user/owner of this mm. All of the following must be true in
|
|
|
|
* order for it to be changed:
|
|
|
|
*
|
|
|
|
* current == mm->owner
|
|
|
|
* current->mm != mm
|
|
|
|
* new_owner->mm == mm
|
|
|
|
* new_owner->alloc_lock is held
|
|
|
|
*/
|
|
|
|
struct task_struct __rcu *owner;
|
2008-02-07 08:13:51 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
struct user_namespace *user_ns;
|
2008-04-29 08:01:36 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
/* store ref to file /proc/<pid>/exe symlink points to */
|
|
|
|
struct file __rcu *exe_file;
|
mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;
if (!kvm)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-28 22:46:29 +00:00
|
|
|
#ifdef CONFIG_MMU_NOTIFIER
|
2019-12-18 17:40:35 +00:00
|
|
|
struct mmu_notifier_subscriptions *notifier_subscriptions;
|
2011-01-13 23:46:45 +00:00
|
|
|
#endif
|
2013-11-14 22:31:07 +00:00
|
|
|
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
|
2018-07-16 19:03:31 +00:00
|
|
|
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
|
2012-10-25 12:16:43 +00:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
2018-07-16 19:03:31 +00:00
|
|
|
/*
|
|
|
|
* numa_next_scan is the next time that the PTEs will be marked
|
|
|
|
* pte_numa. NUMA hinting faults will gather statistics and
|
|
|
|
* migrate pages to new nodes if necessary.
|
|
|
|
*/
|
|
|
|
unsigned long numa_next_scan;
|
2012-10-25 12:16:43 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
/* Restart point for scanning and setting pte_numa */
|
|
|
|
unsigned long numa_scan_offset;
|
mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.
That method has various (obvious) disadvantages:
- it samples the working set at dissimilar rates,
giving some tasks a sampling quality advantage
over others.
- creates performance problems for tasks with very
large working sets
- over-samples processes with large address spaces but
which only very rarely execute
Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.
The per-task nature of the working set sampling functionality in this tree
allows such constant rate, per task, execution-weight proportional sampling
of the working set, with an adaptive sampling interval/frequency that
goes from once per 100ms up to just once per 8 seconds. The current
sampling volume is 256 MB per interval.
As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 8
seconds of CPU time executed.
This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.
[ In AutoNUMA speak, this patch deals with the effective sampling
rate of the 'hinting page fault'. AutoNUMA's scanning is
currently rate-limited, but it is also fundamentally
single-threaded, executing in the knuma_scand kernel thread,
so the limit in AutoNUMA is global and does not scale up with
the number of CPUs, nor does it scan tasks in an execution
proportional manner.
So the idea of rate-limiting the scanning was first implemented
in the AutoNUMA tree via a global rate limit. This patch goes
beyond that by implementing an execution rate proportional
working set sampling rate that is not implemented via a single
global scanning daemon. ]
[ Dan Carpenter pointed out a possible NULL pointer dereference in the
first version of this patch. ]
Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
2012-10-25 12:16:45 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
/* numa_scan_seq prevents two threads setting pte_numa */
|
|
|
|
int numa_scan_seq;
|
mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19 01:08:44 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
/*
|
|
|
|
* An operation with batched TLB flushing is going on. Anything
|
|
|
|
* that can move process memory needs to flush the TLB when
|
|
|
|
* moving a PROT_NONE or PROT_NUMA mapped page.
|
|
|
|
*/
|
|
|
|
atomic_t tlb_flush_pending;
|
2017-08-02 20:31:52 +00:00
|
|
|
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
|
2018-07-16 19:03:31 +00:00
|
|
|
/* See flush_tlb_batched_pending() */
|
|
|
|
bool tlb_flush_batched;
|
2011-05-29 18:32:28 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
struct uprobes_state uprobes_state;
|
2015-11-06 02:47:14 +00:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
2018-07-16 19:03:31 +00:00
|
|
|
atomic_long_t hugetlb_usage;
|
2015-11-06 02:47:14 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
struct work_struct async_put_work;
|
|
|
|
} __randomize_layout;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The mm_cpumask needs to be at the end of mm_struct, because it
|
|
|
|
* is dynamically sized based on nr_cpu_ids.
|
|
|
|
*/
|
|
|
|
unsigned long cpu_bitmap[];
|
|
|
|
};
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2017-02-02 11:27:56 +00:00
|
|
|
extern struct mm_struct init_mm;
|
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
/* Pointer magic because the dynamic array size confuses some compilers. */
|
2011-05-29 18:32:28 +00:00
|
|
|
static inline void mm_init_cpumask(struct mm_struct *mm)
|
|
|
|
{
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long cpu_bitmap = (unsigned long)mm;
|
|
|
|
|
|
|
|
cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
|
|
|
|
cpumask_clear((struct cpumask *)cpu_bitmap);
|
2011-05-29 18:32:28 +00:00
|
|
|
}
|
|
|
|
|
2009-03-12 20:35:44 +00:00
|
|
|
/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
|
2011-05-25 00:12:15 +00:00
|
|
|
static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
|
|
|
|
{
|
2018-07-16 19:03:31 +00:00
|
|
|
return (struct cpumask *)&mm->cpu_bitmap;
|
2011-05-25 00:12:15 +00:00
|
|
|
}
|
2009-03-12 20:35:44 +00:00
|
|
|
|
2017-08-10 22:24:05 +00:00
|
|
|
struct mmu_gather;
|
|
|
|
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
|
|
|
|
unsigned long start, unsigned long end);
|
|
|
|
extern void tlb_finish_mmu(struct mmu_gather *tlb,
|
|
|
|
unsigned long start, unsigned long end);
|
|
|
|
|
mm: migrate: prevent racy access to tlb_flush_pending
Patch series "fixes of TLB batching races", v6.
It turns out that Linux TLB batching mechanism suffers from various
races. Races that are caused due to batching during reclamation were
recently handled by Mel and this patch-set deals with others. The more
fundamental issue is that concurrent updates of the page-tables allow
for TLB flushes to be batched on one core, while another core changes
the page-tables. This other core may assume a PTE change does not
require a flush based on the updated PTE value, while it is unaware that
TLB flushes are still pending.
This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was
observed by hacking the kernel and adding a delay before flushing and
replacing the KSM page.
Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.
This patch (of 7):
Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.
This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.
An actual data corruption was not observed, yet the race was was
confirmed by adding assertion to check tlb_flush_pending is not set by
two threads, adding artificial latency in change_protection_range() and
using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 22:23:56 +00:00
|
|
|
static inline void init_tlb_flush_pending(struct mm_struct *mm)
|
mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19 01:08:44 +00:00
|
|
|
{
|
mm: migrate: prevent racy access to tlb_flush_pending
Patch series "fixes of TLB batching races", v6.
It turns out that Linux TLB batching mechanism suffers from various
races. Races that are caused due to batching during reclamation were
recently handled by Mel and this patch-set deals with others. The more
fundamental issue is that concurrent updates of the page-tables allow
for TLB flushes to be batched on one core, while another core changes
the page-tables. This other core may assume a PTE change does not
require a flush based on the updated PTE value, while it is unaware that
TLB flushes are still pending.
This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was
observed by hacking the kernel and adding a delay before flushing and
replacing the KSM page.
Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.
This patch (of 7):
Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.
This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.
An actual data corruption was not observed, yet the race was was
confirmed by adding assertion to check tlb_flush_pending is not set by
two threads, adding artificial latency in change_protection_range() and
using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 22:23:56 +00:00
|
|
|
atomic_set(&mm->tlb_flush_pending, 0);
|
mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19 01:08:44 +00:00
|
|
|
}
|
mm: migrate: prevent racy access to tlb_flush_pending
Patch series "fixes of TLB batching races", v6.
It turns out that Linux TLB batching mechanism suffers from various
races. Races that are caused due to batching during reclamation were
recently handled by Mel and this patch-set deals with others. The more
fundamental issue is that concurrent updates of the page-tables allow
for TLB flushes to be batched on one core, while another core changes
the page-tables. This other core may assume a PTE change does not
require a flush based on the updated PTE value, while it is unaware that
TLB flushes are still pending.
This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was
observed by hacking the kernel and adding a delay before flushing and
replacing the KSM page.
Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.
This patch (of 7):
Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.
This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.
An actual data corruption was not observed, yet the race was was
confirmed by adding assertion to check tlb_flush_pending is not set by
two threads, adding artificial latency in change_protection_range() and
using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 22:23:56 +00:00
|
|
|
|
|
|
|
static inline void inc_tlb_flush_pending(struct mm_struct *mm)
|
mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19 01:08:44 +00:00
|
|
|
{
|
mm: migrate: prevent racy access to tlb_flush_pending
Patch series "fixes of TLB batching races", v6.
It turns out that Linux TLB batching mechanism suffers from various
races. Races that are caused due to batching during reclamation were
recently handled by Mel and this patch-set deals with others. The more
fundamental issue is that concurrent updates of the page-tables allow
for TLB flushes to be batched on one core, while another core changes
the page-tables. This other core may assume a PTE change does not
require a flush based on the updated PTE value, while it is unaware that
TLB flushes are still pending.
This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was
observed by hacking the kernel and adding a delay before flushing and
replacing the KSM page.
Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.
This patch (of 7):
Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.
This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.
An actual data corruption was not observed, yet the race was was
confirmed by adding assertion to check tlb_flush_pending is not set by
two threads, adding artificial latency in change_protection_range() and
using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 22:23:56 +00:00
|
|
|
atomic_inc(&mm->tlb_flush_pending);
|
2013-12-19 01:08:45 +00:00
|
|
|
/*
|
mm, locking: Rework {set,clear,mm}_tlb_flush_pending()
Commit:
af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
can solve the same problem without this barrier.
If instead we mandate that mm_tlb_flush_pending() is used while
holding the PTL we're guaranteed to observe prior
set_tlb_flush_pending() instances.
For this to work we need to rework migrate_misplaced_transhuge_page()
a little and move the test up into do_huge_pmd_numa_page().
NOTE: this relies on flush_tlb_range() to guarantee:
(1) it ensures that prior page table updates are visible to the
page table walker and
(2) it ensures that subsequent memory accesses are only made
visible after the invalidation has completed
This is required for architectures that implement TRANSPARENT_HUGEPAGE
(arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
mm_tlb_flush_pending() in their page-table operations (arm, arm64,
x86).
This appears true for:
- arm (DSB ISB before and after),
- arm64 (DSB ISHST before, and DSB ISH after),
- powerpc (PTESYNC before and after),
- s390 and x86 TLB invalidate are serializing instructions
But I failed to understand the situation for:
- arc, mips, sparc
Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
inside the PTL. It still needs to guarantee the PTL unlock happens
_after_ the invalidate completes.
Vineet, Ralf and Dave could you guys please have a look?
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-07 16:05:07 +00:00
|
|
|
* The only time this value is relevant is when there are indeed pages
|
|
|
|
* to flush. And we'll only flush pages after changing them, which
|
|
|
|
* requires the PTL.
|
|
|
|
*
|
|
|
|
* So the ordering here is:
|
|
|
|
*
|
2017-08-11 11:51:59 +00:00
|
|
|
* atomic_inc(&mm->tlb_flush_pending);
|
mm, locking: Rework {set,clear,mm}_tlb_flush_pending()
Commit:
af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
can solve the same problem without this barrier.
If instead we mandate that mm_tlb_flush_pending() is used while
holding the PTL we're guaranteed to observe prior
set_tlb_flush_pending() instances.
For this to work we need to rework migrate_misplaced_transhuge_page()
a little and move the test up into do_huge_pmd_numa_page().
NOTE: this relies on flush_tlb_range() to guarantee:
(1) it ensures that prior page table updates are visible to the
page table walker and
(2) it ensures that subsequent memory accesses are only made
visible after the invalidation has completed
This is required for architectures that implement TRANSPARENT_HUGEPAGE
(arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
mm_tlb_flush_pending() in their page-table operations (arm, arm64,
x86).
This appears true for:
- arm (DSB ISB before and after),
- arm64 (DSB ISHST before, and DSB ISH after),
- powerpc (PTESYNC before and after),
- s390 and x86 TLB invalidate are serializing instructions
But I failed to understand the situation for:
- arc, mips, sparc
Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
inside the PTL. It still needs to guarantee the PTL unlock happens
_after_ the invalidate completes.
Vineet, Ralf and Dave could you guys please have a look?
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-07 16:05:07 +00:00
|
|
|
* spin_lock(&ptl);
|
|
|
|
* ...
|
|
|
|
* set_pte_at();
|
|
|
|
* spin_unlock(&ptl);
|
|
|
|
*
|
|
|
|
* spin_lock(&ptl)
|
|
|
|
* mm_tlb_flush_pending();
|
|
|
|
* ....
|
|
|
|
* spin_unlock(&ptl);
|
|
|
|
*
|
|
|
|
* flush_tlb_range();
|
2017-08-11 11:51:59 +00:00
|
|
|
* atomic_dec(&mm->tlb_flush_pending);
|
mm, locking: Rework {set,clear,mm}_tlb_flush_pending()
Commit:
af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
can solve the same problem without this barrier.
If instead we mandate that mm_tlb_flush_pending() is used while
holding the PTL we're guaranteed to observe prior
set_tlb_flush_pending() instances.
For this to work we need to rework migrate_misplaced_transhuge_page()
a little and move the test up into do_huge_pmd_numa_page().
NOTE: this relies on flush_tlb_range() to guarantee:
(1) it ensures that prior page table updates are visible to the
page table walker and
(2) it ensures that subsequent memory accesses are only made
visible after the invalidation has completed
This is required for architectures that implement TRANSPARENT_HUGEPAGE
(arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
mm_tlb_flush_pending() in their page-table operations (arm, arm64,
x86).
This appears true for:
- arm (DSB ISB before and after),
- arm64 (DSB ISHST before, and DSB ISH after),
- powerpc (PTESYNC before and after),
- s390 and x86 TLB invalidate are serializing instructions
But I failed to understand the situation for:
- arc, mips, sparc
Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
inside the PTL. It still needs to guarantee the PTL unlock happens
_after_ the invalidate completes.
Vineet, Ralf and Dave could you guys please have a look?
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-07 16:05:07 +00:00
|
|
|
*
|
2017-08-11 14:04:50 +00:00
|
|
|
* Where the increment if constrained by the PTL unlock, it thus
|
|
|
|
* ensures that the increment is visible if the PTE modification is
|
|
|
|
* visible. After all, if there is no PTE modification, nobody cares
|
|
|
|
* about TLB flushes either.
|
|
|
|
*
|
|
|
|
* This very much relies on users (mm_tlb_flush_pending() and
|
|
|
|
* mm_tlb_flush_nested()) only caring about _specific_ PTEs (and
|
|
|
|
* therefore specific PTLs), because with SPLIT_PTE_PTLOCKS and RCpc
|
|
|
|
* locks (PPC) the unlock of one doesn't order against the lock of
|
|
|
|
* another PTL.
|
|
|
|
*
|
|
|
|
* The decrement is ordered by the flush_tlb_range(), such that
|
|
|
|
* mm_tlb_flush_pending() will not return false unless all flushes have
|
|
|
|
* completed.
|
2013-12-19 01:08:45 +00:00
|
|
|
*/
|
mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19 01:08:44 +00:00
|
|
|
}
|
mm: migrate: prevent racy access to tlb_flush_pending
Patch series "fixes of TLB batching races", v6.
It turns out that Linux TLB batching mechanism suffers from various
races. Races that are caused due to batching during reclamation were
recently handled by Mel and this patch-set deals with others. The more
fundamental issue is that concurrent updates of the page-tables allow
for TLB flushes to be batched on one core, while another core changes
the page-tables. This other core may assume a PTE change does not
require a flush based on the updated PTE value, while it is unaware that
TLB flushes are still pending.
This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was
observed by hacking the kernel and adding a delay before flushing and
replacing the KSM page.
Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.
This patch (of 7):
Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.
This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.
An actual data corruption was not observed, yet the race was was
confirmed by adding assertion to check tlb_flush_pending is not set by
two threads, adding artificial latency in change_protection_range() and
using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 22:23:56 +00:00
|
|
|
|
|
|
|
static inline void dec_tlb_flush_pending(struct mm_struct *mm)
|
mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19 01:08:44 +00:00
|
|
|
{
|
2017-08-10 22:23:59 +00:00
|
|
|
/*
|
2017-08-11 14:04:50 +00:00
|
|
|
* See inc_tlb_flush_pending().
|
|
|
|
*
|
|
|
|
* This cannot be smp_mb__before_atomic() because smp_mb() simply does
|
|
|
|
* not order against TLB invalidate completion, which is what we need.
|
|
|
|
*
|
|
|
|
* Therefore we must rely on tlb_flush_*() to guarantee order.
|
2017-08-10 22:23:59 +00:00
|
|
|
*/
|
mm: migrate: prevent racy access to tlb_flush_pending
Patch series "fixes of TLB batching races", v6.
It turns out that Linux TLB batching mechanism suffers from various
races. Races that are caused due to batching during reclamation were
recently handled by Mel and this patch-set deals with others. The more
fundamental issue is that concurrent updates of the page-tables allow
for TLB flushes to be batched on one core, while another core changes
the page-tables. This other core may assume a PTE change does not
require a flush based on the updated PTE value, while it is unaware that
TLB flushes are still pending.
This behavior affects KSM (which may result in memory corruption) and
MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A
proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
Memory corruption in KSM is harder to produce in practice, but was
observed by hacking the kernel and adding a delay before flushing and
replacing the KSM page.
Finally, there is also one memory barrier missing, which may affect
architectures with weak memory model.
This patch (of 7):
Setting and clearing mm->tlb_flush_pending can be performed by multiple
threads, since mmap_sem may only be acquired for read in
task_numa_work(). If this happens, tlb_flush_pending might be cleared
while one of the threads still changes PTEs and batches TLB flushes.
This can lead to the same race between migration and
change_protection_range() that led to the introduction of
tlb_flush_pending. The result of this race was data corruption, which
means that this patch also addresses a theoretically possible data
corruption.
An actual data corruption was not observed, yet the race was was
confirmed by adding assertion to check tlb_flush_pending is not set by
two threads, adding artificial latency in change_protection_range() and
using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
Fixes: 20841405940e ("mm: fix TLB flush race between migration, and
change_protection_range")
Signed-off-by: Nadav Amit <namit@vmware.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 22:23:56 +00:00
|
|
|
atomic_dec(&mm->tlb_flush_pending);
|
mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19 01:08:44 +00:00
|
|
|
}
|
|
|
|
|
2017-08-11 14:04:50 +00:00
|
|
|
static inline bool mm_tlb_flush_pending(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Must be called after having acquired the PTL; orders against that
|
|
|
|
* PTLs release and therefore ensures that if we observe the modified
|
|
|
|
* PTE we must also observe the increment from inc_tlb_flush_pending().
|
|
|
|
*
|
|
|
|
* That is, it only guarantees to return true if there is a flush
|
|
|
|
* pending for _this_ PTL.
|
|
|
|
*/
|
|
|
|
return atomic_read(&mm->tlb_flush_pending);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Similar to mm_tlb_flush_pending(), we must have acquired the PTL
|
|
|
|
* for which there is a TLB flush pending in order to guarantee
|
|
|
|
* we've seen both that PTE modification and the increment.
|
|
|
|
*
|
|
|
|
* (no requirement on actually still holding the PTL, that is irrelevant)
|
|
|
|
*/
|
|
|
|
return atomic_read(&mm->tlb_flush_pending) > 1;
|
|
|
|
}
|
|
|
|
|
2015-12-30 04:12:19 +00:00
|
|
|
struct vm_fault;
|
|
|
|
|
2019-03-08 00:31:14 +00:00
|
|
|
/**
|
|
|
|
* typedef vm_fault_t - Return type for page fault handlers.
|
|
|
|
*
|
|
|
|
* Page fault handlers return a bitmask of %VM_FAULT values.
|
|
|
|
*/
|
|
|
|
typedef __bitwise unsigned int vm_fault_t;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* enum vm_fault_reason - Page fault handlers return a bitmask of
|
|
|
|
* these values to tell the core VM what happened when handling the
|
|
|
|
* fault. Used to decide whether a process gets delivered SIGBUS or
|
|
|
|
* just gets major/minor fault counters bumped up.
|
|
|
|
*
|
|
|
|
* @VM_FAULT_OOM: Out Of Memory
|
|
|
|
* @VM_FAULT_SIGBUS: Bad access
|
|
|
|
* @VM_FAULT_MAJOR: Page read from storage
|
|
|
|
* @VM_FAULT_WRITE: Special case for get_user_pages
|
|
|
|
* @VM_FAULT_HWPOISON: Hit poisoned small page
|
|
|
|
* @VM_FAULT_HWPOISON_LARGE: Hit poisoned large page. Index encoded
|
|
|
|
* in upper bits
|
|
|
|
* @VM_FAULT_SIGSEGV: segmentation fault
|
|
|
|
* @VM_FAULT_NOPAGE: ->fault installed the pte, not return page
|
|
|
|
* @VM_FAULT_LOCKED: ->fault locked the returned page
|
|
|
|
* @VM_FAULT_RETRY: ->fault blocked, must retry
|
|
|
|
* @VM_FAULT_FALLBACK: huge page fault failed, fall back to small
|
|
|
|
* @VM_FAULT_DONE_COW: ->fault has fully handled COW
|
|
|
|
* @VM_FAULT_NEEDDSYNC: ->fault did not modify page tables and needs
|
|
|
|
* fsync() to complete (for synchronous page faults
|
|
|
|
* in DAX)
|
|
|
|
* @VM_FAULT_HINDEX_MASK: mask HINDEX value
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
enum vm_fault_reason {
|
|
|
|
VM_FAULT_OOM = (__force vm_fault_t)0x000001,
|
|
|
|
VM_FAULT_SIGBUS = (__force vm_fault_t)0x000002,
|
|
|
|
VM_FAULT_MAJOR = (__force vm_fault_t)0x000004,
|
|
|
|
VM_FAULT_WRITE = (__force vm_fault_t)0x000008,
|
|
|
|
VM_FAULT_HWPOISON = (__force vm_fault_t)0x000010,
|
|
|
|
VM_FAULT_HWPOISON_LARGE = (__force vm_fault_t)0x000020,
|
|
|
|
VM_FAULT_SIGSEGV = (__force vm_fault_t)0x000040,
|
|
|
|
VM_FAULT_NOPAGE = (__force vm_fault_t)0x000100,
|
|
|
|
VM_FAULT_LOCKED = (__force vm_fault_t)0x000200,
|
|
|
|
VM_FAULT_RETRY = (__force vm_fault_t)0x000400,
|
|
|
|
VM_FAULT_FALLBACK = (__force vm_fault_t)0x000800,
|
|
|
|
VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
|
|
|
|
VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
|
|
|
|
VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
|
|
|
|
};
|
|
|
|
|
|
|
|
/* Encode hstate index for a hwpoisoned large page */
|
|
|
|
#define VM_FAULT_SET_HINDEX(x) ((__force vm_fault_t)((x) << 16))
|
2019-04-06 01:39:01 +00:00
|
|
|
#define VM_FAULT_GET_HINDEX(x) (((__force unsigned int)(x) >> 16) & 0xf)
|
2019-03-08 00:31:14 +00:00
|
|
|
|
|
|
|
#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | \
|
|
|
|
VM_FAULT_SIGSEGV | VM_FAULT_HWPOISON | \
|
|
|
|
VM_FAULT_HWPOISON_LARGE | VM_FAULT_FALLBACK)
|
|
|
|
|
|
|
|
#define VM_FAULT_RESULT_TRACE \
|
|
|
|
{ VM_FAULT_OOM, "OOM" }, \
|
|
|
|
{ VM_FAULT_SIGBUS, "SIGBUS" }, \
|
|
|
|
{ VM_FAULT_MAJOR, "MAJOR" }, \
|
|
|
|
{ VM_FAULT_WRITE, "WRITE" }, \
|
|
|
|
{ VM_FAULT_HWPOISON, "HWPOISON" }, \
|
|
|
|
{ VM_FAULT_HWPOISON_LARGE, "HWPOISON_LARGE" }, \
|
|
|
|
{ VM_FAULT_SIGSEGV, "SIGSEGV" }, \
|
|
|
|
{ VM_FAULT_NOPAGE, "NOPAGE" }, \
|
|
|
|
{ VM_FAULT_LOCKED, "LOCKED" }, \
|
|
|
|
{ VM_FAULT_RETRY, "RETRY" }, \
|
|
|
|
{ VM_FAULT_FALLBACK, "FALLBACK" }, \
|
|
|
|
{ VM_FAULT_DONE_COW, "DONE_COW" }, \
|
|
|
|
{ VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }
|
|
|
|
|
2015-12-30 04:12:19 +00:00
|
|
|
struct vm_special_mapping {
|
|
|
|
const char *name; /* The name, e.g. "[vdso]". */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If .fault is not provided, this points to a
|
|
|
|
* NULL-terminated array of pages that back the special mapping.
|
|
|
|
*
|
|
|
|
* This must not be NULL unless .fault is provided.
|
|
|
|
*/
|
2014-05-19 22:58:33 +00:00
|
|
|
struct page **pages;
|
2015-12-30 04:12:19 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If non-NULL, then this is called to resolve page faults
|
|
|
|
* on the special mapping. If used, .pages is not checked.
|
|
|
|
*/
|
2018-06-08 00:08:04 +00:00
|
|
|
vm_fault_t (*fault)(const struct vm_special_mapping *sm,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
struct vm_fault *vmf);
|
2016-06-28 11:35:38 +00:00
|
|
|
|
|
|
|
int (*mremap)(const struct vm_special_mapping *sm,
|
|
|
|
struct vm_area_struct *new_vma);
|
2014-05-19 22:58:33 +00:00
|
|
|
};
|
|
|
|
|
2014-07-31 15:40:59 +00:00
|
|
|
enum tlb_flush_reason {
|
|
|
|
TLB_FLUSH_ON_TASK_SWITCH,
|
|
|
|
TLB_REMOTE_SHOOTDOWN,
|
|
|
|
TLB_LOCAL_SHOOTDOWN,
|
|
|
|
TLB_LOCAL_MM_SHOOTDOWN,
|
2015-09-04 22:47:29 +00:00
|
|
|
TLB_REMOTE_SEND_IPI,
|
2014-07-31 15:40:59 +00:00
|
|
|
NR_TLB_FLUSH_REASONS,
|
|
|
|
};
|
|
|
|
|
2014-12-13 00:55:35 +00:00
|
|
|
/*
|
|
|
|
* A swap entry has to fit into a "unsigned long", as the entry is hidden
|
|
|
|
* in the "index" field of the swapper address space.
|
|
|
|
*/
|
|
|
|
typedef struct {
|
|
|
|
unsigned long val;
|
|
|
|
} swp_entry_t;
|
|
|
|
|
2006-09-27 08:50:01 +00:00
|
|
|
#endif /* _LINUX_MM_TYPES_H */
|