License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:07:57 +00:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2006-09-27 08:50:01 +00:00
|
|
|
#ifndef _LINUX_MM_TYPES_H
|
|
|
|
#define _LINUX_MM_TYPES_H
|
|
|
|
|
2017-02-03 23:12:19 +00:00
|
|
|
#include <linux/mm_types_task.h>
|
|
|
|
|
2007-10-17 06:30:12 +00:00
|
|
|
#include <linux/auxvec.h>
|
2022-01-14 22:06:03 +00:00
|
|
|
#include <linux/kref.h>
|
2006-09-27 08:50:01 +00:00
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/spinlock.h>
|
2007-10-16 08:24:43 +00:00
|
|
|
#include <linux/rbtree.h>
|
2022-09-06 19:48:45 +00:00
|
|
|
#include <linux/maple_tree.h>
|
2007-10-16 08:24:43 +00:00
|
|
|
#include <linux/rwsem.h>
|
|
|
|
#include <linux/completion.h>
|
mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;
if (!kvm)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-28 22:46:29 +00:00
|
|
|
#include <linux/cpumask.h>
|
2012-03-30 18:26:31 +00:00
|
|
|
#include <linux/uprobes.h>
|
2021-09-28 12:24:32 +00:00
|
|
|
#include <linux/rcupdate.h>
|
2013-02-23 00:34:30 +00:00
|
|
|
#include <linux/page-flags-layout.h>
|
2016-05-20 23:57:21 +00:00
|
|
|
#include <linux/workqueue.h>
|
2020-12-15 03:05:44 +00:00
|
|
|
#include <linux/seqlock.h>
|
2017-02-03 23:12:19 +00:00
|
|
|
|
2007-10-16 08:24:43 +00:00
|
|
|
#include <asm/mmu.h>
|
2006-09-27 08:50:01 +00:00
|
|
|
|
2007-10-17 06:30:12 +00:00
|
|
|
#ifndef AT_VECTOR_SIZE_ARCH
|
|
|
|
#define AT_VECTOR_SIZE_ARCH 0
|
|
|
|
#endif
|
|
|
|
#define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
|
|
|
|
|
2021-03-13 05:07:15 +00:00
|
|
|
#define INIT_PASID 0
|
2018-04-05 23:25:23 +00:00
|
|
|
|
2006-09-27 08:50:01 +00:00
|
|
|
struct address_space;
|
2014-12-10 23:44:52 +00:00
|
|
|
struct mem_cgroup;
|
2006-09-27 08:50:01 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Each physical page in the system has a struct page associated with
|
|
|
|
* it to keep track of whatever it is we are using the page for at the
|
|
|
|
* moment. Note that we have no way to track which tasks are using
|
|
|
|
* a page, though if it is a pagecache page, rmap structures can tell us
|
2018-06-08 00:08:53 +00:00
|
|
|
* who is mapping it.
|
2018-02-01 00:19:06 +00:00
|
|
|
*
|
2018-06-08 00:08:53 +00:00
|
|
|
* If you allocate the page using alloc_pages(), you can use some of the
|
|
|
|
* space in struct page for your own purposes. The five words in the main
|
|
|
|
* union are available, except for bit 0 of the first word which must be
|
|
|
|
* kept clear. Many users use this word to store a pointer to an object
|
|
|
|
* which is guaranteed to be aligned. If you use the same storage as
|
|
|
|
* page->mapping, you must restore it to NULL before freeing the page.
|
2018-02-01 00:19:06 +00:00
|
|
|
*
|
2018-06-08 00:08:53 +00:00
|
|
|
* If your page will not be mapped to userspace, you can also use the four
|
|
|
|
* bytes in the mapcount union, but you must call page_mapcount_reset()
|
|
|
|
* before freeing it.
|
|
|
|
*
|
|
|
|
* If you want to use the refcount field, it must be used in such a way
|
|
|
|
* that other CPUs temporarily incrementing and then decrementing the
|
|
|
|
* refcount does not cause problems. On receiving the page from
|
|
|
|
* alloc_pages(), the refcount will be positive.
|
|
|
|
*
|
|
|
|
* If you allocate pages of order > 0, you can use some of the fields
|
|
|
|
* in each subpage, but you may need to restore some of their values
|
|
|
|
* afterwards.
|
2011-06-01 17:25:48 +00:00
|
|
|
*
|
2021-10-04 13:45:51 +00:00
|
|
|
* SLUB uses cmpxchg_double() to atomically update its freelist and counters.
|
|
|
|
* That requires that freelist & counters in struct slab be adjacent and
|
|
|
|
* double-word aligned. Because struct slab currently just reinterprets the
|
|
|
|
* bits of struct page, we align all struct pages to double-word boundaries,
|
|
|
|
* and ensure that 'freelist' is aligned within struct slab.
|
2006-09-27 08:50:01 +00:00
|
|
|
*/
|
2018-02-01 00:18:44 +00:00
|
|
|
#ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
|
|
|
|
#define _struct_page_alignment __aligned(2 * sizeof(unsigned long))
|
|
|
|
#else
|
2018-02-01 00:18:58 +00:00
|
|
|
#define _struct_page_alignment
|
2018-06-08 00:08:31 +00:00
|
|
|
#endif
|
2018-02-01 00:18:44 +00:00
|
|
|
|
2006-09-27 08:50:01 +00:00
|
|
|
struct page {
|
|
|
|
unsigned long flags; /* Atomic flags, some possibly
|
|
|
|
* updated asynchronously */
|
2018-06-08 00:08:46 +00:00
|
|
|
/*
|
2018-06-08 00:08:50 +00:00
|
|
|
* Five words (20/40 bytes) are available in this union.
|
|
|
|
* WARNING: bit 0 of the first word is used for PageTail(). That
|
|
|
|
* means the other users of this union MUST NOT use the bit to
|
2018-06-08 00:08:46 +00:00
|
|
|
* avoid collision and false-positive PageTail().
|
|
|
|
*/
|
2013-10-24 01:07:49 +00:00
|
|
|
union {
|
2018-06-08 00:08:39 +00:00
|
|
|
struct { /* Page cache and anonymous pages */
|
2018-06-08 00:08:50 +00:00
|
|
|
/**
|
|
|
|
* @lru: Pageout list, eg. active_list protected by
|
2020-12-15 22:21:31 +00:00
|
|
|
* lruvec->lru_lock. Sometimes used as a generic list
|
2018-06-08 00:08:50 +00:00
|
|
|
* by the page owner.
|
|
|
|
*/
|
2022-02-15 02:29:54 +00:00
|
|
|
union {
|
|
|
|
struct list_head lru;
|
2022-06-24 12:54:17 +00:00
|
|
|
|
2022-02-15 02:29:54 +00:00
|
|
|
/* Or, for the Unevictable "LRU list" slot */
|
|
|
|
struct {
|
|
|
|
/* Always even, to negate PageTail */
|
|
|
|
void *__filler;
|
|
|
|
/* Count page's or folio's mlocks */
|
|
|
|
unsigned int mlock_count;
|
|
|
|
};
|
2022-06-24 12:54:17 +00:00
|
|
|
|
|
|
|
/* Or, free page */
|
|
|
|
struct list_head buddy_list;
|
|
|
|
struct list_head pcp_list;
|
2022-02-15 02:29:54 +00:00
|
|
|
};
|
2018-06-08 00:08:39 +00:00
|
|
|
/* See page-flags.h for PAGE_MAPPING_FLAGS */
|
|
|
|
struct address_space *mapping;
|
|
|
|
pgoff_t index; /* Our offset within mapping. */
|
|
|
|
/**
|
|
|
|
* @private: Mapping-private opaque data.
|
|
|
|
* Usually used for buffer_heads if PagePrivate.
|
|
|
|
* Used for swp_entry_t if PageSwapCache.
|
|
|
|
* Indicates order in the buddy system if PageBuddy.
|
|
|
|
*/
|
|
|
|
unsigned long private;
|
|
|
|
};
|
2019-02-13 01:55:40 +00:00
|
|
|
struct { /* page_pool used by netstack */
|
2021-06-07 19:02:36 +00:00
|
|
|
/**
|
|
|
|
* @pp_magic: magic value to avoid recycling non
|
|
|
|
* page_pool allocated pages.
|
|
|
|
*/
|
|
|
|
unsigned long pp_magic;
|
|
|
|
struct page_pool *pp;
|
|
|
|
unsigned long _pp_mapping_pad;
|
2021-08-06 02:46:20 +00:00
|
|
|
unsigned long dma_addr;
|
2021-11-17 07:56:52 +00:00
|
|
|
union {
|
|
|
|
/**
|
|
|
|
* dma_addr_upper: might require a 64-bit
|
|
|
|
* value on 32-bit architectures.
|
|
|
|
*/
|
|
|
|
unsigned long dma_addr_upper;
|
|
|
|
/**
|
|
|
|
* For frag page support, not supported in
|
|
|
|
* 32-bit architectures with 64-bit DMA.
|
|
|
|
*/
|
|
|
|
atomic_long_t pp_frag_count;
|
|
|
|
};
|
2019-02-13 01:55:40 +00:00
|
|
|
};
|
2018-06-08 00:08:50 +00:00
|
|
|
struct { /* Tail pages of compound page */
|
|
|
|
unsigned long compound_head; /* Bit zero is set */
|
|
|
|
|
|
|
|
/* First tail page only */
|
|
|
|
unsigned char compound_dtor;
|
|
|
|
unsigned char compound_order;
|
|
|
|
atomic_t compound_mapcount;
|
2022-01-06 21:46:43 +00:00
|
|
|
atomic_t compound_pincount;
|
|
|
|
#ifdef CONFIG_64BIT
|
2020-08-15 00:30:23 +00:00
|
|
|
unsigned int compound_nr; /* 1 << compound_order */
|
2022-09-22 15:42:04 +00:00
|
|
|
unsigned long _private_1;
|
2022-01-06 21:46:43 +00:00
|
|
|
#endif
|
2018-06-08 00:08:50 +00:00
|
|
|
};
|
|
|
|
struct { /* Second tail page of compound page */
|
|
|
|
unsigned long _compound_pad_1; /* compound_head */
|
2022-01-06 21:46:43 +00:00
|
|
|
unsigned long _compound_pad_2;
|
2019-09-23 22:38:15 +00:00
|
|
|
/* For both global and memcg */
|
2018-06-08 00:08:50 +00:00
|
|
|
struct list_head deferred_list;
|
|
|
|
};
|
2018-06-08 00:08:39 +00:00
|
|
|
struct { /* Page table pages */
|
2018-06-08 00:08:50 +00:00
|
|
|
unsigned long _pt_pad_1; /* compound_head */
|
|
|
|
pgtable_t pmd_huge_pte; /* protected by page->ptl */
|
2018-06-08 00:08:39 +00:00
|
|
|
unsigned long _pt_pad_2; /* mapping */
|
2018-07-27 11:48:17 +00:00
|
|
|
union {
|
|
|
|
struct mm_struct *pt_mm; /* x86 pgds only */
|
|
|
|
atomic_t pt_frag_refcount; /* powerpc */
|
|
|
|
};
|
2018-06-08 00:08:31 +00:00
|
|
|
#if ALLOC_SPLIT_PTLOCKS
|
2018-06-08 00:08:39 +00:00
|
|
|
spinlock_t *ptl;
|
2018-06-08 00:08:31 +00:00
|
|
|
#else
|
2018-06-08 00:08:39 +00:00
|
|
|
spinlock_t ptl;
|
2018-06-08 00:08:31 +00:00
|
|
|
#endif
|
|
|
|
};
|
2018-06-08 00:09:01 +00:00
|
|
|
struct { /* ZONE_DEVICE pages */
|
|
|
|
/** @pgmap: Points to the hosting device page map. */
|
|
|
|
struct dev_pagemap *pgmap;
|
2019-06-26 12:27:21 +00:00
|
|
|
void *zone_device_data;
|
2019-08-13 22:37:04 +00:00
|
|
|
/*
|
|
|
|
* ZONE_DEVICE private pages are counted as being
|
|
|
|
* mapped so the next 3 words hold the mapping, index,
|
|
|
|
* and private fields from the source anonymous or
|
|
|
|
* page cache page while the page is migrated to device
|
|
|
|
* private memory.
|
|
|
|
* ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
|
|
|
|
* use the mapping, index, and private fields when
|
|
|
|
* pmem backed DAX files are mapped.
|
|
|
|
*/
|
2018-06-08 00:09:01 +00:00
|
|
|
};
|
2018-06-08 00:08:50 +00:00
|
|
|
|
|
|
|
/** @rcu_head: You can use this to free a page by RCU. */
|
|
|
|
struct rcu_head rcu_head;
|
2018-06-08 00:08:31 +00:00
|
|
|
};
|
|
|
|
|
2018-06-08 00:08:35 +00:00
|
|
|
union { /* This union is 4 bytes in size. */
|
|
|
|
/*
|
|
|
|
* If the page can be mapped to userspace, encodes the number
|
|
|
|
* of times this page is referenced by a page table.
|
|
|
|
*/
|
|
|
|
atomic_t _mapcount;
|
|
|
|
|
2018-06-08 00:08:18 +00:00
|
|
|
/*
|
|
|
|
* If the page is neither PageSlab nor mappable to userspace,
|
|
|
|
* the value stored here may help determine what this page
|
|
|
|
* is used for. See page-flags.h for a list of page types
|
|
|
|
* which are currently stored here.
|
|
|
|
*/
|
|
|
|
unsigned int page_type;
|
2007-05-06 21:49:36 +00:00
|
|
|
};
|
2011-06-01 17:25:48 +00:00
|
|
|
|
2018-06-08 00:08:35 +00:00
|
|
|
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
|
|
|
|
atomic_t _refcount;
|
|
|
|
|
2014-12-10 23:44:52 +00:00
|
|
|
#ifdef CONFIG_MEMCG
|
mm: memcontrol: Use helpers to read page's memcg data
Patch series "mm: allow mapping accounted kernel pages to userspace", v6.
Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace. The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter. Pages with a type set can't be mapped to
userspace.
But in general the kmemcg flag has nothing to do with mapping to
userspace. It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.
Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.
This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer. Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions. As the
result the code became more robust with fewer open-coded bit tricks.
This patch (of 4):
Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.
It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information. In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.
This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
struct mem_cgroup *page_memcg(struct page *page);
struct mem_cgroup *page_memcg_rcu(struct page *page);
struct mem_cgroup *page_memcg_check(struct page *page);
page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector. It does
check the lowest bit, and if set, returns NULL. page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.
To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
2020-12-01 21:58:27 +00:00
|
|
|
unsigned long memcg_data;
|
2014-12-10 23:44:52 +00:00
|
|
|
#endif
|
|
|
|
|
2006-09-27 08:50:01 +00:00
|
|
|
/*
|
|
|
|
* On machines where all RAM is mapped into kernel address space,
|
|
|
|
* we can simply calculate the virtual address. On machines with
|
|
|
|
* highmem some memory is mapped into kernel virtual memory
|
|
|
|
* dynamically, so we need a place to store that address.
|
|
|
|
* Note that this field could be 16 bits on x86 ... ;)
|
|
|
|
*
|
|
|
|
* Architectures with slow multiplication can define
|
|
|
|
* WANT_PAGE_VIRTUAL in asm/page.h
|
|
|
|
*/
|
|
|
|
#if defined(WANT_PAGE_VIRTUAL)
|
|
|
|
void *virtual; /* Kernel virtual address (NULL if
|
|
|
|
not kmapped, ie. highmem) */
|
|
|
|
#endif /* WANT_PAGE_VIRTUAL */
|
2008-04-03 22:51:41 +00:00
|
|
|
|
2022-09-15 15:03:45 +00:00
|
|
|
#ifdef CONFIG_KMSAN
|
|
|
|
/*
|
|
|
|
* KMSAN metadata for this page:
|
|
|
|
* - shadow page: every bit indicates whether the corresponding
|
|
|
|
* bit of the original page is initialized (0) or not (1);
|
|
|
|
* - origin page: every 4 bytes contain an id of the stack trace
|
|
|
|
* where the uninitialized value was created.
|
|
|
|
*/
|
|
|
|
struct page *kmsan_shadow;
|
|
|
|
struct page *kmsan_origin;
|
|
|
|
#endif
|
|
|
|
|
2013-10-07 10:29:20 +00:00
|
|
|
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
|
|
|
|
int _last_cpupid;
|
2012-11-12 09:06:20 +00:00
|
|
|
#endif
|
2018-02-01 00:18:44 +00:00
|
|
|
} _struct_page_alignment;
|
2006-09-27 08:50:01 +00:00
|
|
|
|
2020-12-07 03:22:48 +00:00
|
|
|
/**
|
|
|
|
* struct folio - Represents a contiguous set of bytes.
|
|
|
|
* @flags: Identical to the page flags.
|
|
|
|
* @lru: Least Recently Used list; tracks how recently this folio was used.
|
2022-06-09 13:13:57 +00:00
|
|
|
* @mlock_count: Number of times this folio has been pinned by mlock().
|
2020-12-07 03:22:48 +00:00
|
|
|
* @mapping: The file this page belongs to, or refers to the anon_vma for
|
|
|
|
* anonymous memory.
|
|
|
|
* @index: Offset within the file, in units of pages. For anonymous memory,
|
|
|
|
* this is the index from the beginning of the mmap.
|
|
|
|
* @private: Filesystem per-folio data (see folio_attach_private()).
|
|
|
|
* Used for swp_entry_t if folio_test_swapcache().
|
|
|
|
* @_mapcount: Do not access this member directly. Use folio_mapcount() to
|
|
|
|
* find out how many times this folio is mapped by userspace.
|
|
|
|
* @_refcount: Do not access this member directly. Use folio_ref_count()
|
|
|
|
* to find how many references there are to this folio.
|
|
|
|
* @memcg_data: Memory Control Group data.
|
2022-09-02 19:45:58 +00:00
|
|
|
* @_flags_1: For large folios, additional page flags.
|
|
|
|
* @__head: Points to the folio. Do not use.
|
|
|
|
* @_folio_dtor: Which destructor to use for this folio.
|
|
|
|
* @_folio_order: Do not use directly, call folio_order().
|
|
|
|
* @_total_mapcount: Do not use directly, call folio_entire_mapcount().
|
|
|
|
* @_pincount: Do not use directly, call folio_maybe_dma_pinned().
|
|
|
|
* @_folio_nr_pages: Do not use directly, call folio_nr_pages().
|
2022-09-22 15:42:04 +00:00
|
|
|
* @_private_1: Do not use directly, call folio_get_private_1().
|
2020-12-07 03:22:48 +00:00
|
|
|
*
|
|
|
|
* A folio is a physically, virtually and logically contiguous set
|
|
|
|
* of bytes. It is a power-of-two in size, and it is aligned to that
|
|
|
|
* same power-of-two. It is at least as large as %PAGE_SIZE. If it is
|
|
|
|
* in the page cache, it is at a file offset which is a multiple of that
|
|
|
|
* power-of-two. It may be mapped into userspace at an address which is
|
|
|
|
* at an arbitrary page offset, but its kernel virtual address is aligned
|
|
|
|
* to its size.
|
|
|
|
*/
|
|
|
|
struct folio {
|
|
|
|
/* private: don't document the anon union */
|
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
/* public: */
|
|
|
|
unsigned long flags;
|
2022-02-15 02:29:54 +00:00
|
|
|
union {
|
|
|
|
struct list_head lru;
|
2022-06-09 13:13:57 +00:00
|
|
|
/* private: avoid cluttering the output */
|
2022-02-15 02:29:54 +00:00
|
|
|
struct {
|
|
|
|
void *__filler;
|
2022-06-09 13:13:57 +00:00
|
|
|
/* public: */
|
2022-02-15 02:29:54 +00:00
|
|
|
unsigned int mlock_count;
|
2022-06-09 13:13:57 +00:00
|
|
|
/* private: */
|
2022-02-15 02:29:54 +00:00
|
|
|
};
|
2022-06-09 13:13:57 +00:00
|
|
|
/* public: */
|
2022-02-15 02:29:54 +00:00
|
|
|
};
|
2020-12-07 03:22:48 +00:00
|
|
|
struct address_space *mapping;
|
|
|
|
pgoff_t index;
|
|
|
|
void *private;
|
|
|
|
atomic_t _mapcount;
|
|
|
|
atomic_t _refcount;
|
|
|
|
#ifdef CONFIG_MEMCG
|
|
|
|
unsigned long memcg_data;
|
|
|
|
#endif
|
|
|
|
/* private: the union with struct page is transitional */
|
|
|
|
};
|
|
|
|
struct page page;
|
|
|
|
};
|
2022-09-02 19:45:58 +00:00
|
|
|
unsigned long _flags_1;
|
|
|
|
unsigned long __head;
|
|
|
|
unsigned char _folio_dtor;
|
|
|
|
unsigned char _folio_order;
|
|
|
|
atomic_t _total_mapcount;
|
|
|
|
atomic_t _pincount;
|
|
|
|
#ifdef CONFIG_64BIT
|
|
|
|
unsigned int _folio_nr_pages;
|
|
|
|
#endif
|
2022-09-22 15:42:04 +00:00
|
|
|
unsigned long _private_1;
|
2020-12-07 03:22:48 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
#define FOLIO_MATCH(pg, fl) \
|
|
|
|
static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
|
|
|
|
FOLIO_MATCH(flags, flags);
|
|
|
|
FOLIO_MATCH(lru, lru);
|
2022-01-29 21:41:04 +00:00
|
|
|
FOLIO_MATCH(mapping, mapping);
|
2020-12-07 03:22:48 +00:00
|
|
|
FOLIO_MATCH(compound_head, lru);
|
|
|
|
FOLIO_MATCH(index, index);
|
|
|
|
FOLIO_MATCH(private, private);
|
|
|
|
FOLIO_MATCH(_mapcount, _mapcount);
|
|
|
|
FOLIO_MATCH(_refcount, _refcount);
|
|
|
|
#ifdef CONFIG_MEMCG
|
|
|
|
FOLIO_MATCH(memcg_data, memcg_data);
|
|
|
|
#endif
|
|
|
|
#undef FOLIO_MATCH
|
2022-09-02 19:45:58 +00:00
|
|
|
#define FOLIO_MATCH(pg, fl) \
|
|
|
|
static_assert(offsetof(struct folio, fl) == \
|
|
|
|
offsetof(struct page, pg) + sizeof(struct page))
|
|
|
|
FOLIO_MATCH(flags, _flags_1);
|
|
|
|
FOLIO_MATCH(compound_head, __head);
|
|
|
|
FOLIO_MATCH(compound_dtor, _folio_dtor);
|
|
|
|
FOLIO_MATCH(compound_order, _folio_order);
|
|
|
|
FOLIO_MATCH(compound_mapcount, _total_mapcount);
|
|
|
|
FOLIO_MATCH(compound_pincount, _pincount);
|
|
|
|
#ifdef CONFIG_64BIT
|
|
|
|
FOLIO_MATCH(compound_nr, _folio_nr_pages);
|
2022-09-22 15:42:04 +00:00
|
|
|
FOLIO_MATCH(_private_1, _private_1);
|
2022-09-02 19:45:58 +00:00
|
|
|
#endif
|
|
|
|
#undef FOLIO_MATCH
|
2020-12-07 03:22:48 +00:00
|
|
|
|
2021-04-12 20:45:17 +00:00
|
|
|
static inline atomic_t *folio_mapcount_ptr(struct folio *folio)
|
|
|
|
{
|
|
|
|
struct page *tail = &folio->page + 1;
|
|
|
|
return &tail->compound_mapcount;
|
|
|
|
}
|
|
|
|
|
2019-11-06 05:16:30 +00:00
|
|
|
static inline atomic_t *compound_mapcount_ptr(struct page *page)
|
|
|
|
{
|
|
|
|
return &page[1].compound_mapcount;
|
|
|
|
}
|
|
|
|
|
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 04:05:33 +00:00
|
|
|
static inline atomic_t *compound_pincount_ptr(struct page *page)
|
|
|
|
{
|
2022-01-06 21:46:43 +00:00
|
|
|
return &page[1].compound_pincount;
|
mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
scheme tends to overflow too easily, each tail page increments the head
page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
of huge pages that can be pinned.
This patch removes that limitation, by using an exact form of pin counting
for compound pages of order > 1. The "order > 1" is required because this
approach uses the 3rd struct page in the compound page, and order 1
compound pages only have two pages, so that won't work there.
A new struct page field, hpage_pinned_refcount, has been added, replacing
a padding field in the union (so no new space is used).
This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That is
because these compound pages have extra space for tracking things, so they
get exact pin counts instead of overloading page->_refcount.
Documentation/core-api/pin_user_pages.rst is updated accordingly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 04:05:33 +00:00
|
|
|
}
|
|
|
|
|
2018-12-14 22:16:53 +00:00
|
|
|
/*
|
|
|
|
* Used for sizing the vmemmap region on some architectures
|
|
|
|
*/
|
|
|
|
#define STRUCT_PAGE_MAX_SHIFT (order_base_2(sizeof(struct page)))
|
|
|
|
|
2015-05-07 04:11:57 +00:00
|
|
|
#define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK)
|
|
|
|
#define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE)
|
|
|
|
|
2021-01-11 15:04:40 +00:00
|
|
|
/*
|
|
|
|
* page_private can be used on tail pages. However, PagePrivate is only
|
|
|
|
* checked by the VM on the head page. So page_private on the tail pages
|
|
|
|
* should be used for data that's ancillary to the head page (eg attaching
|
|
|
|
* buffer heads to tail pages after attaching buffer heads to the head page)
|
|
|
|
*/
|
2019-05-14 22:41:32 +00:00
|
|
|
#define page_private(page) ((page)->private)
|
2020-06-02 04:48:09 +00:00
|
|
|
|
|
|
|
static inline void set_page_private(struct page *page, unsigned long private)
|
|
|
|
{
|
|
|
|
page->private = private;
|
|
|
|
}
|
2019-05-14 22:41:32 +00:00
|
|
|
|
2021-01-11 15:04:40 +00:00
|
|
|
static inline void *folio_get_private(struct folio *folio)
|
|
|
|
{
|
|
|
|
return folio->private;
|
|
|
|
}
|
|
|
|
|
2022-09-22 15:42:04 +00:00
|
|
|
static inline void folio_set_private_1(struct folio *folio, unsigned long private)
|
|
|
|
{
|
|
|
|
folio->_private_1 = private;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline unsigned long folio_get_private_1(struct folio *folio)
|
|
|
|
{
|
|
|
|
return folio->_private_1;
|
|
|
|
}
|
|
|
|
|
2015-05-07 04:11:57 +00:00
|
|
|
struct page_frag_cache {
|
|
|
|
void * va;
|
|
|
|
#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
|
|
|
|
__u16 offset;
|
|
|
|
__u16 size;
|
|
|
|
#else
|
|
|
|
__u32 offset;
|
|
|
|
#endif
|
|
|
|
/* we maintain a pagecount bias, so that we dont dirty cache line
|
2016-05-20 00:10:49 +00:00
|
|
|
* containing page->_refcount every time we allocate a fragment.
|
2015-05-07 04:11:57 +00:00
|
|
|
*/
|
|
|
|
unsigned int pagecnt_bias;
|
|
|
|
bool pfmemalloc;
|
|
|
|
};
|
|
|
|
|
2015-09-08 22:02:15 +00:00
|
|
|
typedef unsigned long vm_flags_t;
|
2011-05-26 10:16:19 +00:00
|
|
|
|
2009-01-08 12:04:47 +00:00
|
|
|
/*
|
|
|
|
* A region containing a mapping of a non-memory backed file under NOMMU
|
|
|
|
* conditions. These are held in a global tree and are pinned by the VMAs that
|
|
|
|
* map parts of them.
|
|
|
|
*/
|
|
|
|
struct vm_region {
|
|
|
|
struct rb_node vm_rb; /* link in global region tree */
|
2011-05-26 10:16:19 +00:00
|
|
|
vm_flags_t vm_flags; /* VMA vm_flags */
|
2009-01-08 12:04:47 +00:00
|
|
|
unsigned long vm_start; /* start address of region */
|
|
|
|
unsigned long vm_end; /* region initialised to here */
|
2009-01-08 12:04:47 +00:00
|
|
|
unsigned long vm_top; /* region allocated to here */
|
2009-01-08 12:04:47 +00:00
|
|
|
unsigned long vm_pgoff; /* the offset in vm_file corresponding to vm_start */
|
|
|
|
struct file *vm_file; /* the backing file or NULL */
|
|
|
|
|
2010-01-16 01:01:33 +00:00
|
|
|
int vm_usage; /* region usage count (access under nommu_region_sem) */
|
NOMMU: Avoiding duplicate icache flushes of shared maps
When working with FDPIC, there are many shared mappings of read-only
code regions between applications (the C library, applet packages like
busybox, etc.), but the current do_mmap_pgoff() function will issue an
icache flush whenever a VMA is added to an MM instead of only doing it
when the map is initially created.
The flush can instead be done when a region is first mmapped PROT_EXEC.
Note that we may not rely on the first mapping of a region being
executable - it's possible for it to be PROT_READ only, so we have to
remember whether we've flushed the region or not, and then flush the
entire region when a bit of it is made executable.
However, this also affects the brk area. That will no longer be
executable. We can mprotect() it to PROT_EXEC on MPU-mode kernels, but
for NOMMU mode kernels, when it increases the brk allocation, making
sys_brk() flush the extra from the icache should suffice. The brk area
probably isn't used by NOMMU programs since the brk area can only use up
the leavings from the stack allocation, where the stack allocation is
larger than requested.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-06 17:23:23 +00:00
|
|
|
bool vm_icache_flushed : 1; /* true if the icache has been flushed for
|
|
|
|
* this region */
|
2009-01-08 12:04:47 +00:00
|
|
|
};
|
|
|
|
|
2015-09-04 22:46:14 +00:00
|
|
|
#ifdef CONFIG_USERFAULTFD
|
|
|
|
#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
|
|
|
|
struct vm_userfaultfd_ctx {
|
|
|
|
struct userfaultfd_ctx *ctx;
|
|
|
|
};
|
|
|
|
#else /* CONFIG_USERFAULTFD */
|
|
|
|
#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) {})
|
|
|
|
struct vm_userfaultfd_ctx {};
|
|
|
|
#endif /* CONFIG_USERFAULTFD */
|
|
|
|
|
2022-01-14 22:06:03 +00:00
|
|
|
struct anon_vma_name {
|
|
|
|
struct kref kref;
|
|
|
|
/* The name needs to be at the end because it is dynamically sized. */
|
|
|
|
char name[];
|
|
|
|
};
|
|
|
|
|
2007-10-16 08:24:43 +00:00
|
|
|
/*
|
2020-04-07 03:08:33 +00:00
|
|
|
* This struct describes a virtual memory area. There is one of these
|
|
|
|
* per VM-area/task. A VM area is any part of the process virtual memory
|
2007-10-16 08:24:43 +00:00
|
|
|
* space that has a special rule for the page-fault handlers (ie a shared
|
|
|
|
* library, the executable area etc).
|
|
|
|
*/
|
|
|
|
struct vm_area_struct {
|
2012-12-12 00:01:44 +00:00
|
|
|
/* The first cache line has the info for VMA tree walking. */
|
|
|
|
|
2007-10-16 08:24:43 +00:00
|
|
|
unsigned long vm_start; /* Our start address within vm_mm. */
|
|
|
|
unsigned long vm_end; /* The first byte after our end address
|
|
|
|
within vm_mm. */
|
|
|
|
|
2012-12-12 00:01:44 +00:00
|
|
|
struct mm_struct *vm_mm; /* The address space we belong to. */
|
2019-11-22 08:25:12 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Access permissions of this VMA.
|
|
|
|
* See vmf_insert_mixed_prot() for discussion.
|
|
|
|
*/
|
|
|
|
pgprot_t vm_page_prot;
|
2012-12-12 00:01:44 +00:00
|
|
|
unsigned long vm_flags; /* Flags, see mm.h. */
|
|
|
|
|
2007-10-16 08:24:43 +00:00
|
|
|
/*
|
|
|
|
* For areas with an address space and backing store,
|
2015-02-10 22:09:59 +00:00
|
|
|
* linkage into the address_space->i_mmap interval tree.
|
mm: add a field to store names for private anonymous memory
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-14 22:05:59 +00:00
|
|
|
*
|
|
|
|
* For private anonymous mappings, a pointer to a null terminated string
|
|
|
|
* containing the name given to the vma, or NULL if unnamed.
|
2007-10-16 08:24:43 +00:00
|
|
|
*/
|
mm: add a field to store names for private anonymous memory
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-14 22:05:59 +00:00
|
|
|
|
|
|
|
union {
|
|
|
|
struct {
|
|
|
|
struct rb_node rb;
|
|
|
|
unsigned long rb_subtree_last;
|
|
|
|
} shared;
|
2022-03-05 04:28:51 +00:00
|
|
|
/*
|
|
|
|
* Serialized by mmap_sem. Never use directly because it is
|
|
|
|
* valid only when vm_file is NULL. Use anon_vma_name instead.
|
|
|
|
*/
|
2022-01-14 22:06:03 +00:00
|
|
|
struct anon_vma_name *anon_name;
|
mm: add a field to store names for private anonymous memory
In many userspace applications, and especially in VM based applications
like Android uses heavily, there are multiple different allocators in
use. At a minimum there is libc malloc and the stack, and in many cases
there are libc malloc, the stack, direct syscalls to mmap anonymous
memory, and multiple VM heaps (one for small objects, one for big
objects, etc.). Each of these layers usually has its own tools to
inspect its usage; malloc by compiling a debug version, the VM through
heap inspection tools, and for direct syscalls there is usually no way
to track them.
On Android we heavily use a set of tools that use an extended version of
the logic covered in Documentation/vm/pagemap.txt to walk all pages
mapped in userspace and slice their usage by process, shared (COW) vs.
unique mappings, backing, etc. This can account for real physical
memory usage even in cases like fork without exec (which Android uses
heavily to share as many private COW pages as possible between
processes), Kernel SamePage Merging, and clean zero pages. It produces
a measurement of the pages that only exist in that process (USS, for
unique), and a measurement of the physical memory usage of that process
with the cost of shared pages being evenly split between processes that
share them (PSS).
If all anonymous memory is indistinguishable then figuring out the real
physical memory usage (PSS) of each heap requires either a pagemap
walking tool that can understand the heap debugging of every layer, or
for every layer's heap debugging tools to implement the pagemap walking
logic, in which case it is hard to get a consistent view of memory
across the whole system.
Tracking the information in userspace leads to all sorts of problems.
It either needs to be stored inside the process, which means every
process has to have an API to export its current heap information upon
request, or it has to be stored externally in a filesystem that somebody
needs to clean up on crashes. It needs to be readable while the process
is still running, so it has to have some sort of synchronization with
every layer of userspace. Efficiently tracking the ranges requires
reimplementing something like the kernel vma trees, and linking to it
from every layer of userspace. It requires more memory, more syscalls,
more runtime cost, and more complexity to separately track regions that
the kernel is already tracking.
This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
userspace-provided name for anonymous vmas. The names of named
anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
[anon:<name>].
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
Setting the name to NULL clears it. The name length limit is 80 bytes
including NUL-terminator and is checked to contain only printable ascii
characters (including space), except '[',']','\','$' and '`'.
Ascii strings are being used to have a descriptive identifiers for vmas,
which can be understood by the users reading /proc/pid/maps or
/proc/pid/smaps. Names can be standardized for a given system and they
can include some variable parts such as the name of the allocator or a
library, tid of the thread using it, etc.
The name is stored in a pointer in the shared union in vm_area_struct
that points to a null terminated string. Anonymous vmas with the same
name (equivalent strings) and are otherwise mergeable will be merged.
The name pointers are not shared between vmas even if they contain the
same name. The name pointer is stored in a union with fields that are
only used on file-backed mappings, so it does not increase memory usage.
CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
feature. It keeps the feature disabled by default to prevent any
additional memory overhead and to avoid confusing procfs parsers on
systems which are not ready to support named anonymous vmas.
The patch is based on the original patch developed by Colin Cross, more
specifically on its latest version [1] posted upstream by Sumit Semwal.
It used a userspace pointer to store vma names. In that design, name
pointers could be shared between vmas. However during the last
upstreaming attempt, Kees Cook raised concerns [2] about this approach
and suggested to copy the name into kernel memory space, perform
validity checks [3] and store as a string referenced from
vm_area_struct.
One big concern is about fork() performance which would need to strdup
anonymous vma names. Dave Hansen suggested experimenting with
worst-case scenario of forking a process with 64k vmas having longest
possible names [4]. I ran this experiment on an ARM64 Android device
and recorded a worst-case regression of almost 40% when forking such a
process.
This regression is addressed in the followup patch which replaces the
pointer to a name with a refcounted structure that allows sharing the
name pointer between vmas of the same name. Instead of duplicating the
string during fork() or when splitting a vma it increments the refcount.
[1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
[2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
[3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
[4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
Changes for prctl(2) manual page (in the options section):
PR_SET_VMA
Sets an attribute specified in arg2 for virtual memory areas
starting from the address specified in arg3 and spanning the
size specified in arg4. arg5 specifies the value of the attribute
to be set. Note that assigning an attribute to a virtual memory
area might prevent it from being merged with adjacent virtual
memory areas due to the difference in that attribute's value.
Currently, arg2 must be one of:
PR_SET_VMA_ANON_NAME
Set a name for anonymous virtual memory areas. arg5 should
be a pointer to a null-terminated string containing the
name. The name length including null byte cannot exceed
80 bytes. If arg5 is NULL, the name of the appropriate
anonymous virtual memory areas will be reset. The name
can contain only printable ascii characters (including
space), except '[',']','\','$' and '`'.
This feature is available only if the kernel is built with
the CONFIG_ANON_VMA_NAME option enabled.
[surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
[surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
work here was done by Colin Cross, therefore, with his permission, keeping
him as the author]
Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.com
Signed-off-by: Colin Cross <ccross@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Glauber <jan.glauber@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Landley <rob@landley.net>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Shaohua Li <shli@fusionio.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-14 22:05:59 +00:00
|
|
|
};
|
2007-10-16 08:24:43 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
|
|
|
|
* list, after a COW of one of the file pages. A MAP_SHARED vma
|
|
|
|
* can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
|
|
|
|
* or brk vma (with NULL file) can only be in an anon_vma list.
|
|
|
|
*/
|
2020-06-09 04:33:54 +00:00
|
|
|
struct list_head anon_vma_chain; /* Serialized by mmap_lock &
|
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-05 21:42:07 +00:00
|
|
|
* page_table_lock */
|
2007-10-16 08:24:43 +00:00
|
|
|
struct anon_vma *anon_vma; /* Serialized by page_table_lock */
|
|
|
|
|
|
|
|
/* Function pointers to deal with this struct. */
|
2009-09-27 18:29:37 +00:00
|
|
|
const struct vm_operations_struct *vm_ops;
|
2007-10-16 08:24:43 +00:00
|
|
|
|
|
|
|
/* Information about our backing store: */
|
|
|
|
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
|
2016-04-01 12:29:48 +00:00
|
|
|
units */
|
2007-10-16 08:24:43 +00:00
|
|
|
struct file * vm_file; /* File we map to (can be NULL). */
|
|
|
|
void * vm_private_data; /* was vm_pte (shared mem) */
|
|
|
|
|
2019-07-12 03:54:43 +00:00
|
|
|
#ifdef CONFIG_SWAP
|
mm, swap: VMA based swap readahead
The swap readahead is an important mechanism to reduce the swap in
latency. Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.
In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory. And the different tasks in the system may have different access
patterns, which makes the global space locality estimation incorrect.
In this patch, when page fault occurs, the virtual pages near the fault
address will be readahead instead of the swap slots near the fault swap
slot in swap device. This avoid to readahead the unrelated swap slots.
At the same time, the swap readahead is changed to work on per-VMA from
globally. So that the different access patterns of the different VMAs
could be distinguished, and the different readahead policy could be
applied accordingly. The original core readahead detection and scaling
algorithm is reused, because it is an effect algorithm to detect the
space locality.
The test and result is as follow,
Common test condition
=====================
Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
NVMe disk
Micro-benchmark with combined access pattern
============================================
vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds. The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.
At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds. This will trigger random swap-in
in the background.
This is a combined workload with sequential and random memory accessing
at the same time. The result (for sequential workload) is as follow,
Base Optimized
---- ---------
throughput 345413 KB/s 414029 KB/s (+19.9%)
latency.average 97.14 us 61.06 us (-37.1%)
latency.50th 2 us 1 us
latency.60th 2 us 1 us
latency.70th 98 us 2 us
latency.80th 160 us 2 us
latency.90th 260 us 217 us
latency.95th 346 us 369 us
latency.99th 1.34 ms 1.09 ms
ra_hit% 52.69% 99.98%
The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower. The VMA-base
readahead algorithm works much better.
Linpack
=======
The test memory size is bigger than RAM to trigger swapping.
Base Optimized
---- ---------
elapsed_time 393.49 s 329.88 s (-16.2%)
ra_hit% 86.21% 98.82%
The score of base and optimized kernel hasn't visible changes. But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages. And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.
Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 23:24:36 +00:00
|
|
|
atomic_long_t swap_readahead_info;
|
2019-07-12 03:54:43 +00:00
|
|
|
#endif
|
2007-10-16 08:24:43 +00:00
|
|
|
#ifndef CONFIG_MMU
|
2009-01-08 12:04:47 +00:00
|
|
|
struct vm_region *vm_region; /* NOMMU mapping region */
|
2007-10-16 08:24:43 +00:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
|
|
|
|
#endif
|
2015-09-04 22:46:14 +00:00
|
|
|
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
|
2016-10-28 08:22:25 +00:00
|
|
|
} __randomize_layout;
|
2007-10-16 08:24:43 +00:00
|
|
|
|
aio: convert the ioctx list to table lookup v3
On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
> On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
> > When using a large number of threads performing AIO operations the
> > IOCTX list may get a significant number of entries which will cause
> > significant overhead. For example, when running this fio script:
> >
> > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=512; thread; loops=100
> >
> > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
> > 30% CPU time spent by lookup_ioctx:
> >
> > 32.51% [guest.kernel] [g] lookup_ioctx
> > 9.19% [guest.kernel] [g] __lock_acquire.isra.28
> > 4.40% [guest.kernel] [g] lock_release
> > 4.19% [guest.kernel] [g] sched_clock_local
> > 3.86% [guest.kernel] [g] local_clock
> > 3.68% [guest.kernel] [g] native_sched_clock
> > 3.08% [guest.kernel] [g] sched_clock_cpu
> > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11
> > 2.60% [guest.kernel] [g] memcpy
> > 2.33% [guest.kernel] [g] lock_acquired
> > 2.25% [guest.kernel] [g] lock_acquire
> > 1.84% [guest.kernel] [g] do_io_submit
> >
> > This patchs converts the ioctx list to a radix tree. For a performance
> > comparison the above FIO script was run on a 2 sockets 8 core
> > machine. This are the results (average and %rsd of 10 runs) for the
> > original list based implementation and for the radix tree based
> > implementation:
> >
> > cores 1 2 4 8 16 32
> > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms
> > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43%
> > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms
> > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75%
> > % of radix
> > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66%
> > to list
> >
> > To consider the impact of the patch on the typical case of having
> > only one ctx per process the following FIO script was run:
> >
> > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=1; thread; loops=100
> >
> > on the same system and the results are the following:
> >
> > list 58892 ms
> > %rsd 0.91%
> > radix 59404 ms
> > %rsd 0.81%
> > % of radix
> > relative 100.87%
> > to list
>
> So, I was just doing some benchmarking/profiling to get ready to send
> out the aio patches I've got for 3.11 - and it looks like your patch is
> causing a ~1.5% throughput regression in my testing :/
... <snip>
I've got an alternate approach for fixing this wart in lookup_ioctx()...
Instead of using an rbtree, just use the reserved id in the ring buffer
header to index an array pointing the ioctx. It's not finished yet, and
it needs to be tidied up, but is most of the way there.
-ben
--
"Thought is the essence of where you are now."
--
kmo> And, a rework of Ben's code, but this was entirely his idea
kmo> -Kent
bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
free memory.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30 16:54:40 +00:00
|
|
|
struct kioctx_table;
|
2007-10-16 08:24:43 +00:00
|
|
|
struct mm_struct {
|
2018-07-16 19:03:31 +00:00
|
|
|
struct {
|
2022-09-06 19:48:45 +00:00
|
|
|
struct maple_tree mm_mt;
|
2010-01-16 01:01:35 +00:00
|
|
|
#ifdef CONFIG_MMU
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long (*get_unmapped_area) (struct file *filp,
|
2007-10-16 08:24:43 +00:00
|
|
|
unsigned long addr, unsigned long len,
|
|
|
|
unsigned long pgoff, unsigned long flags);
|
2010-01-16 01:01:35 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long mmap_base; /* base of mmap area */
|
|
|
|
unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */
|
2017-03-06 14:17:19 +00:00
|
|
|
#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
|
2021-07-01 01:53:17 +00:00
|
|
|
/* Base addresses for compatible mmap() */
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long mmap_compat_base;
|
|
|
|
unsigned long mmap_compat_legacy_base;
|
2017-03-06 14:17:19 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long task_size; /* size of task vm space */
|
|
|
|
pgd_t * pgd;
|
|
|
|
|
2019-09-19 17:37:02 +00:00
|
|
|
#ifdef CONFIG_MEMBARRIER
|
|
|
|
/**
|
|
|
|
* @membarrier_state: Flags controlling membarrier behavior.
|
|
|
|
*
|
|
|
|
* This field is close to @pgd to hopefully fit in the same
|
|
|
|
* cache-line, which needs to be touched by switch_mm().
|
|
|
|
*/
|
|
|
|
atomic_t membarrier_state;
|
|
|
|
#endif
|
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
/**
|
|
|
|
* @mm_users: The number of users including userspace.
|
|
|
|
*
|
|
|
|
* Use mmget()/mmget_not_zero()/mmput() to modify. When this
|
|
|
|
* drops to 0 (i.e. when the task exits and there are no other
|
|
|
|
* temporary reference holders), we also release a reference on
|
|
|
|
* @mm_count (which may then free the &struct mm_struct if
|
|
|
|
* @mm_count also drops to 0).
|
|
|
|
*/
|
|
|
|
atomic_t mm_users;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* @mm_count: The number of references to &struct mm_struct
|
|
|
|
* (@mm_users count as 1).
|
|
|
|
*
|
|
|
|
* Use mmgrab()/mmdrop() to modify. When this drops to 0, the
|
|
|
|
* &struct mm_struct is freed.
|
|
|
|
*/
|
|
|
|
atomic_t mm_count;
|
2017-02-27 22:30:16 +00:00
|
|
|
|
2017-11-16 01:35:37 +00:00
|
|
|
#ifdef CONFIG_MMU
|
2018-07-16 19:03:31 +00:00
|
|
|
atomic_long_t pgtables_bytes; /* PTE page table pages */
|
2015-04-14 22:46:21 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
int map_count; /* number of VMAs */
|
2011-03-22 23:32:50 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
spinlock_t page_table_lock; /* Protects page tables and some
|
|
|
|
* counters
|
|
|
|
*/
|
mm: relocate 'write_protect_seq' in struct mm_struct
0day robot reported a 9.2% regression for will-it-scale mmap1 test
case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
racing with COW during fork").
Further debug shows the regression is due to that commit changes the
offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some
cache alignment changes.
From the perf data, the contention for 'mmap_lock' is very severe and
takes around 95% cpu cycles, and it is a rw_semaphore
struct rw_semaphore {
atomic_long_t count; /* 8 bytes */
atomic_long_t owner; /* 8 bytes */
struct optimistic_spin_queue osq; /* spinner MCS lock */
...
Before commit 57efa1fe5957 adds the 'write_protect_seq', it happens to
have a very optimal cache alignment layout, as Linus explained:
"and before the addition of the 'write_protect_seq' field, the
mmap_sem was at offset 120 in 'struct mm_struct'.
Which meant that count and owner were in two different cachelines,
and then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind
of layout you want.
Because first the rwsem_write_trylock() will do a cmpxchg on the
first cacheline (for the optimistic fast-path), and then in the
case of contention, rwsem_down_write_slowpath() will just access
the second cacheline.
Which is probably just optimal for a load that spends a lot of
time contended - new waiters touch that first cacheline, and then
they queue themselves up on the second cacheline."
After the commit, the rw_semaphore is at offset 128, which means the
'count' and 'owner' fields are now in the same cacheline, and causes
more cache bouncing.
Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will
affect its offset:
CONFIG_MMU
CONFIG_MEMBARRIER
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
The layout above is on 64 bits system with 0day's default kernel config
(similar to RHEL-8.3's config), in which all these 3 options are 'y'.
And the layout can vary with different kernel configs.
Relayouting a structure is usually a double-edged sword, as sometimes it
can helps one case, but hurt other cases. For this case, one solution
is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t
(when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes
hole in 'mm_struct' will not change other fields' alignment, while
restoring the regression.
Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ [1]
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-11 01:54:42 +00:00
|
|
|
/*
|
|
|
|
* With some kernel config, the current mmap_lock's offset
|
|
|
|
* inside 'mm_struct' is at 0x120, which is very optimal, as
|
|
|
|
* its two hot fields 'count' and 'owner' sit in 2 different
|
|
|
|
* cachelines, and when mmap_lock is highly contended, both
|
|
|
|
* of the 2 fields will be accessed frequently, current layout
|
|
|
|
* will help to reduce cache bouncing.
|
|
|
|
*
|
|
|
|
* So please be careful with adding new fields before
|
|
|
|
* mmap_lock, which can easily push the 2 fields into one
|
|
|
|
* cacheline.
|
|
|
|
*/
|
2020-06-09 04:33:47 +00:00
|
|
|
struct rw_semaphore mmap_lock;
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
struct list_head mmlist; /* List of maybe swapped mm's. These
|
|
|
|
* are globally strung together off
|
|
|
|
* init_mm.mmlist, and are protected
|
|
|
|
* by mmlist_lock
|
|
|
|
*/
|
2007-10-16 08:24:43 +00:00
|
|
|
|
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long hiwater_rss; /* High-watermark of RSS usage */
|
|
|
|
unsigned long hiwater_vm; /* High-water virtual memory usage */
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long total_vm; /* Total pages mapped */
|
|
|
|
unsigned long locked_vm; /* Pages that have PG_mlocked set */
|
2019-02-06 17:59:15 +00:00
|
|
|
atomic64_t pinned_vm; /* Refcount permanently increased */
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
|
|
|
|
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
|
|
|
|
unsigned long stack_vm; /* VM_STACK */
|
|
|
|
unsigned long def_flags;
|
mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.
And, the mmap_sem contention may cause unexpected issue like below:
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
proc_pid_cmdline_read+0xd9/0x4e0
__vfs_read+0x37/0x150
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xc5
Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.
So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.
This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem. The mmap_sem
scalability issue will be solved in the future.
[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-08 00:05:28 +00:00
|
|
|
|
mm: relocate 'write_protect_seq' in struct mm_struct
0day robot reported a 9.2% regression for will-it-scale mmap1 test
case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
racing with COW during fork").
Further debug shows the regression is due to that commit changes the
offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some
cache alignment changes.
From the perf data, the contention for 'mmap_lock' is very severe and
takes around 95% cpu cycles, and it is a rw_semaphore
struct rw_semaphore {
atomic_long_t count; /* 8 bytes */
atomic_long_t owner; /* 8 bytes */
struct optimistic_spin_queue osq; /* spinner MCS lock */
...
Before commit 57efa1fe5957 adds the 'write_protect_seq', it happens to
have a very optimal cache alignment layout, as Linus explained:
"and before the addition of the 'write_protect_seq' field, the
mmap_sem was at offset 120 in 'struct mm_struct'.
Which meant that count and owner were in two different cachelines,
and then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind
of layout you want.
Because first the rwsem_write_trylock() will do a cmpxchg on the
first cacheline (for the optimistic fast-path), and then in the
case of contention, rwsem_down_write_slowpath() will just access
the second cacheline.
Which is probably just optimal for a load that spends a lot of
time contended - new waiters touch that first cacheline, and then
they queue themselves up on the second cacheline."
After the commit, the rw_semaphore is at offset 128, which means the
'count' and 'owner' fields are now in the same cacheline, and causes
more cache bouncing.
Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will
affect its offset:
CONFIG_MMU
CONFIG_MEMBARRIER
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
The layout above is on 64 bits system with 0day's default kernel config
(similar to RHEL-8.3's config), in which all these 3 options are 'y'.
And the layout can vary with different kernel configs.
Relayouting a structure is usually a double-edged sword, as sometimes it
can helps one case, but hurt other cases. For this case, one solution
is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t
(when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes
hole in 'mm_struct' will not change other fields' alignment, while
restoring the regression.
Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ [1]
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-11 01:54:42 +00:00
|
|
|
/**
|
|
|
|
* @write_protect_seq: Locked when any thread is write
|
|
|
|
* protecting pages mapped by this mm to enforce a later COW,
|
|
|
|
* for instance during page table copying for fork().
|
|
|
|
*/
|
|
|
|
seqcount_t write_protect_seq;
|
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
spinlock_t arg_lock; /* protect the below fields */
|
mm: relocate 'write_protect_seq' in struct mm_struct
0day robot reported a 9.2% regression for will-it-scale mmap1 test
case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
racing with COW during fork").
Further debug shows the regression is due to that commit changes the
offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some
cache alignment changes.
From the perf data, the contention for 'mmap_lock' is very severe and
takes around 95% cpu cycles, and it is a rw_semaphore
struct rw_semaphore {
atomic_long_t count; /* 8 bytes */
atomic_long_t owner; /* 8 bytes */
struct optimistic_spin_queue osq; /* spinner MCS lock */
...
Before commit 57efa1fe5957 adds the 'write_protect_seq', it happens to
have a very optimal cache alignment layout, as Linus explained:
"and before the addition of the 'write_protect_seq' field, the
mmap_sem was at offset 120 in 'struct mm_struct'.
Which meant that count and owner were in two different cachelines,
and then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind
of layout you want.
Because first the rwsem_write_trylock() will do a cmpxchg on the
first cacheline (for the optimistic fast-path), and then in the
case of contention, rwsem_down_write_slowpath() will just access
the second cacheline.
Which is probably just optimal for a load that spends a lot of
time contended - new waiters touch that first cacheline, and then
they queue themselves up on the second cacheline."
After the commit, the rw_semaphore is at offset 128, which means the
'count' and 'owner' fields are now in the same cacheline, and causes
more cache bouncing.
Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will
affect its offset:
CONFIG_MMU
CONFIG_MEMBARRIER
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
The layout above is on 64 bits system with 0day's default kernel config
(similar to RHEL-8.3's config), in which all these 3 options are 'y'.
And the layout can vary with different kernel configs.
Relayouting a structure is usually a double-edged sword, as sometimes it
can helps one case, but hurt other cases. For this case, one solution
is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t
(when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes
hole in 'mm_struct' will not change other fields' alignment, while
restoring the regression.
Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ [1]
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-11 01:54:42 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long start_code, end_code, start_data, end_data;
|
|
|
|
unsigned long start_brk, brk, start_stack;
|
|
|
|
unsigned long arg_start, arg_end, env_start, env_end;
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
/*
|
|
|
|
* Special counters, in some configurations protected by the
|
|
|
|
* page_table_lock, in other configurations by being atomic.
|
|
|
|
*/
|
|
|
|
struct mm_rss_stat rss_stat;
|
2009-09-23 22:57:41 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
struct linux_binfmt *binfmt;
|
2011-05-29 18:32:28 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
/* Architecture-specific MM context */
|
|
|
|
mm_context_t context;
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long flags; /* Must use atomic bitops to access */
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2009-09-23 22:57:32 +00:00
|
|
|
#ifdef CONFIG_AIO
|
2018-07-16 19:03:31 +00:00
|
|
|
spinlock_t ioctx_lock;
|
|
|
|
struct kioctx_table __rcu *ioctx_table;
|
2009-09-23 22:57:32 +00:00
|
|
|
#endif
|
2014-06-04 23:07:34 +00:00
|
|
|
#ifdef CONFIG_MEMCG
|
2018-07-16 19:03:31 +00:00
|
|
|
/*
|
|
|
|
* "owner" points to a task that is regarded as the canonical
|
|
|
|
* user/owner of this mm. All of the following must be true in
|
|
|
|
* order for it to be changed:
|
|
|
|
*
|
|
|
|
* current == mm->owner
|
|
|
|
* current->mm != mm
|
|
|
|
* new_owner->mm == mm
|
|
|
|
* new_owner->alloc_lock is held
|
|
|
|
*/
|
|
|
|
struct task_struct __rcu *owner;
|
2008-02-07 08:13:51 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
struct user_namespace *user_ns;
|
2008-04-29 08:01:36 +00:00
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
/* store ref to file /proc/<pid>/exe symlink points to */
|
|
|
|
struct file __rcu *exe_file;
|
mmu-notifiers: core
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.
Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).
The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.
To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.
This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).
At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.
Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.
2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.
The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.
struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;
if (!kvm)
return ERR_PTR(-ENOMEM);
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}
mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Chris Wright <chrisw@redhat.com>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Izik Eidus <izike@qumranet.com>
Cc: Anthony Liguori <aliguori@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-28 22:46:29 +00:00
|
|
|
#ifdef CONFIG_MMU_NOTIFIER
|
2019-12-18 17:40:35 +00:00
|
|
|
struct mmu_notifier_subscriptions *notifier_subscriptions;
|
2011-01-13 23:46:45 +00:00
|
|
|
#endif
|
2013-11-14 22:31:07 +00:00
|
|
|
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
|
2018-07-16 19:03:31 +00:00
|
|
|
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
|
2012-10-25 12:16:43 +00:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_NUMA_BALANCING
|
2018-07-16 19:03:31 +00:00
|
|
|
/*
|
2022-08-25 16:46:59 +00:00
|
|
|
* numa_next_scan is the next time that PTEs will be remapped
|
|
|
|
* PROT_NONE to trigger NUMA hinting faults; such faults gather
|
|
|
|
* statistics and migrate pages to new nodes if necessary.
|
2018-07-16 19:03:31 +00:00
|
|
|
*/
|
|
|
|
unsigned long numa_next_scan;
|
2012-10-25 12:16:43 +00:00
|
|
|
|
2022-08-25 16:46:59 +00:00
|
|
|
/* Restart point for scanning and remapping PTEs. */
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long numa_scan_offset;
|
mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.
That method has various (obvious) disadvantages:
- it samples the working set at dissimilar rates,
giving some tasks a sampling quality advantage
over others.
- creates performance problems for tasks with very
large working sets
- over-samples processes with large address spaces but
which only very rarely execute
Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.
The per-task nature of the working set sampling functionality in this tree
allows such constant rate, per task, execution-weight proportional sampling
of the working set, with an adaptive sampling interval/frequency that
goes from once per 100ms up to just once per 8 seconds. The current
sampling volume is 256 MB per interval.
As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 8
seconds of CPU time executed.
This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.
[ In AutoNUMA speak, this patch deals with the effective sampling
rate of the 'hinting page fault'. AutoNUMA's scanning is
currently rate-limited, but it is also fundamentally
single-threaded, executing in the knuma_scand kernel thread,
so the limit in AutoNUMA is global and does not scale up with
the number of CPUs, nor does it scan tasks in an execution
proportional manner.
So the idea of rate-limiting the scanning was first implemented
in the AutoNUMA tree via a global rate limit. This patch goes
beyond that by implementing an execution rate proportional
working set sampling rate that is not implemented via a single
global scanning daemon. ]
[ Dan Carpenter pointed out a possible NULL pointer dereference in the
first version of this patch. ]
Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
2012-10-25 12:16:45 +00:00
|
|
|
|
2022-08-25 16:46:59 +00:00
|
|
|
/* numa_scan_seq prevents two threads remapping PTEs. */
|
2018-07-16 19:03:31 +00:00
|
|
|
int numa_scan_seq;
|
mm: fix TLB flush race between migration, and change_protection_range
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-19 01:08:44 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
/*
|
|
|
|
* An operation with batched TLB flushing is going on. Anything
|
|
|
|
* that can move process memory needs to flush the TLB when
|
2022-08-25 16:46:59 +00:00
|
|
|
* moving a PROT_NONE mapped page.
|
2018-07-16 19:03:31 +00:00
|
|
|
*/
|
|
|
|
atomic_t tlb_flush_pending;
|
2017-08-02 20:31:52 +00:00
|
|
|
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
|
2018-07-16 19:03:31 +00:00
|
|
|
/* See flush_tlb_batched_pending() */
|
2022-01-14 22:09:16 +00:00
|
|
|
atomic_t tlb_flush_batched;
|
2011-05-29 18:32:28 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
struct uprobes_state uprobes_state;
|
2021-09-28 12:24:32 +00:00
|
|
|
#ifdef CONFIG_PREEMPT_RT
|
|
|
|
struct rcu_head delayed_drop;
|
|
|
|
#endif
|
2015-11-06 02:47:14 +00:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
2018-07-16 19:03:31 +00:00
|
|
|
atomic_long_t hugetlb_usage;
|
2015-11-06 02:47:14 +00:00
|
|
|
#endif
|
2018-07-16 19:03:31 +00:00
|
|
|
struct work_struct async_put_work;
|
2020-09-15 16:30:11 +00:00
|
|
|
|
2022-02-07 23:02:45 +00:00
|
|
|
#ifdef CONFIG_IOMMU_SVA
|
2020-09-15 16:30:11 +00:00
|
|
|
u32 pasid;
|
2022-04-29 06:16:16 +00:00
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_KSM
|
|
|
|
/*
|
|
|
|
* Represent how many pages of this process are involved in KSM
|
|
|
|
* merging.
|
|
|
|
*/
|
|
|
|
unsigned long ksm_merging_pages;
|
ksm: count allocated ksm rmap_items for each process
Patch series "ksm: count allocated rmap_items and update documentation",
v5.
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information.
To determine how beneficial the ksm-policy (like madvise), they are using
brings, so we add a new interface /proc/<pid>/ksm_stat for each process
The value "ksm_rmap_items" in it indicates the total allocated ksm
rmap_items of this process.
The detailed description can be seen in the following patches' commit
message.
This patch (of 2):
KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information. Some of these pages may be merged,
but some may not be abled to be merged after being checked several times,
which are unprofitable memory consumed.
The information about whether KSM save memory or consume memory in
system-wide range can be determined by the comprehensive calculation of
pages_sharing, pages_shared, pages_unshared and pages_volatile. A simple
approximate calculation:
profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
sizeof(rmap_item);
where all_rmap_items equals to the sum of pages_sharing, pages_shared,
pages_unshared and pages_volatile.
But we cannot calculate this kind of ksm profit inner single-process wide
because the information of ksm rmap_item's number of a process is lacked.
For user applications, if this kind of information could be obtained, it
helps upper users know how beneficial the ksm-policy (like madvise) they
are using brings, and then optimize their app code. For example, one
application madvise 1000 pages as MERGEABLE, while only a few pages are
really merged, then it's not cost-efficient.
So we add a new interface /proc/<pid>/ksm_stat for each process in which
the value of ksm_rmap_itmes is only shown now and so more values can be
added in future.
So similarly, we can calculate the ksm profit approximately for a single
process by:
profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items *
sizeof(rmap_item);
where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and
ksm_rmap_items is shown in /proc/<pid>/ksm_stat.
Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn
Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cn
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Reviewed-by: Xiaokai Ran <ran.xiaokai@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-30 14:38:38 +00:00
|
|
|
/*
|
|
|
|
* Represent how many pages are checked for ksm merging
|
|
|
|
* including merged and not merged.
|
|
|
|
*/
|
|
|
|
unsigned long ksm_rmap_items;
|
2020-09-15 16:30:11 +00:00
|
|
|
#endif
|
mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages. A kill switch will be
added in the next patch to disable this behavior. When disabled, the
aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated. Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.
When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently. Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim. Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[8, 10]%
Ops/sec KB/sec
patch1-7: 1147696.57 44640.29
patch1-8: 1245274.91 48435.66
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
48.16% lzo1x_1_do_compress (real work)
8.20% page_vma_mapped_walk (overhead)
7.06% _raw_spin_unlock_irq
2.92% ptep_clear_flush
2.53% __zram_bvec_write
2.11% do_raw_spin_lock
2.02% memmove
1.93% lru_gen_look_around
1.56% free_unref_page_list
1.40% memset
patch1-8
49.44% lzo1x_1_do_compress (real work)
6.19% page_vma_mapped_walk (overhead)
5.97% _raw_spin_unlock_irq
3.13% get_pfn_folio
2.85% ptep_clear_flush
2.42% __zram_bvec_write
2.08% do_raw_spin_lock
1.92% memmove
1.44% alloc_zspage
1.36% memset
Configurations:
no change
Thanks to the following developers for their efforts [3].
kernel test robot <lkp@intel.com>
[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 08:00:05 +00:00
|
|
|
#ifdef CONFIG_LRU_GEN
|
|
|
|
struct {
|
|
|
|
/* this mm_struct is on lru_gen_mm_list */
|
|
|
|
struct list_head list;
|
|
|
|
/*
|
|
|
|
* Set when switching to this mm_struct, as a hint of
|
|
|
|
* whether it has been used since the last time per-node
|
|
|
|
* page table walkers cleared the corresponding bits.
|
|
|
|
*/
|
|
|
|
unsigned long bitmap;
|
|
|
|
#ifdef CONFIG_MEMCG
|
|
|
|
/* points to the memcg of "owner" above */
|
|
|
|
struct mem_cgroup *memcg;
|
|
|
|
#endif
|
|
|
|
} lru_gen;
|
|
|
|
#endif /* CONFIG_LRU_GEN */
|
2018-07-16 19:03:31 +00:00
|
|
|
} __randomize_layout;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The mm_cpumask needs to be at the end of mm_struct, because it
|
|
|
|
* is dynamically sized based on nr_cpu_ids.
|
|
|
|
*/
|
|
|
|
unsigned long cpu_bitmap[];
|
|
|
|
};
|
2007-10-16 08:24:43 +00:00
|
|
|
|
2022-09-06 19:48:45 +00:00
|
|
|
#define MM_MT_FLAGS (MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN)
|
2017-02-02 11:27:56 +00:00
|
|
|
extern struct mm_struct init_mm;
|
|
|
|
|
2018-07-16 19:03:31 +00:00
|
|
|
/* Pointer magic because the dynamic array size confuses some compilers. */
|
2011-05-29 18:32:28 +00:00
|
|
|
static inline void mm_init_cpumask(struct mm_struct *mm)
|
|
|
|
{
|
2018-07-16 19:03:31 +00:00
|
|
|
unsigned long cpu_bitmap = (unsigned long)mm;
|
|
|
|
|
|
|
|
cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
|
|
|
|
cpumask_clear((struct cpumask *)cpu_bitmap);
|
2011-05-29 18:32:28 +00:00
|
|
|
}
|
|
|
|
|
2009-03-12 20:35:44 +00:00
|
|
|
/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
|
2011-05-25 00:12:15 +00:00
|
|
|
static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
|
|
|
|
{
|
2018-07-16 19:03:31 +00:00
|
|
|
return (struct cpumask *)&mm->cpu_bitmap;
|
2011-05-25 00:12:15 +00:00
|
|
|
}
|
2009-03-12 20:35:44 +00:00
|
|
|
|
mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages. A kill switch will be
added in the next patch to disable this behavior. When disabled, the
aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated. Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.
When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently. Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim. Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[8, 10]%
Ops/sec KB/sec
patch1-7: 1147696.57 44640.29
patch1-8: 1245274.91 48435.66
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
48.16% lzo1x_1_do_compress (real work)
8.20% page_vma_mapped_walk (overhead)
7.06% _raw_spin_unlock_irq
2.92% ptep_clear_flush
2.53% __zram_bvec_write
2.11% do_raw_spin_lock
2.02% memmove
1.93% lru_gen_look_around
1.56% free_unref_page_list
1.40% memset
patch1-8
49.44% lzo1x_1_do_compress (real work)
6.19% page_vma_mapped_walk (overhead)
5.97% _raw_spin_unlock_irq
3.13% get_pfn_folio
2.85% ptep_clear_flush
2.42% __zram_bvec_write
2.08% do_raw_spin_lock
1.92% memmove
1.44% alloc_zspage
1.36% memset
Configurations:
no change
Thanks to the following developers for their efforts [3].
kernel test robot <lkp@intel.com>
[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-18 08:00:05 +00:00
|
|
|
#ifdef CONFIG_LRU_GEN
|
|
|
|
|
|
|
|
struct lru_gen_mm_list {
|
|
|
|
/* mm_struct list for page table walkers */
|
|
|
|
struct list_head fifo;
|
|
|
|
/* protects the list above */
|
|
|
|
spinlock_t lock;
|
|
|
|
};
|
|
|
|
|
|
|
|
void lru_gen_add_mm(struct mm_struct *mm);
|
|
|
|
void lru_gen_del_mm(struct mm_struct *mm);
|
|
|
|
#ifdef CONFIG_MEMCG
|
|
|
|
void lru_gen_migrate_mm(struct mm_struct *mm);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static inline void lru_gen_init_mm(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
INIT_LIST_HEAD(&mm->lru_gen.list);
|
|
|
|
mm->lru_gen.bitmap = 0;
|
|
|
|
#ifdef CONFIG_MEMCG
|
|
|
|
mm->lru_gen.memcg = NULL;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void lru_gen_use_mm(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* When the bitmap is set, page reclaim knows this mm_struct has been
|
|
|
|
* used since the last time it cleared the bitmap. So it might be worth
|
|
|
|
* walking the page tables of this mm_struct to clear the accessed bit.
|
|
|
|
*/
|
|
|
|
WRITE_ONCE(mm->lru_gen.bitmap, -1);
|
|
|
|
}
|
|
|
|
|
|
|
|
#else /* !CONFIG_LRU_GEN */
|
|
|
|
|
|
|
|
static inline void lru_gen_add_mm(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void lru_gen_del_mm(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_MEMCG
|
|
|
|
static inline void lru_gen_migrate_mm(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static inline void lru_gen_init_mm(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void lru_gen_use_mm(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_LRU_GEN */
|
|
|
|
|
2022-09-06 19:48:46 +00:00
|
|
|
struct vma_iterator {
|
|
|
|
struct ma_state mas;
|
|
|
|
};
|
|
|
|
|
|
|
|
#define VMA_ITERATOR(name, __mm, __addr) \
|
|
|
|
struct vma_iterator name = { \
|
|
|
|
.mas = { \
|
|
|
|
.tree = &(__mm)->mm_mt, \
|
|
|
|
.index = __addr, \
|
|
|
|
.node = MAS_START, \
|
|
|
|
}, \
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void vma_iter_init(struct vma_iterator *vmi,
|
|
|
|
struct mm_struct *mm, unsigned long addr)
|
|
|
|
{
|
|
|
|
vmi->mas.tree = &mm->mm_mt;
|
|
|
|
vmi->mas.index = addr;
|
|
|
|
vmi->mas.node = MAS_START;
|
|
|
|
}
|
|
|
|
|
2017-08-10 22:24:05 +00:00
|
|
|
struct mmu_gather;
|
2021-01-27 23:53:45 +00:00
|
|
|
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
|
2021-01-27 23:53:44 +00:00
|
|
|
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
|
2021-01-27 23:53:43 +00:00
|
|
|
extern void tlb_finish_mmu(struct mmu_gather *tlb);
|
2017-08-10 22:24:05 +00:00
|
|
|
|
2015-12-30 04:12:19 +00:00
|
|
|
struct vm_fault;
|
|
|
|
|
2019-03-08 00:31:14 +00:00
|
|
|
/**
|
|
|
|
* typedef vm_fault_t - Return type for page fault handlers.
|
|
|
|
*
|
|
|
|
* Page fault handlers return a bitmask of %VM_FAULT values.
|
|
|
|
*/
|
|
|
|
typedef __bitwise unsigned int vm_fault_t;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* enum vm_fault_reason - Page fault handlers return a bitmask of
|
|
|
|
* these values to tell the core VM what happened when handling the
|
|
|
|
* fault. Used to decide whether a process gets delivered SIGBUS or
|
|
|
|
* just gets major/minor fault counters bumped up.
|
|
|
|
*
|
|
|
|
* @VM_FAULT_OOM: Out Of Memory
|
|
|
|
* @VM_FAULT_SIGBUS: Bad access
|
|
|
|
* @VM_FAULT_MAJOR: Page read from storage
|
|
|
|
* @VM_FAULT_WRITE: Special case for get_user_pages
|
|
|
|
* @VM_FAULT_HWPOISON: Hit poisoned small page
|
|
|
|
* @VM_FAULT_HWPOISON_LARGE: Hit poisoned large page. Index encoded
|
|
|
|
* in upper bits
|
|
|
|
* @VM_FAULT_SIGSEGV: segmentation fault
|
|
|
|
* @VM_FAULT_NOPAGE: ->fault installed the pte, not return page
|
|
|
|
* @VM_FAULT_LOCKED: ->fault locked the returned page
|
|
|
|
* @VM_FAULT_RETRY: ->fault blocked, must retry
|
|
|
|
* @VM_FAULT_FALLBACK: huge page fault failed, fall back to small
|
|
|
|
* @VM_FAULT_DONE_COW: ->fault has fully handled COW
|
|
|
|
* @VM_FAULT_NEEDDSYNC: ->fault did not modify page tables and needs
|
|
|
|
* fsync() to complete (for synchronous page faults
|
|
|
|
* in DAX)
|
mm: avoid unnecessary page fault retires on shared memory types
I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page. It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
Then after that throttling we return VM_FAULT_RETRY.
We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.
However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.
It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.
To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.
To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock. It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.
This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:
Before: 650.980 ms (+-1.94%)
After: 569.396 ms (+-1.38%)
I believe it could help more than that.
We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.
Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.
Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm part]
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Will Deacon <will@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Chris Zankel <chris@zankel.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Helge Deller <deller@gmx.de>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-30 18:34:50 +00:00
|
|
|
* @VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released
|
2019-03-08 00:31:14 +00:00
|
|
|
* @VM_FAULT_HINDEX_MASK: mask HINDEX value
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
enum vm_fault_reason {
|
|
|
|
VM_FAULT_OOM = (__force vm_fault_t)0x000001,
|
|
|
|
VM_FAULT_SIGBUS = (__force vm_fault_t)0x000002,
|
|
|
|
VM_FAULT_MAJOR = (__force vm_fault_t)0x000004,
|
|
|
|
VM_FAULT_WRITE = (__force vm_fault_t)0x000008,
|
|
|
|
VM_FAULT_HWPOISON = (__force vm_fault_t)0x000010,
|
|
|
|
VM_FAULT_HWPOISON_LARGE = (__force vm_fault_t)0x000020,
|
|
|
|
VM_FAULT_SIGSEGV = (__force vm_fault_t)0x000040,
|
|
|
|
VM_FAULT_NOPAGE = (__force vm_fault_t)0x000100,
|
|
|
|
VM_FAULT_LOCKED = (__force vm_fault_t)0x000200,
|
|
|
|
VM_FAULT_RETRY = (__force vm_fault_t)0x000400,
|
|
|
|
VM_FAULT_FALLBACK = (__force vm_fault_t)0x000800,
|
|
|
|
VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000,
|
|
|
|
VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000,
|
mm: avoid unnecessary page fault retires on shared memory types
I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page. It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
Then after that throttling we return VM_FAULT_RETRY.
We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.
However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.
It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.
To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.
To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock. It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.
This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:
Before: 650.980 ms (+-1.94%)
After: 569.396 ms (+-1.38%)
I believe it could help more than that.
We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.
Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.
Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm part]
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Will Deacon <will@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Chris Zankel <chris@zankel.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Helge Deller <deller@gmx.de>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-30 18:34:50 +00:00
|
|
|
VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000,
|
2019-03-08 00:31:14 +00:00
|
|
|
VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000,
|
|
|
|
};
|
|
|
|
|
|
|
|
/* Encode hstate index for a hwpoisoned large page */
|
|
|
|
#define VM_FAULT_SET_HINDEX(x) ((__force vm_fault_t)((x) << 16))
|
2019-04-06 01:39:01 +00:00
|
|
|
#define VM_FAULT_GET_HINDEX(x) (((__force unsigned int)(x) >> 16) & 0xf)
|
2019-03-08 00:31:14 +00:00
|
|
|
|
|
|
|
#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | \
|
|
|
|
VM_FAULT_SIGSEGV | VM_FAULT_HWPOISON | \
|
|
|
|
VM_FAULT_HWPOISON_LARGE | VM_FAULT_FALLBACK)
|
|
|
|
|
|
|
|
#define VM_FAULT_RESULT_TRACE \
|
|
|
|
{ VM_FAULT_OOM, "OOM" }, \
|
|
|
|
{ VM_FAULT_SIGBUS, "SIGBUS" }, \
|
|
|
|
{ VM_FAULT_MAJOR, "MAJOR" }, \
|
|
|
|
{ VM_FAULT_WRITE, "WRITE" }, \
|
|
|
|
{ VM_FAULT_HWPOISON, "HWPOISON" }, \
|
|
|
|
{ VM_FAULT_HWPOISON_LARGE, "HWPOISON_LARGE" }, \
|
|
|
|
{ VM_FAULT_SIGSEGV, "SIGSEGV" }, \
|
|
|
|
{ VM_FAULT_NOPAGE, "NOPAGE" }, \
|
|
|
|
{ VM_FAULT_LOCKED, "LOCKED" }, \
|
|
|
|
{ VM_FAULT_RETRY, "RETRY" }, \
|
|
|
|
{ VM_FAULT_FALLBACK, "FALLBACK" }, \
|
|
|
|
{ VM_FAULT_DONE_COW, "DONE_COW" }, \
|
|
|
|
{ VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }
|
|
|
|
|
2015-12-30 04:12:19 +00:00
|
|
|
struct vm_special_mapping {
|
|
|
|
const char *name; /* The name, e.g. "[vdso]". */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If .fault is not provided, this points to a
|
|
|
|
* NULL-terminated array of pages that back the special mapping.
|
|
|
|
*
|
|
|
|
* This must not be NULL unless .fault is provided.
|
|
|
|
*/
|
2014-05-19 22:58:33 +00:00
|
|
|
struct page **pages;
|
2015-12-30 04:12:19 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If non-NULL, then this is called to resolve page faults
|
|
|
|
* on the special mapping. If used, .pages is not checked.
|
|
|
|
*/
|
2018-06-08 00:08:04 +00:00
|
|
|
vm_fault_t (*fault)(const struct vm_special_mapping *sm,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
struct vm_fault *vmf);
|
2016-06-28 11:35:38 +00:00
|
|
|
|
|
|
|
int (*mremap)(const struct vm_special_mapping *sm,
|
|
|
|
struct vm_area_struct *new_vma);
|
2014-05-19 22:58:33 +00:00
|
|
|
};
|
|
|
|
|
2014-07-31 15:40:59 +00:00
|
|
|
enum tlb_flush_reason {
|
|
|
|
TLB_FLUSH_ON_TASK_SWITCH,
|
|
|
|
TLB_REMOTE_SHOOTDOWN,
|
|
|
|
TLB_LOCAL_SHOOTDOWN,
|
|
|
|
TLB_LOCAL_MM_SHOOTDOWN,
|
2015-09-04 22:47:29 +00:00
|
|
|
TLB_REMOTE_SEND_IPI,
|
2014-07-31 15:40:59 +00:00
|
|
|
NR_TLB_FLUSH_REASONS,
|
|
|
|
};
|
|
|
|
|
2014-12-13 00:55:35 +00:00
|
|
|
/*
|
|
|
|
* A swap entry has to fit into a "unsigned long", as the entry is hidden
|
|
|
|
* in the "index" field of the swapper address space.
|
|
|
|
*/
|
|
|
|
typedef struct {
|
|
|
|
unsigned long val;
|
|
|
|
} swp_entry_t;
|
|
|
|
|
2022-01-14 22:06:10 +00:00
|
|
|
/**
|
|
|
|
* enum fault_flag - Fault flag definitions.
|
|
|
|
* @FAULT_FLAG_WRITE: Fault was a write fault.
|
|
|
|
* @FAULT_FLAG_MKWRITE: Fault was mkwrite of existing PTE.
|
|
|
|
* @FAULT_FLAG_ALLOW_RETRY: Allow to retry the fault if blocked.
|
|
|
|
* @FAULT_FLAG_RETRY_NOWAIT: Don't drop mmap_lock and wait when retrying.
|
|
|
|
* @FAULT_FLAG_KILLABLE: The fault task is in SIGKILL killable region.
|
|
|
|
* @FAULT_FLAG_TRIED: The fault has been tried once.
|
|
|
|
* @FAULT_FLAG_USER: The fault originated in userspace.
|
|
|
|
* @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
|
|
|
|
* @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
|
|
|
|
* @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
|
mm: support GUP-triggered unsharing of anonymous pages
Whenever GUP currently ends up taking a R/O pin on an anonymous page that
might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
on the page table entry will end up replacing the mapped anonymous page
due to COW, resulting in the GUP pin no longer being consistent with the
page actually mapped into the page table.
The possible ways to deal with this situation are:
(1) Ignore and pin -- what we do right now.
(2) Fail to pin -- which would be rather surprising to callers and
could break user space.
(3) Trigger unsharing and pin the now exclusive page -- reliable R/O
pins.
We want to implement 3) because it provides the clearest semantics and
allows for checking in unpin_user_pages() and friends for possible BUGs:
when trying to unpin a page that's no longer exclusive, clearly something
went very wrong and might result in memory corruptions that might be hard
to debug. So we better have a nice way to spot such issues.
To implement 3), we need a way for GUP to trigger unsharing:
FAULT_FLAG_UNSHARE. FAULT_FLAG_UNSHARE is only applicable to R/O mapped
anonymous pages and resembles COW logic during a write fault. However, in
contrast to a write fault, GUP-triggered unsharing will, for example,
still maintain the write protection.
Let's implement FAULT_FLAG_UNSHARE by hooking into the existing write
fault handlers for all applicable anonymous page types: ordinary pages,
THP and hugetlb.
* If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that has been
marked exclusive in the meantime by someone else, there is nothing to do.
* If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that's not
marked exclusive, it will try detecting if the process is the exclusive
owner. If exclusive, it can be set exclusive similar to reuse logic
during write faults via page_move_anon_rmap() and there is nothing
else to do; otherwise, we either have to copy and map a fresh,
anonymous exclusive page R/O (ordinary pages, hugetlb), or split the
THP.
This commit is heavily based on patches by Andrea.
Link: https://lkml.kernel.org/r/20220428083441.37290-16-david@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-10 01:20:45 +00:00
|
|
|
* @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare (and mark
|
|
|
|
* exclusive) a possibly shared anonymous page that is
|
|
|
|
* mapped R/O.
|
2022-05-13 03:22:52 +00:00
|
|
|
* @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
|
|
|
|
* We should only access orig_pte if this flag set.
|
2022-01-14 22:06:10 +00:00
|
|
|
*
|
|
|
|
* About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
|
|
|
|
* whether we would allow page faults to retry by specifying these two
|
|
|
|
* fault flags correctly. Currently there can be three legal combinations:
|
|
|
|
*
|
|
|
|
* (a) ALLOW_RETRY and !TRIED: this means the page fault allows retry, and
|
|
|
|
* this is the first try
|
|
|
|
*
|
|
|
|
* (b) ALLOW_RETRY and TRIED: this means the page fault allows retry, and
|
|
|
|
* we've already tried at least once
|
|
|
|
*
|
|
|
|
* (c) !ALLOW_RETRY and !TRIED: this means the page fault does not allow retry
|
|
|
|
*
|
|
|
|
* The unlisted combination (!ALLOW_RETRY && TRIED) is illegal and should never
|
|
|
|
* be used. Note that page faults can be allowed to retry for multiple times,
|
|
|
|
* in which case we'll have an initial fault with flags (a) then later on
|
|
|
|
* continuous faults with flags (b). We should always try to detect pending
|
|
|
|
* signals before a retry to make sure the continuous page faults can still be
|
|
|
|
* interrupted if necessary.
|
mm: support GUP-triggered unsharing of anonymous pages
Whenever GUP currently ends up taking a R/O pin on an anonymous page that
might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
on the page table entry will end up replacing the mapped anonymous page
due to COW, resulting in the GUP pin no longer being consistent with the
page actually mapped into the page table.
The possible ways to deal with this situation are:
(1) Ignore and pin -- what we do right now.
(2) Fail to pin -- which would be rather surprising to callers and
could break user space.
(3) Trigger unsharing and pin the now exclusive page -- reliable R/O
pins.
We want to implement 3) because it provides the clearest semantics and
allows for checking in unpin_user_pages() and friends for possible BUGs:
when trying to unpin a page that's no longer exclusive, clearly something
went very wrong and might result in memory corruptions that might be hard
to debug. So we better have a nice way to spot such issues.
To implement 3), we need a way for GUP to trigger unsharing:
FAULT_FLAG_UNSHARE. FAULT_FLAG_UNSHARE is only applicable to R/O mapped
anonymous pages and resembles COW logic during a write fault. However, in
contrast to a write fault, GUP-triggered unsharing will, for example,
still maintain the write protection.
Let's implement FAULT_FLAG_UNSHARE by hooking into the existing write
fault handlers for all applicable anonymous page types: ordinary pages,
THP and hugetlb.
* If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that has been
marked exclusive in the meantime by someone else, there is nothing to do.
* If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that's not
marked exclusive, it will try detecting if the process is the exclusive
owner. If exclusive, it can be set exclusive similar to reuse logic
during write faults via page_move_anon_rmap() and there is nothing
else to do; otherwise, we either have to copy and map a fresh,
anonymous exclusive page R/O (ordinary pages, hugetlb), or split the
THP.
This commit is heavily based on patches by Andrea.
Link: https://lkml.kernel.org/r/20220428083441.37290-16-david@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-10 01:20:45 +00:00
|
|
|
*
|
|
|
|
* The combination FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE is illegal.
|
|
|
|
* FAULT_FLAG_UNSHARE is ignored and treated like an ordinary read fault when
|
|
|
|
* no existing R/O-mapped anonymous page is encountered.
|
2022-01-14 22:06:10 +00:00
|
|
|
*/
|
|
|
|
enum fault_flag {
|
|
|
|
FAULT_FLAG_WRITE = 1 << 0,
|
|
|
|
FAULT_FLAG_MKWRITE = 1 << 1,
|
|
|
|
FAULT_FLAG_ALLOW_RETRY = 1 << 2,
|
|
|
|
FAULT_FLAG_RETRY_NOWAIT = 1 << 3,
|
|
|
|
FAULT_FLAG_KILLABLE = 1 << 4,
|
|
|
|
FAULT_FLAG_TRIED = 1 << 5,
|
|
|
|
FAULT_FLAG_USER = 1 << 6,
|
|
|
|
FAULT_FLAG_REMOTE = 1 << 7,
|
|
|
|
FAULT_FLAG_INSTRUCTION = 1 << 8,
|
|
|
|
FAULT_FLAG_INTERRUPTIBLE = 1 << 9,
|
mm: support GUP-triggered unsharing of anonymous pages
Whenever GUP currently ends up taking a R/O pin on an anonymous page that
might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
on the page table entry will end up replacing the mapped anonymous page
due to COW, resulting in the GUP pin no longer being consistent with the
page actually mapped into the page table.
The possible ways to deal with this situation are:
(1) Ignore and pin -- what we do right now.
(2) Fail to pin -- which would be rather surprising to callers and
could break user space.
(3) Trigger unsharing and pin the now exclusive page -- reliable R/O
pins.
We want to implement 3) because it provides the clearest semantics and
allows for checking in unpin_user_pages() and friends for possible BUGs:
when trying to unpin a page that's no longer exclusive, clearly something
went very wrong and might result in memory corruptions that might be hard
to debug. So we better have a nice way to spot such issues.
To implement 3), we need a way for GUP to trigger unsharing:
FAULT_FLAG_UNSHARE. FAULT_FLAG_UNSHARE is only applicable to R/O mapped
anonymous pages and resembles COW logic during a write fault. However, in
contrast to a write fault, GUP-triggered unsharing will, for example,
still maintain the write protection.
Let's implement FAULT_FLAG_UNSHARE by hooking into the existing write
fault handlers for all applicable anonymous page types: ordinary pages,
THP and hugetlb.
* If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that has been
marked exclusive in the meantime by someone else, there is nothing to do.
* If FAULT_FLAG_UNSHARE finds a R/O-mapped anonymous page that's not
marked exclusive, it will try detecting if the process is the exclusive
owner. If exclusive, it can be set exclusive similar to reuse logic
during write faults via page_move_anon_rmap() and there is nothing
else to do; otherwise, we either have to copy and map a fresh,
anonymous exclusive page R/O (ordinary pages, hugetlb), or split the
THP.
This commit is heavily based on patches by Andrea.
Link: https://lkml.kernel.org/r/20220428083441.37290-16-david@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oded Gabbay <oded.gabbay@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-10 01:20:45 +00:00
|
|
|
FAULT_FLAG_UNSHARE = 1 << 10,
|
2022-05-13 03:22:52 +00:00
|
|
|
FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
|
2022-01-14 22:06:10 +00:00
|
|
|
};
|
|
|
|
|
2022-05-13 03:22:55 +00:00
|
|
|
typedef unsigned int __bitwise zap_flags_t;
|
|
|
|
|
2006-09-27 08:50:01 +00:00
|
|
|
#endif /* _LINUX_MM_TYPES_H */
|