linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-09-21 10:01:00 +00:00

Author	SHA1	Message	Date
Greg Kroah-Hartman	61d3d5108e	mm: remove PageMovable export The only in-kernel users that need PageMovable() to be exported are z3fold and zsmalloc and they are only using it for dubious debugging functionality. So remove those usages and the export so that no driver code accidentally thinks that they are allowed to use this symbol. Link: https://lkml.kernel.org/r/20230106135900.3763622-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:57 -08:00
Sidhartha Kumar	02d65d6fb1	mm: introduce folio_is_pfmemalloc Add a folio equivalent for page_is_pfmemalloc. This removes two instances of page_is_pfmemalloc(folio_page(folio, 0)) so the folio can be used directly. Link: https://lkml.kernel.org/r/20230106215251.599222-1-sidhartha.kumar@oracle.com Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:57 -08:00
Yu Zhao	17e810229c	mm: support POSIX_FADV_NOREUSE This patch adds POSIX_FADV_NOREUSE to vma_has_recency() so that the LRU algorithm can ignore access to mapped files marked by this flag. The advantages of POSIX_FADV_NOREUSE are: 1. Unlike MADV_SEQUENTIAL and MADV_RANDOM, it does not alter the default readahead behavior. 2. Unlike MADV_SEQUENTIAL and MADV_RANDOM, it does not split VMAs and therefore does not take mmap_lock. 3. Unlike MADV_COLD, setting it has a negligible cost, regardless of how many pages it affects. Its limitations are: 1. Like POSIX_FADV_RANDOM and POSIX_FADV_SEQUENTIAL, it currently does not support range. IOW, its scope is the entire file. 2. It currently does not ignore access through file descriptors. Specifically, for the active/inactive LRU, given a file page shared by two users and one of them having set POSIX_FADV_NOREUSE on the file, this page will be activated upon the second user accessing it. This corner case can be covered by checking POSIX_FADV_NOREUSE before calling folio_mark_accessed() on the read path. But it is considered not worth the effort. There have been a few attempts to support POSIX_FADV_NOREUSE, e.g., [1]. This time the goal is to fill a niche: a few desktop applications, e.g., large file transferring and video encoding/decoding, want fast file streaming with mmap() rather than direct IO. Among those applications, an SVT-AV1 regression was reported when running with MGLRU [2]. The following test can reproduce that regression. kb=$(awk '/MemTotal/ { print $2 }' /proc/meminfo) kb=$((kb - 810241024)) modprobe brd rd_nr=1 rd_size=$kb dd if=/dev/zero of=/dev/ram0 bs=1M mkfs.ext4 /dev/ram0 mount /dev/ram0 /mnt/ swapoff -a fallocate -l 8G /mnt/swapfile mkswap /mnt/swapfile swapon /mnt/swapfile wget http://ultravideo.cs.tut.fi/video/Bosphorus_3840x2160_120fps_420_8bit_YUV_Y4M.7z 7z e -o/mnt/ Bosphorus_3840x2160_120fps_420_8bit_YUV_Y4M.7z SvtAv1EncApp --preset 12 -w 3840 -h 2160 \ -i /mnt/Bosphorus_3840x2160.y4m For MGLRU, the following change showed a [9-11]% increase in FPS, which makes it on par with the active/inactive LRU. patch Source/App/EncApp/EbAppMain.c <<EOF 31a32 > #include <fcntl.h> 35d35 < #include <fcntl.h> /* _O_BINARY */ 117a118 > posix_fadvise(config->mmap.fd, 0, 0, POSIX_FADV_NOREUSE); EOF [1] https://lore.kernel.org/r/1308923350-7932-1-git-send-email-andrea@betterlinux.com/ [2] https://openbenchmarking.org/result/2209259-PTS-MGLRU8GB57 Link: https://lkml.kernel.org/r/20221230215252.2628425-2-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Righi <andrea.righi@canonical.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michael Larabel <Michael@MichaelLarabel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:57 -08:00
Yu Zhao	8788f67814	mm: add vma_has_recency() Add vma_has_recency() to indicate whether a VMA may exhibit temporal locality that the LRU algorithm relies on. This function returns false for VMAs marked by VM_SEQ_READ or VM_RAND_READ. While the former flag indicates linear access, i.e., a special case of spatial locality, both flags indicate a lack of temporal locality, i.e., the reuse of an area within a relatively small duration. "Recency" is chosen over "locality" to avoid confusion between temporal and spatial localities. Before this patch, the active/inactive LRU only ignored the accessed bit from VMAs marked by VM_SEQ_READ. After this patch, the active/inactive LRU and MGLRU share the same logic: they both ignore the accessed bit if vma_has_recency() returns false. For the active/inactive LRU, the following fio test showed a [6, 8]% increase in IOPS when randomly accessing mapped files under memory pressure. kb=$(awk '/MemTotal/ { print $2 }' /proc/meminfo) kb=$((kb - 810241024)) modprobe brd rd_nr=1 rd_size=$kb dd if=/dev/zero of=/dev/ram0 bs=1M mkfs.ext4 /dev/ram0 mount /dev/ram0 /mnt/ swapoff -a fio --name=test --directory=/mnt/ --ioengine=mmap --numjobs=8 \ --size=8G --rw=randrw --time_based --runtime=10m \ --group_reporting The discussion that led to this patch is here [1]. Additional test results are available in that thread. [1] https://lore.kernel.org/r/Y31s%2FK8T85jh05wH@google.com/ Link: https://lkml.kernel.org/r/20221230215252.2628425-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Righi <andrea.righi@canonical.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michael Larabel <Michael@MichaelLarabel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:57 -08:00
David Hildenbrand	997931ce02	drivers/misc/open-dice: don't touch VM_MAYSHARE A MAP_SHARED mapping always has VM_MAYSHARE set, and writable (VM_MAYWRITE) MAP_SHARED mappings have VM_SHARED set as well. To identify a MAP_SHARED mapping, it's sufficient to look at VM_MAYSHARE. We cannot have VM_MAYSHARE\|VM_WRITE mappings without having VM_SHARED set. Consequently, current code will never actually end up clearing VM_MAYSHARE and that code is confusing, because nobody is supposed to mess with VM_MAYWRITE. Let's clean it up and restructure the code. No functional change intended. Link: https://lkml.kernel.org/r/20230102160856.500584-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicolas Pitre <nico@fluxnic.net> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:57 -08:00
David Hildenbrand	b6b7a8faf0	mm/nommu: don't use VM_MAYSHARE for MAP_PRIVATE mappings Let's stop using VM_MAYSHARE for MAP_PRIVATE mappings and use VM_MAYOVERLAY instead. Rewrite determine_vm_flags() to make the whole logic easier to digest, and to cleanly separate MAP_PRIVATE vs. MAP_SHARED. No functional change intended. Link: https://lkml.kernel.org/r/20230102160856.500584-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicolas Pitre <nico@fluxnic.net> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:57 -08:00
David Hildenbrand	fc4f4be9b5	mm/nommu: factor out check for NOMMU shared mappings into is_nommu_shared_mapping() Patch series "mm/nommu: don't use VM_MAYSHARE for MAP_PRIVATE mappings". Trying to reduce the confusion around VM_SHARED and VM_MAYSHARE first requires !CONFIG_MMU to stop using VM_MAYSHARE for MAP_PRIVATE mappings. CONFIG_MMU only sets VM_MAYSHARE for MAP_SHARED mappings. This paves the way for further VM_MAYSHARE and VM_SHARED cleanups: for example, renaming VM_MAYSHARED to VM_MAP_SHARED to make it cleaner what is actually means. Let's first get the weird case out of the way and not use VM_MAYSHARE in MAP_PRIVATE mappings, using a new VM_MAYOVERLAY flag instead. This patch (of 3): We want to stop using VM_MAYSHARE in private mappings to pave the way for clarifying the semantics of VM_MAYSHARE vs. VM_SHARED and reduce the confusion. While CONFIG_MMU uses VM_MAYSHARE to represent MAP_SHARED, !CONFIG_MMU also sets VM_MAYSHARE for selected R/O private file mappings that are an effective overlay of a file mapping. Let's factor out all relevant VM_MAYSHARE checks in !CONFIG_MMU code into is_nommu_shared_mapping() first. Note that whenever VM_SHARED is set, VM_MAYSHARE must be set as well (unless there is a serious BUG). So there is not need to test for VM_SHARED manually. No functional change intended. Link: https://lkml.kernel.org/r/20230102160856.500584-1-david@redhat.com Link: https://lkml.kernel.org/r/20230102160856.500584-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicolas Pitre <nico@fluxnic.net> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:56 -08:00
Lorenzo Stoakes	da0618c146	selftest/vm: add mremap expand merge offset test Add a test to assert that we can mremap() and expand a mapping starting from an offset within an existing mapping. We unmap the last page in a 3 page mapping to ensure that the remap should always succeed, before remapping from the 2nd page. This is additionally a regression test for the issue solved in "mm, mremap: fix mremap() expanding vma with addr inside vma" and confirmed to fail prior to the change and pass after it. Finally, this patch updates the existing mremap expand merge test to check error conditions and reduce code duplication between the two tests. [lstoakes@gmail.com: increment num_expand_tests so test doesn't complain about unexpected tests being run] Link: https://lkml.kernel.org/r/8ff3ba3cadc0b6c1b2688ae5c851bf73aa062d57.1673701836.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/02b117a8ffd52acc01dc66c2fb39754f08d92c0e.1672675824.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jakub Matěna <matenajakub@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:56 -08:00
Sergey Senozhatsky	df32de1433	zram: correctly handle all next_arg() cases When supplied buffer does not have assignment sign next_arg() sets `val` pointer to NULL, so we cannot dereference it. Add a NULL pointer test to handle `param` case, in addition to `*val` test, which handles cases when param has no value assigned to it: `param=`. Link: https://lkml.kernel.org/r/20230103030119.1496358-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:56 -08:00
Jan Kara	4b89a37d54	fs: don't allocate blocks beyond EOF from __mpage_writepage When __mpage_writepage() is called for a page beyond EOF, it will go and allocate all blocks underlying the page. This is not only unnecessary but this way blocks can get leaked (e.g. if a page beyond EOF is marked dirty but in the end write fails and i_size is not extended). Link: https://lkml.kernel.org/r/20230103104430.27749-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:56 -08:00
SeongJae Park	6c364edc19	Docs/admin-guide/mm/numaperf: increase depth of subsections Each section of numaperf.rst has zero depth, and therefore be exposed to the index of admin-guide/mm. Especially 'See Also' section on the index makes the document weird. Hide the sections from the index by giving the document a title and increasing the depth of each section. [sj@kernel.org: change title to fix duplicate label warning] Link: https://lkml.kernel.org/r/20230106194927.152663-1-sj@kernel.org Link: https://lkml.kernel.org/r/20230103180754.129637-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:56 -08:00
SeongJae Park	baa489fabd	selftests/vm: rename selftests/vm to selftests/mm Rename selftets/vm to selftests/mm for being more consistent with the code, documentation, and tools directories, and won't be confused with virtual machines. [sj@kernel.org: convert missing vm->mm changes] Link: https://lkml.kernel.org/r/20230107230643.252273-1-sj@kernel.org Link: https://lkml.kernel.org/r/20230103180754.129637-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:56 -08:00
SeongJae Park	799fb82aa1	tools/vm: rename tools/vm to tools/mm Rename tools/vm to tools/mm for being more consistent with the code and documentation directories, and won't be confused with virtual machines. Link: https://lkml.kernel.org/r/20230103180754.129637-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:55 -08:00
SeongJae Park	060deca404	MAINTAINERS/MEMORY MANAGEMENT: add tools/vm/ as managed files 'tools/vm/' directory should be a part of memory management subsystem, but MAINTAINERS file doesn't mark the directory so. Add one more 'F:' entry for the directory to 'MEMORY MANAGEMENT' section. Link: https://lkml.kernel.org/r/20230103180754.129637-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:55 -08:00
SeongJae Park	1839862099	MAINTAINERS: add types to akpm/mm git trees entries Patch series "mm: trivial fixups". This patchset is for trivial fixups of mm stuff on MAINTAINERS, tools/ selftests, and docs. This patch (of 5): Each SCM tree entry of MAINTAINERS file should have both type and location, but akpm/mm git tree entries of 'MEMORY MANAGEMENT' and 'VMALLOC' sections of the file don't have the type. Add the type. Link: https://lkml.kernel.org/r/20230103180754.129637-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:55 -08:00
Mike Kravetz	e9adcfecf5	mm: remove zap_page_range and create zap_vma_pages zap_page_range was originally designed to unmap pages within an address range that could span multiple vmas. While working on [1], it was discovered that all callers of zap_page_range pass a range entirely within a single vma. In addition, the mmu notification call within zap_page range does not correctly handle ranges that span multiple vmas. When crossing a vma boundary, a new mmu_notifier_range_init/end call pair with the new vma should be made. Instead of fixing zap_page_range, do the following: - Create a new routine zap_vma_pages() that will remove all pages within the passed vma. Most users of zap_page_range pass the entire vma and can use this new routine. - For callers of zap_page_range not passing the entire vma, instead call zap_page_range_single(). - Remove zap_page_range. [1] https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.kravetz@oracle.com/ Link: https://lkml.kernel.org/r/20230104002732.232573-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Peter Xu <peterx@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> [s390] Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:55 -08:00
Feng Tang	bbc61844b4	mm/kasan: simplify and refine kasan_cache code struct 'kasan_cache' has a member 'is_kmalloc' indicating whether its host kmem_cache is a kmalloc cache. With newly introduced is_kmalloc_cache() helper, 'is_kmalloc' and its related function can be replaced and removed. Also 'kasan_cache' is only needed by KASAN generic mode, and not by SW/HW tag modes, so refine its protection macro accordingly, suggested by Andrey Konoval. Link: https://lkml.kernel.org/r/20230104060605.930910-2-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:55 -08:00
Feng Tang	bb94429096	mm/slab: add is_kmalloc_cache() helper function commit `6edf2576a6` ("mm/slub: enable debugging memory wasting of kmalloc") introduces 'SLAB_KMALLOC' bit specifying whether a kmem_cache is a kmalloc cache for slab/slub (slob doesn't have dedicated kmalloc caches). Add a helper inline function for other components like kasan to simplify code. Link: https://lkml.kernel.org/r/20230104060605.930910-1-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:54 -08:00
David Hildenbrand	dee2ad1205	selftests/vm: cow: add COW tests for collapsing of PTE-mapped anon THP Currently, anonymous PTE-mapped THPs cannot be collapsed in-place: collapsing (e.g., via MADV_COLLAPSE) implies allocating a fresh THP and mapping that new THP via a PMD: as it's a fresh anon THP, it will get the exclusive flag set on the head page and everybody is happy. However, if the kernel would ever support in-place collapse of anonymous THPs (replacing a page table mapping each sub-page of a THP via PTEs with a single PMD mapping the complete THP), exclusivity information stored for each sub-page would have to be collapsed accordingly: (1) All PTEs map !exclusive anon sub-pages: the in-place collapsed THP must not not have the exclusive flag set on the head page mapped by the PMD. This is the easiest case to handle ("simply don't set any exclusive flags"). (2) All PTEs map exclusive anon sub-pages: when collapsing, we have to clear the exclusive flag from all tail pages and only leave the exclusive flag set for the head page. Otherwise, fork() after collapse would not clear the exclusive flags from the tail pages and we'd be in trouble once PTE-mapping the shared THP when writing to shared tail pages that still have the exclusive flag set. This would effectively revert what the PTE-mapping code does when propagating the exclusive flag to all sub-pages. (3) PTEs map a mixture of exclusive and !exclusive anon sub-pages (can happen e.g., due to MADV_DONTFORK before fork()). We must not collapse the THP in-place, otherwise bad things may happen: the exclusive flags of sub-pages would get ignored and the exclusive flag of the head page would get used instead. Now that we have MADV_COLLAPSE in place to trigger collapsing a THP, let's add some test cases that would bail out early, if we'd voluntarily/accidantially unlock in-place collapse for anon THPs and forget about taking proper care of exclusive flags. Running the test on a kernel with MADV_COLLAPSE support: # [INFO] Anonymous THP tests # [RUN] Basic COW after fork() when collapsing before fork() ok 169 No leak from parent into child # [RUN] Basic COW after fork() when collapsing after fork() (fully shared) ok 170 # SKIP MADV_COLLAPSE failed: Invalid argument # [RUN] Basic COW after fork() when collapsing after fork() (lower shared) ok 171 No leak from parent into child # [RUN] Basic COW after fork() when collapsing after fork() (upper shared) ok 172 No leak from parent into child For now, MADV_COLLAPSE always seems to fail if all PTEs map shared sub-pages. Link: https://lkml.kernel.org/r/20230104144905.460075-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:54 -08:00
Fabio M. De Francesco	1f8549fce5	mm: fix spelling mistake in highmem.h Substitute "higmem" with "highmem" in highmem.h. Link: https://lkml.kernel.org/r/20230105121305.30714-1-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:54 -08:00
Fabio M. De Francesco	9eefefd835	mm: remove an ambiguous sentence from kmap_local_folio() kdocs In the kdocs of kmap_local_folio() there is a an ambiguous sentence which suggests to use this API "only when really necessary". On the contrary, since kmap() and kmap_atomic() are deprecated, both kmap_local_folio(), as well as kmap_local_page(), must be preferred to the previous ones. Therefore, remove the above-mentioned sentence exactly how it has previously been done for the kmap_local_page() kdocs in commit `72f1c55adf` ("highmem: delete a sentence from kmap_local_page() kdocs"). Link: https://lkml.kernel.org/r/20230105120424.30055-1-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:54 -08:00
Liam Howlett	541e06b772	maple_tree: remove GFP_ZERO from kmem_cache_alloc() and kmem_cache_alloc_bulk() Preallocations are common in the VMA code to avoid allocating under certain locking conditions. The preallocations must also cover the worst-case scenario. Removing the GFP_ZERO flag from the kmem_cache_alloc() (and bulk variant) calls will reduce the amount of time spent zeroing memory that may not be used. Only zero out the necessary area to keep track of the allocations in the maple state. Zero the entire node prior to using it in the tree. This required internal changes to node counting on allocation, so the test code is also updated. This restores some micro-benchmark performance: up to +9% in mmtests mmap1 by my testing +10% to +20% in mmap, mmapaddr, mmapmany tests reported by Red Hat Link: https://bugzilla.redhat.com/show_bug.cgi?id=2149636 Link: https://lkml.kernel.org/r/20230105160427.2988454-1-Liam.Howlett@oracle.com Signed-off-by: Liam Howlett <Liam.Howlett@oracle.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:54 -08:00
Mike Rapoport (IBM)	fc5744881e	mm/page_alloc: invert logic for early page initialisation checks Rename early_page_uninitialised() to early_page_initialised() and invert its logic to make the code more readable. Link: https://lkml.kernel.org/r/20230104191805.2535864-1-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:54 -08:00
Johannes Weiner	f78dfc7b77	workingset: fix confusion around eviction vs refault container Refault decisions are made based on the lruvec where the page was evicted, as that determined its LRU order while it was alive. Stats and workingset aging must then occur on the lruvec of the new page, as that's the node and cgroup that experience the refault and that's the lruvec whose nonresident info ages out by a new resident page. Those lruvecs could be different when a page is shared between cgroups, or the refaulting page is allocated on a different node. There are currently two mix-ups: 1. When swap is available, the resident anon set must be considered when comparing the refault distance. The comparison is made against the right anon set, but the check for swap is not. When pages get evicted from a cgroup with swap, and refault in one without, this can incorrectly consider a hot refault as cold - and vice versa. Fix that by using the eviction cgroup for the swap check. 2. The stats and workingset age are updated against the wrong lruvec altogether: the right cgroup but the wrong NUMA node. When a page refaults on a different NUMA node, this will have confusing stats and distort the workingset age on a different lruvec - again possibly resulting in hot/cold misclassifications down the line. Fix the swap check and the refault pgdat to address both concerns. This was found during code review. It hasn't caused notable issues in production, suggesting that those refault-migrations are relatively rare in practice. Link: https://lkml.kernel.org/r/20230104222944.2380117-1-nphamcs@gmail.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Co-developed-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:53 -08:00
Peter Xu	d1751118c8	mm/uffd: detect pgtable allocation failures Before this patch, when there's any pgtable allocation issues happened during change_protection(), the error will be ignored from the syscall. For shmem, there will be an error dumped into the host dmesg. Two issues with that: (1) Doing a trace dump when allocation fails is not anything close to grace. (2) The user should be notified with any kind of such error, so the user can trap it and decide what to do next, either by retrying, or stop the process properly, or anything else. For userfault users, this will change the API of UFFDIO_WRITEPROTECT when pgtable allocation failure happened. It should not normally break anyone, though. If it breaks, then in good ways. One man-page update will be on the way to introduce the new -ENOMEM for UFFDIO_WRITEPROTECT. Not marking stable so we keep the old behavior on the 5.19-till-now kernels. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20230104225207.1066932-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: James Houghton <jthoughton@google.com> Acked-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:53 -08:00
Peter Xu	a79390f5d6	mm/mprotect: use long for page accountings and retval Switch to use type "long" for page accountings and retval across the whole procedure of change_protection(). The change should have shrinked the possible maximum page number to be half comparing to previous (ULONG_MAX / 2), but it shouldn't overflow on any system either because the maximum possible pages touched by change protection should be ULONG_MAX / PAGE_SIZE. Two reasons to switch from "unsigned long" to "long": 1. It suites better on count_vm_numa_events(), whose 2nd parameter takes a long type. 2. It paves way for returning negative (error) values in the future. Currently the only caller that consumes this retval is change_prot_numa(), where the unsigned long was converted to an int. Since at it, touching up the numa code to also take a long, so it'll avoid any possible overflow too during the int-size convertion. Link: https://lkml.kernel.org/r/20230104225207.1066932-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:53 -08:00
Kefeng Wang	6b7cea90c8	mm/damon/vaddr: convert hugetlb related functions to use a folio Convert damon_hugetlb_mkold() and damon_young_hugetlb_entry() to use a folio. Link: https://lkml.kernel.org/r/20221230070849.63358-9-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:53 -08:00
Kefeng Wang	7824debb3d	mm/damon: remove unneeded damon_get_page() After all damon_get_page() callers are converted to damon_get_folio(), remove unneeded wrapper damon_get_page(). Link: https://lkml.kernel.org/r/20221230070849.63358-8-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:53 -08:00
Kefeng Wang	dc1b78665b	mm/damon/vaddr: convert damon_young_pmd_entry() to use a folio With damon_get_folio(), let's convert damon_young_pmd_entry() to use a folio. Link: https://lkml.kernel.org/r/20221230070849.63358-7-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:53 -08:00
Kefeng Wang	07bb1fbaa2	mm/damon/paddr: convert damon_pa_() to use a folio With damon_get_folio(), let's convert all the damon_pa_() to use a folio. Link: https://lkml.kernel.org/r/20221230070849.63358-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:52 -08:00
Kefeng Wang	70e314c9ab	mm/damon: convert damon_ptep/pmdp_mkold() to use a folio With damon_get_folio(), let's convert damon_ptep_mkold() and damon_pmdp_mkold() to use a folio. Link: https://lkml.kernel.org/r/20221230070849.63358-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:52 -08:00
Kefeng Wang	5e012bba01	mm/damon: introduce damon_get_folio() Introduce damon_get_folio(), and the temporary wrapper function damon_get_page(), which help us to convert damon related functions to use folios, and it will be dropped once the conversion is completed. Link: https://lkml.kernel.org/r/20221230070849.63358-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:52 -08:00
Kefeng Wang	5acc17fd35	mm: page_idle: convert page idle to use a folio Firstly, make page_idle_get_page() return a folio, also rename it to page_idle_get_folio(), then, use it to convert page_idle_bitmap_read() and page_idle_bitmap_write() functions. Link: https://lkml.kernel.org/r/20221230070849.63358-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:52 -08:00
Matthew Wilcox	becacb04fd	mm: memcg: add folio_memcg_check() Patch series "mm: convert page_idle/damon to use folios", v4. This patch (of 8): Convert page_memcg_check() into folio_memcg_check() and add a page_memcg_check() wrapper. The behaviour of page_memcg_check() is unchanged; tail pages always had a NULL ->memcg_data. Link: https://lkml.kernel.org/r/20221230070849.63358-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20221230070849.63358-2-wangkefeng.wang@huawei.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:52 -08:00
JeongHyeon Lee	071acb3084	zram: fix typos in comments - The double `range` is duplicated in comment, remove one. - change `syfs` to `sysfs` Link: https://lkml.kernel.org/r/20221223040331.4194-1-jhs2.lee@samsung.com Signed-off-by: JeongHyeon Lee <jhs2.lee@samsung.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:52 -08:00
Kefeng Wang	630e7c5ee3	mm: huge_memory: convert split_huge_pages_all() to use a folio Straightforwardly convert split_huge_pages_all() to use a folio. Link: https://lkml.kernel.org/r/20221229122503.149083-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:51 -08:00
Christoph Hellwig	c2ca7a59a4	mm: remove generic_writepages Now that all external callers are gone, just fold it into do_writepages. Link: https://lkml.kernel.org/r/20221229161031.391878-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:51 -08:00
Christoph Hellwig	17c30ee6f2	ocfs2: use filemap_fdatawrite_wbc instead of generic_writepages filemap_fdatawrite_wbc is a fairly thing wrapper around do_writepages, and the big difference there is support for cgroup writeback, which is not supported by ocfs2, and the potential to use ->writepages instead of ->writepage, which ocfs2 does not currently implement but eventually should. Link: https://lkml.kernel.org/r/20221229161031.391878-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:51 -08:00
Christoph Hellwig	cff61bbc71	jbd2,ocfs2: move jbd2_journal_submit_inode_data_buffers to ocfs2 jbd2_journal_submit_inode_data_buffers is only used by ocfs2, so move it there to prepare for removing generic_writepages. Link: https://lkml.kernel.org/r/20221229161031.391878-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:51 -08:00
Christoph Hellwig	25a89826f2	ntfs3: remove ->writepage ->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and remove the ->writepage implementation. Link: https://lkml.kernel.org/r/20221229161031.391878-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:51 -08:00
Christoph Hellwig	d4428bad14	ntfs3: stop using generic_writepages Open code the resident inode handling in ntfs_writepages by directly using write_cache_pages to prepare removing the ->writepage handler in ntfs3. Link: https://lkml.kernel.org/r/20221229161031.391878-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:51 -08:00
Christoph Hellwig	5b68de6703	fs: remove an outdated comment on mpage_writepages Patch series "remove generic_writepages" This series removes generic_writepages by open coding the current functionality in the three remaining callers. Besides removing some code the main benefit is that one of the few remaining ->writepage callers from outside the core page cache code go away. This patch (of 6): mpage_writepages doesn't do any of the page locking itself, so remove and outdated comment on the locking pattern there. Link: https://lkml.kernel.org/r/20221229161031.391878-1-hch@lst.de Link: https://lkml.kernel.org/r/20221229161031.391878-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:50 -08:00
Yin Fengwei	81e506bec9	mm/thp: check and bail out if page in deferred queue already Kernel build regression with LLVM was reported here: https://lore.kernel.org/all/Y1GCYXGtEVZbcv%2F5@dev-arch.thelio-3990X/ with commit `f35b5d7d67` ("mm: align larger anonymous mappings on THP boundaries"). And the commit `f35b5d7d67` was reverted. It turned out the regression is related with madvise(MADV_DONTNEED) was used by ld.lld. But with none PMD_SIZE aligned parameter len. trace-bpfcc captured: 531607 531732 ld.lld do_madvise.part.0 start: 0x7feca9000000, len: 0x7fb000, behavior: 0x4 531607 531793 ld.lld do_madvise.part.0 start: 0x7fec86a00000, len: 0x7fb000, behavior: 0x4 If the underneath physical page is THP, the madvise(MADV_DONTNEED) can trigger split_queue_lock contention raised significantly. perf showed following data: 14.85% 0.00% ld.lld [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe 11.52% entry_SYSCALL_64_after_hwframe do_syscall_64 __x64_sys_madvise do_madvise.part.0 zap_page_range unmap_single_vma unmap_page_range page_remove_rmap deferred_split_huge_page __lock_text_start native_queued_spin_lock_slowpath If THP can't be removed from rmap as whole THP, partial THP will be removed from rmap by removing sub-pages from rmap. Even the THP head page is added to deferred queue already, the split_queue_lock will be acquired and check whether the THP head page is in the queue already. Thus, the contention of split_queue_lock is raised. Before acquire split_queue_lock, check and bail out early if the THP head page is in the queue already. The checking without holding split_queue_lock could race with deferred_split_scan, but it doesn't impact the correctness here. Test result of building kernel with ld.lld: commit `7b5a0b664e` (parent commit of `f35b5d7d67`): time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 6:07.99 real, 26367.77 user, 5063.35 sys commit `f35b5d7d67`: time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 7:22.15 real, 26235.03 user, 12504.55 sys commit `f35b5d7d67` with the fixing patch: time -f "\t%E real,\t%U user,\t%S sys" make LD=ld.lld -skj96 allmodconfig all 6:08.49 real, 26520.15 user, 5047.91 sys Link: https://lkml.kernel.org/r/20221223135207.2275317-1-fengwei.yin@intel.com Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:50 -08:00
SeongJae Park	01b5022f0a	mm/page_reporting: replace rcu_access_pointer() with rcu_dereference_protected() Page reporting fetches pr_dev_info using rcu_access_pointer(), which is for safely fetching a pointer that will not be dereferenced but could concurrently updated. The code indeed does not dereference pr_dev_info after fetching it using rcu_access_pointer(), but it fetches the pointer while concurrent updates to the pointer is avoided by holding the update side lock, page_reporting_mutex. In the case, rcu_dereference_protected() should be used instead because it provides better readability and performance on some cases, as rcu_dereference_protected() avoids use of READ_ONCE(). Replace the rcu_access_pointer() calls with rcu_dereference_protected(). Link: https://lkml.kernel.org/r/20221228175942.149491-1-sj@kernel.org Fixes: `36e66c554b` ("mm: introduce Reported pages") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:50 -08:00
Kele Huang	3783e1721b	mm: fix comment of page table counter Commit `af5b0f6a09` ("mm: consolidate page table accounting") consolidates page table accounting to a single counter in struct mm_struct {} as mm->pgtables_bytes. So the meanning of this counter should be the size of all page tables now. Link: https://lkml.kernel.org/r/20221224060233.417827-1-kele.huang@columbia.edu Signed-off-by: Kele Huang <kele.huang@columbia.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Colin Cross <ccross@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:50 -08:00
David Hildenbrand	1ef488edd6	mm/mprotect: drop pgprot_t parameter from change_protection() Being able to provide a custom protection opens the door for inconsistencies and BUGs: for example, accidentally allowing for more permissions than desired by other mechanisms (e.g., softdirty tracking). vma->vm_page_prot should be the single source of truth. Only PROT_NUMA is special: there is no way we can erroneously allow for more permissions when removing all permissions. Special-case using the MM_CP_PROT_NUMA flag. [david@redhat.com: PAGE_NONE might not be defined without CONFIG_NUMA_BALANCING] Link: https://lkml.kernel.org/r/5084ff1c-ebb3-f918-6a60-bacabf550a88@redhat.com Link: https://lkml.kernel.org/r/20221223155616.297723-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:50 -08:00
David Hildenbrand	931298e103	mm/userfaultfd: rely on vma->vm_page_prot in uffd_wp_range() Patch series "mm: uffd-wp + change_protection() cleanups". Cleanup page protection handling in uffd-wp when calling change_protection() and improve unprotecting uffd=wp in private mappings, trying to set PTEs writable again if possible just like we do during mprotect() when upgrading write permissions. Make the change_protection() interface harder to get wrong :) I consider both pages primarily cleanups, although patch #1 fixes a corner case with uffd-wp and softdirty tracking for shmem. @Peter, please let me know if we should flag patch #1 as pure cleanup -- I have no idea how important softdirty tracking on shmem is. This patch (of 2): uffd_wp_range() currently calculates page protection manually using vm_get_page_prot(). This will ignore any other reason for active writenotify: one mechanism applicable to shmem is softdirty tracking. For example, the following sequence 1) Write to mapped shmem page 2) Clear softdirty 3) Register uffd-wp covering the mapped page 4) Unregister uffd-wp covering the mapped page 5) Write to page again will not set the modified page softdirty, because uffd_wp_range() will ignore that writenotify is required for softdirty tracking and simply map the page writable again using change_protection(). Similarly, instead of unregistering, protecting followed by un-protecting the page using uffd-wp would result in the same situation. Now that we enable writenotify whenever enabling uffd-wp on a VMA, vma->vm_page_prot will already properly reflect our requirements: the default is to write-protect all PTEs. However, for shared mappings we would now not remap the PTEs writable if possible when unprotecting, just like for private mappings (COW). To compensate, set MM_CP_TRY_CHANGE_WRITABLE just like mprotect() does to try mapping individual PTEs writable. For private mappings, this change implies that we will now always try setting PTEs writable when un-protecting, just like when upgrading write permissions using mprotect(), which is an improvement. For shared mappings, we will only set PTEs writable if can_change_pte_writable()/can_change_pmd_writable() indicates that it's ok. For ordinary shmem, this will be the case when PTEs are dirty, which should usually be the case -- otherwise we could special-case shmem in can_change_pte_writable()/can_change_pmd_writable() easily, because shmem itself doesn't require writenotify. Note that hugetlb does not yet implement MM_CP_TRY_CHANGE_WRITABLE, so we won't try setting PTEs writable when unprotecting or when unregistering uffd-wp. This can be added later on top by implementing MM_CP_TRY_CHANGE_WRITABLE. While commit `ffd0579396` ("userfaultfd: wp: support write protection for userfault vma range") introduced that code, it should only be applicable to uffd-wp on shared mappings -- shmem (hugetlb does not support softdirty tracking). I don't think this corner cases justifies to cc stable. Let's just handle it correctly and prepare for change_protection() cleanups. [david@redhat.com: o need for additional harmless checks if we're wr-protecting either way] Link: https://lkml.kernel.org/r/71412742-a71f-9c74-865f-773ad83db7a5@redhat.com Link: https://lkml.kernel.org/r/20221223155616.297723-1-david@redhat.com Link: https://lkml.kernel.org/r/20221223155616.297723-2-david@redhat.com Fixes: `b1f9e87686` ("mm/uffd: enable write protection for shmem & hugetlbfs") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:50 -08:00
Xu Panda	a9af8e6bb3	selftests/vm: ksm_functional_tests: fix a typo in comment Fix a typo of "comaring" which should be "comparing". Link: https://lkml.kernel.org/r/202212231050245952617@zte.com.cn Signed-off-by: Xu Panda <xu.panda@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:49 -08:00
Yu Zhao	f386e93140	mm: multi-gen LRU: simplify arch_has_hw_pte_young() check Scanning page tables when hardware does not set the accessed bit has no real use cases. Link: https://lkml.kernel.org/r/20221222041905.2431096-9-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:49 -08:00
Yu Zhao	e9d4e1ee78	mm: multi-gen LRU: clarify scan_control flags Among the flags in scan_control: 1. sc->may_swap, which indicates swap constraint due to memsw.max, is supported as usual. 2. sc->proactive, which indicates reclaim by memory.reclaim, may not opportunistically skip the aging path, since it is considered less latency sensitive. 3. !(sc->gfp_mask & __GFP_IO), which indicates IO constraint, lowers swappiness to prioritize file LRU, since clean file folios are more likely to exist. 4. sc->may_writepage and sc->may_unmap, which indicates opportunistic reclaim, are rejected, since unmapped clean folios are already prioritized. Scanning for more of them is likely futile and can cause high reclaim latency when there is a large number of memcgs. The rest are handled by the existing code. Link: https://lkml.kernel.org/r/20221222041905.2431096-8-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-01-18 17:12:49 -08:00

1 2 3 4 5 ...

1154375 commits