linux-stable/include/linux/gfp.h

376 lines
12 KiB
C
Raw Normal View History

License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:07:57 +00:00
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __LINUX_GFP_H
#define __LINUX_GFP_H
#include <linux/gfp_types.h>
#include <linux/mmzone.h>
#include <linux/topology.h>
struct vm_area_struct;
/* Convert GFP flags to their corresponding migrate type */
#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
#define GFP_MOVABLE_SHIFT 3
Categorize GFP flags The function of GFP_LEVEL_MASK seems to be unclear. In order to clear up the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP flags: GFP_RECLAIM_MASK Flags used to control page allocator reclaim behavior. GFP_CONSTRAINT_MASK Flags used to limit where allocations can occur. GFP_SLAB_BUG_MASK Flags that the slab allocator BUG()s on. These replace the uses of GFP_LEVEL mask in the slab allocators and in vmalloc.c. The use of the flags not included in these sets may occur as a result of a slab allocation standing in for a page allocation when constructing scatter gather lists. Extraneous flags are cleared and not passed through to the page allocator. __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will now be ignored if passed to a slab allocator. Change the allocation of allocator meta data in SLAB and vmalloc to not pass through flags listed in GFP_CONSTRAINT_MASK. SLAB already removes the __GFP_THISNODE flag for such allocations. Generalize that to also cover vmalloc. The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL. The impact of allocator metadata placement on access latency to the cachelines of the object itself is minimal since metadata is only referenced on alloc and free. The attempt is still made to place the meta data optimally but we consistently allow fallback both in SLAB and vmalloc (SLUB does not need to allocate metadata like that). Allocator metadata may serve multiple in kernel users and thus should not be subject to the limitations arising from a single allocation context. [akpm@linux-foundation.org: fix fallback_alloc()] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 08:25:41 +00:00
static inline int gfp_migratetype(const gfp_t gfp_flags)
Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo. The information is collected only on request so there is no runtime overhead. The statistics are in three parts: The first part prints information on the size of blocks that pages are being grouped on and looks like Page block order: 10 Pages per block: 1024 The second part is a more detailed version of /proc/buddyinfo and looks like Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Reclaimable 1 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Reserve 0 4 4 0 0 0 0 1 0 1 0 Node 0, zone Normal, type Unmovable 111 8 4 4 2 3 1 0 0 0 0 Node 0, zone Normal, type Reclaimable 293 89 8 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Movable 1 6 13 9 7 6 3 0 0 0 0 Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 4 The third part looks like Number of blocks type Unmovable Reclaimable Movable Reserve Node 0, zone DMA 0 1 2 1 Node 0, zone Normal 3 17 94 4 To walk the zones within a node with interrupts disabled, walk_zones_in_node() is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and /proc/pagetypeinfo to reduce code duplication. It seems specific to what vmstat.c requires but could be broken out as a general utility function in mmzone.c if there were other other potential users. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 08:26:02 +00:00
{
VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo. The information is collected only on request so there is no runtime overhead. The statistics are in three parts: The first part prints information on the size of blocks that pages are being grouped on and looks like Page block order: 10 Pages per block: 1024 The second part is a more detailed version of /proc/buddyinfo and looks like Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Reclaimable 1 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Reserve 0 4 4 0 0 0 0 1 0 1 0 Node 0, zone Normal, type Unmovable 111 8 4 4 2 3 1 0 0 0 0 Node 0, zone Normal, type Reclaimable 293 89 8 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Movable 1 6 13 9 7 6 3 0 0 0 0 Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 4 The third part looks like Number of blocks type Unmovable Reclaimable Movable Reserve Node 0, zone DMA 0 1 2 1 Node 0, zone Normal 3 17 94 4 To walk the zones within a node with interrupts disabled, walk_zones_in_node() is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and /proc/pagetypeinfo to reduce code duplication. It seems specific to what vmstat.c requires but could be broken out as a general utility function in mmzone.c if there were other other potential users. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 08:26:02 +00:00
if (unlikely(page_group_by_mobility_disabled))
return MIGRATE_UNMOVABLE;
/* Group based on mobility */
return (__force unsigned long)(gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo. The information is collected only on request so there is no runtime overhead. The statistics are in three parts: The first part prints information on the size of blocks that pages are being grouped on and looks like Page block order: 10 Pages per block: 1024 The second part is a more detailed version of /proc/buddyinfo and looks like Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Reclaimable 1 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Reserve 0 4 4 0 0 0 0 1 0 1 0 Node 0, zone Normal, type Unmovable 111 8 4 4 2 3 1 0 0 0 0 Node 0, zone Normal, type Reclaimable 293 89 8 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Movable 1 6 13 9 7 6 3 0 0 0 0 Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 4 The third part looks like Number of blocks type Unmovable Reclaimable Movable Reserve Node 0, zone DMA 0 1 2 1 Node 0, zone Normal 3 17 94 4 To walk the zones within a node with interrupts disabled, walk_zones_in_node() is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and /proc/pagetypeinfo to reduce code duplication. It seems specific to what vmstat.c requires but could be broken out as a general utility function in mmzone.c if there were other other potential users. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 08:26:02 +00:00
}
#undef GFP_MOVABLE_MASK
#undef GFP_MOVABLE_SHIFT
[PATCH] x86_64: Add 4GB DMA32 zone Add a new 4GB GFP_DMA32 zone between the GFP_DMA and GFP_NORMAL zones. As a bit of historical background: when the x86-64 port was originally designed we had some discussion if we should use a 16MB DMA zone like i386 or a 4GB DMA zone like IA64 or both. Both was ruled out at this point because it was in early 2.4 when VM is still quite shakey and had bad troubles even dealing with one DMA zone. We settled on the 16MB DMA zone mainly because we worried about older soundcards and the floppy. But this has always caused problems since then because device drivers had trouble getting enough DMA able memory. These days the VM works much better and the wide use of NUMA has proven it can deal with many zones successfully. So this patch adds both zones. This helps drivers who need a lot of memory below 4GB because their hardware is not accessing more (graphic drivers - proprietary and free ones, video frame buffer drivers, sound drivers etc.). Previously they could only use IOMMU+16MB GFP_DMA, which was not enough memory. Another common problem is that hardware who has full memory addressing for >4GB misses it for some control structures in memory (like transmit rings or other metadata). They tended to allocate memory in the 16MB GFP_DMA or the IOMMU/swiotlb then using pci_alloc_consistent, but that can tie up a lot of precious 16MB GFPDMA/IOMMU/swiotlb memory (even on AMD systems the IOMMU tends to be quite small) especially if you have many devices. With the new zone pci_alloc_consistent can just put this stuff into memory below 4GB which works better. One argument was still if the zone should be 4GB or 2GB. The main motivation for 2GB would be an unnamed not so unpopular hardware raid controller (mostly found in older machines from a particular four letter company) who has a strange 2GB restriction in firmware. But that one works ok with swiotlb/IOMMU anyways, so it doesn't really need GFP_DMA32. I chose 4GB to be compatible with IA64 and because it seems to be the most common restriction. The new zone is so far added only for x86-64. For other architectures who don't set up this new zone nothing changes. Architectures can set a compatibility define in Kconfig CONFIG_DMA_IS_DMA32 that will define GFP_DMA32 as GFP_DMA. Otherwise it's a nop because on 32bit architectures it's normally not needed because GFP_NORMAL (=0) is DMA able enough. One problem is still that GFP_DMA means different things on different architectures. e.g. some drivers used to have #ifdef ia64 use GFP_DMA (trusting it to be 4GB) #elif __x86_64__ (use other hacks like the swiotlb because 16MB is not enough) ... . This was quite ugly and is now obsolete. These should be now converted to use GFP_DMA32 unconditionally. I haven't done this yet. Or best only use pci_alloc_consistent/dma_alloc_coherent which will use GFP_DMA32 transparently. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-05 16:25:53 +00:00
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-07 00:28:21 +00:00
static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
{
return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-07 00:28:21 +00:00
}
net: fix sk_page_frag() recursion from memory reclaim sk_page_frag() optimizes skb_frag allocations by using per-task skb_frag cache when it knows it's the only user. The condition is determined by seeing whether the socket allocation mask allows blocking - if the allocation may block, it obviously owns the task's context and ergo exclusively owns current->task_frag. Unfortunately, this misses recursion through memory reclaim path. Please take a look at the following backtrace. [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10 ... tcp_sendmsg+0x27/0x40 sock_sendmsg+0x30/0x40 sock_xmit.isra.24+0xa1/0x170 [nbd] nbd_send_cmd+0x1d2/0x690 [nbd] nbd_queue_rq+0x1b5/0x3b0 [nbd] __blk_mq_try_issue_directly+0x108/0x1b0 blk_mq_request_issue_directly+0xbd/0xe0 blk_mq_try_issue_list_directly+0x41/0xb0 blk_mq_sched_insert_requests+0xa2/0xe0 blk_mq_flush_plug_list+0x205/0x2a0 blk_flush_plug_list+0xc3/0xf0 [1] blk_finish_plug+0x21/0x2e _xfs_buf_ioapply+0x313/0x460 __xfs_buf_submit+0x67/0x220 xfs_buf_read_map+0x113/0x1a0 xfs_trans_read_buf_map+0xbf/0x330 xfs_btree_read_buf_block.constprop.42+0x95/0xd0 xfs_btree_lookup_get_block+0x95/0x170 xfs_btree_lookup+0xcc/0x470 xfs_bmap_del_extent_real+0x254/0x9a0 __xfs_bunmapi+0x45c/0xab0 xfs_bunmapi+0x15/0x30 xfs_itruncate_extents_flags+0xca/0x250 xfs_free_eofblocks+0x181/0x1e0 xfs_fs_destroy_inode+0xa8/0x1b0 destroy_inode+0x38/0x70 dispose_list+0x35/0x50 prune_icache_sb+0x52/0x70 super_cache_scan+0x120/0x1a0 do_shrink_slab+0x120/0x290 shrink_slab+0x216/0x2b0 shrink_node+0x1b6/0x4a0 do_try_to_free_pages+0xc6/0x370 try_to_free_mem_cgroup_pages+0xe3/0x1e0 try_charge+0x29e/0x790 mem_cgroup_charge_skmem+0x6a/0x100 __sk_mem_raise_allocated+0x18e/0x390 __sk_mem_schedule+0x2a/0x40 [0] tcp_sendmsg_locked+0x8eb/0xe10 tcp_sendmsg+0x27/0x40 sock_sendmsg+0x30/0x40 ___sys_sendmsg+0x26d/0x2b0 __sys_sendmsg+0x57/0xa0 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 In [0], tcp_send_msg_locked() was using current->page_frag when it called sk_wmem_schedule(). It already calculated how many bytes can be fit into current->page_frag. Due to memory pressure, sk_wmem_schedule() called into memory reclaim path which called into xfs and then IO issue path. Because the filesystem in question is backed by nbd, the control goes back into the tcp layer - back into tcp_sendmsg_locked(). nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes sense - it's in the process of freeing memory and wants to be able to, e.g., drop clean pages to make forward progress. However, this confused sk_page_frag() called from [2]. Because it only tests whether the allocation allows blocking which it does, it now thinks current->page_frag can be used again although it already was being used in [0]. After [2] used current->page_frag, the offset would be increased by the used amount. When the control returns to [0], current->page_frag's offset is increased and the previously calculated number of bytes now may overrun the end of allocated memory leading to silent memory corruptions. Fix it by adding gfpflags_normal_context() which tests sleepable && !reclaim and use it to determine whether to use current->task_frag. v2: Eric didn't like gfp flags being tested twice. Introduce a new helper gfpflags_normal_context() and combine the two tests. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24 20:50:27 +00:00
/**
* gfpflags_normal_context - is gfp_flags a normal sleepable context?
* @gfp_flags: gfp_flags to test
*
* Test whether @gfp_flags indicates that the allocation is from the
* %current context and allowed to sleep.
*
* An allocation being allowed to block doesn't mean it owns the %current
* context. When direct reclaim path tries to allocate memory, the
* allocation context is nested inside whatever %current was doing at the
* time of the original allocation. The nested allocation may be allowed
* to block but modifying anything %current owns can corrupt the outer
* context's expectations.
*
* %true result from this function indicates that the allocation context
* can sleep and use anything that's associated with %current.
*/
static inline bool gfpflags_normal_context(const gfp_t gfp_flags)
{
return (gfp_flags & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC)) ==
__GFP_DIRECT_RECLAIM;
}
#ifdef CONFIG_HIGHMEM
#define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
#else
#define OPT_ZONE_HIGHMEM ZONE_NORMAL
#endif
#ifdef CONFIG_ZONE_DMA
#define OPT_ZONE_DMA ZONE_DMA
#else
#define OPT_ZONE_DMA ZONE_NORMAL
#endif
[PATCH] mempolicies: fix policy_zone check There is a check in zonelist_policy that compares pieces of the bitmap obtained from a gfp mask via GFP_ZONETYPES with a zone number in function zonelist_policy(). The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM. The policy_zone is a zone number with the possible values of ZONE_DMA, ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains of values. For some reason seemed to work before the zone reduction patchset (It definitely works on SGI boxes since we just have one zone and the check cannot fail). With the zone reduction patchset this check definitely fails on systems with two zones if the system actually has memory in both zones. This is because ZONE_NORMAL is selected using no __GFP flag at all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA is set. __GFP_DMA is 0x01. So gfp_zone(gfpmask) == 1. policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are populated. For ZONE_NORMAL gfp_zone(<no _GFP_DMA>) yields 0 which is < policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory allocations! Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied to DMA allocations! What we realy want in that place is to establish the highest allowable zone for a given gfp_mask. If the highest zone is higher or equal to the policy_zone then memory policies need to be applied. We have such a highest_zone() function in page_alloc.c. So move the highest_zone() function from mm/page_alloc.c into include/linux/gfp.h. On the way we simplify the function and use the new zone_type that was also introduced with the zone reduction patchset plus we also specify the right type for the gfp flags parameter. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 06:31:17 +00:00
#ifdef CONFIG_ZONE_DMA32
#define OPT_ZONE_DMA32 ZONE_DMA32
#else
#define OPT_ZONE_DMA32 ZONE_NORMAL
[PATCH] mempolicies: fix policy_zone check There is a check in zonelist_policy that compares pieces of the bitmap obtained from a gfp mask via GFP_ZONETYPES with a zone number in function zonelist_policy(). The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM. The policy_zone is a zone number with the possible values of ZONE_DMA, ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains of values. For some reason seemed to work before the zone reduction patchset (It definitely works on SGI boxes since we just have one zone and the check cannot fail). With the zone reduction patchset this check definitely fails on systems with two zones if the system actually has memory in both zones. This is because ZONE_NORMAL is selected using no __GFP flag at all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA is set. __GFP_DMA is 0x01. So gfp_zone(gfpmask) == 1. policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are populated. For ZONE_NORMAL gfp_zone(<no _GFP_DMA>) yields 0 which is < policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory allocations! Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied to DMA allocations! What we realy want in that place is to establish the highest allowable zone for a given gfp_mask. If the highest zone is higher or equal to the policy_zone then memory policies need to be applied. We have such a highest_zone() function in page_alloc.c. So move the highest_zone() function from mm/page_alloc.c into include/linux/gfp.h. On the way we simplify the function and use the new zone_type that was also introduced with the zone reduction patchset plus we also specify the right type for the gfp flags parameter. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 06:31:17 +00:00
#endif
/*
* GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
* zone to use given the lowest 4 bits of gfp_t. Entries are GFP_ZONES_SHIFT
* bits long and there are 16 of them to cover all possible combinations of
* __GFP_DMA, __GFP_DMA32, __GFP_MOVABLE and __GFP_HIGHMEM.
*
* The zone fallback order is MOVABLE=>HIGHMEM=>NORMAL=>DMA32=>DMA.
* But GFP_MOVABLE is not only a zone specifier but also an allocation
* policy. Therefore __GFP_MOVABLE plus another zone selector is valid.
* Only 1 bit of the lowest 3 bits (DMA,DMA32,HIGHMEM) can be set to "1".
*
* bit result
* =================
* 0x0 => NORMAL
* 0x1 => DMA or NORMAL
* 0x2 => HIGHMEM or NORMAL
* 0x3 => BAD (DMA+HIGHMEM)
* 0x4 => DMA32 or NORMAL
* 0x5 => BAD (DMA+DMA32)
* 0x6 => BAD (HIGHMEM+DMA32)
* 0x7 => BAD (HIGHMEM+DMA32+DMA)
* 0x8 => NORMAL (MOVABLE+0)
* 0x9 => DMA or NORMAL (MOVABLE+DMA)
* 0xa => MOVABLE (Movable is valid only if HIGHMEM is set too)
* 0xb => BAD (MOVABLE+HIGHMEM+DMA)
* 0xc => DMA32 or NORMAL (MOVABLE+DMA32)
* 0xd => BAD (MOVABLE+DMA32+DMA)
* 0xe => BAD (MOVABLE+DMA32+HIGHMEM)
* 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
*
2016-03-17 21:19:41 +00:00
* GFP_ZONES_SHIFT must be <= 2 on 32 bit platforms.
*/
2016-03-17 21:19:41 +00:00
#if defined(CONFIG_ZONE_DEVICE) && (MAX_NR_ZONES-1) <= 4
/* ZONE_DEVICE is not a valid GFP zone specifier */
#define GFP_ZONES_SHIFT 2
#else
#define GFP_ZONES_SHIFT ZONES_SHIFT
#endif
#if 16 * GFP_ZONES_SHIFT > BITS_PER_LONG
#error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
#endif
#define GFP_ZONE_TABLE ( \
2016-03-17 21:19:41 +00:00
(ZONE_NORMAL << 0 * GFP_ZONES_SHIFT) \
| (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT) \
| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT) \
| (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT) \
| (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT) \
| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT) \
| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)\
| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)\
)
/*
* GFP_ZONE_BAD is a bitmap for all combinations of __GFP_DMA, __GFP_DMA32
* __GFP_HIGHMEM and __GFP_MOVABLE that are not permitted. One flag per
* entry starting with bit 0. Bit is set if the combination is not
* allowed.
*/
#define GFP_ZONE_BAD ( \
1 << (___GFP_DMA | ___GFP_HIGHMEM) \
| 1 << (___GFP_DMA | ___GFP_DMA32) \
| 1 << (___GFP_DMA32 | ___GFP_HIGHMEM) \
| 1 << (___GFP_DMA | ___GFP_DMA32 | ___GFP_HIGHMEM) \
| 1 << (___GFP_MOVABLE | ___GFP_HIGHMEM | ___GFP_DMA) \
| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA) \
| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_HIGHMEM) \
| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA | ___GFP_HIGHMEM) \
)
static inline enum zone_type gfp_zone(gfp_t flags)
{
enum zone_type z;
int bit = (__force int) (flags & GFP_ZONEMASK);
2016-03-17 21:19:41 +00:00
z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
((1 << GFP_ZONES_SHIFT) - 1);
VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
return z;
[PATCH] mempolicies: fix policy_zone check There is a check in zonelist_policy that compares pieces of the bitmap obtained from a gfp mask via GFP_ZONETYPES with a zone number in function zonelist_policy(). The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM. The policy_zone is a zone number with the possible values of ZONE_DMA, ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains of values. For some reason seemed to work before the zone reduction patchset (It definitely works on SGI boxes since we just have one zone and the check cannot fail). With the zone reduction patchset this check definitely fails on systems with two zones if the system actually has memory in both zones. This is because ZONE_NORMAL is selected using no __GFP flag at all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA is set. __GFP_DMA is 0x01. So gfp_zone(gfpmask) == 1. policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are populated. For ZONE_NORMAL gfp_zone(<no _GFP_DMA>) yields 0 which is < policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory allocations! Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied to DMA allocations! What we realy want in that place is to establish the highest allowable zone for a given gfp_mask. If the highest zone is higher or equal to the policy_zone then memory policies need to be applied. We have such a highest_zone() function in page_alloc.c. So move the highest_zone() function from mm/page_alloc.c into include/linux/gfp.h. On the way we simplify the function and use the new zone_type that was also introduced with the zone reduction patchset plus we also specify the right type for the gfp flags parameter. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 06:31:17 +00:00
}
/*
* There is only one page-allocator function, and two main namespaces to
* it. The alloc_page*() variants return 'struct page *' and as such
* can allocate highmem pages, the *get*page*() variants return
* virtual kernel addresses to the allocated page(s).
*/
static inline int gfp_zonelist(gfp_t flags)
{
#ifdef CONFIG_NUMA
if (unlikely(flags & __GFP_THISNODE))
return ZONELIST_NOFALLBACK;
#endif
return ZONELIST_FALLBACK;
}
/*
* We get the zone list from the current node and the gfp_mask.
* This zone list contains a maximum of MAX_NUMNODES*MAX_NR_ZONES zones.
* There are two zonelists per node, one for all zones with memory and
* one containing just zones from the node the zonelist belongs to.
*
* For the case of non-NUMA systems the NODE_DATA() gets optimized to
* &contig_page_data at compile-time.
*/
static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
{
return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
}
#ifndef HAVE_ARCH_FREE_PAGE
static inline void arch_free_page(struct page *page, int order) { }
#endif
#ifndef HAVE_ARCH_ALLOC_PAGE
static inline void arch_alloc_page(struct page *page, int order) { }
#endif
struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
struct folio *__folio_alloc(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
mm/page_alloc: add a bulk page allocator This patch adds a new page allocator interface via alloc_pages_bulk, and __alloc_pages_bulk_nodemask. A caller requests a number of pages to be allocated and added to a list. The API is not guaranteed to return the requested number of pages and may fail if the preferred allocation zone has limited free memory, the cpuset changes during the allocation or page debugging decides to fail an allocation. It's up to the caller to request more pages in batch if necessary. Note that this implementation is not very efficient and could be improved but it would require refactoring. The intent is to make it available early to determine what semantics are required by different callers. Once the full semantics are nailed down, it can be refactored. [mgorman@techsingularity.net: fix alloc_pages_bulk() return type, per Matthew] Link: https://lkml.kernel.org/r/20210325123713.GQ3697@techsingularity.net [mgorman@techsingularity.net: fix uninit var warning] Link: https://lkml.kernel.org/r/20210330114847.GX3697@techsingularity.net [mgorman@techsingularity.net: fix comment, per Vlastimil] Link: https://lkml.kernel.org/r/20210412110255.GV3697@techsingularity.net Link: https://lkml.kernel.org/r/20210325114228.27719-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Tested-by: Colin Ian King <colin.king@canonical.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 06:01:45 +00:00
unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list,
struct page **page_array);
mm/page_alloc: add a bulk page allocator This patch adds a new page allocator interface via alloc_pages_bulk, and __alloc_pages_bulk_nodemask. A caller requests a number of pages to be allocated and added to a list. The API is not guaranteed to return the requested number of pages and may fail if the preferred allocation zone has limited free memory, the cpuset changes during the allocation or page debugging decides to fail an allocation. It's up to the caller to request more pages in batch if necessary. Note that this implementation is not very efficient and could be improved but it would require refactoring. The intent is to make it available early to determine what semantics are required by different callers. Once the full semantics are nailed down, it can be refactored. [mgorman@techsingularity.net: fix alloc_pages_bulk() return type, per Matthew] Link: https://lkml.kernel.org/r/20210325123713.GQ3697@techsingularity.net [mgorman@techsingularity.net: fix uninit var warning] Link: https://lkml.kernel.org/r/20210330114847.GX3697@techsingularity.net [mgorman@techsingularity.net: fix comment, per Vlastimil] Link: https://lkml.kernel.org/r/20210412110255.GV3697@techsingularity.net Link: https://lkml.kernel.org/r/20210325114228.27719-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Tested-by: Colin Ian King <colin.king@canonical.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 06:01:45 +00:00
mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation Commit ffb29b1c255a ("mm/vmalloc: fix numa spreading for large hash tables") can cause significant performance regressions in some situations as Andrew mentioned in [1]. The main situation is vmalloc, vmalloc will allocate pages with NUMA_NO_NODE by default, that will result in alloc page one by one; In order to solve this, __alloc_pages_bulk and mempolicy should be considered at the same time. 1) If node is specified in memory allocation request, it will alloc all pages by __alloc_pages_bulk. 2) If interleaving allocate memory, it will cauculate how many pages should be allocated in each node, and use __alloc_pages_bulk to alloc pages in each node. [1]: https://lore.kernel.org/lkml/CALvZod4G3SzP3kWxQYn0fj+VgG-G3yWXz=gz17+3N57ru1iajw@mail.gmail.com/t/#m750c8e3231206134293b089feaa090590afa0f60 [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: make two functions static] [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] Link: https://lkml.kernel.org/r/20211021080744.874701-3-chenwandun@huawei.com Signed-off-by: Chen Wandun <chenwandun@huawei.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-05 20:39:53 +00:00
unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
unsigned long nr_pages,
struct page **page_array);
mm/page_alloc: add a bulk page allocator This patch adds a new page allocator interface via alloc_pages_bulk, and __alloc_pages_bulk_nodemask. A caller requests a number of pages to be allocated and added to a list. The API is not guaranteed to return the requested number of pages and may fail if the preferred allocation zone has limited free memory, the cpuset changes during the allocation or page debugging decides to fail an allocation. It's up to the caller to request more pages in batch if necessary. Note that this implementation is not very efficient and could be improved but it would require refactoring. The intent is to make it available early to determine what semantics are required by different callers. Once the full semantics are nailed down, it can be refactored. [mgorman@techsingularity.net: fix alloc_pages_bulk() return type, per Matthew] Link: https://lkml.kernel.org/r/20210325123713.GQ3697@techsingularity.net [mgorman@techsingularity.net: fix uninit var warning] Link: https://lkml.kernel.org/r/20210330114847.GX3697@techsingularity.net [mgorman@techsingularity.net: fix comment, per Vlastimil] Link: https://lkml.kernel.org/r/20210412110255.GV3697@techsingularity.net Link: https://lkml.kernel.org/r/20210325114228.27719-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Tested-by: Colin Ian King <colin.king@canonical.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 06:01:45 +00:00
/* Bulk allocate order-0 pages */
static inline unsigned long
alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
mm/page_alloc: add a bulk page allocator This patch adds a new page allocator interface via alloc_pages_bulk, and __alloc_pages_bulk_nodemask. A caller requests a number of pages to be allocated and added to a list. The API is not guaranteed to return the requested number of pages and may fail if the preferred allocation zone has limited free memory, the cpuset changes during the allocation or page debugging decides to fail an allocation. It's up to the caller to request more pages in batch if necessary. Note that this implementation is not very efficient and could be improved but it would require refactoring. The intent is to make it available early to determine what semantics are required by different callers. Once the full semantics are nailed down, it can be refactored. [mgorman@techsingularity.net: fix alloc_pages_bulk() return type, per Matthew] Link: https://lkml.kernel.org/r/20210325123713.GQ3697@techsingularity.net [mgorman@techsingularity.net: fix uninit var warning] Link: https://lkml.kernel.org/r/20210330114847.GX3697@techsingularity.net [mgorman@techsingularity.net: fix comment, per Vlastimil] Link: https://lkml.kernel.org/r/20210412110255.GV3697@techsingularity.net Link: https://lkml.kernel.org/r/20210325114228.27719-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Tested-by: Colin Ian King <colin.king@canonical.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 06:01:45 +00:00
{
return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL);
}
static inline unsigned long
alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array)
{
return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array);
mm/page_alloc: add a bulk page allocator This patch adds a new page allocator interface via alloc_pages_bulk, and __alloc_pages_bulk_nodemask. A caller requests a number of pages to be allocated and added to a list. The API is not guaranteed to return the requested number of pages and may fail if the preferred allocation zone has limited free memory, the cpuset changes during the allocation or page debugging decides to fail an allocation. It's up to the caller to request more pages in batch if necessary. Note that this implementation is not very efficient and could be improved but it would require refactoring. The intent is to make it available early to determine what semantics are required by different callers. Once the full semantics are nailed down, it can be refactored. [mgorman@techsingularity.net: fix alloc_pages_bulk() return type, per Matthew] Link: https://lkml.kernel.org/r/20210325123713.GQ3697@techsingularity.net [mgorman@techsingularity.net: fix uninit var warning] Link: https://lkml.kernel.org/r/20210330114847.GX3697@techsingularity.net [mgorman@techsingularity.net: fix comment, per Vlastimil] Link: https://lkml.kernel.org/r/20210412110255.GV3697@techsingularity.net Link: https://lkml.kernel.org/r/20210325114228.27719-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Alexander Lobakin <alobakin@pm.me> Tested-by: Colin Ian King <colin.king@canonical.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Miller <davem@davemloft.net> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-30 06:01:45 +00:00
}
static inline unsigned long
alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array);
}
mm: rename alloc_pages_exact_node() to __alloc_pages_node() alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't fallback to current node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node and fails otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is passed among the gfp flags. The misleading name has lead to mistakes in the past, see for example commits 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). Another issue with the name is that there's a family of alloc_pages_exact*() functions where 'exact' means exact size (instead of page order), which leads to more confusion. To prevent further mistakes, this patch effectively renames alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's an optimized variant of alloc_pages_node() not intended for general usage. Both functions get described in comments. It has been also considered to really provide a convenience function for allocations restricted to a node, but the major opinion seems to be that __GFP_THISNODE already provides that functionality and we shouldn't duplicate the API needlessly. The number of users would be small anyway. Existing callers of alloc_pages_exact_node() are simply converted to call __alloc_pages_node(), with the exception of sba_alloc_coherent() which open-codes the check for NUMA_NO_NODE, so it is converted to use alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON checks, and since the current check for nid in alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which would be previously exposed. Both differences will be rectified by the next patch. To sum up, this patch makes no functional changes, except temporarily hiding potentially buggy callers. Restricting the checks in alloc_pages_node() is left for the next patch which can in turn expose more existing buggy callers. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Robin Holt <robinmholt@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cliff Whickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 22:03:50 +00:00
/*
* Allocate pages, preferring the node given as nid. The node must be valid and
* online. For more general interface, see alloc_pages_node().
*/
static inline struct page *
__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
mm: do not warn on offline nodes unless the specific node is explicitly requested Oscar has noticed that we splat WARNING: CPU: 0 PID: 64 at ./include/linux/gfp.h:467 vmemmap_alloc_block+0x4e/0xc9 [...] CPU: 0 PID: 64 Comm: kworker/u4:1 Tainted: G W E 4.17.0-rc5-next-20180517-1-default+ #66 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 Workqueue: kacpi_hotplug acpi_hotplug_work_fn Call Trace: vmemmap_populate+0xf2/0x2ae sparse_mem_map_populate+0x28/0x35 sparse_add_one_section+0x4c/0x187 __add_pages+0xe7/0x1a0 add_pages+0x16/0x70 add_memory_resource+0xa3/0x1d0 add_memory+0xe4/0x110 acpi_memory_device_add+0x134/0x2e0 acpi_bus_attach+0xd9/0x190 acpi_bus_scan+0x37/0x70 acpi_device_hotplug+0x389/0x4e0 acpi_hotplug_work_fn+0x1a/0x30 process_one_work+0x146/0x340 worker_thread+0x47/0x3e0 kthread+0xf5/0x130 ret_from_fork+0x35/0x40 when adding memory to a node that is currently offline. The VM_WARN_ON is just too loud without a good reason. In this particular case we are doing alloc_pages_node(node, GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN, order) so we do not insist on allocating from the given node (it is more a hint) so we can fall back to any other populated node and moreover we explicitly ask to not warn for the allocation failure. Soften the warning only to cases when somebody asks for the given node explicitly by __GFP_THISNODE. Link: http://lkml.kernel.org/r/20180523125555.30039-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Oscar Salvador <osalvador@techadventures.net> Tested-by: Oscar Salvador <osalvador@techadventures.net> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25 21:47:46 +00:00
VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
return __alloc_pages(gfp_mask, order, nid, NULL);
}
static inline
struct folio *__folio_alloc_node(gfp_t gfp, unsigned int order, int nid)
{
VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
VM_WARN_ON((gfp & __GFP_THISNODE) && !node_online(nid));
return __folio_alloc(gfp, order, nid, NULL);
}
mm: rename alloc_pages_exact_node() to __alloc_pages_node() alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't fallback to current node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node and fails otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is passed among the gfp flags. The misleading name has lead to mistakes in the past, see for example commits 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). Another issue with the name is that there's a family of alloc_pages_exact*() functions where 'exact' means exact size (instead of page order), which leads to more confusion. To prevent further mistakes, this patch effectively renames alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's an optimized variant of alloc_pages_node() not intended for general usage. Both functions get described in comments. It has been also considered to really provide a convenience function for allocations restricted to a node, but the major opinion seems to be that __GFP_THISNODE already provides that functionality and we shouldn't duplicate the API needlessly. The number of users would be small anyway. Existing callers of alloc_pages_exact_node() are simply converted to call __alloc_pages_node(), with the exception of sba_alloc_coherent() which open-codes the check for NUMA_NO_NODE, so it is converted to use alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON checks, and since the current check for nid in alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which would be previously exposed. Both differences will be rectified by the next patch. To sum up, this patch makes no functional changes, except temporarily hiding potentially buggy callers. Restricting the checks in alloc_pages_node() is left for the next patch which can in turn expose more existing buggy callers. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Robin Holt <robinmholt@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cliff Whickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 22:03:50 +00:00
/*
* Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
* prefer the current CPU's closest node. Otherwise node must be valid and
* online.
mm: rename alloc_pages_exact_node() to __alloc_pages_node() alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't fallback to current node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node and fails otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is passed among the gfp flags. The misleading name has lead to mistakes in the past, see for example commits 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). Another issue with the name is that there's a family of alloc_pages_exact*() functions where 'exact' means exact size (instead of page order), which leads to more confusion. To prevent further mistakes, this patch effectively renames alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's an optimized variant of alloc_pages_node() not intended for general usage. Both functions get described in comments. It has been also considered to really provide a convenience function for allocations restricted to a node, but the major opinion seems to be that __GFP_THISNODE already provides that functionality and we shouldn't duplicate the API needlessly. The number of users would be small anyway. Existing callers of alloc_pages_exact_node() are simply converted to call __alloc_pages_node(), with the exception of sba_alloc_coherent() which open-codes the check for NUMA_NO_NODE, so it is converted to use alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON checks, and since the current check for nid in alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which would be previously exposed. Both differences will be rectified by the next patch. To sum up, this patch makes no functional changes, except temporarily hiding potentially buggy callers. Restricting the checks in alloc_pages_node() is left for the next patch which can in turn expose more existing buggy callers. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Robin Holt <robinmholt@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cliff Whickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 22:03:50 +00:00
*/
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
if (nid == NUMA_NO_NODE)
nid = numa_mem_id();
return __alloc_pages_node(nid, gfp_mask, order);
}
#ifdef CONFIG_NUMA
struct page *alloc_pages(gfp_t gfp, unsigned int order);
struct folio *folio_alloc(gfp_t gfp, unsigned order);
struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
unsigned long addr, bool hugepage);
#else
static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
{
return alloc_pages_node(numa_node_id(), gfp_mask, order);
}
static inline struct folio *folio_alloc(gfp_t gfp, unsigned int order)
{
return __folio_alloc_node(gfp, order, numa_node_id());
}
#define vma_alloc_folio(gfp, order, vma, addr, hugepage) \
folio_alloc(gfp, order)
#endif
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
static inline struct page *alloc_page_vma(gfp_t gfp,
struct vm_area_struct *vma, unsigned long addr)
{
struct folio *folio = vma_alloc_folio(gfp, 0, vma, addr, false);
return &folio->page;
}
extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
extern unsigned long get_zeroed_page(gfp_t gfp_mask);
mm/page_alloc: add __alloc_size attributes for better bounds checking As already done in GrapheneOS, add the __alloc_size attribute for appropriate page allocator interfaces, to provide additional hinting for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler optimizations. Link: https://lkml.kernel.org/r/20210930222704.2631604-8-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: Daniel Micay <danielmicay@gmail.com> Signed-off-by: Daniel Micay <danielmicay@gmail.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Alexandre Bounine <alex.bou9@gmail.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jing Xiangfeng <jingxiangfeng@huawei.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: kernel test robot <lkp@intel.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-05 20:36:38 +00:00
void *alloc_pages_exact(size_t size, gfp_t gfp_mask) __alloc_size(1);
void free_pages_exact(void *virt, size_t size);
__meminit void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) __alloc_size(2);
#define __get_free_page(gfp_mask) \
__get_free_pages((gfp_mask), 0)
#define __get_dma_pages(gfp_mask, order) \
__get_free_pages((gfp_mask) | GFP_DMA, (order))
extern void __free_pages(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
struct page_frag_cache;
extern void __page_frag_cache_drain(struct page *page, unsigned int count);
extern void *page_frag_alloc_align(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask,
unsigned int align_mask);
static inline void *page_frag_alloc(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask)
{
return page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
}
extern void page_frag_free(void *addr);
#define __free_page(page) __free_pages((page), 0)
#define free_page(addr) free_pages((addr), 0)
void page_alloc_init(void);
Move remote node draining out of slab allocators Currently the slab allocators contain callbacks into the page allocator to perform the draining of pagesets on remote nodes. This requires SLUB to have a whole subsystem in order to be compatible with SLAB. Moving node draining out of the slab allocators avoids a section of code in SLUB. Move the node draining so that is is done when the vm statistics are updated. At that point we are already touching all the cachelines with the pagesets of a processor. Add a expire counter there. If we have to update per zone or global vm statistics then assume that the pageset will require subsequent draining. The expire counter will be decremented on each vm stats update pass until it reaches zero. Then we will drain one batch from the pageset. The draining will cause vm counter updates which will then cause another expiration until the pcp is empty. So we will drain a batch every 3 seconds. Note that remote node draining is a somewhat esoteric feature that is required on large NUMA systems because otherwise significant portions of system memory can become trapped in pcp queues. The number of pcp is determined by the number of processors and nodes in a system. A system with 4 processors and 2 nodes has 8 pcps which is okay. But a system with 1024 processors and 512 nodes has 512k pcps with a high potential for large amount of memory being caught in them. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 09:35:14 +00:00
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(struct zone *zone);
void drain_local_pages(struct zone *zone);
void page_alloc_init_late(void);
mm: avoid livelock on !__GFP_FS allocations Colin Cross reported; Under the following conditions, __alloc_pages_slowpath can loop forever: gfp_mask & __GFP_WAIT is true gfp_mask & __GFP_FS is false reclaim and compaction make no progress order <= PAGE_ALLOC_COSTLY_ORDER These conditions happen very often during suspend and resume, when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL allocations into __GFP_WAIT. The oom killer is not run because gfp_mask & __GFP_FS is false, but should_alloc_retry will always return true when order is less than PAGE_ALLOC_COSTLY_ORDER. In his fix, he avoided retrying the allocation if reclaim made no progress and __GFP_FS was not set. The problem is that this would result in GFP_NOIO allocations failing that previously succeeded which would be very unfortunate. The big difference between GFP_NOIO and suspend converting GFP_KERNEL to behave like GFP_NOIO is that normally flushers will be cleaning pages and kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay. The same does not necessarily apply during suspend as the storage device may be suspended. This patch special cases the suspend case to fail the page allocation if reclaim cannot make progress and adds some documentation on how gfp_allowed_mask is currently used. Failing allocations like this may cause suspend to abort but that is better than a livelock. [mgorman@suse.de: Rework fix to be suspend specific] [rientjes@google.com: Move suspended device check to should_alloc_retry] Reported-by: Colin Cross <ccross@android.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10 23:07:15 +00:00
/*
* gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
* GFP flags are used before interrupts are enabled. Once interrupts are
* enabled, it is set to __GFP_BITS_MASK while the system is running. During
* hibernation, it is used by PM to avoid I/O during memory allocation while
* devices are suspended.
*/
extern gfp_t gfp_allowed_mask;
/* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */
bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
extern void pm_restrict_gfp_mask(void);
extern void pm_restore_gfp_mask(void);
mm,thp,shmem: limit shmem THP alloc gfp_mask Patch series "mm,thp,shm: limit shmem THP alloc gfp_mask", v6. The allocation flags of anonymous transparent huge pages can be controlled through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can help the system from getting bogged down in the page reclaim and compaction code when many THPs are getting allocated simultaneously. However, the gfp_mask for shmem THP allocations were not limited by those configuration settings, and some workloads ended up with all CPUs stuck on the LRU lock in the page reclaim code, trying to allocate dozens of THPs simultaneously. This patch applies the same configurated limitation of THPs to shmem hugepage allocations, to prevent that from happening. This way a THP defrag setting of "never" or "defer+madvise" will result in quick allocation failures without direct reclaim when no 2MB free pages are available. With this patch applied, THP allocations for tmpfs will be a little more aggressive than today for files mmapped with MADV_HUGEPAGE, and a little less aggressive for files that are not mmapped or mapped without that flag. This patch (of 4): The allocation flags of anonymous transparent huge pages can be controlled through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can help the system from getting bogged down in the page reclaim and compaction code when many THPs are getting allocated simultaneously. However, the gfp_mask for shmem THP allocations were not limited by those configuration settings, and some workloads ended up with all CPUs stuck on the LRU lock in the page reclaim code, trying to allocate dozens of THPs simultaneously. This patch applies the same configurated limitation of THPs to shmem hugepage allocations, to prevent that from happening. Controlling the gfp_mask of THP allocations through the knobs in sysfs allows users to determine the balance between how aggressively the system tries to allocate THPs at fault time, and how much the application may end up stalling attempting those allocations. This way a THP defrag setting of "never" or "defer+madvise" will result in quick allocation failures without direct reclaim when no 2MB free pages are available. With this patch applied, THP allocations for tmpfs will be a little more aggressive than today for files mmapped with MADV_HUGEPAGE, and a little less aggressive for files that are not mmapped or mapped without that flag. Link: https://lkml.kernel.org/r/20201124194925.623931-1-riel@surriel.com Link: https://lkml.kernel.org/r/20201124194925.623931-2-riel@surriel.com Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Xu Yu <xuyu@linux.alibaba.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-26 01:16:18 +00:00
extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);
mm: avoid livelock on !__GFP_FS allocations Colin Cross reported; Under the following conditions, __alloc_pages_slowpath can loop forever: gfp_mask & __GFP_WAIT is true gfp_mask & __GFP_FS is false reclaim and compaction make no progress order <= PAGE_ALLOC_COSTLY_ORDER These conditions happen very often during suspend and resume, when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL allocations into __GFP_WAIT. The oom killer is not run because gfp_mask & __GFP_FS is false, but should_alloc_retry will always return true when order is less than PAGE_ALLOC_COSTLY_ORDER. In his fix, he avoided retrying the allocation if reclaim made no progress and __GFP_FS was not set. The problem is that this would result in GFP_NOIO allocations failing that previously succeeded which would be very unfortunate. The big difference between GFP_NOIO and suspend converting GFP_KERNEL to behave like GFP_NOIO is that normally flushers will be cleaning pages and kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay. The same does not necessarily apply during suspend as the storage device may be suspended. This patch special cases the suspend case to fail the page allocation if reclaim cannot make progress and adds some documentation on how gfp_allowed_mask is currently used. Failing allocations like this may cause suspend to abort but that is better than a livelock. [mgorman@suse.de: Rework fix to be suspend specific] [rientjes@google.com: Move suspended device check to should_alloc_retry] Reported-by: Colin Cross <ccross@android.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-10 23:07:15 +00:00
#ifdef CONFIG_PM_SLEEP
extern bool pm_suspended_storage(void);
#else
static inline bool pm_suspended_storage(void)
{
return false;
}
#endif /* CONFIG_PM_SLEEP */
#ifdef CONFIG_CONTIG_ALLOC
/* The below functions must be run on a range from a single zone. */
extern int alloc_contig_range(unsigned long start, unsigned long end,
unsigned migratetype, gfp_t gfp_mask);
mm/page_alloc: add alloc_contig_pages() HugeTLB helper alloc_gigantic_page() implements fairly generic allocation method where it scans over various zones looking for a large contiguous pfn range before trying to allocate it with alloc_contig_range(). Other than deriving the requested order from 'struct hstate', there is nothing HugeTLB specific in there. This can be made available for general use to allocate contiguous memory which could not have been allocated through the buddy allocator. alloc_gigantic_page() has been split carving out actual allocation method which is then made available via new alloc_contig_pages() helper wrapped under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced with more generic term 'contig'. Allocated pages here should be freed with free_contig_range() or by calling __free_page() on each allocated page. Link: http://lkml.kernel.org/r/1571300646-32240-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-01 01:55:06 +00:00
extern struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
int nid, nodemask_t *nodemask);
#endif
void free_contig_range(unsigned long pfn, unsigned long nr_pages);
#ifdef CONFIG_CMA
/* CMA stuff */
extern void init_cma_reserved_pageblock(struct page *page);
#endif
#endif /* __LINUX_GFP_H */