The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
the TID stays the same. However, two places in __slab_alloc() currently
don't update the TID when deactivating the CPU slab.
If multiple operations race the right way, this could lead to an object
getting lost; or, in an even more unlikely situation, it could even lead to
an object being freed onto the wrong slab's freelist, messing up the
`inuse` counter and eventually causing a page to be freed to the page
allocator while it still contains slab objects.
(I haven't actually tested these cases though, this is just based on
looking at the code. Writing testcases for this stuff seems like it'd be
a pain...)
The race leading to state inconsistency is (all operations on the same CPU
and kmem_cache):
- task A: begin do_slab_free():
- read TID
- read pcpu freelist (==NULL)
- check `slab == c->slab` (true)
- [PREEMPT A->B]
- task B: begin slab_alloc_node():
- fastpath fails (`c->freelist` is NULL)
- enter __slab_alloc()
- slub_get_cpu_ptr() (disables preemption)
- enter ___slab_alloc()
- take local_lock_irqsave()
- read c->freelist as NULL
- get_freelist() returns NULL
- write `c->slab = NULL`
- drop local_unlock_irqrestore()
- goto new_slab
- slub_percpu_partial() is NULL
- get_partial() returns NULL
- slub_put_cpu_ptr() (enables preemption)
- [PREEMPT B->A]
- task A: finish do_slab_free():
- this_cpu_cmpxchg_double() succeeds()
- [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]
From there, the object on c->freelist will get lost if task B is allowed to
continue from here: It will proceed to the retry_load_slab label,
set c->slab, then jump to load_freelist, which clobbers c->freelist.
But if we instead continue as follows, we get worse corruption:
- task A: run __slab_free() on object from other struct slab:
- CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
- task A: run slab_alloc_node() with NUMA node constraint:
- fastpath fails (c->slab is NULL)
- call __slab_alloc()
- slub_get_cpu_ptr() (disables preemption)
- enter ___slab_alloc()
- c->slab is NULL: goto new_slab
- slub_percpu_partial() is non-NULL
- set c->slab to slub_percpu_partial(c)
- [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
from slab-2]
- goto redo
- node_match() fails
- goto deactivate_slab
- existing c->freelist is passed into deactivate_slab()
- inuse count of slab-1 is decremented to account for object from
slab-2
At this point, the inuse count of slab-1 is 1 lower than it should be.
This means that if we free all allocated objects in slab-1 except for one,
SLUB will think that slab-1 is completely unused, and may free its page,
leading to use-after-free.
Fixes: c17dda40a6 ("slub: Separate out kmem_cache_cpu processing from deactivate_slab")
Fixes: 03e404af26 ("slub: fast release on full slab")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220608182205.2945720-1-jannh@google.com
The set_track() invocation in free_debug_processing() is invoked with
acquired slab_lock(). The lock disables interrupts on PREEMPT_RT and
this forbids to allocate memory which is done in stack_depot_save().
Split set_track() into two parts: set_track_prepare() which allocate
memory and set_track_update() which only performs the assignment of the
trace data structure. Use set_track_prepare() before disabling
interrupts.
[ vbabka@suse.cz: make set_track() call set_track_update() instead of
open-coded assignments ]
Fixes: 5cf909c553 ("mm/slub: use stackdepot to save stack trace in objects")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/Yp9sqoUi4fVa5ExF@linutronix.de
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEjUuTAak14xi+SF7M4CHKc/GJqRAFAmKLUYoACgkQ4CHKc/GJ
qRCMFwf/Tm1cf2JLUANrT58rjkrrj15EtKhnJdm5/yvmsWKps7WKPP4jeUHe+NTO
NovAGt67lG1l6LMLczZkWckOkWlyYjC42CPDLdxRUkk+zQRb3nRA8Nbt6VTNBOfQ
0wTLOqXgsNXdSPSVUsKGL8kIAHNQTMX+7TjO6s7CXy/5Qag6r1iZX2HZxASOHxLa
yYzaJ9pJRZBAMGnzV6L6v0J8KPnjYO0fB68S1qYQTbhoRxchtFF+0AIr1JydGgBI
9RFUowTrSpJkZtcSjabopvZz4JfCRDP+eAxkyw13feji7MG1FMX74HgDdw+HhzTv
R2/6iA5WcsmzcXopsfMx8lUP/KIfPw==
=gnSc
-----END PGP SIGNATURE-----
Merge tag 'slab-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- Conversion of slub_debug stack traces to stackdepot, allowing more
useful debugfs-based inspection for e.g. memory leak debugging.
Allocation and free debugfs info now includes full traces and is
sorted by the unique trace frequency.
The stackdepot conversion was already attempted last year but
reverted by ae14c63a9f. The memory overhead (while not actually
enabled on boot) has been meanwhile solved by making the large
stackdepot allocation dynamic. The xfstest issues haven't been
reproduced on current kernel locally nor in -next, so the slab cache
layout changes that originally made that bug manifest were probably
not the root cause.
- Refactoring of dma-kmalloc caches creation.
- Trivial cleanups such as removal of unused parameters, fixes and
clarifications of comments.
- Hyeonggon Yoo joins as a reviewer.
* tag 'slab-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
MAINTAINERS: add myself as reviewer for slab
mm/slub: remove unused kmem_cache_order_objects max
mm: slab: fix comment for __assume_kmalloc_alignment
mm: slab: fix comment for ARCH_KMALLOC_MINALIGN
mm/slub: remove unneeded return value of slab_pad_check
mm/slab_common: move dma-kmalloc caches creation into new_kmalloc_cache()
mm/slub: remove meaningless node check in ___slab_alloc()
mm/slub: remove duplicate flag in allocate_slab()
mm/slub: remove unused parameter in setup_object*()
mm/slab.c: fix comments
slab, documentation: add description of debugfs files for SLUB caches
mm/slub: sort debugfs output by frequency of stack traces
mm/slub: distinguish and print stack traces in debugfs files
mm/slub: use stackdepot to save stack trace in objects
mm/slub: move struct track init out of set_track()
lib/stackdepot: allow requesting early initialization dynamically
mm/slub, kunit: Make slub_kunit unaffected by user specified flags
mm/slab: remove some unused functions
max field holds the largest slab order that was ever used for a slab cache.
But it's unused now. Remove it.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220429090545.33413-1-linmiaohe@huawei.com
The return value of slab_pad_check is never used. So we can make it return
void now.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220419120352.37825-1-linmiaohe@huawei.com
Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been
producing garbage data due to the object not actually being maintained
by SLAB or SLUB.
Fix this by implementing __kfence_obj_info() that copies relevant
information to struct kmem_obj_info when the object was allocated by
KFENCE; this is called by a common kmem_obj_info(), which also calls the
slab/slub/slob specific variant now called __kmem_obj_info().
For completeness, kmem_dump_obj() now displays if the object was
allocated by KFENCE.
Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com
Fixes: b89fb5ef0c ("mm, kfence: insert KFENCE hooks for SLUB")
Fixes: d3fb45f370 ("mm, kfence: insert KFENCE hooks for SLAB")
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz> [slab]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In allocate_slab(), __GFP_NOFAIL flag is removed twice when trying
higher-order allocation. Remove it.
Signed-off-by: Jiyoup Kim <lakroforce@gmail.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220409150538.1264-1-lakroforce@gmail.com
setup_object_debug() and setup_object() has unused parameter, "struct
slab *slab". Remove it.
By the commit 3ec0974210 ("SLUB: Simplify debug code"),
setup_object_debug() were introduced to refactor previous code blocks
in the setup_object(). Previous code used SlabDebug() to init_object()
and init_tracking(). As the SlabDebug() takes "struct page *page" as
argument, the setup_object_debug() checks flag of "struct kmem_cache *s"
which doesn't require "struct page *page".
As the struct page were changed into struct slab by commit bb192ed9aa
("mm/slub: Convert most struct page to struct slab by spatch"), but it's
still unused parameter.
Suggested-by: Ohhoon Kwon <ohkwon1043@gmail.com>
Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220411072534.3372768-1-jsyoo5b@gmail.com
Sort the output of debugfs alloc_traces and free_traces by the frequency
of allocation/freeing stack traces. Most frequently used stack traces
will be printed first, e.g. for easier memory leak debugging.
Signed-off-by: Oliver Glitta <glittao@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Aggregate objects in slub cache by unique stack trace in addition to
caller address when producing contents of debugfs files alloc_traces and
free_traces in debugfs. Also add the stack traces to the debugfs output.
This makes it much more useful to e.g. debug memory leaks.
Signed-off-by: Oliver Glitta <glittao@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Many stack traces are similar so there are many similar arrays.
Stackdepot saves each unique stack only once.
Replace field addrs in struct track with depot_stack_handle_t handle. Use
stackdepot to save stack trace.
The benefits are smaller memory overhead and possibility to aggregate
per-cache statistics in the following patch using the stackdepot handle
instead of matching stacks manually.
[ vbabka@suse.cz: rebase to 5.17-rc1 and adjust accordingly ]
This was initially merged as commit 788691464c and reverted by commit
ae14c63a9f due to several issues, that should now be fixed.
The problem of unconditional memory overhead by stackdepot has been
addressed by commit 2dba5eb1c7 ("lib/stackdepot: allow optional init
and stack_table allocation by kvmalloc()"), so the dependency on
stackdepot will result in extra memory usage only when a slab cache
tracking is actually enabled, and not for all CONFIG_SLUB_DEBUG builds.
The build failures on some architectures were also addressed, and the
reported issue with xfs/433 test did not reproduce on 5.17-rc1 with this
patch.
Signed-off-by: Oliver Glitta <glittao@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
set_track() either zeroes out the struct track or fills it, depending on
the addr parameter. This is unnecessary as there's only one place that
calls it for the initialization - init_tracking(). We can simply do the
zeroing there, with a single memset() that covers both TRACK_ALLOC and
TRACK_FREE as they are adjacent.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
slub_kunit does not expect other debugging flags to be set when running
tests. When SLAB_RED_ZONE flag is set globally, test fails because the
flag affects number of errors reported.
To make slub_kunit unaffected by user specified debugging flags,
introduce SLAB_NO_USER_FLAGS to ignore them. With this flag, only flags
specified in the code are used and others are ignored.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/Yk0sY9yoJhFEXWOg@hyeyoo
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEjUuTAak14xi+SF7M4CHKc/GJqRAFAmI4yWsACgkQ4CHKc/GJ
qRAUhQf8CE5dBIblFU6Nfbiv/GHSu2AoHw8DRseeC7uSLGUzP9h2l1c0zMBzLIrf
ujKQGU/8n2AgUZBwxd59vbi0WHLD7K9qr3Yj6hH0QnTmHiv89GXEnYXrHHAJ506q
OGzT7NUiz5pEHrFyE5yqKQ2axy+Q+JrAfyzHQPEdzNtl0DJurVuHATYz5b9Pk6zW
27PrCIFjQyIkqw2s9oBCYZevAZkiAoxXntzhXicrqksZs4htArzn7LDTulpccnhy
6hwrCSg91E1Tza8oTNOCNvXt/YUGa6Q1RBEeUDmLOLwHSuOIfTPb5zW25vCvpyMK
008OYHcw7tspWnEfAEWqIDwMFWcjDg==
=2g9n
-----END PGP SIGNATURE-----
Merge tag 'slab-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- A few non-trivial SLUB code cleanups, most notably a refactoring of
deactivate_slab().
- A bunch of trivial changes, such as removal of unused parameters,
making stuff static, and employing helper functions.
* tag 'slab-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm: slub: Delete useless parameter of alloc_slab_page()
mm: slab: Delete unused SLAB_DEACTIVATED flag
mm/slub: remove forced_order parameter in calculate_sizes
mm/slub: refactor deactivate_slab()
mm/slub: limit number of node partial slabs only in cache creation
mm/slub: use helper macro __ATTR_XX_MODE for SLAB_ATTR(_RO)
mm/slab_common: use helper function is_power_of_2()
mm/slob: make kmem_cache_boot static
We currently allocate scope for every memcg to be able to tracked on
every superblock instantiated in the system, regardless of whether that
superblock is even accessible to that memcg.
These huge memcg counts come from container hosts where memcgs are
confined to just a small subset of the total number of superblocks that
instantiated at any given point in time.
For these systems with huge container counts, list_lru does not need the
capability of tracking every memcg on every superblock. What it comes
down to is that adding the memcg to the list_lru at the first insert.
So introduce kmem_cache_alloc_lru to allocate objects and its list_lru.
In the later patch, we will convert all inode and dentry allocation from
kmem_cache_alloc to kmem_cache_alloc_lru.
Link: https://lkml.kernel.org/r/20220228122126.37293-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The parameter @s is useless for alloc_slab_page(). It was added in 2014
by commit 5dfb417509 ("sl[au]b: charge slabs to kmemcg explicitly"). The
need for it was removed in 2020 by commit 1f3147b49d ("mm: slub: call
account_slab_page() after slab page initialization"). Let's delete it.
[willy@infradead.org: Added detailed history of @s]
Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220310140701.87908-3-sxwjean@me.com
Simplify deactivate_slab() by unlocking n->list_lock and retrying
cmpxchg_double() when cmpxchg_double() fails, and perform
add_{partial,full} only when it succeed.
Releasing and taking n->list_lock again here is not harmful as SLUB
avoids deactivating slabs as much as possible.
[ vbabka@suse.cz: perform add_{partial,full} when cmpxchg_double()
succeed.
count deactivating full slabs even if debugging flag is not set. ]
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220307074057.902222-3-42.hyeyoo@gmail.com
SLUB sets number of minimum partial slabs for node (min_partial)
using set_min_partial(). SLUB holds at least min_partial slabs even if
they're empty to avoid excessive use of page allocator.
set_min_partial() limits value of min_partial limits value of
min_partial MIN_PARTIAL and MAX_PARTIAL. As set_min_partial() can be
called by min_partial_store() too, Only limit value of min_partial
in kmem_cache_open() so that it can be changed to value that a user wants.
[ rientjes@google.com: Fold set_min_partial() into its callers ]
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220307074057.902222-2-42.hyeyoo@gmail.com
This allows more concise code, and VERIFY_OCTAL_PERMISSIONS() can help
validate any future change.
Signed-off-by: Lianjie Zhang <zhanglianjie@uniontech.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://lore.kernel.org/r/20220306073818.15089-1-zhanglianjie@uniontech.com
The fields 'next' and 'slabs' are only used when CONFIG_SLUB_CPU_PARTIAL
is enabled. We can put their definition to #ifdef to prevent accidental
use when disabled.
Currenlty show_slab_objects() and slabs_cpu_partial_show() contain code
accessing the slabs field that's effectively dead with
CONFIG_SLUB_CPU_PARTIAL=n through the wrappers slub_percpu_partial() and
slub_percpu_partial_read_once(), but to prevent a compile error, we need
to hide all this code behind #ifdef.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
KASAN accesses some slab related struct page fields so we need to
convert it to struct slab. Some places are a bit simplified thanks to
kasan_addr_to_slab() encapsulating the PageSlab flag check through
virt_to_slab(). When resolving object address to either a real slab or
a large kmalloc, use struct folio as the intermediate type for testing
the slab flag to avoid unnecessary implicit compound_head().
[ vbabka@suse.cz: use struct folio, adjust to differences in previous
patches ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Tested-by: Hyeongogn Yoo <42.hyeyoo@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <kasan-dev@googlegroups.com>
Update comments mentioning pages to mention slabs where appropriate.
Also some goto labels.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
The majority of conversion from struct page to struct slab in SLUB
internals can be delegated to a coccinelle semantic patch. This includes
renaming of variables with 'page' in name to 'slab', and similar.
Big thanks to Julia Lawall and Luis Chamberlain for help with
coccinelle.
// Options: --include-headers --no-includes --smpl-spacing include/linux/slub_def.h mm/slub.c
// Note: needs coccinelle 1.1.1 to avoid breaking whitespace, and ocaml for the
// embedded script
// build list of functions to exclude from applying the next rule
@initialize:ocaml@
@@
let ok_function p =
not (List.mem (List.hd p).current_element ["nearest_obj";"obj_to_index";"objs_per_slab_page";"__slab_lock";"__slab_unlock";"free_nonslab_page";"kmalloc_large_node"])
// convert the type from struct page to struct page in all functions except the
// list from previous rule
// this also affects struct kmem_cache_cpu, but that's ok
@@
position p : script:ocaml() { ok_function p };
@@
- struct page@p
+ struct slab
// in struct kmem_cache_cpu, change the name from page to slab
// the type was already converted by the previous rule
@@
@@
struct kmem_cache_cpu {
...
-struct slab *page;
+struct slab *slab;
...
}
// there are many places that use c->page which is now c->slab after the
// previous rule
@@
struct kmem_cache_cpu *c;
@@
-c->page
+c->slab
@@
@@
struct kmem_cache {
...
- unsigned int cpu_partial_pages;
+ unsigned int cpu_partial_slabs;
...
}
@@
struct kmem_cache *s;
@@
- s->cpu_partial_pages
+ s->cpu_partial_slabs
@@
@@
static void
- setup_page_debug(
+ setup_slab_debug(
...)
{...}
@@
@@
- setup_page_debug(
+ setup_slab_debug(
...);
// for all functions (with exceptions), change any "struct slab *page"
// parameter to "struct slab *slab" in the signature, and generally all
// occurences of "page" to "slab" in the body - with some special cases.
@@
identifier fn !~ "free_nonslab_page|obj_to_index|objs_per_slab_page|nearest_obj";
@@
fn(...,
- struct slab *page
+ struct slab *slab
,...)
{
<...
- page
+ slab
...>
}
// similar to previous but the param is called partial_page
@@
identifier fn;
@@
fn(...,
- struct slab *partial_page
+ struct slab *partial_slab
,...)
{
<...
- partial_page
+ partial_slab
...>
}
// similar to previous but for functions that take pointer to struct page ptr
@@
identifier fn;
@@
fn(...,
- struct slab **ret_page
+ struct slab **ret_slab
,...)
{
<...
- ret_page
+ ret_slab
...>
}
// functions converted by previous rules that were temporarily called using
// slab_page(E) so we want to remove the wrapper now that they accept struct
// slab ptr directly
@@
identifier fn =~ "slab_free|do_slab_free";
expression E;
@@
fn(...,
- slab_page(E)
+ E
,...)
// similar to previous but for another pattern
@@
identifier fn =~ "slab_pad_check|check_object";
@@
fn(...,
- folio_page(folio, 0)
+ slab
,...)
// functions that were returning struct page ptr and now will return struct
// slab ptr, including slab_page() wrapper removal
@@
identifier fn =~ "allocate_slab|new_slab";
expression E;
@@
static
-struct slab *
+struct slab *
fn(...)
{
<...
- slab_page(E)
+ E
...>
}
// rename any former struct page * declarations
@@
@@
struct slab *
(
- page
+ slab
|
- partial_page
+ partial_slab
|
- oldpage
+ oldslab
)
;
// this has to be separate from previous rule as page and page2 appear at the
// same line
@@
@@
struct slab *
-page2
+slab2
;
// similar but with initial assignment
@@
expression E;
@@
struct slab *
(
- page
+ slab
|
- flush_page
+ flush_slab
|
- discard_page
+ slab_to_discard
|
- page_to_unfreeze
+ slab_to_unfreeze
)
= E;
// convert most of struct page to struct slab usage inside functions (with
// exceptions), including specific variable renames
@@
identifier fn !~ "nearest_obj|obj_to_index|objs_per_slab_page|__slab_(un)*lock|__free_slab|free_nonslab_page|kmalloc_large_node";
expression E;
@@
fn(...)
{
<...
(
- int pages;
+ int slabs;
|
- int pages = E;
+ int slabs = E;
|
- page
+ slab
|
- flush_page
+ flush_slab
|
- partial_page
+ partial_slab
|
- oldpage->pages
+ oldslab->slabs
|
- oldpage
+ oldslab
|
- unsigned int nr_pages;
+ unsigned int nr_slabs;
|
- nr_pages
+ nr_slabs
|
- unsigned int partial_pages = E;
+ unsigned int partial_slabs = E;
|
- partial_pages
+ partial_slabs
)
...>
}
// this has to be split out from the previous rule so that lines containing
// multiple matching changes will be fully converted
@@
identifier fn !~ "nearest_obj|obj_to_index|objs_per_slab_page|__slab_(un)*lock|__free_slab|free_nonslab_page|kmalloc_large_node";
@@
fn(...)
{
<...
(
- slab->pages
+ slab->slabs
|
- pages
+ slabs
|
- page2
+ slab2
|
- discard_page
+ slab_to_discard
|
- page_to_unfreeze
+ slab_to_unfreeze
)
...>
}
// after we simply changed all occurences of page to slab, some usages need
// adjustment for slab-specific functions, or use slab_page() wrapper
@@
identifier fn !~ "nearest_obj|obj_to_index|objs_per_slab_page|__slab_(un)*lock|__free_slab|free_nonslab_page|kmalloc_large_node";
@@
fn(...)
{
<...
(
- page_slab(slab)
+ slab
|
- kasan_poison_slab(slab)
+ kasan_poison_slab(slab_page(slab))
|
- page_address(slab)
+ slab_address(slab)
|
- page_size(slab)
+ slab_size(slab)
|
- PageSlab(slab)
+ folio_test_slab(slab_folio(slab))
|
- page_to_nid(slab)
+ slab_nid(slab)
|
- compound_order(slab)
+ slab_order(slab)
)
...>
}
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Preparatory for mass conversion. Use the new slab_test_pfmemalloc()
helper. As it doesn't do VM_BUG_ON(!PageSlab()) we no longer need the
pfmemalloc_match_unsafe() variant.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
__free_slab() is on the boundary of distinguishing struct slab and
struct page so start with struct slab but convert to folio for working
with flags and folio_page() to call functions that require struct page.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Preparatory, callers convert back to struct page for now.
Also move setting page flags to alloc_slab_page() where we still operate
on a struct page. This means the page->slab_cache pointer is now set
later than the PageSlab flag, which could theoretically confuse some pfn
walker assuming PageSlab means there would be a valid cache pointer. But
as the code had no barriers and used __set_bit() anyway, it could have
happened already, so there shouldn't be such a walker.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Improve the type safety and prepare for further conversion. For flags
access, convert to folio internally.
[ vbabka@suse.cz: access flags via folio_flags() ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
These functions operate on the PG_locked page flag, but make them accept
struct slab to encapsulate this implementation detail.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Convert kfree(), kmem_cache_free() and ___cache_free() to resolve object
addresses to struct slab, using folio as intermediate step where needed.
Keep passing the result as struct page for now in preparation for mass
conversion of internal functions.
[ vbabka@suse.cz: Use folio as intermediate step when checking for
large kmalloc pages, and when freeing them - rename
free_nonslab_page() to free_large_kmalloc() that takes struct folio ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
This gives us a little bit of extra typesafety as we know that nobody
called virt_to_page() instead of virt_to_head_page().
[ vbabka@suse.cz: Use folio as intermediate step when filtering out
large kmalloc pages ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
Ensure that we're not seeing a tail page inside __check_heap_object() by
converting to a slab instead of a page. Take the opportunity to mark
the slab as const since we're not modifying it. Also move the
declaration of __check_heap_object() to mm/slab.h so it's not available
to the wider kernel.
[ vbabka@suse.cz: in check_heap_object() only convert to struct slab for
actual PageSlab pages; use folio as intermediate step instead of page ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
All three implementations of slab support kmem_obj_info() which reports
details of an object allocated from the slab allocator. By using the
slab type instead of the page type, we make it obvious that this can
only be called for slabs.
[ vbabka@suse.cz: also convert the related kmem_valid_obj() to folios ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
In SLUB, use folios, and struct slab to access slab_cache field.
In SLOB, use folios to properly resolve pointers beyond
PAGE_SIZE offset of the object.
[ vbabka@suse.cz: use folios, and only convert folio_test_slab() == true
folios to struct slab ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Convert the parameter of these functions to struct slab instead of
struct page and drop _page from the names. For now their callers just
convert page to slab.
[ vbabka@suse.cz: replace existing functions instead of calling them ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Make struct slab independent of struct page. It still uses the
underlying memory in struct page for storing slab-specific data, but
slab and slub can now be weaned off using struct page directly. Some of
the wrapper functions (slab_address() and slab_order()) still need to
cast to struct folio, but this is a significant disentanglement.
[ vbabka@suse.cz: Rebase on folios, use folio instead of page where
possible.
Do not duplicate flags field in struct slab, instead make the related
accessors go through slab_folio(). For testing pfmemalloc use the
folio_*_active flag accessors directly so the PageSlabPfmemalloc
wrappers can be removed later.
Make folio_slab() expect only folio_test_slab() == true folios and
virt_to_slab() return NULL when folio_test_slab() == false.
Move struct slab to mm/slab.h.
Don't represent with struct slab pages that are not true slab pages,
but just a compound page obtained directly rom page allocator (with
large kmalloc() for SLUB and SLOB). ]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
There are no callers outside of mm/slub.c anymore.
Move freelist_corrupted() that calls object_err() to avoid a need for
forward declaration.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <guro@fb.com>
On big-endian s390, the alloc/free_traces attributes produce endless
output, because of always 0 idx in slab_debugfs_show().
idx is de-referenced from *v, which points to a loff_t value, with
unsigned int idx = *(unsigned int *)v;
This will only give the upper 32 bits on big-endian, which remain 0.
Instead of only fixing this de-reference, during discussion it seemed
more appropriate to change the seq_ops so that they use an explicit
iterator in private loc_track struct.
This patch adds idx to loc_track, which will also fix the endianness
bug.
Link: https://lore.kernel.org/r/20211117193932.4049412-1-gerald.schaefer@linux.ibm.com
Link: https://lkml.kernel.org/r/20211126171848.17534-1-gerald.schaefer@linux.ibm.com
Fixes: 64dd68497b ("mm: slub: move sysfs slab alloc/free interfaces to debugfs")
Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Reported-by: Steffen Maier <maier@linux.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After the memory is freed, it can be immediately allocated by other
CPUs, before the "free" trace report has been emitted. This causes
inaccurate traces.
For example, if the following sequence of events occurs:
CPU 0 CPU 1
(1) alloc xxxxxx
(2) free xxxxxx
(3) alloc xxxxxx
(4) free xxxxxx
Then they will be inaccurately reported via tracing, so that they appear
to have happened in this order:
CPU 0 CPU 1
(1) alloc xxxxxx
(2) alloc xxxxxx
(3) free xxxxxx
(4) free xxxxxx
This makes it look like CPU 1 somehow managed to allocate memory that
CPU 0 still had allocated for itself.
In order to avoid this, emit the "free xxxxxx" tracing report just
before the actual call to free the memory, instead of just after it.
Link: https://lkml.kernel.org/r/374eb75d-7404-8721-4e1e-65b0e5b17279@huawei.com
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge misc updates from Andrew Morton:
"257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and
mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache,
gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc,
pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools,
memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm,
vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram,
cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits)
mm/damon: remove return value from before_terminate callback
mm/damon: fix a few spelling mistakes in comments and a pr_debug message
mm/damon: simplify stop mechanism
Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
Docs/admin-guide/mm/damon/start: simplify the content
Docs/admin-guide/mm/damon/start: fix a wrong link
Docs/admin-guide/mm/damon/start: fix wrong example commands
mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
mm/damon: remove unnecessary variable initialization
Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
selftests/damon: support watermarks
mm/damon/dbgfs: support watermarks
mm/damon/schemes: activate schemes based on a watermarks mechanism
tools/selftests/damon: update for regions prioritization of schemes
mm/damon/dbgfs: support prioritization weights
mm/damon/vaddr,paddr: support pageout prioritization
mm/damon/schemes: prioritize regions within the quotas
mm/damon/selftests: support schemes quotas
mm/damon/dbgfs: support quotas of schemes
...
This has served its purpose and is no longer used. All usercopy
violations appear to have been handled by now, any remaining instances
(or new bugs) will cause copies to be rejected.
This isn't a direct revert of commit 2d891fbc3b ("usercopy: Allow
strict enforcement of whitelists"); since usercopy_fallback is
effectively 0, the fallback handling is removed too.
This also removes the usercopy_fallback module parameter on slab_common.
Link: https://github.com/KSPP/linux/issues/153
Link: https://lkml.kernel.org/r/20210921061149.1091163-1-steve@sk2.org
Signed-off-by: Stephen Kitt <steve@sk2.org>
Suggested-by: Kees Cook <keescook@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joel Stanley <joel@jms.id.au> [defconfig change]
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E . Hallyn" <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 0ad9500e16 ("slub: prefetch next freelist pointer in
slab_alloc()") introduced prefetch_freepointer() because when other
cpu(s) freed objects into a page that current cpu owns, the freelist
link is hot on cpu(s) which freed objects and possibly very cold on
current cpu.
But if freelist link chain is hot on cpu(s) which freed objects, it's
better to invalidate that chain because they're not going to access
again within a short time.
So use prefetchw instead of prefetch. On supported architectures like
x86 and arm, it invalidates other copied instances of a cache line when
prefetching it.
Before:
Time: 91.677
Performance counter stats for 'hackbench -g 100 -l 10000':
1462938.07 msec cpu-clock # 15.908 CPUs utilized
18072550 context-switches # 12.354 K/sec
1018814 cpu-migrations # 696.416 /sec
104558 page-faults # 71.471 /sec
1580035699271 cycles # 1.080 GHz (54.51%)
2003670016013 instructions # 1.27 insn per cycle (54.31%)
5702204863 branch-misses (54.28%)
643368500985 cache-references # 439.778 M/sec (54.26%)
18475582235 cache-misses # 2.872 % of all cache refs (54.28%)
642206796636 L1-dcache-loads # 438.984 M/sec (46.87%)
18215813147 L1-dcache-load-misses # 2.84% of all L1-dcache accesses (46.83%)
653842996501 dTLB-loads # 446.938 M/sec (46.63%)
3227179675 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.85%)
537531951350 iTLB-loads # 367.433 M/sec (54.33%)
114750630 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.37%)
630135543177 L1-icache-loads # 430.733 M/sec (46.80%)
22923237620 L1-icache-load-misses # 3.64% of all L1-icache accesses (46.76%)
91.964452802 seconds time elapsed
43.416742000 seconds user
1422.441123000 seconds sys
After:
Time: 90.220
Performance counter stats for 'hackbench -g 100 -l 10000':
1437418.48 msec cpu-clock # 15.880 CPUs utilized
17694068 context-switches # 12.310 K/sec
958257 cpu-migrations # 666.651 /sec
100604 page-faults # 69.989 /sec
1583259429428 cycles # 1.101 GHz (54.57%)
2004002484935 instructions # 1.27 insn per cycle (54.37%)
5594202389 branch-misses (54.36%)
643113574524 cache-references # 447.409 M/sec (54.39%)
18233791870 cache-misses # 2.835 % of all cache refs (54.37%)
640205852062 L1-dcache-loads # 445.386 M/sec (46.75%)
17968160377 L1-dcache-load-misses # 2.81% of all L1-dcache accesses (46.79%)
651747432274 dTLB-loads # 453.415 M/sec (46.59%)
3127124271 dTLB-load-misses # 0.48% of all dTLB cache accesses (46.75%)
535395273064 iTLB-loads # 372.470 M/sec (54.38%)
113500056 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%)
628871845924 L1-icache-loads # 437.501 M/sec (46.80%)
22585641203 L1-icache-load-misses # 3.59% of all L1-icache accesses (46.79%)
90.514819303 seconds time elapsed
43.877656000 seconds user
1397.176001000 seconds sys
Link: https://lkml.org/lkml/2021/10/8/598=20
Link: https://lkml.kernel.org/r/20211011144331.70084-1-42.hyeyoo@gmail.com
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The defaults are determined based on object size and can go up to 30 for
objects smaller than 256 bytes. Before the previous patch changed the
accounting, this could have made cpu partial list contain up to 30
pages. After that patch, only up to 2 pages with default allocation
order.
Very short lists limit the usefulness of the whole concept of cpu
partial lists, so this patch aims at a more reasonable default under the
new accounting. The defaults are quadrupled, except for object size >=
PAGE_SIZE where it's doubled. This makes the lists grow up to 10 pages
in practice.
A quick test of booting a kernel under virtme with 4GB RAM and 8 vcpus
shows the following slab memory usage after boot:
Before previous patch (using page->pobjects):
Slab: 36732 kB
SReclaimable: 14836 kB
SUnreclaim: 21896 kB
After previous patch (using page->pages):
Slab: 34720 kB
SReclaimable: 13716 kB
SUnreclaim: 21004 kB
After this patch (using page->pages, higher defaults):
Slab: 35252 kB
SReclaimable: 13944 kB
SUnreclaim: 21308 kB
In the same setup, I also ran 5 times:
hackbench -l 16000 -g 16
Differences in time were in the noise, we can compare slub stats as
given by slabinfo -r skbuff_head_cache (the other cache heavily used by
hackbench, kmalloc-cg-512 looks similar). Negligible stats left out for
brevity.
Before previous patch (using page->pobjects):
Objects: 1408, Memory Total: 401408 Used : 304128
Slab Perf Counter Alloc Free %Al %Fr
--------------------------------------------------
Fastpath 469952498 5946606 91 1
Slowpath 42053573 506059465 8 98
Page Alloc 41093 41044 0 0
Add partial 18 21229327 0 4
Remove partial 20039522 36051 3 0
Cpu partial list 4686640 24767229 0 4
RemoteObj/SlabFrozen 16 124027841 0 24
Total 512006071 512006071
Flushes 18
Slab Deactivation Occurrences %
-------------------------------------------------
Slab empty 4993 0%
Deactivation bypass 24767229 99%
Refilled from foreign frees 21972674 88%
After previous patch (using page->pages):
Objects: 480, Memory Total: 131072 Used : 103680
Slab Perf Counter Alloc Free %Al %Fr
--------------------------------------------------
Fastpath 473016294 5405653 92 1
Slowpath 38989777 506600418 7 98
Page Alloc 32717 32701 0 0
Add partial 3 22749164 0 4
Remove partial 11371127 32474 2 0
Cpu partial list 11686226 23090059 2 4
RemoteObj/SlabFrozen 2 67541803 0 13
Total 512006071 512006071
Flushes 3
Slab Deactivation Occurrences %
-------------------------------------------------
Slab empty 227 0%
Deactivation bypass 23090059 99%
Refilled from foreign frees 27585695 119%
After this patch (using page->pages, higher defaults):
Objects: 896, Memory Total: 229376 Used : 193536
Slab Perf Counter Alloc Free %Al %Fr
--------------------------------------------------
Fastpath 473799295 4980278 92 0
Slowpath 38206776 507025793 7 99
Page Alloc 32295 32267 0 0
Add partial 11 23291143 0 4
Remove partial 5815764 31278 1 0
Cpu partial list 18119280 23967320 3 4
RemoteObj/SlabFrozen 10 76974794 0 15
Total 512006071 512006071
Flushes 11
Slab Deactivation Occurrences %
-------------------------------------------------
Slab empty 989 0%
Deactivation bypass 23967320 99%
Refilled from foreign frees 32358473 135%
As expected, memory usage dropped significantly with change of
accounting, increasing the defaults increased it, but not as much. The
number of page allocation/frees dropped significantly with the new
accounting, but didn't increase with the higher defaults.
Interestingly, the number of fasthpath allocations increased, as well as
allocations from the cpu partial list, even though it's shorter.
Link: https://lkml.kernel.org/r/20211012134651.11258-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With CONFIG_SLUB_CPU_PARTIAL enabled, SLUB keeps a percpu list of
partial slabs that can be promoted to cpu slab when the previous one is
depleted, without accessing the shared partial list. A slab can be
added to this list by 1) refill of an empty list from get_partial_node()
- once we really have to access the shared partial list, we acquire
multiple slabs to amortize the cost of locking, and 2) first free to a
previously full slab - instead of putting the slab on a shared partial
list, we can more cheaply freeze it and put it on the per-cpu list.
To control how large a percpu partial list can grow for a kmem cache,
set_cpu_partial() calculates a target number of free objects on each
cpu's percpu partial list, and this can be also set by the sysfs file
cpu_partial.
However, the tracking of actual number of objects is imprecise, in order
to limit overhead from cpu X freeing an objects to a slab on percpu
partial list of cpu Y. Basically, the percpu partial slabs form a
single linked list, and when we add a new slab to the list with current
head "oldpage", we set in the struct page of the slab we're adding:
page->pages = oldpage->pages + 1; // this is precise
page->pobjects = oldpage->pobjects + (page->objects - page->inuse);
page->next = oldpage;
Thus the real number of free objects in the slab (objects - inuse) is
only determined at the moment of adding the slab to the percpu partial
list, and further freeing doesn't update the pobjects counter nor
propagate it to the current list head. As Jann reports [1], this can
easily lead to large inaccuracies, where the target number of objects
(up to 30 by default) can translate to the same number of (empty) slab
pages on the list. In case 2) above, we put a slab with 1 free object
on the list, thus only increase page->pobjects by 1, even if there are
subsequent frees on the same slab. Jann has noticed this in practice
and so did we [2] when investigating significant increase of kmemcg
usage after switching from SLAB to SLUB.
While this is no longer a problem in kmemcg context thanks to the
accounting rewrite in 5.9, the memory waste is still not ideal and it's
questionable whether it makes sense to perform free object count based
control when object counts can easily become so much inaccurate. So
this patch converts the accounting to be based on number of pages only
(which is precise) and removes the page->pobjects field completely.
This is also ultimately simpler.
To retain the existing set_cpu_partial() heuristic, first calculate the
target number of objects as previously, but then convert it to target
number of pages by assuming the pages will be half-filled on average.
This assumption might obviously also be inaccurate in practice, but
cannot degrade to actual number of pages being equal to the target
number of objects.
We could also skip the intermediate step with target number of objects
and rewrite the heuristic in terms of pages. However we still have the
sysfs file cpu_partial which uses number of objects and could break
existing users if it suddenly becomes number of pages, so this patch
doesn't do that.
In practice, after this patch the heuristics limit the size of percpu
partial list up to 2 pages. In case of a reported regression (which
would mean some workload has benefited from the previous imprecise
object based counting), we can tune the heuristics to get a better
compromise within the new scheme, while still avoid the unexpectedly
long percpu partial lists.
[1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/
[2] https://lore.kernel.org/all/2f0f46e8-2535-410a-1859-e9cfa4e57c18@suse.cz/
==========
Evaluation
==========
Mel was kind enough to run v1 through mmtests machinery for netperf
(localhost) and hackbench and, for most significant results see below.
So there are some apparent regressions, especially with hackbench, which
I think ultimately boils down to having shorter percpu partial lists on
average and some benchmarks benefiting from longer ones. Monitoring
slab usage also indicated less memory usage by slab. Based on that, the
following patch will bump the defaults to allow longer percpu partial
lists than after this patch.
However the goal is certainly not such that we would limit the percpu
partial lists to 30 pages just because previously a specific alloc/free
pattern could lead to the limit of 30 objects translate to a limit to 30
pages - that would make little sense. This is a correctness patch, and
if a workload benefits from larger lists, the sysfs tuning knobs are
still there to allow that.
Netperf
2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads per socket), 384GB RAM
TCP-RR:
hmean before 127045.79 after 121092.94 (-4.69%, worse)
stddev before 2634.37 after 1254.08
UDP-RR:
hmean before 166985.45 after 160668.94 ( -3.78%, worse)
stddev before 4059.69 after 1943.63
2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads per socket), 512GB RAM
TCP-RR:
hmean before 84173.25 after 76914.72 ( -8.62%, worse)
UDP-RR:
hmean before 93571.12 after 96428.69 ( 3.05%, better)
stddev before 23118.54 after 16828.14
2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads per socket), 64GB RAM
TCP-RR:
hmean before 49984.92 after 48922.27 ( -2.13%, worse)
stddev before 6248.15 after 4740.51
UDP-RR:
hmean before 61854.31 after 68761.81 ( 11.17%, better)
stddev before 4093.54 after 5898.91
other machines - within 2%
Hackbench
(results before and after the patch, negative % means worse)
2-socket AMD EPYC 7713 (64 cores, 128 threads per core), 256GB RAM
hackbench-process-sockets
Amean 1 0.5380 0.5583 ( -3.78%)
Amean 4 0.7510 0.8150 ( -8.52%)
Amean 7 0.7930 0.9533 ( -20.22%)
Amean 12 0.7853 1.1313 ( -44.06%)
Amean 21 1.1520 1.4993 ( -30.15%)
Amean 30 1.6223 1.9237 ( -18.57%)
Amean 48 2.6767 2.9903 ( -11.72%)
Amean 79 4.0257 5.1150 ( -27.06%)
Amean 110 5.5193 7.4720 ( -35.38%)
Amean 141 7.2207 9.9840 ( -38.27%)
Amean 172 8.4770 12.1963 ( -43.88%)
Amean 203 9.6473 14.3137 ( -48.37%)
Amean 234 11.3960 18.7917 ( -64.90%)
Amean 265 13.9627 22.4607 ( -60.86%)
Amean 296 14.9163 26.0483 ( -74.63%)
hackbench-thread-sockets
Amean 1 0.5597 0.5877 ( -5.00%)
Amean 4 0.7913 0.8960 ( -13.23%)
Amean 7 0.8190 1.0017 ( -22.30%)
Amean 12 0.9560 1.1727 ( -22.66%)
Amean 21 1.7587 1.5660 ( 10.96%)
Amean 30 2.4477 1.9807 ( 19.08%)
Amean 48 3.4573 3.0630 ( 11.41%)
Amean 79 4.7903 5.1733 ( -8.00%)
Amean 110 6.1370 7.4220 ( -20.94%)
Amean 141 7.5777 9.2617 ( -22.22%)
Amean 172 9.2280 11.0907 ( -20.18%)
Amean 203 10.2793 13.3470 ( -29.84%)
Amean 234 11.2410 17.1070 ( -52.18%)
Amean 265 12.5970 23.3323 ( -85.22%)
Amean 296 17.1540 24.2857 ( -41.57%)
2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads
per socket), 384GB RAM
hackbench-process-sockets
Amean 1 0.5760 0.4793 ( 16.78%)
Amean 4 0.9430 0.9707 ( -2.93%)
Amean 7 1.5517 1.8843 ( -21.44%)
Amean 12 2.4903 2.7267 ( -9.49%)
Amean 21 3.9560 4.2877 ( -8.38%)
Amean 30 5.4613 5.8343 ( -6.83%)
Amean 48 8.5337 9.2937 ( -8.91%)
Amean 79 14.0670 15.2630 ( -8.50%)
Amean 110 19.2253 21.2467 ( -10.51%)
Amean 141 23.7557 25.8550 ( -8.84%)
Amean 172 28.4407 29.7603 ( -4.64%)
Amean 203 33.3407 33.9927 ( -1.96%)
Amean 234 38.3633 39.1150 ( -1.96%)
Amean 265 43.4420 43.8470 ( -0.93%)
Amean 296 48.3680 48.9300 ( -1.16%)
hackbench-thread-sockets
Amean 1 0.6080 0.6493 ( -6.80%)
Amean 4 1.0000 1.0513 ( -5.13%)
Amean 7 1.6607 2.0260 ( -22.00%)
Amean 12 2.7637 2.9273 ( -5.92%)
Amean 21 5.0613 4.5153 ( 10.79%)
Amean 30 6.3340 6.1140 ( 3.47%)
Amean 48 9.0567 9.5577 ( -5.53%)
Amean 79 14.5657 15.7983 ( -8.46%)
Amean 110 19.6213 21.6333 ( -10.25%)
Amean 141 24.1563 26.2697 ( -8.75%)
Amean 172 28.9687 30.2187 ( -4.32%)
Amean 203 33.9763 34.6970 ( -2.12%)
Amean 234 38.8647 39.3207 ( -1.17%)
Amean 265 44.0813 44.1507 ( -0.16%)
Amean 296 49.2040 49.4330 ( -0.47%)
2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads
per socket), 512GB RAM
hackbench-process-sockets
Amean 1 0.5027 0.5017 ( 0.20%)
Amean 4 1.1053 1.2033 ( -8.87%)
Amean 7 1.8760 2.1820 ( -16.31%)
Amean 12 2.9053 3.1810 ( -9.49%)
Amean 21 4.6777 4.9920 ( -6.72%)
Amean 30 6.5180 6.7827 ( -4.06%)
Amean 48 10.0710 10.5227 ( -4.48%)
Amean 79 16.4250 17.5053 ( -6.58%)
Amean 110 22.6203 24.4617 ( -8.14%)
Amean 141 28.0967 31.0363 ( -10.46%)
Amean 172 34.4030 36.9233 ( -7.33%)
Amean 203 40.5933 43.0850 ( -6.14%)
Amean 234 46.6477 48.7220 ( -4.45%)
Amean 265 53.0530 53.9597 ( -1.71%)
Amean 296 59.2760 59.9213 ( -1.09%)
hackbench-thread-sockets
Amean 1 0.5363 0.5330 ( 0.62%)
Amean 4 1.1647 1.2157 ( -4.38%)
Amean 7 1.9237 2.2833 ( -18.70%)
Amean 12 2.9943 3.3110 ( -10.58%)
Amean 21 4.9987 5.1880 ( -3.79%)
Amean 30 6.7583 7.0043 ( -3.64%)
Amean 48 10.4547 10.8353 ( -3.64%)
Amean 79 16.6707 17.6790 ( -6.05%)
Amean 110 22.8207 24.4403 ( -7.10%)
Amean 141 28.7090 31.0533 ( -8.17%)
Amean 172 34.9387 36.8260 ( -5.40%)
Amean 203 41.1567 43.0450 ( -4.59%)
Amean 234 47.3790 48.5307 ( -2.43%)
Amean 265 53.9543 54.6987 ( -1.38%)
Amean 296 60.0820 60.2163 ( -0.22%)
1-socket Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz (4 cores, 8 threads),
32 GB RAM
hackbench-process-sockets
Amean 1 1.4760 1.5773 ( -6.87%)
Amean 3 3.9370 4.0910 ( -3.91%)
Amean 5 6.6797 6.9357 ( -3.83%)
Amean 7 9.3367 9.7150 ( -4.05%)
Amean 12 15.7627 16.1400 ( -2.39%)
Amean 18 23.5360 23.6890 ( -0.65%)
Amean 24 31.0663 31.3137 ( -0.80%)
Amean 30 38.7283 39.0037 ( -0.71%)
Amean 32 41.3417 41.6097 ( -0.65%)
hackbench-thread-sockets
Amean 1 1.5250 1.6043 ( -5.20%)
Amean 3 4.0897 4.2603 ( -4.17%)
Amean 5 6.7760 7.0933 ( -4.68%)
Amean 7 9.4817 9.9157 ( -4.58%)
Amean 12 15.9610 16.3937 ( -2.71%)
Amean 18 23.9543 24.3417 ( -1.62%)
Amean 24 31.4400 31.7217 ( -0.90%)
Amean 30 39.2457 39.5467 ( -0.77%)
Amean 32 41.8267 42.1230 ( -0.71%)
2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads
per socket), 64GB RAM
hackbench-process-sockets
Amean 1 1.0347 1.0880 ( -5.15%)
Amean 4 1.7267 1.8527 ( -7.30%)
Amean 7 2.6707 2.8110 ( -5.25%)
Amean 12 4.1617 4.3383 ( -4.25%)
Amean 21 7.0070 7.2600 ( -3.61%)
Amean 30 9.9187 10.2397 ( -3.24%)
Amean 48 15.6710 16.3923 ( -4.60%)
Amean 79 24.7743 26.1247 ( -5.45%)
Amean 110 34.3000 35.9307 ( -4.75%)
Amean 141 44.2043 44.8010 ( -1.35%)
Amean 172 54.2430 54.7260 ( -0.89%)
Amean 192 60.6557 60.9777 ( -0.53%)
hackbench-thread-sockets
Amean 1 1.0610 1.1353 ( -7.01%)
Amean 4 1.7543 1.9140 ( -9.10%)
Amean 7 2.7840 2.9573 ( -6.23%)
Amean 12 4.3813 4.4937 ( -2.56%)
Amean 21 7.3460 7.5350 ( -2.57%)
Amean 30 10.2313 10.5190 ( -2.81%)
Amean 48 15.9700 16.5940 ( -3.91%)
Amean 79 25.3973 26.6637 ( -4.99%)
Amean 110 35.1087 36.4797 ( -3.91%)
Amean 141 45.8220 46.3053 ( -1.05%)
Amean 172 55.4917 55.7320 ( -0.43%)
Amean 192 62.7490 62.5410 ( 0.33%)
Link: https://lkml.kernel.org/r/20211012134651.11258-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Jann Horn <jannh@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After commit f227f0faf6 ("slub: fix unreclaimable slab stat for bulk
free"), the check for free nonslab page is replaced by VM_BUG_ON_PAGE,
which only check with CONFIG_DEBUG_VM enabled, but this config may
impact performance, so it only for debug.
Commit 0937502af7 ("slub: Add check for kfree() of non slab objects.")
add the ability, which should be needed in any configs to catch the
invalid free, they even could be potential issue, eg, memory corruption,
use after free and double free, so replace VM_BUG_ON_PAGE to
WARN_ON_ONCE, add object address printing to help use to debug the
issue.
Link: https://lkml.kernel.org/r/20210930070214.61499-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rienjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmGBCBkACgkQUqAMR0iA
lPLMdg/6Ag9V5Q6DPvbYe0WK8wfrrRL39Eic+K6wrYBVK/8rvMUy4Oee5tyOqCz7
z9GM+SivWRtEdEy8X/HzoawMQEuy3jLcaFoCNxHcScmc6R5Sd8otxPU5Lo8aZPLN
Pulni9EprysI2zhLqq5m6o/F9pMOY0y8uKbD1mgIHEV9yoLan+CZ+vahf/eFwYQu
NtYlMoK2KbS2mChGOZuLsthhyNxcCNFWWNwpBBQz7iJ9ZvnKCZ3EwG7Nx34Rx7ZE
TYZ2iga3TTONsoCk0IClbA6zRIowgumKQl9aY9Oci1MXdIEug42i0GEl+p4iCkrH
VhLyPsvJG6xyE6aCg/p2SB1vPasY+pp94VfTjFfmMulYdUHK7ipfZCR3ddxayR4B
PEsITibo/hHYEVerMMSyVXttiPS7qFhIyZkNuX/xpCMLz8RSFjgU5QhR848A4scM
r+qv1p7xkdBRvH3jlStrpLRnGtqOucvbNQgyvQiinm0yunpJN8FZgEsHnP60E5+j
DLpQF/bK2h7PhE2Wy8/iINi49/dZiIldZ1gZV4BxjuJ5zwSLdiuR9aP51RK4IRhV
qraLwU6yNv0k4v6sjXV78inQQ2vkqy/MBYMe3zqnpYbJB2DZYCbeRE62whrdEd4W
wxHxiY7r9dR6gtJB52kGepbryd3JIMdI49oFRjvGi2shaXG1AZ0=
=t12m
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
- Extend %pGp print format to print hex value of the page flags
- Use kvmalloc instead of kmalloc to allocate devkmsg buffers
- Misc cleanup and warning fixes
* tag 'printk-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
vsprintf: Update %pGp documentation about that it prints hex value
lib/vsprintf.c: Amend static asserts for format specifier flags
vsprintf: Make %pGp print the hex value
test_printf: Append strings more efficiently
test_printf: Remove custom appending of '|'
test_printf: Remove separate page_flags variable
test_printf: Make pft array const
ia64: don't do IA64_CMPXCHG_DEBUG without CONFIG_PRINTK
printk: use gnu_printf format attribute for printk_sprint()
printk: avoid -Wsometimes-uninitialized warning
printk: use kvmalloc instead of kmalloc for devkmsg_user