From 78eb00cc574d3dbf8e6bed804948a89e8110a064 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Thu, 15 Oct 2009 02:20:22 -0700 Subject: [PATCH 1/6] slub: allow stats to be cleared When collecting slub stats for particular workloads, it's necessary to collect each statistic for all caches before the job is even started because the counters are usually greater than zero just from boot and initialization. This allows a statistic to be cleared on each cpu by writing '0' to its sysfs file. This creates a baseline for statistics of interest before the workload is started. Setting a statistic to a particular value is not supported, so all values written to these files other than '0' returns -EINVAL. Cc: Christoph Lameter Signed-off-by: David Rientjes Signed-off-by: Pekka Enberg --- Documentation/ABI/testing/sysfs-kernel-slab | 109 +++++++++++--------- mm/slub.c | 18 +++- 2 files changed, 75 insertions(+), 52 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab index 6dcf75e594fb..8b093f8222d3 100644 --- a/Documentation/ABI/testing/sysfs-kernel-slab +++ b/Documentation/ABI/testing/sysfs-kernel-slab @@ -45,8 +45,9 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The alloc_fastpath file is read-only and specifies how many - objects have been allocated using the fast path. + The alloc_fastpath file shows how many objects have been + allocated using the fast path. It can be written to clear the + current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/alloc_from_partial @@ -55,9 +56,10 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The alloc_from_partial file is read-only and specifies how - many times a cpu slab has been full and it has been refilled - by using a slab from the list of partially used slabs. + The alloc_from_partial file shows how many times a cpu slab has + been full and it has been refilled by using a slab from the list + of partially used slabs. It can be written to clear the current + count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/alloc_refill @@ -66,9 +68,9 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The alloc_refill file is read-only and specifies how many - times the per-cpu freelist was empty but there were objects - available as the result of remote cpu frees. + The alloc_refill file shows how many times the per-cpu freelist + was empty but there were objects available as the result of + remote cpu frees. It can be written to clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/alloc_slab @@ -77,8 +79,9 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The alloc_slab file is read-only and specifies how many times - a new slab had to be allocated from the page allocator. + The alloc_slab file is shows how many times a new slab had to + be allocated from the page allocator. It can be written to + clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/alloc_slowpath @@ -87,9 +90,10 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The alloc_slowpath file is read-only and specifies how many - objects have been allocated using the slow path because of a - refill or allocation from a partial or new slab. + The alloc_slowpath file shows how many objects have been + allocated using the slow path because of a refill or + allocation from a partial or new slab. It can be written to + clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/cache_dma @@ -117,10 +121,11 @@ KernelVersion: 2.6.31 Contact: Pekka Enberg , Christoph Lameter Description: - The file cpuslab_flush is read-only and specifies how many - times a cache's cpu slabs have been flushed as the result of - destroying or shrinking a cache, a cpu going offline, or as - the result of forcing an allocation from a certain node. + The file cpuslab_flush shows how many times a cache's cpu slabs + have been flushed as the result of destroying or shrinking a + cache, a cpu going offline, or as the result of forcing an + allocation from a certain node. It can be written to clear the + current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/ctor @@ -139,8 +144,8 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The file deactivate_empty is read-only and specifies how many - times an empty cpu slab was deactivated. + The deactivate_empty file shows how many times an empty cpu slab + was deactivated. It can be written to clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/deactivate_full @@ -149,8 +154,8 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The file deactivate_full is read-only and specifies how many - times a full cpu slab was deactivated. + The deactivate_full file shows how many times a full cpu slab + was deactivated. It can be written to clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/deactivate_remote_frees @@ -159,9 +164,9 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The file deactivate_remote_frees is read-only and specifies how - many times a cpu slab has been deactivated and contained free - objects that were freed remotely. + The deactivate_remote_frees file shows how many times a cpu slab + has been deactivated and contained free objects that were freed + remotely. It can be written to clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/deactivate_to_head @@ -170,9 +175,9 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The file deactivate_to_head is read-only and specifies how - many times a partial cpu slab was deactivated and added to the - head of its node's partial list. + The deactivate_to_head file shows how many times a partial cpu + slab was deactivated and added to the head of its node's partial + list. It can be written to clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/deactivate_to_tail @@ -181,9 +186,9 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The file deactivate_to_tail is read-only and specifies how - many times a partial cpu slab was deactivated and added to the - tail of its node's partial list. + The deactivate_to_tail file shows how many times a partial cpu + slab was deactivated and added to the tail of its node's partial + list. It can be written to clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/destroy_by_rcu @@ -201,9 +206,9 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The file free_add_partial is read-only and specifies how many - times an object has been freed in a full slab so that it had to - added to its node's partial list. + The free_add_partial file shows how many times an object has + been freed in a full slab so that it had to added to its node's + partial list. It can be written to clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/free_calls @@ -222,9 +227,9 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The free_fastpath file is read-only and specifies how many - objects have been freed using the fast path because it was an - object from the cpu slab. + The free_fastpath file shows how many objects have been freed + using the fast path because it was an object from the cpu slab. + It can be written to clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/free_frozen @@ -233,9 +238,9 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The free_frozen file is read-only and specifies how many - objects have been freed to a frozen slab (i.e. a remote cpu - slab). + The free_frozen file shows how many objects have been freed to + a frozen slab (i.e. a remote cpu slab). It can be written to + clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/free_remove_partial @@ -244,9 +249,10 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The file free_remove_partial is read-only and specifies how - many times an object has been freed to a now-empty slab so - that it had to be removed from its node's partial list. + The free_remove_partial file shows how many times an object has + been freed to a now-empty slab so that it had to be removed from + its node's partial list. It can be written to clear the current + count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/free_slab @@ -255,8 +261,9 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The free_slab file is read-only and specifies how many times an - empty slab has been freed back to the page allocator. + The free_slab file shows how many times an empty slab has been + freed back to the page allocator. It can be written to clear + the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/free_slowpath @@ -265,9 +272,9 @@ KernelVersion: 2.6.25 Contact: Pekka Enberg , Christoph Lameter Description: - The free_slowpath file is read-only and specifies how many - objects have been freed using the slow path (i.e. to a full or - partial slab). + The free_slowpath file shows how many objects have been freed + using the slow path (i.e. to a full or partial slab). It can + be written to clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/hwcache_align @@ -346,10 +353,10 @@ KernelVersion: 2.6.26 Contact: Pekka Enberg , Christoph Lameter Description: - The file order_fallback is read-only and specifies how many - times an allocation of a new slab has not been possible at the - cache's order and instead fallen back to its minimum possible - order. + The order_fallback file shows how many times an allocation of a + new slab has not been possible at the cache's order and instead + fallen back to its minimum possible order. It can be written to + clear the current count. Available when CONFIG_SLUB_STATS is enabled. What: /sys/kernel/slab/cache/partial diff --git a/mm/slub.c b/mm/slub.c index 4996fc719552..ac0ca4c0d054 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4371,12 +4371,28 @@ static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si) return len + sprintf(buf + len, "\n"); } +static void clear_stat(struct kmem_cache *s, enum stat_item si) +{ + int cpu; + + for_each_online_cpu(cpu) + get_cpu_slab(s, cpu)->stat[si] = 0; +} + #define STAT_ATTR(si, text) \ static ssize_t text##_show(struct kmem_cache *s, char *buf) \ { \ return show_stat(s, buf, si); \ } \ -SLAB_ATTR_RO(text); \ +static ssize_t text##_store(struct kmem_cache *s, \ + const char *buf, size_t length) \ +{ \ + if (buf[0] != '0') \ + return -EINVAL; \ + clear_stat(s, si); \ + return length; \ +} \ +SLAB_ATTR(text); \ STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath); STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath); From 74e2134ff892ee4ea4fbd52637060b71e540faf1 Mon Sep 17 00:00:00 2001 From: Pekka Enberg Date: Wed, 25 Nov 2009 20:14:48 +0200 Subject: [PATCH 2/6] SLUB: Fix __GFP_ZERO unlikely() annotation The unlikely() annotation in slab_alloc() covers too much of the expression. It's actually very likely that the object is not NULL so use unlikely() only for the __GFP_ZERO expression like SLAB does. The patch reduces kernel text by 29 bytes on x86-64: text data bss dec hex filename 24185 8560 176 32921 8099 mm/slub.o.orig 24156 8560 176 32892 807c mm/slub.o Acked-by: Christoph Lameter Signed-off-by: Pekka Enberg --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slub.c b/mm/slub.c index 4996fc719552..0956396faed1 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1735,7 +1735,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s, } local_irq_restore(flags); - if (unlikely((gfpflags & __GFP_ZERO) && object)) + if (unlikely(gfpflags & __GFP_ZERO) && object) memset(object, 0, objsize); kmemcheck_slab_alloc(s, gfpflags, object, c->objsize); From ce79ddc8e2376a9a93c7d42daf89bfcbb9187e62 Mon Sep 17 00:00:00 2001 From: Pekka Enberg Date: Mon, 23 Nov 2009 22:01:15 +0200 Subject: [PATCH 3/6] SLAB: Fix lockdep annotations for CPU hotplug As reported by Paul McKenney: I am seeing some lockdep complaints in rcutorture runs that include frequent CPU-hotplug operations. The tests are otherwise successful. My first thought was to send a patch that gave each array_cache structure's ->lock field its own struct lock_class_key, but you already have a init_lock_keys() that seems to be intended to deal with this. ------------------------------------------------------------------------ ============================================= [ INFO: possible recursive locking detected ] 2.6.32-rc4-autokern1 #1 --------------------------------------------- syslogd/2908 is trying to acquire lock: (&nc->lock){..-...}, at: [] .kmem_cache_free+0x118/0x2d4 but task is already holding lock: (&nc->lock){..-...}, at: [] .kfree+0x1f0/0x324 other info that might help us debug this: 3 locks held by syslogd/2908: #0: (&u->readlock){+.+.+.}, at: [] .unix_dgram_recvmsg+0x70/0x338 #1: (&nc->lock){..-...}, at: [] .kfree+0x1f0/0x324 #2: (&parent->list_lock){-.-...}, at: [] .__drain_alien_cache+0x50/0xb8 stack backtrace: Call Trace: [c0000000e8ccafc0] [c0000000000101e4] .show_stack+0x70/0x184 (unreliable) [c0000000e8ccb070] [c0000000000afebc] .validate_chain+0x6ec/0xf58 [c0000000e8ccb180] [c0000000000b0ff0] .__lock_acquire+0x8c8/0x974 [c0000000e8ccb280] [c0000000000b2290] .lock_acquire+0x140/0x18c [c0000000e8ccb350] [c000000000468df0] ._spin_lock+0x48/0x70 [c0000000e8ccb3e0] [c0000000001407f4] .kmem_cache_free+0x118/0x2d4 [c0000000e8ccb4a0] [c000000000140b90] .free_block+0x130/0x1a8 [c0000000e8ccb540] [c000000000140f94] .__drain_alien_cache+0x80/0xb8 [c0000000e8ccb5e0] [c0000000001411e0] .kfree+0x214/0x324 [c0000000e8ccb6a0] [c0000000003ca860] .skb_release_data+0xe8/0x104 [c0000000e8ccb730] [c0000000003ca2ec] .__kfree_skb+0x20/0xd4 [c0000000e8ccb7b0] [c0000000003cf2c8] .skb_free_datagram+0x1c/0x5c [c0000000e8ccb830] [c00000000045597c] .unix_dgram_recvmsg+0x2f4/0x338 [c0000000e8ccb920] [c0000000003c0f14] .sock_recvmsg+0xf4/0x13c [c0000000e8ccbb30] [c0000000003c28ec] .SyS_recvfrom+0xb4/0x130 [c0000000e8ccbcb0] [c0000000003bfb78] .sys_recv+0x18/0x2c [c0000000e8ccbd20] [c0000000003ed388] .compat_sys_recv+0x14/0x28 [c0000000e8ccbd90] [c0000000003ee1bc] .compat_sys_socketcall+0x178/0x220 [c0000000e8ccbe30] [c0000000000085d4] syscall_exit+0x0/0x40 This patch fixes the issue by setting up lockdep annotations during CPU hotplug. Reported-by: Paul E. McKenney Tested-by: Paul E. McKenney Cc: Peter Zijlstra Cc: Christoph Lameter Signed-off-by: Pekka Enberg --- mm/slab.c | 136 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 75 insertions(+), 61 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 7dfa481c96ba..84de47e350dd 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -604,67 +604,6 @@ static struct kmem_cache cache_cache = { #define BAD_ALIEN_MAGIC 0x01020304ul -#ifdef CONFIG_LOCKDEP - -/* - * Slab sometimes uses the kmalloc slabs to store the slab headers - * for other slabs "off slab". - * The locking for this is tricky in that it nests within the locks - * of all other slabs in a few places; to deal with this special - * locking we put on-slab caches into a separate lock-class. - * - * We set lock class for alien array caches which are up during init. - * The lock annotation will be lost if all cpus of a node goes down and - * then comes back up during hotplug - */ -static struct lock_class_key on_slab_l3_key; -static struct lock_class_key on_slab_alc_key; - -static inline void init_lock_keys(void) - -{ - int q; - struct cache_sizes *s = malloc_sizes; - - while (s->cs_size != ULONG_MAX) { - for_each_node(q) { - struct array_cache **alc; - int r; - struct kmem_list3 *l3 = s->cs_cachep->nodelists[q]; - if (!l3 || OFF_SLAB(s->cs_cachep)) - continue; - lockdep_set_class(&l3->list_lock, &on_slab_l3_key); - alc = l3->alien; - /* - * FIXME: This check for BAD_ALIEN_MAGIC - * should go away when common slab code is taught to - * work even without alien caches. - * Currently, non NUMA code returns BAD_ALIEN_MAGIC - * for alloc_alien_cache, - */ - if (!alc || (unsigned long)alc == BAD_ALIEN_MAGIC) - continue; - for_each_node(r) { - if (alc[r]) - lockdep_set_class(&alc[r]->lock, - &on_slab_alc_key); - } - } - s++; - } -} -#else -static inline void init_lock_keys(void) -{ -} -#endif - -/* - * Guard access to the cache-chain. - */ -static DEFINE_MUTEX(cache_chain_mutex); -static struct list_head cache_chain; - /* * chicken and egg problem: delay the per-cpu array allocation * until the general caches are up. @@ -685,6 +624,79 @@ int slab_is_available(void) return g_cpucache_up >= EARLY; } +#ifdef CONFIG_LOCKDEP + +/* + * Slab sometimes uses the kmalloc slabs to store the slab headers + * for other slabs "off slab". + * The locking for this is tricky in that it nests within the locks + * of all other slabs in a few places; to deal with this special + * locking we put on-slab caches into a separate lock-class. + * + * We set lock class for alien array caches which are up during init. + * The lock annotation will be lost if all cpus of a node goes down and + * then comes back up during hotplug + */ +static struct lock_class_key on_slab_l3_key; +static struct lock_class_key on_slab_alc_key; + +static void init_node_lock_keys(int q) +{ + struct cache_sizes *s = malloc_sizes; + + if (g_cpucache_up != FULL) + return; + + for (s = malloc_sizes; s->cs_size != ULONG_MAX; s++) { + struct array_cache **alc; + struct kmem_list3 *l3; + int r; + + l3 = s->cs_cachep->nodelists[q]; + if (!l3 || OFF_SLAB(s->cs_cachep)) + return; + lockdep_set_class(&l3->list_lock, &on_slab_l3_key); + alc = l3->alien; + /* + * FIXME: This check for BAD_ALIEN_MAGIC + * should go away when common slab code is taught to + * work even without alien caches. + * Currently, non NUMA code returns BAD_ALIEN_MAGIC + * for alloc_alien_cache, + */ + if (!alc || (unsigned long)alc == BAD_ALIEN_MAGIC) + return; + for_each_node(r) { + if (alc[r]) + lockdep_set_class(&alc[r]->lock, + &on_slab_alc_key); + } + } +} + +static inline void init_lock_keys(void) +{ + int node; + + for_each_node(node) + init_node_lock_keys(node); +} +#else +static void init_node_lock_keys(int q) +{ +} + +static inline void init_lock_keys(void) +{ +} +#endif + +/* + * Guard access to the cache-chain. + */ +static DEFINE_MUTEX(cache_chain_mutex); +static struct list_head cache_chain; + static DEFINE_PER_CPU(struct delayed_work, reap_work); static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep) @@ -1254,6 +1266,8 @@ static int __cpuinit cpuup_prepare(long cpu) kfree(shared); free_alien_cache(alien); } + init_node_lock_keys(node); + return 0; bad: cpuup_canceled(cpu); From 8e15b79cf4bd20c6afb4663d98a39cd004eee672 Mon Sep 17 00:00:00 2001 From: Tim Blechmann Date: Mon, 30 Nov 2009 18:59:34 +0100 Subject: [PATCH 4/6] SLAB: Fix unlikely() annotation in __cache_alloc_node() Branch profiling on my nehalem machine showed 99% incorrect branch hints: 28459 7678524 99 __cache_alloc_node slab.c 3551 Discussion on lkml [1] led to the solution to remove this hint. [1] http://patchwork.kernel.org/patch/63517/ Signed-off-by: Tim Blechmann Signed-off-by: Pekka Enberg --- mm/slab.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slab.c b/mm/slab.c index 84de47e350dd..a07540e5843b 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3320,7 +3320,7 @@ __cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, cache_alloc_debugcheck_before(cachep, flags); local_irq_save(save_flags); - if (unlikely(nodeid == -1)) + if (nodeid == -1) nodeid = numa_node_id(); if (unlikely(!cachep->nodelists[nodeid])) { From f3d8b53a3abbfd0b74fa5dfaa690870d9619fad9 Mon Sep 17 00:00:00 2001 From: "J. R. Okajima" Date: Wed, 2 Dec 2009 16:55:49 +0900 Subject: [PATCH 5/6] slab, kmemleak: stop calling kmemleak_erase() unconditionally When the gotten object is NULL (probably due to ENOMEM), kmemleak_erase() is unnecessary here, It just sets NULL to where already is NULL. Add a condition. Acked-by: Catalin Marinas Signed-off-by: J. R. Okajima Signed-off-by: Pekka Enberg --- mm/slab.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/slab.c b/mm/slab.c index 7dfa481c96ba..4e61449d7946 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3109,7 +3109,8 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags) * per-CPU caches is leaked, we need to make sure kmemleak doesn't * treat the array pointers as a reference to the object. */ - kmemleak_erase(&ac->entry[ac->avail]); + if (objp) + kmemleak_erase(&ac->entry[ac->avail]); return objp; } From ddbf2e8366f2a7fa3419be418cfd83a914d2527f Mon Sep 17 00:00:00 2001 From: "J. R. Okajima" Date: Wed, 2 Dec 2009 16:55:50 +0900 Subject: [PATCH 6/6] slab, kmemleak: pass the correct pointer to kmemleak_erase() In ____cache_alloc(), the variable 'ac' may be changed after cache_alloc_refill() and the following kmemleak_erase() may get an incorrect pointer. Update 'ac' after cache_alloc_refill() unconditionally. See the following URL for the discussion of this patch: http://marc.info/?l=linux-kernel&m=125873373124187&w=2 Acked-by: Catalin Marinas Signed-off-by: J. R. Okajima Signed-off-by: Pekka Enberg --- mm/slab.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/slab.c b/mm/slab.c index 4e61449d7946..66e90477a4bb 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3103,6 +3103,11 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags) } else { STATS_INC_ALLOCMISS(cachep); objp = cache_alloc_refill(cachep, flags); + /* + * the 'ac' may be updated by cache_alloc_refill(), + * and kmemleak_erase() requires its correct value. + */ + ac = cpu_cache_get(cachep); } /* * To avoid a false negative, if an object that is in one of the