linux-stable/mm
David Hildenbrand d98d053efa mm/memory_hotplug: fix try_offline_node()
commit 2c91f8fc6c upstream.

-- snip --

Only contextual issues:
- Unrelated check_and_unmap_cpu_on_node() changes are missing.
- Unrelated walk_memory_blocks() has not been moved/refactored yet.

-- snip --

try_offline_node() is pretty much broken right now:

 - The node span is updated when onlining memory, not when adding it. We
   ignore memory that was mever onlined. Bad.

 - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
   trigger a kernel panic. Bad for memory that is offline but also bad
   for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
   first PFN of a section might contain garbage.

 - Sections belonging to mixed nodes are not properly considered.

As memory blocks might belong to multiple nodes, we would have to walk
all pageblocks (or at least subsections) within present sections.
However, we don't have a way to identify whether a memmap that is not
online was initialized (relevant for ZONE_DEVICE).  This makes things
more complicated.

Luckily, we can piggy pack on the node span and the nid stored in memory
blocks.  Currently, the node span is grown when calling
move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
removing memory, before calling try_offline_node().  Sysfs links are
created via link_mem_sections(), e.g., during boot or when adding
memory.

If the node still spans memory or if any memory block belongs to the
nid, we don't set the node offline.  As memory blocks that span multiple
nodes cannot get offlined, the nid stored in memory blocks is reliable
enough (for such online memory blocks, the node still spans the memory).

Introduce for_each_memory_block() to efficiently walk all memory blocks.

Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
when removing ZONE_DEVICE memory to fix similar issues (access of
garbage memmaps) - until we have a reliable way to identify whether
these memmaps were properly initialized.  This implies later, that once
a node had ZONE_DEVICE memory, we won't be able to set a node offline -
which should be acceptable.

Since commit f1dd2cd13c ("mm, memory_hotplug: do not associate
hotadded memory to zones until online") memory that is added is not
assoziated with a zone/node (memmap not initialized).  The introducing
commit 60a5a19e74 ("memory-hotplug: remove sysfs file of node")
already missed that we could have multiple nodes for a section and that
the zone/node span is updated when onlining pages, not when adding them.

I tested this by hotplugging two DIMMs to a memory-less and cpu-less
NUMA node.  The node is properly onlined when adding the DIMMs.  When
removing the DIMMs, the node is properly offlined.

Masayoshi Mizuma reported:

: Without this patch, memory hotplug fails as panic:
:
:  BUG: kernel NULL pointer dereference, address: 0000000000000000
:  ...
:  Call Trace:
:   remove_memory_block_devices+0x81/0xc0
:   try_remove_memory+0xb4/0x130
:   __remove_memory+0xa/0x20
:   acpi_memory_device_remove+0x84/0x100
:   acpi_bus_trim+0x57/0x90
:   acpi_bus_trim+0x2e/0x90
:   acpi_device_hotplug+0x2b2/0x4d0
:   acpi_hotplug_work_fn+0x1a/0x30
:   process_one_work+0x171/0x380
:   worker_thread+0x49/0x3f0
:   kthread+0xf8/0x130
:   ret_from_fork+0x35/0x40

[david@redhat.com: v3]
  Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
Fixes: 60a5a19e74 ("memory-hotplug: remove sysfs file of node")
Fixes: f1dd2cd13c ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e86b
Signed-off-by: David Hildenbrand <david@redhat.com>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Jani Nikula <jani.nikula@intel.com>
Cc: Nayna Jain <nayna@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-29 16:43:27 +01:00
..
kasan kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN 2018-08-17 16:20:30 -07:00
backing-dev.c writeback: synchronize sync(2) against cgroup writeback membership switches 2019-03-05 17:58:50 +01:00
balloon_compaction.c virtio_balloon: fix deadlock on OOM 2017-11-14 23:57:38 +02:00
bootmem.c docs/mm: bootmem: add overview documentation 2018-08-02 12:17:27 -06:00
cleancache.c mm: use octal not symbolic permissions 2018-06-15 07:55:25 +09:00
cma.c mm/cma.c: fail if fixed declaration can't be honored 2019-08-06 19:06:51 +02:00
cma.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
cma_debug.c mm/cma_debug.c: fix the break condition in cma_maxchunk_get() 2019-06-15 11:54:01 +02:00
compaction.c mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone 2019-10-05 13:10:13 +02:00
debug.c mm: get rid of vmacache_flush_all() entirely 2018-09-13 15:18:04 -10:00
debug_page_ref.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
dmapool.c mm: use octal not symbolic permissions 2018-06-15 07:55:25 +09:00
early_ioremap.c mm/early_ioremap: Fix boot hang with earlyprintk=efi,keep 2017-12-11 14:54:44 +01:00
fadvise.c vfs: implement readahead(2) using POSIX_FADV_WILLNEED 2018-08-30 20:01:32 +02:00
failslab.c mm: use octal not symbolic permissions 2018-06-15 07:55:25 +09:00
filemap.c mm/filemap.c: don't initiate writeback if mapping has no dirty pages 2019-11-12 19:21:20 +01:00
frame_vector.c mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()' 2017-12-14 16:00:48 -08:00
frontswap.c mm: use octal not symbolic permissions 2018-06-15 07:55:25 +09:00
gup.c mm/gup.c: remove some BUG_ONs from get_gate_page() 2019-07-31 07:27:08 +02:00
gup_benchmark.c mm/gup_benchmark.c: prevent integer overflow in ioctl 2019-12-01 09:17:07 +01:00
highmem.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
hmm.c mm, memory_hotplug: add nid parameter to arch_remove_memory 2020-01-29 16:43:25 +01:00
huge_memory.c mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment 2020-01-23 08:21:32 +01:00
hugetlb.c hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic() 2019-10-29 09:19:59 +01:00
hugetlb_cgroup.c mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() 2019-11-20 18:45:20 +01:00
hwpoison-inject.c mm/memory_failure: Remove unused trapno from memory_failure 2018-01-23 12:17:42 -06:00
init-mm.c mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids 2018-07-17 09:35:30 +02:00
internal.h vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n 2019-12-05 09:20:57 +01:00
interval_tree.c mm/interval_tree.c: use vma_pages() helper 2018-01-31 17:18:37 -08:00
Kconfig mm/hmm: select mmu notifier when selecting HMM 2019-06-15 11:54:00 +02:00
Kconfig.debug mm: clarify CONFIG_PAGE_POISONING and usage 2018-08-22 10:52:44 -07:00
khugepaged.c coredump: fix race condition between collapse_huge_page() and core dumping 2019-06-22 08:15:21 +02:00
kmemleak-test.c
kmemleak.c Revert "kmemleak: allow to coexist with fault injection" 2019-08-25 10:47:58 +02:00
ksm.c mm/ksm.c: don't WARN if page is still mapped in remove_stable_node() 2019-12-01 09:16:11 +01:00
list_lru.c mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node 2019-06-19 08:17:59 +02:00
maccess.c mm: docs: fix parameter names mismatch 2018-02-06 18:32:48 -08:00
madvise.c mm: madvise(MADV_DODUMP): allow hugetlbfs pages 2018-10-05 16:32:05 -07:00
Makefile vfs: implement readahead(2) using POSIX_FADV_WILLNEED 2018-08-30 20:01:32 +02:00
memblock.c mm/page_alloc.c: deduplicate __memblock_free_early() and memblock_free() 2019-12-05 09:20:58 +01:00
memcontrol.c mm: handle no memcg case in memcg_kmem_charge() properly 2019-12-01 09:17:14 +01:00
memfd.c memfd: Use radix_tree_deref_slot_protected to avoid the warning. 2019-11-20 18:47:53 +01:00
memory-failure.c mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once 2019-10-29 09:19:59 +01:00
memory.c mm, thp, proc: report THP eligibility for each vma 2019-12-17 20:35:45 +01:00
memory_hotplug.c mm/memory_hotplug: fix try_offline_node() 2020-01-29 16:43:27 +01:00
mempolicy.c mm: mempolicy: fix the wrong return value and potential pages leak of mbind 2019-11-20 18:45:19 +01:00
mempool.c mm/mempool.c: add missing parameter description 2018-08-22 10:52:44 -07:00
memtest.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
migrate.c mm: move_pages: return valid node id in status if the page is already on the target node 2020-01-09 10:19:00 +01:00
mincore.c mm/mincore.c: make mincore() more conservative 2019-05-22 07:37:40 +02:00
mlock.c mm/mlock.c: change count_mm_mlocked_page_nr return type 2019-07-10 09:53:40 +02:00
mm_init.c mm: access zone->node via zone_to_nid() and zone_set_nid() 2018-08-22 10:52:45 -07:00
mmap.c arm64: Revert support for execute-only user mappings 2020-01-09 10:19:03 +01:00
mmu_context.c
mmu_notifier.c mm/mmu_notifier: use hlist_add_head_rcu() 2019-07-31 07:27:08 +02:00
mmzone.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
mprotect.c x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings 2018-06-20 19:10:01 +02:00
mremap.c mremap: properly flush TLB before releasing the page 2018-10-18 11:30:52 +02:00
msync.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
nobootmem.c mm/memblock: add a name for memblock flags enumeration 2018-08-02 12:17:27 -06:00
nommu.c mm: use down_read_killable for locking mmap_sem in access_remote_vm 2019-07-31 07:27:09 +02:00
oom_kill.c memcg, oom: don't require __GFP_FS when invoking memcg OOM killer 2019-10-05 13:10:07 +02:00
page-writeback.c mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio() 2020-01-23 08:21:31 +01:00
page_alloc.c mm/page_alloc.c: use a single function to free page 2019-12-05 09:20:58 +01:00
page_counter.c memcg: introduce memory.min 2018-06-07 17:34:36 -07:00
page_ext.c mm/page_ext.c: fix an imbalance with kmemleak 2019-04-05 22:32:58 +02:00
page_idle.c mm/page_idle.c: fix oops because end_pfn is larger than max_pfn 2019-07-03 13:14:45 +02:00
page_io.c mm/page_io.c: do not free shared swap slots 2019-12-01 09:17:35 +01:00
page_isolation.c mm, migrate: remove reason argument from new_page_t 2018-04-11 10:28:32 -07:00
page_owner.c mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo 2019-10-29 09:19:58 +01:00
page_poison.c page_poison: play nicely with KASAN 2019-04-05 22:32:59 +02:00
page_vma_mapped.c mm/rmap: map_pte() was not handling private ZONE_DEVICE page properly 2018-11-13 11:08:46 -08:00
pagewalk.c mm: kernel-doc: add missing parameter descriptions 2018-04-05 21:36:27 -07:00
percpu-internal.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
percpu-km.c percpu: convert spin_lock_irq to spin_lock_irqsave. 2019-02-12 19:47:12 +01:00
percpu-stats.c treewide: Use array_size() in vmalloc() 2018-06-12 16:19:22 -07:00
percpu-vm.c percpu: allow select gfp to be passed to underlying allocators 2018-02-18 05:33:01 -08:00
percpu.c percpu: do not search past bitmap when allocating an area 2019-06-15 11:54:11 +02:00
pgtable-generic.c mm: do not lose dirty and accessed bits in pmdp_invalidate() 2018-01-31 17:18:38 -08:00
process_vm_access.c mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors 2018-02-06 18:32:48 -08:00
quicklist.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
readahead.c vfs: implement readahead(2) using POSIX_FADV_WILLNEED 2018-08-30 20:01:32 +02:00
rmap.c mm/hmm: fix bad subpage pointer in try_to_unmap_one 2019-08-25 10:47:43 +02:00
rodata_test.c
shmem.c mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment 2020-01-23 08:21:30 +01:00
slab.c mm/slab.c: fix an infinite loop in leaks_show() 2019-06-15 11:54:01 +02:00
slab.h mm: add support for kmem caches in DMA32 zone 2019-04-03 06:26:28 +02:00
slab_common.c mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid 2020-01-23 08:21:30 +01:00
slob.c slab: __GFP_ZERO is incompatible with a constructor 2018-06-07 17:34:34 -07:00
slub.c mm/slub: fix a deadlock in show_slab_objects() 2019-10-29 09:19:58 +01:00
sparse-vmemmap.c mm/sparse: delete old sparse_init and enable new one 2018-08-17 16:20:32 -07:00
sparse.c mm/memory_hotplug: remove "zone" parameter from sparse_remove_one_section 2020-01-29 16:43:26 +01:00
swap.c mm/swap: fix release_pages() when releasing devmap pages 2019-07-31 07:27:03 +02:00
swap_cgroup.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
swap_slots.c mm, swap, get_swap_pages: use entry_size instead of cluster in parameter 2018-08-22 10:52:44 -07:00
swap_state.c treewide: kvzalloc() -> kvcalloc() 2018-06-12 16:19:22 -07:00
swapfile.c mm, swap: bounds check swap_info array accesses to avoid NULL derefs 2019-04-05 22:32:58 +02:00
truncate.c mm: cleancache: fix corruption on missed inode invalidation 2018-12-05 19:32:13 +01:00
usercopy.c usercopy: Avoid HIGHMEM pfn warning 2019-10-11 18:20:58 +02:00
userfaultfd.c hugetlb: use same fault hash key for shared and private mappings 2019-05-22 07:37:40 +02:00
util.c mm: page_mapped: don't assume compound page is huge or THP 2019-01-16 22:04:36 +01:00
vmacache.c mm: get rid of vmacache_flush_all() entirely 2018-09-13 15:18:04 -10:00
vmalloc.c mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy() 2019-08-16 10:12:40 +02:00
vmpressure.c mm/vmpressure.c: fix a signedness bug in vmpressure_register_event() 2019-10-17 13:45:19 -07:00
vmscan.c mm: vmscan: check if mem cgroup is disabled or not before calling memcg slab shrinker 2019-08-06 19:06:54 +02:00
vmstat.c mm/vmstat.c: fix NUMA statistics updates 2019-12-13 08:51:27 +01:00
workingset.c mm/list_lru: introduce list_lru_shrink_walk_irq() 2018-08-17 16:20:32 -07:00
z3fold.c z3fold: fix possible reclaim races 2018-12-01 09:37:33 +01:00
zbud.c mm: docs: fix parameter names mismatch 2018-02-06 18:32:48 -08:00
zpool.c mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc 2018-02-21 15:35:43 -08:00
zsmalloc.c mm/zsmalloc.c: fix the migrated zspage statistics. 2020-01-09 10:19:00 +01:00
zswap.c zswap: re-check zswap_is_full() after do zswap_shrink() 2018-07-26 19:38:03 -07:00