linux-stable/mm
Miaohe Lin 5ef7ba2799 mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled
commit 1983184c22 upstream.

When I did hard offline test with hugetlb pages, below deadlock occurs:

======================================================
WARNING: possible circular locking dependency detected
6.8.0-11409-gf6cef5f8c37f #1 Not tainted
------------------------------------------------------
bash/46904 is trying to acquire lock:
ffffffffabe68910 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_slow_dec+0x16/0x60

but task is already holding lock:
ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (pcp_batch_high_lock){+.+.}-{3:3}:
       __mutex_lock+0x6c/0x770
       page_alloc_cpu_online+0x3c/0x70
       cpuhp_invoke_callback+0x397/0x5f0
       __cpuhp_invoke_callback_range+0x71/0xe0
       _cpu_up+0xeb/0x210
       cpu_up+0x91/0xe0
       cpuhp_bringup_mask+0x49/0xb0
       bringup_nonboot_cpus+0xb7/0xe0
       smp_init+0x25/0xa0
       kernel_init_freeable+0x15f/0x3e0
       kernel_init+0x15/0x1b0
       ret_from_fork+0x2f/0x50
       ret_from_fork_asm+0x1a/0x30

-> #0 (cpu_hotplug_lock){++++}-{0:0}:
       __lock_acquire+0x1298/0x1cd0
       lock_acquire+0xc0/0x2b0
       cpus_read_lock+0x2a/0xc0
       static_key_slow_dec+0x16/0x60
       __hugetlb_vmemmap_restore_folio+0x1b9/0x200
       dissolve_free_huge_page+0x211/0x260
       __page_handle_poison+0x45/0xc0
       memory_failure+0x65e/0xc70
       hard_offline_page_store+0x55/0xa0
       kernfs_fop_write_iter+0x12c/0x1d0
       vfs_write+0x387/0x550
       ksys_write+0x64/0xe0
       do_syscall_64+0xca/0x1e0
       entry_SYSCALL_64_after_hwframe+0x6d/0x75

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(pcp_batch_high_lock);
                               lock(cpu_hotplug_lock);
                               lock(pcp_batch_high_lock);
  rlock(cpu_hotplug_lock);

 *** DEADLOCK ***

5 locks held by bash/46904:
 #0: ffff98f6c3bb23f0 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x64/0xe0
 #1: ffff98f6c328e488 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0xf8/0x1d0
 #2: ffff98ef83b31890 (kn->active#113){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x100/0x1d0
 #3: ffffffffabf9db48 (mf_mutex){+.+.}-{3:3}, at: memory_failure+0x44/0xc70
 #4: ffffffffabf92ea8 (pcp_batch_high_lock){+.+.}-{3:3}, at: zone_pcp_disable+0x16/0x40

stack backtrace:
CPU: 10 PID: 46904 Comm: bash Kdump: loaded Not tainted 6.8.0-11409-gf6cef5f8c37f #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x68/0xa0
 check_noncircular+0x129/0x140
 __lock_acquire+0x1298/0x1cd0
 lock_acquire+0xc0/0x2b0
 cpus_read_lock+0x2a/0xc0
 static_key_slow_dec+0x16/0x60
 __hugetlb_vmemmap_restore_folio+0x1b9/0x200
 dissolve_free_huge_page+0x211/0x260
 __page_handle_poison+0x45/0xc0
 memory_failure+0x65e/0xc70
 hard_offline_page_store+0x55/0xa0
 kernfs_fop_write_iter+0x12c/0x1d0
 vfs_write+0x387/0x550
 ksys_write+0x64/0xe0
 do_syscall_64+0xca/0x1e0
 entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7fc862314887
Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fff19311268 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007fc862314887
RDX: 000000000000000c RSI: 000056405645fe10 RDI: 0000000000000001
RBP: 000056405645fe10 R08: 00007fc8623d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007fc86241b780 R14: 00007fc862417600 R15: 00007fc862416a00

In short, below scene breaks the lock dependency chain:

 memory_failure
  __page_handle_poison
   zone_pcp_disable -- lock(pcp_batch_high_lock)
   dissolve_free_huge_page
    __hugetlb_vmemmap_restore_folio
     static_key_slow_dec
      cpus_read_lock -- rlock(cpu_hotplug_lock)

Fix this by calling drain_all_pages() instead.

This issue won't occur until commit a6b40850c4 ("mm: hugetlb: replace
hugetlb_free_vmemmap_enabled with a static_key").  As it introduced
rlock(cpu_hotplug_lock) in dissolve_free_huge_page() code path while
lock(pcp_batch_high_lock) is already in the __page_handle_poison().

[linmiaohe@huawei.com: extend comment per Oscar]
[akpm@linux-foundation.org: reflow block comment]
Link: https://lkml.kernel.org/r/20240407085456.2798193-1-linmiaohe@huawei.com
Fixes: a6b40850c4 ("mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-27 17:07:16 +02:00
..
damon mm/damon/reclaim: fix quota stauts loss due to online tunings 2024-03-01 13:26:39 +01:00
kasan kasan/test: avoid gcc warning for intentional overflow 2024-04-03 15:19:27 +02:00
kfence mm,kfence: decouple kfence from page granularity mapping judgement 2023-12-03 07:32:08 +01:00
kmsan mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush() 2023-04-26 14:28:41 +02:00
Kconfig mm: introduce new 'lock_mm_and_find_vma()' page fault helper 2023-07-01 13:16:24 +02:00
Kconfig.debug mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM 2023-06-14 11:15:29 +02:00
Makefile
backing-dev.c writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs 2023-04-26 14:28:39 +02:00
balloon_compaction.c
bootmem_info.c
cma.c mm/cma: use nth_page() in place of direct struct page manipulation 2023-11-28 17:07:14 +00:00
cma.h
cma_debug.c
cma_sysfs.c
compaction.c mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations 2024-04-03 15:19:42 +02:00
debug.c
debug_page_ref.c
debug_vm_pgtable.c
dmapool.c
early_ioremap.c
fadvise.c
failslab.c
filemap.c mm: merge folio_has_private()/filemap_release_folio() call pairs 2024-01-10 17:10:31 +01:00
folio-compat.c
frontswap.c
gup.c mm: always expand the stack with the mmap write lock held 2023-07-01 13:16:25 +02:00
gup_test.c
gup_test.h
highmem.c
hmm.c
huge_memory.c mm: huge_memory: don't force huge page alignment on 32 bit 2024-03-06 14:45:06 +00:00
hugetlb.c hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write 2023-12-13 18:39:20 +01:00
hugetlb_cgroup.c
hugetlb_vmemmap.c mm: hugetlb_vmemmap: fix a race between vmemmap pmd split 2023-09-19 12:27:56 +02:00
hugetlb_vmemmap.h
hwpoison-inject.c
init-mm.c
internal.h mm, netfs, fscache: stop read optimisation when folio removed from pagecache 2024-01-10 17:10:31 +01:00
interval_tree.c
io-mapping.c
ioremap.c
khugepaged.c mm: merge folio_has_private()/filemap_release_folio() call pairs 2024-01-10 17:10:31 +01:00
kmemleak.c
ksm.c mm/ksm: fix race with VMA iteration and mm_struct teardown 2023-03-30 12:49:29 +02:00
list_lru.c
maccess.c mm: Fix copy_from_user_nofault(). 2023-06-28 11:12:17 +02:00
madvise.c madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check 2023-08-30 16:11:11 +02:00
mapping_dirty_helpers.c
memblock.c x86/numa: Fix the address overlap check in numa_fill_memblks() 2024-03-01 13:26:36 +01:00
memcontrol.c mm: memcontrol: clarify swapaccount=0 deprecation warning 2024-03-01 13:26:32 +01:00
memfd.c memfd: check for non-NULL file_seals in memfd_create() syscall 2023-06-28 11:12:27 +02:00
memory-failure.c mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled 2024-04-27 17:07:16 +02:00
memory-tiers.c memory tier: release the new_memtier in find_create_memory_tier() 2023-03-10 09:34:27 +01:00
memory.c x86/mm/pat: fix VM_PAT handling in COW mappings 2024-04-10 16:28:33 +02:00
memory_hotplug.c mm/memory_hotplug: fix error handling in add_memory_resource() 2024-01-10 17:10:33 +01:00
mempolicy.c mm/mempolicy: fix set_mempolicy_home_node() previous VMA pointer 2023-11-08 14:11:02 +01:00
mempool.c
memremap.c
memtest.c memtest: use {READ,WRITE}_ONCE in memory scanning 2024-04-03 15:19:36 +02:00
migrate.c mm/migrate: set swap entry values of THP tail pages properly. 2024-04-03 15:19:47 +02:00
migrate_device.c
mincore.c mm: teach mincore_hugetlb about pte markers 2023-03-22 13:34:03 +01:00
mlock.c
mm_init.c
mm_slot.h
mmap.c mmap: fix error paths with dup_anon_vma() 2023-11-08 14:11:03 +01:00
mmap_lock.c
mmu_gather.c
mmu_notifier.c
mmzone.c
mprotect.c
mremap.c mm, mremap: fix mremap() expanding for vma's with vm_ops->close() 2023-02-09 11:28:22 +01:00
msync.c
nommu.c xtensa: fix lock_mm_and_find_vma in case VMA not found 2023-07-05 18:27:37 +01:00
oom_kill.c
page-writeback.c mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again 2024-02-23 09:12:32 +01:00
page_alloc.c mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations 2024-04-03 15:19:42 +02:00
page_counter.c
page_ext.c
page_idle.c
page_io.c use less confusing names for iov_iter direction initializers 2023-02-09 11:28:04 +01:00
page_isolation.c
page_owner.c
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c mm: page_table_check: Ensure user pages are not slab pages 2023-06-14 11:15:29 +02:00
page_vma_mapped.c
pagewalk.c
percpu-internal.h
percpu-km.c
percpu-stats.c
percpu-vm.c
percpu.c
pgalloc-track.h
pgtable-generic.c
process_vm_access.c use less confusing names for iov_iter direction initializers 2023-02-09 11:28:04 +01:00
ptdump.c
readahead.c readahead: avoid multiple marked readahead pages 2024-03-15 10:48:19 -04:00
rmap.c mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON 2023-03-10 09:34:25 +01:00
rodata_test.c
secretmem.c
shmem.c mm/shmem: fix race in shmem_undo_range w/THP 2023-12-20 17:00:26 +01:00
shrinker_debug.c mm: shrinkers: fix deadlock in shrinker debugfs 2023-02-22 12:59:46 +01:00
shuffle.c
shuffle.h
slab.c mm/slab: Fix undefined init_cache_node_node() for NUMA and !SMP 2023-03-30 12:49:23 +02:00
slab.h
slab_common.c mm/slab_common: fix slab_caches list corruption after kmem_cache_destroy() 2023-10-06 14:57:03 +02:00
slob.c
slub.c
sparse-vmemmap.c
sparse.c mm/sparsemem: fix race in accessing memory_section->usage 2024-01-31 16:17:02 -08:00
swap.c
swap.h mm/swap: fix race when skipping swapcache 2024-03-01 13:26:32 +01:00
swap_cgroup.c
swap_slots.c
swap_state.c
swapfile.c mm: swap: fix race between free_swap_and_cache() and swapoff() 2024-04-03 15:19:32 +02:00
truncate.c mm: merge folio_has_private()/filemap_release_folio() call pairs 2024-01-10 17:10:31 +01:00
usercopy.c mm: Fix copy_from_user_nofault(). 2023-06-28 11:12:17 +02:00
userfaultfd.c userfaultfd: fix mmap_changing checking in mfill_atomic_hugetlb 2024-02-23 09:12:51 +01:00
util.c rcu: dump vmalloc memory info safely 2023-09-13 09:42:59 +02:00
vmalloc.c mm/vmalloc: add a safer version of find_vm_area() for debug 2023-09-13 09:43:00 +02:00
vmpressure.c net-memcg: Fix scope of sockmem pressure indicators 2023-09-13 09:42:33 +02:00
vmscan.c mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations 2024-04-03 15:19:42 +02:00
vmstat.c
workingset.c mm/mglru: fix underprotected page cache 2023-12-20 17:00:26 +01:00
z3fold.c
zbud.c
zpool.c
zsmalloc.c zsmalloc: allow only one active pool compaction context 2023-08-23 17:52:40 +02:00
zswap.c mm: zswap: fix missing folio cleanup in writeback race path 2024-03-01 13:26:39 +01:00