Go to file
Mike Kravetz 32c877191e hugetlb: do not clear hugetlb dtor until allocating vmemmap
Patch series "Fix hugetlb free path race with memory errors".

In the discussion of Jiaqi Yan's series "Improve hugetlbfs read on
HWPOISON hugepages" the race window was discovered. 
https://lore.kernel.org/linux-mm/20230616233447.GB7371@monkey/

Freeing a hugetlb page back to low level memory allocators is performed
in two steps.
1) Under hugetlb lock, remove page from hugetlb lists and clear destructor
2) Outside lock, allocate vmemmap if necessary and call low level free
Between these two steps, the hugetlb page will appear as a normal
compound page.  However, vmemmap for tail pages could be missing.
If a memory error occurs at this time, we could try to update page
flags non-existant page structs.

A much more detailed description is in the first patch.

The first patch addresses the race window.  However, it adds a
hugetlb_lock lock/unlock cycle to every vmemmap optimized hugetlb page
free operation.  This could lead to slowdowns if one is freeing a large
number of hugetlb pages.

The second path optimizes the update_and_free_pages_bulk routine to only
take the lock once in bulk operations.

The second patch is technically not a bug fix, but includes a Fixes tag
and Cc stable to avoid a performance regression.  It can be combined with
the first, but was done separately make reviewing easier.


This patch (of 2):

Freeing a hugetlb page and releasing base pages back to the underlying
allocator such as buddy or cma is performed in two steps:
- remove_hugetlb_folio() is called to remove the folio from hugetlb
  lists, get a ref on the page and remove hugetlb destructor.  This
  all must be done under the hugetlb lock.  After this call, the page
  can be treated as a normal compound page or a collection of base
  size pages.
- update_and_free_hugetlb_folio() is called to allocate vmemmap if
  needed and the free routine of the underlying allocator is called
  on the resulting page.  We can not hold the hugetlb lock here.

One issue with this scheme is that a memory error could occur between
these two steps.  In this case, the memory error handling code treats
the old hugetlb page as a normal compound page or collection of base
pages.  It will then try to SetPageHWPoison(page) on the page with an
error.  If the page with error is a tail page without vmemmap, a write
error will occur when trying to set the flag.

Address this issue by modifying remove_hugetlb_folio() and
update_and_free_hugetlb_folio() such that the hugetlb destructor is not
cleared until after allocating vmemmap.  Since clearing the destructor
requires holding the hugetlb lock, the clearing is done in
remove_hugetlb_folio() if the vmemmap is present.  This saves a
lock/unlock cycle.  Otherwise, destructor is cleared in
update_and_free_hugetlb_folio() after allocating vmemmap.

Note that this will leave hugetlb pages in a state where they are marked
free (by hugetlb specific page flag) and have a ref count.  This is not
a normal state.  The only code that would notice is the memory error
code, and it is set up to retry in such a case.

A subsequent patch will create a routine to do bulk processing of
vmemmap allocation.  This will eliminate a lock/unlock cycle for each
hugetlb page in the case where we are freeing a large number of pages.

Link: https://lkml.kernel.org/r/20230711220942.43706-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20230711220942.43706-2-mike.kravetz@oracle.com
Fixes: ad2fa3717b ("mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jiaqi Yan <jiaqiyan@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-04 13:03:41 -07:00
Documentation TTY/Serial fixes for 6.5-rc4 2023-07-30 11:51:36 -07:00
LICENSES LICENSES: Add the copyleft-next-0.3.1 license 2022-11-08 15:44:01 +01:00
arch x86: 2023-07-30 11:19:08 -07:00
block block-6.5-2023-07-21 2023-07-22 11:05:15 -07:00
certs KEYS: Add missing function documentation 2023-04-24 16:15:52 +03:00
crypto crypto: algif_hash - Fix race between MORE and non-MORE sends 2023-07-08 22:48:42 +10:00
drivers spi: Fixes for v6.5 2023-07-30 12:54:31 -07:00
fs four small client fixes 2023-07-29 20:49:13 -07:00
include four small client fixes 2023-07-29 20:49:13 -07:00
init Kbuild updates for v6.5 2023-07-01 09:24:31 -07:00
io_uring io_uring-6.5-2023-07-28 2023-07-28 10:19:44 -07:00
ipc Merge branch 'work.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:20:07 -08:00
kernel Probe fixes for 6.5-rc3: 2023-07-30 11:27:22 -07:00
lib crypto, cifs: fix error handling in extract_iter_to_sg() 2023-08-04 13:03:40 -07:00
mm hugetlb: do not clear hugetlb dtor until allocating vmemmap 2023-08-04 13:03:41 -07:00
net A patch to reduce the potential for erroneous RBD exclusive lock 2023-07-28 10:47:24 -07:00
rust rust: error: `impl Debug` for `Error` with `errname()` integration 2023-06-13 01:24:42 +02:00
samples arm64: ftrace: Add direct call trampoline samples support 2023-07-10 17:51:54 -04:00
scripts x86: 2023-07-30 11:19:08 -07:00
security security: keys: perform capable check only on privileged operations 2023-07-28 18:07:41 +00:00
sound ASoC: Fixes for v6.5 2023-07-27 14:54:23 +02:00
tools radix tree test suite: fix incorrect allocation size for pthreads 2023-08-04 13:03:40 -07:00
usr initramfs: Encode dependency on KBUILD_BUILD_TIMESTAMP 2023-06-06 17:54:49 +09:00
virt KVM: Grab a reference to KVM for VM and vCPU stats file descriptors 2023-07-29 11:05:28 -04:00
.clang-format iommu: Add for_each_group_device() 2023-05-23 08:15:51 +02:00
.cocciconfig
.get_maintainer.ignore
.gitattributes .gitattributes: set diff driver for Rust source code files 2023-05-31 17:48:25 +02:00
.gitignore Revert ".gitignore: ignore *.cover and *.mbx" 2023-07-04 15:05:12 -07:00
.mailmap mailmap: update remaining active codeaurora.org email addresses 2023-07-27 13:07:05 -07:00
.rustfmt.toml
COPYING
CREDITS - Address -Wmissing-prototype warnings 2023-06-26 16:43:54 -07:00
Kbuild Kbuild updates for v6.1 2022-10-10 12:00:45 -07:00
Kconfig
MAINTAINERS USB fixes for 6.5-rc4 2023-07-30 11:57:51 -07:00
Makefile Linux 6.5-rc4 2023-07-30 13:23:47 -07:00
README

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.