hugetlb: add demote hugetlb page sysfs interfaces

Patch series "hugetlb: add demote/split page functionality", v4.

The concurrent use of multiple hugetlb page sizes on a single system is
becoming more common.  One of the reasons is better TLB support for
gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
being used to back VMs in hosting environments.

When using hugetlb pages to back VMs, it is often desirable to
preallocate hugetlb pools.  This avoids the delay and uncertainty of
allocating hugetlb pages at VM startup.  In addition, preallocating huge
pages minimizes the issue of memory fragmentation that increases the
longer the system is up and running.

In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes.  Over
time, the preallocated pool of smaller hugetlb pages may become depleted
while larger hugetlb pages still remain.  In such situations, it is
desirable to convert larger hugetlb pages to smaller hugetlb pages.

Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages.  For example, to convert 50 GB pages on x86:

  gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
  m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
  echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
  echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages

On an idle system this operation is fairly reliable and results are as
expected.  The number of 2MB pages is increased as expected and the time
of the operation is a second or two.

However, when there is activity on the system the following issues
arise:

1) This process can take quite some time, especially if allocation of
   the smaller pages is not immediate and requires migration/compaction.

2) There is no guarantee that the total size of smaller pages allocated
   will match the size of the larger page which was freed. This is
   because the area freed by the larger page could quickly be
   fragmented.

In a test environment with a load that continually fills the page cache
with clean pages, results such as the following can be observed:

  Unexpected number of 2MB pages allocated: Expected 25600, have 19944
  real    0m42.092s
  user    0m0.008s
  sys     0m41.467s

To address these issues, introduce the concept of hugetlb page demotion.
Demotion provides a means of 'in place' splitting of a hugetlb page to
pages of a smaller size.  This avoids freeing pages to buddy and then
trying to allocate from buddy.

Page demotion is controlled via sysfs files that reside in the per-hugetlb
page size and per node directories.

 - demote_size
        Target page size for demotion, a smaller huge page size. File
        can be written to chose a smaller huge page size if multiple are
        available.

 - demote
        Writable number of hugetlb pages to be demoted

To demote 50 GB huge pages, one would:

  cat .../hugepages-1048576kB/free_hugepages   /* optional, verify free pages */
  cat .../hugepages-1048576kB/demote_size      /* optional, verify target size */
  echo 50 > .../hugepages-1048576kB/demote

Only hugetlb pages which are free at the time of the request can be
demoted.  Demotion does not add to the complexity of surplus pages and
honors reserved huge pages.  Therefore, when a value is written to the
sysfs demote file, that value is only the maximum number of pages which
will be demoted.  It is possible fewer will actually be demoted.  The
recently introduced per-hstate mutex is used to synchronize demote
operations with other operations that modify hugetlb pools.

Real world use cases
--------------------
The above scenario describes a real world use case where hugetlb pages
are used to back VMs on x86.  Both issues of long allocation times and
not necessarily getting the expected number of smaller huge pages after
a free and allocate cycle have been experienced.  The occurrence of
these issues is dependent on other activity within the host and can not
be predicted.

This patch (of 5):

Two new sysfs files are added to demote hugtlb pages.  These files are
both per-hugetlb page size and per node.  Files are:

  demote_size - The size in Kb that pages are demoted to. (read-write)
  demote - The number of huge pages to demote. (write-only)

By default, demote_size is the next smallest huge page size.  Valid huge
page sizes less than huge page size may be written to this file.  When
huge pages are demoted, they are demoted to this size.

Writing a value to demote will result in an attempt to demote that
number of hugetlb pages to an appropriate number of demote_size pages.

NOTE: Demote interfaces are only provided for huge page sizes if there
is a smaller target demote huge page size.  For example, on x86 1GB huge
pages will have demote interfaces.  2MB huge pages will not have demote
interfaces.

This patch does not provide full demote functionality.  It only provides
the sysfs interfaces.

It also provides documentation for the new interfaces.

[mike.kravetz@oracle.com: n_mask initialization does not need to be protected by the mutex]
  Link: https://lkml.kernel.org/r/0530e4ef-2492-5186-f919-5db68edea654@oracle.com

Link: https://lkml.kernel.org/r/20211007181918.136982-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: David Rientjes <rientjes@google.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Nghia Le <nghialm78@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This commit is contained in:
Mike Kravetz 2021-11-05 13:41:20 -07:00 committed by Linus Torvalds
parent 73c5476348
commit 79dfc69552
3 changed files with 183 additions and 3 deletions

View File

@ -234,8 +234,12 @@ will exist, of the form::
hugepages-${size}kB
Inside each of these directories, the same set of files will exist::
Inside each of these directories, the set of files contained in ``/proc``
will exist. In addition, two additional interfaces for demoting huge
pages may exist::
demote
demote_size
nr_hugepages
nr_hugepages_mempolicy
nr_overcommit_hugepages
@ -243,7 +247,29 @@ Inside each of these directories, the same set of files will exist::
resv_hugepages
surplus_hugepages
which function as described above for the default huge page-sized case.
The demote interfaces provide the ability to split a huge page into
smaller huge pages. For example, the x86 architecture supports both
1GB and 2MB huge pages sizes. A 1GB huge page can be split into 512
2MB huge pages. Demote interfaces are not available for the smallest
huge page size. The demote interfaces are:
demote_size
is the size of demoted pages. When a page is demoted a corresponding
number of huge pages of demote_size will be created. By default,
demote_size is set to the next smaller huge page size. If there are
multiple smaller huge page sizes, demote_size can be set to any of
these smaller sizes. Only huge page sizes less than the current huge
pages size are allowed.
demote
is used to demote a number of huge pages. A user with root privileges
can write to this file. It may not be possible to demote the
requested number of huge pages. To determine how many pages were
actually demoted, compare the value of nr_hugepages before and after
writing to the demote interface. demote is a write only interface.
The interfaces which are the same as in ``/proc`` (all except demote and
demote_size) function as described above for the default huge page-sized case.
.. _mem_policy_and_hp_alloc:

View File

@ -586,6 +586,7 @@ struct hstate {
int next_nid_to_alloc;
int next_nid_to_free;
unsigned int order;
unsigned int demote_order;
unsigned long mask;
unsigned long max_huge_pages;
unsigned long nr_huge_pages;

View File

@ -2986,7 +2986,7 @@ free:
static void __init hugetlb_init_hstates(void)
{
struct hstate *h;
struct hstate *h, *h2;
for_each_hstate(h) {
if (minimum_order > huge_page_order(h))
@ -2995,6 +2995,22 @@ static void __init hugetlb_init_hstates(void)
/* oversize hugepages were init'ed in early boot */
if (!hstate_is_gigantic(h))
hugetlb_hstate_alloc_pages(h);
/*
* Set demote order for each hstate. Note that
* h->demote_order is initially 0.
* - We can not demote gigantic pages if runtime freeing
* is not supported, so skip this.
*/
if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
continue;
for_each_hstate(h2) {
if (h2 == h)
continue;
if (h2->order < h->order &&
h2->order > h->demote_order)
h->demote_order = h2->order;
}
}
VM_BUG_ON(minimum_order == UINT_MAX);
}
@ -3235,9 +3251,31 @@ out:
return 0;
}
static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
__must_hold(&hugetlb_lock)
{
int rc = 0;
lockdep_assert_held(&hugetlb_lock);
/* We should never get here if no demote order */
if (!h->demote_order) {
pr_warn("HugeTLB: NULL demote order passed to demote_pool_huge_page.\n");
return -EINVAL; /* internal error */
}
/*
* TODO - demote fucntionality will be added in subsequent patch
*/
return rc;
}
#define HSTATE_ATTR_RO(_name) \
static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
#define HSTATE_ATTR_WO(_name) \
static struct kobj_attribute _name##_attr = __ATTR_WO(_name)
#define HSTATE_ATTR(_name) \
static struct kobj_attribute _name##_attr = \
__ATTR(_name, 0644, _name##_show, _name##_store)
@ -3433,6 +3471,105 @@ static ssize_t surplus_hugepages_show(struct kobject *kobj,
}
HSTATE_ATTR_RO(surplus_hugepages);
static ssize_t demote_store(struct kobject *kobj,
struct kobj_attribute *attr, const char *buf, size_t len)
{
unsigned long nr_demote;
unsigned long nr_available;
nodemask_t nodes_allowed, *n_mask;
struct hstate *h;
int err = 0;
int nid;
err = kstrtoul(buf, 10, &nr_demote);
if (err)
return err;
h = kobj_to_hstate(kobj, &nid);
if (nid != NUMA_NO_NODE) {
init_nodemask_of_node(&nodes_allowed, nid);
n_mask = &nodes_allowed;
} else {
n_mask = &node_states[N_MEMORY];
}
/* Synchronize with other sysfs operations modifying huge pages */
mutex_lock(&h->resize_lock);
spin_lock_irq(&hugetlb_lock);
while (nr_demote) {
/*
* Check for available pages to demote each time thorough the
* loop as demote_pool_huge_page will drop hugetlb_lock.
*
* NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
* but will when full demote functionality is added in a later
* patch.
*/
if (nid != NUMA_NO_NODE)
nr_available = h->free_huge_pages_node[nid];
else
nr_available = h->free_huge_pages;
nr_available -= h->resv_huge_pages;
if (!nr_available)
break;
err = demote_pool_huge_page(h, n_mask);
if (err)
break;
nr_demote--;
}
spin_unlock_irq(&hugetlb_lock);
mutex_unlock(&h->resize_lock);
if (err)
return err;
return len;
}
HSTATE_ATTR_WO(demote);
static ssize_t demote_size_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
int nid;
struct hstate *h = kobj_to_hstate(kobj, &nid);
unsigned long demote_size = (PAGE_SIZE << h->demote_order) / SZ_1K;
return sysfs_emit(buf, "%lukB\n", demote_size);
}
static ssize_t demote_size_store(struct kobject *kobj,
struct kobj_attribute *attr,
const char *buf, size_t count)
{
struct hstate *h, *demote_hstate;
unsigned long demote_size;
unsigned int demote_order;
int nid;
demote_size = (unsigned long)memparse(buf, NULL);
demote_hstate = size_to_hstate(demote_size);
if (!demote_hstate)
return -EINVAL;
demote_order = demote_hstate->order;
/* demote order must be smaller than hstate order */
h = kobj_to_hstate(kobj, &nid);
if (demote_order >= h->order)
return -EINVAL;
/* resize_lock synchronizes access to demote size and writes */
mutex_lock(&h->resize_lock);
h->demote_order = demote_order;
mutex_unlock(&h->resize_lock);
return count;
}
HSTATE_ATTR(demote_size);
static struct attribute *hstate_attrs[] = {
&nr_hugepages_attr.attr,
&nr_overcommit_hugepages_attr.attr,
@ -3449,6 +3586,16 @@ static const struct attribute_group hstate_attr_group = {
.attrs = hstate_attrs,
};
static struct attribute *hstate_demote_attrs[] = {
&demote_size_attr.attr,
&demote_attr.attr,
NULL,
};
static const struct attribute_group hstate_demote_attr_group = {
.attrs = hstate_demote_attrs,
};
static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
struct kobject **hstate_kobjs,
const struct attribute_group *hstate_attr_group)
@ -3466,6 +3613,12 @@ static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
hstate_kobjs[hi] = NULL;
}
if (h->demote_order) {
if (sysfs_create_group(hstate_kobjs[hi],
&hstate_demote_attr_group))
pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
}
return retval;
}