Commit Graph

18700 Commits

Author SHA1 Message Date
Linus Torvalds b1701d5e29 - hugetlb_vmemmap cleanups from Muchun Song
- hardware poisoning support for 1GB hugepages, from Naoya Horiguchi
 
 - highmem documentation fixups from Fabio De Francesco
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYvMFaQAKCRDdBJ7gKXxA
 jrM7AQCsaVsrRJgVMNqjDvLAsWFEm48s6/smQT4UZ9n/rPQUsAEA0iRpOfbjcxwK
 npAml1mp5uP2AkimTVLB6Bokt6N+wAg=
 =9Wta
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull remaining MM updates from Andrew Morton:
 "Three patch series - two that perform cleanups and one feature:

   - hugetlb_vmemmap cleanups from Muchun Song

   - hardware poisoning support for 1GB hugepages, from Naoya Horiguchi

   - highmem documentation fixups from Fabio De Francesco"

* tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits)
  Documentation/mm: add details about kmap_local_page() and preemption
  highmem: delete a sentence from kmap_local_page() kdocs
  Documentation/mm: rrefer kmap_local_page() and avoid kmap()
  Documentation/mm: avoid invalid use of addresses from kmap_local_page()
  Documentation/mm: don't kmap*() pages which can't come from HIGHMEM
  highmem: specify that kmap_local_page() is callable from interrupts
  highmem: remove unneeded spaces in kmap_local_page() kdocs
  mm, hwpoison: enable memory error handling on 1GB hugepage
  mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
  mm, hwpoison: make __page_handle_poison returns int
  mm, hwpoison: set PG_hwpoison for busy hugetlb pages
  mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
  mm, hwpoison, hugetlb: support saving mechanism of raw error pages
  mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
  mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()
  mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE
  mm: hugetlb_vmemmap: move code comments to vmemmap_dedup.rst
  mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
  mm: hugetlb_vmemmap: replace early_param() with core_param()
  mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c
  ...
2022-08-10 11:18:00 -07:00
Linus Torvalds c235698355 cxl for 6.0
- Introduce a 'struct cxl_region' object with support for provisioning
   and assembling persistent memory regions.
 
 - Introduce alloc_free_mem_region() to accompany the existing
   request_free_mem_region() as a method to allocate physical memory
   capacity out of an existing resource.
 
 - Export insert_resource_expand_to_fit() for the CXL subsystem to
   late-publish CXL platform windows in iomem_resource.
 
 - Add a polled mode PCI DOE (Data Object Exchange) driver service and
   use it in cxl_pci to retrieve the CDAT (Coherent Device Attribute
   Table).
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYvLYmAAKCRDfioYZHlFs
 Z0pbAQC/3j+WriWpU7CdhrnZI1Wqn+x5IIklF0Lc4/f6LwGZtAEAsSbLpItzvwqx
 M/rcLaeLpwYlgvS1JjdsuQ2VQ7KOtAs=
 =ehNT
 -----END PGP SIGNATURE-----

Merge tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull cxl updates from Dan Williams:
 "Compute Express Link (CXL) updates for 6.0:

   - Introduce a 'struct cxl_region' object with support for
     provisioning and assembling persistent memory regions.

   - Introduce alloc_free_mem_region() to accompany the existing
     request_free_mem_region() as a method to allocate physical memory
     capacity out of an existing resource.

   - Export insert_resource_expand_to_fit() for the CXL subsystem to
     late-publish CXL platform windows in iomem_resource.

   - Add a polled mode PCI DOE (Data Object Exchange) driver service and
     use it in cxl_pci to retrieve the CDAT (Coherent Device Attribute
     Table)"

* tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (74 commits)
  cxl/hdm: Fix skip allocations vs multiple pmem allocations
  cxl/region: Disallow region granularity != window granularity
  cxl/region: Fix x1 interleave to greater than x1 interleave routing
  cxl/region: Move HPA setup to cxl_region_attach()
  cxl/region: Fix decoder interleave programming
  Documentation: cxl: remove dangling kernel-doc reference
  cxl/region: describe targets and nr_targets members of cxl_region_params
  cxl/regions: add padding for cxl_rr_ep_add nested lists
  cxl/region: Fix IS_ERR() vs NULL check
  cxl/region: Fix region reference target accounting
  cxl/region: Fix region commit uninitialized variable warning
  cxl/region: Fix port setup uninitialized variable warnings
  cxl/region: Stop initializing interleave granularity
  cxl/hdm: Fix DPA reservation vs cxl_endpoint_decoder lifetime
  cxl/acpi: Minimize granularity for x1 interleaves
  cxl/region: Delete 'region' attribute from root decoders
  cxl/acpi: Autoload driver for 'cxl_acpi' test devices
  cxl/region: decrement ->nr_targets on error in cxl_region_attach()
  cxl/region: prevent underflow in ways_to_cxl()
  cxl/region: uninitialized variable in alloc_hpa()
  ...
2022-08-10 11:07:26 -07:00
Linus Torvalds b8dcef877a memblock updates for v5.20
* An optimization in memblock_add_range() to reduce array traversals
 * Improvements to the memblock test suite
 -----BEGIN PGP SIGNATURE-----
 
 iQFMBAABCAA2FiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmLyCbgYHG1pa2UucmFw
 b3BvcnRAZ21haWwuY29tAAoJEDkDhibLDv2RBxMH/1uIcfERl3Cbw25zluWSVn4O
 mrnr+JPqUkyeVLQDEGzk/VWIM1WT11s7fFpoTpIwu3dq/fVoD3HZlZQkWS0ANFDL
 V3xf6Xz17R5ZNoZmacczhNaBqkJSi+dcvoAevjyBHPpKEaCLC/rNrISpDdCD0Lz0
 5fgv2F4sISBUVc6FVIFB+9zKC/neI9ewemCABSFTIa5mmQvZZwX1Tj5BrxIsESwN
 DwX5u1Q65SoFBbAk6F5+aoClJ7wMGz8OlZoFw106HTvxq8sNne27KXW9mKugBzJr
 yAZ/TWrjXigNAr8dcXQEZuxagFSB1PQ4aNgU8phiAwE7/5z3j1KLa65hDRzc9t4=
 =JMiG
 -----END PGP SIGNATURE-----

Merge tag 'memblock-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock updates from Mike Rapoport:

 - An optimization in memblock_add_range() to reduce array traversals

 - Improvements to the memblock test suite

* tag 'memblock-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  memblock test: Modify the obsolete description in README
  memblock tests: fix compilation errors
  memblock tests: change build options to run-time options
  memblock tests: remove completed TODO items
  memblock tests: set memblock_debug to enable memblock_dbg() messages
  memblock tests: add verbose output to memblock tests
  memblock tests: Makefile: add arguments to control verbosity
  memblock: avoid some repeat when add new range
2022-08-09 09:48:30 -07:00
Linus Torvalds f30adc0d33 iov_iter stuff, part 2, rebased
* more new_sync_{read,write}() speedups - ITER_UBUF introduction
 * ITER_PIPE cleanups
 * unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
   switching them to advancing semantics
 * making ITER_PIPE take high-order pages without splitting them
 * handling copy_page_from_iter() for high-order pages properly
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYvHI8QAKCRBZ7Krx/gZQ
 62CQAPsGlbebqBeAT2pMulaGDxfLAsgz5Yf4BEaMLhPtRqFOQgD+KrZQId7Sd8O0
 3IWucpTb2c4jvLlXhGMS+XWnusQH+AQ=
 =pBux
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull more iov_iter updates from Al Viro:

 - more new_sync_{read,write}() speedups - ITER_UBUF introduction

 - ITER_PIPE cleanups

 - unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
   switching them to advancing semantics

 - making ITER_PIPE take high-order pages without splitting them

 - handling copy_page_from_iter() for high-order pages properly

* tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
  fix copy_page_from_iter() for compound destinations
  hugetlbfs: copy_page_to_iter() can deal with compound pages
  copy_page_to_iter(): don't split high-order page in case of ITER_PIPE
  expand those iov_iter_advance()...
  pipe_get_pages(): switch to append_pipe()
  get rid of non-advancing variants
  ceph: switch the last caller of iov_iter_get_pages_alloc()
  9p: convert to advancing variant of iov_iter_get_pages_alloc()
  af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
  iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
  block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
  iov_iter: saner helper for page array allocation
  fold __pipe_get_pages() into pipe_get_pages()
  ITER_XARRAY: don't open-code DIV_ROUND_UP()
  unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
  unify xarray_get_pages() and xarray_get_pages_alloc()
  unify pipe_get_pages() and pipe_get_pages_alloc()
  iov_iter_get_pages(): sanity-check arguments
  iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
  ...
2022-08-08 20:04:35 -07:00
Al Viro fcb14cb1bd new iov_iter flavour - ITER_UBUF
Equivalent of single-segment iovec.  Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.

We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.

New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.

DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08 22:37:15 -04:00
Naoya Horiguchi 6f4614886b mm, hwpoison: enable memory error handling on 1GB hugepage
Now error handling code is prepared, so remove the blocking code and
enable memory error handling on 1GB hugepage.

Link: https://lkml.kernel.org/r/20220714042420.1847125-9-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:44 -07:00
Naoya Horiguchi ceaf8fbea7 mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
Currently if memory_failure() (modified to remove blocking code with
subsequent patch) is called on a page in some 1GB hugepage, memory error
handling fails and the raw error page gets into leaked state.  The impact
is small in production systems (just leaked single 4kB page), but this
limits the testability because unpoison doesn't work for it.  We can no
longer create 1GB hugepage on the 1GB physical address range with such
leaked pages, that's not useful when testing on small systems.

When a hwpoison page in a 1GB hugepage is handled, it's caught by the
PageHWPoison check in free_pages_prepare() because the 1GB hugepage is
broken down into raw error pages before coming to this point:

        if (unlikely(PageHWPoison(page)) && !order) {
                ...
                return false;
        }

Then, the page is not sent to buddy and the page refcount is left 0.

Originally this check is supposed to work when the error page is freed
from page_handle_poison() (that is called from soft-offline), but now we
are opening another path to call it, so the callers of
__page_handle_poison() need to handle the case by considering the return
value 0 as success.  Then page refcount for hwpoison is properly
incremented so unpoison works.

Link: https://lkml.kernel.org/r/20220714042420.1847125-8-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:44 -07:00
Naoya Horiguchi 7453bf621c mm, hwpoison: make __page_handle_poison returns int
__page_handle_poison() returns bool that shows whether
take_page_off_buddy() has passed or not now.  But we will want to
distinguish another case of "dissolve has passed but taking off failed" by
its return value.  So change the type of the return value.  No functional
change.

Link: https://lkml.kernel.org/r/20220714042420.1847125-7-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:44 -07:00
Naoya Horiguchi 38f6d29397 mm, hwpoison: set PG_hwpoison for busy hugetlb pages
If memory_failure() fails to grab page refcount on a hugetlb page because
it's busy, it returns without setting PG_hwpoison on it.  This not only
loses a chance of error containment, but breaks the rule that
action_result() should be called only when memory_failure() do any of
handling work (even if that's just setting PG_hwpoison).  This
inconsistency could harm code maintainability.

So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.

Link: https://lkml.kernel.org/r/20220714042420.1847125-6-naoya.horiguchi@linux.dev
Fixes: 405ce05123 ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:44 -07:00
Naoya Horiguchi ac5fcde0a9 mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage
Raw error info list needs to be removed when hwpoisoned hugetlb is
unpoisoned.  And unpoison handler needs to know how many errors there are
in the target hugepage.  So add them.

HPageVmemmapOptimized(hpage) and HPageRawHwpUnreliable(hpage)) sometimes
can't be unpoisoned, so skip them.

Link: https://lkml.kernel.org/r/20220714042420.1847125-5-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:44 -07:00
Naoya Horiguchi 161df60e9e mm, hwpoison, hugetlb: support saving mechanism of raw error pages
When handling memory error on a hugetlb page, the error handler tries to
dissolve and turn it into 4kB pages.  If it's successfully dissolved,
PageHWPoison flag is moved to the raw error page, so that's all right. 
However, dissolve sometimes fails, then the error page is left as
hwpoisoned hugepage.  It's useful if we can retry to dissolve it to save
healthy pages, but that's not possible now because the information about
where the raw error pages is lost.

Use the private field of a few tail pages to keep that information.  The
code path of shrinking hugepage pool uses this info to try delayed
dissolve.  In order to remember multiple errors in a hugepage, a
singly-linked list originated from SUBPAGE_INDEX_HWPOISON-th tail page is
constructed.  Only simple operations (adding an entry or clearing all) are
required and the list is assumed not to be very long, so this simple data
structure should be enough.

If we failed to save raw error info, the hwpoison hugepage has errors on
unknown subpage, then this new saving mechanism does not work any more, so
disable saving new raw error info and freeing hwpoison hugepages.

Link: https://lkml.kernel.org/r/20220714042420.1847125-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:44 -07:00
Naoya Horiguchi 3a194f3f8a mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry
follow_pud_mask() does not support non-present pud entry now.  As long as
I tested on x86_64 server, follow_pud_mask() still simply returns
no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
user-visible effect should happen.  But generally we should call
follow_huge_pud() for non-present pud entry for 1GB hugetlb page.

Update pud_huge() and follow_huge_pud() to handle non-present pud entries.
The changes are similar to previous works for pud entries commit
e66f17ff71 ("mm/hugetlb: take page table lock in follow_huge_pmd()") and
commit cbef8478be ("mm/hugetlb: pmd_huge() returns true for non-present
hugepage").

Link: https://lkml.kernel.org/r/20220714042420.1847125-3-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:43 -07:00
Naoya Horiguchi c0531714d6 mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages()
Patch series "mm, hwpoison: enable 1GB hugepage support", v7.


This patch (of 8):

I found a weird state of 1GB hugepage pool, caused by the following
procedure:

  - run a process reserving all free 1GB hugepages,
  - shrink free 1GB hugepage pool to zero (i.e. writing 0 to
    /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages), then
  - kill the reserving process.

, then all the hugepages are free *and* surplus at the same time.

  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
  3
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
  3
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/resv_hugepages
  0
  $ cat /sys/kernel/mm/hugepages/hugepages-1048576kB/surplus_hugepages
  3

This state is resolved by reserving and allocating the pages then freeing
them again, so this seems not to result in serious problem.  But it's a
little surprising (shrinking pool suddenly fails).

This behavior is caused by hstate_is_gigantic() check in
return_unused_surplus_pages().  This was introduced so long ago in 2008 by
commit aa888a7497 ("hugetlb: support larger than MAX_ORDER"), and at
that time the gigantic pages were not supposed to be allocated/freed at
run-time.  Now kernel can support runtime allocation/free, so let's check
gigantic_page_runtime_supported() together.

Link: https://lkml.kernel.org/r/20220714042420.1847125-1-naoya.horiguchi@linux.dev
Link: https://lkml.kernel.org/r/20220714042420.1847125-2-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:43 -07:00
Muchun Song e38f055d6d mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE
There is already a macro PTRS_PER_PTE to represent the number of page
table entries, just use it.

Link: https://lkml.kernel.org/r/20220628092235.91270-9-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:43 -07:00
Muchun Song 6213834c10 mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability
There is a discussion about the name of hugetlb_vmemmap_alloc/free in
thread [1].  The suggestion suggested by David is rename "alloc/free" to
"optimize/restore" to make functionalities clearer to users, "optimize"
means the function will optimize vmemmap pages, while "restore" means
restoring its vmemmap pages discared before.  This commit does this.

Another discussion is the confusion RESERVE_VMEMMAP_NR isn't used
explicitly for vmemmap_addr but implicitly for vmemmap_end in
hugetlb_vmemmap_alloc/free.  David suggested we can compute what
hugetlb_vmemmap_init() does now at runtime.  We do not need to worry for
the overhead of computing at runtime since the calculation is simple
enough and those functions are not in a hot path.  This commit has the
following improvements:

  1) The function suffixed name ("optimize/restore") is more expressive.
  2) The logic becomes less weird in hugetlb_vmemmap_optimize/restore().
  3) The hugetlb_vmemmap_init() does not need to be exported anymore.
  4) A ->optimize_vmemmap_pages field in struct hstate is killed.
  5) There is only one place where checks is_power_of_2(sizeof(struct
     page)) instead of two places.
  6) Add more comments for hugetlb_vmemmap_optimize/restore().
  7) For external users, hugetlb_optimize_vmemmap_pages() is used for
     detecting if the HugeTLB's vmemmap pages is optimizable originally.
     In this commit, it is killed and we introduce a new helper
     hugetlb_vmemmap_optimizable() to replace it.  The name is more
     expressive.

Link: https://lore.kernel.org/all/20220404074652.68024-2-songmuchun@bytedance.com/ [1]
Link: https://lkml.kernel.org/r/20220628092235.91270-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:43 -07:00
Muchun Song 30152245c6 mm: hugetlb_vmemmap: replace early_param() with core_param()
After the following commit:

  78f39084b4 ("mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl")

There is no order requirement between the parameter of
"hugetlb_free_vmemmap" and "hugepages" since we have removed the check of
whether HVO is enabled from hugetlb_vmemmap_init().  Therefore we can
safely replace early_param() with core_param() to simplify the code.

Link: https://lkml.kernel.org/r/20220628092235.91270-6-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:42 -07:00
Muchun Song 998a299788 mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c
When I first introduced vmemmap manipulation functions related to HugeTLB,
I thought those functions may be reused by other modules (e.g.  using
similar approach to optimize vmemmap pages, unfortunately, the DAX used
the same approach but does not use those functions).  After two years, we
didn't see any other users.  So move those functions to hugetlb_vmemmap.c.
Code movement without any functional change.

Link: https://lkml.kernel.org/r/20220628092235.91270-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:42 -07:00
Muchun Song dff033818a mm: hugetlb_vmemmap: introduce the name HVO
It it inconvenient to mention the feature of optimizing vmemmap pages
associated with HugeTLB pages when communicating with others since there
is no specific or abbreviated name for it when it is first introduced. 
Let us give it a name HVO (HugeTLB Vmemmap Optimization) from now.

This commit also updates the document about "hugetlb_free_vmemmap" by the
way discussed in thread [1].

Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1]
Link: https://lkml.kernel.org/r/20220628092235.91270-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:42 -07:00
Muchun Song cf5472e561 mm: hugetlb_vmemmap: optimize vmemmap_optimize_mode handling
We hold an another reference to hugetlb_optimize_vmemmap_key when making
vmemmap_optimize_mode on, because we use static_key to tell memory_hotplug
that memory_hotplug.memmap_on_memory should be overridden.  However, this
rule has gone when we have introduced PageVmemmapSelfHosted.  Therefore,
we could simplify vmemmap_optimize_mode handling by not holding an another
reference to hugetlb_optimize_vmemmap_key.  This also means that we not
incur the extra page_fixed_fake_head checks if there are no vmemmap
optinmized hugetlb pages after this change.

Link: https://lkml.kernel.org/r/20220628092235.91270-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08 18:06:42 -07:00
Linus Torvalds 6614a3c316 - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
 
 - Some kmemleak fixes from Patrick Wang and Waiman Long
 
 - DAMON updates from SeongJae Park
 
 - memcg debug/visibility work from Roman Gushchin
 
 - vmalloc speedup from Uladzislau Rezki
 
 - more folio conversion work from Matthew Wilcox
 
 - enhancements for coherent device memory mapping from Alex Sierra
 
 - addition of shared pages tracking and CoW support for fsdax, from
   Shiyang Ruan
 
 - hugetlb optimizations from Mike Kravetz
 
 - Mel Gorman has contributed some pagealloc changes to improve latency
   and realtime behaviour.
 
 - mprotect soft-dirty checking has been improved by Peter Xu
 
 - Many other singleton patches all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
 jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
 SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
 =w/UH
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Most of the MM queue. A few things are still pending.

  Liam's maple tree rework didn't make it. This has resulted in a few
  other minor patch series being held over for next time.

  Multi-gen LRU still isn't merged as we were waiting for mapletree to
  stabilize. The current plan is to merge MGLRU into -mm soon and to
  later reintroduce mapletree, with a view to hopefully getting both
  into 6.1-rc1.

  Summary:

   - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
     Lin, Yang Shi, Anshuman Khandual and Mike Rapoport

   - Some kmemleak fixes from Patrick Wang and Waiman Long

   - DAMON updates from SeongJae Park

   - memcg debug/visibility work from Roman Gushchin

   - vmalloc speedup from Uladzislau Rezki

   - more folio conversion work from Matthew Wilcox

   - enhancements for coherent device memory mapping from Alex Sierra

   - addition of shared pages tracking and CoW support for fsdax, from
     Shiyang Ruan

   - hugetlb optimizations from Mike Kravetz

   - Mel Gorman has contributed some pagealloc changes to improve
     latency and realtime behaviour.

   - mprotect soft-dirty checking has been improved by Peter Xu

   - Many other singleton patches all over the place"

 [ XFS merge from hell as per Darrick Wong in

   https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]

* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
  tools/testing/selftests/vm/hmm-tests.c: fix build
  mm: Kconfig: fix typo
  mm: memory-failure: convert to pr_fmt()
  mm: use is_zone_movable_page() helper
  hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
  hugetlbfs: cleanup some comments in inode.c
  hugetlbfs: remove unneeded header file
  hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
  hugetlbfs: use helper macro SZ_1{K,M}
  mm: cleanup is_highmem()
  mm/hmm: add a test for cross device private faults
  selftests: add soft-dirty into run_vmtests.sh
  selftests: soft-dirty: add test for mprotect
  mm/mprotect: fix soft-dirty check in can_change_pte_writable()
  mm: memcontrol: fix potential oom_lock recursion deadlock
  mm/gup.c: fix formatting in check_and_migrate_movable_page()
  xfs: fail dax mount if reflink is enabled on a partition
  mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
  userfaultfd: don't fail on unrecognized features
  hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
  ...
2022-08-05 16:32:45 -07:00
Linus Torvalds 79b7e67bb9 This pull request contains the following changes for UML:
- KASAN support for x86_64
 - noreboot command line option, just like qemu's -no-reboot
 - Various fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEdgfidid8lnn52cLTZvlZhesYu8EFAmLteZYACgkQZvlZhesY
 u8F8bRAA2806QUzysg3Nj1AKPiTOj47TuluGu4SXytB0usQgYK/n3Fxr36ULJAOJ
 3qZWf2fsAkBLvgX9Sw2QFGfulrpfKnLeTdBXSEbWYWhZ0ZoaEJztKmtfH02kRDOW
 POedQT5FXMDVjGQdLC7Ycp+WyjaUwrccZ+KRkGWmlr7vNFlxcTlEqBb13mgLdjkY
 ep8X+SgmAcdvWBd/os+nNn9Al6TbFd4XQCok82DtNrv0ggwXnVPov/ArvZvvn2Oj
 F028X77180rbrGV+ZnDkV1KSv/ccT5EFebJkfEEcYVjre8o0QoPQmh2tFqXN0d83
 2WpIOb1+mQL0VClpC4hKbScpIB5tw8vIHsUT+ifloIgY/puhezx6aWm0TKSA+aTM
 WitJl1Nf4uNu1rqkBkn9o3VK8CYokTALQIRexHCzvZ70CSxmFbR7EVRSTf7Rr690
 Oq7StHagfuTJpddh0wQwaMorIH4s0/bpPoA6m4OhwlppnCpY0Hfl3+AKluNRUtH6
 lPeQwfxhd/LKqYW0COElEnReDLzer82kUx/keVyxVINqxpm6YTHVtOgtMCEuVNXg
 GbS8PFCW2mIP8Is6HJavZYCzG8vnz3wZ9GENujanwLemiIJfINDauybu+nNsE5pO
 7v12vWeZ0x2HGM/cFxODrpp4xAkdq8BBLap8/aXB8uJFagmYyhs=
 =f3Bh
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml

Pull UML updates from Richard Weinberger:

 - KASAN support for x86_64

 - noreboot command line option, just like qemu's -no-reboot

 - Various fixes and cleanups

* tag 'for-linus-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
  um: include sys/types.h for size_t
  um: Replace to_phys() and to_virt() with less generic function names
  um: Add missing apply_returns()
  um: add "noreboot" command line option for PANIC_TIMEOUT=-1 setups
  um: include linux/stddef.h for __always_inline
  UML: add support for KASAN under x86_64
  mm: Add PAGE_ALIGN_DOWN macro
  um: random: Don't initialise hwrng struct with zero
  um: remove unused mm_copy_segments
  um: remove unused variable
  um: Remove straying parenthesis
  um: x86: print RIP with symbol
  arch: um: Fix build for statically linked UML w/ constructors
  x86/um: Kconfig: Fix indentation
  um/drivers: Kconfig: Fix indentation
  um: Kconfig: Fix indentation
2022-08-05 14:03:11 -07:00
Linus Torvalds 3bd6e5854b asm-generic: updates for 6.0
There are three independent sets of changes:
 
  - Sai Prakash Ranjan adds tracing support to the asm-generic
    version of the MMIO accessors, which is intended to help
    understand problems with device drivers and has been part
    of Qualcomm's vendor kernels for many years.
 
  - A patch from Sebastian Siewior to rework the handling of
    IRQ stacks in softirqs across architectures, which is
    needed for enabling PREEMPT_RT.
 
  - The last patch to remove the CONFIG_VIRT_TO_BUS option and
    some of the code behind that, after the last users of this
    old interface made it in through the netdev, scsi, media and
    staging trees.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmLqPPEACgkQmmx57+YA
 GNlUbQ/+NpIsiA0JUrCGtySt8KrLHdA2dH9lJOR5/iuxfphscPFfWtpcPvcXQWmt
 a8u7wyI8SHW1ku4U0Y5sO0dBSldDnoIqJ5t4X5d7YNU9yVtEtucqQhZf+GkrPlVD
 1HkRu05B7y0k2BMn7BLhSvkpafs3f1lNGXjs8oFBdOF1/zwp/GjcrfCK7KFzqjwU
 dYrX0SOFlKFd4BZC75VfK+XcKg4LtwIOmJraRRl7alz2Q5Oop2hgjgZxXDPf//vn
 SPOhXJN/97i1FUpY2TkfHVH1NxbPfjCV4pUnjmLG0Y4NSy9UQ/ZcXHcywIdeuhfa
 0LySOIsAqBeccpYYYdg2ubiMDZOXkBfANu/sB9o/EhoHfB4svrbPRDhBIQZMFXJr
 MJYu+IYce2rvydA/nydo4q++pxR8v1ES1ZIo8bDux+q1CI/zbpQV+f98kPVRA0M7
 ajc+5GTIqNIsvHzzadq7eYxcj5Bi8Li2JA9sVkAQ+6iq1TVyeYayMc9eYwONlmqw
 MD+PFYc651pKtXZCfkLXPIKSwS0uPqBndAibuVhpZ0hxWaCBBdKvY9mrWcPxt0kA
 tMR8lrosbbrV2K48BFdWTOHvCs2FhHQxPGVPZ/iWuxTA0hHZ9tUlaEkSX+VM57IU
 KCYQLdWzT8J9vrgqSbgYKlb6pSPz6FIjTfut6NZMmshIbavHV/Q=
 =aTR0
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull asm-generic updates from Arnd Bergmann:
 "There are three independent sets of changes:

   - Sai Prakash Ranjan adds tracing support to the asm-generic version
     of the MMIO accessors, which is intended to help understand
     problems with device drivers and has been part of Qualcomm's vendor
     kernels for many years

   - A patch from Sebastian Siewior to rework the handling of IRQ stacks
     in softirqs across architectures, which is needed for enabling
     PREEMPT_RT

   - The last patch to remove the CONFIG_VIRT_TO_BUS option and some of
     the code behind that, after the last users of this old interface
     made it in through the netdev, scsi, media and staging trees"

* tag 'asm-generic-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  uapi: asm-generic: fcntl: Fix typo 'the the' in comment
  arch/*/: remove CONFIG_VIRT_TO_BUS
  soc: qcom: geni: Disable MMIO tracing for GENI SE
  serial: qcom_geni_serial: Disable MMIO tracing for geni serial
  asm-generic/io: Add logging support for MMIO accessors
  KVM: arm64: Add a flag to disable MMIO trace for nVHE KVM
  lib: Add register read/write tracing support
  drm/meson: Fix overflow implicit truncation warnings
  irqchip/tegra: Fix overflow implicit truncation warnings
  coresight: etm4x: Use asm-generic IO memory barriers
  arm64: io: Use asm-generic high level MMIO accessors
  arch/*: Disable softirq stacks on PREEMPT_RT.
2022-08-05 10:07:23 -07:00
Linus Torvalds 353767e4aa for-5.20-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLnyNUACgkQxWXV+ddt
 WDt9vA/9HcF+v5EkknyW07tatTap/Hm/ZB86Z5OZi6ikwIEcHsWhp3rUICejm88e
 GecDPIluDtCtyD6x4stuqkwOm22aDP5q2T9H6+gyw92ozyb436OV1Z8IrmftzXKY
 EpZO70PHZT+E6E/WYvyoTmmoCrjib7YlqCWZZhSLUFpsqqlOInmHEH49PW6KvM4r
 acUZ/RxHurKdmI3kNY6ECbAQl6CASvtTdYcVCx8fT2zN0azoLIQxpYa7n/9ca1R6
 8WnYilCbLbNGtcUXvO2M3tMZ4/5kvxrwQsUn93ccCJYuiN0ASiDXbLZ2g4LZ+n56
 JGu+y5v5oBwjpVf+46cuvnENP5BQ61594WPseiVjrqODWnPjN28XkcVC0XmPsiiZ
 lszeHO2cuIrIFoCah8ELMl8usu8+qxfXmPxIXtPu9rEyKsDtOjxVYc8SMXqLp0qQ
 qYtBoFm0JcZHqtZRpB+dhQ37/xXtH4ljUi/mI6x8iALVujeR273URs7yO9zgIdeW
 uZoFtbwpHFLUk+TL7Ku82/zOXp3fCwtDpNmlYbxeMbea/be3ShjncM4+mYzvHYri
 dYON2LFrq+mnRDqtIXTCaAYwX7zU8Y18Ev9QwlNll8dKlKwS89+jpqLoa+eVYy3c
 /HitHFza70KxmOj4dvDVZlzDpPvl7kW1UBkmskg4u3jnNWzedkM=
 =sS1q
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This brings some long awaited changes, the send protocol bump,
  otherwise lots of small improvements and fixes. The main core part is
  reworking bio handling, cleaning up the submission and endio and
  improving error handling.

  There are some changes outside of btrfs adding helpers or updating
  API, listed at the end of the changelog.

  Features:

   - sysfs:
      - export chunk size, in debug mode add tunable for setting its size
      - show zoned among features (was only in debug mode)
      - show commit stats (number, last/max/total duration)

   - send protocol updated to 2
      - new commands:
         - ability write larger data chunks than 64K
         - send raw compressed extents (uses the encoded data ioctls),
           ie. no decompression on send side, no compression needed on
           receive side if supported
         - send 'otime' (inode creation time) among other timestamps
         - send file attributes (a.k.a file flags and xflags)
      - this is first version bump, backward compatibility on send and
        receive side is provided
      - there are still some known and wanted commands that will be
        implemented in the near future, another version bump will be
        needed, however we want to minimize that to avoid causing
        usability issues

   - print checksum type and implementation at mount time

   - don't print some messages at mount (mentioned as people asked about
     it), we want to print messages namely for new features so let's
     make some space for that
      - big metadata - this has been supported for a long time and is
        not a feature that's worth mentioning
      - skinny metadata - same reason, set by default by mkfs

  Performance improvements:

   - reduced amount of reserved metadata for delayed items
      - when inserted items can be batched into one leaf
      - when deleting batched directory index items
      - when deleting delayed items used for deletion
      - overall improved count of files/sec, decreased subvolume lock
        contention

   - metadata item access bounds checker micro-optimized, with a few
     percent of improved runtime for metadata-heavy operations

   - increase direct io limit for read to 256 sectors, improved
     throughput by 3x on sample workload

  Notable fixes:

   - raid56
      - reduce parity writes, skip sectors of stripe when there are no
        data updates
      - restore reading from on-disk data instead of using stripe cache,
        this reduces chances to damage correct data due to RMW cycle

   - refuse to replay log with unknown incompat read-only feature bit
     set

   - zoned
      - fix page locking when COW fails in the middle of allocation
      - improved tracking of active zones, ZNS drives may limit the
        number and there are ENOSPC errors due to that limit and not
        actual lack of space
      - adjust maximum extent size for zone append so it does not cause
        late ENOSPC due to underreservation

   - mirror reading error messages show the mirror number

   - don't fallback to buffered IO for NOWAIT direct IO writes, we don't
     have the NOWAIT semantics for buffered io yet

   - send, fix sending link commands for existing file paths when there
     are deleted and created hardlinks for same files

   - repair all mirrors for profiles with more than 1 copy (raid1c34)

   - fix repair of compressed extents, unify where error detection and
     repair happen

  Core changes:

   - bio completion cleanups
      - don't double defer compression bios
      - simplify endio workqueues
      - add more data to btrfs_bio to avoid allocation for read requests
      - rework bio error handling so it's same what block layer does,
        the submission works and errors are consumed in endio
      - when asynchronous bio offload fails fall back to synchronous
        checksum calculation to avoid errors under writeback or memory
        pressure

   - new trace points
      - raid56 events
      - ordered extent operations

   - super block log_root_transid deprecated (never used)

   - mixed_backref and big_metadata sysfs feature files removed, they've
     been default for sufficiently long time, there are no known users
     and mixed_backref could be confused with mixed_groups

  Non-btrfs changes, API updates:

   - minor highmem API update to cover const arguments

   - switch all kmap/kmap_atomic to kmap_local

   - remove redundant flush_dcache_page()

   - address_space_operations::writepage callback removed

   - add bdev_max_segments() helper"

* tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (163 commits)
  btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read
  btrfs: fix repair of compressed extents
  btrfs: remove the start argument to check_data_csum and export
  btrfs: pass a btrfs_bio to btrfs_repair_one_sector
  btrfs: simplify the pending I/O counting in struct compressed_bio
  btrfs: repair all known bad mirrors
  btrfs: merge btrfs_dev_stat_print_on_error with its only caller
  btrfs: join running log transaction when logging new name
  btrfs: simplify error handling in btrfs_lookup_dentry
  btrfs: send: always use the rbtree based inode ref management infrastructure
  btrfs: send: fix sending link commands for existing file paths
  btrfs: send: introduce recorded_ref_alloc and recorded_ref_free
  btrfs: zoned: wait until zone is finished when allocation didn't progress
  btrfs: zoned: write out partially allocated region
  btrfs: zoned: activate necessary block group
  btrfs: zoned: activate metadata block group on flush_space
  btrfs: zoned: disable metadata overcommit for zoned
  btrfs: zoned: introduce space_info->active_total_bytes
  btrfs: zoned: finish least available block group on data bg allocation
  btrfs: let can_allocate_chunk return error
  ...
2022-08-03 14:54:52 -07:00
Linus Torvalds 97a77ab14f EFI updates for v5.20
- Enable mirrored memory for arm64
 - Fix up several abuses of the efivar API
 - Refactor the efivar API in preparation for moving the 'business logic'
   part of it into efivarfs
 - Enable ACPI PRM on arm64
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmLhuDIACgkQw08iOZLZ
 jyS9IQv/Wc2nhjN50S3gfrL+68/el/hGdP/J0FK5BOOjNosG2t1ZNYZtSthXqpPH
 hRrTU2m6PpQUalRpFDyLiHkJvdBFQe4VmvrzBa3TIBIzyflLQPJzkWrqThPchV+B
 qi4lmCtTDNIEJmayewqx1wWA+QmUiyI5zJ8wrZp84LTctBPL75seVv0SB20nqai0
 3/I73omB2RLVGpCpeWvb++vePXL8euFW3FEwCTM8hRboICjORTyIZPy8Y5os+3xT
 UgrIgVDOtn1Xwd4tK0qVwjOVA51east4Fcn3yGOrL40t+3SFm2jdpAJOO3UvyNPl
 vkbtjvXsIjt3/oxreKxXHLbamKyueWIfZRyCLsrg6wrr96oypPk6ID4iDCQoen/X
 Zf0VjM2vmvSd4YgnEIblOfSBxVg48cHJA4iVHVxFodNTrVnzGGFYPTmNKmJqo+Xn
 JeUILM7jlR4h/t0+cTTK3Busu24annTuuz5L5rjf4bUm6pPf4crb1yJaFWtGhlpa
 er233D6O
 =zI0R
 -----END PGP SIGNATURE-----

Merge tag 'efi-next-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI updates from Ard Biesheuvel:

 - Enable mirrored memory for arm64

 - Fix up several abuses of the efivar API

 - Refactor the efivar API in preparation for moving the 'business
   logic' part of it into efivarfs

 - Enable ACPI PRM on arm64

* tag 'efi-next-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (24 commits)
  ACPI: Move PRM config option under the main ACPI config
  ACPI: Enable Platform Runtime Mechanism(PRM) support on ARM64
  ACPI: PRM: Change handler_addr type to void pointer
  efi: Simplify arch_efi_call_virt() macro
  drivers: fix typo in firmware/efi/memmap.c
  efi: vars: Drop __efivar_entry_iter() helper which is no longer used
  efi: vars: Use locking version to iterate over efivars linked lists
  efi: pstore: Omit efivars caching EFI varstore access layer
  efi: vars: Add thin wrapper around EFI get/set variable interface
  efi: vars: Don't drop lock in the middle of efivar_init()
  pstore: Add priv field to pstore_record for backend specific use
  Input: applespi - avoid efivars API and invoke EFI services directly
  selftests/kexec: remove broken EFI_VARS secure boot fallback check
  brcmfmac: Switch to appropriate helper to load EFI variable contents
  iwlwifi: Switch to proper EFI variable store interface
  media: atomisp_gmin_platform: stop abusing efivar API
  efi: efibc: avoid efivar API for setting variables
  efi: avoid efivars layer when loading SSDTs from variables
  efi: Correct comment on efi_memmap_alloc
  memblock: Disable mirror feature if kernelcore is not specified
  ...
2022-08-03 14:38:02 -07:00
Linus Torvalds f00654007f Folio changes for 6.0
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
    when running xfstests
 
  - Convert more of mpage to use folios
 
  - Remove add_to_page_cache() and add_to_page_cache_locked()
 
  - Convert find_get_pages_range() to filemap_get_folios()
 
  - Improvements to the read_cache_page() family of functions
 
  - Remove a few unnecessary checks of PageError
 
  - Some straightforward filesystem conversions to use folios
 
  - Split PageMovable users out from address_space_operations into their
    own movable_operations
 
  - Convert aops->migratepage to aops->migrate_folio
 
  - Remove nobh support (Christoph Hellwig)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmLpViQACgkQDpNsjXcp
 gj5pBgf/f3+K7Hi3qw7aYQCYJQ7IA/bLyE/DLWI59kuiao6wDSve40B9YH9X++Ha
 mRLp55bkQS+bwS2xa4jlqrIDJzAfNoWlXaXZHUXGL1C/52ChTF6jaH2cvO9PVlDS
 7fLv1hy2LwiIdzpKJkUW7T+kcQGj3QLKqtQ4x8zD0LGMg055yvt/qndHSUi41nWT
 /58+6W8Sk4vvRgkpeChFzF1lGLy00+FGT8y5V2kM9uRliFQ7XPCwqB2a3e5jbW6z
 C1NXQmRnopCrnOT1TFIhK3DyX6MDIWV5qcikNAmCKFb9fQFPmjDLPt9iSoMGjw2M
 Z+UVhJCaU3ISccd0DG5Ra/vzs9/O9Q==
 =DgUi
 -----END PGP SIGNATURE-----

Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache

Pull folio updates from Matthew Wilcox:

 - Fix an accounting bug that made NR_FILE_DIRTY grow without limit
   when running xfstests

 - Convert more of mpage to use folios

 - Remove add_to_page_cache() and add_to_page_cache_locked()

 - Convert find_get_pages_range() to filemap_get_folios()

 - Improvements to the read_cache_page() family of functions

 - Remove a few unnecessary checks of PageError

 - Some straightforward filesystem conversions to use folios

 - Split PageMovable users out from address_space_operations into
   their own movable_operations

 - Convert aops->migratepage to aops->migrate_folio

 - Remove nobh support (Christoph Hellwig)

* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
  fs: remove the NULL get_block case in mpage_writepages
  fs: don't call ->writepage from __mpage_writepage
  fs: remove the nobh helpers
  jfs: stop using the nobh helper
  ext2: remove nobh support
  ntfs3: refactor ntfs_writepages
  mm/folio-compat: Remove migration compatibility functions
  fs: Remove aops->migratepage()
  secretmem: Convert to migrate_folio
  hugetlb: Convert to migrate_folio
  aio: Convert to migrate_folio
  f2fs: Convert to filemap_migrate_folio()
  ubifs: Convert to filemap_migrate_folio()
  btrfs: Convert btrfs_migratepage to migrate_folio
  mm/migrate: Add filemap_migrate_folio()
  mm/migrate: Convert migrate_page() to migrate_folio()
  nfs: Convert to migrate_folio
  btrfs: Convert btree_migratepage to migrate_folio
  mm/migrate: Convert expected_page_refs() to folio_expected_refs()
  mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
  ...
2022-08-03 10:35:43 -07:00
Linus Torvalds 79802ada87 selinux/stable-6.0 PR 20220801
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmLoEeIUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNSOhAAwWwRcmcHnk+k2agT9QjKrLo26NCO
 MQLE89o4y2ChEFHxC7F7SKoQRxtfYa323p1vmlGzKrlB+IZ6oqERVp4QNQQbXsfn
 n9VvVpxjRNHAetcRhCM9ZOchWjUdw6AMaJ8e3fdRNRESadAUUFDxifw1wpjgG9+i
 LmtDbfZ7vLs2grTf9OZy3JIl1VF3lVRUTI7ZBQggfJncMa+LXNWdVNmEe3yfyboA
 1MwpSao7K2si0hBGAQo/UGQz4b19Tm4xMg8bSy7oTsP5Lae5ciPkeI3qazvs9usp
 WScZYhQ8NugqLbDbjs7dm6QCpj4x3dUs6ei48LKe3GF2mcGesFfOPo9sNHao4kKv
 C9t0f9qw+EhGvnNL7uQIDDf8OuTjuLWDvZSrMLID/IJKFF5NJ3y+XzaS9aPM3VEY
 qyOsX+cEzheXGhD6xE1sCo+AyPUDYqNDMIKBj2wlIGCKlzDGa8RT6VsQuvgf3c3K
 43CnRCQeWDWOHCq3MnRe/fmYtW+JB7tsXiKAq4OJADacwPP36bsP3bqU8AlWYwDt
 tnuMa+LKusHnMEQpMPI8FW8qGdxwGSen+mymfLFIMgtwNGkV7WGRJ6Lbyn0SaR6v
 HyXgZASIOQRnamK3yZCDpxo0K81IVxPWJIjHyg53znqT5TCpXccPyV4HwbJKI/KG
 8PtHrXOdPOGCZ2g=
 =WWq1
 -----END PGP SIGNATURE-----

Merge tag 'selinux-pr-20220801' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull selinux updates from Paul Moore:
 "A relatively small set of patches for SELinux this time, eight patches
  in total with really only one significant change.

  The highlights are:

   - Add support for proper labeling of memfd_secret anonymous inodes.

     This will allow LSMs that implement the anonymous inode hooks to
     apply security policy to memfd_secret() fds.

   - Various small improvements to memory management: fixed leaks, freed
     memory when needed, boundary checks.

   - Hardened the selinux_audit_data struct with __randomize_layout.

   - A minor documentation tweak to fix a formatting/style issue"

* tag 'selinux-pr-20220801' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: selinux_add_opt() callers free memory
  selinux: Add boundary check in put_entry()
  selinux: fix memleak in security_read_state_kernel()
  docs: selinux: add '=' signs to kernel boot options
  mm: create security context for memfd_secret inodes
  selinux: fix typos in comments
  selinux: drop unnecessary NULL check
  selinux: add __randomize_layout to selinux_audit_data
2022-08-02 14:51:47 -07:00
Linus Torvalds 6991a564f5 hardening updates for v5.20-rc1
- Fix Sparse warnings with randomizd kstack (GONG, Ruiqi)
 
 - Replace uintptr_t with unsigned long in usercopy (Jason A. Donenfeld)
 
 - Fix Clang -Wforward warning in LKDTM (Justin Stitt)
 
 - Fix comment to correctly refer to STRICT_DEVMEM (Lukas Bulwahn)
 
 - Introduce dm-verity binding logic to LoadPin LSM (Matthias Kaehlcke)
 
 - Clean up warnings and overflow and KASAN tests (Kees Cook)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmLoEN4WHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJr19D/0fUvHaOui3+ePqKL1CEN2WOYxK
 Ed/HA0kM7VZnuakS2OoWbHYKurt9wImBkw0EuryNEP4nCBHy5OIyDOmWF7DjWntG
 9agKLW5rRgbKe9STbGZpJ92WWosOcJkgkDVES1/NjWt7ujLiefzcZE85hj2Dt1aQ
 6nF2LlkdGdtsa07hP5CR5bynQxAAxg1R1pLiJCgZRYn1SEFYtjcnBjUMrPUFJAi2
 TNy6ijeG473Oj6V/JiIY88u41KG1fed22SymNj6aQVIjGpH7atn6/ooG076ydAyt
 QEibSyQP/CwkSbyiqVFOq4v4a+hKEB5j5F+iKZBrCnFWNvt8D3tizBYgm1NymNEZ
 VBZdg+UhcoVDwiMNzSaAGvt15Qv0INNkQm9PJoeUGSdXz0Yjf4ghOIeaQc3jm6Of
 tElawmPXxVwRZfNpf5tyPaZFphAPK5EAl35S5mdWinKbAO7Jpz9xqvoyZz9/kygR
 Kd4qyRPrl0YM8SBKFuYt5rFaYfw9wqF7ox7cMmwR+pbEHt7UDqDvkX2fBbpCyXza
 5nJ9PDyvB5SqonIF57RiImXCLKXR6UMJgQvtDGsf+n4hpxL40Nga4pqWY82aQEhj
 SRdQlkFYhI/izIq1kMJNa8IoOONlJzV6i87D8iOW32bI3/SrykRUTvV3ohpMri7V
 UFkXzz8pfqfJ4k4zVA==
 =FRS3
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:

 - Fix Sparse warnings with randomizd kstack (GONG, Ruiqi)

 - Replace uintptr_t with unsigned long in usercopy (Jason A. Donenfeld)

 - Fix Clang -Wforward warning in LKDTM (Justin Stitt)

 - Fix comment to correctly refer to STRICT_DEVMEM (Lukas Bulwahn)

 - Introduce dm-verity binding logic to LoadPin LSM (Matthias Kaehlcke)

 - Clean up warnings and overflow and KASAN tests (Kees Cook)

* tag 'hardening-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  dm: verity-loadpin: Drop use of dm_table_get_num_targets()
  kasan: test: Silence GCC 12 warnings
  drivers: lkdtm: fix clang -Wformat warning
  x86: mm: refer to the intended config STRICT_DEVMEM in a comment
  dm: verity-loadpin: Use CONFIG_SECURITY_LOADPIN_VERITY for conditional compilation
  LoadPin: Enable loading from trusted dm-verity devices
  dm: Add verity helpers for LoadPin
  stack: Declare {randomize_,}kstack_offset to fix Sparse warnings
  lib: overflow: Do not define 64-bit tests on 32-bit
  MAINTAINERS: Add a general "kernel hardening" section
  usercopy: use unsigned long instead of uintptr_t
2022-08-02 14:38:59 -07:00
Linus Torvalds 98e2474640 for-5.20/io_uring-buffered-writes-2022-07-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLkm7UQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpldTEADTg/96R+eq78UZBNZmdifY9/qwQD+kzNiK
 ACDoYZFSbWUjMeOWqRxbYr6mXBKHnGHyTGlraTTpLDzhpB1xwoWfgOK9uOYXW/Ik
 eWfgTujPW/8v/l/z86khE+GH9b/maGCRqNZgS6uLVLzhxG6oCkoYTyOh1iHaF1VM
 Rma4nbJ8GSEDtiXNDl0Bznnyks/pzwoz/9slwzZ7PxtFwZsBxKuxgMUR5HIXdRp7
 5iUoFJhZrGWyi/dbQZUsK/9VYVVnKkcBCz2pb4GEmC+3dS/vlPEoeWUpPHInNyd1
 9NB9v8c+KFmQaWnCxuxcdHvCfmRRQrX8Pr8/OBNZKO6McYrKWKA+lurp4EGClE3m
 cZdK+P/9FS/Eeua8hum9UnbPAqsJPqLTbpbrySeBdd4iFA6u7rRqDX2+nz3PNe9U
 1b7V1bWBIEY/Rsw/PKo59oIeV0auD8v9OCHJ0lF2pv6dRln2/W0y1Qfd1DI18xFG
 +9bBnQzhF7R0O8UP5ApVayQCYrd906YsSVUOqAiLmUs/BoOgRq6g/0BqSOVVKE2u
 5iq8zTsVMkxY0ZpExwZST/700JwkPIV4SVPEYRC6QssFTcylvlisIek6XYSS9HX4
 Z6gzMwJW1H47bEfG4JolTI8uBjp0hQLCPX0O0XFLVnbHQwN0kjIBmv3axAwJO2NV
 qrrHXjf09w==
 =hV7G
 -----END PGP SIGNATURE-----

Merge tag 'for-5.20/io_uring-buffered-writes-2022-07-29' of git://git.kernel.dk/linux-block

Pull io_uring buffered writes support from Jens Axboe:
 "This contains support for buffered writes, specifically for XFS. btrfs
  is in progress, will be coming in the next release.

  io_uring does support buffered writes on any file type, but since the
  buffered write path just always -EAGAIN (or -EOPNOTSUPP) any attempt
  to do so if IOCB_NOWAIT is set, any buffered write will effectively be
  handled by io-wq offload. This isn't very efficient, and we even have
  specific code in io-wq to serialize buffered writes to the same inode
  to avoid further inefficiencies with thread offload.

  This is particularly sad since most buffered writes don't block, they
  simply copy data to a page and dirty it. With this pull request, we
  can handle buffered writes a lot more effiently.

  If balance_dirty_pages() needs to block, we back off on writes as
  indicated.

  This improves buffered write support by 2-3x.

  Jan Kara helped with the mm bits for this, and Stefan handled the
  fs/iomap/xfs/io_uring parts of it"

* tag 'for-5.20/io_uring-buffered-writes-2022-07-29' of git://git.kernel.dk/linux-block:
  mm: honor FGP_NOWAIT for page cache page allocation
  xfs: Add async buffered write support
  xfs: Specify lockmode when calling xfs_ilock_for_iomap()
  io_uring: Add tracepoint for short writes
  io_uring: fix issue with io_write() not always undoing sb_start_write()
  io_uring: Add support for async buffered writes
  fs: Add async write file modification handling.
  fs: Split off inode_needs_update_time and __file_update_time
  fs: add __remove_file_privs() with flags parameter
  fs: add a FMODE_BUF_WASYNC flags for f_mode
  iomap: Return -EAGAIN from iomap_write_iter()
  iomap: Add async buffered write support
  iomap: Add flags parameter to iomap_page_create()
  mm: Add balance_dirty_pages_ratelimited_flags() function
  mm: Move updates of dirty_exceeded into one place
  mm: Move starting of background writeback into the main balancing loop
2022-08-02 13:27:23 -07:00
Matthew Wilcox (Oracle) 9800562f2a mm/folio-compat: Remove migration compatibility functions
migrate_page_move_mapping(), migrate_page_copy() and migrate_page_states()
are all now unused after converting all the filesystems from
aops->migratepage() to aops->migrate_folio().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle) 9d0ddc0cb5 fs: Remove aops->migratepage()
With all users converted to migrate_folio(), remove this operation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle) 5409548df3 secretmem: Convert to migrate_folio
This is little more than changing the types over; there's no real work
being done in this function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle) b890ec2a2c hugetlb: Convert to migrate_folio
This involves converting migrate_huge_page_move_mapping().  We also need a
folio variant of hugetlb_set_page_subpool(), but that's for a later patch.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle) 2ec810d596 mm/migrate: Add filemap_migrate_folio()
There is nothing iomap-specific about iomap_migratepage(), and it fits
a pattern used by several other filesystems, so move it to mm/migrate.c,
convert it to be filemap_migrate_folio() and convert the iomap filesystems
to use it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle) 541846502f mm/migrate: Convert migrate_page() to migrate_folio()
Convert all callers to pass a folio.  Most have the folio
already available.  Switch all users from aops->migratepage to
aops->migrate_folio.  Also turn the documentation into kerneldoc.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
2022-08-02 12:34:04 -04:00
Matthew Wilcox (Oracle) 108ca83581 mm/migrate: Convert expected_page_refs() to folio_expected_refs()
Now that both callers have a folio, convert this function to
take a folio & rename it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle) 67235182a4 mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
Use a folio throughout __buffer_migrate_folio(), add kernel-doc for
buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their
declarations to buffer.h and switch all filesystems that have wired
them up.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle) 2be7fa10c0 mm/migrate: Convert writeout() to take a folio
Use a folio throughout this function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle) 8faa8ef5dd mm/migrate: Convert fallback_migrate_page() to fallback_migrate_folio()
Use a folio throughout.  migrate_page() will be converted to
migrate_folio() later.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle) 5490da4f06 fs: Add aops->migrate_folio
Provide a folio-based replacement for aops->migratepage.  Update the
documentation to document migrate_folio instead of migratepage.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle) 68f2736a85 mm: Convert all PageMovable users to movable_operations
These drivers are rather uncomfortably hammered into the
address_space_operations hole.  They aren't filesystems and don't behave
like filesystems.  They just need their own movable_operations structure,
which we can point to directly from page->mapping.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-08-02 12:34:03 -04:00
Matthew Wilcox (Oracle) 81218f80a7 secretmem: Remove isolate_page
The isolate_page operation is never called for filesystems, only
for device drivers which call SetPageMovable.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
2022-08-02 12:34:03 -04:00
Linus Torvalds 0dd1cabe8a slab updates for 5.20/6.0
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEjUuTAak14xi+SF7M4CHKc/GJqRAFAmLnqTQACgkQ4CHKc/GJ
 qRBnBwgAohP0MXszRnhGEKKTmLBtsEyPrV0OBEIlz3MnDYBYfnDLd5JdSMMA+1jp
 sT80QWYKPMr10WKWX5vPjhIYRIfgWchEYND/93DnJYC6Fdap/D0hDd6tIQEKnxpN
 YeGZHck6orj9L2HfazJo7qpt//Th5mM8WRTN9OIiFdKPYOvlm7DT51wukVLnK9fA
 WoWrx3CsyIh6unvAC6AMOVFt7ZJOfD6muMQsGmkcpp1sJLeM1Ofoe8l+h5oSrFZQ
 CrdV4XXrprVi7JhqvSX4alRnF5vmOAVKVXhBLZ3A/3uTou2Bhic6n68chyb/x2RE
 FhwmsXS+v7jsOI0PV4gNzwT+sp+01w==
 =y2kQ
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-5.20_or_6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab updates from Vlastimil Babka:

 - An addition of 'accounted' flag to slab allocation tracepoints to
   indicate memcg_kmem accounting, by Vasily

 - An optimization of memcg handling in freeing paths, by Muchun

 - Various smaller fixes and cleanups

* tag 'slab-for-5.20_or_6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  mm/slab_common: move generic bulk alloc/free functions to SLOB
  mm/sl[au]b: use own bulk free function when bulk alloc failed
  mm: slab: optimize memcg_slab_free_hook()
  mm/tracing: add 'accounted' entry into output of allocation tracepoints
  tools/vm/slabinfo: Handle files in debugfs
  mm/slub: Simplify __kmem_cache_alias()
  mm, slab: fix bad alignments
2022-08-01 11:46:58 -07:00
Linus Torvalds 0cec3f24a7 arm64 updates for 5.20
- Remove unused generic cpuidle support (replaced by PSCI version)
 
 - Fix documentation describing the kernel virtual address space
 
 - Handling of some new CPU errata in Arm implementations
 
 - Rework of our exception table code in preparation for handling
   machine checks (i.e. RAS errors) more gracefully
 
 - Switch over to the generic implementation of ioremap()
 
 - Fix lockdep tracking in NMI context
 
 - Instrument our memory barrier macros for KCSAN
 
 - Rework of the kPTI G->nG page-table repainting so that the MMU remains
   enabled and the boot time is no longer slowed to a crawl for systems
   which require the late remapping
 
 - Enable support for direct swapping of 2MiB transparent huge-pages on
   systems without MTE
 
 - Fix handling of MTE tags with allocating new pages with HW KASAN
 
 - Expose the SMIDR register to userspace via sysfs
 
 - Continued rework of the stack unwinder, particularly improving the
   behaviour under KASAN
 
 - More repainting of our system register definitions to match the
   architectural terminology
 
 - Improvements to the layout of the vDSO objects
 
 - Support for allocating additional bits of HWCAP2 and exposing
   FEAT_EBF16 to userspace on CPUs that support it
 
 - Considerable rework and optimisation of our early boot code to reduce
   the need for cache maintenance and avoid jumping in and out of the
   kernel when handling relocation under KASLR
 
 - Support for disabling SVE and SME support on the kernel command-line
 
 - Support for the Hisilicon HNS3 PMU
 
 - Miscellanous cleanups, trivial updates and minor fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmLeccUQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNCysB/4ml92RJLhVwRAofbtFfVgVz3JLTSsvob9x
 Z7FhNDxfM/G32wKtOHU9tHkGJ+PMVWOPajukzxkMhxmilfTyHBbiisNWVRjKQxj4
 wrd07DNXPIv3bi8SWzS1y2y8ZqujZWjNJlX8SUCzEoxCVtuNKwrh96kU1jUjrkFZ
 kBo4E4wBWK/qW29nClGSCgIHRQNJaB/jvITlQhkqIb0pwNf3sAUzW7QoF1iTZWhs
 UswcLh/zC4q79k9poegdCt8chV5OBDLtLPnMxkyQFvsLYRp3qhyCSQQY/BxvO5JS
 jT9QR6d+1ewET9BFhqHlIIuOTYBCk3xn/PR9AucUl+ZBQd2tO4B1
 =LVH0
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "Highlights include a major rework of our kPTI page-table rewriting
  code (which makes it both more maintainable and considerably faster in
  the cases where it is required) as well as significant changes to our
  early boot code to reduce the need for data cache maintenance and
  greatly simplify the KASLR relocation dance.

  Summary:

   - Remove unused generic cpuidle support (replaced by PSCI version)

   - Fix documentation describing the kernel virtual address space

   - Handling of some new CPU errata in Arm implementations

   - Rework of our exception table code in preparation for handling
     machine checks (i.e. RAS errors) more gracefully

   - Switch over to the generic implementation of ioremap()

   - Fix lockdep tracking in NMI context

   - Instrument our memory barrier macros for KCSAN

   - Rework of the kPTI G->nG page-table repainting so that the MMU
     remains enabled and the boot time is no longer slowed to a crawl
     for systems which require the late remapping

   - Enable support for direct swapping of 2MiB transparent huge-pages
     on systems without MTE

   - Fix handling of MTE tags with allocating new pages with HW KASAN

   - Expose the SMIDR register to userspace via sysfs

   - Continued rework of the stack unwinder, particularly improving the
     behaviour under KASAN

   - More repainting of our system register definitions to match the
     architectural terminology

   - Improvements to the layout of the vDSO objects

   - Support for allocating additional bits of HWCAP2 and exposing
     FEAT_EBF16 to userspace on CPUs that support it

   - Considerable rework and optimisation of our early boot code to
     reduce the need for cache maintenance and avoid jumping in and out
     of the kernel when handling relocation under KASLR

   - Support for disabling SVE and SME support on the kernel
     command-line

   - Support for the Hisilicon HNS3 PMU

   - Miscellanous cleanups, trivial updates and minor fixes"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (136 commits)
  arm64: Delay initialisation of cpuinfo_arm64::reg_{zcr,smcr}
  arm64: fix KASAN_INLINE
  arm64/hwcap: Support FEAT_EBF16
  arm64/cpufeature: Store elf_hwcaps as a bitmap rather than unsigned long
  arm64/hwcap: Document allocation of upper bits of AT_HWCAP
  arm64: enable THP_SWAP for arm64
  arm64/mm: use GENMASK_ULL for TTBR_BADDR_MASK_52
  arm64: errata: Remove AES hwcap for COMPAT tasks
  arm64: numa: Don't check node against MAX_NUMNODES
  drivers/perf: arm_spe: Fix consistency of SYS_PMSCR_EL1.CX
  perf: RISC-V: Add of_node_put() when breaking out of for_each_of_cpu_node()
  docs: perf: Include hns3-pmu.rst in toctree to fix 'htmldocs' WARNING
  arm64: kasan: Revert "arm64: mte: reset the page tag in page->flags"
  mm: kasan: Skip page unpoisoning only if __GFP_SKIP_KASAN_UNPOISON
  mm: kasan: Skip unpoisoning of user pages
  mm: kasan: Ensure the tags are visible before the tag in page->flags
  drivers/perf: hisi: add driver for HNS3 PMU
  drivers/perf: hisi: Add description for HNS3 PMU driver
  drivers/perf: riscv_pmu_sbi: perf format
  perf/arm-cci: Use the bitmap API to allocate bitmaps
  ...
2022-08-01 10:37:00 -07:00
Linus Torvalds 620725263f Two hotfixes, both cc:stable.
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuSoqQAKCRDdBJ7gKXxA
 jhXLAP9LwWGqHPDTJEdJByCQY00DM5hmVT6qycAVjySkXTIGSwD/XpAS/kPuiaMW
 Q+MmXG4F0DQFYyBhalA1AfyytgFUXAs=
 =8KZO
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Two hotfixes, both cc:stable"

* tag 'mm-hotfixes-stable-2022-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/hmm: fault non-owner device private entries
  page_alloc: fix invalid watermark check on a negative value
2022-07-29 21:02:35 -07:00
Sophia Gabriella 1a44131d4f mm: Kconfig: fix typo
Fixes a typo in the help section for ZSWAP.

Link: https://lkml.kernel.org/r/Message-ID:
Signed-off-by: Sophia Gabriella <sophia.gabriellla@outlook.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:20 -07:00
Kefeng Wang 96f96763de mm: memory-failure: convert to pr_fmt()
Use pr_fmt to prefix all pr_<level> output, but unpoison_memory() and
soft_offline_page() are used by error injection, which have own prefixes
like "Unpoison:" and "soft offline:", meanwhile, soft_offline_page() could
be used by memory hotremove, so reset pr_fmt before unpoison_pr_info
definition to keep the original output for them.

[wangkefeng.wang@huawei.com: v3]
  Link: https://lkml.kernel.org/r/20220729031919.72331-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20220726081046.10742-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:20 -07:00
Kefeng Wang 07252dfea2 mm: use is_zone_movable_page() helper
Use is_zone_movable_page() helper to simplify code.

Link: https://lkml.kernel.org/r/20220726131135.146912-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:20 -07:00
Peter Xu 76aefad628 mm/mprotect: fix soft-dirty check in can_change_pte_writable()
Patch series "mm/mprotect: Fix soft-dirty checks", v4.


This patch (of 3):

The check wanted to make sure when soft-dirty tracking is enabled we won't
grant write bit by accident, as a page fault is needed for dirty tracking.
The intention is correct but we didn't check it right because
VM_SOFTDIRTY set actually means soft-dirty tracking disabled.  Fix it.

There's another thing tricky about soft-dirty is that, we can't check the
vma flag !(vma_flags & VM_SOFTDIRTY) directly but only check it after we
checked CONFIG_MEM_SOFT_DIRTY because otherwise VM_SOFTDIRTY will be
defined as zero, and !(vma_flags & VM_SOFTDIRTY) will constantly return
true.  To avoid misuse, introduce a helper for checking whether vma has
soft-dirty tracking enabled.

We can easily verify this with any exclusive anonymous page, like program
below:

=======8<======
  #include <stdio.h>
  #include <unistd.h>
  #include <stdlib.h>
  #include <assert.h>
  #include <inttypes.h>
  #include <stdint.h>
  #include <sys/types.h>
  #include <sys/mman.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <unistd.h>
  #include <fcntl.h>
  #include <stdbool.h>

  #define BIT_ULL(nr)                   (1ULL << (nr))
  #define PM_SOFT_DIRTY                 BIT_ULL(55)

  unsigned int psize;
  char *page;

  uint64_t pagemap_read_vaddr(int fd, void *vaddr)
  {
      uint64_t value;
      int ret;

      ret = pread(fd, &value, sizeof(uint64_t),
                  ((uint64_t)vaddr >> 12) * sizeof(uint64_t));
      assert(ret == sizeof(uint64_t));

      return value;
  }

  void clear_refs_write(void)
  {
      int fd = open("/proc/self/clear_refs", O_RDWR);

      assert(fd >= 0);
      write(fd, "4", 2);
      close(fd);
  }

  #define  check_soft_dirty(str, expect)  do {                            \
          bool dirty = pagemap_read_vaddr(fd, page) & PM_SOFT_DIRTY;      \
          if (dirty != expect) {                                          \
              printf("ERROR: %s, soft-dirty=%d (expect: %d)
", str, dirty, expect); \
              exit(-1);                                                   \
          }                                                               \
  } while (0)

  int main(void)
  {
      int fd = open("/proc/self/pagemap", O_RDONLY);

      assert(fd >= 0);
      psize = getpagesize();
      page = mmap(NULL, psize, PROT_READ|PROT_WRITE,
                  MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      assert(page != MAP_FAILED);

      *page = 1;
      check_soft_dirty("Just faulted in page", 1);
      clear_refs_write();
      check_soft_dirty("Clear_refs written", 0);
      mprotect(page, psize, PROT_READ);
      check_soft_dirty("Marked RO", 0);
      mprotect(page, psize, PROT_READ|PROT_WRITE);
      check_soft_dirty("Marked RW", 0);
      *page = 2;
      check_soft_dirty("Wrote page again", 1);

      munmap(page, psize);
      close(fd);
      printf("Test passed.
");

      return 0;
  }
=======8<======

Here we attach a Fixes to commit 64fe24a3e0 only for easy tracking, as
this patch won't apply to a tree before that point.  However the commit
wasn't the source of problem, but instead 64e455079e.  It's just that
after 64fe24a3e0 anonymous memory will also suffer from this problem
with mprotect().

Link: https://lkml.kernel.org/r/20220725142048.30450-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20220725142048.30450-2-peterx@redhat.com
Fixes: 64e455079e ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
Fixes: 64fe24a3e0 ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:18 -07:00
Tetsuo Handa 68aaee147e mm: memcontrol: fix potential oom_lock recursion deadlock
syzbot is reporting GFP_KERNEL allocation with oom_lock held when
reporting memcg OOM [1].  If this allocation triggers the global OOM
situation then the system can livelock because the GFP_KERNEL
allocation with oom_lock held cannot trigger the global OOM killer
because __alloc_pages_may_oom() fails to hold oom_lock.

Fix this problem by removing the allocation from memory_stat_format()
completely, and pass static buffer when calling from memcg OOM path.

Note that the caller holding filesystem lock was the trigger for syzbot
to report this locking dependency.  Doing GFP_KERNEL allocation with
filesystem lock held can deadlock the system even without involving OOM
situation.

Link: https://syzkaller.appspot.com/bug?extid=2d2aeadc6ce1e1f11d45 [1]
Link: https://lkml.kernel.org/r/86afb39f-8c65-bec2-6cfc-c5e3cd600c0b@I-love.SAKURA.ne.jp
Fixes: c8713d0b23 ("mm: memcontrol: dump memory.stat during cgroup OOM")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+2d2aeadc6ce1e1f11d45@syzkaller.appspotmail.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:18 -07:00
Alistair Popple 65974cb910 mm/gup.c: fix formatting in check_and_migrate_movable_page()
Commit b05a79d437 ("mm/gup: migrate device coherent pages when pinning
instead of failing") added a badly formatted if statement. Fix it.

Link: https://lkml.kernel.org/r/20220721020552.1397598-2-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29 18:07:18 -07:00