Commit graph

3746 commits

Author SHA1 Message Date
Ard Biesheuvel
046545314c x86/boot: Move mem_encrypt= parsing to the decompressor
commit cd0d9d92c8 upstream.

The early SME/SEV code parses the command line very early, in order to
decide whether or not memory encryption should be enabled, which needs
to occur even before the initial page tables are created.

This is problematic for a number of reasons:
- this early code runs from the 1:1 mapping provided by the decompressor
  or firmware, which uses a different translation than the one assumed by
  the linker, and so the code needs to be built in a special way;
- parsing external input while the entire kernel image is still mapped
  writable is a bad idea in general, and really does not belong in
  security minded code;
- the current code ignores the built-in command line entirely (although
  this appears to be the case for the entire decompressor)

Given that the decompressor/EFI stub is an intrinsic part of the x86
bootable kernel image, move the command line parsing there and out of
the core kernel. This removes the need to build lib/cmdline.o in a
special way, or to use RIP-relative LEA instructions in inline asm
blocks.

This involves a new xloadflag in the setup header to indicate
that mem_encrypt=on appeared on the kernel command line.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-17-ardb+git@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-10 16:36:07 +02:00
Ard Biesheuvel
af90ced752 x86/sme: Move early SME kernel encryption handling into .head.text
commit 48204aba80 upstream.

The .head.text section is the initial primary entrypoint of the core
kernel, and is entered with the CPU executing from a 1:1 mapping of
memory. Such code must never access global variables using absolute
references, as these are based on the kernel virtual mapping which is
not active yet at this point.

Given that the SME startup code is also called from this early execution
context, move it into .head.text as well. This will allow more thorough
build time checks in the future to ensure that early startup code only
uses RIP-relative references to global variables.

Also replace some occurrences of __pa_symbol() [which relies on the
compiler generating an absolute reference, which is not guaranteed] and
an open coded RIP-relative access with RIP_REL_REF().

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240227151907.387873-18-ardb+git@google.com
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-10 16:36:07 +02:00
David Hildenbrand
51b7841f3f x86/mm/pat: fix VM_PAT handling in COW mappings
commit 04c35ab3bd upstream.

PAT handling won't do the right thing in COW mappings: the first PTE (or,
in fact, all PTEs) can be replaced during write faults to point at anon
folios.  Reliably recovering the correct PFN and cachemode using
follow_phys() from PTEs will not work in COW mappings.

Using follow_phys(), we might just get the address+protection of the anon
folio (which is very wrong), or fail on swap/nonswap entries, failing
follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and
track_pfn_copy(), not properly calling free_pfn_range().

In free_pfn_range(), we either wouldn't call memtype_free() or would call
it with the wrong range, possibly leaking memory.

To fix that, let's update follow_phys() to refuse returning anon folios,
and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings
if we run into that.

We will now properly handle untrack_pfn() with COW mappings, where we
don't need the cachemode.  We'll have to fail fork()->track_pfn_copy() if
the first page was replaced by an anon folio, though: we'd have to store
the cachemode in the VMA to make this work, likely growing the VMA size.

For now, lets keep it simple and let track_pfn_copy() just fail in that
case: it would have failed in the past with swap/nonswap entries already,
and it would have done the wrong thing with anon folios.

Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn():

<--- C reproducer --->
 #include <stdio.h>
 #include <sys/mman.h>
 #include <unistd.h>
 #include <liburing.h>

 int main(void)
 {
         struct io_uring_params p = {};
         int ring_fd;
         size_t size;
         char *map;

         ring_fd = io_uring_setup(1, &p);
         if (ring_fd < 0) {
                 perror("io_uring_setup");
                 return 1;
         }
         size = p.sq_off.array + p.sq_entries * sizeof(unsigned);

         /* Map the submission queue ring MAP_PRIVATE */
         map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
                    ring_fd, IORING_OFF_SQ_RING);
         if (map == MAP_FAILED) {
                 perror("mmap");
                 return 1;
         }

         /* We have at least one page. Let's COW it. */
         *map = 0;
         pause();
         return 0;
 }
<--- C reproducer --->

On a system with 16 GiB RAM and swap configured:
 # ./iouring &
 # memhog 16G
 # killall iouring
[  301.552930] ------------[ cut here ]------------
[  301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100
[  301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g
[  301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1
[  301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4
[  301.559569] RIP: 0010:untrack_pfn+0xf4/0x100
[  301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000
[  301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282
[  301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047
[  301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200
[  301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000
[  301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000
[  301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000
[  301.564186] FS:  0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000
[  301.564773] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0
[  301.565725] PKRU: 55555554
[  301.565944] Call Trace:
[  301.566148]  <TASK>
[  301.566325]  ? untrack_pfn+0xf4/0x100
[  301.566618]  ? __warn+0x81/0x130
[  301.566876]  ? untrack_pfn+0xf4/0x100
[  301.567163]  ? report_bug+0x171/0x1a0
[  301.567466]  ? handle_bug+0x3c/0x80
[  301.567743]  ? exc_invalid_op+0x17/0x70
[  301.568038]  ? asm_exc_invalid_op+0x1a/0x20
[  301.568363]  ? untrack_pfn+0xf4/0x100
[  301.568660]  ? untrack_pfn+0x65/0x100
[  301.568947]  unmap_single_vma+0xa6/0xe0
[  301.569247]  unmap_vmas+0xb5/0x190
[  301.569532]  exit_mmap+0xec/0x340
[  301.569801]  __mmput+0x3e/0x130
[  301.570051]  do_exit+0x305/0xaf0
...

Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Wupeng Ma <mawupeng1@huawei.com>
Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com
Fixes: b1a86e15dc ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines")
Fixes: 5899329b19 ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3")
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-10 16:36:03 +02:00
Ingo Molnar
fbc0a833c0 Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."
[ Upstream commit c567f2948f ]

This reverts commit d794734c9b.

While the original change tries to fix a bug, it also unintentionally broke
existing systems, see the regressions reported at:

  https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph.com/

Since d794734c9b was also marked for -stable, let's back it out before
causing more damage.

Note that due to another upstream change the revert was not 100% automatic:

  0a845e0f63 mm/treewide: replace pud_large() with pud_leaf()

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Russ Anderson <rja@hpe.com>
Cc: Steve Wahl <steve.wahl@hpe.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph.com/
Fixes: d794734c9b ("x86/mm/ident_map: Use gbpages only where full GB page should be mapped.")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-10 16:35:46 +02:00
Peter Xu
907835e6de mm/treewide: replace pud_large() with pud_leaf()
[ Upstream commit 0a845e0f63 ]

pud_large() is always defined as pud_leaf().  Merge their usages.  Chose
pud_leaf() because pud_leaf() is a global API, while pud_large() is not.

Link: https://lkml.kernel.org/r/20240305043750.93762-9-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stable-dep-of: c567f2948f ("Revert "x86/mm/ident_map: Use gbpages only where full GB page should be mapped."")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-10 16:35:46 +02:00
Kevin Loughlin
4338e40da8 x86/sev: Skip ROM range scans and validation for SEV-SNP guests
commit 0f4a1e8098 upstream.

SEV-SNP requires encrypted memory to be validated before access.
Because the ROM memory range is not part of the e820 table, it is not
pre-validated by the BIOS. Therefore, if a SEV-SNP guest kernel wishes
to access this range, the guest must first validate the range.

The current SEV-SNP code does indeed scan the ROM range during early
boot and thus attempts to validate the ROM range in probe_roms().
However, this behavior is neither sufficient nor necessary for the
following reasons:

* With regards to sufficiency, if EFI_CONFIG_TABLES are not enabled and
  CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK is set, the kernel will
  attempt to access the memory at SMBIOS_ENTRY_POINT_SCAN_START (which
  falls in the ROM range) prior to validation.

  For example, Project Oak Stage 0 provides a minimal guest firmware
  that currently meets these configuration conditions, meaning guests
  booting atop Oak Stage 0 firmware encounter a problematic call chain
  during dmi_setup() -> dmi_scan_machine() that results in a crash
  during boot if SEV-SNP is enabled.

* With regards to necessity, SEV-SNP guests generally read garbage
  (which changes across boots) from the ROM range, meaning these scans
  are unnecessary. The guest reads garbage because the legacy ROM range
  is unencrypted data but is accessed via an encrypted PMD during early
  boot (where the PMD is marked as encrypted due to potentially mapping
  actually-encrypted data in other PMD-contained ranges).

In one exceptional case, EISA probing treats the ROM range as
unencrypted data, which is inconsistent with other probing.

Continuing to allow SEV-SNP guests to use garbage and to inconsistently
classify ROM range encryption status can trigger undesirable behavior.
For instance, if garbage bytes appear to be a valid signature, memory
may be unnecessarily reserved for the ROM range. Future code or other
use cases may result in more problematic (arbitrary) behavior that
should be avoided.

While one solution would be to overhaul the early PMD mapping to always
treat the ROM region of the PMD as unencrypted, SEV-SNP guests do not
currently rely on data from the ROM region during early boot (and even
if they did, they would be mostly relying on garbage data anyways).

As a simpler solution, skip the ROM range scans (and the otherwise-
necessary range validation) during SEV-SNP guest early boot. The
potential SEV-SNP guest crash due to lack of ROM range validation is
thus avoided by simply not accessing the ROM range.

In most cases, skip the scans by overriding problematic x86_init
functions during sme_early_init() to SNP-safe variants, which can be
likened to x86_init overrides done for other platforms (ex: Xen); such
overrides also avoid the spread of cc_platform_has() checks throughout
the tree.

In the exceptional EISA case, still use cc_platform_has() for the
simplest change, given (1) checks for guest type (ex: Xen domain status)
are already performed here, and (2) these checks occur in a subsys
initcall instead of an x86_init function.

  [ bp: Massage commit message, remove "we"s. ]

Fixes: 9704c07bf9 ("x86/kernel: Validate ROM memory before accessing when SEV-SNP is active")
Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20240313121546.2964854-1-kevinloughlin@google.com
Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-03 15:29:03 +02:00
Ard Biesheuvel
0982fd6bf0 x86/sev: Fix position dependent variable references in startup code
commit 1c811d403a upstream.

The early startup code executes from a 1:1 mapping of memory, which
differs from the mapping that the code was linked and/or relocated to
run at. The latter mapping is not active yet at this point, and so
symbol references that rely on it will fault.

Given that the core kernel is built without -fPIC, symbol references are
typically emitted as absolute, and so any such references occuring in
the early startup code will therefore crash the kernel.

While an attempt was made to work around this for the early SEV/SME
startup code, by forcing RIP-relative addressing for certain global
SEV/SME variables via inline assembly (see snp_cpuid_get_table() for
example), RIP-relative addressing must be pervasively enforced for
SEV/SME global variables when accessed prior to page table fixups.

__startup_64() already handles this issue for select non-SEV/SME global
variables using fixup_pointer(), which adjusts the pointer relative to a
`physaddr` argument. To avoid having to pass around this `physaddr`
argument across all functions needing to apply pointer fixups, introduce
a macro RIP_RELATIVE_REF() which generates a RIP-relative reference to
a given global variable. It is used where necessary to force
RIP-relative accesses to global variables.

For backporting purposes, this patch makes no attempt at cleaning up
other occurrences of this pattern, involving either inline asm or
fixup_pointer(). Those will be addressed later.

  [ bp: Call it "rip_rel_ref" everywhere like other code shortens
    "rIP-relative reference" and make the asm wrapper __always_inline. ]

Co-developed-by: Kevin Loughlin <kevinloughlin@google.com>
Signed-off-by: Kevin Loughlin <kevinloughlin@google.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/all/20240130220845.1978329-1-kevinloughlin@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-03 15:28:50 +02:00
Borislav Petkov (AMD)
ecd16da39d x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT
commit 2995674833 upstream.

It was meant well at the time but nothing's using it so get rid of it.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240202163510.GDZb0Zvj8qOndvFOiZ@fat_crate.local
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-03 15:28:50 +02:00
Ard Biesheuvel
835ae8a07a x86/sme: Fix memory encryption setting if enabled by default and not overridden
[ Upstream commit e814b59e6c ]

Commit

  cbebd68f59 ("x86/mm: Fix use of uninitialized buffer in sme_enable()")

'fixed' an issue in sme_enable() detected by static analysis, and broke
the common case in the process.

cmdline_find_option() will return < 0 on an error, or when the command
line argument does not appear at all. In this particular case, the
latter is not an error condition, and so the early exit is wrong.

Instead, without mem_encrypt= on the command line, the compile time
default should be honoured, which could be to enable memory encryption,
and this is currently broken.

Fix it by setting sme_me_mask to a preliminary value based on the
compile time default, and only omitting the command line argument test
when cmdline_find_option() returns an error.

  [ bp: Drop active_by_default while at it. ]

Fixes: cbebd68f59 ("x86/mm: Fix use of uninitialized buffer in sme_enable()")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240126163918.2908990-2-ardb+git@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26 18:19:19 -04:00
Hou Tao
57f78c46f0 x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()
[ Upstream commit 32019c659e ]

When trying to use copy_from_kernel_nofault() to read vsyscall page
through a bpf program, the following oops was reported:

  BUG: unable to handle page fault for address: ffffffffff600000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0
  Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
  RIP: 0010:copy_from_kernel_nofault+0x6f/0x110
  ......
  Call Trace:
   <TASK>
   ? copy_from_kernel_nofault+0x6f/0x110
   bpf_probe_read_kernel+0x1d/0x50
   bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d
   trace_call_bpf+0xc5/0x1c0
   perf_call_bpf_enter.isra.0+0x69/0xb0
   perf_syscall_enter+0x13e/0x200
   syscall_trace_enter+0x188/0x1c0
   do_syscall_64+0xb5/0xe0
   entry_SYSCALL_64_after_hwframe+0x6e/0x76
   </TASK>
  ......
  ---[ end trace 0000000000000000 ]---

The oops is triggered when:

1) A bpf program uses bpf_probe_read_kernel() to read from the vsyscall
page and invokes copy_from_kernel_nofault() which in turn calls
__get_user_asm().

2) Because the vsyscall page address is not readable from kernel space,
a page fault exception is triggered accordingly.

3) handle_page_fault() considers the vsyscall page address as a user
space address instead of a kernel space address. This results in the
fix-up setup by bpf not being applied and a page_fault_oops() is invoked
due to SMAP.

Considering handle_page_fault() has already considered the vsyscall page
address as a userspace address, fix the problem by disallowing vsyscall
page read for copy_from_kernel_nofault().

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: syzbot+72aa0161922eba61b50e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/CAG48ez06TZft=ATH1qh2c5mpS5BT8UakwNkzi6nvK5_djC-4Nw@mail.gmail.com
Reported-by: xingwei lee <xrivendell7@gmail.com>
Closes: https://lore.kernel.org/bpf/CABOYnLynjBoFZOf3Z4BhaZkc5hx_kHfsjiW+UWLoB=w33LvScw@mail.gmail.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240202103935.3154011-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26 18:19:12 -04:00
Hou Tao
aa18a46ef8 x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h
[ Upstream commit ee0e39a63b ]

Move is_vsyscall_vaddr() into asm/vsyscall.h to make it available for
copy_from_kernel_nofault_allowed() in arch/x86/mm/maccess.c.

Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20240202103935.3154011-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-26 18:19:11 -04:00
Alison Schofield
3a0060d2ba x86/numa: Fix the sort compare func used in numa_fill_memblks()
[ Upstream commit b626070ffc ]

The compare function used to sort memblks into starting address
order fails when the result of its u64 address subtraction gets
truncated to an int upon return.

The impact of the bad sort is that memblks will be filled out
incorrectly. Depending on the set of memblks, a user may see no
errors at all but still have a bad fill, or see messages reporting
a node overlap that leads to numa init failure:

[] node 0 [mem: ] overlaps with node 1 [mem: ]
[] No NUMA configuration found

Replace with a comparison that can only result in: 1, 0, -1.

Fixes: 8f012db27c ("x86/numa: Introduce numa_fill_memblks()")
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/99dcb3ae87e04995e9f293f6158dc8fa0749a487.1705085543.git.alison.schofield@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01 13:35:06 +01:00
Alison Schofield
b5bf39cd08 x86/numa: Fix the address overlap check in numa_fill_memblks()
[ Upstream commit 9b99c17f75 ]

numa_fill_memblks() fills in the gaps in numa_meminfo memblks over a
physical address range. To do so, it first creates a list of existing
memblks that overlap that address range. The issue is that it is off
by one when comparing to the end of the address range, so memblks
that do not overlap are selected.

The impact of selecting a memblk that does not actually overlap is
that an existing memblk may be filled when the expected action is to
do nothing and return NUMA_NO_MEMBLK to the caller. The caller can
then add a new NUMA node and memblk.

Replace the broken open-coded search for address overlap with the
memblock helper memblock_addrs_overlap(). Update the kernel doc
and in code comments.

Suggested by: "Huang, Ying" <ying.huang@intel.com>

Fixes: 8f012db27c ("x86/numa: Introduce numa_fill_memblks()")
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/10a3e6109c34c21a8dd4c513cf63df63481a2b07.1705085543.git.alison.schofield@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-03-01 13:35:06 +01:00
Steve Wahl
1dc09ca858 x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
commit d794734c9b upstream.

When ident_pud_init() uses only gbpages to create identity maps, large
ranges of addresses not actually requested can be included in the
resulting table; a 4K request will map a full GB.  On UV systems, this
ends up including regions that will cause hardware to halt the system
if accessed (these are marked "reserved" by BIOS).  Even processor
speculation into these regions is enough to trigger the system halt.

Only use gbpages when map creation requests include the full GB page
of space.  Fall back to using smaller 2M pages when only portions of a
GB page are included in the request.

No attempt is made to coalesce mapping requests. If a request requires
a map entry at the 2M (pmd) level, subsequent mapping requests within
the same 1G region will also be at the pmd level, even if adjacent or
overlapping such requests could have been combined to map a full
gbpage.  Existing usage starts with larger regions and then adds
smaller regions, so this should not have any great consequence.

[ dhansen: fix up comment formatting, simplifty changelog ]

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl%40hpe.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-23 09:25:10 +01:00
Kirill A. Shutemov
34c686e5be x86/coco: Disable 32-bit emulation by default on TDX and SEV
[ upstream commit b82a8dbd3d ]

The INT 0x80 instruction is used for 32-bit x86 Linux syscalls. The
kernel expects to receive a software interrupt as a result of the INT
0x80 instruction. However, an external interrupt on the same vector
triggers the same handler.

The kernel interprets an external interrupt on vector 0x80 as a 32-bit
system call that came from userspace.

A VMM can inject external interrupts on any arbitrary vector at any
time.  This remains true even for TDX and SEV guests where the VMM is
untrusted.

Put together, this allows an untrusted VMM to trigger int80 syscall
handling at any given point. The content of the guest register file at
that moment defines what syscall is triggered and its arguments. It
opens the guest OS to manipulation from the VMM side.

Disable 32-bit emulation by default for TDX and SEV. User can override
it with the ia32_emulation=y command line option.

[ dhansen: reword the changelog ]

Reported-by: Supraja Sridhara <supraja.sridhara@inf.ethz.ch>
Reported-by: Benedict Schlüter <benedict.schlueter@inf.ethz.ch>
Reported-by: Mark Kuhne <mark.kuhne@inf.ethz.ch>
Reported-by: Andrin Bertschi <andrin.bertschi@inf.ethz.ch>
Reported-by: Shweta Shinde <shweta.shinde@inf.ethz.ch>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@vger.kernel.org> # 6.0+: 1da5c9b x86: Introduce ia32_enabled()
Cc: <stable@vger.kernel.org> # 6.0+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-13 18:45:02 +01:00
Mike Rapoport (IBM)
0f32b0b2bd x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size
[ Upstream commit a1e2b8b368 ]

Qi Zheng reported crashes in a production environment and provided a
simplified example as a reproducer:

 |  For example, if we use Qemu to start a two NUMA node kernel,
 |  one of the nodes has 2M memory (less than NODE_MIN_SIZE),
 |  and the other node has 2G, then we will encounter the
 |  following panic:
 |
 |    BUG: kernel NULL pointer dereference, address: 0000000000000000
 |    <...>
 |    RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
 |    <...>
 |    Call Trace:
 |      <TASK>
 |      deactivate_slab()
 |      bootstrap()
 |      kmem_cache_init()
 |      start_kernel()
 |      secondary_startup_64_no_verify()

The crashes happen because of inconsistency between the nodemask that
has nodes with less than 4MB as memoryless, and the actual memory fed
into the core mm.

The commit:

  9391a3f9c7 ("[PATCH] x86_64: Clear more state when ignoring empty node in SRAT parsing")

... that introduced minimal size of a NUMA node does not explain why
a node size cannot be less than 4MB and what boot failures this
restriction might fix.

Fixes have been submitted to the core MM code to tighten up the
memory topologies it accepts and to not crash on weird input:

  mm: page_alloc: skip memoryless nodes entirely
  mm: memory_hotplug: drop memoryless node from fallback lists

Andrew has accepted them into the -mm tree, but there are no
stable SHA1's yet.

This patch drops the limitation for minimal node size on x86:

  - which works around the crash without the fixes to the core MM.
  - makes x86 topologies less weird,
  - removes an arbitrary and undocumented limitation on NUMA topologies.

[ mingo: Improved changelog clarity. ]

Reported-by: Qi Zheng <zhengqi.arch@bytedance.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/ZS+2qqjEO5/867br@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28 17:19:36 +00:00
Adam Dunlap
95846cd88c x86/sev-es: Allow copy_from_kernel_nofault() in earlier boot
[ Upstream commit f79936545f ]

Previously, if copy_from_kernel_nofault() was called before
boot_cpu_data.x86_virt_bits was set up, then it would trigger undefined
behavior due to a shift by 64.

This ended up causing boot failures in the latest version of ubuntu2204
in the gcp project when using SEV-SNP.

Specifically, this function is called during an early #VC handler which
is triggered by a CPUID to check if NX is implemented.

Fixes: 1aa9aa8ee5 ("x86/sev-es: Setup GHCB-based boot #VC handler")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Adam Dunlap <acdunlap@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jacob Xu <jacobhxu@google.com>
Link: https://lore.kernel.org/r/20230912002703.3924521-2-acdunlap@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20 11:58:53 +01:00
Alison Schofield
eb38c5edf2 x86/numa: Introduce numa_fill_memblks()
[ Upstream commit 8f012db27c ]

numa_fill_memblks() fills in the gaps in numa_meminfo memblks
over an physical address range.

The ACPI driver will use numa_fill_memblks() to implement a new Linux
policy that prescribes extending proximity domains in a portion of a
CFMWS window to the entire window.

Dan Williams offered this explanation of the policy:
A CFWMS is an ACPI data structure that indicates *potential* locations
where CXL memory can be placed. It is the playground where the CXL
driver has free reign to establish regions. That space can be populated
by BIOS created regions, or driver created regions, after hotplug or
other reconfiguration.

When BIOS creates a region in a CXL Window it additionally describes
that subset of the Window range in the other typical ACPI tables SRAT,
SLIT, and HMAT. The rationale for BIOS not pre-describing the entire
CXL Window in SRAT, SLIT, and HMAT is that it can not predict the
future. I.e. there is nothing stopping higher or lower performance
devices being placed in the same Window. Compare that to ACPI memory
hotplug that just onlines additional capacity in the proximity domain
with little freedom for dynamic performance differentiation.

That leaves the OS with a choice, should unpopulated window capacity
match the proximity domain of an existing region, or should it allocate
a new one? This patch takes the simple position of minimizing proximity
domain proliferation by reusing any proximity domain intersection for
the entire Window. If the Window has no intersections then allocate a
new proximity domain. Note that SRAT, SLIT and HMAT information can be
enumerated dynamically in a standard way from device provided data.
Think of CXL as the end of ACPI needing to describe memory attributes,
CXL offers a standard discovery model for performance attributes, but
Linux still needs to interoperate with the old regime.

Reported-by: Derick Marks <derick.w.marks@intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Derick Marks <derick.w.marks@intel.com>
Link: https://lore.kernel.org/all/ef078a6f056ca974e5af85997013c0fda9e3326d.1689018477.git.alison.schofield%40intel.com
Stable-dep-of: 8f10046799 ("ACPI/NUMA: Apply SRAT proximity domain to entire CFMWS window")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20 11:58:53 +01:00
Linus Torvalds
df57721f9a Add x86 shadow stack support
Convert IBT selftest to asm to fix objtool warning
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTv1QQACgkQaDWVMHDJ
 krAUwhAAn6TOwHJK8BSkHeiQhON1nrlP3c5cv0AyZ2NP8RYDrZrSZvhpYBJ6wgKC
 Cx5CGq5nn9twYsYS3KsktLKDfR3lRdsQ7K9qtyFtYiaeaVKo+7gEKl/K+klwai8/
 gninQWHk0zmSCja8Vi77q52WOMkQKapT8+vaON9EVDO8dVEi+CvhAIfPwMafuiwO
 Rk4X86SzoZu9FP79LcCg9XyGC/XbM2OG9eNUTSCKT40qTTKm5y4gix687NvAlaHR
 ko5MTsdl0Wfp6Qk0ohT74LnoA2c1g/FluvZIM33ci/2rFpkf9Hw7ip3lUXqn6CPx
 rKiZ+pVRc0xikVWkraMfIGMJfUd2rhelp8OyoozD7DB7UZw40Q4RW4N5tgq9Fhe9
 MQs3p1v9N8xHdRKl365UcOczUxNAmv4u0nV5gY/4FMC6VjldCl2V9fmqYXyzFS4/
 Ogg4FSd7c2JyGFKPs+5uXyi+RY2qOX4+nzHOoKD7SY616IYqtgKoz5usxETLwZ6s
 VtJOmJL0h//z0A7tBliB0zd+SQ5UQQBDC2XouQH2fNX2isJMn0UDmWJGjaHgK6Hh
 8jVp6LNqf+CEQS387UxckOyj7fu438hDky1Ggaw4YqowEOhQeqLVO4++x+HITrbp
 AupXfbJw9h9cMN63Yc0gVxXQ9IMZ+M7UxLtZ3Cd8/PVztNy/clA=
 =3UUm
 -----END PGP SIGNATURE-----

Merge tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 shadow stack support from Dave Hansen:
 "This is the long awaited x86 shadow stack support, part of Intel's
  Control-flow Enforcement Technology (CET).

  CET consists of two related security features: shadow stacks and
  indirect branch tracking. This series implements just the shadow stack
  part of this feature, and just for userspace.

  The main use case for shadow stack is providing protection against
  return oriented programming attacks. It works by maintaining a
  secondary (shadow) stack using a special memory type that has
  protections against modification. When executing a CALL instruction,
  the processor pushes the return address to both the normal stack and
  to the special permission shadow stack. Upon RET, the processor pops
  the shadow stack copy and compares it to the normal stack copy.

  For more information, refer to the links below for the earlier
  versions of this patch set"

Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/

* tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
  x86/shstk: Change order of __user in type
  x86/ibt: Convert IBT selftest to asm
  x86/shstk: Don't retry vm_munmap() on -EINTR
  x86/kbuild: Fix Documentation/ reference
  x86/shstk: Move arch detail comment out of core mm
  x86/shstk: Add ARCH_SHSTK_STATUS
  x86/shstk: Add ARCH_SHSTK_UNLOCK
  x86: Add PTRACE interface for shadow stack
  selftests/x86: Add shadow stack test
  x86/cpufeatures: Enable CET CR4 bit for shadow stack
  x86/shstk: Wire in shadow stack interface
  x86: Expose thread features in /proc/$PID/status
  x86/shstk: Support WRSS for userspace
  x86/shstk: Introduce map_shadow_stack syscall
  x86/shstk: Check that signal frame is shadow stack mem
  x86/shstk: Check that SSP is aligned on sigreturn
  x86/shstk: Handle signals for shadow stack
  x86/shstk: Introduce routines modifying shstk
  x86/shstk: Handle thread shadow stack
  x86/shstk: Add user-mode shadow stack support
  ...
2023-08-31 12:20:12 -07:00
Linus Torvalds
1687d8aca5 * Rework apic callbacks, getting rid of unnecessary ones and
coalescing lots of silly duplicates.
  * Use static_calls() instead of indirect calls for apic->foo()
  * Tons of cleanups an crap removal along the way
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTvfO8ACgkQaDWVMHDJ
 krAP2A//ccii/LuvtTnNEIMMR5w2rwTdHv91ancgFkC8pOeNk37Z8sSLq8tKuLFA
 vgjBIysVIqunuRcNCJ+eqwIIxYfU+UGCWHppzLwO+DY3Q7o9EoTL0BgytdAqxpQQ
 ntEVarqWq25QYXKFoAqbUTJ1UXa42/8HfiXAX/jvP+ACXfilkGPZre6ASxlXeOhm
 XbgPuNQPmXi2WYQH9GCQEsz2Nh80hKap8upK2WbQzzJ3lXsm+xA//4klab0HCYwl
 Uc302uVZozyXRMKbAlwmgasTFOLiV8KKriJ0oHoktBpWgkpdR9uv/RDeSaFR3DAl
 aFmecD4k/Hqezg4yVl+4YpEn2KjxiwARCm4PMW5AV7lpWBPBHAOOai65yJlAi9U6
 bP8pM0+aIx9xg7oWfsTnQ7RkIJ+GZ0w+KZ9LXFM59iu3eV1pAJE3UVyUehe/J1q9
 n8OcH0UeHRlAb8HckqVm1AC7IPvfHw4OAPtUq7z3NFDwbq6i651Tu7f+i2bj31cX
 77Ames+fx6WjxUjyFbJwaK44E7Qez3waztdBfn91qw+m0b+gnKE3ieDNpJTqmm5b
 mKulV7KJwwS6cdqY3+Kr+pIlN+uuGAv7wGzVLcaEAXucDsVn/YAMJHY2+v97xv+n
 J9N+yeaYtmSXVlDsJ6dndMrTQMmcasK1CVXKxs+VYq5Lgf+A68w=
 =eoKm
 -----END PGP SIGNATURE-----

Merge tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 apic updates from Dave Hansen:
 "This includes a very thorough rework of the 'struct apic' handlers.
  Quite a variety of them popped up over the years, especially in the
  32-bit days when odd apics were much more in vogue.

  The end result speaks for itself, which is a removal of a ton of code
  and static calls to replace indirect calls.

  If there's any breakage here, it's likely to be around the 32-bit
  museum pieces that get light to no testing these days.

  Summary:

   - Rework apic callbacks, getting rid of unnecessary ones and
     coalescing lots of silly duplicates.

   - Use static_calls() instead of indirect calls for apic->foo()

   - Tons of cleanups an crap removal along the way"

* tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
  x86/apic: Turn on static calls
  x86/apic: Provide static call infrastructure for APIC callbacks
  x86/apic: Wrap IPI calls into helper functions
  x86/apic: Mark all hotpath APIC callback wrappers __always_inline
  x86/xen/apic: Mark apic __ro_after_init
  x86/apic: Convert other overrides to apic_update_callback()
  x86/apic: Replace acpi_wake_cpu_handler_update() and apic_set_eoi_cb()
  x86/apic: Provide apic_update_callback()
  x86/xen/apic: Use standard apic driver mechanism for Xen PV
  x86/apic: Provide common init infrastructure
  x86/apic: Wrap apic->native_eoi() into a helper
  x86/apic: Nuke ack_APIC_irq()
  x86/apic: Remove pointless arguments from [native_]eoi_write()
  x86/apic/noop: Tidy up the code
  x86/apic: Remove pointless NULL initializations
  x86/apic: Sanitize APIC ID range validation
  x86/apic: Prepare x2APIC for using apic::max_apic_id
  x86/apic: Simplify X2APIC ID validation
  x86/apic: Add max_apic_id member
  x86/apic: Wrap APIC ID validation into an inline
  ...
2023-08-30 10:44:46 -07:00
Linus Torvalds
9855922705 - Remove unnecessary "INVPCID single" feature tracking
- Include PAT in page protection modify mask
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTuUrwACgkQaDWVMHDJ
 krD7jRAAt37pNfAJLd+pJtBAtsZYlmPq1aYHSuLPQFaFebYgN8j4ekMwRNRBbgQF
 6dWQXYSRMFnmJzbxBcHTkzYzR1Noh1o8U0SKUp3CNFfA3gGAq1mpoOKc7l1KjVGN
 x6x0+5aroT0DCtxqid0iBY38IJb/qmJ63NLGT0oJm8NZ9CTwd1UaN6KXWz1mawgk
 BvIY1zgMLibB2aYi2Eib2JlhQ6DWHSJMAZpMMEdPay/lr6ONlQZ3Sckjvz5huskQ
 ikGIvzF3L6BFDsxIYjE4uoFoImDcs4Q3gIGoqsn/Ig79mCnttgoRQ7HVFmUrVKq1
 nxa+o+uqWNJjRSwbHUKX1ReyiFF5Re+7csODEnIzHr761YXWTcm94sR8jb4bCMqV
 QiWkzt5wcdzpAZC72gcRLqL2K3uMwm2rpxhw7az/LgDzNcdkWqFbFurGGN/3Ro6e
 RM9FvTIi+a40cSjc+zCNDSSwb90Oe8ZINFb9g0ta++5mFQXG7bsydwnWVq5pRY0V
 5qNtWNvusW01c5GmOf0iJY7M84jegf4dzPNZcQd6XblWf5XyR+YnjCLU8g1Y3s8y
 H3BC8xHvgIb2Ln/XX4V6er7Ey+SS3XGumeqRn6gi3foa4DNODzbsVuIVpZAZoqyn
 hY4eGmwVS+OS7B+wOE44Z3hqMq4K0eXo+PsXCov4HFAbuMCtsrA=
 =TKMq
 -----END PGP SIGNATURE-----

Merge tag 'x86_mm_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm updates from Dave Hansen:
 "A pair of small x86/mm updates. The INVPCID one is purely a cleanup.
  The PAT one fixes a real issue, albeit a relatively obscure one
  (graphics device passthrough under Xen). The fix also makes the code
  much more readable.

  Summary:

   - Remove unnecessary "INVPCID single" feature tracking

   - Include PAT in page protection modify mask"

* tag 'x86_mm_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Remove "INVPCID single" feature tracking
  x86/mm: Fix PAT bit missing from page protection modify mask
2023-08-30 09:54:00 -07:00
Linus Torvalds
b96a3e9142 - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list")
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
   reduces the special-case code for handling hugetlb pages in GUP.  It
   also speeds up GUP handling of transparent hugepages.
 
 - Peng Zhang provides some maple tree speedups ("Optimize the fast path
   of mas_store()").
 
 - Sergey Senozhatsky has improved te performance of zsmalloc during
   compaction (zsmalloc: small compaction improvements").
 
 - Domenico Cerasuolo has developed additional selftest code for zswap
   ("selftests: cgroup: add zswap test program").
 
 - xu xin has doe some work on KSM's handling of zero pages.  These
   changes are mainly to enable the user to better understand the
   effectiveness of KSM's treatment of zero pages ("ksm: support tracking
   KSM-placed zero-pages").
 
 - Jeff Xu has fixes the behaviour of memfd's
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").
 
 - David Howells has fixed an fscache optimization ("mm, netfs, fscache:
   Stop read optimisation when folio removed from pagecache").
 
 - Axel Rasmussen has given userfaultfd the ability to simulate memory
   poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD").
 
 - Miaohe Lin has contributed some routine maintenance work on the
   memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
   check").
 
 - Peng Zhang has contributed some maintenance work on the maple tree
   code ("Improve the validation for maple tree and some cleanup").
 
 - Hugh Dickins has optimized the collapsing of shmem or file pages into
   THPs ("mm: free retracted page table by RCU").
 
 - Jiaqi Yan has a patch series which permits us to use the healthy
   subpages within a hardware poisoned huge page for general purposes
   ("Improve hugetlbfs read on HWPOISON hugepages").
 
 - Kemeng Shi has done some maintenance work on the pagetable-check code
   ("Remove unused parameters in page_table_check").
 
 - More folioification work from Matthew Wilcox ("More filesystem folio
   conversions for 6.6"), ("Followup folio conversions for zswap").  And
   from ZhangPeng ("Convert several functions in page_io.c to use a
   folio").
 
 - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").
 
 - Baoquan He has converted some architectures to use the GENERIC_IOREMAP
   ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take
   GENERIC_IOREMAP way").
 
 - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
   batched/deferred tlb shootdown during page reclamation/migration").
 
 - Better maple tree lockdep checking from Liam Howlett ("More strict
   maple tree lockdep").  Liam also developed some efficiency improvements
   ("Reduce preallocations for maple tree").
 
 - Cleanup and optimization to the secondary IOMMU TLB invalidation, from
   Alistair Popple ("Invalidate secondary IOMMU TLB on permission
   upgrade").
 
 - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
   for arm64").
 
 - Kemeng Shi provides some maintenance work on the compaction code ("Two
   minor cleanups for compaction").
 
 - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most
   file-backed faults under the VMA lock").
 
 - Aneesh Kumar contributes code to use the vmemmap optimization for DAX
   on ppc64, under some circumstances ("Add support for DAX vmemmap
   optimization for ppc64").
 
 - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
   data in page_ext"), ("minor cleanups to page_ext header").
 
 - Some zswap cleanups from Johannes Weiner ("mm: zswap: three
   cleanups").
 
 - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").
 
 - VMA handling cleanups from Kefeng Wang ("mm: convert to
   vma_is_initial_heap/stack()").
 
 - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
   implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
   address ranges and DAMON monitoring targets").
 
 - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").
 
 - Liam Howlett has improved the maple tree node replacement code
   ("maple_tree: Change replacement strategy").
 
 - ZhangPeng has a general code cleanup - use the K() macro more widely
   ("cleanup with helper macro K()").
 
 - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap
   on memory feature on ppc64").
 
 - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
   in page_alloc"), ("Two minor cleanups for get pageblock migratetype").
 
 - Vishal Moola introduces a memory descriptor for page table tracking,
   "struct ptdesc" ("Split ptdesc from struct page").
 
 - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
   for vm.memfd_noexec").
 
 - MM include file rationalization from Hugh Dickins ("arch: include
   asm/cacheflush.h in asm/hugetlb.h").
 
 - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
   output").
 
 - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
   object_cache instead of kmemleak_initialized").
 
 - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
   and _folio_order").
 
 - A VMA locking scalability improvement from Suren Baghdasaryan
   ("Per-VMA lock support for swap and userfaults").
 
 - pagetable handling cleanups from Matthew Wilcox ("New page table range
   API").
 
 - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
   using page->private on tail pages for THP_SWAP + cleanups").
 
 - Cleanups and speedups to the hugetlb fault handling from Matthew
   Wilcox ("Change calling convention for ->huge_fault").
 
 - Matthew Wilcox has also done some maintenance work on the MM subsystem
   documentation ("Improve mm documentation").
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA
 jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6
 Nfio7afS1ncD+hPYT8947UnLxTgn+ww=
 =Afws
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Some swap cleanups from Ma Wupeng ("fix WARN_ON in
   add_to_avail_list")

 - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
   reduces the special-case code for handling hugetlb pages in GUP. It
   also speeds up GUP handling of transparent hugepages.

 - Peng Zhang provides some maple tree speedups ("Optimize the fast path
   of mas_store()").

 - Sergey Senozhatsky has improved te performance of zsmalloc during
   compaction (zsmalloc: small compaction improvements").

 - Domenico Cerasuolo has developed additional selftest code for zswap
   ("selftests: cgroup: add zswap test program").

 - xu xin has doe some work on KSM's handling of zero pages. These
   changes are mainly to enable the user to better understand the
   effectiveness of KSM's treatment of zero pages ("ksm: support
   tracking KSM-placed zero-pages").

 - Jeff Xu has fixes the behaviour of memfd's
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
   MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").

 - David Howells has fixed an fscache optimization ("mm, netfs, fscache:
   Stop read optimisation when folio removed from pagecache").

 - Axel Rasmussen has given userfaultfd the ability to simulate memory
   poisoning ("add UFFDIO_POISON to simulate memory poisoning with
   UFFD").

 - Miaohe Lin has contributed some routine maintenance work on the
   memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
   check").

 - Peng Zhang has contributed some maintenance work on the maple tree
   code ("Improve the validation for maple tree and some cleanup").

 - Hugh Dickins has optimized the collapsing of shmem or file pages into
   THPs ("mm: free retracted page table by RCU").

 - Jiaqi Yan has a patch series which permits us to use the healthy
   subpages within a hardware poisoned huge page for general purposes
   ("Improve hugetlbfs read on HWPOISON hugepages").

 - Kemeng Shi has done some maintenance work on the pagetable-check code
   ("Remove unused parameters in page_table_check").

 - More folioification work from Matthew Wilcox ("More filesystem folio
   conversions for 6.6"), ("Followup folio conversions for zswap"). And
   from ZhangPeng ("Convert several functions in page_io.c to use a
   folio").

 - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").

 - Baoquan He has converted some architectures to use the
   GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
   architectures to take GENERIC_IOREMAP way").

 - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
   batched/deferred tlb shootdown during page reclamation/migration").

 - Better maple tree lockdep checking from Liam Howlett ("More strict
   maple tree lockdep"). Liam also developed some efficiency
   improvements ("Reduce preallocations for maple tree").

 - Cleanup and optimization to the secondary IOMMU TLB invalidation,
   from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
   upgrade").

 - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
   for arm64").

 - Kemeng Shi provides some maintenance work on the compaction code
   ("Two minor cleanups for compaction").

 - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
   most file-backed faults under the VMA lock").

 - Aneesh Kumar contributes code to use the vmemmap optimization for DAX
   on ppc64, under some circumstances ("Add support for DAX vmemmap
   optimization for ppc64").

 - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
   data in page_ext"), ("minor cleanups to page_ext header").

 - Some zswap cleanups from Johannes Weiner ("mm: zswap: three
   cleanups").

 - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").

 - VMA handling cleanups from Kefeng Wang ("mm: convert to
   vma_is_initial_heap/stack()").

 - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
   implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
   address ranges and DAMON monitoring targets").

 - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").

 - Liam Howlett has improved the maple tree node replacement code
   ("maple_tree: Change replacement strategy").

 - ZhangPeng has a general code cleanup - use the K() macro more widely
   ("cleanup with helper macro K()").

 - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
   memmap on memory feature on ppc64").

 - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
   in page_alloc"), ("Two minor cleanups for get pageblock
   migratetype").

 - Vishal Moola introduces a memory descriptor for page table tracking,
   "struct ptdesc" ("Split ptdesc from struct page").

 - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
   for vm.memfd_noexec").

 - MM include file rationalization from Hugh Dickins ("arch: include
   asm/cacheflush.h in asm/hugetlb.h").

 - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
   output").

 - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
   object_cache instead of kmemleak_initialized").

 - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
   and _folio_order").

 - A VMA locking scalability improvement from Suren Baghdasaryan
   ("Per-VMA lock support for swap and userfaults").

 - pagetable handling cleanups from Matthew Wilcox ("New page table
   range API").

 - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
   using page->private on tail pages for THP_SWAP + cleanups").

 - Cleanups and speedups to the hugetlb fault handling from Matthew
   Wilcox ("Change calling convention for ->huge_fault").

 - Matthew Wilcox has also done some maintenance work on the MM
   subsystem documentation ("Improve mm documentation").

* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
  maple_tree: shrink struct maple_tree
  maple_tree: clean up mas_wr_append()
  secretmem: convert page_is_secretmem() to folio_is_secretmem()
  nios2: fix flush_dcache_page() for usage from irq context
  hugetlb: add documentation for vma_kernel_pagesize()
  mm: add orphaned kernel-doc to the rst files.
  mm: fix clean_record_shared_mapping_range kernel-doc
  mm: fix get_mctgt_type() kernel-doc
  mm: fix kernel-doc warning from tlb_flush_rmaps()
  mm: remove enum page_entry_size
  mm: allow ->huge_fault() to be called without the mmap_lock held
  mm: move PMD_ORDER to pgtable.h
  mm: remove checks for pte_index
  memcg: remove duplication detection for mem_cgroup_uncharge_swap
  mm/huge_memory: work on folio->swap instead of page->private when splitting folio
  mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
  mm/swap: use dedicated entry for swap in folio
  mm/swap: stop using page->private on tail pages for THP_SWAP
  selftests/mm: fix WARNING comparing pointer to 0
  selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
  ...
2023-08-29 14:25:26 -07:00
Linus Torvalds
1a7c611546 Perf events changes for v6.6:
- AMD IBS improvements
 - Intel PMU driver updates
 - Extend core perf facilities & the ARM PMU driver to better handle ARM big.LITTLE events
 - Micro-optimize software events and the ring-buffer code
 - Misc cleanups & fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtBscRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hHoQ/+IBQ8Xi/rcdd40n8OqEB/VBWVuSjNT3uN
 3pHHcTl2Pio9CxBeat42NekNijlRILCKJrZ3Lt3JWBmWyWv5l3KFabelj+lDF2xa
 TVCjTnQNe1+HvrODYnF4ECIs5vaoMVjcJ9jg8+VDgAcOQr1nZs4m5TVAd6TLqPpV
 urBEQVULkkzk7ZRhfrugKhw+wrpWFefgGCx0RV8ijZB7TLMHc2wE+Q/sTxKdKceL
 wNaJaDgV33pZh0aImwR9pKUE532hF1FiBdLuehkh61PZa1L82jzAX1xjw2s1hSa4
 eIWemPHJIYfivRlENbJsDWc4N8gk6ijVHwrxGcr4Axu+NN+zPtQ3ddhaGMAyKdTo
 qUKXH3MZSMIl++jI5Fkc6xM+XLvY1rML62epSzMwu/cc7Z5MeyWdQcri0N9YFuO7
 wUUNnFpU00lwQBLbyyUQ3Zi8E0QV7NuPW4axTkmntiIjMpLagaEvVSf6nf8qLpbE
 WTT16s707t19hUZNazNZ7ONmhly4ALbHFQEH65J2KoYn99fYqy9z68Hwk+xnmykw
 bc3qvfhpw0MImQQ+DqHiBwb4n4UuvY2WlkkZI3FfNeSG63DaM2mZikfpElpXYjn6
 9iOIXvx21Wiq/n0cbLhidI2q/ZzFCzYLCk6ikZ320wb+rhvd7EoSlZil6QSzn3pH
 Qdk+NEZgWQY=
 =ZT6+
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event updates from Ingo Molnar:

 - AMD IBS improvements

 - Intel PMU driver updates

 - Extend core perf facilities & the ARM PMU driver to better handle ARM big.LITTLE events

 - Micro-optimize software events and the ring-buffer code

 - Misc cleanups & fixes

* tag 'perf-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/uncore: Remove unnecessary ?: operator around pcibios_err_to_errno() call
  perf/x86/intel: Add Crestmont PMU
  x86/cpu: Update Hybrids
  x86/cpu: Fix Crestmont uarch
  x86/cpu: Fix Gracemont uarch
  perf: Remove unused extern declaration arch_perf_get_page_size()
  perf: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
  arm_pmu: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
  perf/x86: Remove unused PERF_PMU_CAP_HETEROGENEOUS_CPUS capability
  arm_pmu: Add PERF_PMU_CAP_EXTENDED_HW_TYPE capability
  perf/x86/ibs: Set mem_lvl_num, mem_remote and mem_hops for data_src
  perf/mem: Add PERF_MEM_LVLNUM_NA to PERF_MEM_NA
  perf/mem: Introduce PERF_MEM_LVLNUM_UNC
  perf/ring_buffer: Use local_try_cmpxchg in __perf_output_begin
  locking/arch: Avoid variable shadowing in local_try_cmpxchg()
  perf/core: Use local64_try_cmpxchg in perf_swevent_set_period
  perf/x86: Use local64_try_cmpxchg
  perf/amd: Prevent grouping of IBS events
2023-08-28 16:35:01 -07:00
Linus Torvalds
42a7f6e3ff - The first, cleanup part of the microcode loader reorg tglx has been
working on. This part makes the loader core code as it is practically
   enabled on pretty much every baremetal machine so there's no need to
   have the Kconfig items. In addition, there are cleanups which prepare
   for future feature enablement.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmTskfcACgkQEsHwGGHe
 VUo/hBAAiqVqdc0WASHgVYoO9mD1M2T3oHFz8ceX2pGKf/raZ5UJyit7ybWEZEWG
 rWXFORRlqOKoKImQm6JhyJsu29Xmi9sTb1WNwEyT8YMdhx8v57hOch3alX7sm2BF
 9eOl/77hxt7Pt8keN6gY5w5cydEgBvi8bVe8sfU3hJMwieAMH0q5syRx7fDizcVd
 qoTicHRjfj5Q8BL5NXtdPEMENYPyV89DVjnUM1HVPpCkoHxmOujewgjs4gY7PsGp
 qAGB1+IG3aqOWHM9SDIJp5U9tNX2huhqRTcZsNEe8qHTXCv8F8zRzK0J8giM1wed
 5aAGC4AfMh/gjryXeMj1nnwoAf5TQw4dK+y+BYXIykdQDV1up+HdDtjrBmJ5Kslf
 n/8uGLdLLqMQVEE2hT3r1Ft1RqVf3UwWOxzc+KASjKCj0F5djt+F2JDNGoN0sMD9
 JDj3Dtvo2e1+aZlcvXWSmMCVMT0By1mbFqirEXT3i1sYwHDx23s+KwY71CdT8gx8
 nbEWsaADPRynNbTAI5Z5TFq0Cohs9AUNuotRHYCc0Et5NBlzoN8yAKNW39twUDEt
 a/Knq1Vnybrp18pE/rDphm+p/K261OuEXfFFVTASSlvgMnVM0UAZZZka7A0DmN+g
 mvZ2A9hByFk6sELm3QeNrOdex8djeichyY7+EQ13K25wMd/YsX0=
 =QXDh
 -----END PGP SIGNATURE-----

Merge tag 'x86_microcode_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 microcode loading updates from Borislav Petkov:
 "The first, cleanup part of the microcode loader reorg tglx has been
  working on. The other part wasn't fully ready in time so it will
  follow on later.

  This part makes the loader core code as it is practically enabled on
  pretty much every baremetal machine so there's no need to have the
  Kconfig items.

  In addition, there are cleanups which prepare for future feature
  enablement"

* tag 'x86_microcode_for_v6.6_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode: Remove remaining references to CONFIG_MICROCODE_AMD
  x86/microcode/intel: Remove pointless mutex
  x86/microcode/intel: Remove debug code
  x86/microcode: Move core specific defines to local header
  x86/microcode/intel: Rename get_datasize() since its used externally
  x86/microcode: Make reload_early_microcode() static
  x86/microcode: Include vendor headers into microcode.h
  x86/microcode/intel: Move microcode functions out of cpu/intel.c
  x86/microcode: Hide the config knob
  x86/mm: Remove unused microcode.h include
  x86/microcode: Remove microcode_mutex
  x86/microcode/AMD: Rip out static buffers
2023-08-28 15:55:20 -07:00
Steve Rutherford
ac3f9c9f1b x86/sev: Make enc_dec_hypercall() accept a size instead of npages
enc_dec_hypercall() accepted a page count instead of a size, which
forced its callers to round up. As a result, non-page aligned
vaddrs caused pages to be spuriously marked as decrypted via the
encryption status hypercall, which in turn caused consistent
corruption of pages during live migration. Live migration requires
accurate encryption status information to avoid migrating pages
from the wrong perspective.

Fixes: 064ce6c550 ("mm: x86: Invoke hypercall when page encryption status is changed")
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Tested-by: Ben Hillier <bhillier@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230824223731.2055016-1-srutherford@google.com
2023-08-25 13:33:48 +02:00
Suren Baghdasaryan
4089eef0e6 mm: drop per-VMA lock when returning VM_FAULT_RETRY or VM_FAULT_COMPLETED
handle_mm_fault returning VM_FAULT_RETRY or VM_FAULT_COMPLETED means
mmap_lock has been released.  However with per-VMA locks behavior is
different and the caller should still release it.  To make the rules
consistent for the caller, drop the per-VMA lock when returning
VM_FAULT_RETRY or VM_FAULT_COMPLETED.  Currently the only path returning
VM_FAULT_RETRY under per-VMA locks is do_swap_page and no path returns
VM_FAULT_COMPLETED for now.

[willy@infradead.org: fix riscv]
  Link: https://lkml.kernel.org/r/CAJuCfpE6GWEx1rPBmNpUfoD5o-gNFz9-UFywzCE2PbEGBiVz7g@mail.gmail.com
Link: https://lkml.kernel.org/r/20230630211957.1341547-4-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-24 16:20:17 -07:00
Vishal Moola (Oracle)
f92c494f42 x86: convert various functions to use ptdescs
In order to split struct ptdesc from struct page, convert various
functions to use ptdescs.

Some of the functions use the *get*page*() helper functions.  Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Link: https://lkml.kernel.org/r/20230807230513.102486-14-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Guo Ren <guoren@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21 13:37:54 -07:00
Matthew Wilcox (Oracle)
284e059204 mm: remove CONFIG_PER_VMA_LOCK ifdefs
Patch series "Handle most file-backed faults under the VMA lock", v3.

This patchset adds the ability to handle page faults on parts of files
which are already in the page cache without taking the mmap lock.


This patch (of 10):

Provide lock_vma_under_rcu() when CONFIG_PER_VMA_LOCK is not defined to
eliminate ifdefs in the users.

Link: https://lkml.kernel.org/r/20230724185410.1124082-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230724185410.1124082-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:12:50 -07:00
Alistair Popple
1af5a81099 mmu_notifiers: rename invalidate_range notifier
There are two main use cases for mmu notifiers.  One is by KVM which uses
mmu_notifier_invalidate_range_start()/end() to manage a software TLB.

The other is to manage hardware TLBs which need to use the
invalidate_range() callback because HW can establish new TLB entries at
any time.  Hence using start/end() can lead to memory corruption as these
callbacks happen too soon/late during page unmap.

mmu notifier users should therefore either use the start()/end() callbacks
or the invalidate_range() callbacks.  To make this usage clearer rename
the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
update documention.

Link: https://lkml.kernel.org/r/6f77248cd25545c8020a54b4e567e8b72be4dca1.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:12:41 -07:00
Alistair Popple
6bbd42e2df mmu_notifiers: call invalidate_range() when invalidating TLBs
The invalidate_range() is going to become an architecture specific mmu
notifier used to keep the TLB of secondary MMUs such as an IOMMU in sync
with the CPU page tables.  Currently it is called from separate code paths
to the main CPU TLB invalidations.  This can lead to a secondary TLB not
getting invalidated when required and makes it hard to reason about when
exactly the secondary TLB is invalidated.

To fix this move the notifier call to the architecture specific TLB
maintenance functions for architectures that have secondary MMUs requiring
explicit software invalidations.

This fixes a SMMU bug on ARM64.  On ARM64 PTE permission upgrades require
a TLB invalidation.  This invalidation is done by the architecture
specific ptep_set_access_flags() which calls flush_tlb_page() if required.
However this doesn't call the notifier resulting in infinite faults being
generated by devices using the SMMU if it has previously cached a
read-only PTE in it's TLB.

Moving the invalidations into the TLB invalidation functions ensures all
invalidations happen at the same time as the CPU invalidation.  The
architecture specific flush_tlb_all() routines do not call the notifier as
none of the IOMMUs require this.

Link: https://lkml.kernel.org/r/0287ae32d91393a582897d6c4db6f7456b1001f2.1690292440.git-series.apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Tested-by: SeongJae Park <sj@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Andrew Donnellan <ajd@linux.ibm.com>
Cc: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18 10:12:41 -07:00
Thomas Gleixner
e5a8192942 x86/mm: Remove unused microcode.h include
No usage for anything in that header.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230812195727.600549655@linutronix.de
2023-08-13 09:45:35 +02:00
Peter Zijlstra
882cdb06b6 x86/cpu: Fix Gracemont uarch
Alderlake N is an E-core only product using Gracemont
micro-architecture. It fits the pre-existing naming scheme perfectly
fine, adhere to it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20230807150405.686834933@infradead.org
2023-08-09 21:51:06 +02:00
Thomas Gleixner
9132d720eb x86/apic: Wrap APIC ID validation into an inline
Prepare for removing the callback and making this as simple comparison to
an upper limit, which is the obvious solution to do for limit checks...

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Juergen Gross <jgross@suse.com> # Xen PV (dom0 and unpriv. guest)
2023-08-09 11:58:30 -07:00
Dave Hansen
54e3d9434e x86/mm: Remove "INVPCID single" feature tracking
From: Dave Hansen <dave.hansen@linux.intel.com>

tl;dr: Replace a synthetic X86_FEATURE with a hardware X86_FEATURE
       and check of existing per-cpu state.

== Background ==

There are three features in play here:
 1. Good old Page Table Isolation (PTI)
 2. Process Context IDentifiers (PCIDs) which allow entries from
    multiple address spaces to be in the TLB at once.
 3. Support for the "Invalidate PCID" (INVPCID) instruction,
    specifically the "individual address" mode (aka. mode 0).

When all *three* of these are in place, INVPCID can and should be used
to flush out individual addresses in the PTI user address space.

But there's a wrinkle or two: First, this INVPCID mode is dependent on
CR4.PCIDE.  Even if X86_FEATURE_INVPCID==1, the instruction may #GP
without setting up CR4.  Second, TLB flushing is done very early, even
before CR4 is fully set up.  That means even if PTI, PCID and INVPCID
are supported, there is *still* a window where INVPCID can #GP.

== Problem ==

The current code seems to work, but mostly by chance and there are a
bunch of ways it can go wrong.  It's also somewhat hard to follow
since X86_FEATURE_INVPCID_SINGLE is set far away from its lone user.

== Solution ==

Make "INVPCID single" more robust and easier to follow by placing all
the logic in one place.  Remove X86_FEATURE_INVPCID_SINGLE.

Make two explicit checks before using INVPCID:
 1. Check that the system supports INVPCID itself (boot_cpu_has())
 2. Then check the CR4.PCIDE shadow to ensures that the CPU
    can safely use INVPCID for individual address invalidation.

The CR4 check *always* works and is not affected by any X86_FEATURE_*
twiddling or inconsistencies between the boot and secondary CPUs.

This has been tested on non-Meltdown hardware by using pti=on and
then flipping PCID and INVPCID support with qemu.

== Aside ==

How does this code even work today?  By chance, I think.  First, PTI
is initialized around the same time that the boot CPU sets
CR4.PCIDE=1.  There are currently no TLB invalidations when PTI=1 but
CR4.PCIDE=0.  That means that the X86_FEATURE_INVPCID_SINGLE check is
never even reached.

this_cpu_has() is also very nasty to use in this context because the
boot CPU reaches here before cpu_data(0) has been initialized.  It
happens to work for X86_FEATURE_INVPCID_SINGLE since it's a
software-defined feature but it would fall over for a hardware-
derived X86_FEATURE.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20230718170630.7922E235%40davehans-spike.ostc.intel.com
2023-08-03 10:34:05 -07:00
Rick Edgecombe
b497e52ddb x86/mm: Teach pte_mkwrite() about stack memory
If a VMA has the VM_SHADOW_STACK flag, it is shadow stack memory. So
when it is made writable with pte_mkwrite(), it should create shadow
stack memory, not conventionally writable memory. Now that all the places
where shadow stack memory might be created pass a VMA into pte_mkwrite(),
it can know when it should do this.

So make pte_mkwrite() create shadow stack memory when the VMA has the
VM_SHADOW_STACK flag. Do the same thing for pmd_mkwrite().

This requires referencing VM_SHADOW_STACK in these functions, which are
currently defined in pgtable.h, however mm.h (where VM_SHADOW_STACK is
located) can't be pulled in without causing problems for files that
reference pgtable.h. So also move pte/pmd_mkwrite() into pgtable.c, where
they can safely reference VM_SHADOW_STACK.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Deepak Gupta <debug@rivosinc.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-22-rick.p.edgecombe%40intel.com
2023-07-11 14:12:19 -07:00
Rick Edgecombe
e5136e8765 mm: Warn on shadow stack memory in wrong vma
The x86 Control-flow Enforcement Technology (CET) feature includes a new
type of memory called shadow stack. This shadow stack memory has some
unusual properties, which requires some core mm changes to function
properly.

One sharp edge is that PTEs that are both Write=0 and Dirty=1 are
treated as shadow by the CPU, but this combination used to be created by
the kernel on x86. Previous patches have changed the kernel to now avoid
creating these PTEs unless they are for shadow stack memory. In case any
missed corners of the kernel are still creating PTEs like this for
non-shadow stack memory, and to catch any re-introductions of the logic,
warn if any shadow stack PTEs (Write=0, Dirty=1) are found in non-shadow
stack VMAs when they are being zapped. This won't catch transient cases
but should have decent coverage.

In order to check if a PTE is shadow stack in core mm code, add two arch
breakouts arch_check_zapped_pte/pmd(). This will allow shadow stack
specific code to be kept in arch/x86.

Only do the check if shadow stack is supported by the CPU and configured
because in rare cases older CPUs may write Dirty=1 to a Write=0 CPU on
older CPUs. This check is handled in pte_shstk()/pmd_shstk().

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-18-rick.p.edgecombe%40intel.com
2023-07-11 14:12:19 -07:00
Rick Edgecombe
fd5439e0c9 x86/mm: Check shadow stack page fault errors
The CPU performs "shadow stack accesses" when it expects to encounter
shadow stack mappings. These accesses can be implicit (via CALL/RET
instructions) or explicit (instructions like WRSS).

Shadow stack accesses to shadow-stack mappings can result in faults in
normal, valid operation just like regular accesses to regular mappings.
Shadow stacks need some of the same features like delayed allocation, swap
and copy-on-write. The kernel needs to use faults to implement those
features.

The architecture has concepts of both shadow stack reads and shadow stack
writes. Any shadow stack access to non-shadow stack memory will generate
a fault with the shadow stack error code bit set.

This means that, unlike normal write protection, the fault handler needs
to create a type of memory that can be written to (with instructions that
generate shadow stack writes), even to fulfill a read access. So in the
case of COW memory, the COW needs to take place even with a shadow stack
read. Otherwise the page will be left (shadow stack) writable in
userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
for shadow stack accesses, even if the access was a shadow stack read.

For the purpose of making this clearer, consider the following example.
If a process has a shadow stack, and forks, the shadow stack PTEs will
become read-only due to COW. If the CPU in one process performs a shadow
stack read access to the shadow stack, for example executing a RET and
causing the CPU to read the shadow stack copy of the return address, then
in order for the fault to be resolved the PTE will need to be set with
shadow stack permissions. But then the memory would be changeable from
userspace (from CALL, RET, WRSS, etc). So this scenario needs to trigger
COW, otherwise the shared page would be changeable from both processes.

Shadow stack accesses can also result in errors, such as when a shadow
stack overflows, or if a shadow stack access occurs to a non-shadow-stack
mapping. Also, generate the errors for invalid shadow stack accesses.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-16-rick.p.edgecombe%40intel.com
2023-07-11 14:12:19 -07:00
Rick Edgecombe
f788b71768 x86/mm: Remove _PAGE_DIRTY from kernel RO pages
New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
shadow stack pages.

In normal cases, it can be helpful to create Write=1 PTEs as also Dirty=1
if HW dirty tracking is not needed, because if the Dirty bit is not already
set the CPU has to set Dirty=1 when the memory gets written to. This
creates additional work for the CPU. So traditional wisdom was to simply
set the Dirty bit whenever you didn't care about it. However, it was never
really very helpful for read-only kernel memory.

When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some instructions can write to
such supervisor memory. The kernel does not set IA32_S_CET.SH_STK_EN, so
avoiding kernel Write=0,Dirty=1 memory is not strictly needed for any
functional reason. But having Write=0,Dirty=1 kernel memory doesn't have
any functional benefit either, so to reduce ambiguity between shadow stack
and regular Write=0 pages, remove Dirty=1 from any kernel Write=0 PTEs.

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-14-rick.p.edgecombe%40intel.com
2023-07-11 14:12:19 -07:00
Rick Edgecombe
bb3aadf7d4 x86/mm: Start actually marking _PAGE_SAVED_DIRTY
The recently introduced _PAGE_SAVED_DIRTY should be used instead of the
HW Dirty bit whenever a PTE is Write=0, in order to not inadvertently
create shadow stack PTEs. Update pte_mk*() helpers to do this, and apply
the same changes to pmd and pud. Since there is no x86 version of
pte_mkwrite() to hold this arch specific logic, create one. Add it to
x86/mm/pgtable.c instead of x86/asm/include/pgtable.h as future patches
will require it to live in pgtable.c and it will make the diff easier
for reviewers.

Since CPUs without shadow stack support could create Write=0,Dirty=1
PTEs, only return true for pte_shstk() if the CPU also supports shadow
stack. This will prevent these HW creates PTEs as showing as true for
pte_write().

For pte_modify() this is a bit trickier. It takes a "raw" pgprot_t which
was not necessarily created with any of the existing PTE bit helpers.
That means that it can return a pte_t with Write=0,Dirty=1, a shadow
stack PTE, when it did not intend to create one.

Modify it to also move _PAGE_DIRTY to _PAGE_SAVED_DIRTY. To avoid
creating Write=0,Dirty=1 PTEs, pte_modify() needs to avoid:
1. Marking Write=0 PTEs Dirty=1
2. Marking Dirty=1 PTEs Write=0

The first case cannot happen as the existing behavior of pte_modify() is to
filter out any Dirty bit passed in newprot. Handle the second case by
shifting _PAGE_DIRTY=1 to _PAGE_SAVED_DIRTY=1 if the PTE was write
protected by the pte_modify() call. Apply the same changes to pmd_modify().

Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/all/20230613001108.3040476-13-rick.p.edgecombe%40intel.com
2023-07-11 14:12:19 -07:00
Linus Torvalds
9471f1f2f5 Merge branch 'expand-stack'
This modifies our user mode stack expansion code to always take the
mmap_lock for writing before modifying the VM layout.

It's actually something we always technically should have done, but
because we didn't strictly need it, we were being lazy ("opportunistic"
sounds so much better, doesn't it?) about things, and had this hack in
place where we would extend the stack vma in-place without doing the
proper locking.

And it worked fine.  We just needed to change vm_start (or, in the case
of grow-up stacks, vm_end) and together with some special ad-hoc locking
using the anon_vma lock and the mm->page_table_lock, it all was fairly
straightforward.

That is, it was all fine until Ruihan Li pointed out that now that the
vma layout uses the maple tree code, we *really* don't just change
vm_start and vm_end any more, and the locking really is broken.  Oops.

It's not actually all _that_ horrible to fix this once and for all, and
do proper locking, but it's a bit painful.  We have basically three
different cases of stack expansion, and they all work just a bit
differently:

 - the common and obvious case is the page fault handling. It's actually
   fairly simple and straightforward, except for the fact that we have
   something like 24 different versions of it, and you end up in a maze
   of twisty little passages, all alike.

 - the simplest case is the execve() code that creates a new stack.
   There are no real locking concerns because it's all in a private new
   VM that hasn't been exposed to anybody, but lockdep still can end up
   unhappy if you get it wrong.

 - and finally, we have GUP and page pinning, which shouldn't really be
   expanding the stack in the first place, but in addition to execve()
   we also use it for ptrace(). And debuggers do want to possibly access
   memory under the stack pointer and thus need to be able to expand the
   stack as a special case.

None of these cases are exactly complicated, but the page fault case in
particular is just repeated slightly differently many many times.  And
ia64 in particular has a fairly complicated situation where you can have
both a regular grow-down stack _and_ a special grow-up stack for the
register backing store.

So to make this slightly more manageable, the bulk of this series is to
first create a helper function for the most common page fault case, and
convert all the straightforward architectures to it.

Thus the new 'lock_mm_and_find_vma()' helper function, which ends up
being used by x86, arm, powerpc, mips, riscv, alpha, arc, csky, hexagon,
loongarch, nios2, sh, sparc32, and xtensa.  So we not only convert more
than half the architectures, we now have more shared code and avoid some
of those twisty little passages.

And largely due to this common helper function, the full diffstat of
this series ends up deleting more lines than it adds.

That still leaves eight architectures (ia64, m68k, microblaze, openrisc,
parisc, s390, sparc64 and um) that end up doing 'expand_stack()'
manually because they are doing something slightly different from the
normal pattern.  Along with the couple of special cases in execve() and
GUP.

So there's a couple of patches that first create 'locked' helper
versions of the stack expansion functions, so that there's a obvious
path forward in the conversion.  The execve() case is then actually
pretty simple, and is a nice cleanup from our old "grow-up stackls are
special, because at execve time even they grow down".

The #ifdef CONFIG_STACK_GROWSUP in that code just goes away, because
it's just more straightforward to write out the stack expansion there
manually, instead od having get_user_pages_remote() do it for us in some
situations but not others and have to worry about locking rules for GUP.

And the final step is then to just convert the remaining odd cases to a
new world order where 'expand_stack()' is called with the mmap_lock held
for reading, but where it might drop it and upgrade it to a write, only
to return with it held for reading (in the success case) or with it
completely dropped (in the failure case).

In the process, we remove all the stack expansion from GUP (where
dropping the lock wouldn't be ok without special rules anyway), and add
it in manually to __access_remote_vm() for ptrace().

Thanks to Adrian Glaubitz and Frank Scheiner who tested the ia64 cases.
Everything else here felt pretty straightforward, but the ia64 rules for
stack expansion are really quite odd and very different from everything
else.  Also thanks to Vegard Nossum who caught me getting one of those
odd conditions entirely the wrong way around.

Anyway, I think I want to actually move all the stack expansion code to
a whole new file of its own, rather than have it split up between
mm/mmap.c and mm/memory.c, but since this will have to be backported to
the initial maple tree vma introduction anyway, I tried to keep the
patches _fairly_ minimal.

Also, while I don't think it's valid to expand the stack from GUP, the
final patch in here is a "warn if some crazy GUP user wants to try to
expand the stack" patch.  That one will be reverted before the final
release, but it's left to catch any odd cases during the merge window
and release candidates.

Reported-by: Ruihan Li <lrh2000@pku.edu.cn>

* branch 'expand-stack':
  gup: add warning if some caller would seem to want stack expansion
  mm: always expand the stack with the mmap write lock held
  execve: expand new process stack manually ahead of time
  mm: make find_extend_vma() fail if write lock not held
  powerpc/mm: convert coprocessor fault to lock_mm_and_find_vma()
  mm/fault: convert remaining simple cases to lock_mm_and_find_vma()
  arm/mm: Convert to using lock_mm_and_find_vma()
  riscv/mm: Convert to using lock_mm_and_find_vma()
  mips/mm: Convert to using lock_mm_and_find_vma()
  powerpc/mm: Convert to using lock_mm_and_find_vma()
  arm64/mm: Convert to using lock_mm_and_find_vma()
  mm: make the page fault mmap locking killable
  mm: introduce new 'lock_mm_and_find_vma()' page fault helper
2023-06-28 20:35:21 -07:00
Linus Torvalds
6e17c6de3d - Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing.
 
 - Nhat Pham adds the new cachestat() syscall.  It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability.
 
 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning.
 
 - Lorenzo Stoakes has done some maintanance work on the get_user_pages()
   interface.
 
 - Liam Howlett continues with cleanups and maintenance work to the maple
   tree code.  Peng Zhang also does some work on maple tree.
 
 - Johannes Weiner has done some cleanup work on the compaction code.
 
 - David Hildenbrand has contributed additional selftests for
   get_user_pages().
 
 - Thomas Gleixner has contributed some maintenance and optimization work
   for the vmalloc code.
 
 - Baolin Wang has provided some compaction cleanups,
 
 - SeongJae Park continues maintenance work on the DAMON code.
 
 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting.
 
 - Christoph Hellwig has some cleanups for the filemap/directio code.
 
 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the provided
   APIs rather than open-coding accesses.
 
 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings.
 
 - John Hubbard has a series of fixes to the MM selftesting code.
 
 - ZhangPeng continues the folio conversion campaign.
 
 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock.
 
 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
   128 to 8.
 
 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management.
 
 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code.
 
 - Vishal Moola also has done some folio conversion work.
 
 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
 joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
 J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
 =B7yQ
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:

 - Yosry Ahmed brought back some cgroup v1 stats in OOM logs

 - Yosry has also eliminated cgroup's atomic rstat flushing

 - Nhat Pham adds the new cachestat() syscall. It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability

 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning

 - Lorenzo Stoakes has done some maintanance work on the
   get_user_pages() interface

 - Liam Howlett continues with cleanups and maintenance work to the
   maple tree code. Peng Zhang also does some work on maple tree

 - Johannes Weiner has done some cleanup work on the compaction code

 - David Hildenbrand has contributed additional selftests for
   get_user_pages()

 - Thomas Gleixner has contributed some maintenance and optimization
   work for the vmalloc code

 - Baolin Wang has provided some compaction cleanups,

 - SeongJae Park continues maintenance work on the DAMON code

 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting

 - Christoph Hellwig has some cleanups for the filemap/directio code

 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the
   provided APIs rather than open-coding accesses

 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings

 - John Hubbard has a series of fixes to the MM selftesting code

 - ZhangPeng continues the folio conversion campaign

 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock

 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
   from 128 to 8

 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management

 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code

 - Vishal Moola also has done some folio conversion work

 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch

* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
  mm/hugetlb: remove hugetlb_set_page_subpool()
  mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
  hugetlb: revert use of page_cache_next_miss()
  Revert "page cache: fix page_cache_next/prev_miss off by one"
  mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
  mm: memcg: rename and document global_reclaim()
  mm: kill [add|del]_page_to_lru_list()
  mm: compaction: convert to use a folio in isolate_migratepages_block()
  mm: zswap: fix double invalidate with exclusive loads
  mm: remove unnecessary pagevec includes
  mm: remove references to pagevec
  mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
  mm: remove struct pagevec
  net: convert sunrpc from pagevec to folio_batch
  i915: convert i915_gpu_error to use a folio_batch
  pagevec: rename fbatch_count()
  mm: remove check_move_unevictable_pages()
  drm: convert drm_gem_put_pages() to use a folio_batch
  i915: convert shmem_sg_free_table() to use a folio_batch
  scatterlist: add sg_set_folio()
  ...
2023-06-28 10:28:11 -07:00
Linus Torvalds
4d6751815b Changes in this cycle were:
- Remove Xen-PV leftovers from init_32.c
 
 - Fix __swp_entry_to_pte() warning splat for Xen PV guests,
   triggered on CONFIG_DEBUG_VM_PGTABLE=y
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmSaylcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iowg//VoTsoZSI0RKE5Y5RrsaZ5ipT7HiWpaMr
 72qLh+ce/7jckmaVDtw9wEvfe++K4CnhnX1OVUj5FgJMk7wfZvAotIyiO7eIuG3X
 YHtbgRHJn+UrPXZzF8+iuGzW3hD+jJzpD/WusoSpIiSb2XclxWQ8OtmhqBkyvyRb
 hYbpZxse6d/Vx/g8HSNR/sDBtsHzodgaa2ebbrBN4UNCQcwYW/hKiNnoZDtAx596
 Rw6aEhQGq8PKikX9pRTwHKx2DAky0fuQaGiAt4k//3PBDnAQiGFlZT1qU0v0hAlS
 F9Z4qwsiuvK6wE//gwNAiohDGvqSOdOi4eeJNP08gu5DjF1i1SU6An0mYudRdhwm
 9wOVGQll/M2Q3flJMnoiWHuH65R+OFfufitF1PpzyQht2CUUZblEYVGXsYbD54Js
 MGnOWw4NyNrVWgUkzNLT/yUvKR4GwK9gONZ9tfexorrXljF4A0OQr2pllRGgLLiy
 g3ZwW+FgL7Tl8mCRUFUGg/kOBRWuhA6oDfTS2jqtjAsDFouLDKF9M4aaG2NJzmTO
 UUpwoP0KSNqU7WZNlA2h/Gh4Zp4ZDKuXEqdI4WDmZmpfMmkFtTKbsL9ixPlV4jxk
 RtYjdIqzJwScyQPiUkZJdhugHmOURYI0pHNLnXU2AT8mMvHyZVhMxouJJ1xKocp6
 TcNhew+2azg=
 =FhTY
 -----END PGP SIGNATURE-----

Merge tag 'x86-mm-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm updates from Ingo Molnar:

 - Remove Xen-PV leftovers from init_32.c

 - Fix __swp_entry_to_pte() warning splat for Xen PV guests, triggered
   on CONFIG_DEBUG_VM_PGTABLE=y

* tag 'x86-mm-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Remove Xen-PV leftovers from init_32.c
  x86/mm: Fix __swp_entry_to_pte() for Xen PV guests
2023-06-27 14:47:17 -07:00
Linus Torvalds
12dc010071 - Some SEV and CC platform helpers cleanup and simplifications now that
the usage patterns are becoming apparent
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSa9CoACgkQEsHwGGHe
 VUr1NhAAjmOq/T41u3FCSU6fZ8gXo5UkIUT13a6cx+6Omx9waJn5G0xdf/380vQN
 RPRTcc4cfQHCdnIeHgiz1YtCh1ljxXswOSbexHgWHjEcqadgxkZTlKaBEbqyEwLI
 lnRRsowfk7J/8RsYqtzuBvGaWNliiszWE8iayruI1IL+FoEDLlLx1GNYqusP5WIs
 0KYm919Zozl8FEZjP47nH4bab1RcE+HGmLG7UEBmR0zHl4cc7iN3wpv2o/vDxVzR
 /KP8a2G7J/xjllGW+OP81dFCS7iklHpNuaxQS73fDIL7ll2VDqNemh4ivykCrplo
 93twODBwKboKmZhnKc0M2axm5JGGx7IC3KTqEUHzb2Wo4bZCYnrj+9Utzxsa3FxB
 m0BSUcmBqzZCsHCbu62N66l1NlB32EnMO80/45NrgGi62YiGP8qQNhy3TiceUNle
 NHFkQmRZwLyW5YC1ntSK8fwSu4GrMG1MG/eRfMPDmmsYogiZUm2KIj7XKy3dKXR6
 maqifh/raPk3rL+7cl9BQleCDLjnkHNxFxFa329P9K4wrtqn7Ley4izWYAUYhNrl
 VOjLs+thTwEmPPgpo/K1wAbR/PjLdmSQ/fQR6w7eUrzNVnm5ndXzhTUcIawycG3T
 haVry+xPMIRlLCIg+dkrwwNcTW1y/X3K4SmgnjLLF57lZFn9UJc=
 =Aj6p
 -----END PGP SIGNATURE-----

Merge tag 'x86_sev_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SEV updates from Borislav Petkov:

 - Some SEV and CC platform helpers cleanup and simplifications now that
   the usage patterns are becoming apparent

[ I'm sure I'm the only one that has gets confused by all the TLAs, but
  in case there are others: here SEV is AMD's "Secure Encrypted
  Virtualization" and CC is generic "Confidential Computing".

  There's also Intel SGX (Software Guard Extensions) and TDX (Trust
  Domain Extensions), along with all the vendor memory encryption
  extensions (SME, TSME, TME, and WTF).

  And then we have arm64 with RMA and CCA, and I probably forgot another
  dozen or so related acronyms    - Linus ]

* tag 'x86_sev_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/coco: Get rid of accessor functions
  x86/sev: Get rid of special sev_es_enable_key
  x86/coco: Mark cc_platform_has() and descendants noinstr
2023-06-27 13:26:30 -07:00
Linus Torvalds
dc43fc753b - A serious scrubbing of the MTRR code including adding a new map
mechanism in order to look up the memory type of a region easily. Also
   address memory range lookup issues like returning an invalid memory
   type. Furthermore, this handles the decoupling of PAT from MTRR more
   naturally. All work by Juergen Gross
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSazOIACgkQEsHwGGHe
 VUqltQ/8D1oA4LrgnbFO25J/27U/MwKo7ZI3hN6/OkH2FfdBgqeOlOV4TnndDL88
 l/UrzOfWJQxpVLTO3SLMtDla0VrT24B4HZ4hvzDEdJZ8f1DLZ+gLN7sKOMjoIcO9
 fvBZ5+/gFtVxSquwZmWvM0qKiCkKxmznJfpOx1/lt9UtKyKmpPSMVdrpqOeufL7k
 xxWqGRh2s104ZKwfMOj4dgvCVK9ZUlsPqqiARzkqc0bCg7SeIyPea/S2eljhTl15
 BTOA/wW/lcVQ9yWmDD8inzxrZI4EHEohEaNMfof3AqFyYCOU4RzvE9tpAFEK3GXp
 NilxYkZ+JbEljq2QiEt0Ll8XEVKedi7YC1oN3ciiy9RS6+rWSPIvuMFV9tgPRjr1
 AbWYmDoiLz+5ePI+0fckStRRntWKiao+hOaXb5RbEcg+85hkDHZZC7b0tCAUvnh7
 OwuQfbzAqipn2G1hg+LThHDSjI4qHfHJlpeuPcsAxWef1diJbe15StdVWm+ttRE0
 MTXSn3J9qT9MoY5y6m4KSybp0c1nSFlCK/ZkNvzwWHmkAG6M7wuFmBn3pVzEaCew
 fneGZcX9Ija4MY8Ygajp8GI1aQ4mBNif+uVE7UUY17hH9qAf8vI8Joqs+4L35u8h
 SZl/IqJO9ziEmVLdy9ajgm1xW04AFE1RYRfa6aH6K6tRaIoh8bE=
 =Dmx5
 -----END PGP SIGNATURE-----

Merge tag 'x86_mtrr_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mtrr updates from Borislav Petkov:
 "A serious scrubbing of the MTRR code including adding a new map
  mechanism in order to look up the memory type of a region easily.

  Also address memory range lookup issues like returning an invalid
  memory type. Furthermore, this handles the decoupling of PAT from MTRR
  more naturally.

  All work by Juergen Gross"

* tag 'x86_mtrr_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/xen: Set default memory type for PV guests to WB
  x86/mtrr: Unify debugging printing
  x86/mtrr: Remove unused code
  x86/mm: Only check uniform after calling mtrr_type_lookup()
  x86/mtrr: Don't let mtrr_type_lookup() return MTRR_TYPE_INVALID
  x86/mtrr: Use new cache_map in mtrr_type_lookup()
  x86/mtrr: Add mtrr=debug command line option
  x86/mtrr: Construct a memory map with cache modes
  x86/mtrr: Add get_effective_type() service function
  x86/mtrr: Allocate mtrr_value array dynamically
  x86/mtrr: Move 32-bit code from mtrr.c to legacy.c
  x86/mtrr: Have only one set_mtrr() variant
  x86/mtrr: Replace vendor tests in MTRR code
  x86/xen: Set MTRR state when running as Xen PV initial domain
  x86/hyperv: Set MTRR state when running as SEV-SNP Hyper-V guest
  x86/mtrr: Support setting MTRR state for software defined MTRRs
  x86/mtrr: Replace size_or_mask and size_and_mask with a much easier concept
  x86/mtrr: Remove physical address size calculation
2023-06-27 13:11:32 -07:00
Linus Torvalds
19300488c9 - Address -Wmissing-prototype warnings
- Remove repeated 'the' in comments
  - Remove unused current_untag_mask()
  - Document urgent tip branch timing
  - Clean up MSR kernel-doc notation
  - Clean up paravirt_ops doc
  - Update Srivatsa S. Bhat's maintained areas
  - Remove unused extern declaration acpi_copy_wakeup_routine()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmSZ6xIACgkQaDWVMHDJ
 krB9aQ/+NjB4CiWLbrnOYj9QYG6p1GE7lfu2dzIDdmcNuiai8htopXys54Igy3Rq
 BbIoW4E0SGK5E2OD7nLe4fBA/LpsYZTwDhGUu3SiovxLOoC5qkF0Q+6aVypPJE5o
 q7kn0Eo9IDL1dO0EbJptFDJRjk3K5caEoyXJRelarjIfPRbDEhUFaybVRykMZN9I
 4AOxrlb9WFggT4gUE4+N0kWyEqdgI9/aguavmasaG4lBHZ5JAHNQPNIa8bkVSAPL
 wULAzsrGp96V3tVxdjDCzD9aumk4xlJq7gk+v7mfx013dg7Cjs074Xoi2Y+TmaC7
 fdIZiGPJIkNToW+nENVO7BYtACSQhXeVTGxLQO/HNTDc//ZWiIUoJT2U4qu/6e6F
 aAIGoLwv68H4BghS2qx6Gz+BTIfl35mcPUb75MQhu+D84QZoZWrdamCYhsvHeZzc
 uC3nojrb6PBOth9nJsRae+j1zpRe/DT2LvHSWPJgK6EygOAi05ZfYUll/6sb0vze
 IXkUrVV1BvDDVpY9/HnE8RpDCDolP0/ezK9zsw48arZtkc+Qmw2WlD/2D98E+pSb
 MJPelbVmpzWTaoR4jDzXJCXkWe7CQJ5uPQj5azAE9l7YvnxgCQP5xnm5sLU9eyLu
 RsOwRzss0+3z44x5rJi9nSxQJ0LHfTAzW8/ZmNSZGHzi0ClszK0=
 =N82i
 -----END PGP SIGNATURE-----

Merge tag 'x86_cleanups_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Dave Hansen:
 "As usual, these are all over the map. The biggest cluster is work from
  Arnd to eliminate -Wmissing-prototype warnings:

   - Address -Wmissing-prototype warnings

   - Remove repeated 'the' in comments

   - Remove unused current_untag_mask()

   - Document urgent tip branch timing

   - Clean up MSR kernel-doc notation

   - Clean up paravirt_ops doc

   - Update Srivatsa S. Bhat's maintained areas

   - Remove unused extern declaration acpi_copy_wakeup_routine()"

* tag 'x86_cleanups_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  x86/acpi: Remove unused extern declaration acpi_copy_wakeup_routine()
  Documentation: virt: Clean up paravirt_ops doc
  x86/mm: Remove unused current_untag_mask()
  x86/mm: Remove repeated word in comments
  x86/lib/msr: Clean up kernel-doc notation
  x86/platform: Avoid missing-prototype warnings for OLPC
  x86/mm: Add early_memremap_pgprot_adjust() prototype
  x86/usercopy: Include arch_wb_cache_pmem() declaration
  x86/vdso: Include vdso/processor.h
  x86/mce: Add copy_mc_fragile_handle_tail() prototype
  x86/fbdev: Include asm/fb.h as needed
  x86/hibernate: Declare global functions in suspend.h
  x86/entry: Add do_SYSENTER_32() prototype
  x86/quirks: Include linux/pnp.h for arch_pnpbios_disabled()
  x86/mm: Include asm/numa.h for set_highmem_pages_init()
  x86: Avoid missing-prototype warnings for doublefault code
  x86/fpu: Include asm/fpu/regset.h
  x86: Add dummy prototype for mk_early_pgtbl_32()
  x86/pci: Mark local functions as 'static'
  x86/ftrace: Move prepare_ftrace_return prototype to header
  ...
2023-06-26 16:43:54 -07:00
Linus Torvalds
5dfe7a7e52 - Fix a race window where load_unaligned_zeropad() could cause
a fatal shutdown during TDX private<=>shared conversion
  - Annotate sites where VM "exit reasons" are reused as hypercall
    numbers.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmSZ50cACgkQaDWVMHDJ
 krAPEg//bAR0SzrjIir2eiQ7p3ktr4L7iae0odzFkW/XHam5ZJP+v9cCMLzY6zNO
 44x9Z85jZ9w34GcZ4D7D0OmTHbcDpcxckPXSFco/dK4IyYeLzUImYXEYo41YJEx9
 O4sSQMBqIyjMXej/oKBhgKHSWaV60XimvQvTvhpjXGD/45bt9sx4ZNVTi8+xVMbw
 jpktMsQcsjHctcIY2D2eUR61Ma/Vg9t6Qih51YMtbq6Nqcyhw2IKvDwIx3kQuW34
 qSW7wsyn+RfHQDpwjPgDG/6OE815Pbtzlxz+y6tB9pN88IWkA1H5Jh2CQRlMBud2
 2nVQRpqPgr9uOIeNnNI7FFd1LgTIc/v7lDPfpUH9KelOs7cGWvaRymkuhPSvWxRI
 tmjlMdFq8XcjrOPieA9WpxYKXinqj4wNXtnYGyaM+Ur/P3qWaj18PMCYMbeN6pJC
 eNYEJVk2Mt8GmiPL55aYG5+Z1F8sciLKbz8TFq5ya2z0EnSbyVvR+DReqd7zRzh6
 Bmbmx9isAzN6wWNszNt7f8XSgRPV2Ri1tvb1vixk3JLxyx2iVCUL6KJ0cZOUNy0x
 nQqy7/zMtBsFGZ/Ca8f2kpVaGgxkUFy7n1rI4psXTGBOVlnJyMz3WSi9N8F/uJg0
 Ca5W4493+txdyHSAmWBQAQuZp3RJOlhTkXe5dfjukv6Rnnw1MZU=
 =Hr3D
 -----END PGP SIGNATURE-----

Merge tag 'x86_tdx_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 tdx updates from Dave Hansen:

 - Fix a race window where load_unaligned_zeropad() could cause a fatal
   shutdown during TDX private<=>shared conversion

   The race has never been observed in practice but might allow
   load_unaligned_zeropad() to catch a TDX page in the middle of its
   conversion process which would lead to a fatal and unrecoverable
   guest shutdown.

 - Annotate sites where VM "exit reasons" are reused as hypercall
   numbers.

* tag 'x86_tdx_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Fix enc_status_change_finish_noop()
  x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()
  x86/mm: Allow guest.enc_status_change_prepare() to fail
  x86/tdx: Wrap exit reason with hcall_func()
2023-06-26 16:32:47 -07:00
Linus Torvalds
9244724fbf A large update for SMP management:
- Parallel CPU bringup
 
     The reason why people are interested in parallel bringup is to shorten
     the (kexec) reboot time of cloud servers to reduce the downtime of the
     VM tenants.
 
     The current fully serialized bringup does the following per AP:
 
       1) Prepare callbacks (allocate, intialize, create threads)
       2) Kick the AP alive (e.g. INIT/SIPI on x86)
       3) Wait for the AP to report alive state
       4) Let the AP continue through the atomic bringup
       5) Let the AP run the threaded bringup to full online state
 
     There are two significant delays:
 
       #3 The time for an AP to report alive state in start_secondary() on
          x86 has been measured in the range between 350us and 3.5ms
          depending on vendor and CPU type, BIOS microcode size etc.
 
       #4 The atomic bringup does the microcode update. This has been
          measured to take up to ~8ms on the primary threads depending on
          the microcode patch size to apply.
 
     On a two socket SKL server with 56 cores (112 threads) the boot CPU
     spends on current mainline about 800ms busy waiting for the APs to come
     up and apply microcode. That's more than 80% of the actual onlining
     procedure.
 
     This can be reduced significantly by splitting the bringup mechanism
     into two parts:
 
       1) Run the prepare callbacks and kick the AP alive for each AP which
       	 needs to be brought up.
 
 	 The APs wake up, do their firmware initialization and run the low
       	 level kernel startup code including microcode loading in parallel
       	 up to the first synchronization point. (#1 and #2 above)
 
       2) Run the rest of the bringup code strictly serialized per CPU
       	 (#3 - #5 above) as it's done today.
 
 	 Parallelizing that stage of the CPU bringup might be possible in
 	 theory, but it's questionable whether required surgery would be
 	 justified for a pretty small gain.
 
     If the system is large enough the first AP is already waiting at the
     first synchronization point when the boot CPU finished the wake-up of
     the last AP. That reduces the AP bringup time on that SKL from ~800ms
     to ~80ms, i.e. by a factor ~10x.
 
     The actual gain varies wildly depending on the system, CPU, microcode
     patch size and other factors. There are some opportunities to reduce
     the overhead further, but that needs some deep surgery in the x86 CPU
     bringup code.
 
     For now this is only enabled on x86, but the core functionality
     obviously works for all SMP capable architectures.
 
   - Enhancements for SMP function call tracing so it is possible to locate
     the scheduling and the actual execution points. That allows to measure
     IPI delivery time precisely.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSZb/YTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoRoOD/9vAiGI3IhGyZcX/RjXxauSHf8Pmqll
 05jUubFi5Vi3tKI1ubMOsnMmJTw2yy5xDyS/iGj7AcbRLq9uQd3iMtsXXHNBzo/X
 FNxnuWTXYUj0vcOYJ+j4puBumFzzpRCprqccMInH0kUnSWzbnaQCeelicZORAf+w
 zUYrswK4HpBXHDOnvPw6Z7MYQe+zyDQSwjSftstLyROzu+lCEw/9KUaysY2epShJ
 wHClxS2XqMnpY4rJ/CmJAlRhD0Plb89zXyo6k9YZYVDWoAcmBZy6vaTO4qoR171L
 37ApqrgsksMkjFycCMnmrFIlkeb7bkrYDQ5y+xqC3JPTlYDKOYmITV5fZ83HD77o
 K7FAhl/CgkPq2Ec+d82GFLVBKR1rijbwHf7a0nhfUy0yMeaJCxGp4uQ45uQ09asi
 a/VG2T38EgxVdseC92HRhcdd3pipwCb5wqjCH/XdhdlQrk9NfeIeP+TxF4QhADhg
 dApp3ifhHSnuEul7+HNUkC6U+Zc8UeDPdu5lvxSTp2ooQ0JwaGgC5PJq3nI9RUi2
 Vv826NHOknEjFInOQcwvp6SJPfcuSTF75Yx6xKz8EZ3HHxpvlolxZLq+3ohSfOKn
 2efOuZO5bEu4S/G2tRDYcy+CBvNVSrtZmCVqSOS039c8quBWQV7cj0334cjzf+5T
 TRiSzvssbYYmaw==
 =Y8if
 -----END PGP SIGNATURE-----

Merge tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull SMP updates from Thomas Gleixner:
 "A large update for SMP management:

   - Parallel CPU bringup

     The reason why people are interested in parallel bringup is to
     shorten the (kexec) reboot time of cloud servers to reduce the
     downtime of the VM tenants.

     The current fully serialized bringup does the following per AP:

       1) Prepare callbacks (allocate, intialize, create threads)
       2) Kick the AP alive (e.g. INIT/SIPI on x86)
       3) Wait for the AP to report alive state
       4) Let the AP continue through the atomic bringup
       5) Let the AP run the threaded bringup to full online state

     There are two significant delays:

       #3 The time for an AP to report alive state in start_secondary()
          on x86 has been measured in the range between 350us and 3.5ms
          depending on vendor and CPU type, BIOS microcode size etc.

       #4 The atomic bringup does the microcode update. This has been
          measured to take up to ~8ms on the primary threads depending
          on the microcode patch size to apply.

     On a two socket SKL server with 56 cores (112 threads) the boot CPU
     spends on current mainline about 800ms busy waiting for the APs to
     come up and apply microcode. That's more than 80% of the actual
     onlining procedure.

     This can be reduced significantly by splitting the bringup
     mechanism into two parts:

       1) Run the prepare callbacks and kick the AP alive for each AP
          which needs to be brought up.

          The APs wake up, do their firmware initialization and run the
          low level kernel startup code including microcode loading in
          parallel up to the first synchronization point. (#1 and #2
          above)

       2) Run the rest of the bringup code strictly serialized per CPU
          (#3 - #5 above) as it's done today.

          Parallelizing that stage of the CPU bringup might be possible
          in theory, but it's questionable whether required surgery
          would be justified for a pretty small gain.

     If the system is large enough the first AP is already waiting at
     the first synchronization point when the boot CPU finished the
     wake-up of the last AP. That reduces the AP bringup time on that
     SKL from ~800ms to ~80ms, i.e. by a factor ~10x.

     The actual gain varies wildly depending on the system, CPU,
     microcode patch size and other factors. There are some
     opportunities to reduce the overhead further, but that needs some
     deep surgery in the x86 CPU bringup code.

     For now this is only enabled on x86, but the core functionality
     obviously works for all SMP capable architectures.

   - Enhancements for SMP function call tracing so it is possible to
     locate the scheduling and the actual execution points. That allows
     to measure IPI delivery time precisely"

* tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
  trace,smp: Add tracepoints for scheduling remotelly called functions
  trace,smp: Add tracepoints around remotelly called functions
  MAINTAINERS: Add CPU HOTPLUG entry
  x86/smpboot: Fix the parallel bringup decision
  x86/realmode: Make stack lock work in trampoline_compat()
  x86/smp: Initialize cpu_primary_thread_mask late
  cpu/hotplug: Fix off by one in cpuhp_bringup_mask()
  x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils
  x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it
  x86/smpboot: Support parallel startup of secondary CPUs
  x86/smpboot: Implement a bit spinlock to protect the realmode stack
  x86/apic: Save the APIC virtual base address
  cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE
  x86/apic: Provide cpu_primary_thread mask
  x86/smpboot: Enable split CPU startup
  cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanism
  cpu/hotplug: Reset task stack state in _cpu_up()
  cpu/hotplug: Remove unused state functions
  riscv: Switch to hotplug core state synchronization
  parisc: Switch to hotplug core state synchronization
  ...
2023-06-26 13:59:56 -07:00
Linus Torvalds
64bf6ae93e v6.5/vfs.misc
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZJU4SwAKCRCRxhvAZXjc
 ojOTAP9gT/z1gasIf8OwDHb4inZGnVpHh2ApKLvgMXH6ICtwRgD+OBtOcf438Lx1
 cpFSTVJlh21QXMOOXWHe/LRUV2kZ5wI=
 =zdfx
 -----END PGP SIGNATURE-----

Merge tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull misc vfs updates from Christian Brauner:
 "Miscellaneous features, cleanups, and fixes for vfs and individual fs

  Features:

   - Use mode 0600 for file created by cachefilesd so it can be run by
     unprivileged users. This aligns them with directories which are
     already created with mode 0700 by cachefilesd

   - Reorder a few members in struct file to prevent some false sharing
     scenarios

   - Indicate that an eventfd is used a semaphore in the eventfd's
     fdinfo procfs file

   - Add a missing uapi header for eventfd exposing relevant uapi
     defines

   - Let the VFS protect transitions of a superblock from read-only to
     read-write in addition to the protection it already provides for
     transitions from read-write to read-only. Protecting read-only to
     read-write transitions allows filesystems such as ext4 to perform
     internal writes, keeping writers away until the transition is
     completed

  Cleanups:

   - Arnd removed the architecture specific arch_report_meminfo()
     prototypes and added a generic one into procfs.h. Note, we got a
     report about a warning in amdpgpu codepaths that suggested this was
     bisectable to this change but we concluded it was a false positive

   - Remove unused parameters from split_fs_names()

   - Rename put_and_unmap_page() to unmap_and_put_page() to let the name
     reflect the order of the cleanup operation that has to unmap before
     the actual put

   - Unexport buffer_check_dirty_writeback() as it is not used outside
     of block device aops

   - Stop allocating aio rings from highmem

   - Protecting read-{only,write} transitions in the VFS used open-coded
     barriers in various places. Replace them with proper little helpers
     and document both the helpers and all barrier interactions involved
     when transitioning between read-{only,write} states

   - Use flexible array members in old readdir codepaths

  Fixes:

   - Use the correct type __poll_t for epoll and eventfd

   - Replace all deprecated strlcpy() invocations, whose return value
     isn't checked with an equivalent strscpy() call

   - Fix some kernel-doc warnings in fs/open.c

   - Reduce the stack usage in jffs2's xattr codepaths finally getting
     rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088
     bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
     royally annoying compilation warning

   - Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not
     fmode_t is required to avoid fmode_t to integer degradation
     warnings

   - Create coredumps with O_WRONLY instead of O_RDWR. There's a long
     explanation in that commit how O_RDWR is actually a bug which we
     found out with the help of Linus and git archeology

   - Fix "no previous prototype" warnings in the pipe codepaths

   - Add overflow calculations for remap_verify_area() as a signed
     addition overflow could be triggered in xfstests

   - Fix a null pointer dereference in sysv

   - Use an unsigned variable for length calculations in jfs avoiding
     compilation warnings with gcc 13

   - Fix a dangling pipe pointer in the watch queue codepath

   - The legacy mount option parser provided as a fallback by the VFS
     for filesystems not yet converted to the new mount api did prefix
     the generated mount option string with a leading ',' causing issues
     for some filesystems

   - Fix a repeated word in a comment in fs.h

   - autofs: Update the ctime when mtime is updated as mandated by
     POSIX"

* tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
  readdir: Replace one-element arrays with flexible-array members
  fs: Provide helpers for manipulating sb->s_readonly_remount
  fs: Protect reconfiguration of sb read-write from racing writes
  eventfd: add a uapi header for eventfd userspace APIs
  autofs: set ctime as well when mtime changes on a dir
  eventfd: show the EFD_SEMAPHORE flag in fdinfo
  fs/aio: Stop allocating aio rings from HIGHMEM
  fs: Fix comment typo
  fs: unexport buffer_check_dirty_writeback
  fs: avoid empty option when generating legacy mount string
  watch_queue: prevent dangling pipe pointer
  fs.h: Optimize file struct to prevent false sharing
  highmem: Rename put_and_unmap_page() to unmap_and_put_page()
  cachefiles: Allow the cache to be non-root
  init: remove unused names parameter in split_fs_names()
  jfs: Use unsigned variable for length calculations
  fs/sysv: Null check to prevent null-ptr-deref bug
  fs: use UB-safe check for signed addition overflow in remap_verify_area
  procfs: consolidate arch_report_meminfo declaration
  fs: pipe: reveal missing function protoypes
  ...
2023-06-26 09:50:21 -07:00
Linus Torvalds
c2508ec5a5 mm: introduce new 'lock_mm_and_find_vma()' page fault helper
.. and make x86 use it.

This basically extracts the existing x86 "find and expand faulting vma"
code, but extends it to also take the mmap lock for writing in case we
actually do need to expand the vma.

We've historically short-circuited that case, and have some rather ugly
special logic to serialize the stack segment expansion (since we only
hold the mmap lock for reading) that doesn't match the normal VM
locking.

That slight violation of locking worked well, right up until it didn't:
the maple tree code really does want proper locking even for simple
extension of an existing vma.

So extract the code for "look up the vma of the fault" from x86, fix it
up to do the necessary write locking, and make it available as a helper
function for other architectures that can use the common helper.

Note: I say "common helper", but it really only handles the normal
stack-grows-down case.  Which is all architectures except for PA-RISC
and IA64.  So some rare architectures can't use the helper, but if they
care they'll just need to open-code this logic.

It's also worth pointing out that this code really would like to have an
optimistic "mmap_upgrade_trylock()" to make it quicker to go from a
read-lock (for the common case) to taking the write lock (for having to
extend the vma) in the normal single-threaded situation where there is
no other locking activity.

But that _is_ all the very uncommon special case, so while it would be
nice to have such an operation, it probably doesn't matter in reality.
I did put in the skeleton code for such a possible future expansion,
even if it only acts as pseudo-documentation for what we're doing.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-24 14:12:54 -07:00
Hugh Dickins
653ba8108d x86: sme_populate_pgd() use pte_offset_kernel()
sme_populate_pgd() is an __init function for sme_encrypt_kernel():
it should use pte_offset_kernel() instead of pte_offset_map(), to avoid
the question of whether a pte_unmap() will be needed to balance.

Link: https://lkml.kernel.org/r/497d7777-736e-85f2-c37-aa6bcf155e4@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John David Anglin <dave.anglin@bell.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19 16:19:10 -07:00