2019-06-03 05:44:50 +00:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0-only */
|
2012-03-05 11:49:27 +00:00
|
|
|
/*
|
|
|
|
* Based on arch/arm/include/asm/memory.h
|
|
|
|
*
|
|
|
|
* Copyright (C) 2000-2002 Russell King
|
|
|
|
* Copyright (C) 2012 ARM Ltd.
|
|
|
|
*
|
|
|
|
* Note: this file should not be included by non-asm/.h files
|
|
|
|
*/
|
|
|
|
#ifndef __ASM_MEMORY_H
|
|
|
|
#define __ASM_MEMORY_H
|
|
|
|
|
|
|
|
#include <linux/const.h>
|
2019-08-13 16:22:51 +00:00
|
|
|
#include <linux/sizes.h>
|
2017-07-14 18:43:56 +00:00
|
|
|
#include <asm/page-def.h>
|
2012-03-05 11:49:27 +00:00
|
|
|
|
arm64: Fix overlapping VA allocations
PCI IO space was intended to be 16MiB, at 32MiB below MODULES_VADDR, but
commit d1e6dc91b532d3d3 ("arm64: Add architectural support for PCI")
extended this to cover the full 32MiB. The final 8KiB of this 32MiB is
also allocated for the fixmap, allowing for potential clashes between
the two.
This change was masked by assumptions in mem_init and the page table
dumping code, which assumed the I/O space to be 16MiB long through
seaparte hard-coded definitions.
This patch changes the definition of the PCI I/O space allocation to
live in asm/memory.h, along with the other VA space allocations. As the
fixmap allocation depends on the number of fixmap entries, this is moved
below the PCI I/O space allocation. Both the fixmap and PCI I/O space
are guarded with 2MB of padding. Sites assuming the I/O space was 16MiB
are moved over use new PCI_IO_{START,END} definitions, which will keep
in sync with the size of the IO space (now restored to 16MiB).
As a useful side effect, the use of the new PCI_IO_{START,END}
definitions prevents a build issue in the dumping code due to a (now
redundant) missing include of io.h for PCI_IOBASE.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Will Deacon <will.deacon@arm.com>
[catalin.marinas@arm.com: reorder FIXADDR and PCI_IO address_markers_idx enum]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-01-22 18:20:35 +00:00
|
|
|
/*
|
|
|
|
* Size of the PCI I/O space. This must remain a power of two so that
|
|
|
|
* IO_SPACE_LIMIT acts as a mask for the low bits of I/O addresses.
|
|
|
|
*/
|
|
|
|
#define PCI_IO_SIZE SZ_16M
|
|
|
|
|
2016-03-30 14:46:00 +00:00
|
|
|
/*
|
|
|
|
* VMEMMAP_SIZE - allows the whole linear region to be covered by
|
|
|
|
* a struct page array
|
2019-08-07 15:55:21 +00:00
|
|
|
*
|
|
|
|
* If we are configured with a 52-bit kernel VA then our VMEMMAP_SIZE
|
2019-08-14 13:28:48 +00:00
|
|
|
* needs to cover the memory region from the beginning of the 52-bit
|
|
|
|
* PAGE_OFFSET all the way to PAGE_END for 48-bit. This allows us to
|
2019-08-07 15:55:21 +00:00
|
|
|
* keep a constant PAGE_OFFSET and "fallback" to using the higher end
|
|
|
|
* of the VMEMMAP where 52-bit support is not available in hardware.
|
2016-03-30 14:46:00 +00:00
|
|
|
*/
|
2023-12-13 08:40:30 +00:00
|
|
|
#define VMEMMAP_RANGE (_PAGE_END(VA_BITS_MIN) - PAGE_OFFSET)
|
|
|
|
#define VMEMMAP_SIZE ((VMEMMAP_RANGE >> PAGE_SHIFT) * sizeof(struct page))
|
2016-03-30 14:46:00 +00:00
|
|
|
|
2012-03-05 11:49:27 +00:00
|
|
|
/*
|
2019-08-14 13:28:48 +00:00
|
|
|
* PAGE_OFFSET - the virtual address of the start of the linear map, at the
|
|
|
|
* start of the TTBR1 address space.
|
|
|
|
* PAGE_END - the end of the linear map, where all other kernel mappings begin.
|
|
|
|
* KIMAGE_VADDR - the virtual address of the start of the kernel image.
|
2012-03-05 11:49:27 +00:00
|
|
|
* VA_BITS - the maximum number of bits for virtual addresses.
|
|
|
|
*/
|
2014-05-12 09:40:38 +00:00
|
|
|
#define VA_BITS (CONFIG_ARM64_VA_BITS)
|
2019-08-13 15:58:36 +00:00
|
|
|
#define _PAGE_OFFSET(va) (-(UL(1) << (va)))
|
2019-08-07 15:55:22 +00:00
|
|
|
#define PAGE_OFFSET (_PAGE_OFFSET(VA_BITS))
|
2016-02-16 12:52:40 +00:00
|
|
|
#define KIMAGE_VADDR (MODULES_END)
|
|
|
|
#define MODULES_END (MODULES_VADDR + MODULES_VSIZE)
|
2021-11-05 16:50:45 +00:00
|
|
|
#define MODULES_VADDR (_PAGE_END(VA_BITS_MIN))
|
arm64: module: rework module VA range selection
Currently, the modules region is 128M in size, which is a problem for
some large modules. Shanker reports [1] that the NVIDIA GPU driver alone
can consume 110M of module space in some configurations. We'd like to
make the modules region a full 2G such that we can always make use of a
2G range.
It's possible to build kernel images which are larger than 128M in some
configurations, such as when many debug options are selected and many
drivers are built in. In these configurations, we can't legitimately
select a base for a 128M module region, though we currently select a
value for which allocation will fail. It would be nicer to have a
diagnostic message in this case.
Similarly, in theory it's possible to build a kernel image which is
larger than 2G and which cannot support modules. While this isn't likely
to be the case for any realistic kernel deplyed in the field, it would
be nice if we could print a diagnostic in this case.
This patch reworks the module VA range selection to use a 2G range, and
improves handling of cases where we cannot select legitimate module
regions. We now attempt to select a 128M region and a 2G region:
* The 128M region is selected such that modules can use direct branches
(with JUMP26/CALL26 relocations) to branch to kernel code and other
modules, and so that modules can reference data and text (using PREL32
relocations) anywhere in the kernel image and other modules.
This region covers the entire kernel image (rather than just the text)
to ensure that all PREL32 relocations are in range even when the
kernel data section is absurdly large. Where we cannot allocate from
this region, we'll fall back to the full 2G region.
* The 2G region is selected such that modules can use direct branches
with PLTs to branch to kernel code and other modules, and so that
modules can use reference data and text (with PREL32 relocations) in
the kernel image and other modules.
This region covers the entire kernel image, and the 128M region (if
one is selected).
The two module regions are randomized independently while ensuring the
constraints described above.
[1] https://lore.kernel.org/linux-arm-kernel/159ceeab-09af-3174-5058-445bc8dcf85b@nvidia.com/
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Shanker Donthineni <sdonthineni@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Tested-by: Shanker Donthineni <sdonthineni@nvidia.com>
Link: https://lore.kernel.org/r/20230530110328.2213762-7-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-05-30 11:03:28 +00:00
|
|
|
#define MODULES_VSIZE (SZ_2G)
|
2023-12-13 08:40:30 +00:00
|
|
|
#define VMEMMAP_START (VMEMMAP_END - VMEMMAP_SIZE)
|
|
|
|
#define VMEMMAP_END (-UL(SZ_1G))
|
2023-12-13 08:40:26 +00:00
|
|
|
#define PCI_IO_START (VMEMMAP_END + SZ_8M)
|
|
|
|
#define PCI_IO_END (PCI_IO_START + PCI_IO_SIZE)
|
2023-12-13 08:40:27 +00:00
|
|
|
#define FIXADDR_TOP (-UL(SZ_8M))
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2019-08-07 15:55:17 +00:00
|
|
|
#if VA_BITS > 48
|
arm64: Enable LPA2 at boot if supported by the system
Update the early kernel mapping code to take 52-bit virtual addressing
into account based on the LPA2 feature. This is a bit more involved than
LVA (which is supported with 64k pages only), given that some page table
descriptor bits change meaning in this case.
To keep the handling in asm to a minimum, the initial ID map is still
created with 48-bit virtual addressing, which implies that the kernel
image must be loaded into 48-bit addressable physical memory. This is
currently required by the boot protocol, even though we happen to
support placement outside of that for LVA/64k based configurations.
Enabling LPA2 involves more than setting TCR.T1SZ to a lower value,
there is also a DS bit in TCR that needs to be set, and which changes
the meaning of bits [9:8] in all page table descriptors. Since we cannot
enable DS and every live page table descriptor at the same time, let's
pivot through another temporary mapping. This avoids the need to
reintroduce manipulations of the page tables with the MMU and caches
disabled.
To permit the LPA2 feature to be overridden on the kernel command line,
which may be necessary to work around silicon errata, or to deal with
mismatched features on heterogeneous SoC designs, test for CPU feature
overrides first, and only then enable LPA2.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-78-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 12:29:19 +00:00
|
|
|
#ifdef CONFIG_ARM64_16K_PAGES
|
|
|
|
#define VA_BITS_MIN (47)
|
|
|
|
#else
|
2019-08-07 15:55:17 +00:00
|
|
|
#define VA_BITS_MIN (48)
|
arm64: Enable LPA2 at boot if supported by the system
Update the early kernel mapping code to take 52-bit virtual addressing
into account based on the LPA2 feature. This is a bit more involved than
LVA (which is supported with 64k pages only), given that some page table
descriptor bits change meaning in this case.
To keep the handling in asm to a minimum, the initial ID map is still
created with 48-bit virtual addressing, which implies that the kernel
image must be loaded into 48-bit addressable physical memory. This is
currently required by the boot protocol, even though we happen to
support placement outside of that for LVA/64k based configurations.
Enabling LPA2 involves more than setting TCR.T1SZ to a lower value,
there is also a DS bit in TCR that needs to be set, and which changes
the meaning of bits [9:8] in all page table descriptors. Since we cannot
enable DS and every live page table descriptor at the same time, let's
pivot through another temporary mapping. This avoids the need to
reintroduce manipulations of the page tables with the MMU and caches
disabled.
To permit the LPA2 feature to be overridden on the kernel command line,
which may be necessary to work around silicon errata, or to deal with
mismatched features on heterogeneous SoC designs, test for CPU feature
overrides first, and only then enable LPA2.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-78-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 12:29:19 +00:00
|
|
|
#endif
|
2019-08-07 15:55:17 +00:00
|
|
|
#else
|
|
|
|
#define VA_BITS_MIN (VA_BITS)
|
|
|
|
#endif
|
2019-08-13 16:06:29 +00:00
|
|
|
|
2019-08-14 13:28:48 +00:00
|
|
|
#define _PAGE_END(va) (-(UL(1) << ((va) - 1)))
|
2016-04-27 16:47:09 +00:00
|
|
|
|
2019-08-13 16:22:51 +00:00
|
|
|
#define KERNEL_START _text
|
|
|
|
#define KERNEL_END _end
|
2016-04-27 16:47:09 +00:00
|
|
|
|
2016-02-16 12:52:40 +00:00
|
|
|
/*
|
2023-12-21 20:04:43 +00:00
|
|
|
* Generic and Software Tag-Based KASAN modes require 1/8th and 1/16th of the
|
|
|
|
* kernel virtual address space for storing the shadow memory respectively.
|
|
|
|
*
|
|
|
|
* The mapping between a virtual memory address and its corresponding shadow
|
|
|
|
* memory address is defined based on the formula:
|
|
|
|
*
|
|
|
|
* shadow_addr = (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET
|
|
|
|
*
|
|
|
|
* where KASAN_SHADOW_SCALE_SHIFT is the order of the number of bits that map
|
|
|
|
* to a single shadow byte and KASAN_SHADOW_OFFSET is a constant that offsets
|
|
|
|
* the mapping. Note that KASAN_SHADOW_OFFSET does not point to the start of
|
|
|
|
* the shadow memory region.
|
|
|
|
*
|
|
|
|
* Based on this mapping, we define two constants:
|
|
|
|
*
|
|
|
|
* KASAN_SHADOW_START: the start of the shadow memory region;
|
|
|
|
* KASAN_SHADOW_END: the end of the shadow memory region.
|
|
|
|
*
|
|
|
|
* KASAN_SHADOW_END is defined first as the shadow address that corresponds to
|
|
|
|
* the upper bound of possible virtual kernel memory addresses UL(1) << 64
|
|
|
|
* according to the mapping formula.
|
|
|
|
*
|
|
|
|
* KASAN_SHADOW_START is defined second based on KASAN_SHADOW_END. The shadow
|
|
|
|
* memory start must map to the lowest possible kernel virtual memory address
|
|
|
|
* and thus it depends on the actual bitness of the address space.
|
|
|
|
*
|
|
|
|
* As KASAN inserts redzones between stack variables, this increases the stack
|
|
|
|
* memory usage significantly. Thus, we double the (minimum) stack size.
|
2016-02-16 12:52:40 +00:00
|
|
|
*/
|
2020-12-22 20:02:06 +00:00
|
|
|
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
|
2019-08-07 15:55:15 +00:00
|
|
|
#define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
|
2023-12-21 20:04:43 +00:00
|
|
|
#define KASAN_SHADOW_END ((UL(1) << (64 - KASAN_SHADOW_SCALE_SHIFT)) + KASAN_SHADOW_OFFSET)
|
|
|
|
#define _KASAN_SHADOW_START(va) (KASAN_SHADOW_END - (UL(1) << ((va) - KASAN_SHADOW_SCALE_SHIFT)))
|
|
|
|
#define KASAN_SHADOW_START _KASAN_SHADOW_START(vabits_actual)
|
|
|
|
#define PAGE_END KASAN_SHADOW_START
|
2017-10-03 17:25:46 +00:00
|
|
|
#define KASAN_THREAD_SHIFT 1
|
2016-02-16 12:52:40 +00:00
|
|
|
#else
|
2017-10-03 17:25:46 +00:00
|
|
|
#define KASAN_THREAD_SHIFT 0
|
arm64: mm: extend linear region for 52-bit VA configurations
For historical reasons, the arm64 kernel VA space is configured as two
equally sized halves, i.e., on a 48-bit VA build, the VA space is split
into a 47-bit vmalloc region and a 47-bit linear region.
When support for 52-bit virtual addressing was added, this equal split
was kept, resulting in a substantial waste of virtual address space in
the linear region:
48-bit VA 52-bit VA
0xffff_ffff_ffff_ffff +-------------+ +-------------+
| vmalloc | | vmalloc |
0xffff_8000_0000_0000 +-------------+ _PAGE_END(48) +-------------+
| linear | : :
0xffff_0000_0000_0000 +-------------+ : :
: : : :
: : : :
: : : :
: : : currently :
: unusable : : :
: : : unused :
: by : : :
: : : :
: hardware : : :
: : : :
0xfff8_0000_0000_0000 : : _PAGE_END(52) +-------------+
: : | |
: : | |
: : | |
: : | |
: : | |
: unusable : | |
: : | linear |
: by : | |
: : | region |
: hardware : | |
: : | |
: : | |
: : | |
: : | |
: : | |
: : | |
0xfff0_0000_0000_0000 +-------------+ PAGE_OFFSET +-------------+
As illustrated above, the 52-bit VA kernel uses 47 bits for the vmalloc
space (as before), to ensure that a single 64k granule kernel image can
support any 64k granule capable system, regardless of whether it supports
the 52-bit virtual addressing extension. However, due to the fact that
the VA space is still split in equal halves, the linear region is only
2^51 bytes in size, wasting almost half of the 52-bit VA space.
Let's fix this, by abandoning the equal split, and simply assigning all
VA space outside of the vmalloc region to the linear region.
The KASAN shadow region is reconfigured so that it ends at the start of
the vmalloc region, and grows downwards. That way, the arrangement of
the vmalloc space (which contains kernel mappings, modules, BPF region,
the vmemmap array etc) is identical between non-KASAN and KASAN builds,
which aids debugging.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Link: https://lore.kernel.org/r/20201008153602.9467-3-ardb@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-10-08 15:36:00 +00:00
|
|
|
#define PAGE_END (_PAGE_END(VA_BITS_MIN))
|
2019-08-13 16:06:29 +00:00
|
|
|
#endif /* CONFIG_KASAN */
|
2016-02-16 12:52:40 +00:00
|
|
|
|
2017-10-03 17:25:46 +00:00
|
|
|
#define MIN_THREAD_SHIFT (14 + KASAN_THREAD_SHIFT)
|
2017-07-21 13:25:33 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* VMAP'd stacks are allocated at page granularity, so we must ensure that such
|
|
|
|
* stacks are a multiple of page size.
|
|
|
|
*/
|
|
|
|
#if defined(CONFIG_VMAP_STACK) && (MIN_THREAD_SHIFT < PAGE_SHIFT)
|
|
|
|
#define THREAD_SHIFT PAGE_SHIFT
|
|
|
|
#else
|
|
|
|
#define THREAD_SHIFT MIN_THREAD_SHIFT
|
|
|
|
#endif
|
2017-07-14 15:39:21 +00:00
|
|
|
|
|
|
|
#if THREAD_SHIFT >= PAGE_SHIFT
|
|
|
|
#define THREAD_SIZE_ORDER (THREAD_SHIFT - PAGE_SHIFT)
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#define THREAD_SIZE (UL(1) << THREAD_SHIFT)
|
|
|
|
|
2017-07-21 13:25:33 +00:00
|
|
|
/*
|
|
|
|
* By aligning VMAP'd stacks to 2 * THREAD_SIZE, we can detect overflow by
|
|
|
|
* checking sp & (1 << THREAD_SHIFT), which we can do cheaply in the entry
|
|
|
|
* assembly.
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_VMAP_STACK
|
|
|
|
#define THREAD_ALIGN (2 * THREAD_SIZE)
|
|
|
|
#else
|
|
|
|
#define THREAD_ALIGN THREAD_SIZE
|
|
|
|
#endif
|
|
|
|
|
2017-07-20 11:26:48 +00:00
|
|
|
#define IRQ_STACK_SIZE THREAD_SIZE
|
|
|
|
|
arm64: add VMAP_STACK overflow detection
This patch adds stack overflow detection to arm64, usable when vmap'd stacks
are in use.
Overflow is detected in a small preamble executed for each exception entry,
which checks whether there is enough space on the current stack for the general
purpose registers to be saved. If there is not enough space, the overflow
handler is invoked on a per-cpu overflow stack. This approach preserves the
original exception information in ESR_EL1 (and where appropriate, FAR_EL1).
Task and IRQ stacks are aligned to double their size, enabling overflow to be
detected with a single bit test. For example, a 16K stack is aligned to 32K,
ensuring that bit 14 of the SP must be zero. On an overflow (or underflow),
this bit is flipped. Thus, overflow (of less than the size of the stack) can be
detected by testing whether this bit is set.
The overflow check is performed before any attempt is made to access the
stack, avoiding recursive faults (and the loss of exception information
these would entail). As logical operations cannot be performed on the SP
directly, the SP is temporarily swapped with a general purpose register
using arithmetic operations to enable the test to be performed.
This gives us a useful error message on stack overflow, as can be trigger with
the LKDTM overflow test:
[ 305.388749] lkdtm: Performing direct entry OVERFLOW
[ 305.395444] Insufficient stack space to handle exception!
[ 305.395482] ESR: 0x96000047 -- DABT (current EL)
[ 305.399890] FAR: 0xffff00000a5e7f30
[ 305.401315] Task stack: [0xffff00000a5e8000..0xffff00000a5ec000]
[ 305.403815] IRQ stack: [0xffff000008000000..0xffff000008004000]
[ 305.407035] Overflow stack: [0xffff80003efce4e0..0xffff80003efcf4e0]
[ 305.409622] CPU: 0 PID: 1219 Comm: sh Not tainted 4.13.0-rc3-00021-g9636aea #5
[ 305.412785] Hardware name: linux,dummy-virt (DT)
[ 305.415756] task: ffff80003d051c00 task.stack: ffff00000a5e8000
[ 305.419221] PC is at recursive_loop+0x10/0x48
[ 305.421637] LR is at recursive_loop+0x38/0x48
[ 305.423768] pc : [<ffff00000859f330>] lr : [<ffff00000859f358>] pstate: 40000145
[ 305.428020] sp : ffff00000a5e7f50
[ 305.430469] x29: ffff00000a5e8350 x28: ffff80003d051c00
[ 305.433191] x27: ffff000008981000 x26: ffff000008f80400
[ 305.439012] x25: ffff00000a5ebeb8 x24: ffff00000a5ebeb8
[ 305.440369] x23: ffff000008f80138 x22: 0000000000000009
[ 305.442241] x21: ffff80003ce65000 x20: ffff000008f80188
[ 305.444552] x19: 0000000000000013 x18: 0000000000000006
[ 305.446032] x17: 0000ffffa2601280 x16: ffff0000081fe0b8
[ 305.448252] x15: ffff000008ff546d x14: 000000000047a4c8
[ 305.450246] x13: ffff000008ff7872 x12: 0000000005f5e0ff
[ 305.452953] x11: ffff000008ed2548 x10: 000000000005ee8d
[ 305.454824] x9 : ffff000008545380 x8 : ffff00000a5e8770
[ 305.457105] x7 : 1313131313131313 x6 : 00000000000000e1
[ 305.459285] x5 : 0000000000000000 x4 : 0000000000000000
[ 305.461781] x3 : 0000000000000000 x2 : 0000000000000400
[ 305.465119] x1 : 0000000000000013 x0 : 0000000000000012
[ 305.467724] Kernel panic - not syncing: kernel stack overflow
[ 305.470561] CPU: 0 PID: 1219 Comm: sh Not tainted 4.13.0-rc3-00021-g9636aea #5
[ 305.473325] Hardware name: linux,dummy-virt (DT)
[ 305.475070] Call trace:
[ 305.476116] [<ffff000008088ad8>] dump_backtrace+0x0/0x378
[ 305.478991] [<ffff000008088e64>] show_stack+0x14/0x20
[ 305.481237] [<ffff00000895a178>] dump_stack+0x98/0xb8
[ 305.483294] [<ffff0000080c3288>] panic+0x118/0x280
[ 305.485673] [<ffff0000080c2e9c>] nmi_panic+0x6c/0x70
[ 305.486216] [<ffff000008089710>] handle_bad_stack+0x118/0x128
[ 305.486612] Exception stack(0xffff80003efcf3a0 to 0xffff80003efcf4e0)
[ 305.487334] f3a0: 0000000000000012 0000000000000013 0000000000000400 0000000000000000
[ 305.488025] f3c0: 0000000000000000 0000000000000000 00000000000000e1 1313131313131313
[ 305.488908] f3e0: ffff00000a5e8770 ffff000008545380 000000000005ee8d ffff000008ed2548
[ 305.489403] f400: 0000000005f5e0ff ffff000008ff7872 000000000047a4c8 ffff000008ff546d
[ 305.489759] f420: ffff0000081fe0b8 0000ffffa2601280 0000000000000006 0000000000000013
[ 305.490256] f440: ffff000008f80188 ffff80003ce65000 0000000000000009 ffff000008f80138
[ 305.490683] f460: ffff00000a5ebeb8 ffff00000a5ebeb8 ffff000008f80400 ffff000008981000
[ 305.491051] f480: ffff80003d051c00 ffff00000a5e8350 ffff00000859f358 ffff00000a5e7f50
[ 305.491444] f4a0: ffff00000859f330 0000000040000145 0000000000000000 0000000000000000
[ 305.492008] f4c0: 0001000000000000 0000000000000000 ffff00000a5e8350 ffff00000859f330
[ 305.493063] [<ffff00000808205c>] __bad_stack+0x88/0x8c
[ 305.493396] [<ffff00000859f330>] recursive_loop+0x10/0x48
[ 305.493731] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.494088] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.494425] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.494649] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.494898] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.495205] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.495453] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.495708] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.496000] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.496302] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.496644] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.496894] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.497138] [<ffff00000859f358>] recursive_loop+0x38/0x48
[ 305.497325] [<ffff00000859f3dc>] lkdtm_OVERFLOW+0x14/0x20
[ 305.497506] [<ffff00000859f314>] lkdtm_do_action+0x1c/0x28
[ 305.497786] [<ffff00000859f178>] direct_entry+0xe0/0x170
[ 305.498095] [<ffff000008345568>] full_proxy_write+0x60/0xa8
[ 305.498387] [<ffff0000081fb7f4>] __vfs_write+0x1c/0x128
[ 305.498679] [<ffff0000081fcc68>] vfs_write+0xa0/0x1b0
[ 305.498926] [<ffff0000081fe0fc>] SyS_write+0x44/0xa0
[ 305.499182] Exception stack(0xffff00000a5ebec0 to 0xffff00000a5ec000)
[ 305.499429] bec0: 0000000000000001 000000001c4cf5e0 0000000000000009 000000001c4cf5e0
[ 305.499674] bee0: 574f4c465245564f 0000000000000000 0000000000000000 8000000080808080
[ 305.499904] bf00: 0000000000000040 0000000000000038 fefefeff1b4bc2ff 7f7f7f7f7f7fff7f
[ 305.500189] bf20: 0101010101010101 0000000000000000 000000000047a4c8 0000000000000038
[ 305.500712] bf40: 0000000000000000 0000ffffa2601280 0000ffffc63f6068 00000000004b5000
[ 305.501241] bf60: 0000000000000001 000000001c4cf5e0 0000000000000009 000000001c4cf5e0
[ 305.501791] bf80: 0000000000000020 0000000000000000 00000000004b5000 000000001c4cc458
[ 305.502314] bfa0: 0000000000000000 0000ffffc63f7950 000000000040a3c4 0000ffffc63f70e0
[ 305.502762] bfc0: 0000ffffa2601268 0000000080000000 0000000000000001 0000000000000040
[ 305.503207] bfe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 305.503680] [<ffff000008082fb0>] el0_svc_naked+0x24/0x28
[ 305.504720] Kernel Offset: disabled
[ 305.505189] CPU features: 0x002082
[ 305.505473] Memory Limit: none
[ 305.506181] ---[ end Kernel panic - not syncing: kernel stack overflow
This patch was co-authored by Ard Biesheuvel and Mark Rutland.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
2017-07-14 19:30:35 +00:00
|
|
|
#define OVERFLOW_STACK_SIZE SZ_4K
|
|
|
|
|
2022-07-26 07:37:46 +00:00
|
|
|
/*
|
|
|
|
* With the minimum frame size of [x29, x30], exactly half the combined
|
|
|
|
* sizes of the hyp and overflow stacks is the maximum size needed to
|
|
|
|
* save the unwinded stacktrace; plus an additional entry to delimit the
|
|
|
|
* end.
|
|
|
|
*/
|
|
|
|
#define NVHE_STACKTRACE_SIZE ((OVERFLOW_STACK_SIZE + PAGE_SIZE) / 2 + sizeof(long))
|
|
|
|
|
2017-07-14 14:38:43 +00:00
|
|
|
/*
|
|
|
|
* Alignment of kernel segments (e.g. .text, .data).
|
arm64: remove CONFIG_DEBUG_ALIGN_RODATA feature
When CONFIG_DEBUG_ALIGN_RODATA is enabled, kernel segments mapped with
different permissions (r-x for .text, r-- for .rodata, rw- for .data,
etc) are rounded up to 2 MiB so they can be mapped more efficiently.
In particular, it permits the segments to be mapped using level 2
block entries when using 4k pages, which is expected to result in less
TLB pressure.
However, the mappings for the bulk of the kernel will use level 2
entries anyway, and the misaligned fringes are organized such that they
can take advantage of the contiguous bit, and use far fewer level 3
entries than would be needed otherwise.
This makes the value of this feature dubious at best, and since it is not
enabled in defconfig or in the distro configs, it does not appear to be
in wide use either. So let's just remove it.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Laura Abbott <labbott@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2020-03-29 14:12:58 +00:00
|
|
|
*
|
2017-07-14 14:38:43 +00:00
|
|
|
* 4 KB granule: 16 level 3 entries, with contiguous bit
|
|
|
|
* 16 KB granule: 4 level 3 entries, without contiguous bit
|
|
|
|
* 64 KB granule: 1 level 3 entry
|
|
|
|
*/
|
2019-08-13 16:22:51 +00:00
|
|
|
#define SEGMENT_ALIGN SZ_64K
|
2017-07-14 14:38:43 +00:00
|
|
|
|
2012-03-05 11:49:27 +00:00
|
|
|
/*
|
|
|
|
* Memory types available.
|
2019-11-27 10:00:27 +00:00
|
|
|
*
|
|
|
|
* IMPORTANT: MT_NORMAL must be index 0 since vm_get_page_prot() may 'or' in
|
|
|
|
* the MT_NORMAL_TAGGED memory type for PROT_MTE mappings. Note
|
|
|
|
* that protection_map[] only contains MT_NORMAL attributes.
|
2012-03-05 11:49:27 +00:00
|
|
|
*/
|
2019-11-27 10:00:27 +00:00
|
|
|
#define MT_NORMAL 0
|
|
|
|
#define MT_NORMAL_TAGGED 1
|
|
|
|
#define MT_NORMAL_NC 2
|
2021-05-27 11:03:19 +00:00
|
|
|
#define MT_DEVICE_nGnRnE 3
|
|
|
|
#define MT_DEVICE_nGnRE 4
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2012-12-07 18:35:41 +00:00
|
|
|
/*
|
|
|
|
* Memory types for Stage-2 translation
|
|
|
|
*/
|
|
|
|
#define MT_S2_NORMAL 0xf
|
KVM: arm64: Introduce new flag for non-cacheable IO memory
Currently, KVM for ARM64 maps at stage 2 memory that is considered device
(i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
overrides (as per the ARM architecture [1]) any device MMIO mapping
present at stage 1, resulting in a set-up whereby a guest operating
system cannot determine device MMIO mapping memory attributes on its
own but it is always overridden by the KVM stage 2 default.
This set-up does not allow guest operating systems to select device
memory attributes independently from KVM stage-2 mappings
(refer to [1], "Combining stage 1 and stage 2 memory type attributes"),
which turns out to be an issue in that guest operating systems
(e.g. Linux) may request to map devices MMIO regions with memory
attributes that guarantee better performance (e.g. gathering
attribute - that for some devices can generate larger PCIe memory
writes TLPs) and specific operations (e.g. unaligned transactions)
such as the NormalNC memory type.
The default device stage 2 mapping was chosen in KVM for ARM64 since
it was considered safer (i.e. it would not allow guests to trigger
uncontained failures ultimately crashing the machine) but this
turned out to be asynchronous (SError) defeating the purpose.
Failures containability is a property of the platform and is independent
from the memory type used for MMIO device memory mappings.
Actually, DEVICE_nGnRE memory type is even more problematic than
Normal-NC memory type in terms of faults containability in that e.g.
aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally,
synchronous (i.e. that would imply that the processor should issue at
most 1 load transaction at a time - it cannot pipeline them - otherwise
the synchronous abort semantics would break the no-speculation attribute
attached to DEVICE_XXX memory).
This means that regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures and that in turn relies on platform capabilities
and the device type being assigned (i.e. PCIe AER/DPC error containment
and RAS architecture[3]); therefore the default KVM device stage 2
memory attributes play no role in making device assignment safer
for a given platform (if the platform design adheres to design
guidelines outlined in [3]) and therefore can be relaxed.
For all these reasons, relax the KVM stage 2 device memory attributes
from DEVICE_nGnRE to Normal-NC.
The NormalNC was chosen over a different Normal memory type default
at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping.
Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
trigger any issue on guest device reclaim use cases either (i.e. device
MMIO unmap followed by a device reset) at least for PCIe devices, in that
in PCIe a device reset is architected and carried out through PCI config
space transactions that are naturally ordered with respect to MMIO
transactions according to the PCI ordering rules.
Having Normal-NC S2 default puts guests in control (thanks to
stage1+stage2 combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules described in [1]
and summarized here ([(S1) - stage1], [(S2) - stage 2]):
S1 | S2 | Result
NORMAL-WB | NORMAL-NC | NORMAL-NC
NORMAL-WT | NORMAL-NC | NORMAL-NC
NORMAL-NC | NORMAL-NC | NORMAL-NC
DEVICE<attr> | NORMAL-NC | DEVICE<attr>
It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].
Defaulting KVM device stage-2 mappings to Normal-NC attributes does not
change anything in this respect, in that the mismatched aliases would
only affect (refer to [2] for a detailed explanation) ordering between
the userspace and GuestOS mappings resulting stream of transactions
(i.e. it does not cause loss of property for either stream of
transactions on its own), which is harmless given that the userspace
and GuestOS access to the device is carried out through independent
transactions streams.
A Normal-NC flag is not present today. So add a new kvm_pgtable_prot
(KVM_PGTABLE_PROT_NORMAL_NC) flag for it, along with its
corresponding PTE value 0x5 (0b101) determined from [1].
Lastly, adapt the stage2 PTE property setter function
(stage2_set_prot_attr) to handle the NormalNC attribute.
The entire discussion leading to this patch series may be followed through
the following links.
Link: https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com
Link: https://lore.kernel.org/r/20231205033015.10044-1-ankita@nvidia.com
[1] section D8.5.5 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
[3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20240224150546.368-2-ankita@nvidia.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-24 15:05:43 +00:00
|
|
|
#define MT_S2_NORMAL_NC 0x5
|
2012-12-07 18:35:41 +00:00
|
|
|
#define MT_S2_DEVICE_nGnRE 0x1
|
|
|
|
|
2018-04-06 11:27:28 +00:00
|
|
|
/*
|
|
|
|
* Memory types for Stage-2 translation when ID_AA64MMFR2_EL1.FWB is 0001
|
|
|
|
* Stage-2 enforces Normal-WB and Device-nGnRE
|
|
|
|
*/
|
|
|
|
#define MT_S2_FWB_NORMAL 6
|
KVM: arm64: Introduce new flag for non-cacheable IO memory
Currently, KVM for ARM64 maps at stage 2 memory that is considered device
(i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting
overrides (as per the ARM architecture [1]) any device MMIO mapping
present at stage 1, resulting in a set-up whereby a guest operating
system cannot determine device MMIO mapping memory attributes on its
own but it is always overridden by the KVM stage 2 default.
This set-up does not allow guest operating systems to select device
memory attributes independently from KVM stage-2 mappings
(refer to [1], "Combining stage 1 and stage 2 memory type attributes"),
which turns out to be an issue in that guest operating systems
(e.g. Linux) may request to map devices MMIO regions with memory
attributes that guarantee better performance (e.g. gathering
attribute - that for some devices can generate larger PCIe memory
writes TLPs) and specific operations (e.g. unaligned transactions)
such as the NormalNC memory type.
The default device stage 2 mapping was chosen in KVM for ARM64 since
it was considered safer (i.e. it would not allow guests to trigger
uncontained failures ultimately crashing the machine) but this
turned out to be asynchronous (SError) defeating the purpose.
Failures containability is a property of the platform and is independent
from the memory type used for MMIO device memory mappings.
Actually, DEVICE_nGnRE memory type is even more problematic than
Normal-NC memory type in terms of faults containability in that e.g.
aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally,
synchronous (i.e. that would imply that the processor should issue at
most 1 load transaction at a time - it cannot pipeline them - otherwise
the synchronous abort semantics would break the no-speculation attribute
attached to DEVICE_XXX memory).
This means that regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures and that in turn relies on platform capabilities
and the device type being assigned (i.e. PCIe AER/DPC error containment
and RAS architecture[3]); therefore the default KVM device stage 2
memory attributes play no role in making device assignment safer
for a given platform (if the platform design adheres to design
guidelines outlined in [3]) and therefore can be relaxed.
For all these reasons, relax the KVM stage 2 device memory attributes
from DEVICE_nGnRE to Normal-NC.
The NormalNC was chosen over a different Normal memory type default
at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping.
Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
trigger any issue on guest device reclaim use cases either (i.e. device
MMIO unmap followed by a device reset) at least for PCIe devices, in that
in PCIe a device reset is architected and carried out through PCI config
space transactions that are naturally ordered with respect to MMIO
transactions according to the PCI ordering rules.
Having Normal-NC S2 default puts guests in control (thanks to
stage1+stage2 combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules described in [1]
and summarized here ([(S1) - stage1], [(S2) - stage 2]):
S1 | S2 | Result
NORMAL-WB | NORMAL-NC | NORMAL-NC
NORMAL-WT | NORMAL-NC | NORMAL-NC
NORMAL-NC | NORMAL-NC | NORMAL-NC
DEVICE<attr> | NORMAL-NC | DEVICE<attr>
It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].
Defaulting KVM device stage-2 mappings to Normal-NC attributes does not
change anything in this respect, in that the mismatched aliases would
only affect (refer to [2] for a detailed explanation) ordering between
the userspace and GuestOS mappings resulting stream of transactions
(i.e. it does not cause loss of property for either stream of
transactions on its own), which is harmless given that the userspace
and GuestOS access to the device is carried out through independent
transactions streams.
A Normal-NC flag is not present today. So add a new kvm_pgtable_prot
(KVM_PGTABLE_PROT_NORMAL_NC) flag for it, along with its
corresponding PTE value 0x5 (0b101) determined from [1].
Lastly, adapt the stage2 PTE property setter function
(stage2_set_prot_attr) to handle the NormalNC attribute.
The entire discussion leading to this patch series may be followed through
the following links.
Link: https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com
Link: https://lore.kernel.org/r/20231205033015.10044-1-ankita@nvidia.com
[1] section D8.5.5 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487J_a_a-profile_architecture_reference_manual.pdf
[3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20240224150546.368-2-ankita@nvidia.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-24 15:05:43 +00:00
|
|
|
#define MT_S2_FWB_NORMAL_NC 5
|
2018-04-06 11:27:28 +00:00
|
|
|
#define MT_S2_FWB_DEVICE_nGnRE 1
|
|
|
|
|
2016-02-16 12:52:35 +00:00
|
|
|
#ifdef CONFIG_ARM64_4K_PAGES
|
|
|
|
#define IOREMAP_MAX_ORDER (PUD_SHIFT)
|
|
|
|
#else
|
|
|
|
#define IOREMAP_MAX_ORDER (PMD_SHIFT)
|
|
|
|
#endif
|
|
|
|
|
2021-02-02 12:36:57 +00:00
|
|
|
/*
|
|
|
|
* Open-coded (swapper_pg_dir - reserved_pg_dir) as this cannot be calculated
|
|
|
|
* until link time.
|
|
|
|
*/
|
|
|
|
#define RESERVED_SWAPPER_OFFSET (PAGE_SIZE)
|
|
|
|
|
2021-02-02 12:36:58 +00:00
|
|
|
/*
|
|
|
|
* Open-coded (swapper_pg_dir - tramp_pg_dir) as this cannot be calculated
|
|
|
|
* until link time.
|
|
|
|
*/
|
|
|
|
#define TRAMP_SWAPPER_OFFSET (2 * PAGE_SIZE)
|
|
|
|
|
2012-03-05 11:49:27 +00:00
|
|
|
#ifndef __ASSEMBLY__
|
|
|
|
|
2016-02-22 17:46:04 +00:00
|
|
|
#include <linux/bitops.h>
|
2020-06-30 12:53:07 +00:00
|
|
|
#include <linux/compiler.h>
|
2016-02-22 17:46:03 +00:00
|
|
|
#include <linux/mmdebug.h>
|
2020-06-30 12:53:07 +00:00
|
|
|
#include <linux/types.h>
|
arm64: kaslr: don't pretend KASLR is enabled if offset < MIN_KIMG_ALIGN
Our virtual KASLR displacement is a randomly chosen multiple of
2 MiB plus an offset that is equal to the physical placement modulo 2
MiB. This arrangement ensures that we can always use 2 MiB block
mappings (or contiguous PTE mappings for 16k or 64k pages) to map the
kernel.
This means that a KASLR offset of less than 2 MiB is simply the product
of this physical displacement, and no randomization has actually taken
place. Currently, we use 'kaslr_offset() > 0' to decide whether or not
randomization has occurred, and so we misidentify this case.
If the kernel image placement is not randomized, modules are allocated
from a dedicated region below the kernel mapping, which is only used for
modules and not for other vmalloc() or vmap() calls.
When randomization is enabled, the kernel image is vmap()'ed randomly
inside the vmalloc region, and modules are allocated in the vicinity of
this mapping to ensure that relative references are always in range.
However, unlike the dedicated module region below the vmalloc region,
this region is not reserved exclusively for modules, and so ordinary
vmalloc() calls may end up overlapping with it. This should rarely
happen, given that vmalloc allocates bottom up, although it cannot be
ruled out entirely.
The misidentified case results in a placement of the kernel image within
2 MiB of its default address. However, the logic that randomizes the
module region is still invoked, and this could result in the module
region overlapping with the start of the vmalloc region, instead of
using the dedicated region below it. If this happens, a single large
vmalloc() or vmap() call will use up the entire region, and leave no
space for loading modules after that.
Since commit 82046702e288 ("efi/libstub/arm64: Replace 'preferred'
offset with alignment check"), this is much more likely to occur on
systems that boot via EFI but lack an implementation of the EFI RNG
protocol, as in that case, the EFI stub will decide to leave the image
where it found it, and the EFI firmware uses 64k alignment only.
Fix this, by correctly identifying the case where the virtual
displacement is a result of the physical displacement only.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230223204101.1500373-1-ardb@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-02-23 20:41:01 +00:00
|
|
|
#include <asm/boot.h>
|
2020-06-30 12:53:07 +00:00
|
|
|
#include <asm/bug.h>
|
2023-11-29 11:15:59 +00:00
|
|
|
#include <asm/sections.h>
|
arm64: mm: Handle LVA support as a CPU feature
Currently, we detect CPU support for 52-bit virtual addressing (LVA)
extremely early, before creating the kernel page tables or enabling the
MMU. We cannot override the feature this early, and so large virtual
addressing is always enabled on CPUs that implement support for it if
the software support for it was enabled at build time. It also means we
rely on non-trivial code in asm to deal with this feature.
Given that both the ID map and the TTBR1 mapping of the kernel image are
guaranteed to be 48-bit addressable, it is not actually necessary to
enable support this early, and instead, we can model it as a CPU
feature. That way, we can rely on code patching to get the correct
TCR.T1SZ values programmed on secondary boot and resume from suspend.
On the primary boot path, we simply enable the MMU with 48-bit virtual
addressing initially, and update TCR.T1SZ if LVA is supported from C
code, right before creating the kernel mapping. Given that TTBR1 still
points to reserved_pg_dir at this point, updating TCR.T1SZ should be
safe without the need for explicit TLB maintenance.
Since this gets rid of all accesses to the vabits_actual variable from
asm code that occurred before TCR.T1SZ had been programmed, we no longer
have a need for this variable, and we can replace it with a C expression
that produces the correct value directly, based on the value of TCR.T1SZ.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-70-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 12:29:11 +00:00
|
|
|
#include <asm/sysreg.h>
|
|
|
|
|
|
|
|
static inline u64 __pure read_tcr(void)
|
|
|
|
{
|
|
|
|
u64 tcr;
|
|
|
|
|
|
|
|
// read_sysreg() uses asm volatile, so avoid it here
|
|
|
|
asm("mrs %0, tcr_el1" : "=r"(tcr));
|
|
|
|
return tcr;
|
|
|
|
}
|
2020-06-30 12:53:07 +00:00
|
|
|
|
2022-06-24 15:06:32 +00:00
|
|
|
#if VA_BITS > 48
|
arm64: mm: Handle LVA support as a CPU feature
Currently, we detect CPU support for 52-bit virtual addressing (LVA)
extremely early, before creating the kernel page tables or enabling the
MMU. We cannot override the feature this early, and so large virtual
addressing is always enabled on CPUs that implement support for it if
the software support for it was enabled at build time. It also means we
rely on non-trivial code in asm to deal with this feature.
Given that both the ID map and the TTBR1 mapping of the kernel image are
guaranteed to be 48-bit addressable, it is not actually necessary to
enable support this early, and instead, we can model it as a CPU
feature. That way, we can rely on code patching to get the correct
TCR.T1SZ values programmed on secondary boot and resume from suspend.
On the primary boot path, we simply enable the MMU with 48-bit virtual
addressing initially, and update TCR.T1SZ if LVA is supported from C
code, right before creating the kernel mapping. Given that TTBR1 still
points to reserved_pg_dir at this point, updating TCR.T1SZ should be
safe without the need for explicit TLB maintenance.
Since this gets rid of all accesses to the vabits_actual variable from
asm code that occurred before TCR.T1SZ had been programmed, we no longer
have a need for this variable, and we can replace it with a C expression
that produces the correct value directly, based on the value of TCR.T1SZ.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20240214122845.2033971-70-ardb+git@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-02-14 12:29:11 +00:00
|
|
|
// For reasons of #include hell, we can't use TCR_T1SZ_OFFSET/TCR_T1SZ_MASK here
|
|
|
|
#define vabits_actual (64 - ((read_tcr() >> 16) & 63))
|
2022-06-24 15:06:32 +00:00
|
|
|
#else
|
|
|
|
#define vabits_actual ((u64)VA_BITS)
|
|
|
|
#endif
|
2016-02-22 17:46:03 +00:00
|
|
|
|
2016-02-26 16:57:14 +00:00
|
|
|
extern s64 memstart_addr;
|
2012-03-05 11:49:27 +00:00
|
|
|
/* PHYS_OFFSET - the physical address of the start of memory. */
|
2016-02-22 17:46:03 +00:00
|
|
|
#define PHYS_OFFSET ({ VM_BUG_ON(memstart_addr & 1); memstart_addr; })
|
2016-02-16 12:52:42 +00:00
|
|
|
|
|
|
|
/* the offset between the kernel virtual and physical mappings */
|
|
|
|
extern u64 kimage_voffset;
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2016-12-20 00:23:06 +00:00
|
|
|
static inline unsigned long kaslr_offset(void)
|
|
|
|
{
|
2023-11-29 11:15:59 +00:00
|
|
|
return (u64)&_text - KIMAGE_VADDR;
|
2016-12-20 00:23:06 +00:00
|
|
|
}
|
|
|
|
|
2023-05-30 11:03:25 +00:00
|
|
|
#ifdef CONFIG_RANDOMIZE_BASE
|
|
|
|
void kaslr_init(void);
|
arm64: kaslr: don't pretend KASLR is enabled if offset < MIN_KIMG_ALIGN
Our virtual KASLR displacement is a randomly chosen multiple of
2 MiB plus an offset that is equal to the physical placement modulo 2
MiB. This arrangement ensures that we can always use 2 MiB block
mappings (or contiguous PTE mappings for 16k or 64k pages) to map the
kernel.
This means that a KASLR offset of less than 2 MiB is simply the product
of this physical displacement, and no randomization has actually taken
place. Currently, we use 'kaslr_offset() > 0' to decide whether or not
randomization has occurred, and so we misidentify this case.
If the kernel image placement is not randomized, modules are allocated
from a dedicated region below the kernel mapping, which is only used for
modules and not for other vmalloc() or vmap() calls.
When randomization is enabled, the kernel image is vmap()'ed randomly
inside the vmalloc region, and modules are allocated in the vicinity of
this mapping to ensure that relative references are always in range.
However, unlike the dedicated module region below the vmalloc region,
this region is not reserved exclusively for modules, and so ordinary
vmalloc() calls may end up overlapping with it. This should rarely
happen, given that vmalloc allocates bottom up, although it cannot be
ruled out entirely.
The misidentified case results in a placement of the kernel image within
2 MiB of its default address. However, the logic that randomizes the
module region is still invoked, and this could result in the module
region overlapping with the start of the vmalloc region, instead of
using the dedicated region below it. If this happens, a single large
vmalloc() or vmap() call will use up the entire region, and leave no
space for loading modules after that.
Since commit 82046702e288 ("efi/libstub/arm64: Replace 'preferred'
offset with alignment check"), this is much more likely to occur on
systems that boot via EFI but lack an implementation of the EFI RNG
protocol, as in that case, the EFI stub will decide to leave the image
where it found it, and the EFI firmware uses 64k alignment only.
Fix this, by correctly identifying the case where the virtual
displacement is a result of the physical displacement only.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230223204101.1500373-1-ardb@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-02-23 20:41:01 +00:00
|
|
|
static inline bool kaslr_enabled(void)
|
|
|
|
{
|
2023-05-30 11:03:25 +00:00
|
|
|
extern bool __kaslr_is_enabled;
|
|
|
|
return __kaslr_is_enabled;
|
arm64: kaslr: don't pretend KASLR is enabled if offset < MIN_KIMG_ALIGN
Our virtual KASLR displacement is a randomly chosen multiple of
2 MiB plus an offset that is equal to the physical placement modulo 2
MiB. This arrangement ensures that we can always use 2 MiB block
mappings (or contiguous PTE mappings for 16k or 64k pages) to map the
kernel.
This means that a KASLR offset of less than 2 MiB is simply the product
of this physical displacement, and no randomization has actually taken
place. Currently, we use 'kaslr_offset() > 0' to decide whether or not
randomization has occurred, and so we misidentify this case.
If the kernel image placement is not randomized, modules are allocated
from a dedicated region below the kernel mapping, which is only used for
modules and not for other vmalloc() or vmap() calls.
When randomization is enabled, the kernel image is vmap()'ed randomly
inside the vmalloc region, and modules are allocated in the vicinity of
this mapping to ensure that relative references are always in range.
However, unlike the dedicated module region below the vmalloc region,
this region is not reserved exclusively for modules, and so ordinary
vmalloc() calls may end up overlapping with it. This should rarely
happen, given that vmalloc allocates bottom up, although it cannot be
ruled out entirely.
The misidentified case results in a placement of the kernel image within
2 MiB of its default address. However, the logic that randomizes the
module region is still invoked, and this could result in the module
region overlapping with the start of the vmalloc region, instead of
using the dedicated region below it. If this happens, a single large
vmalloc() or vmap() call will use up the entire region, and leave no
space for loading modules after that.
Since commit 82046702e288 ("efi/libstub/arm64: Replace 'preferred'
offset with alignment check"), this is much more likely to occur on
systems that boot via EFI but lack an implementation of the EFI RNG
protocol, as in that case, the EFI stub will decide to leave the image
where it found it, and the EFI firmware uses 64k alignment only.
Fix this, by correctly identifying the case where the virtual
displacement is a result of the physical displacement only.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230223204101.1500373-1-ardb@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-02-23 20:41:01 +00:00
|
|
|
}
|
2023-05-30 11:03:25 +00:00
|
|
|
#else
|
|
|
|
static inline void kaslr_init(void) { }
|
|
|
|
static inline bool kaslr_enabled(void) { return false; }
|
|
|
|
#endif
|
arm64: kaslr: don't pretend KASLR is enabled if offset < MIN_KIMG_ALIGN
Our virtual KASLR displacement is a randomly chosen multiple of
2 MiB plus an offset that is equal to the physical placement modulo 2
MiB. This arrangement ensures that we can always use 2 MiB block
mappings (or contiguous PTE mappings for 16k or 64k pages) to map the
kernel.
This means that a KASLR offset of less than 2 MiB is simply the product
of this physical displacement, and no randomization has actually taken
place. Currently, we use 'kaslr_offset() > 0' to decide whether or not
randomization has occurred, and so we misidentify this case.
If the kernel image placement is not randomized, modules are allocated
from a dedicated region below the kernel mapping, which is only used for
modules and not for other vmalloc() or vmap() calls.
When randomization is enabled, the kernel image is vmap()'ed randomly
inside the vmalloc region, and modules are allocated in the vicinity of
this mapping to ensure that relative references are always in range.
However, unlike the dedicated module region below the vmalloc region,
this region is not reserved exclusively for modules, and so ordinary
vmalloc() calls may end up overlapping with it. This should rarely
happen, given that vmalloc allocates bottom up, although it cannot be
ruled out entirely.
The misidentified case results in a placement of the kernel image within
2 MiB of its default address. However, the logic that randomizes the
module region is still invoked, and this could result in the module
region overlapping with the start of the vmalloc region, instead of
using the dedicated region below it. If this happens, a single large
vmalloc() or vmap() call will use up the entire region, and leave no
space for loading modules after that.
Since commit 82046702e288 ("efi/libstub/arm64: Replace 'preferred'
offset with alignment check"), this is much more likely to occur on
systems that boot via EFI but lack an implementation of the EFI RNG
protocol, as in that case, the EFI stub will decide to leave the image
where it found it, and the EFI firmware uses 64k alignment only.
Fix this, by correctly identifying the case where the virtual
displacement is a result of the physical displacement only.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230223204101.1500373-1-ardb@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-02-23 20:41:01 +00:00
|
|
|
|
2015-08-18 09:34:42 +00:00
|
|
|
/*
|
2016-02-16 12:52:42 +00:00
|
|
|
* Allow all memory at the discovery stage. We will clip it later.
|
2015-08-18 09:34:42 +00:00
|
|
|
*/
|
2016-02-16 12:52:42 +00:00
|
|
|
#define MIN_MEMBLOCK_ADDR 0
|
|
|
|
#define MAX_MEMBLOCK_ADDR U64_MAX
|
2015-08-18 09:34:42 +00:00
|
|
|
|
2012-03-05 11:49:27 +00:00
|
|
|
/*
|
|
|
|
* PFNs are used to describe any physical page; this means
|
|
|
|
* PFN 0 == physical address 0.
|
|
|
|
*
|
|
|
|
* This is the PFN of the first RAM page in the kernel
|
|
|
|
* direct-mapped view. We assume this is the first page
|
|
|
|
* of RAM in the mem_map as well.
|
|
|
|
*/
|
|
|
|
#define PHYS_PFN_OFFSET (PHYS_OFFSET >> PAGE_SHIFT)
|
|
|
|
|
2018-12-28 08:30:12 +00:00
|
|
|
/*
|
|
|
|
* When dealing with data aborts, watchpoints, or instruction traps we may end
|
|
|
|
* up with a tagged userland pointer. Clear the tag to get a sane pointer to
|
|
|
|
* pass on to access_ok(), for instance.
|
|
|
|
*/
|
2019-10-16 04:04:18 +00:00
|
|
|
#define __untagged_addr(addr) \
|
2019-08-09 14:39:37 +00:00
|
|
|
((__force __typeof__(addr))sign_extend64((__force u64)(addr), 55))
|
2018-12-28 08:30:12 +00:00
|
|
|
|
2019-10-16 04:04:18 +00:00
|
|
|
#define untagged_addr(addr) ({ \
|
2020-02-19 10:19:13 +00:00
|
|
|
u64 __addr = (__force u64)(addr); \
|
2019-10-16 04:04:18 +00:00
|
|
|
__addr &= __untagged_addr(__addr); \
|
|
|
|
(__force __typeof__(addr))__addr; \
|
|
|
|
})
|
|
|
|
|
2020-12-22 20:02:10 +00:00
|
|
|
#if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS)
|
2018-12-28 08:30:16 +00:00
|
|
|
#define __tag_shifted(tag) ((u64)(tag) << 56)
|
2019-10-16 04:04:18 +00:00
|
|
|
#define __tag_reset(addr) __untagged_addr(addr)
|
2018-12-28 08:30:16 +00:00
|
|
|
#define __tag_get(addr) (__u8)((u64)(addr) >> 56)
|
|
|
|
#else
|
2019-08-13 16:01:05 +00:00
|
|
|
#define __tag_shifted(tag) 0UL
|
2018-12-28 08:30:16 +00:00
|
|
|
#define __tag_reset(addr) (addr)
|
|
|
|
#define __tag_get(addr) 0
|
2020-12-22 20:02:10 +00:00
|
|
|
#endif /* CONFIG_KASAN_SW_TAGS || CONFIG_KASAN_HW_TAGS */
|
2018-12-28 08:30:16 +00:00
|
|
|
|
arm64/mm: fix variable 'tag' set but not used
When CONFIG_KASAN_SW_TAGS=n, set_tag() is compiled away. GCC throws a
warning,
mm/kasan/common.c: In function '__kasan_kmalloc':
mm/kasan/common.c:464:5: warning: variable 'tag' set but not used
[-Wunused-but-set-variable]
u8 tag = 0xff;
^~~
Fix it by making __tag_set() a static inline function the same as
arch_kasan_set_tag() in mm/kasan/kasan.h for consistency because there
is a macro in arch/arm64/include/asm/kasan.h,
#define arch_kasan_set_tag(addr, tag) __tag_set(addr, tag)
However, when CONFIG_DEBUG_VIRTUAL=n and CONFIG_SPARSEMEM_VMEMMAP=y,
page_to_virt() will call __tag_set() with incorrect type of a
parameter, so fix that as well. Also, still let page_to_virt() return
"void *" instead of "const void *", so will not need to add a similar
cast in lowmem_page_address().
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Will Deacon <will@kernel.org>
2019-08-01 14:47:05 +00:00
|
|
|
static inline const void *__tag_set(const void *addr, u8 tag)
|
|
|
|
{
|
2019-08-13 16:01:05 +00:00
|
|
|
u64 __addr = (u64)addr & ~__tag_shifted(0xff);
|
|
|
|
return (const void *)(__addr | __tag_shifted(tag));
|
arm64/mm: fix variable 'tag' set but not used
When CONFIG_KASAN_SW_TAGS=n, set_tag() is compiled away. GCC throws a
warning,
mm/kasan/common.c: In function '__kasan_kmalloc':
mm/kasan/common.c:464:5: warning: variable 'tag' set but not used
[-Wunused-but-set-variable]
u8 tag = 0xff;
^~~
Fix it by making __tag_set() a static inline function the same as
arch_kasan_set_tag() in mm/kasan/kasan.h for consistency because there
is a macro in arch/arm64/include/asm/kasan.h,
#define arch_kasan_set_tag(addr, tag) __tag_set(addr, tag)
However, when CONFIG_DEBUG_VIRTUAL=n and CONFIG_SPARSEMEM_VMEMMAP=y,
page_to_virt() will call __tag_set() with incorrect type of a
parameter, so fix that as well. Also, still let page_to_virt() return
"void *" instead of "const void *", so will not need to add a similar
cast in lowmem_page_address().
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Will Deacon <will@kernel.org>
2019-08-01 14:47:05 +00:00
|
|
|
}
|
|
|
|
|
2020-12-22 20:01:56 +00:00
|
|
|
#ifdef CONFIG_KASAN_HW_TAGS
|
2023-03-10 23:43:30 +00:00
|
|
|
#define arch_enable_tag_checks_sync() mte_enable_kernel_sync()
|
|
|
|
#define arch_enable_tag_checks_async() mte_enable_kernel_async()
|
|
|
|
#define arch_enable_tag_checks_asymm() mte_enable_kernel_asymm()
|
2023-03-10 23:43:32 +00:00
|
|
|
#define arch_suppress_tag_checks_start() mte_enable_tco()
|
|
|
|
#define arch_suppress_tag_checks_stop() mte_disable_tco()
|
2021-03-15 13:20:19 +00:00
|
|
|
#define arch_force_async_tag_fault() mte_check_tfsr_exit()
|
2020-12-22 20:01:56 +00:00
|
|
|
#define arch_get_random_tag() mte_get_random_tag()
|
|
|
|
#define arch_get_mem_tag(addr) mte_get_mem_tag(addr)
|
2021-04-30 05:59:55 +00:00
|
|
|
#define arch_set_mem_tag_range(addr, size, tag, init) \
|
|
|
|
mte_set_mem_tag_range((addr), (size), (tag), (init))
|
2020-12-22 20:01:56 +00:00
|
|
|
#endif /* CONFIG_KASAN_HW_TAGS */
|
|
|
|
|
2017-01-10 21:35:47 +00:00
|
|
|
/*
|
|
|
|
* Physical vs virtual RAM address space conversion. These are
|
|
|
|
* private definitions which should NOT be used outside memory.h
|
|
|
|
* files. Use virt_to_phys/phys_to_virt/__pa/__va instead.
|
|
|
|
*/
|
2017-01-10 21:35:50 +00:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
2021-01-26 13:40:56 +00:00
|
|
|
* Check whether an arbitrary address is within the linear map, which
|
|
|
|
* lives in the [PAGE_OFFSET, PAGE_END) interval at the bottom of the
|
|
|
|
* kernel's TTBR1 address range.
|
2017-01-10 21:35:50 +00:00
|
|
|
*/
|
2021-02-01 19:06:34 +00:00
|
|
|
#define __is_lm_address(addr) (((u64)(addr) - PAGE_OFFSET) < (PAGE_END - PAGE_OFFSET))
|
2017-01-10 21:35:50 +00:00
|
|
|
|
2021-02-01 19:06:34 +00:00
|
|
|
#define __lm_to_phys(addr) (((addr) - PAGE_OFFSET) + PHYS_OFFSET)
|
2017-01-10 21:35:50 +00:00
|
|
|
#define __kimg_to_phys(addr) ((addr) - kimage_voffset)
|
|
|
|
|
|
|
|
#define __virt_to_phys_nodebug(x) ({ \
|
2019-08-13 15:26:54 +00:00
|
|
|
phys_addr_t __x = (phys_addr_t)(__tag_reset(x)); \
|
2019-08-13 16:22:51 +00:00
|
|
|
__is_lm_address(__x) ? __lm_to_phys(__x) : __kimg_to_phys(__x); \
|
2017-01-10 21:35:50 +00:00
|
|
|
})
|
|
|
|
|
|
|
|
#define __pa_symbol_nodebug(x) __kimg_to_phys((phys_addr_t)(x))
|
|
|
|
|
|
|
|
#ifdef CONFIG_DEBUG_VIRTUAL
|
|
|
|
extern phys_addr_t __virt_to_phys(unsigned long x);
|
|
|
|
extern phys_addr_t __phys_addr_symbol(unsigned long x);
|
|
|
|
#else
|
|
|
|
#define __virt_to_phys(x) __virt_to_phys_nodebug(x)
|
|
|
|
#define __phys_addr_symbol(x) __pa_symbol_nodebug(x)
|
2019-08-13 16:06:29 +00:00
|
|
|
#endif /* CONFIG_DEBUG_VIRTUAL */
|
2017-01-10 21:35:47 +00:00
|
|
|
|
arm64: mm: use single quantity to represent the PA to VA translation
On arm64, the global variable memstart_addr represents the physical
address of PAGE_OFFSET, and so physical to virtual translations or
vice versa used to come down to simple additions or subtractions
involving the values of PAGE_OFFSET and memstart_addr.
When support for 52-bit virtual addressing was introduced, we had to
deal with PAGE_OFFSET potentially being outside of the region that
can be covered by the virtual range (as the 52-bit VA capable build
needs to be able to run on systems that are only 48-bit VA capable),
and for this reason, another translation was introduced, and recorded
in the global variable physvirt_offset.
However, if we go back to the original definition of memstart_addr,
i.e., the physical address of PAGE_OFFSET, it turns out that there is
no need for two separate translations: instead, we can simply subtract
the size of the unaddressable VA space from memstart_addr to make the
available physical memory appear in the 48-bit addressable VA region.
This simplifies things, but also fixes a bug on KASLR builds, which
may update memstart_addr later on in arm64_memblock_init(), but fails
to update vmemmap and physvirt_offset accordingly.
Fixes: 5383cc6efed1 ("arm64: mm: Introduce vabits_actual")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Link: https://lore.kernel.org/r/20201008153602.9467-2-ardb@kernel.org
Signed-off-by: Will Deacon <will@kernel.org>
2020-10-08 15:35:59 +00:00
|
|
|
#define __phys_to_virt(x) ((unsigned long)((x) - PHYS_OFFSET) | PAGE_OFFSET)
|
2017-01-10 21:35:47 +00:00
|
|
|
#define __phys_to_kimg(x) ((unsigned long)((x) + kimage_voffset))
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Convert a page to/from a physical address
|
|
|
|
*/
|
|
|
|
#define page_to_phys(page) (__pfn_to_phys(page_to_pfn(page)))
|
|
|
|
#define phys_to_page(phys) (pfn_to_page(__phys_to_pfn(phys)))
|
|
|
|
|
2012-03-05 11:49:27 +00:00
|
|
|
/*
|
|
|
|
* Note: Drivers should NOT use these. They are the wrong
|
|
|
|
* translation for translating DMA addresses. Use the driver
|
|
|
|
* DMA support - see dma-mapping.h.
|
|
|
|
*/
|
2014-07-28 15:25:48 +00:00
|
|
|
#define virt_to_phys virt_to_phys
|
2012-03-05 11:49:27 +00:00
|
|
|
static inline phys_addr_t virt_to_phys(const volatile void *x)
|
|
|
|
{
|
|
|
|
return __virt_to_phys((unsigned long)(x));
|
|
|
|
}
|
|
|
|
|
2014-07-28 15:25:48 +00:00
|
|
|
#define phys_to_virt phys_to_virt
|
2012-03-05 11:49:27 +00:00
|
|
|
static inline void *phys_to_virt(phys_addr_t x)
|
|
|
|
{
|
|
|
|
return (void *)(__phys_to_virt(x));
|
|
|
|
}
|
|
|
|
|
2022-05-30 08:46:39 +00:00
|
|
|
/* Needed already here for resolving __phys_to_pfn() in virt_to_pfn() */
|
|
|
|
#include <asm-generic/memory_model.h>
|
|
|
|
|
|
|
|
static inline unsigned long virt_to_pfn(const void *kaddr)
|
|
|
|
{
|
|
|
|
return __phys_to_pfn(virt_to_phys(kaddr));
|
|
|
|
}
|
|
|
|
|
2012-03-05 11:49:27 +00:00
|
|
|
/*
|
|
|
|
* Drivers should NOT use these either.
|
|
|
|
*/
|
|
|
|
#define __pa(x) __virt_to_phys((unsigned long)(x))
|
2017-01-10 21:35:50 +00:00
|
|
|
#define __pa_symbol(x) __phys_addr_symbol(RELOC_HIDE((unsigned long)(x), 0))
|
|
|
|
#define __pa_nodebug(x) __virt_to_phys_nodebug((unsigned long)(x))
|
2012-03-05 11:49:27 +00:00
|
|
|
#define __va(x) ((void *)__phys_to_virt((phys_addr_t)(x)))
|
|
|
|
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
|
2019-08-13 16:22:51 +00:00
|
|
|
#define sym_to_pfn(x) __phys_to_pfn(__pa_symbol(x))
|
2012-03-05 11:49:27 +00:00
|
|
|
|
|
|
|
/*
|
2019-08-13 16:22:51 +00:00
|
|
|
* virt_to_page(x) convert a _valid_ virtual address to struct page *
|
|
|
|
* virt_addr_valid(x) indicates whether a virtual address is valid
|
2012-03-05 11:49:27 +00:00
|
|
|
*/
|
2014-10-28 05:44:01 +00:00
|
|
|
#define ARCH_PFN_OFFSET ((unsigned long)PHYS_PFN_OFFSET)
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2021-04-20 09:35:59 +00:00
|
|
|
#if defined(CONFIG_DEBUG_VIRTUAL)
|
2021-03-08 16:10:23 +00:00
|
|
|
#define page_to_virt(x) ({ \
|
|
|
|
__typeof__(x) __page = x; \
|
|
|
|
void *__addr = __va(page_to_phys(__page)); \
|
|
|
|
(void *)__tag_set((const void *)__addr, page_kasan_tag(__page));\
|
|
|
|
})
|
2019-08-13 16:22:51 +00:00
|
|
|
#define virt_to_page(x) pfn_to_page(virt_to_pfn(x))
|
2016-03-30 14:46:01 +00:00
|
|
|
#else
|
2019-08-13 15:46:11 +00:00
|
|
|
#define page_to_virt(x) ({ \
|
|
|
|
__typeof__(x) __page = x; \
|
2020-11-10 18:05:11 +00:00
|
|
|
u64 __idx = ((u64)__page - VMEMMAP_START) / sizeof(struct page);\
|
|
|
|
u64 __addr = PAGE_OFFSET + (__idx * PAGE_SIZE); \
|
2019-08-13 15:46:11 +00:00
|
|
|
(void *)__tag_set((const void *)__addr, page_kasan_tag(__page));\
|
2018-12-28 08:30:57 +00:00
|
|
|
})
|
|
|
|
|
2019-08-13 15:46:11 +00:00
|
|
|
#define virt_to_page(x) ({ \
|
2020-11-10 18:05:11 +00:00
|
|
|
u64 __idx = (__tag_reset((u64)x) - PAGE_OFFSET) / PAGE_SIZE; \
|
|
|
|
u64 __addr = VMEMMAP_START + (__idx * sizeof(struct page)); \
|
|
|
|
(struct page *)__addr; \
|
2019-08-13 15:46:11 +00:00
|
|
|
})
|
2021-04-20 09:35:59 +00:00
|
|
|
#endif /* CONFIG_DEBUG_VIRTUAL */
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2019-08-13 14:52:23 +00:00
|
|
|
#define virt_addr_valid(addr) ({ \
|
2021-02-01 19:06:33 +00:00
|
|
|
__typeof__(addr) __addr = __tag_reset(addr); \
|
2021-07-01 01:51:19 +00:00
|
|
|
__is_lm_address(__addr) && pfn_is_map_memory(virt_to_pfn(__addr)); \
|
2019-08-13 14:52:23 +00:00
|
|
|
})
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2020-06-29 04:38:31 +00:00
|
|
|
void dump_mem_limit(void);
|
2019-08-13 16:06:29 +00:00
|
|
|
#endif /* !ASSEMBLY */
|
2016-09-21 22:25:04 +00:00
|
|
|
|
arm64, mm, efi: Account for GICv3 LPI tables in static memblock reserve table
In the irqchip and EFI code, we have what basically amounts to a quirk
to work around a peculiarity in the GICv3 architecture, which permits
the system memory address of LPI tables to be programmable only once
after a CPU reset. This means kexec kernels must use the same memory
as the first kernel, and thus ensure that this memory has not been
given out for other purposes by the time the ITS init code runs, which
is not very early for secondary CPUs.
On systems with many CPUs, these reservations could overflow the
memblock reservation table, and this was addressed in commit:
eff896288872 ("efi/arm: Defer persistent reservations until after paging_init()")
However, this turns out to have made things worse, since the allocation
of page tables and heap space for the resized memblock reservation table
itself may overwrite the regions we are attempting to reserve, which may
cause all kinds of corruption, also considering that the ITS will still
be poking bits into that memory in response to incoming MSIs.
So instead, let's grow the static memblock reservation table on such
systems so it can accommodate these reservations at an earlier time.
This will permit us to revert the above commit in a subsequent patch.
[ mingo: Minor cleanups. ]
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20190215123333.21209-2-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-15 12:33:32 +00:00
|
|
|
/*
|
|
|
|
* Given that the GIC architecture permits ITS implementations that can only be
|
|
|
|
* configured with a LPI table address once, GICv3 systems with many CPUs may
|
|
|
|
* end up reserving a lot of different regions after a kexec for their LPI
|
|
|
|
* tables (one per CPU), as we are forced to reuse the same memory after kexec
|
|
|
|
* (and thus reserve it persistently with EFI beforehand)
|
|
|
|
*/
|
|
|
|
#if defined(CONFIG_EFI) && defined(CONFIG_ARM_GIC_V3_ITS)
|
|
|
|
# define INIT_MEMBLOCK_RESERVED_REGIONS (INIT_MEMBLOCK_REGIONS + NR_CPUS + 1)
|
|
|
|
#endif
|
|
|
|
|
memblock,arm64: expand the static memblock memory table
In a system(Huawei Ascend ARM64 SoC) using HBM, a multi-bit ECC error
occurs, and the BIOS will mark the corresponding area (for example, 2 MB)
as unusable. When the system restarts next time, these areas are not
reported or reported as EFI_UNUSABLE_MEMORY. Both cases lead to an
increase in the number of memblocks, whereas EFI_UNUSABLE_MEMORY leads to
a larger number of memblocks.
For example, if the EFI_UNUSABLE_MEMORY type is reported:
...
memory[0x92] [0x0000200834a00000-0x0000200835bfffff], 0x0000000001200000 bytes on node 7 flags: 0x0
memory[0x93] [0x0000200835c00000-0x0000200835dfffff], 0x0000000000200000 bytes on node 7 flags: 0x4
memory[0x94] [0x0000200835e00000-0x00002008367fffff], 0x0000000000a00000 bytes on node 7 flags: 0x0
memory[0x95] [0x0000200836800000-0x00002008369fffff], 0x0000000000200000 bytes on node 7 flags: 0x4
memory[0x96] [0x0000200836a00000-0x0000200837bfffff], 0x0000000001200000 bytes on node 7 flags: 0x0
memory[0x97] [0x0000200837c00000-0x0000200837dfffff], 0x0000000000200000 bytes on node 7 flags: 0x4
memory[0x98] [0x0000200837e00000-0x000020087fffffff], 0x0000000048200000 bytes on node 7 flags: 0x0
memory[0x99] [0x0000200880000000-0x0000200bcfffffff], 0x0000000350000000 bytes on node 6 flags: 0x0
memory[0x9a] [0x0000200bd0000000-0x0000200bd01fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
memory[0x9b] [0x0000200bd0200000-0x0000200bd07fffff], 0x0000000000600000 bytes on node 6 flags: 0x0
memory[0x9c] [0x0000200bd0800000-0x0000200bd09fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
memory[0x9d] [0x0000200bd0a00000-0x0000200fcfffffff], 0x00000003ff600000 bytes on node 6 flags: 0x0
memory[0x9e] [0x0000200fd0000000-0x0000200fd01fffff], 0x0000000000200000 bytes on node 6 flags: 0x4
memory[0x9f] [0x0000200fd0200000-0x0000200fffffffff], 0x000000002fe00000 bytes on node 6 flags: 0x0
...
The EFI memory map is parsed to construct the memblock arrays before the
memblock arrays can be resized. As the result, memory regions beyond
INIT_MEMBLOCK_REGIONS are lost.
Add a new macro INIT_MEMBLOCK_MEMORY_REGIONS to replace
INIT_MEMBLOCK_REGTIONS to define the size of the static memblock.memory
array.
Allow overriding memblock.memory array size with architecture defined
INIT_MEMBLOCK_MEMORY_REGIONS and make arm64 to set
INIT_MEMBLOCK_MEMORY_REGIONS to 1024 when CONFIG_EFI is enabled.
Link: https://lkml.kernel.org/r/20220615102742.96450-1-zhouguanghui1@huawei.com
Signed-off-by: Zhou Guanghui <zhouguanghui1@huawei.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Tested-by: Darren Hart <darren@os.amperecomputing.com>
Acked-by: Will Deacon <will@kernel.org> [arm64]
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Xu Qiang <xuqiang36@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-15 10:27:42 +00:00
|
|
|
/*
|
|
|
|
* memory regions which marked with flag MEMBLOCK_NOMAP(for example, the memory
|
|
|
|
* of the EFI_UNUSABLE_MEMORY type) may divide a continuous memory block into
|
|
|
|
* multiple parts. As a result, the number of memory regions is large.
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_EFI
|
|
|
|
#define INIT_MEMBLOCK_MEMORY_REGIONS (INIT_MEMBLOCK_REGIONS * 8)
|
|
|
|
#endif
|
|
|
|
|
2012-03-05 11:49:27 +00:00
|
|
|
|
2019-08-13 16:06:29 +00:00
|
|
|
#endif /* __ASM_MEMORY_H */
|