linux-stable/Documentation/admin-guide/kernel-parameters.txt

7427 lines
264 KiB
Plaintext
Raw Normal View History

accept_memory= [MM]
Format: { eager | lazy }
default: lazy
By default, unaccepted memory is accepted lazily to
avoid prolonged boot times. The lazy option will add
some runtime overhead until all memory is eventually
accepted. In most cases the overhead is negligible.
For some workloads or for debugging purposes
accept_memory=eager can be used to accept all memory
at once during boot.
acpi= [HW,ACPI,X86,ARM64,RISCV64]
Advanced Configuration and Power Interface
Format: { force | on | off | strict | noirq | rsdt |
copy_dsdt }
force -- enable ACPI if default was off
on -- enable ACPI but allow fallback to DT [arm64,riscv64]
off -- disable ACPI if default was on
noirq -- do not use ACPI for IRQ routing
strict -- Be less tolerant of platforms that are not
strictly ACPI specification compliant.
rsdt -- prefer RSDT over (default) XSDT
copy_dsdt -- copy DSDT to memory
For ARM64 and RISCV64, ONLY "acpi=off", "acpi=on" or
"acpi=force" are available
See also Documentation/power/runtime_pm.rst, pci=noacpi
acpi_apic_instance= [ACPI, IOAPIC]
Format: <int>
2: use 2nd APIC table, if available
1,0: use 1st APIC table
default: 0
acpi_backlight= [HW,ACPI]
{ vendor | video | native | none }
If set to vendor, prefer vendor-specific driver
(e.g. thinkpad_acpi, sony_acpi, etc.) instead
of the ACPI video.ko driver.
If set to video, use the ACPI video.ko driver.
If set to native, use the device's native backlight mode.
If set to none, disable the ACPI backlight interface.
acpi_force_32bit_fadt_addr
force FADT to use 32 bit addresses rather than the
64 bit X_* addresses. Some firmware have broken 64
bit addresses for force ACPI ignore these and use
the older legacy 32 bit addresses.
acpica_no_return_repair [HW, ACPI]
Disable AML predefined validation mechanism
This mechanism can repair the evaluation result to make
the return objects more ACPI specification compliant.
This option is useful for developers to identify the
root cause of an AML interpreter issue when the issue
has something to do with the repair mechanism.
acpi.debug_layer= [HW,ACPI,ACPI_DEBUG]
acpi.debug_level= [HW,ACPI,ACPI_DEBUG]
Format: <int>
CONFIG_ACPI_DEBUG must be enabled to produce any ACPI
debug output. Bits in debug_layer correspond to a
_COMPONENT in an ACPI source file, e.g.,
#define _COMPONENT ACPI_EVENTS
Bits in debug_level correspond to a level in
ACPI_DEBUG_PRINT statements, e.g.,
ACPI_DEBUG_PRINT((ACPI_DB_INFO, ...
The debug_level mask defaults to "info". See
Documentation/firmware-guide/acpi/debug.rst for more information about
debug layers and levels.
Enable processor driver info messages:
acpi.debug_layer=0x20000000
Enable AML "Debug" output, i.e., stores to the Debug
object while interpreting AML:
acpi.debug_layer=0xffffffff acpi.debug_level=0x2
Enable all messages related to ACPI hardware:
acpi.debug_layer=0x2 acpi.debug_level=0xffffffff
Some values produce so much output that the system is
unusable. The "log_buf_len" parameter may be useful
if you need to capture more output.
acpi_enforce_resources= [ACPI]
{ strict | lax | no }
Check for resource conflicts between native drivers
and ACPI OperationRegions (SystemIO and SystemMemory
only). IO ports and memory declared in ACPI might be
used by the ACPI subsystem in arbitrary AML code and
can interfere with legacy drivers.
strict (default): access to resources claimed by ACPI
is denied; legacy drivers trying to access reserved
resources will fail to bind to device using them.
lax: access to resources claimed by ACPI is allowed;
legacy drivers trying to access reserved resources
will bind successfully but a warning message is logged.
no: ACPI OperationRegions are not marked as reserved,
no further checks are performed.
ACPI: Fix x86 regression related to early mapping size limitation The following warning message is triggered: WARNING: CPU: 0 PID: 0 at mm/early_ioremap.c:136 __early_ioremap+0x11f/0x1f2() Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 3.15.0-rc1-00017-g86dfc6f3-dirty #298 Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.99.99.x036.091920111209 09/19/2011 0000000000000009 ffffffff81b75c40 ffffffff817c627b 0000000000000000 ffffffff81b75c78 ffffffff81067b5d 000000000000007b 8000000000000563 00000000b96b20dc 0000000000000001 ffffffffff300e0c ffffffff81b75c88 Call Trace: [<ffffffff817c627b>] dump_stack+0x45/0x56 [<ffffffff81067b5d>] warn_slowpath_common+0x7d/0xa0 [<ffffffff81067c3a>] warn_slowpath_null+0x1a/0x20 [<ffffffff81d4b9d5>] __early_ioremap+0x11f/0x1f2 [<ffffffff81d4bc5b>] early_ioremap+0x13/0x15 [<ffffffff81d2b8f3>] __acpi_map_table+0x13/0x18 [<ffffffff817b8d1a>] acpi_os_map_memory+0x26/0x14e [<ffffffff813ff018>] acpi_tb_acquire_table+0x42/0x70 [<ffffffff813ff086>] acpi_tb_validate_table+0x27/0x37 [<ffffffff813ff0e5>] acpi_tb_verify_table+0x22/0xd8 [<ffffffff813ff6a8>] acpi_tb_install_non_fixed_table+0x60/0x1c9 [<ffffffff81d61024>] acpi_tb_parse_root_table+0x218/0x26a [<ffffffff81d1b120>] ? early_idt_handlers+0x120/0x120 [<ffffffff81d610cd>] acpi_initialize_tables+0x57/0x59 [<ffffffff81d5f25d>] acpi_table_init+0x1b/0x99 [<ffffffff81d2bca0>] acpi_boot_table_init+0x1e/0x85 [<ffffffff81d23043>] setup_arch+0x99d/0xcc6 [<ffffffff81d1b120>] ? early_idt_handlers+0x120/0x120 [<ffffffff81d1bbbe>] start_kernel+0x8b/0x415 [<ffffffff81d1b120>] ? early_idt_handlers+0x120/0x120 [<ffffffff81d1b5ee>] x86_64_start_reservations+0x2a/0x2c [<ffffffff81d1b72e>] x86_64_start_kernel+0x13e/0x14d ---[ end trace 11ae599a1898f4e7 ]--- when installing the following table during early stage: ACPI: SSDT 0x00000000B9638018 07A0C4 (v02 INTEL S2600CP 00004000 INTL 20100331) The regression is caused by the size limitation of the x86 early IO mapping. The root cause is: 1. ACPICA doesn't split IO memory mapping and table mapping; 2. Linux x86 OSL implements acpi_os_map_memory() using a size limited fix-map mechanism during early boot stage, which is more suitable for only IO mappings. This patch fixes this issue by utilizing acpi_gbl_verify_table_checksum to disable the table mapping during early stage and enabling it again for the late stage. In this way, the normal code path is not affected. Then after the code related to the root cause is cleaned up, the early checksum verification can be easily re-enabled. A new boot parameter - acpi_force_table_verification is introduced for the platforms that require the checksum verification to stop loading bad tables. This fix also covers the checksum verification for the table overrides. Now large tables can also be overridden using the initrd override mechanism. Signed-off-by: Lv Zheng <lv.zheng@intel.com> Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-05-31 00:15:02 +00:00
acpi_force_table_verification [HW,ACPI]
Enable table checksum verification during early stage.
By default, this is disabled due to x86 early mapping
size limitation.
acpi_irq_balance [HW,ACPI]
ACPI will balance active IRQs
default in APIC mode
acpi_irq_nobalance [HW,ACPI]
ACPI will not move active IRQs (default)
default in PIC mode
acpi_irq_isa= [HW,ACPI] If irq_balance, mark listed IRQs used by ISA
Format: <irq>,<irq>...
acpi_irq_pci= [HW,ACPI] If irq_balance, clear listed IRQs for
use by PCI
Format: <irq>,<irq>...
acpi_mask_gpe= [HW,ACPI]
Due to the existence of _Lxx/_Exx, some GPEs triggered
by unsupported hardware/firmware features can result in
GPE floodings that cannot be automatically disabled by
the GPE dispatcher.
This facility can be used to prevent such uncontrolled
GPE floodings.
Format: <byte> or <bitmap-list>
acpi_no_auto_serialize [HW,ACPI]
Disable auto-serialization of AML methods
AML control methods that contain the opcodes to create
named objects will be marked as "Serialized" by the
auto-serialization feature.
This feature is enabled by default.
This option allows to turn off the feature.
acpi_no_memhotplug [ACPI] Disable memory hotplug. Useful for kdump
kernels.
acpi_no_static_ssdt [HW,ACPI]
Disable installation of static SSDTs at early boot time
By default, SSDTs contained in the RSDT/XSDT will be
installed automatically and they will appear under
/sys/firmware/acpi/tables.
This option turns off this feature.
Note that specifying this option does not affect
dynamic table installation which will install SSDT
tables to /sys/firmware/acpi/tables/dynamic.
acpi_no_watchdog [HW,ACPI,WDT]
Ignore the ACPI-based watchdog interface (WDAT) and let
a native driver control the watchdog device instead.
acpi_rsdp= [ACPI,EFI,KEXEC]
Pass the RSDP address to the kernel, mostly used
on machines running EFI runtime service to boot the
second kernel for kdump.
ACPICA: Add boot option to disable auto return object repair Sometimes, there might be bugs caused by unexpected AML which is compliant to the Windows but not compliant to the Linux implementation. There is a predefined validation mechanism implemented in ACPICA to repair the unexpected AML evaluation results that are caused by the unexpected AMLs. For example, BIOS may return misorder _CST result and the repair mechanism can make an ascending order on the returned _CST package object based on the C-state type. This mechanism is quite useful to implement an AML interpreter with better compliance with the real world where Windows is the de-facto standard and BIOS codes are only tested on one platform thus not compliant to the ACPI specification. But if a compliance issue hasn't been figured out yet, it will be difficult for developers to identify if the unexpected evaluation result is caused by this mechanism or by the AML interpreter. For example, _PR0 is expected to be a control method, but BIOS may use Package: "Name(_PR0, Package(1) {P1PR})". This boot option can disable the predefined validation mechanism so that developers can make sure the root cause comes from the parser/executer. This patch adds a new kernel parameter to disable this feature. A build test has been made on a Dell Inspiron mini 1100 (i386 z530) machine when this patch is applied and the corresponding boot test is performed w/ or w/o the new kernel parameter specified. References: https://bugzilla.kernel.org/show_bug.cgi?id=67901 Tested-by: Fabian Wehning <fabian.wehning@googlemail.com> Signed-off-by: Lv Zheng <lv.zheng@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-02-11 03:01:52 +00:00
acpi_os_name= [HW,ACPI] Tell ACPI BIOS the name of the OS
Format: To spoof as Windows 98: ="Microsoft Windows"
acpi_rev_override [ACPI] Override the _REV object to return 5 (instead
of 2 which is mandated by ACPI 6) as the supported ACPI
specification revision (when using this switch, it may
be necessary to carry out a cold reboot _twice_ in a
row to make it take effect on the platform firmware).
acpi_osi= [HW,ACPI] Modify list of supported OS interface strings
ACPI: Add facility to disable all _OSI OS vendor strings This patch introduces "acpi_osi=!" command line to force Linux replying "UNSUPPORTED" to all of the _OSI strings. This patch is based on an ACPICA enhancement - the new API acpi_update_interfaces(). The _OSI object provides the platform with the ability to query OSPM to determine the set of ACPI related interfaces, behaviors, or features that the operating system supports. The argument passed to the _OSI is a string like the followings: 1. Feature Group String, examples include Module Device Processor Device 3.0 _SCP Extensions Processor Aggregator Device ... 2. OS Vendor String, examples include Linux FreeBSD Windows ... There are AML codes provided in the ACPI namespace written in the following style to determine OSPM interfaces / features: Method(OSCK) { if (CondRefOf(_OSI, Local0)) { if (\_OSI("Windows")) { Return (One) } if (\_OSI("Windows 2006")) { Return (Ones) } Return (Zero) } Return (Zero) } There is a debugging facility implemented in Linux. Users can pass "acpi_osi=" boot parameters to the kernel to tune the _OSI evaluation result so that certain AML codes can be executed. Current implementation includes: 1. 'acpi_osi=' - this makes CondRefOf(_OSI, Local0) TRUE 2. 'acpi_osi="Windows"' - this makes \_OSI("Windows") TRUE 3. 'acpi_osi="!Windows"' - this makes \_OSI("Windows") FALSE The function to implement this feature is also used as a quirk mechanism in the Linux ACPI subystem. When _OSI is evaluatated by the AML codes, ACPICA replies "SUPPORTED" to all Windows operating system vendor strings. This is because Windows operating systems return "SUPPORTED" if the argument to the _OSI method specifies an earlier version of Windows. Please refer to the following MSDN document: How to Identify the Windows Version in ACPI by Using _OSI http://msdn.microsoft.com/en-us/library/hardware/gg463275.aspx This adds difficulties when developers want to feed specific Windows operating system vendor string to the BIOS codes for debugging purpose, multiple acpi_osi="!xxx" have to be specified in the command line to force Linux replying "UNSUPPORTED" to the Windows OS vendor strings listed in the AML codes. Signed-off-by: Lv Zheng <lv.zheng@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-22 08:08:25 +00:00
acpi_osi="string1" # add string1
acpi_osi="!string2" # remove string2
acpi_osi=!* # remove all strings
ACPI: Add facility to disable all _OSI OS vendor strings This patch introduces "acpi_osi=!" command line to force Linux replying "UNSUPPORTED" to all of the _OSI strings. This patch is based on an ACPICA enhancement - the new API acpi_update_interfaces(). The _OSI object provides the platform with the ability to query OSPM to determine the set of ACPI related interfaces, behaviors, or features that the operating system supports. The argument passed to the _OSI is a string like the followings: 1. Feature Group String, examples include Module Device Processor Device 3.0 _SCP Extensions Processor Aggregator Device ... 2. OS Vendor String, examples include Linux FreeBSD Windows ... There are AML codes provided in the ACPI namespace written in the following style to determine OSPM interfaces / features: Method(OSCK) { if (CondRefOf(_OSI, Local0)) { if (\_OSI("Windows")) { Return (One) } if (\_OSI("Windows 2006")) { Return (Ones) } Return (Zero) } Return (Zero) } There is a debugging facility implemented in Linux. Users can pass "acpi_osi=" boot parameters to the kernel to tune the _OSI evaluation result so that certain AML codes can be executed. Current implementation includes: 1. 'acpi_osi=' - this makes CondRefOf(_OSI, Local0) TRUE 2. 'acpi_osi="Windows"' - this makes \_OSI("Windows") TRUE 3. 'acpi_osi="!Windows"' - this makes \_OSI("Windows") FALSE The function to implement this feature is also used as a quirk mechanism in the Linux ACPI subystem. When _OSI is evaluatated by the AML codes, ACPICA replies "SUPPORTED" to all Windows operating system vendor strings. This is because Windows operating systems return "SUPPORTED" if the argument to the _OSI method specifies an earlier version of Windows. Please refer to the following MSDN document: How to Identify the Windows Version in ACPI by Using _OSI http://msdn.microsoft.com/en-us/library/hardware/gg463275.aspx This adds difficulties when developers want to feed specific Windows operating system vendor string to the BIOS codes for debugging purpose, multiple acpi_osi="!xxx" have to be specified in the command line to force Linux replying "UNSUPPORTED" to the Windows OS vendor strings listed in the AML codes. Signed-off-by: Lv Zheng <lv.zheng@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-22 08:08:25 +00:00
acpi_osi=! # disable all built-in OS vendor
strings
acpi_osi=!! # enable all built-in OS vendor
strings
acpi_osi= # disable all strings
ACPI: Add facility to disable all _OSI OS vendor strings This patch introduces "acpi_osi=!" command line to force Linux replying "UNSUPPORTED" to all of the _OSI strings. This patch is based on an ACPICA enhancement - the new API acpi_update_interfaces(). The _OSI object provides the platform with the ability to query OSPM to determine the set of ACPI related interfaces, behaviors, or features that the operating system supports. The argument passed to the _OSI is a string like the followings: 1. Feature Group String, examples include Module Device Processor Device 3.0 _SCP Extensions Processor Aggregator Device ... 2. OS Vendor String, examples include Linux FreeBSD Windows ... There are AML codes provided in the ACPI namespace written in the following style to determine OSPM interfaces / features: Method(OSCK) { if (CondRefOf(_OSI, Local0)) { if (\_OSI("Windows")) { Return (One) } if (\_OSI("Windows 2006")) { Return (Ones) } Return (Zero) } Return (Zero) } There is a debugging facility implemented in Linux. Users can pass "acpi_osi=" boot parameters to the kernel to tune the _OSI evaluation result so that certain AML codes can be executed. Current implementation includes: 1. 'acpi_osi=' - this makes CondRefOf(_OSI, Local0) TRUE 2. 'acpi_osi="Windows"' - this makes \_OSI("Windows") TRUE 3. 'acpi_osi="!Windows"' - this makes \_OSI("Windows") FALSE The function to implement this feature is also used as a quirk mechanism in the Linux ACPI subystem. When _OSI is evaluatated by the AML codes, ACPICA replies "SUPPORTED" to all Windows operating system vendor strings. This is because Windows operating systems return "SUPPORTED" if the argument to the _OSI method specifies an earlier version of Windows. Please refer to the following MSDN document: How to Identify the Windows Version in ACPI by Using _OSI http://msdn.microsoft.com/en-us/library/hardware/gg463275.aspx This adds difficulties when developers want to feed specific Windows operating system vendor string to the BIOS codes for debugging purpose, multiple acpi_osi="!xxx" have to be specified in the command line to force Linux replying "UNSUPPORTED" to the Windows OS vendor strings listed in the AML codes. Signed-off-by: Lv Zheng <lv.zheng@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-22 08:08:25 +00:00
'acpi_osi=!' can be used in combination with single or
multiple 'acpi_osi="string1"' to support specific OS
vendor string(s). Note that such command can only
affect the default state of the OS vendor strings, thus
it cannot affect the default state of the feature group
strings and the current state of the OS vendor strings,
specifying it multiple times through kernel command line
is meaningless. This command is useful when one do not
care about the state of the feature group strings which
should be controlled by the OSPM.
ACPI: Add facility to disable all _OSI OS vendor strings This patch introduces "acpi_osi=!" command line to force Linux replying "UNSUPPORTED" to all of the _OSI strings. This patch is based on an ACPICA enhancement - the new API acpi_update_interfaces(). The _OSI object provides the platform with the ability to query OSPM to determine the set of ACPI related interfaces, behaviors, or features that the operating system supports. The argument passed to the _OSI is a string like the followings: 1. Feature Group String, examples include Module Device Processor Device 3.0 _SCP Extensions Processor Aggregator Device ... 2. OS Vendor String, examples include Linux FreeBSD Windows ... There are AML codes provided in the ACPI namespace written in the following style to determine OSPM interfaces / features: Method(OSCK) { if (CondRefOf(_OSI, Local0)) { if (\_OSI("Windows")) { Return (One) } if (\_OSI("Windows 2006")) { Return (Ones) } Return (Zero) } Return (Zero) } There is a debugging facility implemented in Linux. Users can pass "acpi_osi=" boot parameters to the kernel to tune the _OSI evaluation result so that certain AML codes can be executed. Current implementation includes: 1. 'acpi_osi=' - this makes CondRefOf(_OSI, Local0) TRUE 2. 'acpi_osi="Windows"' - this makes \_OSI("Windows") TRUE 3. 'acpi_osi="!Windows"' - this makes \_OSI("Windows") FALSE The function to implement this feature is also used as a quirk mechanism in the Linux ACPI subystem. When _OSI is evaluatated by the AML codes, ACPICA replies "SUPPORTED" to all Windows operating system vendor strings. This is because Windows operating systems return "SUPPORTED" if the argument to the _OSI method specifies an earlier version of Windows. Please refer to the following MSDN document: How to Identify the Windows Version in ACPI by Using _OSI http://msdn.microsoft.com/en-us/library/hardware/gg463275.aspx This adds difficulties when developers want to feed specific Windows operating system vendor string to the BIOS codes for debugging purpose, multiple acpi_osi="!xxx" have to be specified in the command line to force Linux replying "UNSUPPORTED" to the Windows OS vendor strings listed in the AML codes. Signed-off-by: Lv Zheng <lv.zheng@intel.com> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-07-22 08:08:25 +00:00
Examples:
1. 'acpi_osi=! acpi_osi="Windows 2000"' is equivalent
to 'acpi_osi="Windows 2000" acpi_osi=!', they all
can make '_OSI("Windows 2000")' TRUE.
'acpi_osi=' cannot be used in combination with other
'acpi_osi=' command lines, the _OSI method will not
exist in the ACPI namespace. NOTE that such command can
only affect the _OSI support state, thus specifying it
multiple times through kernel command line is also
meaningless.
Examples:
1. 'acpi_osi=' can make 'CondRefOf(_OSI, Local1)'
FALSE.
'acpi_osi=!*' can be used in combination with single or
multiple 'acpi_osi="string1"' to support specific
string(s). Note that such command can affect the
current state of both the OS vendor strings and the
feature group strings, thus specifying it multiple times
through kernel command line is meaningful. But it may
still not able to affect the final state of a string if
there are quirks related to this string. This command
is useful when one want to control the state of the
feature group strings to debug BIOS issues related to
the OSPM features.
Examples:
1. 'acpi_osi="Module Device" acpi_osi=!*' can make
'_OSI("Module Device")' FALSE.
2. 'acpi_osi=!* acpi_osi="Module Device"' can make
'_OSI("Module Device")' TRUE.
3. 'acpi_osi=! acpi_osi=!* acpi_osi="Windows 2000"' is
equivalent to
'acpi_osi=!* acpi_osi=! acpi_osi="Windows 2000"'
and
'acpi_osi=!* acpi_osi="Windows 2000" acpi_osi=!',
they all will make '_OSI("Windows 2000")' TRUE.
acpi_pm_good [X86]
Override the pmtimer bug detection: force the kernel
to assume that this machine's pmtimer latches its value
and always returns good values.
acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode
Format: { level | edge | high | low }
acpi_skip_timer_override [HW,ACPI]
Recognize and ignore IRQ0/pin2 Interrupt Override.
For broken nForce2 BIOS resulting in XT-PIC timer.
acpi_sleep= [HW,ACPI] Sleep options
Format: { s3_bios, s3_mode, s3_beep, s4_hwsig,
s4_nohwsig, old_ordering, nonvs,
sci_force_enable, nobl }
See Documentation/power/video.rst for information on
s3_bios and s3_mode.
s3_beep is for debugging; it makes the PC's speaker beep
as soon as the kernel's real-mode entry point is called.
s4_hwsig causes the kernel to check the ACPI hardware
signature during resume from hibernation, and gracefully
refuse to resume if it has changed. This complies with
the ACPI specification but not with reality, since
Windows does not do this and many laptops do change it
on docking. So the default behaviour is to allow resume
and simply warn when the signature changes, unless the
s4_hwsig option is enabled.
s4_nohwsig prevents ACPI hardware signature from being
used (or even warned about) during resume.
old_ordering causes the ACPI 1.0 ordering of the _PTS
control method, with respect to putting devices into
low power states, to be enforced (the ACPI 2.0 ordering
of _PTS is used by default).
nonvs prevents the kernel from saving/restoring the
ACPI NVS memory during suspend/hibernation and resume.
sci_force_enable causes the kernel to set SCI_EN directly
on resume from S1/S3 (which is against the ACPI spec,
but some broken systems don't work without it).
nobl causes the internal blacklist of systems known to
behave incorrectly in some ways with respect to system
suspend and resume to be ignored (use wisely).
acpi_use_timer_override [HW,ACPI]
Use timer override. For some broken Nvidia NF5 boards
that require a timer override, but don't have HPET
add_efi_memmap [EFI; X86] Include EFI memory map in
kernel's map of available physical RAM.
agp= [AGP]
{ off | try_unsupported }
off: disable AGP support
try_unsupported: try to drive unsupported chipsets
(may crash computer or cause data corruption)
ALSA [HW,ALSA]
See Documentation/sound/alsa-configuration.rst
alignment= [KNL,ARM]
Allow the default userspace alignment fault handler
behaviour to be specified. Bit 0 enables warnings,
bit 1 enables fixups, and bit 2 sends a segfault.
align_va_addr= [X86-64]
Align virtual addresses by clearing slice [14:12] when
allocating a VMA at process creation time. This option
gives you up to 3% performance improvement on AMD F15h
machines (where it is enabled by default) for a
CPU-intensive style benchmark, and it can vary highly in
a microbenchmark depending on workload and compiler.
32: only for 32-bit processes
64: only for 64-bit processes
on: enable for both 32- and 64-bit processes
off: disable for both 32- and 64-bit processes
alloc_snapshot [FTRACE]
Allocate the ftrace snapshot buffer on boot up when the
main buffer is allocated. This is handy if debugging
and you need to use tracing_snapshot() on boot up, and
do not want to use tracing_snapshot_alloc() as it needs
to be done where GFP_KERNEL allocations are allowed.
allow_mismatched_32bit_el0 [ARM64]
Allow execve() of 32-bit applications and setting of the
PER_LINUX32 personality on systems where only a strict
subset of the CPUs support 32-bit EL0. When this
parameter is present, the set of CPUs supporting 32-bit
EL0 is indicated by /sys/devices/system/cpu/aarch32_el0
and hot-unplug operations may be restricted.
See Documentation/arch/arm64/asymmetric-32bit.rst for more
information.
amd_iommu= [HW,X86-64]
Pass parameters to the AMD IOMMU driver in the system.
Possible values are:
fullflush - Deprecated, equivalent to iommu.strict=1
off - do not initialize any AMD IOMMU found in
the system
force_isolation - Force device isolation for all
devices. The IOMMU driver is not
allowed anymore to lift isolation
requirements as needed. This option
does not override iommu=pt
force_enable - Force enable the IOMMU on platforms known
to be buggy with IOMMU enabled. Use this
option with care.
pgtbl_v1 - Use v1 page table for DMA-API (Default).
pgtbl_v2 - Use v2 page table for DMA-API.
irtcachedis - Disable Interrupt Remapping Table (IRT) caching.
amd_iommu_dump= [HW,X86-64]
Enable AMD IOMMU driver option to dump the ACPI table
for AMD IOMMU. With this option enabled, AMD IOMMU
driver will print ACPI tables for AMD IOMMU during
IOMMU initialization.
amd_iommu_intr= [HW,X86-64]
Specifies one of the following AMD IOMMU interrupt
remapping modes:
legacy - Use legacy interrupt remapping mode.
vapic - Use virtual APIC mode, which allows IOMMU
to inject interrupts directly into guest.
This mode requires kvm-amd.avic=1.
(Default when IOMMU HW support is present.)
amd_pstate= [X86]
disable
Do not enable amd_pstate as the default
scaling driver for the supported processors
passive
Use amd_pstate with passive mode as a scaling driver.
In this mode autonomous selection is disabled.
Driver requests a desired performance level and platform
tries to match the same performance level if it is
satisfied by guaranteed performance level.
active
Use amd_pstate_epp driver instance as the scaling driver,
driver provides a hint to the hardware if software wants
to bias toward performance (0x0) or energy efficiency (0xff)
to the CPPC firmware. then CPPC power algorithm will
calculate the runtime workload and adjust the realtime cores
frequency.
guided
Activate guided autonomous mode. Driver requests minimum and
maximum performance level and the platform autonomously
selects a performance level in this range and appropriate
to the current workload.
amijoy.map= [HW,JOY] Amiga joystick support
Map of devices attached to JOY0DAT and JOY1DAT
Format: <a>,<b>
See also Documentation/input/joydev/joystick.rst
analog.map= [HW,JOY] Analog joystick and gamepad support
Specifies type or capabilities of an analog joystick
connected to one of 16 gameports
Format: <type1>,<type2>,..<type16>
apc= [HW,SPARC]
Power management functions (SPARCstation-4/5 + deriv.)
Format: noidle
Disable APC CPU standby support. SPARCstation-Fox does
not play well with APC CPU idle - disable it if you have
APC and your system crashes randomly.
apic= [APIC,X86] Advanced Programmable Interrupt Controller
Change the output verbosity while booting
Format: { quiet (default) | verbose | debug }
Change the amount of debugging information output
when initialising the APIC and IO-APIC components.
For X86-32, this can also be used to specify an APIC
driver name.
Format: apic=driver_name
Examples: apic=bigsmp
x86/apic: Introduce apic_extnmi command line parameter This patch introduces a command line parameter apic_extnmi: apic_extnmi=( bsp|all|none ) The default value is "bsp" and this is the current behavior: only the Boot-Strapping Processor receives an external NMI. "all" allows external NMIs to be broadcast to all CPUs. This would raise the success rate of panic on NMI when BSP hangs in NMI context or the external NMI is swallowed by other NMI handlers on the BSP. If you specify "none", no CPUs receive external NMIs. This is useful for the dump capture kernel so that it cannot be shot down by accidentally pressing the external NMI button (on platforms which have it) while saving a crash dump. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Bandan Das <bsd@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kexec@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: "Maciej W. Rozycki" <macro@linux-mips.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/20151210014632.25437.43778.stgit@softrs Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-14 10:19:12 +00:00
apic_extnmi= [APIC,X86] External NMI delivery setting
Format: { bsp (default) | all | none }
bsp: External NMI is delivered only to CPU 0
all: External NMIs are broadcast to all CPUs as a
backup of CPU 0
none: External NMI is masked for all CPUs. This is
useful so that a dump capture kernel won't be
shot down by NMI
autoconf= [IPV6]
See Documentation/networking/ipv6.rst.
apm= [APM] Advanced Power Management
See header of arch/x86/kernel/apm_32.c.
apparmor= [APPARMOR] Disable or enable AppArmor at boot time
Format: { "0" | "1" }
See security/apparmor/Kconfig help text
0 -- disable.
1 -- enable.
Default value is set via kernel config option.
arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards
Format: <io>,<irq>,<nodeID>
arm64.nobti [ARM64] Unconditionally disable Branch Target
Identification support
arm64.nomops [ARM64] Unconditionally disable Memory Copy and Memory
Set instructions support
arm64.nomte [ARM64] Unconditionally disable Memory Tagging Extension
support
arm64.nopauth [ARM64] Unconditionally disable Pointer Authentication
support
arm64.nosme [ARM64] Unconditionally disable Scalable Matrix
Extension support
arm64.nosve [ARM64] Unconditionally disable Scalable Vector
Extension support
ataflop= [HW,M68k]
atarimouse= [HW,MOUSE] Atari Mouse
atkbd.extra= [HW] Enable extra LEDs and keys on IBM RapidAccess,
EzKey and similar keyboards
atkbd.reset= [HW] Reset keyboard during initialization
atkbd.set= [HW] Select keyboard code set
Format: <int> (2 = AT (default), 3 = PS/2)
atkbd.scroll= [HW] Enable scroll wheel on MS Office and similar
keyboards
atkbd.softraw= [HW] Choose between synthetic and real raw mode
Format: <bool> (0 = real, 1 = synthetic (default))
atkbd.softrepeat= [HW]
Use software keyboard repeat
audit= [KNL] Enable the audit sub-system
Format: { "0" | "1" | "off" | "on" }
0 | off - kernel audit is disabled and can not be
enabled until the next reboot
unset - kernel audit is initialized but disabled and
will be fully enabled by the userspace auditd.
1 | on - kernel audit is initialized and partially
enabled, storing at most audit_backlog_limit
messages in RAM until it is fully enabled by the
userspace auditd.
Default: unset
audit: add kernel set-up parameter to override default backlog limit The default audit_backlog_limit is 64. This was a reasonable limit at one time. systemd causes so much audit queue activity on startup that auditd doesn't start before the backlog queue has already overflowed by more than a factor of 2. On a system with audit= not set on the kernel command line, this isn't an issue since that history isn't kept for auditd when it is available. On a system with audit=1 set on the kernel command line, kaudit tries to keep that history until auditd is able to drain the queue. This default can be changed by the "-b" option in audit.rules once the system has booted, but won't help with lost messages on boot. One way to solve this would be to increase the default backlog queue size to avoid losing any messages before auditd is able to consume them. This would be overkill to the embedded community and insufficient for some servers. Another way to solve it might be to add a kconfig option to set the default based on the system type. An embedded system would get the current (or smaller) default, while Workstations might get more than now and servers might get more. None of these solutions helps if a system's compiled default is too small to see the lost messages without compiling a new kernel. This patch adds a kernel set-up parameter (audit already has one to enable/disable it) "audit_backlog_limit=<n>" that overrides the default to allow the system administrator to set the backlog limit. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2013-09-17 16:34:52 +00:00
audit_backlog_limit= [KNL] Set the audit queue size limit.
Format: <int> (must be >=0)
Default: 64
bau= [X86_UV] Enable the BAU on SGI UV. The default
behavior is to disable the BAU (i.e. bau=0).
Format: { "0" | "1" }
0 - Disable the BAU.
1 - Enable the BAU.
unset - Disable the BAU.
baycom_epp= [HW,AX25]
Format: <io>,<mode>
baycom_par= [HW,AX25] BayCom Parallel Port AX.25 Modem
Format: <io>,<mode>
See header of drivers/net/hamradio/baycom_par.c.
baycom_ser_fdx= [HW,AX25]
BayCom Serial Port AX.25 Modem (Full Duplex Mode)
Format: <io>,<irq>,<mode>[,<baud>]
See header of drivers/net/hamradio/baycom_ser_fdx.c.
baycom_ser_hdx= [HW,AX25]
BayCom Serial Port AX.25 Modem (Half Duplex Mode)
Format: <io>,<irq>,<mode>
See header of drivers/net/hamradio/baycom_ser_hdx.c.
bert_disable [ACPI]
Disable BERT OS support on buggy BIOSes.
bgrt_disable [ACPI][X86]
Disable BGRT to avoid flickering OEM logo.
blkdevparts= Manual partition parsing of block device(s) for
embedded devices based on command line input.
See Documentation/block/cmdline-partition.rst
boot_delay= Milliseconds to delay each printk during boot.
Only works if CONFIG_BOOT_PRINTK_DELAY is enabled,
and you may also have to specify "lpj=". Boot_delay
values larger than 10 seconds (10000) are assumed
erroneous and ignored.
Format: integer
bootconfig [KNL]
Extended command line options can be added to an initrd
and this will cause the kernel to look for it.
See Documentation/admin-guide/bootconfig.rst
bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards)
bttv.radio= Most important insmod options are available as
kernel args too.
bttv.pll= See Documentation/admin-guide/media/bttv.rst
bttv.tuner=
bulk_remove=off [PPC] This parameter disables the use of the pSeries
firmware feature for flushing multiple hpte entries
at a time.
c101= [NET] Moxa C101 synchronous serial card
cachesize= [BUGS=X86-32] Override level 2 CPU cache size detection.
Sometimes CPU hardware bugs make them report the cache
size incorrectly. The kernel will attempt work arounds
to fix known problems, but for some CPUs it is not
possible to determine what the correct size should be.
This option provides an override for these situations.
carrier_timeout=
[NET] Specifies amount of time (in seconds) that
the kernel should wait for a network carrier. By default
it waits 120 seconds.
ca_keys= [KEYS] This parameter identifies a specific key(s) on
the system trusted keyring to be used for certificate
trust validation.
format: { id:<keyid> | builtin }
cca= [MIPS] Override the kernel pages' cache coherency
algorithm. Accepted values range from 0 to 7
inclusive. See arch/mips/include/asm/pgtable-bits.h
for platform specific values (SB1, Loongson3 and
others).
ccw_timeout_log [S390]
See Documentation/arch/s390/common_io.rst for details.
cgroup_disable= [KNL] Disable a particular controller or optional feature
Format: {name of the controller(s) or feature(s) to disable}
The effects of cgroup_disable=foo are:
- foo isn't auto-mounted if you mount all cgroups in
a single hierarchy
- foo isn't visible as an individually mountable
subsystem
- if foo is an optional feature then the feature is
disabled and corresponding cgroup files are not
created
{Currently only "memory" controller deal with this and
cut the overhead, others just disable the usage. So
only cgroup_disable=memory is actually worthy}
Specifying "pressure" disables per-cgroup pressure
stall information accounting feature
cgroups: add cgroup support for enabling controllers at boot time The effects of cgroup_disable=foo are: - foo isn't auto-mounted if you mount all cgroups in a single hierarchy - foo isn't visible as an individually mountable subsystem As a result there will only ever be one call to foo->create(), at init time; all processes will stay in this group, and the group will never be mounted on a visible hierarchy. Any additional effects (e.g. not allocating metadata) are up to the foo subsystem. This doesn't handle early_init subsystems (their "disabled" bit isn't set be, but it could easily be extended to do so if any of the early_init systems wanted it - I think it would just involve some nastier parameter processing since it would occur before the command-line argument parser had been run. Hugh said: Ballpark figures, I'm trying to get this question out rather than processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead to the affected paths, booting with cgroup_disable=memory cuts that back to 1% overhead (due to slightly bigger struct page). I'm no expert on distros, they may have no interest whatever in CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or without it, or apply the cgroup_disable=memory patches. Unix bench's execl test result on x86_64 was == just after boot without mounting any cgroup fs.== mem_cgorup=off : Execl Throughput 43.0 3150.1 732.6 mem_cgroup=on : Execl Throughput 43.0 2932.6 682.0 == [lizf@cn.fujitsu.com: fix boot option parsing] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-04 21:29:57 +00:00
cgroup_no_v1= [KNL] Disable cgroup controllers and named hierarchies in v1
Format: { { controller | "all" | "named" }
[,{ controller | "all" | "named" }...] }
Like cgroup_disable, but only applies to cgroup v1;
the blacklisted controllers remain available in cgroup2.
"all" blacklists all controllers and "named" disables
named mounts. Specifying both "all" and "named" disables
all v1 hierarchies.
cgroup_favordynmods= [KNL] Enable or Disable favordynmods.
Format: { "true" | "false" }
Defaults to the value of CONFIG_CGROUP_FAVOR_DYNMODS.
cgroup.memory= [KNL] Pass options to the cgroup memory controller.
Format: <string>
nosocket -- Disable socket memory accounting.
nokmem -- Disable kernel memory accounting.
nobpf -- Disable BPF memory accounting.
checkreqprot= [SELINUX] Set initial checkreqprot flag value.
Format: { "0" | "1" }
See security/selinux/Kconfig help text.
0 -- check protection applied by kernel (includes
any implied execute protection).
1 -- check protection requested by application.
Default value is set via a kernel config option.
Value can be changed at runtime via
/sys/fs/selinux/checkreqprot.
Setting checkreqprot to 1 is deprecated.
cio_ignore= [S390]
See Documentation/arch/s390/common_io.rst for details.
It was a moderately busy cycle for documentation; highlights include: - After a long period of inactivity, the Japanese translations are seeing some much-needed maintenance and updating. - Reworked IOMMU documentation - Some new documentation for static-analysis tools - A new overall structure for the memory-management documentation. This is an LSFMM outcome that, it is hoped, will help encourage developers to fill in the many gaps. Optimism is eternal...but hopefully it will work. - More Chinese translations. Plus the usual typo fixes, updates, etc. -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmKLqZQPHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5YdgQH/2/9+EgQDes93f/+iKtbO23EV67392dwrmXS kYg8lR4948/Q3jzgMloUo6hNOoxXeV/sqmdHu0LjUhFN+BGsp9fFjd/jp0XhWcqA nnc9foGbpmeFPxHeAg2aqV84eeasLoO5lUUm2rNoPBLd6HFV+IYC5R4VZ+w42StB 5bYEOYwHXMvQZXkivZDse82YmvQK3/2rRGTUoFhME/Aap6rFgWJJ+XQcSKA7WmwW OpJqq+FOsjsxHe6IFVy6onzlqgGJM8zM2bLtqedid6yaE3uACcHMb/OyAjp0rdKF BQvaG+d3f7DugABqM6Y1oU75iBtJWWYgGeAm36JtX+3mz2uR/f0= =3UoR -----END PGP SIGNATURE----- Merge tag 'docs-5.19' of git://git.lwn.net/linux Pull documentation updates from Jonathan Corbet: "It was a moderately busy cycle for documentation; highlights include: - After a long period of inactivity, the Japanese translations are seeing some much-needed maintenance and updating. - Reworked IOMMU documentation - Some new documentation for static-analysis tools - A new overall structure for the memory-management documentation. This is an LSFMM outcome that, it is hoped, will help encourage developers to fill in the many gaps. Optimism is eternal...but hopefully it will work. - More Chinese translations. Plus the usual typo fixes, updates, etc" * tag 'docs-5.19' of git://git.lwn.net/linux: (70 commits) docs: pdfdocs: Add space for chapter counts >= 100 in TOC docs/zh_CN: Add dev-tools/gdb-kernel-debugging.rst Chinese translation input: Docs: correct ntrig.rst typo input: Docs: correct atarikbd.rst typos MAINTAINERS: Become the docs/zh_CN maintainer docs/zh_CN: fix devicetree usage-model translation mm,doc: Add new documentation structure Documentation: drop more IDE boot options and ide-cd.rst Documentation/process: use scripts/get_maintainer.pl on patches MAINTAINERS: Add entry for DOCUMENTATION/JAPANESE docs/trans/ja_JP/howto: Don't mention specific kernel versions docs/ja_JP/SubmittingPatches: Request summaries for commit references docs/ja_JP/SubmittingPatches: Add Suggested-by as a standard signature docs/ja_JP/SubmittingPatches: Randy has moved docs/ja_JP/SubmittingPatches: Suggest the use of scripts/get_maintainer.pl docs/ja_JP/SubmittingPatches: Update GregKH links Documentation/sysctl: document max_rcu_stall_to_panic Documentation: add missing angle bracket in cgroup-v2 doc Documentation: dev-tools: use literal block instead of code-block docs/zh_CN: add vm numa translation ...
2022-05-25 18:17:41 +00:00
clearcpuid=X[,X...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
It was a moderately busy cycle for documentation; highlights include: - After a long period of inactivity, the Japanese translations are seeing some much-needed maintenance and updating. - Reworked IOMMU documentation - Some new documentation for static-analysis tools - A new overall structure for the memory-management documentation. This is an LSFMM outcome that, it is hoped, will help encourage developers to fill in the many gaps. Optimism is eternal...but hopefully it will work. - More Chinese translations. Plus the usual typo fixes, updates, etc. -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmKLqZQPHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5YdgQH/2/9+EgQDes93f/+iKtbO23EV67392dwrmXS kYg8lR4948/Q3jzgMloUo6hNOoxXeV/sqmdHu0LjUhFN+BGsp9fFjd/jp0XhWcqA nnc9foGbpmeFPxHeAg2aqV84eeasLoO5lUUm2rNoPBLd6HFV+IYC5R4VZ+w42StB 5bYEOYwHXMvQZXkivZDse82YmvQK3/2rRGTUoFhME/Aap6rFgWJJ+XQcSKA7WmwW OpJqq+FOsjsxHe6IFVy6onzlqgGJM8zM2bLtqedid6yaE3uACcHMb/OyAjp0rdKF BQvaG+d3f7DugABqM6Y1oU75iBtJWWYgGeAm36JtX+3mz2uR/f0= =3UoR -----END PGP SIGNATURE----- Merge tag 'docs-5.19' of git://git.lwn.net/linux Pull documentation updates from Jonathan Corbet: "It was a moderately busy cycle for documentation; highlights include: - After a long period of inactivity, the Japanese translations are seeing some much-needed maintenance and updating. - Reworked IOMMU documentation - Some new documentation for static-analysis tools - A new overall structure for the memory-management documentation. This is an LSFMM outcome that, it is hoped, will help encourage developers to fill in the many gaps. Optimism is eternal...but hopefully it will work. - More Chinese translations. Plus the usual typo fixes, updates, etc" * tag 'docs-5.19' of git://git.lwn.net/linux: (70 commits) docs: pdfdocs: Add space for chapter counts >= 100 in TOC docs/zh_CN: Add dev-tools/gdb-kernel-debugging.rst Chinese translation input: Docs: correct ntrig.rst typo input: Docs: correct atarikbd.rst typos MAINTAINERS: Become the docs/zh_CN maintainer docs/zh_CN: fix devicetree usage-model translation mm,doc: Add new documentation structure Documentation: drop more IDE boot options and ide-cd.rst Documentation/process: use scripts/get_maintainer.pl on patches MAINTAINERS: Add entry for DOCUMENTATION/JAPANESE docs/trans/ja_JP/howto: Don't mention specific kernel versions docs/ja_JP/SubmittingPatches: Request summaries for commit references docs/ja_JP/SubmittingPatches: Add Suggested-by as a standard signature docs/ja_JP/SubmittingPatches: Randy has moved docs/ja_JP/SubmittingPatches: Suggest the use of scripts/get_maintainer.pl docs/ja_JP/SubmittingPatches: Update GregKH links Documentation/sysctl: document max_rcu_stall_to_panic Documentation: add missing angle bracket in cgroup-v2 doc Documentation: dev-tools: use literal block instead of code-block docs/zh_CN: add vm numa translation ...
2022-05-25 18:17:41 +00:00
numbers X. Note the Linux-specific bits are not necessarily
stable over kernel options, but the vendor-specific
ones should be.
It was a moderately busy cycle for documentation; highlights include: - After a long period of inactivity, the Japanese translations are seeing some much-needed maintenance and updating. - Reworked IOMMU documentation - Some new documentation for static-analysis tools - A new overall structure for the memory-management documentation. This is an LSFMM outcome that, it is hoped, will help encourage developers to fill in the many gaps. Optimism is eternal...but hopefully it will work. - More Chinese translations. Plus the usual typo fixes, updates, etc. -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmKLqZQPHGNvcmJldEBs d24ubmV0AAoJEBdDWhNsDH5YdgQH/2/9+EgQDes93f/+iKtbO23EV67392dwrmXS kYg8lR4948/Q3jzgMloUo6hNOoxXeV/sqmdHu0LjUhFN+BGsp9fFjd/jp0XhWcqA nnc9foGbpmeFPxHeAg2aqV84eeasLoO5lUUm2rNoPBLd6HFV+IYC5R4VZ+w42StB 5bYEOYwHXMvQZXkivZDse82YmvQK3/2rRGTUoFhME/Aap6rFgWJJ+XQcSKA7WmwW OpJqq+FOsjsxHe6IFVy6onzlqgGJM8zM2bLtqedid6yaE3uACcHMb/OyAjp0rdKF BQvaG+d3f7DugABqM6Y1oU75iBtJWWYgGeAm36JtX+3mz2uR/f0= =3UoR -----END PGP SIGNATURE----- Merge tag 'docs-5.19' of git://git.lwn.net/linux Pull documentation updates from Jonathan Corbet: "It was a moderately busy cycle for documentation; highlights include: - After a long period of inactivity, the Japanese translations are seeing some much-needed maintenance and updating. - Reworked IOMMU documentation - Some new documentation for static-analysis tools - A new overall structure for the memory-management documentation. This is an LSFMM outcome that, it is hoped, will help encourage developers to fill in the many gaps. Optimism is eternal...but hopefully it will work. - More Chinese translations. Plus the usual typo fixes, updates, etc" * tag 'docs-5.19' of git://git.lwn.net/linux: (70 commits) docs: pdfdocs: Add space for chapter counts >= 100 in TOC docs/zh_CN: Add dev-tools/gdb-kernel-debugging.rst Chinese translation input: Docs: correct ntrig.rst typo input: Docs: correct atarikbd.rst typos MAINTAINERS: Become the docs/zh_CN maintainer docs/zh_CN: fix devicetree usage-model translation mm,doc: Add new documentation structure Documentation: drop more IDE boot options and ide-cd.rst Documentation/process: use scripts/get_maintainer.pl on patches MAINTAINERS: Add entry for DOCUMENTATION/JAPANESE docs/trans/ja_JP/howto: Don't mention specific kernel versions docs/ja_JP/SubmittingPatches: Request summaries for commit references docs/ja_JP/SubmittingPatches: Add Suggested-by as a standard signature docs/ja_JP/SubmittingPatches: Randy has moved docs/ja_JP/SubmittingPatches: Suggest the use of scripts/get_maintainer.pl docs/ja_JP/SubmittingPatches: Update GregKH links Documentation/sysctl: document max_rcu_stall_to_panic Documentation: add missing angle bracket in cgroup-v2 doc Documentation: dev-tools: use literal block instead of code-block docs/zh_CN: add vm numa translation ...
2022-05-25 18:17:41 +00:00
X can also be a string as appearing in the flags: line
in /proc/cpuinfo which does not have the above
instability issue. However, not all features have names
in /proc/cpuinfo.
Note that using this option will taint your kernel.
Also note that user programs calling CPUID directly
or using the feature without checking anything
will still see it. This just prevents it from
being used by the kernel or shown in /proc/cpuinfo.
Also note the kernel might malfunction if you disable
some critical bits.
clk_ignore_unused
[CLK]
Prevents the clock framework from automatically gating
clocks that have not been explicitly enabled by a Linux
device driver but are enabled in hardware at reset or
by the bootloader/firmware. Note that this does not
force such clocks to be always-on nor does it reserve
those clocks in any way. This parameter is useful for
debug and development, but should not be needed on a
platform with proper driver support. For more
information, see Documentation/driver-api/clk.rst.
clock= [BUGS=X86-32, HW] gettimeofday clocksource override.
[Deprecated]
Forces specified clocksource (if available) to be used
when calculating gettimeofday(). If specified
clocksource is not available, it defaults to PIT.
Format: { pit | tsc | cyclone | pmtmr }
clocksource= Override the default clocksource
Format: <string>
Override the default clocksource and use the clocksource
with the name specified.
Some clocksource names to choose from, depending on
the platform:
[all] jiffies (this is the base, fallback clocksource)
[ACPI] acpi_pm
[ARM] imx_timer1,OSTS,netx_timer,mpu_timer2,
pxa_timer,timer3,32k_counter,timer0_1
[X86-32] pit,hpet,tsc;
scx200_hrt on Geode; cyclone on IBM x440
[MIPS] MIPS
[PARISC] cr16
[S390] tod
[SH] SuperH
[SPARC64] tick
[X86-64] hpet,tsc
clocksource.arm_arch_timer.evtstrm=
[ARM,ARM64]
Format: <bool>
Enable/disable the eventstream feature of the ARM
architected timer so that code using WFE-based polling
loops can be debugged more effectively on production
systems.
clocksource: Retry clock read if long delays detected When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. Therefore, re-read the watchdog clock on either side of the read from the clock under test. If the watchdog clock shows an excessive time delta between its pair of reads, the reads are retried. The maximum number of retries is specified by a new kernel boot parameter clocksource.max_cswd_read_retries, which defaults to three, that is, up to four reads, one initial and up to three retries. If more than one retry was required, a message is printed on the console (the occasional single retry is expected behavior, especially in guest OSes). If the maximum number of retries is exceeded, the clock under test will be marked unstable. However, the probability of this happening due to various sorts of delays is quite small. In addition, the reason (clock-read delays) for the unstable marking will be apparent. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org
2021-05-27 19:01:19 +00:00
clocksource.max_cswd_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
unstable. Defaults to two retries, that is,
three attempts to read the clock under test.
clocksource: Retry clock read if long delays detected When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. Therefore, re-read the watchdog clock on either side of the read from the clock under test. If the watchdog clock shows an excessive time delta between its pair of reads, the reads are retried. The maximum number of retries is specified by a new kernel boot parameter clocksource.max_cswd_read_retries, which defaults to three, that is, up to four reads, one initial and up to three retries. If more than one retry was required, a message is printed on the console (the occasional single retry is expected behavior, especially in guest OSes). If the maximum number of retries is exceeded, the clock under test will be marked unstable. However, the probability of this happening due to various sorts of delays is quite small. In addition, the reason (clock-read delays) for the unstable marking will be apparent. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org
2021-05-27 19:01:19 +00:00
clocksource.verify_n_cpus= [KNL]
Limit the number of CPUs checked for clocksources
marked with CLOCK_SOURCE_VERIFY_PERCPU that
are marked unstable due to excessive skew.
A negative value says to check all CPUs, while
zero says not to check any. Values larger than
nr_cpu_ids are silently truncated to nr_cpu_ids.
The actual CPUs are chosen randomly, with
no replacement if the same CPU is chosen twice.
clocksource: Provide kernel module to test clocksource watchdog When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. It would be good to have a way of testing the clocksource watchdog's ability to distinguish between these two causes of clock skew and instability. Therefore, provide a new clocksource-wdtest module selected by a new TEST_CLOCKSOURCE_WATCHDOG Kconfig option. This module has a single module parameter named "holdoff" that provides the number of seconds of delay before testing should start, which defaults to zero when built as a module and to 10 seconds when built directly into the kernel. Very large systems that boot slowly may need to increase the value of this module parameter. This module uses hand-crafted clocksource structures to do its testing, thus avoiding messing up timing for the rest of the kernel and for user applications. This module first verifies that the ->uncertainty_margin field of the clocksource structures are set sanely. It then tests the delay-detection capability of the clocksource watchdog, increasing the number of consecutive delays injected, first provoking console messages complaining about the delays and finally forcing a clock-skew event. Unexpected test results cause at least one WARN_ON_ONCE() console splat. If there are no splats, the test has passed. Finally, it fuzzes the value returned from a clocksource to test the clocksource watchdog's ability to detect time skew. This module checks the state of its clocksource after each test, and uses WARN_ON_ONCE() to emit a console splat if there are any failures. This should enable all types of test frameworks to detect any such failures. This facility is intended for diagnostic use only, and should be avoided on production systems. Reported-by: Chris Mason <clm@fb.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-5-paulmck@kernel.org
2021-05-27 19:01:23 +00:00
clocksource-wdtest.holdoff= [KNL]
Set the time in seconds that the clocksource
watchdog test waits before commencing its tests.
Defaults to zero when built as a module and to
10 seconds when built into the kernel.
cma=nn[MG]@[start[MG][-end[MG]]]
[KNL,CMA]
Sets the size of kernel global memory area for
contiguous memory allocations and optionally the
placement constraint by the physical address range of
memory allocations. A value of 0 disables CMA
altogether. For more information, see
kernel/dma/contiguous.c
cma_pernuma=nn[MG]
[KNL,CMA]
Sets the size of kernel per-numa memory area for
contiguous memory allocations. A value of 0 disables
per-numa CMA altogether. And If this option is not
specified, the default value is 0.
With per-numa CMA enabled, DMA users on node nid will
first try to allocate buffer from the pernuma area
which is located in node nid, if the allocation fails,
they will fallback to the global default memory area.
numa_cma=<node>:nn[MG][,<node>:nn[MG]]
[KNL,CMA]
Sets the size of kernel numa memory area for
contiguous memory allocations. It will reserve CMA
area for the specified node.
With numa CMA enabled, DMA users on node nid will
first try to allocate buffer from the numa area
which is located in node nid, if the allocation fails,
they will fallback to the global default memory area.
cmo_free_hint= [PPC] Format: { yes | no }
Specify whether pages are marked as being inactive
when they are freed. This is used in CMO environments
to determine OS memory pressure for page stealing by
a hypervisor.
Default: yes
coherent_pool=nn[KMG] [ARM,KNL]
Sets the size of memory pool for coherent, atomic dma
allocations, by default set to 256K.
com20020= [HW,NET] ARCnet - COM20020 chipset
Format:
<io>[,<irq>[,<nodeID>[,<backplane>[,<ckp>[,<timeout>]]]]]
com90io= [HW,NET] ARCnet - COM90xx chipset (IO-mapped buffers)
Format: <io>[,<irq>]
com90xx= [HW,NET]
ARCnet - COM90xx chipset (memory-mapped buffers)
Format: <io>[,<irq>[,<memstart>]]
condev= [HW,S390] console device
conmode=
s390/con3215: Drop console data printout when buffer full Using z/VM the 3270 terminal emulator also emulates an IBM 3215 console which outputs line by line. When the screen is full, the console enters the MORE... state and waits for the operator to confirm the data on the screen by pressing a clear key. If this does not happen in the default time frame (currently 50 seconds) the console enters the HOLDING state. It then waits another time frame (currently 10 seconds) before the output continues on the next screen. When the operator presses the clear key during these wait times, the output continues immediately. This may lead to a very long boot time when the console has to print many messages, also the system may hang because of the console's limited buffer space and the system waits for the console output to drain and finally to finish. This problem can only occur when a terminal emulator is actually connected to the 3215 console driver. If not z/VM simply drops console output. Remedy this rare situation and add a kernel boot command line parameter con3215_drop. It can be set to 0 (do not drop) or 1 (do drop) which is the default. This instructs the kernel drop console data when the console buffer is full. This speeds up the boot time considerable and also does not hang the system anymore. Add a sysfs attribute file for console IBM 3215 named con_drop. This allows for changing the behavior after the boot, for example when during interactive debugging a panic/crash is expected. Here is a test of the new behavior using the following test program: #/bin/bash declare -i cnt=4 mode=$(cat /sys/bus/ccw/drivers/3215/con_drop) [ $mode = yes ] && cnt=25 echo "cons_drop $(cat /sys/bus/ccw/drivers/3215/con_drop)" echo "vmcp term more 5 2" vmcp term more 5 2 echo "Run $cnt iterations of "'echo t > /proc/sysrq-trigger' for i in $(seq $cnt) do echo "$i. command 'echo t > /proc/sysrq-trigger' at $(date +%F,%T)" echo t > /proc/sysrq-trigger sleep 1 done echo "droptest done" > /dev/kmsg # Output with sysfs attribute con_drop set to 1: # ./droptest.sh cons_drop yes vmcp term more 5 2 Run 25 iterations of echo t > /proc/sysrq-trigger 1. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:09 2. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:10 3. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:11 4. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:12 5. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:13 6. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:14 7. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:15 8. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:16 9. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:17 10. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:18 11. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:19 12. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:20 13. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:21 14. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:22 15. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:23 16. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:24 17. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:25 18. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:26 19. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:27 20. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:28 21. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:29 22. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:30 23. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:31 24. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:32 25. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:15:33 # There are no hangs anymore. Output with sysfs attribute con_drop set to 0 and identical setting for z/VM console 'term more 5 2'. Sometimes hitting the clear key at the x3270 console to progress output. # ./droptest.sh cons_drop no vmcp term more 5 2 Run 4 iterations of echo t > /proc/sysrq-trigger 1. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:20:58 2. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:24:32 3. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:28:04 4. command 'echo t > /proc/sysrq-trigger' at 2022-09-02,10:31:37 # Details: Enable function raw3215_write() to handle tab expansion and newlines and feed it with input not larger than the console buffer of 65536 bytes. Function raw3125_putchar() just forwards its character for output to raw3215_write(). This moves tab to blank conversion to one function raw3215_write() which also does call raw3215_make_room() to wait for enough free buffer space. Function handle_write() loops over all its input and segments input into chunks of console buffer size (should the input be larger). Rework tab expansion handling logic to avoid code duplication. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
2022-09-20 12:26:16 +00:00
con3215_drop= [S390] 3215 console drop mode.
Format: y|n|Y|N|1|0
When set to true, drop data on the 3215 console when
the console buffer is full. In this case the
operator using a 3270 terminal emulator (for example
x3270) does not have to enter the clear key for the
console output to advance and the kernel to continue.
This leads to a much faster boot time when a 3270
terminal emulator is active. If no 3270 terminal
emulator is used, this parameter has no effect.
console= [KNL] Output console device and options.
tty<n> Use the virtual console device <n>.
ttyS<n>[,options]
ttyUSB0[,options]
Use the specified serial port. The options are of
the form "bbbbpnf", where "bbbb" is the baud rate,
"p" is parity ("n", "o", or "e"), "n" is number of
bits, and "f" is flow control ("r" for RTS or
omit it). Default is "9600n8".
See Documentation/admin-guide/serial-console.rst for more
information. See
Documentation/networking/netconsole.rst for an
alternative.
uart[8250],io,<addr>[,options]
uart[8250],mmio,<addr>[,options]
uart[8250],mmio16,<addr>[,options]
uart[8250],mmio32,<addr>[,options]
uart[8250],0x<addr>[,options]
Start an early, polled-mode console on the 8250/16550
UART at the specified I/O port or MMIO address,
switching to the matching ttyS device later.
MMIO inter-register address stride is either 8-bit
(mmio), 16-bit (mmio16), or 32-bit (mmio32).
If none of [io|mmio|mmio16|mmio32], <addr> is assumed
to be equivalent to 'mmio'. 'options' are specified in
the same format described for ttyS above; if unspecified,
the h/w is not re-initialized.
hvc<n> Use the hypervisor console device <n>. This is for
both Xen and PowerPC hypervisors.
{ null | "" }
Use to disable console output, i.e., to have kernel
console messages discarded.
This must be the only console= parameter used on the
kernel command line.
If the device connected to the port is not a TTY but a braille
device, prepend "brl," before the device type, for instance
console=brl,ttyS0
For now, only VisioBraille is supported.
printk: add console_msg_format command line option 0day and kernelCI automatically parse kernel log - basically some sort of grepping using the pre-defined text patterns - in order to detect and report regressions/errors. There are several sources they get the kernel logs from: a) dmesg or /proc/ksmg This is the preferred way. Because `dmesg --raw' (see later Note) and /proc/kmsg output contains facility and log level, which greatly simplifies grepping for EMERG/ALERT/CRIT/ERR messages. b) serial consoles This option is harder to maintain, because serial console messages don't contain facility and log level. This patch introduces a `console_msg_format=' command line option, to switch between different message formatting on serial consoles. For the time being we have just two options - default and syslog. The "default" option just keeps the existing format. While the "syslog" option makes serial console messages to appear in syslog format [syslog() syscall], matching the `dmesg -S --raw' and `cat /proc/kmsg' output formats: - facility and log level - time stamp (depends on printk_time/PRINTK_TIME) - message <%u>[time stamp] text\n NOTE: while Kevin and Fengguang talk about "dmesg --raw", it's actually "dmesg -S --raw" that always prints messages in syslog format [per Petr Mladek]. Running "dmesg --raw" may produce output in non-syslog format sometimes. console_msg_format=syslog enables syslog format, thus in documentation we mention "dmesg -S --raw", not "dmesg --raw". Per Kevin Hilman: : Right now we can get this info from a "dmesg --raw" after bootup, : but it would be really nice in certain automation frameworks to : have a kernel command-line option to enable printing of loglevels : in default boot log. : : This is especially useful when ingesting kernel logs into advanced : search/analytics frameworks (I'm playing with and ELK stack: Elastic : Search, Logstash, Kibana). : : The other important reason for having this on the command line is that : for testing linux-next (and other bleeding edge developer branches), : it's common that we never make it to userspace, so can't even run : "dmesg --raw" (or equivalent.) So we really want this on the primary : boot (serial) console. Per Fengguang Wu, 0day scripts should quickly benefit from that feature, because they will be able to switch to a more reliable parsing, based on messages' facility and log levels [1]: `#{grep} -a -E -e '^<[0123]>' -e '^kern :(err |crit |alert |emerg )' instead of doing text pattern matching `#{grep} -a -F -f /lkp/printk-error-messages #{kmsg_file} | grep -a -v -E -f #{LKP_SRC}/etc/oops-pattern | grep -a -v -F -f #{LKP_SRC}/etc/kmsg-blacklist` [1] https://github.com/fengguang/lkp-tests/blob/master/lib/dmesg.rb Link: http://lkml.kernel.org/r/20171221054149.4398-1-sergey.senozhatsky@gmail.com To: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Kevin Hilman <khilman@baylibre.com> Cc: Mark Brown <broonie@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: LKML <linux-kernel@vger.kernel.org> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Kevin Hilman <khilman@baylibre.com> Tested-by: Kevin Hilman <khilman@baylibre.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-12-21 05:41:49 +00:00
console_msg_format=
[KNL] Change console messages format
default
By default we print messages on consoles in
"[time stamp] text\n" format (time stamp may not be
printed, depending on CONFIG_PRINTK_TIME or
`printk_time' param).
syslog
Switch to syslog format: "<%u>[time stamp] text\n"
IOW, each message will have a facility and loglevel
prefix. The format is similar to one used by syslog()
syscall, or to executing "dmesg -S --raw" or to reading
from /proc/kmsg.
consoleblank= [KNL] The console blank (screen saver) timeout in
seconds. A value of 0 disables the blank timer.
Defaults to 0.
coredump_filter=
[KNL] Change the default value for
/proc/<pid>/coredump_filter.
See also Documentation/filesystems/proc.rst.
coresight_cpu_debug.enable
[ARM,ARM64]
Format: <bool>
Enable/disable the CPU sampling based debugging.
0: default value, disable debugging
1: enable debugging at boot time
cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver
Format:
<first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]
cpuidle.off=1 [CPU_IDLE]
disable the cpuidle sub-system
cpuidle.governor=
[CPU_IDLE] Name of the cpuidle governor to use.
cpufreq.off=1 [CPU_FREQ]
disable the cpufreq sub-system
cpufreq.default_governor=
[CPU_FREQ] Name of the default cpufreq governor or
policy to use. This governor must be registered in the
kernel before the cpufreq driver probes.
cpu_init_udelay=N
[X86] Delay for N microsec between assert and de-assert
of APIC INIT to start processors. This delay occurs
on every CPU online, such as boot, and resume from suspend.
Default: 10000
cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE There is often significant latency in the early stages of CPU bringup, and time is wasted by waking each CPU (e.g. with SIPI/INIT/INIT on x86) and then waiting for it to respond before moving on to the next. Allow a platform to enable parallel setup which brings all to be onlined CPUs up to the CPUHP_BP_KICK_AP state. While this state advancement on the control CPU (BP) is single-threaded the important part is the last state CPUHP_BP_KICK_AP which wakes the to be onlined CPUs up. This allows the CPUs to run up to the first sychronization point cpuhp_ap_sync_alive() where they wait for the control CPU to release them one by one for the full onlining procedure. This parallelism depends on the CPU hotplug core sync mechanism which ensures that the parallel brought up CPUs wait for release before touching any state which would make the CPU visible to anything outside the hotplug control mechanism. To handle the SMT constraints of X86 correctly the bringup happens in two iterations when CONFIG_HOTPLUG_SMT is enabled. The control CPU brings up the primary SMT threads of each core first, which can load the microcode without the need to rendevouz with the thread siblings. Once that's completed it brings up the secondary SMT threads. Co-developed-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Helge Deller <deller@gmx.de> # parisc Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> # Steam Deck Link: https://lore.kernel.org/r/20230512205257.240231377@linutronix.de
2023-05-12 21:07:50 +00:00
cpuhp.parallel=
[SMP] Enable/disable parallel bringup of secondary CPUs
Format: <bool>
Default is enabled if CONFIG_HOTPLUG_PARALLEL=y. Otherwise
the parameter has no effect.
crash_kexec_post_notifiers
Run kdump after running panic-notifiers and dumping
kmsg. This only for the users who doubt kdump always
succeeds in any situation.
Note that this also increases risks of kdump failure,
because some panic notifiers can make the crashed
kernel more unstable.
crashkernel=size[KMG][@offset[KMG]]
[KNL] Using kexec, Linux can switch to a 'crash kernel'
upon panic. This parameter reserves the physical
memory region [offset, offset + size] for that kernel
image. If '@offset' is omitted, then a suitable offset
x86/kdump: Fall back to reserve high crashkernel memory crashkernel=xM tries to reserve memory for the crash kernel under 4G, which is enough, usually. But this could fail sometimes, for example when one tries to reserve a big chunk like 2G, for example. So let the crashkernel=xM just fall back to use high memory in case it fails to find a suitable low range. Do not set the ,high as default because it allocates extra low memory for DMA buffers and swiotlb, and this is not always necessary for all machines. Typically, crashkernel=128M usually works with low reservation under 4G, so keep <4G as default. [ bp: Massage. ] Signed-off-by: Dave Young <dyoung@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: linux-doc@vger.kernel.org Cc: "Paul E. McKenney" <paulmck@linux.ibm.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: piliu@redhat.com Cc: Ram Pai <linuxram@us.ibm.com> Cc: Sinan Kaya <okaya@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thymo van Beers <thymovanbeers@gmail.com> Cc: vgoyal@redhat.com Cc: x86-ml <x86@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Zhimin Gu <kookoo.gu@intel.com> Link: https://lkml.kernel.org/r/20190422031905.GA8387@dhcp-128-65.nay.redhat.com
2019-04-22 03:19:05 +00:00
is selected automatically.
LoongArch: Use generic interface to support crashkernel=X,[high,low] LoongArch already supports two crashkernel regions in kexec-tools, so we can directly use the common interface to support crashkernel=X,[high,low] after commit 0ab97169aa0517079b ("crash_core: add generic function to do reservation"). With the help of newly changed function parse_crashkernel() and generic reserve_crashkernel_generic(), crashkernel reservation can be simplified by steps: 1) Add a new header file <asm/crash_core.h>, then define CRASH_ALIGN, CRASH_ADDR_LOW_MAX and CRASH_ADDR_HIGH_MAX and in <asm/crash_core.h>; 2) Add arch_reserve_crashkernel() to call parse_crashkernel() and reserve_crashkernel_generic(); 3) Add ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION Kconfig in arch/loongarch/Kconfig. One can reserve the crash kernel from high memory above DMA zone range by explicitly passing "crashkernel=X,high"; or reserve a memory range below 4G with "crashkernel=X,low". Besides, there are few rules need to take notice: 1) "crashkernel=X,[high,low]" will be ignored if "crashkernel=size" is specified. 2) "crashkernel=X,low" is valid only when "crashkernel=X,high" is passed and there is enough memory to be allocated under 4G. 3) When allocating crashkernel above 4G and no "crashkernel=X,low" is specified, a 128M low memory will be allocated automatically for swiotlb bounce buffer. See Documentation/admin-guide/kernel-parameters.txt for more information. Following test cases have been performed as expected: 1) crashkernel=256M //low=256M 2) crashkernel=1G //low=1G 3) crashkernel=4G //high=4G, low=128M(default) 4) crashkernel=4G crashkernel=256M,high //high=4G, low=128M(default), high is ignored 5) crashkernel=4G crashkernel=256M,low //high=4G, low=128M(default), low is ignored 6) crashkernel=4G,high //high=4G, low=128M(default) 7) crashkernel=256M,low //low=0M, invalid 8) crashkernel=4G,high crashkernel=256M,low //high=4G, low=256M 9) crashkernel=4G,high crashkernel=4G,low //high=0M, low=0M, invalid 10) crashkernel=512M@2560M //low=512M 11) crashkernel=1G,high crashkernel=0M,low //high=1G, low=0M Recommended usage in general: 1) In the case of small memory: crashkernel=512M 2) In the case of large memory: crashkernel=1024M,high crashkernel=128M,low Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-01-17 04:43:08 +00:00
[KNL, X86-64, ARM64, RISCV, LoongArch] Select a region
under 4G first, and fall back to reserve region above
4G when '@offset' hasn't been specified.
See Documentation/admin-guide/kdump/kdump.rst for further details.
crashkernel=range1:size1[,range2:size2,...][@offset]
[KNL] Same as above, but depends on the memory
in the running system. The syntax of range is
start-[end] where start and end are both
a memory unit (amount[KMG]). See also
Documentation/admin-guide/kdump/kdump.rst for an example.
crashkernel=size[KMG],high
LoongArch: Use generic interface to support crashkernel=X,[high,low] LoongArch already supports two crashkernel regions in kexec-tools, so we can directly use the common interface to support crashkernel=X,[high,low] after commit 0ab97169aa0517079b ("crash_core: add generic function to do reservation"). With the help of newly changed function parse_crashkernel() and generic reserve_crashkernel_generic(), crashkernel reservation can be simplified by steps: 1) Add a new header file <asm/crash_core.h>, then define CRASH_ALIGN, CRASH_ADDR_LOW_MAX and CRASH_ADDR_HIGH_MAX and in <asm/crash_core.h>; 2) Add arch_reserve_crashkernel() to call parse_crashkernel() and reserve_crashkernel_generic(); 3) Add ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION Kconfig in arch/loongarch/Kconfig. One can reserve the crash kernel from high memory above DMA zone range by explicitly passing "crashkernel=X,high"; or reserve a memory range below 4G with "crashkernel=X,low". Besides, there are few rules need to take notice: 1) "crashkernel=X,[high,low]" will be ignored if "crashkernel=size" is specified. 2) "crashkernel=X,low" is valid only when "crashkernel=X,high" is passed and there is enough memory to be allocated under 4G. 3) When allocating crashkernel above 4G and no "crashkernel=X,low" is specified, a 128M low memory will be allocated automatically for swiotlb bounce buffer. See Documentation/admin-guide/kernel-parameters.txt for more information. Following test cases have been performed as expected: 1) crashkernel=256M //low=256M 2) crashkernel=1G //low=1G 3) crashkernel=4G //high=4G, low=128M(default) 4) crashkernel=4G crashkernel=256M,high //high=4G, low=128M(default), high is ignored 5) crashkernel=4G crashkernel=256M,low //high=4G, low=128M(default), low is ignored 6) crashkernel=4G,high //high=4G, low=128M(default) 7) crashkernel=256M,low //low=0M, invalid 8) crashkernel=4G,high crashkernel=256M,low //high=4G, low=256M 9) crashkernel=4G,high crashkernel=4G,low //high=0M, low=0M, invalid 10) crashkernel=512M@2560M //low=512M 11) crashkernel=1G,high crashkernel=0M,low //high=1G, low=0M Recommended usage in general: 1) In the case of small memory: crashkernel=512M 2) In the case of large memory: crashkernel=1024M,high crashkernel=128M,low Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-01-17 04:43:08 +00:00
[KNL, X86-64, ARM64, RISCV, LoongArch] range could be
above 4G.
Allow kernel to allocate physical memory region from top,
so could be above 4G if system have more than 4G ram
installed. Otherwise memory region will be allocated
below 4G, if available.
It will be ignored if crashkernel=X is specified.
crashkernel=size[KMG],low
LoongArch: Use generic interface to support crashkernel=X,[high,low] LoongArch already supports two crashkernel regions in kexec-tools, so we can directly use the common interface to support crashkernel=X,[high,low] after commit 0ab97169aa0517079b ("crash_core: add generic function to do reservation"). With the help of newly changed function parse_crashkernel() and generic reserve_crashkernel_generic(), crashkernel reservation can be simplified by steps: 1) Add a new header file <asm/crash_core.h>, then define CRASH_ALIGN, CRASH_ADDR_LOW_MAX and CRASH_ADDR_HIGH_MAX and in <asm/crash_core.h>; 2) Add arch_reserve_crashkernel() to call parse_crashkernel() and reserve_crashkernel_generic(); 3) Add ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION Kconfig in arch/loongarch/Kconfig. One can reserve the crash kernel from high memory above DMA zone range by explicitly passing "crashkernel=X,high"; or reserve a memory range below 4G with "crashkernel=X,low". Besides, there are few rules need to take notice: 1) "crashkernel=X,[high,low]" will be ignored if "crashkernel=size" is specified. 2) "crashkernel=X,low" is valid only when "crashkernel=X,high" is passed and there is enough memory to be allocated under 4G. 3) When allocating crashkernel above 4G and no "crashkernel=X,low" is specified, a 128M low memory will be allocated automatically for swiotlb bounce buffer. See Documentation/admin-guide/kernel-parameters.txt for more information. Following test cases have been performed as expected: 1) crashkernel=256M //low=256M 2) crashkernel=1G //low=1G 3) crashkernel=4G //high=4G, low=128M(default) 4) crashkernel=4G crashkernel=256M,high //high=4G, low=128M(default), high is ignored 5) crashkernel=4G crashkernel=256M,low //high=4G, low=128M(default), low is ignored 6) crashkernel=4G,high //high=4G, low=128M(default) 7) crashkernel=256M,low //low=0M, invalid 8) crashkernel=4G,high crashkernel=256M,low //high=4G, low=256M 9) crashkernel=4G,high crashkernel=4G,low //high=0M, low=0M, invalid 10) crashkernel=512M@2560M //low=512M 11) crashkernel=1G,high crashkernel=0M,low //high=1G, low=0M Recommended usage in general: 1) In the case of small memory: crashkernel=512M 2) In the case of large memory: crashkernel=1024M,high crashkernel=128M,low Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-01-17 04:43:08 +00:00
[KNL, X86-64, ARM64, RISCV, LoongArch] range under 4G.
When crashkernel=X,high is passed, kernel could allocate
physical memory region above 4G, that cause second kernel
crash on system that require some amount of low memory,
e.g. swiotlb requires at least 64M+32K low memory, also
enough extra low memory is needed to make sure DMA buffers
for 32-bit devices won't run out. Kernel would try to allocate
default size of memory below 4G automatically. The default
size is platform dependent.
--> x86: max(swiotlb_size_or_default() + 8MiB, 256MiB)
--> arm64: 128MiB
--> riscv: 128MiB
LoongArch: Use generic interface to support crashkernel=X,[high,low] LoongArch already supports two crashkernel regions in kexec-tools, so we can directly use the common interface to support crashkernel=X,[high,low] after commit 0ab97169aa0517079b ("crash_core: add generic function to do reservation"). With the help of newly changed function parse_crashkernel() and generic reserve_crashkernel_generic(), crashkernel reservation can be simplified by steps: 1) Add a new header file <asm/crash_core.h>, then define CRASH_ALIGN, CRASH_ADDR_LOW_MAX and CRASH_ADDR_HIGH_MAX and in <asm/crash_core.h>; 2) Add arch_reserve_crashkernel() to call parse_crashkernel() and reserve_crashkernel_generic(); 3) Add ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION Kconfig in arch/loongarch/Kconfig. One can reserve the crash kernel from high memory above DMA zone range by explicitly passing "crashkernel=X,high"; or reserve a memory range below 4G with "crashkernel=X,low". Besides, there are few rules need to take notice: 1) "crashkernel=X,[high,low]" will be ignored if "crashkernel=size" is specified. 2) "crashkernel=X,low" is valid only when "crashkernel=X,high" is passed and there is enough memory to be allocated under 4G. 3) When allocating crashkernel above 4G and no "crashkernel=X,low" is specified, a 128M low memory will be allocated automatically for swiotlb bounce buffer. See Documentation/admin-guide/kernel-parameters.txt for more information. Following test cases have been performed as expected: 1) crashkernel=256M //low=256M 2) crashkernel=1G //low=1G 3) crashkernel=4G //high=4G, low=128M(default) 4) crashkernel=4G crashkernel=256M,high //high=4G, low=128M(default), high is ignored 5) crashkernel=4G crashkernel=256M,low //high=4G, low=128M(default), low is ignored 6) crashkernel=4G,high //high=4G, low=128M(default) 7) crashkernel=256M,low //low=0M, invalid 8) crashkernel=4G,high crashkernel=256M,low //high=4G, low=256M 9) crashkernel=4G,high crashkernel=4G,low //high=0M, low=0M, invalid 10) crashkernel=512M@2560M //low=512M 11) crashkernel=1G,high crashkernel=0M,low //high=1G, low=0M Recommended usage in general: 1) In the case of small memory: crashkernel=512M 2) In the case of large memory: crashkernel=1024M,high crashkernel=128M,low Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-01-17 04:43:08 +00:00
--> loongarch: 128MiB
This one lets the user specify own low range under 4G
for second kernel instead.
0: to disable low allocation.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
cryptomgr.notests
[KNL] Disable crypto self-tests
cs89x0_dma= [HW,NET]
Format: <dma>
cs89x0_media= [HW,NET]
Format: { rj45 | aui | bnc }
csdlock_debug= [KNL] Enable or disable debug add-ons of cross-CPU
function call handling. When switched on,
additional debug data is printed to the console
in case a hanging CPU is detected, and that
CPU is pinged again in order to try to resolve
the hang situation. The default value of this
option depends on the CSD_LOCK_WAIT_DEBUG_DEFAULT
Kconfig option.
dasd= [HW,NET]
See header of drivers/s390/block/dasd_devmap.c.
db9.dev[2|3]= [HW,JOY] Multisystem joystick support via parallel port
(one device per port)
Format: <port#>,<type>
See also Documentation/input/devices/joystick-parport.rst
debug [KNL] Enable kernel debugging (events log level).
debug_boot_weak_hash
[KNL] Enable printing [hashed] pointers early in the
boot sequence. If enabled, we use a weak hash instead
of siphash to hash pointers. Use this option if you are
seeing instances of '(___ptrval___)') and need to see a
value (hashed pointer) instead. Cryptographically
insecure, please do not use on production kernels.
[PATCH] lockdep: locking API self tests Introduce DEBUG_LOCKING_API_SELFTESTS, which uses the generic lock debugging code's silent-failure feature to run a matrix of testcases. There are 210 testcases currently: +----------------------- | Locking API testsuite: +------------------------------+------+------+------+------+------+------+ | spin |wlock |rlock |mutex | wsem | rsem | -------------------------------+------+------+------+------+------+------+ A-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok | ok | double unlock: ok | ok | ok | ok | ok | ok | bad unlock order: ok | ok | ok | ok | ok | ok | --------------------------------------+------+------+------+------+------+ recursive read-lock: | ok | | ok | --------------------------------------+------+------+------+------+------+ non-nested unlock: ok | ok | ok | ok | --------------------------------------+------+------+------+ hard-irqs-on + irq-safe-A/12: ok | ok | ok | soft-irqs-on + irq-safe-A/12: ok | ok | ok | hard-irqs-on + irq-safe-A/21: ok | ok | ok | soft-irqs-on + irq-safe-A/21: ok | ok | ok | sirq-safe-A => hirqs-on/12: ok | ok | ok | sirq-safe-A => hirqs-on/21: ok | ok | ok | hard-safe-A + irqs-on/12: ok | ok | ok | soft-safe-A + irqs-on/12: ok | ok | ok | hard-safe-A + irqs-on/21: ok | ok | ok | soft-safe-A + irqs-on/21: ok | ok | ok | hard-safe-A + unsafe-B #1/123: ok | ok | ok | soft-safe-A + unsafe-B #1/123: ok | ok | ok | hard-safe-A + unsafe-B #1/132: ok | ok | ok | soft-safe-A + unsafe-B #1/132: ok | ok | ok | hard-safe-A + unsafe-B #1/213: ok | ok | ok | soft-safe-A + unsafe-B #1/213: ok | ok | ok | hard-safe-A + unsafe-B #1/231: ok | ok | ok | soft-safe-A + unsafe-B #1/231: ok | ok | ok | hard-safe-A + unsafe-B #1/312: ok | ok | ok | soft-safe-A + unsafe-B #1/312: ok | ok | ok | hard-safe-A + unsafe-B #1/321: ok | ok | ok | soft-safe-A + unsafe-B #1/321: ok | ok | ok | hard-safe-A + unsafe-B #2/123: ok | ok | ok | soft-safe-A + unsafe-B #2/123: ok | ok | ok | hard-safe-A + unsafe-B #2/132: ok | ok | ok | soft-safe-A + unsafe-B #2/132: ok | ok | ok | hard-safe-A + unsafe-B #2/213: ok | ok | ok | soft-safe-A + unsafe-B #2/213: ok | ok | ok | hard-safe-A + unsafe-B #2/231: ok | ok | ok | soft-safe-A + unsafe-B #2/231: ok | ok | ok | hard-safe-A + unsafe-B #2/312: ok | ok | ok | soft-safe-A + unsafe-B #2/312: ok | ok | ok | hard-safe-A + unsafe-B #2/321: ok | ok | ok | soft-safe-A + unsafe-B #2/321: ok | ok | ok | hard-irq lock-inversion/123: ok | ok | ok | soft-irq lock-inversion/123: ok | ok | ok | hard-irq lock-inversion/132: ok | ok | ok | soft-irq lock-inversion/132: ok | ok | ok | hard-irq lock-inversion/213: ok | ok | ok | soft-irq lock-inversion/213: ok | ok | ok | hard-irq lock-inversion/231: ok | ok | ok | soft-irq lock-inversion/231: ok | ok | ok | hard-irq lock-inversion/312: ok | ok | ok | soft-irq lock-inversion/312: ok | ok | ok | hard-irq lock-inversion/321: ok | ok | ok | soft-irq lock-inversion/321: ok | ok | ok | hard-irq read-recursion/123: ok | soft-irq read-recursion/123: ok | hard-irq read-recursion/132: ok | soft-irq read-recursion/132: ok | hard-irq read-recursion/213: ok | soft-irq read-recursion/213: ok | hard-irq read-recursion/231: ok | soft-irq read-recursion/231: ok | hard-irq read-recursion/312: ok | soft-irq read-recursion/312: ok | hard-irq read-recursion/321: ok | soft-irq read-recursion/321: ok | --------------------------------+-----+---------------- Good, all 210 testcases passed! | --------------------------------+ Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 07:24:48 +00:00
debug_locks_verbose=
[KNL] verbose locking self-tests
Format: <int>
[PATCH] lockdep: locking API self tests Introduce DEBUG_LOCKING_API_SELFTESTS, which uses the generic lock debugging code's silent-failure feature to run a matrix of testcases. There are 210 testcases currently: +----------------------- | Locking API testsuite: +------------------------------+------+------+------+------+------+------+ | spin |wlock |rlock |mutex | wsem | rsem | -------------------------------+------+------+------+------+------+------+ A-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok | ok | double unlock: ok | ok | ok | ok | ok | ok | bad unlock order: ok | ok | ok | ok | ok | ok | --------------------------------------+------+------+------+------+------+ recursive read-lock: | ok | | ok | --------------------------------------+------+------+------+------+------+ non-nested unlock: ok | ok | ok | ok | --------------------------------------+------+------+------+ hard-irqs-on + irq-safe-A/12: ok | ok | ok | soft-irqs-on + irq-safe-A/12: ok | ok | ok | hard-irqs-on + irq-safe-A/21: ok | ok | ok | soft-irqs-on + irq-safe-A/21: ok | ok | ok | sirq-safe-A => hirqs-on/12: ok | ok | ok | sirq-safe-A => hirqs-on/21: ok | ok | ok | hard-safe-A + irqs-on/12: ok | ok | ok | soft-safe-A + irqs-on/12: ok | ok | ok | hard-safe-A + irqs-on/21: ok | ok | ok | soft-safe-A + irqs-on/21: ok | ok | ok | hard-safe-A + unsafe-B #1/123: ok | ok | ok | soft-safe-A + unsafe-B #1/123: ok | ok | ok | hard-safe-A + unsafe-B #1/132: ok | ok | ok | soft-safe-A + unsafe-B #1/132: ok | ok | ok | hard-safe-A + unsafe-B #1/213: ok | ok | ok | soft-safe-A + unsafe-B #1/213: ok | ok | ok | hard-safe-A + unsafe-B #1/231: ok | ok | ok | soft-safe-A + unsafe-B #1/231: ok | ok | ok | hard-safe-A + unsafe-B #1/312: ok | ok | ok | soft-safe-A + unsafe-B #1/312: ok | ok | ok | hard-safe-A + unsafe-B #1/321: ok | ok | ok | soft-safe-A + unsafe-B #1/321: ok | ok | ok | hard-safe-A + unsafe-B #2/123: ok | ok | ok | soft-safe-A + unsafe-B #2/123: ok | ok | ok | hard-safe-A + unsafe-B #2/132: ok | ok | ok | soft-safe-A + unsafe-B #2/132: ok | ok | ok | hard-safe-A + unsafe-B #2/213: ok | ok | ok | soft-safe-A + unsafe-B #2/213: ok | ok | ok | hard-safe-A + unsafe-B #2/231: ok | ok | ok | soft-safe-A + unsafe-B #2/231: ok | ok | ok | hard-safe-A + unsafe-B #2/312: ok | ok | ok | soft-safe-A + unsafe-B #2/312: ok | ok | ok | hard-safe-A + unsafe-B #2/321: ok | ok | ok | soft-safe-A + unsafe-B #2/321: ok | ok | ok | hard-irq lock-inversion/123: ok | ok | ok | soft-irq lock-inversion/123: ok | ok | ok | hard-irq lock-inversion/132: ok | ok | ok | soft-irq lock-inversion/132: ok | ok | ok | hard-irq lock-inversion/213: ok | ok | ok | soft-irq lock-inversion/213: ok | ok | ok | hard-irq lock-inversion/231: ok | ok | ok | soft-irq lock-inversion/231: ok | ok | ok | hard-irq lock-inversion/312: ok | ok | ok | soft-irq lock-inversion/312: ok | ok | ok | hard-irq lock-inversion/321: ok | ok | ok | soft-irq lock-inversion/321: ok | ok | ok | hard-irq read-recursion/123: ok | soft-irq read-recursion/123: ok | hard-irq read-recursion/132: ok | soft-irq read-recursion/132: ok | hard-irq read-recursion/213: ok | soft-irq read-recursion/213: ok | hard-irq read-recursion/231: ok | soft-irq read-recursion/231: ok | hard-irq read-recursion/312: ok | soft-irq read-recursion/312: ok | hard-irq read-recursion/321: ok | soft-irq read-recursion/321: ok | --------------------------------+-----+---------------- Good, all 210 testcases passed! | --------------------------------+ Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 07:24:48 +00:00
Print debugging info while doing the locking API
self-tests.
Bitmask for the various LOCKTYPE_ tests. Defaults to 0
(no extra messages), setting it to -1 (all bits set)
will print _a_lot_ more information - normally only
useful to lockdep developers.
[PATCH] lockdep: locking API self tests Introduce DEBUG_LOCKING_API_SELFTESTS, which uses the generic lock debugging code's silent-failure feature to run a matrix of testcases. There are 210 testcases currently: +----------------------- | Locking API testsuite: +------------------------------+------+------+------+------+------+------+ | spin |wlock |rlock |mutex | wsem | rsem | -------------------------------+------+------+------+------+------+------+ A-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-A deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-A-B-C deadlock: ok | ok | ok | ok | ok | ok | A-B-B-C-C-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-D-D-A deadlock: ok | ok | ok | ok | ok | ok | A-B-C-D-B-C-D-A deadlock: ok | ok | ok | ok | ok | ok | double unlock: ok | ok | ok | ok | ok | ok | bad unlock order: ok | ok | ok | ok | ok | ok | --------------------------------------+------+------+------+------+------+ recursive read-lock: | ok | | ok | --------------------------------------+------+------+------+------+------+ non-nested unlock: ok | ok | ok | ok | --------------------------------------+------+------+------+ hard-irqs-on + irq-safe-A/12: ok | ok | ok | soft-irqs-on + irq-safe-A/12: ok | ok | ok | hard-irqs-on + irq-safe-A/21: ok | ok | ok | soft-irqs-on + irq-safe-A/21: ok | ok | ok | sirq-safe-A => hirqs-on/12: ok | ok | ok | sirq-safe-A => hirqs-on/21: ok | ok | ok | hard-safe-A + irqs-on/12: ok | ok | ok | soft-safe-A + irqs-on/12: ok | ok | ok | hard-safe-A + irqs-on/21: ok | ok | ok | soft-safe-A + irqs-on/21: ok | ok | ok | hard-safe-A + unsafe-B #1/123: ok | ok | ok | soft-safe-A + unsafe-B #1/123: ok | ok | ok | hard-safe-A + unsafe-B #1/132: ok | ok | ok | soft-safe-A + unsafe-B #1/132: ok | ok | ok | hard-safe-A + unsafe-B #1/213: ok | ok | ok | soft-safe-A + unsafe-B #1/213: ok | ok | ok | hard-safe-A + unsafe-B #1/231: ok | ok | ok | soft-safe-A + unsafe-B #1/231: ok | ok | ok | hard-safe-A + unsafe-B #1/312: ok | ok | ok | soft-safe-A + unsafe-B #1/312: ok | ok | ok | hard-safe-A + unsafe-B #1/321: ok | ok | ok | soft-safe-A + unsafe-B #1/321: ok | ok | ok | hard-safe-A + unsafe-B #2/123: ok | ok | ok | soft-safe-A + unsafe-B #2/123: ok | ok | ok | hard-safe-A + unsafe-B #2/132: ok | ok | ok | soft-safe-A + unsafe-B #2/132: ok | ok | ok | hard-safe-A + unsafe-B #2/213: ok | ok | ok | soft-safe-A + unsafe-B #2/213: ok | ok | ok | hard-safe-A + unsafe-B #2/231: ok | ok | ok | soft-safe-A + unsafe-B #2/231: ok | ok | ok | hard-safe-A + unsafe-B #2/312: ok | ok | ok | soft-safe-A + unsafe-B #2/312: ok | ok | ok | hard-safe-A + unsafe-B #2/321: ok | ok | ok | soft-safe-A + unsafe-B #2/321: ok | ok | ok | hard-irq lock-inversion/123: ok | ok | ok | soft-irq lock-inversion/123: ok | ok | ok | hard-irq lock-inversion/132: ok | ok | ok | soft-irq lock-inversion/132: ok | ok | ok | hard-irq lock-inversion/213: ok | ok | ok | soft-irq lock-inversion/213: ok | ok | ok | hard-irq lock-inversion/231: ok | ok | ok | soft-irq lock-inversion/231: ok | ok | ok | hard-irq lock-inversion/312: ok | ok | ok | soft-irq lock-inversion/312: ok | ok | ok | hard-irq lock-inversion/321: ok | ok | ok | soft-irq lock-inversion/321: ok | ok | ok | hard-irq read-recursion/123: ok | soft-irq read-recursion/123: ok | hard-irq read-recursion/132: ok | soft-irq read-recursion/132: ok | hard-irq read-recursion/213: ok | soft-irq read-recursion/213: ok | hard-irq read-recursion/231: ok | soft-irq read-recursion/231: ok | hard-irq read-recursion/312: ok | soft-irq read-recursion/312: ok | hard-irq read-recursion/321: ok | soft-irq read-recursion/321: ok | --------------------------------+-----+---------------- Good, all 210 testcases passed! | --------------------------------+ Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03 07:24:48 +00:00
infrastructure to debug (dynamic) objects We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 07:55:01 +00:00
debug_objects [KNL] Enable object debugging
debug_guardpage_minorder=
[KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
parameter allows control of the order of pages that will
be intentionally kept free (and hence protected) by the
buddy allocator. Bigger value increase the probability
of catching random memory corruption, but reduce the
amount of memory for normal system use. The maximum
possible value is MAX_PAGE_ORDER/2. Setting this
parameter to 1 or 2 should be enough to identify most
random memory corruption problems caused by bugs in
kernel or driver code when a CPU writes to (or reads
from) a random memory location. Note that there exists
a class of memory corruptions problems caused by buggy
H/W or F/W or by drivers badly programming DMA
(basically when memory is written at bus level and the
CPU MMU is bypassed) which are not detectable by
CONFIG_DEBUG_PAGEALLOC, hence this option will not
help tracking down these problems.
debug_pagealloc=
[KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter
enables the feature at boot time. By default, it is
disabled and the system will work mostly the same as a
kernel built without CONFIG_DEBUG_PAGEALLOC.
mm, page_owner, debug_pagealloc: save and dump freeing stack trace The debug_pagealloc functionality is useful to catch buggy page allocator users that cause e.g. use after free or double free. When page inconsistency is detected, debugging is often simpler by knowing the call stack of process that last allocated and freed the page. When page_owner is also enabled, we record the allocation stack trace, but not freeing. This patch therefore adds recording of freeing process stack trace to page owner info, if both page_owner and debug_pagealloc are configured and enabled. With only page_owner enabled, this info is not useful for the memory leak debugging use case. dump_page() is adjusted to print the info. An example result of calling __free_pages() twice may look like this (note the page last free stack trace): BUG: Bad page state in process bash pfn:13d8f8 page:ffffc31984f63e00 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x1affff800000000() raw: 01affff800000000 dead000000000100 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000 page dumped because: nonzero _refcount page_owner tracks the page as freed page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL) prep_new_page+0x143/0x150 get_page_from_freelist+0x289/0x380 __alloc_pages_nodemask+0x13c/0x2d0 khugepaged+0x6e/0xc10 kthread+0xf9/0x130 ret_from_fork+0x3a/0x50 page last free stack trace: free_pcp_prepare+0x134/0x1e0 free_unref_page+0x18/0x90 khugepaged+0x7b/0xc10 kthread+0xf9/0x130 ret_from_fork+0x3a/0x50 Modules linked in: CPU: 3 PID: 271 Comm: bash Not tainted 5.3.0-rc4-2.g07a1a73-default+ #57 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x85/0xc0 bad_page.cold+0xba/0xbf rmqueue_pcplist.isra.0+0x6c5/0x6d0 rmqueue+0x2d/0x810 get_page_from_freelist+0x191/0x380 __alloc_pages_nodemask+0x13c/0x2d0 __get_free_pages+0xd/0x30 __pud_alloc+0x2c/0x110 copy_page_range+0x4f9/0x630 dup_mmap+0x362/0x480 dup_mm+0x68/0x110 copy_process+0x19e1/0x1b40 _do_fork+0x73/0x310 __x64_sys_clone+0x75/0x80 do_syscall_64+0x6e/0x1e0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f10af854a10 ... Link: http://lkml.kernel.org/r/20190820131828.22684-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-23 22:34:42 +00:00
Note: to get most of debug_pagealloc error reports, it's
useful to also enable the page_owner functionality.
on: enable the feature
debugfs= [KNL] This parameter enables what is exposed to userspace
and debugfs internal clients.
Format: { on, no-mount, off }
on: All functions are enabled.
no-mount:
Filesystem is not registered but kernel clients can
access APIs and a crashkernel can be used to read
its content. There is nothing to mount.
off: Filesystem is not registered and clients
get a -EPERM as result when trying to register files
or directories within debugfs.
This is equivalent of the runtime functionality if
debugfs was not enabled in the kernel at all.
Default value is set in build-time with a kernel configuration.
debugpat [X86] Enable PAT debugging
default_hugepagesz=
hugetlbfs: clean up command line processing With all hugetlb page processing done in a single file clean up code. - Make code match desired semantics - Update documentation with semantics - Make all warnings and errors messages start with 'HugeTLB:'. - Consistently name command line parsing routines. - Warn if !hugepages_supported() and command line parameters have been specified. - Add comments to code - Describe some of the subtle interactions - Describe semantics of command line arguments This patch also fixes issues with implicitly setting the number of gigantic huge pages to preallocate. Previously on X86 command line, hugepages=2 default_hugepagesz=1G would result in zero 1G pages being preallocated and, # grep HugePages_Total /proc/meminfo HugePages_Total: 0 # sysctl -a | grep nr_hugepages vm.nr_hugepages = 2 vm.nr_hugepages_mempolicy = 2 # cat /proc/sys/vm/nr_hugepages 2 After this patch 2 gigantic pages will be preallocated and all the proc, sysfs, sysctl and meminfo files will accurately reflect this. To address the issue with gigantic pages, a small change in behavior was made to command line processing. Previously the command line, hugepages=128 default_hugepagesz=2M hugepagesz=2M hugepages=256 would result in the allocation of 256 2M huge pages. The value 128 would be ignored without any warning. After this patch, 128 2M pages will be allocated and a warning message will be displayed indicating the value of 256 is ignored. This change in behavior is required because allocation of implicitly specified gigantic pages must be done when the default_hugepagesz= is encountered for gigantic pages. Previously the code waited until later in the boot process (hugetlb_init), to allocate pages of default size. However the bootmem allocator required for gigantic allocations is not available at this time. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sandipan Das <sandipan@linux.ibm.com> Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Acked-by: Will Deacon <will@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Longpeng <longpeng2@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Qian Cai <cai@lca.pw> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200417185049.275845-5-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 23:00:46 +00:00
[HW] The size of the default HugeTLB page. This is
the size represented by the legacy /proc/ hugepages
APIs. In addition, this is the default hugetlb size
used for shmget(), mmap() and mounting hugetlbfs
filesystems. If not specified, defaults to the
architecture's default huge page size. Huge page
sizes are architecture dependent. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
driver core: allow stopping deferred probe after init Deferred probe will currently wait forever on dependent devices to probe, but sometimes a driver will never exist. It's also not always critical for a driver to exist. Platforms can rely on default configuration from the bootloader or reset defaults for things such as pinctrl and power domains. This is often the case with initial platform support until various drivers get enabled. There's at least 2 scenarios where deferred probe can render a platform broken. Both involve using a DT which has more devices and dependencies than the kernel supports. The 1st case is a driver may be disabled in the kernel config. The 2nd case is the kernel version may simply not have the dependent driver. This can happen if using a newer DT (provided by firmware perhaps) with a stable kernel version. Deferred probe issues can be difficult to debug especially if the console has dependencies or userspace fails to boot to a shell. There are also cases like IOMMUs where only built-in drivers are supported, so deferring probe after initcalls is not needed. The IOMMU subsystem implemented its own mechanism to handle this using OF_DECLARE linker sections. This commit adds makes ending deferred probe conditional on initcalls being completed or a debug timeout. Subsystems or drivers may opt-in by calling driver_deferred_probe_check_init_done() instead of unconditionally returning -EPROBE_DEFER. They may use additional information from DT or kernel's config to decide whether to continue to defer probe or not. The timeout mechanism is intended for debug purposes and WARNs loudly. The remaining deferred probe pending list will also be dumped after the timeout. Not that this timeout won't work for the console which needs to be enabled before userspace starts. However, if the console's dependencies are resolved, then the kernel log will be printed (as opposed to no output). Cc: Alexander Graf <agraf@suse.de> Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-09 15:41:48 +00:00
deferred_probe_timeout=
[KNL] Debugging option to set a timeout in seconds for
deferred probe to give up waiting on dependencies to
probe. Only specific dependencies (subsystems or
driver core: Extend deferred probe timeout on driver registration The deferred probe timer that's used for this currently starts at late_initcall and runs for driver_deferred_probe_timeout seconds. The assumption being that all available drivers would be loaded and registered before the timer expires. This means, the driver_deferred_probe_timeout has to be pretty large for it to cover the worst case. But if we set the default value for it to cover the worst case, it would significantly slow down the average case. For this reason, the default value is set to 0. Also, with CONFIG_MODULES=y and the current default values of driver_deferred_probe_timeout=0 and fw_devlink=on, devices with missing drivers will cause their consumer devices to always defer their probes. This is because device links created by fw_devlink defer the probe even before the consumer driver's probe() is called. Instead of a fixed timeout, if we extend an unexpired deferred probe timer on every successful driver registration, with the expectation more modules would be loaded in the near future, then the default value of driver_deferred_probe_timeout only needs to be as long as the worst case time difference between two consecutive module loads. So let's implement that and set the default value to 10 seconds when CONFIG_MODULES=y. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Rob Herring <robh@kernel.org> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Will Deacon <will@kernel.org> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Kevin Hilman <khilman@kernel.org> Cc: Thierry Reding <treding@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Cc: Paul Kocialkowski <paul.kocialkowski@bootlin.com> Cc: linux-gpio@vger.kernel.org Cc: linux-pm@vger.kernel.org Cc: iommu@lists.linux-foundation.org Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20220429220933.1350374-1-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-29 22:09:32 +00:00
drivers) that have opted in will be ignored. A timeout
of 0 will timeout at the end of initcalls. If the time
out hasn't expired, it'll be restarted by each
successful driver registration. This option will also
driver core: allow stopping deferred probe after init Deferred probe will currently wait forever on dependent devices to probe, but sometimes a driver will never exist. It's also not always critical for a driver to exist. Platforms can rely on default configuration from the bootloader or reset defaults for things such as pinctrl and power domains. This is often the case with initial platform support until various drivers get enabled. There's at least 2 scenarios where deferred probe can render a platform broken. Both involve using a DT which has more devices and dependencies than the kernel supports. The 1st case is a driver may be disabled in the kernel config. The 2nd case is the kernel version may simply not have the dependent driver. This can happen if using a newer DT (provided by firmware perhaps) with a stable kernel version. Deferred probe issues can be difficult to debug especially if the console has dependencies or userspace fails to boot to a shell. There are also cases like IOMMUs where only built-in drivers are supported, so deferring probe after initcalls is not needed. The IOMMU subsystem implemented its own mechanism to handle this using OF_DECLARE linker sections. This commit adds makes ending deferred probe conditional on initcalls being completed or a debug timeout. Subsystems or drivers may opt-in by calling driver_deferred_probe_check_init_done() instead of unconditionally returning -EPROBE_DEFER. They may use additional information from DT or kernel's config to decide whether to continue to defer probe or not. The timeout mechanism is intended for debug purposes and WARNs loudly. The remaining deferred probe pending list will also be dumped after the timeout. Not that this timeout won't work for the console which needs to be enabled before userspace starts. However, if the console's dependencies are resolved, then the kernel log will be printed (as opposed to no output). Cc: Alexander Graf <agraf@suse.de> Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-09 15:41:48 +00:00
dump out devices still on the deferred probe list after
retrying.
delayacct [KNL] Enable per-task delay accounting
dell_smm_hwmon.ignore_dmi=
[HW] Continue probing hardware even if DMI data
indicates that the driver is running on unsupported
hardware.
dell_smm_hwmon.force=
[HW] Activate driver even if SMM BIOS signature does
not match list of supported models and enable otherwise
blacklisted features.
dell_smm_hwmon.power_status=
[HW] Report power status in /proc/i8k
(disabled by default).
dell_smm_hwmon.restricted=
[HW] Allow controlling fans only if SYS_ADMIN
capability is set.
dell_smm_hwmon.fan_mult=
[HW] Factor to multiply fan speed with.
dell_smm_hwmon.fan_max=
[HW] Maximum configurable fan speed.
dfltcc= [HW,S390]
Format: { on | off | def_only | inf_only | always }
on: s390 zlib hardware support for compression on
level 1 and decompression (default)
off: No s390 zlib hardware support
def_only: s390 zlib hardware support for deflate
only (compression on level 1)
inf_only: s390 zlib hardware support for inflate
only (decompression)
always: Same as 'on' but ignores the selected compression
level always using hardware support (used for debugging)
dhash_entries= [KNL]
Set number of hash buckets for dentry cache.
disable_1tb_segments [PPC]
Disables the use of 1TB hash page table segments. This
causes the kernel to fall back to 256MB segments which
can be useful when debugging issues that require an SLB
miss to occur.
disable= [IPV6]
See Documentation/networking/ipv6.rst.
disable_radix [PPC]
Disable RADIX MMU mode on POWER9
disable_tlbie [PPC]
Disable TLBIE instruction. Currently does not work
with KVM, with HASH MMU, or with coherent accelerators.
x86, apic, kexec: Add disable_cpu_apicid kernel parameter Add disable_cpu_apicid kernel parameter. To use this kernel parameter, specify an initial APIC ID of the corresponding CPU you want to disable. This is mostly used for the kdump 2nd kernel to disable BSP to wake up multiple CPUs without causing system reset or hang due to sending INIT from AP to BSP. Kdump users first figure out initial APIC ID of the BSP, CPU0 in the 1st kernel, for example from /proc/cpuinfo and then set up this kernel parameter for the 2nd kernel using the obtained APIC ID. However, doing this procedure at each boot time manually is awkward, which should be automatically done by user-land service scripts, for example, kexec-tools on fedora/RHEL distributions. This design is more flexible than disabling BSP in kernel boot time automatically in that in kernel boot time we have no choice but referring to ACPI/MP table to obtain initial APIC ID for BSP, meaning that the method is not applicable to the systems without such BIOS tables. One assumption behind this design is that users get initial APIC ID of the BSP in still healthy state and so BSP is uniquely kept in CPU0. Thus, through the kernel parameter, only one initial APIC ID can be specified. In a comparison with disabled_cpu_apicid, we use read_apic_id(), not boot_cpu_physical_apicid, because on some platforms, the variable is modified to the apicid reported as BSP through MP table and this function is executed with the temporarily modified boot_cpu_physical_apicid. As a result, disabled_cpu_apicid kernel parameter doesn't work well for apicids of APs. Fixing the wrong handling of boot_cpu_physical_apicid requires some reviews and tests beyond some platforms and it could take some time. The fix here is a kind of workaround to focus on the main topic of this patch. Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Link: http://lkml.kernel.org/r/20140115064458.1545.38775.stgit@localhost6.localdomain6 Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-01-15 06:44:58 +00:00
disable_cpu_apicid= [X86,APIC,SMP]
Format: <int>
The number of initial APIC ID for the
corresponding CPU to be disabled at boot,
mostly used for the kdump 2nd kernel to
disable BSP to wake up multiple CPUs without
causing system reset or hang due to sending
INIT from AP to BSP.
disable_ddw [PPC/PSERIES]
Disable Dynamic DMA Window support. Use this
to workaround buggy firmware.
disable_ipv6= [IPV6]
See Documentation/networking/ipv6.rst.
disable_mtrr_cleanup [X86]
The kernel tries to adjust MTRR layout from continuous
to discrete, to make X server driver able to add WB
entry later. This parameter disables that.
disable_mtrr_trim [X86, Intel and AMD only]
x86, 32-bit: trim memory not covered by wb mtrrs On some machines, buggy BIOSes don't properly setup WB MTRRs to cover all available RAM, meaning the last few megs (or even gigs) of memory will be marked uncached. Since Linux tends to allocate from high memory addresses first, this causes the machine to be unusably slow as soon as the kernel starts really using memory (i.e. right around init time). This patch works around the problem by scanning the MTRRs at boot and figuring out whether the current end_pfn value (setup by early e820 code) goes beyond the highest WB MTRR range, and if so, trimming it to match. A fairly obnoxious KERN_WARNING is printed too, letting the user know that not all of their memory is available due to a likely BIOS bug. Something similar could be done on i386 if needed, but the boot ordering would be slightly different, since the MTRR code on i386 depends on the boot_cpu_data structure being setup. This patch fixes a bug in the last patch that caused the code to run on non-Intel machines (AMD machines apparently don't need it and it's untested on other non-Intel machines, so best keep it off). Further enhancements and fixes from: Yinghai Lu <Yinghai.Lu@Sun.COM> Andi Kleen <ak@suse.de> Signed-off-by: Jesse Barnes <jesse.barnes@intel.com> Tested-by: Justin Piszcz <jpiszcz@lucidpixels.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 12:33:18 +00:00
By default the kernel will trim any uncacheable
memory out of your available memory pool based on
MTRR settings. This parameter disables that behavior,
possibly causing your machine to run very slowly.
disable_timer_pin_1 [X86]
Disable PIN 1 of APIC timer
Can be useful to work around chipset bugs.
dis_ucode_ldr [X86] Disable the microcode loader.
dma_debug=off If the kernel is compiled with DMA_API_DEBUG support,
this option disables the debugging code at boot.
dma_debug_entries=<number>
This option allows to tune the number of preallocated
entries for DMA-API debugging code. One entry is
required per DMA-API allocation. Use this if the
DMA-API debugging code disables itself because the
architectural default is too low.
dma_debug_driver=<driver_name>
With this option the DMA-API debugging driver
filter feature can be enabled at boot time. Just
pass the driver to filter for as the parameter.
The filter can be disabled or changed to another
driver later using sysfs.
reg_file_data_sampling=
[X86] Controls mitigation for Register File Data
Sampling (RFDS) vulnerability. RFDS is a CPU
vulnerability which may allow userspace to infer
kernel data values previously stored in floating point
registers, vector registers, or integer registers.
RFDS only affects Intel Atom processors.
on: Turns ON the mitigation.
off: Turns OFF the mitigation.
This parameter overrides the compile time default set
by CONFIG_MITIGATION_RFDS. Mitigation cannot be
disabled when other VERW based mitigations (like MDS)
are enabled. In order to disable RFDS mitigation all
VERW based mitigations need to be disabled.
For details see:
Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
driver_async_probe= [KNL]
List of driver names to be probed asynchronously. *
matches with all driver names. If * is specified, the
rest of the listed driver names are those that will NOT
match the *.
Format: <driver_name1>,<driver_name2>...
drm: handle override and firmware EDID at drm_do_get_edid() level Handle debugfs override edid and firmware edid at the low level to transparently and completely replace the real edid. Previously, we practically only used the modes from the override EDID, and none of the other data, such as audio parameters. This change also prevents actual EDID reads when the EDID is to be overridden, but retains the DDC probe. This is useful if the reason for preferring override EDID are problems with reading the data, or corruption of the data. Move firmware EDID loading from helper to core, as the functionality moves to lower level as well. This will result in a change of module parameter from drm_kms_helper.edid_firmware to drm.edid_firmware, which arguably makes more sense anyway. Some future work remains related to override and firmware EDID validation. Like before, no validation is done for override EDID. The firmware EDID is validated separately in the loader. Some unification and deduplication would be in order, to validate all of them at the drm_do_get_edid() level, like "real" EDIDs. v2: move firmware loading to core v3: rebase, commit message refresh Cc: Abdiel Janulgue <abdiel.janulgue@linux.intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Tested-by: Abdiel Janulgue <abdiel.janulgue@linux.intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Acked-by: Dave Airlie <airlied@gmail.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/1e8a710bcac46e5136c1a7b430074893c81f364a.1505203831.git.jani.nikula@intel.com
2017-09-12 08:19:26 +00:00
drm.edid_firmware=[<connector>:]<file>[,[<connector>:]<file>]
Broken monitors, graphic adapters, KVMs and EDIDless
panels may send no or incorrect EDID data sets.
This parameter allows to specify an EDID data sets
in the /lib/firmware directory that are used instead.
drm: allow loading an EDID as firmware to override broken monitor Broken monitors and/or broken graphic boards may send erroneous or no EDID data. This also applies to broken KVM devices that are unable to correctly forward the EDID data of the connected monitor but invent their own fantasy data. This patch allows to specify an EDID data set to be used instead of probing the monitor for it. It contains built-in data sets of frequently used screen resolutions. In addition, a particular EDID data set may be provided in the /lib/firmware directory and loaded via the firmware interface. The name is passed to the kernel as module parameter of the drm_kms_helper module either when loaded options drm_kms_helper edid_firmware=edid/1280x1024.bin or as kernel commandline parameter drm_kms_helper.edid_firmware=edid/1280x1024.bin It is also possible to restrict the usage of a specified EDID data set to a particular connector. This is done by prepending the name of the connector to the name of the EDID data set using the syntax edid_firmware=[<connector>:]<edid> such as, for example, edid_firmware=DVI-I-1:edid/1920x1080.bin in which case no other connector will be affected. The built-in data sets are Resolution Name -------------------------------- 1024x768 edid/1024x768.bin 1280x1024 edid/1280x1024.bin 1680x1050 edid/1680x1050.bin 1920x1080 edid/1920x1080.bin They are ignored, if a file with the same name is available in the /lib/firmware directory. The built-in EDID data sets are based on standard timings that may not apply to a particular monitor and even crash it. Ideally, EDID data of the connected monitor should be used. They may be obtained through the drm/cardX/cardX-<connector>/edid entry in the /sys/devices PCI directory of a correctly working graphics adapter. It is even possible to specify the name of an EDID data set on-the-fly via the /sys/module interface, e.g. echo edid/myedid.bin >/sys/module/drm_kms_helper/parameters/edid_firmware The new screen mode is considered when the related kernel function is called for the first time after the change. Such calls are made when the X server is started or when the display settings dialog is opened in an already running X server. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Dave Airlie <airlied@redhat.com>
2012-03-18 21:37:33 +00:00
Generic built-in EDID data sets are used, if one of
edid/1024x768.bin, edid/1280x1024.bin,
edid/1680x1050.bin, or edid/1920x1080.bin is given
and no file with the same name exists. Details and
instructions how to build your own EDID data are
available in Documentation/admin-guide/edid.rst. An EDID
drm: allow loading an EDID as firmware to override broken monitor Broken monitors and/or broken graphic boards may send erroneous or no EDID data. This also applies to broken KVM devices that are unable to correctly forward the EDID data of the connected monitor but invent their own fantasy data. This patch allows to specify an EDID data set to be used instead of probing the monitor for it. It contains built-in data sets of frequently used screen resolutions. In addition, a particular EDID data set may be provided in the /lib/firmware directory and loaded via the firmware interface. The name is passed to the kernel as module parameter of the drm_kms_helper module either when loaded options drm_kms_helper edid_firmware=edid/1280x1024.bin or as kernel commandline parameter drm_kms_helper.edid_firmware=edid/1280x1024.bin It is also possible to restrict the usage of a specified EDID data set to a particular connector. This is done by prepending the name of the connector to the name of the EDID data set using the syntax edid_firmware=[<connector>:]<edid> such as, for example, edid_firmware=DVI-I-1:edid/1920x1080.bin in which case no other connector will be affected. The built-in data sets are Resolution Name -------------------------------- 1024x768 edid/1024x768.bin 1280x1024 edid/1280x1024.bin 1680x1050 edid/1680x1050.bin 1920x1080 edid/1920x1080.bin They are ignored, if a file with the same name is available in the /lib/firmware directory. The built-in EDID data sets are based on standard timings that may not apply to a particular monitor and even crash it. Ideally, EDID data of the connected monitor should be used. They may be obtained through the drm/cardX/cardX-<connector>/edid entry in the /sys/devices PCI directory of a correctly working graphics adapter. It is even possible to specify the name of an EDID data set on-the-fly via the /sys/module interface, e.g. echo edid/myedid.bin >/sys/module/drm_kms_helper/parameters/edid_firmware The new screen mode is considered when the related kernel function is called for the first time after the change. Such calls are made when the X server is started or when the display settings dialog is opened in an already running X server. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Dave Airlie <airlied@redhat.com>
2012-03-18 21:37:33 +00:00
data set will only be used for a particular connector,
if its name and a colon are prepended to the EDID
name. Each connector may use a unique EDID data
set by separating the files with a comma. An EDID
data set with no connector name will be used for
any connectors not explicitly specified.
drm: allow loading an EDID as firmware to override broken monitor Broken monitors and/or broken graphic boards may send erroneous or no EDID data. This also applies to broken KVM devices that are unable to correctly forward the EDID data of the connected monitor but invent their own fantasy data. This patch allows to specify an EDID data set to be used instead of probing the monitor for it. It contains built-in data sets of frequently used screen resolutions. In addition, a particular EDID data set may be provided in the /lib/firmware directory and loaded via the firmware interface. The name is passed to the kernel as module parameter of the drm_kms_helper module either when loaded options drm_kms_helper edid_firmware=edid/1280x1024.bin or as kernel commandline parameter drm_kms_helper.edid_firmware=edid/1280x1024.bin It is also possible to restrict the usage of a specified EDID data set to a particular connector. This is done by prepending the name of the connector to the name of the EDID data set using the syntax edid_firmware=[<connector>:]<edid> such as, for example, edid_firmware=DVI-I-1:edid/1920x1080.bin in which case no other connector will be affected. The built-in data sets are Resolution Name -------------------------------- 1024x768 edid/1024x768.bin 1280x1024 edid/1280x1024.bin 1680x1050 edid/1680x1050.bin 1920x1080 edid/1920x1080.bin They are ignored, if a file with the same name is available in the /lib/firmware directory. The built-in EDID data sets are based on standard timings that may not apply to a particular monitor and even crash it. Ideally, EDID data of the connected monitor should be used. They may be obtained through the drm/cardX/cardX-<connector>/edid entry in the /sys/devices PCI directory of a correctly working graphics adapter. It is even possible to specify the name of an EDID data set on-the-fly via the /sys/module interface, e.g. echo edid/myedid.bin >/sys/module/drm_kms_helper/parameters/edid_firmware The new screen mode is considered when the related kernel function is called for the first time after the change. Such calls are made when the X server is started or when the display settings dialog is opened in an already running X server. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Dave Airlie <airlied@redhat.com>
2012-03-18 21:37:33 +00:00
dscc4.setup= [NET]
dt_cpu_ftrs= [PPC]
Format: {"off" | "known"}
Control how the dt_cpu_ftrs device-tree binding is
used for CPU feature discovery and setup (if it
exists).
off: Do not use it, fall back to legacy cpu table.
known: Do not pass through unknown features to guests
or userspace, only those that the kernel is aware of.
x86/efi: Retrieve and assign Apple device properties Apple's EFI drivers supply device properties which are needed to support Macs optimally. They contain vital information which cannot be obtained any other way (e.g. Thunderbolt Device ROM). They're also used to convey the current device state so that OS drivers can pick up where EFI drivers left (e.g. GPU mode setting). There's an EFI driver dubbed "AAPL,PathProperties" which implements a per-device key/value store. Other EFI drivers populate it using a custom protocol. The macOS bootloader /System/Library/CoreServices/boot.efi retrieves the properties with the same protocol. The kernel extension AppleACPIPlatform.kext subsequently merges them into the I/O Kit registry (see ioreg(8)) where they can be queried by other kernel extensions and user space. This commit extends the efistub to retrieve the device properties before ExitBootServices is called. It assigns them to devices in an fs_initcall so that they can be queried with the API in <linux/property.h>. Note that the device properties will only be available if the kernel is booted with the efistub. Distros should adjust their installers to always use the efistub on Macs. grub with the "linux" directive will not work unless the functionality of this commit is duplicated in grub. (The "linuxefi" directive should work but is not included upstream as of this writing.) The custom protocol has GUID 91BD12FE-F6C3-44FB-A5B7-5122AB303AE0 and looks like this: typedef struct { unsigned long version; /* 0x10000 */ efi_status_t (*get) ( IN struct apple_properties_protocol *this, IN struct efi_dev_path *device, IN efi_char16_t *property_name, OUT void *buffer, IN OUT u32 *buffer_len); /* EFI_SUCCESS, EFI_NOT_FOUND, EFI_BUFFER_TOO_SMALL */ efi_status_t (*set) ( IN struct apple_properties_protocol *this, IN struct efi_dev_path *device, IN efi_char16_t *property_name, IN void *property_value, IN u32 property_value_len); /* allocates copies of property name and value */ /* EFI_SUCCESS, EFI_OUT_OF_RESOURCES */ efi_status_t (*del) ( IN struct apple_properties_protocol *this, IN struct efi_dev_path *device, IN efi_char16_t *property_name); /* EFI_SUCCESS, EFI_NOT_FOUND */ efi_status_t (*get_all) ( IN struct apple_properties_protocol *this, OUT void *buffer, IN OUT u32 *buffer_len); /* EFI_SUCCESS, EFI_BUFFER_TOO_SMALL */ } apple_properties_protocol; Thanks to Pedro Vilaça for this blog post which was helpful in reverse engineering Apple's EFI drivers and bootloader: https://reverse.put.as/2016/06/25/apple-efi-firmware-passwords-and-the-scbo-myth/ If someone at Apple is reading this, please note there's a memory leak in your implementation of the del() function as the property struct is freed but the name and value allocations are not. Neither the macOS bootloader nor Apple's EFI drivers check the protocol version, but we do to avoid breakage if it's ever changed. It's been the same since at least OS X 10.6 (2009). The get_all() function conveniently fills a buffer with all properties in marshalled form which can be passed to the kernel as a setup_data payload. The number of device properties is dynamic and can change between a first invocation of get_all() (to determine the buffer size) and a second invocation (to retrieve the actual buffer), hence the peculiar loop which does not finish until the buffer size settles. The macOS bootloader does the same. The setup_data payload is later on unmarshalled in an fs_initcall. The idea is that most buses instantiate devices in "subsys" initcall level and drivers are usually bound to these devices in "device" initcall level, so we assign the properties in-between, i.e. in "fs" initcall level. This assumes that devices to which properties pertain are instantiated from a "subsys" initcall or earlier. That should always be the case since on macOS, AppleACPIPlatformExpert::matchEFIDevicePath() only supports ACPI and PCI nodes and we've fully scanned those buses during "subsys" initcall level. The second assumption is that properties are only needed from a "device" initcall or later. Seems reasonable to me, but should this ever not work out, an alternative approach would be to store the property sets e.g. in a btree early during boot. Then whenever device_add() is called, an EFI Device Path would have to be constructed for the newly added device, and looked up in the btree. That way, the property set could be assigned to the device immediately on instantiation. And this would also work for devices instantiated in a deferred fashion. It seems like this approach would be more complicated and require more code. That doesn't seem justified without a specific use case. For comparison, the strategy on macOS is to assign properties to objects in the ACPI namespace (AppleACPIPlatformExpert::mergeEFIProperties()). That approach is definitely wrong as it fails for devices not present in the namespace: The NHI EFI driver supplies properties for attached Thunderbolt devices, yet on Macs with Thunderbolt 1 only one device level behind the host controller is described in the namespace. Consequently macOS cannot assign properties for chained devices. With Thunderbolt 2 they started to describe three device levels behind host controllers in the namespace but this grossly inflates the SSDT and still fails if the user daisy-chained more than three devices. We copy the property names and values from the setup_data payload to swappable virtual memory and afterwards make the payload available to the page allocator. This is just for the sake of good housekeeping, it wouldn't occupy a meaningful amount of physical memory (4444 bytes on my machine). Only the payload is freed, not the setup_data header since otherwise we'd break the list linkage and we cannot safely update the predecessor's ->next link because there's no locking for the list. The payload is currently not passed on to kexec'ed kernels, same for PCI ROMs retrieved by setup_efi_pci(). This can be added later if there is demand by amending setup_efi_state(). The payload can then no longer be made available to the page allocator of course. Tested-by: Lukas Wunner <lukas@wunner.de> [MacBookPro9,1] Tested-by: Pierre Moreau <pierre.morrow@free.fr> [MacBookPro11,3] Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Andreas Noever <andreas.noever@gmail.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pedro Vilaça <reverser@put.as> Cc: Peter Jones <pjones@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: grub-devel@gnu.org Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20161112213237.8804-9-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-12 21:32:36 +00:00
dump_apple_properties [X86]
Dump name and content of EFI device properties on
x86 Macs. Useful for driver authors to determine
what data is available or for reverse-engineering.
dyndbg[="val"] [KNL,DYNAMIC_DEBUG]
<module>.dyndbg[="val"]
Enable debug messages at boot time. See
Documentation/admin-guide/dynamic-debug-howto.rst
for details.
early_ioremap_debug [KNL]
Enable debug messages in early_ioremap support. This
is useful for tracking down temporary early mappings
which are not unmapped.
earlycon= [KNL] Output early console device and options.
When used with no options, the early console is
determined by stdout-path property in device tree's
chosen node or the ACPI SPCR table if supported by
the platform.
cdns,<addr>[,options]
Start an early, polled-mode console on a Cadence
(xuartps) serial port at the specified address. Only
supported option is baud rate. If baud rate is not
specified, the serial port must already be setup and
configured.
uart[8250],io,<addr>[,options[,uartclk]]
uart[8250],mmio,<addr>[,options[,uartclk]]
uart[8250],mmio32,<addr>[,options[,uartclk]]
uart[8250],mmio32be,<addr>[,options[,uartclk]]
uart[8250],0x<addr>[,options]
Start an early, polled-mode console on the 8250/16550
UART at the specified I/O port or MMIO address.
MMIO inter-register address stride is either 8-bit
(mmio) or 32-bit (mmio32 or mmio32be).
If none of [io|mmio|mmio32|mmio32be], <addr> is assumed
to be equivalent to 'mmio'. 'options' are specified
in the same format described for "console=ttyS<n>"; if
unspecified, the h/w is not initialized. 'uartclk' is
the uart clock frequency; if unspecified, it is set
to 'BASE_BAUD' * 16.
pl011,<addr>
pl011,mmio32,<addr>
Start an early, polled-mode console on a pl011 serial
port at the specified address. The pl011 serial port
must already be setup and configured. Options are not
yet supported. If 'mmio32' is specified, then only
the driver will use only 32-bit accessors to read/write
the device registers.
liteuart,<addr>
Start an early console on a litex serial port at the
specified address. The serial port must already be
setup and configured. Options are not yet supported.
meson,<addr>
Start an early, polled-mode console on a meson serial
port at the specified address. The serial port must
already be setup and configured. Options are not yet
supported.
msm_serial,<addr>
Start an early, polled-mode console on an msm serial
port at the specified address. The serial port
must already be setup and configured. Options are not
yet supported.
msm_serial_dm,<addr>
Start an early, polled-mode console on an msm serial
dm port at the specified address. The serial port
must already be setup and configured. Options are not
yet supported.
owl,<addr>
Start an early, polled-mode console on a serial port
of an Actions Semi SoC, such as S500 or S900, at the
specified address. The serial port must already be
setup and configured. Options are not yet supported.
rda,<addr>
Start an early, polled-mode console on a serial port
of an RDA Micro SoC, such as RDA8810PL, at the
specified address. The serial port must already be
setup and configured. Options are not yet supported.
sbi
Use RISC-V SBI (Supervisor Binary Interface) for early
console.
smh Use ARM semihosting calls for early console.
s3c2410,<addr>
s3c2412,<addr>
s3c2440,<addr>
s3c6400,<addr>
s5pv210,<addr>
exynos4210,<addr>
Use early console provided by serial driver available
on Samsung SoCs, requires selecting proper type and
a correct base address of the selected UART port. The
serial port must already be setup and configured.
Options are not yet supported.
lantiq,<addr>
Start an early, polled-mode console on a lantiq serial
(lqasc) port at the specified address. The serial port
must already be setup and configured. Options are not
yet supported.
lpuart,<addr>
lpuart32,<addr>
Use early console provided by Freescale LP UART driver
found on Freescale Vybrid and QorIQ LS1021A processors.
A valid base address must be provided, and the serial
port must already be setup and configured.
ec_imx21,<addr>
ec_imx6q,<addr>
Start an early, polled-mode, output-only console on the
Freescale i.MX UART at the specified address. The UART
must already be setup and configured.
ar3700_uart,<addr>
Start an early, polled-mode console on the
Armada 3700 serial port at the specified
address. The serial port must already be setup
and configured. Options are not yet supported.
qcom_geni,<addr>
Start an early, polled-mode console on a Qualcomm
Generic Interface (GENI) based serial port at the
specified address. The serial port must already be
setup and configured. Options are not yet supported.
efi/x86: Convert x86 EFI earlyprintk into generic earlycon implementation Move the x86 EFI earlyprintk implementation to a shared location under drivers/firmware and tweak it slightly so we can expose it as an earlycon implementation (which is generic) rather than earlyprintk (which is only implemented for a few architectures) This also involves switching to write-combine mappings by default (which is required on ARM since device mappings lack memory semantics, and so memcpy/memset may not be used on them), and adding support for shared memory framebuffers on cache coherent non-x86 systems (which do not tolerate mismatched attributes). Note that 32-bit ARM does not populate its struct screen_info early enough for earlycon=efifb to work, so it is disabled there. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Alexander Graf <agraf@suse.de> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Jeffrey Hugo <jhugo@codeaurora.org> Cc: Lee Jones <lee.jones@linaro.org> Cc: Leif Lindholm <leif.lindholm@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Jones <pjones@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20190202094119.13230-10-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-02 09:41:18 +00:00
efifb,[options]
Start an early, unaccelerated console on the EFI
memory mapped framebuffer (if available). On cache
coherent non-x86 systems that use system memory for
the framebuffer, pass the 'ram' option so that it is
mapped with the correct attributes.
tty: serial: Add linflexuart driver for S32V234 Introduce support for LINFlex driver, based on: - the version of Freescale LPUART driver after commit b3e3bf2ef2c7 ("Merge 4.0-rc7 into tty-next"); - commit abf1e0a98083 ("tty: serial: fsl_lpuart: lock port on console write"). In this basic version, the driver can be tested using initramfs and relies on the clocks and pin muxing set up by U-Boot. Remarks concerning the earlycon support: - LinFlexD does not allow character transmissions in the INIT mode (see section 47.4.2.1 in the reference manual[1]). Therefore, a mutual exclusion between the first linflex_setup_watermark/linflex_set_termios executions and linflex_earlycon_putchar was employed and the characters normally sent to earlycon during initialization are kept in a buffer and sent afterwards. - Empirically, character transmission is also forbidden within the last 1-2 ms before entering the INIT mode, so we use an explicit timeout (PREINIT_DELAY) between linflex_earlycon_putchar and the first call to linflex_setup_watermark. - U-Boot currently uses the UART FIFO mode, while this driver makes the transition to the buffer mode. Therefore, the earlycon putchar function matches the U-Boot behavior before initializations and the Linux behavior after. [1] https://www.nxp.com/webapp/Download?colCode=S32V234RM Signed-off-by: Stoica Cosmin-Stefan <cosmin.stoica@nxp.com> Signed-off-by: Adrian.Nitu <adrian.nitu@freescale.com> Signed-off-by: Larisa Grigore <Larisa.Grigore@nxp.com> Signed-off-by: Ana Nedelcu <B56683@freescale.com> Signed-off-by: Mihaela Martinas <Mihaela.Martinas@freescale.com> Signed-off-by: Matthew Nunez <matthew.nunez@nxp.com> [stefan-gabriel.mirea@nxp.com: Reduced for upstreaming and implemented earlycon support] Signed-off-by: Stefan-Gabriel Mirea <stefan-gabriel.mirea@nxp.com> Link: https://lore.kernel.org/r/20190809112853.15846-6-stefan-gabriel.mirea@nxp.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-09 11:29:16 +00:00
linflex,<addr>
Use early console provided by Freescale LINFlexD UART
tty: serial: Add linflexuart driver for S32V234 Introduce support for LINFlex driver, based on: - the version of Freescale LPUART driver after commit b3e3bf2ef2c7 ("Merge 4.0-rc7 into tty-next"); - commit abf1e0a98083 ("tty: serial: fsl_lpuart: lock port on console write"). In this basic version, the driver can be tested using initramfs and relies on the clocks and pin muxing set up by U-Boot. Remarks concerning the earlycon support: - LinFlexD does not allow character transmissions in the INIT mode (see section 47.4.2.1 in the reference manual[1]). Therefore, a mutual exclusion between the first linflex_setup_watermark/linflex_set_termios executions and linflex_earlycon_putchar was employed and the characters normally sent to earlycon during initialization are kept in a buffer and sent afterwards. - Empirically, character transmission is also forbidden within the last 1-2 ms before entering the INIT mode, so we use an explicit timeout (PREINIT_DELAY) between linflex_earlycon_putchar and the first call to linflex_setup_watermark. - U-Boot currently uses the UART FIFO mode, while this driver makes the transition to the buffer mode. Therefore, the earlycon putchar function matches the U-Boot behavior before initializations and the Linux behavior after. [1] https://www.nxp.com/webapp/Download?colCode=S32V234RM Signed-off-by: Stoica Cosmin-Stefan <cosmin.stoica@nxp.com> Signed-off-by: Adrian.Nitu <adrian.nitu@freescale.com> Signed-off-by: Larisa Grigore <Larisa.Grigore@nxp.com> Signed-off-by: Ana Nedelcu <B56683@freescale.com> Signed-off-by: Mihaela Martinas <Mihaela.Martinas@freescale.com> Signed-off-by: Matthew Nunez <matthew.nunez@nxp.com> [stefan-gabriel.mirea@nxp.com: Reduced for upstreaming and implemented earlycon support] Signed-off-by: Stefan-Gabriel Mirea <stefan-gabriel.mirea@nxp.com> Link: https://lore.kernel.org/r/20190809112853.15846-6-stefan-gabriel.mirea@nxp.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-08-09 11:29:16 +00:00
serial driver for NXP S32V234 SoCs. A valid base
address must be provided, and the serial port must
already be setup and configured.
earlyprintk= [X86,SH,ARM,M68k,S390]
earlyprintk=vga
earlyprintk=sclp
earlyprintk=xen
earlyprintk=serial[,ttySn[,baudrate]]
earlyprintk=serial[,0x...[,baudrate]]
earlyprintk=ttySn[,baudrate]
earlyprintk=dbgp[debugController#]
x86/earlyprintk: Add a force option for pciserial device The "pciserial" earlyprintk variant helps much on many modern x86 platforms, but unfortunately there are still some platforms with PCI UART devices which have the wrong PCI class code. In that case, the current class code check does not allow for them to be used for logging. Add a sub-option "force" which overrides the class code check and thus the use of such device can be enforced. [ bp: massage formulations. ] Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Stuart R . Anderson" <stuart.r.anderson@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H Peter Anvin <hpa@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kai-Heng Feng <kai.heng.feng@canonical.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thymo van Beers <thymovanbeers@gmail.com> Cc: alan@linux.intel.com Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/20181002164921.25833-1-feng.tang@intel.com
2018-10-02 16:49:21 +00:00
earlyprintk=pciserial[,force],bus:device.function[,baudrate]
earlyprintk=xdbc[xhciController#]
earlyprintk=bios
earlyprintk is useful when the kernel crashes before
the normal console is initialized. It is not enabled by
default because it has some cosmetic problems.
Append ",keep" to not disable it when the real console
takes over.
Only one of vga, serial, or usb debug port can
be used at a time.
Currently only ttyS0 and ttyS1 may be specified by
name. Other I/O ports may be explicitly specified
on some architectures (x86 and arm at least) by
replacing ttySn with an I/O port address, like this:
earlyprintk=serial,0x1008,115200
You can find the port for a given device in
/proc/tty/driver/serial:
2: uart:ST16650V2 port:00001008 irq:18 ...
Interaction with the standard serial driver is not
very good.
The VGA output is eventually overwritten by
the real console.
The xen option can only be used in Xen domains.
The sclp output can only be used on s390.
The bios output can only be used on SuperH.
x86/earlyprintk: Add a force option for pciserial device The "pciserial" earlyprintk variant helps much on many modern x86 platforms, but unfortunately there are still some platforms with PCI UART devices which have the wrong PCI class code. In that case, the current class code check does not allow for them to be used for logging. Add a sub-option "force" which overrides the class code check and thus the use of such device can be enforced. [ bp: massage formulations. ] Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Stuart R . Anderson" <stuart.r.anderson@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H Peter Anvin <hpa@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kai-Heng Feng <kai.heng.feng@canonical.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thymo van Beers <thymovanbeers@gmail.com> Cc: alan@linux.intel.com Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/20181002164921.25833-1-feng.tang@intel.com
2018-10-02 16:49:21 +00:00
The optional "force" to "pciserial" enables use of a
PCI device even when its classcode is not of the
UART class.
edac_report= [HW,EDAC] Control how to report EDAC event
Format: {"on" | "off" | "force"}
on: enable EDAC to report H/W event. May be overridden
by other higher priority error reporting module.
off: disable H/W event reporting through EDAC.
force: enforce the use of EDAC to report H/W event.
default: on.
edd= [EDD]
Format: {"off" | "on" | "skip[mbr]"}
efi= [EFI]
Format: { "debug", "disable_early_pci_dma",
"nochunk", "noruntime", "nosoftreserve",
"novamap", "no_disable_early_pci_dma" }
debug: enable misc debug output.
disable_early_pci_dma: disable the busmaster bit on all
PCI bridges while in the EFI boot stub.
nochunk: disable reading files in "chunks" in the EFI
boot stub, as chunking can cause problems with some
firmware implementations.
noruntime : disable EFI runtime services support
nosoftreserve: The EFI_MEMORY_SP (Specific Purpose)
attribute may cause the kernel to reserve the
memory range for a memory mapping driver to
claim. Specify efi=nosoftreserve to disable this
reservation and treat the memory by its base type
(i.e. EFI_CONVENTIONAL_MEMORY / "System RAM").
novamap: do not call SetVirtualAddressMap().
efi: Allow disabling PCI busmastering on bridges during boot Add an option to disable the busmaster bit in the control register on all PCI bridges before calling ExitBootServices() and passing control to the runtime kernel. System firmware may configure the IOMMU to prevent malicious PCI devices from being able to attack the OS via DMA. However, since firmware can't guarantee that the OS is IOMMU-aware, it will tear down IOMMU configuration when ExitBootServices() is called. This leaves a window between where a hostile device could still cause damage before Linux configures the IOMMU again. If CONFIG_EFI_DISABLE_PCI_DMA is enabled or "efi=disable_early_pci_dma" is passed on the command line, the EFI stub will clear the busmaster bit on all PCI bridges before ExitBootServices() is called. This will prevent any malicious PCI devices from being able to perform DMA until the kernel reenables busmastering after configuring the IOMMU. This option may cause failures with some poorly behaved hardware and should not be enabled without testing. The kernel commandline options "efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma" may be used to override the default. Note that PCI devices downstream from PCI bridges are disconnected from their drivers first, using the UEFI driver model API, so that DMA can be disabled safely at the bridge level. [ardb: disconnect PCI I/O handles first, as suggested by Arvind] Co-developed-by: Matthew Garrett <mjg59@google.com> Signed-off-by: Matthew Garrett <mjg59@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arvind Sankar <nivedita@alum.mit.edu> Cc: Matthew Garrett <matthewgarrett@google.com> Cc: linux-efi@vger.kernel.org Link: https://lkml.kernel.org/r/20200103113953.9571-18-ardb@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-03 11:39:50 +00:00
no_disable_early_pci_dma: Leave the busmaster bit set
on all PCI bridges while in the EFI boot stub
efi_no_storage_paranoia [EFI; X86]
Using this parameter you can use more than 50% of
your efi variable storage. Use this parameter only if
you are really sure that your UEFI does sane gc and
fulfills the spec otherwise your board may brick.
efi_fake_mem= nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86]
Add arbitrary attribute to specific memory range by
updating original EFI memory map.
Region of memory which aa attribute is added to is
from ss to ss+nn.
If efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000
is specified, EFI_MEMORY_MORE_RELIABLE(0x10000)
attribute is added to range 0x100000000-0x180000000 and
0x10a0000000-0x1120000000.
If efi_fake_mem=8G@9G:0x40000 is specified, the
EFI_MEMORY_SP(0x40000) attribute is added to
range 0x240000000-0x43fffffff.
Using this parameter you can do debugging of EFI memmap
related features. For example, you can do debugging of
Address Range Mirroring feature even if your box
doesn't support it, or mark specific memory as
"soft reserved".
efivar_ssdt= [EFI; X86] Name of an EFI variable that contains an SSDT
that is to be dynamically loaded by Linux. If there are
multiple variables with the same name but with different
vendor GUIDs, all of them will be loaded. See
Documentation/admin-guide/acpi/ssdt-overlays.rst for details.
eisa_irq_edge= [PARISC,HW]
See header of drivers/parisc/eisa.c.
ekgdboc= [X86,KGDB] Allow early kernel console debugging
Format: ekgdboc=kbd
This is designed to be used in conjunction with
the boot argument: earlyprintk=vga
This parameter works in place of the kgdboc parameter
but can only be used if the backing tty is available
very early in the boot process. For early debugging
via a serial port see kgdboc_earlycon instead.
elanfreq= [X86-32]
See comment before function elanfreq_setup() in
arch/x86/kernel/cpu/cpufreq/elanfreq.c.
elfcorehdr=[size[KMG]@]offset[KMG] [PPC,SH,X86,S390]
Specifies physical address of start of kernel core
image elf header and optionally the size. Generally
kexec loader will pass this option to capture kernel.
See Documentation/admin-guide/kdump/kdump.rst for details.
enable_mtrr_cleanup [X86]
The kernel tries to adjust MTRR layout from continuous
to discrete, to make X server driver able to add WB
entry later. This parameter enables that.
enable_timer_pin_1 [X86]
Enable PIN 1 of APIC timer
Can be useful to work around chipset bugs
(in particular on some ATI chipsets).
The kernel tries to set a reasonable default.
enforcing= [SELINUX] Set initial enforcing status.
Format: {"0" | "1"}
See security/selinux/Kconfig help text.
0 -- permissive (log only, no denials).
1 -- enforcing (deny and log).
Default value is 0.
Value can be changed at runtime via
/sys/fs/selinux/enforce.
erst_disable [ACPI]
Disable Error Record Serialization Table (ERST)
support.
ether= [HW,NET] Ethernet cards parameters
This option is obsoleted by the "netdev=" option, which
has equivalent usage. See its documentation for details.
evm= [EVM]
Format: { "fix" }
Permit 'security.evm' to be updated regardless of
current integrity status.
early_page_ext [KNL] Enforces page_ext initialization to earlier
stages so cover more early boot allocations.
Please note that as side effect some optimizations
might be disabled to achieve that (e.g. parallelized
memory initialization is disabled) so the boot process
might take longer, especially on systems with a lot of
memory. Available with CONFIG_PAGE_EXTENSION=y.
failslab=
lib, include/linux: add usercopy failure capability Patch series "add fault injection to user memory access", v3. The goal of this series is to improve testing of fault-tolerance in usages of user memory access functions, by adding support for fault injection. syzkaller/syzbot are using the existing fault injection modes and will use this particular feature also. The first patch adds failure injection capability for usercopy functions. The second changes usercopy functions to use this new failure capability (copy_from_user, ...). The third patch adds get/put/clear_user failures to x86. This patch (of 3): Add a failure injection capability to improve testing of fault-tolerance in usages of user memory access functions. Add CONFIG_FAULT_INJECTION_USERCOPY to enable faults in usercopy functions. The should_fail_usercopy function is to be called by these functions (copy_from_user, get_user, ...) in order to fail or not. Signed-off-by: Albert van der Linde <alinde@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20200831171733.955393-1-alinde@google.com Link: http://lkml.kernel.org/r/20200831171733.955393-2-alinde@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 03:13:46 +00:00
fail_usercopy=
fail_page_alloc=
fail_make_request=[KNL]
General fault injection mechanism.
Format: <interval>,<probability>,<space>,<times>
See also Documentation/fault-injection/.
fb_tunnels= [NET]
Format: { initns | none }
See Documentation/admin-guide/sysctl/net.rst for
fb_tunnels_only_for_init_ns
floppy= [HW]
See Documentation/admin-guide/blockdev/floppy.rst.
forcepae [X86-32]
Forcefully enable Physical Address Extension (PAE).
Many Pentium M systems disable PAE but may have a
functionally usable PAE implementation.
Warning: use of this parameter will taint the kernel
and may cause unknown problems.
ftrace=[tracer]
[FTRACE] will set and start the specified tracer
as early as possible in order to facilitate early
boot debugging.
ftrace_boot_snapshot
[FTRACE] On boot up, a snapshot will be taken of the
ftrace ring buffer that can be read at:
/sys/kernel/tracing/snapshot.
This is useful if you need tracing information from kernel
boot up that is likely to be overridden by user space
start up functionality.
Optionally, the snapshot can also be defined for a tracing
instance that was created by the trace_instance= command
line parameter.
trace_instance=foo,sched_switch ftrace_boot_snapshot=foo
The above will cause the "foo" tracing instance to trigger
a snapshot at the end of boot up.
ftrace_dump_on_oops[=orig_cpu]
[FTRACE] will dump the trace buffers on oops.
If no parameter is passed, ftrace will dump
buffers of all CPUs, but if you pass orig_cpu, it will
dump only the buffer of the CPU that triggered the
oops.
ftrace_filter=[function-list]
[FTRACE] Limit the functions traced by the function
tracer at boot up. function-list is a comma-separated
list of functions. This list can be changed at run
time by the set_ftrace_filter file in the debugfs
tracing directory.
ftrace_notrace=[function-list]
[FTRACE] Do not trace the functions specified in
function-list. This list can be changed at run time
by the set_ftrace_notrace file in the debugfs
tracing directory.
ftrace_graph_filter=[function-list]
[FTRACE] Limit the top level callers functions traced
by the function graph tracer at boot up.
function-list is a comma-separated list of functions
that can be changed at run time by the
set_graph_function file in the debugfs tracing directory.
ftrace_graph_notrace=[function-list]
[FTRACE] Do not trace from the functions specified in
function-list. This list is a comma-separated list of
functions that can be changed at run time by the
set_graph_notrace file in the debugfs tracing directory.
ftrace_graph_max_depth=<uint>
[FTRACE] Used with the function graph tracer. This is
the max depth it will trace into a function. This value
can be changed at run time by the max_graph_depth file
in the tracefs tracing directory. default: 0 (no limit)
fw_devlink= [KNL] Create device links between consumer and supplier
devices by scanning the firmware to infer the
consumer/supplier relationships. This feature is
especially useful when drivers are loaded as modules as
it ensures proper ordering of tasks like device probing
(suppliers first, then consumers), supplier boot state
clean up (only after all consumers have probed),
suspend/resume & runtime PM (consumers first, then
suppliers).
Format: { off | permissive | on | rpm }
off -- Don't create device links from firmware info.
permissive -- Create device links from firmware info
but use it only for ordering boot state clean
up (sync_state() calls).
on -- Create device links from firmware info and use it
to enforce probe and suspend/resume ordering.
rpm -- Like "on", but also use to order runtime PM.
fw_devlink.strict=<bool>
[KNL] Treat all inferred dependencies as mandatory
dependencies. This only applies for fw_devlink=on|rpm.
Format: <bool>
fw_devlink.sync_state =
[KNL] When all devices that could probe have finished
probing, this parameter controls what to do with
devices that haven't yet received their sync_state()
calls.
Format: { strict | timeout }
strict -- Default. Continue waiting on consumers to
probe successfully.
timeout -- Give up waiting on consumers and call
sync_state() on any devices that haven't yet
received their sync_state() calls after
deferred_probe_timeout has expired or by
late_initcall() if !CONFIG_MODULES.
gamecon.map[2|3]=
[HW,JOY] Multisystem joystick and NES/SNES/PSX pad
support via parallel port (up to 5 devices per port)
Format: <port#>,<pad1>,<pad2>,<pad3>,<pad4>,<pad5>
See also Documentation/input/devices/joystick-parport.rst
gamma= [HW,DRM]
gart_fix_e820= [X86-64] disable the fix e820 for K8 GART
x86: disable the GART early, 64-bit For K8 system: 4G RAM with memory hole remapping enabled, or more than 4G RAM installed. when try to use kexec second kernel, and the first doesn't include gart_shutdown. the second kernel could have different aper position than the first kernel. and second kernel could use that hole as RAM that is still used by GART set by the first kernel. esp. when try to kexec 2.6.24 with sparse mem enable from previous kernel (from RHEL 5 or SLES 10). the new kernel will use aper by GART (set by first kernel) for vmemmap. and after new kernel setting one new GART. the position will be real RAM. the _mapcount set is lost. Bad page state in process 'swapper' page:ffffe2000e600020 flags:0x0000000000000000 mapping:0000000000000000 mapcount:1 count:0 Trying to fix it up, but a reboot is needed Backtrace: Pid: 0, comm: swapper Not tainted 2.6.24-rc7-smp-gcdf71a10-dirty #13 Call Trace: [<ffffffff8026401f>] bad_page+0x63/0x8d [<ffffffff80264169>] __free_pages_ok+0x7c/0x2a5 [<ffffffff80ba75d1>] free_all_bootmem_core+0xd0/0x198 [<ffffffff80ba3a42>] numa_free_all_bootmem+0x3b/0x76 [<ffffffff80ba3461>] mem_init+0x3b/0x152 [<ffffffff80b959d3>] start_kernel+0x236/0x2c2 [<ffffffff80b9511a>] _sinittext+0x11a/0x121 and [ffffe2000e600000-ffffe2000e7fffff] PMD ->ffff81001c200000 on node 0 phys addr is : 0x1c200000 RHEL 5.1 kernel -53 said: PCI-DMA: aperture base @ 1c000000 size 65536 KB new kernel said: Mapping aperture over 65536 KB of RAM @ 3c000000 So could try to disable that GART if possible. According to Ingo > hm, i'm wondering, instead of modifying the GART, why dont we simply > _detect_ whatever GART settings we have inherited, and propagate that > into our e820 maps? I.e. if there's inconsistency, then punch that out > from the memory maps and just dont use that memory. > > that way it would not matter whether the GART settings came from a [old > or crashing] Linux kernel that has not called gart_iommu_shutdown(), or > whether it's a BIOS that has set up an aperture hole inconsistent with > the memory map it passed. (or the memory map we _think_ i tried to pass > us) > > it would also be more robust to only read and do a memory map quirk > based on that, than actively trying to change the GART so early in the > bootup. Later on we have to re-enable the GART _anyway_ and have to > punch a hole for it. > > and as a bonus, we would have shored up our defenses against crappy > BIOSes as well. add e820 modification for gart inconsistent setting. gart_fix_e820=off could be used to disable e820 fix. Signed-off-by: Yinghai Lu <yinghai.lu@sun.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 12:33:09 +00:00
Format: off | on
default: on
x86/speculation: Add Gather Data Sampling mitigation Gather Data Sampling (GDS) is a hardware vulnerability which allows unprivileged speculative access to data which was previously stored in vector registers. Intel processors that support AVX2 and AVX512 have gather instructions that fetch non-contiguous data elements from memory. On vulnerable hardware, when a gather instruction is transiently executed and encounters a fault, stale data from architectural or internal vector registers may get transiently stored to the destination vector register allowing an attacker to infer the stale data using typical side channel techniques like cache timing attacks. This mitigation is different from many earlier ones for two reasons. First, it is enabled by default and a bit must be set to *DISABLE* it. This is the opposite of normal mitigation polarity. This means GDS can be mitigated simply by updating microcode and leaving the new control bit alone. Second, GDS has a "lock" bit. This lock bit is there because the mitigation affects the hardware security features KeyLocker and SGX. It needs to be enabled and *STAY* enabled for these features to be mitigated against GDS. The mitigation is enabled in the microcode by default. Disable it by setting gather_data_sampling=off or by disabling all mitigations with mitigations=off. The mitigation status can be checked by reading: /sys/devices/system/cpu/vulnerabilities/gather_data_sampling Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-07-13 02:43:11 +00:00
gather_data_sampling=
[X86,INTEL] Control the Gather Data Sampling (GDS)
mitigation.
Gather Data Sampling is a hardware vulnerability which
allows unprivileged speculative access to data which was
previously stored in vector registers.
This issue is mitigated by default in updated microcode.
The mitigation may have a performance impact but can be
disabled. On systems without the microcode mitigation
disabling AVX serves as a mitigation.
force: Disable AVX to mitigate systems without
microcode mitigation. No effect if the microcode
mitigation is present. Known to cause crashes in
userspace with buggy AVX enumeration.
x86/speculation: Add Gather Data Sampling mitigation Gather Data Sampling (GDS) is a hardware vulnerability which allows unprivileged speculative access to data which was previously stored in vector registers. Intel processors that support AVX2 and AVX512 have gather instructions that fetch non-contiguous data elements from memory. On vulnerable hardware, when a gather instruction is transiently executed and encounters a fault, stale data from architectural or internal vector registers may get transiently stored to the destination vector register allowing an attacker to infer the stale data using typical side channel techniques like cache timing attacks. This mitigation is different from many earlier ones for two reasons. First, it is enabled by default and a bit must be set to *DISABLE* it. This is the opposite of normal mitigation polarity. This means GDS can be mitigated simply by updating microcode and leaving the new control bit alone. Second, GDS has a "lock" bit. This lock bit is there because the mitigation affects the hardware security features KeyLocker and SGX. It needs to be enabled and *STAY* enabled for these features to be mitigated against GDS. The mitigation is enabled in the microcode by default. Disable it by setting gather_data_sampling=off or by disabling all mitigations with mitigations=off. The mitigation status can be checked by reading: /sys/devices/system/cpu/vulnerabilities/gather_data_sampling Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-07-13 02:43:11 +00:00
off: Disable GDS mitigation.
gcov: add gcov profiling infrastructure Enable the use of GCC's coverage testing tool gcov [1] with the Linux kernel. gcov may be useful for: * debugging (has this code been reached at all?) * test improvement (how do I change my test to cover these lines?) * minimizing kernel configurations (do I need this option if the associated code is never run?) The profiling patch incorporates the following changes: * change kbuild to include profiling flags * provide functions needed by profiling code * present profiling data as files in debugfs Note that on some architectures, enabling gcc's profiling option "-fprofile-arcs" for the entire kernel may trigger compile/link/ run-time problems, some of which are caused by toolchain bugs and others which require adjustment of architecture code. For this reason profiling the entire kernel is initially restricted to those architectures for which it is known to work without changes. This restriction can be lifted once an architecture has been tested and found compatible with gcc's profiling. Profiling of single files or directories is still available on all platforms (see config help text). [1] http://gcc.gnu.org/onlinedocs/gcc/Gcov.html Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Li Wei <W.Li@Sun.COM> Cc: Michael Ellerman <michaele@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Heiko Carstens <heicars2@linux.vnet.ibm.com> Cc: Martin Schwidefsky <mschwid2@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-17 23:28:08 +00:00
gcov_persist= [GCOV] When non-zero (default), profiling data for
kernel modules is saved and remains accessible via
debugfs, even when the module is unloaded/reloaded.
When zero, profiling data is discarded and associated
debugfs files are removed at module unload time.
goldfish [X86] Enable the goldfish android emulator platform.
Don't use this when you are not running on the
android emulator
gpio-mockup.gpio_mockup_ranges
[HW] Sets the ranges of gpiochip of for this device.
Format: <start1>,<end1>,<start2>,<end2>...
gpio-mockup.gpio_mockup_named_lines
[HW] Let the driver know GPIO lines should be named.
gpt [EFI] Forces disk with valid GPT signature but
invalid Protective MBR to be treated as GPT. If the
primary GPT is corrupted, it enables the backup/alternate
GPT to be used instead.
grcan.enable0= [HW] Configuration of physical interface 0. Determines
the "Enable 0" bit of the configuration register.
Format: 0 | 1
Default: 0
grcan.enable1= [HW] Configuration of physical interface 1. Determines
the "Enable 0" bit of the configuration register.
Format: 0 | 1
Default: 0
grcan.select= [HW] Select which physical interface to use.
Format: 0 | 1
Default: 0
grcan.txsize= [HW] Sets the size of the tx buffer.
Format: <unsigned int> such that (txsize & ~0x1fffc0) == 0.
Default: 1024
grcan.rxsize= [HW] Sets the size of the rx buffer.
Format: <unsigned int> such that (rxsize & ~0x1fffc0) == 0.
Default: 1024
hardened_usercopy=
[KNL] Under CONFIG_HARDENED_USERCOPY, whether
hardening is enabled for this boot. Hardened
usercopy checking is used to protect the kernel
from reading or writing beyond known memory
allocation boundaries as a proactive defense
against bounds-checking flaws in the kernel's
copy_to_user()/copy_from_user() interface.
on Perform hardened usercopy checks (default).
off Disable hardened usercopy checks.
hardlockup_all_cpu_backtrace=
[KNL] Should the hard-lockup detector generate
backtraces on all cpus.
Format: 0 | 1
hashdist= [KNL,NUMA] Large hashes allocated during boot
are distributed across NUMA nodes. Defaults on
for 64-bit NUMA, off otherwise.
Format: 0 | 1 (for off | on)
hcl= [IA-64] SGI's Hardware Graph compatibility layer
hd= [EIDE] (E)IDE hard drive subsystem geometry
Format: <cyl>,<head>,<sect>
hest_disable [ACPI]
Disable Hardware Error Source Table (HEST) support;
corresponding firmware-first mode error processing
logic will be disabled.
hibernate= [HIBERNATION]
noresume Don't check if there's a hibernation image
present during boot.
nocompress Don't compress/decompress hibernation images.
no Disable hibernation and resume.
protect_image Turn on image protection during restoration
(that will set all pages holding image data
during restoration read-only).
highmem=nn[KMG] [KNL,BOOT] forces the highmem zone to have an exact
size of <nn>. This works even on boxes that have no
highmem otherwise. This also works to reduce highmem
size on bigger boxes.
highres= [KNL] Enable/disable high resolution timer mode.
Valid parameters: "on", "off"
Default: "on"
hlt [BUGS=ARM,SH]
init: add "hostname" kernel parameter The gethostname system call returns the hostname for the current machine. However, the kernel has no mechanism to initially set the current machine's name in such a way as to guarantee that the first userspace process to call gethostname will receive a meaningful result. It relies on some unspecified userspace process to first call sethostname before gethostname can produce a meaningful name. Traditionally the machine's hostname is set from userspace by the init system. The init system, in turn, often relies on a configuration file (say, /etc/hostname) to provide the value that it will supply in the call to sethostname. Consequently, the file system containing /etc/hostname usually must be available before the hostname will be set. There may, however, be earlier userspace processes that could call gethostname before the file system containing /etc/hostname is mounted. Such a process will get some other, likely meaningless, name from gethostname (such as "(none)", "localhost", or "darkstar"). A real-world example where this can happen, and lead to undesirable results, is with mdadm. When assembling arrays, mdadm distinguishes between "local" arrays and "foreign" arrays. A local array is one that properly belongs to the current machine, and a foreign array is one that is (possibly temporarily) attached to the current machine, but properly belongs to some other machine. To determine if an array is local or foreign, mdadm may compare the "homehost" recorded on the array with the current hostname. If mdadm is run before the root file system is mounted, perhaps because the root file system itself resides on an md-raid array, then /etc/hostname isn't yet available and the init system will not yet have called sethostname, causing mdadm to incorrectly conclude that all of the local arrays are foreign. Solving this problem *could* be delegated to the init system. It could be left up to the init system (including any init system that starts within an initramfs, if one is in use) to ensure that sethostname is called before any other userspace process could possibly call gethostname. However, it may not always be obvious which processes could call gethostname (for example, udev itself might not call gethostname, but it could via udev rules invoke processes that do). Additionally, the init system has to ensure that the hostname configuration value is stored in some place where it will be readily accessible during early boot. Unfortunately, every init system will attempt to (or has already attempted to) solve this problem in a different, possibly incorrect, way. This makes getting consistently working configurations harder for users. I believe it is better for the kernel to provide the means by which the hostname may be set early, rather than making this a problem for the init system to solve. The option to set the hostname during early startup, via a kernel parameter, provides a simple, reliable way to solve this problem. It also could make system configuration easier for some embedded systems. [dmoulding@me.com: v2] Link: https://lkml.kernel.org/r/20220506060310.7495-2-dmoulding@me.com Link: https://lkml.kernel.org/r/20220505180651.22849-2-dmoulding@me.com Signed-off-by: Dan Moulding <dmoulding@me.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-18 00:31:37 +00:00
hostname= [KNL] Set the hostname (aka UTS nodename).
Format: <string>
This allows setting the system's hostname during early
startup. This sets the name returned by gethostname.
Using this parameter to set the hostname makes it
possible to ensure the hostname is correctly set before
any userspace processes run, avoiding the possibility
that a process may call gethostname before the hostname
has been explicitly set, resulting in the calling
process getting an incorrect result. The string must
not exceed the maximum allowed hostname length (usually
64 characters) and will be truncated otherwise.
hpet= [X86-32,HPET] option to control HPET usage
Format: { enable (default) | disable | force |
verbose }
disable: disable HPET and use PIT instead
force: allow force enabled of undocumented chips (ICH4,
VIA, nVidia)
verbose: show contents of HPET registers during setup
hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET
registers. Default set by CONFIG_HPET_MMAP_DEFAULT.
hugetlbfs: clean up command line processing With all hugetlb page processing done in a single file clean up code. - Make code match desired semantics - Update documentation with semantics - Make all warnings and errors messages start with 'HugeTLB:'. - Consistently name command line parsing routines. - Warn if !hugepages_supported() and command line parameters have been specified. - Add comments to code - Describe some of the subtle interactions - Describe semantics of command line arguments This patch also fixes issues with implicitly setting the number of gigantic huge pages to preallocate. Previously on X86 command line, hugepages=2 default_hugepagesz=1G would result in zero 1G pages being preallocated and, # grep HugePages_Total /proc/meminfo HugePages_Total: 0 # sysctl -a | grep nr_hugepages vm.nr_hugepages = 2 vm.nr_hugepages_mempolicy = 2 # cat /proc/sys/vm/nr_hugepages 2 After this patch 2 gigantic pages will be preallocated and all the proc, sysfs, sysctl and meminfo files will accurately reflect this. To address the issue with gigantic pages, a small change in behavior was made to command line processing. Previously the command line, hugepages=128 default_hugepagesz=2M hugepagesz=2M hugepages=256 would result in the allocation of 256 2M huge pages. The value 128 would be ignored without any warning. After this patch, 128 2M pages will be allocated and a warning message will be displayed indicating the value of 256 is ignored. This change in behavior is required because allocation of implicitly specified gigantic pages must be done when the default_hugepagesz= is encountered for gigantic pages. Previously the code waited until later in the boot process (hugetlb_init), to allocate pages of default size. However the bootmem allocator required for gigantic allocations is not available at this time. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sandipan Das <sandipan@linux.ibm.com> Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Acked-by: Will Deacon <will@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Longpeng <longpeng2@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Qian Cai <cai@lca.pw> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200417185049.275845-5-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 23:00:46 +00:00
hugepages= [HW] Number of HugeTLB pages to allocate at boot.
If this follows hugepagesz (below), it specifies
the number of pages of hugepagesz to be allocated.
If this is the first HugeTLB parameter on the command
line, it specifies the number of pages to allocate for
hugetlbfs: extend the definition of hugepages parameter to support node allocation We can specify the number of hugepages to allocate at boot. But the hugepages is balanced in all nodes at present. In some scenarios, we only need hugepages in one node. For example: DPDK needs hugepages which are in the same node as NIC. If DPDK needs four hugepages of 1G size in node1 and system has 16 numa nodes we must reserve 64 hugepages on the kernel cmdline. But only four hugepages are used. The others should be free after boot. If the system memory is low(for example: 64G), it will be an impossible task. So extend the hugepages parameter to support specifying hugepages on a specific node. For example add following parameter: hugepagesz=1G hugepages=0:1,1:3 It will allocate 1 hugepage in node0 and 3 hugepages in node1. Link: https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.com Signed-off-by: Zhenguo Yao <yaozhenguo1@gmail.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zhenguo Yao <yaozhenguo1@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Rapoport <rppt@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-05 20:43:28 +00:00
the default huge page size. If using node format, the
number of pages to allocate per-node can be specified.
See also Documentation/admin-guide/mm/hugetlbpage.rst.
Format: <integer> or (node format)
<node>:<integer>[,<node>:<integer>]
hugetlbfs: clean up command line processing With all hugetlb page processing done in a single file clean up code. - Make code match desired semantics - Update documentation with semantics - Make all warnings and errors messages start with 'HugeTLB:'. - Consistently name command line parsing routines. - Warn if !hugepages_supported() and command line parameters have been specified. - Add comments to code - Describe some of the subtle interactions - Describe semantics of command line arguments This patch also fixes issues with implicitly setting the number of gigantic huge pages to preallocate. Previously on X86 command line, hugepages=2 default_hugepagesz=1G would result in zero 1G pages being preallocated and, # grep HugePages_Total /proc/meminfo HugePages_Total: 0 # sysctl -a | grep nr_hugepages vm.nr_hugepages = 2 vm.nr_hugepages_mempolicy = 2 # cat /proc/sys/vm/nr_hugepages 2 After this patch 2 gigantic pages will be preallocated and all the proc, sysfs, sysctl and meminfo files will accurately reflect this. To address the issue with gigantic pages, a small change in behavior was made to command line processing. Previously the command line, hugepages=128 default_hugepagesz=2M hugepagesz=2M hugepages=256 would result in the allocation of 256 2M huge pages. The value 128 would be ignored without any warning. After this patch, 128 2M pages will be allocated and a warning message will be displayed indicating the value of 256 is ignored. This change in behavior is required because allocation of implicitly specified gigantic pages must be done when the default_hugepagesz= is encountered for gigantic pages. Previously the code waited until later in the boot process (hugetlb_init), to allocate pages of default size. However the bootmem allocator required for gigantic allocations is not available at this time. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sandipan Das <sandipan@linux.ibm.com> Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Acked-by: Will Deacon <will@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Longpeng <longpeng2@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Qian Cai <cai@lca.pw> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200417185049.275845-5-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-03 23:00:46 +00:00
hugepagesz=
[HW] The size of the HugeTLB pages. This is used in
conjunction with hugepages (above) to allocate huge
pages of a specific size at boot. The pair
hugepagesz=X hugepages=Y can be specified once for
each supported huge page size. Huge page sizes are
architecture dependent. See also
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
hugetlb_cma= [HW,CMA] The size of a CMA area used for allocation
of gigantic hugepages. Or using node format, the size
of a CMA area per node can be specified.
Format: nn[KMGTPE] or (node format)
<node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
Reserve a CMA area of given size and allocate gigantic
hugepages using the CMA allocator. If enabled, the
boot-time allocation of gigantic hugepages is skipped.
mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap Add a kernel parameter hugetlb_free_vmemmap to enable the feature of freeing unused vmemmap pages associated with each hugetlb page on boot. We disable PMD mapping of vmemmap pages for x86-64 arch when this feature is enabled. Because vmemmap_remap_free() depends on vmemmap being base page mapped. Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 01:47:25 +00:00
hugetlb_free_vmemmap=
[KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap Add a kernel parameter hugetlb_free_vmemmap to enable the feature of freeing unused vmemmap pages associated with each hugetlb page on boot. We disable PMD mapping of vmemmap pages for x86-64 arch when this feature is enabled. Because vmemmap_remap_free() depends on vmemmap being base page mapped. Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 01:47:25 +00:00
enabled.
Control if HugeTLB Vmemmap Optimization (HVO) is enabled.
mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap Add a kernel parameter hugetlb_free_vmemmap to enable the feature of freeing unused vmemmap pages associated with each hugetlb page on boot. We disable PMD mapping of vmemmap pages for x86-64 arch when this feature is enabled. Because vmemmap_remap_free() depends on vmemmap being base page mapped. Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 01:47:25 +00:00
Allows heavy hugetlb users to free up some more
mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page Patch series "Free the 2nd vmemmap page associated with each HugeTLB page", v7. This series can minimize the overhead of struct page for 2MB HugeTLB pages significantly. It further reduces the overhead of struct page by 12.5% for a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB HugeTLB. It is a nice gain. Comments and reviews are welcome. Thanks. The main implementation and details can refer to the commit log of patch 1. In this series, I have changed the following four helpers, the following table shows the impact of the overhead of those helpers. +------------------+-----------------------+ | APIs | head page | tail page | +------------------+-----------+-----------+ | PageHead() | Y | N | +------------------+-----------+-----------+ | PageTail() | Y | N | +------------------+-----------+-----------+ | PageCompound() | N | N | +------------------+-----------+-----------+ | compound_head() | Y | N | +------------------+-----------+-----------+ Y: Overhead is increased. N: Overhead is _NOT_ increased. It shows that the overhead of those helpers on a tail page don't change between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off". But the overhead on a head page will be increased when "hugetlb_free_vmemmap=on" (except PageCompound()). So I believe that Matthew Wilcox's folio series will help with this. The users of PageHead() and PageTail() are much less than compound_head() and most users of PageTail() are VM_BUG_ON(), so I have done some tests about the overhead of compound_head() on head pages. I have tested the overhead of calling compound_head() on a head page, which is 2.11ns (Measure the call time of 10 million times compound_head(), and then average). For a head page whose address is not aligned with PAGE_SIZE or a non-compound page, the overhead of compound_head() is 2.54ns which is increased by 20%. For a head page whose address is aligned with PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by 40%. Most pages are the former. I do not think the overhead is significant since the overhead of compound_head() itself is low. This patch (of 5): This patch minimizes the overhead of struct page for 2MB HugeTLB pages significantly. It further reduces the overhead of struct page by 12.5% for a 2MB HugeTLB compared to the previous approach, which means 2GB per 1TB HugeTLB (2MB type). After the feature of "Free sonme vmemmap pages of HugeTLB page" is enabled, the mapping of the vmemmap addresses associated with a 2MB HugeTLB page becomes the figure below. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | ----------------------+ | | | | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+ As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and remaped. However, the 2nd vmemmap page frame is also can be freed to the buddy allocator, then we can change the mapping from the figure above to the figure below. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+---> PG_head | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | | 2 | -----------------+ | | | | | | | +-----------+ | | | | | | | | 3 | -------------------+ | | | | | | +-----------+ | | | | | | | 4 | ---------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | -----------------------+ | | | | +-----------+ | | | | | 6 | -------------------------+ | | | +-----------+ | | | | 7 | ---------------------------+ | | +-----------+ | | | | | | +-----------+ After we do this, all tail vmemmap pages (1-7) are mapped to the head vmemmap page frame (0). In other words, there are more than one page struct with PG_head associated with each HugeTLB page. We __know__ that there is only one head page struct, the tail page structs with PG_head are fake head page structs. We need an approach to distinguish between those two different types of page structs so that compound_head(), PageHead() and PageTail() can work properly if the parameter is the tail page struct but with PG_head. The following code snippet describes how to distinguish between real and fake head page struct. if (test_bit(PG_head, &page->flags)) { unsigned long head = READ_ONCE(page[1].compound_head); if (head & 1) { if (head == (unsigned long)page + 1) ==> head page struct else ==> tail page struct } else ==> head page struct } We can safely access the field of the @page[1] with PG_head because the @page is a compound page composed with at least two contiguous pages. [songmuchun@bytedance.com: restore lost comment changes] Link: https://lkml.kernel.org/r/20211101031651.75851-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20211101031651.75851-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 21:45:00 +00:00
memory (7 * PAGE_SIZE for each 2MB hugetlb page).
Format: { on | off (default) }
mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap Add a kernel parameter hugetlb_free_vmemmap to enable the feature of freeing unused vmemmap pages associated with each hugetlb page on boot. We disable PMD mapping of vmemmap pages for x86-64 arch when this feature is enabled. Because vmemmap_remap_free() depends on vmemmap being base page mapped. Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 01:47:25 +00:00
on: enable HVO
off: disable HVO
mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap Add a kernel parameter hugetlb_free_vmemmap to enable the feature of freeing unused vmemmap pages associated with each hugetlb page on boot. We disable PMD mapping of vmemmap pages for x86-64 arch when this feature is enabled. Because vmemmap_remap_free() depends on vmemmap being base page mapped. Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 01:47:25 +00:00
Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,
the default is on.
mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory For now, the feature of hugetlb_free_vmemmap is not compatible with the feature of memory_hotplug.memmap_on_memory, and hugetlb_free_vmemmap takes precedence over memory_hotplug.memmap_on_memory. However, someone wants to make memory_hotplug.memmap_on_memory takes precedence over hugetlb_free_vmemmap since memmap_on_memory makes it more likely to succeed memory hotplug in close-to-OOM situations. So the decision of making hugetlb_free_vmemmap take precedence is not wise and elegant. The proper approach is to have hugetlb_vmemmap.c do the check whether the section which the HugeTLB pages belong to can be optimized. If the section's vmemmap pages are allocated from the added memory block itself, hugetlb_free_vmemmap should refuse to optimize the vmemmap, otherwise, do the optimization. Then both kernel parameters are compatible. So this patch introduces VmemmapSelfHosted to mask any non-optimizable vmemmap pages. The hugetlb_vmemmap can use this flag to detect if a vmemmap page can be optimized. [songmuchun@bytedance.com: walk vmemmap page tables to avoid false-positive] Link: https://lkml.kernel.org/r/20220620110616.12056-3-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220617135650.74901-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Co-developed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17 13:56:50 +00:00
Note that the vmemmap pages may be allocated from the added
memory block itself when memory_hotplug.memmap_on_memory is
enabled, those vmemmap pages cannot be optimized even if this
feature is enabled. Other vmemmap pages not allocated from
the added memory block itself do not be affected.
mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled The parameter of memory_hotplug.memmap_on_memory is not compatible with hugetlb_free_vmemmap. So disable it when hugetlb_free_vmemmap is enabled. [akpm@linux-foundation.org: remove unneeded include, per Oscar] Link: https://lkml.kernel.org/r/20210510030027.56044-9-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Chen Huang <chenhuang5@huawei.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-01 01:47:29 +00:00
hung_task_panic=
[KNL] Should the hung task detector generate panics.
Format: 0 | 1
A value of 1 instructs the kernel to panic when a
hung task is detected. The default value is controlled
by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
option. The value selected by this boot parameter can
be changed later by the kernel.hung_task_panic sysctl.
hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC)
terminal devices. Valid values: 0..8
hvc_iucv_allow= [S390] Comma-separated list of z/VM user IDs.
If specified, z/VM IUCV HVC accepts connections
from listed z/VM user IDs only.
hv_nopvspin [X86,HYPER_V] Disables the paravirt spinlock optimizations
which allow the hypervisor to 'idle' the
guest on lock contention.
i2c_bus= [HW] Override the default board specific I2C bus speed
or register an additional I2C bus that is not
registered from board initialization code.
Format:
<bus_id>,<clkrate>
i8042.debug [HW] Toggle i8042 debug mode
i8042.unmask_kbd_data
[HW] Enable printing of interrupt data from the KBD port
(disabled by default, and as a pre-condition
requires that i8042.debug=1 be enabled)
i8042.direct [HW] Put keyboard port into non-translated mode
i8042.dumbkbd [HW] Pretend that controller can only read data from
keyboard and cannot control its state
(Don't attempt to blink the leds)
i8042.noaux [HW] Don't check for auxiliary (== mouse) port
i8042.nokbd [HW] Don't check/create keyboard port
i8042.noloop [HW] Disable the AUX Loopback command while probing
for the AUX port
i8042.nomux [HW] Don't check presence of an active multiplexing
controller
i8042.nopnp [HW] Don't use ACPIPnP / PnPBIOS to discover KBD/AUX
controllers
i8042.notimeout [HW] Ignore timeout condition signalled by controller
i8042.reset [HW] Reset the controller during init, cleanup and
suspend-to-ram transitions, only during s2r
transitions, or never reset
Format: { 1 | Y | y | 0 | N | n }
1, Y, y: always reset controller
0, N, n: don't ever reset controller
Default: only on s2r transitions on x86; most other
architectures force reset to be always executed
i8042.unlock [HW] Unlock (ignore) the keylock
i8042.kbdreset [HW] Reset device connected to KBD port
i8042.probe_defer
[HW] Allow deferred probing upon i8042 probe errors
i810= [HW,DRM]
i915.invert_brightness=
[DRM] Invert the sense of the variable that is used to
set the brightness of the panel backlight. Normally a
brightness value of 0 indicates backlight switched off,
and the maximum of the brightness value sets the backlight
to maximum brightness. If this parameter is set to 0
(default) and the machine requires it, or this parameter
is set to 1, a brightness value of 0 sets the backlight
to maximum brightness, and the maximum of the brightness
value switches the backlight off.
-1 -- never invert brightness
0 -- machine default
1 -- force brightness inversion
ia32_emulation= [X86-64]
Format: <bool>
When true, allows loading 32-bit programs and executing 32-bit
syscalls, essentially overriding IA32_EMULATION_DEFAULT_DISABLED at
boot time. When false, unconditionally disables IA32 emulation.
icn= [HW,ISDN]
Format: <io>[,<membase>[,<icn_id>[,<icn_id2>]]]
idle= [X86]
Format: idle=poll, idle=halt, idle=nomwait
Poll forces a polling idle loop that can slightly
improve the performance of waking up a idle CPU, but
will use a lot of power and make the system run hot.
Not recommended.
idle=halt: Halt is forced to be used for CPU idle.
In such case C2/C3 won't be used again.
idle=nomwait: Disable mwait for CPU C-states
idxd.sva= [HW]
Format: <bool>
Allow force disabling of Shared Virtual Memory (SVA)
support for the idxd driver. By default it is set to
true (1).
idxd.tc_override= [HW]
Format: <bool>
Allow override of default traffic class configuration
for the device. By default it is set to false (0).
ieee754= [MIPS] Select IEEE Std 754 conformance mode
Format: { strict | legacy | 2008 | relaxed }
Default: strict
Choose which programs will be accepted for execution
based on the IEEE 754 NaN encoding(s) supported by
the FPU and the NaN encoding requested with the value
of an ELF file header flag individually set by each
binary. Hardware implementations are permitted to
support either or both of the legacy and the 2008 NaN
encoding mode.
Available settings are as follows:
strict accept binaries that request a NaN encoding
supported by the FPU
legacy only accept legacy-NaN binaries, if supported
by the FPU
2008 only accept 2008-NaN binaries, if supported
by the FPU
relaxed accept any binaries regardless of whether
supported by the FPU
The FPU emulator is always able to support both NaN
encodings, so if no FPU hardware is present or it has
been disabled with 'nofpu', then the settings of
'legacy' and '2008' strap the emulator accordingly,
'relaxed' straps the emulator for both legacy-NaN and
2008-NaN, whereas 'strict' enables legacy-NaN only on
legacy processors and both NaN encodings on MIPS32 or
MIPS64 CPUs.
The setting for ABS.fmt/NEG.fmt instruction execution
mode generally follows that for the NaN encoding,
except where unsupported by hardware.
ignore_loglevel [KNL]
Ignore loglevel setting - this will print /all/
kernel messages to the console. Useful for debugging.
We also add it as printk module parameter, so users
could change it dynamically, usually by
/sys/module/printk/parameters/ignore_loglevel.
mm: warn about VmData over RLIMIT_DATA This patch provides a way of working around a slight regression introduced by commit 84638335900f ("mm: rework virtual memory accounting"). Before that commit RLIMIT_DATA have control only over size of the brk region. But that change have caused problems with all existing versions of valgrind, because it set RLIMIT_DATA to zero. This patch fixes rlimit check (limit actually in bytes, not pages) and by default turns it into warning which prints at first VmData misuse: "mmap: top (795): VmData 516096 exceed data ulimit 512000. Will be forbidden soon." Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs /sys/module/kernel/parameters/ignore_rlimit_data. For now it set to "y". [akpm@linux-foundation.org: tweak kernel-parameters.txt text[ Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranus Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Kees Cook <keescook@google.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03 00:57:43 +00:00
ignore_rlimit_data
Ignore RLIMIT_DATA setting for data mappings,
print warning at first misuse. Can be changed via
/sys/module/kernel/parameters/ignore_rlimit_data.
ihash_entries= [KNL]
Set number of hash buckets for inode cache.
ima: integrity appraisal extension IMA currently maintains an integrity measurement list used to assert the integrity of the running system to a third party. The IMA-appraisal extension adds local integrity validation and enforcement of the measurement against a "good" value stored as an extended attribute 'security.ima'. The initial methods for validating 'security.ima' are hashed based, which provides file data integrity, and digital signature based, which in addition to providing file data integrity, provides authenticity. This patch creates and maintains the 'security.ima' xattr, containing the file data hash measurement. Protection of the xattr is provided by EVM, if enabled and configured. Based on policy, IMA calls evm_verifyxattr() to verify a file's metadata integrity and, assuming success, compares the file's current hash value with the one stored as an extended attribute in 'security.ima'. Changelov v4: - changed iint cache flags to hex values Changelog v3: - change appraisal default for filesystems without xattr support to fail Changelog v2: - fix audit msg 'res' value - removed unused 'ima_appraise=' values Changelog v1: - removed unused iint mutex (Dmitry Kasatkin) - setattr hook must not reset appraised (Dmitry Kasatkin) - evm_verifyxattr() now differentiates between no 'security.evm' xattr (INTEGRITY_NOLABEL) and no EVM 'protected' xattrs included in the 'security.evm' (INTEGRITY_NOXATTRS). - replace hash_status with ima_status (Dmitry Kasatkin) - re-initialize slab element ima_status on free (Dmitry Kasatkin) - include 'security.ima' in EVM if CONFIG_IMA_APPRAISE, not CONFIG_IMA - merged half "ima: ima_must_appraise_or_measure API change" (Dmitry Kasatkin) - removed unnecessary error variable in process_measurement() (Dmitry Kasatkin) - use ima_inode_post_setattr() stub function, if IMA_APPRAISE not configured (moved ima_inode_post_setattr() to ima_appraise.c) - make sure ima_collect_measurement() can read file Changelog: - add 'iint' to evm_verifyxattr() call (Dimitry Kasatkin) - fix the race condition between chmod, which takes the i_mutex and then iint->mutex, and ima_file_free() and process_measurement(), which take the locks in the reverse order, by eliminating iint->mutex. (Dmitry Kasatkin) - cleanup of ima_appraise_measurement() (Dmitry Kasatkin) - changes as a result of the iint not allocated for all regular files, but only for those measured/appraised. - don't try to appraise new/empty files - expanded ima_appraisal description in ima/Kconfig - IMA appraise definitions required even if IMA_APPRAISE not enabled - add return value to ima_must_appraise() stub - unconditionally set status = INTEGRITY_PASS *after* testing status, not before. (Found by Joe Perches) Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@intel.com>
2012-02-13 15:15:05 +00:00
ima_appraise= [IMA] appraise integrity measurements
Format: { "off" | "enforce" | "fix" | "log" }
ima: integrity appraisal extension IMA currently maintains an integrity measurement list used to assert the integrity of the running system to a third party. The IMA-appraisal extension adds local integrity validation and enforcement of the measurement against a "good" value stored as an extended attribute 'security.ima'. The initial methods for validating 'security.ima' are hashed based, which provides file data integrity, and digital signature based, which in addition to providing file data integrity, provides authenticity. This patch creates and maintains the 'security.ima' xattr, containing the file data hash measurement. Protection of the xattr is provided by EVM, if enabled and configured. Based on policy, IMA calls evm_verifyxattr() to verify a file's metadata integrity and, assuming success, compares the file's current hash value with the one stored as an extended attribute in 'security.ima'. Changelov v4: - changed iint cache flags to hex values Changelog v3: - change appraisal default for filesystems without xattr support to fail Changelog v2: - fix audit msg 'res' value - removed unused 'ima_appraise=' values Changelog v1: - removed unused iint mutex (Dmitry Kasatkin) - setattr hook must not reset appraised (Dmitry Kasatkin) - evm_verifyxattr() now differentiates between no 'security.evm' xattr (INTEGRITY_NOLABEL) and no EVM 'protected' xattrs included in the 'security.evm' (INTEGRITY_NOXATTRS). - replace hash_status with ima_status (Dmitry Kasatkin) - re-initialize slab element ima_status on free (Dmitry Kasatkin) - include 'security.ima' in EVM if CONFIG_IMA_APPRAISE, not CONFIG_IMA - merged half "ima: ima_must_appraise_or_measure API change" (Dmitry Kasatkin) - removed unnecessary error variable in process_measurement() (Dmitry Kasatkin) - use ima_inode_post_setattr() stub function, if IMA_APPRAISE not configured (moved ima_inode_post_setattr() to ima_appraise.c) - make sure ima_collect_measurement() can read file Changelog: - add 'iint' to evm_verifyxattr() call (Dimitry Kasatkin) - fix the race condition between chmod, which takes the i_mutex and then iint->mutex, and ima_file_free() and process_measurement(), which take the locks in the reverse order, by eliminating iint->mutex. (Dmitry Kasatkin) - cleanup of ima_appraise_measurement() (Dmitry Kasatkin) - changes as a result of the iint not allocated for all regular files, but only for those measured/appraised. - don't try to appraise new/empty files - expanded ima_appraisal description in ima/Kconfig - IMA appraise definitions required even if IMA_APPRAISE not enabled - add return value to ima_must_appraise() stub - unconditionally set status = INTEGRITY_PASS *after* testing status, not before. (Found by Joe Perches) Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@intel.com>
2012-02-13 15:15:05 +00:00
default: "enforce"
ima_appraise_tcb [IMA] Deprecated. Use ima_policy= instead.
ima: add appraise action keywords and default rules Unlike the IMA measurement policy, the appraise policy can not be dependent on runtime process information, such as the task uid, as the 'security.ima' xattr is written on file close and must be updated each time the file changes, regardless of the current task uid. This patch extends the policy language with 'fowner', defines an appraise policy, which appraises all files owned by root, and defines 'ima_appraise_tcb', a new boot command line option, to enable the appraise policy. Changelog v3: - separate the measure from the appraise rules in order to support measuring without appraising and appraising without measuring. - change appraisal default for filesystems without xattr support to fail - update default appraise policy for cgroups Changelog v1: - don't appraise RAMFS (Dmitry Kasatkin) - merged rest of "ima: ima_must_appraise_or_measure API change" commit (Dmtiry Kasatkin) ima_must_appraise_or_measure() called ima_match_policy twice, which searched the policy for a matching rule. Once for a matching measurement rule and subsequently for an appraisal rule. Searching the policy twice is unnecessary overhead, which could be noticeable with a large policy. The new version of ima_must_appraise_or_measure() does everything in a single iteration using a new version of ima_match_policy(). It returns IMA_MEASURE, IMA_APPRAISE mask. With the use of action mask only one efficient matching function is enough. Removed other specific versions of matching functions. Changelog: - change 'owner' to 'fowner' to conform to the new LSM conditions posted by Roberto Sassu. - fix calls to ima_log_string() Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@intel.com>
2011-03-10 03:25:48 +00:00
The builtin appraise policy appraises all files
owned by uid=0.
ima: define a canonical binary_runtime_measurements list format The IMA binary_runtime_measurements list is currently in platform native format. To allow restoring a measurement list carried across kexec with a different endianness than the targeted kernel, this patch defines little-endian as the canonical format. For big endian systems wanting to save/restore the measurement list from a system with a different endianness, a new boot command line parameter named "ima_canonical_fmt" is defined. Considerations: use of the "ima_canonical_fmt" boot command line option will break existing userspace applications on big endian systems expecting the binary_runtime_measurements list to be in platform native format. Link: http://lkml.kernel.org/r/1480554346-29071-10-git-send-email-zohar@linux.vnet.ibm.com Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Acked-by: Dmitry Kasatkin <dmitry.kasatkin@gmail.com> Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andreas Steffen <andreas.steffen@strongswan.org> Cc: Josh Sklar <sklar@linux.vnet.ibm.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-20 00:22:57 +00:00
ima_canonical_fmt [IMA]
Use the canonical format for the binary runtime
measurements, instead of host native format.
ima_hash= [IMA]
Format: { md5 | sha1 | rmd160 | sha256 | sha384
| sha512 | ... }
default: "sha1"
The list of supported hash algorithms is defined
in crypto/hash_info.h.
ima_policy= [IMA]
The builtin policies to load during IMA setup.
Format: "tcb | appraise_tcb | secure_boot |
fail_securely | critical_data"
The "tcb" policy measures all programs exec'd, files
mmap'd for exec, and all files opened with the read
mode bit set by either the effective uid (euid=0) or
uid=0.
The "appraise_tcb" policy appraises the integrity of
all files owned by root.
The "secure_boot" policy appraises the integrity
of files (eg. kexec kernel image, kernel modules,
firmware, policy, etc) based on file signatures.
The "fail_securely" policy forces file signature
verification failure also on privileged mounted
filesystems with the SB_I_UNVERIFIABLE_SIGNATURE
flag.
The "critical_data" policy measures kernel integrity
critical data.
ima_tcb [IMA] Deprecated. Use ima_policy= instead.
IMA: Minimal IMA policy and boot param for TCB IMA policy The IMA TCB policy is dangerous. A normal use can use all of a system's memory (which cannot be freed) simply by building and running lots of executables. The TCB policy is also nearly useless because logging in as root often causes a policy violation when dealing with utmp, thus rendering the measurements meaningless. There is no good fix for this in the kernel. A full TCB policy would need to be loaded in userspace using LSM rule matching to get both a protected and useful system. But, if too little is measured before userspace can load a real policy one again ends up with a meaningless set of measurements. One option would be to put the policy load inside the initrd in order to get it early enough in the boot sequence to be useful, but this runs into trouble with the LSM. For IMA to measure the LSM policy and the LSM policy loading mechanism it needs rules to do so, but we already talked about problems with defaulting to such broad rules.... IMA also depends on the files being measured to be on an FS which implements and supports i_version. Since the only FS with this support (ext4) doesn't even use it by default it seems silly to have any IMA rules by default. This should reduce the performance overhead of IMA to near 0 while still letting users who choose to configure their machine as such to inclue the ima_tcb kernel paramenter and get measurements during boot before they can load a customized, reasonable policy in userspace. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2009-05-21 19:47:06 +00:00
Load a policy which meets the needs of the Trusted
Computing Base. This means IMA will measure all
programs exec'd, files mmap'd for exec, and all files
opened for read by uid=0.
ima_template= [IMA]
Select one of defined IMA measurements template formats.
Formats: { "ima" | "ima-ng" | "ima-ngv2" | "ima-sig" |
"ima-sigv2" }
Default: "ima-ng"
ima_template_fmt=
[IMA] Define a custom template format.
Format: { "field1|...|fieldN" }
ima.ahash_minsize= [IMA] Minimum file size for asynchronous hash usage
Format: <min_file_size>
Set the minimal file size for using asynchronous hash.
If left unspecified, ahash usage is disabled.
ahash performance varies for different data sizes on
different crypto accelerators. This option can be used
to achieve the best performance for a particular HW.
ima.ahash_bufsize= [IMA] Asynchronous hash buffer size
Format: <bufsize>
Set hashing buffer size. Default: 4k.
ahash performance varies for different chunk sizes on
different crypto accelerators. This option can be used
to achieve best performance for particular HW.
init= [KNL]
Format: <full_path>
Run specified binary instead of /sbin/init as init
process.
initcall_debug [KNL] Trace initcalls as they are executed. Useful
for working out where the kernel is dying during
startup.
init/main.c: add initcall_blacklist kernel parameter When a module is built into the kernel the module_init() function becomes an initcall. Sometimes debugging through dynamic debug can help, however, debugging built in kernel modules is typically done by changing the .config, recompiling, and booting the new kernel in an effort to determine exactly which module caused a problem. This patchset can be useful stand-alone or combined with initcall_debug. There are cases where some initcalls can hang the machine before the console can be flushed, which can make initcall_debug output inaccurate. Having the ability to skip initcalls can help further debugging of these scenarios. Usage: initcall_blacklist=<list of comma separated initcalls> ex) added "initcall_blacklist=sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted ex) added "initcall_blacklist=foo_bar,sgi_uv_sysfs_init" as a kernel parameter and the log contains: blacklisting initcall foo_bar blacklisting initcall sgi_uv_sysfs_init ... ... initcall sgi_uv_sysfs_init blacklisted [akpm@linux-foundation.org: tweak printk text] Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: Rob Landley <rob@landley.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 23:12:17 +00:00
initcall_blacklist= [KNL] Do not execute a comma-separated list of
initcall functions. Useful for debugging built-in
modules and initcalls.
init/initramfs.c: do unpacking asynchronously Patch series "background initramfs unpacking, and CONFIG_MODPROBE_PATH", v3. These two patches are independent, but better-together. The second is a rather trivial patch that simply allows the developer to change "/sbin/modprobe" to something else - e.g. the empty string, so that all request_module() during early boot return -ENOENT early, without even spawning a usermode helper, needlessly synchronizing with the initramfs unpacking. The first patch delegates decompressing the initramfs to a worker thread, allowing do_initcalls() in main.c to proceed to the device_ and late_ initcalls without waiting for that decompression (and populating of rootfs) to finish. Obviously, some of those later calls may rely on the initramfs being available, so I've added synchronization points in the firmware loader and usermodehelper paths - there might be other places that would need this, but so far no one has been able to think of any places I have missed. There's not much to win if most of the functionality needed during boot is only available as modules. But systems with a custom-made .config and initramfs can boot faster, partly due to utilizing more than one cpu earlier, partly by avoiding known-futile modprobe calls (which would still trigger synchronization with the initramfs unpacking, thus eliminating most of the first benefit). This patch (of 2): Most of the boot process doesn't actually need anything from the initramfs, until of course PID1 is to be executed. So instead of doing the decompressing and populating of the initramfs synchronously in populate_rootfs() itself, push that off to a worker thread. This is primarily motivated by an embedded ppc target, where unpacking even the rather modest sized initramfs takes 0.6 seconds, which is long enough that the external watchdog becomes unhappy that it doesn't get attention soon enough. By doing the initramfs decompression in a worker thread, we get to do the device_initcalls and hence start petting the watchdog much sooner. Normal desktops might benefit as well. On my mostly stock Ubuntu kernel, my initramfs is a 26M xz-compressed blob, decompressing to around 126M. That takes almost two seconds: [ 0.201454] Trying to unpack rootfs image as initramfs... [ 1.976633] Freeing initrd memory: 29416K Before this patch, these lines occur consecutively in dmesg. With this patch, the timestamps on these two lines is roughly the same as above, but with 172 lines inbetween - so more than one cpu has been kept busy doing work that would otherwise only happen after the populate_rootfs() finished. Should one of the initcalls done after rootfs_initcall time (i.e., device_ and late_ initcalls) need something from the initramfs (say, a kernel module or a firmware blob), it will simply wait for the initramfs unpacking to be done before proceeding, which should in theory make this completely safe. But if some driver pokes around in the filesystem directly and not via one of the official kernel interfaces (i.e. request_firmware*(), call_usermodehelper*) that theory may not hold - also, I certainly might have missed a spot when sprinkling wait_for_initramfs(). So there is an escape hatch in the form of an initramfs_async= command line parameter. Link: https://lkml.kernel.org/r/20210313212528.2956377-1-linux@rasmusvillemoes.dk Link: https://lkml.kernel.org/r/20210313212528.2956377-2-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 01:05:42 +00:00
initramfs_async= [KNL]
Format: <bool>
Default: 1
This parameter controls whether the initramfs
image is unpacked asynchronously, concurrently
with devices being probed and
initialized. This should normally just work,
but as a debugging aid, one can get the
historical behaviour of the initramfs
unpacking being completed before device_ and
late_ initcalls.
initrd= [BOOT] Specify the location of the initial ramdisk
x86/setup: Add an initrdmem= option to specify initrd physical address Add the initrdmem option: initrdmem=ss[KMG],nn[KMG] which is used to specify the physical address of the initrd, almost always an address in FLASH. Also add code for x86 to use the existing phys_init_start and phys_init_size variables in the kernel. This is useful in cases where a kernel and an initrd is placed in FLASH, but there is no firmware file system structure in the FLASH. One such situation occurs when unused FLASH space on UEFI systems has been reclaimed by, e.g., taking it from the Management Engine. For example, on many systems, the ME is given half the FLASH part; not only is 2.75M of an 8M part unused; but 10.75M of a 16M part is unused. This space can be used to contain an initrd, but need to tell Linux where it is. This space is "raw": due to, e.g., UEFI limitations: it can not be added to UEFI firmware volumes without rebuilding UEFI from source or writing a UEFI device driver. It can be referenced only as a physical address and size. At the same time, if a kernel can be "netbooted" or loaded from GRUB or syslinux, the option of not using the physical address specification should be available. Then, it is easy to boot the kernel and provide an initrd; or boot the the kernel and let it use the initrd in FLASH. In practice, this has proven to be very helpful when integrating Linux into FLASH on x86. Hence, the most flexible and convenient path is to enable the initrdmem command line option in a way that it is the last choice tried. For example, on the DigitalLoggers Atomic Pi, an image into FLASH can be burnt in with a built-in command line which includes: initrdmem=0xff968000,0x200000 which specifies a location and size. [ bp: Massage commit message, make it passive. ] [akpm@linux-foundation.org: coding style fixes] Signed-off-by: Ronald G. Minnich <rminnich@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Link: http://lkml.kernel.org/r/CAP6exYLK11rhreX=6QPyDQmW7wPHsKNEFtXE47pjx41xS6O7-A@mail.gmail.com Link: https://lkml.kernel.org/r/20200426011021.1cskg0AGd%akpm@linux-foundation.org
2020-04-26 01:10:21 +00:00
initrdmem= [KNL] Specify a physical address and size from which to
load the initrd. If an initrd is compiled in or
specified in the bootparams, it takes priority over this
setting.
Format: ss[KMG],nn[KMG]
Default is 0, 0
mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options Patch series "add init_on_alloc/init_on_free boot options", v10. Provide init_on_alloc and init_on_free boot options. These are aimed at preventing possible information leaks and making the control-flow bugs that depend on uninitialized values more deterministic. Enabling either of the options guarantees that the memory returned by the page allocator and SL[AU]B is initialized with zeroes. SLOB allocator isn't supported at the moment, as its emulation of kmem caches complicates handling of SLAB_TYPESAFE_BY_RCU caches correctly. Enabling init_on_free also guarantees that pages and heap objects are initialized right after they're freed, so it won't be possible to access stale data by using a dangling pointer. As suggested by Michal Hocko, right now we don't let the heap users to disable initialization for certain allocations. There's not enough evidence that doing so can speed up real-life cases, and introducing ways to opt-out may result in things going out of control. This patch (of 2): The new options are needed to prevent possible information leaks and make control-flow bugs that depend on uninitialized values more deterministic. This is expected to be on-by-default on Android and Chrome OS. And it gives the opportunity for anyone else to use it under distros too via the boot args. (The init_on_free feature is regularly requested by folks where memory forensics is included in their threat models.) init_on_alloc=1 makes the kernel initialize newly allocated pages and heap objects with zeroes. Initialization is done at allocation time at the places where checks for __GFP_ZERO are performed. init_on_free=1 makes the kernel initialize freed pages and heap objects with zeroes upon their deletion. This helps to ensure sensitive data doesn't leak via use-after-free accesses. Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator returns zeroed memory. The two exceptions are slab caches with constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never zero-initialized to preserve their semantics. Both init_on_alloc and init_on_free default to zero, but those defaults can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and CONFIG_INIT_ON_FREE_DEFAULT_ON. If either SLUB poisoning or page poisoning is enabled, those options take precedence over init_on_alloc and init_on_free: initialization is only applied to unpoisoned allocations. Slowdown for the new features compared to init_on_free=0, init_on_alloc=0: hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%) hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%) Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%) Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%) Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%) Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%) The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline is within the standard error. The new features are also going to pave the way for hardware memory tagging (e.g. arm64's MTE), which will require both on_alloc and on_free hooks to set the tags for heap objects. With MTE, tagging will have the same cost as memory initialization. Although init_on_free is rather costly, there are paranoid use-cases where in-memory data lifetime is desired to be minimized. There are various arguments for/against the realism of the associated threat models, but given that we'll need the infrastructure for MTE anyway, and there are people who want wipe-on-free behavior no matter what the performance cost, it seems reasonable to include it in this series. [glider@google.com: v8] Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com [glider@google.com: v9] Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com [glider@google.com: v10] Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.cz> [page and dmapool parts Acked-by: James Morris <jamorris@linux.microsoft.com>] Cc: Christoph Lameter <cl@linux.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Sandeep Patil <sspatil@android.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 03:59:19 +00:00
init_on_alloc= [MM] Fill newly allocated pages and heap objects with
zeroes.
Format: 0 | 1
Default set by CONFIG_INIT_ON_ALLOC_DEFAULT_ON.
init_on_free= [MM] Fill freed pages and heap objects with zeroes.
Format: 0 | 1
Default set by CONFIG_INIT_ON_FREE_DEFAULT_ON.
init_pkru= [X86] Specify the default memory protection keys rights
x86/pkeys: Default to a restrictive init PKRU PKRU is the register that lets you disallow writes or all access to a given protection key. The XSAVE hardware defines an "init state" of 0 for PKRU: its most permissive state, allowing access/writes to everything. Since we start off all new processes with the init state, we start all processes off with the most permissive possible PKRU. This is unfortunate. If a thread is clone()'d [1] before a program has time to set PKRU to a restrictive value, that thread will be able to write to all data, no matter what pkey is set on it. This weakens any integrity guarantees that we want pkeys to provide. To fix this, we define a very restrictive PKRU to override the XSAVE-provided value when we create a new FPU context. We choose a value that only allows access to pkey 0, which is as restrictive as we can practically make it. This does not cause any practical problems with applications using protection keys because we require them to specify initial permissions for each key when it is allocated, which override the restrictive default. In the end, this ensures that threads which do not know how to manage their own pkey rights can not do damage to data which is pkey-protected. I would have thought this was a pretty contrived scenario, except that I heard a bug report from an MPX user who was creating threads in some very early code before main(). It may be crazy, but folks evidently _do_ it. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-arch@vger.kernel.org Cc: Dave Hansen <dave@sr71.net> Cc: mgorman@techsingularity.net Cc: arnd@arndb.de Cc: linux-api@vger.kernel.org Cc: linux-mm@kvack.org Cc: luto@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20160729163021.F3C25D4A@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-07-29 16:30:21 +00:00
register contents for all processes. 0x55555554 by
default (disallow access to all but pkey 0). Can
override in debugfs after boot.
inport.irq= [HW] Inport (ATI XL and Microsoft) busmouse driver
Format: <irq>
int_pln_enable [X86] Enable power limit notification interrupt
integrity_audit=[IMA]
Format: { "0" | "1" }
0 -- basic integrity auditing messages. (Default)
1 -- additional integrity auditing messages.
intel_iommu= [DMAR] Intel IOMMU driver (DMAR) option
on
Enable intel iommu driver.
off
Disable intel iommu driver.
igfx_off [Default Off]
By default, gfx is mapped as normal device. If a gfx
device has a dedicated DMAR unit, the DMAR unit is
bypassed by not enabling DMAR with this option. In
this case, gfx device will use physical address for
DMA.
strict [Default Off]
Deprecated, equivalent to iommu.strict=1.
intel-iommu: Enable super page (2MiB, 1GiB, etc.) support There are no externally-visible changes with this. In the loop in the internal __domain_mapping() function, we simply detect if we are mapping: - size >= 2MiB, and - virtual address aligned to 2MiB, and - physical address aligned to 2MiB, and - on hardware that supports superpages. (and likewise for larger superpages). We automatically use a superpage for such mappings. We never have to worry about *breaking* superpages, since we trust that we will always *unmap* the same range that was mapped. So all we need to do is ensure that dma_pte_clear_range() will also cope with superpages. Adjust pfn_to_dma_pte() to take a superpage 'level' as an argument, so it can return a PTE at the appropriate level rather than always extending the page tables all the way down to level 1. Again, this is simplified by the fact that we should never encounter existing small pages when we're creating a mapping; any old mapping that used the same virtual range will have been entirely removed and its obsolete page tables freed. Provide an 'intel_iommu=sp_off' argument on the command line as a chicken bit. Not that it should ever be required. == The original commit seen in the iommu-2.6.git was Youquan's implementation (and completion) of my own half-baked code which I'd typed into an email. Followed by half a dozen subsequent 'fixes'. I've taken the unusual step of rewriting history and collapsing the original commits in order to keep the main history simpler, and make life easier for the people who are going to have to backport this to older kernels. And also so I can give it a more coherent commit comment which (hopefully) gives a better explanation of what's going on. The original sequence of commits leading to identical code was: Youquan Song (3): intel-iommu: super page support intel-iommu: Fix superpage alignment calculation error intel-iommu: Fix superpage level calculation error in dma_pfn_level_pte() David Woodhouse (4): intel-iommu: Precalculate superpage support for dmar_domain intel-iommu: Fix hardware_largepage_caps() intel-iommu: Fix inappropriate use of superpages in __domain_mapping() intel-iommu: Fix phys_pfn in __domain_mapping for sglist pages Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2011-05-25 18:13:49 +00:00
sp_off [Default Off]
By default, super page will be supported if Intel IOMMU
has the capability. With this option, super page will
not be supported.
sm_on
Enable the Intel IOMMU scalable mode if the hardware
advertises that it has support for the scalable mode
translation.
sm_off
Disallow use of the Intel IOMMU scalable mode.
tboot_noforce [Default Off]
Do not force the Intel IOMMU enabled under tboot.
By default, tboot will force Intel IOMMU on, which
could harm performance of some high-throughput
devices like 40GBit network cards, even if identity
mapping is enabled.
Note that using this option lowers the security
provided by tboot because it makes the system
vulnerable to DMA attacks.
intel_idle.max_cstate= [KNL,HW,ACPI,X86]
0 disables intel_idle and fall back on acpi_idle.
Update the maximum depth of C-state from 6 to 9 Hi Jon, This patch is an old one, we have corrected some minor issues on the newer one. Please only review the newest version from my last mail with this subject "[PATCH] ACPI: Update the maximum depth of C-state from 6 to 9". And I also attached it to this mail. Thanks, Baole On 7/11/2016 6:37 AM, Jonathan Corbet wrote: > On Mon, 4 Jul 2016 09:55:10 +0800 > "baolex.ni" <baolex.ni@intel.com> wrote: > >> Currently, CPUIDLE_STATE_MAX has been defined as 10 in the cpuidle head file, >> and max_cstate = CPUIDLE_STATE_MAX – 1, so 9 is the right maximum depth of C-state. >> This change is reflected in one place of the kernel-param file, >> but not in the other place where I suggest changing. >> >> Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> >> Signed-off-by: Baole Ni <baolex.ni@intel.com> > > So why are there two signoffs on a single-line patch? Which one of you > is the actual author? > > Thanks, > > jon > From cf5f8aa6885874f6490b11507d3c0c86fa0a11f4 Mon Sep 17 00:00:00 2001 From: Chuansheng Liu <chuansheng.liu@intel.com> Date: Mon, 4 Jul 2016 08:52:51 +0800 Subject: [PATCH] Update the maximum depth of C-state from 6 to 9 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, CPUIDLE_STATE_MAX has been defined as 10 in the cpuidle head file, and max_cstate = CPUIDLE_STATE_MAX – 1, so 9 is the right maximum depth of C-state. This change is reflected in one place of the kernel-param file, but not in the other place where I suggest changing. Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> Signed-off-by: Baole Ni <baolex.ni@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2016-07-11 01:57:37 +00:00
1 to 9 specify maximum depth of C-state.
intel_pstate= [X86]
disable
Do not enable intel_pstate as the default
scaling driver for the supported processors
active
Use intel_pstate driver to bypass the scaling
governors layer of cpufreq and provides it own
algorithms for p-state selection. There are two
P-state selection algorithms provided by
intel_pstate in the active mode: powersave and
performance. The way they both operate depends
on whether or not the hardware managed P-states
(HWP) feature has been enabled in the processor
and possibly on the processor model.
passive
Use intel_pstate as a scaling driver, but configure it
to work with generic cpufreq governors (instead of
enabling its internal governor). This mode cannot be
used along with the hardware-managed P-states (HWP)
feature.
force
Enable intel_pstate on systems that prohibit it by default
in favor of acpi-cpufreq. Forcing the intel_pstate driver
instead of acpi-cpufreq may disable platform features, such
as thermal controls and power capping, that rely on ACPI
P-States information being indicated to OSPM and therefore
should be used with caution. This option does not work with
processors that aren't supported by the intel_pstate driver
or on platforms that use pcc-cpufreq instead of acpi-cpufreq.
no_hwp
Do not enable hardware P state control (HWP)
if available.
hwp_only
Only load intel_pstate on systems which support
hardware P state control (HWP) if available.
support_acpi_ppc
Enforce ACPI _PPC performance limits. If the Fixed ACPI
Description Table, specifies preferred power management
profile as "Enterprise Server" or "Performance Server",
then this feature is turned on by default.
per_cpu_perf_limits
Allow per-logical-CPU P-State performance control limits using
cpufreq sysfs interface
intremap= [X86-64, Intel-IOMMU]
on enable Interrupt Remapping (default)
off disable Interrupt Remapping
nosid disable Source ID checking
x86, x2apic: Enable the bios request for x2apic optout On the platforms which are x2apic and interrupt-remapping capable, Linux kernel is enabling x2apic even if the BIOS doesn't. This is to take advantage of the features that x2apic brings in. Some of the OEM platforms are running into issues because of this, as their bios is not x2apic aware. For example, this was resulting in interrupt migration issues on one of the platforms. Also if the BIOS SMI handling uses APIC interface to send SMI's, then the BIOS need to be aware of x2apic mode that OS has enabled. On some of these platforms, BIOS doesn't have a HW mechanism to turnoff the x2apic feature to prevent OS from enabling it. To resolve this mess, recent changes to the VT-d2 specification: http://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf includes a mechanism that provides BIOS a way to request system software to opt out of enabling x2apic mode. Look at the x2apic optout flag in the DMAR tables before enabling the x2apic mode in the platform. Also print a warning that we have disabled x2apic based on the BIOS request. Kernel boot parameter "intremap=no_x2apic_optout" can be used to override the BIOS x2apic optout request. Signed-off-by: Youquan Song <youquan.song@intel.com> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: yinghai@kernel.org Cc: joerg.roedel@amd.com Cc: tony.luck@intel.com Cc: dwmw2@infradead.org Link: http://lkml.kernel.org/r/20110824001456.171766616@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-24 00:05:18 +00:00
no_x2apic_optout
BIOS x2APIC opt-out request will be ignored
nopost disable Interrupt Posting
iomem= Disable strict checking of access to MMIO memory
strict regions from userspace.
relaxed
iommu= [X86]
off
force
noforce
biomerge
panic
nopanic
merge
nomerge
soft
pt [X86]
nopt [X86]
nobypass [PPC/POWERNV]
Disable IOMMU bypass, using IOMMU for PCI devices.
iommu.forcedac= [ARM64, X86] Control IOVA allocation for PCI devices.
Format: { "0" | "1" }
0 - Try to allocate a 32-bit DMA address first, before
falling back to the full range if needed.
1 - Allocate directly from the full usable range,
forcing Dual Address Cycle for PCI cards supporting
greater than 32-bit addressing.
iommu.strict= [ARM64, X86, S390] Configure TLB invalidation behaviour
Format: { "0" | "1" }
0 - Lazy mode.
Request that DMA unmap operations use deferred
invalidation of hardware TLBs, for increased
throughput at the cost of reduced device isolation.
Will fall back to strict mode if not supported by
the relevant IOMMU driver.
1 - Strict mode.
DMA unmap operations invalidate IOMMU hardware TLBs
synchronously.
unset - Use value of CONFIG_IOMMU_DEFAULT_DMA_{LAZY,STRICT}.
Note: on x86, strict mode specified via one of the
legacy driver-specific options takes precedence.
iommu.passthrough=
[ARM64, X86] Configure DMA to bypass the IOMMU by default.
Format: { "0" | "1" }
0 - Use IOMMU translation for DMA.
1 - Bypass the IOMMU for DMA.
unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.
io7= [HW] IO7 for Marvel-based Alpha systems
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
io_delay= [X86] I/O delay method
0x80
Standard port 0x80 based delay
0xed
Alternate port 0xed based delay (needed on some systems)
x86: provide a DMI based port 0x80 I/O delay override. x86: provide a DMI based port 0x80 I/O delay override. Certain (HP) laptops experience trouble from our port 0x80 I/O delay writes. This patch provides for a DMI based switch to the "alternate diagnostic port" 0xed (as used by some BIOSes as well) for these. David P. Reed confirmed that port 0xed works for him and provides a proper delay. The symptoms of _not_ working are a hanging machine, with "hwclock" use being a direct trigger. Earlier versions of this attempted to simply use udelay(2), with the 2 being a value tested to be a nicely conservative upper-bound with help from many on the linux-kernel mailinglist but that approach has two problems. First, pre-loops_per_jiffy calibration (which is post PIT init while some implementations of the PIT are actually one of the historically problematic devices that need the delay) udelay() isn't particularly well-defined. We could initialise loops_per_jiffy conservatively (and based on CPU family so as to not unduly delay old machines) which would sort of work, but... Second, delaying isn't the only effect that a write to port 0x80 has. It's also a PCI posting barrier which some devices may be explicitly or implicitly relying on. Alan Cox did a survey and found evidence that additionally some drivers may be racy on SMP without the bus locking outb. Switching to an inb() makes the timing too unpredictable and as such, this DMI based switch should be the safest approach for now. Any more invasive changes should get more rigid testing first. It's moreover only very few machines with the problem and a DMI based hack seems to fit that situation. This also introduces a command-line parameter "io_delay" to override the DMI based choice again: io_delay=<standard|alternate> where "standard" means using the standard port 0x80 and "alternate" port 0xed. This retains the udelay method as a config (CONFIG_UDELAY_IO_DELAY) and command-line ("io_delay=udelay") choice for testing purposes as well. This does not change the io_delay() in the boot code which is using the same port 0x80 I/O delay but those do not appear to be a problem as David P. Reed reported the problem was already gone after using the udelay version. He moreover reported that booting with "acpi=off" also fixed things and seeing as how ACPI isn't touched until after this DMI based I/O port switch I believe it's safe to leave the ones in the boot code be. The DMI strings from David's HP Pavilion dv9000z are in there already and we need to get/verify the DMI info from other machines with the problem, notably the HP Pavilion dv6000z. This patch is partly based on earlier patches from Pavel Machek and David P. Reed. Signed-off-by: Rene Herman <rene.herman@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 12:30:05 +00:00
udelay
Simple two microseconds delay
none
No delay
x86: provide a DMI based port 0x80 I/O delay override. x86: provide a DMI based port 0x80 I/O delay override. Certain (HP) laptops experience trouble from our port 0x80 I/O delay writes. This patch provides for a DMI based switch to the "alternate diagnostic port" 0xed (as used by some BIOSes as well) for these. David P. Reed confirmed that port 0xed works for him and provides a proper delay. The symptoms of _not_ working are a hanging machine, with "hwclock" use being a direct trigger. Earlier versions of this attempted to simply use udelay(2), with the 2 being a value tested to be a nicely conservative upper-bound with help from many on the linux-kernel mailinglist but that approach has two problems. First, pre-loops_per_jiffy calibration (which is post PIT init while some implementations of the PIT are actually one of the historically problematic devices that need the delay) udelay() isn't particularly well-defined. We could initialise loops_per_jiffy conservatively (and based on CPU family so as to not unduly delay old machines) which would sort of work, but... Second, delaying isn't the only effect that a write to port 0x80 has. It's also a PCI posting barrier which some devices may be explicitly or implicitly relying on. Alan Cox did a survey and found evidence that additionally some drivers may be racy on SMP without the bus locking outb. Switching to an inb() makes the timing too unpredictable and as such, this DMI based switch should be the safest approach for now. Any more invasive changes should get more rigid testing first. It's moreover only very few machines with the problem and a DMI based hack seems to fit that situation. This also introduces a command-line parameter "io_delay" to override the DMI based choice again: io_delay=<standard|alternate> where "standard" means using the standard port 0x80 and "alternate" port 0xed. This retains the udelay method as a config (CONFIG_UDELAY_IO_DELAY) and command-line ("io_delay=udelay") choice for testing purposes as well. This does not change the io_delay() in the boot code which is using the same port 0x80 I/O delay but those do not appear to be a problem as David P. Reed reported the problem was already gone after using the udelay version. He moreover reported that booting with "acpi=off" also fixed things and seeing as how ACPI isn't touched until after this DMI based I/O port switch I believe it's safe to leave the ones in the boot code be. The DMI strings from David's HP Pavilion dv9000z are in there already and we need to get/verify the DMI info from other machines with the problem, notably the HP Pavilion dv6000z. This patch is partly based on earlier patches from Pavel Machek and David P. Reed. Signed-off-by: Rene Herman <rene.herman@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 12:30:05 +00:00
ip= [IP_PNP]
See Documentation/admin-guide/nfs/nfsroot.rst.
ipc: allow boot time extension of IPCMNI from 32k to 16M The maximum number of unique System V IPC identifiers was limited to 32k. That limit should be big enough for most use cases. However, there are some users out there requesting for more, especially those that are migrating from Solaris which uses 24 bits for unique identifiers. To satisfy the need of those users, a new boot time kernel option "ipcmni_extend" is added to extend the IPCMNI value to 16M. This is a 512X increase which should be big enough for users out there that need a large number of unique IPC identifier. The use of this new option will change the pattern of the IPC identifiers returned by functions like shmget(2). An application that depends on such pattern may not work properly. So it should only be used if the users really need more than 32k of unique IPC numbers. This new option does have the side effect of reducing the maximum number of unique sequence numbers from 64k down to 128. So it is a trade-off. The computation of a new IPC id is not done in the performance critical path. So a little bit of additional overhead shouldn't have any real performance impact. Link: http://lkml.kernel.org/r/20190329204930.21620-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 22:46:29 +00:00
ipcmni_extend [KNL] Extend the maximum number of unique System V
IPC identifiers from 32,768 to 16,777,216.
irqaffinity= [SMP] Set the default irq affinity mask
2016-10-11 20:51:35 +00:00
The argument is a cpu list, as described above.
irqchip.gicv2_force_probe=
[ARM, ARM64]
Format: <bool>
Force the kernel to look for the second 4kB page
of a GICv2 controller even if the memory range
exposed by the device tree is too small.
irqchip.gicv3_nolpi=
[ARM, ARM64]
Force the kernel to ignore the availability of
LPIs (and by consequence ITSs). Intended for system
that use the kernel as a bootloader, and thus want
to let secondary kernels in charge of setting up
LPIs.
irqchip.gicv3_pseudo_nmi= [ARM64]
Enables support for pseudo-NMIs in the kernel. This
requires the kernel to be built with
CONFIG_ARM64_PSEUDO_NMI.
irqfixup [HW]
When an interrupt is not handled search all handlers
for it. Intended to get systems with badly broken
firmware running.
irqpoll [HW]
When an interrupt is not handled search all handlers
for it. Also check all handlers each timer
interrupt. Intended to get systems with badly broken
firmware running.
isapnp= [ISAPNP]
Format: <RDP>,<reset>,<pci_scan>,<verbosity>
isolcpus= [KNL,SMP,ISOL] Isolate a given set of CPUs from disturbance.
[Deprecated - use cpusets instead]
Format: [flag-list,]<cpu-list>
Specify one or more CPUs to isolate from disturbances
specified in the flag list (default: domain):
nohz
Disable the tick when a single task runs.
A residual 1Hz tick is offloaded to workqueues, which you
need to affine to housekeeping through the global
workqueue's affinity configured via the
/sys/devices/virtual/workqueue/cpumask sysfs file, or
by using the 'domain' flag described below.
NOTE: by default the global workqueue runs on all CPUs,
so to protect individual CPUs the 'cpumask' file has to
be configured manually after bootup.
domain
Isolate from the general SMP balancing and scheduling
algorithms. Note that performing domain isolation this way
is irreversible: it's not possible to bring back a CPU to
the domains once isolated through isolcpus. It's strongly
advised to use cpusets instead to disable scheduler load
balancing through the "cpuset.sched_load_balance" file.
It offers a much more flexible interface where CPUs can
move in and out of an isolated set anytime.
You can move a process onto or off an "isolated" CPU via
the CPU affinity syscalls or cpuset.
<cpu number> begins at 0 and the maximum value is
"number of CPUs in system - 1".
genirq, sched/isolation: Isolate from handling managed interrupts The affinity of managed interrupts is completely handled in the kernel and cannot be changed via the /proc/irq/* interfaces from user space. As the kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent vector exhaustion, it can happen that a managed interrupt whose affinity mask contains both isolated and housekeeping CPUs is routed to an isolated CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts on the isolated CPU. Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding logic in the interrupt affinity selection code. The subparameter indicates to the interrupt affinity selection logic that it should try to avoid the above scenario. This isolation is best effort and only effective if the automatically assigned interrupt mask of a device queue contains isolated and housekeeping CPUs. If housekeeping CPUs are online then such interrupts are directed to the housekeeping CPU so that IO submitted on the housekeeping CPU cannot disturb the isolated CPU. If a queue's affinity mask contains only isolated CPUs then this parameter has no effect on the interrupt routing decision, though interrupts are only happening when tasks running on those isolated CPUs submit IO. IO submitted on housekeeping CPUs has no influence on those queues. If the affinity mask contains both housekeeping and isolated CPUs, but none of the contained housekeeping CPUs is online, then the interrupt is also routed to an isolated CPU. Interrupts are only delivered when one of the isolated CPUs in the affinity mask submits IO. If one of the contained housekeeping CPUs comes online, the CPU hotplug logic migrates the interrupt automatically back to the upcoming housekeeping CPU. Depending on the type of interrupt controller, this can require that at least one interrupt is delivered to the isolated CPU in order to complete the migration. [ tglx: Removed unused parameter, added and edited comments/documentation and rephrased the changelog so it contains more details. ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200120091625.17912-1-ming.lei@redhat.com
2020-01-20 09:16:25 +00:00
managed_irq
Isolate from being targeted by managed interrupts
which have an interrupt mask containing isolated
CPUs. The affinity of managed interrupts is
handled by the kernel and cannot be changed via
the /proc/irq/* interfaces.
This isolation is best effort and only effective
if the automatically assigned interrupt mask of a
device queue contains isolated and housekeeping
CPUs. If housekeeping CPUs are online then such
interrupts are directed to the housekeeping CPU
so that IO submitted on the housekeeping CPU
cannot disturb the isolated CPU.
If a queue's affinity mask contains only isolated
CPUs then this parameter has no effect on the
interrupt routing decision, though interrupts are
only delivered when tasks running on those
isolated CPUs submit IO. IO submitted on
housekeeping CPUs has no influence on those
queues.
genirq, sched/isolation: Isolate from handling managed interrupts The affinity of managed interrupts is completely handled in the kernel and cannot be changed via the /proc/irq/* interfaces from user space. As the kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent vector exhaustion, it can happen that a managed interrupt whose affinity mask contains both isolated and housekeeping CPUs is routed to an isolated CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts on the isolated CPU. Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding logic in the interrupt affinity selection code. The subparameter indicates to the interrupt affinity selection logic that it should try to avoid the above scenario. This isolation is best effort and only effective if the automatically assigned interrupt mask of a device queue contains isolated and housekeeping CPUs. If housekeeping CPUs are online then such interrupts are directed to the housekeeping CPU so that IO submitted on the housekeeping CPU cannot disturb the isolated CPU. If a queue's affinity mask contains only isolated CPUs then this parameter has no effect on the interrupt routing decision, though interrupts are only happening when tasks running on those isolated CPUs submit IO. IO submitted on housekeeping CPUs has no influence on those queues. If the affinity mask contains both housekeeping and isolated CPUs, but none of the contained housekeeping CPUs is online, then the interrupt is also routed to an isolated CPU. Interrupts are only delivered when one of the isolated CPUs in the affinity mask submits IO. If one of the contained housekeeping CPUs comes online, the CPU hotplug logic migrates the interrupt automatically back to the upcoming housekeeping CPU. Depending on the type of interrupt controller, this can require that at least one interrupt is delivered to the isolated CPU in order to complete the migration. [ tglx: Removed unused parameter, added and edited comments/documentation and rephrased the changelog so it contains more details. ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200120091625.17912-1-ming.lei@redhat.com
2020-01-20 09:16:25 +00:00
The format of <cpu-list> is described above.
iucv= [HW,NET]
ivrs_ioapic [HW,X86-64]
Provide an override to the IOAPIC-ID<->DEVICE-ID
mapping provided in the IVRS ACPI table.
By default, PCI segment is 0, and can be omitted.
For example, to map IOAPIC-ID decimal 10 to
PCI segment 0x1 and PCI device 00:14.0,
write the parameter as:
ivrs_ioapic=10@0001:00:14.0
Deprecated formats:
* To map IOAPIC-ID decimal 10 to PCI device 00:14.0
write the parameter as:
ivrs_ioapic[10]=00:14.0
* To map IOAPIC-ID decimal 10 to PCI segment 0x1 and
PCI device 00:14.0 write the parameter as:
ivrs_ioapic[10]=0001:00:14.0
ivrs_hpet [HW,X86-64]
Provide an override to the HPET-ID<->DEVICE-ID
mapping provided in the IVRS ACPI table.
By default, PCI segment is 0, and can be omitted.
For example, to map HPET-ID decimal 10 to
PCI segment 0x1 and PCI device 00:14.0,
write the parameter as:
ivrs_hpet=10@0001:00:14.0
Deprecated formats:
* To map HPET-ID decimal 0 to PCI device 00:14.0
write the parameter as:
ivrs_hpet[0]=00:14.0
* To map HPET-ID decimal 10 to PCI segment 0x1 and
PCI device 00:14.0 write the parameter as:
ivrs_ioapic[10]=0001:00:14.0
ivrs_acpihid [HW,X86-64]
Provide an override to the ACPI-HID:UID<->DEVICE-ID
mapping provided in the IVRS ACPI table.
By default, PCI segment is 0, and can be omitted.
For example, to map UART-HID:UID AMD0020:0 to
PCI segment 0x1 and PCI device ID 00:14.5,
write the parameter as:
ivrs_acpihid=AMD0020:0@0001:00:14.5
Deprecated formats:
* To map UART-HID:UID AMD0020:0 to PCI segment is 0,
PCI device ID 00:14.5, write the parameter as:
ivrs_acpihid[00:14.5]=AMD0020:0
* To map UART-HID:UID AMD0020:0 to PCI segment 0x1 and
PCI device ID 00:14.5, write the parameter as:
ivrs_acpihid[0001:00:14.5]=AMD0020:0
js= [HW,JOY] Analog joystick
See Documentation/input/joydev/joystick.rst.
kasan_multi_shot
[KNL] Enforce KASAN (Kernel Address Sanitizer) to print
report on every invalid memory access. Without this
parameter KASAN will print report only for the first
invalid access.
keep_bootcon [KNL]
Do not unregister boot console at start. This is only
useful for debugging when something happens in the window
between unregistering the boot console and initializing
the real console.
keepinitrd [HW,ARM] See retain_initrd.
kernelcore= [KNL,X86,IA-64,PPC]
mm, page_alloc: extend kernelcore and movablecore for percent Both kernelcore= and movablecore= can be used to define the amount of ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires the system memory capacity to be known when specifying the command line, however. This introduces the ability to define both kernelcore= and movablecore= as a percentage of total system memory. This is convenient for systems software that wants to define the amount of ZONE_MOVABLE, for example, as a proportion of a system's memory rather than a hardcoded byte value. To define the percentage, the final character of the parameter should be a '%'. mhocko: "why is anyone using these options nowadays?" rientjes: : : Fragmentation of non-__GFP_MOVABLE pages due to low on memory : situations can pollute most pageblocks on the system, as much as 1GB of : slab being fragmented over 128GB of memory, for example. When the : amount of kernel memory is well bounded for certain systems, it is : better to aggressively reclaim from existing MIGRATE_UNMOVABLE : pageblocks rather than eagerly fallback to others. : : We have additional patches that help with this fragmentation if you're : interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE : pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and : draining of pcp lists back to the zone free area to prevent stranding. [rientjes@google.com: updates] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 23:23:09 +00:00
Format: nn[KMGTPE] | nn% | "mirror"
This parameter specifies the amount of memory usable by
the kernel for non-movable allocations. The requested
amount is spread evenly throughout all nodes in the
system as ZONE_NORMAL. The remaining memory is used for
movable memory in its own zone, ZONE_MOVABLE. In the
event, a node is too small to have both ZONE_NORMAL and
ZONE_MOVABLE, kernelcore memory will take priority and
other nodes will have a larger ZONE_MOVABLE.
ZONE_MOVABLE is used for the allocation of pages that
may be reclaimed or moved by the page migration
subsystem. Note that allocations like PTEs-from-HighMem
still use the HighMem zone if it exists, and the Normal
zone if it does not.
mm, page_alloc: extend kernelcore and movablecore for percent Both kernelcore= and movablecore= can be used to define the amount of ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires the system memory capacity to be known when specifying the command line, however. This introduces the ability to define both kernelcore= and movablecore= as a percentage of total system memory. This is convenient for systems software that wants to define the amount of ZONE_MOVABLE, for example, as a proportion of a system's memory rather than a hardcoded byte value. To define the percentage, the final character of the parameter should be a '%'. mhocko: "why is anyone using these options nowadays?" rientjes: : : Fragmentation of non-__GFP_MOVABLE pages due to low on memory : situations can pollute most pageblocks on the system, as much as 1GB of : slab being fragmented over 128GB of memory, for example. When the : amount of kernel memory is well bounded for certain systems, it is : better to aggressively reclaim from existing MIGRATE_UNMOVABLE : pageblocks rather than eagerly fallback to others. : : We have additional patches that help with this fragmentation if you're : interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE : pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and : draining of pcp lists back to the zone free area to prevent stranding. [rientjes@google.com: updates] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 23:23:09 +00:00
It is possible to specify the exact amount of memory in
the form of "nn[KMGTPE]", a percentage of total system
memory in the form of "nn%", or "mirror". If "mirror"
option is specified, mirrored (reliable) memory is used
for non-movable allocations and remaining memory is used
mm, page_alloc: extend kernelcore and movablecore for percent Both kernelcore= and movablecore= can be used to define the amount of ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires the system memory capacity to be known when specifying the command line, however. This introduces the ability to define both kernelcore= and movablecore= as a percentage of total system memory. This is convenient for systems software that wants to define the amount of ZONE_MOVABLE, for example, as a proportion of a system's memory rather than a hardcoded byte value. To define the percentage, the final character of the parameter should be a '%'. mhocko: "why is anyone using these options nowadays?" rientjes: : : Fragmentation of non-__GFP_MOVABLE pages due to low on memory : situations can pollute most pageblocks on the system, as much as 1GB of : slab being fragmented over 128GB of memory, for example. When the : amount of kernel memory is well bounded for certain systems, it is : better to aggressively reclaim from existing MIGRATE_UNMOVABLE : pageblocks rather than eagerly fallback to others. : : We have additional patches that help with this fragmentation if you're : interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE : pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and : draining of pcp lists back to the zone free area to prevent stranding. [rientjes@google.com: updates] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 23:23:09 +00:00
for Movable pages. "nn[KMGTPE]", "nn%", and "mirror"
are exclusive, so you cannot specify multiple forms.
echi-dbgp: Add kernel debugger support for the usb debug port This patch adds the capability to use the usb debug port with the kernel debugger. It is also still possible to use this functionality with or without the earlyprintk=dbgpX. It is possible to use the kgdbwait boot argument to debug very early in the kernel start up code. There are two ways to use this driver extension with a kernel boot argument. 1) kgdbdbgp=# -- Where # is the number of the usb debug controller You must use sysrq-g to break into the kernel debugger on another connection type other than the dbgp. 2) kgdbdbgp=#debugControlNum#,#Seconds# In this mode, the usb debug port is polled every #Seconds# for character input. It is possible to use gdb or press control-c to break into the kernel debugger. From the implementation perspective there are 3 high level changes. 1) Allow variable retries for the the hardware via dbgp_bulk_read(). The amount of retries for the dbgp_bulk_read() needed to be variable instead of fixed. We do not want to poll at all when the kernel is operating in interrupt driven mode. The polling only occurs if the kernel was booted when specifying some number of seconds via the kgdbdbgp boot argument (IE kgdbdbgp=0,1). In this case the loop count is reduced to 1 so as introduce the smallest amount of latency as possible. 2) Save the bulk IN endpoint address for use by the kgdb code. 3) The addition of the kgdb interface code. This consisted of adding in a character read function for the dbgp as well as a polling thread to allow the dbgp to interrupt the kernel execution. The rest is the typical kgdb I/O api. CC: Eric Biederman <ebiederm@xmission.com> CC: Yinghai Lu <yhlu.kernel@gmail.com> CC: linux-usb@vger.kernel.org Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-21 02:04:31 +00:00
kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
port as it is probed via PCI. The poll interval is
optional and is the number seconds in between
each poll cycle to the debug port in case you need
the functionality for interrupting the kernel with
gdb or control-c on the dbgp connection. When
not using this parameter you use sysrq-g to break into
the kernel debugger.
kgdboc= [KGDB,HW] kgdb over consoles.
Requires a tty driver that supports console polling,
or a supported polling keyboard driver (non-usb).
Serial only format: <serial_device>[,baud]
keyboard only format: kbd
keyboard and serial format: kbd,<serial_device>[,baud]
Optional Kernel mode setting:
kms, kbd format: kms,kbd
kms, kbd and serial format: kms,kbd,<ser_dev>[,baud]
kgdboc_earlycon= [KGDB,HW]
If the boot console provides the ability to read
characters and can work in polling mode, you can use
this parameter to tell kgdb to use it as a backend
until the normal console is registered. Intended to
be used together with the kgdboc parameter which
specifies the normal console to transition to.
The name of the early console should be specified
as the value of this parameter. Note that the name of
the early console might be different than the tty
name passed to kgdboc. It's OK to leave the value
blank and the first boot console that implements
read() will be picked.
kgdbwait [KGDB] Stop kernel execution and enter the
kernel debugger at the earliest opportunity.
kmac= [MIPS] Korina ethernet MAC address.
Configure the RouterBoard 532 series on-chip
Ethernet adapter MAC address.
kmemleak= [KNL] Boot-time kmemleak enable/disable
Valid arguments: on, off
Default: on
Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,
the default is off.
kprobe_event=[probe-list]
[FTRACE] Add kprobe events and enable at boot time.
The probe-list is a semicolon delimited list of probe
definitions. Each definition is same as kprobe_events
interface, but the parameters are comma delimited.
For example, to add a kprobe event on vfs_read with
arg1 and arg2, add to the command line;
kprobe_event=p,vfs_read,$arg1,$arg2
See also Documentation/trace/kprobetrace.rst "Kernel
Boot Parameter" section.
kpti= [ARM64] Control page table isolation of user
and kernel address spaces.
Default: enabled on cores which need mitigation.
0: force disabled
1: force enabled
kunit.enable= [KUNIT] Enable executing KUnit tests. Requires
CONFIG_KUNIT to be set to be fully enabled. The
default value can be overridden via
KUNIT_DEFAULT_ENABLED.
Default is 1 (enabled)
kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs.
Default is 0 (don't ignore, but inject #GP)
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
kvm.eager_page_split=
[KVM,X86] Controls whether or not KVM will try to
proactively split all huge pages during dirty logging.
Eager page splitting reduces interruptions to vCPU
execution by eliminating the write-protection faults
and MMU lock contention that would otherwise be
required to split huge pages lazily.
VM workloads that rarely perform writes or that write
only to a small region of VM memory may benefit from
disabling eager page splitting to allow huge pages to
still be used for reads.
The behavior of eager page splitting depends on whether
KVM_DIRTY_LOG_INITIALLY_SET is enabled or disabled. If
disabled, all huge pages in a memslot will be eagerly
split when dirty logging is enabled on that memslot. If
enabled, eager page splitting will be performed during
the KVM_CLEAR_DIRTY ioctl, and only for the pages being
cleared.
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs Add support for Eager Page Splitting pages that are mapped by nested MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB pages, and then splitting all 2MiB pages to 4KiB pages. Note, Eager Page Splitting is limited to nested MMUs as a policy rather than due to any technical reason (the sp->role.guest_mode check could just be deleted and Eager Page Splitting would work correctly for all shadow MMU pages). There is really no reason to support Eager Page Splitting for tdp_mmu=N, since such support will eventually be phased out, and there is no current use case supporting Eager Page Splitting on hosts where TDP is either disabled or unavailable in hardware. Furthermore, future improvements to nested MMU scalability may diverge the code from the legacy shadow paging implementation. These improvements will be simpler to make if Eager Page Splitting does not have to worry about legacy shadow paging. Splitting huge pages mapped by nested MMUs requires dealing with some extra complexity beyond that of the TDP MMU: (1) The shadow MMU has a limit on the number of shadow pages that are allowed to be allocated. So, as a policy, Eager Page Splitting refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer pages available. (2) Splitting a huge page may end up re-using an existing lower level shadow page tables. This is unlike the TDP MMU which always allocates new shadow page tables when splitting. (3) When installing the lower level SPTEs, they must be added to the rmap which may require allocating additional pte_list_desc structs. Case (2) is especially interesting since it may require a TLB flush, unlike the TDP MMU which can fully split huge pages without any TLB flushes. Specifically, an existing lower level page table may point to even lower level page tables that are not fully populated, effectively unmapping a portion of the huge page, which requires a flush. As of this commit, a flush is always done always after dropping the huge page and before installing the lower level page table. This TLB flush could instead be delayed until the MMU lock is about to be dropped, which would batch flushes for multiple splits. However these flushes should be rare in practice (a huge page must be aliased in multiple SPTEs and have been split for NX Huge Pages in only some of them). Flushing immediately is simpler to plumb and also reduces the chances of tripping over a CPU bug (e.g. see iTLB multihit). [ This commit is based off of the original implementation of Eager Page Splitting from Peter in Google's kernel from 2016. ] Suggested-by: Peter Feiner <pfeiner@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-23-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-22 19:27:09 +00:00
Eager page splitting is only supported when kvm.tdp_mmu=Y.
KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19 23:07:36 +00:00
Default is Y (on).
kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
Default is false (don't support).
kvm.nx_huge_pages=
[KVM] Controls the software workaround for the
X86_BUG_ITLB_MULTIHIT bug.
force : Always deploy workaround.
off : Never deploy workaround.
auto : Deploy workaround based on the presence of
X86_BUG_ITLB_MULTIHIT.
Default is 'auto'.
If the software workaround is enabled for the host,
guests do need not to enable it for nested guests.
kvm.nx_huge_pages_recovery_ratio=
[KVM] Controls how many 4KiB pages are periodically zapped
back to huge pages. 0 disables the recovery, otherwise if
the value is N KVM will zap 1/Nth of the 4KiB pages every
period (see below). The default is 60.
kvm.nx_huge_pages_recovery_period_ms=
[KVM] Controls the time period at which KVM zaps 4KiB pages
back to huge pages. If the value is a non-zero N, KVM will
zap a portion (see ratio above) of the pages every N msecs.
If the value is 0 (the default), KVM will pick a period based
on the ratio, such that a page is zapped after 1 hour on average.
kvm-amd.nested= [KVM,AMD] Control nested virtualization feature in
KVM/SVM. Default is 1 (enabled).
kvm-amd.npt= [KVM,AMD] Control KVM's use of Nested Page Tables,
a.k.a. Two-Dimensional Page Tables. Default is 1
(enabled). Disable by KVM if hardware lacks support
for NPT.
kvm-arm.mode=
[KVM,ARM] Select one of KVM/arm64's modes of operation.
none: Forcefully disable KVM.
nvhe: Standard nVHE-based mode, without support for
protected guests.
protected: nVHE-based mode with support for guests whose
state is kept private from the host.
nested: VHE-based mode with support for nested
virtualization. Requires at least ARMv8.3
hardware.
Defaults to VHE/nVHE based on hardware support. Setting
mode to "protected" will disable kexec and hibernation
for the host. "nested" is experimental and should be
used with extreme caution.
kvm-arm.vgic_v3_group0_trap=
[KVM,ARM] Trap guest accesses to GICv3 group-0
system registers
kvm-arm.vgic_v3_group1_trap=
[KVM,ARM] Trap guest accesses to GICv3 group-1
system registers
kvm-arm.vgic_v3_common_trap=
[KVM,ARM] Trap guest accesses to GICv3 common
system registers
kvm-arm.vgic_v4_enable=
[KVM,ARM] Allow use of GICv4 for direct injection of
LPIs.
kvm_cma_resv_ratio=n [PPC]
Reserves given percentage from system memory area for
contiguous memory allocation for KVM hash pagetable
allocation.
By default it reserves 5% of total system memory.
Format: <integer>
Default: 5
kvm-intel.ept= [KVM,Intel] Control KVM's use of Extended Page Tables,
a.k.a. Two-Dimensional Page Tables. Default is 1
(enabled). Disable by KVM if hardware lacks support
for EPT.
kvm-intel.emulate_invalid_guest_state=
[KVM,Intel] Control whether to emulate invalid guest
state. Ignored if kvm-intel.enable_unrestricted_guest=1,
as guest state is never invalid for unrestricted
guests. This param doesn't apply to nested guests (L2),
as KVM never emulates invalid L2 guest state.
Default is 1 (enabled).
kvm-intel.flexpriority=
[KVM,Intel] Control KVM's use of FlexPriority feature
(TPR shadow). Default is 1 (enabled). Disable by KVM if
hardware lacks support for it.
kvm-intel.nested=
[KVM,Intel] Control nested virtualization feature in
KVM/VMX. Default is 1 (enabled).
kvm-intel.unrestricted_guest=
[KVM,Intel] Control KVM's use of unrestricted guest
feature (virtualized real and unpaged mode). Default
is 1 (enabled). Disable by KVM if EPT is disabled or
hardware lacks support for it.
kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault
CVE-2018-3620.
Valid arguments: never, cond, always
always: L1D cache flush on every VMENTER.
cond: Flush L1D on VMENTER only when the code between
VMEXIT and VMENTER can leak host memory.
never: Disables the mitigation
Default is cond (do L1 cache flush in specific instances)
kvm-intel.vpid= [KVM,Intel] Control KVM's use of Virtual Processor
Identification feature (tagged TLBs). Default is 1
(enabled). Disable by KVM if hardware lacks support
for it.
l1d_flush= [X86,INTEL]
Control mitigation for L1D based snooping vulnerability.
Certain CPUs are vulnerable to an exploit against CPU
internal buffers which can forward information to a
disclosure gadget under certain conditions.
In vulnerable processors, the speculatively
forwarded data can be used in a cache side channel
attack, to access data to which the attacker does
not have direct access.
This parameter controls the mitigation. The
options are:
on - enable the interface for the mitigation
x86/bugs, kvm: Introduce boot-time control of L1TF mitigations Introduce the 'l1tf=' kernel command line option to allow for boot-time switching of mitigation that is used on processors affected by L1TF. The possible values are: full Provides all available mitigations for the L1TF vulnerability. Disables SMT and enables all mitigations in the hypervisors. SMT control via /sys/devices/system/cpu/smt/control is still possible after boot. Hypervisors will issue a warning when the first VM is started in a potentially insecure configuration, i.e. SMT enabled or L1D flush disabled. full,force Same as 'full', but disables SMT control. Implies the 'nosmt=force' command line option. sysfs control of SMT and the hypervisor flush control is disabled. flush Leaves SMT enabled and enables the conditional hypervisor mitigation. Hypervisors will issue a warning when the first VM is started in a potentially insecure configuration, i.e. SMT enabled or L1D flush disabled. flush,nosmt Disables SMT and enables the conditional hypervisor mitigation. SMT control via /sys/devices/system/cpu/smt/control is still possible after boot. If SMT is reenabled or flushing disabled at runtime hypervisors will issue a warning. flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is started in a potentially insecure configuration. off Disables hypervisor mitigations and doesn't emit any warnings. Default is 'flush'. Let KVM adhere to these semantics, which means: - 'lt1f=full,force' : Performe L1D flushes. No runtime control possible. - 'l1tf=full' - 'l1tf-flush' - 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if SMT has been runtime enabled or L1D flushing has been run-time enabled - 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted. - 'l1tf=off' : L1D flushes are not performed and no warnings are emitted. KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush' module parameter except when lt1f=full,force is set. This makes KVM's private 'nosmt' option redundant, and as it is a bit non-systematic anyway (this is something to control globally, not on hypervisor level), remove that option. Add the missing Documentation entry for the l1tf vulnerability sysfs file while at it. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
2018-07-13 14:23:25 +00:00
l1tf= [X86] Control mitigation of the L1TF vulnerability on
affected CPUs
The kernel PTE inversion protection is unconditionally
enabled and cannot be disabled.
full
Provides all available mitigations for the
L1TF vulnerability. Disables SMT and
enables all mitigations in the
hypervisors, i.e. unconditional L1D flush.
SMT control and L1D flush control via the
sysfs interface is still possible after
boot. Hypervisors will issue a warning
when the first VM is started in a
potentially insecure configuration,
i.e. SMT enabled or L1D flush disabled.
full,force
Same as 'full', but disables SMT and L1D
flush runtime control. Implies the
'nosmt=force' command line option.
(i.e. sysfs control of SMT is disabled.)
flush
Leaves SMT enabled and enables the default
hypervisor mitigation, i.e. conditional
L1D flush.
SMT control and L1D flush control via the
sysfs interface is still possible after
boot. Hypervisors will issue a warning
when the first VM is started in a
potentially insecure configuration,
i.e. SMT enabled or L1D flush disabled.
flush,nosmt
Disables SMT and enables the default
hypervisor mitigation.
SMT control and L1D flush control via the
sysfs interface is still possible after
boot. Hypervisors will issue a warning
when the first VM is started in a
potentially insecure configuration,
i.e. SMT enabled or L1D flush disabled.
flush,nowarn
Same as 'flush', but hypervisors will not
warn when a VM is started in a potentially
insecure configuration.
off
Disables hypervisor mitigations and doesn't
emit any warnings.
It also drops the swap size and available
RAM limit restriction on both hypervisor and
bare metal.
x86/bugs, kvm: Introduce boot-time control of L1TF mitigations Introduce the 'l1tf=' kernel command line option to allow for boot-time switching of mitigation that is used on processors affected by L1TF. The possible values are: full Provides all available mitigations for the L1TF vulnerability. Disables SMT and enables all mitigations in the hypervisors. SMT control via /sys/devices/system/cpu/smt/control is still possible after boot. Hypervisors will issue a warning when the first VM is started in a potentially insecure configuration, i.e. SMT enabled or L1D flush disabled. full,force Same as 'full', but disables SMT control. Implies the 'nosmt=force' command line option. sysfs control of SMT and the hypervisor flush control is disabled. flush Leaves SMT enabled and enables the conditional hypervisor mitigation. Hypervisors will issue a warning when the first VM is started in a potentially insecure configuration, i.e. SMT enabled or L1D flush disabled. flush,nosmt Disables SMT and enables the conditional hypervisor mitigation. SMT control via /sys/devices/system/cpu/smt/control is still possible after boot. If SMT is reenabled or flushing disabled at runtime hypervisors will issue a warning. flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is started in a potentially insecure configuration. off Disables hypervisor mitigations and doesn't emit any warnings. Default is 'flush'. Let KVM adhere to these semantics, which means: - 'lt1f=full,force' : Performe L1D flushes. No runtime control possible. - 'l1tf=full' - 'l1tf-flush' - 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if SMT has been runtime enabled or L1D flushing has been run-time enabled - 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted. - 'l1tf=off' : L1D flushes are not performed and no warnings are emitted. KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush' module parameter except when lt1f=full,force is set. This makes KVM's private 'nosmt' option redundant, and as it is a bit non-systematic anyway (this is something to control globally, not on hypervisor level), remove that option. Add the missing Documentation entry for the l1tf vulnerability sysfs file while at it. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
2018-07-13 14:23:25 +00:00
Default is 'flush'.
For details see: Documentation/admin-guide/hw-vuln/l1tf.rst
x86/bugs, kvm: Introduce boot-time control of L1TF mitigations Introduce the 'l1tf=' kernel command line option to allow for boot-time switching of mitigation that is used on processors affected by L1TF. The possible values are: full Provides all available mitigations for the L1TF vulnerability. Disables SMT and enables all mitigations in the hypervisors. SMT control via /sys/devices/system/cpu/smt/control is still possible after boot. Hypervisors will issue a warning when the first VM is started in a potentially insecure configuration, i.e. SMT enabled or L1D flush disabled. full,force Same as 'full', but disables SMT control. Implies the 'nosmt=force' command line option. sysfs control of SMT and the hypervisor flush control is disabled. flush Leaves SMT enabled and enables the conditional hypervisor mitigation. Hypervisors will issue a warning when the first VM is started in a potentially insecure configuration, i.e. SMT enabled or L1D flush disabled. flush,nosmt Disables SMT and enables the conditional hypervisor mitigation. SMT control via /sys/devices/system/cpu/smt/control is still possible after boot. If SMT is reenabled or flushing disabled at runtime hypervisors will issue a warning. flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is started in a potentially insecure configuration. off Disables hypervisor mitigations and doesn't emit any warnings. Default is 'flush'. Let KVM adhere to these semantics, which means: - 'lt1f=full,force' : Performe L1D flushes. No runtime control possible. - 'l1tf=full' - 'l1tf-flush' - 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if SMT has been runtime enabled or L1D flushing has been run-time enabled - 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted. - 'l1tf=off' : L1D flushes are not performed and no warnings are emitted. KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush' module parameter except when lt1f=full,force is set. This makes KVM's private 'nosmt' option redundant, and as it is a bit non-systematic anyway (this is something to control globally, not on hypervisor level), remove that option. Add the missing Documentation entry for the l1tf vulnerability sysfs file while at it. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
2018-07-13 14:23:25 +00:00
l2cr= [PPC]
l3cr= [PPC]
lapic [X86-32,APIC] Enable the local APIC even if BIOS
disabled it.
lapic= [X86,APIC] Do not use TSC deadline
value for LAPIC timer one-shot implementation. Default
back to the programmable timer unit in the LAPIC.
Format: notscdeadline
lapic_timer_c2_ok [X86,APIC] trust the local apic timer
in C2 power state.
libata.dma= [LIBATA] DMA control
libata.dma=0 Disable all PATA and SATA DMA
libata.dma=1 PATA and SATA Disk DMA only
libata.dma=2 ATAPI (CDROM) DMA only
libata.dma=4 Compact Flash DMA only
Combinations also work, so libata.dma=3 enables DMA
for disks and CDROMs, but not CFs.
libata.ignore_hpa= [LIBATA] Ignore HPA limit
libata.ignore_hpa=0 keep BIOS limits (default)
libata.ignore_hpa=1 ignore limits, using full disk
libata.noacpi [LIBATA] Disables use of ACPI in libata suspend/resume
when set.
Format: <int>
libata.force= [LIBATA] Force configurations. The format is a comma-
separated list of "[ID:]VAL" where ID is PORT[.DEVICE].
PORT and DEVICE are decimal numbers matching port, link
or device. Basically, it matches the ATA ID string
printed on console by libata. If the whole ID part is
omitted, the last PORT and DEVICE values are used. If
ID hasn't been specified yet, the configuration applies
to all ports, links and devices.
If only DEVICE is omitted, the parameter applies to
the port and all links and devices behind it. DEVICE
number of 0 either selects the first device or the
first fan-out link behind PMP device. It does not
select the host link. DEVICE number of 15 selects the
host link and device attached to it.
The VAL specifies the configuration to force. As long
as there is no ambiguity, shortcut notation is allowed.
For example, both 1.5 and 1.5G would work for 1.5Gbps.
The following configurations can be forced.
* Cable type: 40c, 80c, short40c, unk, ign or sata.
Any ID with matching PORT is used.
* SATA link speed limit: 1.5Gbps or 3.0Gbps.
* Transfer mode: pio[0-7], mwdma[0-4] and udma[0-7].
udma[/][16,25,33,44,66,100,133] notation is also
allowed.
* nohrst, nosrst, norst: suppress hard, soft and both
resets.
* rstonce: only attempt one reset during hot-unplug
link recovery.
* [no]dbdelay: Enable or disable the extra 200ms delay
before debouncing a link PHY and device presence
detection.
* [no]ncq: Turn on or off NCQ.
* [no]ncqtrim: Enable or disable queued DSM TRIM.
* [no]ncqati: Enable or disable NCQ trim on ATI chipset.
* [no]trim: Enable or disable (unqueued) TRIM.
* trim_zero: Indicate that TRIM command zeroes data.
* max_trim_128m: Set 128M maximum trim size limit.
* [no]dma: Turn on or off DMA transfers.
* atapi_dmadir: Enable ATAPI DMADIR bridge support.
* atapi_mod16_dma: Enable the use of ATAPI DMA for
commands that are not a multiple of 16 bytes.
* [no]dmalog: Enable or disable the use of the
READ LOG DMA EXT command to access logs.
* [no]iddevlog: Enable or disable access to the
identify device data log.
* [no]logdir: Enable or disable access to the general
purpose log directory.
* max_sec_128: Set transfer size limit to 128 sectors.
* max_sec_1024: Set or clear transfer size limit to
1024 sectors.
* max_sec_lba48: Set or clear transfer size limit to
65535 sectors.
* [no]lpm: Enable or disable link power management.
* [no]setxfer: Indicate if transfer speed mode setting
should be skipped.
ata: libata: cleanup fua support detection Move the detection of a device FUA support from ata_scsiop_mode_sense()/ata_dev_supports_fua() to device scan time in ata_dev_configure(). The function ata_dev_config_fua() is introduced to detect if a device supports FUA and this support is indicated using the new device flag ATA_DFLAG_FUA. In order to blacklist known buggy devices, the horkage flag ATA_HORKAGE_NO_FUA is introduced. Similarly to other horkage flags, the libata.force= arguments "fua" and "nofua" are also introduced to allow a user to control this horkage flag through the "force" libata module parameter. The ATA_DFLAG_FUA device flag is set only and only if all the following conditions are met: * libata.fua module parameter is set to 1 * The device supports the WRITE DMA FUA EXT command, * The device is not marked with the ATA_HORKAGE_NO_FUA flag, either from the blacklist or set by the user with libata.force=nofua * The device supports NCQ (while this is not mandated by the standards, this restriction is introduced to avoid problems with older non-NCQ devices). Enabling or diabling libata FUA support for all devices can now also be done using the "force=[no]fua" module parameter when libata.fua is set to 1. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
2022-10-14 09:05:38 +00:00
* [no]fua: Disable or enable FUA (Force Unit Access)
support for devices supporting this feature.
* dump_id: Dump IDENTIFY data.
* disable: Disable this device.
If there are multiple matching configurations changing
the same attribute, the last one is used.
load_ramdisk= [RAM] [Deprecated]
lockd.nlm_grace_period=P [NFS] Assign grace period.
Format: <integer>
lockd.nlm_tcpport=N [NFS] Assign TCP port.
Format: <integer>
lockd.nlm_timeout=T [NFS] Assign timeout value.
Format: <integer>
lockd.nlm_udpport=M [NFS] Assign UDP port.
Format: <integer>
lockdown= [SECURITY]
{ integrity | confidentiality }
Enable the kernel lockdown feature. If set to
integrity, kernel features that allow userland to
modify the running kernel are disabled. If set to
confidentiality, kernel features that allow userland
to extract confidential information from the kernel
are also disabled.
locktorture.acq_writer_lim= [KNL]
Set the time limit in jiffies for a lock
acquisition. Acquisitions exceeding this limit
will result in a splat once they do complete.
locktorture.bind_readers= [KNL]
Specify the list of CPUs to which the readers are
to be bound.
locktorture.bind_writers= [KNL]
Specify the list of CPUs to which the writers are
to be bound.
locktorture.call_rcu_chains= [KNL]
Specify the number of self-propagating call_rcu()
chains to set up. These are used to ensure that
there is a high probability of an RCU grace period
in progress at any given time. Defaults to 0,
which disables these call_rcu() chains.
locktorture.long_hold= [KNL]
Specify the duration in milliseconds for the
occasional long-duration lock hold time. Defaults
to 100 milliseconds. Select 0 to disable.
locktorture.nested_locks= [KNL]
Specify the maximum lock nesting depth that
locktorture is to exercise, up to a limit of 8
(MAX_NESTED_LOCKS). Specify zero to disable.
Note that this parameter is ineffective on types
of locks that do not support nested acquisition.
locktorture.nreaders_stress= [KNL]
Set the number of locking read-acquisition kthreads.
Defaults to being automatically set based on the
number of online CPUs.
locktorture.nwriters_stress= [KNL]
Set the number of locking write-acquisition kthreads.
locktorture.onoff_holdoff= [KNL]
Set time (s) after boot for CPU-hotplug testing.
locktorture.onoff_interval= [KNL]
Set time (s) between CPU-hotplug operations, or
zero to disable CPU-hotplug testing.
locktorture.rt_boost= [KNL]
Do periodic testing of real-time lock priority
boosting. Select 0 to disable, 1 to boost
only rt_mutex, and 2 to boost unconditionally.
Defaults to 2, which might seem to be an
odd choice, but which should be harmless for
non-real-time spinlocks, due to their disabling
of preemption. Note that non-realtime mutexes
disable boosting.
locktorture.rt_boost_factor= [KNL]
Number that determines how often and for how
long priority boosting is exercised. This is
scaled down by the number of writers, so that the
number of boosts per unit time remains roughly
constant as the number of writers increases.
On the other hand, the duration of each boost
increases with the number of writers.
locktorture.shuffle_interval= [KNL]
Set task-shuffle interval (jiffies). Shuffling
tasks allows some CPUs to go into dyntick-idle
mode during the locktorture test.
locktorture.shutdown_secs= [KNL]
Set time (s) after boot system shutdown. This
is useful for hands-off automated testing.
locktorture.stat_interval= [KNL]
Time (s) between statistics printk()s.
locktorture.stutter= [KNL]
Time (s) to stutter testing, for example,
specifying five seconds causes the test to run for
five seconds, wait for five seconds, and so on.
This tests the locking primitive's ability to
transition abruptly to and from idle.
locktorture.torture_type= [KNL]
Specify the locking implementation to test.
locktorture.verbose= [KNL]
Enable additional printk() statements.
locktorture.writer_fifo= [KNL]
Run the write-side locktorture kthreads at
sched_set_fifo() real-time priority.
logibm.irq= [HW,MOUSE] Logitech Bus Mouse Driver
Format: <irq>
loglevel= All Kernel Messages with a loglevel smaller than the
console loglevel will be printed to the console. It can
also be changed with klogd or other programs. The
loglevels are defined as follows:
0 (KERN_EMERG) system is unusable
1 (KERN_ALERT) action must be taken immediately
2 (KERN_CRIT) critical conditions
3 (KERN_ERR) error conditions
4 (KERN_WARNING) warning conditions
5 (KERN_NOTICE) normal but significant condition
6 (KERN_INFO) informational
7 (KERN_DEBUG) debug-level messages
log_buf_len=n[KMG] Sets the size of the printk ring buffer,
printk: allow increasing the ring buffer depending on the number of CPUs The default size of the ring buffer is too small for machines with a large amount of CPUs under heavy load. What ends up happening when debugging is the ring buffer overlaps and chews up old messages making debugging impossible unless the size is passed as a kernel parameter. An idle system upon boot up will on average spew out only about one or two extra lines but where this really matters is on heavy load and that will vary widely depending on the system and environment. There are mechanisms to help increase the kernel ring buffer for tracing through debugfs, and those interfaces even allow growing the kernel ring buffer per CPU. We also have a static value which can be passed upon boot. Relying on debugfs however is not ideal for production, and relying on the value passed upon bootup is can only used *after* an issue has creeped up. Instead of being reactive this adds a proactive measure which lets you scale the amount of contributions you'd expect to the kernel ring buffer under load by each CPU in the worst case scenario. We use num_possible_cpus() to avoid complexities which could be introduced by dynamically changing the ring buffer size at run time, num_possible_cpus() lets us use the upper limit on possible number of CPUs therefore avoiding having to deal with hotplugging CPUs on and off. This introduces the kernel configuration option LOG_CPU_MAX_BUF_SHIFT which is used to specify the maximum amount of contributions to the kernel ring buffer in the worst case before the kernel ring buffer flips over, the size is specified as a power of 2. The total amount of contributions made by each CPU must be greater than half of the default kernel ring buffer size (1 << LOG_BUF_SHIFT bytes) in order to trigger an increase upon bootup. The kernel ring buffer is increased to the next power of two that would fit the required minimum kernel ring buffer size plus the additional CPU contribution. For example if LOG_BUF_SHIFT is 18 (256 KB) you'd require at least 128 KB contributions by other CPUs in order to trigger an increase of the kernel ring buffer. With a LOG_CPU_BUF_SHIFT of 12 (4 KB) you'd require at least anything over > 64 possible CPUs to trigger an increase. If you had 128 possible CPUs the amount of minimum required kernel ring buffer bumps to: ((1 << 18) + ((128 - 1) * (1 << 12))) / 1024 = 764 KB Since we require the ring buffer to be a power of two the new required size would be 1024 KB. This CPU contributions are ignored when the "log_buf_len" kernel parameter is used as it forces the exact size of the ring buffer to an expected power of two value. [pmladek@suse.cz: fix build] Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.cz> Tested-by: Davidlohr Bueso <davidlohr@hp.com> Tested-by: Petr Mladek <pmladek@suse.cz> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Petr Mladek <pmladek@suse.cz> Cc: Joe Perches <joe@perches.com> Cc: Arun KS <arunks.linux@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 23:08:56 +00:00
in bytes. n must be a power of two and greater
than the minimal size. The minimal size is defined
by LOG_BUF_SHIFT kernel config parameter. There is
also CONFIG_LOG_CPU_MAX_BUF_SHIFT config parameter
that allows to increase the default size depending on
the number of CPUs. See init/Kconfig for more details.
logo.nologo [FB] Disables display of the built-in Linux logo.
This may be used to provide more screen space for
kernel log messages and is useful when debugging
kernel boot problems.
lp=0 [LP] Specify parallel ports to use, e.g,
lp=port[,port...] lp=none,parport0 (lp0 not configured, lp1 uses
lp=reset first parallel port). 'lp=0' disables the
lp=auto printer driver. 'lp=reset' (which can be
specified in addition to the ports) causes
attached printers to be reset. Using
lp=port1,port2,... specifies the parallel ports
to associate lp devices with, starting with
lp0. A port specification may be 'none' to skip
that lp device, or a parport name such as
'parport0'. Specifying 'lp=auto' instead of a
port specification list means that device IDs
from each port should be examined, to see if
an IEEE 1284-compliant printer is attached; if
so, the driver will manage that printer.
See also header of drivers/char/lp.c.
lpj=n [KNL]
Sets loops_per_jiffy to given constant, thus avoiding
time-consuming boot-time autodetection (up to 250 ms per
CPU). 0 enables autodetection (default). To determine
the correct value for your kernel, boot with normal
autodetection and see what value is printed. Note that
on SMP systems the preset will be applied to all CPUs,
which is likely to cause problems if your CPUs need
significantly divergent settings. An incorrect value
will cause delays in the kernel to be wrong, leading to
unpredictable I/O errors and other breakage. Although
unlikely, in the extreme case this might damage your
hardware.
ltpc= [NET]
Format: <io>,<irq>,<dma>
lsm.debug [SECURITY] Enable LSM initialization debugging output.
lsm=lsm1,...,lsmN
[SECURITY] Choose order of LSM initialization. This
overrides CONFIG_LSM, and the "security=" parameter.
machvec= [IA-64] Force the use of a particular machine-vector
(machvec) in a generic kernel.
Example: machvec=hpzx1
machtype= [Loongson] Share the same kernel image file between
different yeeloong laptops.
Example: machtype=lemote-yeeloong-2f-7inch
Docs: admin/kernel-parameters: edit a few boot options Clean up some of admin-guide/kernel-parameters.txt: a. "smt" should be "smt=" (S390) b. (dropped) c. Sparc supports the vdso= boot option d. make the tp_printk options (2) formatting similar to other options by adding spacing e. add "trace_clock=" with a reference to Documentation/trace/ftrace.rst f. use [IA-64] as documented instead of [ia64] g. fix formatting and text for test_suspend= h. fix formatting for swapaccount= i. fix formatting and grammar for video.brightness_switch_enabled= Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: sparclinux@vger.kernel.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linux-ia64@vger.kernel.org Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: linux-pm@vger.kernel.org Cc: linux-acpi@vger.kernel.org Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-04-03 05:48:20 +00:00
max_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory greater
than or equal to this physical address is ignored.
maxcpus= [SMP] Maximum number of processors that an SMP kernel
will bring up during bootup. maxcpus=n : n >= 0 limits
the kernel to bring up 'n' processors. Surely after
bootup you can bring up the other plugged cpu by executing
"echo 1 > /sys/devices/system/cpu/cpuX/online". So maxcpus
only takes effect during system bootup.
While n=0 is a special case, it is equivalent to "nosmp",
which also disables the IO APIC.
max_loop= [LOOP] The number of loop block devices that get
(loop.max_loop) unconditionally pre-created at init time. The default
number is configured by BLK_DEV_LOOP_MIN_COUNT. Instead
of statically allocating a predefined number, loop
devices can be requested on-demand with the
/dev/loop-control interface.
mce [X86-32] Machine Check Exception
mce=option [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst
md= [HW] RAID subsystems devices and level
See Documentation/admin-guide/md.rst.
mdacon= [MDA]
Format: <first>,<last>
Specifies range of consoles to be captured by the MDA.
mds= [X86,INTEL]
Control mitigation for the Micro-architectural Data
Sampling (MDS) vulnerability.
Certain CPUs are vulnerable to an exploit against CPU
internal buffers which can forward information to a
disclosure gadget under certain conditions.
In vulnerable processors, the speculatively
forwarded data can be used in a cache side channel
attack, to access data to which the attacker does
not have direct access.
This parameter controls the MDS mitigation. The
options are:
full - Enable MDS mitigation on vulnerable CPUs
full,nosmt - Enable MDS mitigation and disable
SMT on vulnerable CPUs
off - Unconditionally disable MDS mitigation
x86/speculation: Fix incorrect MDS/TAA mitigation status For MDS vulnerable processors with TSX support, enabling either MDS or TAA mitigations will enable the use of VERW to flush internal processor buffers at the right code path. IOW, they are either both mitigated or both not. However, if the command line options are inconsistent, the vulnerabilites sysfs files may not report the mitigation status correctly. For example, with only the "mds=off" option: vulnerabilities/mds:Vulnerable; SMT vulnerable vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable The mds vulnerabilities file has wrong status in this case. Similarly, the taa vulnerability file will be wrong with mds mitigation on, but taa off. Change taa_select_mitigation() to sync up the two mitigation status and have them turned off if both "mds=off" and "tsx_async_abort=off" are present. Update documentation to emphasize the fact that both "mds=off" and "tsx_async_abort=off" have to be specified together for processors that are affected by both TAA and MDS to be effective. [ bp: Massage and add kernel-parameters.txt change too. ] Fixes: 1b42f017415b ("x86/speculation/taa: Add mitigation for TSX Async Abort") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: linux-doc@vger.kernel.org Cc: Mark Gross <mgross@linux.intel.com> Cc: <stable@vger.kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191115161445.30809-2-longman@redhat.com
2019-11-15 16:14:44 +00:00
On TAA-affected machines, mds=off can be prevented by
an active TAA mitigation as both vulnerabilities are
mitigated with the same mechanism so in order to disable
this mitigation, you need to specify tsx_async_abort=off
too.
Not specifying this option is equivalent to
mds=full.
For details see: Documentation/admin-guide/hw-vuln/mds.rst
mem=nn[KMG] [HEXAGON] Set the memory size.
Must be specified, otherwise memory size will be 0.
mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory
mm/memory_hotplug.c: only respect mem= parameter during boot stage In commit 357b4da50a62 ("x86: respect memory size limiting via mem= parameter") a global varialbe max_mem_size is added to store the value parsed from 'mem= ', then checked when memory region is added. This truly stops those DIMMs from being added into system memory during boot-time. However, it also limits the later memory hotplug functionality. Any DIMM can't be hotplugged any more if its region is beyond the max_mem_size. We will get errors like: [ 216.387164] acpi PNP0C80:02: add_memory failed [ 216.389301] acpi PNP0C80:02: acpi_memory_enable_device() error [ 216.392187] acpi PNP0C80:02: Enumeration failure This will cause issue in a known use case where 'mem=' is added to the hypervisor. The memory that lies after 'mem=' boundary will be assigned to KVM guests. After commit 357b4da50a62 merged, memory can't be extended dynamically if system memory on hypervisor is not sufficient. So fix it by also checking if it's during boot-time restricting to add memory. Otherwise, skip the restriction. And also add this use case to document of 'mem=' kernel parameter. Fixes: 357b4da50a62 ("x86: respect memory size limiting via mem= parameter") Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200204050643.20925-1-bhe@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 03:06:50 +00:00
Amount of memory to be used in cases as follows:
1 for test;
2 when the kernel is not able to see the whole system memory;
3 memory that lies after 'mem=' boundary is excluded from
the hypervisor, then assigned to KVM guests.
4 to limit the memory available for kdump kernel.
[ARC,MICROBLAZE] - the limit applies only to low memory,
high memory is not affected.
[ARM64] - only limits memory covered by the linear
mapping. The NOMAP regions are not affected.
mm/memory_hotplug.c: only respect mem= parameter during boot stage In commit 357b4da50a62 ("x86: respect memory size limiting via mem= parameter") a global varialbe max_mem_size is added to store the value parsed from 'mem= ', then checked when memory region is added. This truly stops those DIMMs from being added into system memory during boot-time. However, it also limits the later memory hotplug functionality. Any DIMM can't be hotplugged any more if its region is beyond the max_mem_size. We will get errors like: [ 216.387164] acpi PNP0C80:02: add_memory failed [ 216.389301] acpi PNP0C80:02: acpi_memory_enable_device() error [ 216.392187] acpi PNP0C80:02: Enumeration failure This will cause issue in a known use case where 'mem=' is added to the hypervisor. The memory that lies after 'mem=' boundary will be assigned to KVM guests. After commit 357b4da50a62 merged, memory can't be extended dynamically if system memory on hypervisor is not sufficient. So fix it by also checking if it's during boot-time restricting to add memory. Otherwise, skip the restriction. And also add this use case to document of 'mem=' kernel parameter. Fixes: 357b4da50a62 ("x86: respect memory size limiting via mem= parameter") Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200204050643.20925-1-bhe@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 03:06:50 +00:00
[X86] Work as limiting max address. Use together
with memmap= to avoid physical address space collisions.
Without memmap= PCI devices could be placed at addresses
belonging to unused RAM.
mm/memory_hotplug.c: only respect mem= parameter during boot stage In commit 357b4da50a62 ("x86: respect memory size limiting via mem= parameter") a global varialbe max_mem_size is added to store the value parsed from 'mem= ', then checked when memory region is added. This truly stops those DIMMs from being added into system memory during boot-time. However, it also limits the later memory hotplug functionality. Any DIMM can't be hotplugged any more if its region is beyond the max_mem_size. We will get errors like: [ 216.387164] acpi PNP0C80:02: add_memory failed [ 216.389301] acpi PNP0C80:02: acpi_memory_enable_device() error [ 216.392187] acpi PNP0C80:02: Enumeration failure This will cause issue in a known use case where 'mem=' is added to the hypervisor. The memory that lies after 'mem=' boundary will be assigned to KVM guests. After commit 357b4da50a62 merged, memory can't be extended dynamically if system memory on hypervisor is not sufficient. So fix it by also checking if it's during boot-time restricting to add memory. Otherwise, skip the restriction. And also add this use case to document of 'mem=' kernel parameter. Fixes: 357b4da50a62 ("x86: respect memory size limiting via mem= parameter") Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200204050643.20925-1-bhe@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 03:06:50 +00:00
Note that this only takes effects during boot time since
in above case 3, memory may need be hot added after boot
if system memory of hypervisor is not sufficient.
mem=nn[KMG]@ss[KMG]
[ARM,MIPS] - override the memory layout reported by
firmware.
Define a memory region of size nn[KMG] starting at
ss[KMG].
Multiple different regions can be specified with
multiple mem= parameters on the command line.
mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel
memory.
memblock=debug [KNL] Enable memblock debug messages.
memchunk=nn[KMG]
[KNL,SH] Allow user to override the default size for
per-device physically contiguous DMA buffers.
memhp_default_state=online/offline/online_kernel/online_movable
[KNL] Set the initial state for the memory hotplug
onlining policy. If not specified, the default value is
set according to the
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
option.
See Documentation/admin-guide/mm/memory-hotplug.rst.
memmap=exactmap [KNL,X86] Enable setting of an exact
E820 memory map, as specified by the user.
Such memmap=exactmap lines can be constructed based on
BIOS output or other requirements. See the memmap=nn@ss
option description.
memmap=nn[KMG]@ss[KMG]
[KNL, X86, MIPS, XTENSA] Force usage of a specific region of memory.
Region of memory to be used is from ss to ss+nn.
If @ss[KMG] is omitted, it is equivalent to mem=nn[KMG],
which limits max address to nn[KMG].
Multiple different regions can be specified,
comma delimited.
Example:
memmap=100M@2G,100M#3G,1G!1024G
memmap=nn[KMG]#ss[KMG]
[KNL,ACPI] Mark specific memory as ACPI data.
Region of memory to be marked is from ss to ss+nn.
memmap=nn[KMG]$ss[KMG]
[KNL,ACPI] Mark specific memory as reserved.
Region of memory to be reserved is from ss to ss+nn.
Example: Exclude memory from 0x18690000-0x1869ffff
memmap=64K$0x18690000
or
memmap=0x10000$0x18690000
Some bootloaders may need an escape character before '$',
like Grub2, otherwise '$' and the following number
will be eaten.
memmap=nn[KMG]!ss[KMG]
[KNL,X86] Mark specific memory as protected.
Region of memory to be used, from ss to ss+nn.
The memory region may be marked as e820 type 12 (0xc)
and is NVDIMM or ADR memory.
memmap=<size>%<offset>-<oldtype>+<newtype>
[KNL,ACPI] Convert memory within the specified region
from <oldtype> to <newtype>. If "-<oldtype>" is left
out, the whole region will be marked as <newtype>,
even if previously unavailable. If "+<newtype>" is left
out, matching memory will be removed. Types are
specified as e820 types, e.g., 1 = RAM, 2 = reserved,
3 = ACPI, 12 = PRAM.
memory_corruption_check=0/1 [X86]
Some BIOSes seem to corrupt the first 64k of
memory when doing things like suspend/resume.
Setting this option will scan the memory
looking for corruption. Enabling this will
both detect corruption and prevent the kernel
from using the memory being corrupted.
However, its intended as a diagnostic tool; if
repeatable BIOS-originated corruption always
affects the same memory, you can use memmap=
to prevent the kernel from using that memory.
memory_corruption_check_size=size [X86]
By default it checks for corruption in the low
64k, making this memory unavailable for normal
use. Use this parameter to scan for
corruption in more or less memory.
memory_corruption_check_period=seconds [X86]
By default it checks for corruption every 60
seconds. Use this parameter to check at some
other rate. 0 disables periodic checking.
memory_hotplug.memmap_on_memory
[KNL,X86,ARM] Boolean flag to enable this feature.
Format: {on | off (default)}
When enabled, runtime hotplugged memory will
mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory For now, the feature of hugetlb_free_vmemmap is not compatible with the feature of memory_hotplug.memmap_on_memory, and hugetlb_free_vmemmap takes precedence over memory_hotplug.memmap_on_memory. However, someone wants to make memory_hotplug.memmap_on_memory takes precedence over hugetlb_free_vmemmap since memmap_on_memory makes it more likely to succeed memory hotplug in close-to-OOM situations. So the decision of making hugetlb_free_vmemmap take precedence is not wise and elegant. The proper approach is to have hugetlb_vmemmap.c do the check whether the section which the HugeTLB pages belong to can be optimized. If the section's vmemmap pages are allocated from the added memory block itself, hugetlb_free_vmemmap should refuse to optimize the vmemmap, otherwise, do the optimization. Then both kernel parameters are compatible. So this patch introduces VmemmapSelfHosted to mask any non-optimizable vmemmap pages. The hugetlb_vmemmap can use this flag to detect if a vmemmap page can be optimized. [songmuchun@bytedance.com: walk vmemmap page tables to avoid false-positive] Link: https://lkml.kernel.org/r/20220620110616.12056-3-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220617135650.74901-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Co-developed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-17 13:56:50 +00:00
allocate its internal metadata (struct pages,
those vmemmap pages cannot be optimized even
if hugetlb_free_vmemmap is enabled) from the
hotadded memory which will allow to hotadd a
lot of memory without requiring additional
memory to do so.
This feature is disabled by default because it
has some implication on large (e.g. GB)
allocations in some configurations (e.g. small
memory blocks).
The state of the flag can be read in
/sys/module/memory_hotplug/parameters/memmap_on_memory.
Note that even when enabled, there are a few cases where
the feature is not effective.
memtest= [KNL,X86,ARM,M68K,PPC,RISCV] Enable memtest
Format: <integer>
default : 0 <disable>
Specifies the number of memtest passes to be
performed. Each pass selects another test
pattern from a given set of patterns. Memtest
fills the memory with this pattern, validates
memory contents and reserves bad memory
regions that are detected.
mem_encrypt= [X86-64] AMD Secure Memory Encryption (SME) control
Valid arguments: on, off
Default: off
mem_encrypt=on: Activate SME
mem_encrypt=off: Do not activate SME
Refer to Documentation/virt/kvm/x86/amd-memory-encryption.rst
for details on when memory encryption can be activated.
PM / sleep: System sleep state selection interface rework There are systems in which the platform doesn't support any special sleep states, so suspend-to-idle (PM_SUSPEND_FREEZE) is the only available system sleep state. However, some user space frameworks only use the "mem" and (sometimes) "standby" sleep state labels, so the users of those systems need to modify user space in order to be able to use system suspend at all and that may be a pain in practice. Commit 0399d4db3edf (PM / sleep: Introduce command line argument for sleep state enumeration) attempted to address this problem by adding a command line argument to change the meaning of the "mem" string in /sys/power/state to make it trigger suspend-to-idle (instead of suspend-to-RAM). However, there also are systems in which the platform does support special sleep states, but suspend-to-idle is the preferred one anyway (it even may save more energy than the platform-provided sleep states in some cases) and the above commit doesn't help in those cases. For this reason, rework the system sleep state selection interface again (but preserve backwards compatibiliby). Namely, add a new sysfs file, /sys/power/mem_sleep, that will control the system suspend mode triggered by writing "mem" to /sys/power/state (in analogy with what /sys/power/disk does for hibernation). Make it select suspend-to-RAM ("deep" sleep) by default (if supported) and fall back to suspend-to-idle ("s2idle") otherwise and add a new command line argument, mem_sleep_default, allowing that default to be overridden if need be. At the same time, drop the relative_sleep_states command line argument that doesn't make sense any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Mario Limonciello <mario.limonciello@dell.com>
2016-11-21 21:45:40 +00:00
mem_sleep_default= [SUSPEND] Default system suspend mode:
s2idle - Suspend-To-Idle
shallow - Power-On Suspend or equivalent (if supported)
deep - Suspend-To-RAM or equivalent (if supported)
See Documentation/admin-guide/pm/sleep-states.rst.
PM / sleep: System sleep state selection interface rework There are systems in which the platform doesn't support any special sleep states, so suspend-to-idle (PM_SUSPEND_FREEZE) is the only available system sleep state. However, some user space frameworks only use the "mem" and (sometimes) "standby" sleep state labels, so the users of those systems need to modify user space in order to be able to use system suspend at all and that may be a pain in practice. Commit 0399d4db3edf (PM / sleep: Introduce command line argument for sleep state enumeration) attempted to address this problem by adding a command line argument to change the meaning of the "mem" string in /sys/power/state to make it trigger suspend-to-idle (instead of suspend-to-RAM). However, there also are systems in which the platform does support special sleep states, but suspend-to-idle is the preferred one anyway (it even may save more energy than the platform-provided sleep states in some cases) and the above commit doesn't help in those cases. For this reason, rework the system sleep state selection interface again (but preserve backwards compatibiliby). Namely, add a new sysfs file, /sys/power/mem_sleep, that will control the system suspend mode triggered by writing "mem" to /sys/power/state (in analogy with what /sys/power/disk does for hibernation). Make it select suspend-to-RAM ("deep" sleep) by default (if supported) and fall back to suspend-to-idle ("s2idle") otherwise and add a new command line argument, mem_sleep_default, allowing that default to be overridden if need be. At the same time, drop the relative_sleep_states command line argument that doesn't make sense any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Mario Limonciello <mario.limonciello@dell.com>
2016-11-21 21:45:40 +00:00
mfgpt_irq= [IA-32] Specify the IRQ to use for the
Multi-Function General Purpose Timers on AMD Geode
platforms.
mfgptfix [X86-32] Fix MFGPT timers on AMD Geode platforms when
the BIOS has incorrectly applied a workaround. TinyBIOS
version 0.98 is known to be affected, 0.99 fixes the
problem by letting the user disable the workaround.
mga= [HW,DRM]
microcode.force_minrev= [X86]
Format: <bool>
Enable or disable the microcode minimal revision
enforcement for the runtime microcode loader.
Docs: admin/kernel-parameters: edit a few boot options Clean up some of admin-guide/kernel-parameters.txt: a. "smt" should be "smt=" (S390) b. (dropped) c. Sparc supports the vdso= boot option d. make the tp_printk options (2) formatting similar to other options by adding spacing e. add "trace_clock=" with a reference to Documentation/trace/ftrace.rst f. use [IA-64] as documented instead of [ia64] g. fix formatting and text for test_suspend= h. fix formatting for swapaccount= i. fix formatting and grammar for video.brightness_switch_enabled= Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: sparclinux@vger.kernel.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linux-ia64@vger.kernel.org Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: linux-pm@vger.kernel.org Cc: linux-acpi@vger.kernel.org Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-04-03 05:48:20 +00:00
min_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory below this
physical address is ignored.
mini2440= [ARM,HW,KNL]
Format:[0..2][b][c][t]
Default: "0tb"
MINI2440 configuration specification:
0 - The attached screen is the 3.5" TFT
1 - The attached screen is the 7" TFT
2 - The VGA Shield is attached (1024x768)
Leaving out the screen size parameter will not load
the TFT driver, and the framebuffer will be left
unconfigured.
b - Enable backlight. The TFT backlight pin will be
linked to the kernel VESA blanking code and a GPIO
LED. This parameter is not necessary when using the
VGA shield.
c - Enable the s3c camera interface.
t - Reserved for enabling touchscreen support. The
touchscreen support is not enabled in the mainstream
kernel as of 2.6.30, a preliminary port can be found
in the "bleeding edge" mini2440 support kernel at
https://repo.or.cz/w/linux-2.6/mini2440.git
cpu/speculation: Add 'mitigations=' cmdline option Keeping track of the number of mitigations for all the CPU speculation bugs has become overwhelming for many users. It's getting more and more complicated to decide which mitigations are needed for a given architecture. Complicating matters is the fact that each arch tends to have its own custom way to mitigate the same vulnerability. Most users fall into a few basic categories: a) they want all mitigations off; b) they want all reasonable mitigations on, with SMT enabled even if it's vulnerable; or c) they want all reasonable mitigations on, with SMT disabled if vulnerable. Define a set of curated, arch-independent options, each of which is an aggregation of existing options: - mitigations=off: Disable all mitigations. - mitigations=auto: [default] Enable all the default mitigations, but leave SMT enabled, even if it's vulnerable. - mitigations=auto,nosmt: Enable all the default mitigations, disabling SMT if needed by a mitigation. Currently, these options are placeholders which don't actually do anything. They will be fleshed out in upcoming patches. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.com
2019-04-12 20:39:28 +00:00
mitigations=
[X86,PPC,S390,ARM64] Control optional mitigations for
CPU vulnerabilities. This is a set of curated,
x86/speculation: Support 'mitigations=' cmdline option Configure x86 runtime CPU speculation bug mitigations in accordance with the 'mitigations=' cmdline option. This affects Meltdown, Spectre v2, Speculative Store Bypass, and L1TF. The default behavior is unchanged. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/6616d0ae169308516cfdf5216bedd169f8a8291b.1555085500.git.jpoimboe@redhat.com
2019-04-12 20:39:29 +00:00
arch-independent options, each of which is an
aggregation of existing arch-specific options.
cpu/speculation: Add 'mitigations=' cmdline option Keeping track of the number of mitigations for all the CPU speculation bugs has become overwhelming for many users. It's getting more and more complicated to decide which mitigations are needed for a given architecture. Complicating matters is the fact that each arch tends to have its own custom way to mitigate the same vulnerability. Most users fall into a few basic categories: a) they want all mitigations off; b) they want all reasonable mitigations on, with SMT enabled even if it's vulnerable; or c) they want all reasonable mitigations on, with SMT disabled if vulnerable. Define a set of curated, arch-independent options, each of which is an aggregation of existing options: - mitigations=off: Disable all mitigations. - mitigations=auto: [default] Enable all the default mitigations, but leave SMT enabled, even if it's vulnerable. - mitigations=auto,nosmt: Enable all the default mitigations, disabling SMT if needed by a mitigation. Currently, these options are placeholders which don't actually do anything. They will be fleshed out in upcoming patches. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.com
2019-04-12 20:39:28 +00:00
off
Disable all optional CPU mitigations. This
improves system performance, but it may also
expose users to several CPU vulnerabilities.
x86/speculation: Add Gather Data Sampling mitigation Gather Data Sampling (GDS) is a hardware vulnerability which allows unprivileged speculative access to data which was previously stored in vector registers. Intel processors that support AVX2 and AVX512 have gather instructions that fetch non-contiguous data elements from memory. On vulnerable hardware, when a gather instruction is transiently executed and encounters a fault, stale data from architectural or internal vector registers may get transiently stored to the destination vector register allowing an attacker to infer the stale data using typical side channel techniques like cache timing attacks. This mitigation is different from many earlier ones for two reasons. First, it is enabled by default and a bit must be set to *DISABLE* it. This is the opposite of normal mitigation polarity. This means GDS can be mitigated simply by updating microcode and leaving the new control bit alone. Second, GDS has a "lock" bit. This lock bit is there because the mitigation affects the hardware security features KeyLocker and SGX. It needs to be enabled and *STAY* enabled for these features to be mitigated against GDS. The mitigation is enabled in the microcode by default. Disable it by setting gather_data_sampling=off or by disabling all mitigations with mitigations=off. The mitigation status can be checked by reading: /sys/devices/system/cpu/vulnerabilities/gather_data_sampling Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-07-13 02:43:11 +00:00
Equivalent to: if nokaslr then kpti=0 [ARM64]
gather_data_sampling=off [X86]
kvm.nx_huge_pages=off [X86]
x86/speculation: Support 'mitigations=' cmdline option Configure x86 runtime CPU speculation bug mitigations in accordance with the 'mitigations=' cmdline option. This affects Meltdown, Spectre v2, Speculative Store Bypass, and L1TF. The default behavior is unchanged. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/6616d0ae169308516cfdf5216bedd169f8a8291b.1555085500.git.jpoimboe@redhat.com
2019-04-12 20:39:29 +00:00
l1tf=off [X86]
mds=off [X86]
x86/speculation: Add Gather Data Sampling mitigation Gather Data Sampling (GDS) is a hardware vulnerability which allows unprivileged speculative access to data which was previously stored in vector registers. Intel processors that support AVX2 and AVX512 have gather instructions that fetch non-contiguous data elements from memory. On vulnerable hardware, when a gather instruction is transiently executed and encounters a fault, stale data from architectural or internal vector registers may get transiently stored to the destination vector register allowing an attacker to infer the stale data using typical side channel techniques like cache timing attacks. This mitigation is different from many earlier ones for two reasons. First, it is enabled by default and a bit must be set to *DISABLE* it. This is the opposite of normal mitigation polarity. This means GDS can be mitigated simply by updating microcode and leaving the new control bit alone. Second, GDS has a "lock" bit. This lock bit is there because the mitigation affects the hardware security features KeyLocker and SGX. It needs to be enabled and *STAY* enabled for these features to be mitigated against GDS. The mitigation is enabled in the microcode by default. Disable it by setting gather_data_sampling=off or by disabling all mitigations with mitigations=off. The mitigation status can be checked by reading: /sys/devices/system/cpu/vulnerabilities/gather_data_sampling Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-07-13 02:43:11 +00:00
mmio_stale_data=off [X86]
no_entry_flush [PPC]
no_uaccess_flush [PPC]
x86/speculation: Add Gather Data Sampling mitigation Gather Data Sampling (GDS) is a hardware vulnerability which allows unprivileged speculative access to data which was previously stored in vector registers. Intel processors that support AVX2 and AVX512 have gather instructions that fetch non-contiguous data elements from memory. On vulnerable hardware, when a gather instruction is transiently executed and encounters a fault, stale data from architectural or internal vector registers may get transiently stored to the destination vector register allowing an attacker to infer the stale data using typical side channel techniques like cache timing attacks. This mitigation is different from many earlier ones for two reasons. First, it is enabled by default and a bit must be set to *DISABLE* it. This is the opposite of normal mitigation polarity. This means GDS can be mitigated simply by updating microcode and leaving the new control bit alone. Second, GDS has a "lock" bit. This lock bit is there because the mitigation affects the hardware security features KeyLocker and SGX. It needs to be enabled and *STAY* enabled for these features to be mitigated against GDS. The mitigation is enabled in the microcode by default. Disable it by setting gather_data_sampling=off or by disabling all mitigations with mitigations=off. The mitigation status can be checked by reading: /sys/devices/system/cpu/vulnerabilities/gather_data_sampling Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-07-13 02:43:11 +00:00
nobp=0 [S390]
nopti [X86,PPC]
nospectre_bhb [ARM64]
nospectre_v1 [X86,PPC]
nospectre_v2 [X86,PPC,S390,ARM64]
reg_file_data_sampling=off [X86]
retbleed=off [X86]
x86/speculation: Add Gather Data Sampling mitigation Gather Data Sampling (GDS) is a hardware vulnerability which allows unprivileged speculative access to data which was previously stored in vector registers. Intel processors that support AVX2 and AVX512 have gather instructions that fetch non-contiguous data elements from memory. On vulnerable hardware, when a gather instruction is transiently executed and encounters a fault, stale data from architectural or internal vector registers may get transiently stored to the destination vector register allowing an attacker to infer the stale data using typical side channel techniques like cache timing attacks. This mitigation is different from many earlier ones for two reasons. First, it is enabled by default and a bit must be set to *DISABLE* it. This is the opposite of normal mitigation polarity. This means GDS can be mitigated simply by updating microcode and leaving the new control bit alone. Second, GDS has a "lock" bit. This lock bit is there because the mitigation affects the hardware security features KeyLocker and SGX. It needs to be enabled and *STAY* enabled for these features to be mitigated against GDS. The mitigation is enabled in the microcode by default. Disable it by setting gather_data_sampling=off or by disabling all mitigations with mitigations=off. The mitigation status can be checked by reading: /sys/devices/system/cpu/vulnerabilities/gather_data_sampling Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-07-13 02:43:11 +00:00
spec_store_bypass_disable=off [X86,PPC]
spectre_bhi=off [X86]
x86/speculation: Add Gather Data Sampling mitigation Gather Data Sampling (GDS) is a hardware vulnerability which allows unprivileged speculative access to data which was previously stored in vector registers. Intel processors that support AVX2 and AVX512 have gather instructions that fetch non-contiguous data elements from memory. On vulnerable hardware, when a gather instruction is transiently executed and encounters a fault, stale data from architectural or internal vector registers may get transiently stored to the destination vector register allowing an attacker to infer the stale data using typical side channel techniques like cache timing attacks. This mitigation is different from many earlier ones for two reasons. First, it is enabled by default and a bit must be set to *DISABLE* it. This is the opposite of normal mitigation polarity. This means GDS can be mitigated simply by updating microcode and leaving the new control bit alone. Second, GDS has a "lock" bit. This lock bit is there because the mitigation affects the hardware security features KeyLocker and SGX. It needs to be enabled and *STAY* enabled for these features to be mitigated against GDS. The mitigation is enabled in the microcode by default. Disable it by setting gather_data_sampling=off or by disabling all mitigations with mitigations=off. The mitigation status can be checked by reading: /sys/devices/system/cpu/vulnerabilities/gather_data_sampling Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2023-07-13 02:43:11 +00:00
spectre_v2_user=off [X86]
srbds=off [X86,INTEL]
ssbd=force-off [ARM64]
tsx_async_abort=off [X86]
Exceptions:
This does not have any effect on
kvm.nx_huge_pages when
kvm.nx_huge_pages=force.
cpu/speculation: Add 'mitigations=' cmdline option Keeping track of the number of mitigations for all the CPU speculation bugs has become overwhelming for many users. It's getting more and more complicated to decide which mitigations are needed for a given architecture. Complicating matters is the fact that each arch tends to have its own custom way to mitigate the same vulnerability. Most users fall into a few basic categories: a) they want all mitigations off; b) they want all reasonable mitigations on, with SMT enabled even if it's vulnerable; or c) they want all reasonable mitigations on, with SMT disabled if vulnerable. Define a set of curated, arch-independent options, each of which is an aggregation of existing options: - mitigations=off: Disable all mitigations. - mitigations=auto: [default] Enable all the default mitigations, but leave SMT enabled, even if it's vulnerable. - mitigations=auto,nosmt: Enable all the default mitigations, disabling SMT if needed by a mitigation. Currently, these options are placeholders which don't actually do anything. They will be fleshed out in upcoming patches. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.com
2019-04-12 20:39:28 +00:00
auto (default)
Mitigate all CPU vulnerabilities, but leave SMT
enabled, even if it's vulnerable. This is for
users who don't want to be surprised by SMT
getting disabled across kernel upgrades, or who
have other ways of avoiding SMT-based attacks.
x86/speculation: Support 'mitigations=' cmdline option Configure x86 runtime CPU speculation bug mitigations in accordance with the 'mitigations=' cmdline option. This affects Meltdown, Spectre v2, Speculative Store Bypass, and L1TF. The default behavior is unchanged. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/6616d0ae169308516cfdf5216bedd169f8a8291b.1555085500.git.jpoimboe@redhat.com
2019-04-12 20:39:29 +00:00
Equivalent to: (default behavior)
cpu/speculation: Add 'mitigations=' cmdline option Keeping track of the number of mitigations for all the CPU speculation bugs has become overwhelming for many users. It's getting more and more complicated to decide which mitigations are needed for a given architecture. Complicating matters is the fact that each arch tends to have its own custom way to mitigate the same vulnerability. Most users fall into a few basic categories: a) they want all mitigations off; b) they want all reasonable mitigations on, with SMT enabled even if it's vulnerable; or c) they want all reasonable mitigations on, with SMT disabled if vulnerable. Define a set of curated, arch-independent options, each of which is an aggregation of existing options: - mitigations=off: Disable all mitigations. - mitigations=auto: [default] Enable all the default mitigations, but leave SMT enabled, even if it's vulnerable. - mitigations=auto,nosmt: Enable all the default mitigations, disabling SMT if needed by a mitigation. Currently, these options are placeholders which don't actually do anything. They will be fleshed out in upcoming patches. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.com
2019-04-12 20:39:28 +00:00
auto,nosmt
Mitigate all CPU vulnerabilities, disabling SMT
if needed. This is for users who always want to
be fully mitigated, even if it means losing SMT.
x86/speculation: Support 'mitigations=' cmdline option Configure x86 runtime CPU speculation bug mitigations in accordance with the 'mitigations=' cmdline option. This affects Meltdown, Spectre v2, Speculative Store Bypass, and L1TF. The default behavior is unchanged. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/6616d0ae169308516cfdf5216bedd169f8a8291b.1555085500.git.jpoimboe@redhat.com
2019-04-12 20:39:29 +00:00
Equivalent to: l1tf=flush,nosmt [X86]
mds=full,nosmt [X86]
tsx_async_abort=full,nosmt [X86]
x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data Processor MMIO Stale Data is a class of vulnerabilities that may expose data after an MMIO operation. For details please refer to Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst. These vulnerabilities are broadly categorized as: Device Register Partial Write (DRPW): Some endpoint MMIO registers incorrectly handle writes that are smaller than the register size. Instead of aborting the write or only copying the correct subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than specified by the write transaction may be written to the register. On some processors, this may expose stale data from the fill buffers of the core that created the write transaction. Shared Buffers Data Sampling (SBDS): After propagators may have moved data around the uncore and copied stale data into client core fill buffers, processors affected by MFBDS can leak data from the fill buffer. Shared Buffers Data Read (SBDR): It is similar to Shared Buffer Data Sampling (SBDS) except that the data is directly read into the architectural software-visible state. An attacker can use these vulnerabilities to extract data from CPU fill buffers using MDS and TAA methods. Mitigate it by clearing the CPU fill buffers using the VERW instruction before returning to a user or a guest. On CPUs not affected by MDS and TAA, user application cannot sample data from CPU fill buffers using MDS or TAA. A guest with MMIO access can still use DRPW or SBDR to extract data architecturally. Mitigate it with VERW instruction to clear fill buffers before VMENTER for MMIO capable guests. Add a kernel parameter mmio_stale_data={off|full|full,nosmt} to control the mitigation. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-05-20 03:29:11 +00:00
mmio_stale_data=full,nosmt [X86]
retbleed=auto,nosmt [X86]
cpu/speculation: Add 'mitigations=' cmdline option Keeping track of the number of mitigations for all the CPU speculation bugs has become overwhelming for many users. It's getting more and more complicated to decide which mitigations are needed for a given architecture. Complicating matters is the fact that each arch tends to have its own custom way to mitigate the same vulnerability. Most users fall into a few basic categories: a) they want all mitigations off; b) they want all reasonable mitigations on, with SMT enabled even if it's vulnerable; or c) they want all reasonable mitigations on, with SMT disabled if vulnerable. Define a set of curated, arch-independent options, each of which is an aggregation of existing options: - mitigations=off: Disable all mitigations. - mitigations=auto: [default] Enable all the default mitigations, but leave SMT enabled, even if it's vulnerable. - mitigations=auto,nosmt: Enable all the default mitigations, disabling SMT if needed by a mitigation. Currently, these options are placeholders which don't actually do anything. They will be fleshed out in upcoming patches. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.com
2019-04-12 20:39:28 +00:00
mminit_loglevel=
[KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
parameter allows control of the logging verbosity for
the additional memory initialisation checks. A value
of 0 disables mminit logging and a level of 4 will
log everything. Information is printed at KERN_DEBUG
so loglevel=8 may also need to be specified.
x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data Processor MMIO Stale Data is a class of vulnerabilities that may expose data after an MMIO operation. For details please refer to Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst. These vulnerabilities are broadly categorized as: Device Register Partial Write (DRPW): Some endpoint MMIO registers incorrectly handle writes that are smaller than the register size. Instead of aborting the write or only copying the correct subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than specified by the write transaction may be written to the register. On some processors, this may expose stale data from the fill buffers of the core that created the write transaction. Shared Buffers Data Sampling (SBDS): After propagators may have moved data around the uncore and copied stale data into client core fill buffers, processors affected by MFBDS can leak data from the fill buffer. Shared Buffers Data Read (SBDR): It is similar to Shared Buffer Data Sampling (SBDS) except that the data is directly read into the architectural software-visible state. An attacker can use these vulnerabilities to extract data from CPU fill buffers using MDS and TAA methods. Mitigate it by clearing the CPU fill buffers using the VERW instruction before returning to a user or a guest. On CPUs not affected by MDS and TAA, user application cannot sample data from CPU fill buffers using MDS or TAA. A guest with MMIO access can still use DRPW or SBDR to extract data architecturally. Mitigate it with VERW instruction to clear fill buffers before VMENTER for MMIO capable guests. Add a kernel parameter mmio_stale_data={off|full|full,nosmt} to control the mitigation. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-05-20 03:29:11 +00:00
mmio_stale_data=
[X86,INTEL] Control mitigation for the Processor
MMIO Stale Data vulnerabilities.
Processor MMIO Stale Data is a class of
vulnerabilities that may expose data after an MMIO
operation. Exposed data could originate or end in
the same CPU buffers as affected by MDS and TAA.
Therefore, similar to MDS and TAA, the mitigation
is to clear the affected CPU buffers.
This parameter controls the mitigation. The
options are:
full - Enable mitigation on vulnerable CPUs
full,nosmt - Enable mitigation and disable SMT on
vulnerable CPUs.
off - Unconditionally disable mitigation
On MDS or TAA affected machines,
mmio_stale_data=off can be prevented by an active
MDS or TAA mitigation as these vulnerabilities are
mitigated with the same mechanism so in order to
disable this mitigation, you need to specify
mds=off and tsx_async_abort=off too.
Not specifying this option is equivalent to
mmio_stale_data=full.
For details see:
Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
<module>.async_probe[=<bool>] [KNL]
If no <bool> value is specified or if the value
specified is not a valid <bool>, enable asynchronous
probe on this module. Otherwise, enable/disable
asynchronous probe on this module as indicated by the
<bool> value. See also: module.async_probe
module.async_probe=<bool>
[KNL] When set to true, modules will use async probing
by default. To enable/disable async probing for a
specific module, use the module specific control that
is documented under <module>.async_probe. When both
module.async_probe and <module>.async_probe are
specified, <module>.async_probe takes precedence for
the specific module.
module: add debugging auto-load duplicate module support The finit_module() system call can in the worst case use up to more than twice of a module's size in virtual memory. Duplicate finit_module() system calls are non fatal, however they unnecessarily strain virtual memory during bootup and in the worst case can cause a system to fail to boot. This is only known to currently be an issue on systems with larger number of CPUs. To help debug this situation we need to consider the different sources for finit_module(). Requests from the kernel that rely on module auto-loading, ie, the kernel's *request_module() API, are one source of calls. Although modprobe checks to see if a module is already loaded prior to calling finit_module() there is a small race possible allowing userspace to trigger multiple modprobe calls racing against modprobe and this not seeing the module yet loaded. This adds debugging support to the kernel module auto-loader (*request_module() calls) to easily detect duplicate module requests. To aid with possible bootup failure issues incurred by this, it will converge duplicates requests to a single request. This avoids any possible strain on virtual memory during bootup which could be incurred by duplicate module autoloading requests. Folks debugging virtual memory abuse on bootup can and should enable this to see what pr_warn()s come on, to see if module auto-loading is to blame for their wores. If they see duplicates they can further debug this by enabling the module.enable_dups_trace kernel parameter or by enabling CONFIG_MODULE_DEBUG_AUTOLOAD_DUPS_TRACE. Current evidence seems to point to only a few duplicates for module auto-loading. And so the source for other duplicates creating heavy virtual memory pressure due to larger number of CPUs should becoming from another place (likely udev). Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-04-14 05:28:39 +00:00
module.enable_dups_trace
[KNL] When CONFIG_MODULE_DEBUG_AUTOLOAD_DUPS is set,
this means that duplicate request_module() calls will
trigger a WARN_ON() instead of a pr_warn(). Note that
if MODULE_DEBUG_AUTOLOAD_DUPS_TRACE is set, WARN_ON()s
will always be issued and this option does nothing.
module.sig_enforce
[KNL] When CONFIG_MODULE_SIG is set, this means that
modules without (valid) signatures will fail to load.
Note that if CONFIG_MODULE_SIG_FORCE is set, that
is always true, so this option does nothing.
module_blacklist= [KNL] Do not load a comma-separated list of
modules. Useful for debugging problem modules.
mousedev.tap_time=
[MOUSE] Maximum time between finger touching and
leaving touchpad surface for touch to be considered
a tap and be reported as a left button click (for
touchpads working in absolute mode only).
Format: <msecs>
mousedev.xres= [MOUSE] Horizontal screen resolution, used for devices
reporting absolute coordinates, such as tablets
mousedev.yres= [MOUSE] Vertical screen resolution, used for devices
reporting absolute coordinates, such as tablets
mm, page_alloc: extend kernelcore and movablecore for percent Both kernelcore= and movablecore= can be used to define the amount of ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires the system memory capacity to be known when specifying the command line, however. This introduces the ability to define both kernelcore= and movablecore= as a percentage of total system memory. This is convenient for systems software that wants to define the amount of ZONE_MOVABLE, for example, as a proportion of a system's memory rather than a hardcoded byte value. To define the percentage, the final character of the parameter should be a '%'. mhocko: "why is anyone using these options nowadays?" rientjes: : : Fragmentation of non-__GFP_MOVABLE pages due to low on memory : situations can pollute most pageblocks on the system, as much as 1GB of : slab being fragmented over 128GB of memory, for example. When the : amount of kernel memory is well bounded for certain systems, it is : better to aggressively reclaim from existing MIGRATE_UNMOVABLE : pageblocks rather than eagerly fallback to others. : : We have additional patches that help with this fragmentation if you're : interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE : pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and : draining of pcp lists back to the zone free area to prevent stranding. [rientjes@google.com: updates] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 23:23:09 +00:00
movablecore= [KNL,X86,IA-64,PPC]
Format: nn[KMGTPE] | nn%
This parameter is the complement to kernelcore=, it
specifies the amount of memory used for migratable
allocations. If both kernelcore and movablecore is
specified, then kernelcore will be at *least* the
specified value but may be more. If movablecore on its
own is specified, the administrator must be careful
that the amount of memory usable for all allocations
is not too small.
2017-07-06 22:41:02 +00:00
movable_node [KNL] Boot-time switch to make hotplugable memory
NUMA nodes to be movable. This means that the memory
of such nodes will be usable only for movable
allocations which rules out almost all kernel
allocations. Use with caution!
mem-hotplug: introduce movable_node boot option The hot-Pluggable field in SRAT specifies which memory is hotpluggable. As we mentioned before, if hotpluggable memory is used by the kernel, it cannot be hot-removed. So memory hotplug users may want to set all hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it. Memory hotplug users may also set a node as movable node, which has ZONE_MOVABLE only, so that the whole node can be hot-removed. But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the kernel cannot use memory in movable nodes. This will cause NUMA performance down. And other users may be unhappy. So we need a way to allow users to enable and disable this functionality. In this patch, we introduce movable_node boot option to allow users to choose to not to consume hotpluggable memory at early boot time and later we can set it as ZONE_MOVABLE. To achieve this, the movable_node boot option will control the memblock allocation direction. That said, after memblock is ready, before SRAT is parsed, we should allocate memory near the kernel image as we explained in the previous patches. So if movable_node boot option is set, the kernel does the following: 1. After memblock is ready, make memblock allocate memory bottom up. 2. After SRAT is parsed, make memblock behave as default, allocate memory top down. Users can specify "movable_node" in kernel commandline to enable this functionality. For those who don't use memory hotplug or who don't want to lose their NUMA performance, just don't specify anything. The kernel will work as before. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Suggested-by: Ingo Molnar <mingo@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Toshi Kani <toshi.kani@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-12 23:08:10 +00:00
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>
MTD_Region= [MTD] Format:
<name>,<region-number>[,<base>,<size>,<buswidth>,<altbuswidth>]
mtdparts= [MTD]
See drivers/mtd/parsers/cmdlinepart.c
mtdset= [ARM]
ARM/S3C2412 JIVE boot control
See arch/arm/mach-s3c/mach-jive.c
mtouchusb.raw_coordinates=
[HW] Make the MicroTouch USB driver use raw coordinates
('y', default) or cooked coordinates ('n')
mtrr=debug [X86]
Enable printing debug information related to MTRR
registers at boot time.
mtrr_chunk_size=nn[KMG] [X86]
used for mtrr cleanup. It is largest continuous chunk
that could hold holes aka. UC entries.
mtrr_gran_size=nn[KMG] [X86]
Used for mtrr cleanup. It is granularity of mtrr block.
Default is 1.
Large value could prevent small alignment from
using up MTRRs.
mtrr_spare_reg_nr=n [X86]
Format: <integer>
Range: 0,7 : spare reg number
Default : 1
Used for mtrr cleanup. It is spare mtrr entries number.
Set to 2 or more if your graphical card needs more.
multitce=off [PPC] This parameter disables the use of the pSeries
firmware feature for updating multiple TCE entries
at a time.
n2= [NET] SDL Inc. RISCom/N2 synchronous serial card
netdev= [NET] Network devices parameters
Format: <irq>,<io>,<mem_start>,<mem_end>,<name>
Note that mem_start is often overloaded to mean
something different and driver-specific.
This usage is only documented in each driver source
file if at all.
netpoll.carrier_timeout=
[NET] Specifies amount of time (in seconds) that
netpoll should wait for a carrier. By default netpoll
waits 4 seconds.
nf_conntrack.acct=
[NETFILTER] Enable connection tracking flow accounting
0 to disable accounting
1 to enable accounting
Default value is 0.
nfs.cache_getent=
[NFS] sets the pathname to the program which is used
to update the NFS client cache entries.
nfs.cache_getent_timeout=
[NFS] sets the timeout after which an attempt to
update a cache entry is deemed to have failed.
nfs.callback_nr_threads=
[NFSv4] set the total number of threads that the
NFS client will assign to service NFSv4 callback
requests.
nfs.callback_tcpport=
[NFS] set the TCP port on which the NFSv4 callback
channel should listen.
nfs.delay_retrans=
[NFS] specifies the number of times the NFSv4 client
retries the request before returning an EAGAIN error,
after a reply of NFS4ERR_DELAY from the server.
Only applies if the softerr mount option is enabled,
and the specified value is >= 0.
nfs.enable_ino64=
[NFS] enable 64-bit inode numbers.
If zero, the NFS client will fake up a 32-bit inode
number for the readdir() and stat() syscalls instead
of returning the full 64-bit number.
The default is to return 64-bit inode numbers.
nfs.idmap_cache_timeout=
[NFS] set the maximum lifetime for idmapper cache
entries.
nfs.max_session_cb_slots=
[NFSv4.1] Sets the maximum number of session
slots the client will assign to the callback
channel. This determines the maximum number of
callbacks the client will process in parallel for
a particular server.
nfs.max_session_slots=
[NFSv4.1] Sets the maximum number of session slots
the client will attempt to negotiate with the server.
This limits the number of simultaneous RPC requests
that the client can send to the NFSv4.1 server.
Note that there is little point in setting this
value higher than the max_tcp_slot_table_limit.
nfs.nfs4_disable_idmapping=
[NFSv4] When set to the default of '1', this option
ensures that both the RPC level authentication
scheme and the NFS level operations agree to use
numeric uids/gids if the mount is using the
'sec=sys' security flavour. In effect it is
disabling idmapping, which can make migration from
legacy NFSv2/v3 systems to NFSv4 easier.
Servers that do not support this mode of operation
will be autodetected by the client, and it will fall
back to using the idmapper.
To turn off this behaviour, set the value to '0'.
nfs.nfs4_unique_id=
[NFS4] Specify an additional fixed unique ident-
ification string that NFSv4 clients can insert into
their nfs_client_id4 string. This is typically a
UUID that is generated at system install time.
nfs.recover_lost_locks=
[NFSv4] Attempt to recover locks that were lost due
to a lease timeout on the server. Please note that
doing this risks data corruption, since there are
no guarantees that the file will remain unchanged
after the locks are lost.
If you want to enable the kernel legacy behaviour of
attempting to recover these locks, then set this
parameter to '1'.
The default parameter value of '0' causes the kernel
not to attempt recovery of lost locks.
nfs.send_implementation_id=
[NFSv4.1] Send client implementation identification
information in exchange_id requests.
If zero, no implementation identification information
will be sent.
The default is to send the implementation identification
information.
nfs4.layoutstats_timer=
[NFSv4.2] Change the rate at which the kernel sends
layoutstats to the pNFS metadata server.
Setting this to value to 0 causes the kernel to use
whatever value is the default set by the layout
driver. A non-zero value sets the minimum interval
in seconds between layoutstats transmissions.
nfsd.inter_copy_offload_enable=
[NFSv4.2] When set to 1, the server will support
server-to-server copies for which this server is
the destination of the copy.
nfsd.nfs4_disable_idmapping=
[NFSv4] When set to the default of '1', the NFSv4
server will return only numeric uids and gids to
clients using auth_sys, and will accept numeric uids
and gids from such clients. This is intended to ease
migration from NFSv2/v3.
nfsd.nfsd4_ssc_umount_timeout=
[NFSv4.2] When used as the destination of a
server-to-server copy, knfsd temporarily mounts
the source server. It caches the mount in case
it will be needed again, and discards it if not
used for the number of milliseconds specified by
this parameter.
nfsaddrs= [NFS] Deprecated. Use ip= instead.
See Documentation/admin-guide/nfs/nfsroot.rst.
nfsroot= [NFS] nfs root filesystem for disk-less boxes.
See Documentation/admin-guide/nfs/nfsroot.rst.
nfsrootdebug [NFS] enable nfsroot debugging messages.
See Documentation/admin-guide/nfs/nfsroot.rst.
nmi_backtrace.backtrace_idle [KNL]
Dump stacks even of idle CPUs in response to an
NMI stack-backtrace request.
nmi_debug= [KNL,SH] Specify one or more actions to take
when a NMI is triggered.
Format: [state][,regs][,debounce][,die]
nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
Format: [panic,][nopanic,][num]
watchdog: enable the new user interface of the watchdog mechanism With the current user interface of the watchdog mechanism it is only possible to disable or enable both lockup detectors at the same time. This series introduces new kernel parameters and changes the semantics of some existing kernel parameters, so that the hard lockup detector and the soft lockup detector can be disabled or enabled individually. With this series applied, the user interface is as follows. - parameters in /proc/sys/kernel . soft_watchdog This is a new parameter to control and examine the run state of the soft lockup detector. . nmi_watchdog The semantics of this parameter have changed. It can now be used to control and examine the run state of the hard lockup detector. . watchdog This parameter is still available to control the run state of both lockup detectors at the same time. If this parameter is examined, it shows the logical OR of soft_watchdog and nmi_watchdog. . watchdog_thresh The semantics of this parameter are not affected by the patch. - kernel command line parameters . nosoftlockup The semantics of this parameter have changed. It can now be used to disable the soft lockup detector at boot time. . nmi_watchdog=0 or nmi_watchdog=1 Disable or enable the hard lockup detector at boot time. The patch introduces '=1' as a new option. . nowatchdog The semantics of this parameter are not affected by the patch. It is still available to disable both lockup detectors at boot time. Also, remove the proc_dowatchdog() function which is no longer needed. [dzickus@redhat.com: wrote changelog] [dzickus@redhat.com: update documentation for kernel params and sysctl] Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 22:44:13 +00:00
Valid num: 0 or 1
0 - turn hardlockup detector in nmi_watchdog off
1 - turn hardlockup detector in nmi_watchdog on
When panic is specified, panic when an NMI watchdog
timeout occurs (or 'nopanic' to not panic on an NMI
watchdog, if CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is set)
To disable both hard and soft lockup detectors,
please see 'nowatchdog'.
This is useful when you use a panic=... timeout and
need the box quickly up again.
These settings can be accessed at runtime via
the nmi_watchdog and hardlockup_panic sysctls.
no387 [BUGS=X86-32] Tells the kernel to use the 387 maths
emulation library even if a 387 maths coprocessor
is present.
riscv: Allow to downgrade paging mode from the command line Add 2 early command line parameters that allow to downgrade satp mode (using the same naming as x86): - "no5lvl": use a 4-level page table (down from sv57 to sv48) - "no4lvl": use a 3-level page table (down from sv57/sv48 to sv39) Note that going through the device tree to get the kernel command line works with ACPI too since the efi stub creates a device tree anyway with the command line. In KASAN kernels, we can't use the libfdt that early in the boot process since we are not ready to execute instrumented functions. So instead of using the "generic" libfdt, we compile our own versions of those functions that are not instrumented and that are prefixed so that they do not conflict with the generic ones. We also need the non-instrumented versions of the string functions and the prefixed versions of memcpy/memmove. This is largely inspired by commit aacd149b6238 ("arm64: head: avoid relocating the kernel twice for KASLR") from which I removed compilation flags that were not relevant to RISC-V at the moment (LTO, SCS). Also note that we have to link with -z norelro to avoid ld.lld to throw a warning with the new .got sections, like in commit 311bea3cb9ee ("arm64: link with -z norelro for LLD or aarch64-elf"). Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20230424092313.178699-2-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-24 09:23:13 +00:00
no4lvl [RISCV] Disable 4-level and 5-level paging modes. Forces
kernel to use 3-level paging instead.
no5lvl [X86-64,RISCV] Disable 5-level paging mode. Forces
kernel to use 4-level paging instead.
noaliencache [MM, NUMA, SLAB] Disables the allocation of alien
caches in the slab allocator. Saves per-node memory,
but will impact performance.
noalign [KNL,ARM]
noaltinstr [S390] Disables alternative instructions patching
(CPU alternatives feature).
noapic [SMP,APIC] Tells the kernel to not make use of any
IOAPICs that may be present in the system.
noautogroup Disable scheduler automatic task group creation.
nocache [ARM]
no_console_suspend
[HW] Never suspend the console
Disable suspending of consoles during suspend and
hibernate operations. Once disabled, debugging
messages can reach various consoles while the rest
of the system is being put to sleep (ie, while
debugging driver suspend/resume hooks). This may
not work reliably with all consoles, but is known
to work with serial and VGA consoles.
To facilitate more flexible debugging, we also add
console_suspend, a printk module parameter to control
it. Users could use console_suspend (usually
/sys/module/printk/parameters/console_suspend) to
turn on/off it dynamically.
no_debug_objects
[KNL] Disable object debugging
nodsp [SH] Disable hardware DSP at boot time.
noefi Disable EFI runtime services support.
no_entry_flush [PPC] Don't flush the L1-D cache when entering the kernel.
noexec [IA-64]
noexec32 [X86-64]
This affects only 32-bit executables.
noexec32=on: enable non-executable mappings (default)
read doesn't imply executable mappings
noexec32=off: disable non-executable mappings
read implies executable mappings
no_file_caps Tells the kernel not to honor file capabilities. The
only way then for a file to be executed with privilege
is to be setuid root or executed by root.
nofpu [MIPS,SH] Disable hardware FPU at boot time.
nofsgsbase [X86] Disables FSGSBASE instructions.
nofxsr [BUGS=X86-32] Disables x86 floating point extended
register save and restore. The kernel will only save
legacy floating-point registers on task switch.
nohalt [IA-64] Tells the kernel not to use the power saving
function PAL_HALT_LIGHT when idle. This increases
power-consumption. On the positive side, it reduces
interrupt wake-up latency, which may improve performance
in certain environments such as networked servers or
real-time systems.
no_hash_pointers
Force pointers printed to the console or buffers to be
unhashed. By default, when a pointer is printed via %p
format string, that pointer is "hashed", i.e. obscured
by hashing the pointer value. This is a security feature
that hides actual kernel addresses from unprivileged
users, but it also makes debugging the kernel more
difficult since unequal pointers can no longer be
compared. However, if this command-line option is
specified, then all normal pointers will have their true
value printed. This option should only be specified when
debugging the kernel. Please do not use on production
kernels.
nohibernate [HIBERNATION] Disable hibernation and resume.
nohlt [ARM,ARM64,MICROBLAZE,MIPS,PPC,SH] Forces the kernel to
busy wait in do_idle() and not use the arch_cpu_idle()
implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP
to be effective. This is useful on platforms where the
sleep(SH) or wfi(ARM,ARM64) instructions do not work
correctly or when doing power measurements to evaluate
the impact of the sleep instructions. This is also
useful when using JTAG debugger.
nohugeiomap [KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings.
nohugevmalloc [KNL,X86,PPC,ARM64] Disable kernel huge vmalloc mappings.
nohz= [KNL] Boottime enable/disable dynamic ticks
Valid arguments: on, off
Default: on
nohz_full= [KNL,BOOT,SMP,ISOL]
2016-10-11 20:51:35 +00:00
The argument is a cpu list, as described above.
In kernels built with CONFIG_NO_HZ_FULL=y, set
nohz: Basic full dynticks interface For extreme usecases such as Real Time or HPC, having the ability to shutdown the tick when a single task runs on a CPU is a desired feature: * Reducing the amount of interrupts improves throughput for CPU-bound tasks. The CPU is less distracted from its real job, from an execution time and from the cache point of views. * This also improve latency response as we have less critical sections. Start with introducing a very simple interface to define full dynticks CPU: use a boot time option defined cpumask through the "nohz_extended=" kernel parameter. CPUs that are part of this range will have their tick shutdown whenever possible: provided they run a single task and they don't do kernel activity that require the periodic tick. These details will be later documented in Documentation/* An online CPU must be kept outside this range to handle the timekeeping. Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2012-12-18 16:32:19 +00:00
the specified list of CPUs whose tick will be stopped
2013-03-27 01:18:34 +00:00
whenever possible. The boot CPU will be forced outside
the range to maintain the timekeeping. Any CPUs
in this list will have their RCU callbacks offloaded,
just as if they had also been called out in the
rcu_nocbs= boot parameter.
nohz: Basic full dynticks interface For extreme usecases such as Real Time or HPC, having the ability to shutdown the tick when a single task runs on a CPU is a desired feature: * Reducing the amount of interrupts improves throughput for CPU-bound tasks. The CPU is less distracted from its real job, from an execution time and from the cache point of views. * This also improve latency response as we have less critical sections. Start with introducing a very simple interface to define full dynticks CPU: use a boot time option defined cpumask through the "nohz_extended=" kernel parameter. CPUs that are part of this range will have their tick shutdown whenever possible: provided they run a single task and they don't do kernel activity that require the periodic tick. These details will be later documented in Documentation/* An online CPU must be kept outside this range to handle the timekeeping. Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2012-12-18 16:32:19 +00:00
Note that this argument takes precedence over
the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
noinitrd [RAM] Tells the kernel not to load any configured
initial RAM disk.
nointremap [X86-64, Intel-IOMMU] Do not enable interrupt
remapping.
[Deprecated - use intremap=off]
nointroute [IA-64]
noinvpcid [X86] Disable the INVPCID cpu feature.
noiotrap [SH] Disables trapped I/O port accesses.
noirqdebug [X86-32] Disables the code which attempts to detect and
disable unhandled interrupt sources.
noisapnp [ISAPNP] Disables ISA PnP code.
nojitter [IA-64] Disables jitter checking for ITC timers.
nokaslr [KNL]
When CONFIG_RANDOMIZE_BASE is set, this disables
kernel and module base offset ASLR (Address Space
Layout Randomization).
no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page
fault handling.
no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver
nolapic [X86-32,APIC] Do not enable or use the local APIC.
nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
nomca [IA-64] Disable machine check abort handling
nomce [X86-32] Disable Machine Check Exception
nomfgpt [X86-32] Disable Multi-Function General Purpose
Timer usage (for AMD Geode machines).
nomodeset Disable kernel modesetting. Most systems' firmware
sets up a display mode and provides framebuffer memory
for output. With nomodeset, DRM and fbdev drivers will
not load if they could possibly displace the pre-
initialized output. Only the system framebuffer will
be available for use. The respective drivers will not
perform display-mode changes or accelerated rendering.
Useful as error fallback, or for testing and debugging.
nomodule Disable module load
nonmi_ipi [X86] Disable using NMI IPIs during panic/reboot to
shutdown the other cpus. Instead use the REBOOT_VECTOR
irq.
nopat [X86] Disable PAT (page attribute table extension of
pagetables) support.
nopcid [X86-64] Disable the PCID cpu feature.
nopku [X86] Disable Memory Protection Keys CPU feature found
in some Intel CPUs.
nopti [X86-64]
Equivalent to pti=off
nopv= [X86,XEN,KVM,HYPER_V,VMWARE]
Disables the PV optimizations forcing the guest to run
as generic guest with no PV drivers. Currently support
XEN HVM, KVM, HYPER_V and VMWARE guest.
nopvspin [X86,XEN,KVM]
Disables the qspinlock slow path using PV optimizations
which allow the hypervisor to 'idle' the guest on lock
contention.
norandmaps Don't use address space randomization. Equivalent to
echo 0 > /proc/sys/kernel/randomize_va_space
noreplace-smp [X86-32,SMP] Don't replace SMP instructions
with UP alternatives
noresume [SWSUSP] Disables resume and restores original swap
space.
nosbagart [IA-64]
no-scroll [VGA] Disables scrollback.
This is required for the Braillex ib80-piezo Braille
reader made by F.H. Papenmeier (Germany).
nosgx [X86-64,SGX] Disables Intel SGX kernel support.
nosmap [PPC]
Disable SMAP (Supervisor Mode Access Prevention)
even if it is supported by processor.
nosmep [PPC64s]
Disable SMEP (Supervisor Mode Execution Prevention)
even if it is supported by processor.
nosmp [SMP] Tells an SMP kernel to act as a UP kernel,
and disable the IO APIC. legacy for "maxcpus=0".
nosmt [KNL,MIPS,PPC,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
[KNL,X86,PPC] Disable symmetric multithreading (SMT).
nosmt=force: Force disable SMT, cannot be undone
via the sysfs control file.
nosoftlockup [KNL] Disable the soft-lockup detector.
nospec_store_bypass_disable
[HW] Disable all mitigations for the Speculative Store Bypass vulnerability
nospectre_bhb [ARM64] Disable all mitigations for Spectre-BHB (branch
history injection) vulnerability. System may allow data leaks
with this option.
nospectre_v1 [X86,PPC] Disable mitigations for Spectre Variant 1
(bounds check bypass). With this option data leaks are
possible in the system.
nospectre_v2 [X86,PPC_E500,ARM64] Disable all mitigations for
the Spectre variant 2 (indirect branch prediction)
vulnerability. System may allow data leaks with this
option.
no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES,RISCV] Disable
paravirtualized steal time accounting. steal time is
computed, but won't influence scheduler behaviour
nosync [HW,M68K] Disables sync negotiation for all devices.
no_timer_check [X86,APIC] Disables the code which tests for
broken timer IRQ sources.
no_uaccess_flush
[PPC] Don't flush the L1-D cache after accessing user data.
novmcoredd [KNL,KDUMP]
Disable device dump. Device dump allows drivers to
append dump data to vmcore so you can collect driver
specified debug info. Drivers can append the data
without any limit and this data is stored in memory,
so this may cause significant memory stress. Disabling
device dump can help save memory but the driver debug
data will be no longer available. This parameter
is only available when CONFIG_PROC_VMCORE_DEVICE_DUMP
is set.
no-vmw-sched-clock
[X86,PV_OPS] Disable paravirtualized VMware scheduler
clock and use the default one.
watchdog: enable the new user interface of the watchdog mechanism With the current user interface of the watchdog mechanism it is only possible to disable or enable both lockup detectors at the same time. This series introduces new kernel parameters and changes the semantics of some existing kernel parameters, so that the hard lockup detector and the soft lockup detector can be disabled or enabled individually. With this series applied, the user interface is as follows. - parameters in /proc/sys/kernel . soft_watchdog This is a new parameter to control and examine the run state of the soft lockup detector. . nmi_watchdog The semantics of this parameter have changed. It can now be used to control and examine the run state of the hard lockup detector. . watchdog This parameter is still available to control the run state of both lockup detectors at the same time. If this parameter is examined, it shows the logical OR of soft_watchdog and nmi_watchdog. . watchdog_thresh The semantics of this parameter are not affected by the patch. - kernel command line parameters . nosoftlockup The semantics of this parameter have changed. It can now be used to disable the soft lockup detector at boot time. . nmi_watchdog=0 or nmi_watchdog=1 Disable or enable the hard lockup detector at boot time. The patch introduces '=1' as a new option. . nowatchdog The semantics of this parameter are not affected by the patch. It is still available to disable both lockup detectors at boot time. Also, remove the proc_dowatchdog() function which is no longer needed. [dzickus@redhat.com: wrote changelog] [dzickus@redhat.com: update documentation for kernel params and sysctl] Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14 22:44:13 +00:00
nowatchdog [KNL] Disable both lockup detectors, i.e.
soft-lockup and NMI watchdog (hard-lockup).
lockup_detector: Combine nmi_watchdog and softlockup detector The new nmi_watchdog (which uses the perf event subsystem) is very similar in structure to the softlockup detector. Using Ingo's suggestion, I combined the two functionalities into one file: kernel/watchdog.c. Now both the nmi_watchdog (or hardlockup detector) and softlockup detector sit on top of the perf event subsystem, which is run every 60 seconds or so to see if there are any lockups. To detect hardlockups, cpus not responding to interrupts, I implemented an hrtimer that runs 5 times for every perf event overflow event. If that stops counting on a cpu, then the cpu is most likely in trouble. To detect softlockups, tasks not yielding to the scheduler, I used the previous kthread idea that now gets kicked every time the hrtimer fires. If the kthread isn't being scheduled neither is anyone else and the warning is printed to the console. I tested this on x86_64 and both the softlockup and hardlockup paths work. V2: - cleaned up the Kconfig and softlockup combination - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI - seperated out the softlockup case from perf event subsystem - re-arranged the enabling/disabling nmi watchdog from proc space - added cpumasks for hardlockup failure cases - removed fallback to soft events if no PMU exists for hard events V3: - comment cleanups - drop support for older softlockup code - per_cpu cleanups - completely remove software clock base hardlockup detector - use per_cpu masking on hard/soft lockup detection - #ifdef cleanups - rename config option NMI_WATCHDOG to LOCKUP_DETECTOR - documentation additions V4: - documentation fixes - convert per_cpu to __get_cpu_var - powerpc compile fixes V5: - split apart warn flags for hard and soft lockups TODO: - figure out how to make an arch-agnostic clock2cycles call (if possible) to feed into perf events as a sample period [fweisbec: merged conflict patch] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-2-git-send-email-dzickus@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-07 21:11:44 +00:00
nowb [ARM]
nox2apic [X86-64,APIC] Do not enable x2APIC mode.
2022-08-16 23:19:42 +00:00
NOTE: this parameter will be ignored on systems with the
LEGACY_XAPIC_DISABLED bit set in the
IA32_XAPIC_DISABLE_STATUS MSR.
noxsave [BUGS=X86] Disables x86 extended register state save
and restore using xsave. The kernel will fallback to
enabling legacy floating-point and sse state.
noxsaveopt [X86] Disables xsaveopt used in saving x86 extended
register states. The kernel will fall back to use
xsave to save the states. By using this parameter,
performance of saving the states is degraded because
xsave doesn't support modified optimization while
xsaveopt supports it on xsaveopt enabled systems.
noxsaves [X86] Disables xsaves and xrstors used in saving and
restoring x86 extended register state in compacted
form of xsave area. The kernel will fall back to use
xsaveopt and xrstor to save and restore the states
in standard form of xsave area. By using this
parameter, xsave area per process might occupy more
memory on xsaves enabled systems.
nps_mtm_hs_ctr= [KNL,ARC]
This parameter sets the maximum duration, in
cycles, each HW thread of the CTOP can run
without interruptions, before HW switches it.
The actual maximum duration is 16 times this
parameter's value.
Format: integer between 1 and 255
Default: 255
nptcg= [IA-64] Override max number of concurrent global TLB
purges which is reported from either PAL_VM_SUMMARY or
SAL PALO.
nr_cpus= [SMP] Maximum number of processors that an SMP kernel
could support. nr_cpus=n : n >= 1 limits the kernel to
support 'n' processors. It could be larger than the
number of already plugged CPU during bootup, later in
runtime you can physically add extra cpu until it reaches
n. So during boot up some boot time memory for per-cpu
variables need be pre-allocated for later physical cpu
hot plugging.
nr_uarts= [SERIAL] maximum number of UARTs to be registered.
numa=off [KNL, ARM64, PPC, RISCV, SPARC, X86] Disable NUMA, Only
set up a single NUMA node spanning all memory.
Documentation/admin-guide: kernel-parameters: correct the architectures for numa_balancing X86 isn't the only architecture supporting NUMA_BALANCING. ARM64, PPC, S390 and RISCV also support it: arch$ git grep NUMA_BALANCING arm64/Kconfig: select ARCH_SUPPORTS_NUMA_BALANCING arm64/configs/defconfig:CONFIG_NUMA_BALANCING=y arm64/include/asm/pgtable.h:#ifdef CONFIG_NUMA_BALANCING powerpc/configs/powernv_defconfig:CONFIG_NUMA_BALANCING=y powerpc/configs/ppc64_defconfig:CONFIG_NUMA_BALANCING=y powerpc/configs/pseries_defconfig:CONFIG_NUMA_BALANCING=y powerpc/include/asm/book3s/64/pgtable.h:#ifdef CONFIG_NUMA_BALANCING powerpc/include/asm/book3s/64/pgtable.h:#ifdef CONFIG_NUMA_BALANCING powerpc/include/asm/book3s/64/pgtable.h:#endif /* CONFIG_NUMA_BALANCING */ powerpc/include/asm/book3s/64/pgtable.h:#ifdef CONFIG_NUMA_BALANCING powerpc/include/asm/book3s/64/pgtable.h:#endif /* CONFIG_NUMA_BALANCING */ powerpc/include/asm/nohash/pgtable.h:#ifdef CONFIG_NUMA_BALANCING powerpc/include/asm/nohash/pgtable.h:#endif /* CONFIG_NUMA_BALANCING */ powerpc/platforms/Kconfig.cputype: select ARCH_SUPPORTS_NUMA_BALANCING riscv/Kconfig: select ARCH_SUPPORTS_NUMA_BALANCING riscv/include/asm/pgtable.h:#ifdef CONFIG_NUMA_BALANCING s390/Kconfig: select ARCH_SUPPORTS_NUMA_BALANCING s390/configs/debug_defconfig:CONFIG_NUMA_BALANCING=y s390/configs/defconfig:CONFIG_NUMA_BALANCING=y s390/include/asm/pgtable.h:#ifdef CONFIG_NUMA_BALANCING x86/Kconfig: select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 x86/include/asm/pgtable.h:#ifdef CONFIG_NUMA_BALANCING x86/include/asm/pgtable.h:#endif /* CONFIG_NUMA_BALANCING */ On the other hand, setup_numabalancing() is implemented in mm/mempolicy.c which doesn't depend on architectures. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20210302084159.33688-1-song.bao.hua@hisilicon.com Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2021-03-02 08:41:59 +00:00
numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic
NUMA balancing.
Allowed values are enable and disable
change zonelist order: zonelist order selection logic Make zonelist creation policy selectable from sysctl/boot option v6. This patch makes NUMA's zonelist (of pgdat) order selectable. Available order are Default(automatic)/ Node-based / Zone-based. [Default Order] The kernel selects Node-based or Zone-based order automatically. [Node-based Order] This policy treats the locality of memory as the most important parameter. Zonelist order is created by each zone's locality. This means lower zones (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion. IOW. ZONE_DMA will be in the middle of zonelist. current 2.6.21 kernel uses this. Pros. * A user can expect local memory as much as possible. Cons. * lower zone will be exhansted before higher zone. This may cause OOM_KILL. Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL because of ZONE_DMA exhaution and you need the best locality. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. *node(0)'s memory allocation order: node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL. *node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. [Zone-based order] This policy treats the zone type as the most important parameter. Zonelist order is created by zone-type order. This means lower zone never be used bofere higher zone exhaustion. IOW. ZONE_DMA will be always at the tail of zonelist. Pros. * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted. Cons. * memory locality may not be best. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. *node(0)'s memory allocation order: node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA. *node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. bootoption "numa_zonelist_order=" and proc/sysctl is supporetd. command: %echo N > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Node-based order. command: %echo Z > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Zone-based order. Thanks to Lee Schermerhorn, he gives me much help and codes. [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order] [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 06:38:01 +00:00
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
mm, page_alloc: rip out ZONELIST_ORDER_ZONE Patch series "cleanup zonelists initialization", v1. This is aimed at cleaning up the zonelists initialization code we have but the primary motivation was bug report [2] which got resolved but the usage of stop_machine is just too ugly to live. Most patches are straightforward but 3 of them need a special consideration. Patch 1 removes zone ordered zonelists completely. I am CCing linux-api because this is a user visible change. As I argue in the patch description I do not think we have a strong usecase for it these days. I have kept sysctl in place and warn into the log if somebody tries to configure zone lists ordering. If somebody has a real usecase for it we can revert this patch but I do not expect anybody will actually notice runtime differences. This patch is not strictly needed for the rest but it made patch 6 easier to implement. Patch 7 removes stop_machine from build_all_zonelists without adding any special synchronization between iterators and updater which I _believe_ is acceptable as explained in the changelog. I hope I am not missing anything. Patch 8 then removes zonelists_mutex which is kind of ugly as well and not really needed AFAICS but a care should be taken when double checking my thinking. This patch (of 9): Supporting zone ordered zonelists costs us just a lot of code while the usefulness is arguable if existent at all. Mel has already made node ordering default on 64b systems. 32b systems are still using ZONELIST_ORDER_ZONE because it is considered better to fallback to a different NUMA node rather than consume precious lowmem zones. This argument is, however, weaken by the fact that the memory reclaim has been reworked to be node rather than zone oriented. This means that lowmem requests have to skip over all highmem pages on LRUs already and so zone ordering doesn't save the reclaim time much. So the only advantage of the zone ordering is under a light memory pressure when highmem requests do not ever hit into lowmem zones and the lowmem pressure doesn't need to reclaim. Considering that 32b NUMA systems are rather suboptimal already and it is generally advisable to use 64b kernel on such a HW I believe we should rather care about the code maintainability and just get rid of ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if somebody tries to set zone ordering either from kernel command line or the sysctl. [mhocko@suse.com: reading vm.numa_zonelist_order will never terminate] Link: http://lkml.kernel.org/r/20170721143915.14161-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 23:20:13 +00:00
'node', 'default' can be specified
change zonelist order: zonelist order selection logic Make zonelist creation policy selectable from sysctl/boot option v6. This patch makes NUMA's zonelist (of pgdat) order selectable. Available order are Default(automatic)/ Node-based / Zone-based. [Default Order] The kernel selects Node-based or Zone-based order automatically. [Node-based Order] This policy treats the locality of memory as the most important parameter. Zonelist order is created by each zone's locality. This means lower zones (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion. IOW. ZONE_DMA will be in the middle of zonelist. current 2.6.21 kernel uses this. Pros. * A user can expect local memory as much as possible. Cons. * lower zone will be exhansted before higher zone. This may cause OOM_KILL. Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL because of ZONE_DMA exhaution and you need the best locality. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. *node(0)'s memory allocation order: node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL. *node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. [Zone-based order] This policy treats the zone type as the most important parameter. Zonelist order is created by zone-type order. This means lower zone never be used bofere higher zone exhaustion. IOW. ZONE_DMA will be always at the tail of zonelist. Pros. * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted. Cons. * memory locality may not be best. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. *node(0)'s memory allocation order: node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA. *node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. bootoption "numa_zonelist_order=" and proc/sysctl is supporetd. command: %echo N > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Node-based order. command: %echo Z > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Zone-based order. Thanks to Lee Schermerhorn, he gives me much help and codes. [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order] [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 06:38:01 +00:00
This can be set from sysctl after boot.
See Documentation/admin-guide/sysctl/vm.rst for details.
change zonelist order: zonelist order selection logic Make zonelist creation policy selectable from sysctl/boot option v6. This patch makes NUMA's zonelist (of pgdat) order selectable. Available order are Default(automatic)/ Node-based / Zone-based. [Default Order] The kernel selects Node-based or Zone-based order automatically. [Node-based Order] This policy treats the locality of memory as the most important parameter. Zonelist order is created by each zone's locality. This means lower zones (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion. IOW. ZONE_DMA will be in the middle of zonelist. current 2.6.21 kernel uses this. Pros. * A user can expect local memory as much as possible. Cons. * lower zone will be exhansted before higher zone. This may cause OOM_KILL. Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL because of ZONE_DMA exhaution and you need the best locality. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. *node(0)'s memory allocation order: node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL. *node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. [Zone-based order] This policy treats the zone type as the most important parameter. Zonelist order is created by zone-type order. This means lower zone never be used bofere higher zone exhaustion. IOW. ZONE_DMA will be always at the tail of zonelist. Pros. * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted. Cons. * memory locality may not be best. (example) assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL. *node(0)'s memory allocation order: node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA. *node(1)'s memory allocation order: node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA. bootoption "numa_zonelist_order=" and proc/sysctl is supporetd. command: %echo N > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Node-based order. command: %echo Z > /proc/sys/vm/numa_zonelist_order Will rebuild zonelist in Zone-based order. Thanks to Lee Schermerhorn, he gives me much help and codes. [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order] [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 06:38:01 +00:00
ohci1394_dma=early [HW] enable debugging via the ohci1394 driver.
See Documentation/core-api/debugging-via-ohci1394.rst for more
info.
olpc_ec_timeout= [OLPC] ms delay when issuing EC commands
Rather than timing out after 20 ms if an EC
command is not properly ACKed, override the length
of the timeout. We have interrupts disabled while
waiting for the ACK, so if this is set too high
interrupts *may* be lost!
omap_mux= [OMAP] Override bootloader pin multiplexing.
Format: <mux_mode0.mode_name=value>...
For example, to override I2C bus2:
omap_mux=i2c2_scl.i2c2_scl=0x100,i2c2_sda.i2c2_sda=0x100
onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration
Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock]
boundary - index of last SLC block on Flex-OneNAND.
The remaining blocks are configured as MLC blocks.
lock - Configure if Flex-OneNAND boundary should be locked.
Once locked, the boundary cannot be changed.
1 indicates lock status, 0 indicates unlock status.
oops=panic Always panic on oopses. Default is to just kill the
process, but there is a small probability of
deadlocking the machine.
This will also cause panics on machine check exceptions.
Useful together with panic=30 to trigger a reboot.
mm: shuffle initial free memory to improve memory-side-cache utilization Patch series "mm: Randomize free memory", v10. This patch (of 3): Randomization of the page allocator improves the average utilization of a direct-mapped memory-side-cache. Memory side caching is a platform capability that Linux has been previously exposed to in HPC (high-performance computing) environments on specialty platforms. In that instance it was a smaller pool of high-bandwidth-memory relative to higher-capacity / lower-bandwidth DRAM. Now, this capability is going to be found on general purpose server platforms where DRAM is a cache in front of higher latency persistent memory [1]. Robert offered an explanation of the state of the art of Linux interactions with memory-side-caches [2], and I copy it here: It's been a problem in the HPC space: http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/ A kernel module called zonesort is available to try to help: https://software.intel.com/en-us/articles/xeon-phi-software and this abandoned patch series proposed that for the kernel: https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com Dan's patch series doesn't attempt to ensure buffers won't conflict, but also reduces the chance that the buffers will. This will make performance more consistent, albeit slower than "optimal" (which is near impossible to attain in a general-purpose kernel). That's better than forcing users to deploy remedies like: "To eliminate this gradual degradation, we have added a Stream measurement to the Node Health Check that follows each job; nodes are rebooted whenever their measured memory bandwidth falls below 300 GB/s." A replacement for zonesort was merged upstream in commit cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability"). With this numa_emulation capability, memory can be split into cache sized ("near-memory" sized) numa nodes. A bind operation to such a node, and disabling workloads on other nodes, enables full cache performance. However, once the workload exceeds the cache size then cache conflicts are unavoidable. While HPC environments might be able to tolerate time-scheduling of cache sized workloads, for general purpose server platforms, the oversubscribed cache case will be the common case. The worst case scenario is that a server system owner benchmarks a workload at boot with an un-contended cache only to see that performance degrade over time, even below the average cache performance due to excessive conflicts. Randomization clips the peaks and fills in the valleys of cache utilization to yield steady average performance. Here are some performance impact details of the patches: 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a 3X speedup in a contrived case that tries to force cache conflicts. The contrived cased used the numa_emulation capability to force an instance of the benchmark to be run in two of the near-memory sized numa nodes. If both instances were placed on the same emulated they would fit and cause zero conflicts. While on separate emulated nodes without randomization they underutilized the cache and conflicted unnecessarily due to the in-order allocation per node. 2/ A well known Java server application benchmark was run with a heap size that exceeded cache size by 3X. The cache conflict rate was 8% for the first run and degraded to 21% after page allocator aging. With randomization enabled the rate levelled out at 11%. 3/ A MongoDB workload did not observe measurable difference in cache-conflict rates, but the overall throughput dropped by 7% with randomization in one case. 4/ Mel Gorman ran his suite of performance workloads with randomization enabled on platforms without a memory-side-cache and saw a mix of some improvements and some losses [3]. While there is potentially significant improvement for applications that depend on low latency access across a wide working-set, the performance may be negligible to negative for other workloads. For this reason the shuffle capability defaults to off unless a direct-mapped memory-side-cache is detected. Even then, the page_alloc.shuffle=0 parameter can be specified to disable the randomization on those systems. Outside of memory-side-cache utilization concerns there is potentially security benefit from randomization. Some data exfiltration and return-oriented-programming attacks rely on the ability to infer the location of sensitive data objects. The kernel page allocator, especially early in system boot, has predictable first-in-first out behavior for physical pages. Pages are freed in physical address order when first onlined. Quoting Kees: "While we already have a base-address randomization (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and memory layouts would certainly be using the predictability of allocation ordering (i.e. for attacks where the base address isn't important: only the relative positions between allocated memory). This is common in lots of heap-style attacks. They try to gain control over ordering by spraying allocations, etc. I'd really like to see this because it gives us something similar to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator." While SLAB_FREELIST_RANDOM reduces the predictability of some local slab caches it leaves vast bulk of memory to be predictably in order allocated. However, it should be noted, the concrete security benefits are hard to quantify, and no known CVE is mitigated by this randomization. Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform a Fisher-Yates shuffle of the page allocator 'free_area' lists when they are initially populated with free memory at boot and at hotplug time. Do this based on either the presence of a page_alloc.shuffle=Y command line parameter, or autodetection of a memory-side-cache (to be added in a follow-on patch). The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10, 4MB this trades off randomization granularity for time spent shuffling. MAX_ORDER-1 was chosen to be minimally invasive to the page allocator while still showing memory-side cache behavior improvements, and the expectation that the security implications of finer granularity randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The performance impact of the shuffling appears to be in the noise compared to other memory initialization work. This initial randomization can be undone over time so a follow-on patch is introduced to inject entropy on page free decisions. It is reasonable to ask if the page free entropy is sufficient, but it is not enough due to the in-order initial freeing of pages. At the start of that process putting page1 in front or behind page0 still keeps them close together, page2 is still near page1 and has a high chance of being adjacent. As more pages are added ordering diversity improves, but there is still high page locality for the low address pages and this leads to no significant impact to the cache conflict rate. [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/ [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM [3]: https://lkml.org/lkml/2018/10/12/309 [dan.j.williams@intel.com: fix shuffle enable] Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts] Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Robert Elliott <elliott@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 22:41:28 +00:00
page_alloc.shuffle=
[KNL] Boolean flag to control whether the page allocator
should randomize its free lists. The randomization may
be automatically enabled if the kernel detects it is
running on a platform with a direct-mapped memory-side
cache, and this parameter can be used to
override/disable that behavior. The state of the flag
can be read from sysfs at:
/sys/module/page_alloc/parameters/shuffle.
mm/page_owner: keep track of page owners This is the page owner tracking code which is introduced so far ago. It is resident on Andrew's tree, though, nobody tried to upstream so it remain as is. Our company uses this feature actively to debug memory leak or to find a memory hogger so I decide to upstream this feature. This functionality help us to know who allocates the page. When allocating a page, we store some information about allocation in extra memory. Later, if we need to know status of all pages, we can get and analyze it from this stored information. In previous version of this feature, extra memory is statically defined in struct page, but, in this version, extra memory is allocated outside of struct page. It enables us to turn on/off this feature at boottime without considerable memory waste. Although we already have tracepoint for tracing page allocation/free, using it to analyze page owner is rather complex. We need to enlarge the trace buffer for preventing overlapping until userspace program launched. And, launched program continually dump out the trace buffer for later analysis and it would change system behaviour with more possibility rather than just keeping it in memory, so bad for debug. Moreover, we can use page_owner feature further for various purposes. For example, we can use it for fragmentation statistics implemented in this patch. And, I also plan to implement some CMA failure debugging feature using this interface. I'd like to give the credit for all developers contributed this feature, but, it's not easy because I don't know exact history. Sorry about that. Below is people who has "Signed-off-by" in the patches in Andrew's tree. Contributor: Alexander Nyberg <alexn@dsv.su.se> Mel Gorman <mgorman@suse.de> Dave Hansen <dave@linux.vnet.ibm.com> Minchan Kim <minchan@kernel.org> Michal Nazarewicz <mina86@mina86.com> Andrew Morton <akpm@linux-foundation.org> Jungsoo Son <jungsoo.son@lge.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 00:56:01 +00:00
page_owner= [KNL] Boot-time page_owner enabling option.
Storage of the information about who allocated
each page is disabled in default. With this switch,
we can turn it on.
on: enable the feature
page_poison= [KNL] Boot-time parameter changing the state of
poisoning on the buddy allocator, available with
CONFIG_PAGE_POISONING=y.
off: turn off poisoning (default)
on: turn on poisoning
page_reporting.page_reporting_order=
[KNL] Minimal page reporting order
Format: <integer>
Adjust the minimal page reporting order. The page
reporting is disabled when it exceeds MAX_PAGE_ORDER.
panic= [KNL] Kernel behaviour on panic: delay <timeout>
timeout > 0: seconds before rebooting
timeout = 0: wait forever
timeout < 0: reboot immediately
Format: <timeout>
kernel: add panic_on_taint Analogously to the introduction of panic_on_warn, this patch introduces a kernel option named panic_on_taint in order to provide a simple and generic way to stop execution and catch a coredump when the kernel gets tainted by any given flag. This is useful for debugging sessions as it avoids having to rebuild the kernel to explicitly add calls to panic() into the code sites that introduce the taint flags of interest. For instance, if one is interested in proceeding with a post-mortem analysis at the point a given code path is hitting a bad page (i.e. unaccount_page_cache_page(), or slab_bug()), a coredump can be collected by rebooting the kernel with 'panic_on_taint=0x20' amended to the command line. Another, perhaps less frequent, use for this option would be as a means for assuring a security policy case where only a subset of taints, or no single taint (in paranoid mode), is allowed for the running system. The optional switch 'nousertaint' is handy in this particular scenario, as it will avoid userspace induced crashes by writes to sysctl interface /proc/sys/kernel/tainted causing false positive hits for such policies. [akpm@linux-foundation.org: tweak kernel-parameters.txt wording] Suggested-by: Qian Cai <cai@lca.pw> Signed-off-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Bunk <bunk@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Takashi Iwai <tiwai@suse.de> Link: http://lkml.kernel.org/r/20200515175502.146720-1-aquini@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 04:40:17 +00:00
panic_on_taint= Bitmask for conditionally calling panic() in add_taint()
Format: <hex>[,nousertaint]
Hexadecimal bitmask representing the set of TAINT flags
that will cause the kernel to panic when add_taint() is
called with any of the flags in this set.
The optional switch "nousertaint" can be utilized to
prevent userspace forced crashes by writing to sysctl
/proc/sys/kernel/tainted any flagset matching with the
bitmask set on panic_on_taint.
See Documentation/admin-guide/tainted-kernels.rst for
extra details on the taint flags that users can pick
to compose the bitmask to assign to panic_on_taint.
panic_on_warn=1 panic() instead of WARN(). Useful to cause kdump
kernel: add panic_on_warn There have been several times where I have had to rebuild a kernel to cause a panic when hitting a WARN() in the code in order to get a crash dump from a system. Sometimes this is easy to do, other times (such as in the case of a remote admin) it is not trivial to send new images to the user. A much easier method would be a switch to change the WARN() over to a panic. This makes debugging easier in that I can now test the actual image the WARN() was seen on and I do not have to engage in remote debugging. This patch adds a panic_on_warn kernel parameter and /proc/sys/kernel/panic_on_warn calls panic() in the warn_slowpath_common() path. The function will still print out the location of the warning. An example of the panic_on_warn output: The first line below is from the WARN_ON() to output the WARN_ON()'s location. After that the panic() output is displayed. WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]() Kernel panic - not syncing: panic_on_warn set ... CPU: 30 PID: 11698 Comm: insmod Tainted: G W OE 3.17.0+ #57 Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013 0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190 0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df Call Trace: [<ffffffff81665190>] dump_stack+0x46/0x58 [<ffffffff8165e2ec>] panic+0xd0/0x204 [<ffffffffa038e05f>] ? init_dummy+0x1f/0x30 [dummy_module] [<ffffffff81076b90>] warn_slowpath_common+0xd0/0xd0 [<ffffffffa038e040>] ? dummy_greetings+0x40/0x40 [dummy_module] [<ffffffff81076c8a>] warn_slowpath_null+0x1a/0x20 [<ffffffffa038e05f>] init_dummy+0x1f/0x30 [dummy_module] [<ffffffff81002144>] do_one_initcall+0xd4/0x210 [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110 [<ffffffff810f8889>] load_module+0x16a9/0x1b30 [<ffffffff810f3d30>] ? store_uevent+0x70/0x70 [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180 [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0 [<ffffffff8166cf29>] system_call_fastpath+0x12/0x17 Successfully tested by me. hpa said: There is another very valid use for this: many operators would rather a machine shuts down than being potentially compromised either functionally or security-wise. Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 23:45:50 +00:00
on a WARN().
panic_print= Bitmask for printing system info when panic happens.
User can chose combination of the following bits:
bit 0: print all tasks info
bit 1: print system memory info
bit 2: print timer info
bit 3: print locks info if CONFIG_LOCKDEP is on
bit 4: print ftrace buffer
bit 5: print all printk messages in buffer
bit 6: print all CPUs backtrace (if available in the arch)
*Be aware* that this option may print a _lot_ of lines,
so there are risks of losing older messages in the log.
Use this option carefully, maybe worth to setup a
bigger log buffer with "log_buf_len" along with this.
parkbd.port= [HW] Parallel port number the keyboard adapter is
connected to, default is 0.
Format: <parport#>
parkbd.mode= [HW] Parallel port keyboard adapter mode of operation,
0 for XT, 1 for AT (default is AT).
Format: <mode>
parport= [HW,PPT] Specify parallel ports. 0 disables.
Format: { 0 | auto | 0xBBB[,IRQ[,DMA]] }
Use 'auto' to force the driver to use any
IRQ/DMA settings detected (the default is to
ignore detected IRQ/DMA settings because of
possible conflicts). You can specify the base
address, IRQ, and DMA settings; IRQ and DMA
should be numbers, or 'auto' (for using detected
settings on that particular port), or 'nofifo'
(to avoid using a FIFO even if it is detected).
Parallel ports are assigned in the order they
are specified on the command line, starting
with parport0.
parport_init_mode= [HW,PPT]
Configure VIA parallel port to operate in
a specific mode. This is necessary on Pegasos
computer where firmware has no options for setting
up parallel port mode and sets it to spp.
Currently this function knows 686a and 8231 chips.
Format: [spp|ps2|epp|ecp|ecpepp]
pata_legacy.all= [HW,LIBATA]
Format: <int>
Set to non-zero to probe primary and secondary ISA
port ranges on PCI systems where no PCI PATA device
has been found at either range. Disabled by default.
pata_legacy.autospeed= [HW,LIBATA]
Format: <int>
Set to non-zero if a chip is present that snoops speed
changes. Disabled by default.
pata_legacy.ht6560a= [HW,LIBATA]
Format: <int>
Set to 1, 2, or 3 for HT 6560A on the primary channel,
the secondary channel, or both channels respectively.
Disabled by default.
pata_legacy.ht6560b= [HW,LIBATA]
Format: <int>
Set to 1, 2, or 3 for HT 6560B on the primary channel,
the secondary channel, or both channels respectively.
Disabled by default.
pata_legacy.iordy_mask= [HW,LIBATA]
Format: <int>
IORDY enable mask. Set individual bits to allow IORDY
for the respective channel. Bit 0 is for the first
legacy channel handled by this driver, bit 1 is for
the second channel, and so on. The sequence will often
correspond to the primary legacy channel, the secondary
legacy channel, and so on, but the handling of a PCI
bus and the use of other driver options may interfere
with the sequence. By default IORDY is allowed across
all channels.
pata_legacy.opti82c46x= [HW,LIBATA]
Format: <int>
Set to 1, 2, or 3 for Opti 82c611A on the primary
channel, the secondary channel, or both channels
respectively. Disabled by default.
pata_legacy.opti82c611a= [HW,LIBATA]
Format: <int>
Set to 1, 2, or 3 for Opti 82c465MV on the primary
channel, the secondary channel, or both channels
respectively. Disabled by default.
pata_legacy.pio_mask= [HW,LIBATA]
Format: <int>
PIO mode mask for autospeed devices. Set individual
bits to allow the use of the respective PIO modes.
Bit 0 is for mode 0, bit 1 is for mode 1, and so on.
All modes allowed by default.
pata_legacy.probe_all= [HW,LIBATA]
Format: <int>
Set to non-zero to probe tertiary and further ISA
port ranges on PCI systems. Disabled by default.
pata_legacy: Add `probe_mask' parameter like with ide-generic Carry the `probe_mask' parameter over from ide-generic to pata_legacy so that there is a way to prevent random poking at ISA port I/O locations in attempt to discover adapter option cards with libata like with the old IDE driver. By default all enabled locations are tried, however it may interfere with a different kind of hardware responding there. For example with a plain (E)ISA system the driver tries all the six possible locations: scsi host0: pata_legacy ata1: PATA max PIO4 cmd 0x1f0 ctl 0x3f6 irq 14 ata1.00: ATA-4: ST310211A, 3.54, max UDMA/100 ata1.00: 19541088 sectors, multi 16: LBA ata1.00: configured for PIO scsi 0:0:0:0: Direct-Access ATA ST310211A 3.54 PQ: 0 ANSI: 5 scsi 0:0:0:0: Attached scsi generic sg0 type 0 sd 0:0:0:0: [sda] 19541088 512-byte logical blocks: (10.0 GB/9.32 GiB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sda3 sd 0:0:0:0: [sda] Attached SCSI disk scsi host1: pata_legacy ata2: PATA max PIO4 cmd 0x170 ctl 0x376 irq 15 scsi host1: pata_legacy ata3: PATA max PIO4 cmd 0x1e8 ctl 0x3ee irq 11 scsi host1: pata_legacy ata4: PATA max PIO4 cmd 0x168 ctl 0x36e irq 10 scsi host1: pata_legacy ata5: PATA max PIO4 cmd 0x1e0 ctl 0x3e6 irq 8 scsi host1: pata_legacy ata6: PATA max PIO4 cmd 0x160 ctl 0x366 irq 12 however giving the kernel "pata_legacy.probe_mask=21" makes it try every other location only: scsi host0: pata_legacy ata1: PATA max PIO4 cmd 0x1f0 ctl 0x3f6 irq 14 ata1.00: ATA-4: ST310211A, 3.54, max UDMA/100 ata1.00: 19541088 sectors, multi 16: LBA ata1.00: configured for PIO scsi 0:0:0:0: Direct-Access ATA ST310211A 3.54 PQ: 0 ANSI: 5 scsi 0:0:0:0: Attached scsi generic sg0 type 0 sd 0:0:0:0: [sda] 19541088 512-byte logical blocks: (10.0 GB/9.32 GiB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sda3 sd 0:0:0:0: [sda] Attached SCSI disk scsi host1: pata_legacy ata2: PATA max PIO4 cmd 0x1e8 ctl 0x3ee irq 11 scsi host1: pata_legacy ata3: PATA max PIO4 cmd 0x1e0 ctl 0x3e6 irq 8 Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/alpine.DEB.2.21.2103211800110.21463@angie.orcam.me.uk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-21 19:55:32 +00:00
pata_legacy.probe_mask= [HW,LIBATA]
Format: <int>
Probe mask for legacy ISA PATA ports. Depending on
platform configuration and the use of other driver
options up to 6 legacy ports are supported: 0x1f0,
0x170, 0x1e8, 0x168, 0x1e0, 0x160, however probing
of individual ports can be disabled by setting the
corresponding bits in the mask to 1. Bit 0 is for
the first port in the list above (0x1f0), and so on.
By default all supported ports are probed.
pata_legacy.qdi= [HW,LIBATA]
Format: <int>
Set to non-zero to probe QDI controllers. By default
set to 1 if CONFIG_PATA_QDI_MODULE, 0 otherwise.
pata_legacy.winbond= [HW,LIBATA]
Format: <int>
Set to non-zero to probe Winbond controllers. Use
the standard I/O port (0x130) if 1, otherwise the
value given is the I/O port to use (typically 0x1b0).
By default set to 1 if CONFIG_PATA_WINBOND_VLB_MODULE,
0 otherwise.
pata_platform.pio_mask= [HW,LIBATA]
Format: <int>
Supported PIO mode mask. Set individual bits to allow
the use of the respective PIO modes. Bit 0 is for
mode 0, bit 1 is for mode 1, and so on. Mode 0 only
allowed by default.
pause_on_oops=<int>
Halt all CPUs after the first oops has been printed for
the specified number of seconds. This is to be used if
your oopses keep scrolling off the screen.
pcbit= [HW,ISDN]
pci=option[,option...] [PCI] various PCI subsystem options.
Some options herein operate on a specific device
or a set of devices (<pci_dev>). These are
specified in one of the following formats:
[<domain>:]<bus>:<dev>.<func>[/<dev>.<func>]*
pci:<vendor>:<device>[:<subvendor>:<subdevice>]
Note: the first format specifies a PCI
bus/device/function address which may change
if new hardware is inserted, if motherboard
firmware changes, or due to changes caused
by other kernel parameters. If the
domain is left unspecified, it is
taken to be zero. Optionally, a path
to a device through multiple device/function
addresses can be specified after the base
address (this is more robust against
renumbering issues). The second format
selects devices using IDs from the
configuration space which may match multiple
devices in the system.
earlydump dump PCI config space before the kernel
changes anything
off [X86] don't probe for the PCI bus
bios [X86-32] force use of PCI BIOS, don't access
the hardware directly. Use this if your machine
has a non-standard PCI host bridge.
nobios [X86-32] disallow use of PCI BIOS, only direct
hardware access methods are allowed. Use this
if you experience crashes upon bootup and you
suspect they are caused by the BIOS.
conf1 [X86] Force use of PCI Configuration Access
Mechanism 1 (config address in IO port 0xCF8,
data in IO port 0xCFC, both 32-bit).
conf2 [X86] Force use of PCI Configuration Access
Mechanism 2 (IO port 0xCF8 is an 8-bit port for
the function, IO port 0xCFA, also 8-bit, sets
bus number. The config space is then accessed
through ports 0xC000-0xCFFF).
See http://wiki.osdev.org/PCI for more info
on the configuration access mechanisms.
noaer [PCIE] If the PCIEAER kernel config parameter is
enabled, this kernel boot option can be used to
disable the use of PCIE advanced error reporting.
nodomains [PCI] Disable support for multiple PCI
root domains (aka PCI segments, in ACPI-speak).
nommconf [X86] Disable use of MMCONFIG for PCI
Configuration
check_enable_amd_mmconf [X86] check for and enable
properly configured MMIO access to PCI
config space on AMD family 10h CPU
nomsi [MSI] If the PCI_MSI kernel config parameter is
enabled, this kernel boot option can be used to
disable the use of MSI interrupts system-wide.
noioapicquirk [APIC] Disable all boot interrupt quirks.
Safety option to keep boot IRQs enabled. This
should never be necessary.
ioapicreroute [APIC] Enable rerouting of boot IRQs to the
primary IO-APIC for bridges that cannot disable
boot IRQs. This fixes a source of spurious IRQs
when the system masks IRQs.
noioapicreroute [APIC] Disable workaround that uses the
boot IRQ equivalent of an IRQ that connects to
a chipset where boot IRQs cannot be disabled.
The opposite of ioapicreroute.
biosirq [X86-32] Use PCI BIOS calls to get the interrupt
routing table. These calls are known to be buggy
on several machines and they hang the machine
when used, but on other computers it's the only
way to get the interrupt routing table. Try
this option if the kernel is unable to allocate
IRQs or discover secondary PCI buses on your
motherboard.
rom [X86] Assign address space to expansion ROMs.
Use with caution as certain devices share
address decoders between ROMs and other
resources.
norom [X86] Do not assign address space to
PCI: boot parameter to avoid expansion ROM memory allocation Contention for scarce PCI memory resources has been growing due to an increasing number of PCI slots in large multi-node systems. The kernel currently attempts by default to allocate memory for all PCI expansion ROMs so there has also been an increasing number of PCI memory allocation failures seen on these systems. This occurs because the BIOS either (1) provides insufficient PCI memory resource for all the expansion ROMs or (2) provides adequate PCI memory resource for expansion ROMs but provides the space in kernel unexpected BIOS assigned P2P non-prefetch windows. The resulting PCI memory allocation failures may be benign when related to memory requests for expansion ROMs themselves but in some cases they can occur when attempting to allocate space for more critical BARs. This can happen when a successful expansion ROM allocation request consumes memory resource that was intended for a non-ROM BAR. We have seen this happen during PCI hotplug of an adapter that contains a P2P bridge where successful memory allocation for an expansion ROM BAR on device behind the bridge consumed memory that was intended for a non-ROM BAR on the P2P bridge. In all cases the allocation failure messages can be very confusing for users. This patch provides a new 'pci=norom' kernel boot parameter that can be used to disable the default PCI expansion ROM memory resource allocation. This provides a way to avoid the above described issues on systems that do not contain PCI devices for which drivers or user-level applications depend on the default PCI expansion ROM memory resource allocation behavior. Signed-off-by: Gary Hade <garyhade@us.ibm.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2008-05-12 20:57:46 +00:00
expansion ROMs that do not already have
BIOS assigned address ranges.
nobar [X86] Do not assign address space to the
BARs that weren't assigned by the BIOS.
irqmask=0xMMMM [X86] Set a bit mask of IRQs allowed to be
assigned automatically to PCI devices. You can
make the kernel exclude IRQs of your ISA cards
this way.
pirqaddr=0xAAAAA [X86] Specify the physical address
of the PIRQ table (normally generated
by the BIOS) if it is outside the
F0000h-100000h range.
lastbus=N [X86] Scan all buses thru bus #N. Can be
useful if the kernel is unable to find your
secondary buses and you want to tell it
explicitly which ones they are.
assign-busses [X86] Always assign all PCI bus
numbers ourselves, overriding
whatever the firmware may have done.
usepirqmask [X86] Honor the possible IRQ mask stored
in the BIOS $PIR table. This is needed on
some systems with broken BIOSes, notably
some HP Pavilion N5400 and Omnibook XE3
notebooks. This will have no effect if ACPI
IRQ routing is enabled.
noacpi [X86] Do not use ACPI for IRQ routing
or for PCI scanning.
use_crs [X86] Use PCI host bridge window information
from ACPI. On BIOSes from 2008 or later, this
is enabled by default. If you need to use this,
please report a bug.
nocrs [X86] Ignore PCI host bridge windows from ACPI.
If you need to use this, please report a bug.
x86/PCI: Add kernel cmdline options to use/ignore E820 reserved regions Some firmware supplies PCI host bridge _CRS that includes address space unusable by PCI devices, e.g., space occupied by host bridge registers or used by hidden PCI devices. To avoid this unusable space, Linux currently excludes E820 reserved regions from _CRS windows; see 4dc2287c1805 ("x86: avoid E820 regions when allocating address space"). However, this use of E820 reserved regions to clip things out of _CRS is not supported by ACPI, UEFI, or PCI Firmware specs, and some systems have E820 reserved regions that cover the entire memory window from _CRS. 4dc2287c1805 clips the entire window, leaving no space for hot-added or uninitialized PCI devices. For example, from a Lenovo IdeaPad 3 15IIL 81WE: BIOS-e820: [mem 0x4bc50000-0xcfffffff] reserved pci_bus 0000:00: root bus resource [mem 0x65400000-0xbfffffff window] pci 0000:00:15.0: BAR 0: [mem 0x00000000-0x00000fff 64bit] pci 0000:00:15.0: BAR 0: no space for [mem size 0x00001000 64bit] Future patches will add quirks to enable/disable E820 clipping automatically. Add a "pci=no_e820" kernel command line option to disable clipping with E820 reserved regions. Also add a matching "pci=use_e820" option to enable clipping with E820 reserved regions if that has been disabled by default by further patches in this patch-set. Both options taint the kernel because they are intended for debugging and workaround purposes until a quirk can set them automatically. [bhelgaas: commit log, add printk] Link: https://bugzilla.redhat.com/show_bug.cgi?id=1868899 Lenovo IdeaPad 3 Link: https://lore.kernel.org/r/20220519152150.6135-2-hdegoede@redhat.com Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Benoit Grégoire <benoitg@coeus.ca> Cc: Hui Wang <hui.wang@canonical.com>
2022-05-19 15:21:48 +00:00
use_e820 [X86] Use E820 reservations to exclude parts of
PCI host bridge windows. This is a workaround
for BIOS defects in host bridge _CRS methods.
If you need to use this, please report a bug to
<linux-pci@vger.kernel.org>.
no_e820 [X86] Ignore E820 reservations for PCI host
bridge windows. This is the default on modern
hardware. If you need to use this, please report
a bug to <linux-pci@vger.kernel.org>.
routeirq Do IRQ routing for all PCI devices.
This is normally done in pci_enable_device(),
so this option is a temporary workaround
for broken drivers that don't call it.
skip_isa_align [X86] do not align io start addr, so can
handle more pci cards
noearly [X86] Don't do any early type 1 scanning.
This might help on some broken boards which
machine check when some devices' config space
is read. But various workarounds are disabled
and some IOMMU drivers will not work.
PCI: optionally sort device lists breadth-first Problem: New Dell PowerEdge servers have 2 embedded ethernet ports, which are labeled NIC1 and NIC2 on the chassis, in the BIOS setup screens, and in the printed documentation. Assuming no other add-in ethernet ports in the system, Linux 2.4 kernels name these eth0 and eth1 respectively. Many people have come to expect this naming. Linux 2.6 kernels name these eth1 and eth0 respectively (backwards from expectations). I also have reports that various Sun and HP servers have similar behavior. Root cause: Linux 2.4 kernels walk the pci_devices list, which happens to be sorted in breadth-first order (or pcbios_find_device order on i386, which most often is breadth-first also). 2.6 kernels have both the pci_devices list and the pci_bus_type.klist_devices list, the latter is what is walked at driver load time to match the pci_id tables; this klist happens to be in depth-first order. On systems where, for physical routing reasons, NIC1 appears on a lower bus number than NIC2, but NIC2's bridge is discovered first in the depth-first ordering, NIC2 will be discovered before NIC1. If the list were sorted breadth-first, NIC1 would be discovered before NIC2. A PowerEdge 1955 system has the following topology which easily exhibits the difference between depth-first and breadth-first device lists. -[0000:00]-+-00.0 Intel Corporation 5000P Chipset Memory Controller Hub +-02.0-[0000:03-08]--+-00.0-[0000:04-07]--+-00.0-[0000:05-06]----00.0-[0000:06]----00.0 Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (labeled NIC2, 2.4 kernel name eth1, 2.6 kernel name eth0) +-1c.0-[0000:01-02]----00.0-[0000:02]----00.0 Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (labeled NIC1, 2.4 kernel name eth0, 2.6 kernel name eth1) Other factors, such as device driver load order and the presence of PCI slots at various points in the bus hierarchy further complicate this problem; I'm not trying to solve those here, just restore the device order, and thus basic behavior, that 2.4 kernels had. Solution: The solution can come in multiple steps. Suggested fix #1: kernel Patch below optionally sorts the two device lists into breadth-first ordering to maintain compatibility with 2.4 kernels. It adds two new command line options: pci=bfsort pci=nobfsort to force the sort order, or not, as you wish. It also adds DMI checks for the specific Dell systems which exhibit "backwards" ordering, to make them "right". Suggested fix #2: udev rules from userland Many people also have the expectation that embedded NICs are always discovered before add-in NICs (which this patch does not try to do). Using the PCI IRQ Routing Table provided by system BIOS, it's easy to determine which PCI devices are embedded, or if add-in, which PCI slot they're in. I'm working on a tool that would allow udev to name ethernet devices in ascending embedded, slot 1 .. slot N order, subsort by PCI bus/dev/fn breadth-first. It'll be possible to use it independent of udev as well for those distributions that don't use udev in their installers. Suggested fix #3: system board routing rules One can constrain the system board layout to put NIC1 ahead of NIC2 regardless of breadth-first or depth-first discovery order. This adds a significant level of complexity to board routing, and may not be possible in all instances (witness the above systems from several major manufacturers). I don't want to encourage this particular train of thought too far, at the expense of not doing #1 or #2 above. Feedback appreciated. Patch tested on a Dell PowerEdge 1955 blade with 2.6.18. You'll also note I took some liberty and temporarily break the klist abstraction to simplify and speed up the sort algorithm. I think that's both safe and appropriate in this instance. Signed-off-by: Matt Domsch <Matt_Domsch@dell.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-29 20:23:23 +00:00
bfsort Sort PCI devices into breadth-first order.
This sorting is done to get a device
order compatible with older (<= 2.4) kernels.
nobfsort Don't sort PCI devices into breadth-first order.
pcie_bus_tune_off Disable PCIe MPS (Max Payload Size)
tuning and use the BIOS-configured MPS defaults.
pcie_bus_safe Set every device's MPS to the largest value
supported by all devices below the root complex.
pcie_bus_perf Set device MPS to the largest allowable MPS
based on its parent bus. Also set MRRS (Max
Read Request Size) to the largest supported
value (no larger than the MPS that the device
or bus can support) for best performance.
pcie_bus_peer2peer Set every device's MPS to 128B, which
every device is guaranteed to support. This
configuration allows peer-to-peer DMA between
any pair of devices, possibly at the cost of
reduced performance. This also guarantees
that hot-added devices will work.
cbiosize=nn[KMG] The fixed amount of bus space which is
reserved for the CardBus bridge's IO window.
The default value is 256 bytes.
cbmemsize=nn[KMG] The fixed amount of bus space which is
reserved for the CardBus bridge's memory
window. The default value is 64 megabytes.
2009-03-16 08:13:39 +00:00
resource_alignment=
Format:
[<order of align>@]<pci_dev>[; ...]
2009-03-16 08:13:39 +00:00
Specifies alignment and device to reassign
aligned memory resources. How to
specify the device is described above.
2009-03-16 08:13:39 +00:00
If <order of align> is not specified,
PAGE_SIZE is used as alignment.
A PCI-PCI bridge can be specified if resource
2009-03-16 08:13:39 +00:00
windows need to be expanded.
To specify the alignment for several
instances of a device, the PCI vendor,
device, subvendor, and subdevice may be
specified, e.g., 12@pci:8086:9c22:103c:198f
for 4096-byte alignment.
ecrc= Enable/disable PCIe ECRC (transaction layer
end-to-end CRC checking). Only effective if
OS has native AER control (either granted by
ACPI _OSC or forced via "pcie_ports=native")
bios: Use BIOS/firmware settings. This is the
the default.
off: Turn ECRC off
on: Turn ECRC on.
hpiosize=nn[KMG] The fixed amount of bus space which is
reserved for hotplug bridge's IO window.
Default size is 256 bytes.
hpmmiosize=nn[KMG] The fixed amount of bus space which is
reserved for hotplug bridge's MMIO window.
Default size is 2 megabytes.
hpmmioprefsize=nn[KMG] The fixed amount of bus space which is
reserved for hotplug bridge's MMIO_PREF window.
Default size is 2 megabytes.
hpmemsize=nn[KMG] The fixed amount of bus space which is
reserved for hotplug bridge's MMIO and
MMIO_PREF window.
Default size is 2 megabytes.
hpbussize=nn The minimum amount of additional bus numbers
reserved for buses below a hotplug bridge.
Default is 1.
realloc= Enable/disable reallocating PCI bridge resources
if allocations done by BIOS are too small to
accommodate resources required by all child
devices.
off: Turn realloc off
on: Turn realloc on
realloc same as realloc=on
noari do not use PCIe ARI.
noats [PCIE, Intel-IOMMU, AMD-IOMMU]
do not use PCIe ATS (and IOMMU device IOTLB).
pcie_scan_all Scan all possible PCIe devices. Otherwise we
only look for one device below a PCIe downstream
port.
big_root_window Try to add a big 64bit memory window to the PCIe
root complex on AMD CPUs. Some GFX hardware
can resize a BAR to allow access to all VRAM.
Adding the window is slightly risky (it may
conflict with unreported devices), so this
taints the kernel.
disable_acs_redir=<pci_dev>[; ...]
Specify one or more PCI devices (in the format
specified above) separated by semicolons.
Each device specified will have the PCI ACS
redirect capabilities forced off which will
allow P2P traffic between devices through
bridges without forcing it upstream. Note:
this removes isolation between devices and
may put more devices in an IOMMU group.
force_floating [S390] Force usage of floating interrupts.
nomio [S390] Do not use MIO instructions.
norid [S390] ignore the RID field and force use of
one PCI domain per PCI function
PCI: optionally sort device lists breadth-first Problem: New Dell PowerEdge servers have 2 embedded ethernet ports, which are labeled NIC1 and NIC2 on the chassis, in the BIOS setup screens, and in the printed documentation. Assuming no other add-in ethernet ports in the system, Linux 2.4 kernels name these eth0 and eth1 respectively. Many people have come to expect this naming. Linux 2.6 kernels name these eth1 and eth0 respectively (backwards from expectations). I also have reports that various Sun and HP servers have similar behavior. Root cause: Linux 2.4 kernels walk the pci_devices list, which happens to be sorted in breadth-first order (or pcbios_find_device order on i386, which most often is breadth-first also). 2.6 kernels have both the pci_devices list and the pci_bus_type.klist_devices list, the latter is what is walked at driver load time to match the pci_id tables; this klist happens to be in depth-first order. On systems where, for physical routing reasons, NIC1 appears on a lower bus number than NIC2, but NIC2's bridge is discovered first in the depth-first ordering, NIC2 will be discovered before NIC1. If the list were sorted breadth-first, NIC1 would be discovered before NIC2. A PowerEdge 1955 system has the following topology which easily exhibits the difference between depth-first and breadth-first device lists. -[0000:00]-+-00.0 Intel Corporation 5000P Chipset Memory Controller Hub +-02.0-[0000:03-08]--+-00.0-[0000:04-07]--+-00.0-[0000:05-06]----00.0-[0000:06]----00.0 Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (labeled NIC2, 2.4 kernel name eth1, 2.6 kernel name eth0) +-1c.0-[0000:01-02]----00.0-[0000:02]----00.0 Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (labeled NIC1, 2.4 kernel name eth0, 2.6 kernel name eth1) Other factors, such as device driver load order and the presence of PCI slots at various points in the bus hierarchy further complicate this problem; I'm not trying to solve those here, just restore the device order, and thus basic behavior, that 2.4 kernels had. Solution: The solution can come in multiple steps. Suggested fix #1: kernel Patch below optionally sorts the two device lists into breadth-first ordering to maintain compatibility with 2.4 kernels. It adds two new command line options: pci=bfsort pci=nobfsort to force the sort order, or not, as you wish. It also adds DMI checks for the specific Dell systems which exhibit "backwards" ordering, to make them "right". Suggested fix #2: udev rules from userland Many people also have the expectation that embedded NICs are always discovered before add-in NICs (which this patch does not try to do). Using the PCI IRQ Routing Table provided by system BIOS, it's easy to determine which PCI devices are embedded, or if add-in, which PCI slot they're in. I'm working on a tool that would allow udev to name ethernet devices in ascending embedded, slot 1 .. slot N order, subsort by PCI bus/dev/fn breadth-first. It'll be possible to use it independent of udev as well for those distributions that don't use udev in their installers. Suggested fix #3: system board routing rules One can constrain the system board layout to put NIC1 ahead of NIC2 regardless of breadth-first or depth-first discovery order. This adds a significant level of complexity to board routing, and may not be possible in all instances (witness the above systems from several major manufacturers). I don't want to encourage this particular train of thought too far, at the expense of not doing #1 or #2 above. Feedback appreciated. Patch tested on a Dell PowerEdge 1955 blade with 2.6.18. You'll also note I took some liberty and temporarily break the klist abstraction to simplify and speed up the sort algorithm. I think that's both safe and appropriate in this instance. Signed-off-by: Matt Domsch <Matt_Domsch@dell.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-29 20:23:23 +00:00
pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power
Management.
off Disable ASPM.
force Enable ASPM even on devices that claim not to support it.
WARNING: Forcing ASPM on may cause system lockups.
pcie_ports= [PCIE] PCIe port services handling:
native Use native PCIe services (PME, AER, DPC, PCIe hotplug)
even if the platform doesn't give the OS permission to
use them. This may cause conflicts if the platform
also tries to use these services.
dpc-native Use native PCIe service for DPC only. May
cause conflicts if firmware uses AER or DPC.
compat Disable native PCIe services (PME, AER, DPC, PCIe
hotplug).
PCI: Put PCIe ports into D3 during suspend Currently the Linux PCI core does not touch power state of PCI bridges and PCIe ports when system suspend is entered. Leaving them in D0 consumes power unnecessarily and may prevent the CPU from entering deeper C-states. With recent PCIe hardware we can power down the ports to save power given that we take into account few restrictions: - The PCIe port hardware is recent enough, starting from 2015. - Devices connected to PCIe ports are effectively in D3cold once the port is transitioned to D3 (the config space is not accessible anymore and the link may be powered down). - Devices behind the PCIe port need to be allowed to transition to D3cold and back. There is a way both drivers and userspace can forbid this. - If the device behind the PCIe port is capable of waking the system it needs to be able to do so from D3cold. This patch adds a new flag to struct pci_device called 'bridge_d3'. This flag is set and cleared by the PCI core whenever there is a change in power management state of any of the devices behind the PCIe port. When system later on is suspended we only need to check this flag and if it is true transition the port to D3 otherwise we leave it in D0. Also provide override mechanism via command line parameter "pcie_port_pm=[off|force]" that can be used to disable or enable the feature regardless of the BIOS manufacturing date. Tested-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-06-02 08:17:12 +00:00
pcie_port_pm= [PCIE] PCIe port power management handling:
off Disable power management of all PCIe ports
force Forcibly enable power management of all PCIe ports
pcie_pme= [PCIE,PM] Native PCIe PME signaling options:
nomsi Do not use MSI for native PCIe PME signaling (this makes
PCI: PCIe: Ask BIOS for control of all native services at once After commit 852972acff8f10f3a15679be2059bb94916cba5d (ACPI: Disable ASPM if the platform won't provide _OSC control for PCIe) control of the PCIe Capability Structure is unconditionally requested by acpi_pci_root_add(), which in principle may cause problems to happen in two ways. First, the BIOS may refuse to give control of the PCIe Capability Structure if it is not asked for any of the _OSC features depending on it at the same time. Second, the BIOS may assume that control of the _OSC features depending on the PCIe Capability Structure will be requested in the future and may behave incorrectly if that doesn't happen. For this reason, control of the PCIe Capability Structure should always be requested along with control of any other _OSC features that may depend on it (ie. PCIe native PME, PCIe native hot-plug, PCIe AER). Rework the PCIe port driver so that (1) it checks which native PCIe port services can be enabled, according to the BIOS, and (2) it requests control of all these services simultaneously. In particular, this causes pcie_portdrv_probe() to fail if the BIOS refuses to grant control of the PCIe Capability Structure, which means that no native PCIe port services can be enabled for the PCIe Root Complex the given port belongs to. If that happens, ASPM is disabled to avoid problems with mishandling it by the part of the PCIe hierarchy for which control of the PCIe Capability Structure has not been received. Make it possible to override this behavior using 'pcie_ports=native' (use the PCIe native services regardless of the BIOS response to the control request), or 'pcie_ports=compat' (do not use the PCIe native services at all). Accordingly, rework the existing PCIe port service drivers so that they don't request control of the services directly. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2010-08-21 20:02:38 +00:00
all PCIe root ports use INTx for all services).
pcmv= [HW,PCMCIA] BadgePAD 4
pd_ignore_unused
[PM]
Keep all power-domains already enabled by bootloader on,
even if no driver has claimed them. This is useful
for debug and development, but should not be
needed on a platform with proper driver support.
pdcchassis= [PARISC,HW] Disable/Enable PDC Chassis Status codes at
boot time.
Format: { 0 | 1 }
See arch/parisc/kernel/pdc_chassis.c
percpu_alloc= Select which percpu first chunk allocator to use.
Currently supported values are "embed" and "page".
Archs may support subset or none of the selections.
See comments in mm/percpu.c for details on each
allocator. This parameter is primarily for debugging
and performance comparison.
pirq= [SMP,APIC] Manual mp-table setup
See Documentation/arch/x86/i386/IO-APIC.rst.
plip= [PPT,NET] Parallel port network link
Format: { parport<nr> | timid | 0 }
See also Documentation/admin-guide/parport.rst.
pmtmr= [X86] Manual setup of pmtmr I/O Port.
Override pmtimer IOPort with a hex value.
e.g. pmtmr=0x508
pmu_override= [PPC] Override the PMU.
This option takes over the PMU facility, so it is no
longer usable by perf. Setting this option starts the
PMU counters by setting MMCR0 to 0 (the FC bit is
cleared). If a number is given, then MMCR1 is set to
that number, otherwise (e.g., 'pmu_override=on'), MMCR1
remains 0.
pm_debug_messages [SUSPEND,KNL]
Enable suspend/resume debug messages during boot up.
pnp.debug=1 [PNP]
Enable PNP debug messages (depends on the
CONFIG_PNP_DEBUG_MESSAGES option). Change at run-time
via /sys/module/pnp/parameters/debug. We always show
current resource usage; turning this on also shows
possible settings and some assignment information.
pnpacpi= [ACPI]
{ off }
pnpbios= [ISAPNP]
{ on | off | curr | res | no-curr | no-res }
pnp_reserve_irq=
[ISAPNP] Exclude IRQs for the autoconfiguration
pnp_reserve_dma=
[ISAPNP] Exclude DMAs for the autoconfiguration
pnp_reserve_io= [ISAPNP] Exclude I/O ports for the autoconfiguration
Ranges are in pairs (I/O port base and size).
pnp_reserve_mem=
[ISAPNP] Exclude memory regions for the
autoconfiguration.
Ranges are in pairs (memory base and size).
ports= [IP_VS_FTP] IPVS ftp helper module
Default is 21.
Up to 8 (IP_VS_APP_MAX_PORTS) ports
may be specified.
Format: <port>,<port>....
powersave=off [PPC] This option disables power saving features.
It specifically disables cpuidle and sets the
platform machine description specific power_save
function to NULL. On Idle the CPU just reduces
execution priority.
ppc_strict_facility_enable
[PPC] This option catches any kernel floating point,
Altivec, VSX and SPE outside of regions specifically
allowed (eg kernel_enable_fpu()/kernel_disable_fpu()).
There is some performance impact when enabling this.
ppc_tm= [PPC]
Format: {"off"}
Disable Hardware Transactional Memory
preempt= [KNL]
Select preemption mode if you have CONFIG_PREEMPT_DYNAMIC
none - Limited to cond_resched() calls
voluntary - Limited to cond_resched() and might_sleep() calls
full - Any section that isn't explicitly preempt disabled
can be preempted anytime.
print-fatal-signals=
[KNL] debug: print fatal signals
If enabled, warn about various signal handling
related application anomalies: too many signals,
too many POSIX.1 timers, fatal signals causing a
coredump - etc.
If you hit the warning due to signal overflow,
you might want to try "ulimit -i unlimited".
default: off.
printk.always_kmsg_dump=
Trigger kmsg_dump for cases other than kernel oops or
panics
Format: <bool> (1/Y/y=enable, 0/N/n=disable)
default: disabled
printk.console_no_auto_verbose=
Disable console loglevel raise on oops, panic
or lockdep-detected issues (only if lock debug is on).
With an exception to setups with low baudrate on
serial console, keeping this 0 is a good choice
in order to provide more debug information.
Format: <bool>
default: 0 (auto_verbose is enabled)
printk: add kernel parameter to control writes to /dev/kmsg Add a "printk.devkmsg" kernel command line parameter which controls how userspace writes into /dev/kmsg. It has three options: * ratelimit - ratelimit logging from userspace. * on - unlimited logging from userspace * off - logging from userspace gets ignored The default setting is to ratelimit the messages written to it. This changes the kernel default setting of "on" to "ratelimit" and we do that because we want to keep userspace spamming /dev/kmsg to sane levels. This is especially moot when a small kernel log buffer wraps around and messages get lost. So the ratelimiting setting should be a sane setting where kernel messages should have a bit higher chance of survival from all the spamming. It additionally does not limit logging to /dev/kmsg while the system is booting if we haven't disabled it on the command line. Furthermore, we can control the logging from a lower priority sysctl interface - kernel.printk_devkmsg. That interface will succeed only if printk.devkmsg *hasn't* been supplied on the command line. If it has, then printk.devkmsg is a one-time setting which remains for the duration of the system lifetime. This "locking" of the setting is to prevent userspace from changing the logging on us through sysctl(2). This patch is based on previous patches from Linus and Steven. [bp@suse.de: fixes] Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.de Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Dave Young <dyoung@redhat.com> Cc: Franck Bui <fbui@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 21:04:07 +00:00
printk.devkmsg={on,off,ratelimit}
Control writing to /dev/kmsg.
on - unlimited logging to /dev/kmsg from userspace
off - logging to /dev/kmsg disabled
ratelimit - ratelimit the logging
Default: ratelimit
printk.time= Show timing data prefixed to each printk message line
Format: <bool> (1/Y/y=enable, 0/N/n=disable)
processor.max_cstate= [HW,ACPI]
Limit processor to maximum C-state
max_cstate=9 overrides any DMI blacklist limit.
processor.nocst [HW,ACPI]
Ignore the _CST method to determine C-states,
instead using the legacy FADT method
profile= [KNL] Enable kernel profiling via /proc/profile
Format: [<profiletype>,]<number>
Param: <profiletype>: "schedule", "sleep", or "kvm"
[defaults to kernel profiling]
Param: "schedule" - profile schedule points.
Param: "sleep" - profile D-state sleeping (millisecs).
Requires CONFIG_SCHEDSTATS
Param: "kvm" - profile VM exits.
Param: <number> - step/bucket size as a power of 2 for
statistical time based profiling.
prompt_ramdisk= [RAM] [Deprecated]
prot_virt= [S390] enable hosting protected virtual machines
isolated from the hypervisor (if hardware supports
that).
Format: <bool>
psi: make disabling/enabling easier for vendor kernels Mel Gorman reports a hackbench regression with psi that would prohibit shipping the suse kernel with it default-enabled, but he'd still like users to be able to opt in at little to no cost to others. With the current combination of CONFIG_PSI and the psi_disabled bool set from the commandline, this is a challenge. Do the following things to make it easier: 1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros to enable CONFIG_PSI in their kernel but leave the feature disabled unless a user requests it at boot-time. To avoid double negatives, rename psi_disabled= to psi=. 2. Make psi_disabled a static branch to eliminate any branch costs when the feature is disabled. In terms of numbers before and after this patch, Mel says: : The following is a comparision using CONFIG_PSI=n as a baseline against : your patch and a vanilla kernel : : 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4 : kconfigdisable-v1r1 vanilla psidisable-v1r1 : Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%) : Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%) : Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%) : Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%) : Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%) : Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%) : Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%) : Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%) : Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%) : : It's showing that the vanilla kernel takes a hit (as the bisection : indicated it would) and that disabling PSI by default is reasonably : close in terms of performance for this particular workload on this : particular machine so; Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-30 22:09:58 +00:00
psi= [KNL] Enable or disable pressure stall information
tracking.
Format: <bool>
psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to
probe for; one of (bare|imps|exps|lifebook|any).
psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports
per second.
psmouse.resetafter= [HW,MOUSE]
Try to reset the device after so many bad packets
(0 = never).
psmouse.resolution=
[HW,MOUSE] Set desired mouse resolution, in dpi.
psmouse.smartscroll=
[HW,MOUSE] Controls Logitech smartscroll autorepeat.
0 = disabled, 1 = enabled (default).
pstore.backend= Specify the name of the pstore backend to use
pti= [X86-64] Control Page Table Isolation of user and
kernel address spaces. Disabling this feature
removes hardening, but improves performance of
system calls and interrupts.
on - unconditionally enable
off - unconditionally disable
auto - kernel detects whether your CPU model is
vulnerable to issues that PTI mitigates
Not specifying this option is equivalent to pti=auto.
pty.legacy_count=
[KNL] Number of legacy pty's. Overwrites compiled-in
default number.
quiet [KNL] Disable most log messages
r128= [HW,DRM]
radix_hcall_invalidate=on [PPC/PSERIES]
Disable RADIX GTSE feature and use hcall for TLB
invalidate.
raid= [HW,RAID]
See Documentation/admin-guide/md.rst.
ramdisk_size= [RAM] Sizes of RAM disks in kilobytes
See Documentation/admin-guide/blockdev/ramdisk.rst.
ramdisk_start= [RAM] RAM disk image start address
random.trust_cpu=off
[KNL] Disable trusting the use of the CPU's
random number generator (if available) to
initialize the kernel's RNG.
random.trust_bootloader=off
[KNL] Disable trusting the use of the a seed
passed by the bootloader (if available) to
initialize the kernel's RNG.
stack: Optionally randomize kernel stack offset each syscall This provides the ability for architectures to enable kernel stack base address offset randomization. This feature is controlled by the boot param "randomize_kstack_offset=on/off", with its default value set by CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT. This feature is based on the original idea from the last public release of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt All the credit for the original idea goes to the PaX team. Note that the design and implementation of this upstream randomize_kstack_offset feature differs greatly from the RANDKSTACK feature (see below). Reasoning for the feature: This feature aims to make harder the various stack-based attacks that rely on deterministic stack structure. We have had many such attacks in past (just to name few): https://jon.oberheide.org/files/infiltrate12-thestackisback.pdf https://jon.oberheide.org/files/stackjacking-infiltrate11.pdf https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html As Linux kernel stack protections have been constantly improving (vmap-based stack allocation with guard pages, removal of thread_info, STACKLEAK), attackers have had to find new ways for their exploits to work. They have done so, continuing to rely on the kernel's stack determinism, in situations where VMAP_STACK and THREAD_INFO_IN_TASK_STRUCT were not relevant. For example, the following recent attacks would have been hampered if the stack offset was non-deterministic between syscalls: https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf (page 70: targeting the pt_regs copy with linear stack overflow) https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html (leaked stack address from one syscall as a target during next syscall) The main idea is that since the stack offset is randomized on each system call, it is harder for an attack to reliably land in any particular place on the thread stack, even with address exposures, as the stack base will change on the next syscall. Also, since randomization is performed after placing pt_regs, the ptrace-based approach[1] to discover the randomized offset during a long-running syscall should not be possible. Design description: During most of the kernel's execution, it runs on the "thread stack", which is pretty deterministic in its structure: it is fixed in size, and on every entry from userspace to kernel on a syscall the thread stack starts construction from an address fetched from the per-cpu cpu_current_top_of_stack variable. The first element to be pushed to the thread stack is the pt_regs struct that stores all required CPU registers and syscall parameters. Finally the specific syscall function is called, with the stack being used as the kernel executes the resulting request. The goal of randomize_kstack_offset feature is to add a random offset after the pt_regs has been pushed to the stack and before the rest of the thread stack is used during the syscall processing, and to change it every time a process issues a syscall. The source of randomness is currently architecture-defined (but x86 is using the low byte of rdtsc()). Future improvements for different entropy sources is possible, but out of scope for this patch. Further more, to add more unpredictability, new offsets are chosen at the end of syscalls (the timing of which should be less easy to measure from userspace than at syscall entry time), and stored in a per-CPU variable, so that the life of the value does not stay explicitly tied to a single task. As suggested by Andy Lutomirski, the offset is added using alloca() and an empty asm() statement with an output constraint, since it avoids changes to assembly syscall entry code, to the unwinder, and provides correct stack alignment as defined by the compiler. In order to make this available by default with zero performance impact for those that don't want it, it is boot-time selectable with static branches. This way, if the overhead is not wanted, it can just be left turned off with no performance impact. The generated assembly for x86_64 with GCC looks like this: ... ffffffff81003977: 65 8b 05 02 ea 00 7f mov %gs:0x7f00ea02(%rip),%eax # 12380 <kstack_offset> ffffffff8100397e: 25 ff 03 00 00 and $0x3ff,%eax ffffffff81003983: 48 83 c0 0f add $0xf,%rax ffffffff81003987: 25 f8 07 00 00 and $0x7f8,%eax ffffffff8100398c: 48 29 c4 sub %rax,%rsp ffffffff8100398f: 48 8d 44 24 0f lea 0xf(%rsp),%rax ffffffff81003994: 48 83 e0 f0 and $0xfffffffffffffff0,%rax ... As a result of the above stack alignment, this patch introduces about 5 bits of randomness after pt_regs is spilled to the thread stack on x86_64, and 6 bits on x86_32 (since its has 1 fewer bit required for stack alignment). The amount of entropy could be adjusted based on how much of the stack space we wish to trade for security. My measure of syscall performance overhead (on x86_64): lmbench: /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_syscall -N 10000 null randomize_kstack_offset=y Simple syscall: 0.7082 microseconds randomize_kstack_offset=n Simple syscall: 0.7016 microseconds So, roughly 0.9% overhead growth for a no-op syscall, which is very manageable. And for people that don't want this, it's off by default. There are two gotchas with using the alloca() trick. First, compilers that have Stack Clash protection (-fstack-clash-protection) enabled by default (e.g. Ubuntu[3]) add pagesize stack probes to any dynamic stack allocations. While the randomization offset is always less than a page, the resulting assembly would still contain (unreachable!) probing routines, bloating the resulting assembly. To avoid this, -fno-stack-clash-protection is unconditionally added to the kernel Makefile since this is the only dynamic stack allocation in the kernel (now that VLAs have been removed) and it is provably safe from Stack Clash style attacks. The second gotcha with alloca() is a negative interaction with -fstack-protector*, in that it sees the alloca() as an array allocation, which triggers the unconditional addition of the stack canary function pre/post-amble which slows down syscalls regardless of the static branch. In order to avoid adding this unneeded check and its associated performance impact, architectures need to carefully remove uses of -fstack-protector-strong (or -fstack-protector) in the compilation units that use the add_random_kstack() macro and to audit the resulting stack mitigation coverage (to make sure no desired coverage disappears). No change is visible for this on x86 because the stack protector is already unconditionally disabled for the compilation unit, but the change is required on arm64. There is, unfortunately, no attribute that can be used to disable stack protector for specific functions. Comparison to PaX RANDKSTACK feature: The RANDKSTACK feature randomizes the location of the stack start (cpu_current_top_of_stack), i.e. including the location of pt_regs structure itself on the stack. Initially this patch followed the same approach, but during the recent discussions[2], it has been determined to be of a little value since, if ptrace functionality is available for an attacker, they can use PTRACE_PEEKUSR/PTRACE_POKEUSR to read/write different offsets in the pt_regs struct, observe the cache behavior of the pt_regs accesses, and figure out the random stack offset. Another difference is that the random offset is stored in a per-cpu variable, rather than having it be per-thread. As a result, these implementations differ a fair bit in their implementation details and results, though obviously the intent is similar. [1] https://lore.kernel.org/kernel-hardening/2236FBA76BA1254E88B949DDB74E612BA4BC57C1@IRSMSX102.ger.corp.intel.com/ [2] https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/ [3] https://lists.ubuntu.com/archives/ubuntu-devel/2019-June/040741.html Co-developed-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210401232347.2791257-4-keescook@chromium.org
2021-04-01 23:23:44 +00:00
randomize_kstack_offset=
[KNL] Enable or disable kernel stack offset
randomization, which provides roughly 5 bits of
entropy, frustrating memory corruption attacks
that depend on stack address determinism or
cross-syscall address exposures. This is only
available on architectures that have defined
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET.
Format: <bool> (1/Y/y=enable, 0/N/n=disable)
Default is CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
ras=option[,option,...] [KNL] RAS-specific options
cec_disable [X86]
Disable the Correctable Errors Collector,
see CONFIG_RAS_CEC help text.
rcu_nocbs[=cpu-list]
[KNL] The optional argument is a cpu list,
as described above.
In kernels built with CONFIG_RCU_NOCB_CPU=y,
enable the no-callback CPU mode, which prevents
such CPUs' callbacks from being invoked in
softirq context. Invocation of such CPUs' RCU
callbacks will instead be offloaded to "rcuox/N"
kthreads created for that purpose, where "x" is
"p" for RCU-preempt, "s" for RCU-sched, and "g"
for the kthreads that mediate grace periods; and
"N" is the CPU number. This reduces OS jitter on
the offloaded CPUs, which can be useful for HPC
and real-time workloads. It can also improve
energy efficiency for asymmetric multiprocessors.
If a cpulist is passed as an argument, the specified
list of CPUs is set to no-callback mode from boot.
Otherwise, if the '=' sign and the cpulist
arguments are omitted, no CPU will be set to
no-callback mode from boot but the mode may be
toggled at runtime via cpusets.
Note that this argument takes precedence over
the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
rcu_nocb_poll [KNL]
Rather than requiring that offloaded CPUs
(specified by rcu_nocbs= above) explicitly
awaken the corresponding "rcuoN" kthreads,
make these kthreads poll for callbacks.
This improves the real-time response for the
offloaded CPUs by relieving them of the need to
wake up the corresponding kthread, but degrades
energy efficiency by requiring that the kthreads
periodically wake up to do the polling.
rcutree.blimit= [KNL]
Set maximum number of finished RCU callbacks to
process in one batch.
rcutree.do_rcu_barrier= [KNL]
Request a call to rcu_barrier(). This is
throttled so that userspace tests can safely
hammer on the sysfs variable if they so choose.
If triggered before the RCU grace-period machinery
is fully active, this will error out with EAGAIN.
rcutree.dump_tree= [KNL]
Dump the structure of the rcu_node combining tree
out at early boot. This is used for diagnostic
purposes, to verify correct tree setup.
rcutree.gp_cleanup_delay= [KNL]
Set the number of jiffies to delay each step of
RCU grace-period cleanup.
rcutree.gp_init_delay= [KNL]
Set the number of jiffies to delay each step of
RCU grace-period initialization.
rcutree.gp_preinit_delay= [KNL]
Set the number of jiffies to delay each step of
RCU grace-period pre-initialization, that is,
the propagation of recent CPU-hotplug changes up
the rcu_node combining tree.
rcutree.jiffies_till_first_fqs= [KNL]
Set delay from grace-period initialization to
first attempt to force quiescent states.
Units are jiffies, minimum value is zero,
and maximum value is HZ.
rcutree.jiffies_till_next_fqs= [KNL]
Set delay between subsequent attempts to force
quiescent states. Units are jiffies, minimum
value is one, and maximum value is HZ.
rcutree.jiffies_till_sched_qs= [KNL]
Set required age in jiffies for a
given grace period before RCU starts
soliciting quiescent-state help from
rcu_note_context_switch() and cond_resched().
If not specified, the kernel will calculate
a value based on the most recent settings
of rcutree.jiffies_till_first_fqs
and rcutree.jiffies_till_next_fqs.
This calculated value may be viewed in
rcutree.jiffies_to_sched_qs. Any attempt to set
rcutree.jiffies_to_sched_qs will be cheerfully
overwritten.
rcutree.kthread_prio= [KNL,BOOT]
Set the SCHED_FIFO priority of the RCU per-CPU
kthreads (rcuc/N). This value is also used for
the priority of the RCU boost threads (rcub/N)
and for the RCU grace-period kthreads (rcu_bh,
rcu_preempt, and rcu_sched). If RCU_BOOST is
set, valid values are 1-99 and the default is 1
(the least-favored priority). Otherwise, when
RCU_BOOST is not set, valid values are 0-99 and
the default is zero (non-realtime operation).
When RCU_NOCB_CPU is set, also adjust the
priority of NOCB callback kthreads.
rcutree.nocb_nobypass_lim_per_jiffy= [KNL]
On callback-offloaded (rcu_nocbs) CPUs,
RCU reduces the lock contention that would
otherwise be caused by callback floods through
use of the ->nocb_bypass list. However, in the
common non-flooded case, RCU queues directly to
the main ->cblist in order to avoid the extra
overhead of the ->nocb_bypass list and its lock.
But if there are too many callbacks queued during
a single jiffy, RCU pre-queues the callbacks into
the ->nocb_bypass queue. The definition of "too
many" is supplied by this kernel boot parameter.
rcutree.qhimark= [KNL]
Set threshold of queued RCU callbacks beyond which
batch limiting is disabled.
rcutree.qlowmark= [KNL]
Set threshold of queued RCU callbacks below which
batch limiting is re-enabled.
rcutree.qovld= [KNL]
Set threshold of queued RCU callbacks beyond which
RCU's force-quiescent-state scan will aggressively
enlist help from cond_resched() and sched IPIs to
help CPUs more quickly reach quiescent states.
Set to less than zero to make this be set based
on rcutree.qhimark at boot time and to zero to
disable more aggressive help enlistment.
rcutree.rcu_delay_page_cache_fill_msec= [KNL]
Set the page-cache refill delay (in milliseconds)
in response to low-memory conditions. The range
of permitted values is in the range 0:100000.
rcutree.rcu_divisor= [KNL]
Set the shift-right count to use to compute
the callback-invocation batch limit bl from
the number of callbacks queued on this CPU.
The result will be bounded below by the value of
the rcutree.blimit kernel parameter. Every bl
callbacks, the softirq handler will exit in
order to allow the CPU to do other work.
Please note that this callback-invocation batch
limit applies only to non-offloaded callback
invocation. Offloaded callbacks are instead
invoked in the context of an rcuoc kthread, which
scheduler will preempt as it does any other task.
rcutree.rcu_fanout_exact= [KNL]
Disable autobalancing of the rcu_node combining
tree. This is used by rcutorture, and might
possibly be useful for architectures having high
cache-to-cache transfer latencies.
rcutree.rcu_fanout_leaf= [KNL]
Change the number of CPUs assigned to each
leaf rcu_node structure. Useful for very
large systems, which will choose the value 64,
and for NUMA systems with large remote-access
latencies, which will choose a value aligned
with the appropriate hardware boundaries.
rcutree.rcu_min_cached_objs= [KNL]
Minimum number of objects which are cached and
maintained per one CPU. Object size is equal
to PAGE_SIZE. The cache allows to reduce the
pressure to page allocator, also it makes the
whole algorithm to behave better in low memory
condition.
rcutree.rcu_nocb_gp_stride= [KNL]
Set the number of NOCB callback kthreads in
each group, which defaults to the square root
of the number of CPUs. Larger numbers reduce
the wakeup overhead on the global grace-period
kthread, but increases that same overhead on
each group's NOCB grace-period kthread.
rcutree.rcu_kick_kthreads= [KNL]
Cause the grace-period kthread to get an extra
wake_up() if it sleeps three times longer than
it should at force-quiescent-state time.
This wake_up() will be accompanied by a
WARN_ONCE() splat and an ftrace_dump().
rcutree.rcu_resched_ns= [KNL]
Limit the time spend invoking a batch of RCU
callbacks to the specified number of nanoseconds.
By default, this limit is checked only once
every 32 callbacks in order to limit the pain
inflicted by local_clock() overhead.
rcutree.rcu_unlock_delay= [KNL]
In CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels,
this specifies an rcu_read_unlock()-time delay
in microseconds. This defaults to zero.
Larger delays increase the probability of
catching RCU pointer leaks, that is, buggy use
of RCU-protected pointers after the relevant
rcu_read_unlock() has completed.
rcutree.sysrq_rcu= [KNL]
Commandeer a sysrq key to dump out Tree RCU's
rcu_node tree with an eye towards determining
why a new grace period has not yet started.
rcutree.use_softirq= [KNL]
If set to zero, move all RCU_SOFTIRQ processing to
per-CPU rcuc kthreads. Defaults to a non-zero
value, meaning that RCU_SOFTIRQ is used by default.
Specify rcutree.use_softirq=0 to use rcuc kthreads.
But note that CONFIG_PREEMPT_RT=y kernels disable
this kernel boot parameter, forcibly setting it
to zero.
rcuscale.gp_async= [KNL]
Measure performance of asynchronous
grace-period primitives such as call_rcu().
rcuscale.gp_async_max= [KNL]
Specify the maximum number of outstanding
callbacks per writer thread. When a writer
thread exceeds this limit, it invokes the
corresponding flavor of rcu_barrier() to allow
previously posted callbacks to drain.
rcuscale.gp_exp= [KNL]
Measure performance of expedited synchronous
grace-period primitives.
rcuscale.holdoff= [KNL]
Set test-start holdoff period. The purpose of
this parameter is to delay the start of the
test until boot completes in order to avoid
interference.
rcuscale.kfree_by_call_rcu= [KNL]
In kernels built with CONFIG_RCU_LAZY=y, test
call_rcu() instead of kfree_rcu().
rcuscale.kfree_mult= [KNL]
Instead of allocating an object of size kfree_obj,
allocate one of kfree_mult * sizeof(kfree_obj).
Defaults to 1.
rcuscale.kfree_rcu_test= [KNL]
Set to measure performance of kfree_rcu() flooding.
rcuscale.kfree_rcu_test_double= [KNL]
Test the double-argument variant of kfree_rcu().
If this parameter has the same value as
rcuscale.kfree_rcu_test_single, both the single-
and double-argument variants are tested.
rcuscale.kfree_rcu_test_single= [KNL]
Test the single-argument variant of kfree_rcu().
If this parameter has the same value as
rcuscale.kfree_rcu_test_double, both the single-
and double-argument variants are tested.
rcuscale.kfree_nthreads= [KNL]
The number of threads running loops of kfree_rcu().
rcuscale.kfree_alloc_num= [KNL]
Number of allocations and frees done in an iteration.
rcuscale.kfree_loops= [KNL]
Number of loops doing rcuscale.kfree_alloc_num number
of allocations and frees.
rcuscale.minruntime= [KNL]
Set the minimum test run time in seconds. This
does not affect the data-collection interval,
but instead allows better measurement of things
like CPU consumption.
rcuscale.nreaders= [KNL]
Set number of RCU readers. The value -1 selects
N, where N is the number of CPUs. A value
"n" less than -1 selects N-n+1, where N is again
the number of CPUs. For example, -2 selects N
(the number of CPUs), -3 selects N+1, and so on.
A value of "n" less than or equal to -N selects
a single reader.
rcuscale.nwriters= [KNL]
Set number of RCU writers. The values operate
the same as for rcuscale.nreaders.
N, where N is the number of CPUs
rcuscale.scale_type= [KNL]
Specify the RCU implementation to test.
rcuscale.shutdown= [KNL]
Shut the system down after performance tests
complete. This is useful for hands-off automated
testing.
rcuscale.verbose= [KNL]
Enable additional printk() statements.
rcuscale.writer_holdoff= [KNL]
Write-side holdoff between grace periods,
in microseconds. The default of zero says
no holdoff.
rcuscale.writer_holdoff_jiffies= [KNL]
Additional write-side holdoff between grace
periods, but in jiffies. The default of zero
says no holdoff.
rcutorture.fqs_duration= [KNL]
Set duration of force_quiescent_state bursts
in microseconds.
rcutorture.fqs_holdoff= [KNL]
Set holdoff time within force_quiescent_state bursts
in microseconds.
rcutorture.fqs_stutter= [KNL]
Set wait time between force_quiescent_state bursts
in seconds.
rcutorture.fwd_progress= [KNL]
Specifies the number of kthreads to be used
for RCU grace-period forward-progress testing
for the types of RCU supporting this notion.
Defaults to 1 kthread, values less than zero or
greater than the number of CPUs cause the number
of CPUs to be used.
rcutorture.fwd_progress_div= [KNL]
Specify the fraction of a CPU-stall-warning
period to do tight-loop forward-progress testing.
rcutorture.fwd_progress_holdoff= [KNL]
Number of seconds to wait between successive
forward-progress tests.
rcutorture.fwd_progress_need_resched= [KNL]
Enclose cond_resched() calls within checks for
need_resched() during tight-loop forward-progress
testing.
rcutorture.gp_cond= [KNL]
Use conditional/asynchronous update-side
primitives, if available.
rcutorture.gp_exp= [KNL]
Use expedited update-side primitives, if available.
rcutorture.gp_normal= [KNL]
Use normal (non-expedited) asynchronous
update-side primitives, if available.
rcutorture.gp_sync= [KNL]
Use normal (non-expedited) synchronous
update-side primitives, if available. If all
of rcutorture.gp_cond=, rcutorture.gp_exp=,
rcutorture.gp_normal=, and rcutorture.gp_sync=
are zero, rcutorture acts as if is interpreted
they are all non-zero.
rcutorture.irqreader= [KNL]
Run RCU readers from irq handlers, or, more
accurately, from a timer handler. Not all RCU
flavors take kindly to this sort of thing.
rcutorture.leakpointer= [KNL]
Leak an RCU-protected pointer out of the reader.
This can of course result in splats, and is
intended to test the ability of things like
CONFIG_RCU_STRICT_GRACE_PERIOD=y to detect
such leaks.
rcutorture.n_barrier_cbs= [KNL]
Set callbacks/threads for rcu_barrier() testing.
rcutorture.nfakewriters= [KNL]
Set number of concurrent RCU writers. These just
stress RCU, they don't participate in the actual
test, hence the "fake".
rcutorture.nocbs_nthreads= [KNL]
Set number of RCU callback-offload togglers.
Zero (the default) disables toggling.
rcutorture.nocbs_toggle= [KNL]
Set the delay in milliseconds between successive
callback-offload toggling attempts.
rcutorture.nreaders= [KNL]
Set number of RCU readers. The value -1 selects
N-1, where N is the number of CPUs. A value
"n" less than -1 selects N-n-2, where N is again
the number of CPUs. For example, -2 selects N
(the number of CPUs), -3 selects N+1, and so on.
rcutorture.object_debug= [KNL]
Enable debug-object double-call_rcu() testing.
rcutorture.onoff_holdoff= [KNL]
Set time (s) after boot for CPU-hotplug testing.
rcutorture.onoff_interval= [KNL]
Set time (jiffies) between CPU-hotplug operations,
or zero to disable CPU-hotplug testing.
rcutorture.read_exit= [KNL]
Set the number of read-then-exit kthreads used
to test the interaction of RCU updaters and
task-exit processing.
rcutorture.read_exit_burst= [KNL]
The number of times in a given read-then-exit
episode that a set of read-then-exit kthreads
is spawned.
rcutorture.read_exit_delay= [KNL]
The delay, in seconds, between successive
read-then-exit testing episodes.
rcutorture.shuffle_interval= [KNL]
Set task-shuffle interval (s). Shuffling tasks
allows some CPUs to go into dyntick-idle mode
during the rcutorture test.
rcutorture.shutdown_secs= [KNL]
Set time (s) after boot system shutdown. This
is useful for hands-off automated testing.
rcutorture.stall_cpu= [KNL]
Duration of CPU stall (s) to test RCU CPU stall
warnings, zero to disable.
rcutorture.stall_cpu_block= [KNL]
Sleep while stalling if set. This will result
doc/rcutorture: Add description of rcutorture.stall_cpu_block If you build a kernel with CONFIG_PREEMPTION=n and CONFIG_PREEMPT_COUNT=y, then run the rcutorture tests specifying stalls as follows: runqemu kvm slirp nographic qemuparams="-m 1024 -smp 4" \ bootparams="console=ttyS0 rcutorture.stall_cpu=30 \ rcutorture.stall_no_softlockup=1 rcutorture.stall_cpu_block=1" -d The tests will produce the following splat: [ 10.841071] rcu-torture: rcu_torture_stall begin CPU stall [ 10.841073] rcu_torture_stall start on CPU 3. [ 10.841077] BUG: scheduling while atomic: rcu_torture_sta/66/0x0000000 .... [ 10.841108] Call Trace: [ 10.841110] <TASK> [ 10.841112] dump_stack_lvl+0x64/0xb0 [ 10.841118] dump_stack+0x10/0x20 [ 10.841121] __schedule_bug+0x8b/0xb0 [ 10.841126] __schedule+0x2172/0x2940 [ 10.841157] schedule+0x9b/0x150 [ 10.841160] schedule_timeout+0x2e8/0x4f0 [ 10.841192] schedule_timeout_uninterruptible+0x47/0x50 [ 10.841195] rcu_torture_stall+0x2e8/0x300 [ 10.841199] kthread+0x175/0x1a0 [ 10.841206] ret_from_fork+0x2c/0x50 This is because the rcutorture.stall_cpu_block=1 module parameter causes rcu_torture_stall() to invoke schedule_timeout_uninterruptible() within an RCU read-side critical section. This in turn results in a quiescent state (which prevents the stall) and a sleep in an atomic context (which produces the above splat). Although this code is operating as designed, the design has proven to be counterintuitive to many. This commit therefore updates the description in kernel-parameters.txt accordingly. [ paulmck: Apply Joel Fernandes feedback. ] Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-03-21 02:12:34 +00:00
in warnings from preemptible RCU in addition to
any other stall-related activity. Note that
in kernels built with CONFIG_PREEMPTION=n and
CONFIG_PREEMPT_COUNT=y, this parameter will
cause the CPU to pass through a quiescent state.
Given CONFIG_PREEMPTION=n, this will suppress
RCU CPU stall warnings, but will instead result
in scheduling-while-atomic splats.
Use of this module parameter results in splats.
rcutorture.stall_cpu_holdoff= [KNL]
Time to wait (s) after boot before inducing stall.
rcutorture.stall_cpu_irqsoff= [KNL]
Disable interrupts while stalling if set.
rcutorture.stall_gp_kthread= [KNL]
Duration (s) of forced sleep within RCU
grace-period kthread to test RCU CPU stall
warnings, zero to disable. If both stall_cpu
and stall_gp_kthread are specified, the
kthread is starved first, then the CPU.
rcutorture.stat_interval= [KNL]
Time (s) between statistics printk()s.
rcutorture.stutter= [KNL]
Time (s) to stutter testing, for example, specifying
five seconds causes the test to run for five seconds,
wait for five seconds, and so on. This tests RCU's
ability to transition abruptly to and from idle.
rcutorture.test_boost= [KNL]
Test RCU priority boosting? 0=no, 1=maybe, 2=yes.
"Maybe" means test if the RCU implementation
under test support RCU priority boosting.
rcutorture.test_boost_duration= [KNL]
Duration (s) of each individual boost test.
rcutorture.test_boost_interval= [KNL]
Interval (s) between each boost test.
rcutorture.test_no_idle_hz= [KNL]
Test RCU's dyntick-idle handling. See also the
rcutorture.shuffle_interval parameter.
rcutorture.torture_type= [KNL]
Specify the RCU implementation to test.
rcutorture.verbose= [KNL]
Enable additional printk() statements.
rcupdate.rcu_cpu_stall_ftrace_dump= [KNL]
Dump ftrace buffer after reporting RCU CPU
stall warning.
rcu: Restrict access to RCU CPU stall notifiers Although the RCU CPU stall notifiers can be useful for dumping state when tracking down delicate forward-progress bugs where NUMA effects cause cache lines to be delivered to a given CPU regularly, but always in a state that prevents that CPU from making forward progress. These bugs can be detected by the RCU CPU stall-warning mechanism, but in some cases, the stall-warnings printk()s disrupt the forward-progress bug before any useful state can be obtained. Unfortunately, the notifier mechanism added by commit 5b404fdabacf ("rcu: Add RCU CPU stall notifier") can make matters worse if used at all carelessly. For example, if the stall warning was caused by a lock not being released, then any attempt to acquire that lock in the notifier will hang. This will prevent not only the notifier from producing any useful output, but it will also prevent the stall-warning message from ever appearing. This commit therefore hides this new RCU CPU stall notifier mechanism under a new RCU_CPU_STALL_NOTIFIER Kconfig option that depends on both DEBUG_KERNEL and RCU_EXPERT. In addition, the rcupdate.rcu_cpu_stall_notifiers=1 kernel boot parameter must also be specified. The RCU_CPU_STALL_NOTIFIER Kconfig option's help text contains a warning and explains the dangers of careless use, recommending lockless notifier code. In addition, a WARN() is triggered each time that an attempt is made to register a stall-warning notifier in kernels built with CONFIG_RCU_CPU_STALL_NOTIFIER=y. This combination of measures will keep use of this mechanism confined to debug kernels and away from routine deployments. [ paulmck: Apply Dan Carpenter feedback. ] Fixes: 5b404fdabacf ("rcu: Add RCU CPU stall notifier") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-11-02 01:28:38 +00:00
rcupdate.rcu_cpu_stall_notifiers= [KNL]
Provide RCU CPU stall notifiers, but see the
warnings in the RCU_CPU_STALL_NOTIFIER Kconfig
option's help text. TL;DR: You almost certainly
do not want rcupdate.rcu_cpu_stall_notifiers.
rcupdate.rcu_cpu_stall_suppress= [KNL]
Suppress RCU CPU stall warning messages.
rcupdate.rcu_cpu_stall_suppress_at_boot= [KNL]
Suppress RCU CPU stall warning messages and
rcutorture writer stall warnings that occur
during early boot, that is, during the time
before the init task is spawned.
rcupdate.rcu_cpu_stall_timeout= [KNL]
Set timeout for RCU CPU stall warning messages.
The value is in seconds and the maximum allowed
value is 300 seconds.
rcupdate.rcu_exp_cpu_stall_timeout= [KNL]
Set timeout for expedited RCU CPU stall warning
messages. The value is in milliseconds
and the maximum allowed value is 21000
milliseconds. Please note that this value is
adjusted to an arch timer tick resolution.
Setting this to zero causes the value from
rcupdate.rcu_cpu_stall_timeout to be used (after
conversion from seconds to milliseconds).
rcu: Add RCU stall diagnosis information Because RCU CPU stall warnings are driven from the scheduling-clock interrupt handler, a workload consisting of a very large number of short-duration hardware interrupts can result in misleading stall-warning messages. On systems supporting only a single level of interrupts, that is, where interrupts handlers cannot be interrupted, this can produce misleading diagnostics. The stack traces will show the innocent-bystander interrupted task, not the interrupts that are at the very least exacerbating the stall. This situation can be improved by displaying the number of interrupts and the CPU time that they have consumed. Diagnosing other types of stalls can be eased by also providing the count of softirqs and the CPU time that they consumed as well as the number of context switches and the task-level CPU time consumed. Consider the following output given this change: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-....: (1250 ticks this GP) <omitted> rcu: hardirqs softirqs csw/system rcu: number: 624 45 0 rcu: cputime: 69 1 2425 ==> 2500(ms) This output shows that the number of hard and soft interrupts is small, there are no context switches, and the system takes up a lot of time. This indicates that the current task is looping with preemption disabled. The impact on system performance is negligible because snapshot is recorded only once for all continuous RCU stalls. This added debugging information is suppressed by default and can be enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or by booting with rcupdate.rcu_cpu_stall_cputime=1. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-11-19 09:25:06 +00:00
rcupdate.rcu_cpu_stall_cputime= [KNL]
Provide statistics on the cputime and count of
interrupts and tasks during the sampling period. For
multiple continuous RCU stalls, all sampling periods
begin at half of the first RCU stall timeout.
rcupdate.rcu_exp_stall_task_details= [KNL]
Print stack dumps of any tasks blocking the
current expedited RCU grace period during an
expedited RCU CPU stall warning.
rcupdate.rcu_expedited= [KNL]
Use expedited grace-period primitives, for
example, synchronize_rcu_expedited() instead
of synchronize_rcu(). This reduces latency,
but can increase CPU utilization, degrade
real-time latency, and degrade energy efficiency.
No effect on CONFIG_TINY_RCU kernels.
rcupdate.rcu_normal= [KNL]
Use only normal grace-period primitives,
for example, synchronize_rcu() instead of
synchronize_rcu_expedited(). This improves
real-time latency, CPU utilization, and
energy efficiency, but can expose users to
increased grace-period latency. This parameter
overrides rcupdate.rcu_expedited. No effect on
CONFIG_TINY_RCU kernels.
rcupdate.rcu_normal_after_boot= [KNL]
Once boot has completed (that is, after
rcu_end_inkernel_boot() has been invoked), use
only normal grace-period primitives. No effect
on CONFIG_TINY_RCU kernels.
But note that CONFIG_PREEMPT_RT=y kernels enables
this kernel boot parameter, forcibly setting
it to the value one, that is, converting any
post-boot attempt at an expedited RCU grace
period to instead use normal non-expedited
grace-period processing.
rcupdate.rcu_task_collapse_lim= [KNL]
Set the maximum number of callbacks present
at the beginning of a grace period that allows
the RCU Tasks flavors to collapse back to using
a single callback queue. This switching only
occurs when rcupdate.rcu_task_enqueue_lim is
set to the default value of -1.
rcupdate.rcu_task_contend_lim= [KNL]
Set the minimum number of callback-queuing-time
lock-contention events per jiffy required to
cause the RCU Tasks flavors to switch to per-CPU
callback queuing. This switching only occurs
when rcupdate.rcu_task_enqueue_lim is set to
the default value of -1.
rcupdate.rcu_task_enqueue_lim= [KNL]
Set the number of callback queues to use for the
RCU Tasks family of RCU flavors. The default
of -1 allows this to be automatically (and
dynamically) adjusted. This parameter is intended
for use in testing.
rcupdate.rcu_task_ipi_delay= [KNL]
Set time in jiffies during which RCU tasks will
avoid sending IPIs, starting with the beginning
of a given grace period. Setting a large
number avoids disturbing real-time workloads,
but lengthens grace periods.
rcupdate.rcu_task_lazy_lim= [KNL]
Number of callbacks on a given CPU that will
cancel laziness on that CPU. Use -1 to disable
cancellation of laziness, but be advised that
doing so increases the danger of OOM due to
callback flooding.
rcupdate.rcu_task_stall_info= [KNL]
Set initial timeout in jiffies for RCU task stall
informational messages, which give some indication
of the problem for those not patient enough to
wait for ten minutes. Informational messages are
only printed prior to the stall-warning message
for a given grace period. Disable with a value
less than or equal to zero. Defaults to ten
seconds. A change in value does not take effect
until the beginning of the next grace period.
rcupdate.rcu_task_stall_info_mult= [KNL]
Multiplier for time interval between successive
RCU task stall informational messages for a given
RCU tasks grace period. This value is clamped
to one through ten, inclusive. It defaults to
the value three, so that the first informational
message is printed 10 seconds into the grace
period, the second at 40 seconds, the third at
160 seconds, and then the stall warning at 600
seconds would prevent a fourth at 640 seconds.
rcupdate.rcu_task_stall_timeout= [KNL]
Set timeout in jiffies for RCU task stall
warning messages. Disable with a value less
than or equal to zero. Defaults to ten minutes.
A change in value does not take effect until
the beginning of the next grace period.
rcupdate.rcu_tasks_lazy_ms= [KNL]
Set timeout in milliseconds RCU Tasks asynchronous
callback batching for call_rcu_tasks().
A negative value will take the default. A value
of zero will disable batching. Batching is
always disabled for synchronize_rcu_tasks().
rcupdate.rcu_tasks_rude_lazy_ms= [KNL]
Set timeout in milliseconds RCU Tasks
Rude asynchronous callback batching for
call_rcu_tasks_rude(). A negative value
will take the default. A value of zero will
disable batching. Batching is always disabled
for synchronize_rcu_tasks_rude().
rcupdate.rcu_tasks_trace_lazy_ms= [KNL]
Set timeout in milliseconds RCU Tasks
Trace asynchronous callback batching for
call_rcu_tasks_trace(). A negative value
will take the default. A value of zero will
disable batching. Batching is always disabled
for synchronize_rcu_tasks_trace().
rcupdate.rcu_self_test= [KNL]
Run the RCU early boot self tests
rdinit= [KNL]
Format: <full_path>
Run specified binary instead of /init from the ramdisk,
used for early userspace startup. See initrd.
x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h There have been reports of RDRAND issues after resuming from suspend on some AMD family 15h and family 16h systems. This issue stems from a BIOS not performing the proper steps during resume to ensure RDRAND continues to function properly. RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND support using CPUID, including the kernel, will believe that RDRAND is not supported. Update the CPU initialization to clear the RDRAND CPUID bit for any family 15h and 16h processor that supports RDRAND. If it is known that the family 15h or family 16h system does not have an RDRAND resume issue or that the system will not be placed in suspend, the "rdrand=force" kernel parameter can be used to stop the clearing of the RDRAND CPUID bit. Additionally, update the suspend and resume path to save and restore the MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in place after resuming from suspend. Note, that clearing the RDRAND CPUID bit does not prevent a processor that normally supports the RDRAND instruction from executing it. So any code that determined the support based on family and model won't #UD. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chen Yu <yu.c.chen@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org> Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: <stable@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "x86@kernel.org" <x86@kernel.org> Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.com
2019-08-19 15:52:35 +00:00
rdrand= [X86]
force - Override the decision by the kernel to hide the
advertisement of RDRAND support (this affects
certain AMD processors because of buggy BIOS
support, specifically around the suspend/resume
path).
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
mba, smba, bmec.
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
reboot= [KNL]
Format (x86 or x86_64):
[w[arm] | c[old] | h[ard] | s[oft] | g[pio]] | d[efault] \
[[,]s[mp]#### \
[[,]b[ios] | a[cpi] | k[bd] | t[riple] | e[fi] | p[ci]] \
[[,]f[orce]
Where reboot_mode is one of warm (soft) or cold (hard) or gpio
(prefix with 'panic_' to set mode for panic
reboot only),
reboot_type is one of bios, acpi, kbd, triple, efi, or pci,
reboot_force is either force or not specified,
reboot_cpu is s[mp]#### with #### being the processor
to be used for rebooting.
refscale.holdoff= [KNL]
Set test-start holdoff period. The purpose of
this parameter is to delay the start of the
test until boot completes in order to avoid
interference.
refscale.lookup_instances= [KNL]
Number of data elements to use for the forms of
SLAB_TYPESAFE_BY_RCU testing. A negative number
is negated and multiplied by nr_cpu_ids, while
zero specifies nr_cpu_ids.
refscale.loops= [KNL]
Set the number of loops over the synchronization
primitive under test. Increasing this number
reduces noise due to loop start/end overhead,
but the default has already reduced the per-pass
noise to a handful of picoseconds on ca. 2020
x86 laptops.
refscale.nreaders= [KNL]
Set number of readers. The default value of -1
selects N, where N is roughly 75% of the number
of CPUs. A value of zero is an interesting choice.
refscale.nruns= [KNL]
Set number of runs, each of which is dumped onto
the console log.
refscale.readdelay= [KNL]
Set the read-side critical-section duration,
measured in microseconds.
refscale.scale_type= [KNL]
Specify the read-protection implementation to test.
refscale.shutdown= [KNL]
Shut down the system at the end of the performance
test. This defaults to 1 (shut it down) when
refscale is built into the kernel and to 0 (leave
it running) when refscale is built as a module.
refscale.verbose= [KNL]
Enable additional printk() statements.
refscale.verbose_batched= [KNL]
Batch the additional printk() statements. If zero
(the default) or negative, print everything. Otherwise,
print every Nth verbose statement, where N is the value
specified.
regulator_ignore_unused
[REGULATOR]
Prevents regulator framework from disabling regulators
that are unused, due no driver claiming them. This may
be useful for debug and development, but should not be
needed on a platform with proper driver support.
relax_domain_level=
[KNL, SMP] Set scheduler's default relax_domain_level.
See Documentation/admin-guide/cgroup-v1/cpusets.rst.
reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory
Format: <base1>,<size1>[,<base2>,<size2>,...]
Reserve I/O ports or memory so the kernel won't use
them. If <base> is less than 0x10000, the region
is assumed to be I/O ports; otherwise it is memory.
reservetop= [X86-32]
Format: nn[KMG]
Reserves a hole at the top of the kernel virtual
address space.
reset_devices [KNL] Force drivers to reset the underlying device
during initialization.
resume= [SWSUSP]
Specify the partition device for software suspend
Format:
{/dev/<dev> | PARTUUID=<uuid> | <int>:<int> | <hex>}
resume_offset= [SWSUSP]
Specify the offset from the beginning of the partition
given by "resume=" at which the swap header is located,
in <PAGE_SIZE> units (needed only for swap files).
See Documentation/power/swsusp-and-swap-files.rst
resumedelay= [HIBERNATION] Delay (in seconds) to pause before attempting to
read the resume files
resumewait [HIBERNATION] Wait (indefinitely) for resume device to show up.
Useful for devices that are detected asynchronously
(e.g. USB and MMC devices).
retain_initrd [RAM] Keep initrd memory after extraction. After boot, it will
be accessible via /sys/firmware/initrd.
retbleed= [X86] Control mitigation of RETBleed (Arbitrary
Speculative Code Execution with Return Instructions)
vulnerability.
AMD-based UNRET and IBPB mitigations alone do not stop
sibling threads from influencing the predictions of other
sibling threads. For that reason, STIBP is used on pro-
cessors that support it, and mitigate SMT on processors
that don't.
off - no mitigation
auto - automatically select a migitation
auto,nosmt - automatically select a mitigation,
disabling SMT if necessary for
the full mitigation (only on Zen1
and older without STIBP).
ibpb - On AMD, mitigate short speculation
windows on basic block boundaries too.
Safe, highest perf impact. It also
enables STIBP if present. Not suitable
on Intel.
ibpb,nosmt - Like "ibpb" above but will disable SMT
when STIBP is not available. This is
the alternative for systems which do not
have STIBP.
unret - Force enable untrained return thunks,
only effective on AMD f15h-f17h based
systems.
unret,nosmt - Like unret, but will disable SMT when STIBP
is not available. This is the alternative for
systems which do not have STIBP.
Selecting 'auto' will choose a mitigation method at run
time according to the CPU.
Not specifying this option is equivalent to retbleed=auto.
rfkill.default_state=
0 "airplane mode". All wifi, bluetooth, wimax, gps, fm,
etc. communication is blocked by default.
1 Unblocked.
rfkill.master_switch_mode=
0 The "airplane mode" button does nothing.
1 The "airplane mode" button toggles between everything
blocked and the previous configuration.
2 The "airplane mode" button toggles between everything
blocked and everything unblocked.
rhash_entries= [KNL,NET]
Set number of hash buckets for route cache
ring3mwait=disable
[KNL] Disable ring 3 MONITOR/MWAIT feature on supported
CPUs.
riscv_isa_fallback [RISCV]
When CONFIG_RISCV_ISA_FALLBACK is not enabled, permit
falling back to detecting extension support by parsing
"riscv,isa" property on devicetree systems when the
replacement properties are not found. See the Kconfig
entry for RISCV_ISA_FALLBACK.
ro [KNL] Mount root device read-only on boot
rodata= [KNL]
on Mark read-only kernel memory as read-only (default).
off Leave read-only kernel memory writable for debugging.
arm64: fix rodata=full On arm64, "rodata=full" has been suppored (but not documented) since commit: c55191e96caa9d78 ("arm64: mm: apply r/o permissions of VM areas to its linear alias as well") As it's necessary to determine the rodata configuration early during boot, arm64 has an early_param() handler for this, whereas init/main.c has a __setup() handler which is run later. Unfortunately, this split meant that since commit: f9a40b0890658330 ("init/main.c: return 1 from handled __setup() functions") ... passing "rodata=full" would result in a spurious warning from the __setup() handler (though RO permissions would be configured appropriately). Further, "rodata=full" has been broken since commit: 0d6ea3ac94ca77c5 ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") ... which caused strtobool() to parse "full" as false (in addition to many other values not documented for the "rodata=" kernel parameter. This patch fixes this breakage by: * Moving the core parameter parser to an __early_param(), such that it is available early. * Adding an (optional) arch hook which arm64 can use to parse "full". * Updating the documentation to mention that "full" is valid for arm64. * Having the core parameter parser handle "on" and "off" explicitly, such that any undocumented values (e.g. typos such as "ful") are reported as errors rather than being silently accepted. Note that __setup() and early_param() have opposite conventions for their return values, where __setup() uses 1 to indicate a parameter was handled and early_param() uses 0 to indicate a parameter was handled. Fixes: f9a40b089065 ("init/main.c: return 1 from handled __setup() functions") Fixes: 0d6ea3ac94ca ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jagdish Gediya <jvgediya@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20220817154022.3974645-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2022-08-17 15:40:22 +00:00
full Mark read-only kernel memory and aliases as read-only
[arm64]
rockchip.usb_uart
Enable the uart passthrough on the designated usb port
on Rockchip SoCs. When active, the signals of the
debug-uart get routed to the D+ and D- pins of the usb
port and the regular usb controller gets disabled.
root= [KNL] Root filesystem
Usually this a a block device specifier of some kind,
see the early_lookup_bdev comment in
block/early-lookup.c for details.
Alternatively this can be "ram" for the legacy initial
ramdisk, "nfs" and "cifs" for root on a network file
system, or "mtd" and "ubi" for mounting from raw flash.
rootdelay= [KNL] Delay (in seconds) to pause before attempting to
mount the root filesystem
rootflags= [KNL] Set root filesystem mount option string
rootfstype= [KNL] Set root filesystem type
rootwait [KNL] Wait (indefinitely) for root device to show up.
Useful for devices that are detected asynchronously
(e.g. USB and MMC devices).
rootwait= [KNL] Maximum time (in seconds) to wait for root device
to show up before attempting to mount the root
filesystem.
rproc_mem=nn[KMG][@address]
[KNL,ARM,CMA] Remoteproc physical memory block.
Memory area to be used by remote processor image,
managed by CMA.
rw [KNL] Mount root device read-write on boot
S [KNL] Run init in single mode
s390_iommu= [HW,S390]
Set s390 IOTLB flushing mode
strict
With strict flushing every unmap operation will result
in an IOTLB flush. Default is lazy flushing before
reuse, which is faster. Deprecated, equivalent to
iommu.strict=1.
s390_iommu_aperture= [KNL,S390]
Specifies the size of the per device DMA address space
accessible through the DMA and IOMMU APIs as a decimal
factor of the size of main memory.
The default is 1 meaning that one can concurrently use
as many DMA addresses as physical memory is installed,
if supported by hardware, and thus map all of memory
once. With a value of 2 one can map all of memory twice
and so on. As a special case a factor of 0 imposes no
restrictions other than those given by hardware at the
cost of significant additional memory use for tables.
sa1100ir [NET]
See drivers/net/irda/sa1100_ir.c.
sched_verbose [KNL] Enables verbose scheduler debug messages.
sched/debug: Make schedstats a runtime tunable that is disabled by default schedstats is very useful during debugging and performance tuning but it incurs overhead to calculate the stats. As such, even though it can be disabled at build time, it is often enabled as the information is useful. This patch adds a kernel command-line and sysctl tunable to enable or disable schedstats on demand (when it's built in). It is disabled by default as someone who knows they need it can also learn to enable it when necessary. The benefits are dependent on how scheduler-intensive the workload is. If it is then the patch reduces the number of cycles spent calculating the stats with a small benefit from reducing the cache footprint of the scheduler. These measurements were taken from a 48-core 2-socket machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a single socket machine 8-core machine with Intel i7-3770 processors. netperf-tcp 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v3r1 Hmean 64 560.45 ( 0.00%) 575.98 ( 2.77%) Hmean 128 766.66 ( 0.00%) 795.79 ( 3.80%) Hmean 256 950.51 ( 0.00%) 981.50 ( 3.26%) Hmean 1024 1433.25 ( 0.00%) 1466.51 ( 2.32%) Hmean 2048 2810.54 ( 0.00%) 2879.75 ( 2.46%) Hmean 3312 4618.18 ( 0.00%) 4682.09 ( 1.38%) Hmean 4096 5306.42 ( 0.00%) 5346.39 ( 0.75%) Hmean 8192 10581.44 ( 0.00%) 10698.15 ( 1.10%) Hmean 16384 18857.70 ( 0.00%) 18937.61 ( 0.42%) Small gains here, UDP_STREAM showed nothing intresting and neither did the TCP_RR tests. The gains on the 8-core machine were very similar. tbench4 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v3r1 Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%) Hmean mb/sec-2 984.66 ( 0.00%) 1018.19 ( 3.41%) Hmean mb/sec-4 1827.91 ( 0.00%) 1847.78 ( 1.09%) Hmean mb/sec-8 3561.36 ( 0.00%) 3611.28 ( 1.40%) Hmean mb/sec-16 5824.52 ( 0.00%) 5929.03 ( 1.79%) Hmean mb/sec-32 10943.10 ( 0.00%) 10802.83 ( -1.28%) Hmean mb/sec-64 15950.81 ( 0.00%) 16211.31 ( 1.63%) Hmean mb/sec-128 15302.17 ( 0.00%) 15445.11 ( 0.93%) Hmean mb/sec-256 14866.18 ( 0.00%) 15088.73 ( 1.50%) Hmean mb/sec-512 15223.31 ( 0.00%) 15373.69 ( 0.99%) Hmean mb/sec-1024 14574.25 ( 0.00%) 14598.02 ( 0.16%) Hmean mb/sec-2048 13569.02 ( 0.00%) 13733.86 ( 1.21%) Hmean mb/sec-3072 12865.98 ( 0.00%) 13209.23 ( 2.67%) Small gains of 2-4% at low thread counts and otherwise flat. The gains on the 8-core machine were slightly different tbench4 on 8-core i7-3770 single socket machine Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%) Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%) Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%) Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%) Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%) Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%) Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%) Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%) Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%) Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%) In constract, this shows a relatively steady 2-3% gain at higher thread counts. Due to the nature of the patch and the type of workload, it's not a surprise that the result will depend on the CPU used. hackbench-pipes 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v3r1 Amean 1 0.0637 ( 0.00%) 0.0660 ( -3.59%) Amean 4 0.1229 ( 0.00%) 0.1181 ( 3.84%) Amean 7 0.1921 ( 0.00%) 0.1911 ( 0.52%) Amean 12 0.3117 ( 0.00%) 0.2923 ( 6.23%) Amean 21 0.4050 ( 0.00%) 0.3899 ( 3.74%) Amean 30 0.4586 ( 0.00%) 0.4433 ( 3.33%) Amean 48 0.5910 ( 0.00%) 0.5694 ( 3.65%) Amean 79 0.8663 ( 0.00%) 0.8626 ( 0.43%) Amean 110 1.1543 ( 0.00%) 1.1517 ( 0.22%) Amean 141 1.4457 ( 0.00%) 1.4290 ( 1.16%) Amean 172 1.7090 ( 0.00%) 1.6924 ( 0.97%) Amean 192 1.9126 ( 0.00%) 1.9089 ( 0.19%) Some small gains and losses and while the variance data is not included, it's close to the noise. The UMA machine did not show anything particularly different pipetest 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v2r2 Min Time 4.13 ( 0.00%) 3.99 ( 3.39%) 1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%) 2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%) 3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%) Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%) Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%) Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%) Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%) Max Time 4.93 ( 0.00%) 4.83 ( 2.03%) Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%) Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%) Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%) Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%) Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%) Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%) Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%) Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%) Small improvement and similar gains were seen on the UMA machine. The gain is small but it stands to reason that doing less work in the scheduler is a good thing. The downside is that the lack of schedstats and tracepoints may be surprising to experts doing performance analysis until they find the existence of the schedstats= parameter or schedstats sysctl. It will be automatically activated for latencytop and sleep profiling to alleviate the problem. For tracepoints, there is a simple warning as it's not safe to activate schedstats in the context when it's known the tracepoint may be wanted but is unavailable. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-05 09:08:36 +00:00
schedstats= [KNL,X86] Enable or disable scheduled statistics.
Allowed values are enable and disable. This feature
incurs a small amount of overhead in the scheduler
but is useful for debugging and performance tuning.
sched_thermal_decay_shift=
[KNL, SMP] Set a decay shift for scheduler thermal
pressure signal. Thermal pressure signal follows the
default decay period of other scheduler pelt
signals(usually 32 ms but configurable). Setting
sched_thermal_decay_shift will left shift the decay
period for the thermal pressure signal by the shift
value.
i.e. with the default pelt decay period of 32 ms
sched_thermal_decay_shift thermal pressure decay pr
1 64 ms
2 128 ms
and so on.
Format: integer between 0 and 10
Default is 0.
scftorture.holdoff= [KNL]
Number of seconds to hold off before starting
test. Defaults to zero for module insertion and
to 10 seconds for built-in smp_call_function()
tests.
scftorture.longwait= [KNL]
Request ridiculously long waits randomly selected
up to the chosen limit in seconds. Zero (the
default) disables this feature. Please note
that requesting even small non-zero numbers of
seconds can result in RCU CPU stall warnings,
softlockup complaints, and so on.
scftorture.nthreads= [KNL]
Number of kthreads to spawn to invoke the
smp_call_function() family of functions.
The default of -1 specifies a number of kthreads
equal to the number of CPUs.
scftorture.onoff_holdoff= [KNL]
Number seconds to wait after the start of the
test before initiating CPU-hotplug operations.
scftorture.onoff_interval= [KNL]
Number seconds to wait between successive
CPU-hotplug operations. Specifying zero (which
is the default) disables CPU-hotplug operations.
scftorture.shutdown_secs= [KNL]
The number of seconds following the start of the
test after which to shut down the system. The
default of zero avoids shutting down the system.
Non-zero values are useful for automated tests.
scftorture.stat_interval= [KNL]
The number of seconds between outputting the
current test statistics to the console. A value
of zero disables statistics output.
scftorture.stutter_cpus= [KNL]
The number of jiffies to wait between each change
to the set of CPUs under test.
scftorture.use_cpus_read_lock= [KNL]
Use use_cpus_read_lock() instead of the default
preempt_disable() to disable CPU hotplug
while invoking one of the smp_call_function*()
functions.
scftorture.verbose= [KNL]
Enable additional printk() statements.
scftorture.weight_single= [KNL]
The probability weighting to use for the
smp_call_function_single() function with a zero
"wait" parameter. A value of -1 selects the
default if all other weights are -1. However,
if at least one weight has some other value, a
value of -1 will instead select a weight of zero.
scftorture.weight_single_wait= [KNL]
The probability weighting to use for the
smp_call_function_single() function with a
non-zero "wait" parameter. See weight_single.
scftorture.weight_many= [KNL]
The probability weighting to use for the
smp_call_function_many() function with a zero
"wait" parameter. See weight_single.
Note well that setting a high probability for
this weighting can place serious IPI load
on the system.
scftorture.weight_many_wait= [KNL]
The probability weighting to use for the
smp_call_function_many() function with a
non-zero "wait" parameter. See weight_single
and weight_many.
scftorture.weight_all= [KNL]
The probability weighting to use for the
smp_call_function_all() function with a zero
"wait" parameter. See weight_single and
weight_many.
scftorture.weight_all_wait= [KNL]
The probability weighting to use for the
smp_call_function_all() function with a
non-zero "wait" parameter. See weight_single
and weight_many.
skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
xtime_lock contention on larger systems, and/or RCU lock
contention on all systems with CONFIG_MAXSMP set.
Format: { "0" | "1" }
0 -- disable. (may be 1 via CONFIG_CMDLINE="skew_tick=1"
1 -- enable.
Note: increases power consumption, thus should only be
enabled if running jitter sensitive (HPC/RT) workloads.
security= [SECURITY] Choose a legacy "major" security module to
enable at boot. This has been deprecated by the
"lsm=" parameter.
selinux= [SELINUX] Disable or enable SELinux at boot time.
Format: { "0" | "1" }
See security/selinux/Kconfig help text.
0 -- disable.
1 -- enable.
Default value is 1.
serialnumber [BUGS=X86-32]
sev=option[,option...] [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst
shapers= [NET]
Maximal number of shapers.
show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller
Limit apic dumping. The parameter defines the maximal
number of local apics being dumped. Also it is possible
to set it to "all" by meaning -- no limit here.
Format: { 1 (default) | 2 | ... | all }.
The parameter valid if only apic=debug or
apic=verbose is specified.
Example: apic=debug show_lapic=all
simeth= [IA-64]
simscsi=
slram= [HW,MTD]
slab_merge [MM]
Enable merging of slabs with similar size when the
kernel is built without CONFIG_SLAB_MERGE_DEFAULT.
slab_nomerge [MM]
Disable merging of slabs with similar size. May be
necessary if there is some reason to distinguish
mm: allow slab_nomerge to be set at build time Some hardened environments want to build kernels with slab_nomerge already set (so that they do not depend on remembering to set the kernel command line option). This is desired to reduce the risk of kernel heap overflows being able to overwrite objects from merged caches and changes the requirements for cache layout control, increasing the difficulty of these attacks. By keeping caches unmerged, these kinds of exploits can usually only damage objects in the same cache (though the risk to metadata exploitation is unchanged). Link: http://lkml.kernel.org/r/20170620230911.GA25238@beast Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Daniel Micay <danielmicay@gmail.com> Cc: David Windsor <dave@nullcore.net> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Daniel Micay <danielmicay@gmail.com> Cc: David Windsor <dave@nullcore.net> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Tejun Heo <tj@kernel.org> Cc: Daniel Mack <daniel@zonque.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Helge Deller <deller@gmx.de> Cc: Rik van Riel <riel@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 22:36:40 +00:00
allocs to different slabs, especially in hardened
environments where the risk of heap overflows and
layout control by attackers can usually be
frustrated by disabling merging. This will reduce
most of the exposure of a heap attack to a single
cache (risks via metadata attacks are mostly
unchanged). Debug options disable merging on their
own.
For more information see Documentation/mm/slub.rst.
slab_max_order= [MM, SLAB]
Determines the maximum allowed order for slabs.
A high setting may cause OOMs due to memory
fragmentation. Defaults to 1 for systems with
more than 32MB of RAM, 0 otherwise.
2020-08-07 06:18:35 +00:00
slub_debug[=options[,slabs][;[options[,slabs]]...] [MM, SLUB]
Enabling slub_debug allows one to determine the
culprit if slab objects become corrupted. Enabling
slub_debug can create guard zones around objects and
may poison objects when not in use. Also tracks the
last alloc / free. For more information see
Documentation/mm/slub.rst.
slub_max_order= [MM, SLUB]
Determines the maximum allowed order for slabs.
A high setting may cause OOMs due to memory
fragmentation. For more information see
Documentation/mm/slub.rst.
slub_min_objects= [MM, SLUB]
The minimum number of objects per slab. SLUB will
increase the slab order up to slub_max_order to
generate a sufficiently large slab able to contain
the number of objects indicated. The higher the number
of objects the smaller the overhead of tracking slabs
and the less frequently locks need to be acquired.
For more information see Documentation/mm/slub.rst.
slub_min_order= [MM, SLUB]
Determines the minimum page order for slabs. Must be
lower than slub_max_order.
For more information see Documentation/mm/slub.rst.
slub_merge [MM, SLUB]
Same with slab_merge.
slub_nomerge [MM, SLUB]
Same with slab_nomerge. This is supported for legacy.
See slab_nomerge for more information.
smart2= [HW]
Format: <io1>[,<io2>[,...,<io8>]]
smp.csd_lock_timeout= [KNL]
Specify the period of time in milliseconds
that smp_call_function() and friends will wait
for a CPU to release the CSD lock. This is
useful when diagnosing bugs involving CPUs
disabling interrupts for extended periods
of time. Defaults to 5,000 milliseconds, and
setting a value of zero disables this feature.
This feature may be more efficiently disabled
using the csdlock_debug- kernel parameter.
smp.panic_on_ipistall= [KNL]
If a csd_lock_timeout extends for more than
the specified number of milliseconds, panic the
system. By default, let CSD-lock acquisition
take as long as they take. Specifying 300,000
for this value provides a 5-minute timeout.
smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices
smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port
smsc-ircc2.ircc_sir= [HW] SIR base I/O port
smsc-ircc2.ircc_fir= [HW] FIR base I/O port
smsc-ircc2.ircc_irq= [HW] IRQ line
smsc-ircc2.ircc_dma= [HW] DMA channel
smsc-ircc2.ircc_transceiver= [HW] Transceiver type:
0: Toshiba Satellite 1800 (GP data pin select)
1: Fast pin select (default)
2: ATC IRMode
smt= [KNL,MIPS,S390] Set the maximum number of threads (logical
CPUs) to use per physical CPU on systems capable of
symmetric multithreading (SMT). Will be capped to the
actual hardware limit.
Format: <integer>
Default: -1 (no limit)
softlockup_panic=
[KNL] Should the soft-lockup detector generate panics.
Format: 0 | 1
A value of 1 instructs the soft-lockup detector
to panic the machine when a soft-lockup occurs. It is
also controlled by the kernel.softlockup_panic sysctl
and CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, which is the
respective build-time switch to that functionality.
kernel/watchdog.c: print traces for all cpus on lockup detection A 'softlockup' is defined as a bug that causes the kernel to loop in kernel mode for more than a predefined period to time, without giving other tasks a chance to run. Currently, upon detection of this condition by the per-cpu watchdog task, debug information (including a stack trace) is sent to the system log. On some occasions, we have observed that the "victim" rather than the actual "culprit" (i.e. the owner/holder of the contended resource) is reported to the user. Often this information has proven to be insufficient to assist debugging efforts. To avoid loss of useful debug information, for architectures which support NMI, this patch makes it possible to improve soft lockup reporting. This is accomplished by issuing an NMI to each cpu to obtain a stack trace. If NMI is not supported we just revert back to the old method. A sysctl and boot-time parameter is available to toggle this feature. [dzickus@redhat.com: add CONFIG_SMP in certain areas] [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations] [mq@suse.cz: fix warning] Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jan Moskyto Matejka <mq@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 20:22:05 +00:00
softlockup_all_cpu_backtrace=
[KNL] Should the soft-lockup detector generate
backtraces on all cpus.
Format: 0 | 1
kernel/watchdog.c: print traces for all cpus on lockup detection A 'softlockup' is defined as a bug that causes the kernel to loop in kernel mode for more than a predefined period to time, without giving other tasks a chance to run. Currently, upon detection of this condition by the per-cpu watchdog task, debug information (including a stack trace) is sent to the system log. On some occasions, we have observed that the "victim" rather than the actual "culprit" (i.e. the owner/holder of the contended resource) is reported to the user. Often this information has proven to be insufficient to assist debugging efforts. To avoid loss of useful debug information, for architectures which support NMI, this patch makes it possible to improve soft lockup reporting. This is accomplished by issuing an NMI to each cpu to obtain a stack trace. If NMI is not supported we just revert back to the old method. A sysctl and boot-time parameter is available to toggle this feature. [dzickus@redhat.com: add CONFIG_SMP in certain areas] [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations] [mq@suse.cz: fix warning] Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jan Moskyto Matejka <mq@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 20:22:05 +00:00
sonypi.*= [HW] Sony Programmable I/O Control Device driver
See Documentation/admin-guide/laptops/sonypi.rst
spectre_bhi= [X86] Control mitigation of Branch History Injection
(BHI) vulnerability. This setting affects the
deployment of the HW BHI control and the SW BHB
clearing sequence.
on - (default) Enable the HW or SW mitigation
as needed.
off - Disable the mitigation.
spectre_v2= [X86] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability.
x86/speculation: Add command line control for indirect branch speculation Add command line control for user space indirect branch speculation mitigations. The new option is: spectre_v2_user= The initial options are: - on: Unconditionally enabled - off: Unconditionally disabled -auto: Kernel selects mitigation (default off for now) When the spectre_v2= command line argument is either 'on' or 'off' this implies that the application to application control follows that state even if a contradicting spectre_v2_user= argument is supplied. Originally-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.082720373@linutronix.de
2018-11-25 18:33:45 +00:00
The default operation protects the kernel from
user space attacks.
x86/speculation: Add command line control for indirect branch speculation Add command line control for user space indirect branch speculation mitigations. The new option is: spectre_v2_user= The initial options are: - on: Unconditionally enabled - off: Unconditionally disabled -auto: Kernel selects mitigation (default off for now) When the spectre_v2= command line argument is either 'on' or 'off' this implies that the application to application control follows that state even if a contradicting spectre_v2_user= argument is supplied. Originally-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.082720373@linutronix.de
2018-11-25 18:33:45 +00:00
on - unconditionally enable, implies
spectre_v2_user=on
off - unconditionally disable, implies
spectre_v2_user=off
auto - kernel detects whether your CPU model is
vulnerable
Selecting 'on' will, and 'auto' may, choose a
mitigation method at run time according to the
CPU, the available microcode, the setting of the
CONFIG_RETPOLINE configuration option, and the
compiler with which the kernel was built.
x86/speculation: Add command line control for indirect branch speculation Add command line control for user space indirect branch speculation mitigations. The new option is: spectre_v2_user= The initial options are: - on: Unconditionally enabled - off: Unconditionally disabled -auto: Kernel selects mitigation (default off for now) When the spectre_v2= command line argument is either 'on' or 'off' this implies that the application to application control follows that state even if a contradicting spectre_v2_user= argument is supplied. Originally-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.082720373@linutronix.de
2018-11-25 18:33:45 +00:00
Selecting 'on' will also enable the mitigation
against user space to user space task attacks.
Selecting 'off' will disable both the kernel and
the user space protections.
Specific mitigations can also be selected manually:
retpoline - replace indirect branches
retpoline,generic - Retpolines
retpoline,lfence - LFENCE; indirect branch
retpoline,amd - alias for retpoline,lfence
eibrs - Enhanced/Auto IBRS
eibrs,retpoline - Enhanced/Auto IBRS + Retpolines
eibrs,lfence - Enhanced/Auto IBRS + LFENCE
ibrs - use IBRS to protect kernel
Not specifying this option is equivalent to
spectre_v2=auto.
x86/speculation: Add command line control for indirect branch speculation Add command line control for user space indirect branch speculation mitigations. The new option is: spectre_v2_user= The initial options are: - on: Unconditionally enabled - off: Unconditionally disabled -auto: Kernel selects mitigation (default off for now) When the spectre_v2= command line argument is either 'on' or 'off' this implies that the application to application control follows that state even if a contradicting spectre_v2_user= argument is supplied. Originally-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.082720373@linutronix.de
2018-11-25 18:33:45 +00:00
spectre_v2_user=
[X86] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability between
user space tasks
on - Unconditionally enable mitigations. Is
enforced by spectre_v2=on
off - Unconditionally disable mitigations. Is
enforced by spectre_v2=off
prctl - Indirect branch speculation is enabled,
but mitigation can be enabled via prctl
per thread. The mitigation control state
is inherited on fork.
prctl,ibpb
- Like "prctl" above, but only STIBP is
controlled per thread. IBPB is issued
always when switching between different user
space processes.
x86/speculation: Add seccomp Spectre v2 user space protection mode If 'prctl' mode of user space protection from spectre v2 is selected on the kernel command-line, STIBP and IBPB are applied on tasks which restrict their indirect branch speculation via prctl. SECCOMP enables the SSBD mitigation for sandboxed tasks already, so it makes sense to prevent spectre v2 user space to user space attacks as well. The Intel mitigation guide documents how STIPB works: Setting bit 1 (STIBP) of the IA32_SPEC_CTRL MSR on a logical processor prevents the predicted targets of indirect branches on any logical processor of that core from being controlled by software that executes (or executed previously) on another logical processor of the same core. Ergo setting STIBP protects the task itself from being attacked from a task running on a different hyper-thread and protects the tasks running on different hyper-threads from being attacked. While the document suggests that the branch predictors are shielded between the logical processors, the observed performance regressions suggest that STIBP simply disables the branch predictor more or less completely. Of course the document wording is vague, but the fact that there is also no requirement for issuing IBPB when STIBP is used points clearly in that direction. The kernel still issues IBPB even when STIBP is used until Intel clarifies the whole mechanism. IBPB is issued when the task switches out, so malicious sandbox code cannot mistrain the branch predictor for the next user space task on the same logical processor. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185006.051663132@linutronix.de
2018-11-25 18:33:55 +00:00
seccomp
- Same as "prctl" above, but all seccomp
threads will enable the mitigation unless
they explicitly opt out.
seccomp,ibpb
- Like "seccomp" above, but only STIBP is
controlled per thread. IBPB is issued
always when switching between different
user space processes.
x86/speculation: Add command line control for indirect branch speculation Add command line control for user space indirect branch speculation mitigations. The new option is: spectre_v2_user= The initial options are: - on: Unconditionally enabled - off: Unconditionally disabled -auto: Kernel selects mitigation (default off for now) When the spectre_v2= command line argument is either 'on' or 'off' this implies that the application to application control follows that state even if a contradicting spectre_v2_user= argument is supplied. Originally-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.082720373@linutronix.de
2018-11-25 18:33:45 +00:00
auto - Kernel selects the mitigation depending on
the available CPU features and vulnerability.
x86/speculation: Add seccomp Spectre v2 user space protection mode If 'prctl' mode of user space protection from spectre v2 is selected on the kernel command-line, STIBP and IBPB are applied on tasks which restrict their indirect branch speculation via prctl. SECCOMP enables the SSBD mitigation for sandboxed tasks already, so it makes sense to prevent spectre v2 user space to user space attacks as well. The Intel mitigation guide documents how STIPB works: Setting bit 1 (STIBP) of the IA32_SPEC_CTRL MSR on a logical processor prevents the predicted targets of indirect branches on any logical processor of that core from being controlled by software that executes (or executed previously) on another logical processor of the same core. Ergo setting STIBP protects the task itself from being attacked from a task running on a different hyper-thread and protects the tasks running on different hyper-threads from being attacked. While the document suggests that the branch predictors are shielded between the logical processors, the observed performance regressions suggest that STIBP simply disables the branch predictor more or less completely. Of course the document wording is vague, but the fact that there is also no requirement for issuing IBPB when STIBP is used points clearly in that direction. The kernel still issues IBPB even when STIBP is used until Intel clarifies the whole mechanism. IBPB is issued when the task switches out, so malicious sandbox code cannot mistrain the branch predictor for the next user space task on the same logical processor. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185006.051663132@linutronix.de
2018-11-25 18:33:55 +00:00
x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl Switch the kernel default of SSBD and STIBP to the ones with CONFIG_SECCOMP=n (i.e. spec_store_bypass_disable=prctl spectre_v2_user=prctl) even if CONFIG_SECCOMP=y. Several motivations listed below: - If SMT is enabled the seccomp jail can still attack the rest of the system even with spectre_v2_user=seccomp by using MDS-HT (except on XEON PHI where MDS can be tamed with SMT left enabled, but that's a special case). Setting STIBP become a very expensive window dressing after MDS-HT was discovered. - The seccomp jail cannot attack the kernel with spectre-v2-HT regardless (even if STIBP is not set), but with MDS-HT the seccomp jail can attack the kernel too. - With spec_store_bypass_disable=prctl the seccomp jail can attack the other userland (guest or host mode) using spectre-v2-HT, but the userland attack is already mitigated by both ASLR and pid namespaces for host userland and through virt isolation with libkrun or kata. (if something if somebody is worried about spectre-v2-HT it's best to mount proc with hidepid=2,gid=proc on workstations where not all apps may run under container runtimes, rather than slowing down all seccomp jails, but the best is to add pid namespaces to the seccomp jail). As opposed MDS-HT is not mitigated and the seccomp jail can still attack all other host and guest userland if SMT is enabled even with spec_store_bypass_disable=seccomp. - If full security is required then MDS-HT must also be mitigated with nosmt and then spectre_v2_user=prctl and spectre_v2_user=seccomp would become identical. - Setting spectre_v2_user=seccomp is overall lower priority than to setting javascript.options.wasm false in about:config to protect against remote wasm MDS-HT, instead of worrying about Spectre-v2-HT and STIBP which again is already statistically well mitigated by other means in userland and it's fully mitigated in kernel with retpolines (unlike the wasm assist call with MDS-HT). - SSBD is needed to prevent reading the JIT memory and the primary user being the OpenJDK. However the primary user of SSBD wouldn't be covered by spec_store_bypass_disable=seccomp because it doesn't use seccomp and the primary user also explicitly declined to set PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS despite it easily could. In fact it would need to set it only when the sandboxing mechanism is enabled for javaws applets, but it still declined it by declaring security within the same user address space as an untenable objective for their JIT, even in the sandboxing case where performance would be a lesser concern (for the record: I kind of disagree in not setting PR_SPEC_STORE_BYPASS in the sandbox case and I prefer to run javaws through a wrapper that sets PR_SPEC_STORE_BYPASS if I need). In turn it can be inferred that even if the primary user of SSBD would use seccomp, they would invoke it with SECCOMP_FILTER_FLAG_SPEC_ALLOW by now. - runc/crun already set SECCOMP_FILTER_FLAG_SPEC_ALLOW by default, k8s and podman have a default json seccomp allowlist that cannot be slowed down, so for the #1 seccomp user this change is already a noop. - systemd/sshd or other apps that use seccomp, if they really need STIBP or SSBD, they need to explicitly set the PR_SET_SPECULATION_CTRL by now. The stibp/ssbd seccomp blind catch-all approach was done probably initially with a wishful thinking objective to pretend to have a peace of mind that it could magically fix it all. That was wishful thinking before MDS-HT was discovered, but after MDS-HT has been discovered it become just window dressing. - For qemu "-sandbox" seccomp jail it wouldn't make sense to set STIBP or SSBD. SSBD doesn't help with KVM because there's no JIT (if it's needed with TCG it should be an opt-in with PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS and it shouldn't slowdown KVM for nothing). For qemu+KVM STIBP would be even more window dressing than it is for all other apps, because in the qemu+KVM case there's not only the MDS attack to worry about with SMT enabled. Even after disabling SMT, there's still a theoretical spectre-v2 attack possible within the same thread context from guest mode to host ring3 that the host kernel retpoline mitigation has no theoretical chance to mitigate. On some kernels a ibrs-always/ibrs-retpoline opt-in model is provided that will enabled IBRS in the qemu host ring3 userland which fixes this theoretical concern. Only after enabling IBRS in the host userland it would then make sense to proceed and worry about STIBP and an attack on the other host userland, but then again SMT would need to be disabled for full security anyway, so that would render STIBP again a noop. - last but not the least: the lack of "spec_store_bypass_disable=prctl spectre_v2_user=prctl" means the moment a guest boots and sshd/systemd runs, the guest kernel will write to SPEC_CTRL MSR which will make the guest vmexit forever slower, forcing KVM to issue a very slow rdmsr instruction at every vmexit. So the end result is that SPEC_CTRL MSR is only available in GCE. Most other public cloud providers don't expose SPEC_CTRL, which means that not only STIBP/SSBD isn't available, but IBPB isn't available either (which would cause no overhead to the guest or the hypervisor because it's write only and requires no reading during vmexit). So the current default already net loss in security (missing IBPB) which means most public cloud providers cannot achieve a fully secure guest with nosmt (and nosmt is enough to fully mitigate MDS-HT). It also means GCE and is unfairly penalized in performance because it provides the option to enable full security in the guest as an opt-in (i.e. nosmt and IBPB). So this change will allow all cloud providers to expose SPEC_CTRL without incurring into any hypervisor slowdown and at the same time it will remove the unfair penalization of GCE performance for doing the right thing and it'll allow to get full security with nosmt with IBPB being available (and STIBP becoming meaningless). Example to put things in prospective: the STIBP enabled in seccomp has never been about protecting apps using seccomp like sshd from an attack from a malicious userland, but to the contrary it has always been about protecting the system from an attack from sshd, after a successful remote network exploit against sshd. In fact initially it wasn't obvious STIBP would work both ways (STIBP was about preventing the task that runs with STIBP to be attacked with spectre-v2-HT, but accidentally in the STIBP case it also prevents the attack in the other direction). In the hypothetical case that sshd has been remotely exploited the last concern should be STIBP being set, because it'll be still possible to obtain info even from the kernel by using MDS if nosmt wasn't set (and if it was set, STIBP is a noop in the first place). As opposed kernel cannot leak anything with spectre-v2 HT because of retpolines and the userland is mitigated by ASLR already and ideally PID namespaces too. If something it'd be worth checking if sshd run the seccomp thread under pid namespaces too if available in the running kernel. SSBD also would be a noop for sshd, since sshd uses no JIT. If sshd prefers to keep doing the STIBP window dressing exercise, it still can even after this change of defaults by opting-in with PR_SPEC_INDIRECT_BRANCH. Ultimately setting SSBD and STIBP by default for all seccomp jails is a bad sweet spot and bad default with more cons than pros that end up reducing security in the public cloud (by giving an huge incentive to not expose SPEC_CTRL which would be needed to get full security with IBPB after setting nosmt in the guest) and by excessively hurting performance to more secure apps using seccomp that end up having to opt out with SECCOMP_FILTER_FLAG_SPEC_ALLOW. The following is the verified result of the new default with SMT enabled: (gdb) print spectre_v2_user_stibp $1 = SPECTRE_V2_USER_PRCTL (gdb) print spectre_v2_user_ibpb $2 = SPECTRE_V2_USER_PRCTL (gdb) print ssb_mode $3 = SPEC_STORE_BYPASS_PRCTL Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201104235054.5678-1-aarcange@redhat.com Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/lkml/AAA2EF2C-293D-4D5B-BFA6-FF655105CD84@redhat.com Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/lkml/c0722838-06f7-da6b-138f-e0f26362f16a@redhat.com
2020-11-04 23:50:54 +00:00
Default mitigation: "prctl"
x86/speculation: Add command line control for indirect branch speculation Add command line control for user space indirect branch speculation mitigations. The new option is: spectre_v2_user= The initial options are: - on: Unconditionally enabled - off: Unconditionally disabled -auto: Kernel selects mitigation (default off for now) When the spectre_v2= command line argument is either 'on' or 'off' this implies that the application to application control follows that state even if a contradicting spectre_v2_user= argument is supplied. Originally-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Casey Schaufler <casey.schaufler@intel.com> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Jon Masters <jcm@redhat.com> Cc: Waiman Long <longman9394@gmail.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Dave Stewart <david.c.stewart@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20181125185005.082720373@linutronix.de
2018-11-25 18:33:45 +00:00
Not specifying this option is equivalent to
spectre_v2_user=auto.
spec_rstack_overflow=
[X86] Control RAS overflow mitigation on AMD Zen CPUs
off - Disable mitigation
microcode - Enable microcode mitigation only
safe-ret - Enable sw-only safe RET mitigation (default)
ibpb - Enable mitigation by issuing IBPB on
kernel entry
ibpb-vmexit - Issue IBPB only on VMEXIT
(cloud-specific mitigation)
x86/bugs: Provide boot parameters for the spec_store_bypass_disable mitigation Contemporary high performance processors use a common industry-wide optimization known as "Speculative Store Bypass" in which loads from addresses to which a recent store has occurred may (speculatively) see an older value. Intel refers to this feature as "Memory Disambiguation" which is part of their "Smart Memory Access" capability. Memory Disambiguation can expose a cache side-channel attack against such speculatively read values. An attacker can create exploit code that allows them to read memory outside of a sandbox environment (for example, malicious JavaScript in a web page), or to perform more complex attacks against code running within the same privilege level, e.g. via the stack. As a first step to mitigate against such attacks, provide two boot command line control knobs: nospec_store_bypass_disable spec_store_bypass_disable=[off,auto,on] By default affected x86 processors will power on with Speculative Store Bypass enabled. Hence the provided kernel parameters are written from the point of view of whether to enable a mitigation or not. The parameters are as follows: - auto - Kernel detects whether your CPU model contains an implementation of Speculative Store Bypass and picks the most appropriate mitigation. - on - disable Speculative Store Bypass - off - enable Speculative Store Bypass [ tglx: Reordered the checks so that the whole evaluation is not done when the CPU does not support RDS ] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ingo Molnar <mingo@kernel.org>
2018-04-26 02:04:21 +00:00
spec_store_bypass_disable=
[HW] Control Speculative Store Bypass (SSB) Disable mitigation
(Speculative Store Bypass vulnerability)
Certain CPUs are vulnerable to an exploit against a
a common industry wide performance optimization known
as "Speculative Store Bypass" in which recent stores
to the same memory location may not be observed by
later loads during speculative execution. The idea
is that such stores are unlikely and that they can
be detected prior to instruction retirement at the
end of a particular speculation execution window.
In vulnerable processors, the speculatively forwarded
store can be used in a cache side channel attack, for
example to read memory to which the attacker does not
directly have access (e.g. inside sandboxed code).
This parameter controls whether the Speculative Store
Bypass optimization is used.
On x86 the options are:
on - Unconditionally disable Speculative Store Bypass
off - Unconditionally enable Speculative Store Bypass
auto - Kernel detects whether the CPU model contains an
implementation of Speculative Store Bypass and
picks the most appropriate mitigation. If the
CPU is not vulnerable, "off" is selected. If the
CPU is vulnerable the default mitigation is
architecture and Kconfig dependent. See below.
prctl - Control Speculative Store Bypass per thread
via prctl. Speculative Store Bypass is enabled
for a process by default. The state of the control
is inherited on fork.
seccomp - Same as "prctl" above, but all seccomp threads
will disable SSB unless they explicitly opt out.
x86/bugs: Provide boot parameters for the spec_store_bypass_disable mitigation Contemporary high performance processors use a common industry-wide optimization known as "Speculative Store Bypass" in which loads from addresses to which a recent store has occurred may (speculatively) see an older value. Intel refers to this feature as "Memory Disambiguation" which is part of their "Smart Memory Access" capability. Memory Disambiguation can expose a cache side-channel attack against such speculatively read values. An attacker can create exploit code that allows them to read memory outside of a sandbox environment (for example, malicious JavaScript in a web page), or to perform more complex attacks against code running within the same privilege level, e.g. via the stack. As a first step to mitigate against such attacks, provide two boot command line control knobs: nospec_store_bypass_disable spec_store_bypass_disable=[off,auto,on] By default affected x86 processors will power on with Speculative Store Bypass enabled. Hence the provided kernel parameters are written from the point of view of whether to enable a mitigation or not. The parameters are as follows: - auto - Kernel detects whether your CPU model contains an implementation of Speculative Store Bypass and picks the most appropriate mitigation. - on - disable Speculative Store Bypass - off - enable Speculative Store Bypass [ tglx: Reordered the checks so that the whole evaluation is not done when the CPU does not support RDS ] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Ingo Molnar <mingo@kernel.org>
2018-04-26 02:04:21 +00:00
Default mitigations:
x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl Switch the kernel default of SSBD and STIBP to the ones with CONFIG_SECCOMP=n (i.e. spec_store_bypass_disable=prctl spectre_v2_user=prctl) even if CONFIG_SECCOMP=y. Several motivations listed below: - If SMT is enabled the seccomp jail can still attack the rest of the system even with spectre_v2_user=seccomp by using MDS-HT (except on XEON PHI where MDS can be tamed with SMT left enabled, but that's a special case). Setting STIBP become a very expensive window dressing after MDS-HT was discovered. - The seccomp jail cannot attack the kernel with spectre-v2-HT regardless (even if STIBP is not set), but with MDS-HT the seccomp jail can attack the kernel too. - With spec_store_bypass_disable=prctl the seccomp jail can attack the other userland (guest or host mode) using spectre-v2-HT, but the userland attack is already mitigated by both ASLR and pid namespaces for host userland and through virt isolation with libkrun or kata. (if something if somebody is worried about spectre-v2-HT it's best to mount proc with hidepid=2,gid=proc on workstations where not all apps may run under container runtimes, rather than slowing down all seccomp jails, but the best is to add pid namespaces to the seccomp jail). As opposed MDS-HT is not mitigated and the seccomp jail can still attack all other host and guest userland if SMT is enabled even with spec_store_bypass_disable=seccomp. - If full security is required then MDS-HT must also be mitigated with nosmt and then spectre_v2_user=prctl and spectre_v2_user=seccomp would become identical. - Setting spectre_v2_user=seccomp is overall lower priority than to setting javascript.options.wasm false in about:config to protect against remote wasm MDS-HT, instead of worrying about Spectre-v2-HT and STIBP which again is already statistically well mitigated by other means in userland and it's fully mitigated in kernel with retpolines (unlike the wasm assist call with MDS-HT). - SSBD is needed to prevent reading the JIT memory and the primary user being the OpenJDK. However the primary user of SSBD wouldn't be covered by spec_store_bypass_disable=seccomp because it doesn't use seccomp and the primary user also explicitly declined to set PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS despite it easily could. In fact it would need to set it only when the sandboxing mechanism is enabled for javaws applets, but it still declined it by declaring security within the same user address space as an untenable objective for their JIT, even in the sandboxing case where performance would be a lesser concern (for the record: I kind of disagree in not setting PR_SPEC_STORE_BYPASS in the sandbox case and I prefer to run javaws through a wrapper that sets PR_SPEC_STORE_BYPASS if I need). In turn it can be inferred that even if the primary user of SSBD would use seccomp, they would invoke it with SECCOMP_FILTER_FLAG_SPEC_ALLOW by now. - runc/crun already set SECCOMP_FILTER_FLAG_SPEC_ALLOW by default, k8s and podman have a default json seccomp allowlist that cannot be slowed down, so for the #1 seccomp user this change is already a noop. - systemd/sshd or other apps that use seccomp, if they really need STIBP or SSBD, they need to explicitly set the PR_SET_SPECULATION_CTRL by now. The stibp/ssbd seccomp blind catch-all approach was done probably initially with a wishful thinking objective to pretend to have a peace of mind that it could magically fix it all. That was wishful thinking before MDS-HT was discovered, but after MDS-HT has been discovered it become just window dressing. - For qemu "-sandbox" seccomp jail it wouldn't make sense to set STIBP or SSBD. SSBD doesn't help with KVM because there's no JIT (if it's needed with TCG it should be an opt-in with PR_SET_SPECULATION_CTRL+PR_SPEC_STORE_BYPASS and it shouldn't slowdown KVM for nothing). For qemu+KVM STIBP would be even more window dressing than it is for all other apps, because in the qemu+KVM case there's not only the MDS attack to worry about with SMT enabled. Even after disabling SMT, there's still a theoretical spectre-v2 attack possible within the same thread context from guest mode to host ring3 that the host kernel retpoline mitigation has no theoretical chance to mitigate. On some kernels a ibrs-always/ibrs-retpoline opt-in model is provided that will enabled IBRS in the qemu host ring3 userland which fixes this theoretical concern. Only after enabling IBRS in the host userland it would then make sense to proceed and worry about STIBP and an attack on the other host userland, but then again SMT would need to be disabled for full security anyway, so that would render STIBP again a noop. - last but not the least: the lack of "spec_store_bypass_disable=prctl spectre_v2_user=prctl" means the moment a guest boots and sshd/systemd runs, the guest kernel will write to SPEC_CTRL MSR which will make the guest vmexit forever slower, forcing KVM to issue a very slow rdmsr instruction at every vmexit. So the end result is that SPEC_CTRL MSR is only available in GCE. Most other public cloud providers don't expose SPEC_CTRL, which means that not only STIBP/SSBD isn't available, but IBPB isn't available either (which would cause no overhead to the guest or the hypervisor because it's write only and requires no reading during vmexit). So the current default already net loss in security (missing IBPB) which means most public cloud providers cannot achieve a fully secure guest with nosmt (and nosmt is enough to fully mitigate MDS-HT). It also means GCE and is unfairly penalized in performance because it provides the option to enable full security in the guest as an opt-in (i.e. nosmt and IBPB). So this change will allow all cloud providers to expose SPEC_CTRL without incurring into any hypervisor slowdown and at the same time it will remove the unfair penalization of GCE performance for doing the right thing and it'll allow to get full security with nosmt with IBPB being available (and STIBP becoming meaningless). Example to put things in prospective: the STIBP enabled in seccomp has never been about protecting apps using seccomp like sshd from an attack from a malicious userland, but to the contrary it has always been about protecting the system from an attack from sshd, after a successful remote network exploit against sshd. In fact initially it wasn't obvious STIBP would work both ways (STIBP was about preventing the task that runs with STIBP to be attacked with spectre-v2-HT, but accidentally in the STIBP case it also prevents the attack in the other direction). In the hypothetical case that sshd has been remotely exploited the last concern should be STIBP being set, because it'll be still possible to obtain info even from the kernel by using MDS if nosmt wasn't set (and if it was set, STIBP is a noop in the first place). As opposed kernel cannot leak anything with spectre-v2 HT because of retpolines and the userland is mitigated by ASLR already and ideally PID namespaces too. If something it'd be worth checking if sshd run the seccomp thread under pid namespaces too if available in the running kernel. SSBD also would be a noop for sshd, since sshd uses no JIT. If sshd prefers to keep doing the STIBP window dressing exercise, it still can even after this change of defaults by opting-in with PR_SPEC_INDIRECT_BRANCH. Ultimately setting SSBD and STIBP by default for all seccomp jails is a bad sweet spot and bad default with more cons than pros that end up reducing security in the public cloud (by giving an huge incentive to not expose SPEC_CTRL which would be needed to get full security with IBPB after setting nosmt in the guest) and by excessively hurting performance to more secure apps using seccomp that end up having to opt out with SECCOMP_FILTER_FLAG_SPEC_ALLOW. The following is the verified result of the new default with SMT enabled: (gdb) print spectre_v2_user_stibp $1 = SPECTRE_V2_USER_PRCTL (gdb) print spectre_v2_user_ibpb $2 = SPECTRE_V2_USER_PRCTL (gdb) print ssb_mode $3 = SPEC_STORE_BYPASS_PRCTL Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201104235054.5678-1-aarcange@redhat.com Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Link: https://lore.kernel.org/lkml/AAA2EF2C-293D-4D5B-BFA6-FF655105CD84@redhat.com Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/lkml/c0722838-06f7-da6b-138f-e0f26362f16a@redhat.com
2020-11-04 23:50:54 +00:00
X86: "prctl"
On powerpc the options are:
on,auto - On Power8 and Power9 insert a store-forwarding
barrier on kernel entry and exit. On Power7
perform a software flush on kernel entry and
exit.
off - No action.
Not specifying this option is equivalent to
spec_store_bypass_disable=auto.
spia_io_base= [HW,MTD]
spia_fio_base=
spia_pedr=
spia_peddr=
x86/split_lock: Enable split lock detection by kernel A split-lock occurs when an atomic instruction operates on data that spans two cache lines. In order to maintain atomicity the core takes a global bus lock. This is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores (which must wait for the bus lock to be released before their memory operations can complete). For real-time systems this may mean missing deadlines. For other systems it may just be very annoying. Some CPUs have the capability to raise an #AC trap when a split lock is attempted. Provide a command line option to give the user choices on how to handle this: split_lock_detect= off - not enabled (no traps for split locks) warn - warn once when an application does a split lock, but allow it to continue running. fatal - Send SIGBUS to applications that cause split lock On systems that support split lock detection the default is "warn". Note that if the kernel hits a split lock in any mode other than "off" it will OOPs. One implementation wrinkle is that the MSR to control the split lock detection is per-core, not per thread. This might result in some short lived races on HT systems in "warn" mode if Linux tries to enable on one thread while disabling on the other. Race analysis by Sean Christopherson: - Toggling of split-lock is only done in "warn" mode. Worst case scenario of a race is that a misbehaving task will generate multiple #AC exceptions on the same instruction. And this race will only occur if both siblings are running tasks that generate split-lock #ACs, e.g. a race where sibling threads are writing different values will only occur if CPUx is disabling split-lock after an #AC and CPUy is re-enabling split-lock after *its* previous task generated an #AC. - Transitioning between off/warn/fatal modes at runtime isn't supported and disabling is tracked per task, so hardware will always reach a steady state that matches the configured mode. I.e. split-lock is guaranteed to be enabled in hardware once all _TIF_SLD threads have been scheduled out. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Co-developed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200126200535.GB30377@agluck-desk2.amr.corp.intel.com
2020-01-26 20:05:35 +00:00
split_lock_detect=
[X86] Enable split lock detection or bus lock detection
x86/split_lock: Enable split lock detection by kernel A split-lock occurs when an atomic instruction operates on data that spans two cache lines. In order to maintain atomicity the core takes a global bus lock. This is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores (which must wait for the bus lock to be released before their memory operations can complete). For real-time systems this may mean missing deadlines. For other systems it may just be very annoying. Some CPUs have the capability to raise an #AC trap when a split lock is attempted. Provide a command line option to give the user choices on how to handle this: split_lock_detect= off - not enabled (no traps for split locks) warn - warn once when an application does a split lock, but allow it to continue running. fatal - Send SIGBUS to applications that cause split lock On systems that support split lock detection the default is "warn". Note that if the kernel hits a split lock in any mode other than "off" it will OOPs. One implementation wrinkle is that the MSR to control the split lock detection is per-core, not per thread. This might result in some short lived races on HT systems in "warn" mode if Linux tries to enable on one thread while disabling on the other. Race analysis by Sean Christopherson: - Toggling of split-lock is only done in "warn" mode. Worst case scenario of a race is that a misbehaving task will generate multiple #AC exceptions on the same instruction. And this race will only occur if both siblings are running tasks that generate split-lock #ACs, e.g. a race where sibling threads are writing different values will only occur if CPUx is disabling split-lock after an #AC and CPUy is re-enabling split-lock after *its* previous task generated an #AC. - Transitioning between off/warn/fatal modes at runtime isn't supported and disabling is tracked per task, so hardware will always reach a steady state that matches the configured mode. I.e. split-lock is guaranteed to be enabled in hardware once all _TIF_SLD threads have been scheduled out. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Co-developed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200126200535.GB30377@agluck-desk2.amr.corp.intel.com
2020-01-26 20:05:35 +00:00
When enabled (and if hardware support is present), atomic
instructions that access data across cache line
boundaries will result in an alignment check exception
for split lock detection or a debug exception for
bus lock detection.
x86/split_lock: Enable split lock detection by kernel A split-lock occurs when an atomic instruction operates on data that spans two cache lines. In order to maintain atomicity the core takes a global bus lock. This is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores (which must wait for the bus lock to be released before their memory operations can complete). For real-time systems this may mean missing deadlines. For other systems it may just be very annoying. Some CPUs have the capability to raise an #AC trap when a split lock is attempted. Provide a command line option to give the user choices on how to handle this: split_lock_detect= off - not enabled (no traps for split locks) warn - warn once when an application does a split lock, but allow it to continue running. fatal - Send SIGBUS to applications that cause split lock On systems that support split lock detection the default is "warn". Note that if the kernel hits a split lock in any mode other than "off" it will OOPs. One implementation wrinkle is that the MSR to control the split lock detection is per-core, not per thread. This might result in some short lived races on HT systems in "warn" mode if Linux tries to enable on one thread while disabling on the other. Race analysis by Sean Christopherson: - Toggling of split-lock is only done in "warn" mode. Worst case scenario of a race is that a misbehaving task will generate multiple #AC exceptions on the same instruction. And this race will only occur if both siblings are running tasks that generate split-lock #ACs, e.g. a race where sibling threads are writing different values will only occur if CPUx is disabling split-lock after an #AC and CPUy is re-enabling split-lock after *its* previous task generated an #AC. - Transitioning between off/warn/fatal modes at runtime isn't supported and disabling is tracked per task, so hardware will always reach a steady state that matches the configured mode. I.e. split-lock is guaranteed to be enabled in hardware once all _TIF_SLD threads have been scheduled out. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Co-developed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200126200535.GB30377@agluck-desk2.amr.corp.intel.com
2020-01-26 20:05:35 +00:00
off - not enabled
warn - the kernel will emit rate-limited warnings
x86/split_lock: Enable split lock detection by kernel A split-lock occurs when an atomic instruction operates on data that spans two cache lines. In order to maintain atomicity the core takes a global bus lock. This is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores (which must wait for the bus lock to be released before their memory operations can complete). For real-time systems this may mean missing deadlines. For other systems it may just be very annoying. Some CPUs have the capability to raise an #AC trap when a split lock is attempted. Provide a command line option to give the user choices on how to handle this: split_lock_detect= off - not enabled (no traps for split locks) warn - warn once when an application does a split lock, but allow it to continue running. fatal - Send SIGBUS to applications that cause split lock On systems that support split lock detection the default is "warn". Note that if the kernel hits a split lock in any mode other than "off" it will OOPs. One implementation wrinkle is that the MSR to control the split lock detection is per-core, not per thread. This might result in some short lived races on HT systems in "warn" mode if Linux tries to enable on one thread while disabling on the other. Race analysis by Sean Christopherson: - Toggling of split-lock is only done in "warn" mode. Worst case scenario of a race is that a misbehaving task will generate multiple #AC exceptions on the same instruction. And this race will only occur if both siblings are running tasks that generate split-lock #ACs, e.g. a race where sibling threads are writing different values will only occur if CPUx is disabling split-lock after an #AC and CPUy is re-enabling split-lock after *its* previous task generated an #AC. - Transitioning between off/warn/fatal modes at runtime isn't supported and disabling is tracked per task, so hardware will always reach a steady state that matches the configured mode. I.e. split-lock is guaranteed to be enabled in hardware once all _TIF_SLD threads have been scheduled out. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Co-developed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200126200535.GB30377@agluck-desk2.amr.corp.intel.com
2020-01-26 20:05:35 +00:00
about applications triggering the #AC
exception or the #DB exception. This mode is
the default on CPUs that support split lock
detection or bus lock detection. Default
behavior is by #AC if both features are
enabled in hardware.
x86/split_lock: Enable split lock detection by kernel A split-lock occurs when an atomic instruction operates on data that spans two cache lines. In order to maintain atomicity the core takes a global bus lock. This is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores (which must wait for the bus lock to be released before their memory operations can complete). For real-time systems this may mean missing deadlines. For other systems it may just be very annoying. Some CPUs have the capability to raise an #AC trap when a split lock is attempted. Provide a command line option to give the user choices on how to handle this: split_lock_detect= off - not enabled (no traps for split locks) warn - warn once when an application does a split lock, but allow it to continue running. fatal - Send SIGBUS to applications that cause split lock On systems that support split lock detection the default is "warn". Note that if the kernel hits a split lock in any mode other than "off" it will OOPs. One implementation wrinkle is that the MSR to control the split lock detection is per-core, not per thread. This might result in some short lived races on HT systems in "warn" mode if Linux tries to enable on one thread while disabling on the other. Race analysis by Sean Christopherson: - Toggling of split-lock is only done in "warn" mode. Worst case scenario of a race is that a misbehaving task will generate multiple #AC exceptions on the same instruction. And this race will only occur if both siblings are running tasks that generate split-lock #ACs, e.g. a race where sibling threads are writing different values will only occur if CPUx is disabling split-lock after an #AC and CPUy is re-enabling split-lock after *its* previous task generated an #AC. - Transitioning between off/warn/fatal modes at runtime isn't supported and disabling is tracked per task, so hardware will always reach a steady state that matches the configured mode. I.e. split-lock is guaranteed to be enabled in hardware once all _TIF_SLD threads have been scheduled out. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Co-developed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200126200535.GB30377@agluck-desk2.amr.corp.intel.com
2020-01-26 20:05:35 +00:00
fatal - the kernel will send SIGBUS to applications
that trigger the #AC exception or the #DB
exception. Default behavior is by #AC if
both features are enabled in hardware.
x86/split_lock: Enable split lock detection by kernel A split-lock occurs when an atomic instruction operates on data that spans two cache lines. In order to maintain atomicity the core takes a global bus lock. This is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores (which must wait for the bus lock to be released before their memory operations can complete). For real-time systems this may mean missing deadlines. For other systems it may just be very annoying. Some CPUs have the capability to raise an #AC trap when a split lock is attempted. Provide a command line option to give the user choices on how to handle this: split_lock_detect= off - not enabled (no traps for split locks) warn - warn once when an application does a split lock, but allow it to continue running. fatal - Send SIGBUS to applications that cause split lock On systems that support split lock detection the default is "warn". Note that if the kernel hits a split lock in any mode other than "off" it will OOPs. One implementation wrinkle is that the MSR to control the split lock detection is per-core, not per thread. This might result in some short lived races on HT systems in "warn" mode if Linux tries to enable on one thread while disabling on the other. Race analysis by Sean Christopherson: - Toggling of split-lock is only done in "warn" mode. Worst case scenario of a race is that a misbehaving task will generate multiple #AC exceptions on the same instruction. And this race will only occur if both siblings are running tasks that generate split-lock #ACs, e.g. a race where sibling threads are writing different values will only occur if CPUx is disabling split-lock after an #AC and CPUy is re-enabling split-lock after *its* previous task generated an #AC. - Transitioning between off/warn/fatal modes at runtime isn't supported and disabling is tracked per task, so hardware will always reach a steady state that matches the configured mode. I.e. split-lock is guaranteed to be enabled in hardware once all _TIF_SLD threads have been scheduled out. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Co-developed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200126200535.GB30377@agluck-desk2.amr.corp.intel.com
2020-01-26 20:05:35 +00:00
ratelimit:N -
Set system wide rate limit to N bus locks
per second for bus lock detection.
0 < N <= 1000.
N/A for split lock detection.
x86/split_lock: Enable split lock detection by kernel A split-lock occurs when an atomic instruction operates on data that spans two cache lines. In order to maintain atomicity the core takes a global bus lock. This is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores (which must wait for the bus lock to be released before their memory operations can complete). For real-time systems this may mean missing deadlines. For other systems it may just be very annoying. Some CPUs have the capability to raise an #AC trap when a split lock is attempted. Provide a command line option to give the user choices on how to handle this: split_lock_detect= off - not enabled (no traps for split locks) warn - warn once when an application does a split lock, but allow it to continue running. fatal - Send SIGBUS to applications that cause split lock On systems that support split lock detection the default is "warn". Note that if the kernel hits a split lock in any mode other than "off" it will OOPs. One implementation wrinkle is that the MSR to control the split lock detection is per-core, not per thread. This might result in some short lived races on HT systems in "warn" mode if Linux tries to enable on one thread while disabling on the other. Race analysis by Sean Christopherson: - Toggling of split-lock is only done in "warn" mode. Worst case scenario of a race is that a misbehaving task will generate multiple #AC exceptions on the same instruction. And this race will only occur if both siblings are running tasks that generate split-lock #ACs, e.g. a race where sibling threads are writing different values will only occur if CPUx is disabling split-lock after an #AC and CPUy is re-enabling split-lock after *its* previous task generated an #AC. - Transitioning between off/warn/fatal modes at runtime isn't supported and disabling is tracked per task, so hardware will always reach a steady state that matches the configured mode. I.e. split-lock is guaranteed to be enabled in hardware once all _TIF_SLD threads have been scheduled out. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Co-developed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200126200535.GB30377@agluck-desk2.amr.corp.intel.com
2020-01-26 20:05:35 +00:00
If an #AC exception is hit in the kernel or in
firmware (i.e. not while executing in user mode)
the kernel will oops in either "warn" or "fatal"
mode.
#DB exception for bus lock is triggered only when
CPL > 0.
x86/speculation: Add Special Register Buffer Data Sampling (SRBDS) mitigation SRBDS is an MDS-like speculative side channel that can leak bits from the random number generator (RNG) across cores and threads. New microcode serializes the processor access during the execution of RDRAND and RDSEED. This ensures that the shared buffer is overwritten before it is released for reuse. While it is present on all affected CPU models, the microcode mitigation is not needed on models that enumerate ARCH_CAPABILITIES[MDS_NO] in the cases where TSX is not supported or has been disabled with TSX_CTRL. The mitigation is activated by default on affected processors and it increases latency for RDRAND and RDSEED instructions. Among other effects this will reduce throughput from /dev/urandom. * Enable administrator to configure the mitigation off when desired using either mitigations=off or srbds=off. * Export vulnerability status via sysfs * Rename file-scoped macros to apply for non-whitelist table initializations. [ bp: Massage, - s/VULNBL_INTEL_STEPPING/VULNBL_INTEL_STEPPINGS/g, - do not read arch cap MSR a second time in tsx_fused_off() - just pass it in, - flip check in cpu_set_bug_bits() to save an indentation level, - reflow comments. jpoimboe: s/Mitigated/Mitigation/ in user-visible strings tglx: Dropped the fused off magic for now ] Signed-off-by: Mark Gross <mgross@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
2020-04-16 15:54:04 +00:00
srbds= [X86,INTEL]
Control the Special Register Buffer Data Sampling
(SRBDS) mitigation.
Certain CPUs are vulnerable to an MDS-like
exploit which can leak bits from the random
number generator.
By default, this issue is mitigated by
microcode. However, the microcode fix can cause
the RDRAND and RDSEED instructions to become
much slower. Among other effects, this will
result in reduced throughput from /dev/urandom.
The microcode mitigation can be disabled with
the following option:
off: Disable mitigation and remove
performance impact to RDRAND and RDSEED
srcutree.big_cpu_lim [KNL]
Specifies the number of CPUs constituting a
large system, such that srcu_struct structures
should immediately allocate an srcu_node array.
This kernel-boot parameter defaults to 128,
but takes effect only when the low-order four
bits of srcutree.convert_to_big is equal to 3
(decide at boot).
srcutree.convert_to_big [KNL]
Specifies under what conditions an SRCU tree
srcu_struct structure will be converted to big
form, that is, with an rcu_node tree:
0: Never.
1: At init_srcu_struct() time.
2: When rcutorture decides to.
3: Decide at boot time (default).
srcu: Add contention-triggered addition of srcu_node tree This commit instruments the acquisitions of the srcu_struct structure's ->lock, enabling the initiation of a transition from SRCU_SIZE_SMALL to SRCU_SIZE_BIG when sufficient contention is experienced. The instrumentation counts the number of trylock failures within the confines of a single jiffy. If that number exceeds the value specified by the srcutree.small_contention_lim kernel boot parameter (which defaults to 100), and if the value specified by the srcutree.convert_to_big kernel boot parameter has the 0x10 bit set (defaults to 0), then a transition will be automatically initiated. By default, there will never be any transitions, so that none of the srcu_struct structures ever gains an srcu_node array. The useful values for srcutree.convert_to_big are: 0x00: Never convert. 0x01: Always convert at init_srcu_struct() time. 0x02: Convert when rcutorture prints its first round of statistics. 0x03: Decide conversion approach at boot given system size. 0x10: Convert if contention is encountered. 0x12: Convert if contention is encountered or when rcutorture prints its first round of statistics, whichever comes first. The value 0x11 acts the same as 0x01 because the conversion happens before there is any chance of contention. [ paulmck: Apply "static" feedback from kernel test robot. ] Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-01-28 04:32:05 +00:00
0x1X: Above plus if high contention.
Either way, the srcu_node tree will be sized based
on the actual runtime number of CPUs (nr_cpu_ids)
instead of the compile-time CONFIG_NR_CPUS.
srcutree.counter_wrap_check [KNL]
Specifies how frequently to check for
grace-period sequence counter wrap for the
srcu_data structure's ->srcu_gp_seq_needed field.
The greater the number of bits set in this kernel
parameter, the less frequently counter wrap will
be checked for. Note that the bottom two bits
are ignored.
srcutree.exp_holdoff [KNL]
Specifies how many nanoseconds must elapse
since the end of the last SRCU grace period for
a given srcu_struct until the next normal SRCU
grace period will be considered for automatic
expediting. Set to zero to disable automatic
expediting.
srcu: Make expedited RCU grace periods block even less frequently The purpose of commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") was to prevent a long series of never-blocking expedited SRCU grace periods from blocking kernel-live-patching (KLP) progress. Although it was successful, it also resulted in excessive boot times on certain embedded workloads running under qemu with the "-bios QEMU_EFI.fd" command line. Here "excessive" means increasing the boot time up into the three-to-four minute range. This increase in boot time was due to the more than 6000 back-to-back invocations of synchronize_rcu_expedited() within the KVM host OS, which in turn resulted from qemu's emulation of a long series of MMIO accesses. Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") did not significantly help this particular use case. Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values of non-sleeping per phase counts on a system with preemption enabled, and observed the following boot times: +──────────────────────────+────────────────+ | SRCU_MAX_NODELAY_PHASE | Boot time (s) | +──────────────────────────+────────────────+ | 100 | 30.053 | | 150 | 25.151 | | 200 | 20.704 | | 250 | 15.748 | | 500 | 11.401 | | 1000 | 11.443 | | 10000 | 11.258 | | 1000000 | 11.154 | +──────────────────────────+────────────────+ Analysis on the experiment results show additional improvements with CPU-bound delays approaching one jiffy in duration. This improvement was also seen when number of per-phase iterations were scaled to one jiffy. This commit therefore scales per-grace-period phase number of non-sleeping polls so that non-sleeping polls extend for about one jiffy. In addition, the delay-calculation call to srcu_get_delay() in srcu_gp_end() is replaced with a simple check for an expedited grace period. This change schedules callback invocation immediately after expedited grace periods complete, which results in greatly improved boot times. Testing done by Marc and Zhangfei confirms that this change recovers most of the performance degradation in boottime; for CONFIG_HZ_250 configuration, specifically, boot times improve from 3m50s to 41s on Marc's setup; and from 2m40s to ~9.7s on Zhangfei's setup. In addition to the changes to default per phase delays, this change adds 3 new kernel parameters - srcutree.srcu_max_nodelay, srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay. This allows users to configure the srcu grace period scanning delays in order to more quickly react to additional use cases. Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reported-by: yueluck <yueluck@163.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Tested-by: Marc Zyngier <maz@kernel.org> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-01 03:15:45 +00:00
srcutree.srcu_max_nodelay [KNL]
Specifies the number of no-delay instances
per jiffy for which the SRCU grace period
worker thread will be rescheduled with zero
delay. Beyond this limit, worker thread will
be rescheduled with a sleep delay of one jiffy.
srcutree.srcu_max_nodelay_phase [KNL]
Specifies the per-grace-period phase, number of
non-sleeping polls of readers. Beyond this limit,
grace period worker thread will be rescheduled
with a sleep delay of one jiffy, between each
rescan of the readers, for a grace period phase.
srcutree.srcu_retry_check_delay [KNL]
Specifies number of microseconds of non-sleeping
delay between each non-sleeping poll of readers.
srcu: Add contention-triggered addition of srcu_node tree This commit instruments the acquisitions of the srcu_struct structure's ->lock, enabling the initiation of a transition from SRCU_SIZE_SMALL to SRCU_SIZE_BIG when sufficient contention is experienced. The instrumentation counts the number of trylock failures within the confines of a single jiffy. If that number exceeds the value specified by the srcutree.small_contention_lim kernel boot parameter (which defaults to 100), and if the value specified by the srcutree.convert_to_big kernel boot parameter has the 0x10 bit set (defaults to 0), then a transition will be automatically initiated. By default, there will never be any transitions, so that none of the srcu_struct structures ever gains an srcu_node array. The useful values for srcutree.convert_to_big are: 0x00: Never convert. 0x01: Always convert at init_srcu_struct() time. 0x02: Convert when rcutorture prints its first round of statistics. 0x03: Decide conversion approach at boot given system size. 0x10: Convert if contention is encountered. 0x12: Convert if contention is encountered or when rcutorture prints its first round of statistics, whichever comes first. The value 0x11 acts the same as 0x01 because the conversion happens before there is any chance of contention. [ paulmck: Apply "static" feedback from kernel test robot. ] Co-developed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-01-28 04:32:05 +00:00
srcutree.small_contention_lim [KNL]
Specifies the number of update-side contention
events per jiffy will be tolerated before
initiating a conversion of an srcu_struct
structure to big form. Note that the value of
srcutree.convert_to_big must have the 0x10 bit
set for contention-based conversions to occur.
ssbd= [ARM64,HW]
Speculative Store Bypass Disable control
On CPUs that are vulnerable to the Speculative
Store Bypass vulnerability and offer a
firmware based mitigation, this parameter
indicates how the mitigation should be used:
force-on: Unconditionally enable mitigation for
for both kernel and userspace
force-off: Unconditionally disable mitigation for
for both kernel and userspace
kernel: Always enable mitigation in the
kernel, and offer a prctl interface
to allow userspace to register its
interest in being mitigated too.
mm: larger stack guard gap, between vmas Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 11:03:24 +00:00
stack_guard_gap= [MM]
override the default stack gap protection. The value
is in page units and it defines how many pages prior
to (for stacks growing down) resp. after (for stacks
growing up) the main stack are reserved for no other
mapping. Default value is 256 pages.
stack_depot_disable= [KNL]
Setting this to true through kernel command line will
disable the stack depot thereby saving the static memory
consumed by the stack hash table. By default this is set
to false.
stacktrace [FTRACE]
Enabled the stack tracer on boot up.
stacktrace_filter=[function-list]
[FTRACE] Limit the functions that the stack tracer
will trace at boot up. function-list is a comma-separated
list of functions. This list can be changed at run
time by the stack_trace_filter file in the debugfs
tracing directory. Note, this enables stack tracing
and the stacktrace above is not needed.
sti= [PARISC,HW]
Format: <num>
Set the STI (builtin display/keyboard on the HP-PARISC
machines) console (graphic card) which should be used
as the initial boot-console.
See also comment in drivers/video/console/sticore.c.
sti_font= [HW]
See comment in drivers/video/console/sticore.c.
stifb= [HW]
Format: bpp:<bpp1>[:<bpp2>[:<bpp3>...]]
strict_sas_size=
[X86]
Format: <bool>
Enable or disable strict sigaltstack size checks
against the required signal frame size which
depends on the supported FPU features. This can
be used to filter out binaries which have
not yet been made aware of AT_MINSIGSTKSZ.
stress_hpt [PPC]
Limits the number of kernel HPT entries in the hash
page table to increase the rate of hash page table
faults on kernel addresses.
stress_slb [PPC]
Limits the number of kernel SLB entries, and flushes
them frequently to increase the rate of SLB faults
on kernel addresses.
sunrpc.min_resvport=
sunrpc.max_resvport=
[NFS,SUNRPC]
SunRPC servers often require that client requests
originate from a privileged port (i.e. a port in the
range 0 < portnr < 1024).
An administrator who wishes to reserve some of these
ports for other uses may adjust the range that the
kernel's sunrpc client considers to be privileged
using these two parameters to set the minimum and
maximum port values.
sunrpc.svc_rpc_per_connection_limit=
[NFS,SUNRPC]
Limit the number of requests that the server will
process in parallel from a single connection.
The default value is 0 (no limit).
sunrpc.pool_mode=
[NFS]
Control how the NFS server code allocates CPUs to
service thread pools. Depending on how many NICs
you have and where their interrupts are bound, this
option will affect which CPUs will do NFS serving.
Note: this parameter cannot be changed while the
NFS server is running.
auto the server chooses an appropriate mode
automatically using heuristics
global a single global pool contains all CPUs
percpu one pool for each CPU
pernode one pool for each NUMA node (equivalent
to global on non-NUMA machines)
sunrpc.tcp_slot_table_entries=
sunrpc.udp_slot_table_entries=
[NFS,SUNRPC]
Sets the upper limit on the number of simultaneous
RPC calls that can be sent from the client to a
server. Increasing these values may allow you to
improve throughput, but will also increase the
amount of memory reserved for use by the client.
suspend.pm_test_delay=
[SUSPEND]
Sets the number of seconds to remain in a suspend test
mode before resuming the system (see
/sys/power/pm_test). Only available when CONFIG_PM_DEBUG
is set. Default value is 5.
svm= [PPC]
Format: { on | off | y | n | 1 | 0 }
This parameter controls use of the Protected
Execution Facility on pSeries.
swiotlb= [ARM,IA-64,PPC,MIPS,X86]
Format: { <int> [,<int>] | force | noforce }
<int> -- Number of I/O TLB slabs
<int> -- Second integer after comma. Number of swiotlb
areas with their own lock. Will be rounded up
to a power of 2.
force -- force using of bounce buffers even if they
wouldn't be automatically used by the kernel
noforce -- Never use bounce buffers (for debugging)
switches= [HW,M68k]
kernel/sysctl: support setting sysctl parameters from kernel command line Patch series "support setting sysctl parameters from kernel command line", v3. This series adds support for something that seems like many people always wanted but nobody added it yet, so here's the ability to set sysctl parameters via kernel command line options in the form of sysctl.vm.something=1 The important part is Patch 1. The second, not so important part is an attempt to clean up legacy one-off parameters that do the same thing as a sysctl. I don't want to remove them completely for compatibility reasons, but with generic sysctl support the idea is to remove the one-off param handlers and treat the parameters as aliases for the sysctl variants. I have identified several parameters that mention sysctl counterparts in Documentation/admin-guide/kernel-parameters.txt but there might be more. The conversion also has varying level of success: - numa_zonelist_order is converted in Patch 2 together with adding the necessary infrastructure. It's easy as it doesn't really do anything but warn on deprecated value these days. - hung_task_panic is converted in Patch 3, but there's a downside that now it only accepts 0 and 1, while previously it was any integer value - nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic, so there's no straighforward conversion possible - traceoff_on_warning is a flag without value and it would be required to handle that somehow in the conversion infractructure, which seems pointless for a single flag This patch (of 5): A recently proposed patch to add vm_swappiness command line parameter in addition to existing sysctl [1] made me wonder why we don't have a general support for passing sysctl parameters via command line. Googling found only somebody else wondering the same [2], but I haven't found any prior discussion with reasons why not to do this. Settings the vm_swappiness issue aside (the underlying issue might be solved in a different way), quick search of kernel-parameters.txt shows there are already some that exist as both sysctl and kernel parameter - hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning. A general mechanism would remove the need to add more of those one-offs and might be handy in situations where configuration by e.g. /etc/sysctl.d/ is impractical. Hence, this patch adds a new parse_args() pass that looks for parameters prefixed by 'sysctl.' and tries to interpret them as writes to the corresponding sys/ files using an temporary in-kernel procfs mount. This mechanism was suggested by Eric W. Biederman [3], as it handles all dynamically registered sysctl tables, even though we don't handle modular sysctls. Errors due to e.g. invalid parameter name or value are reported in the kernel log. The processing is hooked right before the init process is loaded, as some handlers might be more complicated than simple setters and might need some subsystems to be initialized. At the moment the init process can be started and eventually execute a process writing to /proc/sys/ then it should be also fine to do that from the kernel. Sysctls registered later on module load time are not set by this mechanism - it's expected that in such scenarios, setting sysctl values from userspace is practical enough. [1] https://lore.kernel.org/r/BL0PR02MB560167492CA4094C91589930E9FC0@BL0PR02MB5601.namprd02.prod.outlook.com/ [2] https://unix.stackexchange.com/questions/558802/how-to-set-sysctl-using-kernel-command-line-parameter [3] https://lore.kernel.org/r/87bloj2skm.fsf@x220.int.ebiederm.org/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Christian Brauner <christian.brauner@ubuntu.com> Link: http://lkml.kernel.org/r/20200427180433.7029-1-vbabka@suse.cz Link: http://lkml.kernel.org/r/20200427180433.7029-2-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-08 04:40:24 +00:00
sysctl.*= [KNL]
Set a sysctl parameter, right before loading the init
process, as if the value was written to the respective
/proc/sys/... file. Both '.' and '/' are recognized as
separators. Unrecognized parameters and invalid values
are reported in the kernel log. Sysctls registered
later by a loaded module cannot be set this way.
Example: sysctl.vm.swappiness=40
sysrq_always_enabled
[KNL]
Ignore sysrq setting - this boot parameter will
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
ram pages. This is used to specify the TCP metrics
cache size. See Documentation/networking/ip-sysctl.rst
"tcp_no_metrics_save" section for more details.
tdfx= [HW,DRM]
Docs: admin/kernel-parameters: edit a few boot options Clean up some of admin-guide/kernel-parameters.txt: a. "smt" should be "smt=" (S390) b. (dropped) c. Sparc supports the vdso= boot option d. make the tp_printk options (2) formatting similar to other options by adding spacing e. add "trace_clock=" with a reference to Documentation/trace/ftrace.rst f. use [IA-64] as documented instead of [ia64] g. fix formatting and text for test_suspend= h. fix formatting for swapaccount= i. fix formatting and grammar for video.brightness_switch_enabled= Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: sparclinux@vger.kernel.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linux-ia64@vger.kernel.org Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: linux-pm@vger.kernel.org Cc: linux-acpi@vger.kernel.org Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-04-03 05:48:20 +00:00
test_suspend= [SUSPEND]
Format: { "mem" | "standby" | "freeze" }[,N]
Specify "mem" (for Suspend-to-RAM) or "standby" (for
standby suspend) or "freeze" (for suspend type freeze)
as the system sleep state during system startup with
the optional capability to repeat N number of times.
The system is woken from this state using a
wakeup-capable RTC alarm.
thash_entries= [KNL,NET]
Set number of hash buckets for TCP connection
thermal.act= [HW,ACPI]
-1: disable all active trip points in all thermal zones
<degrees C>: override all lowest active trip points
thermal.crt= [HW,ACPI]
-1: disable all critical trip points in all thermal zones
<degrees C>: override all critical trip points
thermal.off= [HW,ACPI]
1: disable ACPI thermal control
thermal.psv= [HW,ACPI]
-1: disable all passive trip points
<degrees C>: override all passive trip points to this
value
ACPI: thermal: expose "thermal.tzp=" to set global polling frequency Thermal Zone Polling frequency (_TZP) is an optional ACPI object recommending the rate that the OS should poll the associated thermal zone. If _TZP is 0, no polling should be used. If _TZP is non-zero, then the platform recommends that the OS poll the thermal zone at the specified rate. The minimum period is 30 seconds. The maximum period is 5 minutes. (note _TZP and thermal.tzp units are in deci-seconds, so _TZP = 300 corresponds to 30 seconds) If _TZP is not present, ACPI 3.0b recommends that the thermal zone be polled at an "OS provided default frequency". However, common industry practice is: 1. The BIOS never specifies any _TZP 2. High volume OS's from this century never poll any thermal zones Ie. The OS depends on the platform's ability to provoke thermal events when necessary, and the "OS provided default frequency" is "never":-) There is a proposal that ACPI 4.0 be updated to reflect common industry practice -- ie. no _TZP, no polling. The Linux kernel already follows this practice -- thermal zones are not polled unless _TZP is present and non-zero. But thermal zone polling is useful as a workaround for systems which have ACPI thermal control, but have an issue preventing thermal events. Indeed, some Linux distributions still set a non-zero thermal polling frequency for this reason. But rather than ask the user to write a polling frequency into all the /proc/acpi/thermal_zone/*/polling_frequency files, here we simply document and expose the already existing module parameter to do the same at system level, to simplify debugging those broken platforms. Note that thermal.tzp is a module-load time parameter only. Signed-off-by: Len Brown <len.brown@intel.com>
2007-08-12 04:12:26 +00:00
thermal.tzp= [HW,ACPI]
Specify global default ACPI thermal zone polling rate
<deci-seconds>: poll all this frequency
0: no polling (default)
genirq: Provide forced interrupt threading Add a commandline parameter "threadirqs" which forces all interrupts except those marked IRQF_NO_THREAD to run threaded. That's mostly a debug option to allow retrieving better debug data from crashing interrupt handlers. If "threadirqs" is not enabled on the kernel command line, then there is no impact in the interrupt hotpath. Architecture code needs to select CONFIG_IRQ_FORCED_THREADING after marking the interrupts which cant be threaded IRQF_NO_THREAD. All interrupts which have IRQF_TIMER set are implict marked IRQF_NO_THREAD. Also all PER_CPU interrupts are excluded. Forced threading hard interrupts also forces all soft interrupt handling into thread context. When enabled it might slow down things a bit, but for debugging problems in interrupt code it's a reasonable penalty as it does not immediately crash and burn the machine when an interrupt handler is buggy. Some test results on a Core2Duo machine: Cache cold run of: # time git grep irq_desc non-threaded threaded real 1m18.741s 1m19.061s user 0m1.874s 0m1.757s sys 0m5.843s 0m5.427s # iperf -c server non-threaded [ 3] 0.0-10.0 sec 1.09 GBytes 933 Mbits/sec [ 3] 0.0-10.0 sec 1.09 GBytes 934 Mbits/sec [ 3] 0.0-10.0 sec 1.09 GBytes 933 Mbits/sec threaded [ 3] 0.0-10.0 sec 1.09 GBytes 939 Mbits/sec [ 3] 0.0-10.0 sec 1.09 GBytes 934 Mbits/sec [ 3] 0.0-10.0 sec 1.09 GBytes 937 Mbits/sec Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110223234956.772668648@linutronix.de>
2011-02-23 23:52:23 +00:00
threadirqs [KNL]
Force threading of all interrupt handlers except those
marked explicitly IRQF_NO_THREAD.
genirq: Provide forced interrupt threading Add a commandline parameter "threadirqs" which forces all interrupts except those marked IRQF_NO_THREAD to run threaded. That's mostly a debug option to allow retrieving better debug data from crashing interrupt handlers. If "threadirqs" is not enabled on the kernel command line, then there is no impact in the interrupt hotpath. Architecture code needs to select CONFIG_IRQ_FORCED_THREADING after marking the interrupts which cant be threaded IRQF_NO_THREAD. All interrupts which have IRQF_TIMER set are implict marked IRQF_NO_THREAD. Also all PER_CPU interrupts are excluded. Forced threading hard interrupts also forces all soft interrupt handling into thread context. When enabled it might slow down things a bit, but for debugging problems in interrupt code it's a reasonable penalty as it does not immediately crash and burn the machine when an interrupt handler is buggy. Some test results on a Core2Duo machine: Cache cold run of: # time git grep irq_desc non-threaded threaded real 1m18.741s 1m19.061s user 0m1.874s 0m1.757s sys 0m5.843s 0m5.427s # iperf -c server non-threaded [ 3] 0.0-10.0 sec 1.09 GBytes 933 Mbits/sec [ 3] 0.0-10.0 sec 1.09 GBytes 934 Mbits/sec [ 3] 0.0-10.0 sec 1.09 GBytes 933 Mbits/sec threaded [ 3] 0.0-10.0 sec 1.09 GBytes 939 Mbits/sec [ 3] 0.0-10.0 sec 1.09 GBytes 934 Mbits/sec [ 3] 0.0-10.0 sec 1.09 GBytes 937 Mbits/sec Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110223234956.772668648@linutronix.de>
2011-02-23 23:52:23 +00:00
topology= [S390]
Format: {off | on}
Specify if the kernel should make use of the cpu
topology information if the hardware supports this.
The scheduler will make use of this information and
e.g. base its process migration decisions on it.
Default is on.
topology_updates= [KNL, PPC, NUMA]
Format: {off}
Specify if the kernel should ignore (off)
topology updates sent by the hypervisor to this
LPAR.
torture.disable_onoff_at_boot= [KNL]
Prevent the CPU-hotplug component of torturing
until after init has spawned.
torture.ftrace_dump_at_shutdown= [KNL]
Dump the ftrace buffer at torture-test shutdown,
even if there were no errors. This can be a
very costly operation when many torture tests
are running concurrently, especially on systems
with rotating-rust storage.
torture.verbose_sleep_frequency= [KNL]
Specifies how many verbose printk()s should be
emitted between each sleep. The default of zero
disables verbose-printk() sleeping.
torture.verbose_sleep_duration= [KNL]
Duration of each verbose-printk() sleep in jiffies.
tp720= [HW,PS2]
tpm_suspend_pcr=[HW,TPM]
Format: integer pcr id
Specify that at suspend time, the tpm driver
should extend the specified pcr with zeros,
as a workaround for some chips which fail to
flush the last written pcr on TPM_SaveState.
This will guarantee that all the other pcrs
are saved.
tpm_tis.interrupts= [HW,TPM]
Enable interrupts for the MMIO based physical layer
for the FIFO interface. By default it is set to false
(0). For more information about TPM hardware interfaces
defined by Trusted Computing Group (TCG) see
https://trustedcomputinggroup.org/resource/pc-client-platform-tpm-profile-ptp-specification/
tp_printk [FTRACE]
Have the tracepoints sent to printk as well as the
tracing ring buffer. This is useful for early boot up
where the system hangs or reboots and does not give the
option for reading the tracing buffer or performing a
ftrace_dump_on_oops.
To turn off having tracepoints sent to printk,
echo 0 > /proc/sys/kernel/tracepoint_printk
Note, echoing 1 into this file without the
tracepoint_printk kernel cmdline option has no effect.
The tp_printk_stop_on_boot (see below) can also be used
to stop the printing of events to console at
late_initcall_sync.
** CAUTION **
Having tracepoints sent to printk() and activating high
frequency tracepoints such as irq or sched, can cause
the system to live lock.
tp_printk_stop_on_boot [FTRACE]
When tp_printk (above) is set, it can cause a lot of noise
on the console. It may be useful to only include the
printing of events during boot up, as user space may
make the system inoperable.
This command line option will stop the printing of events
to console at the late_initcall_sync() time frame.
trace_buf_size=nn[KMG]
[FTRACE] will set tracing buffer size on each cpu.
Docs: admin/kernel-parameters: edit a few boot options Clean up some of admin-guide/kernel-parameters.txt: a. "smt" should be "smt=" (S390) b. (dropped) c. Sparc supports the vdso= boot option d. make the tp_printk options (2) formatting similar to other options by adding spacing e. add "trace_clock=" with a reference to Documentation/trace/ftrace.rst f. use [IA-64] as documented instead of [ia64] g. fix formatting and text for test_suspend= h. fix formatting for swapaccount= i. fix formatting and grammar for video.brightness_switch_enabled= Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: sparclinux@vger.kernel.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linux-ia64@vger.kernel.org Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: linux-pm@vger.kernel.org Cc: linux-acpi@vger.kernel.org Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-04-03 05:48:20 +00:00
trace_clock= [FTRACE] Set the clock used for tracing events
at boot up.
local - Use the per CPU time stamp counter
(converted into nanoseconds). Fast, but
depending on the architecture, may not be
in sync between CPUs.
global - Event time stamps are synchronize across
CPUs. May be slower than the local clock,
but better for some race conditions.
counter - Simple counting of events (1, 2, ..)
note, some counts may be skipped due to the
infrastructure grabbing the clock more than
once per event.
uptime - Use jiffies as the time stamp.
perf - Use the same clock that perf uses.
mono - Use ktime_get_mono_fast_ns() for time stamps.
mono_raw - Use ktime_get_raw_fast_ns() for time
stamps.
boot - Use ktime_get_boot_fast_ns() for time stamps.
Architectures may add more clocks. See
Documentation/trace/ftrace.rst for more details.
trace_event=[event-list]
[FTRACE] Set and start specified trace events in order
to facilitate early boot debugging. The event-list is a
comma-separated list of trace events to enable. See
also Documentation/trace/events.rst
trace_instance=[instance-info]
[FTRACE] Create a ring buffer instance early in boot up.
This will be listed in:
/sys/kernel/tracing/instances
Events can be enabled at the time the instance is created
via:
trace_instance=<name>,<system1>:<event1>,<system2>:<event2>
Note, the "<system*>:" portion is optional if the event is
unique.
trace_instance=foo,sched:sched_switch,irq_handler_entry,initcall
will enable the "sched_switch" event (note, the "sched:" is optional, and
the same thing would happen if it was left off). The irq_handler_entry
event, and all events under the "initcall" system.
trace_options=[option-list]
[FTRACE] Enable or disable tracer options at boot.
The option-list is a comma delimited list of options
that can be enabled or disabled just as if you were
to echo the option name into
/sys/kernel/tracing/trace_options
For example, to enable stacktrace option (to dump the
stack trace of each event), add to the command line:
trace_options=stacktrace
See also Documentation/trace/ftrace.rst "trace options"
section.
trace_trigger=[trigger-list]
[FTRACE] Add a event trigger on specific events.
Set a trigger on top of a specific event, with an optional
filter.
The format is is "trace_trigger=<event>.<trigger>[ if <filter>],..."
Where more than one trigger may be specified that are comma deliminated.
For example:
trace_trigger="sched_switch.stacktrace if prev_state == 2"
The above will enable the "stacktrace" trigger on the "sched_switch"
event but only trigger it if the "prev_state" of the "sched_switch"
event is "2" (TASK_UNINTERUPTIBLE).
See also "Event triggers" in Documentation/trace/events.rst
traceoff_on_warning
[FTRACE] enable this option to disable tracing when a
warning is hit. This turns off "tracing_on". Tracing can
be enabled again by echoing '1' into the "tracing_on"
file located in /sys/kernel/tracing/
This option is useful, as it disables the trace before
the WARNING dump is called, which prevents the trace to
be filled with content caused by the warning output.
This option can also be set at run time via the sysctl
option: kernel/traceoff_on_warning
transparent_hugepage=
[KNL]
Format: [always|madvise|never]
Can be used to control the default behavior of the system
with respect to transparent hugepages.
See Documentation/admin-guide/mm/transhuge.rst
for more details.
trusted.source= [KEYS]
Format: <string>
This parameter identifies the trust source as a backend
for trusted keys implementation. Supported trust
sources:
- "tpm"
- "tee"
- "caam"
If not specified then it defaults to iterating through
the trust source list starting with TPM and assigns the
first trust source as a backend which is initialized
successfully during iteration.
trusted.rng= [KEYS]
Format: <string>
The RNG used to generate key material for trusted keys.
Can be one of:
- "kernel"
- the same value as trusted.source: "tpm" or "tee"
- "default"
If not specified, "default" is used. In this case,
the RNG's choice is left to each individual trust source.
tsc= Disable clocksource stability checks for TSC.
x86: Skip verification by the watchdog for TSC clocksource. Impact: Changes timekeeping on Vmware (or with tsc=reliable). This is achieved by resetting the CLOCKSOURCE_MUST_VERIFY flag. We add a tsc=reliable commandline option to enable this. This enables legacy hardware without HPET, LAPIC, or ACPI timers to enter high-resolution timer mode. Along with that have extended this to be used in virtualization environement too. Now we also set this flag if the X86_FEATURE_TSC_RELIABLE bit is set. This is important since there is a wrap-around problem with the acpi_pm timer. The acpi_pm counter is just 24bits and this can overflow in ~4 seconds. With the NO_HZ kernels in virtualized environment, there can be situations when the guest is descheduled for longer duration, as a result we may miss the wrap of the acpi counter. When TSC is used as a clocksource and acpi_pm timer is being used as the watchdog clocksource this error in acpi_pm results in TSC being marked as unstable, and essentially results in time dropping in chunks of 4 seconds whenever this wrap is missed. Since the virtualized TSC is reliable on VMware, we should always use the TSCs clocksource on VMware, so we skip the verfication at runtime, by checking for the feature bit. Since we reset the flag for mgeode systems too, i have combined the mgeode case with the feature bit check. Signed-off-by: Jeff Hansen <jhansen@cardaccess-inc.com> Signed-off-by: Alok N Kataria <akataria@vmware.com> Signed-off-by: Dan Hecht <dhecht@vmware.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2008-10-25 00:22:01 +00:00
Format: <string>
[x86] reliable: mark tsc clocksource as reliable, this
disables clocksource verification at runtime, as well
as the stability checks done at bootup. Used to enable
high-resolution timer mode on older hardware, and in
virtualized environment.
[x86] noirqtime: Do not use TSC to do irq accounting.
Used to run time disable IRQ_TIME_ACCOUNTING on any
platforms where RDTSC is slow and this accounting
can add overhead.
[x86] unstable: mark the TSC clocksource as unstable, this
marks the TSC unconditionally unstable at bootup and
avoids any further wobbles once the TSC watchdog notices.
[x86] nowatchdog: disable clocksource watchdog. Used
in situations with strict latency requirements (where
interruptions from clocksource watchdog are not
acceptable).
x86/tsc: Add option to force frequency recalibration with HW timer The kernel assumes that the TSC frequency which is provided by the hardware / firmware via MSRs or CPUID(0x15) is correct after applying a few basic consistency checks. This disables the TSC recalibration against HPET or PM timer. As a result there is no mechanism to validate that frequency in cases where a firmware or hardware defect is suspected. And there was case that some user used atomic clock to measure the TSC frequency and reported an inaccuracy issue, which was later fixed in firmware. Add an option 'recalibrate' for 'tsc' kernel parameter to force the tsc freq recalibration with HPET or PM timer, and warn if the deviation from previous value is more than about 500 PPM, which provides a way to verify the data from hardware / firmware. There is no functional change to existing work flow. Recently there was a real-world case: "The 40ms/s divergence between TSC and HPET was observed on hardware that is quite recent" [1], on that platform the TSC frequence 1896 MHz was got from CPUID(0x15), and the force-reclibration with HPET/PMTIMER both calibrated out value of 1975 MHz, which also matched with check from software 'chronyd', indicating it's a problem of BIOS or firmware. [Thanks tglx for helping improving the commit log] [ paulmck: Wordsmith Kconfig help text. ] [1]. https://lore.kernel.org/lkml/20221117230910.GI4001@paulmck-ThinkPad-P17-Gen-1/ Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: <x86@kernel.org> Cc: <linux-doc@vger.kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-04 08:19:38 +00:00
[x86] recalibrate: force recalibration against a HW timer
(HPET or PM timer) on systems whose TSC frequency was
obtained from HW or FW using either an MSR or CPUID(0x15).
Warn if the difference is more than 500 ppm.
[x86] watchdog: Use TSC as the watchdog clocksource with
which to check other HW timers (HPET or PM timer), but
only on systems where TSC has been deemed trustworthy.
This will be suppressed by an earlier tsc=nowatchdog and
can be overridden by a later tsc=nowatchdog. A console
message will flag any such suppression or overriding.
x86: Skip verification by the watchdog for TSC clocksource. Impact: Changes timekeeping on Vmware (or with tsc=reliable). This is achieved by resetting the CLOCKSOURCE_MUST_VERIFY flag. We add a tsc=reliable commandline option to enable this. This enables legacy hardware without HPET, LAPIC, or ACPI timers to enter high-resolution timer mode. Along with that have extended this to be used in virtualization environement too. Now we also set this flag if the X86_FEATURE_TSC_RELIABLE bit is set. This is important since there is a wrap-around problem with the acpi_pm timer. The acpi_pm counter is just 24bits and this can overflow in ~4 seconds. With the NO_HZ kernels in virtualized environment, there can be situations when the guest is descheduled for longer duration, as a result we may miss the wrap of the acpi counter. When TSC is used as a clocksource and acpi_pm timer is being used as the watchdog clocksource this error in acpi_pm results in TSC being marked as unstable, and essentially results in time dropping in chunks of 4 seconds whenever this wrap is missed. Since the virtualized TSC is reliable on VMware, we should always use the TSCs clocksource on VMware, so we skip the verfication at runtime, by checking for the feature bit. Since we reset the flag for mgeode systems too, i have combined the mgeode case with the feature bit check. Signed-off-by: Jeff Hansen <jhansen@cardaccess-inc.com> Signed-off-by: Alok N Kataria <akataria@vmware.com> Signed-off-by: Dan Hecht <dhecht@vmware.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2008-10-25 00:22:01 +00:00
tsc_early_khz= [X86] Skip early TSC calibration and use the given
value instead. Useful when the early TSC frequency discovery
procedure is not reliable, such as on overclocked systems
with CPUID.16h support and partial CPUID.15h support.
Format: <unsigned int>
tsx= [X86] Control Transactional Synchronization
Extensions (TSX) feature in Intel processors that
support TSX control.
This parameter controls the TSX feature. The options are:
on - Enable TSX on the system. Although there are
mitigations for all known security vulnerabilities,
TSX has been known to be an accelerator for
several previous speculation-related CVEs, and
so there may be unknown security risks associated
with leaving it enabled.
off - Disable TSX on the system. (Note that this
option takes effect only on newer CPUs which are
not vulnerable to MDS, i.e., have
MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1 and which get
the new IA32_TSX_CTRL MSR through a microcode
update. This new MSR allows for the reliable
deactivation of the TSX functionality.)
auto - Disable TSX if X86_BUG_TAA is present,
otherwise enable TSX on the system.
Not specifying this option is equivalent to tsx=off.
See Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
for more details.
tsx_async_abort= [X86,INTEL] Control mitigation for the TSX Async
Abort (TAA) vulnerability.
Similar to Micro-architectural Data Sampling (MDS)
certain CPUs that support Transactional
Synchronization Extensions (TSX) are vulnerable to an
exploit against CPU internal buffers which can forward
information to a disclosure gadget under certain
conditions.
In vulnerable processors, the speculatively forwarded
data can be used in a cache side channel attack, to
access data to which the attacker does not have direct
access.
This parameter controls the TAA mitigation. The
options are:
full - Enable TAA mitigation on vulnerable CPUs
if TSX is enabled.
full,nosmt - Enable TAA mitigation and disable SMT on
vulnerable CPUs. If TSX is disabled, SMT
is not disabled because CPU is not
vulnerable to cross-thread TAA attacks.
off - Unconditionally disable TAA mitigation
x86/speculation: Fix incorrect MDS/TAA mitigation status For MDS vulnerable processors with TSX support, enabling either MDS or TAA mitigations will enable the use of VERW to flush internal processor buffers at the right code path. IOW, they are either both mitigated or both not. However, if the command line options are inconsistent, the vulnerabilites sysfs files may not report the mitigation status correctly. For example, with only the "mds=off" option: vulnerabilities/mds:Vulnerable; SMT vulnerable vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable The mds vulnerabilities file has wrong status in this case. Similarly, the taa vulnerability file will be wrong with mds mitigation on, but taa off. Change taa_select_mitigation() to sync up the two mitigation status and have them turned off if both "mds=off" and "tsx_async_abort=off" are present. Update documentation to emphasize the fact that both "mds=off" and "tsx_async_abort=off" have to be specified together for processors that are affected by both TAA and MDS to be effective. [ bp: Massage and add kernel-parameters.txt change too. ] Fixes: 1b42f017415b ("x86/speculation/taa: Add mitigation for TSX Async Abort") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: linux-doc@vger.kernel.org Cc: Mark Gross <mgross@linux.intel.com> Cc: <stable@vger.kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191115161445.30809-2-longman@redhat.com
2019-11-15 16:14:44 +00:00
On MDS-affected machines, tsx_async_abort=off can be
prevented by an active MDS mitigation as both vulnerabilities
are mitigated with the same mechanism so in order to disable
this mitigation, you need to specify mds=off too.
Not specifying this option is equivalent to
tsx_async_abort=full. On CPUs which are MDS affected
and deploy MDS mitigation, TAA mitigation is not
required and doesn't provide any additional
mitigation.
For details see:
Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
turbografx.map[2|3]= [HW,JOY]
TurboGraFX parallel port interface
Format:
<port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7>
See also Documentation/input/devices/joystick-parport.rst
udbg-immortal [PPC] When debugging early kernel crashes that
happen after console_init() and before a proper
console driver takes over, this boot options might
help "seeing" what's going on.
uhash_entries= [KNL,NET]
Set number of hash buckets for UDP/UDP-Lite connections
uhci-hcd.ignore_oc=
[USB] Ignore overcurrent events (default N).
Some badly-designed motherboards generate lots of
bogus events, for ports that aren't wired to
anything. Set this parameter to avoid log spamming.
Note that genuine overcurrent events won't be
reported either.
unknown_nmi_panic
[X86] Cause panic on unknown NMI.
unwind_debug [X86-64]
Enable unwinder debug output. This can be
useful for debugging certain unwinder error
conditions, including corrupt stacks and
bad/missing unwinder metadata.
usbcore.authorized_default=
[USB] Default USB device authorization:
(default -1 = authorized (same as 1),
0 = not authorized, 1 = authorized, 2 = authorized
if device connected to internal port)
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2). This
is the time required before an idle device will be
autosuspended. Devices for which the delay is set
to a negative value won't be autosuspended at all.
usbcore.usbfs_snoop=
[USB] Set to log all usbfs traffic (default 0 = off).
usbcore.usbfs_snoop_max=
[USB] Maximum number of bytes to snoop in each URB
(default = 65536).
usbcore.blinkenlights=
[USB] Set to cycle leds on hubs (default 0 = off).
usbcore.old_scheme_first=
[USB] Start with the old device initialization
USB: hub: Revert commit bd0e6c9614b9 ("usb: hub: try old enumeration scheme first for high speed devices") Commit bd0e6c9614b9 ("usb: hub: try old enumeration scheme first for high speed devices") changed the way the hub driver enumerates high-speed devices. Instead of using the "new" enumeration scheme first and switching to the "old" scheme if that doesn't work, we start with the "old" scheme. In theory this is better because the "old" scheme is slightly faster -- it involves resetting the device only once instead of twice. However, for a long time Windows used only the "new" scheme. Zeng Tao said that Windows 8 and later use the "old" scheme for high-speed devices, but apparently there are some devices that don't like it. William Bader reports that the Ricoh webcam built into his Sony Vaio laptop not only doesn't enumerate under the "old" scheme, it gets hung up so badly that it won't then enumerate under the "new" scheme! Only a cold reset will fix it. Therefore we will revert the commit and go back to trying the "new" scheme first for high-speed devices. Reported-and-tested-by: William Bader <williambader@hotmail.com> Ref: https://bugzilla.kernel.org/show_bug.cgi?id=207219 Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Fixes: bd0e6c9614b9 ("usb: hub: try old enumeration scheme first for high speed devices") CC: Zeng Tao <prime.zeng@hisilicon.com> CC: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/Pine.LNX.4.44L0.2004221611230.11262-100000@iolanthe.rowland.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-22 20:13:08 +00:00
scheme (default 0 = off).
usbcore.usbfs_memory_mb=
[USB] Memory limit (in MB) for buffers allocated by
usbfs (default = 16, 0 = max = 2047).
usbcore.use_both_schemes=
[USB] Try the other device initialization scheme
if the first one fails (default 1 = enabled).
usbcore.initial_descriptor_timeout=
[USB] Specifies timeout for the initial 64-byte
USB_REQ_GET_DESCRIPTOR request in milliseconds
(default 5000 = 5.0 seconds).
usbcore.nousb [USB] Disable the USB subsystem
usbcore.quirks=
[USB] A list of quirk entries to augment the built-in
usb core quirk list. List entries are separated by
commas. Each entry has the form
VendorID:ProductID:Flags. The IDs are 4-digit hex
numbers and Flags is a set of letters. Each letter
will change the built-in quirk; setting it if it is
clear and clearing it if it is set. The letters have
the following meanings:
a = USB_QUIRK_STRING_FETCH_255 (string
descriptors must not be fetched using
a 255-byte read);
b = USB_QUIRK_RESET_RESUME (device can't resume
correctly so reset it instead);
c = USB_QUIRK_NO_SET_INTF (device can't handle
Set-Interface requests);
d = USB_QUIRK_CONFIG_INTF_STRINGS (device can't
handle its Configuration or Interface
strings);
e = USB_QUIRK_RESET (device can't be reset
(e.g morph devices), don't use reset);
f = USB_QUIRK_HONOR_BNUMINTERFACES (device has
more interface descriptions than the
bNumInterfaces count, and can't handle
talking to these interfaces);
g = USB_QUIRK_DELAY_INIT (device needs a pause
during initialization, after we read
the device descriptor);
h = USB_QUIRK_LINEAR_UFRAME_INTR_BINTERVAL (For
high speed and super speed interrupt
endpoints, the USB 2.0 and USB 3.0 spec
require the interval in microframes (1
microframe = 125 microseconds) to be
calculated as interval = 2 ^
(bInterval-1).
Devices with this quirk report their
bInterval as the result of this
calculation instead of the exponent
variable used in the calculation);
i = USB_QUIRK_DEVICE_QUALIFIER (device can't
handle device_qualifier descriptor
requests);
j = USB_QUIRK_IGNORE_REMOTE_WAKEUP (device
generates spurious wakeup, ignore
remote wakeup capability);
k = USB_QUIRK_NO_LPM (device can't handle Link
Power Management);
l = USB_QUIRK_LINEAR_FRAME_INTR_BINTERVAL
(Device reports its bInterval as linear
frames instead of the USB 2.0
calculation);
m = USB_QUIRK_DISCONNECT_SUSPEND (Device needs
to be disconnected before suspend to
prevent spurious wakeup);
n = USB_QUIRK_DELAY_CTRL_MSG (Device needs a
pause after every control message);
o = USB_QUIRK_HUB_SLOW_RESET (Hub needs extra
delay after resetting its port);
usb: new quirk to reduce the SET_ADDRESS request timeout This patch introduces a new USB quirk, USB_QUIRK_SHORT_SET_ADDRESS_REQ_TIMEOUT, which modifies the timeout value for the SET_ADDRESS request. The standard timeout for USB request/command is 5000 ms, as recommended in the USB 3.2 specification (section 9.2.6.1). However, certain scenarios, such as connecting devices through an APTIV hub, can lead to timeout errors when the device enumerates as full speed initially and later switches to high speed during chirp negotiation. In such cases, USB analyzer logs reveal that the bus suspends for 5 seconds due to incorrect chirp parsing and resumes only after two consecutive timeout errors trigger a hub driver reset. Packet(54) Dir(?) Full Speed J(997.100 us) Idle( 2.850 us) _______| Time Stamp(28 . 105 910 682) _______|_____________________________________________________________Ch0 Packet(55) Dir(?) Full Speed J(997.118 us) Idle( 2.850 us) _______| Time Stamp(28 . 106 910 632) _______|_____________________________________________________________Ch0 Packet(56) Dir(?) Full Speed J(399.650 us) Idle(222.582 us) _______| Time Stamp(28 . 107 910 600) _______|_____________________________________________________________Ch0 Packet(57) Dir Chirp J( 23.955 ms) Idle(115.169 ms) _______| Time Stamp(28 . 108 532 832) _______|_____________________________________________________________Ch0 Packet(58) Dir(?) Full Speed J (Suspend)( 5.347 sec) Idle( 5.366 us) _______| Time Stamp(28 . 247 657 600) _______|_____________________________________________________________Ch0 This 5-second delay in device enumeration is undesirable, particularly in automotive applications where quick enumeration is crucial (ideally within 3 seconds). The newly introduced quirks provide the flexibility to align with a 3-second time limit, as required in specific contexts like automotive applications. By reducing the SET_ADDRESS request timeout to 500 ms, the system can respond more swiftly to errors, initiate rapid recovery, and ensure efficient device enumeration. This change is vital for scenarios where rapid smartphone enumeration and screen projection are essential. To use the quirk, please write "vendor_id:product_id:p" to /sys/bus/usb/drivers/hub/module/parameter/quirks For example, echo "0x2c48:0x0132:p" > /sys/bus/usb/drivers/hub/module/parameters/quirks" Signed-off-by: Hardik Gajjar <hgajjar@de.adit-jv.com> Reviewed-by: Alan Stern <stern@rowland.harvard.edu> Link: https://lore.kernel.org/r/20231027152029.104363-2-hgajjar@de.adit-jv.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-27 15:20:29 +00:00
p = USB_QUIRK_SHORT_SET_ADDRESS_REQ_TIMEOUT
(Reduce timeout of the SET_ADDRESS
request from 5000 ms to 500 ms);
Example: quirks=0781:5580:bk,0a5c:5834:gij
usbhid.mousepoll=
[USBHID] The interval which mice are to be polled at.
usbhid.jspoll=
[USBHID] The interval which joysticks are to be polled at.
usbhid.kbpoll=
[USBHID] The interval which keyboards are to be polled at.
usb-storage.delay_use=
[UMS] The delay in seconds before a new device is
scanned for Logical Units (default 1).
usb-storage.quirks=
[UMS] A list of quirks entries to supplement or
override the built-in unusual_devs list. List
entries are separated by commas. Each entry has
the form VID:PID:Flags where VID and PID are Vendor
and Product ID values (4-digit hex numbers) and
Flags is a set of characters, each corresponding
to a common usb-storage quirk flag as follows:
a = SANE_SENSE (collect more than 18 bytes
of sense data, not on uas);
b = BAD_SENSE (don't collect more than 18
bytes of sense data, not on uas);
c = FIX_CAPACITY (decrease the reported
device capacity by one sector);
d = NO_READ_DISC_INFO (don't use
READ_DISC_INFO command, not on uas);
e = NO_READ_CAPACITY_16 (don't use
READ_CAPACITY_16 command);
f = NO_REPORT_OPCODES (don't use report opcodes
command, uas only);
g = MAX_SECTORS_240 (don't transfer more than
240 sectors at a time, uas only);
h = CAPACITY_HEURISTICS (decrease the
reported device capacity by one
sector if the number is odd);
i = IGNORE_DEVICE (don't bind to this
device);
j = NO_REPORT_LUNS (don't use report luns
command, uas only);
k = NO_SAME (do not use WRITE_SAME, uas only)
l = NOT_LOCKABLE (don't try to lock and
unlock ejectable media, not on uas);
m = MAX_SECTORS_64 (don't transfer more
than 64 sectors = 32 KB at a time,
not on uas);
2011-06-07 15:35:52 +00:00
n = INITIAL_READ10 (force a retry of the
initial READ(10) command, not on uas);
o = CAPACITY_OK (accept the capacity
reported by the device, not on uas);
p = WRITE_CACHE (the device cache is ON
by default, not on uas);
r = IGNORE_RESIDUE (the device reports
bogus residue values, not on uas);
s = SINGLE_LUN (the device has only one
Logical Unit);
t = NO_ATA_1X (don't allow ATA(12) and ATA(16)
commands, uas only);
u = IGNORE_UAS (don't bind to the uas driver);
w = NO_WP_DETECT (don't test whether the
medium is write-protected).
y = ALWAYS_SYNC (issue a SYNCHRONIZE_CACHE
even if the device claims no cache,
not on uas)
Example: quirks=0419:aaf5:rl,0421:0433:rc
user_debug= [KNL,ARM]
Format: <int>
See arch/arm/Kconfig.debug help text.
1 - undefined instruction events
2 - system calls
4 - invalid data aborts
8 - SIGSEGV faults
16 - SIGBUS faults
Example: user_debug=31
x86, mm: Allow highmem user page tables to be disabled at boot time Distros generally (I looked at Debian, RHEL5 and SLES11) seem to enable CONFIG_HIGHPTE for any x86 configuration which has highmem enabled. This means that the overhead applies even to machines which have a fairly modest amount of high memory and which therefore do not really benefit from allocating PTEs in high memory but still pay the price of the additional mapping operations. Running kernbench on a 4G box I found that with CONFIG_HIGHPTE=y but no actual highptes being allocated there was a reduction in system time used from 59.737s to 55.9s. With CONFIG_HIGHPTE=y and highmem PTEs being allocated: Average Optimal load -j 4 Run (std deviation): Elapsed Time 175.396 (0.238914) User Time 515.983 (5.85019) System Time 59.737 (1.26727) Percent CPU 263.8 (71.6796) Context Switches 39989.7 (4672.64) Sleeps 42617.7 (246.307) With CONFIG_HIGHPTE=y but with no highmem PTEs being allocated: Average Optimal load -j 4 Run (std deviation): Elapsed Time 174.278 (0.831968) User Time 515.659 (6.07012) System Time 55.9 (1.07799) Percent CPU 263.8 (71.266) Context Switches 39929.6 (4485.13) Sleeps 42583.7 (373.039) This patch allows the user to control the allocation of PTEs in highmem from the command line ("userpte=nohigh") but retains the status-quo as the default. It is possible that some simple heuristic could be developed which allows auto-tuning of this option however I don't have a sufficiently large machine available to me to perform any particularly meaningful experiments. We could probably handwave up an argument for a threshold at 16G of total RAM. Assuming 768M of lowmem we have 196608 potential lowmem PTE pages. Each page can map 2M of RAM in a PAE-enabled configuration, meaning a maximum of 384G of RAM could potentially be mapped using lowmem PTEs. Even allowing generous factor of 10 to account for other required lowmem allocations, generous slop to account for page sharing (which reduces the total amount of RAM mappable by a given number of PT pages) and other innacuracies in the estimations it would seem that even a 32G machine would not have a particularly pressing need for highmem PTEs. I think 32G could be considered to be at the upper bound of what might be sensible on a 32 bit machine (although I think in practice 64G is still supported). It's seems questionable if HIGHPTE is even a win for any amount of RAM you would sensibly run a 32 bit kernel on rather than going 64 bit. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> LKML-Reference: <1266403090-20162-1-git-send-email-ian.campbell@citrix.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-02-17 10:38:10 +00:00
userpte=
[X86] Flags controlling user PTE allocations.
nohigh = do not allocate PTE pages in
HIGHMEM regardless of setting
of CONFIG_HIGHPTE.
Docs: admin/kernel-parameters: edit a few boot options Clean up some of admin-guide/kernel-parameters.txt: a. "smt" should be "smt=" (S390) b. (dropped) c. Sparc supports the vdso= boot option d. make the tp_printk options (2) formatting similar to other options by adding spacing e. add "trace_clock=" with a reference to Documentation/trace/ftrace.rst f. use [IA-64] as documented instead of [ia64] g. fix formatting and text for test_suspend= h. fix formatting for swapaccount= i. fix formatting and grammar for video.brightness_switch_enabled= Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: sparclinux@vger.kernel.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linux-ia64@vger.kernel.org Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: linux-pm@vger.kernel.org Cc: linux-acpi@vger.kernel.org Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-04-03 05:48:20 +00:00
vdso= [X86,SH,SPARC]
On X86_32, this is an alias for vdso32=. Otherwise:
vdso=1: enable VDSO (the default)
[PATCH] vdso: randomize the i386 vDSO by moving it into a vma Move the i386 VDSO down into a vma and thus randomize it. Besides the security implications, this feature also helps debuggers, which can COW a vma-backed VDSO just like a normal DSO and can thus do single-stepping and other debugging features. It's good for hypervisors (Xen, VMWare) too, which typically live in the same high-mapped address space as the VDSO, hence whenever the VDSO is used, they get lots of guest pagefaults and have to fix such guest accesses up - which slows things down instead of speeding things up (the primary purpose of the VDSO). There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support for older glibcs that still rely on a prelinked high-mapped VDSO. Newer distributions (using glibc 2.3.3 or later) can turn this option off. Turning it off is also recommended for security reasons: attackers cannot use the predictable high-mapped VDSO page as syscall trampoline anymore. There is a new vdso=[0|1] boot option as well, and a runtime /proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned on/off. (This version of the VDSO-randomization patch also has working ELF coredumping, the previous patch crashed in the coredumping code.) This code is a combined work of the exec-shield VDSO randomization code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell started this patch and i completed it. [akpm@osdl.org: cleanups] [akpm@osdl.org: compile fix] [akpm@osdl.org: compile fix 2] [akpm@osdl.org: compile fix 3] [akpm@osdl.org: revernt MAXMEM change] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org> Cc: Gerd Hoffmann <kraxel@suse.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Zachary Amsden <zach@vmware.com> Cc: Andi Kleen <ak@muc.de> Cc: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 09:53:50 +00:00
vdso=0: disable VDSO mapping
vdso32= [X86] Control the 32-bit vDSO
vdso32=1: enable 32-bit VDSO
vdso32=0 or vdso32=2: disable 32-bit VDSO
See the help text for CONFIG_COMPAT_VDSO for more
details. If CONFIG_COMPAT_VDSO is set, the default is
vdso32=0; otherwise, the default is vdso32=1.
For compatibility with older kernels, vdso32=2 is an
alias for vdso32=0.
Try vdso32=0 if you encounter an error that says:
dl_main: Assertion `(void *) ph->p_vaddr == _rtld_local._dl_sysinfo_dso' failed!
vector= [IA-64,SMP]
vector=percpu: enable percpu vector domain
video= [FB] Frame buffer configuration
See Documentation/fb/modedb.rst.
Docs: admin/kernel-parameters: edit a few boot options Clean up some of admin-guide/kernel-parameters.txt: a. "smt" should be "smt=" (S390) b. (dropped) c. Sparc supports the vdso= boot option d. make the tp_printk options (2) formatting similar to other options by adding spacing e. add "trace_clock=" with a reference to Documentation/trace/ftrace.rst f. use [IA-64] as documented instead of [ia64] g. fix formatting and text for test_suspend= h. fix formatting for swapaccount= i. fix formatting and grammar for video.brightness_switch_enabled= Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: sparclinux@vger.kernel.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linux-ia64@vger.kernel.org Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: linux-pm@vger.kernel.org Cc: linux-acpi@vger.kernel.org Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-04-03 05:48:20 +00:00
video.brightness_switch_enabled= [ACPI]
Format: [0|1]
If set to 1, on receiving an ACPI notify event
generated by hotkey, video driver will adjust brightness
level and then send out the event to user space through
Docs: admin/kernel-parameters: edit a few boot options Clean up some of admin-guide/kernel-parameters.txt: a. "smt" should be "smt=" (S390) b. (dropped) c. Sparc supports the vdso= boot option d. make the tp_printk options (2) formatting similar to other options by adding spacing e. add "trace_clock=" with a reference to Documentation/trace/ftrace.rst f. use [IA-64] as documented instead of [ia64] g. fix formatting and text for test_suspend= h. fix formatting for swapaccount= i. fix formatting and grammar for video.brightness_switch_enabled= Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: sparclinux@vger.kernel.org Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linux-ia64@vger.kernel.org Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: linux-pm@vger.kernel.org Cc: linux-acpi@vger.kernel.org Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2022-04-03 05:48:20 +00:00
the allocated input device. If set to 0, video driver
will only send out the event without touching backlight
brightness level.
default: 1
virtio_mmio.device=
[VMMIO] Memory mapped virtio (platform) device.
<size>@<baseaddr>:<irq>[:<id>]
where:
<size> := size (can use standard suffixes
like K, M and G)
<baseaddr> := physical base address
<irq> := interrupt number (as passed to
request_irq())
<id> := (optional) platform device id
example:
virtio_mmio.device=1K@0x100b0000:48:7
Can be used multiple times for multiple devices.
vga= [BOOT,X86-32] Select a particular video mode
See Documentation/arch/x86/boot.rst and
Documentation/admin-guide/svga.rst.
Use vga=ask for menu.
This is actually a boot loader parameter; the value is
passed to the kernel using a special protocol.
mm: provide kernel parameter to allow disabling page init poisoning Patch series "Address issues slowing persistent memory initialization", v5. The main thing this patch set achieves is that it allows us to initialize each node worth of persistent memory independently. As a result we reduce page init time by about 2 minutes because instead of taking 30 to 40 seconds per node and going through each node one at a time, we process all 4 nodes in parallel in the case of a 12TB persistent memory setup spread evenly over 4 nodes. This patch (of 3): On systems with a large amount of memory it can take a significant amount of time to initialize all of the page structs with the PAGE_POISON_PATTERN value. I have seen it take over 2 minutes to initialize a system with over 12TB of RAM. In order to work around the issue I had to disable CONFIG_DEBUG_VM and then the boot time returned to something much more reasonable as the arch_add_memory call completed in milliseconds versus seconds. However in doing that I had to disable all of the other VM debugging on the system. In order to work around a kernel that might have CONFIG_DEBUG_VM enabled on a system that has a large amount of memory I have added a new kernel parameter named "vm_debug" that can be set to "-" in order to disable it. Link: http://lkml.kernel.org/r/20180925201921.3576.84239.stgit@localhost.localdomain Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 22:07:45 +00:00
vm_debug[=options] [KNL] Available with CONFIG_DEBUG_VM=y.
May slow down system boot speed, especially when
enabled on systems with a large amount of memory.
All options are enabled by default, and this
interface is meant to allow for selectively
enabling or disabling specific virtual memory
debugging features.
Available options are:
P Enable page structure init time poisoning
- Disable all of the above options
vmalloc=nn[KMG] [KNL,BOOT] Forces the vmalloc area to have an exact
size of <nn>. This can be used to increase the
minimum size (128MB on x86). It can also be used to
decrease the size and leave more room for directly
mapped kernel RAM.
vmcp_cma=nn[MG] [KNL,S390]
Sets the memory size reserved for contiguous memory
allocations for the vmcp device driver.
vmhalt= [KNL,S390] Perform z/VM CP command after system halt.
Format: <command>
vmpanic= [KNL,S390] Perform z/VM CP command after kernel panic.
Format: <command>
vmpoff= [KNL,S390] Perform z/VM CP command after power off.
Format: <command>
vsyscall= [X86-64]
Controls the behavior of vsyscalls (i.e. calls to
fixed addresses of 0xffffffffff600x00 from legacy
code). Most statically-linked binaries and older
versions of glibc use these calls. Because these
functions are at fixed addresses, they make nice
targets for exploits that can control RIP.
emulate Vsyscalls turn into traps and are emulated
reasonably safely. The vsyscall page is
readable.
xonly [default] Vsyscalls turn into traps and are
emulated reasonably safely. The vsyscall
page is not readable.
none Vsyscalls don't work at all. This makes
them quite hard to use for exploits but
might break your system.
vt.color= [VT] Default text color.
Format: 0xYX, X = foreground, Y = background.
Default: 0x07 = light gray on black.
vt.cur_default= [VT] Default cursor shape.
Format: 0xCCBBAA, where AA, BB, and CC are the same as
the parameters of the <Esc>[?A;B;Cc escape sequence;
see VGA-softcursor.txt. Default: 2 = underline.
vt.default_blu= [VT]
Format: <blue0>,<blue1>,<blue2>,...,<blue15>
Change the default blue palette of the console.
This is a 16-member array composed of values
ranging from 0-255.
vt.default_grn= [VT]
Format: <green0>,<green1>,<green2>,...,<green15>
Change the default green palette of the console.
This is a 16-member array composed of values
ranging from 0-255.
vt.default_red= [VT]
Format: <red0>,<red1>,<red2>,...,<red15>
Change the default red palette of the console.
This is a 16-member array composed of values
ranging from 0-255.
vt.default_utf8=
[VT]
Format=<0|1>
Set system-wide default UTF-8 mode for all tty's.
Default is 1, i.e. UTF-8 mode is enabled for all
newly opened terminals.
vt.global_cursor_default=
[VT]
Format=<-1|0|1>
Set system-wide default for whether a cursor
is shown on new VTs. Default is -1,
i.e. cursors will be created by default unless
overridden by individual drivers. 0 will hide
cursors, 1 will display them.
vt.italic= [VT] Default color for italic text; 0-15.
Default: 2 = green.
vt.underline= [VT] Default color for underlined text; 0-15.
Default: 3 = cyan.
watchdog timers [HW,WDT] For information on watchdog timers,
see Documentation/watchdog/watchdog-parameters.rst
or other driver-specific files in the
Documentation/watchdog/ directory.
watchdog_thresh=
[KNL]
Set the hard lockup detector stall duration
threshold in seconds. The soft lockup detector
threshold is set to twice the value. A value of 0
disables both lockup detectors. Default is 10
seconds.
workqueue.unbound_cpus=
[KNL,SMP] Specify to constrain one or some CPUs
to use in unbound workqueues.
Format: <cpu-list>
By default, all online CPUs are available for
unbound workqueues.
workqueue: implement lockup detector Workqueue stalls can happen from a variety of usage bugs such as missing WQ_MEM_RECLAIM flag or concurrency managed work item indefinitely staying RUNNING. These stalls can be extremely difficult to hunt down because the usual warning mechanisms can't detect workqueue stalls and the internal state is pretty opaque. To alleviate the situation, this patch implements workqueue lockup detector. It periodically monitors all worker_pools periodically and, if any pool failed to make forward progress longer than the threshold duration, triggers warning and dumps workqueue state as follows. BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256 pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent workqueue events_power_efficient: flags=0x80 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256 pending: check_lifetime, neigh_periodic_work workqueue cgroup_pidlist_destroy: flags=0x0 pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1 pending: cgroup_pidlist_destroy_work_fn ... The detection mechanism is controller through kernel parameter workqueue.watchdog_thresh and can be updated at runtime through the sysfs module parameter file. v2: Decoupled from softlockup control knobs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2015-12-08 16:28:04 +00:00
workqueue.watchdog_thresh=
If CONFIG_WQ_WATCHDOG is configured, workqueue can
warn stall conditions and dump internal state to
help debugging. 0 disables workqueue stall
detection; otherwise, it's the stall threshold
duration in seconds. The default value is 30 and
it can be updated at runtime by writing to the
corresponding sysfs file.
workqueue: Automatically mark CPU-hogging work items CPU_INTENSIVE If a per-cpu work item hogs the CPU, it can prevent other work items from starting through concurrency management. A per-cpu workqueue which intends to host such CPU-hogging work items can choose to not participate in concurrency management by setting %WQ_CPU_INTENSIVE; however, this can be error-prone and difficult to debug when missed. This patch adds an automatic CPU usage based detection. If a concurrency-managed work item consumes more CPU time than the threshold (10ms by default) continuously without intervening sleeps, wq_worker_tick() which is called from scheduler_tick() will detect the condition and automatically mark it CPU_INTENSIVE. The mechanism isn't foolproof: * Detection depends on tick hitting the work item. Getting preempted at the right timings may allow a violating work item to evade detection at least temporarily. * nohz_full CPUs may not be running ticks and thus can fail detection. * Even when detection is working, the 10ms detection delays can add up if many CPU-hogging work items are queued at the same time. However, in vast majority of cases, this should be able to detect violations reliably and provide reasonable protection with a small increase in code complexity. If some work items trigger this condition repeatedly, the bigger problem likely is the CPU being saturated with such per-cpu work items and the solution would be making them UNBOUND. The next patch will add a debug mechanism to help spot such cases. v4: Documentation for workqueue.cpu_intensive_thresh_us added to kernel-parameters.txt. v3: Switch to use wq_worker_tick() instead of hooking into preemptions as suggested by Peter. v2: Lai pointed out that wq_worker_stopping() also needs to be called from preemption and rtlock paths and an earlier patch was updated accordingly. This patch adds a comment describing the risk of infinte recursions and how they're avoided. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
2023-05-18 03:02:08 +00:00
workqueue.cpu_intensive_thresh_us=
Per-cpu work items which run for longer than this
threshold are automatically considered CPU intensive
and excluded from concurrency management to prevent
them from noticeably delaying other per-cpu work
items. Default is 10000 (10ms).
If CONFIG_WQ_CPU_INTENSIVE_REPORT is set, the kernel
will report the work functions which violate this
threshold repeatedly. They are likely good
candidates for using WQ_UNBOUND workqueues instead.
workqueue.power_efficient
Per-cpu workqueues are generally preferred because
they show better performance thanks to cache
locality; unfortunately, per-cpu workqueues tend to
be more power hungry than unbound workqueues.
Enabling this makes the per-cpu workqueues which
were observed to contribute significantly to power
consumption unbound, leading to measurably lower
power usage at the cost of small performance
overhead.
The default value of this parameter is determined by
the config option CONFIG_WQ_POWER_EFFICIENT_DEFAULT.
workqueue.default_affinity_scope=
Select the default affinity scope to use for unbound
workqueues. Can be one of "cpu", "smt", "cache",
"numa" and "system". Default is "cache". For more
information, see the Affinity Scopes section in
Documentation/core-api/workqueue.rst.
This can be changed after boot by writing to the
matching /sys/module/workqueue/parameters file. All
workqueues with the "default" affinity scope will be
updated accordignly.
workqueue.debug_force_rr_cpu
Workqueue used to implicitly guarantee that work
items queued without explicit CPU specified are put
on the local CPU. This guarantee is no longer true
and while local CPU is still preferred work items
may be put on foreign CPUs. This debug option
forces round-robin CPU selection to flush out
usages which depend on the now broken guarantee.
When enabled, memory and cache locality will be
impacted.
writecombine= [LOONGARCH] Control the MAT (Memory Access Type) of
ioremap_wc().
on - Enable writecombine, use WUC for ioremap_wc()
off - Disable writecombine, use SUC for ioremap_wc()
x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of
default x2apic cluster mode on platforms
supporting x2apic.
xen_512gb_limit [KNL,X86-64,XEN]
Restricts the kernel running paravirtualized under Xen
to use only up to 512 GB of RAM. The reason to do so is
crash analysis tools and Xen tools for doing domain
save/restore/migration must be enabled to handle larger
domains.
xen_emul_unplug= [HW,X86,XEN]
Unplug Xen emulated devices
Format: [unplug0,][unplug1]
ide-disks -- unplug primary master IDE devices
aux-ide-disks -- unplug non-primary-master IDE devices
nics -- unplug network devices
all -- unplug all emulated devices (NICs and IDE disks)
unnecessary -- unplugging emulated devices is
unnecessary even if the host did not respond to
the unplug protocol
never -- do not unplug even if version check succeeds
xen_legacy_crash [X86,XEN]
Crash from Xen panic notifier, without executing late
panic() code such as dumping handler.
xen_msr_safe= [X86,XEN]
Format: <bool>
Select whether to always use non-faulting (safe) MSR
access functions when running as Xen PV guest. The
default value is controlled by CONFIG_XEN_PV_MSR_SAFE.
xen_nopvspin [X86,XEN]
Disables the qspinlock slowpath using Xen PV optimizations.
This parameter is obsoleted by "nopvspin" parameter, which
has equivalent effect for XEN platform.
xen_nopv [X86]
Disables the PV optimizations forcing the HVM guest to
run as generic HVM guest with no PV drivers.
This option is obsoleted by the "nopv" option, which
has equivalent effect for XEN platform.
xen_no_vector_callback
[KNL,X86,XEN] Disable the vector callback for Xen
event channel interrupts.
xen_scrub_pages= [XEN]
Boolean option to control scrubbing pages before giving them back
to Xen, for use by other domains. Can be also changed at runtime
with /sys/devices/system/xen_memory/xen_memory0/scrub_pages.
Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT.
xen_timer_slop= [X86-64,XEN]
Set the timer slop (in nanoseconds) for the virtual Xen
timers (default is 100000). This adjusts the minimum
delta of virtualized Xen timers, where lower values
improve timer resolution at the expense of processing
more timer interrupts.
xen.balloon_boot_timeout= [XEN]
The time (in seconds) to wait before giving up to boot
in case initial ballooning fails to free enough memory.
Applies only when running as HVM or PVH guest and
started with less memory configured than allowed at
max. Default is 180.
xen.event_eoi_delay= [XEN]
How long to delay EOI handling in case of event
storms (jiffies). Default is 10.
xen.event_loop_timeout= [XEN]
After which time (jiffies) the event handling loop
should start to delay EOI handling. Default is 2.
xen.fifo_events= [XEN]
Boolean parameter to disable using fifo event handling
even if available. Normally fifo event handling is
preferred over the 2-level event handling, as it is
fairer and the number of possible event channels is
much higher. Default is on (use fifo events).
xirc2ps_cs= [NET,PCMCIA]
Format:
<irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]]
xive= [PPC]
By default on POWER9 and above, the kernel will
natively use the XIVE interrupt controller. This option
allows the fallback firmware mode to be used:
off Fallback to firmware control of XIVE interrupt
controller on both pseries and powernv
platforms. Only useful on POWER9 and above.
xive.store-eoi=off [PPC]
By default on POWER10 and above, the kernel will use
stores for EOI handling when the XIVE interrupt mode
is active. This option allows the XIVE driver to use
loads instead, as on POWER9.
xhci-hcd.quirks [USB,KNL]
A hex value specifying bitmask with supplemental xhci
host controller quirks. Meaning of each bit can be
consulted in header drivers/usb/host/xhci.h.
xmon [PPC]
Format: { early | on | rw | ro | off }
Controls if xmon debugger is enabled. Default is off.
Passing only "xmon" is equivalent to "xmon=early".
early Call xmon as early as possible on boot; xmon
debugger is called from setup_arch().
on xmon debugger hooks will be installed so xmon
is only called on a kernel crash. Default mode,
i.e. either "ro" or "rw" mode, is controlled
with CONFIG_XMON_DEFAULT_RO_MODE.
rw xmon debugger hooks will be installed so xmon
is called only on a kernel crash, mode is write,
meaning SPR registers, memory and, other data
can be written using xmon commands.
ro same as "rw" option above but SPR registers,
memory, and other data can't be written using
xmon commands.
off xmon is disabled.