Commit Graph

45189 Commits

Author SHA1 Message Date
Nicolas Saenz Julienne d6800af51c KVM: x86: hyper-v: Don't auto-enable stimer on write from user-space
Don't apply the stimer's counter side effects when modifying its
value from user-space, as this may trigger spurious interrupts.

For example:
 - The stimer is configured in auto-enable mode.
 - The stimer's count is set and the timer enabled.
 - The stimer expires, an interrupt is injected.
 - The VM is live migrated.
 - The stimer config and count are deserialized, auto-enable is ON, the
   stimer is re-enabled.
 - The stimer expires right away, and injects an unwarranted interrupt.

Cc: stable@vger.kernel.org
Fixes: 1f4b34f825 ("kvm/x86: Hyper-V SynIC timers")
Signed-off-by: Nicolas Saenz Julienne <nsaenz@amazon.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231017155101.40677-1-nsaenz@amazon.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-17 16:41:02 -07:00
Mingwei Zhang 5a989bbead KVM: x86: Update the variable naming in kvm_x86_ops.sched_in()
Update the variable with name 'kvm' in kvm_x86_ops.sched_in() to 'vcpu' to
avoid confusions. Variable naming in KVM has a clear convention that 'kvm'
refers to pointer of type 'struct kvm *', while 'vcpu' refers to pointer of
type 'struct kvm_vcpu *'.

Fix this 9-year old naming issue for fun.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20231017232610.4008690-1-mizhang@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-17 16:40:01 -07:00
Paolo Bonzini 2e9064facc x86/microcode/amd: Fix snprintf() format string warning in W=1 build
Building with GCC 11.x results in the following warning:

  arch/x86/kernel/cpu/microcode/amd.c: In function ‘find_blobs_in_containers’:
  arch/x86/kernel/cpu/microcode/amd.c:504:58: error: ‘h.bin’ directive output may be truncated writing 5 bytes into a region of size between 1 and 7 [-Werror=format-truncation=]
  arch/x86/kernel/cpu/microcode/amd.c:503:17: note: ‘snprintf’ output between 35 and 41 bytes into a destination of size 36

The issue is that GCC does not know that the family can only be a byte
(it ultimately comes from CPUID).  Suggest the right size to the compiler
by marking the argument as char-size ("hh").  While at it, instead of
using the slightly more obscure precision specifier use the width with
zero padding (over 23000 occurrences in kernel sources, vs 500 for
the idiom using the precision).

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Closes: https://lore.kernel.org/oe-kbuild-all/202308252255.2HPJ6x5Q-lkp@intel.com/
Link: https://lore.kernel.org/r/20231016224858.2829248-1-pbonzini@redhat.com
2023-10-17 23:51:58 +02:00
David Matlack 3d30bfcbdc KVM: x86/mmu: Stop kicking vCPUs to sync the dirty log when PML is disabled
Stop kicking vCPUs in kvm_arch_sync_dirty_log() when PML is disabled.
Kicking vCPUs when PML is disabled serves no purpose and could
negatively impact guest performance.

This restores KVM's behavior to prior to 5.12 commit a018eba538 ("KVM:
x86: Move MMU's PML logic to common code"), which replaced a
static_call_cond(kvm_x86_flush_log_dirty) with unconditional calls to
kvm_vcpu_kick().

Fixes: a018eba538 ("KVM: x86: Move MMU's PML logic to common code")
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20231016221228.1348318-1-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-17 13:54:52 -07:00
Peng Hao 26951ec862 KVM: x86: Use octal for file permission
Convert all module params to octal permissions to improve code readability
and to make checkpatch happy:

  WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using
           octal permissions '0444'.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Link: https://lore.kernel.org/r/20231013113020.77523-1-flyingpeng@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-17 10:29:10 -07:00
Hou Wenlong d2a285d65b x86/head/64: Move the __head definition to <asm/init.h>
Move the __head section definition to a header to widen its use.

An upcoming patch will mark the code as __head in mem_encrypt_identity.c too.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/0583f57977be184689c373fe540cbd7d85ca2047.1697525407.git.houwenlong.hwl@antgroup.com
2023-10-17 14:51:14 +02:00
Babu Moger 4cee14bcb1 x86/resctrl: Display RMID of resource group
In x86, hardware uses RMID to identify a monitoring group. When a user
creates a monitor group these details are not visible. These details
can help resctrl debugging.

Add RMID(mon_hw_id) to the monitor groups display in the resctrl interface.
Users can see these details when resctrl is mounted with "-o debug" option.

Add RFTYPE_MON_BASE that complements existing RFTYPE_CTRL_BASE and
represents files belonging to monitoring groups.

Other architectures do not use "RMID". Use the name mon_hw_id to refer
to "RMID" in an effort to keep the naming generic.

For example:
  $cat /sys/fs/resctrl/mon_groups/mon_grp1/mon_hw_id
  3

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Link: https://lore.kernel.org/r/20231017002308.134480-10-babu.moger@amd.com
2023-10-17 14:05:40 +02:00
Babu Moger 918f211b5e x86/resctrl: Add support for the files of MON groups only
Files unique to monitoring groups have the RFTYPE_MON flag. When a new
monitoring group is created the resctrl files with flags RFTYPE_BASE
(files common to all resource groups) and RFTYPE_MON (files unique to
monitoring groups) are created to support interacting with the new
monitoring group.

A resource group can support both monitoring and control, also termed
a CTRL_MON resource group. CTRL_MON groups should get both monitoring
and control resctrl files but that is not the case. Only the
RFTYPE_BASE and RFTYPE_CTRL files are created for CTRL_MON groups.

Ensure that files with the RFTYPE_MON flag are created for CTRL_MON groups.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Link: https://lore.kernel.org/r/20231017002308.134480-9-babu.moger@amd.com
2023-10-17 14:05:24 +02:00
Babu Moger ca8dad225e x86/resctrl: Display CLOSID for resource group
In x86, hardware uses CLOSID to identify a control group. When a user
creates a control group this information is not visible to the user. It
can help resctrl debugging.

Add CLOSID(ctrl_hw_id) to the control groups display in the resctrl
interface. Users can see this detail when resctrl is mounted with the
"-o debug" option.

Other architectures do not use "CLOSID". Use the names ctrl_hw_id to refer
to "CLOSID" in an effort to keep the naming generic.

For example:
  $cat /sys/fs/resctrl/ctrl_grp1/ctrl_hw_id
  1

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Link: https://lore.kernel.org/r/20231017002308.134480-8-babu.moger@amd.com
2023-10-17 14:05:14 +02:00
Babu Moger cb07d71f01 x86/resctrl: Introduce "-o debug" mount option
Add "-o debug" option to mount resctrl filesystem in debug mode.  When
in debug mode resctrl displays files that have the new RFTYPE_DEBUG flag
to help resctrl debugging.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Link: https://lore.kernel.org/r/20231017002308.134480-7-babu.moger@amd.com
2023-10-17 13:07:17 +02:00
Babu Moger d27567a0eb x86/resctrl: Move default group file creation to mount
The default resource group and its files are created during kernel init
time. Upcoming changes will make some resctrl files optional based on
a mount parameter. If optional files are to be added to the default
group based on the mount option, then each new file needs to be created
separately and call kernfs_activate() again.

Create all files of the default resource group during resctrl mount,
destroyed during unmount, to avoid scattering resctrl file addition
across two separate code flows.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Link: https://lore.kernel.org/r/20231017002308.134480-6-babu.moger@amd.com
2023-10-17 12:50:53 +02:00
Babu Moger df5f3a1dd8 x86/resctrl: Unwind properly from rdt_enable_ctx()
rdt_enable_ctx() enables the features provided during resctrl mount.

Additions to rdt_enable_ctx() are required to also modify error paths
of rdt_enable_ctx() callers to ensure correct unwinding if errors
are encountered after calling rdt_enable_ctx(). This is error prone.

Introduce rdt_disable_ctx() to refactor the error unwinding of
rdt_enable_ctx() to simplify future additions. This also simplifies
cleanup in rdt_kill_sb().

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Link: https://lore.kernel.org/r/20231017002308.134480-5-babu.moger@amd.com
2023-10-17 12:49:02 +02:00
Babu Moger d41592435c x86/resctrl: Rename rftype flags for consistency
resctrl associates rftype flags with its files so that files can be chosen
based on the resource, whether it is info or base, and if it is control
or monitor type file. These flags use the RF_ as well as RFTYPE_ prefixes.

Change the prefix to RFTYPE_ for all these flags to be consistent.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Link: https://lore.kernel.org/r/20231017002308.134480-4-babu.moger@amd.com
2023-10-17 11:59:14 +02:00
Babu Moger 6846dc1a31 x86/resctrl: Simplify rftype flag definitions
The rftype flags are bitmaps used for adding files under the resctrl
filesystem. Some of these bitmap defines have one extra level of
indirection which is not necessary.

Drop the RF_* defines and simplify the macros.

  [ bp: Massage commit message. ]

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Link: https://lore.kernel.org/r/20231017002308.134480-3-babu.moger@amd.com
2023-10-17 11:51:16 +02:00
Babu Moger fe2a20ea0b x86/resctrl: Add multiple tasks to the resctrl group at once
The resctrl task assignment for monitor or control group needs to be
done one at a time. For example:

  $mount -t resctrl resctrl /sys/fs/resctrl/
  $mkdir /sys/fs/resctrl/ctrl_grp1
  $echo 123 > /sys/fs/resctrl/ctrl_grp1/tasks
  $echo 456 > /sys/fs/resctrl/ctrl_grp1/tasks
  $echo 789 > /sys/fs/resctrl/ctrl_grp1/tasks

This is not user-friendly when dealing with hundreds of tasks.

Support multiple task assignment in one command with tasks ids separated
by commas. For example:

  $echo 123,456,789 > /sys/fs/resctrl/ctrl_grp1/tasks

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Tan Shaopeng <tan.shaopeng@jp.fujitsu.com>
Link: https://lore.kernel.org/r/20231017002308.134480-2-babu.moger@amd.com
2023-10-17 11:27:50 +02:00
Joerg Roedel 63e44bc520 x86/sev: Check for user-space IOIO pointing to kernel space
Check the memory operand of INS/OUTS before emulating the instruction.
The #VC exception can get raised from user-space, but the memory operand
can be manipulated to access kernel memory before the emulation actually
begins and after the exception handler has run.

  [ bp: Massage commit message. ]

Fixes: 597cfe4821 ("x86/boot/compressed/64: Setup a GHCB-based VC Exception handler")
Reported-by: Tom Dohrmann <erbse.13@gmx.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
2023-10-17 10:58:16 +02:00
Arnd Bergmann acfc788233 vgacon: remove screen_info dependency
The vga console driver is fairly self-contained, and only used by
architectures that explicitly initialize the screen_info settings.

Chance every instance that picks the vga console by setting conswitchp
to call a function instead, and pass a reference to the screen_info
there.

Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Acked-by: Khalid Azzi <khalid@gonehiking.org>
Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20231009211845.3136536-6-arnd@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-17 10:17:02 +02:00
Linus Torvalds 86d6a628a2 ARM:
- Fix the handling of the phycal timer offset when FEAT_ECV
   and CNTPOFF_EL2 are implemented.
 
 - Restore the functionnality of Permission Indirection that
   was broken by the Fine Grained Trapping rework
 
 - Cleanup some PMU event sharing code
 
 MIPS:
 
 - Fix W=1 build.
 
 s390:
 
 - One small fix for gisa to avoid stalls.
 
 x86:
 
 - Truncate writes to PMU counters to the counter's width to avoid spurious
   overflows when emulating counter events in software.
 
 - Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined
   architectural behavior).
 
 - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to
   kick the guest out of emulated halt.
 
 - Fix for loading XSAVE state from an old kernel into a new one.
 
 - Fixes for AMD AVIC
 
 selftests:
 
 - Play nice with %llx when formatting guest printf and assert statements.
 
 - Clean up stale test metadata.
 
 - Zero-initialize structures in memslot perf test to workaround a suspected
   "may be used uninitialized" false positives from GCC.
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmUtvXgUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOE3gf/Q0Xvi/oU/+iDMuvfCbMZg/nhbrsa
 WmE/zXLrCF0DknppAsWulkhLGL2ceL6X+f37f2vWpBdG9SVDG/vSAg+QQDwsXiKN
 hTJoaybtMMPZM64emPZKeLMrq3A/U32qIMmWMJkoQRyz6dftUhGqZEuy1jw8oomJ
 n9idRDCMkbo+bick4URt0FEuI3Tf6dPIlG7P5hObFTw+nenzzxTjoxWZ432Mgyod
 yqveEke4hcQ+6K1zdAcDNZqT9ZhxeTxAO4yrBAYfnFoPLhUXKDUumkqAQPNOhKTo
 YN+b29kHBm+HvYkHN785FQla/13wjE1aq5TUj5J7NEDv4uRXDefDq2OAeg==
 =b9AY
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM:

   - Fix the handling of the phycal timer offset when FEAT_ECV and
     CNTPOFF_EL2 are implemented

   - Restore the functionnality of Permission Indirection that was
     broken by the Fine Grained Trapping rework

   - Cleanup some PMU event sharing code

  MIPS:

   - Fix W=1 build

  s390:

   - One small fix for gisa to avoid stalls

  x86:

   - Truncate writes to PMU counters to the counter's width to avoid
     spurious overflows when emulating counter events in software

   - Set the LVTPC entry mask bit when handling a PMI (to match
     Intel-defined architectural behavior)

   - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work
     to kick the guest out of emulated halt

   - Fix for loading XSAVE state from an old kernel into a new one

   - Fixes for AMD AVIC

  selftests:

   - Play nice with %llx when formatting guest printf and assert
     statements

   - Clean up stale test metadata

   - Zero-initialize structures in memslot perf test to workaround a
     suspected 'may be used uninitialized' false positives from GCC"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
  KVM: arm64: timers: Correctly handle TGE flip with CNTPOFF_EL2
  KVM: arm64: POR{E0}_EL1 do not need trap handlers
  KVM: arm64: Add nPIR{E0}_EL1 to HFG traps
  KVM: MIPS: fix -Wunused-but-set-variable warning
  KVM: arm64: pmu: Drop redundant check for non-NULL kvm_pmu_events
  KVM: SVM: Fix build error when using -Werror=unused-but-set-variable
  x86: KVM: SVM: refresh AVIC inhibition in svm_leave_nested()
  x86: KVM: SVM: add support for Invalid IPI Vector interception
  x86: KVM: SVM: always update the x2avic msr interception
  KVM: selftests: Force load all supported XSAVE state in state test
  KVM: selftests: Load XSAVE state into untouched vCPU during state test
  KVM: selftests: Touch relevant XSAVE state in guest for state test
  KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2}
  x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer
  KVM: selftests: Zero-initialize entire test_result in memslot perf test
  KVM: selftests: Remove obsolete and incorrect test case metadata
  KVM: selftests: Treat %llx like %lx when formatting guest printf
  KVM: x86/pmu: Synthesize at most one PMI per VM-exit
  KVM: x86: Mask LVTPC when handling a PMI
  KVM: x86/pmu: Truncate counter value to allowed width on write
  ...
2023-10-16 18:34:17 -07:00
Yazen Ghannam 1bae0cfe4a x86/mce: Cleanup mce_usable_address()
Move Intel-specific checks into a helper function.

Explicitly use "bool" for return type.

No functional change intended.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613141142.36801-4-yazen.ghannam@amd.com
2023-10-16 15:37:01 +02:00
Yazen Ghannam 48da1ad8ba x86/mce: Define amd_mce_usable_address()
Currently, all valid MCA_ADDR values are assumed to be usable on AMD
systems. However, this is not correct in most cases. Notifiers expecting
usable addresses may then operate on inappropriate values.

Define a helper function to do AMD-specific checks for a usable memory
address. List out all known cases.

  [ bp: Tone down the capitalized words. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613141142.36801-3-yazen.ghannam@amd.com
2023-10-16 15:31:32 +02:00
Yazen Ghannam 495a91d099 x86/MCE/AMD: Split amd_mce_is_memory_error()
Define helper functions for legacy and SMCA systems in order to reuse
individual checks in later changes.

Describe what each function is checking for, and correct the XEC bitmask
for SMCA.

No functional change intended.

  [ bp: Use "else in amd_mce_is_memory_error() to make the conditional
    balanced, for readability. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613141142.36801-2-yazen.ghannam@amd.com
2023-10-16 15:04:53 +02:00
Hou Wenlong 7f6874eddd x86/head/64: Add missing __head annotation to startup_64_load_idt()
This function is currently only used in the head code and is only called
from startup_64_setup_env(). Although it would be inlined by the
compiler, it would be better to mark it as __head too in case it doesn't.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/efcc5b5e18af880e415d884e072bf651c1fa7c34.1689130310.git.houwenlong.hwl@antgroup.com
2023-10-16 13:38:24 +02:00
Hou Wenlong dc62830090 x86/head/64: Mark 'startup_gdt[]' and 'startup_gdt_descr' as __initdata
As 'startup_gdt[]' and 'startup_gdt_descr' are only used in booting,
mark them as __initdata to allow them to be freed after boot.

Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/c85903a7cfad37d14a7e5a4df9fc7119a3669fb3.1689130310.git.houwenlong.hwl@antgroup.com
2023-10-16 13:38:24 +02:00
Sandipan Das 744940f192 perf/x86/amd/uncore: Pass through error code for initialization failures, instead of -ENODEV
Pass through the appropriate error code when the registration of hotplug
callbacks fail during initialization, instead of returning a blanket -ENODEV.

[ mingo: Updated the changelog. ]

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231016060743.332051-1-sandipan.das@amd.com
2023-10-16 11:42:01 +02:00
Linus Torvalds fbe1bf1e5f Revert "x86/smp: Put CPUs into INIT on shutdown if possible"
This reverts commit 45e34c8af5, and the
two subsequent fixes to it:

  3f874c9b2a ("x86/smp: Don't send INIT to non-present and non-booted CPUs")
  b1472a60a5 ("x86/smp: Don't send INIT to boot CPU")

because it seems to result in hung machines at shutdown.  Particularly
some Dell machines, but Thomas says

 "The rest seems to be Lenovo and Sony with Alderlake/Raptorlake CPUs -
  at least that's what I could figure out from the various bug reports.

  I don't know which CPUs the DELL machines have, so I can't say it's a
  pattern.

  I agree with the revert for now"

Ashok Raj chimes in:

 "There was a report (probably this same one), and it turns out it was a
  bug in the BIOS SMI handler.

  The client BIOS's were waiting for the lowest APICID to be the SMI
  rendevous master. If this is MeteorLake, the BSP wasn't the one with
  the lowest APIC and it triped here.

  The BIOS change is also being pushed to others for assimilation :)

  Server BIOS's had this correctly for a while now"

and it does look likely to be some bad interaction between SMI and the
non-BSP cores having put into INIT (and thus unresponsive until reset).

Link: https://bbs.archlinux.org/viewtopic.php?pid=2124429
Link: https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
Link: https://forum.artixlinux.org/index.php/topic,5997.0.html
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2241279
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-10-15 12:02:02 -07:00
Linus Torvalds ddf2085598 Fix a Longsoon build warning by harmonizing the arch_[un]register_cpu()
prototypes between architectures.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUrobMRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1ggUw/7BGHe360tsrsMAcOHcwvvGhnQ0UKuoqLs
 IJl3dfsdX+JnL7cpbNcRBVDqgH2seIwdQFa73gALColcxntEBbnC/gVS4QLLSxSp
 HIq5C1OELT/jPMOjc6aimJx/qPvW/CLgo2WJx78rv0ykkf1RJIzqCTVKf8VQX6Vu
 t0/9jEhBNuL8DZthJ5ZV448WtlJcdnWXVGxq/UHEheV219Rjtp3NGf8s/K+WMzF3
 x9Zhmb+/UPgjhaZtrQDP2mf7ZYgmVhLvJTRSQdQNrcDe/ZaNrCrEGOwHuOpQ0vXw
 v42rd1AVGV/xgIhfBOABLdb5snBbQMDvYLcma04bkBd6H6WPFJZ1PvnGovTagxUO
 FP4117VBA5ARwZemxwGEPJkNF9lVEPSBVDv7bx2OO0zVCViuAbKJXxGzCW/GiSGA
 BRk5FogxJ7TcjWsYWWaZfYlq8RFI5UI3K/IEQIUpQKtC9OMhScdp532xEP48dc8u
 pHjPjVoYCXtoD/ZD73ZJJduN6Hn30HEE/IJ6+YKJRo4EZquUrGtbgG46QbFBqxqW
 xIPPTx7OPAaAfq020+c6BuUnta1iEY6I4De/+XbRdQf+AIWqfLsWNI8en8aO4t5p
 rtlrYD2Si1F0KBWtDWR7JhCm8CD2klWVWrD9d4DLpz9ljHXKa9d7BYp3BxkvgcQm
 x8f1D9yC9X4=
 =20Na
 -----END PGP SIGNATURE-----

Merge tag 'smp-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull CPU hotplug fix from Ingo Molnar:
 "Fix a Longsoon build warning by harmonizing the
  arch_[un]register_cpu() prototypes between architectures"

* tag 'smp-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu-hotplug: Provide prototypes for arch CPU registration
2023-10-15 08:44:56 -07:00
Paolo Bonzini 88e4cd893f KVM x86/pmu fixes for 6.6:
- Truncate writes to PMU counters to the counter's width to avoid spurious
    overflows when emulating counter events in software.
 
  - Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined
    architectural behavior).
 
  - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to
    kick the guest out of emulated halt.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmUp1FESHHNlYW5qY0Bn
 b29nbGUuY29tAAoJEGCRIgFNDBL5IRsQAIsk+UwTP+q+ZzkpkSOJ+ocmKU97/GbW
 snB+F5FwNXnWEPzHIV+Ldv+WUpmHilTrylk2t5jLyew783TPxTnLmNAa+D3iSSBP
 jSGzCIqR2uRHOxhuJgkKvdOkfuS7vob1KcKrfOwKCSss78VhKGkMGIi66/81RTxo
 zxpzva+F2YtbCwKWXewOvR4CsWhjVqOGRTCmjF6t8PpFDGqwZdu0ornBHC2gvkUI
 iDHWVBg5Rz/akqxjEVL94SP5qdFSaVG+F3Z8xpnn+tfPncEK/xPFdGHGKwOy5Jvt
 4dQLc6TGmS2+NGPU3eAJOr+GZKryQth1CI+5RDlnoKQXjQ3laJwjmgyCRbUYLoZh
 /R7f5YJrhGheUvCCmagY1g2x41qp/CTG1RnX1SVTIGH9h+5LSVcCukCL9Tx2/B4v
 eU8nrzhUuijSqG6TiyAV5hvFqMQf3LWWcjSSW58kIWmXLpqdb/Xp6wiFHjOM7wZM
 c1br+6AwKZwKNdqn3/cnlBnLc+1jq/PWFnuF9svjKn5JTOyg8kddmyWUkDqiLOeZ
 /jqqwRJQUZppy4DxFHdkuQxnTsrztNzs/vhQtF6MIgFRULrs4FaiTUxuAs72skqm
 Fv/IIuyHWjST9HY8dgTx8PLqUevEc7zekmhN1Cj5KwhlHxKYWSZfew80CO7h2qhJ
 IvAC70QC+BsW
 =g8g3
 -----END PGP SIGNATURE-----

Merge tag 'kvm-x86-pmu-6.6-fixes' of https://github.com/kvm-x86/linux into HEAD

KVM x86/pmu fixes for 6.6:

 - Truncate writes to PMU counters to the counter's width to avoid spurious
   overflows when emulating counter events in software.

 - Set the LVTPC entry mask bit when handling a PMI (to match Intel-defined
   architectural behavior).

 - Treat KVM_REQ_PMI as a wake event instead of queueing host IRQ work to
   kick the guest out of emulated halt.
2023-10-15 08:24:18 -04:00
Linus Torvalds dc9b2e683b Fix a false-positive KASAN warning, fix an AMD erratum on Zen4 CPUs,
and fix kernel-doc build warnings.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUrEOQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hG+g/+OUdKjz6u3rod9ZR0/u0koyXFsat2V+CD
 9OSUsDvff/7alRhztUoyj3g1t+brp1GuvyPTJ+KN3jJO08PHfQgm7eH873dguPSs
 k4o45eKglYuCb+9CQvOJggqOp7W2tvA1BW8pU/YdNgwxOARZFBeY6nkxW7oii9Ff
 TLRtzVfBAShYm7QK8HIae7DpPwqt1RcINcWRa4HX93/Gh364Q3Cywj+8DfARweWq
 J+47bgrMP5o+AHWiZs0AGsflXy0PYmrmFzh/d6iMqLSZuAhyG6heyzuThAimze5L
 5O89Dh/cPHTwQdY1pOAYezMVRNECecsrv6nlu9uTjCkvutopd1EB+WfZh2z1qaPB
 3t6JKliUSRUrckk/RtM8qIcb6VV7G0+6ez9n6EGo+9ci2On8afN2va16qo/rrwU7
 dfCyRLxaoLA4AZWjvtVSMHRlA4Iox+pH7AlBCeALIStHXZfaPJIjNkfyO5iT3Mug
 6KyI8QDpVvGfWz5GwE+4GaB6IKekcGOqROubK9X8mtNg+FhDU7xcIEM2f28etmYt
 kmKWD9yJdKiaW+bwJ3C2krUhtxQgW4wNjLrrJTU3m8tsc+7SwEKbUls+OHHEUyMD
 rE7X64fuL9E+VXyOg58bN4XbUpLm94z6E4IuBPsLthdKPsz9Lq6R2BZlpZQ4cVyG
 efglSGvEHnU=
 =5l6l
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
 "Fix a false-positive KASAN warning, fix an AMD erratum on Zen4 CPUs,
  and fix kernel-doc build warnings"

* tag 'x86-urgent-2023-10-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/alternatives: Disable KASAN in apply_alternatives()
  x86/cpu: Fix AMD erratum #1485 on Zen4-based CPUs
  x86/resctrl: Fix kernel-doc warnings
2023-10-14 15:32:20 -07:00
Linus Torvalds 23931d9353 Fix an LBR sampling bug.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUrDnYRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iDkw//TMHzE+uiDjzzMwR+Gi1o4UngrDQIX06k
 t9Bds36w3z0VuUAO/UFWt8+9gfZ+QngX3AGI6ti9OV+uMbl3R8SpOhZR6ppBoPvk
 272BosojR0rL5l6I9PYd4CVaGGi47R9BOlXYCqy7u+gJkDAr+Fz/HbwjR6rh2oys
 gJ5sJt6QoP4IBAIp2mLf5UR++AiTC+9WQgtDnAHDVphUEoaQLWs4qeVak2mHdotl
 vxRGMFnYD8pceGfPQjkUuxfSxcbRLAJa/TnxR6En5e02+jjaVW/JAfmMTmGTAqH1
 a0XsxgCxRlwVya6gYOYYW1eiJ7RLnthfmsCd4goKR+3tpxlBpODsQ7ZpwL/t6e4H
 ZWznZj6oLaM4eDKIA4WnkSDB0rlka/fjA4rbbCr86sgv3qKEGHj5foDfPsSLBbvU
 KkWrzf91orCI8cJYg3w7XeZMtrZ0hT9fRIVFNJjNfromvL4FHUvRdDxYTaLCuvc5
 FuSEmjxc6+xaFkwZcXeEgD5QPvHvpXgXe7R+m7+JEM+cIkpeqs1j147ufRwI7JRJ
 bQnBw+m33f9eD7xOC3Rl0N6gfl9luRzKWNxONLEfsWLZ6v8Pn1CK4yBH9qVAGBPn
 7BHiYr4dnhA0wwapx6Iz4NR4bnY4fUORrh4Zzn0gRi2lTk2J/6PJlhd+Oxpr30aG
 fxPojbcoZ64=
 =fFDw
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2023-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 perf event fix from Ingo Molnar:
 "Fix an LBR sampling bug"

* tag 'perf-urgent-2023-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/lbr: Filter vsyscall addresses
2023-10-14 15:09:55 -07:00
Brian Gerst 1a09a27153 x86/entry/32: Clean up syscall fast exit tests
Merge compat and native code and clarify comments.

No change in functionality expected.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20231011224351.130935-4-brgerst@gmail.com
2023-10-13 13:05:28 +02:00
Brian Gerst 58978b44df x86/entry/64: Use TASK_SIZE_MAX for canonical RIP test
Using shifts to determine if an address is canonical is difficult for
the compiler to optimize when the virtual address width is variable
(LA57 feature) without using inline assembly.  Instead, compare RIP
against TASK_SIZE_MAX.  The only user executable address outside of that
range is the deprecated vsyscall page, which can fall back to using IRET.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20231011224351.130935-3-brgerst@gmail.com
2023-10-13 13:05:28 +02:00
Brian Gerst ca282b486a x86/entry/64: Convert SYSRET validation tests to C
No change in functionality expected.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20231011224351.130935-2-brgerst@gmail.com
2023-10-13 13:05:28 +02:00
Ingo Molnar 92fe9bb77b x86/apic, x86/hyperv: Use u32 in hv_snp_boot_ap() too
The data type for APIC IDs was standardized to 'u32' in the
following recent commit:

   db4a4086a2 ("x86/apic: Use u32 for wakeup_secondary_cpu[_64]()")

Which changed the function arguments type signature of the
apic->wakeup_secondary_cpu() APIC driver function.

Propagate this to hv_snp_boot_ap() as well, which also addresses a
'assignment from incompatible pointer type' build warning that triggers
under the -Werror=incompatible-pointer-types GCC warning.

Fixes: db4a4086a2 ("x86/apic: Use u32 for wakeup_secondary_cpu[_64]()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230814085113.233274223@linutronix.de
2023-10-13 12:26:58 +02:00
Dan Carpenter 7543365739 perf/x86/amd/uncore: Fix uninitialized return value in amd_uncore_init()
Some of the error paths in this function return don't initialize the
error code.  Return -ENODEV by default.

Fixes: d6389d3ccc ("perf/x86/amd/uncore: Refactor uncore management")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/cec62eba-c4b8-4cb7-9671-58894dd4b974@moroto.mountain
2023-10-13 09:32:50 +02:00
Jakub Kicinski 0e6bb5b7f4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

kernel/bpf/verifier.c
  829955981c ("bpf: Fix verifier log for async callback return values")
  a923819fb2 ("bpf: Treat first argument as return value for bpf_throw")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-12 17:07:34 -07:00
Kirill A. Shutemov d35652a5fc x86/alternatives: Disable KASAN in apply_alternatives()
Fei has reported that KASAN triggers during apply_alternatives() on
a 5-level paging machine:

	BUG: KASAN: out-of-bounds in rcu_is_watching()
	Read of size 4 at addr ff110003ee6419a0 by task swapper/0/0
	...
	__asan_load4()
	rcu_is_watching()
	trace_hardirqs_on()
	text_poke_early()
	apply_alternatives()
	...

On machines with 5-level paging, cpu_feature_enabled(X86_FEATURE_LA57)
gets patched. It includes KASAN code, where KASAN_SHADOW_START depends on
__VIRTUAL_MASK_SHIFT, which is defined with cpu_feature_enabled().

KASAN gets confused when apply_alternatives() patches the
KASAN_SHADOW_START users. A test patch that makes KASAN_SHADOW_START
static, by replacing __VIRTUAL_MASK_SHIFT with 56, works around the issue.

Fix it for real by disabling KASAN while the kernel is patching alternatives.

[ mingo: updated the changelog ]

Fixes: 6657fca06e ("x86/mm: Allow to boot without LA57 if CONFIG_X86_5LEVEL=y")
Reported-by: Fei Yang <fei.yang@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20231012100424.1456-1-kirill.shutemov@linux.intel.com
2023-10-12 20:27:16 +02:00
Borislav Petkov deedec0a15 x86/cpu: Fix the AMD Fam 17h, Fam 19h, Zen2 and Zen4 MSR enumerations
The comments introduced in <asm/msr-index.h> in the merge conflict fixup in:

  8f4156d587 ("Merge branch 'x86/urgent' into perf/core, to resolve conflict")

... aren't right: AMD naming schemes are more complex than implied,
family 0x17 is Zen1 and 2, family 0x19 is spread around Zen 3 and 4.

So there's indeed four separate MSR namespaces for:

  MSR_F17H_
  MSR_F19H_
  MSR_ZEN2_
  MSR_ZEN4_

... and the namespaces cannot be merged.

Fix it up. No change in functionality.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/D99589F4-BC5D-430B-87B2-72C20370CF57@exactcode.com
2023-10-12 20:10:39 +02:00
Lukas Bulwahn faed498d0d hardening: x86: drop reference to removed config AMD_IOMMU_V2
Commit 5a0b11a180 ("iommu/amd: Remove iommu_v2 module") removes the
config AMD_IOMMU_V2.

Remove the reference to this config in the x86 architecture-specific
hardening config fragment as well.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20231012045040.22088-1-lukas.bulwahn@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-10-12 09:08:57 -07:00
Tom Lendacky 3e93467346 KVM: SVM: Fix build error when using -Werror=unused-but-set-variable
Commit 916e3e5f26 ("KVM: SVM: Do not use user return MSR support for
virtualized TSC_AUX") introduced a local variable used for the rdmsr()
function for the high 32-bits of the MSR value. This variable is not used
after being set and triggers a warning or error, when treating warnings
as errors, when the unused-but-set-variable flag is set. Mark this
variable as __maybe_unused to fix this.

Fixes: 916e3e5f26 ("KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <0da9874b6e9fcbaaa5edeb345d7e2a7c859fc818.1696271334.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:09:36 -04:00
Maxim Levitsky 3fdc6087df x86: KVM: SVM: refresh AVIC inhibition in svm_leave_nested()
svm_leave_nested() similar to a nested VM exit, get the vCPU out of nested
mode and thus should end the local inhibition of AVIC on this vCPU.

Failure to do so, can lead to hangs on guest reboot.

Raise the KVM_REQ_APICV_UPDATE request to refresh the AVIC state of the
current vCPU in this case.

Fixes: f44509f849 ("KVM: x86: SVM: allow AVIC to co-exist with a nested guest running")
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928173354.217464-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:09:00 -04:00
Maxim Levitsky 2dcf37abf9 x86: KVM: SVM: add support for Invalid IPI Vector interception
In later revisions of AMD's APM, there is a new 'incomplete IPI' exit code:

"Invalid IPI Vector - The vector for the specified IPI was set to an
illegal value (VEC < 16)"

Note that tests on Zen2 machine show that this VM exit doesn't happen and
instead AVIC just does nothing.

Add support for this exit code by doing nothing, instead of filling
the kernel log with errors.

Also replace an unthrottled 'pr_err()' if another unknown incomplete
IPI exit happens with vcpu_unimpl()

(e.g in case AMD adds yet another 'Invalid IPI' exit reason)

Cc: <stable@vger.kernel.org>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928173354.217464-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:08:59 -04:00
Maxim Levitsky b65235f6e1 x86: KVM: SVM: always update the x2avic msr interception
The following problem exists since x2avic was enabled in the KVM:

svm_set_x2apic_msr_interception is called to enable the interception of
the x2apic msrs.

In particular it is called at the moment the guest resets its apic.

Assuming that the guest's apic was in x2apic mode, the reset will bring
it back to the xapic mode.

The svm_set_x2apic_msr_interception however has an erroneous check for
'!apic_x2apic_mode()' which prevents it from doing anything in this case.

As a result of this, all x2apic msrs are left unintercepted, and that
exposes the bare metal x2apic (if enabled) to the guest.
Oops.

Remove the erroneous '!apic_x2apic_mode()' check to fix that.

This fixes CVE-2023-5090

Fixes: 4d1d7942e3 ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode")
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928173354.217464-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:08:59 -04:00
Sean Christopherson 8647c52e95 KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2}
Mask off xfeatures that aren't exposed to the guest only when saving guest
state via KVM_GET_XSAVE{2} instead of modifying user_xfeatures directly.
Preserving the maximal set of xfeatures in user_xfeatures restores KVM's
ABI for KVM_SET_XSAVE, which prior to commit ad856280dd ("x86/kvm/fpu:
Limit guest user_xfeatures to supported bits of XCR0") allowed userspace
to load xfeatures that are supported by the host, irrespective of what
xfeatures are exposed to the guest.

There is no known use case where userspace *intentionally* loads xfeatures
that aren't exposed to the guest, but the bug fixed by commit ad856280dd
was specifically that KVM_GET_SAVE{2} would save xfeatures that weren't
exposed to the guest, e.g. would lead to userspace unintentionally loading
guest-unsupported xfeatures when live migrating a VM.

Restricting KVM_SET_XSAVE to guest-supported xfeatures is especially
problematic for QEMU-based setups, as QEMU has a bug where instead of
terminating the VM if KVM_SET_XSAVE fails, QEMU instead simply stops
loading guest state, i.e. resumes the guest after live migration with
incomplete guest state, and ultimately results in guest data corruption.

Note, letting userspace restore all host-supported xfeatures does not fix
setups where a VM is migrated from a host *without* commit ad856280dd,
to a target with a subset of host-supported xfeatures.  However there is
no way to safely address that scenario, e.g. KVM could silently drop the
unsupported features, but that would be a clear violation of KVM's ABI and
so would require userspace to opt-in, at which point userspace could
simply be updated to sanitize the to-be-loaded XSAVE state.

Reported-by: Tyler Stachecki <stachecki.tyler@gmail.com>
Closes: https://lore.kernel.org/all/20230914010003.358162-1-tstachecki@bloomberg.net
Fixes: ad856280dd ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0")
Cc: stable@vger.kernel.org
Cc: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Message-Id: <20230928001956.924301-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:08:58 -04:00
Sean Christopherson 18164f66e6 x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer
Plumb an xfeatures mask into __copy_xstate_to_uabi_buf() so that KVM can
constrain which xfeatures are saved into the userspace buffer without
having to modify the user_xfeatures field in KVM's guest_fpu state.

KVM's ABI for KVM_GET_XSAVE{2} is that features that are not exposed to
guest must not show up in the effective xstate_bv field of the buffer.
Saving only the guest-supported xfeatures allows userspace to load the
saved state on a different host with a fewer xfeatures, so long as the
target host supports the xfeatures that are exposed to the guest.

KVM currently sets user_xfeatures directly to restrict KVM_GET_XSAVE{2} to
the set of guest-supported xfeatures, but doing so broke KVM's historical
ABI for KVM_SET_XSAVE, which allows userspace to load any xfeatures that
are supported by the *host*.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230928001956.924301-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-10-12 11:08:58 -04:00
Suma Hegde 5150542b8e
platform/x86/amd/hsmp: add support for metrics tbl
AMD MI300 MCM provides GET_METRICS_TABLE message to retrieve
all the system management information from SMU.

The metrics table is made available as hexadecimal sysfs binary file
under per socket sysfs directory created at
/sys/devices/platform/amd_hsmp/socket%d/metrics_bin

Metrics table definitions will be documented as part of Public PPR.
The same is defined in the amd_hsmp.h header.

Signed-off-by: Suma Hegde <suma.hegde@amd.com>
Reviewed-by: Naveen Krishna Chatradhi <nchatrad@amd.com>
Link: https://lore.kernel.org/r/20231010120310.3464066-2-suma.hegde@amd.com
[ij: lseek -> lseek(), dram -> DRAM in dev_err()]
[ij: added period to terminate a documentation sentence]
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2023-10-12 16:29:58 +03:00
Peter Zijlstra f577cd57bf sched/topology: Rename 'DIE' domain to 'PKG'
While reworking the x86 topology code Thomas tripped over creating a 'DIE' domain
for the package mask. :-)

Since these names are CONFIG_SCHED_DEBUG=y only, rename them to make the
name less ambiguous.

[ Shrikanth Hegde: rename on s390 as well. ]
[ Valentin Schneider: also rename it in the comments. ]
[ mingo: port to recent kernels & find all remaining occurances. ]

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230712141056.GI3100107@hirez.programming.kicks-ass.net
2023-10-12 09:38:16 +02:00
Thomas Zimmermann 052ddf7b86 fbdev: Replace fb_pgprotect() with pgprot_framebuffer()
Rename the fbdev mmap helper fb_pgprotect() to pgprot_framebuffer().
The helper sets VMA page-access flags for framebuffers in device I/O
memory.

Also clean up the helper's parameters and return value. Instead of
the VMA instance, pass the individial parameters separately: existing
page-access flags, the VMAs start and end addresses and the offset
in the underlying device memory rsp file. Return the new page-access
flags. These changes align pgprot_framebuffer() with other pgprot_()
functions.

v4:
	* fix commit message (Christophe)
v3:
	* rename fb_pgprotect() to pgprot_framebuffer() (Arnd)

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230922080636.26762-3-tzimmermann@suse.de
2023-10-12 09:20:46 +02:00
Paul E. McKenney f44075ecaf x86/nmi: Fix out-of-order NMI nesting checks & false positive warning
The ->idt_seq and ->recv_jiffies variables added by:

  1a3ea611fc ("x86/nmi: Accumulate NMI-progress evidence in exc_nmi()")

... place the exit-time check of the bottom bit of ->idt_seq after the
this_cpu_dec_return() that re-enables NMI nesting.  This can result in
the following sequence of events on a given CPU in kernels built with
CONFIG_NMI_CHECK_CPU=y:

  o   An NMI arrives, and ->idt_seq is incremented to an odd number.
      In addition, nmi_state is set to NMI_EXECUTING==1.

  o   The NMI is processed.

  o   The this_cpu_dec_return(nmi_state) zeroes nmi_state and returns
      NMI_EXECUTING==1, thus opting out of the "goto nmi_restart".

  o   Another NMI arrives and ->idt_seq is incremented to an even
      number, triggering the warning.  But all is just fine, at least
      assuming we don't get so many closely spaced NMIs that the stack
      overflows or some such.

Experience on the fleet indicates that the MTBF of this false positive
is about 70 years.  Or, for those who are not quite that patient, the
MTBF appears to be about one per week per 4,000 systems.

Fix this false-positive warning by moving the "nmi_restart" label before
the initial ->idt_seq increment/check and moving the this_cpu_dec_return()
to follow the final ->idt_seq increment/check.  This way, all nested NMIs
that get past the NMI_NOT_RUNNING check get a clean ->idt_seq slate.
And if they don't get past that check, they will set nmi_state to
NMI_LATCHED, which will cause the this_cpu_dec_return(nmi_state)
to restart.

Fixes: 1a3ea611fc ("x86/nmi: Accumulate NMI-progress evidence in exc_nmi()")
Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/0cbff831-6e3d-431c-9830-ee65ee7787ff@paulmck-laptop
2023-10-12 08:35:15 +02:00
Lu Yao 441ccc3512 x86/msi: Fix compile error caused by CONFIG_GENERIC_MSI_IRQ=y && !CONFIG_X86_LOCAL_APIC
When compiling the x86 kernel, if X86_LOCAL_APIC is not enabled but
GENERIC_MSI_IRQ is selected in '.config', the following compilation
error will occur:

  include/linux/gpio/driver.h:38:19: error:
    field 'msiinfo' has incomplete type

  kernel/irq/msi.c:752:5: error: invalid use of incomplete typedef
    'msi_alloc_info_t' {aka 'struct irq_alloc_info'}

  kernel/irq/msi.c:740:1: error: control reaches end of non-void function

This is because file such as 'kernel/irq/msi.c' only depends on
'GENERIC_MSI_IRQ', and uses 'struct msi_alloc_info_t'. However,
this struct depends on 'X86_LOCAL_APIC'.

When enable 'GENERIC_MSI_IRQ' or 'X86_LOCAL_APIC' will select
'IRQ_DOMAIN_HIERARCHY', so exposing this struct using
'IRQ_DOMAIN_HIERARCHY' rather than 'X86_LOCAL_APIC'.

Under the above conditions, if 'HPET_TIMER' is selected, the following
compilation error will occur:

  arch/x86/kernel/hpet.c:550:13: error: ‘x86_vector_domain’ undeclared

  arch/x86/kernel/hpet.c:600:9: error: implicit declaration of
    function ‘init_irq_alloc_info’

This is because 'x86_vector_domain' is defined in 'kernel/apic/vector.c'
which is compiled only when 'X86_LOCAL_APIC' is enabled. Besides,
function 'msi_domain_set_affinity' is defined in 'include/linux/msi.h'
which depends on 'GENERIC_MSI_IRQ'. So use 'X86_LOCAL_APIC' and
'GENERIC_MSI_IRQ' to expose these code.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Lu Yao <yaolu@kylinos.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231012032659.323251-1-yaolu@kylinos.cn
2023-10-12 08:13:27 +02:00
Ingo Molnar 8f4156d587 Merge branch 'x86/urgent' into perf/core, to resolve conflict
Resolve an MSR enumeration conflict.

Conflicts:
	arch/x86/include/asm/msr-index.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-11 22:54:03 +02:00
Fenghua Yu 4dba8f10b8 x86/resctrl: Add sparse_masks file in info
Add the interface in resctrl FS to show if sparse cache allocation
bit masks are supported on the platform. Reading the file returns
either a "1" if non-contiguous 1s are supported and "0" otherwise.
The file path is /sys/fs/resctrl/info/{resource}/sparse_masks, where
{resource} can be either "L2" or "L3".

Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Peter Newman <peternewman@google.com>
Link: https://lore.kernel.org/r/7300535160beba41fd8aa073749ec1ee29b4621f.1696934091.git.maciej.wieczor-retman@intel.com
2023-10-11 21:51:24 +02:00
Maciej Wieczor-Retman 0e3cd31f6e x86/resctrl: Enable non-contiguous CBMs in Intel CAT
The setting for non-contiguous 1s support in Intel CAT is
hardcoded to false. On these systems, writing non-contiguous
1s into the schemata file will fail before resctrl passes
the value to the hardware.

In Intel CAT CPUID.0x10.1:ECX[3] and CPUID.0x10.2:ECX[3] stopped
being reserved and now carry information about non-contiguous 1s
value support for L3 and L2 cache respectively. The CAT
capacity bitmask (CBM) supports a non-contiguous 1s value if
the bit is set.

The exception are Haswell systems where non-contiguous 1s value
support needs to stay disabled since they can't make use of CPUID
for Cache allocation.

Originally-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Peter Newman <peternewman@google.com>
Link: https://lore.kernel.org/r/1849b487256fe4de40b30f88450cba3d9abc9171.1696934091.git.maciej.wieczor-retman@intel.com
2023-10-11 21:48:52 +02:00
Maciej Wieczor-Retman 39c6eed1f6 x86/resctrl: Rename arch_has_sparse_bitmaps
Rename arch_has_sparse_bitmaps to arch_has_sparse_bitmasks to ensure
consistent terminology throughout resctrl.

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Peter Newman <peternewman@google.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Babu Moger <babu.moger@amd.com>
Tested-by: Peter Newman <peternewman@google.com>
Link: https://lore.kernel.org/r/e330fcdae873ef1a831e707025a4b70fa346666e.1696934091.git.maciej.wieczor-retman@intel.com
2023-10-11 19:43:43 +02:00
Russell King (Oracle) c4dd854f74 cpu-hotplug: Provide prototypes for arch CPU registration
Provide common prototypes for arch_register_cpu() and
arch_unregister_cpu(). These are called by acpi_processor.c, with weak
versions, so the prototype for this is already set. It is generally not
necessary for function prototypes to be conditional on preprocessor macros.

Some architectures (e.g. Loongarch) are missing the prototype for this, and
rather than add it to Loongarch's asm/cpu.h, do the job once for everyone.

Since this covers everyone, remove the now unnecessary prototypes in
asm/cpu.h, and therefore remove the 'static' from one of ia64's
arch_register_cpu() definitions.

[ tglx: Bring back the ia64 part and remove the ACPI prototypes ]

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/E1qkoRr-0088Q8-Da@rmk-PC.armlinux.org.uk
2023-10-11 14:27:37 +02:00
Borislav Petkov (AMD) f454b18e07 x86/cpu: Fix AMD erratum #1485 on Zen4-based CPUs
Fix erratum #1485 on Zen4 parts where running with STIBP disabled can
cause an #UD exception. The performance impact of the fix is negligible.

Reported-by: René Rebe <rene@exactcode.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: René Rebe <rene@exactcode.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/D99589F4-BC5D-430B-87B2-72C20370CF57@exactcode.com
2023-10-11 11:00:11 +02:00
Alexander Shishkin d6f274b7c8 x86/sev: Drop unneeded #include
Commit:

  20f07a044a ("x86/sev: Move common memory encryption code to mem_encrypt.c")

... forgot to remove the include of virtio_config.h from mem_encrypt_amd.c
when it moved the related code to mem_encrypt.c (from where this include
subsequently got removed by a later commit).

Remove it now.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20231010145220.3960055-3-alexander.shishkin@linux.intel.com
2023-10-11 10:15:47 +02:00
Alexander Shishkin 6e74b12515 x86/sev: Move sev_setup_arch() to mem_encrypt.c
Since commit:

  4d96f91091 ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()")

... the SWIOTLB bounce buffer size adjustment and restricted virtio memory
setting also inadvertently apply to TDX: the code is using
cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) as a gatekeeping condition,
which is also true for TDX, and this is also what we want.

To reflect this, move the corresponding code to generic mem_encrypt.c.

No functional changes intended.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20231010145220.3960055-2-alexander.shishkin@linux.intel.com
2023-10-11 10:15:47 +02:00
Maciej Wieczor-Retman f05fd4ce99 x86/resctrl: Fix remaining kernel-doc warnings
The kernel test robot reported kernel-doc warnings here:

  arch/x86/kernel/cpu/resctrl/rdtgroup.c:915: warning: Function parameter or member 'of' not described in 'rdt_bit_usage_show'
  arch/x86/kernel/cpu/resctrl/rdtgroup.c:915: warning: Function parameter or member 'seq' not described in 'rdt_bit_usage_show'
  arch/x86/kernel/cpu/resctrl/rdtgroup.c:915: warning: Function parameter or member 'v' not described in 'rdt_bit_usage_show'
  arch/x86/kernel/cpu/resctrl/rdtgroup.c:1144: warning: Function parameter or member 'type' not described in '__rdtgroup_cbm_overlaps'
  arch/x86/kernel/cpu/resctrl/rdtgroup.c:1224: warning: Function parameter or member 'rdtgrp' not described in 'rdtgroup_mode_test_exclusive'
  arch/x86/kernel/cpu/resctrl/rdtgroup.c:1261: warning: Function parameter or member 'of' not described in 'rdtgroup_mode_write'
  arch/x86/kernel/cpu/resctrl/rdtgroup.c:1261: warning: Function parameter or member 'buf' not described in 'rdtgroup_mode_write'
  arch/x86/kernel/cpu/resctrl/rdtgroup.c:1261: warning: Function parameter or member 'nbytes' not described in 'rdtgroup_mode_write'
  arch/x86/kernel/cpu/resctrl/rdtgroup.c:1261: warning: Function parameter or member 'off' not described in 'rdtgroup_mode_write'
  arch/x86/kernel/cpu/resctrl/rdtgroup.c:1370: warning: Function parameter or member 'of' not described in 'rdtgroup_size_show'
  arch/x86/kernel/cpu/resctrl/rdtgroup.c:1370: warning: Function parameter or member 's' not described in 'rdtgroup_size_show'
  arch/x86/kernel/cpu/resctrl/rdtgroup.c:1370: warning: Function parameter or member 'v' not described in 'rdtgroup_size_show'

The first two functions are missing an argument description while the
other three are file callbacks and don't require a kernel-doc comment.

Closes: https://lore.kernel.org/oe-kbuild-all/202310070434.mD8eRNAz-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Newman <peternewman@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/r/20231011064843.246592-1-maciej.wieczor-retman@intel.com
2023-10-11 09:44:41 +02:00
Yan Zhao c9f65a3f2d KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED
For KVM_X86_QUIRK_CD_NW_CLEARED is on, remove the IPAT (ignore PAT) bit in
EPT memory types when cache is disabled and non-coherent DMA are present.

To correctly emulate CR0.CD=1, UC + IPAT are required as memtype in EPT.
However, as with commit fb279950ba ("KVM: vmx: obey
KVM_QUIRK_CD_NW_CLEARED"), WB + IPAT are now returned to workaround a BIOS
issue that guest MTRRs are enabled too late. Without this workaround, a
super slow guest boot-up is expected during the pre-guest-MTRR-enabled
period due to UC as the effective memory type for all guest memory.

Absent emulating CR0.CD=1 with UC, it makes no sense to set IPAT when KVM
is honoring the guest memtype.
Removing the IPAT bit in this patch allows effective memory type to honor
PAT values as well, as WB is the weakest memtype. It means if a guest
explicitly claims UC as the memtype in PAT, the effective memory is UC
instead of previous WB. If, for some unknown reason, a guest meets a slow
boot-up issue with the removal of IPAT, it's desired to fix the blamed PAT
in the guest.

Returning guest MTRR type as if CR0.CD=0 is also not preferred because
KVMs ABI for the quirk also requires KVM to force WB memtype regardless of
guest MTRRs to workaround the slow guest boot-up issue.

In the future, honoring guest PAT will also allow KVM to more precisely
zap SPTEs when the effective memtype changes.  E.g. by not forcing WB when
CR0.CD=1, instead of zapping SPTEs when guest MTRRs change, KVM can skip
MTRR-induced zaps if CR0.CD=1 and zap SPTEs for non-WB MTRR ranges when
CR0.CD is toggled (WB MTRR SPTEs can be kept because they're WB regardless
of CR0.CD).

The change of removing IPAT has been verified with normal boot-up time
on old OVMF of commit c9e5618f84b0cb54a9ac2d7604f7b7e7859b45a7 as well,
dated back to Apr 14 2015.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20230714065326.20557-1-yan.y.zhao@intel.com
[sean: massage changelog to apply patch without full series]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-10 17:04:42 -07:00
Yan Zhao 362ff6dca5 KVM: x86/mmu: Zap KVM TDP when noncoherent DMA assignment starts/stops
Zap KVM TDP when noncoherent DMA assignment starts (noncoherent dma count
transitions from 0 to 1) or stops (noncoherent dma count transitions
from 1 to 0). Before the zap, test if guest MTRR is to be honored after
the assignment starts or was honored before the assignment stops.

When there's no noncoherent DMA device, EPT memory type is
((MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT)

When there're noncoherent DMA devices, EPT memory type needs to honor
guest CR0.CD and MTRR settings.

So, if noncoherent DMA count transitions between 0 and 1, EPT leaf entries
need to be zapped to clear stale memory type.

This issue might be hidden when the device is statically assigned with
VFIO adding/removing MMIO regions of the noncoherent DMA devices for
several times during guest boot, and current KVM MMU will call
kvm_mmu_zap_all_fast() on the memslot removal.

But if the device is hot-plugged, or if the guest has mmio_always_on for
the device, the MMIO regions of it may only be added for once, then there's
no path to do the EPT entries zapping to clear stale memory type.

Therefore do the EPT zapping when noncoherent assignment starts/stops to
ensure stale entries cleaned away.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20230714065223.20432-1-yan.y.zhao@intel.com
[sean: fix misspelled words in comment and changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-10 17:04:38 -07:00
Joel Granados ecd5d66383 x86/vdso: Remove now superfluous sentinel element from ctl_table array
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove sentinel element from abi_table2. This removal is safe because
register_sysctl implicitly uses ARRAY_SIZE() in addition to checking for
the sentinel.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-10-10 15:22:02 -07:00
Joel Granados 83e291d3f5 arch/x86: Remove now superfluous sentinel elem from ctl_table arrays
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove sentinel element from sld_sysctl and itmt_kern_table. This
removal is safe because register_sysctl_init and register_sysctl
implicitly use the array size in addition to checking for the sentinel.

Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-10-10 15:22:02 -07:00
Linus Torvalds b711538a40 hyperv-fixes for v6.6-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmUk4fcTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXhhqCACWsBYTB0EJ3oMJnzfnHeuN418ZDx/O
 AL0k0O5MT6roEFmvGUhzJ/jsoxL+W+Wj3aFwzReyOSQpgjTTF/Ja26LPvxRzDxKi
 sZPojnR2ykW31l7y+eh1p9qSM/aYvTMDP5zO7L1fBnWMAGMv8w8RezpCJ7bh4BgA
 FTMZZrvKYVT9hCGkYqKUZGBtDTPZ56WE+MCiRxTWQvF+4QKaIff0tpno8V7203bE
 D/b4+Ouh19RXFTC5dUq/0JtAdV2AadrPHnScUupc8Hk/MMFiU5CzvH4bAqiwXBcU
 YqqlD3kZbIqqbKE93+03jvyrRDvDGlq+rpA3KMk5MBAfrkM4DytpWvMs
 =SVq1
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-fixes-signed-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

 - fixes for Hyper-V VTL code (Saurabh Sengar and Olaf Hering)

 - fix hv_kvp_daemon to support keyfile based connection profile
   (Shradha Gupta)

* tag 'hyperv-fixes-signed-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  hv/hv_kvp_daemon:Support for keyfile based connection profile
  hyperv: reduce size of ms_hyperv_info
  x86/hyperv: Add common print prefix "Hyper-V" in hv_init
  x86/hyperv: Remove hv_vtl_early_init initcall
  x86/hyperv: Restrict get_vtl to only VTL platforms
2023-10-10 11:01:21 -07:00
Thomas Gleixner 48525fd1ea x86/cpu: Provide debug interface
Provide debug files which dump the topology related information of
cpuinfo_x86. This is useful to validate the upcoming conversion of the
topology evaluation for correctness or bug compatibility.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.353191313@linutronix.de
2023-10-10 14:38:19 +02:00
Thomas Gleixner 90781f0c4c x86/cpu/topology: Cure the abuse of cpuinfo for persisting logical ids
Per CPU cpuinfo is used to persist the logical package and die IDs. That's
really not the right place simply because cpuinfo is subject to be
reinitialized when a CPU goes through an offline/online cycle.

This works by chance today, but that's far from correct and neither obvious
nor documented.

Add a per cpu datastructure which persists those logical IDs, which allows
to cleanup the CPUID evaluation code.

This is a temporary workaround until the larger topology management is in
place, which makes all of this logical management mechanics obsolete.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.292947071@linutronix.de
2023-10-10 14:38:19 +02:00
Thomas Gleixner db4a4086a2 x86/apic: Use u32 for wakeup_secondary_cpu[_64]()
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.

Make it all consistently use u32 because that reflects the hardware
register width.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.233274223@linutronix.de
2023-10-10 14:38:19 +02:00
Thomas Gleixner 59f7928cd4 x86/apic: Use u32 for [gs]et_apic_id()
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.

Make it all consistently use u32 because that reflects the hardware
register width.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.172569282@linutronix.de
2023-10-10 14:38:19 +02:00
Thomas Gleixner 01ccf9bbd2 x86/apic: Use u32 for phys_pkg_id()
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.

Make it all consistently use u32 because that reflects the hardware
register width even if that callback going to be removed soonish.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.113097126@linutronix.de
2023-10-10 14:38:19 +02:00
Thomas Gleixner 8aa2a4178d x86/apic: Use u32 for cpu_present_to_apicid()
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.

Make it all consistently use u32 because that reflects the hardware
register width and fixup a few related usage sites for consistency sake.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085113.054064391@linutronix.de
2023-10-10 14:38:19 +02:00
Thomas Gleixner 5d376b8fb1 x86/apic: Use u32 for check_apicid_used()
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.

Make it all consistently use u32 because that reflects the hardware
register width and move the default implementation to local.h as there are
no users outside the apic directory.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.981956102@linutronix.de
2023-10-10 14:38:18 +02:00
Thomas Gleixner 4705243d23 x86/apic: Use u32 for APIC IDs in global data
APIC IDs are used with random data types u16, u32, int, unsigned int,
unsigned long.

Make it all consistently use u32 because that reflects the hardware
register width and fixup the most obvious usage sites of that.

The APIC callbacks will be addressed separately.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.922905727@linutronix.de
2023-10-10 14:38:18 +02:00
Thomas Gleixner 9ff4275bc8 x86/apic: Use BAD_APICID consistently
APIC ID checks compare with BAD_APICID all over the place, but some
initializers and some code which fiddles with global data structure use
-1[U] instead. That simply cannot work at all.

Fix it up and use BAD_APICID consistently all over the place.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.862835121@linutronix.de
2023-10-10 14:38:18 +02:00
Thomas Gleixner 6e29032340 x86/cpu: Move cpu_l[l2]c_id into topology info
The topology IDs which identify the LLC and L2 domains clearly belong to
the per CPU topology information.

Move them into cpuinfo_x86::cpuinfo_topo and get rid of the extra per CPU
data and the related exports.

This also paves the way to do proper topology evaluation during early boot
because it removes the only per CPU dependency for that.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.803864641@linutronix.de
2023-10-10 14:38:18 +02:00
Thomas Gleixner 22dc963162 x86/cpu: Move logical package and die IDs into topology info
Yet another topology related data pair. Rename logical_proc_id to
logical_pkg_id so it fits the common naming conventions.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.745139505@linutronix.de
2023-10-10 14:38:18 +02:00
Thomas Gleixner 594957d723 x86/cpu: Remove pointless evaluation of x86_coreid_bits
cpuinfo_x86::x86_coreid_bits is only used by the AMD numa topology code. No
point in evaluating it on non AMD systems.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.687588373@linutronix.de
2023-10-10 14:38:18 +02:00
Thomas Gleixner e3c0c5d52a x86/cpu: Move cu_id into topology info
No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.628405546@linutronix.de
2023-10-10 14:38:18 +02:00
Thomas Gleixner e95256335d x86/cpu: Move cpu_core_id into topology info
Rename it to core_id and stick it to the other ID fields.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.566519388@linutronix.de
2023-10-10 14:38:17 +02:00
Thomas Gleixner 8a169ed40f x86/cpu: Move cpu_die_id into topology info
Move the next member.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.388185134@linutronix.de
2023-10-10 14:38:17 +02:00
Thomas Gleixner 02fb601d27 x86/cpu: Move phys_proc_id into topology info
Rename it to pkg_id which is the terminology used in the kernel.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.329006989@linutronix.de
2023-10-10 14:38:17 +02:00
Thomas Gleixner b9655e702d x86/cpu: Encapsulate topology information in cpuinfo_x86
The topology related information is randomly scattered across cpuinfo_x86.

Create a new structure cpuinfo_topo and move in a first step initial_apicid
and apicid into it.

Aside of being better readable this is in preparation for replacing the
horribly fragile CPU topology evaluation code further down the road.

Consolidate APIC ID fields to u32 as that represents the hardware type.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.269787744@linutronix.de
2023-10-10 14:38:17 +02:00
Thomas Gleixner 965e05ff8a x86/apic: Fake primary thread mask for XEN/PV
The SMT control mechanism got added as speculation attack vector
mitigation. The implemented logic relies on the primary thread mask to
be set up properly.

This turns out to be an issue with XEN/PV guests because their CPU hotplug
mechanics do not enumerate APICs and therefore the mask is never correctly
populated.

This went unnoticed so far because by chance XEN/PV ends up with
smp_num_siblings == 2. So cpu_smt_control stays at its default value
CPU_SMT_ENABLED and the primary thread mask is never evaluated in the
context of CPU hotplug.

This stopped "working" with the upcoming overhaul of the topology
evaluation which legitimately provides a fake topology for XEN/PV. That
sets smp_num_siblings to 1, which causes the core CPU hot-plug core to
refuse to bring up the APs.

This happens because cpu_smt_control is set to CPU_SMT_NOT_SUPPORTED which
causes cpu_bootable() to evaluate the unpopulated primary thread mask with
the conclusion that all non-boot CPUs are not valid to be plugged.

The core code has already been made more robust against this kind of fail,
but the primary thread mask really wants to be populated to avoid other
issues all over the place.

Just fake the mask by pretending that all XEN/PV vCPUs are primary threads,
which is consistent because all of XEN/PVs topology is fake or non-existent.

Fixes: 6a4d2657e0 ("x86/smp: Provide topology_is_primary_thread()")
Fixes: f54d4434c2 ("x86/apic: Provide cpu_primary_thread mask")
Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Juergen Gross <jgross@suse.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230814085112.210011520@linutronix.de
2023-10-10 14:38:17 +02:00
Pu Wen ee545b94d3 x86/cpu/hygon: Fix the CPU topology evaluation for real
Hygon processors with a model ID > 3 have CPUID leaf 0xB correctly
populated and don't need the fixed package ID shift workaround. The fixup
is also incorrect when running in a guest.

Fixes: e0ceeae708 ("x86/CPU/hygon: Fix phys_proc_id calculation logic for multi-die processors")
Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/tencent_594804A808BD93A4EBF50A994F228E3A7F07@qq.com
Link: https://lore.kernel.org/r/20230814085112.089607918@linutronix.de
2023-10-10 14:38:16 +02:00
Like Xu bf328e22e4 KVM: x86: Don't sync user-written TSC against startup values
The legacy API for setting the TSC is fundamentally broken, and only
allows userspace to set a TSC "now", without any way to account for
time lost between the calculation of the value, and the kernel eventually
handling the ioctl.

To work around this, KVM has a hack which, if a TSC is set with a value
which is within a second's worth of the last TSC "written" to any vCPU in
the VM, assumes that userspace actually intended the two TSC values to be
in sync and adjusts the newly-written TSC value accordingly.

Thus, when a VMM restores a guest after suspend or migration using the
legacy API, the TSCs aren't necessarily *right*, but at least they're
in sync.

This trick falls down when restoring a guest which genuinely has been
running for less time than the 1 second of imprecision KVM allows for in
in the legacy API.  On *creation*, the first vCPU starts its TSC counting
from zero, and the subsequent vCPUs synchronize to that.  But then when
the VMM tries to restore a vCPU's intended TSC, because the VM has been
alive for less than 1 second and KVM's default TSC value for new vCPU's is
'0', the intended TSC is within a second of the last "written" TSC and KVM
incorrectly adjusts the intended TSC in an attempt to synchronize.

But further hacks can be piled onto KVM's existing hackish ABI, and
declare that the *first* value written by *userspace* (on any vCPU)
should not be subject to this "correction", i.e. KVM can assume that the
first write from userspace is not an attempt to sync up with TSC values
that only come from the kernel's default vCPU creation.

To that end: Add a flag, kvm->arch.user_set_tsc, protected by
kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in
the VM *has* been set by userspace, and make the 1-second slop hack only
trigger if user_set_tsc is already set.

Note that userspace can explicitly request a *synchronization* of the
TSC by writing zero. For the purpose of user_set_tsc, an explicit
synchronization counts as "setting" the TSC, i.e. if userspace then
subsequently writes an explicit non-zero value which happens to be within
1 second of the previous value, the new value will be "corrected".  This
behavior is deliberate, as treating explicit synchronization as "setting"
the TSC preserves KVM's existing behaviour inasmuch as possible (KVM
always applied the 1-second "correction" regardless of whether the write
came from userspace vs. the kernel).

Reported-by: Yong He <alexyonghe@tencent.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Original-by: Oliver Upton <oliver.upton@linux.dev>
Original-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Tested-by: Yong He <alexyonghe@tencent.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231008025335.7419-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-09 17:29:52 -07:00
Yan Zhao 9a3768191d KVM: x86/mmu: Zap SPTEs on MTRR update iff guest MTRRs are honored
When guest MTRRs are updated, zap SPTEs and do zap range calcluation if and
only if KVM's MMU is honoring guest MTRRs, which is the only time that KVM
incorporates the guest's MTRR type into the final memtype.

Suggested-by: Chao Gao <chao.gao@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20230714065156.20375-1-yan.y.zhao@intel.com
[sean: rephrase shortlog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-09 14:35:14 -07:00
Yan Zhao 7a18c7c2b6 KVM: x86/mmu: Zap SPTEs when CR0.CD is toggled iff guest MTRRs are honored
Zap SPTEs when CR0.CD is toggled if and only if KVM's MMU is honoring
guest MTRRs, which is the only time that KVM incorporates the guest's
CR0.CD into the final memtype.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20230714065122.20315-1-yan.y.zhao@intel.com
[sean: rephrase shortlog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-09 14:35:13 -07:00
Yan Zhao 1affe455d6 KVM: x86/mmu: Add helpers to return if KVM honors guest MTRRs
Add helpers to check if KVM honors guest MTRRs instead of open coding the
logic in kvm_tdp_page_fault().  Future fixes and cleanups will also need
to determine if KVM should honor guest MTRRs, e.g. for CR0.CD toggling and
and non-coherent DMA transitions.

Provide an inner helper, __kvm_mmu_honors_guest_mtrrs(), so that KVM can
check if guest MTRRs were honored when stopping non-coherent DMA.

Note, there is no need to explicitly check that TDP is enabled, KVM clears
shadow_memtype_mask when TDP is disabled, i.e. it's non-zero if and only
if EPT is enabled.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20230714065006.20201-1-yan.y.zhao@intel.com
Link: https://lore.kernel.org/r/20230714065043.20258-1-yan.y.zhao@intel.com
[sean: squash into a one patch, drop explicit TDP check massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-09 14:34:57 -07:00
Jim Mattson 8b0e00fba9 KVM: x86: Virtualize HWCR.TscFreqSel[bit 24]
On certain CPUs, Linux guests expect HWCR.TscFreqSel[bit 24] to be
set. If it isn't set, they complain:
	[Firmware Bug]: TSC doesn't count with P0 frequency!

Allow userspace (and the guest) to set this bit in the virtual HWCR to
eliminate the above complaint.

Allow the guest to write the bit even though its is R/O on *some* CPUs.
Like many bits in HWRC, TscFreqSel is not architectural at all. On Family
10h[1], it was R/W and powered on as 0. In Family 15h, one of the "changes
relative to Family 10H Revision D processors[2] was:

  • MSRC001_0015 [Hardware Configuration (HWCR)]:
  • Dropped TscFreqSel; TSC can no longer be selected to run at NB P0-state.

Despite the "Dropped" above, that same document later describes
HWCR[bit 24] as follows:

  TscFreqSel: TSC frequency select. Read-only. Reset: 1. 1=The TSC
  increments at the P0 frequency

If the guest clears the bit, the worst case scenario is the guest will be
no worse off than it is today, e.g. the whining may return after a guest
clears the bit and kexec()'s into a new kernel.

[1] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/31116.pdf
[2] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/42301_15h_Mod_00h-0Fh_BKDG.pdf,
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20230929230246.1954854-3-jmattson@google.com
[sean: elaborate on why the bit is writable by the guest]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-09 12:36:16 -07:00
Jim Mattson 598a790fc2 KVM: x86: Allow HWCR.McStatusWrEn to be cleared once set
When HWCR is set to 0, store 0 in vcpu->arch.msr_hwcr.

Fixes: 191c8137a9 ("x86/kvm: Implement HWCR support")
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20230929230246.1954854-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-09 12:36:15 -07:00
Uros Bizjak 636d6a8b85 locking/atomic/x86: Introduce arch_sync_try_cmpxchg()
Introduce the arch_sync_try_cmpxchg() macro to improve code using
sync_try_cmpxchg() locking primitive. The new definitions use existing
__raw_try_cmpxchg() macros, but use its own "lock; " prefix.

The new macros improve assembly of the cmpxchg loop in
evtchn_fifo_unmask() from drivers/xen/events/events_fifo.c from:

 57a:	85 c0                	test   %eax,%eax
 57c:	78 52                	js     5d0 <...>
 57e:	89 c1                	mov    %eax,%ecx
 580:	25 ff ff ff af       	and    $0xafffffff,%eax
 585:	c7 04 24 00 00 00 00 	movl   $0x0,(%rsp)
 58c:	81 e1 ff ff ff ef    	and    $0xefffffff,%ecx
 592:	89 4c 24 04          	mov    %ecx,0x4(%rsp)
 596:	89 44 24 08          	mov    %eax,0x8(%rsp)
 59a:	8b 74 24 08          	mov    0x8(%rsp),%esi
 59e:	8b 44 24 04          	mov    0x4(%rsp),%eax
 5a2:	f0 0f b1 32          	lock cmpxchg %esi,(%rdx)
 5a6:	89 04 24             	mov    %eax,(%rsp)
 5a9:	8b 04 24             	mov    (%rsp),%eax
 5ac:	39 c1                	cmp    %eax,%ecx
 5ae:	74 07                	je     5b7 <...>
 5b0:	a9 00 00 00 40       	test   $0x40000000,%eax
 5b5:	75 c3                	jne    57a <...>
 <...>

to:

 578:	a9 00 00 00 40       	test   $0x40000000,%eax
 57d:	74 2b                	je     5aa <...>
 57f:	85 c0                	test   %eax,%eax
 581:	78 40                	js     5c3 <...>
 583:	89 c1                	mov    %eax,%ecx
 585:	25 ff ff ff af       	and    $0xafffffff,%eax
 58a:	81 e1 ff ff ff ef    	and    $0xefffffff,%ecx
 590:	89 4c 24 04          	mov    %ecx,0x4(%rsp)
 594:	89 44 24 08          	mov    %eax,0x8(%rsp)
 598:	8b 4c 24 08          	mov    0x8(%rsp),%ecx
 59c:	8b 44 24 04          	mov    0x4(%rsp),%eax
 5a0:	f0 0f b1 0a          	lock cmpxchg %ecx,(%rdx)
 5a4:	89 44 24 04          	mov    %eax,0x4(%rsp)
 5a8:	75 30                	jne    5da <...>
 <...>
 5da:	8b 44 24 04          	mov    0x4(%rsp),%eax
 5de:	eb 98                	jmp    578 <...>

The new code removes move instructions from 585: 5a6: and 5a9:
and the compare from 5ac:. Additionally, the compiler assumes that
cmpxchg success is more probable and optimizes code flow accordingly.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
2023-10-09 18:14:25 +02:00
Ingo Molnar fdb8b7a1af Linux 6.6-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmUjFeceHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNCAH/RDI8G44DCV9Ps5U
 rl/FMf6iLUxU6fCS3Wwe8vtppLjPP7Y16AH5HKMumoDIqTfh9ZAUVKhZfT+PTgz3
 /oFXcGzZQLTcdbtH7XK2/zk7N/RI25/rDiCDd1uIJVCNii+hsBKS6Ihc4wXadxaR
 0z3lwoEKp2egeaeqmJWMzJLdjRrYhLs33+SEciVYqTiIvlWsM5QBm/sMvES7V57s
 TXrs5/y7yXtDBZ2PgYNCBRLyBazjqB28x07aQoePOAs6nFXl5N/wWPW/4wirWFHT
 s9LYZlmVo+O+RHWj10ASm/2l+ihgn959ZfRj1VekK2AWU1x/VzSPcuCXKvsrUoa+
 xEjL+vM=
 =efE3
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-09 18:09:23 +02:00
Sandipan Das 25e5684782 perf/x86/amd/uncore: Add memory controller support
Unified Memory Controller (UMC) events were introduced with Zen 4 as a
part of the Performance Monitoring Version 2 (PerfMonV2) enhancements.
An event is specified using the EventSelect bits and the RdWrMask bits
can be used for additional filtering of read and write requests.

As of now, a maximum of 12 channels of DDR5 are available on each socket
and each channel is controlled by a dedicated UMC. Each UMC, in turn,
has its own set of performance monitoring counters.

Since the MSR address space for the UMC PERF_CTL and PERF_CTR registers
are reused across sockets, uncore groups are created on the basis of
socket IDs. Hence, group exclusivity is mandatory while opening events
so that events for an UMC can only be opened on CPUs which are on the
same socket as the corresponding memory channel.

For each socket, the total number of available UMC counters and active
memory channels are determined from CPUID leaf 0x80000022 EBX and ECX
respectively. Usually, on Zen 4, each UMC has four counters.

MSR assignments are determined on the basis of active UMCs. E.g. if
UMCs 1, 4 and 9 are active for a given socket, then

  * UMC 1 gets MSRs 0xc0010800 to 0xc0010807 as PERF_CTLs and PERF_CTRs
  * UMC 4 gets MSRs 0xc0010808 to 0xc001080f as PERF_CTLs and PERF_CTRs
  * UMC 9 gets MSRs 0xc0010810 to 0xc0010817 as PERF_CTLs and PERF_CTRs

If there are sockets without any online CPUs when the amd_uncore driver
is loaded, UMCs for such sockets will not be discoverable since the
mechanism relies on executing the CPUID instruction on an online CPU
from the socket.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/b25f391205c22733493abec1ed850b71784edc5f.1696425185.git.sandipan.das@amd.com
2023-10-09 16:12:25 +02:00
Sandipan Das 83a43c6221 perf/x86/amd/uncore: Add group exclusivity
In some cases, it may be necessary to restrict opening PMU events to a
subset of CPUs. E.g. Unified Memory Controller (UMC) PMUs are specific
to each active memory channel and the MSR address space for the PERF_CTL
and PERF_CTR registers is reused on each socket. Thus, opening events
for a specific UMC PMU should be restricted to CPUs belonging to the
same socket as that of the UMC. The "cpumask" of the PMU should also
reflect this accordingly.

Uncore PMUs which require this can use the new group attribute in struct
amd_uncore_pmu to set a valid group ID during the scan() phase. Later,
during init(), an uncore context for a CPU will be unavailable if the
group ID does not match.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/937d6d71010a48ea4e069f4904b3116a5f99ecdf.1696425185.git.sandipan.das@amd.com
2023-10-09 16:12:24 +02:00
Sandipan Das 7ef0343855 perf/x86/amd/uncore: Use rdmsr if rdpmc is unavailable
Not all uncore PMUs may support the use of the RDPMC instruction for
reading counters. In such cases, read the count from the corresponding
PERF_CTR register using the RDMSR instruction.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/e9d994e32a3fcb39fa59fcf43ab4260d11aba097.1696425185.git.sandipan.das@amd.com
2023-10-09 16:12:24 +02:00
Sandipan Das 07888daa05 perf/x86/amd/uncore: Move discovery and registration
Uncore PMUs have traditionally been registered in the module init path.
This is fine for the existing DF and L3 PMUs since the CPUID information
does not vary across CPUs but not for the memory controller (UMC) PMUs
since information like active memory channels can vary for each socket
depending on how the DIMMs have been physically populated.

To overcome this, the discovery of PMU information using CPUID is moved
to the startup of UNCORE_STARTING. This cannot be done in the startup of
UNCORE_PREP since the hotplug callback does not run on the CPU that is
being brought online.

Previously, the startup of UNCORE_PREP was used for allocating uncore
contexts following which, the startup of UNCORE_STARTING was used to
find and reuse an existing sibling context, if possible. Any unused
contexts were added to a list for reclaimation later during the startup
of UNCORE_ONLINE.

Since all required CPUID info is now available only after the startup of
UNCORE_STARTING has completed, context allocation has been moved to the
startup of UNCORE_ONLINE. Before allocating contexts, the first CPU that
comes online has to take up the additional responsibility of registering
the PMUs. This is a one-time process though. Since sibling discovery now
happens prior to deciding whether a new context is required, there is no
longer a need to track and free up unused contexts.

The teardown of UNCORE_ONLINE and UNCORE_PREP functionally remain the
same.

Overall, the flow of control described above is achieved using the
following handlers for managing uncore PMUs. It is mandatory to define
them for each type of uncore PMU.

  * scan() runs during startup of UNCORE_STARTING and collects PMU info
    using CPUID.

  * init() runs during startup of UNCORE_ONLINE, registers PMUs and sets
    up uncore contexts.

  * move() runs during teardown of UNCORE_ONLINE and migrates uncore
    contexts to a shared sibling, if possible.

  * free() runs during teardown of UNCORE_PREP and frees up uncore
    contexts.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/e6c447e48872fcab8452e0dd81b1c9cb09f39eb4.1696425185.git.sandipan.das@amd.com
2023-10-09 16:12:23 +02:00
Sandipan Das d6389d3ccc perf/x86/amd/uncore: Refactor uncore management
Since struct amd_uncore is used to manage per-cpu contexts, rename it to
amd_uncore_ctx in order to better reflect its purpose. Add a new struct
amd_uncore_pmu to encapsulate all attributes which are shared by per-cpu
contexts for a corresponding PMU. These include the number of counters,
active mask, MSR and RDPMC base addresses, etc. Since the struct pmu is
now embedded, the corresponding amd_uncore_pmu for a given event can be
found by simply using container_of().

Finally, move all PMU-specific code to separate functions. While the
original event management functions continue to provide the base
functionality, all PMU-specific quirks and customizations are applied in
separate functions.

The motivation is to simplify the management of uncore PMUs.

Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/24b38c49a5dae65d8c96e5d75a2b96ae97aaa651.1696425185.git.sandipan.das@amd.com
2023-10-09 16:12:23 +02:00
Tero Kristo 05276d4831 perf/x86/cstate: Allow reading the package statistics from local CPU
The MSR registers for reading the package residency counters are
available on every CPU of the package. To avoid doing unnecessary SMP
calls to read the values for these from the various CPUs inside a
package, allow reading them from any CPU of the package.

Suggested-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230912124432.3616761-2-tero.kristo@linux.intel.com
2023-10-09 16:12:22 +02:00
Joerg Roedel b9cb9c4558 x86/sev: Check IOBM for IOIO exceptions from user-space
Check the IO permission bitmap (if present) before emulating IOIO #VC
exceptions for user-space. These permissions are checked by hardware
already before the #VC is raised, but due to the VC-handler decoding
race it needs to be checked again in software.

Fixes: 25189d08e5 ("x86/sev-es: Add support for handling IOIO exceptions")
Reported-by: Tom Dohrmann <erbse.13@gmx.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Dohrmann <erbse.13@gmx.de>
Cc: <stable@kernel.org>
2023-10-09 15:47:57 +02:00
Borislav Petkov (AMD) a37cd2a59d x86/sev: Disable MMIO emulation from user mode
A virt scenario can be constructed where MMIO memory can be user memory.
When that happens, a race condition opens between when the hardware
raises the #VC and when the #VC handler gets to emulate the instruction.

If the MOVS is replaced with a MOVS accessing kernel memory in that
small race window, then write to kernel memory happens as the access
checks are not done at emulation time.

Disable MMIO emulation in user mode temporarily until a sensible use
case appears and justifies properly handling the race window.

Fixes: 0118b604c2 ("x86/sev-es: Handle MMIO String Instructions")
Reported-by: Tom Dohrmann <erbse.13@gmx.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Tom Dohrmann <erbse.13@gmx.de>
Cc: <stable@kernel.org>
2023-10-09 15:45:34 +02:00
Lucy Mielke 38cd5b6a87 perf/x86/intel/pt: Fix kernel-doc comments
Some parameters or return codes were either wrong or missing,
update them.

Signed-off-by: Lucy Mielke <lucymielke@icloud.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/ZSOjQW3e2nJR4bAo@fedora.fritz.box
2023-10-09 12:26:22 +02:00
JP Kobryn e53899771a perf/x86/lbr: Filter vsyscall addresses
We found that a panic can occur when a vsyscall is made while LBR sampling
is active. If the vsyscall is interrupted (NMI) for perf sampling, this
call sequence can occur (most recent at top):

    __insn_get_emulate_prefix()
    insn_get_emulate_prefix()
    insn_get_prefixes()
    insn_get_opcode()
    decode_branch_type()
    get_branch_type()
    intel_pmu_lbr_filter()
    intel_pmu_handle_irq()
    perf_event_nmi_handler()

Within __insn_get_emulate_prefix() at frame 0, a macro is called:

    peek_nbyte_next(insn_byte_t, insn, i)

Within this macro, this dereference occurs:

    (insn)->next_byte

Inspecting registers at this point, the value of the next_byte field is the
address of the vsyscall made, for example the location of the vsyscall
version of gettimeofday() at 0xffffffffff600000. The access to an address
in the vsyscall region will trigger an oops due to an unhandled page fault.

To fix the bug, filtering for vsyscalls can be done when
determining the branch type. This patch will return
a "none" branch if a kernel address if found to lie in the
vsyscall region.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
2023-10-08 12:25:18 +02:00
Kees Cook a56d5551e1 perf/x86/rapl: Annotate 'struct rapl_pmus' with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS=y (for
array indexing) and CONFIG_FORTIFY_SOURCE=y (for strcpy/memcpy-family
functions).

Found with Coccinelle:

  https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1]

Add __counted_by for 'struct rapl_pmus'.

No change in functionality intended.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20231006201754.work.473-kees@kernel.org
2023-10-08 12:18:17 +02:00
Randy Dunlap 025d5ac978 x86/resctrl: Fix kernel-doc warnings
The kernel test robot reported kernel-doc warnings here:

  monitor.c:34: warning: Cannot understand  * @rmid_free_lru    A least recently used list of free RMIDs on line 34 - I thought it was a doc line
  monitor.c:41: warning: Cannot understand  * @rmid_limbo_count     count of currently unused but (potentially) on line 41 - I thought it was a doc line
  monitor.c:50: warning: Cannot understand  * @rmid_entry - The entry in the limbo and free lists.  on line 50 - I thought it was a doc line

We don't have a syntax for documenting individual data items via
kernel-doc, so remove the "/**" kernel-doc markers and add a hyphen
for consistency.

Fixes: 6a445edce6 ("x86/intel_rdt/cqm: Add RDT monitoring initialization")
Fixes: 24247aeeab ("x86/intel_rdt/cqm: Improve limbo list processing")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231006235132.16227-1-rdunlap@infradead.org
2023-10-08 11:45:16 +02:00
Waiman Long 2743fe89d4 x86/idle: Disable IBRS when CPU is offline to improve single-threaded performance
Commit bf5835bcdb ("intel_idle: Disable IBRS during long idle")
disables IBRS when the CPU enters long idle. However, when a CPU
becomes offline, the IBRS bit is still set when X86_FEATURE_KERNEL_IBRS
is enabled. That will impact the performance of a sibling CPU. Mitigate
this performance impact by clearing all the mitigation bits in SPEC_CTRL
MSR when offline. When the CPU is online again, it will be re-initialized
and so restoring the SPEC_CTRL value isn't needed.

Add a comment to say that native_play_dead() is a __noreturn function,
but it can't be marked as such to avoid confusion about the missing
MSR restoration code.

When DPDK is running on an isolated CPU thread processing network packets
in user space while its sibling thread is idle. The performance of the
busy DPDK thread with IBRS on and off in the sibling idle thread are:

                                IBRS on         IBRS off
                                -------         --------
  packets/second:                  7.8M           10.4M
  avg tsc cycles/packet:         282.26          209.86

This is a 25% performance degradation. The test system is a Intel Xeon
4114 CPU @ 2.20GHz.

[ mingo: Extended the changelog with performance data from the 0/4 mail. ]

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20230727184600.26768-3-longman@redhat.com
2023-10-07 11:33:28 +02:00
Waiman Long e3e3bab184 x86/speculation: Add __update_spec_ctrl() helper
Add a new __update_spec_ctrl() helper which is a variant of
update_spec_ctrl() that can be used in a noinstr function.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20230727184600.26768-2-longman@redhat.com
2023-10-07 11:33:28 +02:00
Baolin Wang 55d2a0bd5e mm: add statistics for PUD level pagetable
Recently, we found that cross-die access to pagetable pages on ARM64
machines can cause performance fluctuations in our business.  Currently,
there are no PMU events available to track this situation on our ARM64
machines, so accurate pagetable accounting can help to analyze this issue,
but now the PUD level pagetable accounting is missed.

So introduce pagetable_pud_ctor/dtor() to help to get accurate PUD
pagetable accounting, as well as converting the architectures which use
generic PUD pagetable allocation to add corresponding PUD pagetable
accounting.  Moreover this patch will mark the PUD level pagetable with
PG_table flag, which will help to do sanity validation in
unpoison_memory().

On my testing machine, I can see more pagetables statistics after the patch
with page-types tool:

Before patch:
        flags           page-count      MB  symbolic-flags                     long-symbolic-flags
0x0000000004000000           27326      106  __________________________g_________________       pgtable
After patch:
0x0000000004000000           27541      107  __________________________g_________________       pgtable

Link: https://lkml.kernel.org/r/876c71c03a7e69c17722a690e3225a4f7b172fb2.1695017383.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06 14:44:10 -07:00
Sohil Mehta 2fd0ebad27 arch: Reserve map_shadow_stack() syscall number for all architectures
commit c35559f94e ("x86/shstk: Introduce map_shadow_stack syscall")
recently added support for map_shadow_stack() but it is limited to x86
only for now. There is a possibility that other architectures (namely,
arm64 and RISC-V), that are implementing equivalent support for shadow
stacks, might need to add support for it.

Independent of that, reserving arch-specific syscall numbers in the
syscall tables of all architectures is good practice and would help
avoid future conflicts. map_shadow_stack() is marked as a conditional
syscall in sys_ni.c. Adding it to the syscall tables of other
architectures is harmless and would return ENOSYS when exercised.

Note, map_shadow_stack() was assigned #453 during the merge process
since #452 was taken by fchmodat2().

For Powerpc, map it to sys_ni_syscall() as is the norm for Powerpc
syscall tables.

For Alpha, map_shadow_stack() takes up #563 as Alpha still diverges from
the common syscall numbering system in the other architectures.

Link: https://lore.kernel.org/lkml/20230515212255.GA562920@debug.ba.rivosinc.com/
Link: https://lore.kernel.org/lkml/b402b80b-a7c6-4ef0-b977-c0f5f582b78a@sirena.org.uk/

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-10-06 22:26:51 +02:00
Kirill A. Shutemov 9ee4318c15 x86/tdx: Mark TSC reliable
In x86 virtualization environments, including TDX, RDTSC instruction is
handled without causing a VM exit, resulting in minimal overhead and
jitters. On the other hand, other clock sources (such as HPET, ACPI
timer, APIC, etc.) necessitate VM exits to implement, resulting in more
fluctuating measurements compared to TSC. Thus, those clock sources are
not effective for calibrating TSC.

As a foundation, the host TSC is guaranteed to be invariant on any
system which enumerates TDX support.

TDX guests and the TDX module build on that foundation by enforcing:

  - Virtual TSC is monotonously incrementing for any single VCPU;
  - Virtual TSC values are consistent among all the TD’s VCPUs at the
    level supported by the CPU:
    + VMM is required to set the same TSC_ADJUST;
    + VMM must not modify from initial value of TSC_ADJUST before
      SEAMCALL;
  - The frequency is determined by TD configuration:
    + Virtual TSC frequency is specified by VMM on TDH.MNG.INIT;
    + Virtual TSC starts counting from 0 at TDH.MNG.INIT;

The result is that a reliable TSC is a TDX architectural guarantee.

Use the TSC as the only reliable clock source in TD guests, bypassing
unstable calibration.

This is similar to what the kernel already does in some VMWare and
HyperV environments.

[ dhansen: changelog tweaks ]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Erdem Aktas <erdemaktas@google.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/all/20231006144549.2633-1-kirill.shutemov%40linux.intel.com
2023-10-06 10:00:04 -07:00
Mario Limonciello 7d08f21f8c x86/PCI: Avoid PME from D3hot/D3cold for AMD Rembrandt and Phoenix USB4
Iain reports that USB devices can't be used to wake a Lenovo Z13 from
suspend.  This occurs because on some AMD platforms, even though the Root
Ports advertise PME_Support for D3hot and D3cold, wakeup events from
devices on a USB4 controller don't result in wakeup interrupts from the
Root Port when amd-pmc has put the platform in a hardware sleep state.

If amd-pmc will be involved in the suspend, remove D3hot and D3cold from
the PME_Support mask of Root Ports above USB4 controllers so we avoid those
states if we need wakeups.

Restore D3 support at resume so that it can be used by runtime suspend.

This affects both AMD Rembrandt and Phoenix SoCs.

"pm_suspend_target_state == PM_SUSPEND_ON" means we're doing runtime
suspend, and amd-pmc will not be involved.  In that case PMEs work as
advertised in D3hot/D3cold, so we don't need to do anything.

Note that amd-pmc is technically optional, and there's no need for this
quirk if it's not present, but we assume it's always present because power
consumption is so high without it.

Fixes: 9d26d3a8f1 ("PCI: Put PCIe ports into D3 during suspend")
Link: https://lore.kernel.org/r/20231004144959.158840-1-mario.limonciello@amd.com
Reported-by: Iain Lane <iain@orangesquash.org.uk>
Closes: https://forums.lenovo.com/t5/Ubuntu/Z13-can-t-resume-from-suspend-with-external-USB-keyboard/m-p/5217121
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
[bhelgaas: commit log, move to arch/x86/pci/fixup.c, add #includes]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
2023-10-06 09:09:47 -05:00
Jithu Joseph 97a5e801b3
platform/x86/intel/ifs: Store IFS generation number
IFS generation number is reported via MSR_INTEGRITY_CAPS.  As IFS
support gets added to newer CPUs, some differences are expected during
IFS image loading and test flows.

Define MSR bitmasks to extract and store the generation in driver data,
so that driver can modify its MSR interaction appropriately.

Signed-off-by: Jithu Joseph <jithu.joseph@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Link: https://lore.kernel.org/r/20231005195137.3117166-2-jithu.joseph@intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2023-10-06 13:05:13 +03:00
David Woodhouse 5d6d6a7d7e KVM: x86: Refine calculation of guest wall clock to use a single TSC read
When populating the guest's PV wall clock information, KVM currently does
a simple 'kvm_get_real_ns() - get_kvmclock_ns(kvm)'. This is an antipattern
which should be avoided; when working with the relationship between two
clocks, it's never correct to obtain one of them "now" and then the other
at a slightly different "now" after an unspecified period of preemption
(which might not even be under the control of the kernel, if this is an
L1 hosting an L2 guest under nested virtualization).

Add a kvm_get_wall_clock_epoch() function to return the guest wall clock
epoch in nanoseconds using the same method as __get_kvmclock() — by using
kvm_get_walltime_and_clockread() to calculate both the wall clock and KVM
clock time from a *single* TSC reading.

The condition using get_cpu_tsc_khz() is equivalent to the version in
__get_kvmclock() which separately checks for the CONSTANT_TSC feature or
the per-CPU cpu_tsc_khz. Which is what get_cpu_tsc_khz() does anyway.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/bfc6d3d7cfb88c47481eabbf5a30a264c58c7789.camel@infradead.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-05 19:36:16 -07:00
Jakub Kicinski 2606cf059c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts (or adjacent changes of note).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-05 13:16:47 -07:00
Chang S. Bae e12a68b3c6 crypto: x86/aesni - Perform address alignment early for XTS mode
Currently, the alignment of each field in struct aesni_xts_ctx occurs
right before every access. However, it's possible to perform this
alignment ahead of time.

Introduce a helper function that converts struct crypto_skcipher *tfm
to struct aesni_xts_ctx *ctx and returns an aligned address. Utilize
this helper function at the beginning of each XTS function and then
eliminate redundant alignment code.

Suggested-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/all/ZFWQ4sZEVu%2FLHq+Q@gmail.com/
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Cc: linux-crypto@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-10-05 18:16:30 +08:00
Chang S. Bae d148736ff1 crypto: x86/aesni - Correct the data type in struct aesni_xts_ctx
Currently, every field in struct aesni_xts_ctx is defined as a byte
array of the same size as struct crypto_aes_ctx. This data type
is obscure and the choice lacks justification.

To rectify this, update the field type in struct aesni_xts_ctx to
match its actual structure.

Suggested-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/all/ZFWQ4sZEVu%2FLHq+Q@gmail.com/
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Cc: linux-crypto@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-10-05 18:16:30 +08:00
Chang S. Bae 62496a2dea crypto: x86/aesni - Refactor the common address alignment code
The address alignment code has been duplicated for each mode. Instead
of duplicating the same code, refactor the alignment code and simplify
the alignment helpers.

Suggested-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/all/20230526065414.GB875@sol.localdomain/
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Cc: linux-crypto@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-10-05 18:16:30 +08:00
Brian Gerst bab9fa6dc5 x86/entry/32: Remove SEP test for SYSEXIT
SEP must be already be present in order for do_fast_syscall_32() to be
called on native 32-bit, so checking it again is unnecessary.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230721161018.50214-6-brgerst@gmail.com
2023-10-05 10:06:43 +02:00
Brian Gerst 0d3109ad2e x86/entry/32: Convert do_fast_syscall_32() to bool return type
Doesn't have to be 'long' - this simplifies the code a bit.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230721161018.50214-5-brgerst@gmail.com
2023-10-05 10:06:42 +02:00
Brian Gerst eec62f61e1 x86/entry/compat: Combine return value test from syscall handler
Move the sysret32_from_system_call label to remove a duplicate test of
the return value from the syscall handler.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230721161018.50214-4-brgerst@gmail.com
2023-10-05 10:06:42 +02:00
Brian Gerst eb43c9b151 x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
This comment comes from a time when the kernel attempted to use SYSRET
on all returns to userspace, including interrupts and exceptions.  Ever
since commit fffbb5dc ("Move opportunistic sysret code to syscall code
path"), SYSRET is only used for returning from system calls. The
specific tracing issue listed in this comment is not possible anymore.

Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20230721161018.50214-2-brgerst@gmail.com
2023-10-05 10:06:42 +02:00
Ingo Molnar 3fc18b06b8 Linux 6.6-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmUZ4WEeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGnIYH/07zef2U1nlqI+ro
 HRL2GlWGIs9yE70Oax+A3eYUYsjJIPu0yiDhFHUgOV3VyAALo44ZX/WNwKCGsI3e
 zhuOeItyyVcLGZXVC/jxSu0uveyfEiEYIWRYGyQ6Sna8Ksdk/qwhNgQNotdWdQG5
 7xt8z32couglu0uOkxcGqjTxmbjO6WSM5qi7Ts+xLsgrcS5cRuNhAg/vezp9bfeL
 1IUieCih4RJFgar/6LPOiB8uoVXEBonVbtlTRRqYdnqcsSIC+ACR9ZFk/+X88b5z
 S+Ta5VTcOAPu+2M/lSGe+PlUECvoBNK0SIYnaVCP2paPmDxfDXOFvSy/qJE87/7L
 9BeonFw=
 =8FTr
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-rc4' into x86/entry, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-05 10:05:51 +02:00
Paul Durrant 409f2e92a2 KVM: x86/xen: ignore the VCPU_SSHOTTMR_future flag
Upstream Xen now ignores _VCPU_SSHOTTMR_future[1], since the only guest
kernel ever to use it was buggy. By ignoring the flag the guest will
always get a callback if it sets a negative timeout which upstream Xen
has determined not to cause problems for any guest setting the flag.

[1] https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=19c6cbd909

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20231004174628.2073263-1-paul@xen.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-04 15:22:58 -07:00
Josh Poimboeuf e47d86083c KVM: x86: Add SBPB support
Add support for the AMD Selective Branch Predictor Barrier (SBPB) by
advertising the CPUID bit and handling PRED_CMD writes accordingly.

Note, like SRSO_NO and IBPB_BRTYPE before it, advertise support for SBPB
even if it's not enumerated by in the raw CPUID.  Some CPUs that gained
support via a uCode patch don't report SBPB via CPUID (the kernel forces
the flag).

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/a4ab1e7fe50096d50fde33e739ed2da40b41ea6a.1692919072.git.jpoimboe@kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-04 15:19:32 -07:00
Josh Poimboeuf 6f0f23ef76 KVM: x86: Add IBPB_BRTYPE support
Add support for the IBPB_BRTYPE CPUID flag, which indicates that IBPB
includes branch type prediction flushing.

Note, like SRSO_NO, advertise support for IBPB_BRTYPE even if it's not
enumerated by in the raw CPUID, i.e. bypass the cpuid_count() in
__kvm_cpu_cap_mask().  Some CPUs that gained support via a uCode patch
don't report IBPB_BRTYPE via CPUID (the kernel forces the flag).

Opportunistically use kvm_cpu_cap_check_and_set() for SRSO_NO instead
of manually querying host support (cpu_feature_enabled() and
boot_cpu_has() yield the same end result in this case).

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/79d5f5914fb42c2c62418ffbcd78f138645ded21.1692919072.git.jpoimboe@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-04 15:15:52 -07:00
Sean Christopherson 0068299540 KVM: SVM: Treat all "skip" emulation for SEV guests as outright failures
Treat EMULTYPE_SKIP failures on SEV guests as unhandleable emulation
instead of simply resuming the guest, and drop the hack-a-fix which
effects that behavior for the INT3/INTO injection path.  If KVM can't
skip an instruction for which KVM has already done partial emulation,
resuming the guest is undesirable as doing so may corrupt guest state.

Link: https://lore.kernel.org/r/20230825013621.2845700-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-04 15:08:53 -07:00
Sean Christopherson aeb904f6b9 KVM: x86: Refactor can_emulate_instruction() return to be more expressive
Refactor and rename can_emulate_instruction() to allow vendor code to
return more than true/false, e.g. to explicitly differentiate between
"retry", "fault", and "unhandleable".  For now, just do the plumbing, a
future patch will expand SVM's implementation to signal outright failure
if KVM attempts EMULTYPE_SKIP on an SEV guest.

No functional change intended (or rather, none that are visible to the
guest or userspace).

Link: https://lore.kernel.org/r/20230825013621.2845700-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-04 15:08:53 -07:00
Frederic Weisbecker 448e9f34d9 rcu: Standardize explicit CPU-hotplug calls
rcu_report_dead() and rcutree_migrate_callbacks() have their headers in
rcupdate.h while those are pure rcutree calls, like the other CPU-hotplug
functions.

Also rcu_cpu_starting() and rcu_report_dead() have different naming
conventions while they mirror each other's effects.

Fix the headers and propose a naming that relates both functions and
aligns with the prefix of other rcutree CPU-hotplug functions.

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-10-04 22:29:45 +02:00
David Woodhouse 77c9b9dea4 KVM: x86/xen: Use fast path for Xen timer delivery
Most of the time there's no need to kick the vCPU and deliver the timer
event through kvm_xen_inject_timer_irqs(). Use kvm_xen_set_evtchn_fast()
directly from the timer callback, and only fall back to the slow path if
delivering the timer would block, i.e. if kvm_xen_set_evtchn_fast()
returns -EWOULDBLOCK.  If delivery fails for any other reason, do nothing
and just let it fail silently, as that is what the slow path would end up
doing anyways.

This gives a significant improvement in timer latency testing (using
nanosleep() for various periods and then measuring the actual time
elapsed).

However, there was a reason[1] the fast path was dropped when this support
was first added. The current code holds vcpu->mutex for all operations on
the kvm->arch.timer_expires field, and the fast path introduces a
potential race condition. Avoid that race by ensuring the hrtimer is
(temporarily) cancelled before making changes in kvm_xen_start_timer(),
and also when reading the values out for KVM_XEN_VCPU_ATTR_TYPE_TIMER.

[1] https://lore.kernel.org/kvm/846caa99-2e42-4443-1070-84e49d2f11d2@redhat.com

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/f21ee3bd852761e7808240d4ecaec3013c649dc7.camel@infradead.org
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-04 12:29:21 -07:00
Peng Hao ee11ab6bb0 KVM: X86: Reduce size of kvm_vcpu_arch structure when CONFIG_KVM_XEN=n
When CONFIG_KVM_XEN=n, the size of kvm_vcpu_arch can be reduced
from 5100+ to 4400+ by adding macro control.

Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Link: https://lore.kernel.org/all/CAPm50aKwbZGeXPK5uig18Br8CF1hOS71CE2j_dLX+ub7oJdpGg@mail.gmail.com
[sean: fix whitespace damage]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-10-04 12:26:02 -07:00
Baoquan He 9c08a2a139 x86: kdump: use generic interface to simplify crashkernel reservation code
With the help of newly changed function parse_crashkernel() and generic
reserve_crashkernel_generic(), crashkernel reservation can be simplified
by steps:

1) Add a new header file <asm/crash_core.h>, and define CRASH_ALIGN,
   CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX and
   DEFAULT_CRASH_KERNEL_LOW_SIZE in <asm/crash_core.h>;

2) Add arch_reserve_crashkernel() to call parse_crashkernel() and
   reserve_crashkernel_generic(), and do the ARCH specific work if
   needed.

3) Add ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION Kconfig in
   arch/x86/Kconfig.

When adding DEFAULT_CRASH_KERNEL_LOW_SIZE, add crash_low_size_default() to
calculate crashkernel low memory because x86_64 has special requirement.

The old reserve_crashkernel_low() and reserve_crashkernel() can be
removed.

[bhe@redhat.com: move crash_low_size_default() code into <asm/crash_core.h>]
  Link: https://lkml.kernel.org/r/ZQpeAjOmuMJBFw1/@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20230914033142.676708-7-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Jiahao <chenjiahao16@huawei.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:41:58 -07:00
Baoquan He a9e1a3d84e crash_core: change the prototype of function parse_crashkernel()
Add two parameters 'low_size' and 'high' to function parse_crashkernel(),
later crashkernel=,high|low parsing will be added.  Make adjustments in
all call sites of parse_crashkernel() in arch.

Link: https://lkml.kernel.org/r/20230914033142.676708-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Jiahao <chenjiahao16@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:41:58 -07:00
Qi Zheng e5985c4098 kvm: mmu: dynamically allocate the x86-mmu shrinker
Use new APIs to dynamically allocate the x86-mmu shrinker.

Link: https://lkml.kernel.org/r/20230911094444.68966-3-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04 10:32:23 -07:00
Uros Bizjak 5e0eb67974 locking/local, arch: Rewrite local_add_unless() as a static inline function
Rewrite local_add_unless() as a static inline function with boolean
return value, similar to the arch_atomic_add_unless() arch fallbacks.

The function is currently unused.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230731084458.28096-1-ubizjak@gmail.com
2023-10-04 11:38:11 +02:00
Justin Stitt c9babd5d95 x86/tdx: Replace deprecated strncpy() with strtomem_pad()
strncpy() works perfectly here in all cases, however, it is deprecated and
as such we should prefer more robust and less ambiguous string APIs:

    https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings

Let's use strtomem_pad() as this matches the functionality of strncpy()
and is _not_ deprecated.

Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://github.com/KSPP/linux/issues/90
Link: https://lore.kernel.org/r/20231003-strncpy-arch-x86-coco-tdx-tdx-c-v2-1-0bd21174a217@google.com
2023-10-04 09:34:07 +02:00
Zhu Wang 8ae292c66d x86/lib: Address kernel-doc warnings
Fix all kernel-doc warnings in csum-wrappers_64.c:

  arch/x86/lib/csum-wrappers_64.c:25: warning: Excess function parameter 'isum' description in 'csum_and_copy_from_user'
  arch/x86/lib/csum-wrappers_64.c:25: warning: Excess function parameter 'errp' description in 'csum_and_copy_from_user'
  arch/x86/lib/csum-wrappers_64.c:49: warning: Excess function parameter 'isum' description in 'csum_and_copy_to_user'
  arch/x86/lib/csum-wrappers_64.c:49: warning: Excess function parameter 'errp' description in 'csum_and_copy_to_user'
  arch/x86/lib/csum-wrappers_64.c:71: warning: Excess function parameter 'sum' description in 'csum_partial_copy_nocheck'

Signed-off-by: Zhu Wang <wangzhu9@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
2023-10-03 22:46:47 +02:00
Zhu Wang 90879f5dfc x86/fpu/xstate: Address kernel-doc warning
Fix kernel-doc warning:

  arch/x86/kernel/fpu/xstate.c:1753: warning: Excess function parameter 'tsk' description in 'fpu_xstate_prctl'

Signed-off-by: Zhu Wang <wangzhu9@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
2023-10-03 22:46:12 +02:00
David Reaver 618e77d774 perf/x86/rapl: Fix "Using plain integer as NULL pointer" Sparse warning
Change 0 to NULL when initializing the test field of perf_msr structs to
avoid the following sparse warnings:

  make C=2 arch/x86/events/rapl.o

  CHECK   arch/x86/events/rapl.c
  ...
  arch/x86/events/rapl.c:540:59: warning: Using plain integer as NULL pointer
  arch/x86/events/rapl.c:542:59: warning: Using plain integer as NULL pointer
  arch/x86/events/rapl.c:543:59: warning: Using plain integer as NULL pointer
  arch/x86/events/rapl.c:544:59: warning: Using plain integer as NULL pointer

Signed-off-by: David Reaver <me@davidreaver.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230801155651.108076-1-me@davidreaver.com
2023-10-03 21:25:56 +02:00
Uros Bizjak bcc6ec3d95 perf/x86/rapl: Use local64_try_cmpxchg in rapl_event_update()
Use local64_try_cmpxchg() instead of local64_cmpxchg(*ptr, old, new) == old.

X86 CMPXCHG instruction returns success in ZF flag, so this change saves a
compare after CMPXCHG (and related move instruction in front of CMPXCHG).

Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when CMPXCHG
fails. There is no need to re-read the value in the loop.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20230807145134.3176-2-ubizjak@gmail.com
2023-10-03 21:13:45 +02:00
Uros Bizjak 1ce19bf90b perf/x86/rapl: Stop doing cpu_relax() in the local64_cmpxchg() loop in rapl_event_update()
According to the following commit:

   f5fe24ef17 ("lockref: stop doing cpu_relax in the cmpxchg loop")

   "On the x86-64 architecture even a failing cmpxchg grants exclusive
    access to the cacheline, making it preferable to retry the failed op
    immediately instead of stalling with the pause instruction."

Based on the above observation, remove cpu_relax() from the
local64_cmpxchg() loop of rapl_event_update().

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20230807145134.3176-1-ubizjak@gmail.com
2023-10-03 21:13:23 +02:00
Sohil Mehta ccab211af3 syscalls: Cleanup references to sys_lookup_dcookie()
commit 'be65de6b03aa ("fs: Remove dcookies support")' removed the
syscall definition for lookup_dcookie.  However, syscall tables still
point to the old sys_lookup_dcookie() definition. Update syscall tables
of all architectures to directly point to sys_ni_syscall() instead.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org> # for perf
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-10-03 19:51:37 +02:00
GUO Zihua bfb32e2008 x86/sev: Make boot_ghcb_page[] static
boot_ghcb_page is not used by any other file, so make it static.

This also resolves sparse warning:

  arch/x86/boot/compressed/sev.c:28:13: warning: symbol 'boot_ghcb_page' was not declared. Should it be static?

Signed-off-by: GUO Zihua <guozihua@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
2023-10-03 15:31:27 +02:00
Wang Jinchao 9f76d60626 x86/boot: Harmonize the style of array-type parameter for fixup_pointer() calls
The usage of '&' before the array parameter is redundant because '&array'
is equivalent to 'array'. Therefore, there is no need to include '&'
before the array parameter. In fact, using '&' can cause more confusion,
especially for individuals who are not familiar with the address-of
operation for arrays. They might mistakenly believe that one is different
from the other and spend additional time realizing that they are actually
the same.

Harmonizing the style by removing the unnecessary '&' would save time for
those individuals.

Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/ZMt24BGEX9IhPSY6@fedora
2023-10-03 11:28:38 +02:00
Yazen Ghannam 2a565258b3 x86/amd_nb: Use Family 19h Models 60h-7Fh Function 4 IDs
Three PCI IDs for DF Function 4 were defined but not used.

Add them to the "link" list.

Fixes: f8faf34966 ("x86/amd_nb: Add AMD PCI IDs for SMN communication")
Fixes: 23a5b8bb02 ("x86/amd_nb: Add PCI ID for family 19h model 78h")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230803150430.3542854-1-yazen.ghannam@amd.com
2023-10-03 11:25:01 +02:00
Masahiro Yamada 8b01de8030 x86/headers: Remove <asm/export.h>
All *.S files under arch/x86/ have been converted to include
<linux/export.h> instead of <asm/export.h>.

Remove <asm/export.h>.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230806145958.380314-3-masahiroy@kernel.org
2023-10-03 10:38:08 +02:00
Masahiro Yamada 94ea9c0521 x86/headers: Replace #include <asm/export.h> with #include <linux/export.h>
The following commit:

  ddb5cdbafa ("kbuild: generate KSYMTAB entries by modpost")

deprecated <asm/export.h>, which is now a wrapper of <linux/export.h>.

Use <linux/export.h> in *.S as well as in *.c files.

After all the <asm/export.h> lines are replaced, <asm/export.h> and
<asm-generic/export.h> will be removed.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230806145958.380314-2-masahiroy@kernel.org
2023-10-03 10:38:07 +02:00
Masahiro Yamada b425232c67 x86/headers: Remove unnecessary #include <asm/export.h>
There is no EXPORT_SYMBOL() line there, hence #include <asm/export.h>
is unnecessary.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230806145958.380314-1-masahiroy@kernel.org
2023-10-03 10:38:07 +02:00
Yuntao Wang 001470fed5 x86/boot: Fix incorrect startup_gdt_descr.size
Since the size value is added to the base address to yield the last valid
byte address of the GDT, the current size value of startup_gdt_descr is
incorrect (too large by one), fix it.

[ mingo: This probably never mattered, because startup_gdt[] is only used
         in a very controlled fashion - but make it consistent nevertheless. ]

Fixes: 866b556efa ("x86/head/64: Install startup GDT")
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lore.kernel.org/r/20230807084547.217390-1-ytcoode@gmail.com
2023-10-03 10:28:29 +02:00
Ingo Molnar de80193308 Linux 6.6-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmUZ4WEeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGnIYH/07zef2U1nlqI+ro
 HRL2GlWGIs9yE70Oax+A3eYUYsjJIPu0yiDhFHUgOV3VyAALo44ZX/WNwKCGsI3e
 zhuOeItyyVcLGZXVC/jxSu0uveyfEiEYIWRYGyQ6Sna8Ksdk/qwhNgQNotdWdQG5
 7xt8z32couglu0uOkxcGqjTxmbjO6WSM5qi7Ts+xLsgrcS5cRuNhAg/vezp9bfeL
 1IUieCih4RJFgar/6LPOiB8uoVXEBonVbtlTRRqYdnqcsSIC+ACR9ZFk/+X88b5z
 S+Ta5VTcOAPu+2M/lSGe+PlUECvoBNK0SIYnaVCP2paPmDxfDXOFvSy/qJE87/7L
 9BeonFw=
 =8FTr
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-rc4' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-03 09:32:25 +02:00
Dave Hansen 3e32552652 x86/boot: Move x86_cache_alignment initialization to correct spot
c->x86_cache_alignment is initialized from c->x86_clflush_size.
However, commit fbf6449f84 moved c->x86_clflush_size initialization
to later in boot without moving the c->x86_cache_alignment assignment:

  fbf6449f84 ("x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach")

This presumably left c->x86_cache_alignment set to zero for longer
than it should be.

The result was an oops on 32-bit kernels while accessing a pointer
at 0x20.  The 0x20 came from accessing a structure member at offset
0x10 (buffer->cpumask) from a ZERO_SIZE_PTR=0x10.  kmalloc() can
evidently return ZERO_SIZE_PTR when it's given 0 as its alignment
requirement.

Move the c->x86_cache_alignment initialization to be after
c->x86_clflush_size has an actual value.

Fixes: fbf6449f84 ("x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20231002220045.1014760-1-dave.hansen@linux.intel.com
2023-10-03 09:27:12 +02:00
Saurabh Sengar 0c436a5829 x86/numa: Add Devicetree support
Hyper-V has usecases where it needs to fetch NUMA information from
Devicetree. Currently, it is not possible to extract the NUMA information
from Devicetree for the x86 arch.

Add support for Devicetree in the x86_numa_init() function, allowing the
retrieval of NUMA node information from the Devicetree.

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/1692949657-16446-2-git-send-email-ssengar@linux.microsoft.com
2023-10-02 21:30:20 +02:00
Saurabh Sengar 0d294c8c4e x86/of: Move the x86_flattree_get_config() call out of x86_dtb_init()
Fetching the device tree configuration before initmem_init() is necessary
to allow the parsing of NUMA node information. However moving the entire
x86_dtb_init() call before initmem_init() is not correct as APIC/IO-APIC enumeration
has to be after initmem_init().

Thus, move the x86_flattree_get_config() call out of x86_dtb_init(),
into setup_arch(), to call it before initmem_init(), and
leave the ACPI/IOAPIC registration sequence as-is.

[ mingo: Updated the changelog for clarity. ]

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/1692949657-16446-1-git-send-email-ssengar@linux.microsoft.com
2023-10-02 21:30:09 +02:00
Tom Lendacky 62d5e970d0 x86/sev: Change npages to unsigned long in snp_accept_memory()
In snp_accept_memory(), the npages variables value is calculated from
phys_addr_t variables but is an unsigned int. A very large range passed
into snp_accept_memory() could lead to truncating npages to zero. This
doesn't happen at the moment but let's be prepared.

Fixes: 6c32117963 ("x86/sev: Add SNP-specific unaccepted memory support")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/6d511c25576494f682063c9fb6c705b526a3757e.1687441505.git.thomas.lendacky@amd.com
2023-10-02 14:55:41 +02:00
Tom Lendacky 6bc6f7d9d7 x86/sev: Use the GHCB protocol when available for SNP CPUID requests
SNP retrieves the majority of CPUID information from the SNP CPUID page.
But there are times when that information needs to be supplemented by the
hypervisor, for example, obtaining the initial APIC ID of the vCPU from
leaf 1.

The current implementation uses the MSR protocol to retrieve the data from
the hypervisor, even when a GHCB exists. The problem arises when an NMI
arrives on return from the VMGEXIT. The NMI will be immediately serviced
and may generate a #VC requiring communication with the hypervisor.

Since a GHCB exists in this case, it will be used. As part of using the
GHCB, the #VC handler will write the GHCB physical address into the GHCB
MSR and the #VC will be handled.

When the NMI completes, processing resumes at the site of the VMGEXIT
which is expecting to read the GHCB MSR and find a CPUID MSR protocol
response. Since the NMI handling overwrote the GHCB MSR response, the
guest will see an invalid reply from the hypervisor and self-terminate.

Fix this problem by using the GHCB when it is available. Any NMI
received is properly handled because the GHCB contents are copied into
a backup page and restored on NMI exit, thus preserving the active GHCB
request or result.

  [ bp: Touchups. ]

Fixes: ee0bfa08a3 ("x86/compressed/64: Add support for SEV-SNP CPUID table in #VC handlers")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/a5856fa1ebe3879de91a8f6298b6bbd901c61881.1690578565.git.thomas.lendacky@amd.com
2023-10-02 14:55:39 +02:00
Linus Torvalds d2c5231581 Fourteen hotfixes, eleven of which are cc:stable. The remainder pertain
to issues which were introduced after 6.5.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZRmSDAAKCRDdBJ7gKXxA
 jlSaAQCe3SnBdjRmuzbp5iIfNJOY7GXLN4NwMsArRUxRGY27IwD+KWhXZP/ydVnt
 ZgS4x9rmarHuh5Pxds+6SRGhihRz/Ak=
 =sf/5
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Fourteen hotfixes, eleven of which are cc:stable. The remainder
  pertain to issues which were introduced after 6.5"

* tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  Crash: add lock to serialize crash hotplug handling
  selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error
  mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified
  mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions()
  mm, memcg: reconsider kmem.limit_in_bytes deprecation
  mm: zswap: fix potential memory corruption on duplicate store
  arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries
  mm: hugetlb: add huge page size param to set_huge_pte_at()
  maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states
  maple_tree: add mas_is_active() to detect in-tree walks
  nilfs2: fix potential use after free in nilfs_gccache_submit_read_data()
  mm: abstract moving to the next PFN
  mm: report success more often from filemap_map_folio_range()
  fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
2023-10-01 13:33:25 -07:00
Linus Torvalds ec8c298121 Misc fixes: a kerneldoc build warning fix, add SRSO mitigation for
AMD-derived Hygon processors, and fix a SGX kernel crash in the
 page fault handler that can trigger when ksgxd races to reclaim
 the SECS special page, by making the SECS page unswappable.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUZNa4RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hFYA//aCXuIxVgMdgwDs7uuaghAX7v3NWlx1Pu
 qxioHgpGimOl5Sm28siT29GCnFK3a+DBd7wCNQ9yhRxTIESwAG9wGlD8cfZISvyI
 qinPU6Yo0OLEAI//g2IWzr/Hw8QecrjLGGoqhFj8m2vsLANWcTXkeRoFxNwWlobx
 OSGQL+SYP5tuAhsrjbsQMHiOAUxXAdAuT62R8nYgCcj6A/VTvjSUF9N3C6G/CtnM
 Y7pi8n4VPxDuU8dwhi1HptHzBrAl6GYXiC1A9UgddKaDk710R+Fe4+LkJR4NVrnL
 zXv5YY4qoHzVcQ3gznBUOTwDPDWjsGPaTU2Gcya3FRXxZlc1i7p+/kBvxAxQz05H
 Z6ixkfnWOhdYPWbr7yau7r0RRR4ZvK22pIoAfxKdTYbbI5lOUqbSof0d9mHAY+An
 tHkYqqFAabnPs4ogGT4tK7nHr9pnCFEEfd2JAAPDu78XkVlokvV7e4sm7bjS4D4D
 tIHpp3gd04PaEq9I2mEvP57/Sn2fYr0PO3mg6jUppv35k2+hjjkfPilKzqPbcXP6
 bD4gdjYXQV367kCfpN2SXUaPtn0UZfqdEol1UVzteVyNOgXHiPzCd/2K7YtW2MM5
 wlJ35BDvC7uQr8XxOJBQoAETJQ7TtePnhIjDHOp+WYzn8dmd+r31/EJMIOC0TD/C
 nlvkY3/gvYA=
 =YK/4
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:
 "Misc fixes: a kerneldoc build warning fix, add SRSO mitigation for
  AMD-derived Hygon processors, and fix a SGX kernel crash in the page
  fault handler that can trigger when ksgxd races to reclaim the SECS
  special page, by making the SECS page unswappable"

* tag 'x86-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Resolves SECS reclaim vs. page fault for EAUG race
  x86/srso: Add SRSO mitigation for Hygon processors
  x86/kgdb: Fix a kerneldoc warning when build with W=1
2023-10-01 09:50:58 -07:00
Linus Torvalds 3a38c57a87 Misc fixes: work around an AMD microcode bug on certain models,
and fix kexec kernel PMI handlers on AMD systems that get loaded
 on older kernels that have an unexpected register state.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUZLo8RHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gzXw/8C6YUlO05xpocZuBzbQU/BtpAj1P8VwTe
 gSPD3vaPeXvEFSDqoNBEXXs54VRpAnHfLOGmr5DrjG9h8yr7MU5/hUfov6a8dLtH
 rAjwYOtZapHz/nc/+knbGUcM4d7H8IRYF62A4Vy2uugLlfNQY6YAuEeg/5/ykS5Y
 mCUAWgwwygPdN5VOQZNlJluEw6DexyUmM3fg9Gbq+Hx1bycPJtUzub5q3fGQ/FcG
 wxUUgL1lKIUoOHXL2D3Xq0fE7A8khXJIjzxqBysSrnMqe7BDS3ur1SoLjDPGaXcP
 f2wBpEGdSU9xxfEL3W8gdtmsffihU8ExbNZYFdpTlOf02aB7z03M0Lr4NwsJDiCZ
 Qek6IpK0ZHyC67X7UC7iTzAnI/c32jhkoTzHXpI1aOX/dX8TLNy0r/FKxoUcbv16
 HDPy4RnuhNQS3X8P1EqcXJ72AEF0Ce5S9tU8MfSKDWSElfXtCtE+hvIilqoUaW9F
 RTtHl6cOezCEkx3SryhhZibMaOVrLGFMeF01lpHXckwHQt8q2TMMEaCRNYjpGfwM
 zsBS1+E1/XeVsVqadM3/vbu7SyUUimlPZy8gc2gkSqyq8NZcs+OqbYaYF00v4lcN
 TtxRv9I3J8oXaZWTA7XUC+00IwMuYVjdv4ZJL/rRWpQaatoAFRLf0rS5/LgY9kSc
 xIYuX63A4Wo=
 =rzXG
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event fixes from Ingo Molnar:
 "Misc fixes: work around an AMD microcode bug on certain models, and
  fix kexec kernel PMI handlers on AMD systems that get loaded on older
  kernels that have an unexpected register state"

* tag 'perf-urgent-2023-10-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/amd: Do not WARN() on every IRQ
  perf/x86/amd/core: Fix overflow reset on hotplug
2023-10-01 09:34:53 -07:00
Matthew Wilcox (Oracle) ce60f27bb6 mm: abstract moving to the next PFN
In order to fix the L1TF vulnerability, x86 can invert the PTE bits for
PROT_NONE VMAs, which means we cannot move from one PTE to the next by
adding 1 to the PFN field of the PTE.  This results in the BUG reported at
[1].

Abstract advancing the PTE to the next PFN through a pte_next_pfn()
function/macro.

Link: https://lkml.kernel.org/r/20230920040958.866520-1-willy@infradead.org
Fixes: bcc6cc8325 ("mm: add default definition of set_ptes()")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: syzbot+55cc72f8cc3a549119df@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000d099fa0604f03351@google.com [1]
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-29 17:20:46 -07:00
Baolin Liu b5034c6385 x86/cpu/amd: Remove redundant 'break' statement
This break is after the return statement, so it is redundant & confusing,
and should be deleted.

Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/396ba14d.2726.189d957b74b.Coremail.liubaolin12138@163.com
2023-09-29 11:24:09 +02:00
Haitao Huang c6c2adcba5 x86/sgx: Resolves SECS reclaim vs. page fault for EAUG race
The SGX EPC reclaimer (ksgxd) may reclaim the SECS EPC page for an
enclave and set secs.epc_page to NULL. The SECS page is used for EAUG
and ELDU in the SGX page fault handler. However, the NULL check for
secs.epc_page is only done for ELDU, not EAUG before being used.

Fix this by doing the same NULL check and reloading of the SECS page as
needed for both EAUG and ELDU.

The SECS page holds global enclave metadata. It can only be reclaimed
when there are no other enclave pages remaining. At that point,
virtually nothing can be done with the enclave until the SECS page is
paged back in.

An enclave can not run nor generate page faults without a resident SECS
page. But it is still possible for a #PF for a non-SECS page to race
with paging out the SECS page: when the last resident non-SECS page A
triggers a #PF in a non-resident page B, and then page A and the SECS
both are paged out before the #PF on B is handled.

Hitting this bug requires that race triggered with a #PF for EAUG.
Following is a trace when it happens.

BUG: kernel NULL pointer dereference, address: 0000000000000000
RIP: 0010:sgx_encl_eaug_page+0xc7/0x210
Call Trace:
 ? __kmem_cache_alloc_node+0x16a/0x440
 ? xa_load+0x6e/0xa0
 sgx_vma_fault+0x119/0x230
 __do_fault+0x36/0x140
 do_fault+0x12f/0x400
 __handle_mm_fault+0x728/0x1110
 handle_mm_fault+0x105/0x310
 do_user_addr_fault+0x1ee/0x750
 ? __this_cpu_preempt_check+0x13/0x20
 exc_page_fault+0x76/0x180
 asm_exc_page_fault+0x27/0x30

Fixes: 5a90d2c3f5 ("x86/sgx: Support adding of pages to an initialized enclave")
Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20230728051024.33063-1-haitao.huang%40linux.intel.com
2023-09-28 16:16:40 -07:00
Adam Dunlap fbf6449f84 x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach
Instead of setting x86_virt_bits to a possibly-correct value and then
correcting it later, do all the necessary checks before setting it.

At this point, the #VC handler references boot_cpu_data.x86_virt_bits,
and in the previous version, it would be triggered by the CPUIDs between
the point at which it is set to 48 and when it is set to the correct
value.

Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Adam Dunlap <acdunlap@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jacob Xu <jacobhxu@google.com>
Link: https://lore.kernel.org/r/20230912002703.3924521-3-acdunlap@google.com
2023-09-28 22:49:35 +02:00
Adam Dunlap f79936545f x86/sev-es: Allow copy_from_kernel_nofault() in earlier boot
Previously, if copy_from_kernel_nofault() was called before
boot_cpu_data.x86_virt_bits was set up, then it would trigger undefined
behavior due to a shift by 64.

This ended up causing boot failures in the latest version of ubuntu2204
in the gcp project when using SEV-SNP.

Specifically, this function is called during an early #VC handler which
is triggered by a CPUID to check if NX is implemented.

Fixes: 1aa9aa8ee5 ("x86/sev-es: Setup GHCB-based boot #VC handler")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Adam Dunlap <acdunlap@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Jacob Xu <jacobhxu@google.com>
Link: https://lore.kernel.org/r/20230912002703.3924521-2-acdunlap@google.com
2023-09-28 22:49:35 +02:00
Tao Su 629d3698f6 KVM: x86: Clear bit12 of ICR after APIC-write VM-exit
When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR
MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi()
thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay,
but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit.

Under the x2APIC section, regarding ICR, the SDM says:

  It remains readable only to aid in debugging; however, software should
  not assume the value returned by reading the ICR is the last written
  value.

I.e. the guest is allowed to set bit 12.  However, the SDM also gives KVM
free reign to do whatever it wants with the bit, so long as KVM's behavior
doesn't confuse userspace or break KVM's ABI.

Clear bit 12 so that it reads back as '0'. This approach is safer than
"do nothing" and is consistent with the case where IPI virtualization is
disabled or not supported, i.e.,

  handle_fastpath_set_x2apic_icr_irqoff() -> kvm_x2apic_icr_write()

Opportunistically replace the TODO with a comment calling out that eating
the write is likely faster than a conditional branch around the busy bit.

Link: https://lore.kernel.org/all/ZPj6iF0Q7iynn62p@google.com/
Fixes: 5413bcba7e ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode")
Cc: stable@vger.kernel.org
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20230914055504.151365-1-tao1.su@linux.intel.com
[sean: tweak changelog, replace TODO with comment, drop local "val"]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-28 10:42:16 -07:00
Haitao Shan 9cfec6d097 KVM: x86: Fix lapic timer interrupt lost after loading a snapshot.
When running android emulator (which is based on QEMU 2.12) on
certain Intel hosts with kernel version 6.3-rc1 or above, guest
will freeze after loading a snapshot. This is almost 100%
reproducible. By default, the android emulator will use snapshot
to speed up the next launching of the same android guest. So
this breaks the android emulator badly.

I tested QEMU 8.0.4 from Debian 12 with an Ubuntu 22.04 guest by
running command "loadvm" after "savevm". The same issue is
observed. At the same time, none of our AMD platforms is impacted.
More experiments show that loading the KVM module with
"enable_apicv=false" can workaround it.

The issue started to show up after commit 8e6ed96cdd ("KVM: x86:
fire timer when it is migrated and expired, and in oneshot mode").
However, as is pointed out by Sean Christopherson, it is introduced
by commit 967235d320 ("KVM: vmx: clear pending interrupts on
KVM_SET_LAPIC"). commit 8e6ed96cdd ("KVM: x86: fire timer when
it is migrated and expired, and in oneshot mode") just makes it
easier to hit the issue.

Having both commits, the oneshot lapic timer gets fired immediately
inside the KVM_SET_LAPIC call when loading the snapshot. On Intel
platforms with APIC virtualization and posted interrupt processing,
this eventually leads to setting the corresponding PIR bit. However,
the whole PIR bits get cleared later in the same KVM_SET_LAPIC call
by apicv_post_state_restore. This leads to timer interrupt lost.

The fix is to move vmx_apicv_post_state_restore to the beginning of
the KVM_SET_LAPIC call and rename to vmx_apicv_pre_state_restore.
What vmx_apicv_post_state_restore does is actually clearing any
former apicv state and this behavior is more suitable to carry out
in the beginning.

Fixes: 967235d320 ("KVM: vmx: clear pending interrupts on KVM_SET_LAPIC")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Haitao Shan <hshan@google.com>
Link: https://lore.kernel.org/r/20230913000215.478387-1-hshan@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-28 10:19:29 -07:00
Peter Gonda bc3d7c5570 KVM: SVM: Update SEV-ES shutdown intercepts with more metadata
Currently if an SEV-ES VM shuts down userspace sees KVM_RUN struct with
only errno=EINVAL. This is a very limited amount of information to debug
the situation. Instead return KVM_EXIT_SHUTDOWN to alert userspace the VM
is shutting down and is not usable any further.

Signed-off-by: Peter Gonda <pgonda@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230907162449.1739785-1-pgonda@google.com
[sean: tweak changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-28 10:15:00 -07:00
Kyle Meyer f10a570b09 KVM: x86: Add CONFIG_KVM_MAX_NR_VCPUS to allow up to 4096 vCPUs
Add a Kconfig entry to set the maximum number of vCPUs per KVM guest and
set the default value to 4096 when MAXSMP is enabled, as there are use
cases that want to create more than the currently allowed 1024 vCPUs and
are more than happy to eat the memory overhead.

The Hyper-V TLFS doesn't allow more than 64 sparse banks, i.e. allows a
maximum of 4096 virtual CPUs. Cap KVM's maximum number of virtual CPUs
to 4096 to avoid exceeding Hyper-V's limit as KVM support for Hyper-V is
unconditional, and alternatives like dynamically disabling Hyper-V
enlightenments that rely on sparse banks would require non-trivial code
changes.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://lore.kernel.org/r/20230824215244.3897419-1-kyle.meyer@hpe.com
[sean: massage changelog with --verbose, document #ifdef mess]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-28 09:25:19 -07:00
Alexey Dobriyan b3bee1e7c3 x86/boot: Compile boot code with -std=gnu11 too
Use -std=gnu11 for consistency with main kernel code.

It doesn't seem to change anything in vmlinux.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: https://lore.kernel.org/r/2058761e-12a4-4b2f-9690-3c3c1c9902a5@p183
2023-09-28 10:11:27 +02:00
Pu Wen a5ef7d68ce x86/srso: Add SRSO mitigation for Hygon processors
Add mitigation for the speculative return stack overflow vulnerability
which exists on Hygon processors too.

Signed-off-by: Pu Wen <puwen@hygon.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/tencent_4A14812842F104E93AA722EC939483CEFF05@qq.com
2023-09-28 09:57:07 +02:00
Michal Luczaj 4346db6e6e KVM: x86: Force TLB flush on userspace changes to special registers
Userspace can directly modify the content of vCPU's CR0, CR3, and CR4 via
KVM_SYNC_X86_SREGS and KVM_SET_SREGS{,2}. Make sure that KVM flushes guest
TLB entries and paging-structure caches if a (partial) guest TLB flush is
architecturally required based on the CRn changes.  To keep things simple,
flush whenever KVM resets the MMU context, i.e. if any bits in CR0, CR3,
CR4, or EFER are modified.  This is extreme overkill, but stuffing state
from userspace is not such a hot path that preserving guest TLB state is a
priority.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230814222358.707877-3-mhal@rbox.co
[sean: call out that the flushing on MMU context resets is for simplicity]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-27 12:58:33 -07:00
Michal Luczaj 9dbb029b9c KVM: x86: Remove redundant vcpu->arch.cr0 assignments
Drop the vcpu->arch.cr0 assignment after static_call(kvm_x86_set_cr0).
CR0 was already set by {vmx,svm}_set_cr0().

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://lore.kernel.org/r/20230814222358.707877-2-mhal@rbox.co
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-27 12:57:48 -07:00
Xin Li (Intel) 1882366217 x86/entry: Fix typos in comments
Fix 2 typos in the comments.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: https://lore.kernel.org/r/20230926061319.1929127-1-xin@zytor.com
2023-09-27 10:05:04 +02:00
Xin Li (Intel) da4aff622a x86/entry: Remove unused argument %rsi passed to exc_nmi()
exc_nmi() only takes one argument of type struct pt_regs *, but
asm_exc_nmi() calls it with 2 arguments. The second one passed
in %rsi seems to be a leftover, so simply remove it.

Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Link: https://lore.kernel.org/r/20230926061319.1929127-1-xin@zytor.com
2023-09-27 10:04:54 +02:00
Muralidhara M K 24775700ea x86/amd_nb: Add AMD Family MI300 PCI IDs
Add new Root, Device 18h Function 3, and Function 4 PCI IDS
for AMD F19h Model 90h-9fh (MI300A).

Signed-off-by: Muralidhara M K <muralidhara.mk@amd.com>
Signed-off-by: Suma Hegde <suma.hegde@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20230926051932.193239-1-suma.hegde@amd.com
2023-09-27 09:53:23 +02:00
Jim Mattson 73554b29bd KVM: x86/pmu: Synthesize at most one PMI per VM-exit
When the irq_work callback, kvm_pmi_trigger_fn(), is invoked during a
VM-exit that also invokes __kvm_perf_overflow() as a result of
instruction emulation, kvm_pmu_deliver_pmi() will be called twice
before the next VM-entry.

Calling kvm_pmu_deliver_pmi() twice is unlikely to be problematic now that
KVM sets the LVTPC mask bit when delivering a PMI.  But using IRQ work to
trigger the PMI is still broken, albeit very theoretically.

E.g. if the self-IPI to trigger IRQ work is be delayed long enough for the
vCPU to be migrated to a different pCPU, then it's possible for
kvm_pmi_trigger_fn() to race with the kvm_pmu_deliver_pmi() from
KVM_REQ_PMI and still generate two PMIs.

KVM could set the mask bit using an atomic operation, but that'd just be
piling on unnecessary code to workaround what is effectively a hack.  The
*only* reason KVM uses IRQ work is to ensure the PMI is treated as a wake
event, e.g. if the vCPU just executed HLT.

Remove the irq_work callback for synthesizing a PMI, and all of the
logic for invoking it. Instead, to prevent a vcpu from leaving C0 with
a PMI pending, add a check for KVM_REQ_PMI to kvm_vcpu_has_events().

Fixes: 9cd803d496 ("KVM: x86: Update vPMCs when retiring instructions")
Signed-off-by: Jim Mattson <jmattson@google.com>
Tested-by: Mingwei Zhang <mizhang@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20230925173448.3518223-2-mizhang@google.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-25 14:42:52 -07:00
Jim Mattson a16eb25b09 KVM: x86: Mask LVTPC when handling a PMI
Per the SDM, "When the local APIC handles a performance-monitoring
counters interrupt, it automatically sets the mask flag in the LVT
performance counter register."  Add this behavior to KVM's local APIC
emulation.

Failure to mask the LVTPC entry results in spurious PMIs, e.g. when
running Linux as a guest, PMI handlers that do a "late_ack" spew a large
number of "dazed and confused" spurious NMI warnings.

Fixes: f5132b0138 ("KVM: Expose a version 2 architectural PMU to a guests")
Cc: stable@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
Tested-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20230925173448.3518223-3-mizhang@google.com
[sean: massage changelog, correct Fixes]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-25 14:40:29 -07:00
Roman Kagan b29a2acd36 KVM: x86/pmu: Truncate counter value to allowed width on write
Performance counters are defined to have width less than 64 bits.  The
vPMU code maintains the counters in u64 variables but assumes the value
to fit within the defined width.  However, for Intel non-full-width
counters (MSR_IA32_PERFCTRx) the value receieved from the guest is
truncated to 32 bits and then sign-extended to full 64 bits.  If a
negative value is set, it's sign-extended to 64 bits, but then in
kvm_pmu_incr_counter() it's incremented, truncated, and compared to the
previous value for overflow detection.

That previous value is not truncated, so it always evaluates bigger than
the truncated new one, and a PMI is injected.  If the PMI handler writes
a negative counter value itself, the vCPU never quits the PMI loop.

Turns out that Linux PMI handler actually does write the counter with
the value just read with RDPMC, so when no full-width support is exposed
via MSR_IA32_PERF_CAPABILITIES, and the guest initializes the counter to
a negative value, it locks up.

This has been observed in the field, for example, when the guest configures
atop to use perfevents and runs two instances of it simultaneously.

To address the problem, maintain the invariant that the counter value
always fits in the defined bit width, by truncating the received value
in the respective set_msr methods.  For better readability, factor the
out into a helper function, pmc_write_counter(), shared by vmx and svm
parts.

Fixes: 9cd803d496 ("KVM: x86: Update vPMCs when retiring instructions")
Cc: stable@vger.kernel.org
Signed-off-by: Roman Kagan <rkagan@amazon.de>
Link: https://lore.kernel.org/all/20230504120042.785651-1-rkagan@amazon.de
Tested-by: Like Xu <likexu@tencent.com>
[sean: tweak changelog, s/set/write in the helper]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-09-25 14:30:44 -07:00
David Howells 066baf92be
iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user()
copy_mc_to_user() has the destination marked __user on powerpc, but not on
x86; the latter results in a sparse warning in lib/iov_iter.c.

Fix this by applying the tag on x86 too.

Fixes: ec6347bb43 ("x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230925120309.1731676-3-dhowells@redhat.com
cc: Dan Williams <dan.j.williams@intel.com>
cc: Thomas Gleixner <tglx@linutronix.de>
cc: Ingo Molnar <mingo@redhat.com>
cc: Borislav Petkov <bp@alien8.de>
cc: Dave Hansen <dave.hansen@linux.intel.com>
cc: "H. Peter Anvin" <hpa@zytor.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: x86@kernel.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-09-25 14:30:27 +02:00
Breno Leitao 599522d9d2 perf/x86/amd: Do not WARN() on every IRQ
Zen 4 systems running buggy microcode can hit a WARN_ON() in the PMI
handler, as shown below, several times while perf runs. A simple
`perf top` run is enough to render the system unusable:

  WARNING: CPU: 18 PID: 20608 at arch/x86/events/amd/core.c:944 amd_pmu_v2_handle_irq+0x1be/0x2b0

This happens because the Performance Counter Global Status Register
(PerfCntGlobalStatus) has one or more bits set which are considered
reserved according to the "AMD64 Architecture Programmer’s Manual,
Volume 2: System Programming, 24593":

  https://www.amd.com/system/files/TechDocs/24593.pdf

To make this less intrusive, warn just once if any reserved bit is set
and prompt the user to update the microcode. Also sanitize the value to
what the code is handling, so that the overflow events continue to be
handled for the number of counters that are known to be sane.

Going forward, the following microcode patch levels are recommended
for Zen 4 processors in order to avoid such issues with reserved bits:

  Family=0x19 Model=0x11 Stepping=0x01: Patch=0x0a10113e
  Family=0x19 Model=0x11 Stepping=0x02: Patch=0x0a10123e
  Family=0x19 Model=0xa0 Stepping=0x01: Patch=0x0aa00116
  Family=0x19 Model=0xa0 Stepping=0x02: Patch=0x0aa00212

Commit f2eb058afc57 ("linux-firmware: Update AMD cpu microcode") from
the linux-firmware tree has binaries that meet the minimum required
patch levels.

  [ sandipan: - add message to prompt users to update microcode
              - rework commit message and call out required microcode levels ]

Fixes: 7685665c39 ("perf/x86/amd/core: Add PerfMonV2 overflow handling")
Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/3540f985652f41041e54ee82aa53e7dbd55739ae.1694696888.git.sandipan.das@amd.com/
2023-09-25 11:30:31 +02:00
Linus Torvalds 8a511e7efc ARM:
* Fix EL2 Stage-1 MMIO mappings where a random address was used
 
 * Fix SMCCC function number comparison when the SVE hint is set
 
 RISC-V:
 
 * Fix KVM_GET_REG_LIST API for ISA_EXT registers
 
 * Fix reading ISA_EXT register of a missing extension
 
 * Fix ISA_EXT register handling in get-reg-list test
 
 * Fix filtering of AIA registers in get-reg-list test
 
 x86:
 
 * Fixes for TSC_AUX virtualization
 
 * Stop zapping page tables asynchronously, since we don't
   zap them as often as before
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmUQU5YUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNcdwf/X8eHQ5yfAE0J70xs4VZ1z7B8i77q
 P54401z/q0FyQ4yyTHwbUv/FgVYscZ0efYogrkd3uuoPNtLmN2xKj1tM95A2ncP/
 v318ljevZ0FWZ6J471Xu9MM3u15QmjC3Wai9z6IP4tz0S2rUhOYTJdFqlNf6gQSu
 P8n9l2j3ZLAiUNizXa8M7350gCUFCBi37dvLLVTYOxbu17hZtmNjhNpz5G7YNc9y
 zmJIJh30ZnMGUgMylLfcW0ZoqQFNIkNg3yyr9YjY68bTW5aspXdhp9u0zI+01xYF
 sT+tOXBPPLi9MBuckX+oLMsvNXEZWxos2oMow3qziMo83neG+jU+WhjLHg==
 =+sqe
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"ARM:

   - Fix EL2 Stage-1 MMIO mappings where a random address was used

   - Fix SMCCC function number comparison when the SVE hint is set

  RISC-V:

   - Fix KVM_GET_REG_LIST API for ISA_EXT registers

   - Fix reading ISA_EXT register of a missing extension

   - Fix ISA_EXT register handling in get-reg-list test

   - Fix filtering of AIA registers in get-reg-list test

  x86:

   - Fixes for TSC_AUX virtualization

   - Stop zapping page tables asynchronously, since we don't zap them as
     often as before"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX
  KVM: SVM: Fix TSC_AUX virtualization setup
  KVM: SVM: INTERCEPT_RDTSCP is never intercepted anyway
  KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously
  KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe()
  KVM: x86/mmu: Open code leaf invalidation from mmu_notifier
  KVM: riscv: selftests: Selectively filter-out AIA registers
  KVM: riscv: selftests: Fix ISA_EXT register handling in get-reg-list
  RISC-V: KVM: Fix riscv_vcpu_get_isa_ext_single() for missing extensions
  RISC-V: KVM: Fix KVM_GET_REG_LIST API for ISA_EXT registers
  KVM: selftests: Assert that vasprintf() is successful
  KVM: arm64: nvhe: Ignore SVE hint in SMCCC function ID
  KVM: arm64: Properly return allocated EL2 VA from hyp_alloc_private_va_range()
2023-09-24 14:14:35 -07:00
Hugh Dickins f4c5ca9850 x86_64: Show CR4.PSE on auxiliaries like on BSP
Set CR4.PSE in secondary_startup_64: the Intel SDM is clear that it does
not matter whether it's 0 or 1 when 4-level-pts are enabled, but it's
distracting to find CR4 different on BSP and auxiliaries - on x86_64,
BSP alone got to add the PSE bit, in probe_page_size_mask().

Peter Zijlstra adds:

   "I think the point is that PSE bit is completely without
    meaning in long mode.

    But yes, having the same CR4 bits set across BSP and APs is
    definitely sane."

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/103ad03a-8c93-c3e2-4226-f79af4d9a074@google.com
2023-09-24 13:23:54 +02:00
Kees Cook d9a01959d9 x86/platform/uv: Annotate struct uv_rtc_timer_head with __counted_by
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).

Found with Coccinelle:

  https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci

Add __counted_by for struct uv_rtc_timer_head.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230922175151.work.118-kees@kernel.org
2023-09-24 12:02:58 +02:00
Christophe JAILLET 94adf495e7 x86/kgdb: Fix a kerneldoc warning when build with W=1
When compiled with W=1, the following warning is generated:

  arch/x86/kernel/kgdb.c:698: warning: Cannot understand  *
   on line 698 - I thought it was a doc line

Remove the corresponding empty comment line to fix the warning.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/aad659537c1d4ebd86912a6f0be458676c8e69af.1695401178.git.christophe.jaillet@wanadoo.fr
2023-09-24 11:00:13 +02:00
Paolo Bonzini 5804c19b80 KVM/riscv fixes for 6.6, take #1
- Fix KVM_GET_REG_LIST API for ISA_EXT registers
 - Fix reading ISA_EXT register of a missing extension
 - Fix ISA_EXT register handling in get-reg-list test
 - Fix filtering of AIA registers in get-reg-list test
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmUMDssACgkQrUjsVaLH
 LAckSg/+IZ5DPvPs81rUpL3i1Z5SrK4jXWL2zyvMIksBEYmFD2NPNvinVZ4Sxv6u
 IzWNKJcAp4nA/+qPGPLXCURDe1W6PCDvO4SShjYm2UkPtNIfiskmFr3MunXZysgm
 I7USJgj9ev+46yfOnwlYrwpZ8sQk7Z6nLTI/6Osk4Q7Sm0Vjoobh6krub7LNjeKQ
 y6p+vxrXj+Owc5H9bgl0wAi6GOmOJKAM+cZU5DygQxjOgiUgNbOzsVgbLDTvExNq
 gISUU4PoAO7/U1NKEaaopbe7C0KNQHTnegedtXsDzg7WTBah77/MNBt4snbfiR27
 6rODklZlG/kAGIHdVtYC+zf8AfPqvGTIT8SLGmzQlyVlHujFBGn0L41NmMzW+EeA
 7UpfUk8vPiiGhefBE+Ml3yqiReogo+aRhL1mZoI39rPusd7DMnwx97KpBlAcYuTI
 PTgqycIMRmq2dSCHya+nrOVpwwV3Qx4G8Alpq1jOa7XDMeGMj4h521NQHjWckIK2
 IJ2a0RtzB10+Z91nLV+amdAno326AnxJC7dR26O6uqVSPJy/nHE2GAc49gFKeWq6
 QmzgzY1sU2Y02/TM8miyKSl3i+bpZSIPzXCKlOm1TowBKO+sfJzn/yMon9sVaVhb
 4Wjgg3vgE74y9FVsL4JXR/PfrZH5Aq77J1R+/pMtsNTtVYrt1Sk=
 =ytFs
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-fixes-6.6-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv fixes for 6.6, take #1

- Fix KVM_GET_REG_LIST API for ISA_EXT registers
- Fix reading ISA_EXT register of a missing extension
- Fix ISA_EXT register handling in get-reg-list test
- Fix filtering of AIA registers in get-reg-list test
2023-09-23 05:35:55 -04:00
Tom Lendacky 916e3e5f26 KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX
When the TSC_AUX MSR is virtualized, the TSC_AUX value is swap type "B"
within the VMSA. This means that the guest value is loaded on VMRUN and
the host value is restored from the host save area on #VMEXIT.

Since the value is restored on #VMEXIT, the KVM user return MSR support
for TSC_AUX can be replaced by populating the host save area with the
current host value of TSC_AUX. And, since TSC_AUX is not changed by Linux
post-boot, the host save area can be set once in svm_hardware_enable().
This eliminates the two WRMSR instructions associated with the user return
MSR support.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <d381de38eb0ab6c9c93dda8503b72b72546053d7.1694811272.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23 05:35:49 -04:00
Tom Lendacky e0096d01c4 KVM: SVM: Fix TSC_AUX virtualization setup
The checks for virtualizing TSC_AUX occur during the vCPU reset processing
path. However, at the time of initial vCPU reset processing, when the vCPU
is first created, not all of the guest CPUID information has been set. In
this case the RDTSCP and RDPID feature support for the guest is not in
place and so TSC_AUX virtualization is not established.

This continues for each vCPU created for the guest. On the first boot of
an AP, vCPU reset processing is executed as a result of an APIC INIT
event, this time with all of the guest CPUID information set, resulting
in TSC_AUX virtualization being enabled, but only for the APs. The BSP
always sees a TSC_AUX value of 0 which probably went unnoticed because,
at least for Linux, the BSP TSC_AUX value is 0.

Move the TSC_AUX virtualization enablement out of the init_vmcb() path and
into the vcpu_after_set_cpuid() path to allow for proper initialization of
the support after the guest CPUID information has been set.

With the TSC_AUX virtualization support now in the vcpu_set_after_cpuid()
path, the intercepts must be either cleared or set based on the guest
CPUID input.

Fixes: 296d5a17e7 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <4137fbcb9008951ab5f0befa74a0399d2cce809a.1694811272.git.thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23 05:35:49 -04:00
Paolo Bonzini e8d93d5d93 KVM: SVM: INTERCEPT_RDTSCP is never intercepted anyway
svm_recalc_instruction_intercepts() is always called at least once
before the vCPU is started, so the setting or clearing of the RDTSCP
intercept can be dropped from the TSC_AUX virtualization support.

Extracted from a patch by Tom Lendacky.

Cc: stable@vger.kernel.org
Fixes: 296d5a17e7 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23 05:35:49 -04:00
Sean Christopherson 0df9dab891 KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously
Stop zapping invalidate TDP MMU roots via work queue now that KVM
preserves TDP MMU roots until they are explicitly invalidated.  Zapping
roots asynchronously was effectively a workaround to avoid stalling a vCPU
for an extended during if a vCPU unloaded a root, which at the time
happened whenever the guest toggled CR0.WP (a frequent operation for some
guest kernels).

While a clever hack, zapping roots via an unbound worker had subtle,
unintended consequences on host scheduling, especially when zapping
multiple roots, e.g. as part of a memslot.  Because the work of zapping a
root is no longer bound to the task that initiated the zap, things like
the CPU affinity and priority of the original task get lost.  Losing the
affinity and priority can be especially problematic if unbound workqueues
aren't affined to a small number of CPUs, as zapping multiple roots can
cause KVM to heavily utilize the majority of CPUs in the system, *beyond*
the CPUs KVM is already using to run vCPUs.

When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root
zap can result in KVM occupying all logical CPUs for ~8ms, and result in
high priority tasks not being scheduled in in a timely manner.  In v5.15,
which doesn't preserve unloaded roots, the issues were even more noticeable
as KVM would zap roots more frequently and could occupy all CPUs for 50ms+.

Consuming all CPUs for an extended duration can lead to significant jitter
throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots
is a semi-frequent operation as memslots are deleted and recreated with
different host virtual addresses to react to host GPU drivers allocating
and freeing GPU blobs.  On ChromeOS, the jitter manifests as audio blips
during games due to the audio server's tasks not getting scheduled in
promptly, despite the tasks having a high realtime priority.

Deleting memslots isn't exactly a fast path and should be avoided when
possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the
memslot shenanigans, but KVM is squarely in the wrong.  Not to mention
that removing the async zapping eliminates a non-trivial amount of
complexity.

Note, one of the subtle behaviors hidden behind the async zapping is that
KVM would zap invalidated roots only once (ignoring partial zaps from
things like mmu_notifier events).  Preserve this behavior by adding a flag
to identify roots that are scheduled to be zapped versus roots that have
already been zapped but not yet freed.

Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can
encounter invalid roots, as it's not at all obvious why zapping
invalidated roots shouldn't simply zap all invalid roots.

Reported-by: Pattara Teerapong <pteerapong@google.com>
Cc: David Stevens <stevensd@google.com>
Cc: Yiwei Zhang<zzyiwei@google.com>
Cc: Paul Hsia <paulhsia@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230916003916.2545000-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23 05:35:48 -04:00
Paolo Bonzini 441a5dfcd9 KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe()
All callers except the MMU notifier want to process all address spaces.
Remove the address space ID argument of for_each_tdp_mmu_root_yield_safe()
and switch the MMU notifier to use __for_each_tdp_mmu_root_yield_safe().

Extracted out of a patch by Sean Christopherson <seanjc@google.com>

Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23 05:35:12 -04:00
Linus Torvalds b61ec8d0f1 Fix the patching ordering between static calls and return thunks.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmUN5ZsACgkQEsHwGGHe
 VUrx/BAAwPv/CY946qbSyhl1AvpvjkWEi2O6KJfz0OUPCkL0qBfX9eLTJjDKiDoS
 qlqO+yT8qmapf4oZ1rKyVH7AIHmy0b2N9Ks3T0I6lcvZ9JP5ADtoeT6RdnS+ALQb
 l82HlkhmIBJLC6r2wnxlRPKGv+VrRXuk0uiqvFVhslHyAeslxp3kkK6pAASBqEuf
 k0Hc8+s/qh8lpXi/O898QJYNRL0Abfxd/Jfnn5JHedA/SB6XAPwN7s5Uk2mQL7s8
 nujV8msV7EgRz6Svjwph+3BaU3ul2bIsbm27R0GVEhFJ9eYbjsn8pr62/HRh2X+v
 8lgijBDlbB7fcZQBvIzyyP2UFNbM6QN7uFzZOYL92fl+JAqOQ5B6SQeyrLo65w7Y
 5zxPZxMPKkuzCyRYdZUJx0KPQS1cagtn/Q8D0fy61jvAufh6sxl9BXXYZimaI0rD
 uN6AlTTR8nTMM+9KlHUG1wSC3Yz9UYwk5GVKk7dhhorpJuVj5yj+hEVHsP8AXceJ
 WLxh/pUItIiPuYc45eEwNGo6x4rS3ARrXYQezXyAyyLv4tP2Zu7iJx+3NVkRFHIx
 kU4H0aC33Au0lvFkeO6g/qZ+u46sGf2VsuVPBiVreARY+CM5HRKA16soPIdh0Uzw
 anq8nMaRNL/myVeVAMdJi6mO9sxtmJwCzQWN7fh6fH0c+8/NmeY=
 =xLBC
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 rethunk fixes from Borislav Petkov:
 "Fix the patching ordering between static calls and return thunks"

* tag 'x86_urgent_for_v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86,static_call: Fix static-call vs return-thunk
  x86/alternatives: Remove faulty optimization
2023-09-22 12:35:56 -07:00
Linus Torvalds e583bffeb8 Misc x86 fixes:
- Fix a kexec bug,
  - Fix an UML build bug,
  - Fix a handful of SRSO related bugs,
  - Fix a shadow stacks handling bug & robustify related code.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmUNbQIRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jVIg/9EChW7qFTda8joR41Uayg07VIOpGirDLu
 7hjzOnt4Ni93cfFNUBkKDKXoWxGAiOD+cRDnT6+zsJAvAZR26Y3UNVLYlAy+lFKK
 9kRxeDM7nOEKqCC+zneinMFcIKmZRttMLpj8O901jB2S08x4UarnNx5SaWEcqYbn
 rf1XIEuEvlxqMfafNueS/TRadV52qVU8Y+2+inkDnM7dDNwt+rCs5tQ9ebJos8QO
 tsAoQes1G+0mjPrpqAgsIic5e3QCHliwVr8ASQrKZ9KR+fokEJRbSBNjgHUCNeVN
 0bHBhuDJBSniC7QmAQGEizrZWmHl2HxwYYnCE0gd0g24b7sDTwWBFSXWratCrPdX
 e4qYq36xonWdQcbpVF8ijMXF/R810vDyis/yc/BTeo5EBWf6aM/cx1dmS9DUxRpA
 QsIW2amSCPVYwYE5MAC+K/DM9nq1tk8YnKi4Mob3L28+W3CmVwSwT9S86z2QLlZu
 +KgVV4yBtJY1FJqVur5w3awhFtqLfBdfIvs6uQCd9VZXVPbBfS8+rOQmmhFixEDu
 FSPlAChmXYTAM2R+UxcEvw1Zckrtd2BCOa8UrY2lq57mduBK1EymdpfjlrUAChLG
 x7fQBOGNgOTLwYcsurIdS5jAqiudpnJ/KDG8ZAmKsVoW96JCPp9B3tVZMp9tT30C
 8HRwSPX4384=
 =58St
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 fixes from Ingo Molnar:

 - Fix a kexec bug

 - Fix an UML build bug

 - Fix a handful of SRSO related bugs

 - Fix a shadow stacks handling bug & robustify related code

* tag 'x86-urgent-2023-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/shstk: Add warning for shadow stack double unmap
  x86/shstk: Remove useless clone error handling
  x86/shstk: Handle vfork clone failure correctly
  x86/srso: Fix SBPB enablement for spec_rstack_overflow=off
  x86/srso: Don't probe microcode in a guest
  x86/srso: Set CPUID feature bits independently of bug or mitigation status
  x86/srso: Fix srso_show_state() side effect
  x86/asm: Fix build of UML with KASAN
  x86/mm, kexec, ima: Use memblock_free_late() from ima_free_kexec_buffer()
2023-09-22 12:26:42 -07:00
Saurabh Sengar 203a521bd9 x86/hyperv: Add common print prefix "Hyper-V" in hv_init
Add "#define pr_fmt()" in hv_init.c to use "Hyper-V:" as common
print prefix for all pr_*() statements in this file.

Remove the "Hyper-V:" already prefixed in couple of prints.

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/1695123361-8877-1-git-send-email-ssengar@linux.microsoft.com
2023-09-22 18:43:09 +00:00
Saurabh Sengar 14058f72cf x86/hyperv: Remove hv_vtl_early_init initcall
There has been cases reported where HYPERV_VTL_MODE is enabled by mistake,
on a non Hyper-V platforms. This causes the hv_vtl_early_init function to
be called in an non Hyper-V/VTL platforms which results the memory
corruption.

Remove the early_initcall for hv_vtl_early_init and call it at the end of
hyperv_init to make sure it is never called in a non Hyper-V platform by
mistake.

Reported-by: Mathias Krause <minipli@grsecurity.net>
Closes: https://lore.kernel.org/lkml/40467722-f4ab-19a5-4989-308225b1f9f0@grsecurity.net/
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Acked-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/1695358720-27681-1-git-send-email-ssengar@linux.microsoft.com
2023-09-22 18:41:29 +00:00
Saurabh Sengar f2a55d08d7 x86/hyperv: Restrict get_vtl to only VTL platforms
When Linux runs in a non-default VTL (CONFIG_HYPERV_VTL_MODE=y),
get_vtl() must never fail as its return value is used in negotiations
with the host. In the more generic case, (CONFIG_HYPERV_VTL_MODE=n) the
VTL is always zero so there's no need to do the hypercall.

Make get_vtl() BUG() in case of failure and put the implementation under
"if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)" to avoid the call altogether in
the most generic use case.

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/1695182675-13405-1-git-send-email-ssengar@linux.microsoft.com
2023-09-22 18:40:53 +00:00
Peter Zijlstra aee9d30b97 x86,static_call: Fix static-call vs return-thunk
Commit

  7825451fa4 ("static_call: Add call depth tracking support")

failed to realize the problem fixed there is not specific to call depth
tracking but applies to all return-thunk uses.

Move the fix to the appropriate place and condition.

Fixes: ee88d363d1 ("x86,static_call: Use alternative RET encoding")
Reported-by: David Kaplan <David.Kaplan@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
2023-09-22 18:58:24 +02:00
Josh Poimboeuf 4ba89dd6dd x86/alternatives: Remove faulty optimization
The following commit

  095b8303f3 ("x86/alternative: Make custom return thunk unconditional")

made '__x86_return_thunk' a placeholder value.  All code setting
X86_FEATURE_RETHUNK also changes the value of 'x86_return_thunk'.  So
the optimization at the beginning of apply_returns() is dead code.

Also, before the above-mentioned commit, the optimization actually had a
bug It bypassed __static_call_fixup(), causing some raw returns to
remain unpatched in static call trampolines.  Thus the 'Fixes' tag.

Fixes: d2408e043e ("x86/alternative: Optimize returns patching")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/16d19d2249d4485d8380fb215ffaae81e6b8119e.1693889988.git.jpoimboe@kernel.org
2023-09-22 18:52:39 +02:00
Kees Cook 215199e3d9 hardening: Provide Kconfig fragments for basic options
Inspired by Salvatore Mesoraca's earlier[1] efforts to provide some
in-tree guidance for kernel hardening Kconfig options, add a new fragment
named "hardening-basic.config" (along with some arch-specific fragments)
that enable a basic set of kernel hardening options that have the least
(or no) performance impact and remove a reasonable set of legacy APIs.

Using this fragment is as simple as running "make hardening.config".

More extreme fragments can be added[2] in the future to cover all the
recognized hardening options, and more per-architecture files can be
added too.

For now, document the fragments directly via comments. Perhaps .rst
documentation can be generated from them in the future (rather than the
other way around).

[1] https://lore.kernel.org/kernel-hardening/1536516257-30871-1-git-send-email-s.mesoraca16@gmail.com/
[2] https://github.com/KSPP/linux/issues/14

Cc: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Cc: x86@kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kbuild@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-09-22 09:50:55 -07:00
Sandipan Das 23d2626b84 perf/x86/amd/core: Fix overflow reset on hotplug
Kernels older than v5.19 do not support PerfMonV2 and the PMI handler
does not clear the overflow bits of the PerfCntrGlobalStatus register.
Because of this, loading a recent kernel using kexec from an older
kernel can result in inconsistent register states on Zen 4 systems.

The PMI handler of the new kernel gets confused and shows a warning when
an overflow occurs because some of the overflow bits are set even if the
corresponding counters are inactive. These are remnants from overflows
that were handled by the older kernel.

During CPU hotplug, the PerfCntrGlobalCtl and PerfCntrGlobalStatus
registers should always be cleared for PerfMonV2-capable processors.
However, a condition used for NB event constaints applicable only to
older processors currently prevents this from happening. Move the reset
sequence to an appropriate place and also clear the LBR Freeze bit.

Fixes: 21d59e3e2c ("perf/x86/amd/core: Detect PerfMonV2 support")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/882a87511af40792ba69bb0e9026f19a2e71e8a3.1694696888.git.sandipan.das@amd.com
2023-09-22 12:05:14 +02:00
Fangrui Song b8ec60e118 x86/speculation, objtool: Use absolute relocations for annotations
.discard.retpoline_safe sections do not have the SHF_ALLOC flag.  These
sections referencing text sections' STT_SECTION symbols with PC-relative
relocations like R_386_PC32 [0] is conceptually not suitable.  Newer
LLD will report warnings for REL relocations even for relocatable links [1]:

    ld.lld: warning: vmlinux.a(drivers/i2c/busses/i2c-i801.o):(.discard.retpoline_safe+0x120): has non-ABS relocation R_386_PC32 against symbol ''

Switch to absolute relocations instead, which indicate link-time
addresses.  In a relocatable link, these addresses are also output
section offsets, used by checks in tools/objtool/check.c.  When linking
vmlinux, these .discard.* sections will be discarded, therefore it is
not a problem that R_X86_64_32 cannot represent a kernel address.

Alternatively, we could set the SHF_ALLOC flag for .discard.* sections,
but I think non-SHF_ALLOC for sections to be discarded makes more sense.

Note: if we decide to never support REL architectures (e.g. arm, i386),
we can utilize R_*_NONE relocations (.reloc ., BFD_RELOC_NONE, sym),
making .discard.* sections zero-sized.  That said, the section content
waste is 4 bytes per entry, much smaller than sizeof(Elf{32,64}_Rel).

  [0] commit 1c0c1faf56 ("objtool: Use relative pointers for annotations")
  [1] https://github.com/ClangBuiltLinux/linux/issues/1937

Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20230920001728.1439947-1-maskray@google.com
2023-09-22 11:41:24 +02:00
Paolo Bonzini 7deda2ce5b x86/cpu: Clear SVM feature if disabled by BIOS
When SVM is disabled by BIOS, one cannot use KVM but the
SVM feature is still shown in the output of /proc/cpuinfo.
On Intel machines, VMX is cleared by init_ia32_feat_ctl(),
so do the same on AMD and Hygon processors.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230921114940.957141-1-pbonzini@redhat.com
2023-09-22 10:55:26 +02:00
Ingo Molnar ad42474325 x86/bitops: Remove unused __sw_hweight64() assembly implementation on x86-32
Header cleanups in the fast-headers tree highlighted that we have an
unused assembly implementation for __sw_hweight64():

    WARNING: modpost: EXPORT symbol "__sw_hweight64" [vmlinux] version ...

__arch_hweight64() on x86-32 is defined in the
arch/x86/include/asm/arch_hweight.h header as an inline, using
__arch_hweight32():

  #ifdef CONFIG_X86_32
  static inline unsigned long __arch_hweight64(__u64 w)
  {
          return  __arch_hweight32((u32)w) +
                  __arch_hweight32((u32)(w >> 32));
  }

*But* there's also a __sw_hweight64() assembly implementation:

  arch/x86/lib/hweight.S

  SYM_FUNC_START(__sw_hweight64)
  #ifdef CONFIG_X86_64
  ...
  #else /* CONFIG_X86_32 */
        /* We're getting an u64 arg in (%eax,%edx): unsigned long hweight64(__u64 w) */
        pushl   %ecx

        call    __sw_hweight32
        movl    %eax, %ecx                      # stash away result
        movl    %edx, %eax                      # second part of input
        call    __sw_hweight32
        addl    %ecx, %eax                      # result

        popl    %ecx
        ret
  #endif

But this __sw_hweight64 assembly implementation is unused - and it's
essentially doing the same thing that the inline wrapper does.

Remove the assembly version and add a comment about it.

Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
2023-09-22 09:34:50 +02:00
Ingo Molnar d73a105586 x86/mm: Move arch_memory_failure() and arch_is_platform_page() definitions from <asm/processor.h> to <asm/pgtable.h>
<linux/mm.h> relies on these definitions being included first,
which is true currently due to historic header spaghetti,
but in the future <asm/processor.h> will not guaranteed to be
included by the MM code.

Move these definitions over into a suitable MM header.

This is a preparatory patch for x86 header dependency simplifications
and reductions.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
2023-09-22 09:32:03 +02:00
Paolo Abeni e9cbc89067 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-21 21:49:45 +02:00
peterz@infradead.org 0f4b5f9722 futex: Add sys_futex_requeue()
Finish off the 'simple' futex2 syscall group by adding
sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are
too numerous to fit into a regular syscall. As such, use struct
futex_waitv to pass the 'source' and 'destination' futexes to the
syscall.

This syscall implements what was previously known as FUTEX_CMP_REQUEUE
and uses {val, uaddr, flags} for source and {uaddr, flags} for
destination.

This design explicitly allows requeueing between different types of
futex by having a different flags word per uaddr.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20230921105248.511860556@noisy.programming.kicks-ass.net
2023-09-21 19:22:10 +02:00