After successful update, the late loading routine prints an update
summary similar to:
microcode: load: updated on 128 primary CPUs with 128 siblings
microcode: revision: 0x21000170 -> 0x21000190
Remove the redundant message in the Intel side of the driver.
[ bp: Massage commit message. ]
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/ZWjYhedNfhAUmt0k@a4bf019067fa.jf.intel.com
The requested info will be stored in 'guest_xsave->region' referenced by
the incoming pointer "struct kvm_xsave *guest_xsave", thus there is no need
to explicitly use return void expression for a void function "static void
kvm_vcpu_ioctl_x86_get_xsave(...)". The issue is caught with [-Wpedantic].
Fixes: 2d287ec65e79 ("x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer")
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231007064019.17472-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Set .owner for all KVM-owned filed types so that the KVM module is pinned
until any files with callbacks back into KVM are completely freed. Using
"struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
while there are active VMs, doesn't provide full protection.
Userspace can invoke delete_module() the instant the last reference to KVM
is put. If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
then it's possible for KVM to be preempted and deleted/unloaded before KVM
fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
in, it will jump to a code page that is no longer mapped.
Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
THIS_MODULE (which points at kvm.ko). KVM assumes that if /dev/kvm is
reachable, e.g. VMs are active, then the vendor module is loaded.
To reduce the probability of forgetting to set .owner entirely, use
THIS_MODULE for stats files where KVM does not call back into vendor code.
This reverts commit 70375c2d8f, and fixes
several other file types that have been buggy since their introduction.
Fixes: 70375c2d8f ("Revert "KVM: set owner of cpu and vm file operations"")
Fixes: 3bcd0662d6 ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fix the comment about what can and cannot happen when mmu_unsync_pages_lock
is not help. The comment correctly mentions "clearing sp->unsync", but then
it talks about unsync going from 0 to 1.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-5-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
It is cheap to take tdp_mmu_pages_lock in all write-side critical sections.
We already do it all the time when zapping with read_lock(), so it is not
a problem to do it from the kvm_tdp_mmu_zap_all() path (aka
kvm_arch_flush_shadow_all(), aka VM destruction and MMU notifier release).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-4-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The "bool shared" argument is more or less unnecessary in the
for_each_*_tdp_mmu_root_yield_safe() macros. Many users check for
the lock before calling it; all of them either call small functions
that do the check, or end up calling tdp_mmu_set_spte_atomic() and
tdp_mmu_iter_set_spte(). Add a few assertions to make up for the
lost check in for_each_*_tdp_mmu_root_yield_safe(), but even this
is probably overkill and mostly for documentation reasons.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-3-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Neither tdp_mmu_next_root nor kvm_tdp_mmu_put_root need to know
if the lock is taken for read or write. Either way, protection
is achieved via RCU and tdp_mmu_pages_lock. Remove the argument
and just assert that the lock is taken.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-2-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Re-check that the given SPTE is still a leaf and present SPTE after a
failed cmpxchg in clear_dirty_gfn_range(). clear_dirty_gfn_range()
intends to only operate on present leaf SPTEs, but that could change
after a failed cmpxchg.
A check for present was added in commit 3354ef5a59 ("KVM: x86/mmu:
Check for present SPTE when clearing dirty bit in TDP MMU") but the
check for leaf is still buried in tdp_root_for_each_leaf_pte() and does
not get rechecked on retry.
Fixes: a6a0b05da9 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20231027172640.2335197-3-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fix an off-by-1 error when passing in the range of pages to
kvm_mmu_try_split_huge_pages() during CLEAR_DIRTY_LOG. Specifically, end
is the last page that needs to be split (inclusive) so pass in `end + 1`
since kvm_mmu_try_split_huge_pages() expects the `end` to be
non-inclusive.
At worst this will cause a huge page to be write-protected instead of
eagerly split, which is purely a performance issue, not a correctness
issue. But even that is unlikely as it would require userspace pass in a
bitmap where the last page is the only 4K page on a huge page that needs
to be split.
Reported-by: Vipin Sharma <vipinsh@google.com>
Fixes: cb00a70bd4 ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG")
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20231027172640.2335197-2-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
cpuid.c utilizes vmemdup_user() and array_size() to copy two userspace
arrays. This, currently, does not check for an overflow.
Use the new wrapper vmemdup_array_user() to copy the arrays more safely,
as vmemdup_user() doesn't check for overflow.
Note, KVM explicitly checks the number of entries before duplicating the
array, i.e. adding the overflow check should be a glorified nop.
Suggested-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Link: https://lore.kernel.org/r/20231102181526.43279-2-pstanner@redhat.com
[sean: call out that KVM pre-checks the number of entries]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Explicitly track emulated counter events instead of using the common
counter value that's shared with the hardware counter owned by perf.
Bumping the common counter requires snapshotting the pre-increment value
in order to detect overflow from emulation, and the snapshot approach is
inherently flawed.
Snapshotting the previous counter at every increment assumes that there is
at most one emulated counter event per emulated instruction (or rather,
between checks for KVM_REQ_PMU). That's mostly holds true today because
KVM only emulates (branch) instructions retired, but the approach will
fall apart if KVM ever supports event types that don't have a 1:1
relationship with instructions.
And KVM already has a relevant bug, as handle_invalid_guest_state()
emulates multiple instructions without checking KVM_REQ_PMU, i.e. could
miss an overflow event due to clobbering pmc->prev_counter. Not checking
KVM_REQ_PMU is problematic in both cases, but at least with the emulated
counter approach, the resulting behavior is delayed overflow detection,
as opposed to completely lost detection.
Tracking the emulated count fixes another bug where the snapshot approach
can signal spurious overflow due to incorporating both the emulated count
and perf's count in the check, i.e. if overflow is detected by perf, then
KVM's emulation will also incorrectly signal overflow. Add a comment in
the related code to call out the need to process emulated events *after*
pausing the perf event (big kudos to Mingwei for figuring out that
particular wrinkle).
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Roman Kagan <rkagan@amazon.de>
Cc: Jim Mattson <jmattson@google.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20231103230541.352265-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Update a PMC's sample period in pmc_write_counter() to deduplicate code
across all callers of pmc_write_counter(). Opportunistically move
pmc_write_counter() into pmc.c now that it's doing more work. WRMSR isn't
such a hot path that an extra CALL+RET pair will be problematic, and the
order of function definitions needs to be changed anyways, i.e. now is a
convenient time to eat the churn.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove code that unnecessarily clears event_count and need_cleanup in
kvm_pmu_init(), the entire kvm_pmu is zeroed just a few lines earlier.
Vendor code doesn't set event_count or need_cleanup during .init(), and
if either VMX or SVM did set those fields it would be a flagrant bug.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop kvm_vcpu_reset()'s call to kvm_pmu_reset(), the call is performed
only for RESET, which is really just the same thing as vCPU creation,
and kvm_arch_vcpu_create() *just* called kvm_pmu_init(), i.e. there can't
possibly be any work to do.
Unlike Intel, AMD's amd_pmu_refresh() does fill all_valid_pmc_idx even if
guest CPUID is empty, but everything that is at all dynamic is guaranteed
to be '0'/NULL, e.g. it should be impossible for KVM to have already
created a perf event.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Stop all counters and release all perf events before refreshing the vPMU,
i.e. before reconfiguring the vPMU to respond to changes in the vCPU
model.
Clear need_cleanup in kvm_pmu_reset() as well so that KVM doesn't
prematurely stop counters, e.g. if KVM enters the guest and enables
counters before the vCPU is scheduled out.
Cc: stable@vger.kernel.org
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the common (or at least "ignored") aspects of resetting the vPMU to
common x86 code, along with the stop/release helpers that are no used only
by the common pmu.c.
There is no need to manually handle fixed counters as all_valid_pmc_idx
tracks both fixed and general purpose counters, and resetting the vPMU is
far from a hot path, i.e. the extra bit of overhead to the PMC from the
index is a non-issue.
Zero fixed_ctr_ctrl in common code even though it's Intel specific.
Ensuring it's zero doesn't harm AMD/SVM in any way, and stopping the fixed
counters via all_valid_pmc_idx, but not clearing the associated control
bits, would be odd/confusing.
Make the .reset() hook optional as SVM no longer needs vendor specific
handling.
Cc: stable@vger.kernel.org
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Instruction with %rip-relative address operand is one byte shorter than
its absolute address counterpart and is also compatible with position
independent executable (-fpie) build.
No functional changes intended.
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20231031075312.47525-1-ubizjak@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When vNMI is enabled, rely entirely on hardware to correctly handle NMI
blocking, i.e. don't intercept IRET to detect when NMIs are no longer
blocked. KVM already correctly ignores svm->nmi_masked when vNMI is
enabled, so the effect of the bug is essentially an unnecessary VM-Exit.
KVM intercepts IRET for two reasons:
- To track NMI masking to be able to know at any point of time if NMI
is masked.
- To track NMI windows (to inject another NMI after the guest executes
IRET, i.e. unblocks NMIs)
When vNMI is enabled, both cases are handled by hardware:
- NMI masking state resides in int_ctl.V_NMI_BLOCKING and can be read by
KVM at will.
- Hardware automatically "injects" pending virtual NMIs when virtual NMIs
become unblocked.
However, even though pending a virtual NMI for hardware to handle is the
most common way to synthesize a guest NMI, KVM may still directly inject
an NMI via when KVM is handling two "simultaneous" NMIs (see comments in
process_nmi() for details on KVM's simultaneous NMI handling). Per AMD's
APM, hardware sets the BLOCKING flag when software directly injects an NMI
as well, i.e. KVM doesn't need to manually mark vNMIs as blocked:
If Event Injection is used to inject an NMI when NMI Virtualization is
enabled, VMRUN sets V_NMI_MASK in the guest state.
Note, it's still possible that KVM could trigger a spurious IRET VM-Exit.
When running a nested guest, KVM disables vNMI for L2 and thus will enable
IRET interception (in both vmcb01 and vmcb02) while running L2 reason. If
a nested VM-Exit happens before L2 executes IRET, KVM can end up running
L1 with vNMI enable and IRET intercepted. This is also a benign bug, and
even less likely to happen, i.e. can be safely punted to a future fix.
Fixes: fa4c027a79 ("KVM: x86: Add support for SVM's Virtual NMI")
Link: https://lore.kernel.org/all/ZOdnuDZUd4mevCqe@google.como
Cc: Santosh Shukla <santosh.shukla@amd.com>
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Santosh Shukla <santosh.shukla@amd.com>
Link: https://lore.kernel.org/r/20231018192021.1893261-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a sanity check that FLUSHBYASID is available if SEV is supported in
hardware, as SEV (and beyond) guests are bound to a single ASID, i.e. KVM
can't "flush" by assigning a new, fresh ASID to the guest. If FLUSHBYASID
isn't supported for some bizarre reason, KVM would completely fail to do
TLB flushes for SEV+ guests (see pre_svm_run() and pre_sev_run()).
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20231018193617.1895752-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Advertise support for FLUSHBYASID when nested SVM is enabled, as KVM can
always emulate flushing TLB entries for a vmcb12 ASID, e.g. by running L2
with a new, fresh ASID in vmcb02. Some modern hypervisors, e.g. VMWare
Workstation 17, require FLUSHBYASID support and will refuse to run if it's
not present.
Punt on proper support, as "Honor L1's request to flush an ASID on nested
VMRUN" is one of the TODO items in the (incomplete) list of issues that
need to be addressed in order for KVM to NOT do a full TLB flush on every
nested SVM transition (see nested_svm_transition_tlb_flush()).
Reported-by: Stefan Sterz <s.sterz@proxmox.com>
Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231018194104.1896415-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Revert KVM's made-up consistency check on SVM's TLB control. The APM says
that unsupported encodings are reserved, but the APM doesn't state that
VMRUN checks for a supported encoding. Unless something is called out
in "Canonicalization and Consistency Checks" or listed as MBZ (Must Be
Zero), AMD behavior is typically to let software shoot itself in the foot.
This reverts commit 174a921b69.
Fixes: 174a921b69 ("nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB")
Reported-by: Stefan Sterz <s.sterz@proxmox.com>
Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231018194104.1896415-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Don't force a masterclock update when a vCPU synchronizes to the current
TSC generation, e.g. when userspace hotplugs a pre-created vCPU into the
VM. Unnecessarily updating the masterclock is undesirable as it can cause
kvmclock's time to jump, which is particularly painful on systems with a
stable TSC as kvmclock _should_ be fully reliable on such systems.
The unexpected time jumps are due to differences in the TSC=>nanoseconds
conversion algorithms between kvmclock and the host's CLOCK_MONOTONIC_RAW
(the pvclock algorithm is inherently lossy). When updating the
masterclock, KVM refreshes the "base", i.e. moves the elapsed time since
the last update from the kvmclock/pvclock algorithm to the
CLOCK_MONOTONIC_RAW algorithm. Synchronizing kvmclock with
CLOCK_MONOTONIC_RAW is the lesser of evils when the TSC is unstable, but
adds no real value when the TSC is stable.
Prior to commit 7f187922dd ("KVM: x86: update masterclock values on TSC
writes"), KVM did NOT force an update when synchronizing a vCPU to the
current generation.
commit 7f187922dd
Author: Marcelo Tosatti <mtosatti@redhat.com>
Date: Tue Nov 4 21:30:44 2014 -0200
KVM: x86: update masterclock values on TSC writes
When the guest writes to the TSC, the masterclock TSC copy must be
updated as well along with the TSC_OFFSET update, otherwise a negative
tsc_timestamp is calculated at kvm_guest_time_update.
Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
"if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
becomes redundant, so remove the do_request boolean and collapse
everything into a single condition.
Before that, KVM only re-synced the masterclock if the masterclock was
enabled or disabled Note, at the time of the above commit, VMX
synchronized TSC on *guest* writes to MSR_IA32_TSC:
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr_info);
break;
which is why the changelog specifically says "guest writes", but the bug
that was being fixed wasn't unique to guest write, i.e. a TSC write from
the host would suffer the same problem.
So even though KVM stopped synchronizing on guest writes as of commit
0c899c25d7 ("KVM: x86: do not attempt TSC synchronization on guest
writes"), simply reverting commit 7f187922dd is not an option. Figuring
out how a negative tsc_timestamp could be computed requires a bit more
sleuthing.
In kvm_write_tsc() (at the time), except for KVM's "less than 1 second"
hack, KVM snapshotted the vCPU's current TSC *and* the current time in
nanoseconds, where kvm->arch.cur_tsc_nsec is the current host kernel time
in nanoseconds:
ns = get_kernel_ns();
...
if (usdiff < USEC_PER_SEC &&
vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
...
} else {
/*
* We split periods of matched TSC writes into generations.
* For each generation, we track the original measured
* nanosecond time, offset, and write, so if TSCs are in
* sync, we can match exact offset, and if not, we can match
* exact software computation in compute_guest_tsc()
*
* These values are tracked in kvm->arch.cur_xxx variables.
*/
kvm->arch.cur_tsc_generation++;
kvm->arch.cur_tsc_nsec = ns;
kvm->arch.cur_tsc_write = data;
kvm->arch.cur_tsc_offset = offset;
matched = false;
pr_debug("kvm: new tsc generation %llu, clock %llu\n",
kvm->arch.cur_tsc_generation, data);
}
...
/* Keep track of which generation this VCPU has synchronized to */
vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
Note that the above creates a new generation and sets "matched" to false!
But because kvm_track_tsc_matching() looks for matched+1, i.e. doesn't
require the vCPU that creates the new generation to match itself, KVM
would immediately compute vcpus_matched as true for VMs with a single vCPU.
As a result, KVM would skip the masterlock update, even though a new TSC
generation was created:
vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
atomic_read(&vcpu->kvm->online_vcpus));
if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC)
if (!ka->use_master_clock)
do_request = 1;
if (!vcpus_matched && ka->use_master_clock)
do_request = 1;
if (do_request)
kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
On hardware without TSC scaling support, vcpu->tsc_catchup is set to true
if the guest TSC frequency is faster than the host TSC frequency, even if
the TSC is otherwise stable. And for that mode, kvm_guest_time_update(),
by way of compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the
kernel time at the last TSC write, to compute the guest TSC relative to
kernel time:
static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
{
u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
vcpu->arch.virtual_tsc_mult,
vcpu->arch.virtual_tsc_shift);
tsc += vcpu->arch.this_tsc_write;
return tsc;
}
Except the "kernel_ns" passed to compute_guest_tsc() isn't the current
kernel time, it's the masterclock snapshot!
spin_lock(&ka->pvclock_gtod_sync_lock);
use_master_clock = ka->use_master_clock;
if (use_master_clock) {
host_tsc = ka->master_cycle_now;
kernel_ns = ka->master_kernel_ns;
}
spin_unlock(&ka->pvclock_gtod_sync_lock);
if (vcpu->tsc_catchup) {
u64 tsc = compute_guest_tsc(v, kernel_ns);
if (tsc > tsc_timestamp) {
adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
tsc_timestamp = tsc;
}
}
And so when KVM skips the masterclock update after a TSC write, i.e. after
a new TSC generation is started, the "kernel_ns-vcpu->arch.this_tsc_nsec"
is *guaranteed* to generate a negative value, because this_tsc_nsec was
captured after ka->master_kernel_ns.
Forcing a masterclock update essentially fudged around that problem, but
in a heavy handed way that introduced undesirable side effects, i.e.
unnecessarily forces a masterclock update when a new vCPU joins the party
via hotplug.
Note, KVM forces masterclock updates in other weird ways that are also
likely unnecessary, e.g. when establishing a new Xen shared info page and
when userspace creates a brand new vCPU. But the Xen thing is firmly a
separate mess, and there are no known userspace VMMs that utilize kvmclock
*and* create new vCPUs after the VM is up and running. I.e. the other
issues are future problems.
Reported-by: Dongli Zhang <dongli.zhang@oracle.com>
Closes: https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com
Fixes: 7f187922dd ("KVM: x86: update masterclock values on TSC writes")
Cc: David Woodhouse <dwmw2@infradead.org>
Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com>
Tested-by: Dongli Zhang <dongli.zhang@oracle.com>
Link: https://lore.kernel.org/r/20231018195638.1898375-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use a switch statement with macro-generated case statements to handle
translating feature flags in order to reduce the probability of runtime
errors due to copy+paste goofs, to make compile-time errors easier to
debug, and to make the code more readable.
E.g. the compiler won't directly generate an error for duplicate if
statements
if (x86_feature == X86_FEATURE_SGX1)
return KVM_X86_FEATURE_SGX1;
else if (x86_feature == X86_FEATURE_SGX2)
return KVM_X86_FEATURE_SGX1;
and so instead reverse_cpuid_check() will fail due to the untranslated
entry pointing at a Linux-defined leaf, which provides practically no
hint as to what is broken
arch/x86/kvm/reverse_cpuid.h:108:2: error: call to __compiletime_assert_450 declared with 'error' attribute:
BUILD_BUG_ON failed: x86_leaf == CPUID_LNX_4
BUILD_BUG_ON(x86_leaf == CPUID_LNX_4);
^
whereas duplicate case statements very explicitly point at the offending
code:
arch/x86/kvm/reverse_cpuid.h:125:2: error: duplicate case value '361'
KVM_X86_TRANSLATE_FEATURE(SGX2);
^
arch/x86/kvm/reverse_cpuid.h:124:2: error: duplicate case value '360'
KVM_X86_TRANSLATE_FEATURE(SGX1);
^
And without macros, the opposite type of copy+paste goof doesn't generate
any error at compile-time, e.g. this yields no complaints:
case X86_FEATURE_SGX1:
return KVM_X86_FEATURE_SGX1;
case X86_FEATURE_SGX2:
return KVM_X86_FEATURE_SGX1;
Note, __feature_translate() is forcibly inlined and the feature is known
at compile-time, so the code generation between an if-elif sequence and a
switch statement should be identical.
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20231024001636.890236-2-jmattson@google.com
[sean: use a macro, rewrite changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
The low five bits {INTEL_PSFD, IPRED_CTRL, RRSBA_CTRL, DDPD_U, BHI_CTRL}
advertise the availability of specific bits in IA32_SPEC_CTRL. Since KVM
dynamically determines the legal IA32_SPEC_CTRL bits for the underlying
hardware, the hard work has already been done. Just let userspace know
that a guest can use these IA32_SPEC_CTRL bits.
The sixth bit (MCDT_NO) states that the processor does not exhibit MXCSR
Configuration Dependent Timing (MCDT) behavior. This is an inherent
property of the physical processor that is inherited by the virtual
CPU. Pass that information on to userspace.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20231024001636.890236-1-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Don't enable KVM_WERROR by default for x86-64 builds as KVM's one-off
-Werror enabling is *mostly* superseded by the kernel-wide WERROR, and
enabling KVM_WERROR by default can cause problems for developers working
on other subsystems. E.g. subsystems that have a "zero W=1 regressions"
rule can inadvertently build KVM with -Werror and W=1, and end up with
build failures that are completely uninteresting to the developer (W=1 is
prone to false positives, especially on older compilers).
Keep KVM_WERROR as there are combinations where enabling WERROR isn't
feasible, e.g. the default FRAME_WARN=1024 on i386 builds generates a
non-zero number of warnings and thus errors, and there are far too many
warnings throughout the kernel to enable WERROR with W=1 (building KVM
with -Werror is desirable (with a sane compiler) as W=1 does generate
useful warnings).
Opportunistically drop the dependency on !COMPILE_TEST as it's completely
meaningless (it was copied from i195's -Werror Kconfig), as the kernel's
WERROR is explicitly *enabled* for COMPILE_TEST=y kernel's, i.e. enabling
-Werror is obviosly not dependent on COMPILE_TEST=n.
Reported-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/all/20231006205415.3501535-1-kuba@kernel.org
Link: https://lore.kernel.org/r/20231018151906.1841689-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The same code is used for retrieving package ID procedure from GIDNIDMAP
register. Factor out topology_gidnid_map() to avoid code duplication.
Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20231127185246.2371939-3-alexander.antonov@linux.intel.com
Get logical socket id instead of physical id in discover_upi_topology()
to avoid out-of-bound access on 'upi = &type->topology[nid][idx];' line
that leads to NULL pointer dereference in upi_fill_topology()
Fixes: f680b6e606 ("perf/x86/intel/uncore: Enable UPI topology discovery for Icelake Server")
Reported-by: Kyle Meyer <kyle.meyer@hpe.com>
Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Kyle Meyer <kyle.meyer@hpe.com>
Link: https://lore.kernel.org/r/20231127185246.2371939-2-alexander.antonov@linux.intel.com
A write-access violation page fault kernel crash was observed while running
cpuhotplug LTP testcases on SEV-ES enabled systems. The crash was
observed during hotplug, after the CPU was offlined and the process
was migrated to different CPU. setup_ghcb() is called again which
tries to update ghcb_version in sev_es_negotiate_protocol(). Ideally this
is a read_only variable which is initialised during booting.
Trying to write it results in a pagefault:
BUG: unable to handle page fault for address: ffffffffba556e70
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
[ ...]
Call Trace:
<TASK>
? __die_body.cold+0x1a/0x1f
? __die+0x2a/0x35
? page_fault_oops+0x10c/0x270
? setup_ghcb+0x71/0x100
? __x86_return_thunk+0x5/0x6
? search_exception_tables+0x60/0x70
? __x86_return_thunk+0x5/0x6
? fixup_exception+0x27/0x320
? kernelmode_fixup_or_oops+0xa2/0x120
? __bad_area_nosemaphore+0x16a/0x1b0
? kernel_exc_vmm_communication+0x60/0xb0
? bad_area_nosemaphore+0x16/0x20
? do_kern_addr_fault+0x7a/0x90
? exc_page_fault+0xbd/0x160
? asm_exc_page_fault+0x27/0x30
? setup_ghcb+0x71/0x100
? setup_ghcb+0xe/0x100
cpu_init_exception_handling+0x1b9/0x1f0
The fix is to call sev_es_negotiate_protocol() only in the BSP boot phase,
and it only needs to be done once in any case.
[ mingo: Refined the changelog. ]
Fixes: 95d33bfaa3 ("x86/sev: Register GHCB memory when SEV-SNP is active")
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Bo Gan <bo.gan@broadcom.com>
Signed-off-by: Bo Gan <bo.gan@broadcom.com>
Signed-off-by: Ashwin Dayanand Kamat <ashwin.kamat@broadcom.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/1701254429-18250-1-git-send-email-kashwindayan@vmware.com
When there are two racing NMIs on x86, the first NMI invokes NMI handler and
the 2nd NMI is latched until IRET is executed.
If panic on NMI and panic kexec are enabled, the first NMI triggers
panic and starts booting the next kernel via kexec. Note that the 2nd
NMI is still latched. During the early boot of the next kernel, once
an IRET is executed as a result of a page fault, then the 2nd NMI is
unlatched and invokes the NMI handler.
However, NMI handler is not set up at the early stage of boot, which
results in a boot failure.
Avoid such problems by setting up a NOP handler for early NMIs.
[ mingo: Refined the changelog. ]
Signed-off-by: Jun'ichi Nomura <junichi.nomura@nec.com>
Signed-off-by: Derek Barbosa <debarbos@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
This check is superfluous now that the minimum version of binutils to
build the kernel is 2.25. This also fixes an error seen with
llvm-objdump because it does not support '-v' prior to LLVM 13:
llvm-objdump: error: unknown argument '-v'
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: dde24a87c5
Link: https://lore.kernel.org/r/20231129-objdump-reformat-llvm-v3-3-0d855e79314d@kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/1362
GNU objdump and LLVM objdump have differing output formats.
Specifically, GNU objump will format its output as: address:<tab>hex,
whereas LLVM objdump displays its output as address:<space>hex.
objdump_reformat.awk incorrectly handles this discrepancy due to
the unexpected space and as a result insn_decoder_test fails, as
its input is garbled.
The instruction line being tokenized now handles a space and colon,
or tab delimiter.
Signed-off-by: Samuel Zeter <samuelzeter@gmail.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20231129-objdump-reformat-llvm-v3-2-0d855e79314d@kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/1364
If there is "wait" mnemonic in the line being parsed, it is incorrectly
handled by the script, and an extra line of "fwait" in
objdump_reformat's output is inserted. As insn_decoder_test relies upon
the formatted output, the test fails.
This is reproducible when disassembling with llvm-objdump:
Pre-processed lines:
ffffffff81033e72: 9b wait
ffffffff81033e73: 48 c7 c7 89 50 42 82 movq
After objdump_reformat.awk:
ffffffff81033e72: 9b fwait
ffffffff81033e72: wait
ffffffff81033e73: 48 c7 c7 89 50 42 82 movq
The regex match now accepts spaces or tabs, along with the "fwait"
instruction.
Signed-off-by: Samuel Zeter <samuelzeter@gmail.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20231129-objdump-reformat-llvm-v3-1-0d855e79314d@kernel.org
Closes: https://github.com/ClangBuiltLinux/linux/issues/1364
The AMD IBS PMU doesn't handle branch stacks, so it should not accept
events with brstack.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231130062246.290-1-ravi.bangoria@amd.com
Declare the kvm_x86_ops hooks used to wire up paravirt TLB flushes when
running under Hyper-V if and only if CONFIG_HYPERV!=n. Wrapping yet more
code with IS_ENABLED(CONFIG_HYPERV) eliminates a handful of conditional
branches, and makes it super obvious why the hooks *might* be valid.
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231018192325.1893896-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When querying whether or not a vCPU "is" running in kernel mode, directly
get the CPL if the vCPU is the currently loaded vCPU. In scenarios where
a guest is profiled via perf-kvm, querying vcpu->arch.preempted_in_kernel
from kvm_guest_state() is wrong if vCPU is actively running, i.e. isn't
scheduled out due to being preempted and so preempted_in_kernel is stale.
This affects perf/core's ability to accurately tag guest RIP with
PERF_RECORD_MISC_GUEST_{KERNEL|USER} and record it in the sample. This
causes perf/tool to fail to connect the vCPU RIPs to the guest kernel
space symbols when parsing these samples due to incorrect PERF_RECORD_MISC
flags:
Before (perf-report of a cpu-cycles sample):
1.23% :58945 [unknown] [u] 0xffffffff818012e0
After:
1.35% :60703 [kernel.vmlinux] [g] asm_exc_page_fault
Note, checking preempted_in_kernel in kvm_arch_vcpu_in_kernel() is awful
as nothing in the API's suggests that it's safe to use if and only if the
vCPU was preempted. That can be cleaned up in the future, for now just
fix the glaring correctness bug.
Note #2, checking vcpu->preempted is NOT safe, as getting the CPL on VMX
requires VMREAD, i.e. is correct if and only if the vCPU is loaded. If
the target vCPU *was* preempted, then it can be scheduled back in after
the check on vcpu->preempted in kvm_vcpu_on_spin(), i.e. KVM could end up
trying to do VMREAD on a VMCS that isn't loaded on the current pCPU.
Signed-off-by: Like Xu <likexu@tencent.com>
Fixes: e1bfc24577 ("KVM: Move x86's perf guest info callbacks to generic KVM")
Link: https://lore.kernel.org/r/20231123075818.12521-1-likexu@tencent.com
[sean: massage changelong, add Fixes]
Signed-off-by: Sean Christopherson <seanjc@google.com>
intel_idle_irq() re-enables IRQs very early. As a result, an interrupt
may fire before mwait() is eventually called. If such an interrupt queues
a timer, it may go unnoticed until mwait returns and the idle loop
handles the tick re-evaluation. And monitoring TIF_NEED_RESCHED doesn't
help because a local timer enqueue doesn't set that flag.
The issue is mitigated by the fact that this idle handler is only invoked
for shallow C-states when, presumably, the next tick is supposed to be
close enough. There may still be rare cases though when the next tick
is far away and the selected C-state is shallow, resulting in a timer
getting ignored for a while.
Fix this with using sti_mwait() whose IRQ-reenablement only triggers
upon calling mwait(), dealing with the race while keeping the interrupt
latency within acceptable bounds.
Fixes: c227233ad6 (intel_idle: enable interrupts before C1 on Xeons)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lkml.kernel.org/r/20231115151325.6262-3-frederic@kernel.org
Add a note to make sure we never miss and break the requirements behind
it.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lkml.kernel.org/r/20231115151325.6262-2-frederic@kernel.org
Setting X86_BUG_AMD_E400 in init_amd() is early enough.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: http://lore.kernel.org/r/20231120104152.13740-12-bp@alien8.de
Set it in init_amd_gh() unconditionally as that is the F10h init
function.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: http://lore.kernel.org/r/20231120104152.13740-11-bp@alien8.de
Add X86_FEATURE flags for each Zen generation. They should be used from
now on instead of checking f/m/s.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lore.kernel.org/r/20231120104152.13740-2-bp@alien8.de
Use the governed feature framework to track if Linear Address Masking (LAM)
is "enabled", i.e. if LAM can be used by the guest.
Using the framework to avoid the relative expensive call guest_cpuid_has()
during cr3 and vmexit handling paths for LAM.
No functional change intended.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-14-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
LAM is enumerated by CPUID.7.1:EAX.LAM[bit 26]. Advertise the feature to
userspace and enable it as the final step after the LAM virtualization
support for supervisor and user pointers.
SGX LAM support is not advertised yet. SGX LAM support is enumerated in
SGX's own CPUID and there's no hard requirement that it must be supported
when LAM is reported in CPUID leaf 0x7.
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Jingqi Liu <jingqi.liu@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-13-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add support to allow guests to set the new CR3 control bits for Linear
Address Masking (LAM) and add implementation to get untagged address for
user pointers.
LAM modifies the canonical check for 64-bit linear addresses, allowing
software to use the masked/ignored address bits for metadata. Hardware
masks off the metadata bits before using the linear addresses to access
memory. LAM uses two new CR3 non-address bits, LAM_U48 (bit 62) and
LAM_U57 (bit 61), to configure LAM for user pointers. LAM also changes
VMENTER to allow both bits to be set in VMCS's HOST_CR3 and GUEST_CR3 for
virtualization.
When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set
any of the two LAM control bits. However, when EPT is off, the actual CR3
used by the guest is generated from the shadow MMU root which is different
from the CR3 that is *set* by the guest, and KVM needs to manually apply
any active control bits to VMCS's GUEST_CR3 based on the cached CR3 *seen*
by the guest.
KVM manually checks guest's CR3 to make sure it points to a valid guest
physical address (i.e. to support smaller MAXPHYSADDR in the guest). Extend
this check to allow the two LAM control bits to be set. After check, LAM
bits of guest CR3 will be stripped off to extract guest physical address.
In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3
and GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when
L0 enters L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2
directly. KVM also manually checks VMCS12's HOST_CR3 and GUEST_CR3 being
valid physical address. Extend such check to allow the new LAM control bits
too.
Note, LAM doesn't have a global control bit to turn on/off LAM completely,
but purely depends on hardware's CPUID to determine it can be enabled or
not. That means, when EPT is on, even when KVM doesn't expose LAM to guest,
the guest can still set LAM control bits in CR3 w/o causing problem. This
is an unfortunate virtualization hole. KVM could choose to intercept CR3 in
this case and inject fault but this would hurt performance when running a
normal VM w/o LAM support. This is undesirable. Just choose to let the
guest do such illegal thing as the worst case is guest being killed when
KVM eventually find out such illegal behaviour and that the guest is
misbehaving.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-12-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add support to allow guests to set the new CR4 control bit for LAM and add
implementation to get untagged address for supervisor pointers.
LAM modifies the canonicality check applied to 64-bit linear addresses for
data accesses, allowing software to use of the untranslated address bits for
metadata and masks the metadata bits before using them as linear addresses
to access memory. LAM uses CR4.LAM_SUP (bit 28) to configure and enable LAM
for supervisor pointers. It also changes VMENTER to allow the bit to be set
in VMCS's HOST_CR4 and GUEST_CR4 to support virtualization. Note CR4.LAM_SUP
is allowed to be set even not in 64-bit mode, but it will not take effect
since LAM only applies to 64-bit linear addresses.
Move CR4.LAM_SUP out of CR4_RESERVED_BITS, its reservation depends on vcpu
supporting LAM or not. Leave it intercepted to prevent guest from setting
the bit if LAM is not exposed to guest as well as to avoid vmread every time
when KVM fetches its value, with the expectation that guest won't toggle the
bit frequently.
Set CR4.LAM_SUP bit in the emulated IA32_VMX_CR4_FIXED1 MSR for guests to
allow guests to enable LAM for supervisor pointers in nested VMX operation.
Hardware is not required to do TLB flush when CR4.LAM_SUP toggled, KVM
doesn't need to emulate TLB flush based on it. There's no other features
or vmx_exec_controls connection, and no other code needed in
{kvm,vmx}_set_cr4().
Skip address untag for instruction fetches (which includes branch targets),
operand of INVLPG instructions, and implicit system accesses, all of which
are not subject to untagging. Note, get_untagged_addr() isn't invoked for
implicit system accesses as there is no reason to do so, but check the
flag anyways for documentation purposes.
Signed-off-by: Robert Hoo <robert.hu@linux.intel.com>
Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-11-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Stub in vmx_get_untagged_addr() and wire up calls from the emulator (via
get_untagged_addr()) and "direct" calls from various VM-Exit handlers in
VMX where LAM untagging is supposed to be applied. Defer implementing
the guts of vmx_get_untagged_addr() to future patches purely to make the
changes easier to consume.
LAM is active only for 64-bit linear addresses and several types of
accesses are exempted.
- Cases need to untag address (handled in get_vmx_mem_address())
Operand(s) of VMX instructions and INVPCID.
Operand(s) of SGX ENCLS.
- Cases LAM doesn't apply to (no change needed)
Operand of INVLPG.
Linear address in INVPCID descriptor.
Linear address in INVVPID descriptor.
BASEADDR specified in SECS of ECREATE.
Note:
- LAM doesn't apply to write to control registers or MSRs
- LAM masking is applied before walking page tables, i.e. the faulting
linear address in CR2 doesn't contain the metadata.
- The guest linear address saved in VMCS doesn't contain metadata.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-10-binbin.wu@linux.intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Introduce a new interface get_untagged_addr() to kvm_x86_ops to untag
the metadata from linear address. Call the interface in linearization
of instruction emulator for 64-bit mode.
When enabled feature like Intel Linear Address Masking (LAM) or AMD Upper
Address Ignore (UAI), linear addresses may be tagged with metadata that
needs to be dropped prior to canonicality checks, i.e. the metadata is
ignored.
Introduce get_untagged_addr() to kvm_x86_ops to hide the vendor specific
code, as sadly LAM and UAI have different semantics. Pass the emulator
flags to allow vendor specific implementation to precisely identify the
access type (LAM doesn't untag certain accesses).
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-9-binbin.wu@linux.intel.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove kvm_vcpu_is_illegal_gpa() and use !kvm_vcpu_is_legal_gpa() instead.
The "illegal" helper actually predates the "legal" helper, the only reason
the "illegal" variant wasn't removed by commit 4bda0e9786 ("KVM: x86:
Add a helper to check for a legal GPA") was to avoid code churn. Now that
CR3 has a dedicated helper, there are fewer callers, and so the code churn
isn't that much of a deterrent.
No functional change intended.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-8-binbin.wu@linux.intel.com
[sean: provide a bit of history in the changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add and use kvm_vcpu_is_legal_cr3() to check CR3's legality to provide
a clear distinction between CR3 and GPA checks. This will allow exempting
bits from kvm_vcpu_is_legal_cr3() without affecting general GPA checks,
e.g. for upcoming features that will use high bits in CR3 for feature
enabling.
No functional change intended.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-7-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop non-PA bits when getting GFN for guest's PGD with the maximum theoretical
mask for guest MAXPHYADDR.
Do it unconditionally because it's harmless for 32-bit guests, querying 64-bit
mode would be more expensive, and for EPT the mask isn't tied to guest mode.
Using PT_BASE_ADDR_MASK would be technically wrong (PAE paging has 64-bit
elements _except_ for CR3, which has only 32 valid bits), it wouldn't matter
in practice though.
Opportunistically use GENMASK_ULL() to define __PT_BASE_ADDR_MASK.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-6-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add an emulation flag X86EMUL_F_INVLPG, which is used to identify an
instruction that does TLB invalidation without true memory access.
Only invlpg & invlpga implemented in emulator belong to this kind.
invlpga doesn't need additional information for emulation. Just pass
the flag to em_invlpg().
Linear Address Masking (LAM) and Linear Address Space Separation (LASS)
don't apply to addresses that are inputs to TLB invalidation. The flag
will be consumed to support LAM/LASS virtualization.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-5-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add an emulation flag X86EMUL_F_IMPLICIT to identify implicit system access
in instruction emulation. Don't bother wiring up any usage at this point,
as Linear Address Space Separation (LASS) will be the first "real" consumer
of the flag and LASS support will require dedicated hooks, i.e. there
aren't any existing calls where passing X86EMUL_F_IMPLICIT is meaningful.
Add the IMPLICIT flag even though there's no imminent usage so that
Linear Address Masking (LAM) support can reference the flag to document
that addresses for implicit accesses aren't untagged.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-4-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Consolidate @write and @fetch of __linearize() into a set of flags so that
additional flags can be added without needing more/new boolean parameters,
to precisely identify the access type.
No functional change intended.
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Xuelian Guo <xuelian.guo@intel.com>
Link: https://lore.kernel.org/r/20230913124227.12574-2-binbin.wu@linux.intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Ever since the eventfd type was introduced back in 2007 in commit
e1ad7468c7 ("signal/timer/event: eventfd core") the eventfd_signal()
function only ever passed 1 as a value for @n. There's no point in
keeping that additional argument.
Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org
Acked-by: Xu Yilun <yilun.xu@intel.com>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl
Acked-by: Eric Farman <farman@linux.ibm.com> # s390
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Commit
0f08c3b229 ("x86/smp: Reduce code duplication")
removed the only use of CONFIG_X86_32_SMP.
Remove the now obsolete config X86_32_SMP too.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231128090016.29676-1-lukas.bulwahn@gmail.com
Today the percpu struct vcpu_info is allocated via DEFINE_PER_CPU(),
meaning that it could cross a page boundary. In this case registering
it with the hypervisor will fail, resulting in a panic().
This can easily be fixed by using DEFINE_PER_CPU_ALIGNED() instead,
as struct vcpu_info is guaranteed to have a size of 64 bytes, matching
the cache line size of x86 64-bit processors (Xen doesn't support
32-bit processors).
Fixes: 5ead97c84f ("xen: Core Xen implementation")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.con>
Link: https://lore.kernel.org/r/20231124074852.25161-1-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
The long names of the SMCA banks are only used by the MCE decoder
module.
Move them out of the arch code and into the decoder module.
[ bp: Name the long names array "smca_long_names", drop local ptr in
decode_smca_error(), constify arrays. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231118193248.1296798-5-yazen.ghannam@amd.com
and remove the driver version announcement to avoid version
confusion when distros backport fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmVjFKgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h/7hAAs5hL1KvrfdW0VpAW91MbX6mtIe7Emc8T
LCiBJtl9UngRdASUC9CGrcIZ5JIps0702gAq0qPVzk5zKxC22ySWsqMZybask+eF
d6E7amMtF+KX0wiCZSuC66StCKA08JfrUXgxvYHnxDjNqERYFmVr1QabGL1IN5lZ
KUrVUyvN8VOnzypOiQ98lXGWDJYwaV7t+IzMMh7mT5OUkoo09e6tFm7IF+NWg5xe
NYCcvZqyo0Ipld7HOjlGHYG+blFkDxJpfTby5UevZybXsPd00cxBzDSR9zs1sYeG
Kt6cgwDhfewfcM1QFVvNV/SDVsPp1BlVvMUa6Xa3vtsnWCit8zQqMbkYYWUaTcIh
yUJvtzh/xZQtxaQ8Z8SbI7EhUBOFJXoHWV9JoEe3gWsWA5thu+4iOCh/P8C2ON3n
6kLSgNQ4GAylH34MWoS84t2Jxv7XmNZljR/78ucRQrJ1JJIEA+r4sJ9hK9btxqf2
0n86StHuwtNXSQwEhDcacqUpFPLZ65Za1Y9AXc69CDuiwj3DvTVBMwjEOBbnGTrZ
dL9QOYG5gkklOx4o5ePj7RoLrzz/j6dj6idmu8FxZZ4q+QB9vvL2lRHusJnUEloE
yxR7WWOB/kyUZT7FriLHRuEP7yQNRSvLs6U7b8uXCHiGcAb2mN0fFi2m/BcJGS4A
hHA0t9WyNBM=
=eYQ4
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode fixes from Ingo Molnar:
"Fix/enhance x86 microcode version reporting: fix the bootup log spam,
and remove the driver version announcement to avoid version confusion
when distros backport fixes"
* tag 'x86-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Rework early revisions reporting
x86/microcode: Remove the driver announcement and version
code resulting in non-working events on those platforms.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmVjEt8RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g54g/9EKugplQhVSfEXhzfBx38ICyodLxJuomL
x96auVes+H3CEC6CYADzAY14GPDxymzOVSK/Sy8fWkzyGVWG23SZgB0i/hOTIsFL
CVZLrVztidvkWWzJj5RnzNstZsUXp4QfNtcIw3VvCcsgDhk8WVen1xe5XbIuzamJ
HbTrlrL2S3feAWaWZUh6c/3unHsLhCupLg8u0DTe97OFlRRlSj+wtmJCMCQoRRpL
/LVItmmxfRbuMeyY8GGR6kb2qcb2+Zr6AUOXUPIoYOqARLuaV81F0kD2zwpPwB68
rFPYBG2ld2IHkbNFgxfCfNqP2DuY5TnSpS8SJzcfTAp6saIA5Ua3QLwg8IumV/2X
G3ZI/G54er2+hhddX7tUjJIiutwNHxGkiOKy3Bnxm8K8c2H0I8XZ97fB1OWQUnVy
pnwMzz3tlUOfF9SV8K0zttu3Dar9dcoBybT8HL/Wxr4RWF+FuZBiQNjjd92k3L4u
Ua8Q6nLD/IRXwQfiEWzDSsWIQqxuSDGxvGHBgYJm4d5DxblFAeKmKMkFnO6Z17Kh
fm1M4nW/YawSQYlG7Jo5XG+dWESJPhM5IAVBw19o0oYkqHV2gfmUerG8hvJYR7DY
CpLrw4S9UDOLzsYAtNoozB/GMQnXQvjRz8mILWiwNJxpbo1saeZUj43zHvzqr7ta
Vj96qrQLuNU=
=jOSq
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 perf event fix from Ingo Molnar:
"Fix a bug in the Intel hybrid CPUs hardware-capabilities enumeration
code resulting in non-working events on those platforms"
* tag 'perf-urgent-2023-11-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Correct incorrect 'or' operation for PMU capabilities
The same as Granite Rapids, the Sierra Forest and Grand Ridge also
supports the discovery table feature and the same type of the uncore
units. The difference of the available units and counters can be
retrieved from the discovery table automatically.
Just add the CPU model ID.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Ammy Yi <ammy.yi@intel.com>
Link: https://lore.kernel.org/r/20231117163939.2468007-5-kan.liang@linux.intel.com
The free-running counters for IIO uncore blocks on Granite Rapids are
similar to Sapphire Rapids. The key difference is the offset of the
registers. The number of the IIO uncore blocks can also be retrieved
from the discovery table.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Ammy Yi <ammy.yi@intel.com>
Link: https://lore.kernel.org/r/20231117163939.2468007-4-kan.liang@linux.intel.com
The same as Sapphire Rapids, Granite Rapids also supports the discovery
table feature. All the basic uncore PMON information can be retrieved
from the discovery table which resides in the BIOS.
There are 4 new units are added on Granite Rapids, b2cmi, b2cxl, ubox,
and mdf_sbo. The layout of the counters is exactly the same as the
generic uncore counters. Only add a name for the new units. All the
details can be retrieved from the discovery table.
The description of the new units can be found at
https://www.intel.com/content/www/us/en/secure/content-details/772943/content-details.html
The other units, e.g., cha, iio, irp, pcu, and imc, are the same as
Sapphire Rapids.
Ignore the upi and b2upi units in the discovery table, which are broken
for now.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Ammy Yi <ammy.yi@intel.com>
Link: https://lore.kernel.org/r/20231117163939.2468007-3-kan.liang@linux.intel.com
The current perf doesn't save the complete address of an uncore unit.
The complete address of each unit is calculated by the base address +
offset. The type of the base address is u64, while the type of offset is
unsigned.
In the old platforms (without the discovery table method), the base
address and offset are hard coded in the driver. Perf can always use the
lowest address as the base address. Everything works well.
In the new platforms (starting from SPR), the discovery table provides
a complete address for all uncore units. To follow the current
framework/codes, when parsing the discovery table, the complete address
of the first box is stored as a base address. The offset of the
following units is calculated by the complete address of the unit minus
the base address (the address of the first unit). On GNR, the latter
units may have a lower address compared to the first unit. So the offset
is a negative value. The upper 32 bits are lost when casting a negative
u64 to an unsigned type.
Use u64 to replace unsigned for the uncore offsets array to correct the
above case. There is no functional change.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Ammy Yi <ammy.yi@intel.com>
Link: https://lore.kernel.org/r/20231117163939.2468007-2-kan.liang@linux.intel.com
Factor out SPR_UNCORE_MMIO_COMMON_FORMAT which can be reused by
Granite Rapids in the following patch.
Granite Rapids have more uncore units than Sapphire Rapids. Add new
parameters to support adjustable uncore units.
No functional change.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Ammy Yi <ammy.yi@intel.com>
Link: https://lore.kernel.org/r/20231117163939.2468007-1-kan.liang@linux.intel.com
intel_epb_init() is called as a subsys_initcall() to register cpuhp
callbacks. The callbacks make use of get_cpu_device() which will return
NULL unless register_cpu() has been called. register_cpu() is called
from topology_init(), which is also a subsys_initcall().
This is fragile. Moving the register_cpu() to a different
subsys_initcall() leads to a NULL dereference during boot.
Make intel_epb_init() a late_initcall(), user-space can't provide a
policy before this point anyway.
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The VDSO functions are defined as globals in the kernel sources but intended
to be called from userspace, so there is no need to declare them in a kernel
side header.
Without a prototype, this now causes warnings such as
arch/mips/vdso/vgettimeofday.c:14:5: error: no previous prototype for '__vdso_clock_gettime' [-Werror=missing-prototypes]
arch/mips/vdso/vgettimeofday.c:28:5: error: no previous prototype for '__vdso_gettimeofday' [-Werror=missing-prototypes]
arch/mips/vdso/vgettimeofday.c:36:5: error: no previous prototype for '__vdso_clock_getres' [-Werror=missing-prototypes]
arch/mips/vdso/vgettimeofday.c:42:5: error: no previous prototype for '__vdso_clock_gettime64' [-Werror=missing-prototypes]
arch/sparc/vdso/vclock_gettime.c:254:1: error: no previous prototype for '__vdso_clock_gettime' [-Werror=missing-prototypes]
arch/sparc/vdso/vclock_gettime.c:282:1: error: no previous prototype for '__vdso_clock_gettime_stick' [-Werror=missing-prototypes]
arch/sparc/vdso/vclock_gettime.c:307:1: error: no previous prototype for '__vdso_gettimeofday' [-Werror=missing-prototypes]
arch/sparc/vdso/vclock_gettime.c:343:1: error: no previous prototype for '__vdso_gettimeofday_stick' [-Werror=missing-prototypes]
Most architectures have already added workarounds for these by adding
declarations somewhere, but since these are all compatible, we should
really just have one copy, with an #ifdef check for the 32-bit vs
64-bit variant and use that everywhere.
Unfortunately, the sparc an um versions are currently incompatible
since they never added support for __vdso_clock_gettime64() in 32-bit
userland. For the moment, I'm leaving this one out, as I can't
easily test it and it requires a larger rework.
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
some architectures run into a -Wmissing-prototypes warning
for trap_init()
arch/microblaze/kernel/traps.c:21:6: warning: no previous prototype for 'trap_init' [-Wmissing-prototypes]
Include the right header to avoid this consistently, removing
the extra declarations on m68k and x86 that were added as local
workarounds already.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The prototype was hidden in an #ifdef on x86, which causes a warning:
kernel/irq_work.c:72:13: error: no previous prototype for 'arch_irq_work_raise' [-Werror=missing-prototypes]
Some architectures have a working prototype, while others don't.
Fix this by providing it in only one place that is always visible.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
Acked-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Commit 1e8f93e183 ("x86: Consolidate port I/O helpers") moved some
port I/O helpers to <asm/shared/io.h>, which caused the 'bw' parameter
in the BUILDIO() macro to become unused. Remove it.
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231123034911.217791-1-ytcoode@gmail.com
AMD systems generally allow MCA "simulation" where MCA registers can be
written with valid data and the full MCA handling flow can be tested by
software.
However, the platform on Scalable MCA systems, can prevent software from
writing data to the MCA registers. There is no architectural way to
determine this configuration. Therefore, the MCE injection module will
check for this behavior by writing and reading back a test status value.
This is done during module init, and the check can run on any CPU with
any valid MCA bank.
If MCA_STATUS writes are ignored by the platform, then there are no side
effects on the hardware state.
If the writes are not ignored, then the test status value will remain in
the hardware MCA_STATUS register. It is likely that the value will not
be overwritten by hardware or software, since the tested CPU and bank
are arbitrary. Therefore, the user may see a spurious, synthetic MCA
error reported whenever MCA is polled for this CPU.
Clear the test value immediately after writing it. It is very unlikely
that a valid MCA error is logged by hardware during the test. Errors
that cause an #MC won't be affected.
Fixes: 891e465a1b ("x86/mce: Check whether writes to MCA_STATUS are getting ignored")
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231118193248.1296798-2-yazen.ghannam@amd.com
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmVdgqYTHHdlaS5saXVA
a2VybmVsLm9yZwAKCRB2FHBfkEGgXhBsCACzUGLF3vOQdrmgTMymzaaOzfLJtvNW
oQ34FwMJMOAyJ6FxM12IJPHA2j+azl9CPjQc5O6F2CBcF8hVj2mDIINQIi+4wpV5
FQv445g2KFml/+AJr/1waz1GmhHtr1rfu7B7NX6tPUtOpxKN7AHAQXWYmHnwK8BJ
5Mh2a/7Lphjin4M1FWCeBTj0JtqF1oVAW2L9jsjqogq1JV0a51DIFutROtaPSC/4
ssTLM5Rqpnw8Z1GWVYD2PObIW4A+h1LV1tNGOIoGW6mX56mPU+KmVA7tTKr8Je/i
z3Jk8bZXFyLvPW2+KNJacbldKNcfwAFpReffNz/FO3R16Stq9Ta1mcE2
=wXju
-----END PGP SIGNATURE-----
Merge tag 'hyperv-fixes-signed-20231121' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- One fix for the KVP daemon (Ani Sinha)
- Fix for the detection of E820_TYPE_PRAM in a Gen2 VM (Saurabh Sengar)
- Micro-optimization for hv_nmi_unknown() (Uros Bizjak)
* tag 'hyperv-fixes-signed-20231121' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Use atomic_try_cmpxchg() to micro-optimize hv_nmi_unknown()
x86/hyperv: Fix the detection of E820_TYPE_PRAM in a Gen2 VM
hv/hv_kvp_daemon: Some small fixes for handling NM keyfiles
This type predates recorded history in tglx/history.git, making it older
than Feb 5th 2002.
This structure is literally old enough to drink in most juristictions in
the world, and has not been used once in that time.
Lay it to rest in /dev/null.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20231102-x86-apic-v1-3-bf049a2a0ed6@citrix.com
The type is not used any more.
Replace the constants with plain defines so they can live outside of an
__ASSEMBLY__ block, allowing for more cleanup in subsequent changes.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20231102-x86-apic-v1-2-bf049a2a0ed6@citrix.com
This field is set to APIC_DELIVERY_MODE_FIXED in all cases, and is read
exactly once. Fold the constant in uv_program_mmr() and drop the field.
Searching for the origin of the stale HyperV comment reveals commit
a31e58e129 ("x86/apic: Switch all APICs to Fixed delivery mode") which
notes:
As a consequence of this change, the apic::irq_delivery_mode field is
now pointless, but this needs to be cleaned up in a separate patch.
6 years is long enough for this technical debt to have survived.
[ bp: Fold in
https://lore.kernel.org/r/20231121123034.1442059-1-andrew.cooper3@citrix.com
]
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20231102-x86-apic-v1-1-bf049a2a0ed6@citrix.com
The AMD side of the loader issues the microcode revision for each
logical thread on the system, which can become really noisy on huge
machines. And doing that doesn't make a whole lot of sense - the
microcode revision is already in /proc/cpuinfo.
So in case one is interested in the theoretical support of mixed silicon
steppings on AMD, one can check there.
What is also missing on the AMD side - something which people have
requested before - is showing the microcode revision the CPU had
*before* the early update.
So abstract that up in the main code and have the BSP on each vendor
provide those revision numbers.
Then, dump them only once on driver init.
On Intel, do not dump the patch date - it is not needed.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/CAHk-=wg=%2B8rceshMkB4VnKxmRccVLtBLPBawnewZuuqyx5U=3A@mail.gmail.com
First of all, the print is useless. The driver will either load and say
which microcode revision the machine has or issue an error.
Then, the version number is meaningless and actively confusing, as Yazen
mentioned recently: when a subset of patches are backported to a distro
kernel, one can't assume the driver version is the same as the upstream
one. And besides, the version number of the loader hasn't been used and
incremented for a long time. So drop it.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231115210212.9981-2-bp@alien8.de
Make the CONFIG_DEBUG_ENTRY=y check that validates CS is a user segment
unconditional and move it nearer to IRET.
PRE:
140,026,608 cycles:k ( +- 0.01% )
236,696,176 instructions:k # 1.69 insn per cycle ( +- 0.00% )
POST:
139,957,681 cycles:k ( +- 0.01% )
236,681,819 instructions:k # 1.69 insn per cycle ( +- 0.00% )
(this is with --repeat 100 and the run-to-run variance is bigger than
the difference shown)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231120143626.753200755@infradead.org
The code in common_interrupt_return() does a bunch of unconditional
work that is really only needed on PTI kernels. Specifically it
unconditionally copies the IRET frame back onto the entry stack,
swizzles onto the entry stack and does IRET from there.
However, without PTI we can simply IRET from whatever stack we're on.
ivb-ep, mitigations=off, gettid-1m:
PRE:
140,118,538 cycles:k ( +- 0.01% )
236,692,878 instructions:k # 1.69 insn per cycle ( +- 0.00% )
POST:
140,026,608 cycles:k ( +- 0.01% )
236,696,176 instructions:k # 1.69 insn per cycle ( +- 0.00% )
(this is with --repeat 100 and the run-to-run variance is bigger than
the difference shown)
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20231120143626.638107480@infradead.org
When running perf-stat command on Intel hybrid platform, perf-stat
reports the following errors:
sudo taskset -c 7 ./perf stat -vvvv -e cpu_atom/instructions/ sleep 1
Opening: cpu/cycles/:HG
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
config 0xa00000000
disabled 1
------------------------------------------------------------
sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
sys_perf_event_open failed, error -16
Performance counter stats for 'sleep 1':
<not counted> cpu_atom/instructions/
It looks the cpu_atom/instructions/ event can't be enabled on atom PMU
even when the process is pinned on atom core. Investigation shows that
exclusive_event_init() helper always returns -EBUSY error in the perf
event creation. That's strange since the atom PMU should not be an
exclusive PMU.
Further investigation shows the issue was introduced by commit:
97588df87b ("perf/x86/intel: Add common intel_pmu_init_hybrid()")
The commit originally intents to clear the bit PERF_PMU_CAP_AUX_OUTPUT
from PMU capabilities if intel_cap.pebs_output_pt_available is not set,
but it incorrectly uses 'or' operation and leads to all PMU capabilities
bits are set to 1 except bit PERF_PMU_CAP_AUX_OUTPUT.
Testing this fix on Intel hybrid platforms, the observed issues
disappear.
Fixes: 97588df87b ("perf/x86/intel: Add common intel_pmu_init_hybrid()")
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20231121014628.729989-1-dapeng1.mi@linux.intel.com
- Fix a back-to-back signals handling scenario when shadow stack is in
use
- A documentation fix
- Add Kirill as TDX maintainer
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVaChkACgkQEsHwGGHe
VUraNQ/+KyCyJgG6bdIB3tS9qKr0Z4REaXQ+UQ7DfAjlhrzw7C6f4VReNLp3ohEv
RdxNjKLEueYFQAo+v8uKGkqYIT6H1ob9uW+RjtjN+OJqWN/3AfK7CTx8HI1bJsW5
wKM+Ey81cID0iQDiNPAdzRnu7suKKjF5jLwztAw6EYOsTRfUnLZ8Ct84uHBWd58v
kZ+WkEyeOyeJo+Vdx07d/LEcCJ+S9G6WfA0AnhHPOZxRZTn2RhqNsnJvqTeOvWUM
PSN9NjxFk0ymidwnhR1urw1wHGgTT990vNsPIHLE72TwXrWEOM14Xkq1XNI4PfD1
Bp74ySpF0YUQrvgBW4V3qXgBFls4DkKys1amd2kK5KQGEpcXZm7ZPnI5w2NKMsY4
1Tk379W/1jPY8cyZjIqn92eFEkAjfID4eHICLj5IJhVMUusNEPmxgoycvKDqI8sK
NihF1wUjyfRibh4ujYaurqKUBgxVHo2dyXPPo7UNzeaMfvqkFaxgwNJVF0gQ+MyI
5BzeY71RCFb8ZKtCT6SVN6oUeWLg+QAZApoJVDDnhF9InG+wJj+D400T7pZnNHbo
ag6L2gJFJ2+XsV8DJhiaII0gfbf9cUppn4G7RcvQfL2HivYnZV3q1dBKf6C35H44
Kpz5w/eoJPOIcuZ48a6ph80zuRpuN6MSBigZ0G2Q7IwrmFx1Vcg=
=PGYO
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Ignore invalid x2APIC entries in order to not waste per-CPU data
- Fix a back-to-back signals handling scenario when shadow stack is in
use
- A documentation fix
- Add Kirill as TDX maintainer
* tag 'x86_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/acpi: Ignore invalid x2APIC entries
x86/shstk: Delay signal entry SSP write until after user accesses
x86/Documentation: Indent 'note::' directive for protocol version number note
MAINTAINERS: Add Intel TDX entry
tl;dr: The num_digits() function has a theoretical overflow issue.
But it doesn't affect any actual in-tree users. Fix it by using
a larger type for one of the local variables.
Long version:
There is an overflow in variable m in function num_digits when val
is >= 1410065408 which leads to the digit calculation loop to
iterate more times than required. This results in either more
digits being counted or in some cases (for example where val is
1932683193) the value of m eventually overflows to zero and the
while loop spins forever).
Currently the function num_digits is currently only being used for
small values of val in the SMP boot stage for digit counting on the
number of cpus and NUMA nodes, so the overflow is never encountered.
However it is useful to fix the overflow issue in case the function
is used for other purposes in the future. (The issue was discovered
while investigating the digit counting performance in various
kernel helper functions rather than any real-world use-case).
The simplest fix is to make m a long long, the overhead in
multiplication speed for a long long is very minor for small values
of val less than 10000 on modern processors. The alternative
fix is to replace the multiplication with a constant division
by 10 loop (this compiles down to an multiplication and shift)
without needing to make m a long long, but this is slightly slower
than the fix in this commit when measured on a range of x86
processors).
[ dhansen: subject and changelog tweaks ]
Fixes: 646e29a178 ("x86: Improve the printout of the SMP bootup CPU table")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231102174901.2590325-1-colin.i.king%40gmail.com
The x86 SHA-256 module contains four implementations: SSSE3, AVX, AVX2,
and SHA-NI. Commit 1c43c0f1f8 ("crypto: x86/sha - load modules based
on CPU features") made the module be autoloaded when SSSE3, AVX, or AVX2
is detected. The omission of SHA-NI appears to be an oversight, perhaps
because of the outdated file-level comment. This patch fixes this,
though in practice this makes no difference because SSSE3 is a subset of
the other three features anyway. Indeed, sha256_ni_transform() executes
SSSE3 instructions such as pshufb.
Reviewed-by: Roxana Nicolescu <roxana.nicolescu@canonical.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The x86 SHA-1 module contains four implementations: SSSE3, AVX, AVX2,
and SHA-NI. Commit 1c43c0f1f8 ("crypto: x86/sha - load modules based
on CPU features") made the module be autoloaded when SSSE3, AVX, or AVX2
is detected. The omission of SHA-NI appears to be an oversight, perhaps
because of the outdated file-level comment. This patch fixes this,
though in practice this makes no difference because SSSE3 is a subset of
the other three features anyway. Indeed, sha1_ni_transform() executes
SSSE3 instructions such as pshufb.
Reviewed-by: Roxana Nicolescu <roxana.nicolescu@canonical.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The same as the Sierra Forest, the Grand Ridge supports core C1/C6 and
module C6. But it doesn't support pkg C6 residency counter.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231116142245.1233485-4-kan.liang@linux.intel.com
A new module C6 Residency Counter is introduced in the Sierra Forest.
The scope of the new counter is module (A cluster of cores shared L2
cache). Create a brand new cstate_module PMU to profile the new counter.
The only differences between the new cstate_module PMU and the existing
cstate PMU are the scope and events.
Regarding the choice of the new cstate_module PMU name, the current
naming rule of a cstate PMU is "cstate_" + the scope of the PMU. The
scope of the PMU is the cores shared L2. On SRF, Intel calls it
"module", while the internal Linux sched code calls it "cluster". The
"cstate_module" is used as the new PMU name, because
- The Cstate PMU driver is a Intel specific driver. It doesn't impact
other ARCHs. The name makes it consistent with the documentation.
- The "cluster" mainly be used by the scheduler developer, while the
user of cstate PMU is more likely a researcher reading HW docs and
optimizing power.
- In the Intel's SDM, the "cluster" has a different meaning/scope for
topology. Using it will mislead the end users.
Besides the module C6, the core C1/C6 and pkg C6 residency counters are
supported in the Sierra Forest as well.
Suggested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231116142245.1233485-3-kan.liang@linux.intel.com
Intel cstate PMU driver will invoke the topology_cluster_cpumask() to
retrieve the CPU mask of a cluster. A modpost error is triggered since
the symbol cpu_clustergroup_mask is not exported.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231116142245.1233485-2-kan.liang@linux.intel.com
The events of the cstate_core and cstate_pkg PMU have the same format.
They both need to create a "events" group (with empty attrs). The
attr_groups can be shared.
Remove the dedicated attr_groups for each cstate PMU. Use the shared
cstate_attr_groups to replace.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231116142245.1233485-1-kan.liang@linux.intel.com
mce_device_create() is called only from mce_cpu_online() which in turn
will be called iff MCA support is available. That is, at the time of
mce_device_create() call it's guaranteed that MCA support is available.
No need to duplicate this check so remove it.
[ bp: Massage commit message. ]
Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231107165529.407349-1-nik.borisov@suse.com
Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM
to provide features, enhancements, and optimizations that are kludgly
or outright impossible to implement in a generic memory subsystem.
The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which
similar to the generic memfd_create(), creates an anonymous file and
returns a file descriptor that refers to it. Again like "regular"
memfd files, guest_memfd files live in RAM, have volatile storage,
and are automatically released when the last reference is dropped.
The key differences between memfd files (and every other memory subystem)
is that guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
convert a guest memory area between the shared and guest-private states.
A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to
specify attributes for a given page of guest memory. In the long term,
it will likely be extended to allow userspace to specify per-gfn RWX
protections, including allowing memory to be writable in the guest
without it also being writable in host userspace.
The immediate and driving use case for guest_memfd are Confidential
(CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.
For such use cases, being able to map memory into KVM guests without
requiring said memory to be mapped into the host is a hard requirement.
While SEV+ and TDX prevent untrusted software from reading guest private
data by encrypting guest memory, pKVM provides confidentiality and
integrity *without* relying on memory encryption. In addition, with
SEV-SNP and especially TDX, accessing guest private memory can be fatal
to the host, i.e. KVM must be prevent host userspace from accessing
guest memory irrespective of hardware behavior.
Long term, guest_memfd may be useful for use cases beyond CoCo VMs,
for example hardening userspace against unintentional accesses to guest
memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to
define the allow guest protection (with an exception granted to mapping
guest memory executable), and similarly KVM currently requires the guest
mapping size to be a strict subset of the host userspace mapping size.
Decoupling the mappings sizes would allow userspace to precisely map
only what is needed and with the required permissions, without impacting
guest performance.
A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to DMA from or into guest memory).
guest_memfd is the result of 3+ years of development and exploration;
taking on memory management responsibilities in KVM was not the first,
second, or even third choice for supporting CoCo VMs. But after many
failed attempts to avoid KVM-specific backing memory, and looking at
where things ended up, it is quite clear that of all approaches tried,
guest_memfd is the simplest, most robust, and most extensible, and the
right thing to do for KVM and the kernel at-large.
The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.
Pending post-merge work includes:
- hugepage support
- looking into using the restrictedmem framework for guest memory
- introducing a testing mechanism to poison memory, possibly using
the same memory attributes introduced here
- SNP and TDX support
There are two non-KVM patches buried in the middle of this series:
fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
mm: Add AS_UNMOVABLE to mark mapping as completely unmovable
The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.
The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.
At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-24-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs. Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.
Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-23-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
KVM_ADDRESS_SPACE_NUM.
No functional change intended.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-22-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory. For such VMs,
KVM_MEM_GUEST_MEMFD memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes. To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace. Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.
Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits. In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.
Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-21-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disallow creating hugepages with mixed memory attributes, e.g. shared
versus private, as mapping a hugepage in this case would allow the guest
to access memory with the wrong attributes, e.g. overlaying private memory
with a shared hugepage.
Tracking whether or not attributes are mixed via the existing
disallow_lpage field, but use the most significant bit in 'disallow_lpage'
to indicate a hugepage has mixed attributes instead using the normal
refcounting. Whether or not attributes are mixed is binary; either they
are or they aren't. Attempting to squeeze that info into the refcount is
unnecessarily complex as it would require knowing the previous state of
the mixed count when updating attributes. Using a flag means KVM just
needs to ensure the current status is reflected in the memslots.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20231027182217.3615211-20-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
the probability of exiting to userspace with a stale run->exit_reason that
*appears* to be valid.
To support fd-based guest memory (guest memory without a corresponding
userspace virtual address), KVM will exit to userspace for various memory
related errors, which userspace *may* be able to resolve, instead of using
e.g. BUS_MCEERR_AR. And in the more distant future, KVM will also likely
utilize the same functionality to let userspace "intercept" and handle
memory faults when the userspace mapping is missing, i.e. when fast gup()
fails.
Because many of KVM's internal APIs related to guest memory use '0' to
indicate "success, continue on" and not "exit to userspace", reporting
memory faults/errors to userspace will set run->exit_reason and
corresponding fields in the run structure fields in conjunction with a
a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON. And because
KVM already returns -EFAULT in many paths, there's a relatively high
probability that KVM could return -EFAULT without setting run->exit_reason,
in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
whatever exit reason happened to be in the run structure.
Note, KVM must wait until after run->immediate_exit is serviced to
sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
preserved across KVM_RUN when run->immediate_exit is true.
Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoorthy@google.com
Link: https://lore.kernel.org/all/ZFFbwOXZ5uI%2Fgdaf@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-19-seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Similar to struct alt_instr, make the struct paravirt_patch_site packed
and get rid of all the .align directives and save 2 bytes for one
PARA_SITE entry on X86_64.
[ bp: Massage commit message. ]
Suggested-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/6dcb20159ded36586c5f7f2ae159e4e030256627.1686301237.git.houwenlong.hwl@antgroup.com
Similar to the alternative patching, use a relative reference for original
instruction offset rather than absolute one, which saves 8 bytes for one
PARA_SITE entry on x86_64. As a result, a R_X86_64_PC32 relocation is
generated instead of an R_X86_64_64 one, which also reduces relocation
metadata on relocatable builds. Hardcode the alignment to 4 now.
[ bp: Massage commit message. ]
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/9e6053107fbaabc0d33e5d2865c5af2c67ec9925.1686301237.git.houwenlong.hwl@antgroup.com
Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).
KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory. With guest private memory,
there will be two kind of memory conversions:
- explicit conversion: happens when the guest explicitly calls into KVM
to map a range (as private or shared)
- implicit conversion: happens when the guest attempts to access a gfn
that is configured in the "wrong" state (private vs. shared)
On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.
KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.
Note! To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'! Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.
Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages. It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.
Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com
Link: https://lore.kernel.org/all/ZQ3AmLO2SYv3DszH@google.com
Cc: Anish Moorthy <amoorthy@google.com>
Cc: David Matlack <dmatlack@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20231027182217.3615211-10-seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
information can be supplied without setting userspace up to fail. The
padding in the new kvm_userspace_memory_region2 structure will be used to
pass a file descriptor in addition to the userspace_addr, i.e. allow
userspace to point at a file descriptor and map memory into a guest that
is NOT mapped into host userspace.
Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
makes detection of bad flags a bit more robust, e.g. if the new fd field
is guarded only by a flag and not a new ioctl(), then a userspace bug
(setting a "bad" flag) would generate out-of-bounds access instead of an
-EINVAL error.
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-9-seanjc@google.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior. Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.
Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared. PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.
bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.
Acked-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Message-Id: <20231027182217.3615211-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently in mmu_notifier invalidate path, hva range is recorded and then
checked against by mmu_invalidate_retry_hva() in the page fault handling
path. However, for the soon-to-be-introduced private memory, a page fault
may not have a hva associated, checking gfn(gpa) makes more sense.
For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Xu Yilun <yilun.xu@intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
Message-Id: <20231027182217.3615211-4-seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AMD does not have the requirement for a synchronization barrier when
acccessing a certain group of MSRs. Do not incur that unnecessary
penalty there.
There will be a CPUID bit which explicitly states that a MFENCE is not
needed. Once that bit is added to the APM, this will be extended with
it.
While at it, move to processor.h to avoid include hell. Untangling that
file properly is a matter for another day.
Some notes on the performance aspect of why this is relevant, courtesy
of Kishon VijayAbraham <Kishon.VijayAbraham@amd.com>:
On a AMD Zen4 system with 96 cores, a modified ipi-bench[1] on a VM
shows x2AVIC IPI rate is 3% to 4% lower than AVIC IPI rate. The
ipi-bench is modified so that the IPIs are sent between two vCPUs in the
same CCX. This also requires to pin the vCPU to a physical core to
prevent any latencies. This simulates the use case of pinning vCPUs to
the thread of a single CCX to avoid interrupt IPI latency.
In order to avoid run-to-run variance (for both x2AVIC and AVIC), the
below configurations are done:
1) Disable Power States in BIOS (to prevent the system from going to
lower power state)
2) Run the system at fixed frequency 2500MHz (to prevent the system
from increasing the frequency when the load is more)
With the above configuration:
*) Performance measured using ipi-bench for AVIC:
Average Latency: 1124.98ns [Time to send IPI from one vCPU to another vCPU]
Cumulative throughput: 42.6759M/s [Total number of IPIs sent in a second from
48 vCPUs simultaneously]
*) Performance measured using ipi-bench for x2AVIC:
Average Latency: 1172.42ns [Time to send IPI from one vCPU to another vCPU]
Cumulative throughput: 40.9432M/s [Total number of IPIs sent in a second from
48 vCPUs simultaneously]
From above, x2AVIC latency is ~4% more than AVIC. However, the expectation is
x2AVIC performance to be better or equivalent to AVIC. Upon analyzing
the perf captures, it is observed significant time is spent in
weak_wrmsr_fence() invoked by x2apic_send_IPI().
With the fix to skip weak_wrmsr_fence()
*) Performance measured using ipi-bench for x2AVIC:
Average Latency: 1117.44ns [Time to send IPI from one vCPU to another vCPU]
Cumulative throughput: 42.9608M/s [Total number of IPIs sent in a second from
48 vCPUs simultaneously]
Comparing the performance of x2AVIC with and without the fix, it can be seen
the performance improves by ~4%.
Performance captured using an unmodified ipi-bench using the 'mesh-ipi' option
with and without weak_wrmsr_fence() on a Zen4 system also showed significant
performance improvement without weak_wrmsr_fence(). The 'mesh-ipi' option ignores
CCX or CCD and just picks random vCPU.
Average throughput (10 iterations) with weak_wrmsr_fence(),
Cumulative throughput: 4933374 IPI/s
Average throughput (10 iterations) without weak_wrmsr_fence(),
Cumulative throughput: 6355156 IPI/s
[1] https://github.com/bytedance/kvm-utils/tree/master/microbenchmark/ipi-bench
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230622095212.20940-1-bp@alien8.de
Memory errors don't happen very often, especially fatal ones. However,
in large-scale scenarios such as data centers, that probability
increases with the amount of machines present.
When a fatal machine check happens, mce_panic() is called based on the
severity grading of that error. The page containing the error is not
marked as poison.
However, when kexec is enabled, tools like makedumpfile understand when
pages are marked as poison and do not touch them so as not to cause
a fatal machine check exception again while dumping the previous
kernel's memory.
Therefore, mark the page containing the error as poisoned so that the
kexec'ed kernel can avoid accessing the page.
[ bp: Rewrite commit message and comment. ]
Co-developed-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Link: https://lore.kernel.org/r/20231014051754.3759099-1-zhiquan1.li@intel.com
After
0b62f6cb07 ("x86/microcode/32: Move early loading after paging enable"),
the global variable relocated_ramdisk is no longer used anywhere except
for the relocate_initrd() function. Make it a local variable of that
function.
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Link: https://lore.kernel.org/r/20231113034026.130679-1-ytcoode@gmail.com
The Processor capability bits notify ACPI of the OS capabilities, and
so ACPI can adjust the return of other Processor methods taking the OS
capabilities into account.
When Linux is running as a Xen dom0, the hypervisor is the entity
in charge of processor power management, and hence Xen needs to make
sure the capabilities reported by _OSC/_PDC match the capabilities of
the driver in Xen.
Introduce a small helper to sanitize the buffer when running as Xen
dom0.
When Xen supports HWP, this serves as the equivalent of commit
a21211672c ("ACPI / processor: Request native thermal interrupt
handling via _OSC") to avoid SMM crashes. Xen will set bit
ACPI_PROC_CAP_COLLAB_PROC_PERF (bit 12) in the capability bits and the
_OSC/_PDC call will apply it.
[ jandryuk: Mention Xen HWP's need. Support _OSC & _PDC ]
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20231108212517.72279-1-jandryuk@gmail.com
Signed-off-by: Juergen Gross <jgross@suse.com>
A Gen2 VM doesn't support legacy PCI/PCIe, so both raw_pci_ops and
raw_pci_ext_ops are NULL, and pci_subsys_init() -> pcibios_init()
doesn't call pcibios_resource_survey() -> e820__reserve_resources_late();
as a result, any emulated persistent memory of E820_TYPE_PRAM (12) via
the kernel parameter memmap=nn[KMG]!ss is not added into iomem_resource
and hence can't be detected by register_e820_pmem().
Fix this by directly calling e820__reserve_resources_late() in
hv_pci_init(), which is called from arch_initcall(pci_arch_init).
It's ok to move a Gen2 VM's e820__reserve_resources_late() from
subsys_initcall(pci_subsys_init) to arch_initcall(pci_arch_init) because
the code in-between doesn't depend on the E820 resources.
e820__reserve_resources_late() depends on e820__reserve_resources(),
which has been called earlier from setup_arch().
For a Gen-2 VM, the new hv_pci_init() also adds any memory of
E820_TYPE_PMEM (7) into iomem_resource, and acpi_nfit_register_region() ->
acpi_nfit_insert_resource() -> region_intersects() returns
REGION_INTERSECTS, so the memory of E820_TYPE_PMEM won't get added twice.
Changed the local variable "int gen2vm" to "bool gen2vm".
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1699691867-9827-1-git-send-email-ssengar@linux.microsoft.com>
Most architectures that support kprobes declare this function in their
own asm/kprobes.h header and provide an override, but some are missing
the prototype, which causes a warning for the __weak stub implementation:
kernel/kprobes.c:1865:12: error: no previous prototype for 'kprobe_exceptions_notify' [-Werror=missing-prototypes]
1865 | int __weak kprobe_exceptions_notify(struct notifier_block *self,
Move the prototype into linux/kprobes.h so it is visible to all
the definitions.
Link: https://lore.kernel.org/all/20231108125843.3806765-4-arnd@kernel.org/
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Currently, the kernel enumerates the possible CPUs by parsing both ACPI
MADT Local APIC entries and x2APIC entries. So CPUs with "valid" APIC IDs,
even if they have duplicated APIC IDs in Local APIC and x2APIC, are always
enumerated.
Below is what ACPI MADT Local APIC and x2APIC describes on an
Ivebridge-EP system,
[02Ch 0044 1] Subtable Type : 00 [Processor Local APIC]
[02Fh 0047 1] Local Apic ID : 00
...
[164h 0356 1] Subtable Type : 00 [Processor Local APIC]
[167h 0359 1] Local Apic ID : 39
[16Ch 0364 1] Subtable Type : 00 [Processor Local APIC]
[16Fh 0367 1] Local Apic ID : FF
...
[3ECh 1004 1] Subtable Type : 09 [Processor Local x2APIC]
[3F0h 1008 4] Processor x2Apic ID : 00000000
...
[B5Ch 2908 1] Subtable Type : 09 [Processor Local x2APIC]
[B60h 2912 4] Processor x2Apic ID : 00000077
As a result, kernel shows "smpboot: Allowing 168 CPUs, 120 hotplug CPUs".
And this wastes significant amount of memory for the per-cpu data.
Plus this also breaks https://lore.kernel.org/all/87edm36qqb.ffs@tglx/,
because __max_logical_packages is over-estimated by the APIC IDs in
the x2APIC entries.
According to https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#processor-local-x2apic-structure:
"[Compatibility note] On some legacy OSes, Logical processors with APIC
ID values less than 255 (whether in XAPIC or X2APIC mode) must use the
Processor Local APIC structure to convey their APIC information to OSPM,
and those processors must be declared in the DSDT using the Processor()
keyword. Logical processors with APIC ID values 255 and greater must use
the Processor Local x2APIC structure and be declared using the Device()
keyword."
Therefore prevent the registration of x2APIC entries with an APIC ID less
than 255 if the local APIC table enumerates valid APIC IDs.
[ tglx: Simplify the logic ]
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230702162802.344176-1-rui.zhang@intel.com
When a signal is being delivered, the kernel needs to make accesses to
userspace. These accesses could encounter an access error, in which case
the signal delivery itself will trigger a segfault. Usually this would
result in the kernel killing the process. But in the case of a SEGV signal
handler being configured, the failure of the first signal delivery will
result in *another* signal getting delivered. The second signal may
succeed if another thread has resolved the issue that triggered the
segfault (i.e. a well timed mprotect()/mmap()), or the second signal is
being delivered to another stack (i.e. an alt stack).
On x86, in the non-shadow stack case, all the accesses to userspace are
done before changes to the registers (in pt_regs). The operation is
aborted when an access error occurs, so although there may be writes done
for the first signal, control flow changes for the signal (regs->ip,
regs->sp, etc) are not committed until all the accesses have already
completed successfully. This means that the second signal will be
delivered as if it happened at the time of the first signal. It will
effectively replace the first aborted signal, overwriting the half-written
frame of the aborted signal. So on sigreturn from the second signal,
control flow will resume happily from the point of control flow where the
original signal was delivered.
The problem is, when shadow stack is active, the shadow stack SSP
register/MSR is updated *before* some of the userspace accesses. This
means if the earlier accesses succeed and the later ones fail, the second
signal will not be delivered at the same spot on the shadow stack as the
first one. So on sigreturn from the second signal, the SSP will be
pointing to the wrong location on the shadow stack (off by a frame).
Pengfei privately reported that while using a shadow stack enabled glibc,
the “signal06” test in the LTP test-suite hung. It turns out it is
testing the above described double signal scenario. When this test was
compiled with shadow stack, the first signal pushed a shadow stack
sigframe, then the second pushed another. When the second signal was
handled, the SSP was at the first shadow stack signal frame instead of
the original location. The test then got stuck as the #CP from the twice
incremented SSP was incorrect and generated segfaults in a loop.
Fix this by adjusting the SSP register only after any userspace accesses,
such that there can be no failures after the SSP is adjusted. Do this by
moving the shadow stack sigframe push logic to happen after all other
userspace accesses.
Note, sigreturn (as opposed to the signal delivery dealt with in this
patch) has ordering behavior that could lead to similar failures. The
ordering issues there extend beyond shadow stack to include the alt stack
restoration. Fixing that would require cross-arch changes, and the
ordering today does not cause any known test or apps breakages. So leave
it as is, for now.
[ dhansen: minor changelog/subject tweak ]
Fixes: 05e36022c0 ("x86/shstk: Handle signals for shadow stack")
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20231107182251.91276-1-rick.p.edgecombe%40intel.com
Link: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/signal/signal06.c
- Introduce configfs-tsm as a shared ABI for confidential computing
attestation reports
- Convert sev-guest to additionally support configfs-tsm alongside its
vendor specific ioctl()
- Added signed attestation report retrieval to the tdx-guest driver
forgoing a new vendor specific ioctl()
- Misc. cleanups and a new __free() annotation for kvfree()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCZUQhiQAKCRDfioYZHlFs
Z2gMAQCJdtP0f2kH+pvf3oxAkA1OubKBqJqWOppeyrhTsNMpDQEA9ljXH9h7eRB/
2NQ6USrU6jqcdu3gB5Tzq8J/ZZabMQU=
=1Eiv
-----END PGP SIGNATURE-----
Merge tag 'tsm-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux
Pull unified attestation reporting from Dan Williams:
"In an ideal world there would be a cross-vendor standard attestation
report format for confidential guests along with a common device
definition to act as the transport.
In the real world the situation ended up with multiple platform
vendors inventing their own attestation report formats with the
SEV-SNP implementation being a first mover to define a custom
sev-guest character device and corresponding ioctl(). Later, this
configfs-tsm proposal intercepted an attempt to add a tdx-guest
character device and a corresponding new ioctl(). It also anticipated
ARM and RISC-V showing up with more chardevs and more ioctls().
The proposal takes for granted that Linux tolerates the vendor report
format differentiation until a standard arrives. From talking with
folks involved, it sounds like that standardization work is unlikely
to resolve anytime soon. It also takes the position that kernfs ABIs
are easier to maintain than ioctl(). The result is a shared configfs
mechanism to return per-vendor report-blobs with the option to later
support a standard when that arrives.
Part of the goal here also is to get the community into the
"uncomfortable, but beneficial to the long term maintainability of the
kernel" state of talking to each other about their differentiation and
opportunities to collaborate. Think of this like the device-driver
equivalent of the common memory-management infrastructure for
confidential-computing being built up in KVM.
As for establishing an "upstream path for cross-vendor
confidential-computing device driver infrastructure" this is something
I want to discuss at Plumbers. At present, the multiple vendor
proposals for assigning devices to confidential computing VMs likely
needs a new dedicated repository and maintainer team, but that is a
discussion for v6.8.
For now, Greg and Thomas have acked this approach and this is passing
is AMD, Intel, and Google tests.
Summary:
- Introduce configfs-tsm as a shared ABI for confidential computing
attestation reports
- Convert sev-guest to additionally support configfs-tsm alongside
its vendor specific ioctl()
- Added signed attestation report retrieval to the tdx-guest driver
forgoing a new vendor specific ioctl()
- Misc cleanups and a new __free() annotation for kvfree()"
* tag 'tsm-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/linux:
virt: tdx-guest: Add Quote generation support using TSM_REPORTS
virt: sevguest: Add TSM_REPORTS support for SNP_GET_EXT_REPORT
mm/slab: Add __free() support for kvfree
virt: sevguest: Prep for kernel internal get_ext_report()
configfs-tsm: Introduce a shared ABI for attestation reports
virt: coco: Add a coco/Makefile and coco/Kconfig
virt: sevguest: Fix passing a stack buffer as a scatterlist target
Gleixner:
- Restructure the code needed for it and add a temporary initrd mapping
on 32-bit so that the loader can access the microcode blobs. This in
itself is a preparation for the next major improvement:
- Do not load microcode on 32-bit before paging has been enabled.
Handling this has caused an endless stream of headaches, issues, ugly
code and unnecessary hacks in the past. And there really wasn't any
sensible reason to do that in the first place. So switch the 32-bit
loading to happen after paging has been enabled and turn the loader
code "real purrty" again
- Drop mixed microcode steppings loading on Intel - there, a single patch
loaded on the whole system is sufficient
- Rework late loading to track which CPUs have updated microcode
successfully and which haven't, act accordingly
- Move late microcode loading on Intel in NMI context in order to
guarantee concurrent loading on all threads
- Make the late loading CPU-hotplug-safe and have the offlined threads
be woken up for the purpose of the update
- Add support for a minimum revision which determines whether late
microcode loading is safe on a machine and the microcode does not
change software visible features which the machine cannot use anyway
since feature detection has happened already. Roughly, the minimum
revision is the smallest revision number which must be loaded
currently on the system so that late updates can be allowed
- Other nice leanups, fixess, etc all over the place
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmVE0xkACgkQEsHwGGHe
VUrCuBAAhOqqwkYPiGXPWd2hvdn1zGtD5fvEdXn3Orzd+Lwc6YaQTsCxCjIO/0ws
8inpPFuOeGz4TZcplzipi3G5oatPVc7ORDuW+/BvQQQljZOsSKfhiaC29t6dvS6z
UG3sbCXKVwlJ5Kwv3Qe4eWur4Ex6GeFDZkIvBCmbaAdGPFlfu1i2uO1yBooNP1Rs
GiUkp+dP1/KREWwR/dOIsHYL2QjWIWfHQEWit/9Bj46rxE9ERx/TWt3AeKPfKriO
Wp0JKp6QY78jg6a0a2/JVmbT1BKz69Z9aPp6hl4P2MfbBYOnqijRhdezFW0NyqV2
pn6nsuiLIiXbnSOEw0+Wdnw5Q0qhICs5B5eaBfQrwgfZ8pxPHv2Ir777GvUTV01E
Dv0ZpYsHa+mHe17nlK8V3+4eajt0PetExcXAYNiIE+pCb7pLjjKkl8e+lcEvEsO0
QSL3zG5i5RWUMPYUvaFRgepWy3k/GPIoDQjRcUD3P+1T0GmnogNN10MMNhmOzfWU
pyafe4tJUOVsq0HJ7V/bxIHk2p+Q+5JLKh5xBm9janE4BpabmSQnvFWNblVfK4ig
M9ohjI/yMtgXROC4xkNXgi8wE5jfDKBghT6FjTqKWSV45vknF1mNEjvuaY+aRZ3H
MB4P3HCj+PKWJimWHRYnDshcytkgcgVcYDiim8va/4UDrw8O2ks=
=JOZu
-----END PGP SIGNATURE-----
Merge tag 'x86_microcode_for_v6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loading updates from Borislac Petkov:
"Major microcode loader restructuring, cleanup and improvements by
Thomas Gleixner:
- Restructure the code needed for it and add a temporary initrd
mapping on 32-bit so that the loader can access the microcode
blobs. This in itself is a preparation for the next major
improvement:
- Do not load microcode on 32-bit before paging has been enabled.
Handling this has caused an endless stream of headaches, issues,
ugly code and unnecessary hacks in the past. And there really
wasn't any sensible reason to do that in the first place. So switch
the 32-bit loading to happen after paging has been enabled and turn
the loader code "real purrty" again
- Drop mixed microcode steppings loading on Intel - there, a single
patch loaded on the whole system is sufficient
- Rework late loading to track which CPUs have updated microcode
successfully and which haven't, act accordingly
- Move late microcode loading on Intel in NMI context in order to
guarantee concurrent loading on all threads
- Make the late loading CPU-hotplug-safe and have the offlined
threads be woken up for the purpose of the update
- Add support for a minimum revision which determines whether late
microcode loading is safe on a machine and the microcode does not
change software visible features which the machine cannot use
anyway since feature detection has happened already. Roughly, the
minimum revision is the smallest revision number which must be
loaded currently on the system so that late updates can be allowed
- Other nice leanups, fixess, etc all over the place"
* tag 'x86_microcode_for_v6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
x86/microcode/intel: Add a minimum required revision for late loading
x86/microcode: Prepare for minimal revision check
x86/microcode: Handle "offline" CPUs correctly
x86/apic: Provide apic_force_nmi_on_cpu()
x86/microcode: Protect against instrumentation
x86/microcode: Rendezvous and load in NMI
x86/microcode: Replace the all-in-one rendevous handler
x86/microcode: Provide new control functions
x86/microcode: Add per CPU control field
x86/microcode: Add per CPU result state
x86/microcode: Sanitize __wait_for_cpus()
x86/microcode: Clarify the late load logic
x86/microcode: Handle "nosmt" correctly
x86/microcode: Clean up mc_cpu_down_prep()
x86/microcode: Get rid of the schedule work indirection
x86/microcode: Mop up early loading leftovers
x86/microcode/amd: Use cached microcode for AP load
x86/microcode/amd: Cache builtin/initrd microcode early
x86/microcode/amd: Cache builtin microcode too
x86/microcode/amd: Use correct per CPU ucode_cpu_info
...
- Implement the binary search in modpost for faster symbol lookup
- Respect HOSTCC when linking host programs written in Rust
- Change the binrpm-pkg target to generate kernel-devel RPM package
- Fix endianness issues for tee and ishtp MODULE_DEVICE_TABLE
- Unify vdso_install rules
- Remove unused __memexit* annotations
- Eliminate stale whitelisting for __devinit/__devexit from modpost
- Enable dummy-tools to handle the -fpatchable-function-entry flag
- Add 'userldlibs' syntax
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmVFIZgVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGeKwP+wd2kCrxAgS4zPffOcO3cVHfZwJe
AXOrTp/v73gzxb9eHXH6TmEDf1Rv7EwW3fmmGJosopJGD6itBqzJa4bNDrbq40rY
XStmg0NRmTrIG20CHGgaGWxb8/7WMrYfu0rhFdUXJjmbny6XwJ3US9FvDPC0mZz7
w9VCq5CZOqMsJcQyGkAR7uCHDRzNWiZ/Vnfbz3aa6abFzp7dsjhOgDy5SQ6qZgQz
AwHHKNEN+G3HWmGDZqcbV9aDaCk4btnz64h843RAxjy2HNJF360Ohm2KOcdJr5lo
DSSStkogBkZNSRQPtqtfknDjzITjeF4JAnUw5ivOtt8ERaO3JRUcr5gHjfw5iV/n
o4pC1SXmFzdfoN4dogoYF9rz3j955mSFlT/DSbSbuQS/ELzQs0nsqERxhV4zNCsX
KvYPUqKzZLW3i8pHNuhh7z7t4Nbz1zXqUa19FvaLNtFTCtS8/IA868a59S0uqT9I
EAIqrNy9qAsk8UuQUxWVx0qf9f5wKGYxW62iMIF9F2lsFRWA8H588CFPUuSU9Bhk
KAsvzq249MUGJd0RAjF92EWJgNz/nYzZfFTEL5HKAVauYY5UCyR3AVjrak761I8z
ctVskA7eVkaW4eARfcp15Fna15FHVzxBJ3B26oKYIJBQfJLjzZcV8XeMtEcQjEGU
jzl+oRqB/Q3oD7Nx
=PeX7
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- Implement the binary search in modpost for faster symbol lookup
- Respect HOSTCC when linking host programs written in Rust
- Change the binrpm-pkg target to generate kernel-devel RPM package
- Fix endianness issues for tee and ishtp MODULE_DEVICE_TABLE
- Unify vdso_install rules
- Remove unused __memexit* annotations
- Eliminate stale whitelisting for __devinit/__devexit from modpost
- Enable dummy-tools to handle the -fpatchable-function-entry flag
- Add 'userldlibs' syntax
* tag 'kbuild-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (30 commits)
kbuild: support 'userldlibs' syntax
kbuild: dummy-tools: pretend we understand -fpatchable-function-entry
kbuild: Correct missing architecture-specific hyphens
modpost: squash ALL_{INIT,EXIT}_TEXT_SECTIONS to ALL_TEXT_SECTIONS
modpost: merge sectioncheck table entries regarding init/exit sections
modpost: use ALL_INIT_SECTIONS for the section check from DATA_SECTIONS
modpost: disallow the combination of EXPORT_SYMBOL and __meminit*
modpost: remove EXIT_SECTIONS macro
modpost: remove MEM_INIT_SECTIONS macro
modpost: remove more symbol patterns from the section check whitelist
modpost: disallow *driver to reference .meminit* sections
linux/init: remove __memexit* annotations
modpost: remove ALL_EXIT_DATA_SECTIONS macro
kbuild: simplify cmd_ld_multi_m
kbuild: avoid too many execution of scripts/pahole-flags.sh
kbuild: remove ARCH_POSTLINK from module builds
kbuild: unify no-compiler-targets and no-sync-config-targets
kbuild: unify vdso_install rules
docs: kbuild: add INSTALL_DTBS_PATH
UML: remove unused cmd_vdso_install
...
Here is the big set of tty/serial driver changes for 6.7-rc1. Included
in here are:
- console/vgacon cleanups and removals from Arnd
- tty core and n_tty cleanups from Jiri
- lots of 8250 driver updates and cleanups
- sc16is7xx serial driver updates
- dt binding updates
- first set of port lock wrapers from Thomas for the printk fixes
coming in future releases
- other small serial and tty core cleanups and updates
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZUTbaw8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yk9+gCeKdoRb8FDwGCO/GaoHwR4EzwQXhQAoKXZRmN5
LTtw9sbfGIiBdOTtgLPb
=6PJr
-----END PGP SIGNATURE-----
Merge tag 'tty-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty and serial updates from Greg KH:
"Here is the big set of tty/serial driver changes for 6.7-rc1. Included
in here are:
- console/vgacon cleanups and removals from Arnd
- tty core and n_tty cleanups from Jiri
- lots of 8250 driver updates and cleanups
- sc16is7xx serial driver updates
- dt binding updates
- first set of port lock wrapers from Thomas for the printk fixes
coming in future releases
- other small serial and tty core cleanups and updates
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (193 commits)
serdev: Replace custom code with device_match_acpi_handle()
serdev: Simplify devm_serdev_device_open() function
serdev: Make use of device_set_node()
tty: n_gsm: add copyright Siemens Mobility GmbH
tty: n_gsm: fix race condition in status line change on dead connections
serial: core: Fix runtime PM handling for pending tx
vgacon: fix mips/sibyte build regression
dt-bindings: serial: drop unsupported samsung bindings
tty: serial: samsung: drop earlycon support for unsupported platforms
tty: 8250: Add note for PX-835
tty: 8250: Fix IS-200 PCI ID comment
tty: 8250: Add Brainboxes Oxford Semiconductor-based quirks
tty: 8250: Add support for Intashield IX cards
tty: 8250: Add support for additional Brainboxes PX cards
tty: 8250: Fix up PX-803/PX-857
tty: 8250: Fix port count of PX-257
tty: 8250: Add support for Intashield IS-100
tty: 8250: Add support for Brainboxes UP cards
tty: 8250: Add support for additional Brainboxes UC cards
tty: 8250: Remove UC-257 and UC-431
...
there's little I can say which isn't in the individual changelogs.
The lengthier patch series are
- "kdump: use generic functions to simplify crashkernel reservation in
arch", from Baoquan He. This is mainly cleanups and consolidation of
the "crashkernel=" kernel parameter handling.
- After much discussion, David Laight's "minmax: Relax type checks in
min() and max()" is here. Hopefully reduces some typecasting and the
use of min_t() and max_t().
- A group of patches from Oleg Nesterov which clean up and slightly fix
our handling of reads from /proc/PID/task/... and which remove
task_struct.therad_group.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZUQP9wAKCRDdBJ7gKXxA
jmOAAQDh8sxagQYocoVsSm28ICqXFeaY9Co1jzBIDdNesAvYVwD/c2DHRqJHEiS4
63BNcG3+hM9nwGJHb5lyh5m79nBMRg0=
=On4u
-----END PGP SIGNATURE-----
Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
"As usual, lots of singleton and doubleton patches all over the tree
and there's little I can say which isn't in the individual changelogs.
The lengthier patch series are
- 'kdump: use generic functions to simplify crashkernel reservation
in arch', from Baoquan He. This is mainly cleanups and
consolidation of the 'crashkernel=' kernel parameter handling
- After much discussion, David Laight's 'minmax: Relax type checks in
min() and max()' is here. Hopefully reduces some typecasting and
the use of min_t() and max_t()
- A group of patches from Oleg Nesterov which clean up and slightly
fix our handling of reads from /proc/PID/task/... and which remove
task_struct.thread_group"
* tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits)
scripts/gdb/vmalloc: disable on no-MMU
scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n
.mailmap: add address mapping for Tomeu Vizoso
mailmap: update email address for Claudiu Beznea
tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions
.mailmap: map Benjamin Poirier's address
scripts/gdb: add lx_current support for riscv
ocfs2: fix a spelling typo in comment
proc: test ProtectionKey in proc-empty-vm test
proc: fix proc-empty-vm test with vsyscall
fs/proc/base.c: remove unneeded semicolon
do_io_accounting: use sig->stats_lock
do_io_accounting: use __for_each_thread()
ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error()
ocfs2: fix a typo in a comment
scripts/show_delta: add __main__ judgement before main code
treewide: mark stuff as __ro_after_init
fs: ocfs2: check status values
proc: test /proc/${pid}/statm
compiler.h: move __is_constexpr() to compiler.h
...
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series "Fixes and cleanups to compaction".
- Joel Fernandes has a patchset ("Optimize mremap during mutual
alignment within PMD") which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested.
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series "Do not try to access unaccepted memory" Adrian Hunter
provides some fixups for the recently-added "unaccepted memory' feature.
To increase the feature's checking coverage. "Plug a few gaps where
RAM is exposed without checking if it is unaccepted memory".
- In the series "cleanups for lockless slab shrink" Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code.
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series "use refcount+RCU method to implement
lockless slab shrink".
- David Hildenbrand contributes some maintenance work for the rmap code
in the series "Anon rmap cleanups".
- Kefeng Wang does more folio conversions and some maintenance work in
the migration code. Series "mm: migrate: more folio conversion and
unification".
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series "Add and use bdev_getblk()".
- In the series "Use nth_page() in place of direct struct page
manipulation" Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames.
- In the series "mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO" has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of gigantic
pages are in use.
- Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
rationalization and folio conversions in the hugetlb code.
- Yin Fengwei has improved mlock()'s handling of large folios in the
series "support large folio for mlock"
- In the series "Expose swapcache stat for memcg v1" Liu Shixin has
added statistics for memcg v1 users which are available (and useful)
under memcg v2.
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named "MDWE
without inheritance".
- Kefeng Wang has provided the series "mm: convert numa balancing
functions to use a folio" which does what it says.
- In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
makes is possible for a process to propagate KSM treatment across
exec().
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use "high
bandwidth memory" in addition to Optane Data Center Persistent Memory
Modules (DCPMM). The series is named "memory tiering: calculate
abstract distance based on ACPI HMAT"
- In the series "Smart scanning mode for KSM" Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans.
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
series "mm: memcg: fix tracking of pending stats updates values".
- In the series "Implement IOCTL to get and optionally clear info about
PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
us to atomically read-then-clear page softdirty state. This is mainly
used by CRIU.
- Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
- a bunch of relatively minor maintenance tweaks to this code.
- Matthew Wilcox has increased the use of the VMA lock over file-backed
page faults in the series "Handle more faults under the VMA lock". Some
rationalizations of the fault path became possible as a result.
- In the series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
and folio conversions.
- In the series "various improvements to the GUP interface" Lorenzo
Stoakes has simplified and improved the GUP interface with an eye to
providing groundwork for future improvements.
- Andrey Konovalov has sent along the series "kasan: assorted fixes and
improvements" which does those things.
- Some page allocator maintenance work from Kemeng Shi in the series
"Two minor cleanups to break_down_buddy_pages".
- In thes series "New selftest for mm" Breno Leitao has developed
another MM self test which tickles a race we had between madvise() and
page faults.
- In the series "Add folio_end_read" Matthew Wilcox provides cleanups
and an optimization to the core pagecache code.
- Nhat Pham has added memcg accounting for hugetlb memory in the series
"hugetlb memcg accounting".
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series "Abstract vma_merge() and split_vma()".
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series "Fix page_owner's use of free timestamps".
- Lorenzo Stoakes has fixed the handling of new mappings of sealed files
in the series "permit write-sealed memfd read-only shared mappings".
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series "Batch hugetlb vmemmap modification operations".
- Some buffer_head folio conversions and cleanups from Matthew Wilcox in
the series "Finish the create_empty_buffers() transition".
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the series
"mm: PCP high auto-tuning".
- Roman Gushchin has contributed the patchset "mm: improve performance
of accounted kernel memory allocations" which improves their performance
by ~30% as measured by a micro-benchmark.
- folio conversions from Kefeng Wang in the series "mm: convert page
cpupid functions to folios".
- Some kmemleak fixups in Liu Shixin's series "Some bugfix about
kmemleak".
- Qi Zheng has improved our handling of memoryless nodes by keeping them
off the allocation fallback list. This is done in the series "handle
memoryless nodes more appropriately".
- khugepaged conversions from Vishal Moola in the series "Some
khugepaged folio conversions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
FgeUPAD1oasg6CP+INZvCj34waNxwAc=
=E+Y4
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'
- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'
- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'
- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'
- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'
- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames
- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use
- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code
- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'
- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'
- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says
- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'
- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'
- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU
- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code
- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result
- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions
- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements
- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things
- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'
- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults
- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code
- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'
- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'
- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'
- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark
- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'
- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'
- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'
- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"
[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in
https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/
with help from Qi Zheng.
The clone3 test filtering conflict was half-arsed by yours truly ]
* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...
API:
- Add virtual-address based lskcipher interface.
- Optimise ahash/shash performance in light of costly indirect calls.
- Remove ahash alignmask attribute.
Algorithms:
- Improve AES/XTS performance of 6-way unrolling for ppc.
- Remove some uses of obsolete algorithms (md4, md5, sha1).
- Add FIPS 202 SHA-3 support in pkcs1pad.
- Add fast path for single-page messages in adiantum.
- Remove zlib-deflate.
Drivers:
- Add support for S4 in meson RNG driver.
- Add STM32MP13x support in stm32.
- Add hwrng interface support in qcom-rng.
- Add support for deflate algorithm in hisilicon/zip.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmVB3vgACgkQxycdCkmx
i6dsOBAAykbnX8BpnpnOXYywE9ZWrl98rAk51MK0N9olZNfg78zRPIv7fFxFdC20
SDJrDSNPmn0Qvaa5e0EfoAdklsm0k2GkXL/BwPKMKWUsyIoJVYI3WrBMnjBy9xMp
yfME+h0bKoXJCZKnYkIUSGUejmUPSyRlEylrXoFlH/VWYwAaii/x9zwreQoF+0LR
KI24A1q8AYs6Dw9HSfndaAub9GOzrqKYs6fSaMG+77Y4UC5aoi5J9Bp2G3uVyHay
x/0bZtIxKXS9wn+LeG/3GspX23x/I5VwBOdAoMigrYmAIaIg5qgyMszudltTAs4R
zF1Kh7WsnM5+vpnBSeigzo+/GGOU3QTz8y3tBTg+3ZR7GWGOwQLiizhOYqCyOfAH
pIm6c++sZw/OOHiL69Nt4HeLKzGNYYWk3s4X/B/6cqoouPfOsfBaQobZNx9zfy7q
ZNEvSVBjrFX/L6wDSotny1LTWLUNjHbmLaMV5uQZ/SQKEtv19fp2Dl7SsLkHH+3v
ldOAwfoJR6QcSwz3Ez02TUAvQhtP172Hnxi7u44eiZu2aUboLhCFr7aEU6kVdBCx
1rIRVHD1oqlOEDRwPRXzhF3I8R4QDORJIxZ6UUhg7yueuI+XCGDsBNC+LqBrBmSR
IbdjqmSDUBhJyM5yMnt1VFYhqKQ/ZzwZ3JQviwW76Es9pwEIolM=
=IZmR
-----END PGP SIGNATURE-----
Merge tag 'v6.7-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Add virtual-address based lskcipher interface
- Optimise ahash/shash performance in light of costly indirect calls
- Remove ahash alignmask attribute
Algorithms:
- Improve AES/XTS performance of 6-way unrolling for ppc
- Remove some uses of obsolete algorithms (md4, md5, sha1)
- Add FIPS 202 SHA-3 support in pkcs1pad
- Add fast path for single-page messages in adiantum
- Remove zlib-deflate
Drivers:
- Add support for S4 in meson RNG driver
- Add STM32MP13x support in stm32
- Add hwrng interface support in qcom-rng
- Add support for deflate algorithm in hisilicon/zip"
* tag 'v6.7-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (283 commits)
crypto: adiantum - flush destination page before unmapping
crypto: testmgr - move pkcs1pad(rsa,sha3-*) to correct place
Documentation/module-signing.txt: bring up to date
module: enable automatic module signing with FIPS 202 SHA-3
crypto: asymmetric_keys - allow FIPS 202 SHA-3 signatures
crypto: rsa-pkcs1pad - Add FIPS 202 SHA-3 support
crypto: FIPS 202 SHA-3 register in hash info for IMA
x509: Add OIDs for FIPS 202 SHA-3 hash and signatures
crypto: ahash - optimize performance when wrapping shash
crypto: ahash - check for shash type instead of not ahash type
crypto: hash - move "ahash wrapping shash" functions to ahash.c
crypto: talitos - stop using crypto_ahash::init
crypto: chelsio - stop using crypto_ahash::init
crypto: ahash - improve file comment
crypto: ahash - remove struct ahash_request_priv
crypto: ahash - remove crypto_ahash_alignmask
crypto: gcm - stop using alignmask of ahash
crypto: chacha20poly1305 - stop using alignmask of ahash
crypto: ccm - stop using alignmask of ahash
net: ipv6: stop checking crypto_ahash_alignmask
...
* Generalized infrastructure for 'writable' ID registers, effectively
allowing userspace to opt-out of certain vCPU features for its guest
* Optimization for vSGI injection, opportunistically compressing MPIDR
to vCPU mapping into a table
* Improvements to KVM's PMU emulation, allowing userspace to select
the number of PMCs available to a VM
* Guest support for memory operation instructions (FEAT_MOPS)
* Cleanups to handling feature flags in KVM_ARM_VCPU_INIT, squashing
bugs and getting rid of useless code
* Changes to the way the SMCCC filter is constructed, avoiding wasted
memory allocations when not in use
* Load the stage-2 MMU context at vcpu_load() for VHE systems, reducing
the overhead of errata mitigations
* Miscellaneous kernel and selftest fixes
LoongArch:
* New architecture. The hardware uses the same model as x86, s390
and RISC-V, where guest/host mode is orthogonal to supervisor/user
mode. The virtualization extensions are very similar to MIPS,
therefore the code also has some similarities but it's been cleaned
up to avoid some of the historical bogosities that are found in
arch/mips. The kernel emulates MMU, timer and CSR accesses, while
interrupt controllers are only emulated in userspace, at least for
now.
RISC-V:
* Support for the Smstateen and Zicond extensions
* Support for virtualizing senvcfg
* Support for virtualized SBI debug console (DBCN)
S390:
* Nested page table management can be monitored through tracepoints
and statistics
x86:
* Fix incorrect handling of VMX posted interrupt descriptor in KVM_SET_LAPIC,
which could result in a dropped timer IRQ
* Avoid WARN on systems with Intel IPI virtualization
* Add CONFIG_KVM_MAX_NR_VCPUS, to allow supporting up to 4096 vCPUs without
forcing more common use cases to eat the extra memory overhead.
* Add virtualization support for AMD SRSO mitigation (IBPB_BRTYPE and
SBPB, aka Selective Branch Predictor Barrier).
* Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
creating the original vCPU would cause KVM to try to synchronize the vCPU's
TSC and thus clobber the correct TSC being set by userspace.
* Compute guest wall clock using a single TSC read to avoid generating an
inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.
* "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
about a "Firmware Bug" if the bit isn't set for select F/M/S combos.
Likewise "virtualize" (ignore) MSR_AMD64_TW_CFG to appease Windows Server
2022.
* Don't apply side effects to Hyper-V's synthetic timer on writes from
userspace to fix an issue where the auto-enable behavior can trigger
spurious interrupts, i.e. do auto-enabling only for guest writes.
* Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
without PML enabled.
* Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.
* Harden the fast page fault path to guard against encountering an invalid
root when walking SPTEs.
* Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.
* Use the fast path directly from the timer callback when delivering Xen
timer events, instead of waiting for the next iteration of the run loop.
This was not done so far because previously proposed code had races,
but now care is taken to stop the hrtimer at critical points such as
restarting the timer or saving the timer information for userspace.
* Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag.
* Optimize injection of PMU interrupts that are simultaneous with NMIs.
* Usual handful of fixes for typos and other warts.
x86 - MTRR/PAT fixes and optimizations:
* Clean up code that deals with honoring guest MTRRs when the VM has
non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.
* Zap EPT entries when non-coherent DMA assignment stops/start to prevent
using stale entries with the wrong memtype.
* Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y.
This was done as a workaround for virtual machine BIOSes that did not
bother to clear CR0.CD (because ancient KVM/QEMU did not bother to
set it, in turn), and there's zero reason to extend the quirk to
also ignore guest PAT.
x86 - SEV fixes:
* Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
running an SEV-ES guest.
* Clean up the recognition of emulation failures on SEV guests, when KVM would
like to "skip" the instruction but it had already been partially emulated.
This makes it possible to drop a hack that second guessed the (insufficient)
information provided by the emulator, and just do the right thing.
Documentation:
* Various updates and fixes, mostly for x86
* MTRR and PAT fixes and optimizations:
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmVBZc0UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroP1LQf+NgsmZ1lkGQlKdSdijoQ856w+k0or
l2SV1wUwiEdFPSGK+RTUlHV5Y1ni1dn/CqCVIJZKEI3ZtZ1m9/4HKIRXvbMwFHIH
hx+E4Lnf8YUjsGjKTLd531UKcpphztZavQ6pXLEwazkSkDEra+JIKtooI8uU+9/p
bd/eF1V+13a8CHQf1iNztFJVxqBJbVlnPx4cZDRQQvewskIDGnVDtwbrwCUKGtzD
eNSzhY7si6O2kdQNkuA8xPhg29dYX9XLaCK2K1l8xOUm8WipLdtF86GAKJ5BVuOL
6ek/2QCYjZ7a+coAZNfgSEUi8JmFHEqCo7cnKmWzPJp+2zyXsdudqAhT1g==
=UIxm
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Generalized infrastructure for 'writable' ID registers, effectively
allowing userspace to opt-out of certain vCPU features for its
guest
- Optimization for vSGI injection, opportunistically compressing
MPIDR to vCPU mapping into a table
- Improvements to KVM's PMU emulation, allowing userspace to select
the number of PMCs available to a VM
- Guest support for memory operation instructions (FEAT_MOPS)
- Cleanups to handling feature flags in KVM_ARM_VCPU_INIT, squashing
bugs and getting rid of useless code
- Changes to the way the SMCCC filter is constructed, avoiding wasted
memory allocations when not in use
- Load the stage-2 MMU context at vcpu_load() for VHE systems,
reducing the overhead of errata mitigations
- Miscellaneous kernel and selftest fixes
LoongArch:
- New architecture for kvm.
The hardware uses the same model as x86, s390 and RISC-V, where
guest/host mode is orthogonal to supervisor/user mode. The
virtualization extensions are very similar to MIPS, therefore the
code also has some similarities but it's been cleaned up to avoid
some of the historical bogosities that are found in arch/mips. The
kernel emulates MMU, timer and CSR accesses, while interrupt
controllers are only emulated in userspace, at least for now.
RISC-V:
- Support for the Smstateen and Zicond extensions
- Support for virtualizing senvcfg
- Support for virtualized SBI debug console (DBCN)
S390:
- Nested page table management can be monitored through tracepoints
and statistics
x86:
- Fix incorrect handling of VMX posted interrupt descriptor in
KVM_SET_LAPIC, which could result in a dropped timer IRQ
- Avoid WARN on systems with Intel IPI virtualization
- Add CONFIG_KVM_MAX_NR_VCPUS, to allow supporting up to 4096 vCPUs
without forcing more common use cases to eat the extra memory
overhead.
- Add virtualization support for AMD SRSO mitigation (IBPB_BRTYPE and
SBPB, aka Selective Branch Predictor Barrier).
- Fix a bug where restoring a vCPU snapshot that was taken within 1
second of creating the original vCPU would cause KVM to try to
synchronize the vCPU's TSC and thus clobber the correct TSC being
set by userspace.
- Compute guest wall clock using a single TSC read to avoid
generating an inaccurate time, e.g. if the vCPU is preempted
between multiple TSC reads.
- "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which
complain about a "Firmware Bug" if the bit isn't set for select
F/M/S combos. Likewise "virtualize" (ignore) MSR_AMD64_TW_CFG to
appease Windows Server 2022.
- Don't apply side effects to Hyper-V's synthetic timer on writes
from userspace to fix an issue where the auto-enable behavior can
trigger spurious interrupts, i.e. do auto-enabling only for guest
writes.
- Remove an unnecessary kick of all vCPUs when synchronizing the
dirty log without PML enabled.
- Advertise "support" for non-serializing FS/GS base MSR writes as
appropriate.
- Harden the fast page fault path to guard against encountering an
invalid root when walking SPTEs.
- Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.
- Use the fast path directly from the timer callback when delivering
Xen timer events, instead of waiting for the next iteration of the
run loop. This was not done so far because previously proposed code
had races, but now care is taken to stop the hrtimer at critical
points such as restarting the timer or saving the timer information
for userspace.
- Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future
flag.
- Optimize injection of PMU interrupts that are simultaneous with
NMIs.
- Usual handful of fixes for typos and other warts.
x86 - MTRR/PAT fixes and optimizations:
- Clean up code that deals with honoring guest MTRRs when the VM has
non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.
- Zap EPT entries when non-coherent DMA assignment stops/start to
prevent using stale entries with the wrong memtype.
- Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y
This was done as a workaround for virtual machine BIOSes that did
not bother to clear CR0.CD (because ancient KVM/QEMU did not bother
to set it, in turn), and there's zero reason to extend the quirk to
also ignore guest PAT.
x86 - SEV fixes:
- Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts
SHUTDOWN while running an SEV-ES guest.
- Clean up the recognition of emulation failures on SEV guests, when
KVM would like to "skip" the instruction but it had already been
partially emulated. This makes it possible to drop a hack that
second guessed the (insufficient) information provided by the
emulator, and just do the right thing.
Documentation:
- Various updates and fixes, mostly for x86
- MTRR and PAT fixes and optimizations"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (164 commits)
KVM: selftests: Avoid using forced target for generating arm64 headers
tools headers arm64: Fix references to top srcdir in Makefile
KVM: arm64: Add tracepoint for MMIO accesses where ISV==0
KVM: arm64: selftest: Perform ISB before reading PAR_EL1
KVM: arm64: selftest: Add the missing .guest_prepare()
KVM: arm64: Always invalidate TLB for stage-2 permission faults
KVM: x86: Service NMI requests after PMI requests in VM-Enter path
KVM: arm64: Handle AArch32 SPSR_{irq,abt,und,fiq} as RAZ/WI
KVM: arm64: Do not let a L1 hypervisor access the *32_EL2 sysregs
KVM: arm64: Refine _EL2 system register list that require trap reinjection
arm64: Add missing _EL2 encodings
arm64: Add missing _EL12 encodings
KVM: selftests: aarch64: vPMU test for validating user accesses
KVM: selftests: aarch64: vPMU register test for unimplemented counters
KVM: selftests: aarch64: vPMU register test for implemented counters
KVM: selftests: aarch64: Introduce vpmu_counter_access test
tools: Import arm_pmuv3.h
KVM: arm64: PMU: Allow userspace to limit PMCR_EL0.N for the guest
KVM: arm64: Sanitize PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} before first run
KVM: arm64: Add {get,set}_user for PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR}
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmVBaU8UHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vwEdxAAo++s98+ZaaTdUuoV0Zpft1fuY6Yr
mR80jUDxjHDbcI1G4iNVUSWG6pGIdlURnrBp5kU74FV9R2Ps3Fl49XQUHowE0HfH
D/qmihiJQdnMsQKwzw3XGoTSINrDcF6nLafl9brBItVkgjNxfxSEbnweJMBf+Boc
rpRXHzxbVHVjwwhBLODF2Wt/8sQ24w9c+wcQkpo7im8ZZReoigNMKgEa4J7tLlqA
vTyPR/K6QeU8IBUk2ObCY3GeYrVuqi82eRK3Uwzu7IkQwA9orE416Okvq3Z026/h
TUAivtrcygHaFRdGNvzspYLbc2hd2sEXF+KKKb6GNAjxuDWUhVQW4ObY4FgFkZ65
Gqz/05D6c1dqTS3vTxp3nZYpvPEbNnO1RaGRL4h0/mbU+QSPSlHXWd9Lfg6noVVd
3O+CcstQK8RzMiiWLeyctRPV5XIf7nGVQTJW5aCLajlHeJWcvygNpNG4N57j/hXQ
gyEHrz3idXXHXkBKmyWZfre6YpLkxZtKyONZDHWI/AVhU0TgRdJWmqpRfC1kVVUe
IUWBRcPUF4/r3jEu6t10N/aDWQN1uQzIsJNnCrKzAddPDTTYQJk8VVzKPo8SVxPD
X+OjEMgBB/fXUfkJ7IMwgYnWaFJhxthrs6/3j1UqRvGYRoulE4NdWwJDky9UYIHd
qV3dzuAxC/cpv08=
=G//C
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.7-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
"Enumeration:
- Use acpi_evaluate_dsm_typed() instead of open-coding _DSM
evaluation to learn device characteristics (Andy Shevchenko)
- Tidy multi-function header checks using new PCI_HEADER_TYPE_MASK
definition (Ilpo Järvinen)
- Simplify config access error checking in various drivers (Ilpo
Järvinen)
- Use pcie_capability_clear_word() (not
pcie_capability_clear_and_set_word()) when only clearing (Ilpo
Järvinen)
- Add pci_get_base_class() to simplify finding devices using base
class only (ignoring subclass and programming interface) (Sui
Jingfeng)
- Add pci_is_vga(), which includes ancient PCI_CLASS_NOT_DEFINED_VGA
devices from before the Class Code was added to PCI (Sui Jingfeng)
- Use pci_is_vga() for vgaarb, sysfs "boot_vga", virtio, qxl to
include ancient VGA devices (Sui Jingfeng)
Resource management:
- Make pci_assign_unassigned_resources() non-init because sparc uses
it after init (Randy Dunlap)
Driver binding:
- Retain .remove() and .probe() callbacks (previously __init) because
sysfs may cause them to be called later (Uwe Kleine-König)
- Prevent xHCI driver from claiming AMD VanGogh USB3 DRD device, so
it can be claimed by dwc3 instead (Vicki Pfau)
PCI device hotplug:
- Add Ampere Altra Attention Indicator extension driver for acpiphp
(D Scott Phillips)
Power management:
- Quirk VideoPropulsion Torrent QN16e with longer delay after reset
(Lukas Wunner)
- Prevent users from overriding drivers that say we shouldn't use
D3cold (Lukas Wunner)
- Avoid PME from D3hot/D3cold for AMD Rembrandt and Phoenix USB4
because wakeup interrupts from those states don't work if amd-pmc
has put the platform in a hardware sleep state (Mario Limonciello)
IOMMU:
- Disable ATS for Intel IPU E2000 devices with invalidation message
endianness erratum (Bartosz Pawlowski)
Error handling:
- Factor out interrupt enable/disable into helpers (Kai-Heng Feng)
Peer-to-peer DMA:
- Fix flexible-array usage in struct pci_p2pdma_pagemap in case we
ever use pagemaps with multiple entries (Gustavo A. R. Silva)
ASPM:
- Revert a change that broke when drivers disabled L1 and users later
enabled an L1.x substate via sysfs, and fix a similar issue when
users disabled L1 via sysfs (Heiner Kallweit)
Endpoint framework:
- Fix double free in __pci_epc_create() (Dan Carpenter)
- Use IS_ERR_OR_NULL() to simplify endpoint core (Ruan Jinjie)
Cadence PCIe controller driver:
- Drop unused "is_rc" member (Li Chen)
Freescale Layerscape PCIe controller driver:
- Enable 64-bit addressing in endpoint mode (Guanhua Gao)
Intel VMD host bridge driver:
- Fix multi-function header check (Ilpo Järvinen)
Microsoft Hyper-V host bridge driver:
- Annotate struct hv_dr_state with __counted_by (Kees Cook)
NVIDIA Tegra194 PCIe controller driver:
- Drop setting of LNKCAP_MLW (max link width) since dw_pcie_setup()
already does this via dw_pcie_link_set_max_link_width() (Yoshihiro
Shimoda)
Qualcomm PCIe controller driver:
- Use PCIE_SPEED2MBS_ENC() to simplify encoding of link speed
(Manivannan Sadhasivam)
- Add a .write_dbi2() callback so DBI2 register writes, e.g., for
setting the BAR size, work correctly (Manivannan Sadhasivam)
- Enable ASPM for platforms that use 1.9.0 ops, because the PCI core
doesn't enable ASPM states that haven't been enabled by the
firmware (Manivannan Sadhasivam)
Renesas R-Car Gen4 PCIe controller driver:
- Add DesignWare core support (set max link width, EDMA_UNROLL flag,
.pre_init(), .deinit(), etc) for use by R-Car Gen4 driver
(Yoshihiro Shimoda)
- Add driver and DT schema for DesignWare-based Renesas R-Car Gen4
controller in both host and endpoint mode (Yoshihiro Shimoda)
Xilinx NWL PCIe controller driver:
- Update ECAM size to support 256 buses (Thippeswamy Havalige)
- Stop setting bridge primary/secondary/subordinate bus numbers,
since PCI core does this (Thippeswamy Havalige)
Xilinx XDMA controller driver:
- Add driver and DT schema for Zynq UltraScale+ MPSoCs devices with
Xilinx XDMA Soft IP (Thippeswamy Havalige)
Miscellaneous:
- Use FIELD_GET()/FIELD_PREP() to simplify and reduce use of _SHIFT
macros (Ilpo Järvinen, Bjorn Helgaas)
- Remove logic_outb(), _outw(), outl() duplicate declarations (John
Sanpe)
- Replace unnecessary UTF-8 in Kconfig help text because menuconfig
doesn't render it correctly (Liu Song)"
* tag 'pci-v6.7-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (102 commits)
PCI: qcom-ep: Add dedicated callback for writing to DBI2 registers
PCI: Simplify pcie_capability_clear_and_set_word() to ..._clear_word()
PCI: endpoint: Fix double free in __pci_epc_create()
PCI: xilinx-xdma: Add Xilinx XDMA Root Port driver
dt-bindings: PCI: xilinx-xdma: Add schemas for Xilinx XDMA PCIe Root Port Bridge
PCI: xilinx-cpm: Move IRQ definitions to a common header
PCI: xilinx-nwl: Modify ECAM size to enable support for 256 buses
PCI: xilinx-nwl: Rename the NWL_ECAM_VALUE_DEFAULT macro
dt-bindings: PCI: xilinx-nwl: Modify ECAM size in the DT example
PCI: xilinx-nwl: Remove redundant code that sets Type 1 header fields
PCI: hotplug: Add Ampere Altra Attention Indicator extension driver
PCI/AER: Factor out interrupt toggling into helpers
PCI: acpiphp: Allow built-in drivers for Attention Indicators
PCI/portdrv: Use FIELD_GET()
PCI/VC: Use FIELD_GET()
PCI/PTM: Use FIELD_GET()
PCI/PME: Use FIELD_GET()
PCI/ATS: Use FIELD_GET()
PCI/ATS: Show PASID Capability register width in bitmasks
PCI/ASPM: Fix L1 substate handling in aspm_attr_store_common()
...
To help make the move of sysctls out of kernel/sysctl.c not incur a size
penalty sysctl has been changed to allow us to not require the sentinel, the
final empty element on the sysctl array. Joel Granados has been doing all this
work. On the v6.6 kernel we got the major infrastructure changes required to
support this. For v6.7-rc1 we have all arch/ and drivers/ modified to remove
the sentinel. Both arch and driver changes have been on linux-next for a bit
less than a month. It is worth re-iterating the value:
- this helps reduce the overall build time size of the kernel and run time
memory consumed by the kernel by about ~64 bytes per array
- the extra 64-byte penalty is no longer inncurred now when we move sysctls
out from kernel/sysctl.c to their own files
For v6.8-rc1 expect removal of all the sentinels and also then the unneeded
check for procname == NULL.
The last 2 patches are fixes recently merged by Krister Johansen which allow
us again to use softlockup_panic early on boot. This used to work but the
alias work broke it. This is useful for folks who want to detect softlockups
super early rather than wait and spend money on cloud solutions with nothing
but an eventual hung kernel. Although this hadn't gone through linux-next it's
also a stable fix, so we might as well roll through the fixes now.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmVCqKsSHG1jZ3JvZkBr
ZXJuZWwub3JnAAoJEM4jHQowkoinEgYQAIpkqRL85DBwems19Uk9A27lkctwZ6Fc
HdslQCObQTsbuKVimZFP4IL2beUfUE0cfLZCXlzp+4nRDOf6vyhyf3w19jPQtI0Q
YdqwTk9y6G5VjDsb35QK0+UBloY/kZ1H3/LW4uCwjXTuksUGmWW2Qvey35696Scv
hDMLADqKQmdpYxLUaNi9QyYbEAjYtOai2ezg3+i7hTG168t1k/Ab2BxIFrPVsCR2
FAiq05L4ugWjNskdsWBjck05JZsx9SK/qcAxpIPoUm4nGiFNHApXE0E0hs3vsnmn
WIHIbxCQw8ZlUDlmw4S+0YH3NFFzFbWfmW8k2b0f2qZTJm/rU4KiJfcJVknkAUVF
raFox6XDW0AUQ9L/NOUJ9ip5rup57GcFrMYocdJ3PPAvvmHKOb1D1O741p75RRcc
9j7zwfIRrzjPUqzhsQS/GFjdJu3lJNmEBK1AcgrVry6WoItrAzJHKPPDC7TwaNmD
eXpjxMl1sYzzHqtVh4hn+xkUYphj/6gTGMV8zdo+/FopFswgeJW9G8kHtlEWKDPk
MRIKwACmfetP6f3ngHunBg+BOipbjCANL7JI0nOhVOQoaULxCCPx+IPJ6GfSyiuH
AbcjH8DGI7fJbUkBFoF0dsRFZ2gH8ds1PYMbWUJ6x3FtuCuv5iIuvQYoaWU6itm7
6f0KvCogg0fU
=Qf50
-----END PGP SIGNATURE-----
Merge tag 'sysctl-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull sysctl updates from Luis Chamberlain:
"To help make the move of sysctls out of kernel/sysctl.c not incur a
size penalty sysctl has been changed to allow us to not require the
sentinel, the final empty element on the sysctl array. Joel Granados
has been doing all this work. On the v6.6 kernel we got the major
infrastructure changes required to support this. For v6.7-rc1 we have
all arch/ and drivers/ modified to remove the sentinel. Both arch and
driver changes have been on linux-next for a bit less than a month. It
is worth re-iterating the value:
- this helps reduce the overall build time size of the kernel and run
time memory consumed by the kernel by about ~64 bytes per array
- the extra 64-byte penalty is no longer inncurred now when we move
sysctls out from kernel/sysctl.c to their own files
For v6.8-rc1 expect removal of all the sentinels and also then the
unneeded check for procname == NULL.
The last two patches are fixes recently merged by Krister Johansen
which allow us again to use softlockup_panic early on boot. This used
to work but the alias work broke it. This is useful for folks who want
to detect softlockups super early rather than wait and spend money on
cloud solutions with nothing but an eventual hung kernel. Although
this hadn't gone through linux-next it's also a stable fix, so we
might as well roll through the fixes now"
* tag 'sysctl-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (23 commits)
watchdog: move softlockup_panic back to early_param
proc: sysctl: prevent aliased sysctls from getting passed to init
intel drm: Remove now superfluous sentinel element from ctl_table array
Drivers: hv: Remove now superfluous sentinel element from ctl_table array
raid: Remove now superfluous sentinel element from ctl_table array
fw loader: Remove the now superfluous sentinel element from ctl_table array
sgi-xp: Remove the now superfluous sentinel element from ctl_table array
vrf: Remove the now superfluous sentinel element from ctl_table array
char-misc: Remove the now superfluous sentinel element from ctl_table array
infiniband: Remove the now superfluous sentinel element from ctl_table array
macintosh: Remove the now superfluous sentinel element from ctl_table array
parport: Remove the now superfluous sentinel element from ctl_table array
scsi: Remove now superfluous sentinel element from ctl_table array
tty: Remove now superfluous sentinel element from ctl_table array
xen: Remove now superfluous sentinel element from ctl_table array
hpet: Remove now superfluous sentinel element from ctl_table array
c-sky: Remove now superfluous sentinel element from ctl_talbe array
powerpc: Remove now superfluous sentinel element from ctl_table arrays
riscv: Remove now superfluous sentinel element from ctl_table array
x86/vdso: Remove now superfluous sentinel element from ctl_table array
...
The ia64 architecture gets its well-earned retirement as planned,
now that there is one last (mostly) working release that will
be maintained as an LTS kernel.
The architecture specific system call tables are updated for
the added map_shadow_stack() syscall and to remove references
to the long-gone sys_lookup_dcookie() syscall.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmVC40IACgkQYKtH/8kJ
Uidhmw/9EX+aWSXGoObJ3fngaNSMw+PmrEuP8qEKBHxfKHcCdX3hc451Oh4GlhaQ
tru91pPwgNvN2/rfoKusxT+V4PemGIzfNni/04rp+P0kvmdw5otQ2yNhsQNsfVmq
XGWvkxF4P2GO6bkjjfR/1dDq7GtlyXtwwPDKeLbYb6TnJOZjtx+EAN27kkfSn1Ms
R4Sa3zJ+DfHUmHL5S9g+7UD/CZ5GfKNmIskI4Mz5GsfoUz/0iiU+Bge/9sdcdSJQ
kmbLy5YnVzfooLZ3TQmBFsO3iAMWb0s/mDdtyhqhTVmTUshLolkPYyKnPFvdupyv
shXcpEST2XJNeaDRnL2K4zSCdxdbnCZHDpjfl9wfioBg7I8NfhXKpf1jYZHH1de4
LXq8ndEFEOVQw/zSpYWfQq1sux8Jiqr+UK/ukbVeFWiGGIUs91gEWtPAf8T0AZo9
ujkJvaWGl98O1g5wmBu0/dAR6QcFJMDfVwbmlIFpU8O+MEaz6X8mM+O5/T0IyTcD
eMbAUjj4uYcU7ihKzHEv/0SS9Of38kzff67CLN5k8wOP/9NlaGZ78o1bVle9b52A
BdhrsAefFiWHp1jT6Y9Rg4HOO/TguQ9e6EWSKOYFulsiLH9LEFaB9RwZLeLytV0W
vlAgY9rUW77g1OJcb7DoNv33nRFuxsKqsnz3DEIXtgozo9CzbYI=
=H1vH
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull ia64 removal and asm-generic updates from Arnd Bergmann:
- The ia64 architecture gets its well-earned retirement as planned,
now that there is one last (mostly) working release that will be
maintained as an LTS kernel.
- The architecture specific system call tables are updated for the
added map_shadow_stack() syscall and to remove references to the
long-gone sys_lookup_dcookie() syscall.
* tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
hexagon: Remove unusable symbols from the ptrace.h uapi
asm-generic: Fix spelling of architecture
arch: Reserve map_shadow_stack() syscall number for all architectures
syscalls: Cleanup references to sys_lookup_dcookie()
Documentation: Drop or replace remaining mentions of IA64
lib/raid6: Drop IA64 support
Documentation: Drop IA64 from feature descriptions
kernel: Drop IA64 support from sig_fault handlers
arch: Remove Itanium (IA-64) architecture
* Handle retrying/resuming page conversion hypercalls
* Make sure to use the (shockingly) reliable TSC in TDX guests
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmVBlqMACgkQaDWVMHDJ
krBrhBAArKay0MvzmdzS4IQs8JqkmuMEHI6WabYv2POPjJNXrn5MelLH972pLuX9
NJ3+yeOLmNMYwqu5qwLCxyeO5CtqEyT2lNumUrxAtHQG4+oS2RYJYUalxMuoGxt8
fAHxbItFg0TobBSUtwcnN2R2WdXwPuUW0Co+pJfLlZV4umVM7QANO1nf1g8YmlDD
sVtpDaeKJRdylmwgWgAyGow0tDKd6oZB9j/vOHvZRrEQ+DMjEtG75fjwbjbu43Cl
tI/fbxKjzAkOFcZ7PEPsQ8jE1h9DXU+JzTML9Nu/cPMalxMeBg3Dol/JOEbqgreI
4W8Lg7g071EkUcQDxpwfe4aS6rsfsbwUIV4gJVkg9ZhlT7RayWsFik2CfBpJ4IMQ
TM8BxtCEGCz3cxVvg3mstX9rRA7eNlXOzcKE/8Y7cpSsp94bA9jtf2GgUSUoi9St
y+fIEei8mgeHutdiFh8psrmR7hp6iX/ldMFqHtjNo6xatf2KjdVHhVSU13Jz544z
43ATNi1gZeHOgfwlAlIxLPDVDJidHuux3f6g2vfMkAqItyEqFauC1HA1pIDgckoY
9FpBPp9vNUToSPp6reB6z/PkEBIrG2XtQh82JLt2CnCb6aTUtnPds+psjtT4sSE/
a9SQvZLWWmpj+BlI2yrtfJzhy7SwhltgdjItQHidmCNEn0PYfTc=
=FJ1Y
-----END PGP SIGNATURE-----
Merge tag 'x86_tdx_for_6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 TDX updates from Dave Hansen:
"The majority of this is a rework of the assembly and C wrappers that
are used to talk to the TDX module and VMM. This is a nice cleanup in
general but is also clearing the way for using this code when Linux is
the TDX VMM.
There are also some tidbits to make TDX guests play nicer with Hyper-V
and to take advantage the hardware TSC.
Summary:
- Refactor and clean up TDX hypercall/module call infrastructure
- Handle retrying/resuming page conversion hypercalls
- Make sure to use the (shockingly) reliable TSC in TDX guests"
[ TLA reminder: TDX is "Trust Domain Extensions", Intel's guest VM
confidentiality technology ]
* tag 'x86_tdx_for_6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tdx: Mark TSC reliable
x86/tdx: Fix __noreturn build warning around __tdx_hypercall_failed()
x86/virt/tdx: Make TDX_MODULE_CALL handle SEAMCALL #UD and #GP
x86/virt/tdx: Wire up basic SEAMCALL functions
x86/tdx: Remove 'struct tdx_hypercall_args'
x86/tdx: Reimplement __tdx_hypercall() using TDX_MODULE_CALL asm
x86/tdx: Make TDX_HYPERCALL asm similar to TDX_MODULE_CALL
x86/tdx: Extend TDX_MODULE_CALL to support more TDCALL/SEAMCALL leafs
x86/tdx: Pass TDCALL/SEAMCALL input/output registers via a structure
x86/tdx: Rename __tdx_module_call() to __tdcall()
x86/tdx: Make macros of TDCALLs consistent with the spec
x86/tdx: Skip saving output regs when SEAMCALL fails with VMFailInvalid
x86/tdx: Zero out the missing RSI in TDX_HYPERCALL macro
x86/tdx: Retry partially-completed page conversion hypercalls
kernel:
- add initial vmemdup-user-array
core:
- fix platform remove() to return void
- drm_file owner updated to reflect owner
- move size calcs to drm buddy allocator
- let GPUVM build as a module
- allow variable number of run-queues in scheduler
edid:
- handle bad h/v sync_end in EDIDs
panfrost:
- add Boris as maintainer
fbdev:
- use fb_ops helpers more
- only allow logo use from fbcon
- rename fb_pgproto to pgprot_framebuffer
- add HPD state to drm_connector_oob_hotplug_event
- convert to fbdev i/o mem helpers
i915:
- Enable meteorlake by default
- Early Xe2 LPD/Lunarlake display enablement
- Rework subplatforms into IP version checks
- GuC based TLB invalidation for Meteorlake
- Display rework for future Xe driver integration
- LNL FBC features
- LNL display feature capability reads
- update recommended fw versions for DG2+
- drop fastboot module parameter
- added deviceid for Arrowlake-S
- drop preproduction workarounds
- don't disable preemption for resets
- cleanup inlines in headers
- PXP firmware loading fix
- Fix sg list lengths
- DSC PPS state readout/verification
- Add more RPL P/U PCI IDs
- Add new DG2-G12 stepping
- DP enhanced framing support to state checker
- Improve shared link bandwidth management
- stop using GEM macros in display code
- refactor related code into display code
- locally enable W=1 warnings
- remove PSR watchdog timers on LNL
amdgpu:
- RAS/FRU EEPROM updatse
- IP discovery updatses
- GC 11.5 support
- DCN 3.5 support
- VPE 6.1 support
- NBIO 7.11 support
- DML2 support
- lots of IP updates
- use flexible arrays for bo list handling
- W=1 fixes
- Enable seamless boot in more cases
- Enable context type property for HDMI
- Rework GPUVM TLB flushing
- VCN IB start/size alignment fixes
amdkfd:
- GC 10/11 fixes
- GC 11.5 support
- use partial migration in GPU faults
radeon:
- W=1 Fixes
- fix some possible buffer overflow/NULL derefs
nouveau:
- update uapi for NO_PREFETCH
- scheduler/fence fixes
- rework suspend/resume for GSP-RM
- rework display in preparation for GSP-RM
habanalabs:
- uapi: expose tsc clock
- uapi: block access to eventfd through control device
- uapi: force dma-buf export to PAGE_SIZE alignments
- complete move to accel subsystem
- move firmware interface include files
- perform hard reset on PCIe AXI drain event
- optimise user interrupt handling
msm:
- DP: use existing helpers for DPCD
- DPU: interrupts reworked
- gpu: a7xx (a730/a740) support
- decouple msm_drv from kms for headless devices
mediatek:
- MT8188 dsi/dp/edp support
- DDP GAMMA - 12 bit LUT support
- connector dynamic selection capability
rockchip:
- rv1126 mipi-dsi/vop support
- add planar formats
ast:
- rename constants
panels:
- Mitsubishi AA084XE01
- JDI LPM102A188A
- LTK050H3148W-CTA6
ivpu:
- power management fixes
qaic:
- add detach slice bo api
komeda:
- add NV12 writeback
tegra:
- support NVSYNC/NHSYNC
- host1x suspend fixes
ili9882t:
- separate into own driver
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmVAgzYACgkQDHTzWXnE
hr7ZEQ//UXne3tyGOsU3X8r+lstLFDMa90a3hvTg6hX+Q0MjHd/clwkKFkLpkipL
n7gIZlaHl11dRs0FzrIZA5EVAAgjMLKmIl10NBDFec6ZFA3VERcggx8y61uifI15
VviMR1VbLHYZaCdyrQOK0A4wcktWnKXyoXp7cwy9crdc2GOBMUZkdIqtvD7jHxQx
UMIFnzi1CyKUX/Fjt/JceYcNk9y2ZGkzakYO3sHcUdv4DPu9qX4kNzpjF691AZBP
UeKWvCswTRVg2M0kuo/RYIBzqaTmOlk6dHLWBognIeZPyuyhCcaGC2d64c6tShwQ
dtHdi+IgyQ8s2qb350ymKTQUP7xA/DfZBwH7LvrZALBxeQGYQN1CnsgDMOS2wcUc
XrRFiS7PxEOtMMBctcPBnnoV5ttnsLLlPpzM9puh9sUFMn6CgLzcAMqXdqxzMajH
+dz2aD1N0vMqq4varozOg9SC2QamgUiPN/TQfrulhCTCfQaXczy5x1OYiIz65+Sl
mKoe2WASuP9Ve8do4N/wEwH5SZY2ItipBdUTRxttY9NTanmV0X5DjZBXH5b9XGci
Zl5Ar613f9zwm5T5BVA5k6s3ZbGY6QcP5pDNTCPaSgitfFXIdReBZ2CaYzK3MPg/
Wit/TXrud9yT6VPpI1igboMyasf5QubV1MY1K83kOCWr9u8R2CM=
=l79u
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"Highlights:
- AMD adds some more upcoming HW platforms
- Intel made Meteorlake stable and started adding Lunarlake
- nouveau has a bunch of display rework in prepartion for the NVIDIA
GSP firmware support
- msm adds a7xx support
- habanalabs has finished migration to accel subsystem
Detail summary:
kernel:
- add initial vmemdup-user-array
core:
- fix platform remove() to return void
- drm_file owner updated to reflect owner
- move size calcs to drm buddy allocator
- let GPUVM build as a module
- allow variable number of run-queues in scheduler
edid:
- handle bad h/v sync_end in EDIDs
panfrost:
- add Boris as maintainer
fbdev:
- use fb_ops helpers more
- only allow logo use from fbcon
- rename fb_pgproto to pgprot_framebuffer
- add HPD state to drm_connector_oob_hotplug_event
- convert to fbdev i/o mem helpers
i915:
- Enable meteorlake by default
- Early Xe2 LPD/Lunarlake display enablement
- Rework subplatforms into IP version checks
- GuC based TLB invalidation for Meteorlake
- Display rework for future Xe driver integration
- LNL FBC features
- LNL display feature capability reads
- update recommended fw versions for DG2+
- drop fastboot module parameter
- added deviceid for Arrowlake-S
- drop preproduction workarounds
- don't disable preemption for resets
- cleanup inlines in headers
- PXP firmware loading fix
- Fix sg list lengths
- DSC PPS state readout/verification
- Add more RPL P/U PCI IDs
- Add new DG2-G12 stepping
- DP enhanced framing support to state checker
- Improve shared link bandwidth management
- stop using GEM macros in display code
- refactor related code into display code
- locally enable W=1 warnings
- remove PSR watchdog timers on LNL
amdgpu:
- RAS/FRU EEPROM updatse
- IP discovery updatses
- GC 11.5 support
- DCN 3.5 support
- VPE 6.1 support
- NBIO 7.11 support
- DML2 support
- lots of IP updates
- use flexible arrays for bo list handling
- W=1 fixes
- Enable seamless boot in more cases
- Enable context type property for HDMI
- Rework GPUVM TLB flushing
- VCN IB start/size alignment fixes
amdkfd:
- GC 10/11 fixes
- GC 11.5 support
- use partial migration in GPU faults
radeon:
- W=1 Fixes
- fix some possible buffer overflow/NULL derefs
nouveau:
- update uapi for NO_PREFETCH
- scheduler/fence fixes
- rework suspend/resume for GSP-RM
- rework display in preparation for GSP-RM
habanalabs:
- uapi: expose tsc clock
- uapi: block access to eventfd through control device
- uapi: force dma-buf export to PAGE_SIZE alignments
- complete move to accel subsystem
- move firmware interface include files
- perform hard reset on PCIe AXI drain event
- optimise user interrupt handling
msm:
- DP: use existing helpers for DPCD
- DPU: interrupts reworked
- gpu: a7xx (a730/a740) support
- decouple msm_drv from kms for headless devices
mediatek:
- MT8188 dsi/dp/edp support
- DDP GAMMA - 12 bit LUT support
- connector dynamic selection capability
rockchip:
- rv1126 mipi-dsi/vop support
- add planar formats
ast:
- rename constants
panels:
- Mitsubishi AA084XE01
- JDI LPM102A188A
- LTK050H3148W-CTA6
ivpu:
- power management fixes
qaic:
- add detach slice bo api
komeda:
- add NV12 writeback
tegra:
- support NVSYNC/NHSYNC
- host1x suspend fixes
ili9882t:
- separate into own driver"
* tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm: (1803 commits)
drm/amdgpu: Remove unused variables from amdgpu_show_fdinfo
drm/amdgpu: Remove duplicate fdinfo fields
drm/amd/amdgpu: avoid to disable gfxhub interrupt when driver is unloaded
drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systems
drm/amdgpu: Retrieve CE count from ce_count_lo_chip in EccInfo table
drm/amdgpu: Identify data parity error corrected in replay mode
drm/amdgpu: Fix typo in IP discovery parsing
drm/amd/display: fix S/G display enablement
drm/amdxcp: fix amdxcp unloads incompletely
drm/amd/amdgpu: fix the GPU power print error in pm info
drm/amdgpu: Use pcie domain of xcc acpi objects
drm/amd: check num of link levels when update pcie param
drm/amdgpu: Add a read to GFX v9.4.3 ring test
drm/amd/pm: call smu_cmn_get_smc_version in is_mode1_reset_supported.
drm/amdgpu: get RAS poison status from DF v4_6_2
drm/amdgpu: Use discovery table's subrevision
drm/amd/display: 3.2.256
drm/amd/display: add interface to query SubVP status
drm/amd/display: Read before writing Backlight Mode Set Register
drm/amd/display: Disable SYMCLK32_SE RCO on DCN314
...
Highlights:
- asus-wmi: Support for screenpad and solve brightness key
press duplication
- int3472: Eliminate the last use of deprecated GPIO functions
- mlxbf-pmc: New HW support
- msi-ec: Support new EC configurations
- thinkpad_acpi: Support reading aux MAC address during passthrough
- wmi: Fixes & improvements
- x86-android-tablets: Detection fix and avoid use of GPIO private APIs
- Debug & metrics interface improvements
- Miscellaneous cleanups / fixes / improvements
The following is an automated shortlog grouped by driver:
acer-wmi:
- Remove void function return
amd/hsmp:
- add support for metrics tbl
- create plat specific struct
- Fix iomem handling
- improve the error log
amd/pmc:
- Add dump_custom_stb module parameter
- Add PMFW command id to support S2D force flush
- Handle overflow cases where the num_samples range is higher
- Use flex array when calling amd_pmc_stb_debugfs_open_v2()
asus-wireless:
- Replace open coded acpi_match_acpi_device()
asus-wmi:
- add support for ASUS screenpad
- Do not report brightness up/down keys when also reported by acpi_video
gpiolib: acpi:
- Add a ignore interrupt quirk for Peaq C1010
- Check if a GPIO is listed in ignore_interrupt earlier
hp-bioscfg:
- Annotate struct bios_args with __counted_by
inspur-platform-profile:
- Add platform profile support
int3472:
- Add new skl_int3472_fill_gpiod_lookup() helper
- Add new skl_int3472_gpiod_get_from_temp_lookup() helper
- Stop using gpiod_toggle_active_low()
- Switch to devm_get_gpiod()
intel: bytcrc_pwrsrc:
- Convert to platform remove callback returning void
intel/ifs:
- Add new CPU support
- Add new error code
- ARRAY BIST for Sierra Forest
- Gen2 scan image loading
- Gen2 Scan test support
- Metadata validation for start_chunk
- Refactor image loading code
- Store IFS generation number
- Validate image size
intel_speed_select_if:
- Remove hardcoded map size
- Use devm_ioremap_resource
intel/tpmi:
- Add debugfs support for read/write blocked
- Add defines to get version information
intel-uncore-freq:
- Ignore minor version change
ISST:
- Allow level 0 to be not present
- Ignore minor version change
- Use fuse enabled mask instead of allowed levels
mellanox:
- Fix misspelling error in routine name
- Rename some init()/exit() functions for consistent naming
mlxbf-bootctl:
- Convert to platform remove callback returning void
mlxbf-pmc:
- Add support for BlueField-3
mlxbf-tmfifo:
- Convert to platform remove callback returning void
mlx-Convert to platform remove callback returning void:
- mlx-Convert to platform remove callback returning void
mlxreg-hotplug:
- Convert to platform remove callback returning void
mlxreg-io:
- Convert to platform remove callback returning void
mlxreg-lc:
- Convert to platform remove callback returning void
msi-ec:
- Add more EC configs
- rename fn_super_swap
nvsw-sn2201:
- Convert to platform remove callback returning void
sel3350-Convert to platform remove callback returning void:
- sel3350-Convert to platform remove callback returning void
siemens: simatic-ipc-batt-apollolake:
- Convert to platform remove callback returning void
siemens: simatic-ipc-batt:
- Convert to platform remove callback returning void
siemens: simatic-ipc-batt-elkhartlake:
- Convert to platform remove callback returning void
siemens: simatic-ipc-batt-f7188x:
- Convert to platform remove callback returning void
siemens: simatic-ipc-batt:
- Simplify simatic_ipc_batt_remove()
surface: acpi-notify:
- Convert to platform remove callback returning void
surface: aggregator:
- Annotate struct ssam_event with __counted_by
surface: aggregator-cdev:
- Convert to platform remove callback returning void
surface: aggregator-registry:
- Convert to platform remove callback returning void
surface: dtx:
- Convert to platform remove callback returning void
surface: gpe:
- Convert to platform remove callback returning void
surface: hotplug:
- Convert to platform remove callback returning void
surface: surface3-wmi:
- Convert to platform remove callback returning void
think-lmi:
- Add bulk save feature
- Replace kstrdup() + strreplace() with kstrdup_and_replace()
- Use strreplace() to replace a character by nul
thinkpad_acpi:
- Add battery quirk for Thinkpad X120e
- replace deprecated strncpy with memcpy
- sysfs interface to auxmac
tools/power/x86/intel-speed-select:
- Display error for core-power support
- Increase max CPUs in one request
- No TRL for non compute domains
- Sanitize integer arguments
- turbo-mode enable disable swapped
- Update help for TRL
- Use cgroup isolate for CPU 0
- v1.18 release
wmi:
- Decouple probe deferring from wmi_block_list
- Decouple WMI device removal from wmi_block_list
- Fix opening of char device
- Fix probe failure when failing to register WMI devices
- Fix refcounting of WMI devices in legacy functions
x86-android-tablets:
- Add a comment about x86_android_tablet_get_gpiod()
- Create a platform_device from module_init()
- Drop "linux,power-supply-name" from lenovo_yt3_bq25892_0_props[]
- Fix Lenovo Yoga Tablet 2 830F/L vs 1050F/L detection
- Remove invalid_aei_gpiochip from Peaq C1010
- Remove invalid_aei_gpiochip support
- Stop using gpiolib private APIs
- Use platform-device as gpio-keys parent
xo15-ebook:
- Replace open coded acpi_match_acpi_device()
Merges:
- Merge branch 'pdx86/platform-drivers-x86-int3472' into review-ilpo
- Merge branch 'pdx86/platform-drivers-x86-mellanox-init' into review-ilpo
- Merge remote-tracking branch 'intel-speed-select/intel-sst' into review-ilpo
- Merge remote-tracking branch 'pdx86/platform-drivers-x86-android-tablets' into review-hans
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSCSUwRdwTNL2MhaBlZrE9hU+XOMQUCZT+lBwAKCRBZrE9hU+XO
Mck0AQCFU7dYLCF4d1CXtHf1eZhSXLpYdhcO+C08JGGoM+MqSgD+Jyb9KJHk4pxE
FvKG51I9neyAne9lvNrLodHRzxCYgAo=
=duM8
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver updates from Ilpo Järvinen:
- asus-wmi: Support for screenpad and solve brightness key press
duplication
- int3472: Eliminate the last use of deprecated GPIO functions
- mlxbf-pmc: New HW support
- msi-ec: Support new EC configurations
- thinkpad_acpi: Support reading aux MAC address during passthrough
- wmi: Fixes & improvements
- x86-android-tablets: Detection fix and avoid use of GPIO private APIs
- Debug & metrics interface improvements
- Miscellaneous cleanups / fixes / improvements
* tag 'platform-drivers-x86-v6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (80 commits)
platform/x86: inspur-platform-profile: Add platform profile support
platform/x86: thinkpad_acpi: Add battery quirk for Thinkpad X120e
platform/x86: wmi: Decouple WMI device removal from wmi_block_list
platform/x86: wmi: Fix opening of char device
platform/x86: wmi: Fix probe failure when failing to register WMI devices
platform/x86: wmi: Fix refcounting of WMI devices in legacy functions
platform/x86: wmi: Decouple probe deferring from wmi_block_list
platform/x86/amd/hsmp: Fix iomem handling
platform/x86: asus-wmi: Do not report brightness up/down keys when also reported by acpi_video
platform/x86: thinkpad_acpi: replace deprecated strncpy with memcpy
tools/power/x86/intel-speed-select: v1.18 release
tools/power/x86/intel-speed-select: Use cgroup isolate for CPU 0
tools/power/x86/intel-speed-select: Increase max CPUs in one request
tools/power/x86/intel-speed-select: Display error for core-power support
tools/power/x86/intel-speed-select: No TRL for non compute domains
tools/power/x86/intel-speed-select: turbo-mode enable disable swapped
tools/power/x86/intel-speed-select: Update help for TRL
tools/power/x86/intel-speed-select: Sanitize integer arguments
platform/x86: acer-wmi: Remove void function return
platform/x86/amd/pmc: Add dump_custom_stb module parameter
...
Core & protocols
----------------
- Support usec resolution of TCP timestamps, enabled selectively by
a route attribute.
- Defer regular TCP ACK while processing socket backlog, try to send
a cumulative ACK at the end. Increase single TCP flow performance
on a 200Gbit NIC by 20% (100Gbit -> 120Gbit).
- The Fair Queuing (FQ) packet scheduler:
- add built-in 3 band prio / WRR scheduling
- support bypass if the qdisc is mostly idle (5% speed up for TCP RR)
- improve inactive flow reporting
- optimize the layout of structures for better cache locality
- Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern
replacement for the old MD5 option.
- Add more retransmission timeout (RTO) related statistics to TCP_INFO.
- Support sending fragmented skbs over vsock sockets.
- Make sure we send SIGPIPE for vsock sockets if socket was shutdown().
- Add sysctl for ignoring lower limit on lifetime in Router
Advertisement PIO, based on an in-progress IETF draft.
- Add sysctl to control activation of TCP ping-pong mode.
- Add sysctl to make connection timeout in MPTCP configurable.
- Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps
limit the number of wakeups.
- Support netlink GET for MDB (multicast forwarding), allowing user
space to request a single MDB entry instead of dumping the entire
table.
- Support selective FDB flushing in the VXLAN tunnel driver.
- Allow limiting learned FDB entries in bridges, prevent OOM attacks.
- Allow controlling via configfs netconsole targets which were created
via the kernel cmdline at boot, rather than via configfs at runtime.
- Support multiple PTP timestamp event queue readers with different
filters.
- MCTP over I3C.
BPF
---
- Add new veth-like netdevice where BPF program defines the logic
of the xmit routine. It can operate in L3 and L2 mode.
- Support exceptions - allow asserting conditions which should
never be true but are hard for the verifier to infer.
With some extra flexibility around handling of the exit / failure.
https://lwn.net/Articles/938435/
- Add support for local per-cpu kptr, allow allocating and storing
per-cpu objects in maps. Access to those objects operates on
the value for the current CPU. This allows to deprecate local
one-off implementations of per-CPU storage like
BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps.
- Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is
for systemd to re-implement the LogNamespace feature which allows
running multiple instances of systemd-journald to process the logs
of different services.
- Enable open-coded task_vma iteration, after maple tree conversion
made it hard to directly walk VMAs in tracing programs.
- Add open-coded task, css_task and css iterator support.
One of the use cases is customizable OOM victim selection via BPF.
- Allow source address selection with bpf_*_fib_lookup().
- Add ability to pin BPF timer to the current CPU.
- Prevent creation of infinite loops by combining tail calls and
fentry/fexit programs.
- Add missed stats for kprobes to retrieve the number of missed kprobe
executions and subsequent executions of BPF programs.
- Inherit system settings for CPU security mitigations.
- Add BPF v4 CPU instruction support for arm32 and s390x.
Changes to common code
----------------------
- overflow: add DEFINE_FLEX() for on-stack definition of structs
with flexible array members.
- Process doc update with more guidance for reviewers.
Driver API
----------
- Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy
mutex in most places and remove a lot of smaller locks.
- Create a common DPLL configuration API. Allow configuring
and querying state of PLL circuits used for clock syntonization,
in network time distribution.
- Unify fragmented and full page allocation APIs in page pool code.
Let drivers be ignorant of PAGE_SIZE.
- Rework PHY state machine to avoid races with calls to phy_stop().
- Notify DSA drivers of MAC address changes on user ports, improve
correctness of offloads which depend on matching port MAC addresses.
- Allow antenna control on injected WiFi frames.
- Reduce the number of variants of napi_schedule().
- Simplify error handling when composing devlink health messages.
Misc
----
- A lot of KCSAN data race "fixes", from Eric.
- A lot of __counted_by() annotations, from Kees.
- A lot of strncpy -> strscpy and printf format fixes.
- Replace master/slave terminology with conduit/user in DSA drivers.
- Handful of KUnit tests for netdev and WiFi core.
Removed
-------
- AppleTalk COPS.
- AppleTalk ipddp.
- TI AR7 CPMAC Ethernet driver.
Drivers
-------
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- add a driver for the Intel E2000 IPUs
- make CRC/FCS stripping configurable
- cross-timestamping for E823 devices
- basic support for E830 devices
- use aux-bus for managing client drivers
- i40e: report firmware versions via devlink
- nVidia/Mellanox:
- support 4-port NICs
- increase max number of channels to 256
- optimize / parallelize SF creation flow
- Broadcom (bnxt):
- enhance NIC temperature reporting
- support PAM4 speeds and lane configuration
- Marvell OcteonTX2:
- PTP pulse-per-second output support
- enable hardware timestamping for VFs
- Solarflare/AMD:
- conntrack NAT offload and offload for tunnels
- Wangxun (ngbe/txgbe):
- expose HW statistics
- Pensando/AMD:
- support PCI level reset
- narrow down the condition under which skbs are linearized
- Netronome/Corigine (nfp):
- support CHACHA20-POLY1305 crypto in IPsec offload
- Ethernet NICs embedded, slower, virtual:
- Synopsys (stmmac):
- add Loongson-1 SoC support
- enable use of HW queues with no offload capabilities
- enable PPS input support on all 5 channels
- increase TX coalesce timer to 5ms
- RealTek USB (r8152): improve efficiency of Rx by using GRO frags
- xen: support SW packet timestamping
- add drivers for implementations based on TI's PRUSS (AM64x EVM)
- nVidia/Mellanox Ethernet datacenter switches:
- avoid poor HW resource use on Spectrum-4 by better block selection
for IPv6 multicast forwarding and ordering of blocks in ACL region
- Ethernet embedded switches:
- Microchip:
- support configuring the drive strength for EMI compliance
- ksz9477: partial ACL support
- ksz9477: HSR offload
- ksz9477: Wake on LAN
- Realtek:
- rtl8366rb: respect device tree config of the CPU port
- Ethernet PHYs:
- support Broadcom BCM5221 PHYs
- TI dp83867: support hardware LED blinking
- CAN:
- add support for Linux-PHY based CAN transceivers
- at91_can: clean up and use rx-offload helpers
- WiFi:
- MediaTek (mt76):
- new sub-driver for mt7925 USB/PCIe devices
- HW wireless <> Ethernet bridging in MT7988 chips
- mt7603/mt7628 stability improvements
- Qualcomm (ath12k):
- WCN7850:
- enable 320 MHz channels in 6 GHz band
- hardware rfkill support
- enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS
to make scan faster
- read board data variant name from SMBIOS
- QCN9274: mesh support
- RealTek (rtw89):
- TDMA-based multi-channel concurrency (MCC)
- Silicon Labs (wfx):
- Remain-On-Channel (ROC) support
- Bluetooth:
- ISO: many improvements for broadcast support
- mark BCM4378/BCM4387 as BROKEN_LE_CODED
- add support for QCA2066
- btmtksdio: enable Bluetooth wakeup from suspend
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmU8XsYACgkQMUZtbf5S
Irv19RAAnud/24OOF5XMEJkIcYlnfqximh4XO6PujRSYkSkOUJdZTF6iJPgf3pSP
YpwoHYbYKHYfeOf8+3bTNESiQNSnoVmvmvwiS6/7lZ3behHUrGLQzW9Htc3EZyWH
2h6QkDZ5OOjfg0bwYSfp3vXkmMH2k8WE9Y0NvCkhcohqZi13Rmp14RnyPmNb2d1V
yZRYDMSM133KqE6gnBr1Ct65IEvnKeGlCUN2mTGqOJgdn6DZMsyxvtt0y4rmN7Ab
41+CgPU5SfxfbYpW+Dl2HJpgfte3WrC57KC6AM0PAPJzPmQWgeB/m9mjz/apj6Bg
bhsEIo7FdvbCnQm3yWPhK2OgCAcSwLr8jfGMU+Q+W4VnL5SRRR3Rm0zjsze+kHNP
OfqJgxzl3DpvoJqVBy1h5FGcZt0XHwhksm4cTxWqIahsF+veY0ECBXbuBBQx9XTF
Y7INfI8ulg7wISJs+CJfIClYkgOibTw2u8taBS5ikbtgxNqp5D4QqODn7UefQap1
PR/IDYODF+zRgmMJLeBqSa6fij6BkfOEDiOWak5kggBoZdtbtmeKI6tzze06CNdW
lWv1WEhRufxnwK+IuWsEkjhiMbs2WGLvkJ5JbgQV9BfqHfIfiqBCrcWtT/WbQnGt
lmU46CXh1t/FZEqbmK9h+8vsIIfrcDl6jb5npEiKPRG00vDKRTM=
=46nS
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Support usec resolution of TCP timestamps, enabled selectively by a
route attribute.
- Defer regular TCP ACK while processing socket backlog, try to send
a cumulative ACK at the end. Increase single TCP flow performance
on a 200Gbit NIC by 20% (100Gbit -> 120Gbit).
- The Fair Queuing (FQ) packet scheduler:
- add built-in 3 band prio / WRR scheduling
- support bypass if the qdisc is mostly idle (5% speed up for TCP RR)
- improve inactive flow reporting
- optimize the layout of structures for better cache locality
- Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern
replacement for the old MD5 option.
- Add more retransmission timeout (RTO) related statistics to
TCP_INFO.
- Support sending fragmented skbs over vsock sockets.
- Make sure we send SIGPIPE for vsock sockets if socket was
shutdown().
- Add sysctl for ignoring lower limit on lifetime in Router
Advertisement PIO, based on an in-progress IETF draft.
- Add sysctl to control activation of TCP ping-pong mode.
- Add sysctl to make connection timeout in MPTCP configurable.
- Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps
limit the number of wakeups.
- Support netlink GET for MDB (multicast forwarding), allowing user
space to request a single MDB entry instead of dumping the entire
table.
- Support selective FDB flushing in the VXLAN tunnel driver.
- Allow limiting learned FDB entries in bridges, prevent OOM attacks.
- Allow controlling via configfs netconsole targets which were
created via the kernel cmdline at boot, rather than via configfs at
runtime.
- Support multiple PTP timestamp event queue readers with different
filters.
- MCTP over I3C.
BPF:
- Add new veth-like netdevice where BPF program defines the logic of
the xmit routine. It can operate in L3 and L2 mode.
- Support exceptions - allow asserting conditions which should never
be true but are hard for the verifier to infer. With some extra
flexibility around handling of the exit / failure:
https://lwn.net/Articles/938435/
- Add support for local per-cpu kptr, allow allocating and storing
per-cpu objects in maps. Access to those objects operates on the
value for the current CPU.
This allows to deprecate local one-off implementations of per-CPU
storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps.
- Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is
for systemd to re-implement the LogNamespace feature which allows
running multiple instances of systemd-journald to process the logs
of different services.
- Enable open-coded task_vma iteration, after maple tree conversion
made it hard to directly walk VMAs in tracing programs.
- Add open-coded task, css_task and css iterator support. One of the
use cases is customizable OOM victim selection via BPF.
- Allow source address selection with bpf_*_fib_lookup().
- Add ability to pin BPF timer to the current CPU.
- Prevent creation of infinite loops by combining tail calls and
fentry/fexit programs.
- Add missed stats for kprobes to retrieve the number of missed
kprobe executions and subsequent executions of BPF programs.
- Inherit system settings for CPU security mitigations.
- Add BPF v4 CPU instruction support for arm32 and s390x.
Changes to common code:
- overflow: add DEFINE_FLEX() for on-stack definition of structs with
flexible array members.
- Process doc update with more guidance for reviewers.
Driver API:
- Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy
mutex in most places and remove a lot of smaller locks.
- Create a common DPLL configuration API. Allow configuring and
querying state of PLL circuits used for clock syntonization, in
network time distribution.
- Unify fragmented and full page allocation APIs in page pool code.
Let drivers be ignorant of PAGE_SIZE.
- Rework PHY state machine to avoid races with calls to phy_stop().
- Notify DSA drivers of MAC address changes on user ports, improve
correctness of offloads which depend on matching port MAC
addresses.
- Allow antenna control on injected WiFi frames.
- Reduce the number of variants of napi_schedule().
- Simplify error handling when composing devlink health messages.
Misc:
- A lot of KCSAN data race "fixes", from Eric.
- A lot of __counted_by() annotations, from Kees.
- A lot of strncpy -> strscpy and printf format fixes.
- Replace master/slave terminology with conduit/user in DSA drivers.
- Handful of KUnit tests for netdev and WiFi core.
Removed:
- AppleTalk COPS.
- AppleTalk ipddp.
- TI AR7 CPMAC Ethernet driver.
Drivers:
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- add a driver for the Intel E2000 IPUs
- make CRC/FCS stripping configurable
- cross-timestamping for E823 devices
- basic support for E830 devices
- use aux-bus for managing client drivers
- i40e: report firmware versions via devlink
- nVidia/Mellanox:
- support 4-port NICs
- increase max number of channels to 256
- optimize / parallelize SF creation flow
- Broadcom (bnxt):
- enhance NIC temperature reporting
- support PAM4 speeds and lane configuration
- Marvell OcteonTX2:
- PTP pulse-per-second output support
- enable hardware timestamping for VFs
- Solarflare/AMD:
- conntrack NAT offload and offload for tunnels
- Wangxun (ngbe/txgbe):
- expose HW statistics
- Pensando/AMD:
- support PCI level reset
- narrow down the condition under which skbs are linearized
- Netronome/Corigine (nfp):
- support CHACHA20-POLY1305 crypto in IPsec offload
- Ethernet NICs embedded, slower, virtual:
- Synopsys (stmmac):
- add Loongson-1 SoC support
- enable use of HW queues with no offload capabilities
- enable PPS input support on all 5 channels
- increase TX coalesce timer to 5ms
- RealTek USB (r8152): improve efficiency of Rx by using GRO frags
- xen: support SW packet timestamping
- add drivers for implementations based on TI's PRUSS (AM64x EVM)
- nVidia/Mellanox Ethernet datacenter switches:
- avoid poor HW resource use on Spectrum-4 by better block
selection for IPv6 multicast forwarding and ordering of blocks
in ACL region
- Ethernet embedded switches:
- Microchip:
- support configuring the drive strength for EMI compliance
- ksz9477: partial ACL support
- ksz9477: HSR offload
- ksz9477: Wake on LAN
- Realtek:
- rtl8366rb: respect device tree config of the CPU port
- Ethernet PHYs:
- support Broadcom BCM5221 PHYs
- TI dp83867: support hardware LED blinking
- CAN:
- add support for Linux-PHY based CAN transceivers
- at91_can: clean up and use rx-offload helpers
- WiFi:
- MediaTek (mt76):
- new sub-driver for mt7925 USB/PCIe devices
- HW wireless <> Ethernet bridging in MT7988 chips
- mt7603/mt7628 stability improvements
- Qualcomm (ath12k):
- WCN7850:
- enable 320 MHz channels in 6 GHz band
- hardware rfkill support
- enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to
make scan faster
- read board data variant name from SMBIOS
- QCN9274: mesh support
- RealTek (rtw89):
- TDMA-based multi-channel concurrency (MCC)
- Silicon Labs (wfx):
- Remain-On-Channel (ROC) support
- Bluetooth:
- ISO: many improvements for broadcast support
- mark BCM4378/BCM4387 as BROKEN_LE_CODED
- add support for QCA2066
- btmtksdio: enable Bluetooth wakeup from suspend"
* tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1816 commits)
net: pcs: xpcs: Add 2500BASE-X case in get state for XPCS drivers
net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos()
net: mana: Use xdp_set_features_flag instead of direct assignment
vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size()
iavf: delete the iavf client interface
iavf: add a common function for undoing the interrupt scheme
iavf: use unregister_netdev
iavf: rely on netdev's own registered state
iavf: fix the waiting time for initial reset
iavf: in iavf_down, don't queue watchdog_task if comms failed
iavf: simplify mutex_trylock+sleep loops
iavf: fix comments about old bit locks
doc/netlink: Update schema to support cmd-cnt-name and cmd-max-name
tools: ynl: introduce option to process unknown attributes or types
ipvlan: properly track tx_errors
netdevsim: Block until all devices are released
nfp: using napi_build_skb() to replace build_skb()
net: dsa: microchip: ksz9477: Fix spelling mistake "Enery" -> "Energy"
net: dsa: microchip: Ensure Stable PME Pin State for Wake-on-LAN
net: dsa: microchip: Refactor switch shutdown routine for WoL preparation
...
- Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
running an SEV-ES guest.
- Clean up handling "failures" when KVM detects it can't emulate the "skip"
action for an instruction that has already been partially emulated. Drop a
hack in the SVM code that was fudging around the emulator code not giving
SVM enough information to do the right thing.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8GHYSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5hwkQAIR8l1gWz/caz29biBzmRnDS+aZOXcYM
8V8WBJqJgMKE9egibF4sADAlhInXzg19Xr7bQs6VfuvmdXrCn0UJ/nLorX+H85A2
pph6iNlWO6tyQAjvk/AieaeUyZOqpCFmKOgxfN2Fr/Lrn7u3AdjXC20qPeFJSLXr
YOTCQ704yvjjJp4yVA8JlclAQu38hanKiO5SZdlLzbuhUgWwQk4DVP2ZsYnhX+RO
F6exxORvMnYF/LJe/kR2/DMLf2JWvyUmjRrGWoeRoksOw5BlXMc5HyTPHSJ2jDac
lJaNtmZkTY1bDVWZk7N03ze5aFJa4DaqJdIFLtgujrFW8thog0P48aH6vmKi4UAA
bXme9GFYbmJTkemaGRnrzidFV12uPNvvanS+1PDOw4sn4HpscoMSpZw5PeH2kBwV
6uKNCJCwLtk8oe50yroKD7rJ/ASB7CeoqzbIL9s2TA0HSAskIf65T4eZp01uniyd
Q98yCdrG2mudsg5aU5yMfe0LwZby5BB5kUCqIe4hyRC68GJR8wkAzhaFRgCn4aJE
yaTyjnT2V3PGMEEJOPFdSF3VQGztljzQiXlEvBVj3zvMGQNTo2NhmS3ka4W+wW5G
avRYv8dITlGRs6J2gV1vp8Eb5LzDrwRpRURSmzeP5rR58saKdljTZgNfOzfLeFr1
WhLzonLz52IS
=U0fq
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-svm-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM SVM changes for 6.7:
- Report KVM_EXIT_SHUTDOWN instead of EINVAL if KVM intercepts SHUTDOWN while
running an SEV-ES guest.
- Clean up handling "failures" when KVM detects it can't emulate the "skip"
action for an instruction that has already been partially emulated. Drop a
hack in the SVM code that was fudging around the emulator code not giving
SVM enough information to do the right thing.
- Handle NMI/SMI requests after PMU/PMI requests so that a PMI=>NMI doesn't
require redoing the entire run loop due to the NMI not being detected until
the final kvm_vcpu_exit_request() check before entering the guest.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8G/sSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5/FQP/1B0tk5TMe/Xfe/q4ng+J2eMr10TpbH5
uWRpxN6seRmH7cqfZwsNH86FubNRf3h9U/jOK3C9Q9dIhrq9MB1dZePDjF/xmZcz
4lhM76fHTeRNxJ1o+j2ApiK9U2dDAbBTLA8iGi+OTs/sAuvbNUELY7d3Ht2TqJjb
e9tGT+SavbTsg0UHEmteFHepMCe577AchL2T6jPbUaaVB05N7uD/qvIGDOLQvyaC
KHWqY5f+eFN+3JdGEefCiS4XCAWXBPSs7Ybq5SduxS7rnB7m96Vkidwk1DLjnyUt
+KNtb8JXBsMMuyaYZHrl4mPZyvOfmZxXOz9CzCYXzcQlsnkJqIyy3CiZFVEAqdq2
kXtOhNEqByAKVCWvcoJvfO/VGd/w3KP5XYP3GHXJ8gsS3sDORnL5PYWIvPNfjdlu
x7nsnk7PbaGdspSPqfKblwUvET1fePs1yjKECUMl4iJ6Wfr+QfKEpPUXQ6f79r+h
DrhPE9DIWyMMbre0p8E7uTFsteVerUx/GVDh7jtn6LCUKwWmKAZ43sKR2d35GAvG
x7ZKCcKl5U9vmC8c6q/eAZUE7CeNy1QBGXhYX6oP28NGxl5AzZ/Q8aYMsv0uqhyF
cwYbVKA5Wl5fovMrnjs8wwkKqa9cHdzy7JmhyhBV5k5ggfSUeD7mG0UM5eRxr6ZM
TOa/97QeXa7v
=mjsy
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-pmu-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM PMU change for 6.7:
- Handle NMI/SMI requests after PMU/PMI requests so that a PMI=>NMI doesn't
require redoing the entire run loop due to the NMI not being detected until
the final kvm_vcpu_exit_request() check before entering the guest.
- Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.
- Use the fast path directly from the timer callback when delivering Xen timer
events. Avoid the problematic races with using the fast path by ensuring
the hrtimer isn't running when (re)starting the timer or saving the timer
information (for userspace).
- Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8He8SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5KyQP+wUH3n6hhJGScsSCpWXK6r8q+Y2ZBftY
ecXuoTfeBJmsoTbnExF7K600DtbxHY5jjxt3ROmoUCertCFRCoq6pi5v4rbRDDQ1
fmGkht43A6zAuHQ0Ntvkq4rNEmISAbzLP4EXOxZJ/Hxld91T8IutMFo7NN/YfOSx
nb+qgb7B25T7ODGvzahRjxnoevCHBN/TdKeDrvsoWeMpVw+CDYqquQOcLfHMaBAN
DqGwZzpdVqRQqg3TOuBGCiv5IcvskjkFUh0y6cEYkCR/MruLoT6CygoLImEV2naW
RU0ZU9Y4cjf+BV/faQEdP6mDQwwCUHWLxDpXUVn03KQYQHlA7q6UgRKxy35ixZ5w
Euxvg4m2ZGgJjsVLqTTMUlbLSNxD6wWZAVxGH7w8XghKrNmoj1IoajPZS+1rwyO2
5rUynMKf3HMT6oeqqZH95aChlUMiAvaPYPc+ogku8Bt1zJQVv/xnk/6T95Vw6C/t
KfYsV80rmJd/EL/fUXYX3mCMcZGHyv80QlOEc0uR4f25HGszCG8qHiSaUtnvQUjQ
xaguSuO1Cf7sdhHPWj4p/US+Jerrgd8nzoQGvKUOkdLsQzU71xwjvTZNlmmBYKKO
zgGIXZfaXa4JibAqnRrC+V8UdDPOwKvOEzmH0joLEzkTISnIG2LycvZ6tG7sTcMU
0sIg2dvhJx/G
=Z2eM
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-xen-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM x86 Xen changes for 6.7:
- Omit "struct kvm_vcpu_xen" entirely when CONFIG_KVM_XEN=n.
- Use the fast path directly from the timer callback when delivering Xen timer
events. Avoid the problematic races with using the fast path by ensuring
the hrtimer isn't running when (re)starting the timer or saving the timer
information (for userspace).
- Follow the lead of upstream Xen and ignore the VCPU_SSHOTTMR_future flag.
- Clean up code that deals with honoring guest MTRRs when the VM has
non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.
- Zap EPT entries when non-coherent DMA assignment stops/start to prevent
using stale entries with the wrong memtype.
- Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y, as
there's zero reason to ignore guest PAT if the effective MTRR memtype is WB.
This will also allow for future optimizations of handling guest MTRR updates
for VMs with non-coherent DMA and the quirk enabled.
- Harden the fast page fault path to guard against encountering an invalid
root when walking SPTEs.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8FG8SHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5tYMP+gJd3raPnpmai4NyaFaZNP6/5YsXuUMj
XBvHH7hBGHmjd1sV+O62fhUvNk4+M/1f1rERutP4s7yXEXxQfC9G/MQFgLBfyiW8
xR+RQkNrz8HsG8mHFBZ0Ei6OofhP+BRTYDRU7kbctKDh/4Hp5AOZAxYHs/ZhOho1
Lw6upZbQLCkdt72eEKbfocg6Tf400hWEyarBRXFe4KJzWq7KMjAPgqA/3Vx0lF6u
zX73Zr6tV0mcf3QXd58Q4CUwOuwMo1aTangmOhEeC09JplF2okLV36h6WrCF8qqO
gvmDrMA450Yc215peOJGBJzoZJrNjMIHZ2m+4Ifag6Z/jJoam4vjzUZmmrzx+Gbj
Ot5lmXCVRXCdHmUNdYQ6yR27WaVP3C3ItkxwNZGMPoh2G08NGyLLY1kwzRyITEH4
M9jYTRBZaeue57ad5Ms9FaneBLWwPxajTX90rWZbl2kzfd8PG5cF1VroESBLoa0f
I2kDcd7988xLTOMl1sfO8ci21Ve7rQc0hA6WlOXrDxb26OvYrftYXeXOCowN6kqP
czXIu5ZPmLI1btimZQXGMdxKkw5wwe3wDC3y5gKrm+rTfORUXoOUDoITIpmPCnAp
Dzfr5la3RI1GjHhzR80x4vXQC9BgJ9WrEwJub/RqVfE3T3ohw+NZl+AeM1xB9eT1
2mJWm6GFEm9Y
=Zfbr
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-mmu-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MMU changes for 6.7:
- Clean up code that deals with honoring guest MTRRs when the VM has
non-coherent DMA and host MTRRs are ignored, i.e. EPT is enabled.
- Zap EPT entries when non-coherent DMA assignment stops/start to prevent
using stale entries with the wrong memtype.
- Don't ignore guest PAT for CR0.CD=1 && KVM_X86_QUIRK_CD_NW_CLEARED=y, as
there's zero reason to ignore guest PAT if the effective MTRR memtype is WB.
This will also allow for future optimizations of handling guest MTRR updates
for VMs with non-coherent DMA and the quirk enabled.
- Harden the fast page fault path to guard against encountering an invalid
root when walking SPTEs.
- Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without
forcing more common use cases to eat the extra memory overhead.
- Add IBPB and SBPB virtualization support.
- Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
creating the original vCPU would cause KVM to try to synchronize the vCPU's
TSC and thus clobber the correct TSC being set by userspace.
- Compute guest wall clock using a single TSC read to avoid generating an
inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.
- "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
about a "Firmware Bug" if the bit isn't set for select F/M/S combos.
- Don't apply side effects to Hyper-V's synthetic timer on writes from
userspace to fix an issue where the auto-enable behavior can trigger
spurious interrupts, i.e. do auto-enabling only for guest writes.
- Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
without PML enabled.
- Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.
- Use octal notation for file permissions through KVM x86.
- Fix a handful of typo fixes and warts.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8EugSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5xS0P+gPTDO81CUZO70LrO2W4E7toRBf/F9x1
/v5D/76p9hG32Z6+BJs/xxDxJFagw75MtoR5oKivtXiip3TxbfOyDOlaQkIRo85E
/d95il/LRidL3Mv3TXRj1lykXnxSSz9tigAGEZti1Y9Fn9fXEIwurJH7dU5cBI1E
fin5bsDaTNRjG4jjTiEUbnKPRTlD/S7CQJn4CaYvZhMv/eJkYDLyBBVy4VLoLzvD
ctL6VJQLGPVxbxr9mEmulaqMrSuDIQQLkRVQJAViKyerBInTEc5d/GPCHuE8O3zi
0r/QSJbMS9titWLz07NhJ1UH4VJNyaEhRlyJPSFhBW4h6dzUb3EXdUe0Hwa+JH/S
H2cVqsANItTCIhvDtuEGIRDahu0eD+63h90InJ0gEVL1kSJS+UWZHB71PkUEQgAV
2OsuT1D26fuxrv+0b9ioBZURycqKw++zGsrwyVhe77eBgqBJ12tbL4TAD+QNjaQ5
HZTCe6YV83gZoOMeVkoTGSf96s9lGORgxsaAIXmFuLB9RVCVXhVh0ph2HZsnV8Hw
ZXEXpBEFo7GUhb0NIvsk2W73QL87A3fLv15yITWc8KuC7/dXP9z6KpSKjFySS69X
uWD1MVx6shhvbg97UzoJlXc3/z0aVzmdZJudE5d0gcFvAjIItqp6ICPOoKxfj8pT
tqRZu3kVHd61
=sfp8
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-misc-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.7:
- Add CONFIG_KVM_MAX_NR_VCPUS to allow supporting up to 4096 vCPUs without
forcing more common use cases to eat the extra memory overhead.
- Add IBPB and SBPB virtualization support.
- Fix a bug where restoring a vCPU snapshot that was taken within 1 second of
creating the original vCPU would cause KVM to try to synchronize the vCPU's
TSC and thus clobber the correct TSC being set by userspace.
- Compute guest wall clock using a single TSC read to avoid generating an
inaccurate time, e.g. if the vCPU is preempted between multiple TSC reads.
- "Virtualize" HWCR.TscFreqSel to make Linux guests happy, which complain
about a "Firmware Bug" if the bit isn't set for select F/M/S combos.
- Don't apply side effects to Hyper-V's synthetic timer on writes from
userspace to fix an issue where the auto-enable behavior can trigger
spurious interrupts, i.e. do auto-enabling only for guest writes.
- Remove an unnecessary kick of all vCPUs when synchronizing the dirty log
without PML enabled.
- Advertise "support" for non-serializing FS/GS base MSR writes as appropriate.
- Use octal notation for file permissions through KVM x86.
- Fix a handful of typo fixes and warts.
- Purge VMX's posted interrupt descriptor *before* loading APIC state when
handling KVM_SET_LAPIC. Purging the PID after loading APIC state results in
lost APIC timer IRQs as the APIC timer can be armed as part of loading APIC
state, i.e. can immediately pend an IRQ if the expiry is in the past.
- Clear the ICR.BUSY bit when handling trap-like x2APIC writes to suppress a
WARN due to KVM expecting the BUSY bit to be cleared when sending IPIs.
-----BEGIN PGP SIGNATURE-----
iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmU8DrkSHHNlYW5qY0Bn
b29nbGUuY29tAAoJEGCRIgFNDBL5Z54P/jQbjmwbvAVDa1g2HGr16ouFb3v+rCgV
eXiN2/8tDlQl9JkvZsjOY1AtJauS3RZu/SncT3IxCCvHVrKKt0xgnhWT//rAMX3N
zrrFxXJmilFdd2ijsuK8flHVFO5i06Kx1NRjSmAMb76et1et1EI0I62F1rom8ocn
l0tryTkV0qhWKlgxzLBNzzhzxxA2Mvb6vFhut+nfa0IisnQ+roh5dV4JS7Cdhy+5
qIQ4J86JdDR6kmraZLCv8r7vIQn5EQPVCAC2WWalds7iWUW3ixy01NLweVLaVtA3
xWRHYK7azvu3SJ9HvmSMRZzzTElQBYnE2d37ycyYewHbDUJ6FCXxh7H+BehYaVBW
CK6GLaMHY8vyXLcB2cRAzZ7GgBSSDevL11/CMc2ueGc51AOsDCxFh6v69TU+/PiQ
dMOYiNTuwV9QC3raIXPCc2Q9rqGp20xBqXIv5XLEyIxqcFQESvtRggg6zJdBZ3zB
T5/xAXDf++rKFiu1wskFAbwsk3M4T36u1WlK9iayC9dPvCDGRzsle3tt9npmoj2X
3e0Uz7fkRLki1YZW1qLmvqoYJ9bX9B9BjuKkORIqSqXuwcgPYEzHt1GcRByI2FSk
KSTnNwgs/Bs498LsDdU1Z740ZkYFGmu2nK+CUvwxP+uejPsnlhik5kGqLSIeA7zA
9kDpcdV+tZKJ
=atZ2
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-apic-6.7' of https://github.com/kvm-x86/linux into HEAD
KVM x86 APIC changes for 6.7:
- Purge VMX's posted interrupt descriptor *before* loading APIC state when
handling KVM_SET_LAPIC. Purging the PID after loading APIC state results in
lost APIC timer IRQs as the APIC timer can be armed as part of loading APIC
state, i.e. can immediately pend an IRQ if the expiry is in the past.
- Clear the ICR.BUSY bit when handling trap-like x2APIC writes. This avoids a
WARN, due to KVM expecting the BUSY bit to be cleared when sending IPIs.
A small one compared to the previous one in terms of features. In terms
of lines, as usual, the 'alloc' version upgrade accounts for most of them.
Toolchain and infrastructure:
- Upgrade to Rust 1.73.0.
This time around, due to how the kernel and Rust schedules have
aligned, there are two upgrades in fact. They contain the fixes for
a few issues we reported to the Rust project.
In addition, a few cleanups indicated by the upgraded compiler
or possible thanks to it. For instance, the compiler now detects
redundant explicit links.
- A couple changes to the Rust 'Makefile' so that it can be used with
toybox tools, allowing Rust to be used in the Android kernel build.
x86:
- Enable IBT if enabled in C.
Documentation:
- Add "The Rust experiment" section to the Rust index page.
MAINTAINERS
- Add Maintainer Entry Profile field ('P:').
- Update our 'W:' field to point to the webpage we have been building
this year.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPjU5OPd5QIZ9jqqOGXyLc2htIW0FAmU79aIACgkQGXyLc2ht
IW03Rw//ZcbRDxVEbD9UE+aM6wtKQ1eY8QIB7gNivctv1R8spKlnsLpF/VdBQV4v
yWmMPG/Vnp7Xbkcg2kZrbo2J82NgD6ACJWxWHVb8K2N/hoIVwrQXiQUmtg8bAa5B
aDso+8WcWZOF6uzu5ww7kv8AbAOO3ReCxnIVPexeWQtZVAGeBd4BiVecoTL0mCbA
7MMNyyKxjnRSo72Sj4iFoVPjb/IkOHYRaPQA/QOvG3bwin5nTvxM+94v+UZ4D7fp
THWuZjJiC19L/C2/GGweK6mnpV2lExdZl4RC3JInu8s3R6jwGuRxUNE4vCnO9DlY
QBkeUV3qwMCG/LnAb+iyClDM5aEU2wWBFl1NwNy0yEQM/gBqk6+4HQxxB177Wte3
V65f4Sz19baci1SNCk+rFe/1EK8/UoD2jo42DXnuIUnGq5VJtVRNxbn/2tR0kNZn
9DGwR1U8shMytNen5xFLi03q+p1Zuez/YKzmTmahv8zn2JysUgj3ctFK2M3mCiFu
+HWvPFBbUbOH+/g3txBXtnvY757Nei5siSKONVZTOez6VYR//jlvLNNBbUp3vPFE
w2TuR6HcFVR8z/kjcTxIgS+xzbsT8FkugJwRdLeQU4Ky03o2Z6uosM3nky9TzWjb
oVR7rFgb9KygvP67+uVQS5OETQVO0EwutlEsQzzs7cbnoSt0ckk=
=rKLl
-----END PGP SIGNATURE-----
Merge tag 'rust-6.7' of https://github.com/Rust-for-Linux/linux
Pull rust updates from Miguel Ojeda:
"A small one compared to the previous one in terms of features. In
terms of lines, as usual, the 'alloc' version upgrade accounts for
most of them.
Toolchain and infrastructure:
- Upgrade to Rust 1.73.0
This time around, due to how the kernel and Rust schedules have
aligned, there are two upgrades in fact. They contain the fixes for
a few issues we reported to the Rust project.
In addition, a few cleanups indicated by the upgraded compiler or
possible thanks to it. For instance, the compiler now detects
redundant explicit links.
- A couple changes to the Rust 'Makefile' so that it can be used with
toybox tools, allowing Rust to be used in the Android kernel build.
x86:
- Enable IBT if enabled in C
Documentation:
- Add "The Rust experiment" section to the Rust index page
MAINTAINERS:
- Add Maintainer Entry Profile field ('P:').
- Update our 'W:' field to point to the webpage we have been building
this year"
* tag 'rust-6.7' of https://github.com/Rust-for-Linux/linux:
docs: rust: add "The Rust experiment" section
x86: Enable IBT in Rust if enabled in C
rust: Use grep -Ev rather than relying on GNU grep
rust: Use awk instead of recent xargs
rust: upgrade to Rust 1.73.0
rust: print: use explicit link in documentation
rust: task: remove redundant explicit link
rust: kernel: remove `#[allow(clippy::new_ret_no_self)]`
MAINTAINERS: add Maintainer Entry Profile field for Rust
MAINTAINERS: update Rust webpage
rust: upgrade to Rust 1.72.1
rust: arc: add explicit `drop()` around `Box::from_raw()`
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R. Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmU/3cUWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsEoEACBGPSiOmfSWdH3TOnIG270PD24
jGjg8KFv7RC/JTOdYmpLl0okdlGT9LvjN/ToSSDEw3PIayxoXUdhkbYy0MYtiV3m
yz2ozDTzJuplQX/W2fPE+nXSzIwHao2zjPPFjHnT7lt8IIjhgjiOtLfZ2gGUkW99
Mdu2aWh3u0r4tC8OS23++yN5ibRc5l72efsjDWjZ0aPXnxE1bjmLMiIPiizpndIf
beasPuDBs98sJVYouemCwnsPXuXOPz3Q1Cpo/fTd+TMTJCLSemCQZCTuOBU0acI/
ZjLCgCaJU1yIYKBMtrIN4G9kITZniXX3/Nm4o6NQMVlcCqMeNaHuflomqWoqWfhE
UPbRo2eghZOaMNiCKLLvZDIqPrh1IcsiEl6Ef3W4hICc42GTK96IuGisIvDXwQ4N
/SzTOupJuN42noh3z1M3XuZy5RoXJ99IYDNY5CTKf9IdqvA0bbGkU3nb1gZH/xw9
BjTqKzR/7K1kTXuSgagDZ1Wceej9pZxhX7E3IHYsP8ZOvKug3EeL4yybVwQ3HRfq
Qnzcp/qPB9cOkLSQXveRTFTsj2mX28Gixct/iDuc1jIYwGQlY1gI6dcUcqby6ptM
BrQti7eR2NH2+T3aE2UVCIWsZVhx7NaSF+z8JxfAuu56jicc4xJVsi8zrNveWX5M
m2VXyBl3121BVtKi4w==
=0iVF
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"One of the more voluminous set of changes is for adding the new
__counted_by annotation[1] to gain run-time bounds checking of
dynamically sized arrays with UBSan.
- Add LKDTM test for stuck CPUs (Mark Rutland)
- Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo)
- Refactor more 1-element arrays into flexible arrays (Gustavo A. R.
Silva)
- Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem
Shaikh)
- Convert group_info.usage to refcount_t (Elena Reshetova)
- Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva)
- Add Kconfig fragment for basic hardening options (Kees Cook, Lukas
Bulwahn)
- Fix randstruct GCC plugin performance mode to stay in groups (Kees
Cook)
- Fix strtomem() compile-time check for small sources (Kees Cook)"
* tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (56 commits)
hwmon: (acpi_power_meter) replace open-coded kmemdup_nul
reset: Annotate struct reset_control_array with __counted_by
kexec: Annotate struct crash_mem with __counted_by
virtio_console: Annotate struct port_buffer with __counted_by
ima: Add __counted_by for struct modsig and use struct_size()
MAINTAINERS: Include stackleak paths in hardening entry
string: Adjust strtomem() logic to allow for smaller sources
hardening: x86: drop reference to removed config AMD_IOMMU_V2
randstruct: Fix gcc-plugin performance mode to stay in group
mailbox: zynqmp: Annotate struct zynqmp_ipi_pdata with __counted_by
drivers: thermal: tsens: Annotate struct tsens_priv with __counted_by
irqchip/imx-intmux: Annotate struct intmux_data with __counted_by
KVM: Annotate struct kvm_irq_routing_table with __counted_by
virt: acrn: Annotate struct vm_memory_region_batch with __counted_by
hwmon: Annotate struct gsc_hwmon_platform_data with __counted_by
sparc: Annotate struct cpuinfo_tree with __counted_by
isdn: kcapi: replace deprecated strncpy with strscpy_pad
isdn: replace deprecated strncpy with strscpy
NFS/flexfiles: Annotate struct nfs4_ff_layout_segment with __counted_by
nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by
...
This pull request contains the following branches:
rcu/torture: RCU torture, locktorture and generic torture infrastructure
updates that include various fixes, cleanups and consolidations.
Among the user visible things, ftrace dumps can now be found into
their own file, and module parameters get better documented and
reported on dumps.
rcu/fixes: Generic and misc fixes all over the place. Some highlights:
* Hotplug handling has seen some light cleanups and comments.
* An RCU barrier can now be triggered through sysfs to serialize
memory stress testing and avoid OOM.
* Object information is now dumped in case of invalid callback
invocation.
* Also various SRCU issues, too hard to trigger to deserve urgent
pull requests, have been fixed.
rcu/docs: RCU documentation updates
rcu/refscale: RCU reference scalability test minor fixes and doc
improvements.
rcu/tasks: RCU tasks minor fixes
rcu/stall: Stall detection updates. Introduce RCU CPU Stall notifiers
that allows a subsystem to provide informations to help debugging.
Also cure some false positive stalls.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmU21h0ACgkQhSRUR1CO
jHdUgA/+Myy5K5OxNrqlF/gIK+flOSg635RyZ0DBx8OMXZ/fAg9qRI+PKt5I4Lha
eXAg6EtmwSgHmIbjcg8WzsvwniEsqqjOF+n1qil447fHUI2Qqw6c7fIm/MXQkeHJ
qA7CODDRtsAnwnjmTteasmMeGV0bmXDENxhNrAZBFnVkRgTqfyDbFcn+nxOaPK6b
fmbKvnB07WUg1KOV8/MbEtAZPb8QgHo58bXSZRKjKkiqRQWB/D3On+tShFK7SYJi
wIqQ96MLyUXLaIWQ47v6xEO4PZO+3o1wAryvP1DRdb5UrPjO6yKFfQaoo5Mza92G
zhBJhnXkVvCoNoCU7GKJIDV54SgDHaB6Sf1GN5cjwfujOkLuGCyg0CpKktCGm7uH
n3X66PVep608Uj2Y/pAo/hv3Hbv7lCu4nfrERvVLG9YoxUvTJDsKmBv+SF/g2mxF
rHqFa39HUPr1yHA5WjqOQS3lLdqCXEGKvNi6zXCvOceiDbHbiJFkBo6p8TVrbSMX
FCOWZ3LoE+6uiLu/lLOEroTjeBd8GhDh1LgWgyVK7o0LhP1018DSBolrpcSwnmOo
Q/E4G2x+aPWs+5NTOmMGOIPY70khKQIM3c8YZelSRffJBo6O3yV68h6X45NQxYvx
keLvrDaza8h4hKwaof/QaX4ZJgTOZ0xjpawr1vR0hbK8LNtPrUw=
=cVD7
-----END PGP SIGNATURE-----
Merge tag 'rcu-next-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks
Pull RCU updates from Frederic Weisbecker:
- RCU torture, locktorture and generic torture infrastructure updates
that include various fixes, cleanups and consolidations.
Among the user visible things, ftrace dumps can now be found into
their own file, and module parameters get better documented and
reported on dumps.
- Generic and misc fixes all over the place. Some highlights:
* Hotplug handling has seen some light cleanups and comments
* An RCU barrier can now be triggered through sysfs to serialize
memory stress testing and avoid OOM
* Object information is now dumped in case of invalid callback
invocation
* Also various SRCU issues, too hard to trigger to deserve urgent
pull requests, have been fixed
- RCU documentation updates
- RCU reference scalability test minor fixes and doc improvements.
- RCU tasks minor fixes
- Stall detection updates. Introduce RCU CPU Stall notifiers that
allows a subsystem to provide informations to help debugging. Also
cure some false positive stalls.
* tag 'rcu-next-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (56 commits)
srcu: Only accelerate on enqueue time
locktorture: Check the correct variable for allocation failure
srcu: Fix callbacks acceleration mishandling
rcu: Comment why callbacks migration can't wait for CPUHP_RCUTREE_PREP
rcu: Standardize explicit CPU-hotplug calls
rcu: Conditionally build CPU-hotplug teardown callbacks
rcu: Remove references to rcu_migrate_callbacks() from diagrams
rcu: Assume rcu_report_dead() is always called locally
rcu: Assume IRQS disabled from rcu_report_dead()
rcu: Use rcu_segcblist_segempty() instead of open coding it
rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects
srcu: Fix srcu_struct node grpmask overflow on 64-bit systems
torture: Convert parse-console.sh to mktemp
rcutorture: Traverse possible cpu to set maxcpu in rcu_nocb_toggle()
rcutorture: Replace schedule_timeout*() 1-jiffy waits with HZ/20
torture: Add kvm.sh --debug-info argument
locktorture: Rename readers_bind/writers_bind to bind_readers/bind_writers
doc: Catch-up update for locktorture module parameters
locktorture: Add call_rcu_chains module parameter
locktorture: Add new module parameters to lock_torture_print_module_parms()
...
- Limit the hardcoded topology quirk for Hygon CPUs to those which have a
model ID less than 4. The newer models have the topology CPUID leaf 0xB
correctly implemented and are not affected.
- Make SMT control more robust against enumeration failures
SMT control was added to allow controlling SMT at boottime or
runtime. The primary purpose was to provide a simple mechanism to
disable SMT in the light of speculation attack vectors.
It turned out that the code is sensible to enumeration failures and
worked only by chance for XEN/PV. XEN/PV has no real APIC enumeration
which means the primary thread mask is not set up correctly. By chance
a XEN/PV boot ends up with smp_num_siblings == 2, which makes the
hotplug control stay at its default value "enabled". So the mask is
never evaluated.
The ongoing rework of the topology evaluation caused XEN/PV to end up
with smp_num_siblings == 1, which sets the SMT control to "not
supported" and the empty primary thread mask causes the hotplug core to
deny the bringup of the APS.
Make the decision logic more robust and take 'not supported' and 'not
implemented' into account for the decision whether a CPU should be
booted or not.
- Fake primary thread mask for XEN/PV
Pretend that all XEN/PV vCPUs are primary threads, which makes the
usage of the primary thread mask valid on XEN/PV. That is consistent
with because all of the topology information on XEN/PV is fake or even
non-existent.
- Encapsulate topology information in cpuinfo_x86
Move the randomly scattered topology data into a separate data
structure for readability and as a preparatory step for the topology
evaluation overhaul.
- Consolidate APIC ID data type to u32
It's fixed width hardware data and not randomly u16, int, unsigned long
or whatever developers decided to use.
- Cure the abuse of cpuinfo for persisting logical IDs.
Per CPU cpuinfo is used to persist the logical package and die
IDs. That's really not the right place simply because cpuinfo is
subject to be reinitialized when a CPU goes through an offline/online
cycle.
Use separate per CPU data for the persisting to enable the further
topology management rework. It will be removed once the new topology
management is in place.
- Provide a debug interface for inspecting topology information
Useful in general and extremly helpful for validating the topology
management rework in terms of correctness or "bug" compatibility.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmU+yX0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoROUD/4vlvKEcpm9rbI5DzLcaq4DFHKbyEZF
cQtzuOSM/9vTc9DHnuoNNLl9TWSYxiVYnejf3E21evfsqspYlzbTH8bId9XBCUid
6B68AJW842M2erNuwj0b0HwF1z++zpDmBDyhGOty/KQhoM8pYOHMvntAmbzJbuso
Dgx6BLVFcboTy6RwlfRa0EE8f9W5V+JbmG/VBDpdyCInal7VrudoVFZmWQnPIft7
zwOJpAoehkp8OKq7geKDf79yWxu9a1sNPd62HtaVEvfHwehHqE6OaMLss1us+0vT
SJ/D6gmRQBOwcXaZL0wL1dG7Km9Et4AisOvzhXGvTa5b2D5oljVoqJ7V7FTf5g3u
y3aqWbeUJzERUbeJt1HoGVAKyA4GtZOvg+TNIysf6F1Z4khl9alfa9jiqjj4g1au
zgItq/ZMBEBmJ7X4FxQUEUVBG2CDsEidyNBDRcimWQUDfBakV/iCs0suD8uu8ZOD
K5jMx8Hi2+xFx7r1YqsfsyMBYOf/zUZw65RbNe+kI992JbJ9nhcODbnbo5MlAsyv
vcqlK5FwXgZ4YAC8dZHU/tyTiqAW7oaOSkqKwTP5gcyNEqsjQHV//q6v+uqtjfYn
1C4oUsRHT2vJiV9ktNJTA4GQHIYF4geGgpG8Ih2SjXsSzdGtUd3DtX1iq0YiLEOk
eHhYsnniqsYB5g==
=xrz8
-----END PGP SIGNATURE-----
Merge tag 'x86-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Thomas Gleixner:
- Limit the hardcoded topology quirk for Hygon CPUs to those which have
a model ID less than 4.
The newer models have the topology CPUID leaf 0xB correctly
implemented and are not affected.
- Make SMT control more robust against enumeration failures
SMT control was added to allow controlling SMT at boottime or
runtime. The primary purpose was to provide a simple mechanism to
disable SMT in the light of speculation attack vectors.
It turned out that the code is sensible to enumeration failures and
worked only by chance for XEN/PV. XEN/PV has no real APIC enumeration
which means the primary thread mask is not set up correctly. By
chance a XEN/PV boot ends up with smp_num_siblings == 2, which makes
the hotplug control stay at its default value "enabled". So the mask
is never evaluated.
The ongoing rework of the topology evaluation caused XEN/PV to end up
with smp_num_siblings == 1, which sets the SMT control to "not
supported" and the empty primary thread mask causes the hotplug core
to deny the bringup of the APS.
Make the decision logic more robust and take 'not supported' and 'not
implemented' into account for the decision whether a CPU should be
booted or not.
- Fake primary thread mask for XEN/PV
Pretend that all XEN/PV vCPUs are primary threads, which makes the
usage of the primary thread mask valid on XEN/PV. That is consistent
with because all of the topology information on XEN/PV is fake or
even non-existent.
- Encapsulate topology information in cpuinfo_x86
Move the randomly scattered topology data into a separate data
structure for readability and as a preparatory step for the topology
evaluation overhaul.
- Consolidate APIC ID data type to u32
It's fixed width hardware data and not randomly u16, int, unsigned
long or whatever developers decided to use.
- Cure the abuse of cpuinfo for persisting logical IDs.
Per CPU cpuinfo is used to persist the logical package and die IDs.
That's really not the right place simply because cpuinfo is subject
to be reinitialized when a CPU goes through an offline/online cycle.
Use separate per CPU data for the persisting to enable the further
topology management rework. It will be removed once the new topology
management is in place.
- Provide a debug interface for inspecting topology information
Useful in general and extremly helpful for validating the topology
management rework in terms of correctness or "bug" compatibility.
* tag 'x86-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/apic, x86/hyperv: Use u32 in hv_snp_boot_ap() too
x86/cpu: Provide debug interface
x86/cpu/topology: Cure the abuse of cpuinfo for persisting logical ids
x86/apic: Use u32 for wakeup_secondary_cpu[_64]()
x86/apic: Use u32 for [gs]et_apic_id()
x86/apic: Use u32 for phys_pkg_id()
x86/apic: Use u32 for cpu_present_to_apicid()
x86/apic: Use u32 for check_apicid_used()
x86/apic: Use u32 for APIC IDs in global data
x86/apic: Use BAD_APICID consistently
x86/cpu: Move cpu_l[l2]c_id into topology info
x86/cpu: Move logical package and die IDs into topology info
x86/cpu: Remove pointless evaluation of x86_coreid_bits
x86/cpu: Move cu_id into topology info
x86/cpu: Move cpu_core_id into topology info
hwmon: (fam15h_power) Use topology_core_id()
scsi: lpfc: Use topology_core_id()
x86/cpu: Move cpu_die_id into topology info
x86/cpu: Move phys_proc_id into topology info
x86/cpu: Encapsulate topology information in cpuinfo_x86
...
- Make the quirk for non-maskable MSI interrupts in the affinity setter
functional again.
It was broken by a MSI core code update, which restructured the code in
a way that the quirk flag was not longer set correctly.
Trying to restore the core logic caused a deeper inspection and it
turned out that the extra quirk flag is not required at all because
it's the inverse of the reservation mode bit, which only can be set
when the MSI interrupt is maskable.
So the trivial fix is to use the reservation mode check in the affinity
setter function and remove almost 40 lines of code related to the
no-mask quirk flag.
- Cure a Kconfig dependency issue which causes compile fails by correcting
the conditionals in the affected heaer files.
- Clean up coding style in the UV APIC driver.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmU+w2ETHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodRIEACugQpAiND53Cz9MTJE1XWGheBn8n+y
WZtbVckK9LXHfnf99dtigtq3ycPBg2Mx/pItT1c71iN/NyVPNCC/3mM+ntP53etX
06v7VoIgRpF+GRQsIbjvVfBkKWS3G5zqHq6hazE0lcPhPEiOIMSBcusSQaA3C3sp
BT23rrx/59JKCb/187pum0Obx+n1oH+MG91wJ1v0zNOGgfx1u5gBPyKDZ0JAsMPx
r05991Lp4K80ooEsk/PKgpcZuPZD3QRwFDLHJwh2ITjsC22ItOnhN/c/1J8MZtk9
5YBaIzy+vyQd+nksqmDtB48FD0ioW/WLUR5zV9MMpD3RQAGgrxQmXeTqScu/j8tn
I4QZGi80HZSWLwFjWL2DoAw5LHZDh8ksxIYZBHE+8LHx0ymyX+7//naZnNFn9cXM
K5orouZFQi+dvfD7MAla73fiibPs6cHjGzRKfsfNlLBNNj+/ffcEolLlKGhEX9fx
R1w7gbtNs/RDnfoa45cvmez8UxJB5zei7l2HWbSecR2DDTQ9aGfEU/0NEr9muBzK
cuZIpmgD09VvS84jCxdYbaXA5Cau3NLOE5Os6UN9Wa3dUjWP5PArWkc8RzxP3Bgx
PANtytesqJ9YeDDg9SyylPw6d3EYNldtFZDXlD32K2O2SDmUZxxJa+0og8K9YNJN
aIUcqiSaYhg5rQ==
=2lhV
-----END PGP SIGNATURE-----
Merge tag 'x86-apic-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 APIC updates from Thomas Gleixner:
- Make the quirk for non-maskable MSI interrupts in the affinity setter
functional again.
It was broken by a MSI core code update, which restructured the code
in a way that the quirk flag was not longer set correctly.
Trying to restore the core logic caused a deeper inspection and it
turned out that the extra quirk flag is not required at all because
it's the inverse of the reservation mode bit, which only can be set
when the MSI interrupt is maskable.
So the trivial fix is to use the reservation mode check in the
affinity setter function and remove almost 40 lines of code related
to the no-mask quirk flag.
- Cure a Kconfig dependency issue which causes compile failures by
correcting the conditionals in the affected header files.
- Clean up coding style in the UV APIC driver.
* tag 'x86-apic-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic/msi: Fix misconfigured non-maskable MSI quirk
x86/msi: Fix compile error caused by CONFIG_GENERIC_MSI_IRQ=y && !CONFIG_X86_LOCAL_APIC
x86/platform/uv/apic: Clean up inconsistent indenting
- Add new NX-stack self-test
- Improve NUMA partial-CFMWS handling
- Fix #VC handler bugs resulting in SEV-SNP boot failures
- Drop the 4MB memory size restriction on minimal NUMA nodes
- Reorganize headers a bit, in preparation to header dependency reduction efforts
- Misc cleanups & fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU9Ek4RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gIJQ/+Mg6mzMaThyNXqhJszeZJBmDaBv2sqjAB
5tcferg1nJBdNBzX8bJ95UFt9fIqeYAcgH00qlQCYSmyzbC1TQTk9U2Pre1zbOw4
042ONK8sygKSje1zdYleHoBeqwnxD2VNM0NwBElhGjumwHRng/tbLiI9wx6qiz+C
VsFXavkBszHGA1pjy9wZLGixYIH5jCygMpH134Wp+CIhpS+C4nftcGdIL1D5Oil1
6Tm2XeI6uyfiQhm9IOwDjfoYeC7gUjx1rp8rHseGUMJxyO/BX9q5j1ixbsVriqfW
97ucYuRL9mza7ic516C9v7OlAA3AGH2xWV+SYOGK88i9Co4kYzP4WnamxXqOsD8+
popxG55oa6QelhaouTBZvgERpZ4fWupSDs/UccsDaE9leMCerNEbGHEzt/Mm/2sw
xopjMQ0y5Kn6/fS0dLv8U+XHu4ANkvXJkFd6Ny0h/WfgGefuQOOTG9ruYgfeqqB8
dViQ4R7CO8ySjD45KawAZl/EqL86x1M/CI1nlt0YY4vNwUuOJbebL7Jn8w3Fjxm5
FVfUlDmcPdhZfL9Vnrsi6MIou1cU1yJPw4D6sXJ4sg4s7A4ebBcRRrjayVQ4msjv
Q7cvBOMnWEHhOV11pvP50FmQuj74XW3bUqiuWrnK1SypvnhHavF6kc1XYpBLs1xZ
y8nueJW2qPw=
=tT5F
-----END PGP SIGNATURE-----
Merge tag 'x86-mm-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm handling updates from Ingo Molnar:
- Add new NX-stack self-test
- Improve NUMA partial-CFMWS handling
- Fix #VC handler bugs resulting in SEV-SNP boot failures
- Drop the 4MB memory size restriction on minimal NUMA nodes
- Reorganize headers a bit, in preparation to header dependency
reduction efforts
- Misc cleanups & fixes
* tag 'x86-mm-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Drop the 4 MB restriction on minimal NUMA node memory size
selftests/x86/lam: Zero out buffer for readlink()
x86/sev: Drop unneeded #include
x86/sev: Move sev_setup_arch() to mem_encrypt.c
x86/tdx: Replace deprecated strncpy() with strtomem_pad()
selftests/x86/mm: Add new test that userspace stack is in fact NX
x86/sev: Make boot_ghcb_page[] static
x86/boot: Move x86_cache_alignment initialization to correct spot
x86/sev-es: Set x86_virt_bits to the correct value straight away, instead of a two-phase approach
x86/sev-es: Allow copy_from_kernel_nofault() in earlier boot
x86_64: Show CR4.PSE on auxiliaries like on BSP
x86/iommu/docs: Update AMD IOMMU specification document URL
x86/sev/docs: Update document URL in amd-memory-encryption.rst
x86/mm: Move arch_memory_failure() and arch_is_platform_page() definitions from <asm/processor.h> to <asm/pgtable.h>
ACPI/NUMA: Apply SRAT proximity domain to entire CFMWS window
x86/numa: Introduce numa_fill_memblks()
- Make IA32_EMULATION boot time configurable with
the new ia32_emulation=<bool> boot option.
- Clean up fast syscall return validation code: convert
it to C and refactor the code.
- As part of this, optimize the canonical RIP test code.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU9DiARHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iNAw//cLn9gBXMVPDiCDVUOTqjkZ+OwIF11Y9v
WatksSe5hrw0Bzl5CiSvtrWpTkKPnhyM8Lc1WD8l0YSMKprdkQfNAvQOPv0IMLjk
XP1pgQhAiXwB87XL/G2sA6RunuK56zlnl7KJiDrQThrS/WOfrq3UkB2vyYEP4GtP
69WZ/WM++u74uEml0+HZ0Z9HVvzwYl1VQPdTYfl52S4H3U8MXL89YEsPr13Ttq88
FMKdXJ/VvItuVM/ZHHqFkGvRJjUtDWePLu29b684Ap6onDJ7uMMw86Gj5UxXtdpB
Axsjuwlca8sCPotcqohay6IdyxIth6lMdvjPv0KhA+/QMrHbDaluv88YQs4k7Add
1GPULH6oeDTHxMPOcJmFuSTpMY8HP6O9ZIXB6ogQRkLaDJKaWr5UQU7L2VBQ/WUy
NRa6mba0XHYrz6U7DmtsdL0idWBJeJokHmaIcGJ/pp6gMznvufm2+SoJ6w6wcYva
VTSTyrAAj/N9/TzJ5i8S2+yDPI9GanFpZJfYbW/rT9XGutvXWVKe3AmUNgR8O+hE
JiEMfpR0TtXXlrik74jur/RPZhaFIE8MeCvJrkJ3oxQlPThYSTMBAlUOtD7kOfNT
onjPrumREX4hOIBU+nnC9VrJMqxX9lz4xDzqw3jvX99Ma0o8Wx/UndWELX8tAYwd
j8M8NWAbv90=
=YkaP
-----END PGP SIGNATURE-----
Merge tag 'x86-entry-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry updates from Ingo Molnar:
- Make IA32_EMULATION boot time configurable with
the new ia32_emulation=<bool> boot option
- Clean up fast syscall return validation code: convert
it to C and refactor the code
- As part of this, optimize the canonical RIP test code
* tag 'x86-entry-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/32: Clean up syscall fast exit tests
x86/entry/64: Use TASK_SIZE_MAX for canonical RIP test
x86/entry/64: Convert SYSRET validation tests to C
x86/entry/32: Remove SEP test for SYSEXIT
x86/entry/32: Convert do_fast_syscall_32() to bool return type
x86/entry/compat: Combine return value test from syscall handler
x86/entry/64: Remove obsolete comment on tracing vs. SYSRET
x86: Make IA32_EMULATION boot time configurable
x86/entry: Make IA32 syscalls' availability depend on ia32_enabled()
x86/elf: Make loading of 32bit processes depend on ia32_enabled()
x86/entry: Compile entry_SYSCALL32_ignore() unconditionally
x86/entry: Rename ignore_sysret()
x86: Introduce ia32_enabled()
- Micro-optimize the x86 bitops code
- Define target-specific {raw,this}_cpu_try_cmpxchg{64,128}() to improve code generation
- Define and use raw_cpu_try_cmpxchg() preempt_count_set()
- Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_op
- Remove the unused __sw_hweight64() implementation on x86-32
- Misc fixes and cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU9BJkRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iBKBAAl++uUEmjJM2nfS1AtKS5Rn0YJGoj7ZLy
MMjFJwYzw7nWqbkCZJqQATicmOVvdEUacYibYYfX4QkH3ylC9Av3ta1HeUEzqLpX
qG8W4jJPu/qlAOtGI4mQkq/yVminea6xy6l0vcCF+pezUKnwADP/YSL2Zsg03UsX
Nelty29NrpN/qCcLUJk40CHRjBhfYBVEk+HtCMahnftzLSNZWHYTgoYA9x4zm4Hg
LGiION+dwJKwaafaNw8/k1ikRJYfc5c1ZImi7YoOeCXXtBq7VHOHf76axok42m4s
2FJ7QefioQ/Gs1Gojxd99080F4H/Elt6uj05hR2493ncN4WVNKqYBOMezrlG686D
CuoOZvMaJ1LyaEntu1YWF3dtHIDTVjHhe5dyVclgu2a+v1gP/16xTLIf2z/cWtuX
whOcalFc7AdGXnLtGACHnhbc5Yh/Ex9Y49+6WYgBrDcgNDCGdTOa/m8kFmKFgOjh
x8Ot1xUjFI3utGgMlfKpYqw281ws+DjPiVvGi+fmUf8eBsq2IJuO9/f9oyfFfFSj
1bzwrL5Qop/ccZAOxtuLnVVVZ+Cz/5wHmMTI0rRE5BsoiVUrtWZfD+E0/vqdavjG
0eabDTdwUgXNBp6ex0TajXKtQY7FKg3tzIuPqHRvhRCvlt77MoSdcWRCc37ac5GJ
M6mAeLcitu8=
=7WUt
-----END PGP SIGNATURE-----
Merge tag 'x86-asm-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 assembly code updates from Ingo Molnar:
- Micro-optimize the x86 bitops code
- Define target-specific {raw,this}_cpu_try_cmpxchg{64,128}() to
improve code generation
- Define and use raw_cpu_try_cmpxchg() preempt_count_set()
- Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_op
- Remove the unused __sw_hweight64() implementation on x86-32
- Misc fixes and cleanups
* tag 'x86-asm-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/lib: Address kernel-doc warnings
x86/entry: Fix typos in comments
x86/entry: Remove unused argument %rsi passed to exc_nmi()
x86/bitops: Remove unused __sw_hweight64() assembly implementation on x86-32
x86/percpu: Do not clobber %rsi in percpu_{try_,}cmpxchg{64,128}_op
x86/percpu: Use raw_cpu_try_cmpxchg() in preempt_count_set()
x86/percpu: Define raw_cpu_try_cmpxchg and this_cpu_try_cmpxchg()
x86/percpu: Define {raw,this}_cpu_try_cmpxchg{64,128}
x86/asm/bitops: Use __builtin_clz{l|ll} to evaluate constant expressions
- Add AMD Unified Memory Controller (UMC) events introduced with Zen 4
- Simplify & clean up the uncore management code
- Fall back from RDPMC to RDMSR on certain uncore PMUs
- Improve per-package and cstate event reading
- Extend the Intel ref-cycles event to GP counters
- Fix Intel MTL event constraints
- Improve the Intel hybrid CPU handling code
- Micro-optimize the RAPL code
- Optimize perf_cgroup_switch()
- Improve large AUX area error handling
- Misc fixes and cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU89YsRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iQqQ/9EF9mG4te5By4qN+B7jADCmE71xG5ViKz
sp4Thl86SHxhwFuiHn8dMUixrp+qbcemi5yTbQ9TF8cKl4s3Ju2CihU8jaauUp0a
iS5W0IliMqLD1pxQoXAPLuPVInVYgrNOCbR4l6l7D6ervh5Z6PVEf7SVeAP3L5wo
QV/V3NKkrYeNQL+FoKhCH8Vhxw0HxUmKJO7UhW6yuCt7BAok9Es18h3OVnn+7es4
BB7VI/JvdmXf2ioKhTPnDXJjC+vh5vnwiBoTcdQ2W9ADhWUvfL4ozxOXT6z7oC3A
nwBOdXf8w8Rqnqqd8hduop1QUrusMxlEVgOMCk27qHx97uWgPceZWdoxDXGHBiRK
fqJAwXERf9wp5/M57NDlPwyf/43Hocdx2CdLkQBpfD78/k/sB5hW0KxnzY0FUI9x
jBRQyWD05IDJATBaMHz+VbrexS+Itvjp2QvSiSm9zislYD4zA9fQ3lAgFhEpcUbA
ZA/nN4t+CbiGEAsJEuBPlvSC1ahUwVP/0nz3PFlVWFDqAx0mXgVNKBe083A9yh7I
dVisVY6KPAVDzyOc1LqzU8WFXNFnIkIIaLrb6fRHJVEM8MDfpLPS/a+7AHdRcDP4
yq6fjVVjyP7e9lSQLYBUP3/3uiVnWQj92l6V6CrcgDMX5rDOb0VN+BQrmhPR6fWY
WEim6WZrZj4=
=OLsv
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance event updates from Ingo Molnar:
- Add AMD Unified Memory Controller (UMC) events introduced with Zen 4
- Simplify & clean up the uncore management code
- Fall back from RDPMC to RDMSR on certain uncore PMUs
- Improve per-package and cstate event reading
- Extend the Intel ref-cycles event to GP counters
- Fix Intel MTL event constraints
- Improve the Intel hybrid CPU handling code
- Micro-optimize the RAPL code
- Optimize perf_cgroup_switch()
- Improve large AUX area error handling
- Misc fixes and cleanups
* tag 'perf-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
perf/x86/amd/uncore: Pass through error code for initialization failures, instead of -ENODEV
perf/x86/amd/uncore: Fix uninitialized return value in amd_uncore_init()
x86/cpu: Fix the AMD Fam 17h, Fam 19h, Zen2 and Zen4 MSR enumerations
perf: Optimize perf_cgroup_switch()
perf/x86/amd/uncore: Add memory controller support
perf/x86/amd/uncore: Add group exclusivity
perf/x86/amd/uncore: Use rdmsr if rdpmc is unavailable
perf/x86/amd/uncore: Move discovery and registration
perf/x86/amd/uncore: Refactor uncore management
perf/core: Allow reading package events from perf_event_read_local
perf/x86/cstate: Allow reading the package statistics from local CPU
perf/x86/intel/pt: Fix kernel-doc comments
perf/x86/rapl: Annotate 'struct rapl_pmus' with __counted_by
perf/core: Rename perf_proc_update_handler() -> perf_event_max_sample_rate_handler(), for readability
perf/x86/rapl: Fix "Using plain integer as NULL pointer" Sparse warning
perf/x86/rapl: Use local64_try_cmpxchg in rapl_event_update()
perf/x86/rapl: Stop doing cpu_relax() in the local64_cmpxchg() loop in rapl_event_update()
perf/core: Bail out early if the request AUX area is out of bound
perf/x86/intel: Extend the ref-cycles event to GP counters
perf/x86/intel: Fix broken fixed event constraints extension
...
- Fix potential MAX_NAME_LEN limit related build failures
- Fix scripts/faddr2line symbol filtering bug
- Fix scripts/faddr2line on LLVM=1
- Fix scripts/faddr2line to accept readelf output with mapping symbols
- Minor cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU88VYRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g2rQ//dvzezrAs+ZEhKLbRLSabbAlCeJ+J9zuP
c0xBmaLwUh47sSDKfBLLEFN3IMDfgMdKjfb3E32vT/WQ+ASdfEMs6FfwRtaErypG
XfZFpfC2WE1+Gq0MAgrXYuQgDv1Lygdimoy0aCwMlrgb7ZgWL1xorG0VSEemyKhd
CoRFURKjeJIKJN1oOvTXKhp/SZyk39KHXeF4qSAjIGkrzsfDtEUSNR6NjBmeGUS4
zNVWus/CucHK/6MMpHtdWw1/Ygemc1CBzYC3ZSMGimqy4Rqe2RsiGa0Y3XhlMCyn
ekNFuUm9bxStaTknM3ZXga0xHPdKnTPkihxykLDzo0Nh9eysuFlmFrFJ2xL/B87k
IxlpXvwxjxTSmGDhGQFVnXma6M2le3YFWGClS8UyhSPG08qg09ClwZ8OtVDi8ITI
rj0VoFbFLuc8aeHF/tyF2t323JmcMHq0aHi+kMUElszm6+B+fPnD54gHU+REXVxO
YIRkK9RY52mfU4KFf8xlO/UhFF6nP8pgE8pVnNF4lC034M0t4z+i/TLjOsspjVt3
yMoZakD7sfUkAaCBq4mVfdWwo5UzTVse0BarbEcKxoME6wLEfN+efE850zGdy7n1
iRC9AddddEyo4BnSHbWdWu/PDYJKPiH7dAtHBcfnEMJjLQewnRHlsHHbCA55jtrX
363jNE3x6K4=
=9U5x
-----END PGP SIGNATURE-----
Merge tag 'objtool-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
"Misc fixes and cleanups:
- Fix potential MAX_NAME_LEN limit related build failures
- Fix scripts/faddr2line symbol filtering bug
- Fix scripts/faddr2line on LLVM=1
- Fix scripts/faddr2line to accept readelf output with mapping
symbols
- Minor cleanups"
* tag 'objtool-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
scripts/faddr2line: Skip over mapping symbols in output from readelf
scripts/faddr2line: Use LLVM addr2line and readelf if LLVM=1
scripts/faddr2line: Don't filter out non-function symbols from readelf
objtool: Remove max symbol name length limitation
objtool: Propagate early errors
objtool: Use 'the fallthrough' pseudo-keyword
x86/speculation, objtool: Use absolute relocations for annotations
x86/unwind/orc: Remove redundant initialization of 'mid' pointer in __orc_find()
- Fair scheduler (SCHED_OTHER) improvements:
- Remove the old and now unused SIS_PROP code & option
- Scan cluster before LLC in the wake-up path
- Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
- NUMA scheduling improvements:
- Improve the VMA access-PID code to better skip/scan VMAs
- Extend tracing to cover VMA-skipping decisions
- Improve/fix the recently introduced sched_numa_find_nth_cpu() code
- Generalize numa_map_to_online_node()
- Energy scheduling improvements:
- Remove the EM_MAX_COMPLEXITY limit
- Add tracepoints to track energy computation
- Make the behavior of the 'sched_energy_aware' sysctl more consistent
- Consolidate and clean up access to a CPU's max compute capacity
- Fix uclamp code corner cases
- RT scheduling improvements:
- Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
- Drive the ->rto_mask with rt_rq->pushable_tasks updates
- Scheduler scalability improvements:
- Rate-limit updates to tg->load_avg
- On x86 disable IBRS when CPU is offline to improve single-threaded performance
- Micro-optimize in_task() and in_interrupt()
- Micro-optimize the PSI code
- Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
- Core scheduler infrastructure improvements:
- Use saved_state to reduce some spurious freezer wakeups
- Bring in a handful of fast-headers improvements to scheduler headers
- Make the scheduler UAPI headers more widely usable by user-space
- Simplify the control flow of scheduler syscalls by using lock guards
- Fix sched_setaffinity() vs. CPU hotplug race
- Scheduler debuggability improvements:
- Disallow writing invalid values to sched_rt_period_us
- Fix a race in the rq-clock debugging code triggering warnings
- Fix a warning in the bandwidth distribution code
- Micro-optimize in_atomic_preempt_off() checks
- Enforce that the tasklist_lock is held in for_each_thread()
- Print the TGID in sched_show_task()
- Remove the /proc/sys/kernel/sched_child_runs_first sysctl
- Misc cleanups & fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU8/NoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gN+xAAvKGYNZBCBG4jowxccgqAbCx81KOhhsy/
KUaOmdLPg9WaXuqjZ5sggXQCMT0wUqBYAmqV7ts53VhWcma2I1ap4dCM6Jj+RLrc
vNwkeNetsikiZtarMoCJs5NahL8ULh3liBaoAkkToPjQ5r43aZ/eKwDovEdIKc+g
+Vgn7jUY8ssIrAOKT1midSwY1y8kAU2AzWOSFDTgedkJP4PgOu9/lBl9jSJ2sYaX
N4XqONYPXTwOHUtvmzkYILxLz0k0GgJ7hmt78E8Xy2rC4taGCRwCfCMBYxREuwiP
huo3O1P/iIe5svm4/EBUvcpvf44eAWTV+CD0dnJPwOc9IvFhpSzqSZZAsyy/JQKt
Lnzmc/xmyc1PnXCYJfHuXrw2/m+MyUHaegPzh5iLJFrlqa79GavOElj0jNTAMzbZ
39fybzPtuFP+64faRfu0BBlQZfORPBNc/oWMpPKqgP58YGuveKTWaUF5rl5lM7Ne
nm07uOmq02JVR8YzPl/FcfhU2dPMawWuMwUjEr2eU+lAunY3PF88vu0FALj7iOBd
66F8qrtpDHJanOxrdEUwSJ7hgw79qY1iw66Db7cQYjMazFKZONxArQPqFUZ0ngLI
n9hVa7brg1bAQKrQflqjcIAIbpVu3SjPEl15cKpAJTB/gn5H66TQgw8uQ6HfG+h2
GtOsn1nlvuk=
=GDqb
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduler (SCHED_OTHER) improvements:
- Remove the old and now unused SIS_PROP code & option
- Scan cluster before LLC in the wake-up path
- Use candidate prev/recent_used CPU if scanning failed for cluster
wakeup
NUMA scheduling improvements:
- Improve the VMA access-PID code to better skip/scan VMAs
- Extend tracing to cover VMA-skipping decisions
- Improve/fix the recently introduced sched_numa_find_nth_cpu() code
- Generalize numa_map_to_online_node()
Energy scheduling improvements:
- Remove the EM_MAX_COMPLEXITY limit
- Add tracepoints to track energy computation
- Make the behavior of the 'sched_energy_aware' sysctl more
consistent
- Consolidate and clean up access to a CPU's max compute capacity
- Fix uclamp code corner cases
RT scheduling improvements:
- Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
- Drive the ->rto_mask with rt_rq->pushable_tasks updates
Scheduler scalability improvements:
- Rate-limit updates to tg->load_avg
- On x86 disable IBRS when CPU is offline to improve single-threaded
performance
- Micro-optimize in_task() and in_interrupt()
- Micro-optimize the PSI code
- Avoid updating PSI triggers and ->rtpoll_total when there are no
state changes
Core scheduler infrastructure improvements:
- Use saved_state to reduce some spurious freezer wakeups
- Bring in a handful of fast-headers improvements to scheduler
headers
- Make the scheduler UAPI headers more widely usable by user-space
- Simplify the control flow of scheduler syscalls by using lock
guards
- Fix sched_setaffinity() vs. CPU hotplug race
Scheduler debuggability improvements:
- Disallow writing invalid values to sched_rt_period_us
- Fix a race in the rq-clock debugging code triggering warnings
- Fix a warning in the bandwidth distribution code
- Micro-optimize in_atomic_preempt_off() checks
- Enforce that the tasklist_lock is held in for_each_thread()
- Print the TGID in sched_show_task()
- Remove the /proc/sys/kernel/sched_child_runs_first sysctl
... and misc cleanups & fixes"
* tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
sched/fair: Remove SIS_PROP
sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
sched/fair: Scan cluster before scanning LLC in wake-up path
sched: Add cpus_share_resources API
sched/core: Fix RQCF_ACT_SKIP leak
sched/fair: Remove unused 'curr' argument from pick_next_entity()
sched/nohz: Update comments about NEWILB_KICK
sched/fair: Remove duplicate #include
sched/psi: Update poll => rtpoll in relevant comments
sched: Make PELT acronym definition searchable
sched: Fix stop_one_cpu_nowait() vs hotplug
sched/psi: Bail out early from irq time accounting
sched/topology: Rename 'DIE' domain to 'PKG'
sched/psi: Delete the 'update_total' function parameter from update_triggers()
sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
sched/headers: Remove comment referring to rq::cpu_load, since this has been removed
sched/numa: Complete scanning of inactive VMAs when there is no alternative
sched/numa: Complete scanning of partial VMAs regardless of PID activity
sched/numa: Move up the access pid reset logic
sched/numa: Trace decisions related to skipping VMAs
...
- Futex improvements:
- Add the 'futex2' syscall ABI, which is an attempt to get away from the
multiplex syscall and adds a little room for extentions, while lifting
some limitations.
- Fix futex PI recursive rt_mutex waiter state bug
- Fix inter-process shared futexes on no-MMU systems
- Use folios instead of pages
- Micro-optimizations of locking primitives:
- Improve arch_spin_value_unlocked() on asm-generic ticket spinlock
architectures, to improve lockref code generation.
- Improve the x86-32 lockref_get_not_zero() main loop by adding
build-time CMPXCHG8B support detection for the relevant lockref code,
and by better interfacing the CMPXCHG8B assembly code with the compiler.
- Introduce arch_sync_try_cmpxchg() on x86 to improve sync_try_cmpxchg()
code generation. Convert some sync_cmpxchg() users to sync_try_cmpxchg().
- Micro-optimize rcuref_put_slowpath()
- Locking debuggability improvements:
- Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well
- Enforce atomicity of sched_submit_work(), which is de-facto atomic but
was un-enforced previously.
- Extend <linux/cleanup.h>'s no_free_ptr() with __must_check semantics
- Fix ww_mutex self-tests
- Clean up const-propagation in <linux/seqlock.h> and simplify
the API-instantiation macros a bit.
- RT locking improvements:
- Provide the rt_mutex_*_schedule() primitives/helpers and use them
in the rtmutex code to avoid recursion vs. rtlock on the PI state.
- Add nested blocking lockdep asserts to rt_mutex_lock(), rtlock_lock()
and rwbase_read_lock().
- Plus misc fixes & cleanups
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU877IRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1g9jw/+N7rxQ78dmFCYh4UWnLCYvuKP0/ivHErG
493JcB8MupuA2tfJHIkDdr4aM2mNq2E61w69/WlZAQWWD6pdOhwgF5Xf5eoEcJm0
vsAhWBGLxihXdtevPuMAx0dEpg3AMp2wc6i5PkN831KdPUgCNsrKq9Bfnfef7/G8
MQTSHjmtba6jxleyxfEa4tE2xe5PJX825nRfkX2e1cf+stkYua+uJFxVxUfxFWGE
4pBy70D9OC7MsJ44WWOA1gwkVtMMiBTmRPNjlP8Gz2GQ0f3ERHRwYk3jDHOPHZI6
0GNt7pE3IMXQn2UuDtfkvv9IFTd+U5qD+APnWIn2ntWXqzGLFqOlmovMrobVn7El
olYDCyweWPG71m1Qblsb1VK2QjRPQVJ9NAEg8RlDHIu2ThxHbMysDVGPVOYnPFq4
S8QFpmldzbNoPU4rDJyT1fAmoUIrusBHkl+Us3yGfC74iM+fHnDEvaSoMZbzEdY1
x/Nocj9XgKEgfXdYzrCWFmZ9xXqHkO25/wDL6yKqBdQtvaEalXuHTT6mQcYxrUPm
Xx1BPan2Jg7p4u2oOFcVtKewUtRH9KBx8qytr5S+JK4PJbrBsixMnr84HLd/3X2V
ykYkO+367T5MTYv4TnJDE5vdurzUqekKSCFPY3skPujPJfdLj1vsPzYf9iMkCLdo
hU2f/R+Wpdk=
=36Ff
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Info Molnar:
"Futex improvements:
- Add the 'futex2' syscall ABI, which is an attempt to get away from
the multiplex syscall and adds a little room for extentions, while
lifting some limitations.
- Fix futex PI recursive rt_mutex waiter state bug
- Fix inter-process shared futexes on no-MMU systems
- Use folios instead of pages
Micro-optimizations of locking primitives:
- Improve arch_spin_value_unlocked() on asm-generic ticket spinlock
architectures, to improve lockref code generation
- Improve the x86-32 lockref_get_not_zero() main loop by adding
build-time CMPXCHG8B support detection for the relevant lockref
code, and by better interfacing the CMPXCHG8B assembly code with
the compiler
- Introduce arch_sync_try_cmpxchg() on x86 to improve
sync_try_cmpxchg() code generation. Convert some sync_cmpxchg()
users to sync_try_cmpxchg().
- Micro-optimize rcuref_put_slowpath()
Locking debuggability improvements:
- Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well
- Enforce atomicity of sched_submit_work(), which is de-facto atomic
but was un-enforced previously.
- Extend <linux/cleanup.h>'s no_free_ptr() with __must_check
semantics
- Fix ww_mutex self-tests
- Clean up const-propagation in <linux/seqlock.h> and simplify the
API-instantiation macros a bit
RT locking improvements:
- Provide the rt_mutex_*_schedule() primitives/helpers and use them
in the rtmutex code to avoid recursion vs. rtlock on the PI state.
- Add nested blocking lockdep asserts to rt_mutex_lock(),
rtlock_lock() and rwbase_read_lock()
.. plus misc fixes & cleanups"
* tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
futex: Don't include process MM in futex key on no-MMU
locking/seqlock: Fix grammar in comment
alpha: Fix up new futex syscall numbers
locking/seqlock: Propagate 'const' pointers within read-only methods, remove forced type casts
locking/lockdep: Fix string sizing bug that triggers a format-truncation compiler-warning
locking/seqlock: Change __seqprop() to return the function pointer
locking/seqlock: Simplify SEQCOUNT_LOCKNAME()
locking/atomics: Use atomic_try_cmpxchg_release() to micro-optimize rcuref_put_slowpath()
locking/atomic, xen: Use sync_try_cmpxchg() instead of sync_cmpxchg()
locking/atomic/x86: Introduce arch_sync_try_cmpxchg()
locking/atomic: Add generic support for sync_try_cmpxchg() and its fallback
locking/seqlock: Fix typo in comment
futex/requeue: Remove unnecessary ‘NULL’ initialization from futex_proxy_trylock_atomic()
locking/local, arch: Rewrite local_add_unless() as a static inline function
locking/debug: Fix debugfs API return value checks to use IS_ERR()
locking/ww_mutex/test: Make sure we bail out instead of livelock
locking/ww_mutex/test: Fix potential workqueue corruption
locking/ww_mutex/test: Use prng instead of rng to avoid hangs at bootup
futex: Add sys_futex_requeue()
futex: Add flags2 argument to futex_requeue()
...
actually used in the amd_nb.c enumeration
- Add support for extracting NUMA information from devicetree for
Hyper-V usages
- Add PCI device IDs for the new AMD MI300 AI accelerators
- Annotate an array in struct uv_rtc_timer_head with the new
__counted_by attribute
- Rework UV's NMI action parameter handling
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmU78I4ACgkQEsHwGGHe
VUoDFQ//a8KYzcj2wC+GA+tAScMCP4lT3hsSYoHvFE6Fd4HBngMUBcaEHY0InxWc
jfXRHFZBS54sLMH2XPVOnozTBDLUEF+4cnApj9iVf//RwjSz/mnJ1P/fqwGQHmfY
eyOKn38nzkdcf+MW69HAyOeOLxOhOcxEdrzJ4AuPijjTH0GfHWXtuZrqCgaRB6XZ
yYw7PwO9azSBIYA3c87IPo5uGDa0Ht/BWWHv+XD21DjH0E2n2jaGPgExojBwpI1F
zJZkx8xJWlQANSD9qiYWoYFdEluGAJ6Fyn989hMRVUqFLbGB9ppEu0BD5TWor08A
UMX3LYIXM+706EnpOYKc5m5dajbUNOEUl73M6S6J67ugvRUWjCZpoejEx1Aejup7
aCcYrkbSiuVPFn8Wj9/JdkItu+zyzzsp3fjEdYlYVfefLh3EY4IrFAkQGUk7A0cA
p6Idau8j4Cx+4lvNEFGMk/u7SAxcInJx7NDy8rJL9CG1BdbjzkwJH+GJFELuI/zM
dSd+N2r5YUMClHR8Tor17PDh61i+ovg1kPhOyDyf7bNv0HQG0Hbcded4lyiVrsXa
m5GNYRXEQFZXMeBxo6dJKDAj2rdSPsACC2FBTa9B+bciklcWid4FVTvf/9sSFeQv
cvGFfGbFdW/uddAHsWqHMiuTfBsQneZ9UbZCKq/kAkHAtfdhRa8=
=7JVZ
-----END PGP SIGNATURE-----
Merge tag 'x86_platform_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform updates from Borislav Petkov:
- Make sure PCI function 4 IDs of AMD family 0x19, models 0x60-0x7f are
actually used in the amd_nb.c enumeration
- Add support for extracting NUMA information from devicetree for
Hyper-V usages
- Add PCI device IDs for the new AMD MI300 AI accelerators
- Annotate an array in struct uv_rtc_timer_head with the new
__counted_by attribute
- Rework UV's NMI action parameter handling
* tag 'x86_platform_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/amd_nb: Use Family 19h Models 60h-7Fh Function 4 IDs
x86/numa: Add Devicetree support
x86/of: Move the x86_flattree_get_config() call out of x86_dtb_init()
x86/amd_nb: Add AMD Family MI300 PCI IDs
x86/platform/uv: Annotate struct uv_rtc_timer_head with __counted_by
x86/platform/uv: Rework NMI "action" modparam handling
virtualization support is disabled in the BIOS on AMD and Hygon
platforms
- A minor cleanup
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmU77KoACgkQEsHwGGHe
VUrophAAtfsB+WhRydin0V6kjQeH+RbiWyx/jOw6eNqvzOzaOPxVXn0cAHRSgAO4
+S8tKIqaWpXNNNKpOIKBVaDkh9qr50/p36/jfVkXi8GOLYrK633F0BMjcG4+/vYQ
A9b5iNiJhZ7xWE6+qRrqdg+o+a6UyPUGz34HNp3KwJVTdaHU2OnXXwuWeiUkgRrJ
uQSfLc4+UIeefIzNy8Tqg083iaENBYMya7U90rzewD64NF0bsA15AEPut/6tnUVq
ej3UU3cqO7nKXyhuZX+zpt856MZFa1rNYVXUAfoAO4xhqdN0Q5LFWO506sqajNx/
hqbT+hKDoC03zuLmbZO21s/uWQdtVFo63FU0h9QBRp1m6Ug5P3rQQCK8ydJc5xwr
Yd7je6UPK9jIKBo9VP1qmsyzGwADNevNf1qGExHI2T6Wml7HgDmPysAHnGiKqRGI
1o9+Yqa+VBt8Wml9M8Ny+dLyr5F/2uq8sMrQedQlXdFMSzVm2JYecukJ5BvUWE/r
Qyll8mTpIdgGXjBt56lMrgH7ibMC5ct/4MvTHOHuA997g/PwuwtWj7QyKXpUq2Rf
o/c3zKKWIFxevjzwU86haCBaz+5xAQlB6dJw61ExxsmUuT/kZzkN15w6aqGZtpns
PsARwnvuwZJ7vfqFLIa0ZkPN4OgnkRX7HlNqrVyKpONDTocZd9E=
=i9On
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:
- Make sure the "svm" feature flag is cleared from /proc/cpuinfo when
virtualization support is disabled in the BIOS on AMD and Hygon
platforms
- A minor cleanup
* tag 'x86_cpu_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Remove redundant 'break' statement
x86/cpu: Clear SVM feature if disabled by BIOS
Intel's CAT implementation
- Other improvements to resctrl code: better configuration,
simplifications, debugging support, fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmU7x3YACgkQEsHwGGHe
VUqj4BAAn9HiQPuBWW5UPVpLoHBmKHtoNuIn2AWD3wcRwFwd+mO1JbPgQzMude0C
QV/Dpm+PPxyFNATtCiRtqns3qHSt8wVMy/mrOKT7R20mmBxIhMb+323YvoamFSzc
gamSKDZJFEp8Dqj2ccnwpFIdPjlTZGuOCumcxrHbrEs10ezsZ1UCSOjlQrRJSrFX
M/KCVmrwVt6VR24Sz3K7e6atWgnl5Gj926VB0wXFSVAHII22Pirx7rdsHZXkgI0z
pBHomVyZxyjzo3XV9szG1h+3iTPIebWH6A+25YGpmh7PZeFzuJhn6XXbpZ35tjSw
3EKjkbwJyDXLLfAV+MMzVYZeCzpxy5MEuTW6aNi4y59k6GAhyeHClq9HWePo8rp7
lMXVfSeFpdtG0n7WUVF2ctm7mAqTF8id2WGNfvXoP/bzB2mkXQ4x1GV+TvxAyex8
OAk7iHk0IhrakfDj1XAE1o0BiSkGKaNo0eT8LnuByaUvHQHSBo/24fFcFOC9V1NL
K4eGfgn7yyXBFWvJch5LtdmG3LHQEJ9Dh4zZ8TkyHOt42Lc32JdHtBrRGdWRMyq9
p5lhLvPwuumjfjTeaXG4ABacdED1a8fiUzydumHmAux+an7irTqxwP51Y9MrxR1O
37+YBgEcO2nmubCUKUjvBga4ztFo0f/hMVqGqnME+tYwr7NzXx8=
=GGQX
-----END PGP SIGNATURE-----
Merge tag 'x86_cache_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:
- Add support for non-contiguous capacity bitmasks being added to
Intel's CAT implementation
- Other improvements to resctrl code: better configuration,
simplifications, debugging support, fixes
* tag 'x86_cache_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Display RMID of resource group
x86/resctrl: Add support for the files of MON groups only
x86/resctrl: Display CLOSID for resource group
x86/resctrl: Introduce "-o debug" mount option
x86/resctrl: Move default group file creation to mount
x86/resctrl: Unwind properly from rdt_enable_ctx()
x86/resctrl: Rename rftype flags for consistency
x86/resctrl: Simplify rftype flag definitions
x86/resctrl: Add multiple tasks to the resctrl group at once
Documentation/x86: Document resctrl's new sparse_masks
x86/resctrl: Add sparse_masks file in info
x86/resctrl: Enable non-contiguous CBMs in Intel CAT
x86/resctrl: Rename arch_has_sparse_bitmaps
x86/resctrl: Fix remaining kernel-doc warnings
machinery and other, general cleanups to the hw mitigations code,
by Josh Poimboeuf
- Improve the return thunk detection by objtool as it is absolutely
important that the default return thunk is not used after returns
have been patched. Future work to detect and report this better is
pending
- Other misc cleanups and fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmU7mFEACgkQEsHwGGHe
VUpbBxAAtS4X5LCntPWUsDEBU80SBYAunEp0Wd0ttYEj+UrEk4tvnWVGFiIEr47A
PrRKK9JCJtC6ko0+dwPtMi66L/T7mCpoNPI1kzfRG1IHJBfvCTGJhzZsesogvkA2
1X9Je+RCVW4xVybIryxhjMGdB6jUoGEU1a4DmQXq481qiLB3ilvA1bIAaNo9BBYP
rxKPrPcdOxn2NjxuOWg+FXjSc8LuAVSu3HqsgCW2AHJ6XIKEYWEq9FkXhwj9OJOr
ax1F4qD1IY++jYZO9DJiltjeJyj0wC+yp8kDDURoLbcTk85WHlpD5vK0g64mELOA
y0375thHep+vsrtQ/qZAmi/eVTaTekgbi7McahjoZebK7FbKOYRk6GZ+5+m29AVr
DfQSJ7xQQqbCbpimeFmZ+gQf7mFexyDWvjUPyBl+OelOY1umdPM9IZVTnqib5LPr
D2M+uqWfJhSwACi2o05LRv0gyhkAz0bGHrwZPmCVuxE5kBbhOpj4aT87fetUp/MW
8lEFa3PHx/gkh2VOJ7ZgKzpeD75Vjo8TRAXOe4O2jn/L54gNEJ+1mukvrjW3+lp1
ShmcZokl3ldPq6F5ioE+u45hVAfHkaruWM+5Rj3hsA/fdFN3isTVLhIRIsypPTKc
p1ITT8Yhek8vkm9PcRBE5xWRmEZ2XE5ooDld930nJxra8QNVVQw=
=E7c4
-----END PGP SIGNATURE-----
Merge tag 'x86_bugs_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 hw mitigation updates from Borislav Petkov:
- A bunch of improvements, cleanups and fixlets to the SRSO mitigation
machinery and other, general cleanups to the hw mitigations code, by
Josh Poimboeuf
- Improve the return thunk detection by objtool as it is absolutely
important that the default return thunk is not used after returns
have been patched. Future work to detect and report this better is
pending
- Other misc cleanups and fixes
* tag 'x86_bugs_for_6.7_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/retpoline: Document some thunk handling aspects
x86/retpoline: Make sure there are no unconverted return thunks due to KCSAN
x86/callthunks: Delete unused "struct thunk_desc"
x86/vdso: Run objtool on vdso32-setup.o
objtool: Fix return thunk patching in retpolines
x86/srso: Remove unnecessary semicolon
x86/pti: Fix kernel warnings for pti= and nopti cmdline options
x86/calldepth: Rename __x86_return_skl() to call_depth_return_thunk()
x86/nospec: Refactor UNTRAIN_RET[_*]
x86/rethunk: Use SYM_CODE_START[_LOCAL]_NOALIGN macros
x86/srso: Disentangle rethunk-dependent options
x86/srso: Move retbleed IBPB check into existing 'has_microcode' code block
x86/bugs: Remove default case for fully switched enums
x86/srso: Remove 'pred_cmd' label
x86/srso: Unexport untraining functions
x86/srso: Improve i-cache locality for alias mitigation
x86/srso: Fix unret validation dependencies
x86/srso: Fix vulnerability reporting for missing microcode
x86/srso: Print mitigation for retbleed IBPB case
x86/srso: Print actual mitigation if requested mitigation isn't possible
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppQwAKCRCRxhvAZXjc
om2kAP4u+eLsrhJHfUPUttGUEkSkZE+5/s/f1A/1GcV51usLSgEAu8urxAnP49GW
INaDABXaFfKx8/KI/H2YFZPKGwlNEwY=
=PDDE
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull iov_iter updates from Christian Brauner:
"This contain's David's iov_iter cleanup work to convert the iov_iter
iteration macros to inline functions:
- Remove last_offset from iov_iter as it was only used by ITER_PIPE
- Add a __user tag on copy_mc_to_user()'s dst argument on x86 to
match that on powerpc and get rid of a sparse warning
- Convert iter->user_backed to user_backed_iter() in the sound PCM
driver
- Convert iter->user_backed to user_backed_iter() in a couple of
infiniband drivers
- Renumber the type enum so that the ITER_* constants match the order
in iterate_and_advance*()
- Since the preceding patch puts UBUF and IOVEC at 0 and 1, change
user_backed_iter() to just use the type value and get rid of the
extra flag
- Convert the iov_iter iteration macros to always-inline functions to
make the code easier to follow. It uses function pointers, but they
get optimised away
- Move the check for ->copy_mc to _copy_from_iter() and
copy_page_from_iter_atomic() rather than in memcpy_from_iter_mc()
where it gets repeated for every segment. Instead, we check once
and invoke a side function that can use iterate_bvec() rather than
iterate_and_advance() and supply a different step function
- Move the copy-and-csum code to net/ where it can be in proximity
with the code that uses it
- Fold memcpy_and_csum() in to its two users
- Move csum_and_copy_from_iter_full() out of line and merge in
csum_and_copy_from_iter() since the former is the only caller of
the latter
- Move hash_and_copy_to_iter() to net/ where it can be with its only
caller"
* tag 'vfs-6.7.iov_iter' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
iov_iter, net: Move hash_and_copy_to_iter() to net/
iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together
iov_iter, net: Fold in csum_and_memcpy()
iov_iter, net: Move csum_and_copy_to/from_iter() to net/
iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()
iov_iter: Convert iterate*() to inline funcs
iov_iter: Derive user-backedness from the iterator type
iov_iter: Renumber ITER_* constants
infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC
sound: Fix snd_pcm_readv()/writev() to use iov access functions
iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user()
iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE
- Fix a possible CPU hotplug deadlock bug caused by the new
TSC synchronization code.
- Fix a legacy PIC discovery bug that results in device troubles on
affected systems, such as non-working keybards, etc.
- Add a new Intel CPU model number to <asm/intel-family.h>.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU85uARHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jSrhAAr3FrzkDWqLUUuDMWFEEVkTGz9GLc2uMP
bGT7HApp301wGpFv35bbBVp7HKLSB9nFeLsWKQF6MBa2uSUiCxSBcSNfHVJ6jX1O
Ac4hgl0Q/tZl0yZoag3u4j3FS57d9Qkzpg0QLTz9MVnQXEPTcDkli9tfSbRpxx3B
WvcGwAD+VeMjeghIJHxTJW9KwouIf3gAzqoKE7TKEozJ4S2LbNs3nEsDGsRHJsPN
jy/Fxlny6pdlkJArP/ILA9WO3yqBn2doIN9oqWKQCQtYbt1USnuJb5oCOXYAOijG
A9iLJmMjDyN2TZ8lKYHBcHVmEVTeB+6VnikejcAaT4EMrArj5mdjL3K7McZ24NFh
s8mx/S+v4mBExwNsaoq8kuUEZySWxNbFPLBjOq8BKIZYYDQUp1NK58CRWv9BAB6y
GjgYlMYI6EGDb+QAoTKZ5KNqrwtUPuk2ijjJTB15DpgLkl7XmyoVOulrtVDrhuGZ
Uuge9h0CLKnqD7lLfJdI7NLYxmZxPiQKbL9vrIqlk989UM5ItvttxWBNzhAdSTsG
VIaXAOJgGDj74dJC+3b0umugKJcrycUFBy5/RxOg+OcnSPd5g/L4XWcLLk+OElE3
pFBetWmO2HU4pxhL1GdWsDuF+RkYug4XKQqQP2LJ8Zepff/jkF+hVJoLYUhQIF6A
OMQv1N6/nSg=
=k19e
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix a possible CPU hotplug deadlock bug caused by the new TSC
synchronization code
- Fix a legacy PIC discovery bug that results in device troubles on
affected systems, such as non-working keybards, etc
- Add a new Intel CPU model number to <asm/intel-family.h>
* tag 'x86-urgent-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Defer marking TSC unstable to a worker
x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility
x86/cpu: Add model number for Intel Arrow Lake mobile processor
The '%.ko' rule in arch/*/Makefile.postlink does nothing but call the
'true' command.
Remove the unneeded code.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Nicolas Schier <n.schier@avm.de>
Currently, there is no standard implementation for vdso_install,
leading to various issues:
1. Code duplication
Many architectures duplicate similar code just for copying files
to the install destination.
Some architectures (arm, sparc, x86) create build-id symlinks,
introducing more code duplication.
2. Unintended updates of in-tree build artifacts
The vdso_install rule depends on the vdso files to install.
It may update in-tree build artifacts. This can be problematic,
as explained in commit 19514fc665 ("arm, kbuild: make
"make install" not depend on vmlinux").
3. Broken code in some architectures
Makefile code is often copied from one architecture to another
without proper adaptation.
'make vdso_install' for parisc does not work.
'make vdso_install' for s390 installs vdso64, but not vdso32.
To address these problems, this commit introduces a generic vdso_install
rule.
Architectures that support vdso_install need to define vdso-install-y
in arch/*/Makefile. vdso-install-y lists the files to install.
For example, arch/x86/Makefile looks like this:
vdso-install-$(CONFIG_X86_64) += arch/x86/entry/vdso/vdso64.so.dbg
vdso-install-$(CONFIG_X86_X32_ABI) += arch/x86/entry/vdso/vdsox32.so.dbg
vdso-install-$(CONFIG_X86_32) += arch/x86/entry/vdso/vdso32.so.dbg
vdso-install-$(CONFIG_IA32_EMULATION) += arch/x86/entry/vdso/vdso32.so.dbg
These files will be installed to $(MODLIB)/vdso/ with the .dbg suffix,
if exists, stripped away.
vdso-install-y can optionally take the second field after the colon
separator. This is needed because some architectures install a vdso
file as a different base name.
The following is a snippet from arch/arm64/Makefile.
vdso-install-$(CONFIG_COMPAT_VDSO) += arch/arm64/kernel/vdso32/vdso.so.dbg:vdso32.so
This will rename vdso.so.dbg to vdso32.so during installation. If such
architectures change their implementation so that the base names match,
this workaround will go away.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Sven Schnelle <svens@linux.ibm.com> # s390
Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>
Reviewed-by: Guo Ren <guoren@kernel.org>
Acked-by: Helge Deller <deller@gmx.de> # parisc
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Service NMI and SMI requests after PMI requests in vcpu_enter_guest() so
that KVM does not need to cancel and redo the VM-Enter if the guest
configures its PMIs to be delivered as NMIs (likely) or SMIs (unlikely).
Because APIC emulation "injects" NMIs via KVM_REQ_NMI, handling PMI
requests after NMI requests (the likely case) means KVM won't detect the
pending NMI request until the final check for outstanding requests.
Detecting requests at the final stage is costly as KVM has already loaded
guest state, potentially queued events for injection, disabled IRQs,
dropped SRCU, etc., most of which needs to be unwound.
Note that changing the order of request processing doesn't change the end
result, as KVM's final check for outstanding requests prevents entering
the guest until all requests are serviced. I.e. KVM will ultimately
coalesce events (or not) regardless of the ordering.
Using SPEC2017 benchmark programs running along with Intel vtune in a VM
demonstrates that the following code change reduces 800~1500 canceled
VM-Enters per second.
Some glory details:
Probe the invocation to vmx_cancel_injection():
$ perf probe -a vmx_cancel_injection
$ perf stat -a -e probe:vmx_cancel_injection -I 10000 # per 10 seconds
Partial results when SPEC2017 with Intel vtune are running in the VM:
On kernel without the change:
10.010018010 14254 probe:vmx_cancel_injection
20.037646388 15207 probe:vmx_cancel_injection
30.078739816 15261 probe:vmx_cancel_injection
40.114033258 15085 probe:vmx_cancel_injection
50.149297460 15112 probe:vmx_cancel_injection
60.185103088 15104 probe:vmx_cancel_injection
On kernel with the change:
10.003595390 40 probe:vmx_cancel_injection
20.017855682 31 probe:vmx_cancel_injection
30.028355883 34 probe:vmx_cancel_injection
40.038686298 31 probe:vmx_cancel_injection
50.048795162 20 probe:vmx_cancel_injection
60.069057747 19 probe:vmx_cancel_injection
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20231002040839.2630027-1-mizhang@google.com
[sean: hoist PMU/PMI above SMI too, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tetsuo reported the following lockdep splat when the TSC synchronization
fails during CPU hotplug:
tsc: Marking TSC unstable due to check_tsc_sync_source failed
WARNING: inconsistent lock state
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
ffffffff8cfa1c78 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0x23/0x5a0
{IN-HARDIRQ-W} state was registered at:
_raw_spin_lock_irqsave+0x3f/0x60
clocksource_mark_unstable+0x1b/0x90
mark_tsc_unstable+0x41/0x50
check_tsc_sync_source+0x14f/0x180
sysvec_call_function_single+0x69/0x90
Possible unsafe locking scenario:
lock(watchdog_lock);
<Interrupt>
lock(watchdog_lock);
stack backtrace:
_raw_spin_lock+0x30/0x40
clocksource_watchdog+0x23/0x5a0
run_timer_softirq+0x2a/0x50
sysvec_apic_timer_interrupt+0x6e/0x90
The reason is the recent conversion of the TSC synchronization function
during CPU hotplug on the control CPU to a SMP function call. In case
that the synchronization with the upcoming CPU fails, the TSC has to be
marked unstable via clocksource_mark_unstable().
clocksource_mark_unstable() acquires 'watchdog_lock', but that lock is
taken with interrupts enabled in the watchdog timer callback to minimize
interrupt disabled time. That's obviously a possible deadlock scenario,
Before that change the synchronization function was invoked in thread
context so this could not happen.
As it is not crucical whether the unstable marking happens slightly
delayed, defer the call to a worker thread which avoids the lock context
problem.
Fixes: 9d349d47f0 ("x86/smpboot: Make TSC synchronization function call based")
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87zg064ceg.ffs@tglx
David and a few others reported that on certain newer systems some legacy
interrupts fail to work correctly.
Debugging revealed that the BIOS of these systems leaves the legacy PIC in
uninitialized state which makes the PIC detection fail and the kernel
switches to a dummy implementation.
Unfortunately this fallback causes quite some code to fail as it depends on
checks for the number of legacy PIC interrupts or the availability of the
real PIC.
In theory there is no reason to use the PIC on any modern system when
IO/APIC is available, but the dependencies on the related checks cannot be
resolved trivially and on short notice. This needs lots of analysis and
rework.
The PIC detection has been added to avoid quirky checks and force selection
of the dummy implementation all over the place, especially in VM guest
scenarios. So it's not an option to revert the relevant commit as that
would break a lot of other scenarios.
One solution would be to try to initialize the PIC on detection fail and
retry the detection, but that puts the burden on everything which does not
have a PIC.
Fortunately the ACPI/MADT table header has a flag field, which advertises
in bit 0 that the system is PCAT compatible, which means it has a legacy
8259 PIC.
Evaluate that bit and if set avoid the detection routine and keep the real
PIC installed, which then gets initialized (for nothing) and makes the rest
of the code with all the dependencies work again.
Fixes: e179f69141 ("x86, irq, pic: Probe for legacy PIC and set legacy_pic appropriately")
Reported-by: David Lazar <dlazar@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: David Lazar <dlazar@gmail.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Cc: stable@vger.kernel.org
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218003
Link: https://lore.kernel.org/r/875y2u5s8g.ffs@tglx
For "reasons" Intel has code-named this CPU with a "_H" suffix.
[ dhansen: As usual, apply this and send it upstream quickly to
make it easier for anyone who is doing work that
consumes this. ]
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20231025202513.12358-1-tony.luck%40intel.com
The branch counters logging (A.K.A LBR event logging) introduces a
per-counter indication of precise event occurrences in LBRs. It can
provide a means to attribute exposed retirement latency to combinations
of events across a block of instructions. It also provides a means of
attributing Timed LBR latencies to events.
The feature is first introduced on SRF/GRR. It is an enhancement of the
ARCH LBR. It adds new fields in the LBR_INFO MSRs to log the occurrences
of events on the GP counters. The information is displayed by the order
of counters.
The design proposed in this patch requires that the events which are
logged must be in a group with the event that has LBR. If there are
more than one LBR group, the counters logging information only from the
current group (overflowed) are stored for the perf tool, otherwise the
perf tool cannot know which and when other groups are scheduled
especially when multiplexing is triggered. The user can ensure it uses
the maximum number of counters that support LBR info (4 by now) by
making the group large enough.
The HW only logs events by the order of counters. The order may be
different from the order of enabling which the perf tool can understand.
When parsing the information of each branch entry, convert the counter
order to the enabled order, and store the enabled order in the extension
space.
Unconditionally reset LBRs for an LBR event group when it's deleted. The
logged counter information is only valid for the current LBR group. If
another LBR group is scheduled later, the information from the stale
LBRs would be otherwise wrongly interpreted.
Add a sanity check in intel_pmu_hw_config(). Disable the feature if other
counter filters (inv, cmask, edge, in_tx) are set or LBR call stack mode
is enabled. (For the LBR call stack mode, we cannot simply flush the
LBR, since it will break the call stack. Also, there is no obvious usage
with the call stack mode for now.)
Only applying the PERF_SAMPLE_BRANCH_COUNTERS doesn't require any branch
stack setup.
Expose the maximum number of supported counters and the width of the
counters into the sysfs. The perf tool can use the information to parse
the logged counters in each branch.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-5-kan.liang@linux.intel.com
Some attrs and is_visible implementations are rather far away from one
another which makes the whole thing hard to interpret.
There are only two attribute groups which have both .attrs and
.is_visible, group_default and group_caps_lbr. Move them together.
No functional changes.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-4-kan.liang@linux.intel.com
Add a helper function to check call stack sample type.
The later patch will invoke the function in several places.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-3-kan.liang@linux.intel.com
Currently, branch_sample_type !=0 is used to check whether a branch
stack setup is required. But it doesn't check the sample type,
unnecessary branch stack setup may be done for a counting event. E.g.,
perf record -e "{branch-instructions,branch-misses}:S" -j any
Also, the event only with the new PERF_SAMPLE_BRANCH_COUNTERS branch
sample type may not require a branch stack setup either.
Add a new flag NEEDS_BRANCH_STACK to indicate whether the event requires
a branch stack setup. Replace the needs_branch_stack() by checking the
new flag.
The counting event check is implemented here. The later patch will take
the new PERF_SAMPLE_BRANCH_COUNTERS into account.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-2-kan.liang@linux.intel.com
Currently, the additional information of a branch entry is stored in a
u64 space. With more and more information added, the space is running
out. For example, the information of occurrences of events will be added
for each branch.
Two places were suggested to append the counters.
https://lore.kernel.org/lkml/20230802215814.GH231007@hirez.programming.kicks-ass.net/
One place is right after the flags of each branch entry. It changes the
existing struct perf_branch_entry. The later ARCH specific
implementation has to be really careful to consistently pick
the right struct.
The other place is right after the entire struct perf_branch_stack.
The disadvantage is that the pointer of the extra space has to be
recorded. The common interface perf_sample_save_brstack() has to be
updated.
The latter is much straightforward, and should be easily understood and
maintained. It is implemented in the patch.
Add a new branch sample type, PERF_SAMPLE_BRANCH_COUNTERS, to indicate
the event which is recorded in the branch info.
The "u64 counters" may store the occurrences of several events. The
information regarding the number of events/counters and the width of
each counter should be exposed via sysfs as a reference for the perf
tool. Define the branch_counter_nr and branch_counter_width ABI here.
The support will be implemented later in the Intel-specific patch.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231025201626.3000228-1-kan.liang@linux.intel.com
commit ef8dd01538 ("genirq/msi: Make interrupt allocation less
convoluted"), reworked the code so that the x86 specific quirk for affinity
setting of non-maskable PCI/MSI interrupts is not longer activated if
necessary.
This could be solved by restoring the original logic in the core MSI code,
but after a deeper analysis it turned out that the quirk flag is not
required at all.
The quirk is only required when the PCI/MSI device cannot mask the MSI
interrupts, which in turn also prevents reservation mode from being enabled
for the affected interrupt.
This allows ot remove the NOMASK quirk bit completely as msi_set_affinity()
can instead check whether reservation mode is enabled for the interrupt,
which gives exactly the same answer.
Even in the momentary non-existing case that the reservation mode would be
not set for a maskable MSI interrupt this would not cause any harm as it
just would cause msi_set_affinity() to go needlessly through the
functionaly equivalent slow path, which works perfectly fine with maskable
interrupts as well.
Rework msi_set_affinity() to query the reservation mode and remove all
NOMASK quirk logic from the core code.
[ tglx: Massaged changelog ]
Fixes: ef8dd01538 ("genirq/msi: Make interrupt allocation less convoluted")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20231026032036.2462428-1-den@valinux.co.jp
A bug was recently found via CONFIG_DEBUG_ENTRY=y, and the x86
tree kinda is the main source of changes to the x86 entry code,
so enable this debug option by default in our defconfigs.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
In general users, don't have the necessary information to determine
whether late loading of a new microcode version is safe and does not
modify anything which the currently running kernel uses already, e.g.
removal of CPUID bits or behavioural changes of MSRs.
To address this issue, Intel has added a "minimum required version"
field to a previously reserved field in the microcode header. Microcode
updates should only be applied if the current microcode version is equal
to, or greater than this minimum required version.
Thomas made some suggestions on how meta-data in the microcode file could
provide Linux with information to decide if the new microcode is suitable
candidate for late loading. But even the "simpler" option requires a lot of
metadata and corresponding kernel code to parse it, so the final suggestion
was to add the 'minimum required version' field in the header.
When microcode changes visible features, microcode will set the minimum
required version to its own revision which prevents late loading.
Old microcode blobs have the minimum revision field always set to 0, which
indicates that there is no information and the kernel considers it
unsafe.
This is a pure OS software mechanism. The hardware/firmware ignores this
header field.
For early loading there is no restriction because OS visible features
are enumerated after the early load and therefore a change has no
effect.
The check is always enabled, but by default not enforced. It can be
enforced via Kconfig or kernel command line.
If enforced, the kernel refuses to late load microcode with a minimum
required version field which is zero or when the currently loaded
microcode revision is smaller than the minimum required revision.
If not enforced the load happens independent of the revision check to
stay compatible with the existing behaviour, but it influences the
decision whether the kernel is tainted or not. If the check signals that
the late load is safe, then the kernel is not tainted.
Early loading is not affected by this.
[ tglx: Massaged changelog and fixed up the implementation ]
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.776467264@linutronix.de
Applying microcode late can be fatal for the running kernel when the
update changes functionality which is in use already in a non-compatible
way, e.g. by removing a CPUID bit.
There is no way for admins which do not have access to the vendors deep
technical support to decide whether late loading of such a microcode is
safe or not.
Intel has added a new field to the microcode header which tells the
minimal microcode revision which is required to be active in the CPU in
order to be safe.
Provide infrastructure for handling this in the core code and a command
line switch which allows to enforce it.
If the update is considered safe the kernel is not tainted and the annoying
warning message not emitted. If it's enforced and the currently loaded
microcode revision is not safe for late loading then the load is aborted.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211724.079611170@linutronix.de
Offline CPUs need to be parked in a safe loop when microcode update is
in progress on the primary CPU. Currently, offline CPUs are parked in
mwait_play_dead(), and for Intel CPUs, its not a safe instruction,
because the MWAIT instruction can be patched in the new microcode update
that can cause instability.
- Add a new microcode state 'UCODE_OFFLINE' to report status on per-CPU
basis.
- Force NMI on the offline CPUs.
Wake up offline CPUs while the update is in progress and then return
them back to mwait_play_dead() after microcode update is complete.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.660850472@linutronix.de
When SMT siblings are soft-offlined and parked in one of the play_dead()
variants they still react on NMI, which is problematic on affected Intel
CPUs. The default play_dead() variant uses MWAIT on modern CPUs, which is
not guaranteed to be safe when updated concurrently.
Right now late loading is prevented when not all SMT siblings are online,
but as they still react on NMI, it is possible to bring them out of their
park position into a trivial rendezvous handler.
Provide a function which allows to do that. I does sanity checks whether
the target is in the cpus_booted_once_mask and whether the APIC driver
supports it.
Mark X2APIC and XAPIC as capable, but exclude 32bit and the UV and NUMACHIP
variants as that needs feedback from the relevant experts.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.603100036@linutronix.de
The wait for control loop in which the siblings are waiting for the
microcode update on the primary thread must be protected against
instrumentation as instrumentation can end up in #INT3, #DB or #PF,
which then returns with IRET. That IRET reenables NMI which is the
opposite of what the NMI rendezvous is trying to achieve.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.545969323@linutronix.de
stop_machine() does not prevent the spin-waiting sibling from handling
an NMI, which is obviously violating the whole concept of rendezvous.
Implement a static branch right in the beginning of the NMI handler
which is nopped out except when enabled by the late loading mechanism.
The late loader enables the static branch before stop_machine() is
invoked. Each CPU has an nmi_enable in its control structure which
indicates whether the CPU should go into the update routine.
This is required to bridge the gap between enabling the branch and
actually being at the point where it is required to enter the loader
wait loop.
Each CPU which arrives in the stopper thread function sets that flag and
issues a self NMI right after that. If the NMI function sees the flag
clear, it returns. If it's set it clears the flag and enters the
rendezvous.
This is safe against a real NMI which hits in between setting the flag
and sending the NMI to itself. The real NMI will be swallowed by the
microcode update and the self NMI will then let stuff continue.
Otherwise this would end up with a spurious NMI.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.489900814@linutronix.de
with a new handler which just separates the control flow of primary and
secondary CPUs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.433704135@linutronix.de
The current all in one code is unreadable and really not suited for
adding future features like uniform loading with package or system
scope.
Provide a set of new control functions which split the handling of the
primary and secondary CPUs. These will replace the current rendezvous
all in one function in the next step. This is intentionally a separate
change because diff makes an complete unreadable mess otherwise.
So the flow separates the primary and the secondary CPUs into their own
functions which use the control field in the per CPU ucode_ctrl struct.
primary() secondary()
wait_for_all() wait_for_all()
apply_ucode() wait_for_release()
release() apply_ucode()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.377922731@linutronix.de
Add a per CPU control field to ucode_ctrl and define constants for it
which are going to be used to control the loading state machine.
In theory this could be a global control field, but a global control does
not cover the following case:
15 primary CPUs load microcode successfully
1 primary CPU fails and returns with an error code
With global control the sibling of the failed CPU would either try again or
the whole operation would be aborted with the consequence that the 15
siblings do not invoke the apply path and end up with inconsistent software
state. The result in dmesg would be inconsistent too.
There are two additional fields added and initialized:
ctrl_cpu and secondaries. ctrl_cpu is the CPU number of the primary thread
for now, but with the upcoming uniform loading at package or system scope
this will be one CPU per package or just one CPU. Secondaries hands the
control CPU a CPU mask which will be required to release the secondary CPUs
out of the wait loop.
Preparatory change for implementing a properly split control flow for
primary and secondary CPUs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.319959519@linutronix.de
The microcode rendezvous is purely acting on global state, which does
not allow to analyze fails in a coherent way.
Introduce per CPU state where the results are written into, which allows to
analyze the return codes of the individual CPUs.
Initialize the state when walking the cpu_present_mask in the online
check to avoid another for_each_cpu() loop.
Enhance the result print out with that.
The structure is intentionally named ucode_ctrl as it will gain control
fields in subsequent changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211723.632681010@linutronix.de
The code is too complicated for no reason:
- The return value is pointless as this is a strict boolean.
- It's way simpler to count down from num_online_cpus() and check for
zero.
- The timeout argument is pointless as this is always one second.
- Touching the NMI watchdog every 100ns does not make any sense, neither
does checking every 100ns. This is really not a hotpath operation.
Preload the atomic counter with the number of online CPUs and simplify the
whole timeout logic. Delay for one microsecond and touch the NMI watchdog
once per millisecond.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.204251527@linutronix.de
reload_store() is way too complicated. Split the inner workings out and
make the following enhancements:
- Taint the kernel only when the microcode was actually updated. If. e.g.
the rendezvous fails, then nothing happened and there is no reason for
tainting.
- Return useful error codes
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/r/20231002115903.145048840@linutronix.de
On CPUs where microcode loading is not NMI-safe the SMT siblings which
are parked in one of the play_dead() variants still react to NMIs.
So if an NMI hits while the primary thread updates the microcode the
resulting behaviour is undefined. The default play_dead() implementation on
modern CPUs is using MWAIT which is not guaranteed to be safe against
a microcode update which affects MWAIT.
Take the cpus_booted_once_mask into account to detect this case and
refuse to load late if the vendor specific driver does not advertise
that late loading is NMI safe.
AMD stated that this is safe, so mark the AMD driver accordingly.
This requirement will be partially lifted in later changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.087472735@linutronix.de
This function has nothing to do with suspend. It's a hotplug
callback. Remove the bogus comment.
Drop the pointless debug printk. The hotplug core provides tracepoints
which track the invocation of those callbacks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115903.028651784@linutronix.de
Scheduling work on all CPUs to collect the microcode information is just
another extra step for no value. Let the CPU hotplug callback registration
do it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211723.354748138@linutronix.de
Get rid of the initrd_gone hack which was required to keep
find_microcode_in_initrd() functional after init.
As find_microcode_in_initrd() is now only used during init, mark it
accordingly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211723.298854846@linutronix.de
Now that the microcode cache is initialized before the APs are brought
up, there is no point in scanning builtin/initrd microcode during AP
loading.
Convert the AP loader to utilize the cache, which in turn makes the CPU
hotplug callback which applies the microcode after initrd/builtin is
gone, obsolete as the early loading during late hotplug operations
including the resume path depends now only on the cache.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211723.243426023@linutronix.de
There is no reason to scan builtin/initrd microcode on each AP.
Cache the builtin/initrd microcode in an early initcall so that the
early AP loader can utilize the cache.
The existing fs initcall which invoked save_microcode_in_initrd_amd() is
still required to maintain the initrd_gone flag. Rename it accordingly.
This will be removed once the AP loader code is converted to use the
cache.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211723.187566507@linutronix.de
save_microcode_in_initrd_amd() fails to cache builtin microcode and only
scans initrd.
Use find_blobs_in_containers() instead which covers both.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231010150702.495139089@linutronix.de
find_blobs_in_containers() is invoked on every CPU but overwrites
unconditionally ucode_cpu_info of CPU0.
Fix this by using the proper CPU data and move the assignment into the
call site apply_ucode_from_containers() so that the function can be
reused.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231010150702.433454320@linutronix.de
Microcode is applied on the APs during early bringup. There is no point
in trying to apply the microcode again during the hotplug operations and
neither at the point where the microcode device is initialized.
Collect CPU info and microcode revision in setup_online_cpu() for now.
This will move to the CPU hotplug callback later.
[ bp: Leave the starting notifier for the following scenario:
- boot, late load, suspend to disk, resume
without the starting notifier, only the last core manages to update the
microcode upon resume:
# rdmsr -a 0x8b
10000bf
10000bf
10000bf
10000bf
10000bf
10000dc <----
This is on an AMD F10h machine.
For the future, one should check whether potential unification of
the CPU init path could cover the resume path too so that this can
be simplified even more.
tglx: This is caused by the odd handling of APs which try to find the
microcode blob in builtin or initrd instead of caching the microcode
blob during early init before the APs are brought up. Will be cleaned
up in a later step. ]
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231017211723.018821624@linutronix.de
Take a cpu_signature argument and work from there. Move the match()
helper next to the callsite as there is no point for having it in
a header.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115902.797820205@linutronix.de
Nothing needs struct ucode_cpu_info. Make it take struct cpu_signature,
let it return a boolean and simplify the implementation. Rename it now
that the silly name clash with collect_cpu_info() is gone.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.851573238@linutronix.de
Deduplicate the early and late apply() functions.
[ bp: Rename the function which does the actual application to
__apply_microcode() to differentiate it from
microcode_ops.apply_microcode(). ]
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231017211722.795508212@linutronix.de
There are situations where the late microcode is loaded into memory but
is not applied:
1) The rendezvous fails
2) The microcode is rejected by the CPUs
If any of this happens then the pointer which was updated at firmware
load time is stale and subsequent CPU hotplug operations either fail to
update or create inconsistent microcode state.
Save the loaded microcode in a separate pointer before the late load is
attempted and when successful, update the hotplug pointer accordingly
via a new microcode_ops callback.
Remove the pointless fallback in the loader to a microcode pointer which
is never populated.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115902.505491309@linutronix.de
The early loading code is overly complicated:
- It scans the builtin/initrd for microcode not only on the BSP, but also
on all APs during early boot and then later in the boot process it
scans again to duplicate and save the microcode before initrd goes
away.
That's a pointless exercise because this can be simply done before
bringing up the APs when the memory allocator is up and running.
- Saving the microcode from within the scan loop is completely
non-obvious and a left over of the microcode cache.
This can be done at the call site now which makes it obvious.
Rework the code so that only the BSP scans the builtin/initrd microcode
once during early boot and save it away in an early initcall for later
use.
[ bp: Test and fold in a fix from tglx ontop which handles the need to
distinguish what save_microcode() does depending on when it is
called:
- when on the BSP during early load, it needs to find a newer
revision than the one currently loaded on the BSP
- later, before SMP init, it still runs on the BSP and gets the BSP
revision just loaded and uses that revision to know which patch
to save for the APs. For that it needs to find the exact one as
on the BSP.
]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.629085215@linutronix.de
These flags are not made conditional on compiler support because at the
moment exactly one version of rustc supported, and that one supports
these flags.
Building without these additional flags will manifest as objtool
printing a large number of errors about missing ENDBR and if CFI is
enabled (not currently possible) will result in incorrectly structured
function prefixes.
Signed-off-by: Matthew Maurer <mmaurer@google.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231009224347.2076221-1-mmaurer@google.com
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmU1ngkeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGrsIH/0k/+gdBBYFFdEym
foRhKir9WV3ZX4oIozJjA1f7T+qVYclKs6kaYm3gNepRBb6AoG8pdgv4MMAqhYsf
QMe2XHi0MrO/qKBgfNfivxEa9jq+0QK5uvTbqCRqCAB8LfwVyDqapCmg3EuiZcPW
UbMITmnwLIfXgPxvp9rabmCsTqO6FLbf0GDOVIkNSAIDBXMpcO1iffjrWUbhRa7n
oIoiJmWJLcXLxPWDsRKbpJwzw2cIG08YhfQYAiQnC3YaeRm1FKLDIICRBsmfYzja
rWv9r4dn4TDfV4/AnjggQnsZvz2yPCxNaFSQIT88nIeiLvyuUTJ9j8aidsSfMZQf
xZAbzbA=
=NoQv
-----END PGP SIGNATURE-----
BackMerge tag 'v6.6-rc7' into drm-next
This is needed to add the msm pr which is based on a higher base.
Signed-off-by: Dave Airlie <airlied@redhat.com>
After a lot of experimenting (see thread Link points to) document for
now the issues and requirements for future improvements to the thunk
handling and potential issuing of a diagnostic when the default thunk
hasn't been patched out.
This documentation is only temporary and that close before the merge
window it is only a placeholder for those future improvements.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231010171020.462211-1-david.kaplan@amd.com
vdso32-setup.c is part of the main kernel image and should not be
excluded from objtool. Objtool is necessary in part for ensuring that
returns in this file are correctly patched to the appropriate return
thunk at runtime.
Signed-off-by: David Kaplan <david.kaplan@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231010171020.462211-3-david.kaplan@amd.com
Parse the pti= and nopti cmdline options using early_param to fix 'Unknown
kernel command line parameters "nopti", will be passed to user space'
warnings in the kernel log when nopti or pti= are passed to the kernel
cmdline on x86 platforms.
Additionally allow the kernel to warn for malformed pti= options.
Signed-off-by: Jo Van Bulck <jo.vanbulck@cs.kuleuven.be>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Link: https://lore.kernel.org/r/20230819080921.5324-2-jo.vanbulck@cs.kuleuven.be
CONFIG_RETHUNK, CONFIG_CPU_UNRET_ENTRY and CONFIG_CPU_SRSO are all
tangled up. De-spaghettify the code a bit.
Some of the rethunk-related code has been shuffled around within the
'.text..__x86.return_thunk' section, but otherwise there are no
functional changes. srso_alias_untrain_ret() and srso_alias_safe_ret()
((which are very address-sensitive) haven't moved.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/2845084ed303d8384905db3b87b77693945302b4.1693889988.git.jpoimboe@kernel.org
For enum switch statements which handle all possible cases, remove the
default case so a compiler warning gets printed if one of the enums gets
accidentally omitted from the switch statement.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/fcf6feefab991b72e411c2aed688b18e65e06aed.1693889988.git.jpoimboe@kernel.org
SBPB is only enabled in two distinct cases:
1) when SRSO has been disabled with srso=off
2) when SRSO has been fixed (in future HW)
Simplify the control flow by getting rid of the 'pred_cmd' label and
moving the SBPB enablement check to the two corresponding code sites.
This makes it more clear when exactly SBPB gets enabled.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/bb20e8569cfa144def5e6f25e610804bc4974de2.1693889988.git.jpoimboe@kernel.org
Move srso_alias_return_thunk() to the same section as
srso_alias_safe_ret() so they can share a cache line.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/eadaf5530b46a7ae8b936522da45ae555d2b3393.1693889988.git.jpoimboe@kernel.org
CONFIG_CPU_SRSO isn't dependent on CONFIG_CPU_UNRET_ENTRY (AMD
Retbleed), so the two features are independently configurable. Fix
several issues for the (presumably rare) case where CONFIG_CPU_SRSO is
enabled but CONFIG_CPU_UNRET_ENTRY isn't.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/299fb7740174d0f2335e91c58af0e9c242b4bac1.1693889988.git.jpoimboe@kernel.org
The SRSO default safe-ret mitigation is reported as "mitigated" even if
microcode hasn't been updated. That's wrong because userspace may still
be vulnerable to SRSO attacks due to IBPB not flushing branch type
predictions.
Report the safe-ret + !microcode case as vulnerable.
Also report the microcode-only case as vulnerable as it leaves the
kernel open to attacks.
Fixes: fb3bd914b3 ("x86/srso: Add a Speculative RAS Overflow mitigation")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/a8a14f97d1b0e03ec255c81637afdf4cf0ae9c99.1693889988.git.jpoimboe@kernel.org
When overriding the requested mitigation with IBPB due to retbleed=ibpb,
print the mitigation in the usual format instead of a custom error
message.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/ec3af919e267773d896c240faf30bfc6a1fd6304.1693889988.git.jpoimboe@kernel.org
If the kernel wasn't compiled to support the requested option, print the
actual option that ends up getting used.
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/7e7a12ea9d85a9f76ca16a3efb71f262dee46ab1.1693889988.git.jpoimboe@kernel.org
Make the SBPB check more robust against the (possible) case where future
HW has SRSO fixed but doesn't have the SRSO_NO bit set.
Fixes: 1b5277c0ea ("x86/srso: Add SRSO_NO support")
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/cee5050db750b391c9f35f5334f8ff40e66c01b9.1693889988.git.jpoimboe@kernel.org
Qi Zheng reported crashes in a production environment and provided a
simplified example as a reproducer:
| For example, if we use Qemu to start a two NUMA node kernel,
| one of the nodes has 2M memory (less than NODE_MIN_SIZE),
| and the other node has 2G, then we will encounter the
| following panic:
|
| BUG: kernel NULL pointer dereference, address: 0000000000000000
| <...>
| RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40
| <...>
| Call Trace:
| <TASK>
| deactivate_slab()
| bootstrap()
| kmem_cache_init()
| start_kernel()
| secondary_startup_64_no_verify()
The crashes happen because of inconsistency between the nodemask that
has nodes with less than 4MB as memoryless, and the actual memory fed
into the core mm.
The commit:
9391a3f9c7 ("[PATCH] x86_64: Clear more state when ignoring empty node in SRAT parsing")
... that introduced minimal size of a NUMA node does not explain why
a node size cannot be less than 4MB and what boot failures this
restriction might fix.
Fixes have been submitted to the core MM code to tighten up the
memory topologies it accepts and to not crash on weird input:
mm: page_alloc: skip memoryless nodes entirely
mm: memory_hotplug: drop memoryless node from fallback lists
Andrew has accepted them into the -mm tree, but there are no
stable SHA1's yet.
This patch drops the limitation for minimal node size on x86:
- which works around the crash without the fixes to the core MM.
- makes x86 topologies less weird,
- removes an arbitrary and undocumented limitation on NUMA topologies.
[ mingo: Improved changelog clarity. ]
Reported-by: Qi Zheng <zhengqi.arch@bytedance.com>
Tested-by: Mario Casquero <mcasquer@redhat.com>
Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/ZS+2qqjEO5/867br@gmail.com
Implement the ->digest method to improve performance on single-page
messages by reducing the number of indirect calls.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Implement a ->digest function for sha256-ssse3, sha256-avx, sha256-avx2,
and sha256-ni. This improves the performance of crypto_shash_digest()
with these algorithms by reducing the number of indirect calls that are
made.
For now, don't bother with this for sha224, since sha224 is rarely used.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
the guest kernel gets to emulate certain instructions in SEV-{ES,SNP}
guests by:
- disabling emulation of MMIO instructions when coming from user mode
- checking the IO permission bitmap before emulating IO instructions and
verifying the memory operands of INS/OUTS insns.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmUufCwACgkQEsHwGGHe
VUoTHA//YO81VH8JkvfKwxh322mbD+TDTkgWgcpClsWnkIZQdyCpKVTwsWWuhwX5
FmCEc3I75hRK3ts3sdhZYOS94gKVUyWf2ERm2qMD02+08tS3K/TxJyx5xBMz9U03
VOiWRC1rp33MZ0eCrXenTbA7Xay6AhU34pz4qSdEvkUKUU6YIdCfnspFXSi84Uqy
tgmyPDJhSH/3hE46EJSHd4m6c8PO3Su/oUJHMy/refbxAscf9NNdWpGlPY285Aox
RTA0mOYQRRKf0YFkGabLY9IIcL0w+NXMhMVEMFNiXyxFvaM8CONhK6SDmzvcUngB
gOfsN6nD4JDqfH11gXCdxS3n0IZuAAMHyEigktvp1qnyNEDTBUtbfUkyqvITg+JC
u3KMFSSYB58colTK/bkhE0IHnH2bKzhkDuVKzmJn/OCTxf0xxfGsnjbdw0JxMO81
/9ORx8/QKWzv411AH2DUNh4vIJqDxVTJJb8zkScnYStX2ust6Ra+jYIr+mmf46md
+Rzo5qoe/GnAtReCdGFg3w339nEbUz51n5uqm9KN4QnH39wg5R8nPiAUMHOlO1Zm
PNvNgSZUkiiJpMci/KBbyFzPJTO7YjjRql7GWRwhWrclSPOrq49kocK5eIEYS4ol
cd5cKF92hHsnwycz2dZsDQwYqEQ5J+c6kZTwfUwJcoUBxCWP/qI=
=MNCv
-----END PGP SIGNATURE-----
Merge tag 'sev_fixes_for_v6.6' of //git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"Take care of a race between when the #VC exception is raised and when
the guest kernel gets to emulate certain instructions in SEV-{ES,SNP}
guests by:
- disabling emulation of MMIO instructions when coming from user mode
- checking the IO permission bitmap before emulating IO instructions
and verifying the memory operands of INS/OUTS insns"
* tag 'sev_fixes_for_v6.6' of //git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Check for user-space IOIO pointing to kernel space
x86/sev: Check IOBM for IOIO exceptions from user-space
x86/sev: Disable MMIO emulation from user mode
In TDX guest, the attestation process is used to verify the TDX guest
trustworthiness to other entities before provisioning secrets to the
guest. The first step in the attestation process is TDREPORT
generation, which involves getting the guest measurement data in the
format of TDREPORT, which is further used to validate the authenticity
of the TDX guest. TDREPORT by design is integrity-protected and can
only be verified on the local machine.
To support remote verification of the TDREPORT in a SGX-based
attestation, the TDREPORT needs to be sent to the SGX Quoting Enclave
(QE) to convert it to a remotely verifiable Quote. SGX QE by design can
only run outside of the TDX guest (i.e. in a host process or in a
normal VM) and guest can use communication channels like vsock or
TCP/IP to send the TDREPORT to the QE. But for security concerns, the
TDX guest may not support these communication channels. To handle such
cases, TDX defines a GetQuote hypercall which can be used by the guest
to request the host VMM to communicate with the SGX QE. More details
about GetQuote hypercall can be found in TDX Guest-Host Communication
Interface (GHCI) for Intel TDX 1.0, section titled
"TDG.VP.VMCALL<GetQuote>".
Trusted Security Module (TSM) [1] exposes a common ABI for Confidential
Computing Guest platforms to get the measurement data via ConfigFS.
Extend the TSM framework and add support to allow an attestation agent
to get the TDX Quote data (included usage example below).
report=/sys/kernel/config/tsm/report/report0
mkdir $report
dd if=/dev/urandom bs=64 count=1 > $report/inblob
hexdump -C $report/outblob
rmdir $report
GetQuote TDVMCALL requires TD guest pass a 4K aligned shared buffer
with TDREPORT data as input, which is further used by the VMM to copy
the TD Quote result after successful Quote generation. To create the
shared buffer, allocate a large enough memory and mark it shared using
set_memory_decrypted() in tdx_guest_init(). This buffer will be re-used
for GetQuote requests in the TDX TSM handler.
Although this method reserves a fixed chunk of memory for GetQuote
requests, such one time allocation can help avoid memory fragmentation
related allocation failures later in the uptime of the guest.
Since the Quote generation process is not time-critical or frequently
used, the current version uses a polling model for Quote requests and
it also does not support parallel GetQuote requests.
Link: https://lore.kernel.org/lkml/169342399185.3934343.3035845348326944519.stgit@dwillia2-xfh.jf.intel.com/ [1]
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Erdem Aktas <erdemaktas@google.com>
Tested-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Tested-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Hyper-V enabled Windows Server 2022 KVM VM cannot be started on Zen1 Ryzen
since it crashes at boot with SYSTEM_THREAD_EXCEPTION_NOT_HANDLED +
STATUS_PRIVILEGED_INSTRUCTION (in other words, because of an unexpected #GP
in the guest kernel).
This is because Windows tries to set bit 8 in MSR_AMD64_TW_CFG and can't
handle receiving a #GP when doing so.
Give this MSR the same treatment that commit 2e32b71906
("x86, kvm: Add MSR_AMD64_BU_CFG2 to the list of ignored MSRs") gave
MSR_AMD64_BU_CFG2 under justification that this MSR is baremetal-relevant
only.
Although apparently it was then needed for Linux guests, not Windows as in
this case.
With this change, the aforementioned guest setup is able to finish booting
successfully.
This issue can be reproduced either on a Summit Ridge Ryzen (with
just "-cpu host") or on a Naples EPYC (with "-cpu host,stepping=1" since
EPYC is ordinarily stepping 2).
Alternatively, userspace could solve the problem by using MSR filters, but
forcing every userspace to define a filter isn't very friendly and doesn't
add much, if any, value. The only potential hiccup is if one of these
"baremetal-only" MSRs ever requires actual emulation and/or has F/M/S
specific behavior. But if that happens, then KVM can still punt *that*
handling to userspace since userspace MSR filters "win" over KVM's default
handling.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/1ce85d9c7c9e9632393816cf19c902e0a3f411f1.1697731406.git.maciej.szmigiero@oracle.com
[sean: call out MSR filtering alternative]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Sanitize the microcode scan loop, fixup printks and move the loading
function for builtin microcode next to the place where it is used and mark
it __init.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115902.389400871@linutronix.de
so it becomes less obfuscated and rename it because there is nothing
generic about it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231002115902.330295409@linutronix.de
Mixed steppings aren't supported on Intel CPUs. Only one microcode patch
is required for the entire system. The caching of microcode blobs which
match the family and model is therefore pointless and in fact is
dysfunctional as CPU hotplug updates use only a single microcode blob,
i.e. the one where *intel_ucode_patch points to.
Remove the microcode cache and make it an AMD local feature.
[ tglx:
- save only at the end. Otherwise random microcode ends up in the
pointer for early loading
- free the ucode patch pointer in save_microcode_patch() only
after kmemdup() has succeeded, as reported by Andrew Cooper ]
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231017211722.404362809@linutronix.de
Don't initialize "spte" and "sptep" in fast_page_fault() as they are both
guaranteed (for all intents and purposes) to be written at the start of
every loop iteration. Add a sanity check that "sptep" is non-NULL after
walking the shadow page tables, as encountering a NULL root would result
in "spte" not being written, i.e. would lead to uninitialized data or the
previous value being consumed.
Signed-off-by: Li zeming <zeming@nfschina.com>
Link: https://lore.kernel.org/r/20230905182006.2964-1-zeming@nfschina.com
[sean: rewrite changelog with --verbose]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Replace clear_bit_and_unlock_is_negative_byte() with
xor_unlock_is_negative_byte(). We have a few places that like to lock a
folio, set a flag and unlock it again. Allow for the possibility of
combining the latter two operations for efficiency. We are guaranteed
that the caller holds the lock, so it is safe to unlock it with the xor.
The caller must guarantee that nobody else will set the flag without
holding the lock; it is not safe to do this with the PG_dirty flag, for
example.
Link: https://lkml.kernel.org/r/20231004165317.1061855-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Define an X86_FEATURE_* flag for CPUID.80000021H:EAX.[bit 1], and
advertise the feature to userspace via KVM_GET_SUPPORTED_CPUID.
Per AMD's "Processor Programming Reference (PPR) for AMD Family 19h
Model 61h, Revision B1 Processors (56713-B1-PUB)," this CPUID bit
indicates that a WRMSR to MSR_FS_BASE, MSR_GS_BASE, or
MSR_KERNEL_GS_BASE is non-serializing. This is a change in previously
architected behavior.
Effectively, this CPUID bit is a "defeature" bit, or a reverse
polarity feature bit. When this CPUID bit is clear, the feature
(serialization on WRMSR to any of these three MSRs) is available. When
this CPUID bit is set, the feature is not available.
KVM_GET_SUPPORTED_CPUID must pass this bit through from the underlying
hardware, if it is set. Leaving the bit clear claims that WRMSR to
these three MSRs will be serializing in a guest running under
KVM. That isn't true. Though KVM could emulate the feature by
intercepting writes to the specified MSRs, it does not do so
today. The guest is allowed direct read/write access to these MSRs
without interception, so the innate hardware behavior is preserved
under KVM.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231005031237.1652871-1-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The 'kvmclock_periodic_sync' is a readonly param that cannot change after
bootup.
The kvm_arch_vcpu_postcreate() is not going to schedule the
kvmclock_sync_work if kvmclock_periodic_sync == false.
As a result, the "if (!kvmclock_periodic_sync)" can never be true if the
kvmclock_sync_work = kvmclock_sync_fn() is scheduled.
Link: https://lore.kernel.org/kvm/a461bf3f-c17e-9c3f-56aa-726225e8391d@oracle.com
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Link: https://lore.kernel.org/r/20231001213637.76686-1-dongli.zhang@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>