KVM x86 changes for 6.5:

- Move handling of PAT out of MTRR code and dedup SVM+VMX code - Fix output of PIC poll command emulation when there's an interrupt - Fix a longstanding bug in the reporting of the number of entries returned by KVM_GET_CPUID2 - Add a maintainer's handbook to document KVM x86 processes, preferred coding style, testing expectations, etc. - Misc cleanups -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEEMHr+pfEFOIzK+KY1YJEiAU0MEvkFAmSaGMMSHHNlYW5qY0Bn b29nbGUuY29tAAoJEGCRIgFNDBL5iDIP/0PwY3J5odTEUTnAyuDFPimd5PBt9k/O B414wdpSKVgzq+0An4qM9mKRnklVIh2p8QqQTvDhcBUg3xb6CX9xZ4ery7hp/T5O tr5bAXs2AYX6jpxvsopt+w+E9j6fvkJhcJCRU9im3QbrqwUE+ecyU5OHvmv2n/GO syVZJbPOYuoLPKDjlSMrScE6fWEl9UOvHc5BK/vafTeyisMG3vv1BSmJj6GuiNNk TS1RRIg//cOZghQyDfdXt0azTmakNZyNn35xnoX9x8SRmdRykyUjQeHmeqWxPDso kiGO+CGancfS57S6ZtCkJjqEWZ1o/zKdOxr8MMf/3nJhv4kY7/5XtlVoACv5soW9 bZEmNiXIaSbvKNMwAlLJxHFbLa1sMdSCb345CIuMdt5QiWJ53ZiTyIAJX6+eL+Zf 8nkeekgPf5VUs6Zt0RdRPyvo+W7Vp9BtI87yDXm1nQKpbys2pt6CD3YB/oF4QViG a5cyGoFuqRQbS3nmbshIlR7EanTuxbhLZKrNrFnolZ5e624h3Cnk2hVsfTznVGiX vNHWM80phk1CWB9McErrZVkGfjlyVyBL13CBB2XF7Dl6PfF6/N22a9bOuTJD3tvk PlNx4hvZm3esvvyGpjfbSajTKYE8O7rxiE1KrF0BpZ5IUl5WSiTr6XCy/yI/mIeM hay2IWhPOF2z =D0BH -----END PGP SIGNATURE----- Merge tag 'kvm-x86-misc-6.5' of https://github.com/kvm-x86/linux into HEAD KVM x86 changes for 6.5: * Move handling of PAT out of MTRR code and dedup SVM+VMX code * Fix output of PIC poll command emulation when there's an interrupt * Add a maintainer's handbook to document KVM x86 processes, preferred coding style, testing expectations, etc. * Misc cleanups
2024-09-27 12:57:53 +00:00 · 2023-07-01 07:08:59 -04:00 · 2023-07-01 07:08:59 -04:00 · 36b68d360a
commit 36b68d360a
parent d74669ebae 63e2f55cab
14 changed files with 493 additions and 86 deletions
--- a/Documentation/process/maintainer-handbooks.rst
+++ b/Documentation/process/maintainer-handbooks.rst
@ -17,3 +17,4 @@ Contents:

   maintainer-tip
   maintainer-netdev
+   maintainer-kvm-x86
--- a/Documentation/process/maintainer-kvm-x86.rst
+++ b/Documentation/process/maintainer-kvm-x86.rst
@ -0,0 +1,390 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+KVM x86
+=======
+
+Foreword
+--------
+KVM strives to be a welcoming community; contributions from newcomers are
+valued and encouraged.  Please do not be discouraged or intimidated by the
+length of this document and the many rules/guidelines it contains.  Everyone
+makes mistakes, and everyone was a newbie at some point.  So long as you make
+an honest effort to follow KVM x86's guidelines, are receptive to feedback,
+and learn from any mistakes you make, you will be welcomed with open arms, not
+torches and pitchforks.
+
+TL;DR
+-----
+Testing is mandatory.  Be consistent with established styles and patterns.
+
+Trees
+-----
+KVM x86 is currently in a transition period from being part of the main KVM
+tree, to being "just another KVM arch".  As such, KVM x86 is split across the
+main KVM tree, ``git.kernel.org/pub/scm/virt/kvm/kvm.git``, and a KVM x86
+specific tree, ``github.com/kvm-x86/linux.git``.
+
+Generally speaking, fixes for the current cycle are applied directly to the
+main KVM tree, while all development for the next cycle is routed through the
+KVM x86 tree.  In the unlikely event that a fix for the current cycle is routed
+through the KVM x86 tree, it will be applied to the ``fixes`` branch before
+making its way to the main KVM tree.
+
+Note, this transition period is expected to last quite some time, i.e. will be
+the status quo for the foreseeable future.
+
+Branches
+~~~~~~~~
+The KVM x86 tree is organized into multiple topic branches.  The purpose of
+using finer-grained topic branches is to make it easier to keep tabs on an area
+of development, and to limit the collateral damage of human errors and/or buggy
+commits, e.g. dropping the HEAD commit of a topic branch has no impact on other
+in-flight commits' SHA1 hashes, and having to reject a pull request due to bugs
+delays only that topic branch.
+
+All topic branches, except for ``next`` and ``fixes``, are rolled into ``next``
+via a Cthulhu merge on an as-needed basis, i.e. when a topic branch is updated.
+As a result, force pushes to ``next`` are common.
+
+Lifecycle
+~~~~~~~~~
+Fixes that target the current release, a.k.a. mainline, are typically applied
+directly to the main KVM tree, i.e. do not route through the KVM x86 tree.
+
+Changes that target the next release are routed through the KVM x86 tree.  Pull
+requests (from KVM x86 to main KVM) are sent for each KVM x86 topic branch,
+typically the week before Linus' opening of the merge window, e.g. the week
+following rc7 for "normal" releases.  If all goes well, the topic branches are
+rolled into the main KVM pull request sent during Linus' merge window.
+
+The KVM x86 tree doesn't have its own official merge window, but there's a soft
+close around rc5 for new features, and a soft close around rc6 for fixes (for
+the next release; see above for fixes that target the current release).
+
+Timeline
+~~~~~~~~
+Submissions are typically reviewed and applied in FIFO order, with some wiggle
+room for the size of a series, patches that are "cache hot", etc.  Fixes,
+especially for the current release and or stable trees, get to jump the queue.
+Patches that will be taken through a non-KVM tree (most often through the tip
+tree) and/or have other acks/reviews also jump the queue to some extent.
+
+Note, the vast majority of review is done between rc1 and rc6, give or take.
+The period between rc6 and the next rc1 is used to catch up on other tasks,
+i.e. radio silence during this period isn't unusual.
+
+Pings to get a status update are welcome, but keep in mind the timing of the
+current release cycle and have realistic expectations.  If you are pinging for
+acceptance, i.e. not just for feedback or an update, please do everything you
+can, within reason, to ensure that your patches are ready to be merged!  Pings
+on series that break the build or fail tests lead to unhappy maintainers!
+
+Development
+-----------
+
+Base Tree/Branch
+~~~~~~~~~~~~~~~~
+Fixes that target the current release, a.k.a. mainline, should be based on
+``git://git.kernel.org/pub/scm/virt/kvm/kvm.git master``.  Note, fixes do not
+automatically warrant inclusion in the current release.  There is no singular
+rule, but typically only fixes for bugs that are urgent, critical, and/or were
+introduced in the current release should target the current release.
+
+Everything else should be based on ``kvm-x86/next``, i.e. there is no need to
+select a specific topic branch as the base.  If there are conflicts and/or
+dependencies across topic branches, it is the maintainer's job to sort them
+out.
+
+The only exception to using ``kvm-x86/next`` as the base is if a patch/series
+is a multi-arch series, i.e. has non-trivial modifications to common KVM code
+and/or has more than superficial changes to other architectures' code.  Multi-
+arch patch/series should instead be based on a common, stable point in KVM's
+history, e.g. the release candidate upon which ``kvm-x86 next`` is based.  If
+you're unsure whether a patch/series is truly multi-arch, err on the side of
+caution and treat it as multi-arch, i.e. use a common base.
+
+Coding Style
+~~~~~~~~~~~~
+When it comes to style, naming, patterns, etc., consistency is the number one
+priority in KVM x86.  If all else fails, match what already exists.
+
+With a few caveats listed below, follow the tip tree maintainers' preferred
+:ref:`maintainer-tip-coding-style`, as patches/series often touch both KVM and
+non-KVM x86 files, i.e. draw the attention of KVM *and* tip tree maintainers.
+
+Using reverse fir tree, a.k.a. reverse Christmas tree or reverse XMAS tree, for
+variable declarations isn't strictly required, though it is still preferred.
+
+Except for a handful of special snowflakes, do not use kernel-doc comments for
+functions.  The vast majority of "public" KVM functions aren't truly public as
+they are intended only for KVM-internal consumption (there are plans to
+privatize KVM's headers and exports to enforce this).
+
+Comments
+~~~~~~~~
+Write comments using imperative mood and avoid pronouns.  Use comments to
+provide a high level overview of the code, and/or to explain why the code does
+what it does.  Do not reiterate what the code literally does; let the code
+speak for itself.  If the code itself is inscrutable, comments will not help.
+
+SDM and APM References
+~~~~~~~~~~~~~~~~~~~~~~
+Much of KVM's code base is directly tied to architectural behavior defined in
+Intel's Software Development Manual (SDM) and AMD's Architecture Programmer’s
+Manual (APM).  Use of "Intel's SDM" and "AMD's APM", or even just "SDM" or
+"APM", without additional context is a-ok.
+
+Do not reference specific sections, tables, figures, etc. by number, especially
+not in comments.  Instead, if necessary (see below), copy-paste the relevant
+snippet and reference sections/tables/figures by name.  The layouts of the SDM
+and APM are constantly changing, and so the numbers/labels aren't stable.
+
+Generally speaking, do not explicitly reference or copy-paste from the SDM or
+APM in comments.  With few exceptions, KVM *must* honor architectural behavior,
+therefore it's implied that KVM behavior is emulating SDM and/or APM behavior.
+Note, referencing the SDM/APM in changelogs to justify the change and provide
+context is perfectly ok and encouraged.
+
+Shortlog
+~~~~~~~~
+The preferred prefix format is ``KVM: <topic>:``, where ``<topic>`` is one of::
+
+  - x86
+  - x86/mmu
+  - x86/pmu
+  - x86/xen
+  - selftests
+  - SVM
+  - nSVM
+  - VMX
+  - nVMX
+
+**DO NOT use x86/kvm!**  ``x86/kvm`` is used exclusively for Linux-as-a-KVM-guest
+changes, i.e. for arch/x86/kernel/kvm.c.  Do not use file names or complete file
+paths as the subject/shortlog prefix.
+
+Note, these don't align with the topics branches (the topic branches care much
+more about code conflicts).
+
+All names are case sensitive!  ``KVM: x86:`` is good, ``kvm: vmx:`` is not.
+
+Capitalize the first word of the condensed patch description, but omit ending
+punctionation.  E.g.::
+
+    KVM: x86: Fix a null pointer dereference in function_xyz()
+
+not::
+
+    kvm: x86: fix a null pointer dereference in function_xyz.
+
+If a patch touches multiple topics, traverse up the conceptual tree to find the
+first common parent (which is often simply ``x86``).  When in doubt,
+``git log path/to/file`` should provide a reasonable hint.
+
+New topics do occasionally pop up, but please start an on-list discussion if
+you want to propose introducing a new topic, i.e. don't go rogue.
+
+See :ref:`the_canonical_patch_format` for more information, with one amendment:
+do not treat the 70-75 character limit as an absolute, hard limit.  Instead,
+use 75 characters as a firm-but-not-hard limit, and use 80 characters as a hard
+limit.  I.e. let the shortlog run a few characters over the standard limit if
+you have good reason to do so.
+
+Changelog
+~~~~~~~~~
+Most importantly, write changelogs using imperative mood and avoid pronouns.
+
+See :ref:`describe_changes` for more information, with one amendment: lead with
+a short blurb on the actual changes, and then follow up with the context and
+background.  Note!  This order directly conflicts with the tip tree's preferred
+approach!  Please follow the tip tree's preferred style when sending patches
+that primarily target arch/x86 code that is _NOT_ KVM code.
+
+Stating what a patch does before diving into details is preferred by KVM x86
+for several reasons.  First and foremost, what code is actually being changed
+is arguably the most important information, and so that info should be easy to
+find. Changelogs that bury the "what's actually changing" in a one-liner after
+3+ paragraphs of background make it very hard to find that information.
+
+For initial review, one could argue the "what's broken" is more important, but
+for skimming logs and git archaeology, the gory details matter less and less.
+E.g. when doing a series of "git blame", the details of each change along the
+way are useless, the details only matter for the culprit.  Providing the "what
+changed" makes it easy to quickly determine whether or not a commit might be of
+interest.
+
+Another benefit of stating "what's changing" first is that it's almost always
+possible to state "what's changing" in a single sentence.  Conversely, all but
+the most simple bugs require multiple sentences or paragraphs to fully describe
+the problem.  If both the "what's changing" and "what's the bug" are super
+short then the order doesn't matter.  But if one is shorter (almost always the
+"what's changing), then covering the shorter one first is advantageous because
+it's less of an inconvenience for readers/reviewers that have a strict ordering
+preference.  E.g. having to skip one sentence to get to the context is less
+painful than having to skip three paragraphs to get to "what's changing".
+
+Fixes
+~~~~~
+If a change fixes a KVM/kernel bug, add a Fixes: tag even if the change doesn't
+need to be backported to stable kernels, and even if the change fixes a bug in
+an older release.
+
+Conversely, if a fix does need to be backported, explicitly tag the patch with
+"Cc: stable@vger.kernel" (though the email itself doesn't need to Cc: stable);
+KVM x86 opts out of backporting Fixes: by default.  Some auto-selected patches
+do get backported, but require explicit maintainer approval (search MANUALSEL).
+
+Function References
+~~~~~~~~~~~~~~~~~~~
+When a function is mentioned in a comment, changelog, or shortlog (or anywhere
+for that matter), use the format ``function_name()``.  The parentheses provide
+context and disambiguate the reference.
+
+Testing
+-------
+At a bare minimum, *all* patches in a series must build cleanly for KVM_INTEL=m
+KVM_AMD=m, and KVM_WERROR=y.  Building every possible combination of Kconfigs
+isn't feasible, but the more the merrier.  KVM_SMM, KVM_XEN, PROVE_LOCKING, and
+X86_64 are particularly interesting knobs to turn.
+
+Running KVM selftests and KVM-unit-tests is also mandatory (and stating the
+obvious, the tests need to pass).  The only exception is for changes that have
+negligible probability of affecting runtime behavior, e.g. patches that only
+modify comments.  When possible and relevant, testing on both Intel and AMD is
+strongly preferred.  Booting an actual VM is encouraged, but not mandatory.
+
+For changes that touch KVM's shadow paging code, running with TDP (EPT/NPT)
+disabled is mandatory.  For changes that affect common KVM MMU code, running
+with TDP disabled is strongly encouraged.  For all other changes, if the code
+being modified depends on and/or interacts with a module param, testing with
+the relevant settings is mandatory.
+
+Note, KVM selftests and KVM-unit-tests do have known failures.  If you suspect
+a failure is not due to your changes, verify that the *exact same* failure
+occurs with and without your changes.
+
+Changes that touch reStructured Text documentation, i.e. .rst files, must build
+htmldocs cleanly, i.e. with no new warnings or errors.
+
+If you can't fully test a change, e.g. due to lack of hardware, clearly state
+what level of testing you were able to do, e.g. in the cover letter.
+
+New Features
+~~~~~~~~~~~~
+With one exception, new features *must* come with test coverage.  KVM specific
+tests aren't strictly required, e.g. if coverage is provided by running a
+sufficiently enabled guest VM, or by running a related kernel selftest in a VM,
+but dedicated KVM tests are preferred in all cases.  Negative testcases in
+particular are mandatory for enabling of new hardware features as error and
+exception flows are rarely exercised simply by running a VM.
+
+The only exception to this rule is if KVM is simply advertising support for a
+feature via KVM_GET_SUPPORTED_CPUID, i.e. for instructions/features that KVM
+can't prevent a guest from using and for which there is no true enabling.
+
+Note, "new features" does not just mean "new hardware features"!  New features
+that can't be well validated using existing KVM selftests and/or KVM-unit-tests
+must come with tests.
+
+Posting new feature development without tests to get early feedback is more
+than welcome, but such submissions should be tagged RFC, and the cover letter
+should clearly state what type of feedback is requested/expected.  Do not abuse
+the RFC process; RFCs will typically not receive in-depth review.
+
+Bug Fixes
+~~~~~~~~~
+Except for "obvious" found-by-inspection bugs, fixes must be accompanied by a
+reproducer for the bug being fixed.  In many cases the reproducer is implicit,
+e.g. for build errors and test failures, but it should still be clear to
+readers what is broken and how to verify the fix.  Some leeway is given for
+bugs that are found via non-public workloads/tests, but providing regression
+tests for such bugs is strongly preferred.
+
+In general, regression tests are preferred for any bug that is not trivial to
+hit.  E.g. even if the bug was originally found by a fuzzer such as syzkaller,
+a targeted regression test may be warranted if the bug requires hitting a
+one-in-a-million type race condition.
+
+Note, KVM bugs are rarely urgent *and* non-trivial to reproduce.  Ask yourself
+if a bug is really truly the end of the world before posting a fix without a
+reproducer.
+
+Posting
+-------
+
+Links
+~~~~~
+Do not explicitly reference bug reports, prior versions of a patch/series, etc.
+via ``In-Reply-To:`` headers.  Using ``In-Reply-To:`` becomes an unholy mess
+for large series and/or when the version count gets high, and ``In-Reply-To:``
+is useless for anyone that doesn't have the original message, e.g. if someone
+wasn't Cc'd on the bug report or if the list of recipients changes between
+versions.
+
+To link to a bug report, previous version, or anything of interest, use lore
+links.  For referencing previous version(s), generally speaking do not include
+a Link: in the changelog as there is no need to record the history in git, i.e.
+put the link in the cover letter or in the section git ignores.  Do provide a
+formal Link: for bug reports and/or discussions that led to the patch.  The
+context of why a change was made is highly valuable for future readers.
+
+Git Base
+~~~~~~~~
+If you are using git version 2.9.0 or later (Googlers, this is all of you!),
+use ``git format-patch`` with the ``--base`` flag to automatically include the
+base tree information in the generated patches.
+
+Note, ``--base=auto`` works as expected if and only if a branch's upstream is
+set to the base topic branch, e.g. it will do the wrong thing if your upstream
+is set to your personal repository for backup purposes.  An alternative "auto"
+solution is to derive the names of your development branches based on their
+KVM x86 topic, and feed that into ``--base``.  E.g. ``x86/pmu/my_branch_name``,
+and then write a small wrapper to extract ``pmu`` from the current branch name
+to yield ``--base=x/pmu``, where ``x`` is whatever name your repository uses to
+track the KVM x86 remote.
+
+Co-Posting Tests
+~~~~~~~~~~~~~~~~
+KVM selftests that are associated with KVM changes, e.g. regression tests for
+bug fixes, should be posted along with the KVM changes as a single series.  The
+standard kernel rules for bisection apply, i.e. KVM changes that result in test
+failures should be ordered after the selftests updates, and vice versa, new
+tests that fail due to KVM bugs should be ordered after the KVM fixes.
+
+KVM-unit-tests should *always* be posted separately.  Tools, e.g. b4 am, don't
+know that KVM-unit-tests is a separate repository and get confused when patches
+in a series apply on different trees.  To tie KVM-unit-tests patches back to
+KVM patches, first post the KVM changes and then provide a lore Link: to the
+KVM patch/series in the KVM-unit-tests patch(es).
+
+Notifications
+-------------
+When a patch/series is officially accepted, a notification email will be sent
+in reply to the original posting (cover letter for multi-patch series).  The
+notification will include the tree and topic branch, along with the SHA1s of
+the commits of applied patches.
+
+If a subset of patches is applied, this will be clearly stated in the
+notification.  Unless stated otherwise, it's implied that any patches in the
+series that were not accepted need more work and should be submitted in a new
+version.
+
+If for some reason a patch is dropped after officially being accepted, a reply
+will be sent to the notification email explaining why the patch was dropped, as
+well as the next steps.
+
+SHA1 Stability
+~~~~~~~~~~~~~~
+SHA1s are not 100% guaranteed to be stable until they land in Linus' tree!  A
+SHA1 is *usually* stable once a notification has been sent, but things happen.
+In most cases, an update to the notification email be provided if an applied
+patch's SHA1 changes.  However, in some scenarios, e.g. if all KVM x86 branches
+need to be rebased, individual notifications will not be given.
+
+Vulnerabilities
+---------------
+Bugs that can be exploited by the guest to attack the host (kernel or
+userspace), or that can be exploited by a nested VM to *its* host (L2 attacking
+L1), are of particular interest to KVM.  Please follow the protocol for
+:ref:`securitybugs` if you suspect a bug can lead to an escape, data leak, etc.
+
--- a/Documentation/process/maintainer-tip.rst
+++ b/Documentation/process/maintainer-tip.rst
@ -452,6 +452,8 @@ and can be added to an existing kernel config by running:
 Some of these options are x86-specific and can be left out when testing
 on other architectures.

+.. _maintainer-tip-coding-style:
+
 Coding style notes
 ------------------

--- a/Documentation/virt/kvm/x86/mmu.rst
+++ b/Documentation/virt/kvm/x86/mmu.rst
@ -205,7 +205,7 @@ Shadow pages contain the following information:
  role.passthrough:
    The page is not backed by a guest page table, but its first entry
    points to one.  This is set if NPT uses 5-level page tables (host
-    CR4.LA57=1) and is shadowing L1's 4-level NPT (L1 CR4.LA57=1).
+    CR4.LA57=1) and is shadowing L1's 4-level NPT (L1 CR4.LA57=0).
  gfn:
    Either the guest page table containing the translations shadowed by this
    page, or the base page frame for linear translations.  See role.direct.
--- a/1
+++ b/1
@ -11436,6 +11436,7 @@ M:	Sean Christopherson <seanjc@google.com>
 M:	Paolo Bonzini <pbonzini@redhat.com>
 L:	kvm@vger.kernel.org
 S:	Supported
+P:	Documentation/process/maintainer-kvm-x86.rst
 T:	git git://git.kernel.org/pub/scm/virt/kvm/kvm.git
 F:	arch/x86/include/asm/kvm*
 F:	arch/x86/include/asm/svm.h
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@ -501,20 +501,15 @@ int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu,
 			      struct kvm_cpuid2 *cpuid,
 			      struct kvm_cpuid_entry2 __user *entries)
 {
-	int r;
-
-	r = -E2BIG;
 	if (cpuid->nent < vcpu->arch.cpuid_nent)
-		goto out;
-	r = -EFAULT;
+		return -E2BIG;
+
 	if (copy_to_user(entries, vcpu->arch.cpuid_entries,
 			 vcpu->arch.cpuid_nent * sizeof(struct kvm_cpuid_entry2)))
-		goto out;
-	return 0;
+		return -EFAULT;

-out:
 	cpuid->nent = vcpu->arch.cpuid_nent;
-	return r;
+	return 0;
 }

 /* Mask kvm_cpu_caps for @leaf with the raw CPUID capabilities of this CPU. */
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@ -411,7 +411,10 @@ static u32 pic_poll_read(struct kvm_kpic_state *s, u32 addr1)
 		pic_clear_isr(s, ret);
 		if (addr1 >> 7 || ret != 2)
 			pic_update_irq(s->pics_state);
+		/* Bit 7 is 1, means there's an interrupt */
+		ret |= 0x80;
 	} else {
+		/* Bit 7 is 0, means there's no interrupt */
 		ret = 0x07;
 		pic_update_irq(s->pics_state);
 	}
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@ -51,11 +51,6 @@
 #define mod_64(x, y) ((x) % (y))
 #endif

-#define PRId64 "d"
-#define PRIx64 "llx"
-#define PRIu64 "u"
-#define PRIo64 "o"
-
 /* 14 is the version for Xeon and Pentium 8.4.8*/
 #define APIC_VERSION			0x14UL
 #define LAPIC_MMIO_LENGTH		(1 << 12)
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@ -25,10 +25,24 @@
 #define IA32_MTRR_DEF_TYPE_FE		(1ULL << 10)
 #define IA32_MTRR_DEF_TYPE_TYPE_MASK	(0xff)

+static bool is_mtrr_base_msr(unsigned int msr)
+{
+	/* MTRR base MSRs use even numbers, masks use odd numbers. */
+	return !(msr & 0x1);
+}
+
+static struct kvm_mtrr_range *var_mtrr_msr_to_range(struct kvm_vcpu *vcpu,
+						    unsigned int msr)
+{
+	int index = (msr - MTRRphysBase_MSR(0)) / 2;
+
+	return &vcpu->arch.mtrr_state.var_ranges[index];
+}
+
 static bool msr_mtrr_valid(unsigned msr)
 {
 	switch (msr) {
-	case 0x200 ... 0x200 + 2 * KVM_NR_VAR_MTRR - 1:
+	case MTRRphysBase_MSR(0) ... MTRRphysMask_MSR(KVM_NR_VAR_MTRR - 1):
 	case MSR_MTRRfix64K_00000:
 	case MSR_MTRRfix16K_80000:
 	case MSR_MTRRfix16K_A0000:
@ -41,7 +55,6 @@ static bool msr_mtrr_valid(unsigned msr)
 	case MSR_MTRRfix4K_F0000:
 	case MSR_MTRRfix4K_F8000:
 	case MSR_MTRRdefType:
-	case MSR_IA32_CR_PAT:
 		return true;
 	}
 	return false;
@ -52,7 +65,7 @@ static bool valid_mtrr_type(unsigned t)
 	return t < 8 && (1 << t) & 0x73; /* 0, 1, 4, 5, 6 */
 }

-bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data)
+static bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	int i;
 	u64 mask;
@ -60,9 +73,7 @@ bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	if (!msr_mtrr_valid(msr))
 		return false;

-	if (msr == MSR_IA32_CR_PAT) {
-		return kvm_pat_valid(data);
-	} else if (msr == MSR_MTRRdefType) {
+	if (msr == MSR_MTRRdefType) {
 		if (data & ~0xcff)
 			return false;
 		return valid_mtrr_type(data & 0xff);
@ -74,7 +85,8 @@ bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	}

 	/* variable MTRRs */
-	WARN_ON(!(msr >= 0x200 && msr < 0x200 + 2 * KVM_NR_VAR_MTRR));
+	WARN_ON(!(msr >= MTRRphysBase_MSR(0) &&
+		  msr <= MTRRphysMask_MSR(KVM_NR_VAR_MTRR - 1)));

 	mask = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
 	if ((msr & 1) == 0) {
@ -88,7 +100,6 @@ bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data)

 	return (data & mask) == 0;
 }
-EXPORT_SYMBOL_GPL(kvm_mtrr_valid);

 static bool mtrr_is_enabled(struct kvm_mtrr *mtrr_state)
 {
@ -308,10 +319,8 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 {
 	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
 	gfn_t start, end;
-	int index;

-	if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
-	      !kvm_arch_has_noncoherent_dma(vcpu->kvm))
+	if (!tdp_enabled || !kvm_arch_has_noncoherent_dma(vcpu->kvm))
 		return;

 	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
@ -326,8 +335,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 		end = ~0ULL;
 	} else {
 		/* variable range MTRRs. */
-		index = (msr - 0x200) / 2;
-		var_mtrr_range(&mtrr_state->var_ranges[index], &start, &end);
+		var_mtrr_range(var_mtrr_msr_to_range(vcpu, msr), &start, &end);
 	}

 	kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
@ -342,21 +350,18 @@ static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
 	struct kvm_mtrr_range *tmp, *cur;
-	int index, is_mtrr_mask;

-	index = (msr - 0x200) / 2;
-	is_mtrr_mask = msr - 0x200 - 2 * index;
-	cur = &mtrr_state->var_ranges[index];
+	cur = var_mtrr_msr_to_range(vcpu, msr);

 	/* remove the entry if it's in the list. */
 	if (var_mtrr_range_is_valid(cur))
-		list_del(&mtrr_state->var_ranges[index].node);
+		list_del(&cur->node);

 	/*
 	 * Set all illegal GPA bits in the mask, since those bits must
 	 * implicitly be 0.  The bits are then cleared when reading them.
 	 */
-	if (!is_mtrr_mask)
+	if (is_mtrr_base_msr(msr))
 		cur->base = data;
 	else
 		cur->mask = data | kvm_vcpu_reserved_gpa_bits_raw(vcpu);
@ -382,8 +387,6 @@ int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 		*(u64 *)&vcpu->arch.mtrr_state.fixed_ranges[index] = data;
 	else if (msr == MSR_MTRRdefType)
 		vcpu->arch.mtrr_state.deftype = data;
-	else if (msr == MSR_IA32_CR_PAT)
-		vcpu->arch.pat = data;
 	else
 		set_var_mtrr_msr(vcpu, msr, data);

@ -411,21 +414,16 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 		return 1;

 	index = fixed_msr_to_range_index(msr);
-	if (index >= 0)
+	if (index >= 0) {
 		*pdata = *(u64 *)&vcpu->arch.mtrr_state.fixed_ranges[index];
-	else if (msr == MSR_MTRRdefType)
+	} else if (msr == MSR_MTRRdefType) {
 		*pdata = vcpu->arch.mtrr_state.deftype;
-	else if (msr == MSR_IA32_CR_PAT)
-		*pdata = vcpu->arch.pat;
-	else {	/* Variable MTRRs */
-		int is_mtrr_mask;
-
-		index = (msr - 0x200) / 2;
-		is_mtrr_mask = msr - 0x200 - 2 * index;
-		if (!is_mtrr_mask)
-			*pdata = vcpu->arch.mtrr_state.var_ranges[index].base;
+	} else {
+		/* Variable MTRRs */
+		if (is_mtrr_base_msr(msr))
+			*pdata = var_mtrr_msr_to_range(vcpu, msr)->base;
 		else
-			*pdata = vcpu->arch.mtrr_state.var_ranges[index].mask;
+			*pdata = var_mtrr_msr_to_range(vcpu, msr)->mask;

 		*pdata &= ~kvm_vcpu_reserved_gpa_bits_raw(vcpu);
 	}
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@ -752,7 +752,7 @@ static bool msr_write_intercepted(struct kvm_vcpu *vcpu, u32 msr)

 	BUG_ON(offset == MSR_INVALID);

-	return !!test_bit(bit_write,  &tmp);
+	return test_bit(bit_write, &tmp);
 }

 static void set_msr_interception_bitmap(struct kvm_vcpu *vcpu, u32 *msrpm,
@ -2939,9 +2939,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)

 		break;
 	case MSR_IA32_CR_PAT:
-		if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
-			return 1;
-		vcpu->arch.pat = data;
+		ret = kvm_set_msr_common(vcpu, msr);
+		if (ret)
+			break;
+
 		svm->vmcb01.ptr->save.g_pat = data;
 		if (is_guest_mode(vcpu))
 			nested_vmcb02_compute_g_pat(svm);
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@ -2287,19 +2287,16 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 1;
 		goto find_uret_msr;
 	case MSR_IA32_CR_PAT:
-		if (!kvm_pat_valid(data))
-			return 1;
+		ret = kvm_set_msr_common(vcpu, msr_info);
+		if (ret)
+			break;

 		if (is_guest_mode(vcpu) &&
 		    get_vmcs12(vcpu)->vm_exit_controls & VM_EXIT_SAVE_IA32_PAT)
 			get_vmcs12(vcpu)->guest_ia32_pat = data;

-		if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
+		if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT)
 			vmcs_write64(GUEST_IA32_PAT, data);
-			vcpu->arch.pat = data;
-			break;
-		}
-		ret = kvm_set_msr_common(vcpu, msr_info);
 		break;
 	case MSR_IA32_MCG_EXT_CTL:
 		if ((!msr_info->host_initiated &&
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@ -1017,13 +1017,11 @@ void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
 			wrmsrl(MSR_IA32_XSS, vcpu->arch.ia32_xss);
 	}

-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
-	if (static_cpu_has(X86_FEATURE_PKU) &&
+	if (cpu_feature_enabled(X86_FEATURE_PKU) &&
 	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
 	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
 	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE)))
 		write_pkru(vcpu->arch.pkru);
-#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */
 }
 EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);

@ -1032,15 +1030,13 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.guest_state_protected)
 		return;

-#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
-	if (static_cpu_has(X86_FEATURE_PKU) &&
+	if (cpu_feature_enabled(X86_FEATURE_PKU) &&
 	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
 	     kvm_is_cr4_bit_set(vcpu, X86_CR4_PKE))) {
 		vcpu->arch.pkru = rdpkru();
 		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
 			write_pkru(vcpu->arch.host_pkru);
 	}
-#endif /* CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS */

 	if (kvm_is_cr4_bit_set(vcpu, X86_CR4_OSXSAVE)) {

@ -1427,15 +1423,14 @@ int kvm_emulate_rdpmc(struct kvm_vcpu *vcpu)
 EXPORT_SYMBOL_GPL(kvm_emulate_rdpmc);

 /*
- * List of msr numbers which we expose to userspace through KVM_GET_MSRS
- * and KVM_SET_MSRS, and KVM_GET_MSR_INDEX_LIST.
- *
- * The three MSR lists(msrs_to_save, emulated_msrs, msr_based_features)
- * extract the supported MSRs from the related const lists.
- * msrs_to_save is selected from the msrs_to_save_all to reflect the
- * capabilities of the host cpu. This capabilities test skips MSRs that are
- * kvm-specific. Those are put in emulated_msrs_all; filtering of emulated_msrs
- * may depend on host virtualization features rather than host cpu features.
+ * The three MSR lists(msrs_to_save, emulated_msrs, msr_based_features) track
+ * the set of MSRs that KVM exposes to userspace through KVM_GET_MSRS,
+ * KVM_SET_MSRS, and KVM_GET_MSR_INDEX_LIST.  msrs_to_save holds MSRs that
+ * require host support, i.e. should be probed via RDMSR.  emulated_msrs holds
+ * MSRs that KVM emulates without strictly requiring host support.
+ * msr_based_features holds MSRs that enumerate features, i.e. are effectively
+ * CPUID leafs.  Note, msr_based_features isn't mutually exclusive with
+ * msrs_to_save and emulated_msrs.
 */

 static const u32 msrs_to_save_base[] = {
@ -1531,11 +1526,11 @@ static const u32 emulated_msrs_all[] = {
 	MSR_IA32_UCODE_REV,

 	/*
-	 * The following list leaves out MSRs whose values are determined
-	 * by arch/x86/kvm/vmx/nested.c based on CPUID or other MSRs.
-	 * We always support the "true" VMX control MSRs, even if the host
-	 * processor does not, so I am putting these registers here rather
-	 * than in msrs_to_save_all.
+	 * KVM always supports the "true" VMX control MSRs, even if the host
+	 * does not.  The VMX MSRs as a whole are considered "emulated" as KVM
+	 * doesn't strictly require them to exist in the host (ignoring that
+	 * KVM would refuse to load in the first place if the core set of MSRs
+	 * aren't supported).
 	 */
 	MSR_IA32_VMX_BASIC,
 	MSR_IA32_VMX_TRUE_PINBASED_CTLS,
@ -1631,7 +1626,7 @@ static u64 kvm_get_arch_capabilities(void)
 	 * If we're doing cache flushes (either "always" or "cond")
 	 * we will do one whenever the guest does a vmlaunch/vmresume.
 	 * If an outer hypervisor is doing the cache flush for us
-	 * (VMENTER_L1D_FLUSH_NESTED_VM), we can safely pass that
+	 * (ARCH_CAP_SKIP_VMENTRY_L1DFLUSH), we can safely pass that
 	 * capability to the guest too, and if EPT is disabled we're not
 	 * vulnerable.  Overall, only VMENTER_L1D_FLUSH_NEVER will
 	 * require a nested hypervisor to do a flush of its own.
@ -1809,7 +1804,7 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
 		unsigned long *bitmap = ranges[i].bitmap;

 		if ((index >= start) && (index < end) && (flags & type)) {
-			allowed = !!test_bit(index - start, bitmap);
+			allowed = test_bit(index - start, bitmap);
 			break;
 		}
 	}
@ -3702,8 +3697,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 1;
 		}
 		break;
-	case 0x200 ... MSR_IA32_MC0_CTL2 - 1:
-	case MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) ... 0x2ff:
+	case MSR_IA32_CR_PAT:
+		if (!kvm_pat_valid(data))
+			return 1;
+
+		vcpu->arch.pat = data;
+		break;
+	case MTRRphysBase_MSR(0) ... MSR_MTRRfix4K_F8000:
+	case MSR_MTRRdefType:
 		return kvm_mtrr_set_msr(vcpu, msr, data);
 	case MSR_IA32_APICBASE:
 		return kvm_set_apic_base(vcpu, msr_info);
@ -4110,9 +4111,12 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		msr_info->data = kvm_scale_tsc(rdtsc(), ratio) + offset;
 		break;
 	}
+	case MSR_IA32_CR_PAT:
+		msr_info->data = vcpu->arch.pat;
+		break;
 	case MSR_MTRRcap:
-	case 0x200 ... MSR_IA32_MC0_CTL2 - 1:
-	case MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) ... 0x2ff:
+	case MTRRphysBase_MSR(0) ... MSR_MTRRfix4K_F8000:
+	case MSR_MTRRdefType:
 		return kvm_mtrr_get_msr(vcpu, msr_info->index, &msr_info->data);
 	case 0xcd: /* fsb frequency */
 		msr_info->data = 3;
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@ -309,7 +309,6 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,

 void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu);
 u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn);
-bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
 bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
--- a/tools/testing/selftests/kvm/x86_64/cpuid_test.c
+++ b/tools/testing/selftests/kvm/x86_64/cpuid_test.c
@ -163,6 +163,25 @@ static void set_cpuid_after_run(struct kvm_vcpu *vcpu)
 	ent->eax = eax;
 }

+static void test_get_cpuid2(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid2 *cpuid = allocate_kvm_cpuid2(vcpu->cpuid->nent + 1);
+	int i, r;
+
+	vcpu_ioctl(vcpu, KVM_GET_CPUID2, cpuid);
+	TEST_ASSERT(cpuid->nent == vcpu->cpuid->nent,
+		    "KVM didn't update nent on success, wanted %u, got %u\n",
+		    vcpu->cpuid->nent, cpuid->nent);
+
+	for (i = 0; i < vcpu->cpuid->nent; i++) {
+		cpuid->nent = i;
+		r = __vcpu_ioctl(vcpu, KVM_GET_CPUID2, cpuid);
+		TEST_ASSERT(r && errno == E2BIG, KVM_IOCTL_ERROR(KVM_GET_CPUID2, r));
+		TEST_ASSERT(cpuid->nent == i, "KVM modified nent on failure");
+	}
+	free(cpuid);
+}
+
 int main(void)
 {
 	struct kvm_vcpu *vcpu;
@ -183,5 +202,7 @@ int main(void)

 	set_cpuid_after_run(vcpu);

+	test_get_cpuid2(vcpu);
+
 	kvm_vm_free(vm);
 }