Commit graph

41591 commits

Author SHA1 Message Date
Roman Gushchin
e33c267ab7 mm: shrinkers: provide shrinkers with names
Currently shrinkers are anonymous objects.  For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g.  for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.

This commit adds names to shrinkers.  register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.

In some cases it's not possible to determine a good name at the time when
a shrinker is allocated.  For such cases shrinker_debugfs_rename() is
provided.

The expected format is:
    <subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.

After this change the shrinker debugfs directory looks like:
  $ cd /sys/kernel/debug/shrinker/
  $ ls
    dquota-cache-16     sb-devpts-28     sb-proc-47       sb-tmpfs-42
    mm-shadow-18        sb-devtmpfs-5    sb-proc-48       sb-tmpfs-43
    mm-zspool:zram0-34  sb-hugetlbfs-17  sb-pstore-31     sb-tmpfs-44
    rcu-kfree-0         sb-hugetlbfs-33  sb-rootfs-2      sb-tmpfs-49
    sb-aio-20           sb-iomem-12      sb-securityfs-6  sb-tracefs-13
    sb-anon_inodefs-15  sb-mqueue-21     sb-selinuxfs-22  sb-xfs:vda1-36
    sb-bdev-3           sb-nsfs-4        sb-sockfs-8      sb-zsmalloc-19
    sb-bpf-32           sb-pipefs-14     sb-sysfs-26      thp-deferred_split-10
    sb-btrfs:vda2-24    sb-proc-25       sb-tmpfs-1       thp-zero-9
    sb-cgroup2-30       sb-proc-39       sb-tmpfs-27      xfs-buf:vda1-37
    sb-configfs-23      sb-proc-41       sb-tmpfs-29      xfs-inodegc:vda1-38
    sb-dax-11           sb-proc-45       sb-tmpfs-35
    sb-debugfs-7        sb-proc-46       sb-tmpfs-40

[roman.gushchin@linux.dev: fix build warnings]
  Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
  Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03 18:08:40 -07:00
akpm
46a3b11253 Merge branch 'master' into mm-stable 2022-06-27 10:31:34 -07:00
Linus Torvalds
e946554905 ARM64:
* Fix a regression with pKVM when kmemleak is enabled
 
 * Add Oliver Upton as an official KVM/arm64 reviewer
 
 selftests:
 
 * deal with compiler optimizations around hypervisor exits
 
 x86:
 
 * MAINTAINERS reorganization
 
 * Two SEV fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmK1cjIUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroMayQf+JOOggLacjPPa/t/CE8kIcbX0IWc+
 epEdq/f0qgxJlAjUB9YKgMr2Io9jPScyTdY8t6uS0WyZ7Q1NyAogXfds/dF4wElm
 IMWWfLTSU3gzCmzPh8n6SfbWtRGJKsOukK0cDIIh86h5YnXDmeyVjJrDvEVQOnzG
 TjHOKYuFXGPj8/NKwcrxqBFHK9DBNxn9b/UBRArG+5AZM0mx3Jl8LJMYUDEIyAyO
 yhNfTh7gPPidEiJLkFDyHWKg5rhO3fbn8UrncY+eTmSBqMHvvY0+eka6urwihN0v
 ExmKqy00ES51c/6r+zqsqYICqVSqiaNNWF4lp1HTp7LrUBtxyZAqnkBbHQ==
 =U2ol
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "ARM64:

   - Fix a regression with pKVM when kmemleak is enabled

   - Add Oliver Upton as an official KVM/arm64 reviewer

  selftests:

   - deal with compiler optimizations around hypervisor exits

  x86:

   - MAINTAINERS reorganization

   - Two SEV fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SEV: Init target VMCBs in sev_migrate_from
  KVM: x86/svm: add __GFP_ACCOUNT to __sev_dbg_{en,de}crypt_user()
  MAINTAINERS: Reorganize KVM/x86 maintainership
  selftests: KVM: Handle compiler optimizations in ucall
  KVM: arm64: Add Oliver as a reviewer
  KVM: arm64: Prevent kmemleak from accessing pKVM memory
  tools/kvm_stat: fix display of error when multiple processes are found
2022-06-24 12:17:47 -07:00
Peter Gonda
6defa24d3b KVM: SEV: Init target VMCBs in sev_migrate_from
The target VMCBs during an intra-host migration need to correctly setup
for running SEV and SEV-ES guests. Add sev_init_vmcb() function and make
sev_es_init_vmcb() static. sev_init_vmcb() uses the now private function
to init SEV-ES guests VMCBs when needed.

Fixes: 0b020f5af0 ("KVM: SEV: Add support for SEV-ES intra host migration")
Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Signed-off-by: Peter Gonda <pgonda@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <20220623173406.744645-1-pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 04:10:18 -04:00
Mingwei Zhang
ebdec859fa KVM: x86/svm: add __GFP_ACCOUNT to __sev_dbg_{en,de}crypt_user()
Adding the accounting flag when allocating pages within the SEV function,
since these memory pages should belong to individual VM.

No functional change intended.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20220623171858.2083637-1-mizhang@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24 03:58:06 -04:00
Linus Torvalds
399bd66e21 Networking fixes for 5.19-rc4, including fixes from bpf and netfilter.
Current release - regressions:
   - netfilter: cttimeout: fix slab-out-of-bounds read in cttimeout_net_exit
 
 Current release - new code bugs:
   - bpf: ftrace: keep address offset in ftrace_lookup_symbols
 
   - bpf: force cookies array to follow symbols sorting
 
 Previous releases - regressions:
   - ipv4: ping: fix bind address validity check
 
   - tipc: fix use-after-free read in tipc_named_reinit
 
   - eth: veth: add updating of trans_start
 
 Previous releases - always broken:
   - sock: redo the psock vs ULP protection check
 
   - netfilter: nf_dup_netdev: fix skb_under_panic
 
   - bpf: fix request_sock leak in sk lookup helpers
 
   - eth: igb: fix a use-after-free issue in igb_clean_tx_ring
 
   - eth: ice: prohibit improper channel config for DCB
 
   - eth: at803x: fix null pointer dereference on AR9331 phy
 
   - eth: virtio_net: fix xdp_rxq_info bug after suspend/resume
 
 Misc:
   - eth: hinic: replace memcpy() with direct assignment
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmK0P+0SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkmBkP/1m5Et04wgtlEfQJtudZj0Sadra0tu6P
 vaYlqtiRNMziSY/hxG1p4w7giM4gD7fD3S12Pc/ueCaUwxxILN/eZ/hNgCq9huf6
 IbmVmfq6YNZwDaNzFDP8UcIqjnxbg1B3XD41dN7+FggA9ogGFkOvuAcJByzdANVX
 BLOkQmGP22+pNJmniH3KYvCZlHIa+LVeRjdjdM+1/LKDs2pxpBi97obyzb5zUiE5
 c5E7+BhkGI9X6V1TuHVCHIEFssYNWLiTJcw76HptWmK9Z/DlDEeVlHzKbAMNTycl
 I8eTLXnqgye0KCKOqJ4fN+YN42ypdDzrUILKMHGEddG1lOot/2XChgp8+EqMY7Nx
 Gjpjh28jTsKdCZMFF3lxDGxeonHciP6lZA80g3GNk4FWUVrqnKEYpdy+6psTkpDr
 HahjmFWylGXfmPIKJrsiVGIyxD4ObkRF6SSH7L8j5tAVGxaB5MDFrCws136kACCk
 ZyZiXTS0J3Cn1fAb2/vGKgDFhbEWykITYPaiVo7pyrO1jju5qQTtiKiABpcX0Ejs
 WxvPA8HB61+kEapIzBLhhxRl25CXTleGE986au2MVh0I/HuQBxVExrRE9FgThjwk
 YbSKhR2JOcD5B94HRQXVsQ05q02JzxmB0kVbqSLcIAbCOuo++LZCIdwR5XxSpF6s
 AAFhqQycWowh
 =JFWo
 -----END PGP SIGNATURE-----

Merge tag 'net-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf and netfilter.

  Current release - regressions:

   - netfilter: cttimeout: fix slab-out-of-bounds read in
     cttimeout_net_exit

Current release - new code bugs:

   - bpf: ftrace: keep address offset in ftrace_lookup_symbols

   - bpf: force cookies array to follow symbols sorting

  Previous releases - regressions:

   - ipv4: ping: fix bind address validity check

   - tipc: fix use-after-free read in tipc_named_reinit

   - eth: veth: add updating of trans_start

  Previous releases - always broken:

   - sock: redo the psock vs ULP protection check

   - netfilter: nf_dup_netdev: fix skb_under_panic

   - bpf: fix request_sock leak in sk lookup helpers

   - eth: igb: fix a use-after-free issue in igb_clean_tx_ring

   - eth: ice: prohibit improper channel config for DCB

   - eth: at803x: fix null pointer dereference on AR9331 phy

   - eth: virtio_net: fix xdp_rxq_info bug after suspend/resume

  Misc:

   - eth: hinic: replace memcpy() with direct assignment"

* tag 'net-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
  net: openvswitch: fix parsing of nw_proto for IPv6 fragments
  sock: redo the psock vs ULP protection check
  Revert "net/tls: fix tls_sk_proto_close executed repeatedly"
  virtio_net: fix xdp_rxq_info bug after suspend/resume
  igb: Make DMA faster when CPU is active on the PCIe link
  net: dsa: qca8k: reduce mgmt ethernet timeout
  net: dsa: qca8k: reset cpu port on MTU change
  MAINTAINERS: Add a maintainer for OCP Time Card
  hinic: Replace memcpy() with direct assignment
  Revert "drivers/net/ethernet/neterion/vxge: Fix a use-after-free bug in vxge-main.c"
  net: phy: smsc: Disable Energy Detect Power-Down in interrupt mode
  ice: ethtool: Prohibit improper channel config for DCB
  ice: ethtool: advertise 1000M speeds properly
  ice: Fix switchdev rules book keeping
  ice: ignore protocol field in GTP offload
  netfilter: nf_dup_netdev: add and use recursion counter
  netfilter: nf_dup_netdev: do not push mac header a second time
  selftests: netfilter: correct PKTGEN_SCRIPT_PATHS in nft_concat_range.sh
  net/tls: fix tls_sk_proto_close executed repeatedly
  erspan: do not assume transport header is always set
  ...
2022-06-23 09:01:01 -05:00
Linus Torvalds
ca1fdab7fd First batch of EFI fixes for v5.19
- remove pointless include of asm/efi.h, which does not exist on ia64
 - fix DXE service marshalling prototype for mixed mode
 -----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmKx7iAACgkQw08iOZLZ
 jyTwxgv/dnQ+zEsHGSVmfQkucP+Oj49gzNgX4EaeRly9WvpUCj4Nar4aAjCRBaur
 Jcx2aX0z8fnGzh4CtqfI/J5BzU4fHHuLm1sRXkCCNMkMbVscdTghGiYIPCkXlHYS
 cgQmqzM/vM2lGu2seDfib0ZCbsBhFgSfq9mk9BAHzK+zxdn70fUm+gjL5oGUAXyC
 hJOvJJEom0nfhE0SL88EmRiKcborny98+bkaMb5U+309LkISY0o8RjMxgFK6V/q2
 jxUxw1MQgzdGEjCy46A4SJ5wIpqGIzB5FlU6JXSWsDP8brIADmimlSjp36FRDvti
 1Td1Vjy7a3FSrRHJJXM2Tm5JJJOdFJywvIcNZT3UzWvwbVFHriivwjbIxZwFQUIo
 4TRdrQkkdnAGK/e9nc0ZGUT9/D5ooVh+mkLCxRuems8l8J052+DSL6vqeM8OJh1c
 xZnmSC76ouNURPJdLgjfniVFcfygC9XhZl6lK+Od654DN0+eYzgm9y55BvevTA+P
 NIONRRTU
 =4yyi
 -----END PGP SIGNATURE-----

Merge tag 'efi-urgent-for-v5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI fixes from Ard Biesheuvel:

 - remove pointless include of asm/efi.h, which does not exist on ia64

 - fix DXE service marshalling prototype for mixed mode

* tag 'efi-urgent-for-v5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi/x86: libstub: Fix typo in __efi64_argmap* name
  efi: sysfb_efi: remove unnecessary <asm/efi.h> include
2022-06-21 12:20:11 -05:00
Evgeniy Baskov
aa6d1ed107 efi/x86: libstub: Fix typo in __efi64_argmap* name
The actual name of the DXE services function used
is set_memory_space_attributes(), not set_memory_space_descriptor().

Change EFI mixed mode helper macro name to match the function name.

Fixes: 31f1a0edff ("efi/x86: libstub: Make DXE calls mixed mode safe")
Signed-off-by: Evgeniy Baskov <baskov@ispras.ru>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2022-06-21 18:11:46 +02:00
Linus Torvalds
05c6ca8512 X86 updates:
- Make RESERVE_BRK() work again with older binutils. The recent
    'simplification' broke that.
 
  - Make early #VE handling increment RIP when successful.
 
  - Make the #VE code consistent vs. the RIP adjustments and add comments.
 
  - Handle load_unaligned_zeropad() across page boundaries correctly in #VE
    when the second page is shared.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKvIG4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaqCD/9NAUyHTjKDqdWuMD/ITU8ymDr+Ix8z
 vUlysdXbxJg6MvT12ZbhJUFTKAsXskGAAnXz/EtZ8zTQQVzTjis/HooJh4XLeuO4
 NLh9KV9FvH7w69e6Jg31MGkOUJU3BV+WYUx1f34zbQ8FHftxUwu+M47UYExPYKDR
 VIbNeQIpqoBfjTSPVGXlWl/panuZG6RV+PRcvxV3yeRRA8zyCB/WTmNkoDjbw4fl
 YCWwJF7/m4iT3LtoaFXWVGFzSRZoGHbhSdgEOZGIZ7sjvydoaQo402JuhW3WLI2m
 oXLVZ+2wOPGBKp3WQ1t3mpfScBvCiN3SW4pSPDQ+E8fT/RQiRMb29c9S6ANdm3nT
 27fYMJOq+xxex5gOYzdgLz7O99M08uOn2bxJwB+IBIr5jEFH9b4EffeEWsfdZBsi
 1AzkXCi+Ib0ZYAndxUP068m+4iW0LtuApm0fg6LhtdDmBGquj+88OZOUK7Z/kW/N
 IkjgCeqFgmdNb/+Z3XrdYobaAl6J4toIqA4A+O8yL6gJfn9PnaMGsYtA8c5yQchD
 kFoTu5pCALY2KjZkKFRMuEbMH2oj3sjjb7f6mYAHxec6jikIx2c5HswA4sLmzHAN
 GG2MDUH12bWoLfeA4IRwTRz/vh8IeZNq5ZzdCnS6KHUNk5OJRGLtRphKy8z+pOYx
 +i9ThZFBV8pBzg==
 =sRtG
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:

 - Make RESERVE_BRK() work again with older binutils. The recent
   'simplification' broke that.

 - Make early #VE handling increment RIP when successful.

 - Make the #VE code consistent vs. the RIP adjustments and add
   comments.

 - Handle load_unaligned_zeropad() across page boundaries correctly in
   #VE when the second page is shared.

* tag 'x86-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
  x86/tdx: Clarify RIP adjustments in #VE handler
  x86/tdx: Fix early #VE handling
  x86/mm: Fix RESERVE_BRK() for older binutils
2022-06-19 09:58:28 -05:00
Linus Torvalds
5d770f11a1 Build tool updates:
- Remove obsolete CONFIG_X86_SMAP reference from objtool
 
  - Fix overlapping text section failures in faddr2line for real
 
  - Remove OBJECT_FILES_NON_STANDARD usage from x86 ftrace and replace it
    with finegrained annotations so objtool can validate that code
    correctly.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKvHbwTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoQlsEADC7BotE1Mi8HITScIlHT19bkZ7Bm6M
 g46RCtUw0u+lo7BclIetNYGWQ0z+D+1DbmZTBv5D22N/Wd6MrwOuuwRVtJ2dEosF
 l/EK1qKLxCGMdQ1x0KOmrjzHB4WmcUH0pQwDHa5N5s5IyrJRRFHtoLk3EYp6QIhi
 mIZTJgGUEe/aS7BtFM5xARprm7JxdbhXgXVbnC1ctzpOi2y6O7F4Ek3x4j5zTGV/
 49iuO2ljqmKdoFlPa1fhuw+O9sTUf+RmKLPEU+2nm7Wg+wlIwKIB3yDDBoZ9Q7lm
 YdorW0/506x4qOwa8KAjxXEb2qgZkL/fQcMS0jQQKNNDD8GHU+3YdjCNbrmZjA+T
 Ib0rs9YsxgLstnmqzU3QfarhgtdzIMhzF0cofoFDK0a8tHLqOhZF4j7wLmYo52ec
 6Hrt8IvJPW2b3TC6KfRAF+uWsDDoaum0OOFVjVu1b9G6fFf0i8h5JGWdkJh8cau9
 0qvAYg0sDjRf7zAZ6kbbGDVCljECyypIdCBF3ihA6yMhJX16VihEWTJyqIaoO+a0
 0XzPvkhX+6mY0kqn+5HvvhLEY4POA8pFKeTPx4eABvKnw7m01I9cJfHa7HEBQHe6
 ZJA7qWMBRrPfK2Ogrx8zohj2sOdHMHlIQU+6Ai+7+BHs13Nje4umGMu2YuZrnfIF
 gQ1ayCN8OsUwVA==
 =12KJ
 -----END PGP SIGNATURE-----

Merge tag 'objtool-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull build tooling updates from Thomas Gleixner:

 - Remove obsolete CONFIG_X86_SMAP reference from objtool

 - Fix overlapping text section failures in faddr2line for real

 - Remove OBJECT_FILES_NON_STANDARD usage from x86 ftrace and replace it
   with finegrained annotations so objtool can validate that code
   correctly.

* tag 'objtool-urgent-2022-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/ftrace: Remove OBJECT_FILES_NON_STANDARD usage
  faddr2line: Fix overlapping text section failures, the sequel
  objtool: Fix obsolete reference to CONFIG_X86_SMAP
2022-06-19 09:54:16 -05:00
Jakub Kicinski
582573f1b2 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2022-06-17

We've added 12 non-merge commits during the last 4 day(s) which contain
a total of 14 files changed, 305 insertions(+), 107 deletions(-).

The main changes are:

1) Fix x86 JIT tailcall count offset on BPF-2-BPF call, from Jakub Sitnicki.

2) Fix a kprobe_multi link bug which misplaces BPF cookies, from Jiri Olsa.

3) Fix an infinite loop when processing a module's BTF, from Kumar Kartikeya Dwivedi.

4) Fix getting a rethook only in RCU available context, from Masami Hiramatsu.

5) Fix request socket refcount leak in sk lookup helpers, from Jon Maxwell.

6) Fix xsk xmit behavior which wrongly adds skb to already full cq, from Ciara Loftus.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  rethook: Reject getting a rethook if RCU is not watching
  fprobe, samples: Add use_trace option and show hit/missed counter
  bpf, docs: Update some of the JIT/maintenance entries
  selftest/bpf: Fix kprobe_multi bench test
  bpf: Force cookies array to follow symbols sorting
  ftrace: Keep address offset in ftrace_lookup_symbols
  selftests/bpf: Shuffle cookies symbols in kprobe multi test
  selftests/bpf: Test tail call counting with bpf2bpf and data on stack
  bpf, x86: Fix tail call count offset calculation on bpf2bpf call
  bpf: Limit maximum modifier chain length in btf_check_type_tags
  bpf: Fix request_sock leak in sk lookup helpers
  xsk: Fix generic transmit when completion queue reservation fails
====================

Link: https://lore.kernel.org/r/20220617202119.2421-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-06-17 18:30:01 -07:00
Kirill A. Shutemov
1e7769653b x86/tdx: Handle load_unaligned_zeropad() page-cross to a shared page
load_unaligned_zeropad() can lead to unwanted loads across page boundaries.
The unwanted loads are typically harmless. But, they might be made to
totally unrelated or even unmapped memory. load_unaligned_zeropad()
relies on exception fixup (#PF, #GP and now #VE) to recover from these
unwanted loads.

In TDX guests, the second page can be shared page and a VMM may configure
it to trigger #VE.

The kernel assumes that #VE on a shared page is an MMIO access and tries to
decode instruction to handle it. In case of load_unaligned_zeropad() it
may result in confusion as it is not MMIO access.

Fix it by detecting split page MMIO accesses and failing them.
load_unaligned_zeropad() will recover using exception fixups.

The issue was discovered by analysis and reproduced artificially. It was
not triggered during testing.

[ dhansen: fix up changelogs and comments for grammar and clarity,
	   plus incorporate Kirill's off-by-one fix]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220614120135.14812-4-kirill.shutemov@linux.intel.com
2022-06-17 15:37:33 -07:00
Linus Torvalds
32efdbffff pci-v5.19-fixes-2
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmKs17UUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vzlVQ//Qygje8u1x0qQqGFykMf3+cev2ECV
 /sfOHtRoG9qZ4gKQOGShr9gyeV+9VFl0+CEJE4z9/YiAnoOr5JMAMixYRjHQU9MB
 k3TFkDqD0KBgmXdAubVP5HQGZgA+mryvtEhr5rm45PooJJSjuh1ds87YYO+Z/t7s
 c2AzvpHFLjECB6LRHqHieyp4CeWn5tw8im7uMUmfKkkXF5ckqw3e+7gwJzRrAukg
 GgbLb2JZLzXSl1HOx/2GvPW8RzyXRbbJmpvu9LsNKeoqP006F3chDVIncqG3Q9QQ
 LvrjudC/eY/2Fee3tpP6gjl9A5ALvXT/k4gTw4Lwm8OlaxrYZ3gwcBHZepp3F08s
 qVCncjFooeHAMiJDkGvtf6N8k8VnOy4zvg2qDpKOl2NO3jH95nGi0LGMf0/GXvfh
 P4bwqjGcWKSo9C2amagZ49rzaJBIQRms8ItM6WPvCYirjyYi3PeUMl/dylstbnuq
 DQuprZtgOEGlPGUg/CO4fCpCIkKzpvOV6157z39mZS5HOT5ugvF4k+hwSxZd0hsM
 rJI7Te48Z46Y/qoFesgOglJwwEZs4RqHAvMTGC+V5Ftj9gHe7wo2oOxlhCyFW1Qd
 jec3UhXEzTVxnjsu7peHyfwwmGWFUkG16P17hWpncpG6azW1dybQOCsRy21/F+q2
 i1nd61lXpuhKQKQ=
 =5y1d
 -----END PGP SIGNATURE-----

Merge tag 'pci-v5.19-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull pci fix from Bjorn Helgaas:
 "Revert clipping of PCI host bridge windows to avoid E820 regions,
  which broke several machines by forcing unnecessary BAR reassignments
  (Hans de Goede)"

* tag 'pci-v5.19-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  x86/PCI: Revert "x86/PCI: Clip only host bridge windows for E820 regions"
2022-06-17 15:12:20 -05:00
Hans de Goede
a2b36ffbf5 x86/PCI: Revert "x86/PCI: Clip only host bridge windows for E820 regions"
This reverts commit 4c5e242d3e.

Prior to 4c5e242d3e ("x86/PCI: Clip only host bridge windows for E820
regions"), E820 regions did not affect PCI host bridge windows.  We only
looked at E820 regions and avoided them when allocating new MMIO space.
If firmware PCI bridge window and BAR assignments used E820 regions, we
left them alone.

After 4c5e242d3e, we removed E820 regions from the PCI host bridge
windows before looking at BARs, so firmware assignments in E820 regions
looked like errors, and we moved things around to fit in the space left
(if any) after removing the E820 regions.  This unnecessary BAR
reassignment broke several machines.

Guilherme reported that Steam Deck fails to boot after 4c5e242d3e.  We
clipped the window that contained most 32-bit BARs:

  BIOS-e820: [mem 0x00000000a0000000-0x00000000a00fffff] reserved
  acpi PNP0A08:00: clipped [mem 0x80000000-0xf7ffffff window] to [mem 0xa0100000-0xf7ffffff window] for e820 entry [mem 0xa0000000-0xa00fffff]

which forced us to reassign all those BARs, for example, this NVMe BAR:

  pci 0000:00:01.2: PCI bridge to [bus 01]
  pci 0000:00:01.2:   bridge window [mem 0x80600000-0x806fffff]
  pci 0000:01:00.0: BAR 0: [mem 0x80600000-0x80603fff 64bit]
  pci 0000:00:01.2: can't claim window [mem 0x80600000-0x806fffff]: no compatible bridge window
  pci 0000:01:00.0: can't claim BAR 0 [mem 0x80600000-0x80603fff 64bit]: no compatible bridge window

  pci 0000:00:01.2: bridge window: assigned [mem 0xa0100000-0xa01fffff]
  pci 0000:01:00.0: BAR 0: assigned [mem 0xa0100000-0xa0103fff 64bit]

All the reassignments were successful, so the devices should have been
functional at the new addresses, but some were not.

Andy reported a similar failure on an Intel MID platform.  Benjamin
reported a similar failure on a VMWare Fusion VM.

Note: this is not a clean revert; this revert keeps the later change to
make the clipping dependent on a new pci_use_e820 bool, moving the checking
of this bool to arch_remove_reservations().

[bhelgaas: commit log, add more reporters and testers]
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216109
Reported-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Reported-by: Jongman Heo <jongman.heo@gmail.com>
Fixes: 4c5e242d3e ("x86/PCI: Clip only host bridge windows for E820 regions")
Link: https://lore.kernel.org/r/20220612144325.85366-1-hdegoede@redhat.com
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2022-06-17 14:24:14 -05:00
Linus Torvalds
2d806a688f hyperv-fixes for 5.19-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEIbPD0id6easf0xsudhRwX5BBoF4FAmKsjCwTHHdlaS5saXVA
 a2VybmVsLm9yZwAKCRB2FHBfkEGgXpAqB/0akBXX+rxp/MzKKiuzOa53rV5IqPe6
 dbDhJtsyCeX2xiNtiDYqCsCmKRt+rmQgzrKykRLCeQatrLmjbUzEZYo1QmpmdFjK
 icZtYdso1FEW6v4h1j+p7dl/l3nqXWcG9gjCDecma9ZxXnpzNw5O9kBCpUyWGzwS
 85RJy2X5n7gyqiAyrDD5MSiUAL2n2FG2lHwK4EYjXsfPJEesx/gh8DEzVkMJWSqa
 nPyvtnPxGSDr22KTymF/Z2BJSYMcxaU528Ztto+DE8NhVWKk99eiMpyXagH3LygD
 gOHmnRS23eJhw/Mq44IHGf7t94bFuUBQraVvaGnaXGLoLv946oNMfLhl
 =miu9
 -----END PGP SIGNATURE-----

Merge tag 'hyperv-fixes-signed-20220617' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull hyperv fixes from Wei Liu:

 - Fix hv_init_clocksource annotation (Masahiro Yamada)

 - Two bug fixes for vmbus driver (Saurabh Sengar)

 - Fix SEV negotiation (Tianyu Lan)

 - Fix comments in code (Xiang Wang)

 - One minor fix to HID driver (Michael Kelley)

* tag 'hyperv-fixes-signed-20220617' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
  x86/Hyper-V: Add SEV negotiate protocol support in Isolation VM
  Drivers: hv: vmbus: Release cpu lock in error case
  HID: hyperv: Correctly access fields declared as __le16
  clocksource: hyper-v: unexport __init-annotated hv_init_clocksource()
  Drivers: hv: Fix syntax errors in comments
  Drivers: hv: vmbus: Don't assign VMbus channel interrupts to isolated CPUs
2022-06-17 13:39:12 -05:00
Peter Xu
d92725256b mm: avoid unnecessary page fault retires on shared memory types
I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page.  It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).

Then after that throttling we return VM_FAULT_RETRY.

We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.

However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.

It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.

To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.

To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock.  It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.

This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:

  Before: 650.980 ms (+-1.94%)
  After:  569.396 ms (+-1.38%)

I believe it could help more than that.

We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.

Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.

I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.

Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>	[arm part]
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Will Deacon <will@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Chris Zankel <chris@zankel.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Helge Deller <deller@gmx.de>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16 19:48:27 -07:00
Peter Xu
cd16dd0373 mm/x86: remove dead code for hugetlbpage.c
It seems to exist since the old times and never used once.  Remove them.

Link: https://lkml.kernel.org/r/20220525195220.10241-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16 19:48:27 -07:00
Jakub Sitnicki
ff672c67ee bpf, x86: Fix tail call count offset calculation on bpf2bpf call
On x86-64 the tail call count is passed from one BPF function to another
through %rax. Additionally, on function entry, the tail call count value
is stored on stack right after the BPF program stack, due to register
shortage.

The stored count is later loaded from stack either when performing a tail
call - to check if we have not reached the tail call limit - or before
calling another BPF function call in order to pass it via %rax.

In the latter case, we miscalculate the offset at which the tail call count
was stored on function entry. The JIT does not take into account that the
allocated BPF program stack is always a multiple of 8 on x86, while the
actual stack depth does not have to be.

This leads to a load from an offset that belongs to the BPF stack, as shown
in the example below:

SEC("tc")
int entry(struct __sk_buff *skb)
{
	/* Have data on stack which size is not a multiple of 8 */
	volatile char arr[1] = {};
	return subprog_tail(skb);
}

int entry(struct __sk_buff * skb):
   0: (b4) w2 = 0
   1: (73) *(u8 *)(r10 -1) = r2
   2: (85) call pc+1#bpf_prog_ce2f79bb5f3e06dd_F
   3: (95) exit

int entry(struct __sk_buff * skb):
   0xffffffffa0201788:  nop    DWORD PTR [rax+rax*1+0x0]
   0xffffffffa020178d:  xor    eax,eax
   0xffffffffa020178f:  push   rbp
   0xffffffffa0201790:  mov    rbp,rsp
   0xffffffffa0201793:  sub    rsp,0x8
   0xffffffffa020179a:  push   rax
   0xffffffffa020179b:  xor    esi,esi
   0xffffffffa020179d:  mov    BYTE PTR [rbp-0x1],sil
   0xffffffffa02017a1:  mov    rax,QWORD PTR [rbp-0x9]	!!! tail call count
   0xffffffffa02017a8:  call   0xffffffffa02017d8       !!! is at rbp-0x10
   0xffffffffa02017ad:  leave
   0xffffffffa02017ae:  ret

Fix it by rounding up the BPF stack depth to a multiple of 8, when
calculating the tail call count offset on stack.

Fixes: ebf7d1f508 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220616162037.535469-2-jakub@cloudflare.com
2022-06-16 21:48:24 +02:00
Tianyu Lan
49d6a3c062 x86/Hyper-V: Add SEV negotiate protocol support in Isolation VM
Hyper-V Isolation VM current code uses sev_es_ghcb_hv_call()
to read/write MSR via GHCB page and depends on the sev code.
This may cause regression when sev code changes interface
design.

The latest SEV-ES code requires to negotiate GHCB version before
reading/writing MSR via GHCB page and sev_es_ghcb_hv_call() doesn't
work for Hyper-V Isolation VM. Add Hyper-V ghcb related implementation
to decouple SEV and Hyper-V code. Negotiate GHCB version in the
hyperv_init() and use the version to communicate with Hyper-V
in the ghcb hv call function.

Fixes: 2ea29c5abb ("x86/sev: Save the negotiated GHCB version")
Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20220614014553.1915929-1-ltykernel@gmail.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
2022-06-15 18:27:40 +00:00
Kirill A. Shutemov
cdd85786f4 x86/tdx: Clarify RIP adjustments in #VE handler
After successful #VE handling, tdx_handle_virt_exception() has to move
RIP to the next instruction. The handler needs to know the length of the
instruction.

If the #VE happened due to instruction execution, the GET_VEINFO TDX
module call provides info on the instruction in R10, including its length.

For #VE due to EPT violation, the info in R10 is not populand and the
kernel must decode the instruction manually to find out its length.

Restructure the code to make it explicit that the instruction length
depends on the type of #VE. Make individual #VE handlers return
the instruction length on success or -errno on failure.

[ dhansen: fix up changelog and comments ]

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220614120135.14812-3-kirill.shutemov@linux.intel.com
2022-06-15 11:05:16 -07:00
Kirill A. Shutemov
60428d8bc2 x86/tdx: Fix early #VE handling
tdx_early_handle_ve() does not increment RIP after successfully
handling the exception.  That leads to infinite loop of exceptions.

Move RIP when exceptions are successfully handled.

[ dhansen: make problem statement more clear ]

Fixes: 32e72854fa ("x86/tdx: Port I/O: Add early boot support")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lkml.kernel.org/r/20220614120135.14812-2-kirill.shutemov@linux.intel.com
2022-06-15 10:52:59 -07:00
Linus Torvalds
24625f7d91 ARM64:
* Properly reset the SVE/SME flags on vcpu load
 
 * Fix a vgic-v2 regression regarding accessing the pending
 state of a HW interrupt from userspace (and make the code
 common with vgic-v3)
 
 * Fix access to the idreg range for protected guests
 
 * Ignore 'kvm-arm.mode=protected' when using VHE
 
 * Return an error from kvm_arch_init_vm() on allocation failure
 
 * A bunch of small cleanups (comments, annotations, indentation)
 
 RISC-V:
 
 * Typo fix in arch/riscv/kvm/vmid.c
 
 * Remove broken reference pattern from MAINTAINERS entry
 
 x86-64:
 
 * Fix error in page tables with MKTME enabled
 
 * Dirty page tracking performance test extended to running a nested
   guest
 
 * Disable APICv/AVIC in cases that it cannot implement correctly
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmKjTIAUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNhPQgAiIVtp8aepujUM/NhkNyK3SIdLzlS
 oZCZiS6bvaecKXi/QvhBU0EBxAEyrovk3lmVuYNd41xI+PDjyaA4SDIl5DnToGUw
 bVPNFSYqjpF939vUUKjc0RCdZR4o5g3Od3tvWoHTHviS1a8aAe5o9pcpHpD0D6Mp
 Gc/o58nKAOPl3htcFKmjymqo3Y6yvkJU9NB7DCbL8T5mp5pJ959Mw1/LlmBaAzJC
 OofrynUm4NjMyAj/mAB1FhHKFyQfjBXLhiVlS0SLiiEA/tn9/OXyVFMKG+n5VkAZ
 Q337GMFe2RikEIuMEr3Rc4qbZK3PpxHhaj+6MPRuM0ho/P4yzl2Nyb/OhA==
 =h81Q
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "While last week's pull request contained miscellaneous fixes for x86,
  this one covers other architectures, selftests changes, and a bigger
  series for APIC virtualization bugs that were discovered during 5.20
  development. The idea is to base 5.20 development for KVM on top of
  this tag.

  ARM64:

   - Properly reset the SVE/SME flags on vcpu load

   - Fix a vgic-v2 regression regarding accessing the pending state of a
     HW interrupt from userspace (and make the code common with vgic-v3)

   - Fix access to the idreg range for protected guests

   - Ignore 'kvm-arm.mode=protected' when using VHE

   - Return an error from kvm_arch_init_vm() on allocation failure

   - A bunch of small cleanups (comments, annotations, indentation)

  RISC-V:

   - Typo fix in arch/riscv/kvm/vmid.c

   - Remove broken reference pattern from MAINTAINERS entry

  x86-64:

   - Fix error in page tables with MKTME enabled

   - Dirty page tracking performance test extended to running a nested
     guest

   - Disable APICv/AVIC in cases that it cannot implement correctly"

[ This merge also fixes a misplaced end parenthesis bug introduced in
  commit 3743c2f025 ("KVM: x86: inhibit APICv/AVIC on changes to APIC
  ID or APIC base") pointed out by Sean Christopherson ]

Link: https://lore.kernel.org/all/20220610191813.371682-1-seanjc@google.com/

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (34 commits)
  KVM: selftests: Restrict test region to 48-bit physical addresses when using nested
  KVM: selftests: Add option to run dirty_log_perf_test vCPUs in L2
  KVM: selftests: Clean up LIBKVM files in Makefile
  KVM: selftests: Link selftests directly with lib object files
  KVM: selftests: Drop unnecessary rule for STATIC_LIBS
  KVM: selftests: Add a helper to check EPT/VPID capabilities
  KVM: selftests: Move VMX_EPT_VPID_CAP_AD_BITS to vmx.h
  KVM: selftests: Refactor nested_map() to specify target level
  KVM: selftests: Drop stale function parameter comment for nested_map()
  KVM: selftests: Add option to create 2M and 1G EPT mappings
  KVM: selftests: Replace x86_page_size with PG_LEVEL_XX
  KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE
  KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/put
  KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking
  KVM: x86: disable preemption while updating apicv inhibition
  KVM: x86: SVM: fix avic_kick_target_vcpus_fast
  KVM: x86: SVM: remove avic's broken code that updated APIC ID
  KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base
  KVM: x86: document AVIC/APICv inhibit reasons
  KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs
  ...
2022-06-14 07:57:18 -07:00
Linus Torvalds
8e8afafb0b Yet another hw vulnerability with a software mitigation: Processor MMIO
Stale Data.
 
 They are a class of MMIO-related weaknesses which can expose stale data
 by propagating it into core fill buffers. Data which can then be leaked
 using the usual speculative execution methods.
 
 Mitigations include this set along with microcode updates and are
 similar to MDS and TAA vulnerabilities: VERW now clears those buffers
 too.
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKXMkMTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWGPD/idalLIhhV5F2+hZIKm0WSnsBxAOh9K
 7y8xBxpQQ5FUfW3vm7Pg3ro6VJp7w2CzKoD4lGXzGHriusn3qst3vkza9Ay8xu8g
 RDwKe6hI+p+Il9BV9op3f8FiRLP9bcPMMReW/mRyYsOnJe59hVNwRAL8OG40PY4k
 hZgg4Psfvfx8bwiye5efjMSe4fXV7BUCkr601+8kVJoiaoszkux9mqP+cnnB5P3H
 zW1d1jx7d6eV1Y063h7WgiNqQRYv0bROZP5BJkufIoOHUXDpd65IRF3bDnCIvSEz
 KkMYJNXb3qh7EQeHS53NL+gz2EBQt+Tq1VH256qn6i3mcHs85HvC68gVrAkfVHJE
 QLJE3MoXWOqw+mhwzCRrEXN9O1lT/PqDWw8I4M/5KtGG/KnJs+bygmfKBbKjIVg4
 2yQWfMmOgQsw3GWCRjgEli7aYbDJQjany0K/qZTq54I41gu+TV8YMccaWcXgDKrm
 cXFGUfOg4gBm4IRjJ/RJn+mUv6u+/3sLVqsaFTs9aiib1dpBSSUuMGBh548Ft7g2
 5VbFVSDaLjB2BdlcG7enlsmtzw0ltNssmqg7jTK/L7XNVnvxwUoXw+zP7RmCLEYt
 UV4FHXraMKNt2ZketlomC8ui2hg73ylUp4pPdMXCp7PIXp9sVamRTbpz12h689VJ
 /s55bWxHkR6S
 =LBxT
 -----END PGP SIGNATURE-----

Merge tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 MMIO stale data fixes from Thomas Gleixner:
 "Yet another hw vulnerability with a software mitigation: Processor
  MMIO Stale Data.

  They are a class of MMIO-related weaknesses which can expose stale
  data by propagating it into core fill buffers. Data which can then be
  leaked using the usual speculative execution methods.

  Mitigations include this set along with microcode updates and are
  similar to MDS and TAA vulnerabilities: VERW now clears those buffers
  too"

* tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/speculation/mmio: Print SMT warning
  KVM: x86/speculation: Disable Fill buffer clear within guests
  x86/speculation/mmio: Reuse SRBDS mitigation for SBDS
  x86/speculation/srbds: Update SRBDS mitigation selection
  x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data
  x86/speculation/mmio: Enable CPU Fill buffer clearing on idle
  x86/bugs: Group MDS, TAA & Processor MMIO Stale Data mitigations
  x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data
  x86/speculation: Add a common function for MD_CLEAR mitigation update
  x86/speculation/mmio: Enumerate Processor MMIO Stale Data bug
  Documentation: Add documentation for Processor MMIO Stale Data
2022-06-14 07:43:15 -07:00
Josh Poimboeuf
e32683c6f7 x86/mm: Fix RESERVE_BRK() for older binutils
With binutils 2.26, RESERVE_BRK() causes a build failure:

  /tmp/ccnGOKZ5.s: Assembler messages:
  /tmp/ccnGOKZ5.s:98: Error: missing ')'
  /tmp/ccnGOKZ5.s:98: Error: missing ')'
  /tmp/ccnGOKZ5.s:98: Error: missing ')'
  /tmp/ccnGOKZ5.s:98: Error: junk at end of line, first unrecognized
  character is `U'

The problem is this line:

  RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE)

Specifically, the INIT_PGT_BUF_SIZE macro which (via PAGE_SIZE's use
_AC()) has a "1UL", which makes older versions of the assembler unhappy.
Unfortunately the _AC() macro doesn't work for inline asm.

Inline asm was only needed here to convince the toolchain to add the
STT_NOBITS flag.  However, if a C variable is placed in a section whose
name is prefixed with ".bss", GCC and Clang automatically set
STT_NOBITS.  In fact, ".bss..page_aligned" already relies on this trick.

So fix the build failure (and simplify the macro) by allocating the
variable in C.

Also, add NOLOAD to the ".brk" output section clause in the linker
script.  This is a failsafe in case the ".bss" prefix magic trick ever
stops working somehow.  If there's a section type mismatch, the GNU
linker will force the ".brk" output section to be STT_NOBITS.  The LLVM
linker will fail with a "section type mismatch" error.

Note this also changes the name of the variable from .brk.##name to
__brk_##name.  The variable names aren't actually used anywhere, so it's
harmless.

Fixes: a1e2c031ec ("x86/mm: Simplify RESERVE_BRK()")
Reported-by: Joe Damato <jdamato@fastly.com>
Reported-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Joe Damato <jdamato@fastly.com>
Link: https://lore.kernel.org/r/22d07a44c80d8e8e1e82b9a806ddc8c6bbb2606e.1654759036.git.jpoimboe@kernel.org
2022-06-13 10:15:04 +02:00
Linus Torvalds
f2ecc964b9 xen: branch for v5.19-rc2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYqLLcQAKCRCAXGG7T9hj
 vggVAP0Wjf81TxvIGrwtcn1q2LW7xyNp5TQk1T6GYx40sHXWlQD/SCKl5iorSGn+
 +g1eXYmOpnSIAcMX3B3T4Ra8uo3mnA8=
 =PnB/
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-5.19a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:

 - a small cleanup removing "export" of an __init function

 - a small series adding a new infrastructure for platform flags

 - a series adding generic virtio support for Xen guests (frontend side)

* tag 'for-linus-5.19a-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: unexport __init-annotated xen_xlate_map_ballooned_pages()
  arm/xen: Assign xen-grant DMA ops for xen-grant DMA devices
  xen/grant-dma-ops: Retrieve the ID of backend's domain for DT devices
  xen/grant-dma-iommu: Introduce stub IOMMU driver
  dt-bindings: Add xen,grant-dma IOMMU description for xen-grant DMA ops
  xen/virtio: Enable restricted memory access using Xen grant mappings
  xen/grant-dma-ops: Add option to restrict memory access under Xen
  xen/grants: support allocating consecutive grants
  arm/xen: Introduce xen_setup_dma_ops()
  virtio: replace arch_has_restricted_virtio_memory_access()
  kernel: add platform_has() infrastructure
2022-06-10 09:57:11 -07:00
Paolo Bonzini
e3cdaab5ff KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE
Commit 74fd41ed16 ("KVM: x86: nSVM: support PAUSE filtering when L0
doesn't intercept PAUSE") introduced passthrough support for nested pause
filtering, (when the host doesn't intercept PAUSE) (either disabled with
kvm module param, or disabled with '-overcommit cpu-pm=on')

Before this commit, L1 KVM didn't intercept PAUSE at all; afterwards,
the feature was exposed as supported by KVM cpuid unconditionally, thus
if L1 could try to use it even when the L0 KVM can't really support it.

In this case the fallback caused KVM to intercept each PAUSE instruction;
in some cases, such intercept can slow down the nested guest so much
that it can fail to boot.  Instead, before the problematic commit KVM
was already setting both thresholds to 0 in vmcb02, but after the first
userspace VM exit shrink_ple_window was called and would reset the
pause_filter_count to the default value.

To fix this, change the fallback strategy - ignore the guest threshold
values, but use/update the host threshold values unless the guest
specifically requests disabling PAUSE filtering (either simple or
advanced).

Also fix a minor bug: on nested VM exit, when PAUSE filter counter
were copied back to vmcb01, a dirty bit was not set.

Thanks a lot to Suravee Suthikulpanit for debugging this!

Fixes: 74fd41ed16 ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE")
Reported-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Co-developed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220518072709.730031-1-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:21 -04:00
Maxim Levitsky
ba8ec27324 KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/put
Now that these functions are always called with preemption disabled,
remove the preempt_disable()/preempt_enable() pair inside them.

No functional change intended.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-8-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:20 -04:00
Maxim Levitsky
66c768d30e KVM: x86: disable preemption while updating apicv inhibition
Currently nothing prevents preemption in kvm_vcpu_update_apicv.

On SVM, If the preemption happens after we update the
vcpu->arch.apicv_active, the preemption itself will
'update' the inhibition since the AVIC will be first disabled
on vCPU unload and then enabled, when the current task
is loaded again.

Then we will try to update it again, which will lead to a warning
in __avic_vcpu_load, that the AVIC is already enabled.

Fix this by disabling preemption in this code.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-6-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:19 -04:00
Maxim Levitsky
603ccef42c KVM: x86: SVM: fix avic_kick_target_vcpus_fast
There are two issues in avic_kick_target_vcpus_fast

1. It is legal to issue an IPI request with APIC_DEST_NOSHORT
   and a physical destination of 0xFF (or 0xFFFFFFFF in case of x2apic),
   which must be treated as a broadcast destination.

   Fix this by explicitly checking for it.
   Also don’t use ‘index’ in this case as it gives no new information.

2. It is legal to issue a logical IPI request to more than one target.
   Index field only provides index in physical id table of first
   such target and therefore can't be used before we are sure
   that only a single target was addressed.

   Instead, parse the ICRL/ICRH, double check that a unicast interrupt
   was requested, and use that info to figure out the physical id
   of the target vCPU.
   At that point there is no need to use the index field as well.

In addition to fixing the above	issues,	also skip the call to
kvm_apic_match_dest.

It is possible to do this now, because now as long as AVIC is not
inhibited, it is guaranteed that none of the vCPUs changed their
apic id from its default value.

This fixes boot of windows guest with AVIC enabled because it uses
IPI with 0xFF destination and no destination shorthand.

Fixes: 7223fd2d53 ("KVM: SVM: Use target APIC ID to complete AVIC IRQs when possible")
Cc: stable@vger.kernel.org

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:19 -04:00
Maxim Levitsky
f5f9089f76 KVM: x86: SVM: remove avic's broken code that updated APIC ID
AVIC is now inhibited if the guest changes the apic id,
and therefore this code is no longer needed.

There are several ways this code was broken, including:

1. a vCPU was only allowed to change its apic id to an apic id
of an existing vCPU.

2. After such change, the vCPU whose apic id entry was overwritten,
could not correctly change its own apic id, because its own
entry is already overwritten.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:18 -04:00
Maxim Levitsky
3743c2f025 KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base
Neither of these settings should be changed by the guest and it is
a burden to support it in the acceleration code, so just inhibit
this code instead.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:18 -04:00
Maxim Levitsky
a9603ae0e4 KVM: x86: document AVIC/APICv inhibit reasons
These days there are too many AVIC/APICv inhibit
reasons, and it doesn't hurt to have some documentation
for them.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:17 -04:00
Yuan Yao
d2263de137 KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs
Assign shadow_me_value, not shadow_me_mask, to PAE root entries,
a.k.a. shadow PDPTRs, when host memory encryption is supported.  The
"mask" is the set of all possible memory encryption bits, e.g. MKTME
KeyIDs, whereas "value" holds the actual value that needs to be
stuffed into host page tables.

Using shadow_me_mask results in a failed VM-Entry due to setting
reserved PA bits in the PDPTRs, and ultimately causes an OOPS due to
physical addresses with non-zero MKTME bits sending to_shadow_page()
into the weeds:

set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.
BUG: unable to handle page fault for address: ffd43f00063049e8
PGD 86dfd8067 P4D 0
Oops: 0000 [#1] PREEMPT SMP
RIP: 0010:mmu_free_root_page+0x3c/0x90 [kvm]
 kvm_mmu_free_roots+0xd1/0x200 [kvm]
 __kvm_mmu_unload+0x29/0x70 [kvm]
 kvm_mmu_unload+0x13/0x20 [kvm]
 kvm_arch_destroy_vm+0x8a/0x190 [kvm]
 kvm_put_kvm+0x197/0x2d0 [kvm]
 kvm_vm_release+0x21/0x30 [kvm]
 __fput+0x8e/0x260
 ____fput+0xe/0x10
 task_work_run+0x6f/0xb0
 do_exit+0x327/0xa90
 do_group_exit+0x35/0xa0
 get_signal+0x911/0x930
 arch_do_signal_or_restart+0x37/0x720
 exit_to_user_mode_prepare+0xb2/0x140
 syscall_exit_to_user_mode+0x16/0x30
 do_syscall_64+0x4e/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: e54f1ff244 ("KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask")
Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20220608012015.19566-1-yuan.yao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09 10:52:16 -04:00
Paolo Bonzini
66da65005a KVM/riscv fixes for 5.19, take #1
- Typo fix in arch/riscv/kvm/vmid.c
 
 - Remove broken reference pattern from MAINTAINERS entry
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEZdn75s5e6LHDQ+f/rUjsVaLHLAcFAmKhgGkACgkQrUjsVaLH
 LAfCCw/+LEz6af/lm4PXr5CGkJ91xq2xMpU9o39jMBm0RltzsG7zt/90SaUdK/Oz
 MpX7CLFgb1Cm2ZZ/+l5cBlLc7NUaMMxHH9dpScyrYC8xAb75QYimpe/jfjuMyXjO
 IaYJB2WCs2gfTYXA58c4sB2WR5rLahLnQGJrwW2CfMSvpv/nAyEZyWYtgXw8tSxH
 oM06Z/cLWU53uWuX0hbKAVQMdAIrQK5H+z46bhbpFC6gk/XSvaBBEngoOiiE6lC6
 uM8i8ZIeUgqSeWWreczd6H25eYwyLuVxXHWSIgbdvEcvBUn0VDO+Ox4UA2ab3g3d
 uHubqdRY5GnrkbRK0ue6tXfON8NxGlKwlcc6kp9Vqxb3Jxjr2qwToTubHYAUVXUi
 XzrvSxoZRRikwstb1+PNXECCNYUHkNdj4FBA4WoF0Y3Br1IfSwZLUX+EKkY/DHv+
 L4MhFFNqsQPzVly2wNiyxuWwRQyxupHekeMQlp13P9vZnGcptxxEyuQlM1Hf40ST
 iiOC8L+TCQzc5dN156/KjQIUFPud4huJO+0xHQtang628yVzQazzcxD+ialPkcqt
 JnpMmNbvvNzFYLoB3dQ/36flmYRA6SbK4Tt4bdhls+UcnLnfHDZow7OLmX5yj8+A
 QiKx6IOS6KI10LXhVZguAmZuKjXajyLVaCWpBl0tV6XpV9Y5t98=
 =w6dT
 -----END PGP SIGNATURE-----

Merge tag 'kvm-riscv-fixes-5.19-1' of https://github.com/kvm-riscv/linux into HEAD

KVM/riscv fixes for 5.19, take #1

- Typo fix in arch/riscv/kvm/vmid.c

- Remove broken reference pattern from MAINTAINERS entry
2022-06-09 09:45:00 -04:00
Linus Torvalds
34f4335c16 * Fix syzkaller NULL pointer dereference
* Fix TDP MMU performance issue with disabling dirty logging
 * Fix 5.14 regression with SVM TSC scaling
 * Fix indefinite stall on applying live patches
 * Fix unstable selftest
 * Fix memory leak from wrong copy-and-paste
 * Fix missed PV TLB flush when racing with emulation
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmKglysUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroOJDAgArpPcAnJbeT2VQTQcp94e4tp9k1Sf
 gmUewajco4zFVB/sldE0fIporETkaX+FYYPiaNDdNgJ2lUw/HUJBN7KoFEYTZ37N
 Xx/qXiIXQYFw1bmxTnacLzIQtD3luMCzOs/6/Q7CAFZIBpUtUEjkMlQOBuxoKeG0
 B0iLCTJSw0taWcN170aN8G6T+5+bdR3AJW1k2wkgfESfYF9NfJoTUHQj9WTMzM2R
 aBRuXvUI/rWKvQY3DfoRmgg9Ig/SirSC+abbKIs4H08vZIEUlPk3WOZSKpsN/Wzh
 3XDnVRxgnaRLx6NI/ouI2UYJCmjPKbNcueGCf5IfUcHvngHjAEG/xxe4Qw==
 =zQ9u
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:

 - syzkaller NULL pointer dereference

 - TDP MMU performance issue with disabling dirty logging

 - 5.14 regression with SVM TSC scaling

 - indefinite stall on applying live patches

 - unstable selftest

 - memory leak from wrong copy-and-paste

 - missed PV TLB flush when racing with emulation

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: do not report a vCPU as preempted outside instruction boundaries
  KVM: x86: do not set st->preempted when going back to user space
  KVM: SVM: fix tsc scaling cache logic
  KVM: selftests: Make hyperv_clock selftest more stable
  KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging
  x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm()
  KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots()
  entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
  KVM: Don't null dereference ops->destroy
2022-06-08 09:16:31 -07:00
Paolo Bonzini
6cd88243c7 KVM: x86: do not report a vCPU as preempted outside instruction boundaries
If a vCPU is outside guest mode and is scheduled out, it might be in the
process of making a memory access.  A problem occurs if another vCPU uses
the PV TLB flush feature during the period when the vCPU is scheduled
out, and a virtual address has already been translated but has not yet
been accessed, because this is equivalent to using a stale TLB entry.

To avoid this, only report a vCPU as preempted if sure that the guest
is at an instruction boundary.  A rescheduling request will be delivered
to the host physical CPU as an external interrupt, so for simplicity
consider any vmexit *not* instruction boundary except for external
interrupts.

It would in principle be okay to report the vCPU as preempted also
if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the
vmentry/vmexit overhead unnecessarily, and optimistic spinning is
also unlikely to succeed.  However, leave it for later because right
now kvm_vcpu_check_block() is doing memory accesses.  Even
though the TLB flush issue only applies to virtual memory address,
it's very much preferrable to be conservative.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:21:07 -04:00
Paolo Bonzini
54aa83c901 KVM: x86: do not set st->preempted when going back to user space
Similar to the Xen path, only change the vCPU's reported state if the vCPU
was actually preempted.  The reason for KVM's behavior is that for example
optimistic spinning might not be a good idea if the guest is doing repeated
exits to userspace; however, it is confusing and unlikely to make a difference,
because well-tuned guests will hardly ever exit KVM_RUN in the first place.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08 04:21:06 -04:00
Maxim Levitsky
11d39e8cc4 KVM: SVM: fix tsc scaling cache logic
SVM uses a per-cpu variable to cache the current value of the
tsc scaling multiplier msr on each cpu.

Commit 1ab9287add
("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
broke this caching logic.

Refactor the code so that all TSC scaling multiplier writes go through
a single function which checks and updates the cache.

This fixes the following scenario:

1. A CPU runs a guest with some tsc scaling ratio.

2. New guest with different tsc scaling ratio starts on this CPU
   and terminates almost immediately.

   This ensures that the short running guest had set the tsc scaling ratio just
   once when it was set via KVM_SET_TSC_KHZ. Due to the bug,
   the per-cpu cache is not updated.

3. The original guest continues to run, it doesn't restore the msr
   value back to its own value, because the cache matches,
   and thus continues to run with a wrong tsc scaling ratio.

Fixes: 1ab9287add ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07 11:28:50 -04:00
Ben Gardon
5ba7c4c6d1 KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging
Currently disabling dirty logging with the TDP MMU is extremely slow.
On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds
to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with
the shadow MMU.

When disabling dirty logging, zap non-leaf parent entries to allow
replacement with huge pages instead of recursing and zapping all of the
child, leaf entries. This reduces the number of TLB flushes required.
and reduces the disable dirty log time with the TDP MMU to ~3 seconds.

Opportunistically add a WARN() to catch GFNs that are mapped at a
higher level than their max level.

Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220525230904.1584480-1-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07 11:28:49 -04:00
Jan Beulich
1df931d95f x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm()
As noted (and fixed) a couple of times in the past, "=@cc<cond>" outputs
and clobbering of "cc" don't work well together. The compiler appears to
mean to reject such, but doesn't - in its upstream form - quite manage
to yet for "cc". Furthermore two similar macros don't clobber "cc", and
clobbering "cc" is pointless in asm()-s for x86 anyway - the compiler
always assumes status flags to be clobbered there.

Fixes: 989b5db215 ("x86/uaccess: Implement macros for CMPXCHG on user addresses")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Message-Id: <485c0c0b-a3a7-0b7c-5264-7d00c01de032@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07 11:28:49 -04:00
Shaoqin Huang
cf4a8693d9 KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots()
When freeing obsolete previous roots, check prev_roots as intended, not
the current root.

Signed-off-by: Shaoqin Huang <shaoqin.huang@intel.com>
Fixes: 527d5cd7ee ("KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped")
Message-Id: <20220607005905.2933378-1-shaoqin.huang@intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07 11:28:48 -04:00
Josh Poimboeuf
7b6c7a877c x86/ftrace: Remove OBJECT_FILES_NON_STANDARD usage
The file-wide OBJECT_FILES_NON_STANDARD annotation is used with
CONFIG_FRAME_POINTER to tell objtool to skip the entire file when frame
pointers are enabled.  However that annotation is now deprecated because
it doesn't work with IBT, where objtool runs on vmlinux.o instead of
individual translation units.

Instead, use more fine-grained function-specific annotations:

- The 'save_mcount_regs' macro does funny things with the frame pointer.
  Use STACK_FRAME_NON_STANDARD_FP to tell objtool to ignore the
  functions using it.

- The return_to_handler() "function" isn't actually a callable function.
  Instead of being called, it's returned to.  The real return address
  isn't on the stack, so unwinding is already doomed no matter which
  unwinder is used.  So just remove the STT_FUNC annotation, telling
  objtool to ignore it.  That also removes the implicit
  ANNOTATE_NOENDBR, which now needs to be made explicit.

Fixes the following warning:

  vmlinux.o: warning: objtool: __fentry__+0x16: return with modified stack frame

Fixes: ed53a0d971 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/b7a7a42fe306aca37826043dac89e113a1acdbac.1654268610.git.jpoimboe@kernel.org
2022-06-06 11:50:22 -07:00
Juergen Gross
fa1f57421e xen/virtio: Enable restricted memory access using Xen grant mappings
In order to support virtio in Xen guests add a config option XEN_VIRTIO
enabling the user to specify whether in all Xen guests virtio should
be able to access memory via Xen grant mappings only on the host side.

Also set PLATFORM_VIRTIO_RESTRICTED_MEM_ACCESS feature from the guest
initialization code on Arm and x86 if CONFIG_XEN_VIRTIO is enabled.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/1654197833-25362-5-git-send-email-olekstysh@gmail.com
Signed-off-by: Juergen Gross <jgross@suse.com>
2022-06-06 16:07:30 +02:00
Juergen Gross
3f9dfbebdc virtio: replace arch_has_restricted_virtio_memory_access()
Instead of using arch_has_restricted_virtio_memory_access() together
with CONFIG_ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS, replace those
with platform_has() and a new platform feature
PLATFORM_VIRTIO_RESTRICTED_MEM_ACCESS.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # Arm64 only
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Borislav Petkov <bp@suse.de>
2022-06-06 08:22:01 +02:00
Linus Torvalds
815b196c70 Hot fixes for 5.19-rc1.
- fixes for material merged during this merge window
 
 - cc:stable fixes for more longstanding issues
 
 - minor mailmap and MAINTAINERS updates
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYpz1+QAKCRDdBJ7gKXxA
 jrudAP9EvjTg4KhmXDoUpgJYc2oPg27nIhu1LWT8VFdsVQ6mPwEA//HPvPhjah8u
 C1M183VxKL9trZf22DBn2BbD3kBDIAo=
 =9LgC
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm hotfixes from Andrew Morton:
 "Fixups for various recently-added and longer-term issues and a few
  minor tweaks:

   - fixes for material merged during this merge window

   - cc:stable fixes for more longstanding issues

   - minor mailmap and MAINTAINERS updates"

* tag 'mm-hotfixes-stable-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm/oom_kill.c: fix vm_oom_kill_table[] ifdeffery
  x86/kexec: fix memory leak of elf header buffer
  mm/memremap: fix missing call to untrack_pfn() in pagemap_range()
  mm: page_isolation: use compound_nr() correctly in isolate_single_pageblock()
  mm: hugetlb_vmemmap: fix CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON
  MAINTAINERS: add maintainer information for z3fold
  mailmap: update Josh Poimboeuf's email
2022-06-05 17:05:38 -07:00
Linus Torvalds
d717180e7f A single fix for x86/SGX to prevent that memory which is allocated for an
SGX enclave is accounted to the wrong memory control group.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcd1MTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaalD/0TdNTH+LiM0BpEZ4VHIAFhE9mgfaU/
 1HIZcXEvAzPqS+iLMYAPo2dS7hNKv1GCCD8HcuOdEwC/CyTdrcpvhCNeQXCagF38
 BHtzVCMFd/Y6U7ERNVsaHiuHFSkF+3QHef4Gzljzblgj1FK7s55z9tlQmE3pElOg
 UGfRoD32ODUtQPmOCjlOhFjsUUtFpdpXFCbjPPFdOqJ80LbdKR2s/0IBpHMk1xoz
 ESmS10tVC3a5np1/4Ge8vRCZnewOpulL/Is84Q8MbCvxI8NQh9pD7Imom/wRjSAS
 19N+sWh7ywuUtAOVqJ23dDc6SOL3yjM4HbmsEYRGPsgzuJ5crezLcrKgCFeGmz/4
 4zbU3R9hzzXQy8ZqNjbj71FKswfUDcMLb26GA/62d2N6zR7O0TSzfIrpIYp+GwJ3
 5KaM0LiKoW/LXGfwEdEBWpCkK1OKgMXmZ5IQlr5bRz3Qihqzkk65Dgfo66XRt+jb
 DhMHW+cMfLwSX72QER6LyP2jPfUSCZgy5Pn8LfXUH30fc084gyrAPq2eqtVnf0lf
 Hq5/r1nMosPE0CtxHM1vNRj5M052nQxXhDhdsTcoO6PVBrvEjJbkanj3XbNRk66T
 FDWGWmdtDC2su6p0ezwbARMxYnsSS40GVsp/DoOu+SHxlAm9VkomY+QDJ2FuoJnb
 K0XfW5vV9MEsvw==
 =cuLD
 -----END PGP SIGNATURE-----

Merge tag 'x86-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 SGX fix from Thomas Gleixner:
 "A single fix for x86/SGX to prevent that memory which is allocated for
  an SGX enclave is accounted to the wrong memory control group"

* tag 'x86-urgent-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Set active memcg prior to shmem allocation
2022-06-05 11:00:43 -07:00
Linus Torvalds
0b7da15c21 Use PAGE_ALIGNED() instead of open coding it in the x86/mm code.
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcdlsTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaDeD/oDyKpbOf+ta83AyZm+h6Q+KzYf44n6
 XRaT8Ip1f/aUk+Ucv2YLpZT/LRiBenmJpsdDRaGhje83pS2cnXStd8UhEBjiWmeq
 vBGR8QpXJtFna1Ouizwwq+Pp2mnktK+YEk+Qf17z9UwrRzqXlgTIJqPWsVY1plmV
 lHVhThdkr09PR4aSBN9TQzTFxG0Qtf3mHbsRyPTzODldMtSbdCy9Zy2xlTigDIJX
 uTU/ykoB8qZpJguacAZpExwxvoRHy95cpxNCQfC5paeAv3gJ0U1b3PAJA4VkVlko
 h/2crEzHV0MZRx12o1le72K/2UfuvTeb3949VD3ZBIE5fPSH/K/niwl9BtntXSQj
 Bs0Kbm3k94tJz2VunG0FO7ayJ8/KL98zLHXQR6yRBTGvtkj9m7vhUMaN268NSLQH
 meAIuMreJejfsKV2Z/z9y+nSWebKdEQ0xkOq51qe3s2ivroKshTUNR0h7S35xyw5
 +R8lzJZH4rVVBOcxzCLTaIVsJbVHGfE65gb6JIyUT2sRRxXm53gYPC2rvIRzi8eA
 1h1ODpz0pn3LewXsigOJGmiK11Dda5uJAF58t1BJmGi9kXSH/CVfYBTcDOXcYNHy
 sA+oq+yrQ5RSywv1D9zuDzOgbcEZ1mB82aWuy4Li+k9uhdCyNIsNy7Nbw9paCDZk
 Oy+IMDLroCieSw==
 =KN22
 -----END PGP SIGNATURE-----

Merge tag 'x86-mm-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm cleanup from Thomas Gleixner:
 "Use PAGE_ALIGNED() instead of open coding it in the x86/mm code"

* tag 'x86-mm-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Use PAGE_ALIGNED(x) instead of IS_ALIGNED(x, PAGE_SIZE)
2022-06-05 10:57:35 -07:00
Linus Torvalds
9784edd73a X86 microcode updates:
- Disable late microcode loading by default. Unless the HW people get
     their act together and provide a required minimum version in the
     microcode header for making a halfways informed decision its just
     lottery and broken.
 
   - Warn and taint the kernel when microcode is loaded late
 
   - Remove the old unused microcode loader interface
 
   - Remove a redundant perf callback from the microcode loader
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcdjETHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoUUTD/92cLMB0g7XP8yFN+ZiHl7uoaDtE0UR
 WfapnMNL3tKKVnuwEMg83AjtyQ+O1JNZ6iS5K8jmiBnBByg6EQccz8pfe3jUQ6Ar
 gOLWV1F4tRbJD2NqqiWOo/l5qs5hJaz/QeE+oyCP0fvw6DOZixepG5RzveFSSwAa
 G2Q03GsGEu84SXlVAjagMSU6tYlBnlZBfKRB8NiNxkW8CLcJY0NfDCunbN6icEbH
 AQHXeviM3GWMKJA9R9DeSvYq9PbN5o2UVmcFQWsDAzZ8Ne2qCqskjoGNjuQ9s+72
 G35fm5d7dtIcrYg4PSJN0JDJP4HbcfSjhUrbdH4iAClTkGnNQERfuDV9O92/lYJE
 hd9c8yCegD0NWQ4dMNNrM5PSbWbQK7ajqRYVvqqouJZpH+IDtajA1jxEe+1msB8P
 xmXQDcdSMOyVs+Bw3Djf3tt8Qqhu4jxnf3y711oLklPwwh9lq9SvaWiX9ZFoYgdn
 1HVtQUAOdgDmncs5BQ8dpuwtoYXH5p31n0wh57emyFXl7wA93eWouuFczQ6mSol9
 LLdd9c+q9mBFIo0ult+fVhEOTDJF+27s3YXOpge6BAqei/SQIU4c5oq51CukV7ap
 TPzWAayq0lsAwXn1k6r86Zkewh1C2SQTyk1J3zMehZlSVpwSWbEQASYlKywmh6YB
 N+6/0XtHDAVK5Q==
 =DQjr
 -----END PGP SIGNATURE-----

Merge tag 'x86-microcode-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 microcode updates from Thomas Gleixner:

 - Disable late microcode loading by default. Unless the HW people get
   their act together and provide a required minimum version in the
   microcode header for making a halfways informed decision its just
   lottery and broken.

 - Warn and taint the kernel when microcode is loaded late

 - Remove the old unused microcode loader interface

 - Remove a redundant perf callback from the microcode loader

* tag 'x86-microcode-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode: Remove unnecessary perf callback
  x86/microcode: Taint and warn on late loading
  x86/microcode: Default-disable late loading
  x86/microcode: Rip out the OLD_INTERFACE
2022-06-05 10:55:23 -07:00
Linus Torvalds
a925128092 A set of small x86 cleanups:
- Remove unused headers in the IDT code
 
   - Kconfig indendation and comment fixes
 
   - Fix all 'the the' typos in one go instead of waiting for bots to fix
     one at a time.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcdUsTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofu/EACJEYM67sgOGX/OPxSI2QrcqIPajI/u
 EMrNi69jR8XBgFUwnYRLC+eoC7nvYdpTaUHzQklS2xhE8lcZ4PcMejy9nHECe8MI
 sYA38gXeGamM4pzFQgpsX0Eoq1OX3iH165dCnSgRfGg2Zv6YovmcGk2fkHtA0fXn
 Sqp5fy33wK2U+ghY5MrJwVQ2SshbDp4p7SJ80iLCfdHvtKzQi02EH4CjrZ/guoJL
 bjdiWXA+eIDrPXhoPiBkFQ3cG/vHPc/oj2SEAnBV5oC+hdgjFebiz6CNYbFw0QI9
 MnJQlvhu/oe66J6sRGfqPABm4yh4omNSbjNjbWr9ahoVPvprH9gJ2EMBx6qOT3pe
 sG6pluiQAZXBoOpRqR45vws4Ypq5onyv4OwMzEFNZVT9kzr1qrMJtsXIlffM/hHE
 zgygCV1nqWznUueZKcI6XkXVtawte9wfpujDuZhCgoD/UaIixulW8zpK/0h9iUI5
 03u0lXse20h7kEmJYZ+vgQwSci/6i10U1X7+VngIrBAt24gtzigdKd6FLljSPp0y
 GOc9c79qNdm0Ayofko/m+XtwXw8UJ2Pbtkvku8qmyUR3ffUlBD+qcTPIZ6TZ8aPb
 w17k64zxEMQRYlMm8uHRg8KVJXuWD8nO3BSzwpwPVyckpsL4CkEcgdij7HMN7dLO
 +GCpzbvafupD6Q==
 =vSDr
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cleanups from Thomas Gleixner:
 "A set of small x86 cleanups:

   - Remove unused headers in the IDT code

   - Kconfig indendation and comment fixes

   - Fix all 'the the' typos in one go instead of waiting for bots to
     fix one at a time"

* tag 'x86-cleanups-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Fix all occurences of the "the the" typo
  x86/idt: Remove unused headers
  x86/Kconfig: Fix indentation of arch/x86/Kconfig.debug
  x86/Kconfig: Fix indentation and add endif comments to arch/x86/Kconfig
2022-06-05 10:53:41 -07:00
Linus Torvalds
1fd9f4ce84 Use strlcpy() instead of strscpy() in arch_setup()
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcdNYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYod9oD/9Y3JOKOKfXKz6VTZeIkaq8bwizjCxX
 tzTdeDy6pCYHv1z63rhwTpB4lUl+dt5CUiaf9YjEU3bdgNBtba/C0j6rx2Zf7Qk0
 hw2CjsPahEdFGRRgzbMF6fOUHxsOV/fZ5S4w9XcV7u5QAMkoj/w3rl7Eh2Vn9KbL
 B8I7Cl+Vyec2feknWau3s9vt4GRt+EYR08YmjWL1bxzjFss8JTs0mpzVnuR/QurU
 O20UIGS/167EvacmC15Ehht3EJEOykte3iWVFCMgEwFsZTOpByCQGGnDQCVik0o9
 6gYESc6fRNfTbC+rRGMs2LWXwsYfJMAqYLkSQIfOIETqysxu2HoWoCFWhpmaKLYr
 DEL3mVTy1LB7TqY6+C0P2UCeU9CuNr3fejQf4SsDIsNmTlUuv+FHrDCfi9cotX/G
 gmRa/29BASMgoVzF/QnzrEUGvEqU5S7wJgBxAD0cTw4IwvXz80KgbHNEl9utOCjB
 ceXoPh3zOaEnBZn7B5HXRq52r2KOA+T7dL/6blPfuYokZrKftq8z9fLDXEqKSLFY
 2lSxtowzAZiUxZDI5Z6qoBmEeEXrbxK3r7ro42KXetvudDWAoCspf2Qz8kEgZOCV
 ykDMPEnhesL8eE1LJzaJBSggz3LmQAslxIZ+CcZlFMAI+vOmbFEbMYVYxDzyJECN
 LEK9uNoEY1mMeQ==
 =gNpf
 -----END PGP SIGNATURE-----

Merge tag 'x86-boot-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 boot update from Thomas Gleixner:
 "Use strlcpy() instead of strscpy() in arch_setup()"

* tag 'x86-boot-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/setup: Use strscpy() to replace deprecated strlcpy()
2022-06-05 10:49:42 -07:00