Go to file
Gavin Shan 1dfe889fd2 KVM: Avoid illegal stage2 mapping on invalid memory slot
commit 2230f9e117 upstream.

We run into guest hang in edk2 firmware when KSM is kept as running on
the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash
device (TYPE_PFLASH_CFI01) during the operation of sector erasing or
buffered write. The status is returned by reading the memory region of
the pflash device and the read request should have been forwarded to QEMU
and emulated by it. Unfortunately, the read request is covered by an
illegal stage2 mapping when the guest hang issue occurs. The read request
is completed with QEMU bypassed and wrong status is fetched. The edk2
firmware runs into an infinite loop with the wrong status.

The illegal stage2 mapping is populated due to same page sharing by KSM
at (C) even the associated memory slot has been marked as invalid at (B)
when the memory slot is requested to be deleted. It's notable that the
active and inactive memory slots can't be swapped when we're in the middle
of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count
is elevated, and kvm_swap_active_memslots() will busy loop until it reaches
to zero again. Besides, the swapping from the active to the inactive memory
slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(),
corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots().

  CPU-A                    CPU-B
  -----                    -----
                           ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION)
                           kvm_vm_ioctl_set_memory_region
                           kvm_set_memory_region
                           __kvm_set_memory_region
                           kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE)
                             kvm_invalidate_memslot
                               kvm_copy_memslot
                               kvm_replace_memslot
                               kvm_swap_active_memslots        (A)
                               kvm_arch_flush_shadow_memslot   (B)
  same page sharing by KSM
  kvm_mmu_notifier_invalidate_range_start
        :
  kvm_mmu_notifier_change_pte
    kvm_handle_hva_range
    __kvm_handle_hva_range
    kvm_set_spte_gfn            (C)
        :
  kvm_mmu_notifier_invalidate_range_end

Fix the issue by skipping the invalid memory slot at (C) to avoid the
illegal stage2 mapping so that the read request for the pflash's status
is forwarded to QEMU and emulated by it. In this way, the correct pflash's
status can be returned from QEMU to break the infinite loop in the edk2
firmware.

We tried a git-bisect and the first problematic commit is cd4c718352 ("
KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this,
clean_dcache_guest_page() is called after the memory slots are iterated
in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called
before the iteration on the memory slots before this commit. This change
literally enlarges the racy window between kvm_mmu_notifier_change_pte()
and memory slot removal so that we're able to reproduce the issue in a
practical test case. However, the issue exists since commit d5d8184d35
("KVM: ARM: Memory virtualization setup").

Cc: stable@vger.kernel.org # v3.9+
Fixes: d5d8184d35 ("KVM: ARM: Memory virtualization setup")
Reported-by: Shuai Hu <hshuai@redhat.com>
Reported-by: Zhenyu Zhang <zhenyzha@redhat.com>
Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Message-Id: <20230615054259.14911-1-gshan@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-28 11:14:05 +02:00
Documentation mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM 2023-06-14 11:16:59 +02:00
LICENSES LICENSES: Add the copyleft-next-0.3.1 license 2022-11-08 15:44:01 +01:00
arch riscv: Link with '-z norelro' 2023-06-28 11:13:59 +02:00
block blk-cgroup: Flush stats before releasing blkcg_gq 2023-06-21 16:02:19 +02:00
certs Kbuild updates for v6.3 2023-02-26 11:53:25 -08:00
crypto KEYS: asymmetric: Copy sig and digest in public_key_verify_signature() 2023-06-09 10:48:25 +02:00
drivers thermal/intel/intel_soc_dts_iosf: Fix reporting wrong temperatures 2023-06-28 11:14:05 +02:00
fs nilfs2: prevent general protection fault in nilfs_clear_dirty_page() 2023-06-28 11:14:04 +02:00
include ACPI: sleep: Avoid breaking S3 wakeup due to might_sleep() 2023-06-28 11:14:04 +02:00
init gcc: disable '-Warray-bounds' for gcc-13 too 2023-04-23 09:56:20 -07:00
io_uring io_uring/net: save msghdr->msg_control for retries 2023-06-21 16:02:09 +02:00
ipc Merge branch 'work.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:20:07 -08:00
kernel tick/common: Align tick period during sched_timer setup 2023-06-28 11:13:58 +02:00
lib lib: cpu_rmap: Fix potential use-after-free in irq_cpu_rmap_release() 2023-06-14 11:16:51 +02:00
mm memfd: check for non-NULL file_seals in memfd_create() syscall 2023-06-28 11:14:04 +02:00
net neighbour: delete neigh_lookup_nodev as not used 2023-06-21 16:02:18 +02:00
rust rust: allow to use INIT_STACK_ALL_ZERO 2023-04-19 19:34:43 +02:00
samples samples/bpf: Fix fout leak in hbm's run_bpf_prog 2023-05-24 17:30:06 +01:00
scripts scripts: fix the gfp flags header path in gfp-translate 2023-06-28 11:14:04 +02:00
security selinux: don't use make's grouped targets feature yet 2023-06-09 10:48:20 +02:00
sound ALSA: hda/realtek: Add a quirk for Compaq N14JP6 2023-06-21 16:02:11 +02:00
tools selftests: mptcp: join: skip mixed tests if not supported 2023-06-28 11:14:04 +02:00
usr initramfs: Check negative timestamp to prevent broken cpio archive 2023-04-16 17:37:01 +09:00
virt KVM: Avoid illegal stage2 mapping on invalid memory slot 2023-06-28 11:14:05 +02:00
.clang-format cpumask: re-introduce constant-sized cpumask optimizations 2023-03-05 14:30:34 -08:00
.cocciconfig scripts: add Linux .cocciconfig for coccinelle 2016-07-22 12:13:39 +02:00
.get_maintainer.ignore get_maintainer: add Alan to .get_maintainer.ignore 2022-08-20 15:17:44 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for *.dtso files 2023-02-26 15:28:23 +09:00
.gitignore kbuild: rpm-pkg: move source components to rpmbuild/SOURCES 2023-03-16 22:45:56 +09:00
.mailmap Networking fixes for 6.3-rc8, including fixes from netfilter and bpf 2023-04-20 11:03:51 -07:00
.rustfmt.toml rust: add `.rustfmt.toml` 2022-09-28 09:02:20 +02:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
Kbuild Kbuild updates for v6.1 2022-10-10 12:00:45 -07:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS MAINTAINERS: Resume MPTCP co-maintainer role 2023-04-19 18:10:24 -07:00
Makefile Linux 6.3.9 2023-06-21 16:02:19 +02:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.