Commit graph

855297 commits

Author SHA1 Message Date
Wanpeng Li
4d151bf3b8 KVM: LAPIC: Make lapic timer unpinned
Commit 61abdbe0bc ("kvm: x86: make lapic hrtimer pinned") pinned the
lapic timer to avoid to wait until the next kvm exit for the guest to
see KVM_REQ_PENDING_TIMER set. There is another solution to give a kick
after setting the KVM_REQ_PENDING_TIMER bit, make lapic timer unpinned
will be used in follow up patches.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-17 18:21:22 +02:00
Daniel Drake
9af93db9e1 platform/x86: asus: Rename "fan mode" to "fan boost mode"
The Asus WMI spec indicates that the function being controlled here
is called "Fan Boost Mode". The user-facing documentation also calls it
this.

The spec uses the term "fan mode" is used to refer to other things,
including functionality expected to appear on future products.
We missed this before as we are not dealing with the most readable of
specs, and didn't forsee any confusion around shortening the name.

Rename "fan mode" to "fan boost mode" to improve consistency with the
spec and to avoid a future naming conflict.

There is no interface breakage here since this has yet to be included
in an official kernel release. I also updated the kernel version listed
under ABI accordingly.

Signed-off-by: Daniel Drake <drake@endlessm.com>
Acked-by: Yurii Pavlovskyi <yurii.pavlovskyi@gmail.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
2019-07-17 19:07:58 +03:00
Linus Torvalds
57a8ec387e Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "VM:
   - z3fold fixes and enhancements by Henry Burns and Vitaly Wool

   - more accurate reclaimed slab caches calculations by Yafang Shao

   - fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by
     Christoph Hellwig

   - !CONFIG_MMU fixes by Christoph Hellwig

   - new novmcoredd parameter to omit device dumps from vmcore, by
     Kairui Song

   - new test_meminit module for testing heap and pagealloc
     initialization, by Alexander Potapenko

   - ioremap improvements for huge mappings, by Anshuman Khandual

   - generalize kprobe page fault handling, by Anshuman Khandual

   - device-dax hotplug fixes and improvements, by Pavel Tatashin

   - enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V

   - add pte_devmap() support for arm64, by Robin Murphy

   - unify locked_vm accounting with a helper, by Daniel Jordan

   - several misc fixes

  core/lib:
   - new typeof_member() macro including some users, by Alexey Dobriyan

   - make BIT() and GENMASK() available in asm, by Masahiro Yamada

   - changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better
     code generation, by Alexey Dobriyan

   - rbtree code size optimizations, by Michel Lespinasse

   - convert struct pid count to refcount_t, by Joel Fernandes

  get_maintainer.pl:
   - add --no-moderated switch to skip moderated ML's, by Joe Perches

  misc:
   - ptrace PTRACE_GET_SYSCALL_INFO interface

   - coda updates

   - gdb scripts, various"

[ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ]

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits)
  fs/select.c: use struct_size() in kmalloc()
  mm: add account_locked_vm utility function
  arm64: mm: implement pte_devmap support
  mm: introduce ARCH_HAS_PTE_DEVMAP
  mm: clean up is_device_*_page() definitions
  mm/mmap: move common defines to mman-common.h
  mm: move MAP_SYNC to asm-generic/mman-common.h
  device-dax: "Hotremove" persistent memory that is used like normal RAM
  mm/hotplug: make remove_memory() interface usable
  device-dax: fix memory and resource leak if hotplug fails
  include/linux/lz4.h: fix spelling and copy-paste errors in documentation
  ipc/mqueue.c: only perform resource calculation if user valid
  include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures
  scripts/gdb: add helpers to find and list devices
  scripts/gdb: add lx-genpd-summary command
  drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
  kernel/pid.c: convert struct pid count to refcount_t
  drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
  select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining()
  select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR
  ...
2019-07-17 08:58:04 -07:00
Nikos Tsironis
c663e04097 dm kcopyd: Increase default sub-job size to 512KB
Currently, kcopyd has a sub-job size of 64KB and a maximum number of 8
sub-jobs. As a result, for any kcopyd job, we have a maximum of 512KB of
I/O in flight.

This upper limit to the amount of in-flight I/O under-utilizes fast
devices and results in decreased throughput, e.g., when writing to a
snapshotted thin LV with I/O size less than the pool's block size (so
COW is performed using kcopyd).

Increase kcopyd's default sub-job size to 512KB, so we have a maximum of
4MB of I/O in flight for each kcopyd job. This results in an up to 96%
improvement of bandwidth when writing to a snapshotted thin LV, with I/O
sizes less than the pool's block size.

Also, add dm_mod.kcopyd_subjob_size_kb module parameter to allow users
to fine tune the sub-job size of kcopyd. The default value of this
parameter is 512KB and the maximum allowed value is 1024KB.

We evaluate the performance impact of the change by running the
snap_breaking_throughput benchmark, from the device mapper test suite
[1].

The benchmark:

  1. Creates a 1G thin LV
  2. Provisions the thin LV
  3. Takes a snapshot of the thin LV
  4. Writes to the thin LV with:

      dd if=/dev/zero of=/dev/vg/thin_lv oflag=direct bs=<I/O size>

Running this benchmark with various thin pool block sizes and dd I/O
sizes (all combinations triggering the use of kcopyd) we get the
following results:

+-----------------+-------------+------------------+-----------------+
| Pool block size | dd I/O size | BW before (MB/s) | BW after (MB/s) |
+-----------------+-------------+------------------+-----------------+
|       1 MB      |      256 KB |       242        |       280       |
|       1 MB      |      512 KB |       238        |       295       |
|                 |             |                  |                 |
|       2 MB      |      256 KB |       238        |       354       |
|       2 MB      |      512 KB |       241        |       380       |
|       2 MB      |        1 MB |       245        |       394       |
|                 |             |                  |                 |
|       4 MB      |      256 KB |       248        |       412       |
|       4 MB      |      512 KB |       234        |       432       |
|       4 MB      |        1 MB |       251        |       474       |
|       4 MB      |        2 MB |       257        |       504       |
|                 |             |                  |                 |
|       8 MB      |      256 KB |       239        |       420       |
|       8 MB      |      512 KB |       256        |       431       |
|       8 MB      |        1 MB |       264        |       467       |
|       8 MB      |        2 MB |       264        |       502       |
|       8 MB      |        4 MB |       281        |       537       |
+-----------------+-------------+------------------+-----------------+

[1] https://github.com/jthornber/device-mapper-test-suite

Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-17 11:24:16 -04:00
Mike Snitzer
3ee25485ba dm snapshot: fix oversights in optional discard support
__find_snapshots_sharing_cow() should always be used with _origins_lock
held so fix snapshot_io_hints() accordingly.  Also, once a snapshot is
being merged discards must not be allowed -- otherwise incorrect or
duplicate work will be performed.

Fixes: 2e6023850e ("dm snapshot: add optional discard support features")
Reported-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-17 11:12:30 -04:00
Damien Le Moal
3b8cafdd54 dm zoned: fix zone state management race
dm-zoned uses the zone flag DMZ_ACTIVE to indicate that a zone of the
backend device is being actively read or written and so cannot be
reclaimed. This flag is set as long as the zone atomic reference
counter is not 0. When this atomic is decremented and reaches 0 (e.g.
on BIO completion), the active flag is cleared and set again whenever
the zone is reused and BIO issued with the atomic counter incremented.
These 2 operations (atomic inc/dec and flag set/clear) are however not
always executed atomically under the target metadata mutex lock and
this causes the warning:

WARN_ON(!test_bit(DMZ_ACTIVE, &zone->flags));

in dmz_deactivate_zone() to be displayed. This problem is regularly
triggered with xfstests generic/209, generic/300, generic/451 and
xfs/077 with XFS being used as the file system on the dm-zoned target
device. Similarly, xfstests ext4/303, ext4/304, generic/209 and
generic/300 trigger the warning with ext4 use.

This problem can be easily fixed by simply removing the DMZ_ACTIVE flag
and managing the "ACTIVE" state by directly looking at the reference
counter value. To do so, the functions dmz_activate_zone() and
dmz_deactivate_zone() are changed to inline functions respectively
calling atomic_inc() and atomic_dec(), while the dmz_is_active() macro
is changed to an inline function calling atomic_read().

Fixes: 3b1a94c88b ("dm zoned: drive-managed zoned block device target")
Cc: stable@vger.kernel.org
Reported-by: Masato Suzuki <masato.suzuki@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2019-07-17 11:03:48 -04:00
Darrick J. Wong
5d907307ad iomap: move internal declarations into fs/iomap/
Move internal function declarations out of fs/internal.h into
include/linux/iomap.h so that our transition is complete.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-07-17 07:21:02 -07:00
Darrick J. Wong
cb7181ff4b iomap: move the main iteration code into a separate file
Move the main iteration code into a separate file so that we can group
related functions in a single file instead of having a single enormous
source file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-07-17 07:20:43 -07:00
Darrick J. Wong
afc51aaa22 iomap: move the buffered IO code into a separate file
Move the buffered IO code into a separate file so that we can group
related functions in a single file instead of having a single enormous
source file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-07-17 07:16:00 -07:00
Darrick J. Wong
db074436f4 iomap: move the direct IO code into a separate file
Move the direct IO code into a separate file so that we can group
related functions in a single file instead of having a single enormous
source file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-07-17 07:16:00 -07:00
Darrick J. Wong
56a178981d iomap: move the SEEK_HOLE code into a separate file
Move the SEEK_HOLE/SEEK_DATA code into a separate file so that we can
group related functions in a single file instead of having a single
enormous source file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-07-17 07:14:10 -07:00
Darrick J. Wong
5157fb8f5a iomap: move the file mapping reporting code into a separate file
Move the file mapping reporting code (FIEMAP/FIBMAP) into a separate
file so that we can group related functions in a single file instead of
having a single enormous source file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-07-17 07:14:10 -07:00
Darrick J. Wong
a45c0eccc5 iomap: move the swapfile code into a separate file
Move the swapfile activation code into a separate file so that we can
group related functions in a single file instead of having a single
enormous source file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2019-07-17 07:14:10 -07:00
Masahiro Yamada
c35c87d6f2 kbuild: modsign: read modules.order instead of $(MODVERDIR)/*.mod
Towards the goal of removing MODVERDIR, read out modules.order to get
the list of modules to be signed. This is simpler than parsing *.mod
files in $(MODVERDIR).

The modules_sign target is only supported for in-kernel modules.
So, this commit does not take care of external modules.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-07-17 22:39:27 +09:00
Masahiro Yamada
d56aec102c kbuild: modinst: read modules.order instead of $(MODVERDIR)/*.mod
Towards the goal of removing MODVERDIR, read out modules.order to get
the list of modules to be installed. This is simpler than parsing *.mod
files in $(MODVERDIR).

For external modules, $(KBUILD_EXTMOD)/modules.order should be read.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-07-17 22:39:27 +09:00
Masahiro Yamada
0e5d8b7fb2 scsi: remove pointless $(MODVERDIR)/$(obj)/53c700.ver
Nothing depends on this, so it is dead code.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-07-17 22:39:27 +09:00
Masahiro Yamada
e0e1b1ec39 kbuild: remove duplication from modules.order in sub-directories
Currently, only the top-level modules.order drops duplicated entries.

The modules.order files in sub-directories potentially contain
duplication. To list out the paths of all modules, I want to use
modules.order instead of parsing *.mod files in $(MODVERDIR).

To achieve this, I want to rip off duplication from modules.order
of external modules too.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-07-17 22:39:27 +09:00
Masahiro Yamada
1bd9a46801 kbuild: get rid of kernel/ prefix from in-tree modules.{order,builtin}
Removing the 'kernel/' prefix will make our life easier because we can
simply do 'cat modules.order' to get all built modules with full paths.

Currently, we parse the first line of '*.mod' files in $(MODVERDIR).
Since we have duplicated functionality here, I plan to remove MODVERDIR
entirely.

In fact, modules.order is generated also for external modules in a
broken format. It adds the 'kernel/' prefix to the absolute path of
the module, like this:

  kernel//path/to/your/external/module/foo.ko

This is fine for now since modules.order is not used for external
modules. However, I want to sanitize the format everywhere towards
the goal of removing MODVERDIR.

We cannot change the format of installed module.{order,builtin}.
So, 'make modules_install' will add the 'kernel/' prefix while copying
them to $(MODLIB)/.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-07-17 22:39:27 +09:00
Masahiro Yamada
7e13191879 kbuild: do not create empty modules.order in the prepare stage
Currently, $(objtree)/modules.order is touched in two places.

In the 'prepare0' rule, scripts/Makefile.build creates an empty
modules.order while processing 'obj=.'

In the 'modules' rule, the top-level Makefile overwrites it with
the correct list of modules.

While this might be a good side-effect that modules.order is made
empty every time (probably this is not intended functionality),
I personally do not like this behavior.

Create modules.order only when it is sensible to do so.

This avoids creating the following pointless files:

  scripts/basic/modules.order
  scripts/dtc/modules.order
  scripts/gcc-plugins/modules.order
  scripts/genksyms/modules.order
  scripts/mod/modules.order
  scripts/modules.order
  scripts/selinux/genheaders/modules.order
  scripts/selinux/mdp/modules.order
  scripts/selinux/modules.order

Going forward, $(objtree)/modules.order lists the modules that
was built in the last successful build.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-07-17 22:39:27 +09:00
Himanshu Jha
d09778d16e coccinelle: api: add devm_platform_ioremap_resource script
Use recently introduced devm_platform_ioremap_resource
helper which wraps platform_get_resource() and
devm_ioremap_resource() together. This helps produce much
cleaner code and remove local `struct resource` declaration.

Signed-off-by: Himanshu Jha <himanshujha199640@gmail.com>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-07-17 22:37:51 +09:00
Masahiro Yamada
4bd01de8f2 kbuild: compile-test headers listed in header-test-m as well
It will be useful to control the header-test by a tristate option.

If CONFIG_FOO is a tristate option, you can write like this:

  header-test-$(CONFIG_FOO) += foo.h

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-07-17 22:37:51 +09:00
Masahiro Yamada
c04d1e46fc kbuild: remove unused hostcc-option
We can re-add this whenever it is needed. At this moment, it is unused.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-07-17 22:37:51 +09:00
Masahiro Yamada
46457133ac kbuild: remove tag files by distclean instead of mrproper
It takes somewhat long time to generate these tag files.
Keep such precious files until we run 'make distclean'.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-07-17 22:37:51 +09:00
Masahiro Yamada
89ff7131f7 kbuild: add --hash-style= and --build-id unconditionally
As commit 1e0221374e ("mips: vdso: drop unnecessary cc-ldoption")
explained, these flags are supported by the minimal required version
of binutils. They are supported by ld.lld too.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
2019-07-17 22:37:51 +09:00
Masahiro Yamada
5ef872636c kbuild: get rid of misleading $(AS) from documents
The assembler files in the kernel are *.S instead of *.s, so they must
be preprocessed. Since 'as' of GNU binutils is not able to preprocess,
we always use $(CC) as an assembler driver.

$(AS) is almost unused in Kbuild. As of v5.2, there is just one place
that directly invokes $(AS).

  $ git grep -e '$(AS)' -e '${AS}' -e '$AS' -e '$(AS:' -e '${AS:' -- :^Documentation
  drivers/net/wan/Makefile:  AS68K = $(AS)

The documentation about *_AFLAGS* sounds like the flags were passed
to $(AS). This is somewhat misleading.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
2019-07-17 22:37:51 +09:00
Masahiro Yamada
8e2442a5f8 kconfig: fix missing choice values in auto.conf
Since commit 00c864f890 ("kconfig: allow all config targets to write
auto.conf if missing"), Kconfig creates include/config/auto.conf in the
defconfig stage when it is missing.

Joonas Kylmälä reported incorrect auto.conf generation under some
circumstances.

To reproduce it, apply the following diff:

|  --- a/arch/arm/configs/imx_v6_v7_defconfig
|  +++ b/arch/arm/configs/imx_v6_v7_defconfig
|  @@ -345,14 +345,7 @@ CONFIG_USB_CONFIGFS_F_MIDI=y
|   CONFIG_USB_CONFIGFS_F_HID=y
|   CONFIG_USB_CONFIGFS_F_UVC=y
|   CONFIG_USB_CONFIGFS_F_PRINTER=y
|  -CONFIG_USB_ZERO=m
|  -CONFIG_USB_AUDIO=m
|  -CONFIG_USB_ETH=m
|  -CONFIG_USB_G_NCM=m
|  -CONFIG_USB_GADGETFS=m
|  -CONFIG_USB_FUNCTIONFS=m
|  -CONFIG_USB_MASS_STORAGE=m
|  -CONFIG_USB_G_SERIAL=m
|  +CONFIG_USB_FUNCTIONFS=y
|   CONFIG_MMC=y
|   CONFIG_MMC_SDHCI=y
|   CONFIG_MMC_SDHCI_PLTFM=y

And then, run:

$ make ARCH=arm mrproper imx_v6_v7_defconfig

You will see CONFIG_USB_FUNCTIONFS=y is correctly contained in the
.config, but not in the auto.conf.

Please note drivers/usb/gadget/legacy/Kconfig is included from a choice
block in drivers/usb/gadget/Kconfig. So USB_FUNCTIONFS is a choice value.

This is probably a similar situation described in commit beaaddb625
("kconfig: tests: test defconfig when two choices interact").

When sym_calc_choice() is called, the choice symbol forgets the
SYMBOL_DEF_USER unless all of its choice values are explicitly set by
the user.

The choice symbol is given just one chance to recall it because
set_all_choice_values() is called if SYMBOL_NEED_SET_CHOICE_VALUES
is set.

When sym_calc_choice() is called again, the choice symbol forgets it
forever, since SYMBOL_NEED_SET_CHOICE_VALUES is a one-time aid.
Hence, we cannot call sym_clear_all_valid() again and again.

It is crazy to repeat set and unset of internal flags. However, we
cannot simply get rid of "sym->flags &= flags | ~SYMBOL_DEF_USER;"
Doing so would re-introduce the problem solved by commit 5d09598d48
("kconfig: fix new choices being skipped upon config update").

To work around the issue, conf_write_autoconf() stopped calling
sym_clear_all_valid().

conf_write() must be changed accordingly. Currently, it clears
SYMBOL_WRITE after the symbol is written into the .config file. This
is needed to prevent it from writing the same symbol multiple times in
case the symbol is declared in two or more locations. I added the new
flag SYMBOL_WRITTEN, to track the symbols that have been written.

Anyway, this is a cheesy workaround in order to suppress the issue
as far as defconfig is concerned.

Handling of choices is totally broken. sym_clear_all_valid() is called
every time a user touches a symbol from the GUI interface. To reproduce
it, just add a new symbol drivers/usb/gadget/legacy/Kconfig, then touch
around unrelated symbols from menuconfig. USB_FUNCTIONFS will disappear
from the .config file.

I added the Fixes tag since it is more fatal than before. But, this
has been broken since long long time before, and still it is.
We should take a closer look to fix this correctly somehow.

Fixes: 00c864f890 ("kconfig: allow all config targets to write auto.conf if missing")
Cc: linux-stable <stable@vger.kernel.org> # 4.19+
Reported-by: Joonas Kylmälä <joonas.kylmala@iki.fi>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Tested-by: Joonas Kylmälä <joonas.kylmala@iki.fi>
2019-07-17 22:37:51 +09:00
Like Xu
4d1a082da9 KVM: x86/vPMU: reset pmc->counter to 0 for pmu fixed_counters
To avoid semantic inconsistency, the fixed_counters in Intel vPMU
need to be reset to 0 in intel_pmu_reset() as gp_counters does.

Signed-off-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-17 12:23:20 +02:00
Christoph Hellwig
a5008b59cd dma-direct: only limit the mapping size if swiotlb could be used
Don't just check for a swiotlb buffer, but also if buffering might
be required for this particular device.

Fixes: 133d624b1c ("dma: Introduce dma_max_mapping_size()")
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2019-07-17 08:25:45 +02:00
Christoph Hellwig
b866455423 dma-mapping: add a dma_addressing_limited helper
This helper returns if the device has issues addressing all present
memory in the system.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-17 08:25:40 +02:00
Zhenzhong Duan
b23e5844df xen/pv: Fix a boot up hang revealed by int3 self test
Commit 7457c0da02 ("x86/alternatives: Add int3_emulate_call()
selftest") is used to ensure there is a gap setup in int3 exception stack
which could be used for inserting call return address.

This gap is missed in XEN PV int3 exception entry path, then below panic
triggered:

[    0.772876] general protection fault: 0000 [#1] SMP NOPTI
[    0.772886] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0+ #11
[    0.772893] RIP: e030:int3_magic+0x0/0x7
[    0.772905] RSP: 3507:ffffffff82203e98 EFLAGS: 00000246
[    0.773334] Call Trace:
[    0.773334]  alternative_instructions+0x3d/0x12e
[    0.773334]  check_bugs+0x7c9/0x887
[    0.773334]  ? __get_locked_pte+0x178/0x1f0
[    0.773334]  start_kernel+0x4ff/0x535
[    0.773334]  ? set_init_arg+0x55/0x55
[    0.773334]  xen_start_kernel+0x571/0x57a

For 64bit PV guests, Xen's ABI enters the kernel with using SYSRET, with
%rcx/%r11 on the stack. To convert back to "normal" looking exceptions,
the xen thunks do 'xen_*: pop %rcx; pop %r11; jmp *'.

E.g. Extracting 'xen_pv_trap xenint3' we have:
xen_xenint3:
 pop %rcx;
 pop %r11;
 jmp xenint3

As xenint3 and int3 entry code are same except xenint3 doesn't generate
a gap, we can fix it by using int3 and drop useless xenint3.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-17 08:09:59 +02:00
Zhenzhong Duan
bef6e0ae74 x86/xen: Add "nopv" support for HVM guest
PVH guest needs PV extentions to work, so "nopv" parameter should be
ignored for PVH but not for HVM guest.

If PVH guest boots up via the Xen-PVH boot entry, xen_pvh is set early,
we know it's PVH guest and ignore "nopv" parameter directly.

If PVH guest boots up via the normal boot entry same as HVM guest, it's
hard to distinguish PVH and HVM guest at that time. In this case, we
have to panic early if PVH is detected and nopv is enabled to avoid a
worse situation later.

Remove static from bool_x86_init_noop/x86_op_int_noop so they could be
used globally. Move xen_platform_hvm() after xen_hvm_guest_late_init()
to avoid compile error.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-17 08:09:59 +02:00
Zhenzhong Duan
cc8f3b4dd2 x86/paravirt: Remove const mark from x86_hyper_xen_hvm variable
.. as "nopv" support needs it to be changeable at boot up stage.

Checkpatch reports warning, so move variable declarations from
hypervisor.c to hypervisor.h

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-17 08:09:59 +02:00
Zhenzhong Duan
b39b049749 xen: Map "xen_nopv" parameter to "nopv" and mark it obsolete
Clean up unnecessory code after that operation.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-17 08:09:58 +02:00
Zhenzhong Duan
3097834637 x86: Add "nopv" parameter to disable PV extensions
In virtualization environment, PV extensions (drivers, interrupts,
timers, etc) are enabled in the majority of use cases which is the
best option.

However, in some cases (kexec not fully working, benchmarking)
we want to disable PV extensions. We have "xen_nopv" for that purpose
but only for XEN. For a consistent admin experience a common command
line parameter "nopv" set across all PV guest implementations is a
better choice.

There are guest types which just won't work without PV extensions,
like Xen PV, Xen PVH and jailhouse. add a "ignore_nopv" member to
struct hypervisor_x86 set to true for those guest types and call
the detect functions only if nopv is false or ignore_nopv is true.

Suggested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-17 08:09:58 +02:00
Zhenzhong Duan
1b37683cda x86/xen: Mark xen_hvm_need_lapic() and xen_x2apic_para_available() as __init
.. as they are only called at early bootup stage. In fact, other
functions in x86_hyper_xen_hvm.init.* are all marked as __init.

Unexport xen_hvm_need_lapic as it's never used outside.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-17 08:09:58 +02:00
Juergen Gross
814bbf49dc xen: remove tmem driver
The Xen tmem (transcendent memory) driver can be removed, as the
related Xen hypervisor feature never made it past the "experimental"
state and will be removed in future Xen versions (>= 4.13).

The xen-selfballoon driver depends on tmem, so it can be removed, too.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-17 08:09:58 +02:00
Zhenzhong Duan
090d54bcbc Revert "x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized"
This reverts commit ca5d376e17.

Commit 8990cac6e5 ("x86/jump_label: Initialize static branching
early") adds jump_label_init() call in setup_arch() to make static
keys initialized early, so we could use the original simpler code
again.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-17 08:09:57 +02:00
Juergen Gross
bce5963bcb xen/events: fix binding user event channels to cpus
When binding an interdomain event channel to a vcpu via
IOCTL_EVTCHN_BIND_INTERDOMAIN not only the event channel needs to be
bound, but the affinity of the associated IRQi must be changed, too.
Otherwise the IRQ and the event channel won't be moved to another vcpu
in case the original vcpu they were bound to is going offline.

Cc: <stable@vger.kernel.org> # 4.13
Fixes: c48f64ab47 ("xen-evtchn: Bind dyn evtchn:qemu-dm interrupt to next online VCPU")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2019-07-17 08:09:57 +02:00
Christoph Hellwig
07d9aa1434 scsi: megaraid_sas: set an unlimited max_segment_size
When using a virt_boundary_mask, as done for NVMe devices attached to
megaraid_sas controllers, we require an unlimited max_segment_size as the
virt boundary merging code assumes that.  But we also need to propagate
that to the DMA mapping layer to make dma-debug happy.  The SCSI layer
takes care of that when using the per-host virt_boundary setting, but
given that megaraid_sas only wants to set the virt_boundary for actual
NVMe devices, we can't rely on that.  The DMA layer maximum segment is
global to the HBA however, so we have to set it explicitly.  This patch
assumes that megaraid_sas does not have a segment size limitation, which
seems true based on the SGL format, but will need to be verified.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-07-16 23:02:11 -04:00
Christoph Hellwig
ce0ad85310 scsi: mpt3sas: set an unlimited max_segment_size for SAS 3.0 HBAs
When using a virt_boundary_mask, as done for NVMe devices attached to
mpt3sas controllers, we require an unlimited max_segment_size as the virt
boundary merging code assumes that.  But we also need to propagate that to
the DMA mapping layer to make dma-debug happy.  The SCSI layer takes care
of that when using the per-host virt_boundary setting, but given that
mpt3sas only wants to set the virt_boundary for actual NVMe devices, we
can't rely on that.  The DMA layer maximum segment is global to the HBA
however, so we have to set it explicitly.  This patch assumes that mpt3sas
does not have a segment size limitation, which seems true based on the SGL
format, but will need to be verified.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-07-16 23:02:11 -04:00
Christoph Hellwig
8c175d3131 scsi: IB/srp: set virt_boundary_mask in the scsi host
This ensures all proper DMA layer handling is taken care of by the SCSI
midlayer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-07-16 23:02:11 -04:00
Christoph Hellwig
09a4460ba4 scsi: IB/iser: set virt_boundary_mask in the scsi host
This ensures all proper DMA layer handling is taken care of by the SCSI
midlayer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-07-16 23:02:10 -04:00
Christoph Hellwig
83eed4592f scsi: storvsc: set virt_boundary_mask in the scsi host template
This ensures all proper DMA layer handling is taken care of by the SCSI
midlayer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-07-16 23:02:10 -04:00
Christoph Hellwig
552a990ca1 scsi: ufshcd: set max_segment_size in the scsi host template
We need to also mirror the value to the device to ensure IOMMU merging
doesn't undo it, and the SCSI host level parameter will ensure that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-07-16 23:01:49 -04:00
Christoph Hellwig
bdd17bdef7 scsi: core: take the DMA max mapping size into account
We need to limit the device's max_sectors to what the DMA mapping
implementation can support.  If not, we risk running out of swiotlb
buffers easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-07-16 23:01:49 -04:00
Christoph Hellwig
7ad388d8e4 scsi: core: add a host / host template field for the virt boundary
This allows drivers setting it up easily instead of branching out to block
layer calls in slave_alloc, and ensures the upgraded max_segment_size
setting gets picked up by the DMA layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Kashyap Desai < kashyap.desai@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-07-16 23:01:49 -04:00
Al Viro
56cbb429d9 switch the remnants of releasing the mountpoint away from fs_pin
We used to need rather convoluted ordering trickery to guarantee
that dput() of ex-mountpoints happens before the final mntput()
of the same.  Since we don't need that anymore, there's no point
playing with fs_pin for that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-16 22:52:37 -04:00
Al Viro
2763d11912 get rid of detach_mnt()
Lift getting the original mount (dentry is actually not needed at all)
of the mountpoint into the callers - to do_move_mount() and pivot_root()
level.  That simplifies the cleanup in those and allows to get saner
arguments for attach_mnt_recursive().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-16 22:50:11 -04:00
Pankaj Gupta
8c2e408e73 virtio_pmem: fix sparse warning
This patch fixes below sparse warning related to __virtio
type in virtio pmem driver. This is reported by Intel test
bot on linux-next tree.

nd_virtio.c:56:28: warning: incorrect type in assignment
                                (different base types)
nd_virtio.c:56:28:    expected unsigned int [unsigned] [usertype] type
nd_virtio.c:56:28:    got restricted __virtio32
nd_virtio.c:93:59: warning: incorrect type in argument 2
                                (different base types)
nd_virtio.c:93:59:    expected restricted __virtio32 [usertype] val
nd_virtio.c:93:59:    got unsigned int [unsigned] [usertype] ret

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2019-07-16 19:44:26 -07:00
Al Viro
4edbe133f8 make struct mountpoint bear the dentry reference to mountpoint, not struct mount
Using dput_to_list() to shift the contributing reference from ->mnt_mountpoint
to ->mnt_mp->m_dentry.  Dentries are dropped (with dput_to_list()) as soon
as struct mountpoint is destroyed; in cases where we are under namespace_sem
we use the global list, shrinking it in namespace_unlock().  In case of
detaching stuck MNT_LOCKed children at final mntput_no_expire() we use a local
list and shrink it ourselves.  ->mnt_ex_mountpoint crap is gone.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-16 22:43:40 -04:00