Commit graph

17291 commits

Author SHA1 Message Date
Balbir Singh
b973685e70 powerpc/mm/radix: Split linear mapping on hot-unplug
commit 4dd5f8a99e upstream.

This patch splits the linear mapping if the hot-unplug range is
smaller than the mapping size. The code detects if the mapping needs
to be split into a smaller size and if so, uses the stop machine
infrastructure to clear the existing mapping and then remap the
remaining range using a smaller page size.

The code will skip any region of the mapping that overlaps with kernel
text and warn about it once. We don't want to remove a mapping where
the kernel text and the LMB we intend to remove overlap in the same
TLB mapping as it may affect the currently executing code.

I've tested these changes under a kvm guest with 2 vcpus, from a split
mapping point of view, some of the caveats mentioned above applied to
the testing I did.

Fixes: 4b5d62ca17 ("powerpc/mm: add radix__remove_section_mapping()")
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
[mpe: Tweak change log to match updated behaviour]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22 15:42:19 +01:00
Benjamin Herrenschmidt
40be210c83 powerpc: Fix DABR match on hash based systems
commit f23ab3efb1 upstream.

Commit 398a719d34 ("powerpc/mm: Update bits used to skip hash_page")
mistakenly dropped the DSISR_DABRMATCH bit from the mask of bit tested
to skip trying to hash a page.

As a result, the DABR matches would no longer be detected.

This adds it back. We open code it in the 2 places where it matters
rather than fold it into DSISR_BAD_FAULT_32S/64S because this isn't
technically a bad fault and while we would never hit it with the
current code, I prefer if page_fault_is_bad() didn't trigger on these.

Fixes: 398a719d34 ("powerpc/mm: Update bits used to skip hash_page")
Cc: stable@vger.kernel.org # v4.14
Tested-by: Pedro Miraglia Franco de Carvalho <pedromfc@br.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22 15:42:17 +01:00
Cédric Le Goater
3b09911d3b powerpc/xive: Use hw CPU ids when configuring the CPU queues
commit 8e036c8d30 upstream.

The CPU event notification queues on sPAPR should be configured using
a hardware CPU identifier.

The problem did not show up on the Power Hypervisor because pHyp
supports 8 threads per core which keeps CPU number contiguous. This is
not the case on all sPAPR virtual machines, some use SMT=1.

Also improve error logging by adding the CPU number.

Fixes: eac1e731b5 ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22 15:42:16 +01:00
Alexey Kardashevskiy
892674b505 powerpc/mm: Flush radix process translations when setting MMU type
commit 62e984ddfd upstream.

Radix guests do normally invalidate process-scoped translations when a
new pid is allocated but migrated guests do not invalidate these so
migrated guests crash sometime, especially easy to reproduce with
migration happening within first 10 seconds after the guest boot start
on the same machine.

This adds the "Invalidate process-scoped translations" flush to fix
radix guests migration.

Fixes: 2ee13be34b ("KVM: PPC: Book3S HV: Update kvmppc_set_arch_compat() for ISA v3.00")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Tested-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22 15:42:16 +01:00
Nathan Fontenot
4386f223b4 powerpc/numa: Invalidate numa_cpu_lookup_table on cpu remove
commit 1d9a090783 upstream.

When DLPAR removing a CPU, the unmapping of the cpu from a node in
unmap_cpu_from_node() should also invalidate the CPUs entry in the
numa_cpu_lookup_table. There is not a guarantee that on a subsequent
DLPAR add of the CPU the associativity will be the same and thus
could be in a different node. Invalidating the entry in the
numa_cpu_lookup_table causes the associativity to be read from the
device tree at the time of the add.

The current behavior of not invalidating the CPUs entry in the
numa_cpu_lookup_table can result in scenarios where the the topology
layout of CPUs in the partition does not match the device tree
or the topology reported by the HMC.

This bug looks like it was introduced in 2004 in the commit titled
"ppc64: cpu hotplug notifier for numa", which is 6b15e4e87e32 in the
linux-fullhist tree. Hence tag it for all stable releases.

Cc: stable@vger.kernel.org
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Reviewed-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22 15:42:16 +01:00
Mahesh Salgaonkar
5b98d31481 powerpc/radix: Remove trace_tlbie call from radix__flush_tlb_all
commit 8d81296cfc upstream.

radix__flush_tlb_all() is called only in kexec path in real mode and any
tracepoints at this stage will make kexec to fail if enabled.

To verify enable tlbie trace before kexec.

$ echo 1 > /sys/kernel/debug/tracing/events/powerpc/tlbie/enable
== kexec into new kernel and kexec fails.

Fix this by not calling trace_tlbie from radix__flush_tlb_all().

Fixes: 0428491cba ("powerpc/mm: Trace tlbie(l) instructions")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22 15:42:16 +01:00
Ulf Magnusson
e747a02d9f KVM: PPC: Book3S PR: Fix broken select due to misspelling
commit 57ea5f161a upstream.

Commit 76d837a4c0 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code
on non-pseries platforms") added a reference to the globally undefined
symbol PPC_SERIES. Looking at the rest of the commit, PPC_PSERIES was
probably intended.

Change PPC_SERIES to PPC_PSERIES.

Discovered with the
https://github.com/ulfalizer/Kconfiglib/blob/master/examples/list_undefined.py
script.

Fixes: 76d837a4c0 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms")
Signed-off-by: Ulf Magnusson <ulfalizer@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16 20:23:03 +01:00
Paul Mackerras
be54d79b43 KVM: PPC: Book3S HV: Drop locks before reading guest memory
commit 36ee41d161 upstream.

Running with CONFIG_DEBUG_ATOMIC_SLEEP reveals that HV KVM tries to
read guest memory, in order to emulate guest instructions, while
preempt is disabled and a vcore lock is held.  This occurs in
kvmppc_handle_exit_hv(), called from post_guest_process(), when
emulating guest doorbell instructions on POWER9 systems, and also
when checking whether we have hit a hypervisor breakpoint.
Reading guest memory can cause a page fault and thus cause the
task to sleep, so we need to avoid reading guest memory while
holding a spinlock or when preempt is disabled.

To fix this, we move the preempt_enable() in kvmppc_run_core() to
before the loop that calls post_guest_process() for each vcore that
has just run, and we drop and re-take the vcore lock around the calls
to kvmppc_emulate_debug_inst() and kvmppc_emulate_doorbell_instr().

Dropping the lock is safe with respect to the iteration over the
runnable vcpus in post_guest_process(); for_each_runnable_thread
is actually safe to use locklessly.  It is possible for a vcpu
to become runnable and add itself to the runnable_threads array
(code near the beginning of kvmppc_run_vcpu()) and then get included
in the iteration in post_guest_process despite the fact that it
has not just run.  This is benign because vcpu->arch.trap and
vcpu->arch.ceded will be zero.

Fixes: 579006944e ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16 20:23:03 +01:00
Paul Mackerras
88b64450cc KVM: PPC: Book3S HV: Make sure we don't re-enter guest without XIVE loaded
commit 43ff3f6523 upstream.

This fixes a bug where it is possible to enter a guest on a POWER9
system without having the XIVE (interrupt controller) context loaded.
This can happen because we unload the XIVE context from the CPU
before doing the real-mode handling for machine checks.  After the
real-mode handler runs, it is possible that we re-enter the guest
via a fast path which does not load the XIVE context.

To fix this, we move the unloading of the XIVE context to come after
the real-mode machine check handler is called.

Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16 20:23:03 +01:00
Eric Biggers
8d906d183b crypto: hash - annotate algorithms taking optional key
commit a208fa8f33 upstream.

We need to consistently enforce that keyed hashes cannot be used without
setting the key.  To do this we need a reliable way to determine whether
a given hash algorithm is keyed or not.  AF_ALG currently does this by
checking for the presence of a ->setkey() method.  However, this is
actually slightly broken because the CRC-32 algorithms implement
->setkey() but can also be used without a key.  (The CRC-32 "key" is not
actually a cryptographic key but rather represents the initial state.
If not overridden, then a default initial state is used.)

Prepare to fix this by introducing a flag CRYPTO_ALG_OPTIONAL_KEY which
indicates that the algorithm has a ->setkey() method, but it is not
required to be called.  Then set it on all the CRC-32 algorithms.

The same also applies to the Adler-32 implementation in Lustre.

Also, the cryptd and mcryptd templates have to pass through the flag
from their underlying algorithm.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16 20:23:00 +01:00
Michal Suchanek
b4a9ffad97 powerpc/pseries: include linux/types.h in asm/hvcall.h
commit 1b689a95ce upstream.

Commit 6e032b350c ("powerpc/powernv: Check device-tree for RFI flush
settings") uses u64 in asm/hvcall.h without including linux/types.h

This breaks hvcall.h users that do not include the header themselves.

Fixes: 6e032b350c ("powerpc/powernv: Check device-tree for RFI flush settings")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-16 20:22:42 +01:00
Masahiro Yamada
abc5896b77 .gitignore: move *.dtb and *.dtb.S patterns to the top-level .gitignore
commit 10b62a2f78 upstream.

Most of DT files are compiled under arch/*/boot/dts/, but we have some
other directories, like drivers/of/unittest-data/.  We often miss to
add gitignore patterns per directory.  Since there are no source files
that end with .dtb or .dtb.S, we can ignore the patterns globally.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-13 10:19:46 +01:00
Michael Ellerman
d6eded6c94 powerpc/64s: Allow control of RFI flush via debugfs
commit 236003e6b5 upstream.

Expose the state of the RFI flush (enabled/disabled) via debugfs, and
allow it to be enabled/disabled at runtime.

eg: $ cat /sys/kernel/debug/powerpc/rfi_flush
    1
    $ echo 0 > /sys/kernel/debug/powerpc/rfi_flush
    $ cat /sys/kernel/debug/powerpc/rfi_flush
    0

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-07 11:12:17 -08:00
Michael Ellerman
517bdccc3a powerpc/64s: Wire up cpu_show_meltdown()
commit fd6e440f20 upstream.

The recent commit 87590ce6e3 ("sysfs/cpu: Add vulnerability folder")
added a generic folder and set of files for reporting information on
CPU vulnerabilities. One of those was for meltdown:

  /sys/devices/system/cpu/vulnerabilities/meltdown

This commit wires up that file for 64-bit Book3S powerpc.

For now we default to "Vulnerable" unless the RFI flush is enabled.
That may not actually be true on all hardware, further patches will
refine the reporting based on the CPU/platform etc. But for now we
default to being pessimists.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-07 11:12:17 -08:00
Jan H. Schönherr
40ba283e26 KVM: Let KVM_SET_SIGNAL_MASK work as advertised
[ Upstream commit 20b7035c66 ]

KVM API says for the signal mask you set via KVM_SET_SIGNAL_MASK, that
"any unblocked signal received [...] will cause KVM_RUN to return with
-EINTR" and that "the signal will only be delivered if not blocked by
the original signal mask".

This, however, is only true, when the calling task has a signal handler
registered for a signal. If not, signal evaluation is short-circuited for
SIG_IGN and SIG_DFL, and the signal is either ignored without KVM_RUN
returning or the whole process is terminated.

Make KVM_SET_SIGNAL_MASK behave as advertised by utilizing logic similar
to that in do_sigtimedwait() to avoid short-circuiting of signals.

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03 17:39:06 +01:00
Oliver O'Halloran
5dc5971854 powerpc/powernv: Check device-tree for RFI flush settings
commit 6e032b350c upstream.

New device-tree properties are available which tell the hypervisor
settings related to the RFI flush. Use them to determine the
appropriate flush instruction to use, and whether the flush is
required.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:11 +01:00
Michael Neuling
4b5158cefc powerpc/pseries: Query hypervisor for RFI flush settings
commit 8989d56878 upstream.

A new hypervisor call is available which tells the guest settings
related to the RFI flush. Use it to query the appropriate flush
instruction(s), and whether the flush is required.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:11 +01:00
Michael Ellerman
9472d895cd powerpc/64s: Support disabling RFI flush with no_rfi_flush and nopti
commit bc9c9304a4 upstream.

Because there may be some performance overhead of the RFI flush, add
kernel command line options to disable it.

We add a sensibly named 'no_rfi_flush' option, but we also hijack the
x86 option 'nopti'. The RFI flush is not the same as KPTI, but if we
see 'nopti' we can guess that the user is trying to avoid any overhead
of Meltdown mitigations, and it means we don't have to educate every
one about a different command line option.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:11 +01:00
Michael Ellerman
b434c155ab powerpc/64s: Add support for RFI flush of L1-D cache
commit aa8a5e0062 upstream.

On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.

This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.

The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.

In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.

In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.

For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.

In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.

Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:10 +01:00
Nicholas Piggin
9488c6b916 powerpc/64s: Convert slb_miss_common to use RFI_TO_USER/KERNEL
commit c7305645eb upstream.

In the SLB miss handler we may be returning to user or kernel. We need
to add a check early on and save the result in the cr4 register, and
then we bifurcate the return path based on that.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:10 +01:00
Nicholas Piggin
bcac5d3653 powerpc/64: Convert fast_exception_return to use RFI_TO_USER/KERNEL
commit a08f828cf4 upstream.

Similar to the syscall return path, in fast_exception_return we may be
returning to user or kernel context. We already have a test for that,
because we conditionally restore r13. So use that existing test and
branch, and bifurcate the return based on that.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:10 +01:00
Nicholas Piggin
627700e455 powerpc/64: Convert the syscall exit path to use RFI_TO_USER/KERNEL
commit b8e90cb7bc upstream.

In the syscall exit path we may be returning to user or kernel
context. We already have a test for that, because we conditionally
restore r13. So use that existing test and branch, and bifurcate the
return based on that.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:10 +01:00
Nicholas Piggin
11caf810bd powerpc/64s: Simple RFI macro conversions
commit 222f20f140 upstream.

This commit does simple conversions of rfi/rfid to the new macros that
include the expected destination context. By simple we mean cases
where there is a single well known destination context, and it's
simply a matter of substituting the instruction for the appropriate
macro.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:10 +01:00
Nicholas Piggin
bcba6b9024 powerpc/64: Add macros for annotating the destination of rfid/hrfid
commit 50e51c13b3 upstream.

The rfid/hrfid ((Hypervisor) Return From Interrupt) instruction is
used for switching from the kernel to userspace, and from the
hypervisor to the guest kernel. However it can and is also used for
other transitions, eg. from real mode kernel code to virtual mode
kernel code, and it's not always clear from the code what the
destination context is.

To make it clearer when reading the code, add macros which encode the
expected destination context.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:09 +01:00
Michael Neuling
4167dcbc91 powerpc/pseries: Add H_GET_CPU_CHARACTERISTICS flags & wrapper
commit 191eccb158 upstream.

A new hypervisor call has been defined to communicate various
characteristics of the CPU to guests. Add definitions for the hcall
number, flags and a wrapper function.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-23 19:58:09 +01:00
David Gibson
02453a0f8f KVM: PPC: Book3S HV: Always flush TLB in kvmppc_alloc_reset_hpt()
commit ecba8297aa upstream.

The KVM_PPC_ALLOCATE_HTAB ioctl(), implemented by kvmppc_alloc_reset_hpt()
is supposed to completely clear and reset a guest's Hashed Page Table (HPT)
allocating or re-allocating it if necessary.

In the case where an HPT of the right size already exists and it just
zeroes it, it forces a TLB flush on all guest CPUs, to remove any stale TLB
entries loaded from the old HPT.

However, that situation can arise when the HPT is resizing as well - or
even when switching from an RPT to HPT - so those cases need a TLB flush as
well.

So, move the TLB flush to trigger in all cases except for errors.

Fixes: f98a8bf9ee ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:45:24 +01:00
Serhii Popovych
c8e754fe3b KVM: PPC: Book3S HV: Fix use after free in case of multiple resize requests
commit 4ed11aeefd upstream.

When serving multiple resize requests following could happen:

    CPU0                                    CPU1
    ----                                    ----
    kvm_vm_ioctl_resize_hpt_prepare(1);
      -> schedule_work()
                                            /* system_rq might be busy: delay */
    kvm_vm_ioctl_resize_hpt_prepare(2);
      mutex_lock();
      if (resize) {
         ...
         release_hpt_resize();
      }
      ...                                   resize_hpt_prepare_work()
      -> schedule_work()                    {
      mutex_unlock()                           /* resize->kvm could be wrong */
                                               struct kvm *kvm = resize->kvm;

                                               mutex_lock(&kvm->lock);   <<<< UAF
                                               ...
                                            }

i.e. a second resize request with different order could be started by
kvm_vm_ioctl_resize_hpt_prepare(), causing the previous request to be
free()d when there's still an active worker thread which will try to
access it.  This leads to a use after free in point marked with UAF on
the diagram above.

To prevent this from happening, instead of unconditionally releasing a
pre-existing resize structure from the prepare ioctl(), we check if
the existing structure has an in-progress worker.  We do that by
checking if the resize->error == -EBUSY, which is safe because the
resize->error field is protected by the kvm->lock.  If there is an
active worker, instead of releasing, we mark the structure as stale by
unlinking it from kvm_struct.

In the worker thread we check for a stale structure (with kvm->lock
held), and in that case abort, releasing the stale structure ourself.
We make the check both before and the actual allocation.  Strictly,
only the check afterwards is needed, the check before is an
optimization: if the structure happens to become stale before the
worker thread is dispatched, rather than during the allocation, it
means we can avoid allocating then immediately freeing a potentially
substantial amount of memory.

This fixes following or similar host kernel crash message:

[  635.277361] Unable to handle kernel paging request for data at address 0x00000000
[  635.277438] Faulting instruction address: 0xc00000000052f568
[  635.277446] Oops: Kernel access of bad area, sig: 11 [#1]
[  635.277451] SMP NR_CPUS=2048 NUMA PowerNV
[  635.277470] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter nfsv3 nfs_acl nfs
lockd grace fscache kvm_hv kvm rpcrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi
scsi_transport_iscsi ib_srpt target_core_mod ext4 ib_srp scsi_transport_srp
ib_ipoib mbcache jbd2 rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ocrdma(T)
ib_core ses enclosure scsi_transport_sas sg shpchp leds_powernv ibmpowernv i2c_opal
i2c_core powernv_rng ipmi_powernv ipmi_devintf ipmi_msghandler ip_tables xfs
libcrc32c sr_mod sd_mod cdrom lpfc nvme_fc(T) nvme_fabrics nvme_core ipr nvmet_fc(T)
tg3 nvmet libata be2net crc_t10dif crct10dif_generic scsi_transport_fc ptp scsi_tgt
pps_core crct10dif_common dm_mirror dm_region_hash dm_log dm_mod
[  635.278687] CPU: 40 PID: 749 Comm: kworker/40:1 Tainted: G
------------ T 3.10.0.bz1510771+ #1
[  635.278782] Workqueue: events resize_hpt_prepare_work [kvm_hv]
[  635.278851] task: c0000007e6840000 ti: c0000007e9180000 task.ti: c0000007e9180000
[  635.278919] NIP: c00000000052f568 LR: c0000000009ea310 CTR: c0000000009ea4f0
[  635.278988] REGS: c0000007e91837f0 TRAP: 0300   Tainted: G
------------ T  (3.10.0.bz1510771+)
[  635.279077] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002022  XER:
00000000
[  635.279248] CFAR: c000000000009368 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
GPR00: c0000000009ea310 c0000007e9183a70 c000000001250b00 c0000007e9183b10
GPR04: 0000000000000000 0000000000000000 c0000007e9183650 0000000000000000
GPR08: c0000007ffff7b80 00000000ffffffff 0000000080000028 d00000000d2529a0
GPR12: 0000000000002200 c000000007b56800 c000000000120028 c0000007f135bb40
GPR16: 0000000000000000 c000000005c1e018 c000000005c1e018 0000000000000000
GPR20: 0000000000000001 c0000000011bf778 0000000000000001 fffffffffffffef7
GPR24: 0000000000000000 c000000f1e262e50 0000000000000002 c0000007e9180000
GPR28: c000000f1e262e4c c000000f1e262e50 0000000000000000 c0000007e9183b10
[  635.280149] NIP [c00000000052f568] __list_add+0x38/0x110
[  635.280197] LR [c0000000009ea310] __mutex_lock_slowpath+0xe0/0x2c0
[  635.280253] Call Trace:
[  635.280277] [c0000007e9183af0] [c0000000009ea310] __mutex_lock_slowpath+0xe0/0x2c0
[  635.280356] [c0000007e9183b70] [c0000000009ea554] mutex_lock+0x64/0x70
[  635.280426] [c0000007e9183ba0] [d00000000d24da04]
resize_hpt_prepare_work+0xe4/0x1c0 [kvm_hv]
[  635.280507] [c0000007e9183c40] [c000000000113c0c] process_one_work+0x1dc/0x680
[  635.280587] [c0000007e9183ce0] [c000000000114250] worker_thread+0x1a0/0x520
[  635.280655] [c0000007e9183d80] [c00000000012010c] kthread+0xec/0x100
[  635.280724] [c0000007e9183e30] [c00000000000a4b8] ret_from_kernel_thread+0x5c/0xa4
[  635.280814] Instruction dump:
[  635.280880] 7c0802a6 fba1ffe8 fbc1fff0 7cbd2b78 fbe1fff8 7c9e2378 7c7f1b78
f8010010
[  635.281099] f821ff81 e8a50008 7fa52040 40de00b8 <e8be0000> 7fbd2840 40de008c
7fbff040
[  635.281324] ---[ end trace b628b73449719b9d ]---

Fixes: b5baa68773 ("KVM: PPC: Book3S HV: KVM-HV HPT resizing implementation")
Signed-off-by: Serhii Popovych <spopovyc@redhat.com>
[dwg: Replaced BUG_ON()s with WARN_ONs() and reworded commit message
 for clarity]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:45:24 +01:00
Serhii Popovych
5584082e85 KVM: PPC: Book3S HV: Drop prepare_done from struct kvm_resize_hpt
commit 3073774e63 upstream.

Currently the kvm_resize_hpt structure has two fields relevant to the
state of an ongoing resize: 'prepare_done', which indicates whether
the worker thread has completed or not, and 'error' which indicates
whether it was successful or not.

Since the success/failure isn't known until completion, this is
confusingly redundant.  This patch consolidates the information into
just the 'error' value: -EBUSY indicates the worked is still in
progress, other negative values indicate (completed) failure, 0
indicates successful completion.

As a bonus this reduces size of struct kvm_resize_hpt by
__alignof__(struct kvm_hpt_info) and saves few bytes of code.

While there correct comment in struct kvm_resize_hpt which references
a non-existent semaphore (leftover from an early draft).

Assert with WARN_ON() in case of HPT allocation thread work runs more
than once for resize request or resize_hpt_allocate() returns -EBUSY
that is treated specially.

Change comparison against zero to make checkpatch.pl happy.

Signed-off-by: Serhii Popovych <spopovyc@redhat.com>
[dwg: Changed BUG_ON()s to WARN_ON()s and altered commit message for
 clarity]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:45:24 +01:00
Alexey Kardashevskiy
fead438752 KVM: PPC: Book3S PR: Fix WIMG handling under pHyp
commit 6c7d47c33e upstream.

Commit 96df226 ("KVM: PPC: Book3S PR: Preserve storage control bits")
added code to preserve WIMG bits but it missed 2 special cases:
- a magic page in kvmppc_mmu_book3s_64_xlate() and
- guest real mode in kvmppc_handle_pagefault().

For these ptes, WIMG was 0 and pHyp failed on these causing a guest to
stop in the very beginning at NIP=0x100 (due to bd9166ffe "KVM: PPC:
Book3S PR: Exit KVM on failed mapping").

According to LoPAPR v1.1 14.5.4.1.2 H_ENTER:

 The hypervisor checks that the WIMG bits within the PTE are appropriate
 for the physical page number else H_Parameter return. (For System Memory
 pages WIMG=0010, or, 1110 if the SAO option is enabled, and for IO pages
 WIMG=01**.)

This hence initializes WIMG to non-zero value HPTE_R_M (0x10), as expected
by pHyp.

[paulus@ozlabs.org - fix compile for 32-bit]

Fixes: 96df226 "KVM: PPC: Book3S PR: Preserve storage control bits"
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Ruediger Oertel <ro@suse.de>
Reviewed-by: Greg Kurz <groug@kaod.org>
Tested-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:45:24 +01:00
John Sperbeck
49b52457e1 powerpc/mm: Fix SEGV on mapped region to return SEGV_ACCERR
commit ecb101aed8 upstream.

The recent refactoring of the powerpc page fault handler in commit
c3350602e8 ("powerpc/mm: Make bad_area* helper functions") caused
access to protected memory regions to indicate SEGV_MAPERR instead of
the traditional SEGV_ACCERR in the si_code field of a user-space
signal handler. This can confuse debug libraries that temporarily
change the protection of memory regions, and expect to use SEGV_ACCERR
as an indication to restore access to a region.

This commit restores the previous behavior. The following program
exhibits the issue:

    $ ./repro read  || echo "FAILED"
    $ ./repro write || echo "FAILED"
    $ ./repro exec  || echo "FAILED"

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>
    #include <signal.h>
    #include <sys/mman.h>
    #include <assert.h>

    static void segv_handler(int n, siginfo_t *info, void *arg) {
            _exit(info->si_code == SEGV_ACCERR ? 0 : 1);
    }

    int main(int argc, char **argv)
    {
            void *p = NULL;
            struct sigaction act = {
                    .sa_sigaction = segv_handler,
                    .sa_flags = SA_SIGINFO,
            };

            assert(argc == 2);
            p = mmap(NULL, getpagesize(),
                    (strcmp(argv[1], "write") == 0) ? PROT_READ : 0,
                    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
            assert(p != MAP_FAILED);

            assert(sigaction(SIGSEGV, &act, NULL) == 0);
            if (strcmp(argv[1], "read") == 0)
                    printf("%c", *(unsigned char *)p);
            else if (strcmp(argv[1], "write") == 0)
                    *(unsigned char *)p = 0;
            else if (strcmp(argv[1], "exec") == 0)
                    ((void (*)(void))p)();
            return 1;  /* failed to generate SEGV */
    }

Fixes: c3350602e8 ("powerpc/mm: Make bad_area* helper functions")
Signed-off-by: John Sperbeck <jsperbeck@google.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Add commit references in change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-10 09:31:21 +01:00
Ravi Bangoria
a971d10f67 powerpc/perf: Dereference BHRB entries safely
commit f41d84dddc upstream.

It's theoretically possible that branch instructions recorded in
BHRB (Branch History Rolling Buffer) entries have already been
unmapped before they are processed by the kernel. Hence, trying to
dereference such memory location will result in a crash. eg:

    Unable to handle kernel paging request for data at address 0xd000000019c41764
    Faulting instruction address: 0xc000000000084a14
    NIP [c000000000084a14] branch_target+0x4/0x70
    LR [c0000000000eb828] record_and_restart+0x568/0x5c0
    Call Trace:
    [c0000000000eb3b4] record_and_restart+0xf4/0x5c0 (unreliable)
    [c0000000000ec378] perf_event_interrupt+0x298/0x460
    [c000000000027964] performance_monitor_exception+0x54/0x70
    [c000000000009ba4] performance_monitor_common+0x114/0x120

Fix it by deferefencing the addresses safely.

Fixes: 691231846c ("powerpc/perf: Fix setting of "to" addresses for BHRB")
Suggested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
[mpe: Use probe_kernel_read() which is clearer, tweak change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-29 17:53:49 +01:00
Laurent Vivier
5aa30b450a KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp()
commit 7333b5aca4 upstream.

When we migrate a VM from a POWER8 host (XICS) to a POWER9 host
(XICS-on-XIVE), we have an error:

qemu-kvm: Unable to restore KVM interrupt controller state \
          (0xff000000) for CPU 0: Invalid argument

This is because kvmppc_xics_set_icp() checks the new state
is internaly consistent, and especially:

...
   1129         if (xisr == 0) {
   1130                 if (pending_pri != 0xff)
   1131                         return -EINVAL;
...

On the other side, kvmppc_xive_get_icp() doesn't set
neither the pending_pri value, nor the xisr value (set to 0)
(and kvmppc_xive_set_icp() ignores the pending_pri value)

As xisr is 0, pending_pri must be set to 0xff.

Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-29 17:53:48 +01:00
Cédric Le Goater
8708f68283 KVM: PPC: Book3S: fix XIVE migration of pending interrupts
commit dc1c4165d1 upstream.

When restoring a pending interrupt, we are setting the Q bit to force
a retrigger in xive_finish_unmask(). But we also need to force an EOI
in this case to reach the same initial state : P=1, Q=0.

This can be done by not setting 'old_p' for pending interrupts which
will inform xive_finish_unmask() that an EOI needs to be sent.

Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-29 17:53:48 +01:00
Thomas Gleixner
ee8e8b2df6 arch, mm: Allow arch_dup_mmap() to fail
commit c10e83f598 upstream.

In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: dan.j.williams@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-29 17:53:43 +01:00
Daniel Borkmann
8a681dfd8f bpf, ppc64: do not reload skb pointers in non-skb context
[ Upstream commit 87338c8e2c ]

The assumption of unconditionally reloading skb pointers on
BPF helper calls where bpf_helper_changes_pkt_data() holds
true is wrong. There can be different contexts where the helper
would enforce a reload such as in case of XDP. Here, we do
have a struct xdp_buff instead of struct sk_buff as context,
thus this will access garbage.

JITs only ever need to deal with cached skb pointer reload
when ld_abs/ind was seen, therefore guard the reload behind
SEEN_SKB.

Fixes: 156d0e290e ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:32 +01:00
Nicholas Piggin
fa21a13d76 powerpc/watchdog: Do not trigger SMP crash from touch_nmi_watchdog
[ Upstream commit 80e4d70b06 ]

In xmon, touch_nmi_watchdog() is not expected to be checking that
other CPUs have not touched the watchdog, so the code will just call
touch_nmi_watchdog() once before re-enabling hard interrupts.

Just update our CPU's state, and ignore apparently stuck SMP threads.

Arguably touch_nmi_watchdog should check for SMP lockups, and callers
should be fixed, but that's not trivial for the input code of xmon.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:28 +01:00
Nicholas Piggin
97f41b41c4 powerpc/xmon: Avoid tripping SMP hardlockup watchdog
[ Upstream commit 064996d62a ]

The SMP hardlockup watchdog cross-checks other CPUs for lockups, which
causes xmon headaches because it's assuming interrupts hard disabled
means no watchdog troubles. Try to improve that by calling
touch_nmi_watchdog() in obvious places where secondaries are spinning.

Also annotate these spin loops with spin_begin/end calls.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25 14:26:28 +01:00
Breno Leitao
9cd0192298 powerpc/xmon: Check before calling xive functions
[ Upstream commit 402e172a2c ]

Currently xmon could call XIVE functions from OPAL even if the XIVE is
disabled or does not exist in the system, as in POWER8 machines. This
causes the following exception:

 1:mon> dx
 cpu 0x1: Vector: 700 (Program Check) at [c000000423c93450]
     pc: c00000000009cfa4: opal_xive_dump+0x50/0x68
     lr: c0000000000997b8: opal_return+0x0/0x50

This patch simply checks if XIVE is enabled before calling XIVE
functions.

Fixes: 243e25112d ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
Suggested-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:34 +01:00
Michael Ellerman
8ee1eada4f powerpc/perf/hv-24x7: Fix incorrect comparison in memord
[ Upstream commit 05c14c0313 ]

In the hv-24x7 code there is a function memord() which tries to
implement a sort function return -1, 0, 1. However one of the
conditions is incorrect, such that it can never be true, because we
will have already returned.

I don't believe there is a bug in practice though, because the
comparisons are an optimisation prior to calling memcmp().

Fix it by swapping the second comparision, so it can be true.

Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:33 +01:00
Tyrel Datwyler
f8b31d88a6 powerpc/pseries/vio: Dispose of virq mapping on vdevice unregister
[ Upstream commit b8f89fea59 ]

When a vdevice is DLPAR removed from the system the vio subsystem
doesn't bother unmapping the virq from the irq_domain. As a result we
have a virq mapped to a hardware irq that is no longer valid for the
irq_domain. A side effect is that we are left with /proc/irq/<irq#>
affinity entries, and attempts to modify the smp_affinity of the irq
will fail.

In the following observed example the kernel log is spammed by
ics_rtas_set_affinity errors after the removal of a VSCSI adapter.
This is a result of irqbalance trying to adjust the affinity every 10
seconds.

  rpadlpar_io: slot U8408.E8E.10A7ACV-V5-C25 removed
  ics_rtas_set_affinity: ibm,set-xive irq=655385 returns -3
  ics_rtas_set_affinity: ibm,set-xive irq=655385 returns -3

This patch fixes the issue by calling irq_dispose_mapping() on the
virq of the viodev on unregister.

Fixes: f2ab621996 ("powerpc/pseries: Add PFO support to the VIO bus")
Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:25 +01:00
Christophe Leroy
0fe5286e49 powerpc/ipic: Fix status get and status clear
[ Upstream commit 6b148a7ce7 ]

IPIC Status is provided by register IPIC_SERSR and not by IPIC_SERMR
which is the mask register.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:25 +01:00
William A. Kennington III
2637cb1e3d powerpc/opal: Fix EBUSY bug in acquiring tokens
[ Upstream commit 71e24d7731 ]

The current code checks the completion map to look for the first token
that is complete. In some cases, a completion can come in but the
token can still be on lease to the caller processing the completion.
If this completed but unreleased token is the first token found in the
bitmap by another tasks trying to acquire a token, then the
__test_and_set_bit call will fail since the token will still be on
lease. The acquisition will then fail with an EBUSY.

This patch reorganizes the acquisition code to look at the
opal_async_token_map for an unleased token. If the token has no lease
it must have no outstanding completions so we should never see an
EBUSY, unless we have leased out too many tokens. Since
opal_async_get_token_inrerruptible is protected by a semaphore, we
will practically never see EBUSY anymore.

Fixes: 8d72482322 ("powerpc/powernv: Infrastructure to support OPAL async completion")
Signed-off-by: William A. Kennington III <wak@google.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:25 +01:00
Shriya
f9565e1e07 powerpc/powernv/cpufreq: Fix the frequency read by /proc/cpuinfo
[ Upstream commit cd77b5ce20 ]

The call to /proc/cpuinfo in turn calls cpufreq_quick_get() which
returns the last frequency requested by the kernel, but may not
reflect the actual frequency the processor is running at. This patch
makes a call to cpufreq_get() instead which returns the current
frequency reported by the hardware.

Fixes: fb5153d05a ("powerpc: powernv: Implement ppc_md.get_proc_freq()")
Signed-off-by: Shriya <shriyak@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20 10:10:24 +01:00
Jeff Layton
fc2f802193 fcntl: don't cap l_start and l_end values for F_GETLK64 in compat syscall
commit 4d2dc2cc76 upstream.

Currently, we're capping the values too low in the F_GETLK64 case. The
fields in that structure are 64-bit values, so we shouldn't need to do
any sort of fixup there.

Make sure we check that assumption at build time in the future however
by ensuring that the sizes we're copying will fit.

With this, we no longer need COMPAT_LOFF_T_MAX either, so remove it.

Fixes: 94073ad77f (fs/locks: don't mess with the address limit in compat_fcntl64)
Reported-by: Vitaly Lipatov <lav@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-17 15:07:59 +01:00
Madhavan Srinivasan
86d1d015fe powerpc/perf: Fix pmu_count to count only nest imc pmus
[ Upstream commit de34787f10 ]

"pmu_count" in opal_imc_counters_probe() is intended to hold
the number of successful nest imc pmu registerations. But
current code also counts other imc units like core_imc and
thread_imc. Patch add a check to count only nest imc pmus.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14 09:53:06 +01:00
Nicholas Piggin
f11f5c4b87 powerpc/64s: Initialize ISAv3 MMU registers before setting partition table
commit 371b80447f upstream.

kexec can leave MMU registers set when booting into a new kernel,
the PIDR (Process Identification Register) in particular. The boot
sequence does not zero PIDR, so it only gets set when CPUs first
switch to a userspace processes (until then it's running a kernel
thread with effective PID = 0).

This leaves a window where a process table entry and page tables are
set up due to user processes running on other CPUs, that happen to
match with a stale PID. The CPU with that PID may cause speculative
accesses that address quadrant 0 (aka userspace addresses), which will
result in cached translations and PWC (Page Walk Cache) for that
process, on a CPU which is not in the mm_cpumask and so they will not
be invalidated properly.

The most common result is the kernel hanging in infinite page fault
loops soon after kexec (usually in schedule_tail, which is usually the
first non-speculative quadrant 0 access to a new PID) due to a stale
PWC. However being a stale translation error, it could result in
anything up to security and data corruption problems.

Fix this by zeroing out PIDR at boot and kexec.

Fixes: 7e381c0ff6 ("powerpc/mm/radix: Add mmu context handling callback for radix")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14 09:52:57 +01:00
David Gibson
d46be2f67c Revert "powerpc: Do not call ppc_md.panic in fadump panic notifier"
commit ab9dbf771f upstream.

This reverts commit a3b2cb30f2.

That commit tried to fix problems with panic on powerpc in certain
circumstances, where some output from the generic panic code was being
dropped.

Unfortunately, it breaks things worse in other circumstances. In
particular when running a PAPR guest, it will now attempt to reboot
instead of informing the hypervisor (KVM or PowerVM) that the guest
has crashed. The crash notification is important to some
virtualization management layers.

Revert it for now until we can come up with a better solution.

Fixes: a3b2cb30f2 ("powerpc: Do not call ppc_md.panic in fadump panic notifier")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Tweak change log a bit]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14 09:52:57 +01:00
Naveen N. Rao
1ffabfc1d5 powerpc/kprobes: Disable preemption before invoking probe handler for optprobes
commit 8a2d71a3f2 upstream.

Per Documentation/kprobes.txt, probe handlers need to be invoked with
preemption disabled. Update optimized_callback() to do so. Also move
get_kprobe_ctlblk() invocation post preemption disable, since it
accesses pre-cpu data.

This was not an issue so far since optprobes wasn't selected if
CONFIG_PREEMPT was enabled. Commit a30b85df7d ("kprobes: Use
synchronize_rcu_tasks() for optprobe with CONFIG_PREEMPT=y") changes
this.

Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-10 13:40:43 +01:00
Naveen N. Rao
f8d0785281 powerpc/jprobes: Disable preemption when triggered through ftrace
commit 6baea433bc upstream.

KPROBES_SANITY_TEST throws the below splat when CONFIG_PREEMPT is
enabled:

  Kprobe smoke test: started
  DEBUG_LOCKS_WARN_ON(val > preempt_count())
  ------------[ cut here ]------------
  WARNING: CPU: 19 PID: 1 at kernel/sched/core.c:3094 preempt_count_sub+0xcc/0x140
  Modules linked in:
  CPU: 19 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc7-nnr+ #97
  task: c0000000fea80000 task.stack: c0000000feb00000
  NIP:  c00000000011d3dc LR: c00000000011d3d8 CTR: c000000000a090d0
  REGS: c0000000feb03400 TRAP: 0700   Not tainted  (4.13.0-rc7-nnr+)
  MSR:  8000000000021033 <SF,ME,IR,DR,RI,LE>  CR: 28000282  XER: 00000000
  CFAR: c00000000015aa18 SOFTE: 0
  <snip>
  NIP preempt_count_sub+0xcc/0x140
  LR  preempt_count_sub+0xc8/0x140
  Call Trace:
    preempt_count_sub+0xc8/0x140 (unreliable)
    kprobe_handler+0x228/0x4b0
    program_check_exception+0x58/0x3b0
    program_check_common+0x16c/0x170
    --- interrupt: 0 at kprobe_target+0x8/0x20
                     LR = init_test_probes+0x248/0x7d0
    kp+0x0/0x80 (unreliable)
    livepatch_handler+0x38/0x74
    init_kprobes+0x1d8/0x208
    do_one_initcall+0x68/0x1d0
    kernel_init_freeable+0x298/0x374
    kernel_init+0x24/0x160
    ret_from_kernel_thread+0x5c/0x70
  Instruction dump:
  419effdc 3d22001b 39299240 81290000 2f890000 409effc8 3c82ffcb 3c62ffcb
  3884bc68 3863bc18 4803d5fd 60000000 <0fe00000> 4bffffa8 60000000 60000000
  ---[ end trace 432dd46b4ce3d29f ]---
  Kprobe smoke test: passed successfully

The issue is that we aren't disabling preemption in
kprobe_ftrace_handler(). Disable it.

Fixes: ead514d5fb ("powerpc/kprobes: Add support for KPROBES_ON_FTRACE")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
[mpe: Trim oops a little for formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-10 13:40:43 +01:00
Michael Ellerman
c12e358867 powerpc/kexec: Fix kexec/kdump in P9 guest kernels
commit 2621e945fb upstream.

The code that cleans up the IAMR/AMOR before kexec'ing failed to
remember that when we're running as a guest AMOR is not writable, it's
hypervisor privileged.

They symptom is that the kexec stops before entering purgatory and
nothing else is seen on the console. If you examine the state of the
system all threads will be in the 0x700 program check handler.

Fix it by making the write to AMOR dependent on HV mode.

Fixes: 1e2a516e89 ("powerpc/kexec: Fix radix to hash kexec due to IAMR/AMOR")
Reported-by: Yilin Zhang <yilzhang@redhat.com>
Debugged-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-05 11:26:31 +01:00