Commit graph

753341 commits

Author SHA1 Message Date
Dmitry Safonov
acf4602001 x86/mm: Drop TS_COMPAT on 64-bit exec() syscall
The x86 mmap() code selects the mmap base for an allocation depending on
the bitness of the syscall. For 64bit sycalls it select mm->mmap_base and
for 32bit mm->mmap_compat_base.

exec() calls mmap() which in turn uses in_compat_syscall() to check whether
the mapping is for a 32bit or a 64bit task. The decision is made on the
following criteria:

  ia32    child->thread.status & TS_COMPAT
   x32    child->pt_regs.orig_ax & __X32_SYSCALL_BIT
  ia64    !ia32 && !x32

__set_personality_x32() was dropping TS_COMPAT flag, but
set_personality_64bit() has kept compat syscall flag making
in_compat_syscall() return true during the first exec() syscall.

Which in result has user-visible effects, mentioned by Alexey:
1) It breaks ASAN
$ gcc -fsanitize=address wrap.c -o wrap-asan
$ ./wrap32 ./wrap-asan true
==1217==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
==1217==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range.
==1217==Process memory map follows:
        0x000000400000-0x000000401000   /home/izbyshev/test/gcc/asan-exec-from-32bit/wrap-asan
        0x000000600000-0x000000601000   /home/izbyshev/test/gcc/asan-exec-from-32bit/wrap-asan
        0x000000601000-0x000000602000   /home/izbyshev/test/gcc/asan-exec-from-32bit/wrap-asan
        0x0000f7dbd000-0x0000f7de2000   /lib64/ld-2.27.so
        0x0000f7fe2000-0x0000f7fe3000   /lib64/ld-2.27.so
        0x0000f7fe3000-0x0000f7fe4000   /lib64/ld-2.27.so
        0x0000f7fe4000-0x0000f7fe5000
        0x7fed9abff000-0x7fed9af54000
        0x7fed9af54000-0x7fed9af6b000   /lib64/libgcc_s.so.1
[snip]

2) It doesn't seem to be great for security if an attacker always knows
that ld.so is going to be mapped into the first 4GB in this case
(the same thing happens for PIEs as well).

The testcase:
$ cat wrap.c

int main(int argc, char *argv[]) {
  execvp(argv[1], &argv[1]);
  return 127;
}

$ gcc wrap.c -o wrap
$ LD_SHOW_AUXV=1 ./wrap ./wrap true |& grep AT_BASE
AT_BASE:         0x7f63b8309000
AT_BASE:         0x7faec143c000
AT_BASE:         0x7fbdb25fa000

$ gcc -m32 wrap.c -o wrap32
$ LD_SHOW_AUXV=1 ./wrap32 ./wrap true |& grep AT_BASE
AT_BASE:         0xf7eff000
AT_BASE:         0xf7cee000
AT_BASE:         0x7f8b9774e000

Fixes: 1b028f784e ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Fixes: ada26481df ("x86/mm: Make in_compat_syscall() work during exec")
Reported-by: Alexey Izbyshev <izbyshev@ispras.ru>
Bisected-by: Alexander Monakov <amonakov@ispras.ru>
Investigated-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Alexander Monakov <amonakov@ispras.ru>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20180517233510.24996-1-dima@arista.com
2018-05-19 12:31:05 +02:00
Thomas Gleixner
fed71f7d98 x86/apic/x2apic: Initialize cluster ID properly
Rick bisected a regression on large systems which use the x2apic cluster
mode for interrupt delivery to the commit wich reworked the cluster
management.

The problem is caused by a missing initialization of the clusterid field
in the shared cluster data structures. So all structures end up with
cluster ID 0 which only allows sharing between all CPUs which belong to
cluster 0. All other CPUs with a cluster ID > 0 cannot share the data
structure because they cannot find existing data with their cluster
ID. This causes malfunction with IPIs because IPIs are sent to the wrong
cluster and the caller waits for ever that the target CPU handles the IPI.

Add the missing initialization when a upcoming CPU is the first in a
cluster so that the later booting CPUs can find the data and share it for
proper operation.

Fixes: 023a611748 ("x86/apic/x2apic: Simplify cluster management")
Reported-by: Rick Warner <rick@microway.com>
Bisected-by: Rick Warner <rick@microway.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Rick Warner <rick@microway.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1805171418210.1947@nanos.tec.linutronix.de
2018-05-17 21:00:12 +02:00
Kirill A. Shutemov
589bb62be3 x86/boot/compressed/64: Fix moving page table out of trampoline memory
cleanup_trampoline() relocates the top-level page table out of
trampoline memory. We use 'top_pgtable' as our new top-level page table.

But if the 'top_pgtable' would be referenced from C in a usual way,
the address of the table will be calculated relative to RIP.
After kernel gets relocated, the address will be in the middle of
decompression buffer and the page table may get overwritten.
This leads to a crash.

We calculate the address of other page tables relative to the relocation
address. It makes them safe. We should do the same for 'top_pgtable'.

Calculate the address of 'top_pgtable' in assembly and pass down to
cleanup_trampoline().

Move the page table to .pgtable section where the rest of page tables
are. The section is @nobits so we save 4k in kernel image.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: e9d0e6330e ("x86/boot/compressed/64: Prepare new top-level page table for trampoline")
Link: http://lkml.kernel.org/r/20180516080131.27913-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-16 12:15:13 +02:00
Kirill A. Shutemov
5c9b0b1c49 x86/boot/compressed/64: Set up GOT for paging_prepare() and cleanup_trampoline()
Eric and Hugh have reported instant reboot due to my recent changes in
decompression code.

The root cause is that I didn't realize that we need to adjust GOT to be
able to run C code that early.

The problem is only visible with an older toolchain. Binutils >= 2.24 is
able to eliminate GOT references by replacing them with RIP-relative
address loads:

  https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commitdiff;h=80d873266dec

We need to adjust GOT two times:

 - before calling paging_prepare() using the initial load address
 - before calling C code from the relocated kernel

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: 194a9749c7 ("x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G")
Link: http://lkml.kernel.org/r/20180516080131.27913-2-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-16 12:15:13 +02:00
Dave Hansen
2fa9d1cfaf x86/pkeys: Do not special case protection key 0
mm_pkey_is_allocated() treats pkey 0 as unallocated.  That is
inconsistent with the manpages, and also inconsistent with
mm->context.pkey_allocation_map.  Stop special casing it and only
disallow values that are actually bad (< 0).

The end-user visible effect of this is that you can now use
mprotect_pkey() to set pkey=0.

This is a bit nicer than what Ram proposed[1] because it is simpler
and removes special-casing for pkey 0.  On the other hand, it does
allow applications to pkey_free() pkey-0, but that's just a silly
thing to do, so we are not going to protect against it.

The scenario that could happen is similar to what happens if you free
any other pkey that is in use: it might get reallocated later and used
to protect some other data.  The most likely scenario is that pkey-0
comes back from pkey_alloc(), an access-disable or write-disable bit
is set in PKRU for it, and the next stack access will SIGSEGV.  It's
not horribly different from if you mprotect()'d your stack or heap to
be unreadable or unwritable, which is generally very foolish, but also
not explicitly prevented by the kernel.

1. http://lkml.kernel.org/r/1522112702-27853-1-git-send-email-linuxram@us.ibm.com

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>p
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Fixes: 58ab9a088d ("x86/pkeys: Check against max pkey to avoid overflows")
Link: http://lkml.kernel.org/r/20180509171358.47FD785E@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Dave Hansen
3488a600d9 x86/pkeys/selftests: Add a test for pkey 0
Protection key 0 is the default key for all memory and will
not normally come back from pkey_alloc().  But, you might
still want pass it to mprotect_pkey().

This check ensures that you can use pkey 0.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171356.9E40B254@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Dave Hansen
acb25d761d x86/pkeys/selftests: Save off 'prot' for allocations
This makes it possible to to tell what 'prot' a given allocation
is supposed to have.  That way, if we want to change just the
pkey, we know what 'prot' to pass to mprotect_pkey().

Also, keep a record of the most recent allocation so the tests
can easily find it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171354.AA23E228@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Dave Hansen
3d64f4ed15 x86/pkeys/selftests: Fix pointer math
We dump out the entire area of the siginfo where the si_pkey_ptr is
supposed to be.  But, we do some math on the poitner, which is a u32.
We intended to do byte math, not u32 math on the pointer.

Cast it over to a u8* so it works.

Also, move this block of code to below th si_code check.  It doesn't
hurt anything, but the si_pkey field is gibberish for other signal
types.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171352.9BE09819@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Dave Hansen
0a0b152083 x86/pkeys: Override pkey when moving away from PROT_EXEC
I got a bug report that the following code (roughly) was
causing a SIGSEGV:

	mprotect(ptr, size, PROT_EXEC);
	mprotect(ptr, size, PROT_NONE);
	mprotect(ptr, size, PROT_READ);
	*ptr = 100;

The problem is hit when the mprotect(PROT_EXEC)
is implicitly assigned a protection key to the VMA, and made
that key ACCESS_DENY|WRITE_DENY.  The PROT_NONE mprotect()
failed to remove the protection key, and the PROT_NONE->
PROT_READ left the PTE usable, but the pkey still in place
and left the memory inaccessible.

To fix this, we ensure that we always "override" the pkee
at mprotect() if the VMA does not have execute-only
permissions, but the VMA has the execute-only pkey.

We had a check for PROT_READ/WRITE, but it did not work
for PROT_NONE.  This entirely removes the PROT_* checks,
which ensures that PROT_NONE now works.

Reported-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Fixes: 62b5f7d013 ("mm/core, x86/mm/pkeys: Add execute-only protection keys support")
Link: http://lkml.kernel.org/r/20180509171351.084C5A71@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Dave Hansen
f50b487832 x86/pkeys/selftests: Fix pkey exhaustion test off-by-one
In our "exhaust all pkeys" test, we make sure that there
is the expected number available.  Turns out that the
test did not cover the execute-only key, but discussed
it anyway.  It did *not* discuss the test-allocated
key.

Now that we have a test for the mprotect(PROT_EXEC) case,
this off-by-one issue showed itself.  Correct the off-by-
one and add the explanation for the case we missed.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171350.E1656B95@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Dave Hansen
6af17cf89e x86/pkeys/selftests: Add PROT_EXEC test
Under the covers, implement executable-only memory with
protection keys when userspace calls mprotect(PROT_EXEC).

But, we did not have a selftest for that.  Now we do.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171348.9EEE4BEF@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Dave Hansen
3fcd2b2d92 x86/pkeys/selftests: Factor out "instruction page"
We currently have an execute-only test, but it is for
the explicit mprotect_pkey() interface.  We will soon
add a test for the implicit mprotect(PROT_EXEC)
enterface.  We need this code in both tests.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171347.C64AB733@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Dave Hansen
7e7fd67ca3 x86/pkeys/selftests: Allow faults on unknown keys
The exec-only pkey is allocated inside the kernel and userspace
is not told what it is.  So, allow PK faults to occur that have
an unknown key.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171345.7FC7DA00@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Dave Hansen
caf9eb6b4c x86/pkeys/selftests: Avoid printf-in-signal deadlocks
printf() and friends are unusable in signal handlers.  They deadlock.
The pkey selftest does not do any normal printing in signal handlers,
only extra debugging.  So, just print the format string so we get
*some* output when debugging.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171344.C53FD2F3@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Dave Hansen
a50093d604 x86/pkeys/selftests: Remove dead debugging code, fix dprint_in_signal
There is some noisy debug code at the end of the signal handler.  It was
disabled by an early, unconditional "return".  However, that return also
hid a dprint_in_signal=0, which kept dprint_in_signal=1 and effectively
locked us into permanent dprint_in_signal=1 behavior.

Remove the return and the dead code, fixing dprint_in_signal.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171342.846B9B2E@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Dave Hansen
86b9eea230 x86/pkeys/selftests: Stop using assert()
If we use assert(), the program "crashes".  That can be scary to users,
so stop doing it.  Just exit with a >0 exit code instead.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171340.E63EF7DA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Dave Hansen
55556b0b20 x86/pkeys/selftests: Give better unexpected fault error messages
do_not_expect_pk_fault() is a helper that we call when we do not expect
a PK fault to have occurred.  But, it is a function, which means that
it obscures the line numbers from pkey_assert().  It also gives no
details.

Replace it with an implementation that gives nice line numbers and
also lets callers pass in a more descriptive message about what
happened that caused the unexpected fault.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20180509171338.55D13B64@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Andy Lutomirski
59c2a7226f x86/selftests: Add mov_to_ss test
This exercises a nasty corner case of the x86 ISA.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/67e08b69817171da8026e0eb3af0214b06b4d74f.1525800455.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Ingo Molnar
73bb4d6cd1 x86/mpx/selftests: Adjust the self-test to fresh distros that export the MPX ABI
Fix this warning:

  mpx-mini-test.c:422:0: warning: "SEGV_BNDERR" redefined

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: dave.hansen@intel.com
Cc: linux-mm@kvack.org
Cc: linuxram@us.ibm.com
Cc: mpe@ellerman.id.au
Cc: shakeelb@google.com
Cc: shuah@kernel.org
Link: http://lkml.kernel.org/r/20180514085908.GA12798@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Ingo Molnar
0fb96620dc x86/pkeys/selftests: Adjust the self-test to fresh distros that export the pkeys ABI
Ubuntu 18.04 started exporting pkeys details in header files, resulting
in build failures and warnings in the pkeys self-tests:

  protection_keys.c:232:0: warning: "SEGV_BNDERR" redefined
  protection_keys.c:387:5: error: conflicting types for ‘pkey_get’
  protection_keys.c:409:5: error: conflicting types for ‘pkey_set’
  ...

Fix these namespace conflicts and double definitions, plus also
clean up the ABI definitions to make it all a bit more readable ...

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: dave.hansen@intel.com
Cc: linux-mm@kvack.org
Cc: linuxram@us.ibm.com
Cc: mpe@ellerman.id.au
Cc: shakeelb@google.com
Cc: shuah@kernel.org
Link: http://lkml.kernel.org/r/20180514085623.GB7094@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:45 +02:00
Alexander Potapenko
4a09f0210c x86/boot/64/clang: Use fixup_pointer() to access '__supported_pte_mask'
Clang builds with defconfig started crashing after the following
commit:

  fb43d6cb91 ("x86/mm: Do not auto-massage page protections")

This was caused by introducing a new global access in __startup_64().

Code in __startup_64() can be relocated during execution, but the compiler
doesn't have to generate PC-relative relocations when accessing globals
from that function. Clang actually does not generate them, which leads
to boot-time crashes. To work around this problem, every global pointer
must be adjusted using fixup_pointer().

Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: dvyukov@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Cc: md@google.com
Cc: mka@chromium.org
Fixes: fb43d6cb91 ("x86/mm: Do not auto-massage page protections")
Link: http://lkml.kernel.org/r/20180509091822.191810-1-glider@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 11:14:30 +02:00
Ingo Molnar
4fe875e4bd objtool, kprobes/x86: Sync the latest <asm/insn.h> header with tools/objtool/arch/x86/include/asm/insn.h
The following commit:

  ee6a7354a3: kprobes/x86: Prohibit probing on exception masking instructions

Modified <asm/insn.h>, adding the insn_masking_exception() function.

Sync the tooling version of the header to it, to fix this warning:

  Warning: synced file at 'tools/objtool/arch/x86/include/asm/insn.h' differs from latest kernel version at 'arch/x86/include/asm/insn.h'

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Francis Deslauriers <francis.deslauriers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "David S . Miller" <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-14 10:15:54 +02:00
Alexei Starovoitov
b1ae32dbab x86/cpufeature: Guard asm_volatile_goto usage for BPF compilation
Workaround for the sake of BPF compilation which utilizes kernel
headers, but clang does not support ASM GOTO and fails the build.

Fixes: d0266046ad ("x86: Remove FAST_FEATURE_TESTS")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel@iogearbox.net
Cc: peterz@infradead.org
Cc: netdev@vger.kernel.org
Cc: bp@alien8.de
Cc: yhs@fb.com
Cc: kernel-team@fb.com
Cc: torvalds@linux-foundation.org
Cc: davem@davemloft.net
Link: https://lkml.kernel.org/r/20180513193222.1997938-1-ast@kernel.org
2018-05-13 21:49:14 +02:00
Masami Hiramatsu
13ebe18c94 uprobes/x86: Prohibit probing on MOV SS instruction
Since MOV SS and POP SS instructions will delay the exceptions until the
next instruction is executed, single-stepping on it by uprobes must be
prohibited.

uprobe already rejects probing on POP SS (0x1f), but allows probing on MOV
SS (0x8e and reg == 2).  This checks the target instruction and if it is
MOV SS or POP SS, returns -ENOTSUPP to reject probing.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Francis Deslauriers <francis.deslauriers@efficios.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "David S . Miller" <davem@davemloft.net>
Link: https://lkml.kernel.org/r/152587072544.17316.5950935243917346341.stgit@devbox
2018-05-13 19:52:56 +02:00
Masami Hiramatsu
ee6a7354a3 kprobes/x86: Prohibit probing on exception masking instructions
Since MOV SS and POP SS instructions will delay the exceptions until the
next instruction is executed, single-stepping on it by kprobes must be
prohibited.

However, kprobes usually executes those instructions directly on trampoline
buffer (a.k.a. kprobe-booster), except for the kprobes which has
post_handler. Thus if kprobe user probes MOV SS with post_handler, it will
do single-stepping on the MOV SS.

This means it is safe that if it is used via ftrace or perf/bpf since those
don't use the post_handler.

Anyway, since the stack switching is a rare case, it is safer just
rejecting kprobes on such instructions.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Francis Deslauriers <francis.deslauriers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "David S . Miller" <davem@davemloft.net>
Link: https://lkml.kernel.org/r/152587069574.17316.3311695234863248641.stgit@devbox
2018-05-13 19:52:55 +02:00
Tetsuo Handa
a466ef76b8 x86/kexec: Avoid double free_page() upon do_kexec_load() failure
>From ff82bedd3e12f0d3353282054ae48c3bd8c72012 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 9 May 2018 12:12:39 +0900
Subject: [PATCH v3] x86/kexec: avoid double free_page() upon do_kexec_load() failure.

syzbot is reporting crashes after memory allocation failure inside
do_kexec_load() [1]. This is because free_transition_pgtable() is called
by both init_transition_pgtable() and machine_kexec_cleanup() when memory
allocation failed inside init_transition_pgtable().

Regarding 32bit code, machine_kexec_free_page_tables() is called by both
machine_kexec_alloc_page_tables() and machine_kexec_cleanup() when memory
allocation failed inside machine_kexec_alloc_page_tables().

Fix this by leaving the error handling to machine_kexec_cleanup()
(and optionally setting NULL after free_page()).

[1] https://syzkaller.appspot.com/bug?id=91e52396168cf2bdd572fe1e1bc0bc645c1c6b40

Fixes: f5deb79679 ("x86: kexec: Use one page table in x86_64 machine_kexec")
Fixes: 92be3d6bdf ("kexec/i386: allocate page table pages dynamically")
Reported-by: syzbot <syzbot+d96f60296ef613fe1d69@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: thomas.lendacky@amd.com
Cc: prudo@linux.vnet.ibm.com
Cc: Huang Ying <ying.huang@intel.com>
Cc: syzkaller-bugs@googlegroups.com
Cc: takahiro.akashi@linaro.org
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: akpm@linux-foundation.org
Cc: dyoung@redhat.com
Cc: kirill.shutemov@linux.intel.com
Link: https://lkml.kernel.org/r/201805091942.DGG12448.tMFVFSJFQOOLHO@I-love.SAKURA.ne.jp
2018-05-13 19:50:06 +02:00
Linus Torvalds
ccda3c4b77 some small SMB3 fixes for 4.17-rc5, some for stable
-----BEGIN PGP SIGNATURE-----
 
 iQHHBAABCgAxFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlr04iwTHHNtZnJlbmNo
 QGdtYWlsLmNvbQAKCRCKLL1wB3JPUaQ/C/4xelt5SEsr3AEj9SUJFwjHZJ09pSCI
 T1e/ztJLVorE2VAKhVub9bN5hP4vrAfUzWTHDeDB02eQy8BLUaUKFMxhN2UjRfix
 pdL9vLIGMXNCuGROWnO3UHbE99b1Dht8pdpmt+tTevSwYnHj7ACbyoZJsESOhged
 5WDwUCM2FTmyu4L3UqD8ZDp5jrvx2q6qTOpck53v3lLbnkRx/rNLLdamFXo1IRd3
 cypXIck+v+g1pv0mbs5cYdtHRQqiiP7RbY7bmvNJrV0G95iNGgonuWaCwiMOLqmZ
 R9aw9ktGZepu2aWlhhiV1nNxI1zo7pvMSoRYT2Pd1ZaxUbTb3LvBwkviAFgz8Dft
 a10MzvLMrxY79q/4FdWf+FdsQ0QHh98XwsMdAFuraI0zlNN+9/0UPDYd8X/LFgoK
 3GOZDNyQ6LPvCsJ4ZmzugneuCRzWk/HU4y1uo67vjqUODKjPPZ8qXGOz0sNSJfnp
 gjpdw6G2GW4PLH4q3oppbNfetAaqhm24pMw=
 =qAFz
 -----END PGP SIGNATURE-----

Merge tag '4.17-rc4-SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Some small SMB3 fixes for 4.17-rc5, some for stable"

* tag '4.17-rc4-SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: directory sync should not return an error
  cifs: smb2ops: Fix listxattr() when there are no EAs
  cifs: smbd: Enable signing with smbdirect
  cifs: Allocate validate negotiation request through kmalloc
2018-05-12 18:49:53 -07:00
Linus Torvalds
427fbe8926 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux
Pull thermal fixes from Zhang Rui:

 - fix NULL pointer dereference on module load/probe for int3403_thermal
   driver

 - fix an emergency shutdown issue on exynos thermal driver

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux:
  thermal: exynos: Propagate error value from tmu_read()
  thermal: exynos: Reading temperature makes sense only when TMU is turned on
  thermal: int3403_thermal: Fix NULL pointer deref on module load / probe
2018-05-12 10:58:57 -07:00
Linus Torvalds
0d4cafd12f for-linus-20180511
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJa9gfhAAoJEPfTWPspceCm1vkP/3sNyf+FJy0Jz+EEHN01wJZP
 z77IRN+r5dKIi2bnOpfVSRTpEyCP9QTOAmv1waNygMFG6OcVmDFsgFoe/fnqGVo8
 VZjiQ+rpVrP8rEXzi1nMJunuImmx3JIV8E44+IKWKulipxHy38WXy73tWSEBrkoN
 uvL2UGJZOaKMoxC8XRsqDK/JVrhFSif44hNp8STPAo3e3YBPs95xJpV5qeIlJ8i1
 K2CXIiLVUUyM17vnW7dZR+nMoXHMXbQWwWpbWvgZ/ZDpPREHHddG+9NRkrinkGZI
 pLIyyPcamFifhiFvbV+3oWDiG/f/2CXFo45iO/uuxxW8JQgCMyQq4TukiTYrlbCt
 ciayjj0MQy8lF0IRIuFSgTyxp8iJrBYJEkaqBZcIRNdMTEFnE/ig43G7i7LqmZon
 7YjCC1EBZ6UNKgXaazqX+573wFUScCHMcmHcTWqE4Yfj93Q8mVTVedyYaJTwFSIq
 bNM2dspae+4iBxQa3z4w66dIcDC5ngJyKq2w2YdcO3H4KnT3FpTp/yv8GGW9y8lZ
 JiP2MYA2j4G3JtJLE40L6L2FnmI70APbHSX1JiLsS22J4UpLGElo8NUe5uwYPd/u
 jy7J6VAu/sl6e/+PlBIRV01pOGXzPSg3Fis9jZFtZJy2zddZVw+grapI6cs3o1Hq
 DYZnS0nFd2GmvnKmTI9p
 =Oq/z
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180511' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Just a few NVMe fixes this round - one fixing a use-after-free, one
  fixes the return value after controller reset, and the last one fixes
  an issue where some drives will spuriously EIO. We should get these
  into 4.17"

* tag 'for-linus-20180511' of git://git.kernel.dk/linux-block:
  nvme: add quirk to force medium priority for SQ creation
  nvme: Fix sync controller reset return
  nvme: fix use-after-free in nvme_free_ns_head
2018-05-12 10:55:48 -07:00
Linus Torvalds
f0ab773f5c Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "13 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  rbtree: include rcu.h
  scripts/faddr2line: fix error when addr2line output contains discriminator
  ocfs2: take inode cluster lock before moving reflinked inode from orphan dir
  mm, oom: fix concurrent munlock and oom reaper unmap, v3
  mm: migrate: fix double call of radix_tree_replace_slot()
  proc/kcore: don't bounds check against address 0
  mm: don't show nr_indirectly_reclaimable in /proc/vmstat
  mm: sections are not offlined during memory hotremove
  z3fold: fix reclaim lock-ups
  init: fix false positives in W+X checking
  lib/find_bit_benchmark.c: avoid soft lockup in test_find_first_bit()
  KASAN: prohibit KASAN+STRUCTLEAK combination
  MAINTAINERS: update Shuah's email address
2018-05-11 18:04:12 -07:00
Sebastian Andrzej Siewior
2075b16e32 rbtree: include rcu.h
Since commit c1adf20052 ("Introduce rb_replace_node_rcu()")
rbtree_augmented.h uses RCU related data structures but does not include
the header file.  It works as long as it gets somehow included before
that and fails otherwise.

Link: http://lkml.kernel.org/r/20180504103159.19938-1-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Changbin Du
78eb0c6356 scripts/faddr2line: fix error when addr2line output contains discriminator
When addr2line output contains discriminator, the current awk script
cannot parse it.  This patch fixes it by extracting key words using
regex which is more reliable.

  $ scripts/faddr2line vmlinux tlb_flush_mmu_free+0x26
  tlb_flush_mmu_free+0x26/0x50:
  tlb_flush_mmu_free at mm/memory.c:258 (discriminator 3)
  scripts/faddr2line: eval: line 173: unexpected EOF while looking for matching `)'

Link: http://lkml.kernel.org/r/1525323379-25193-1-git-send-email-changbin.du@intel.com
Fixes: 6870c0165f ("scripts/faddr2line: show the code context")
Signed-off-by: Changbin Du <changbin.du@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: NeilBrown <neilb@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Ashish Samant
e438302920 ocfs2: take inode cluster lock before moving reflinked inode from orphan dir
While reflinking an inode, we create a new inode in orphan directory,
then take EX lock on it, reflink the original inode to orphan inode and
release EX lock.  Once the lock is released another node could request
it in EX mode from ocfs2_recover_orphans() which causes downconvert of
the lock, on this node, to NL mode.

Later we attempt to initialize security acl for the orphan inode and
move it to the reflink destination.  However, while doing this we dont
take EX lock on the inode.  This could potentially cause problems
because we could be starting transaction, accessing journal and
modifying metadata of the inode while holding NL lock and with another
node holding EX lock on the inode.

Fix this by taking orphan inode cluster lock in EX mode before
initializing security and moving orphan inode to reflink destination.
Use the __tracker variant while taking inode lock to avoid recursive
locking in the ocfs2_init_security_and_acl() call chain.

Link: http://lkml.kernel.org/r/1523475107-7639-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
David Rientjes
27ae357fa8 mm, oom: fix concurrent munlock and oom reaper unmap, v3
Since exit_mmap() is done without the protection of mm->mmap_sem, it is
possible for the oom reaper to concurrently operate on an mm until
MMF_OOM_SKIP is set.

This allows munlock_vma_pages_all() to concurrently run while the oom
reaper is operating on a vma.  Since munlock_vma_pages_range() depends
on clearing VM_LOCKED from vm_flags before actually doing the munlock to
determine if any other vmas are locking the same memory, the check for
VM_LOCKED in the oom reaper is racy.

This is especially noticeable on architectures such as powerpc where
clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
is zapped by the oom reaper during follow_page_mask() after the check
for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a
kernel oops.

Fix this by manually freeing all possible memory from the mm before
doing the munlock and then setting MMF_OOM_SKIP.  The oom reaper can not
run on the mm anymore so the munlock is safe to do in exit_mmap().  It
also matches the logic that the oom reaper currently uses for
determining when to set MMF_OOM_SKIP itself, so there's no new risk of
excessive oom killing.

This issue fixes CVE-2018-1000200.

Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com
Fixes: 2129258024 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: David Rientjes <rientjes@google.com>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Naoya Horiguchi
013567be19 mm: migrate: fix double call of radix_tree_replace_slot()
radix_tree_replace_slot() is called twice for head page, it's obviously
a bug.  Let's fix it.

Link: http://lkml.kernel.org/r/20180423072101.GA12157@hori1.linux.bs1.fc.nec.co.jp
Fixes: e71769ae52 ("mm: enable thp migration for shmem thp")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Zi Yan <zi.yan@sent.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Laura Abbott
3955333df9 proc/kcore: don't bounds check against address 0
The existing kcore code checks for bad addresses against __va(0) with
the assumption that this is the lowest address on the system.  This may
not hold true on some systems (e.g.  arm64) and produce overflows and
crashes.  Switch to using other functions to validate the address range.

It's currently only seen on arm64 and it's not clear if anyone wants to
use that particular combination on a stable release.  So this is not
urgent for stable.

Link: http://lkml.kernel.org/r/20180501201143.15121-1-labbott@redhat.com
Signed-off-by: Laura Abbott <labbott@redhat.com>
Tested-by: Dave Anderson <anderson@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>a
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Roman Gushchin
7aaf772723 mm: don't show nr_indirectly_reclaimable in /proc/vmstat
Don't show nr_indirectly_reclaimable in /proc/vmstat, because there is
no need to export this vm counter to userspace, and some changes are
expected in reclaimable object accounting, which can alter this counter.

Link: http://lkml.kernel.org/r/20180425191422.9159-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Pavel Tatashin
27227c7338 mm: sections are not offlined during memory hotremove
Memory hotplug and hotremove operate with per-block granularity.  If the
machine has a large amount of memory (more than 64G), the size of a
memory block can span multiple sections.  By mistake, during hotremove
we set only the first section to offline state.

The bug was discovered because kernel selftest started to fail:
  https://lkml.kernel.org/r/20180423011247.GK5563@yexl-desktop

After commit, "mm/memory_hotplug: optimize probe routine".  But, the bug
is older than this commit.  In this optimization we also added a check
for sections to be in a proper state during hotplug operation.

Link: http://lkml.kernel.org/r/20180427145257.15222-1-pasha.tatashin@oracle.com
Fixes: 2d070eab2e ("mm: consider zone which is not fully populated to have holes")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Vitaly Wool
6098d7e136 z3fold: fix reclaim lock-ups
Do not try to optimize in-page object layout while the page is under
reclaim.  This fixes lock-ups on reclaim and improves reclaim
performance at the same time.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180430125800.444cae9706489f412ad12621@gmail.com
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: <Oleksiy.Avramchenko@sony.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Jeffrey Hugo
ae646f0b9c init: fix false positives in W+X checking
load_module() creates W+X mappings via __vmalloc_node_range() (from
layout_and_allocate()->move_module()->module_alloc()) by using
PAGE_KERNEL_EXEC.  These mappings are later cleaned up via
"call_rcu_sched(&freeinit->rcu, do_free_init)" from do_init_module().

This is a problem because call_rcu_sched() queues work, which can be run
after debug_checkwx() is run, resulting in a race condition.  If hit,
the race results in a nasty splat about insecure W+X mappings, which
results in a poor user experience as these are not the mappings that
debug_checkwx() is intended to catch.

This issue is observed on multiple arm64 platforms, and has been
artificially triggered on an x86 platform.

Address the race by flushing the queued work before running the
arch-defined mark_rodata_ro() which then calls debug_checkwx().

Link: http://lkml.kernel.org/r/1525103946-29526-1-git-send-email-jhugo@codeaurora.org
Fixes: e1a58320a3 ("x86/mm: Warn on W^X mappings")
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Reported-by: Timur Tabi <timur@codeaurora.org>
Reported-by: Jan Glauber <jan.glauber@caviumnetworks.com>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Yury Norov
4ba281d5bd lib/find_bit_benchmark.c: avoid soft lockup in test_find_first_bit()
test_find_first_bit() is intentionally sub-optimal, and may cause soft
lockup due to long time of run on some systems.  So decrease length of
bitmap to traverse to avoid lockup.

With the change below, time of test execution doesn't exceed 0.2 seconds
on my testing system.

Link: http://lkml.kernel.org/r/20180420171949.15710-1-ynorov@caviumnetworks.com
Fixes: 4441fca0a2 ("lib: test module for find_*_bit() functions")
Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Dmitry Vyukov
c9cf87ea6a KASAN: prohibit KASAN+STRUCTLEAK combination
Currently STRUCTLEAK inserts initialization out of live scope of variables
from KASAN point of view.  This leads to KASAN false positive reports.
Prohibit this combination for now.

Link: http://lkml.kernel.org/r/20180419172451.104700-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Shuah Khan (Samsung OSG)
1d1c8e5f0d MAINTAINERS: update Shuah's email address
Update email address in MAINTAINERS file due to IT infrastructure changes
at Samsung.

Link: http://lkml.kernel.org/r/20180501212815.25911-1-shuah@kernel.org
Signed-off-by: Shuah Khan (Samsung OSG) <shuah@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Linus Torvalds
4bc871984f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Verify lengths of keys provided by the user is AF_KEY, from Kevin
    Easton.

 2) Add device ID for BCM89610 PHY. Thanks to Bhadram Varka.

 3) Add Spectre guards to some ATM code, courtesy of Gustavo A. R.
    Silva.

 4) Fix infinite loop in NSH protocol code. To Eric Dumazet we are most
    grateful for this fix.

 5) Line up /proc/net/netlink headers properly. This fix from YU Bo, we
    do appreciate.

 6) Use after free in TLS code. Once again we are blessed by the
    honorable Eric Dumazet with this fix.

 7) Fix regression in TLS code causing stalls on partial TLS records.
    This fix is bestowed upon us by Andrew Tomt.

 8) Deal with too small MTUs properly in LLC code, another great gift
    from Eric Dumazet.

 9) Handle cached route flushing properly wrt. MTU locking in ipv4, to
    Hangbin Liu we give thanks for this.

10) Fix regression in SO_BINDTODEVIC handling wrt. UDP socket demux.
    Paolo Abeni, he gave us this.

11) Range check coalescing parameters in mlx4 driver, thank you Moshe
    Shemesh.

12) Some ipv6 ICMP error handling fixes in rxrpc, from our good brother
    David Howells.

13) Fix kexec on mlx5 by freeing IRQs in shutdown path. Daniel Juergens,
    you're the best!

14) Don't send bonding RLB updates to invalid MAC addresses. Debabrata
    Benerjee saved us!

15) Uh oh, we were leaking in udp_sendmsg and ping_v4_sendmsg. The ship
    is now water tight, thanks to Andrey Ignatov.

16) IPSEC memory leak in ixgbe from Colin Ian King, man we've got holes
    everywhere!

17) Fix error path in tcf_proto_create, Jiri Pirko what would we do
    without you!

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (92 commits)
  net sched actions: fix refcnt leak in skbmod
  net: sched: fix error path in tcf_proto_create() when modules are not configured
  net sched actions: fix invalid pointer dereferencing if skbedit flags missing
  ixgbe: fix memory leak on ipsec allocation
  ixgbevf: fix ixgbevf_xmit_frame()'s return type
  ixgbe: return error on unsupported SFP module when resetting
  ice: Set rq_last_status when cleaning rq
  ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg
  mlxsw: core: Fix an error handling path in 'mlxsw_core_bus_device_register()'
  bonding: send learning packets for vlans on slave
  bonding: do not allow rlb updates to invalid mac
  net/mlx5e: Err if asked to offload TC match on frag being first
  net/mlx5: E-Switch, Include VF RDMA stats in vport statistics
  net/mlx5: Free IRQs in shutdown path
  rxrpc: Trace UDP transmission failure
  rxrpc: Add a tracepoint to log ICMP/ICMP6 and error messages
  rxrpc: Fix the min security level for kernel calls
  rxrpc: Fix error reception on AF_INET6 sockets
  rxrpc: Fix missing start of call timeout
  qed: fix spelling mistake: "taskelt" -> "tasklet"
  ...
2018-05-11 14:14:46 -07:00
Linus Torvalds
a1f45efbb9 NFS client fixes for Linux 4.17-rc4
Bugfixes:
 - Fix a possible NFSoRDMA list corruption during recovery
 - Fix sunrpc tracepoint crashes
 
 Other change:
 - Update Trond's email in the MAINTAINERS file
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlr2ABEACgkQ18tUv7Cl
 QOvLew//WipZ1of+dZpiGa95pqVKBrIxq5R1y8LACmEKaiyfHOOoFcaopI7YDU1r
 OkBRZkldMLKOSGZsQ9xEjh3OOPgW60oInFZ2sD2qjnph23x09IcDbiCp8iJ0PTFI
 iD9ioUKc3h7FSl0pJQSjIo9+9fFsTZzIioxP7tDZt2Kog5OMIZeWAqRIj1xmgu5i
 TX793gTFJ+SfMSkvWZM5oOHVEmW/oXAgWsgaVXEqkdjK2JI6KYKqAgMj0CLvvNIo
 S2eeJjbyd9Hl59lDo50NzrZEQESlPYod6ZDfEOmF50mxC3MCLlmtAgwXKknVaY1N
 1L4tFuBoXBLV0jctBztuqMIDKXncoNlsCvr38WqkBaFxikKpK8dFqeByh+wCTdtz
 pwMPHFDQmQB1mIwqzQa+O6MAZ5n3a/cgyWQtoymlq5ddQU3roB2euWXRmaoXPudY
 SnmEVYxq839Ukw16qNa1HkKkroy8Zzqr5+sS30w/l916U9/S3ZolXF+XU5ux+6hQ
 Mlu9aW5SCP4S5QresaAcjPcBdLvbjN8/h/I8bdCmPRCGVKSkcxSz2MZYUli8UxAq
 tht4tQtuCY1XInQPnuf20egnJnrhpgQjb8Xx5BvTtcEkFvz9F36lzK4ot0lqQzTo
 tGDDW8gpeskt0Z1PC4eD1gq/E+FSywP7gg/g32AMdk2GpCewBog=
 =xfe9
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.17-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "These patches fix both a possible corruption during NFSoRDMA MR
  recovery, and a sunrpc tracepoint crash.

  Additionally, Trond has a new email address to put in the MAINTAINERS
  file"

* tag 'nfs-for-4.17-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  Change Trond's email address in MAINTAINERS
  sunrpc: Fix latency trace point crashes
  xprtrdma: Fix list corruption / DMAR errors during MR recovery
2018-05-11 13:56:43 -07:00
Roman Mashak
a52956dfc5 net sched actions: fix refcnt leak in skbmod
When application fails to pass flags in netlink TLV when replacing
existing skbmod action, the kernel will leak refcnt:

$ tc actions get action skbmod index 1
total acts 0

        action order 0: skbmod pipe set smac 00:11:22:33:44:55
         index 1 ref 1 bind 0

For example, at this point a buggy application replaces the action with
index 1 with new smac 00:aa:22:33:44:55, it fails because of zero flags,
however refcnt gets bumped:

$ tc actions get actions skbmod index 1
total acts 0

        action order 0: skbmod pipe set smac 00:11:22:33:44:55
         index 1 ref 2 bind 0
$

Tha patch fixes this by calling tcf_idr_release() on existing actions.

Fixes: 86da71b573 ("net_sched: Introduce skbmod action")
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 16:37:03 -04:00
Linus Torvalds
ac42803695 These patches fix two long-standing bugs in the DIO code path, one of
which is a crash trivially triggerable with splice().
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJa9bYgAAoJEEp/3jgCEfOLlz4H/2rlMpTGRn2mHwiK3DmVsmXT
 J9YRAClWEmJOAbtwjwToRB9QWbx8vkVRTI+CDJDK4IMVXgTCpAexvymdzeC54gZw
 PzTRfSjXEGa4fVF5P+TwWrz4nGgBq+uEz35dK7ZB0zJZHPiZbXdMChCNnPaVdPQY
 2kat9alDAuSyBt7HLt1J+EwKaFBUJcCL0TJJUirFrO3HYN9npj/QcvKSdK0OpPYA
 WzJDduuCFv4V3o9j03sP0SRw2jYJqzWqw3nrs+47uxPeStDZ9OK7cXztP4BdbDjv
 WYlNewcC6H578MvB0GFeQ8ED9RJj3C/6JRhQReF45COUQaNS91i1gQi7KRGviOw=
 =StVr
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.17-rc5' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "These patches fix two long-standing bugs in the DIO code path, one of
  which is a crash trivially triggerable with splice()"

* tag 'ceph-for-4.17-rc5' of git://github.com/ceph/ceph-client:
  ceph: fix iov_iter issues in ceph_direct_read_write()
  libceph: add osd_req_op_extent_osd_data_bvecs()
  ceph: fix rsize/wsize capping in ceph_direct_read_write()
2018-05-11 13:36:06 -07:00
Jiri Pirko
d68d75fdc3 net: sched: fix error path in tcf_proto_create() when modules are not configured
In case modules are not configured, error out when tp->ops is null
and prevent later null pointer dereference.

Fixes: 33a48927c1 ("sched: push TC filter protocol creation into a separate function")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-11 16:34:38 -04:00
Linus Torvalds
3f5f8596ed Fixes for critical regressions and a build failure.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJa9eaYAAoJELcQ+SIFb8HamGAH/imdKSv7GJe7OOZyJ1AjHPIc
 BLklw6RA6ejGpKdFNVzK77g+5DmpH+00sInWJNDginQCop+quH5aqTvLdVaoFh/w
 /XU+aZClSNvqI3wesPPKTKuWj/bofDks7pGfiXsBXrSIT6SGQVMC7U/iL9YhkyN2
 93hgdYs1qEoy2d3MgFfyRmAJKhPVnr4jbzN5RLXpbZox9NMYBa4qPL2GXvhx0A7T
 3DtXykSmfBAHBHbAnLCIvwgJ9aQg9TUVMKJo3jXynTvnD+5qgLYiE8y3j+Yg6qRm
 WG1G5f5sU9fEPgkYApg30wdrO3HbSi9POsLDW0Inzlt1/WUMrC3iJ8Gza7K58kM=
 =VZ/j
 -----END PGP SIGNATURE-----

Merge tag 'sh-for-4.17-fixes' of git://git.libc.org/linux-sh

Pull arch/sh fixes from Rich Felker:
 "Fixes for critical regressions and a build failure.

  The regressions were introduced in 4.15 and 4.17-rc1 and prevented
  booting on affected systems"

* tag 'sh-for-4.17-fixes' of git://git.libc.org/linux-sh:
  sh: switch to NO_BOOTMEM
  sh: mm: Fix unprotected access to struct device
  sh: fix build failure for J2 cpu with SMP disabled
2018-05-11 13:14:24 -07:00
Linus Torvalds
7404bc2773 arm64 fixes:
- Mitigate Spectre-v2 for NVIDIA Denver CPUs
 
 - Free memblocks corresponding to freed initrd area
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCgAGBQJa9bgsAAoJELescNyEwWM0yYwIAKvMuUU8d6fy/5EdjTm2uG9p
 DoSw+ezHeiUrphQwNvOc/fj0vGutM+sftcmghRV1KmP7lvAqk/zvK57PAZjwQ5ua
 i1X2AJemKr7Gs77FV5Y6Jgkkd2kaIh3n86d9/hM7n9TfAt31vPAYCapb8h3LbRBJ
 bjZXoTHeujZAIMLGyxzLGVlk9MdW2UjQ3LvWGby/mFEPuktJKkApxBSNQOJOuRKw
 Ny/eCwFhbyLzDA4zXw7hASld/J+WWBhk0m8ks2qy7BD/F2auZX/p5flU/NoE1VXi
 JevclGif18iQtZQRV/hJ1woLROfbp6cRKWaVB4cEFKSnB2mG6FLSfrYyvbCj6LE=
 =lZDP
 -----END PGP SIGNATURE-----

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:
 "There's a small memblock accounting problem when freeing the initrd
  and a Spectre-v2 mitigation for NVIDIA Denver CPUs which just requires
  a match on the CPU ID register.

  Summary:

   - Mitigate Spectre-v2 for NVIDIA Denver CPUs

   - Free memblocks corresponding to freed initrd area"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: capabilities: Add NVIDIA Denver CPU to bp_harden list
  arm64: Add MIDR encoding for NVIDIA CPUs
  arm64: To remove initrd reserved area entry from memblock
2018-05-11 13:09:04 -07:00