Commit graph

2475 commits

Author SHA1 Message Date
Andy Lutomirski
dbd68d8e84 x86/mm: Fix flush_tlb_page() on Xen
flush_tlb_page() passes a bogus range to flush_tlb_others() and
expects the latter to fix it up.  native_flush_tlb_others() has the
fixup but Xen's version doesn't.  Move the fixup to
flush_tlb_others().

AFAICS the only real effect is that, without this fix, Xen would
flush everything instead of just the one page on remote vCPUs in
when flush_tlb_page() was called.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: e7b52ffd45 ("x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range")
Link: http://lkml.kernel.org/r/10ed0e4dfea64daef10b87fb85df1746999b4dba.1492844372.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26 10:02:06 +02:00
Andy Lutomirski
ce27374fab x86/mm: Make flush_tlb_mm_range() more predictable
I'm about to rewrite the function almost completely, but first I
want to get a functional change out of the way.  Currently, if
flush_tlb_mm_range() does not flush the local TLB at all, it will
never do individual page flushes on remote CPUs.  This seems to be
an accident, and preserving it will be awkward.  Let's change it
first so that any regressions in the rewrite will be easier to
bisect and so that the rewrite can attempt to change no visible
behavior at all.

The fix is simple: we can simply avoid short-circuiting the
calculation of base_pages_to_flush.

As a side effect, this also eliminates a potential corner case: if
tlb_single_page_flush_ceiling == TLB_FLUSH_ALL, flush_tlb_mm_range()
could have ended up flushing the entire address space one page at a
time.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4b29b771d9975aad7154c314534fec235618175a.1492844372.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26 10:02:06 +02:00
Andy Lutomirski
29961b59a5 x86/mm: Remove flush_tlb() and flush_tlb_current_task()
I was trying to figure out what how flush_tlb_current_task() would
possibly work correctly if current->mm != current->active_mm, but I
realized I could spare myself the effort: it has no callers except
the unused flush_tlb() macro.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e52d64c11690f85e9f1d69d7b48cc2269cd2e94b.1492844372.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26 10:02:06 +02:00
Kirill A. Shutemov
e6ab9c4d43 x86/mm/64: Fix crash in remove_pagetable()
remove_pagetable() does page walk using p*d_page_vaddr() plus cast.
It's not canonical approach -- we usually use p*d_offset() for that.

It works fine as long as all page table levels are present. We broke the
invariant by introducing folded p4d page table level.

As result, remove_pagetable() interprets PMD as PUD and it leads to
crash:

	BUG: unable to handle kernel paging request at ffff880300000000
	IP: memchr_inv+0x60/0x110
	PGD 317d067
	P4D 317d067
	PUD 3180067
	PMD 33f102067
	PTE 8000000300000060

Let's fix this by using p*d_offset() instead of p*d_page_vaddr() for
page walk.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Fixes: f2a6a70501 ("x86: Convert the rest of the code to support p4d_t")
Link: http://lkml.kernel.org/r/20170425092557.21852-1-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26 08:26:43 +02:00
Ingo Molnar
6dd29b3df9 Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"
This reverts commit 2947ba054a.

Dan Williams reported dax-pmem kernel warnings with the following signature:

   WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
   percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic

... and bisected it to this commit, which suggests possible memory corruption
caused by the x86 fast-GUP conversion.

He also pointed out:

 "
  This is similar to the backtrace when we were not properly handling
  pud faults and was fixed with this commit: 220ced1676 "mm: fix
  get_user_pages() vs device-dax pud mappings"

  I've found some missing _devmap checks in the generic
  get_user_pages_fast() path, but this does not fix the regression
  [...]
 "

So given that there are known bugs, and a pretty robust looking bisection
points to this commit suggesting that are unknown bugs in the conversion
as well, revert it for the time being - we'll re-try in v4.13.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dann.frazier@canonical.com
Cc: dave.hansen@intel.com
Cc: steve.capper@linaro.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23 11:45:20 +02:00
Juergen Gross
84bbabc3a4 x86/mm: Fix dump pagetables for 4 levels of page tables
Commit fdd3d8ce0e ("x86/dump_pagetables: Add support for 5-level
paging") introduced an error for dumping with only 4 levels by setting
PGD_LEVEL_MULT to a wrong value.

This is leading to e.g. addresses printed as "(null)" for ranges:

  x86/mm: Found insecure W+X mapping at address (null)/(null)

Make PGD_LEVEL_MULT a multiple of PTRS_PER_P4D instead of PTRS_PER_PUD

Fixes: fdd3d8ce0e ("x86/dump_pagetables: Add support for 5-level paging")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20170412143634.6846-1-jgross@suse.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-13 00:26:30 +02:00
Kees Cook
a4866aa812 mm: Tighten x86 /dev/mem with zeroing reads
Under CONFIG_STRICT_DEVMEM, reading System RAM through /dev/mem is
disallowed. However, on x86, the first 1MB was always allowed for BIOS
and similar things, regardless of it actually being System RAM. It was
possible for heap to end up getting allocated in low 1MB RAM, and then
read by things like x86info or dd, which would trip hardened usercopy:

usercopy: kernel memory exposure attempt detected from ffff880000090000 (dma-kmalloc-256) (4096 bytes)

This changes the x86 exception for the low 1MB by reading back zeros for
System RAM areas instead of blindly allowing them. More work is needed to
extend this to mmap, but currently mmap doesn't go through usercopy, so
hardened usercopy won't Oops the kernel.

Reported-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Tested-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-04-12 11:40:23 -07:00
Joerg Roedel
5ed386ec09 x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space
When this function fails it just sends a SIGSEGV signal to
user-space using force_sig(). This signal is missing
essential information about the cause, e.g. the trap_nr or
an error code.

Fix this by propagating the error to the only caller of
mpx_handle_bd_fault(), do_bounds(), which sends the correct
SIGSEGV signal to the process.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fe3d197f84 ('x86, mpx: On-demand kernel allocation of bounds tables')
Link: http://lkml.kernel.org/r/1491488362-27198-1-git-send-email-joro@8bytes.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-12 08:40:58 +02:00
Ingo Molnar
b6466d53af Merge branch 'x86/urgent' into x86/cpu, to resolve conflict
Conflicts:
	arch/x86/kernel/cpu/intel_rdt_schemata.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-11 10:47:28 +02:00
Ingo Molnar
e5185a76a2 Merge branch 'x86/boot' into x86/mm, to avoid conflict
There's a conflict between ongoing level-5 paging support and
the E820 rewrite. Since the E820 rewrite is essentially ready,
merge it into x86/mm to reduce tree conflicts.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-11 08:56:05 +02:00
Ingo Molnar
4729277156 Merge branch 'WIP.x86/boot' into x86/boot, to pick up ready branch
The E820 rework in WIP.x86/boot has gone through a couple of weeks
of exposure in -tip, merge it in a wider fashion.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-11 08:49:31 +02:00
Thomas Gleixner
b678c91aef Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()"
This reverts commit 474aeffd88 due to testing
failures.

Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20170406124459.dwn5zhpr2xqg3lqm@node.shutemov.name
2017-04-08 00:00:53 +02:00
David Howells
3c2e2e6816 Annotate hardware config module parameters in arch/x86/mm/
When the kernel is running in secure boot mode, we lock down the kernel to
prevent userspace from modifying the running kernel image.  Whilst this
includes prohibiting access to things like /dev/mem, it must also prevent
access by means of configuring driver modules in such a way as to cause a
device to access or modify the kernel image.

To this end, annotate module_param* statements that refer to hardware
configuration and indicate for future reference what type of parameter they
specify.  The parameter parser in the core sees this information and can
skip such parameters with an error message if the kernel is locked down.
The module initialisation then runs as normal, but just sees whatever the
default values for those parameters is.

Note that we do still need to do the module initialisation because some
drivers have viable defaults set in case parameters aren't specified and
some drivers support automatic configuration (e.g. PNP or PCI) in addition
to manually coded parameters.

This patch annotates drivers in arch/x86/mm/.

[Note: With respect to testmmiotrace, an additional patch will be added
 separately that makes the module refuse to load if the kernel is locked
 down.]

Suggested-by: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
cc: Ingo Molnar <mingo@kernel.org>
cc: Thomas Gleixner <tglx@linutronix.de>
cc: "H. Peter Anvin" <hpa@zytor.com>
cc: x86@kernel.org
cc: linux-kernel@vger.kernel.org
cc: nouveau@lists.freedesktop.org
2017-04-04 16:54:21 +01:00
Kirill A. Shutemov
5480bb61cf x86/kasan: Extend KASAN to support 5-level paging
This patch bring support for a non-folded additional page table level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170330080731.65421-7-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-04 08:22:34 +02:00
Kirill A. Shutemov
b8504058a0 x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y
Extends pagetable headers to support the new paging mode.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170330080731.65421-6-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-04 08:22:34 +02:00
Ingo Molnar
7f75540ff2 Linux 4.11-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJY4ZYkAAoJEHm+PkMAQRiGsq4H/R4PMXDoe2XhSSk7IoT97pXV
 /A8np/scAPjzEgYUidbb54OSqWwsPRuPGWONTFeSrE2u0L4wln/REI91jg7QetLq
 IisncExlYeJ/XQ+iO0ZZh9fLbqwIlEJFdSXmyIFr3m/TBxe8a61C8j93oNgM1tHT
 yuwzlq7c3sLq2hsmUG2HyL2kJsEfRasv4Rk0yhFuti12zVsBoTW4qmZuMauq+gdf
 f7cSYgiHhPTdb2o+azg5O7uYNHaQQBxdUMlIuhhYtVOUq+pFDO23SLHSFIW2NwOm
 Zn5R6CFSrLsCw0Bx0v8Xlc151QUbaRK4h9lhUhkBr6d3uNShU1NQ9JojpSvYwBo=
 =vP6E
 -----END PGP SIGNATURE-----

Merge tag 'v4.11-rc5' into x86/mm, to refresh the branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-03 16:36:32 +02:00
Wei Yang
474aeffd88 x86/mm/numa: Remove numa_nodemask_from_meminfo()
numa_nodemask_from_meminfo() generates a nodemask of nodes which have
memory according to a meminfo descriptor.

The two callsites of that function both set bits in copies of the
numa_nodes_parsed nodemask. In both cases, the information in supplied
numa_meminfo is a subset of numa_nodes_parsed. So setting those bits
again is not really necessary.

Here are the three call paths which show that the supplied numa_meminfo
argument describes memory regions in nodes which are already in
numa_nodes_parsed:

    x86_numa_init()
        numa_init()
            Case 1:
            acpi_numa_init()
	    acpi_parse_memory_affinity()
                    numa_add_memblk()
                    node_set(numa_nodes_parsed)
                acpi_parse_slit()
		 acpi_numa_slit_init()
		  numa_set_distance()
		   numa_alloc_distance()
                    numa_nodemask_from_meminfo()

            Case 2:
            amd_numa_init()
                numa_add_memblk()
                node_set(numa_nodes_parsed)

            Case 3
            dummy_numa_init()
                node_set(numa_nodes_parsed)
                numa_add_memblk()

            numa_register_memblks()
                numa_nodemask_from_meminfo()

Thus, in all three cases, the respective bit in numa_nodes_parsed is
set, which means it is not necessary to set it again in a copy of
numa_nodes_parsed.

So remove that function.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/20170314030801.13656-2-richard.weiyang@gmail.com
[ Heavily massage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-03 11:54:37 +02:00
Wei Yang
43dac8f6a7 x86/mm/numa: Improve alloc_node_data() error path message
alloc_node_data() tries to allocate from the local node first and, if
that attempt fails, falls back to any node. Improve the error message to
issue the initial node for ease during debugging.

Fix a typo in the comments, while at it.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Link: http://lkml.kernel.org/r/20170314030801.13656-1-richard.weiyang@gmail.com
[ Masssage commit message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-03 11:54:37 +02:00
Borislav Petkov
952a6c2c09 x86/boot/32: Flip the logic in test_wp_bit()
... to have a natural "likely()" in the code flow and thus have the
success case with a branch 99.999% of the times non-taken and function
return code following it instead of jumping to it each time.

This puts the panic() call at the end of the function - it is going to
be practically unreachable anyway.

The C code is a bit more readable too.

No functionality change.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: thgarnie@google.com
Link: http://lkml.kernel.org/r/20170330080101.ywsf5rg6ilzu4itk@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-31 08:08:31 +02:00
Andy Lutomirski
4af1711051 x86/boot/32: Rewrite test_wp_bit()
This code seems to be very old and has gotten only minor updates.
It's overcomplicated and has a bunch of comments that are, at best,
of purely historical interest.  Nowadays we have a shiny function
probe_kernel_write() that does more or less exactly what we need.
Use it.

I switched the page that we test from swapper_pg_dir to
empty_zero_page because writing zero to empty_zero_page is more
obviously safe than writing to the paging structures.  (It's
extremely unlikely that any of this would cause problems in practice
because the write will fail on any supported CPU.)

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/0b9e64ab0236de30e7572213cea77bf95ae2e990.1490831211.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 09:08:33 +02:00
Ingo Molnar
73fa1362a7 Merge branch 'x86/cpu' into x86/mm, before applying dependent patch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 09:07:54 +02:00
Kirill A. Shutemov
fdd3d8ce0e x86/dump_pagetables: Add support for 5-level paging
Simple extension to support one more page table level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170328104806.41711-1-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-30 08:20:17 +02:00
Kirill A. Shutemov
f2a6a70501 x86: Convert the rest of the code to support p4d_t
This patch converts x86 to use proper folding of a new (fifth) page table level
with <asm-generic/pgtable-nop4d.h>.

That's a bit of a kitchen sink patch, but I don't see how to split it further
without hurting bisectability.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170317185515.8636-7-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-27 08:56:58 +02:00
Kirill A. Shutemov
d691a3cf80 x86/kasan: Prepare clear_pgds() to switch to <asm-generic/pgtable-nop4d.h>
With folded p4d, pgd_clear() is a nop. Change clear_pgds() to use
p4d_clear() instead.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170317185515.8636-5-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-27 08:56:41 +02:00
Kirill A. Shutemov
4547833602 x86/mm/pat: Add 5-level paging support
Straight-forward extension of existing code to support additional page
table level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170317185515.8636-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-27 08:56:34 +02:00
Baoquan He
a46f60d760 x86/mm/KASLR: Exclude EFI region from KASLR VA space randomization
Currently KASLR is enabled on three regions: the direct mapping of physical
memory, vamlloc and vmemmap. However the EFI region is also mistakenly
included for VA space randomization because of misusing EFI_VA_START macro
and assuming EFI_VA_START < EFI_VA_END.

(This breaks kexec and possibly other things that rely on stable addresses.)

The EFI region is reserved for EFI runtime services virtual mapping which
should not be included in KASLR ranges. In Documentation/x86/x86_64/mm.txt,
we can see:

  ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space

EFI uses the space from -4G to -64G thus EFI_VA_START > EFI_VA_END,
Here EFI_VA_START = -4G, and EFI_VA_END = -64G.

Changing EFI_VA_START to EFI_VA_END in mm/kaslr.c fixes this problem.

Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Acked-by: Dave Young <dyoung@redhat.com>
Acked-by: Thomas Garnier <thgarnie@google.com>
Cc: <stable@vger.kernel.org> #4.8+
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/1490331592-31860-1-git-send-email-bhe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-24 09:04:27 +01:00
Kirill A. Shutemov
2947ba054a x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation
This patch provides all required callbacks required by the generic
get_user_pages_fast() code and switches x86 over - and removes
the platform specific implementation.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K . V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dann Frazier <dann.frazier@canonical.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com
[ Minor readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-18 09:48:03 +01:00
Thomas Garnier
f991376e44 x86/mm: Correct fixmap header usage on adaptable MODULES_END
This patch removes fixmap header usage on non-x86 code that was
introduced by the adaptable MODULE_END change.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170317175034.4701-1-thgarnie@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-18 09:48:00 +01:00
Tobias Klauser
6bce725a78 x86/mpx: Make unnecessarily global function static
Make the function get_user_bd_entry() static as it is not used outside of
arch/x86/mm/mpx.c

This fixes a sparse warning.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:17:05 +01:00
Thomas Garnier
f06bdd4001 x86/mm: Adapt MODULES_END based on fixmap section size
This patch aligns MODULES_END to the beginning of the fixmap section.
It optimizes the space available for both sections. The address is
pre-computed based on the number of pages required by the fixmap
section.

It will allow GDT remapping in the fixmap section. The current
MODULES_END static address does not provide enough space for the kernel
to support a large number of processors.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Luis R . Rodriguez <mcgrof@kernel.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: kasan-dev@googlegroups.com
Cc: kernel-hardening@lists.openwall.com
Cc: kvm@vger.kernel.org
Cc: lguest@lists.ozlabs.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-pm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: zijun_hu <zijun_hu@htc.com>
Link: http://lkml.kernel.org/r/20170314170508.100882-1-thgarnie@google.com
[ Small build fix. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:06:24 +01:00
Dmitry Safonov
e13b73dd9c x86/hugetlb: Adjust to the new native/compat mmap bases
Commit 1b028f784e introduced two mmap() bases for 32-bit syscalls and for
64-bit syscalls. The mmap() code in x86 was modified to handle the
separation, but the patch series missed to update the hugetlb code.

As a consequence a 32bit application mapping a file on hugetlbfs uses the
64-bit mmap base for address space allocation, which fails.

Adjust the hugetlb mapping code to use the proper bases depending on the
syscall invocation mode (64-bit or compat).

[ tglx: Massaged changelog and switched from asm/compat.h to linux/compat.h ]

Fixes: commit 1b028f784e ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: 0x7f454c46@gmail.com
Cc: linux-mm@kvack.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20170314114126.9280-1-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-14 16:29:16 +01:00
Kirill A. Shutemov
b50858ce3e x86/mm/vmalloc: Add 5-level paging support
Modify vmalloc_fault() to handle additional page table level.

With 4-level paging, copying happens on p4d level, as we have pgd_none()
always false if p4d_t is folded.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170313143309.16020-6-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-14 08:45:08 +01:00
Kirill A. Shutemov
ea3b5e60ce x86/mm/ident_map: Add 5-level paging support
Add additional page table level handing. It's mostly mechanical.

The only quirk is that with p4d folded, 'pgd' is equal to 'p4d' in
kernel_ident_mapping_init(). The pgd entry has to point to the pud
page table in this case.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170313143309.16020-5-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-14 08:45:08 +01:00
Kirill A. Shutemov
0318e5abe1 x86/mm/gup: Add 5-level paging support
Extend get_user_pages_fast() to handle an additional page table level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170313143309.16020-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-14 08:45:08 +01:00
Kirill A. Shutemov
e0c4f6750e x86/mm: Convert trivial cases of page table walk to 5-level paging
This patch only covers simple cases. Less trivial cases will be
converted with separate patches.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170313143309.16020-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-14 08:45:08 +01:00
Andrey Ryabinin
be3606ff73 x86/kasan: Fix boot with KASAN=y and PROFILE_ANNOTATED_BRANCHES=y
The kernel doesn't boot with both PROFILE_ANNOTATED_BRANCHES=y and KASAN=y
options selected. With branch profiling enabled we end up calling
ftrace_likely_update() before kasan_early_init(). ftrace_likely_update() is
built with KASAN instrumentation, so calling it before kasan has been
initialized leads to crash.

Use DISABLE_BRANCH_PROFILING define to make sure that we don't call
ftrace_likely_update() from early code before kasan_early_init().

Fixes: ef7f0d6a6c ("x86_64: add KASan support")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Alexander Potapenko <glider@google.com>
Cc: stable@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: lkp@01.org
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: http://lkml.kernel.org/r/20170313163337.1704-1-aryabinin@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-14 00:00:55 +01:00
Dmitry Safonov
1b028f784e x86/mm: Introduce mmap_compat_base() for 32-bit mmap()
mmap() uses a base address, from which it starts to look for a free space
for allocation.

The base address is stored in mm->mmap_base, which is calculated during
exec(). The address depends on task's size, set rlimit for stack, ASLR
randomization. The base depends on the task size and the number of random
bits which are different for 64-bit and 32bit applications.

Due to the fact, that the base address is fixed, its mmap() from a compat
(32bit) syscall issued by a 64bit task will return a address which is based
on the 64bit base address and does not fit into the 32bit address space
(4GB). The returned pointer is truncated to 32bit, which results in an
invalid address.

To solve store a seperate compat address base plus a compat legacy address
base in mm_struct. These bases are calculated at exec() time and can be
used later to address the 32bit compat mmap() issued by 64 bit
applications.

As a consequence of this change 32-bit applications issuing a 64-bit
syscall (after doing a long jump) will get a 64-bit mapping now. Before
this change 32-bit applications always got a 32bit mapping.

[ tglx: Massaged changelog and added a comment ]

Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: 0x7f454c46@gmail.com
Cc: linux-mm@kvack.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20170306141721.9188-4-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-13 14:59:22 +01:00
Dmitry Safonov
8f3e474f3c x86/mm: Add task_size parameter to mmap_base()
To correctly handle 32-bit and 64-bit mmap() syscalls in 64bit applications
its required to have separate address bases to place a mapping.

The tasksize can be used as an indicator to select the proper parameters
for mmap_base().

This requires the following changes:

 - Add task_size argument to mmap_base() and make the calculation based on it.
 - Provide mmap_legacy_base() as a seperate function
 - Use the new functions in arch_pick_mmap_layout()

[ tglx: Massaged changelog ]

Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: 0x7f454c46@gmail.com
Cc: linux-mm@kvack.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20170306141721.9188-3-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-13 14:59:22 +01:00
Dmitry Safonov
6a0b41d1e2 x86/mm: Introduce arch_rnd() to compute 32/64 mmap random base
The compat (32bit) mmap() sycall issued by a 64-bit task results in a
mapping above 4GB. That's outside the compat mode address space and
prevents CRIU to restore 32bit processes from a 64bit application.

As a first step to address this, split out the address base randomizing
calculation from arch_mmap_rnd() into a helper function, which can be used
independent of mmap_ia32() based decisions.

[ tglx: Massaged changelog ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: 0x7f454c46@gmail.com
Cc: linux-mm@kvack.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: http://lkml.kernel.org/r/20170306141721.9188-2-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-13 14:59:22 +01:00
Mathias Krause
6415813bae x86/cpu: Drop wp_works_ok member of struct cpuinfo_x86
Remove the wp_works_ok member of struct cpuinfo_x86. It's an
optimization back from Linux v0.99 times where we had no fixup support
yet and did the CR0.WP test via special code in the page fault handler.
The < 0 test was an optimization to not do the special casing for each
NULL ptr access violation but just for the first one doing the WP test.
Today it serves no real purpose as the test no longer needs special code
in the page fault handler and the only call side -- mem_init() -- calls
it just once, anyway. However, Xen pre-initializes it to 1, to skip the
test.

Doing the test again for Xen should be no issue at all, as even the
commit introducing skipping the test (commit d560bc6157 ("x86, xen:
Suppress WP test on Xen")) mentioned it being ban aid only. And, in
fact, testing the patch on Xen showed nothing breaks.

The pre-fixup times are long gone and with the removal of the fallback
handling code in commit a5c2a893db ("x86, 386 removal: Remove
CONFIG_X86_WP_WORKS_OK") the kernel requires a working CR0.WP anyway.
So just get rid of the "optimization" and do the test unconditionally.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Arnd Hannemann <hannemann@nets.rwth-aachen.de>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/1486933932-585-3-git-send-email-minipli@googlemail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-11 14:30:24 +01:00
Dan Williams
b2e593e271 x86, mm: unify exit paths in gup_pte_range()
All exit paths from gup_pte_range() require pte_unmap() of the original
pte page before returning.  Refactor the code to have a single exit
point to do the unmap.

This mirrors the flow of the generic gup_pte_range() in mm/gup.c.

Link: http://lkml.kernel.org/r/148804251828.36605.14910389618497006945.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Dan Williams
ef947b2529 x86, mm: fix gup_pte_range() vs DAX mappings
gup_pte_range() fails to check pte_allows_gup() before translating a DAX
pte entry, pte_devmap(), to a page.  This allows writes to read-only
mappings, and bypasses the DAX cacheline dirty tracking due to missed
'mkwrite' faults.  The gup_huge_pmd() path and the gup_huge_pud() path
correctly check pte_allows_gup() before checking for _devmap() entries.

Fixes: 3565fce3a6 ("mm, x86: get_user_pages() for dax mappings")
Link: http://lkml.kernel.org/r/148804251312.36605.12665024794196605053.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Xiong Zhou <xzhou@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-09 17:01:09 -08:00
Ingo Molnar
9164bb4a18 sched/headers: Prepare to move 'init_task' and 'init_thread_union' from <linux/sched.h> to <linux/sched/task.h>
Update all usage sites first.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:38 +01:00
Ingo Molnar
589ee62844 sched/headers: Prepare to remove the <linux/mm_types.h> dependency from <linux/sched.h>
Update code that relied on sched.h including various MM types for them.

This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:37 +01:00
Ingo Molnar
68db0cf106 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h>
We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:36 +01:00
Ingo Molnar
b17b01533b sched/headers: Prepare for new header dependencies before moving code to <linux/sched/debug.h>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:34 +01:00
Ingo Molnar
010426079e sched/headers: Prepare for new header dependencies before moving more code to <linux/sched/mm.h>
We are going to split more MM APIs out of <linux/sched.h>, which
will have to be picked up from a couple of .c files.

The APIs that we are going to move are:

  arch_pick_mmap_layout()
  arch_get_unmapped_area()
  arch_get_unmapped_area_topdown()
  mm_update_next_owner()

Include the header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:30 +01:00
Ingo Molnar
3f07c01441 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Ingo Molnar
0871d5a66d Merge branch 'linus' into WIP.x86/boot, to fix up conflicts and to pick up updates
Conflicts:
	arch/x86/xen/setup.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-01 09:02:26 +01:00
Jinbum Park
2959a5f726 mm: add arch-independent testcases for RODATA
This patch makes arch-independent testcases for RODATA.  Both x86 and
x86_64 already have testcases for RODATA, But they are arch-specific
because using inline assembly directly.

And cacheflush.h is not a suitable location for rodata-test related
things.  Since they were in cacheflush.h, If someone change the state of
CONFIG_DEBUG_RODATA_TEST, It cause overhead of kernel build.

To solve the above issues, write arch-independent testcases and move it
to shared location.

[jinb.park7@gmail.com: fix config dependency]
  Link: http://lkml.kernel.org/r/20170209131625.GA16954@pjb1027-Latitude-E5410
Link: http://lkml.kernel.org/r/20170129105436.GA9303@pjb1027-Latitude-E5410
Signed-off-by: Jinbum Park <jinb.park7@gmail.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Valentin Rothberg <valentinrothberg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:48 -08:00
Mike Rapoport
897ab3e0c4 userfaultfd: non-cooperative: add event for memory unmaps
When a non-cooperative userfaultfd monitor copies pages in the
background, it may encounter regions that were already unmapped.
Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
changes in the virtual memory layout.

Since there might be different uffd contexts for the affected VMAs, we
first should create a temporary representation for the unmap event for
each uffd context and then notify them one by one to the appropriate
userfault file descriptors.

The event notification occurs after the mmap_sem has been released.

[arnd@arndb.de: fix nommu build]
  Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
[mhocko@suse.com: fix nommu build]
  Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:55 -08:00
Dan Williams
220ced1676 mm: fix get_user_pages() vs device-dax pud mappings
A new unit test for the device-dax 1GB enabling currently fails with
this warning before hanging the test thread:

 WARNING: CPU: 0 PID: 21 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1e3/0x1f0
 percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic
 [..]
 CPU: 0 PID: 21 Comm: rcuos/1 Tainted: G           O    4.10.0-rc7-next-20170207+ #944
 [..]
 Call Trace:
  dump_stack+0x86/0xc3
  __warn+0xcb/0xf0
  warn_slowpath_fmt+0x5f/0x80
  ? rcu_nocb_kthread+0x27a/0x510
  ? dax_pmem_percpu_exit+0x50/0x50 [dax_pmem]
  percpu_ref_switch_to_atomic_rcu+0x1e3/0x1f0
  ? percpu_ref_exit+0x60/0x60
  rcu_nocb_kthread+0x339/0x510
  ? rcu_nocb_kthread+0x27a/0x510
  kthread+0x101/0x140

The get_user_pages() path needs to arrange for references to be taken
against the dev_pagemap instance backing the pud mapping.  Refactor the
existing __gup_device_huge_pmd() to also account for the pud case.

Link: http://lkml.kernel.org/r/148653181153.38226.9605457830505509385.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Matthew Wilcox
a00cc7d9dd mm, x86: add support for PUD-sized transparent hugepages
The current transparent hugepage code only supports PMDs.  This patch
adds support for transparent use of PUDs with DAX.  It does not include
support for anonymous pages.  x86 support code also added.

Most of this patch simply parallels the work that was done for huge
PMDs.  The only major difference is how the new ->pud_entry method in
mm_walk works.  The ->pmd_entry method replaces the ->pte_entry method,
whereas the ->pud_entry method works along with either ->pmd_entry or
->pte_entry.  The pagewalk code takes care of locking the PUD before
calling ->pud_walk, so handlers do not need to worry whether the PUD is
stable.

[dave.jiang@intel.com: fix SMP x86 32bit build for native_pud_clear()]
  Link: http://lkml.kernel.org/r/148719066814.31111.3239231168815337012.stgit@djiang5-desk3.ch.intel.com
[dave.jiang@intel.com: native_pud_clear missing on i386 build]
  Link: http://lkml.kernel.org/r/148640375195.69754.3315433724330910314.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545059381.17912.8602162635537598445.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Tested-by: Alexander Kapshuk <alexander.kapshuk@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Kirill A. Shutemov
ecf1385d72 mm: drop unused argument of zap_page_range()
There's no users of zap_page_range() who wants non-NULL 'details'.
Let's drop it.

Link: http://lkml.kernel.org/r/20170118122429.43661-3-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:30 -08:00
Yasuaki Ishimatsu
ddffe98d16 mm/memory_hotplug: set magic number to page->freelist instead of page->lru.next
To identify that pages of page table are allocated from bootmem
allocator, magic number sets to page->lru.next.

But page->lru list is initialized in reserve_bootmem_region().  So when
calling free_pagetable(), the function cannot find the magic number of
pages.  And free_pagetable() frees the pages by free_reserved_page() not
put_page_bootmem().

But if the pages are allocated from bootmem allocator and used as page
table, the pages have private flag.  So before freeing the pages, we
should clear the private flag by put_page_bootmem().

Before applying the commit 7bfec6f47b ("mm, page_alloc: check multiple
page fields with a single branch"), we could find the following visible
issue:

  BUG: Bad page state in process kworker/u1024:1
  page:ffffea103cfd8040 count:0 mapcount:0 mappi
  flags: 0x6fffff80000800(private)
  page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
  bad because of flags: 0x800(private)
  <snip>
  Call Trace:
  [...] dump_stack+0x63/0x87
  [...] bad_page+0x114/0x130
  [...] free_pages_prepare+0x299/0x2d0
  [...] free_hot_cold_page+0x31/0x150
  [...] __free_pages+0x25/0x30
  [...] free_pagetable+0x6f/0xb4
  [...] remove_pagetable+0x379/0x7ff
  [...] vmemmap_free+0x10/0x20
  [...] sparse_remove_one_section+0x149/0x180
  [...] __remove_pages+0x2e9/0x4f0
  [...] arch_remove_memory+0x63/0xc0
  [...] remove_memory+0x8c/0xc0
  [...] acpi_memory_device_remove+0x79/0xa5
  [...] acpi_bus_trim+0x5a/0x8d
  [...] acpi_bus_trim+0x38/0x8d
  [...] acpi_device_hotplug+0x1b7/0x418
  [...] acpi_hotplug_work_fn+0x1e/0x29
  [...] process_one_work+0x152/0x400
  [...] worker_thread+0x125/0x4b0
  [...] kthread+0xd8/0xf0
  [...] ret_from_fork+0x22/0x40

And the issue still silently occurs.

Until freeing the pages of page table allocated from bootmem allocator,
the page->freelist is never used.  So the patch sets magic number to
page->freelist instead of page->lru.next.

[isimatu.yasuaki@jp.fujitsu.com: fix merge issue]
  Link: http://lkml.kernel.org/r/722b1cc4-93ac-dd8b-2be2-7a7e313b3b0b@gmail.com
Link: http://lkml.kernel.org/r/2c29bd9f-5b67-02d0-18a3-8828e78bbb6f@gmail.com
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:29 -08:00
Andrey Ryabinin
025205f8f3 x86/mm/ptdump: Add address marker for KASAN shadow region
Annotate the KASAN shadow with address markers in page table
dump output:

$ cat /sys/kernel/debug/kernel_page_tables
...

---[ Vmemmap ]---
0xffffea0000000000-0xffffea0003000000          48M     RW         PSE     GLB NX pmd
0xffffea0003000000-0xffffea0004000000          16M                               pmd
0xffffea0004000000-0xffffea0005000000          16M     RW         PSE     GLB NX pmd
0xffffea0005000000-0xffffea0040000000         944M                               pmd
0xffffea0040000000-0xffffea8000000000         511G                               pud
0xffffea8000000000-0xffffec0000000000        1536G                               pgd
---[ KASAN shadow ]---
0xffffec0000000000-0xffffed0000000000           1T     ro                 GLB NX pte
0xffffed0000000000-0xffffed0018000000         384M     RW         PSE     GLB NX pmd
0xffffed0018000000-0xffffed0020000000         128M                               pmd
0xffffed0020000000-0xffffed0028200000         130M     RW         PSE     GLB NX pmd
0xffffed0028200000-0xffffed0040000000         382M                               pmd
0xffffed0040000000-0xffffed8000000000         511G                               pud
0xffffed8000000000-0xfffff50000000000        7680G                               pgd
0xfffff50000000000-0xfffffbfff0000000     7339776M     ro                 GLB NX pte
0xfffffbfff0000000-0xfffffbfff0200000           2M                               pmd
0xfffffbfff0200000-0xfffffbfff0a00000           8M     RW         PSE     GLB NX pmd
0xfffffbfff0a00000-0xfffffbffffe00000         244M                               pmd
0xfffffbffffe00000-0xfffffc0000000000           2M     ro                 GLB NX pte
---[ KASAN shadow end ]---
0xfffffc0000000000-0xffffff0000000000           3T                               pgd
---[ ESPfix Area ]---
...

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: kasan-dev@googlegroups.com
Cc: Tobias Regnery <tobias.regnery@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: http://lkml.kernel.org/r/20170214100839.17186-2-aryabinin@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-16 19:53:25 +01:00
Andrey Ryabinin
243b72aae2 x86/mm/ptdump: Optimize check for W+X mappings for CONFIG_KASAN=y
Enabling both DEBUG_WX=y and KASAN=y options significantly increases
boot time (dozens of seconds at least).
KASAN fills kernel page tables with repeated values to map several
TBs of the virtual memory to the single kasan_zero_page:

    kasan_zero_pud ->
        kasan_zero_pmd->
            kasan_zero_pte->
                kasan_zero_page

So, the page table walker used to find W+X mapping check the same
kasan_zero_p?d page table entries a lot more than once.
With patch pud walker will skip the pud if it has the same value as
the previous one . Skipping done iff we search for W+X mappings,
so this optimization won't affect the page table dump via debugfs.

This dropped time spend in W+X check from ~30 sec to reasonable 0.1 sec:

Before:
[    4.579991] Freeing unused kernel memory: 1000K
[   35.257523] x86/mm: Checked W+X mappings: passed, no W+X pages found.

After:
[    5.138756] Freeing unused kernel memory: 1000K
[    5.266496] x86/mm: Checked W+X mappings: passed, no W+X pages found.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: kasan-dev@googlegroups.com
Cc: Tobias Regnery <tobias.regnery@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: http://lkml.kernel.org/r/20170214100839.17186-1-aryabinin@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-16 19:53:25 +01:00
Thomas Gleixner
5b1ad68f9b Merge branch 'linus' into x86/mm
Make sure to get the latest fixes before applying the ptdump enhancements.
2017-02-16 19:51:27 +01:00
Andrey Ryabinin
146fbb7669 x86/mm/ptdump: Fix soft lockup in page table walker
CONFIG_KASAN=y needs a lot of virtual memory mapped for its shadow.
In that case ptdump_walk_pgd_level_core() takes a lot of time to
walk across all page tables and doing this without
a rescheduling causes soft lockups:

 NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [swapper/0:1]
 ...
 Call Trace:
  ptdump_walk_pgd_level_core+0x40c/0x550
  ptdump_walk_pgd_level_checkwx+0x17/0x20
  mark_rodata_ro+0x13b/0x150
  kernel_init+0x2f/0x120
  ret_from_fork+0x2c/0x40

I guess that this issue might arise even without KASAN on huge machines
with several terabytes of RAM.

Stick cond_resched() in pgd loop to fix this.

Reported-by: Tobias Regnery <tobias.regnery@gmail.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: kasan-dev@googlegroups.com
Cc: Alexander Potapenko <glider@google.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170210095405.31802-1-aryabinin@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10 11:00:23 +01:00
Geliang Tang
1013fe32a6 x86/mm/pat: Use rb_entry()
To make the code clearer, use rb_entry() instead of open coding it

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/974a91cd4ed2d04c92e4faa4765077e38f248d6b.1482157956.git.geliangtang@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-04 17:18:00 +01:00
John Ogness
459fbe0069 x86/mm/cpa: Avoid wbinvd() for PREEMPT
Although wbinvd() is faster than flushing many individual pages, it blocks
the memory bus for "long" periods of time (>100us), thus directly causing
unusually large latencies on all CPUs, regardless of any CPU isolation
features that may be active. This is an unpriviledged operatation as it is
exposed to user space via the graphics subsystem.

For 1024 pages, flushing those pages individually can take up to 2200us,
but the task remains fully preemptible during that time.

Signed-off-by: John Ogness <john.ogness@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-01-30 15:33:52 +01:00
Ingo Molnar
0c6fc11ac3 x86/boot/e820: Rename the remaining E820 APIs to the e820__*() prefix
Three more renames left:

   e820_end_of_ram_pfn()      =>  e820__end_of_ram_pfn()
   e820_end_of_low_ram_pfn()  =>  e820__end_of_low_ram_pfn()
   e820_reallocate_tables()   =>  e820__reallocate_tables()

After this all E820 API calls are prefixed with "e820__", making
it much easier to grep for E820 functionality in the kernel.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 22:55:26 +01:00
Ingo Molnar
08b46d5dd8 x86/boot/e820: Clean up the E820 table size define names
We've got a number of defines related to the E820 table and its size:

	E820MAP
	E820NR
	E820_X_MAX
	E820MAX

The first two denote byte offsets into the zeropage (struct boot_params),
and can are not used in the kernel and can be removed.

The E820_*_MAX values have an inconsistent structure and it's unclear in any
case what they mean. 'X' presuably goes for extended - but it's not very
expressive altogether.

Change these over to:

	E820_MAX_ENTRIES_ZEROPAGE
	E820_MAX_ENTRIES

... which are self-explanatory names.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 22:55:23 +01:00
Ingo Molnar
09821ff1d5 x86/boot/e820: Prefix the E820_* type names with "E820_TYPE_"
So there's a number of constants that start with "E820" but which
are not types - these create a confusing mixture when seen together
with 'enum e820_type' values:

	E820MAP
	E820NR
	E820_X_MAX
	E820MAX

To better differentiate the 'enum e820_type' values prefix them
with E820_TYPE_.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 22:55:22 +01:00
Ingo Molnar
3bce64f019 x86/boot/e820: Rename e820_any_mapped()/e820_all_mapped() to e820__mapped_any()/e820__mapped_all()
The 'any' and 'all' are modified to the 'mapped' concept, so move them last in the name.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:31 +01:00
Ingo Molnar
4270fd8b4c x86/boot/e820: Move the memblock_find_dma_reserve() function and rename it to memblock_set_dma_reserve()
We introduced memblock_find_dma_reserve() in this commit:

   6f2a75369e x86, memblock: Use memblock_memory_size()/memblock_free_memory_size() to get correct dma_reserve

But there's several problems with it:

 - The changelog is full of typos and is incomprehensible in general, and
   the comments in the code are not much better either.

 - The function was inexplicably placed into e820.c, while it has very
   little connection to the E820 table: when we call
   memblock_find_dma_reserve() then memblock is already set up and we
   are not using the E820 table anymore.

 - The function is a wrapper around set_dma_reserve(), but changed the 'set'
   name to 'find' - actively misleading about its primary purpose, which is
   still to set the DMA-reserve value.

 - The function is limited to 64-bit systems, but neither the changelog nor
   the comments explain why. The change would appear to be relevant to
   32-bit systems as well, as the ISA DMA zone is the first 16 MB of RAM.

So address some of these problems:

 - Move it into arch/x86/mm/init.c, next to the other zone setup related
   functions.

 - Clean up the code flow and names of local variables a bit.

 - Rename it to memblock_set_dma_reserve()

 - Improve the comments.

No change in functionality. Enabling it for 32-bit systems is left
for a separate patch.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 14:42:28 +01:00
Ingo Molnar
9de94dbb90 x86/boot/e820: Remove unnecessary #include <linux/ioport.h> from asm/e820/api.h
There's a completely unnecessary inclusion of linux/ioport.h near
the end of the asm/e820/api.h file.

Remove it and fix up unrelated code that learned to rely on this
spurious inclusion of a generic header.

No change in functionality.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 09:33:15 +01:00
Ingo Molnar
5520b7e7d2 x86/boot/e820: Remove spurious asm/e820/api.h inclusions
A commonly used lowlevel x86 header, asm/pgtable.h, includes asm/e820/api.h
spuriously, without making direct use of it.

Removing it is not simple: over the years various .c code learned to rely
on this indirect inclusion.

Remove the unnecessary include - this should speed up the kernel build a bit,
as a large header is not included anymore in totally unrelated code.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 09:31:14 +01:00
Ingo Molnar
66441bd3cf x86/boot/e820: Move asm/e820.h to asm/e820/api.h
In line with asm/e820/types.h, move the e820 API declarations to
asm/e820/api.h and update all usage sites.

This is just a mechanical, obviously correct move & replace patch,
there will be subsequent changes to clean up the code and to make
better use of the new header organization.

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-28 09:31:13 +01:00
Tobias Klauser
4538286257 x86/mpx: Use compatible types in comparison to fix sparse error
info->si_addr is of type void __user *, so it should be compared against
something from the same address space.

This fixes the following sparse error:

  arch/x86/mm/mpx.c:296:27: error: incompatible types in comparison expression (different address spaces)

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 09:32:06 +01:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Mark Rutland
cb02de96ec x86/mpx: Move bd_addr to mm_context_t
Currently bd_addr lives in mm_struct, which is otherwise architecture
independent. Architecture-specific data is supposed to live within
mm_context_t (itself contained in mm_struct).

Other x86-specific context like the pkey accounting data lives in
mm_context_t, and there's no readon the MPX data can't also live there.
So as to keep the arch-specific data togather, and to set a good example
for others, this patch moves bd_addr into x86's mm_context_t.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1481892055-24596-1-git-send-email-mark.rutland@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-17 12:29:56 +01:00
Kirill A. Shutemov
5372e155a2 x86/mm: Drop unused argument 'removed' from sync_global_pgds()
Since commit af2cf278ef ("x86/mm/hotplug: Don't remove PGD entries in
remove_pagetable()") there are no callers of sync_global_pgds() which set
the 'removed' argument to 1.

Remove the argument and the related conditionals in the function.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Link: http://lkml.kernel.org/r/20161214234403.137556-1-kirill.shutemov@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-15 12:46:07 +01:00
Boris Ostrovsky
aec03f89e9 ACPI/NUMA: Do not map pxm to node when NUMA is turned off
acpi_map_pxm_to_node() unconditially maps nodes even when NUMA is turned
off. So acpi_get_node() might return a node > 0, which is fatal when NUMA
is disabled as the rest of the kernel assumes that only node 0 exists.

Expose numa_off to the acpi code and return NUMA_NO_NODE when it's set.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: linux-ia64@vger.kernel.org
Cc: catalin.marinas@arm.com
Cc: rjw@rjwysocki.net
Cc: will.deacon@arm.com
Cc: linux-acpi@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: lenb@kernel.org
Link: http://lkml.kernel.org/r/1481602709-18260-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-15 11:32:32 +01:00
Linus Torvalds
518bacf5a5 Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - do a large round of simplifications after all CPUs do 'eager' FPU
     context switching in v4.9: remove CR0 twiddling, remove leftover
     eager/lazy bts, etc (Andy Lutomirski)

   - more FPU code simplifications: remove struct fpu::counter, clarify
     nomenclature, remove unnecessary arguments/functions and better
     structure the code (Rik van Riel)"

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Remove clts()
  x86/fpu: Remove stts()
  x86/fpu: Handle #NM without FPU emulation as an error
  x86/fpu, lguest: Remove CR0.TS support
  x86/fpu, kvm: Remove host CR0.TS manipulation
  x86/fpu: Remove irq_ts_save() and irq_ts_restore()
  x86/fpu: Stop saving and restoring CR0.TS in fpu__init_check_bugs()
  x86/fpu: Get rid of two redundant clts() calls
  x86/fpu: Finish excising 'eagerfpu'
  x86/fpu: Split old_fpu & new_fpu handling into separate functions
  x86/fpu: Remove 'cpu' argument from __cpu_invalidate_fpregs_state()
  x86/fpu: Split old & new FPU code paths
  x86/fpu: Remove __fpregs_(de)activate()
  x86/fpu: Rename lazy restore functions to "register state valid"
  x86/fpu, kvm: Remove KVM vcpu->fpu_counter
  x86/fpu: Remove struct fpu::counter
  x86/fpu: Remove use_eager_fpu()
  x86/fpu: Remove the XFEATURE_MASK_EAGER/LAZY distinction
  x86/fpu: Hard-disable lazy FPU mode
  x86/crypto, x86/fpu: Remove X86_FEATURE_EAGER_FPU #ifdef from the crc32c code
2016-12-12 14:27:49 -08:00
Linus Torvalds
5645688f9d Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
 "The main changes in this development cycle were:

   - a large number of call stack dumping/printing improvements: higher
     robustness, better cross-context dumping, improved output, etc.
     (Josh Poimboeuf)

   - vDSO getcpu() performance improvement for future Intel CPUs with
     the RDPID instruction (Andy Lutomirski)

   - add two new Intel AVX512 features and the CPUID support
     infrastructure for it: AVX512IFMA and AVX512VBMI. (Gayatri Kammela,
     He Chen)

   - more copy-user unification (Borislav Petkov)

   - entry code assembly macro simplifications (Alexander Kuleshov)

   - vDSO C/R support improvements (Dmitry Safonov)

   - misc fixes and cleanups (Borislav Petkov, Paul Bolle)"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  scripts/decode_stacktrace.sh: Fix address line detection on x86
  x86/boot/64: Use defines for page size
  x86/dumpstack: Make stack name tags more comprehensible
  selftests/x86: Add test_vdso to test getcpu()
  x86/vdso: Use RDPID in preference to LSL when available
  x86/dumpstack: Handle NULL stack pointer in show_trace_log_lvl()
  x86/cpufeatures: Enable new AVX512 cpu features
  x86/cpuid: Provide get_scattered_cpuid_leaf()
  x86/cpuid: Cleanup cpuid_regs definitions
  x86/copy_user: Unify the code by removing the 64-bit asm _copy_*_user() variants
  x86/unwind: Ensure stack grows down
  x86/vdso: Set vDSO pointer only after success
  x86/prctl/uapi: Remove #ifdef for CHECKPOINT_RESTORE
  x86/unwind: Detect bad stack return address
  x86/dumpstack: Warn on stack recursion
  x86/unwind: Warn on bad frame pointer
  x86/decoder: Use stderr if insn sanity test fails
  x86/decoder: Use stdout if insn decoder test is successful
  mm/page_alloc: Remove kernel address exposure in free_reserved_area()
  x86/dumpstack: Remove raw stack dump
  ...
2016-12-12 13:49:57 -08:00
Linus Torvalds
0719dbf5e1 Merge branch 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull mm/PAT cleanup from Ingo Molnar:
 "A single cleanup for a generic interface that was originally
  introduced for PAT"

* 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/pat, mm: Make track_pfn_insert() return void
2016-12-12 11:14:52 -08:00
Ingo Molnar
064e6a8ba6 Merge branch 'linus' into x86/fpu, to resolve conflicts
Conflicts:
	arch/x86/kernel/fpu/core.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-23 07:18:09 +01:00
Andy Lutomirski
fc0e81b2be x86/traps: Ignore high word of regs->cs in early_fixup_exception()
On the 80486 DX, it seems that some exceptions may leave garbage in
the high bits of CS.  This causes sporadic failures in which
early_fixup_exception() refuses to fix up an exception.

As far as I can tell, this has been buggy for a long time, but the
problem seems to have been exacerbated by commits:

  1e02ce4ccc ("x86: Store a per-cpu shadow copy of CR4")
  e1bfc11c5a ("x86/init: Fix cr4_init_shadow() on CR4-less machines")

This appears to have broken for as long as we've had early
exception handling.

[ Note to stable maintainers: This patch is needed all the way back to 3.4,
  but it will only apply to 4.6 and up, as it depends on commit:

    0e861fbb5b ("x86/head: Move early exception panic code into early_fixup_exception()")

  If you want to backport to kernels before 4.6, please don't backport the
  prerequisites (there was a big chain of them that rewrote a lot of the
  early exception machinery); instead, ask me and I can send you a one-liner
  that will apply. ]

Reported-by: Matthew Whitehead <tedheadster@gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: 4c5023a3fa ("x86-32: Handle exception table entries during early boot")
Link: http://lkml.kernel.org/r/cb32c69920e58a1a58e7b5cad975038a69c0ce7d.1479609510.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-21 08:06:54 +01:00
Borislav Petkov
308a047c3f x86/pat, mm: Make track_pfn_insert() return void
It only returns 0 so we can save us the testing of its retval
everywhere.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: mcgrof@suse.com
Cc: dri-devel@lists.freedesktop.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Airlie <airlied@redhat.com>
Cc: dan.j.williams@intel.com
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20161026174839.rusfxkm3xt4ennhe@pd.tnic
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-09 21:36:07 +01:00
Ingo Molnar
c29c716662 Merge branch 'core/urgent' into x86/fpu, to merge fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:47:40 +01:00
Ingo Molnar
05b93c19d5 Merge branch 'linus' into x86/asm, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:41:06 +01:00
Linus Torvalds
bdb520845b patches to fix a regression in 4.9-rc1 on x86 PAT
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYEFHpAAoJEAx081l5xIa+tyYP/0xq+ZqHwS90k1mge/2uWYB3
 sVQvFFIV55r6siOjIdDek+dsHq7IGFOChbsxegGyGvfwYVjzSmdoBwO1aMTV+Ii9
 OoqLS/53kts9jHOVm1UNsbxW1lzJVWoFWpMY57KDodWsWxVbd0NuP9mfTRIH2Sfj
 MmymKigXgwHSndn07+2xp9jI9Y5krtOLl+4YDsly7JF2IR7UBRRoW8n/WHR75lny
 MNn2Vtn9NBwxDieFQc/KQGUQ1nC8wB0c3wtGDDQIux0gp6IVW+pQoCLo6CMtgHXB
 IXGDojVA9KpcyEUz5RkBsVHYmvZR1PoS+nrnEE6b/C8p7UDuyCrk1Zfy0ZTGV/hq
 LKmfRKB3NWbgKnBbqOdFYhsh/iyVjqoNdGYqfR4qJx5JGIltVWbjYwlwUpImlrIY
 gKqtAdVFaFuoJs8MpFharxFlBf/o9DPDTPTWPQxGI16y7poH+86v7QmAJT9dJHRE
 pf3oyYI3eHTeIQb42f7PHSp4hsVJMX5Awkm9a8b9PhNlu/3cHUOYkCT060ripMBc
 ZksIUqKFzuk+TDRTnQrCQjaC4vJ6s8XUwntFhfHCZUmnnH8YhYpryDwdyzavcUvX
 or8rkKsO/+Jxa1kRr8d2c1JYi2FIMHBP0srAu43WeYyAsSPFIL/9l5flIeHi2Ow3
 tSHbCo4W5YRbQaVcBzxG
 =prah
 -----END PGP SIGNATURE-----

Merge tag 'drm-x86-pat-regression-fix' of git://people.freedesktop.org/~airlied/linux

Pull drm x86/pat regression fixes from Dave Airlie:
 "This is a standalone pull request for the fix for a regression
  introduced in -rc1 by a change to vm_insert_mixed to start using the
  PAT range tracking to validate page protections. With this fix in
  place, all the VRAM mappings for GPU drivers ended up at UC instead of
  WC.

  There are probably better ways to fix this long term, but nothing I'd
  considered for -fixes that wouldn't need more settling in time. So
  I've just created a new arch API that the drivers can reserve all
  their VRAM aperture ranges as WC"

* tag 'drm-x86-pat-regression-fix' of git://people.freedesktop.org/~airlied/linux:
  drm/drivers: add support for using the arch wc mapping API.
  x86/io: add interface to reserve io memtype for a resource range. (v1.1)
2016-10-28 09:36:07 -07:00
Masahiro Yamada
c0a0aba8e4 kconfig.h: remove config_enabled() macro
The use of config_enabled() is ambiguous.  For config options,
IS_ENABLED(), IS_REACHABLE(), etc.  will make intention clearer.
Sometimes config_enabled() has been used for non-config options because
it is useful to check whether the given symbol is defined or not.

I have been tackling on deprecating config_enabled(), and now is the
time to finish this work.

Some new users have appeared for v4.9-rc1, but it is trivial to replace
them:

 - arch/x86/mm/kaslr.c
  replace config_enabled() with IS_ENABLED() because
  CONFIG_X86_ESPFIX64 and CONFIG_EFI are boolean.

 - include/asm-generic/export.h
  replace config_enabled() with __is_defined().

Then, config_enabled() can be removed now.

Going forward, please use IS_ENABLED(), IS_REACHABLE(), etc. for config
options, and __is_defined() for non-config symbols.

Link: http://lkml.kernel.org/r/1476616078-32252-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-27 18:43:43 -07:00
Dave Airlie
8ef4227615 x86/io: add interface to reserve io memtype for a resource range. (v1.1)
A recent change to the mm code in:
87744ab383 mm: fix cache mode tracking in vm_insert_mixed()

started enforcing checking the memory type against the registered list for
amixed pfn insertion mappings. It happens that the drm drivers for a number
of gpus relied on this being broken. Currently the driver only inserted
VRAM mappings into the tracking table when they came from the kernel,
and userspace mappings never landed in the table. This led to a regression
where all the mapping end up as UC instead of WC now.

I've considered a number of solutions but since this needs to be fixed
in fixes and not next, and some of the solutions were going to introduce
overhead that hadn't been there before I didn't consider them viable at
this stage. These mainly concerned hooking into the TTM io reserve APIs,
but these API have a bunch of fast paths I didn't want to unwind to add
this to.

The solution I've decided on is to add a new API like the arch_phys_wc
APIs (these would have worked but wc_del didn't take a range), and
use them from the drivers to add a WC compatible mapping to the table
for all VRAM on those GPUs. This means we can then create userspace
mapping that won't get degraded to UC.

v1.1: use CONFIG_X86_PAT + add some comments in io.h

Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: x86@kernel.org
Cc: mcgrof@suse.com
Cc: Dan Williams <dan.j.williams@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2016-10-26 15:45:38 +10:00
Josh Poimboeuf
bb5e5ce545 x86/dumpstack: Remove kernel text addresses from stack dump
Printing kernel text addresses in stack dumps is of questionable value,
especially now that address randomization is becoming common.

It can be a security issue because it leaks kernel addresses.  It also
affects the usefulness of the stack dump.  Linus says:

  "I actually spend time cleaning up commit messages in logs, because
  useless data that isn't actually information (random hex numbers) is
  actively detrimental.

  It makes commit logs less legible.

  It also makes it harder to parse dumps.

  It's not useful. That makes it actively bad.

  I probably look at more oops reports than most people. I have not
  found the hex numbers useful for the last five years, because they are
  just randomized crap.

  The stack content thing just makes code scroll off the screen etc, for
  example."

The only real downside to removing these addresses is that they can be
used to disambiguate duplicate symbol names.  However such cases are
rare, and the context of the stack dump should be enough to be able to
figure it out.

There's now a 'faddr2line' script which can be used to convert a
function address to a file name and line:

  $ ./scripts/faddr2line ~/k/vmlinux write_sysrq_trigger+0x51/0x60
  write_sysrq_trigger+0x51/0x60:
  write_sysrq_trigger at drivers/tty/sysrq.c:1098

Or gdb can be used:

  $ echo "list *write_sysrq_trigger+0x51" |gdb ~/k/vmlinux |grep "is in"
  (gdb) 0xffffffff815b5d83 is in driver_probe_device (/home/jpoimboe/git/linux/drivers/base/dd.c:378).

(But note that when there are duplicate symbol names, gdb will only show
the first symbol it finds.  faddr2line is recommended over gdb because
it handles duplicates and it also does function size checking.)

Here's an example of what a stack dump looks like after this change:

  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: sysrq_handle_crash+0x45/0x80
  PGD 36bfa067 [   29.650644] PUD 7aca3067
  Oops: 0002 [#1] PREEMPT SMP
  Modules linked in: ...
  CPU: 1 PID: 786 Comm: bash Tainted: G            E   4.9.0-rc1+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
  task: ffff880078582a40 task.stack: ffffc90000ba8000
  RIP: 0010:sysrq_handle_crash+0x45/0x80
  RSP: 0018:ffffc90000babdc8 EFLAGS: 00010296
  RAX: ffff880078582a40 RBX: 0000000000000063 RCX: 0000000000000001
  RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000292
  RBP: ffffc90000babdc8 R08: 0000000b31866061 R09: 0000000000000000
  R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
  R13: 0000000000000007 R14: ffffffff81ee8680 R15: 0000000000000000
  FS:  00007ffb43869700(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 000000007a3e9000 CR4: 00000000001406e0
  Stack:
   ffffc90000babe00 ffffffff81572d08 ffffffff81572bd5 0000000000000002
   0000000000000000 ffff880079606600 00007ffb4386e000 ffffc90000babe20
   ffffffff81573201 ffff880036a3fd00 fffffffffffffffb ffffc90000babe40
  Call Trace:
   __handle_sysrq+0x138/0x220
   ? __handle_sysrq+0x5/0x220
   write_sysrq_trigger+0x51/0x60
   proc_reg_write+0x42/0x70
   __vfs_write+0x37/0x140
   ? preempt_count_sub+0xa1/0x100
   ? __sb_start_write+0xf5/0x210
   ? vfs_write+0x183/0x1a0
   vfs_write+0xb8/0x1a0
   SyS_write+0x58/0xc0
   entry_SYSCALL_64_fastpath+0x1f/0xc2
  RIP: 0033:0x7ffb42f55940
  RSP: 002b:00007ffd33bb6b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 0000000000000046 RCX: 00007ffb42f55940
  RDX: 0000000000000002 RSI: 00007ffb4386e000 RDI: 0000000000000001
  RBP: 0000000000000011 R08: 00007ffb4321ea40 R09: 00007ffb43869700
  R10: 00007ffb43869700 R11: 0000000000000246 R12: 0000000000778a10
  R13: 00007ffd33bb5c00 R14: 0000000000000007 R15: 0000000000000010
  Code: 34 e8 d0 34 bc ff 48 c7 c2 3b 2b 57 81 be 01 00 00 00 48 c7 c7 e0 dd e5 81 e8 a8 55 ba ff c7 05 0e 3f de 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 e8 4c 49 bc ff 84 c0 75 c3 48 c7
  RIP: sysrq_handle_crash+0x45/0x80 RSP: ffffc90000babdc8
  CR2: 0000000000000000

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/69329cb29b8f324bb5fcea14d61d224807fb6488.1477405374.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-25 18:40:37 +02:00
Linus Torvalds
63ae602cea Merge branch 'gup_flag-cleanups'
Merge the gup_flags cleanups from Lorenzo Stoakes:
 "This patch series adjusts functions in the get_user_pages* family such
  that desired FOLL_* flags are passed as an argument rather than
  implied by flags.

  The purpose of this change is to make the use of FOLL_FORCE explicit
  so it is easier to grep for and clearer to callers that this flag is
  being used.  The use of FOLL_FORCE is an issue as it overrides missing
  VM_READ/VM_WRITE flags for the VMA whose pages we are reading
  from/writing to, which can result in surprising behaviour.

  The patch series came out of the discussion around commit 38e0885465
  ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"),
  which addressed a BUG_ON() being triggered when a page was faulted in
  with PROT_NONE set but having been overridden by FOLL_FORCE.
  do_numa_page() was run on the assumption the page _must_ be one marked
  for NUMA node migration as an actual PROT_NONE page would have been
  dealt with prior to this code path, however FOLL_FORCE introduced a
  situation where this assumption did not hold.

  See

      https://marc.info/?l=linux-mm&m=147585445805166

  for the patch proposal"

Additionally, there's a fix for an ancient bug related to FOLL_FORCE and
FOLL_WRITE by me.

[ This branch was rebased recently to add a few more acked-by's and
  reviewed-by's ]

* gup_flag-cleanups:
  mm: replace access_process_vm() write parameter with gup_flags
  mm: replace access_remote_vm() write parameter with gup_flags
  mm: replace __access_remote_vm() write parameter with gup_flags
  mm: replace get_user_pages_remote() write/force parameters with gup_flags
  mm: replace get_user_pages() write/force parameters with gup_flags
  mm: replace get_vaddr_frames() write/force parameters with gup_flags
  mm: replace get_user_pages_locked() write/force parameters with gup_flags
  mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
  mm: remove write/force parameters from __get_user_pages_unlocked()
  mm: remove write/force parameters from __get_user_pages_locked()
  mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
2016-10-19 08:39:47 -07:00
Lorenzo Stoakes
768ae309a9 mm: replace get_user_pages() write/force parameters with gup_flags
This removes the 'write' and 'force' from get_user_pages() and replaces
them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers
as use of this flag can result in surprising behaviour (and hence bugs)
within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19 08:11:43 -07:00
Lorenzo Stoakes
c164154f66 mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
This removes the 'write' and 'force' use from get_user_pages_unlocked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-18 14:13:37 -07:00
Andy Lutomirski
e63650840e x86/fpu: Finish excising 'eagerfpu'
Now that eagerfpu= is gone, remove it from the docs and some
comments.  Also sync the changes to tools/.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/cf430dd4481d41280e93ac6cf0def1007a67fc8e.1476740397.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-18 09:56:03 +02:00
Linus Torvalds
4cdf8dbe2d Merge branch 'work.uaccess2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess.h prepwork from Al Viro:
 "Preparations to tree-wide switch to use of linux/uaccess.h (which,
  obviously, will allow to start unifying stuff for real). The last step
  there, ie

    PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
    sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
            `git grep -l "$PATT"|grep -v ^include/linux/uaccess.h`

  is not taken here - I would prefer to do it once just before or just
  after -rc1.  However, everything should be ready for it"

* 'work.uaccess2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  remove a stray reference to asm/uaccess.h in docs
  sparc64: separate extable_64.h, switch elf_64.h to it
  score: separate extable.h, switch module.h to it
  mips: separate extable.h, switch module.h to it
  x86: separate extable.h, switch sections.h to it
  remove stray include of asm/uaccess.h from cacheflush.h
  mn10300: remove a bogus processor.h->uaccess.h include
  xtensa: split uaccess.h into C and asm sides
  bonding: quit messing with IOCTL
  kill __kernel_ds_p off
  mn10300: finish verify_area() off
  frv: move HAVE_ARCH_UNMAPPED_AREA to pgtable.h
  exceptions: detritus removal
2016-10-11 23:38:39 -07:00
Linus Torvalds
93c26d7dc0 Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull protection keys syscall interface from Thomas Gleixner:
 "This is the final step of Protection Keys support which adds the
  syscalls so user space can actually allocate keys and protect memory
  areas with them. Details and usage examples can be found in the
  documentation.

  The mm side of this has been acked by Mel"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/pkeys: Update documentation
  x86/mm/pkeys: Do not skip PKRU register if debug registers are not used
  x86/pkeys: Fix pkeys build breakage for some non-x86 arches
  x86/pkeys: Add self-tests
  x86/pkeys: Allow configuration of init_pkru
  x86/pkeys: Default to a restrictive init PKRU
  pkeys: Add details of system call use to Documentation/
  generic syscalls: Wire up memory protection keys syscalls
  x86: Wire up protection keys system calls
  x86/pkeys: Allocation/free syscalls
  x86/pkeys: Make mprotect_key() mask off additional vm_flags
  mm: Implement new pkey_mprotect() system call
  x86/pkeys: Add fault handling for PF_PK page fault bit
2016-10-10 11:01:51 -07:00
Linus Torvalds
a8adc0f091 Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
 "Header file and a wrapper functions cleanup"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Migrate exception table users off module.h and onto extable.h
  x86: Clean up various simple wrapper functions
2016-10-03 17:18:52 -07:00
Linus Torvalds
3ef0a61a46 Merge branch 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
 "The changes in this cycle were:

   - Save e820 table RAM footprint on larger kernel configurations.
     (Denys Vlasenko)

   - pmem related fixes (Dan Williams)

   - theoretical e820 boundary condition fix (Wei Yang)"

* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Fix kdump, cleanup aborted E820_PRAM max_pfn manipulation
  x86/e820: Use much less memory for e820/e820_saved, save up to 120k
  x86/e820: Prepare e280 code for switch to dynamic storage
  x86/e820: Mark some static functions __init
  x86/e820: Fix very large 'size' handling boundary condition
2016-10-03 16:46:53 -07:00
Linus Torvalds
1a4a2bc460 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull low-level x86 updates from Ingo Molnar:
 "In this cycle this topic tree has become one of those 'super topics'
  that accumulated a lot of changes:

   - Add CONFIG_VMAP_STACK=y support to the core kernel and enable it on
     x86 - preceded by an array of changes. v4.8 saw preparatory changes
     in this area already - this is the rest of the work. Includes the
     thread stack caching performance optimization. (Andy Lutomirski)

   - switch_to() cleanups and all around enhancements. (Brian Gerst)

   - A large number of dumpstack infrastructure enhancements and an
     unwinder abstraction. The secret long term plan is safe(r) live
     patching plus maybe another attempt at debuginfo based unwinding -
     but all these current bits are standalone enhancements in a frame
     pointer based debug environment as well. (Josh Poimboeuf)

   - More __ro_after_init and const annotations. (Kees Cook)

   - Enable KASLR for the vmemmap memory region. (Thomas Garnier)"

[ The virtually mapped stack changes are pretty fundamental, and not
  x86-specific per se, even if they are only used on x86 right now. ]

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  x86/asm: Get rid of __read_cr4_safe()
  thread_info: Use unsigned long for flags
  x86/alternatives: Add stack frame dependency to alternative_call_2()
  x86/dumpstack: Fix show_stack() task pointer regression
  x86/dumpstack: Remove dump_trace() and related callbacks
  x86/dumpstack: Convert show_trace_log_lvl() to use the new unwinder
  oprofile/x86: Convert x86_backtrace() to use the new unwinder
  x86/stacktrace: Convert save_stack_trace_*() to use the new unwinder
  perf/x86: Convert perf_callchain_kernel() to use the new unwinder
  x86/unwind: Add new unwind interface and implementations
  x86/dumpstack: Remove NULL task pointer convention
  fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y
  sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK
  lib/syscall: Pin the task stack in collect_syscall()
  x86/process: Pin the target stack in get_wchan()
  x86/dumpstack: Pin the target stack when dumping it
  kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
  sched/core: Add try_get_task_stack() and put_task_stack()
  x86/entry/64: Fix a minor comment rebase error
  iommu/amd: Don't put completion-wait semaphore on stack
  ...
2016-10-03 16:13:28 -07:00
Linus Torvalds
110a9e42b6 Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic updates from Ingo Molnar:
 "The main changes are:

   - Persistent CPU/node numbering across CPU hotplug/unplug events.
     This is a pretty involved series of changes that first fetches all
     the information during bootup and then uses it for the various
     hotplug/unplug methods. (Gu Zheng, Dou Liyang)

   - IO-APIC hot-add/remove fixes and enhancements. (Rui Wang)

   - ... various fixes, cleanups and enhancements"

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
  x86/apic: Fix silent & fatal merge conflict in __generic_processor_info()
  acpi: Fix broken error check in map_processor()
  acpi: Validate processor id when mapping the processor
  acpi: Provide mechanism to validate processors in the ACPI tables
  x86/acpi: Set persistent cpuid <-> nodeid mapping when booting
  x86/acpi: Enable MADT APIs to return disabled apicids
  x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping
  x86/acpi: Enable acpi to register all possible cpus at boot time
  x86/numa: Online memory-less nodes at boot time
  x86/apic: Get rid of apic_version[] array
  x86/apic: Order irq_enter/exit() calls correctly vs. ack_APIC_irq()
  x86/ioapic: Ignore root bridges without a companion ACPI device
  x86/apic: Update comment about disabling processor focus
  x86/smpboot: Check APIC ID before setting up default routing
  x86/ioapic: Fix IOAPIC failing to request resource
  x86/ioapic: Fix lost IOAPIC resource after hot-removal and hotadd
  x86/ioapic: Fix setup_res() failing to get resource
  x86/ioapic: Support hot-removal of IOAPICs present during boot
  x86/ioapic: Change prototype of acpi_ioapic_add()
  x86/apic, ACPI: Fix incorrect assignment when handling apic/x2apic entries
  ...
2016-10-03 15:36:06 -07:00
Thomas Gleixner
d7e25c66c9 Merge branch 'x86/urgent' into x86/asm
Get the cr4 fixes so we can apply the final cleanup
2016-09-30 12:38:28 +02:00
Al Viro
df720ac12f exceptions: detritus removal
externs and defines for stuff that is never used

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:15:14 -04:00
Thomas Gleixner
1e1b37273c Merge branch 'x86/urgent' into x86/apic
Bring in the upstream modifications so we can fixup the silent merge
conflict which is introduced by this merge.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-26 15:47:03 -04:00
Tang Chen
2532fc318d x86/numa: Online memory-less nodes at boot time
For now, x86 does not support memory-less node. A node without memory
will not be onlined, and the cpus on it will be mapped to the other
online nodes with memory in init_cpu_to_node(). The reason of doing this
is to ensure each cpu has mapped to a node with memory, so that it will
be able to allocate local memory for that cpu.

But we don't have to do it in this way.

In this series of patches, we are going to construct cpu <-> node mapping
for all possible cpus at boot time, which is a persistent mapping. It means
that the cpu will be mapped to the node which it belongs to, and will never
be changed. If a node has only cpus but no memory, the cpus on it will be
mapped to a memory-less node. And the memory-less node should be onlined.

Allocate pgdats for all memory-less nodes and online them at boot
time. Then build zonelists for these nodes. As a result, when cpus on these
memory-less nodes try to allocate memory from local node, it will
automatically fall back to the proper zones in the zonelists.

Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: mika.j.penttila@gmail.com
Cc: len.brown@intel.com
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: rafael@kernel.org
Cc: rjw@rjwysocki.net
Cc: yasu.isimatu@gmail.com
Cc: linux-mm@kvack.org
Cc: linux-acpi@vger.kernel.org
Cc: isimatu.yasuaki@jp.fujitsu.com
Cc: gongzhaogang@inspur.com
Cc: tj@kernel.org
Cc: izumi.taku@jp.fujitsu.com
Cc: cl@linux.com
Cc: chen.tang@easystack.cn
Cc: akpm@linux-foundation.org
Cc: kamezawa.hiroyu@jp.fujitsu.com
Cc: lenb@kernel.org
Link: http://lkml.kernel.org/r/1472114120-3281-2-git-send-email-douly.fnst@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-21 21:18:38 +02:00