Commit Graph

577 Commits

Author SHA1 Message Date
Christoph Hellwig a7fce1f7ca sparc/iommu: use sbus_iommu_unmap_page in sbus_iommu_unmap_sg
Use the page-level helper instead of duplicating the logic, while also
fixing the incorrect handling of larger than page sized offsets in
the sg variant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-08 17:11:56 -07:00
Christoph Hellwig 031abf0b70 sparc/iommu: use !PageHighMem to check if a page has a kernel mapping
This deobsfucates the check a bit, and prepares for future changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-08 17:11:56 -07:00
Mike Rapoport b1e1c869ff sparc: add checks for the return value of memblock_alloc*()
Add panic() calls if memblock_alloc*() returns NULL.

Most of the changes are simply addition of

        if(!ptr)
                panic();

statements after the calls to memblock_alloc*() variants.

Exceptions are pcpu_populate_pte() and kernel_map_range() that were
slightly refactored to accommodate the change.

Link: http://lkml.kernel.org/r/1548057848-15136-16-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:02 -07:00
Mike Rapoport 9415673e3e arch: use memblock_alloc() instead of memblock_alloc_from(size, align, 0)
The last parameter of memblock_alloc_from() is the lower limit for the
memory allocation.  When it is 0, the call is equivalent to
memblock_alloc().

Link: http://lkml.kernel.org/r/1548057848-15136-13-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Paul Burton <paul.burton@mips.com> # MIPS part
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Juergen Gross <jgross@suse.com>			[Xen]
Cc: Mark Salter <msalter@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:01 -07:00
Mike Rapoport b63a07d69d arch: simplify several early memory allocations
There are several early memory allocations in arch/ code that use
memblock_phys_alloc() to allocate memory, convert the returned physical
address to the virtual address and then set the allocated memory to
zero.

Exactly the same behaviour can be achieved simply by calling
memblock_alloc(): it allocates the memory in the same way as
memblock_phys_alloc(), then it performs the phys_to_virt() conversion
and clears the allocated memory.

Replace the longer sequence with a simpler call to memblock_alloc().

Link: http://lkml.kernel.org/r/1546248566-14910-6-git-send-email-rppt@linux.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Simek <michal.simek@xilinx.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07 18:32:03 -08:00
Anshuman Khandual 98fa15f34c mm: replace all open encodings for NUMA_NO_NODE
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code.  Please let me know if this
might have missed some instances.  This might also have replaced some
false positives.  I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1.  Even though implicitly understood it is always better to
have macros in there.  Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE.  This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:14 -08:00
Joel Fernandes (Google) 4cf5892495 mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap".

This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems.  There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work.  Also we find that there is no point in passing the 'address' to
pte_alloc since its unused.  This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well.  Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.

Build and boot tested on x86-64.  Build tested on arm64.  The config
enablement patch for arm64 will be posted in the future after more
testing.

The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from  pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.

// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.

virtual patch

@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@

 fn(...
- , T2 E2
 )
 { ... }

@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)

@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)

@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

 fn(...
-,  E2
 )

@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@

(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)

Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:47 -08:00
Christoph Hellwig 518a2f1925 dma-mapping: zero memory returned from dma_alloc_*
If we want to map memory from the DMA allocator to userspace it must be
zeroed at allocation time to prevent stale data leaks.   We already do
this on most common architectures, but some architectures don't do this
yet, fix them up, either by passing GFP_ZERO when we use the normal page
allocator or doing a manual memset otherwise.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Acked-by: Sam Ravnborg <sam@ravnborg.org> [sparc]
2018-12-20 08:13:52 +01:00
Christoph Hellwig 6c503d0d88 sparc/iommu: fix ->map_sg return value
Just decrementing the sz value will lead to an incorrect return value.
Instead of just introducing a local variable switch to the standard
for_each_sg helper and standard naming of the arguments.

Fixes: ce65d36f3e ("sparc: remove the sparc32_dma_ops indirection")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
2018-12-19 18:00:40 +01:00
Christoph Hellwig 1e2934ad7f sparc/io-unit: fix ->map_sg return value
Just decrementing the sz value will lead to an incorrect return value.
Instead of just introducing a local variable switch to the standard
for_each_sg helper and standard naming of the arguments.

Fixes: ce65d36f3e ("sparc: remove the sparc32_dma_ops indirection")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
2018-12-19 18:00:40 +01:00
Christoph Hellwig ce65d36f3e sparc: remove the sparc32_dma_ops indirection
There is no good reason to have a double indirection for the sparc32
dma ops, so remove the sparc32_dma_ops and define separate dma_map_ops
instance for the different IOMMU types.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David S. Miller <davem@davemloft.net>
2018-12-11 14:45:29 +01:00
Mike Rapoport 57c8a661d9 mm: remove include/linux/bootmem.h
Move remaining definitions and declarations from include/linux/bootmem.h
into include/linux/memblock.h and remove the redundant header.

The includes were replaced with the semantic patch below and then
semi-automated removal of duplicated '#include <linux/memblock.h>

@@
@@
- #include <linux/bootmem.h>
+ #include <linux/memblock.h>

[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
[sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
  Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
[sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
  Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport c6ffc5ca8f memblock: rename free_all_bootmem to memblock_free_all
The conversion is done using

sed -i 's@free_all_bootmem@memblock_free_all@' \
    $(git grep -l free_all_bootmem)

Link: http://lkml.kernel.org/r/1536927045-23536-26-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport 4fc4a09e4c memblock: replace __alloc_bootmem with memblock_alloc_from
The functions are equivalent, just the later does not require nobootmem
translation layer.

The conversion is done using the following semantic patch:

@@
expression size, align, goal;
@@
- __alloc_bootmem(size, align, goal)
+ memblock_alloc_from(size, align, goal)

Link: http://lkml.kernel.org/r/1536927045-23536-21-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:16 -07:00
Mike Rapoport 9a8dd708d5 memblock: rename memblock_alloc{_nid,_try_nid} to memblock_phys_alloc*
Make it explicit that the caller gets a physical address rather than a
virtual one.

This will also allow using meblock_alloc prefix for memblock allocations
returning virtual address, which is done in the following patches.

The conversion is done using the following semantic patch:

@@
expression e1, e2, e3;
@@
(
- memblock_alloc(e1, e2)
+ memblock_phys_alloc(e1, e2)
|
- memblock_alloc_nid(e1, e2, e3)
+ memblock_phys_alloc_nid(e1, e2, e3)
|
- memblock_alloc_try_nid(e1, e2, e3)
+ memblock_phys_alloc_try_nid(e1, e2, e3)
)

Link: http://lkml.kernel.org/r/1536927045-23536-7-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Serge Semin <fancer.lancer@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 08:54:15 -07:00
David S. Miller 2b4792eaa9 sparc64: Export __node_distance.
Some drivers reference it via node_distance(), for example the
NVME host driver core.

ERROR: "__node_distance" [drivers/nvme/host/nvme-core.ko] undefined!
make[1]: *** [scripts/Makefile.modpost:92: __modpost] Error 1

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-26 15:11:56 -07:00
Mike Rapoport 8818d89cf8 sparc32: split ramdisk detection and reservation to a helper function
The detection and reservation of ramdisk memory were separated to allow
bootmem bitmap initialization after the ramdisk boundaries are detected.
Since the bootmem initialization is removed, the reservation of ramdisk
memory is done immediately after its boundaries are found.

Split the entire block into a separate helper function.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Suggested-by: Sam Ravnborg <sam@ravnborg.org>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 12:40:26 -07:00
Mike Rapoport cca079ef8a sparc32: switch to NO_BOOTMEM
Each populated sparc_phys_bank is added to memblock.memory. The
reserve_bootmem() calls are replaced with memblock_reserve(), and the
bootmem bitmap initialization is droppped.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 12:40:26 -07:00
Mike Rapoport 60ef861682 sparc: mm/init_32: kill trailing whitespace
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 12:40:26 -07:00
Souptick Joarder 50a7ca3c6f mm: convert return type of handle_mm_fault() caller to vm_fault_t
Use new return type vm_fault_t for fault handler.  For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno.  Once all instances are converted, vm_fault_t will become a
distinct type.

Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")

In this patch all the caller of handle_mm_fault() are changed to return
vm_fault_t type.

Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: David S. Miller <davem@davemloft.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Thomas Petazzoni f0afc6b18d sparc: move MSI related definitions to where they are used
The definitions in arch/sparc/include/asm/msi.h are only used in
arch/sparc/mm/srmmu.c, so it makes sense to have them in the C file
directly.

In addition, having a custom arch/sparc/include/asm/msi.h prevents
from using the asm-generic version of this header, which is necessary
to be able to include <linux/msi.h> when CONFIG_GENERIC_MSI_IRQ_DOMAIN
is enabled.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-30 13:00:56 -07:00
Stefan Agner d7dc899abe treewide: use PHYS_ADDR_MAX to avoid type casting ULLONG_MAX
With PHYS_ADDR_MAX there is now a type safe variant for all bits set.
Make use of it.

Patch created using a semantic patch as follows:

// <smpl>
@@
typedef phys_addr_t;
@@
-(phys_addr_t)ULLONG_MAX
+PHYS_ADDR_MAX
// </smpl>

Link: http://lkml.kernel.org/r/20180419214204.19322-1-stefan@agner.ch
Signed-off-by: Stefan Agner <stefan@agner.ch>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-15 07:55:25 +09:00
Eric W. Biederman d1f5bef63f signal/sparc: Use force_sig_fault where appropriate
Filling in struct siginfo before calling force_sig_info a tedious and
error prone process, where once in a great while the wrong fields
are filled out, and siginfo has been inconsistently cleared.

Simplify this process by using the helper force_sig_fault.  Which
takes as a parameters all of the information it needs, ensures
all of the fiddly bits of filling in struct siginfo are done properly
and then calls force_sig_info.

In short about a 5 line reduction in code for every time force_sig_info
is called, which makes the calling function clearer.

Cc: David Miller <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-04-25 10:44:10 -05:00
Eric W. Biederman 3eb0f5193b signal: Ensure every siginfo we send has all bits initialized
Call clear_siginfo to ensure every stack allocated siginfo is properly
initialized before being passed to the signal sending functions.

Note: It is not safe to depend on C initializers to initialize struct
siginfo on the stack because C is allowed to skip holes when
initializing a structure.

The initialization of struct siginfo in tracehook_report_syscall_exit
was moved from the helper user_single_step_siginfo into
tracehook_report_syscall_exit itself, to make it clear that the local
variable siginfo gets fully initialized.

In a few cases the scope of struct siginfo has been reduced to make it
clear that siginfo siginfo is not used on other paths in the function
in which it is declared.

Instances of using memset to initialize siginfo have been replaced
with calls clear_siginfo for clarity.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-04-25 10:40:51 -05:00
Michael S. Tsirkin d081107867 mm/gup.c: document return value
__get_user_pages_fast handles errors differently from
get_user_pages_fast: the former always returns the number of pages
pinned, the later might return a negative error code.

Link: http://lkml.kernel.org/r/1522962072-182137-6-git-send-email-mst@redhat.com
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-13 17:10:27 -07:00
Huang Ying cb9f753a37 mm: fix races between swapoff and flush dcache
Thanks to commit 4b3ef9daa4 ("mm/swap: split swap cache into 64MB
trunks"), after swapoff the address_space associated with the swap
device will be freed.  So page_mapping() users which may touch the
address_space need some kind of mechanism to prevent the address_space
from being freed during accessing.

The dcache flushing functions (flush_dcache_page(), etc) in architecture
specific code may access the address_space of swap device for anonymous
pages in swap cache via page_mapping() function.  But in some cases
there are no mechanisms to prevent the swap device from being swapoff,
for example,

  CPU1					CPU2
  __get_user_pages()			swapoff()
    flush_dcache_page()
      mapping = page_mapping()
        ...				  exit_swap_address_space()
        ...				    kvfree(spaces)
        mapping_mapped(mapping)

The address space may be accessed after being freed.

But from cachetlb.txt and Russell King, flush_dcache_page() only care
about file cache pages, for anonymous pages, flush_anon_page() should be
used.  The implementation of flush_dcache_page() in all architectures
follows this too.  They will check whether page_mapping() is NULL and
whether mapping_mapped() is true to determine whether to flush the
dcache immediately.  And they will use interval tree (mapping->i_mmap)
to find all user space mappings.  While mapping_mapped() and
mapping->i_mmap isn't used by anonymous pages in swap cache at all.

So, to fix the race between swapoff and flush dcache, __page_mapping()
is add to return the address_space for file cache pages and NULL
otherwise.  All page_mapping() invoking in flush dcache functions are
replaced with page_mapping_file().

[akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:26 -07:00
Khalid Aziz 74a0496748 sparc64: Add support for ADI (Application Data Integrity)
ADI is a new feature supported on SPARC M7 and newer processors to allow
hardware to catch rogue accesses to memory. ADI is supported for data
fetches only and not instruction fetches. An app can enable ADI on its
data pages, set version tags on them and use versioned addresses to
access the data pages. Upper bits of the address contain the version
tag. On M7 processors, upper four bits (bits 63-60) contain the version
tag. If a rogue app attempts to access ADI enabled data pages, its
access is blocked and processor generates an exception. Please see
Documentation/sparc/adi.txt for further details.

This patch extends mprotect to enable ADI (TSTATE.mcde), enable/disable
MCD (Memory Corruption Detection) on selected memory ranges, enable
TTE.mcd in PTEs, return ADI parameters to userspace and save/restore ADI
version tags on page swap out/in or migration. ADI is not enabled by
default for any task. A task must explicitly enable ADI on a memory
range and set version tag for ADI to be effective for the task.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18 07:38:48 -07:00
David S. Miller cfb61b5e3e sparc64: Fix regression in pmdp_invalidate().
pmdp_invalidate() was changed to update the pmd atomically
(to not lose dirty/access bits) and return the original pmd
value.

However, in doing so, we lost a lot of the essential work that
set_pmd_at() does, namely to update hugepage mapping counts and
queuing up the batched TLB flush entry.

Thus we were not flushing entries out of the TLB when making
such PMD changes.

Fix this by abstracting the accounting work of set_pmd_at() out into a
separate function, and call it from pmdp_establish().

Fixes: a8e654f01c ("sparc64: update pmdp_invalidate() to return old pmd value")
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-15 14:18:00 -07:00
Linus Torvalds 3ff1b28caa libnvdimm for 4.16
* Require struct page by default for filesystem DAX to remove a number of
   surprising failure cases.  This includes failures with direct I/O, gdb and
   fork(2).
 
 * Add support for the new Platform Capabilities Structure added to the NFIT in
   ACPI 6.2a.  This new table tells us whether the platform supports flushing
   of CPU and memory controller caches on unexpected power loss events.
 
 * Revamp vmem_altmap and dev_pagemap handling to clean up code and better
   support future future PCI P2P uses.
 
 * Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has become
   out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL spec, and
   instead rely on the generic ND_CMD_CALL approach used by the two other IOCTL
   families, NVDIMM_FAMILY_{HPE,MSFT}.
 
 * Enhance nfit_test so we can test some of the new things added in version 1.6
   of the DSM specification.  This includes testing firmware download and
   simulating the Last Shutdown State (LSS) status.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaeOg0AAoJEJ/BjXdf9fLBAFoQAI/IgcgJ2h9lfEpgjBRTC44t
 2p8dxwT1Ofw3Y1aR/tI8nYRXjRtAGuP4UIeRVnb1CL/N7PagJyoMGU+6hmzg+ptY
 c7cEDvw6nZOhrFwXx/xn7R53sYG8zH+UE6+jTR/PP/G4mQJfFCg4iF9R72Y7z0n7
 aurf82Kz137NPUy6dNr4V9bmPMJWAaOci9WOj5SKddR5ZSNbjoxylTwQRvre5y4r
 7HQTScEkirABOdSf1JoXTSUXCH/RC9UFFXR03ScHstGb1HjCj3KdcicVc50Q++Ub
 qsEudhE6i44PEW1Hh4Qkg6hjHMEa8qHP+ShBuRuVaUmlghYTQn66niJAYLZilwdz
 EVjE7vR+toHA5g3YCalEmYVutUEhIDkh/xfpd7vM6ZorUGJy95a2elEJs2fHBffC
 gEhnCip7FROPcK5RDNUM8hBgnG/q5wwWPQMKY+6rKDZQx3mXssCrKp2Vlx7kBwMG
 rpblkEpYjPonbLEHxsSU8yTg9Uq55ciIWgnOToffcjZvjbihi8WUVlHcwHUMPf/o
 DWElg+4qmG0Sdd4S2NeAGwTl1Ewrf2RrtUGMjHtH4OUFs1wo6ZmfrxFzzMfoZ1Od
 ko/s65v4uwtTzECh2o+XQaNsReR5YETXxmA40N/Jpo7/7twABIoZ/ASvj/3ZBYj+
 sie+u2rTod8/gQWSfHpJ
 =MIMX
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Ross Zwisler:

 - Require struct page by default for filesystem DAX to remove a number
   of surprising failure cases. This includes failures with direct I/O,
   gdb and fork(2).

 - Add support for the new Platform Capabilities Structure added to the
   NFIT in ACPI 6.2a. This new table tells us whether the platform
   supports flushing of CPU and memory controller caches on unexpected
   power loss events.

 - Revamp vmem_altmap and dev_pagemap handling to clean up code and
   better support future future PCI P2P uses.

 - Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
   become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
   spec, and instead rely on the generic ND_CMD_CALL approach used by
   the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.

 - Enhance nfit_test so we can test some of the new things added in
   version 1.6 of the DSM specification. This includes testing firmware
   download and simulating the Last Shutdown State (LSS) status.

* tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
  libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
  acpi, nfit: fix register dimm error handling
  libnvdimm, namespace: make min namespace size 4K
  tools/testing/nvdimm: force nfit_test to depend on instrumented modules
  libnvdimm/nfit_test: adding support for unit testing enable LSS status
  libnvdimm/nfit_test: add firmware download emulation
  nfit-test: Add platform cap support from ACPI 6.2a to test
  libnvdimm: expose platform persistence attribute for nd_region
  acpi: nfit: add persistent memory control flag for nd_region
  acpi: nfit: Add support for detect platform CPU cache flush on power loss
  device-dax: Fix trailing semicolon
  libnvdimm, btt: fix uninitialized err_lock
  dax: require 'struct page' by default for filesystem dax
  ext2: auto disable dax instead of failing mount
  ext4: auto disable dax instead of failing mount
  mm, dax: introduce pfn_t_special()
  mm: Fix devm_memremap_pages() collision handling
  mm: Fix memory size alignment in devm_memremap_pages_release()
  memremap: merge find_dev_pagemap into get_dev_pagemap
  memremap: change devm_memremap_pages interface to use struct dev_pagemap
  ...
2018-02-06 10:41:33 -08:00
Nitin Gupta a8e654f01c sparc64: update pmdp_invalidate() to return old pmd value
It's required to avoid losing dirty and accessed bits.

[akpm@linux-foundation.org: add a `do' to the do-while loop]
Link: http://lkml.kernel.org/r/20171213105756.69879-9-kirill.shutemov@linux.intel.com
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:38 -08:00
Christoph Hellwig 24b6d41643 mm: pass the vmem_altmap to vmemmap_free
We can just pass this on instead of having to do a radix tree lookup
without proper locking a few levels into the callchain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig 7b73d978a5 mm: pass the vmem_altmap to vmemmap_populate
We can just pass this on instead of having to do a radix tree lookup
without proper locking a few levels into the callchain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Kees Cook 10a7e9d849 Do not hash userspace addresses in fault handlers
The hashing of %p was designed to restrict kernel addresses. There is
no reason to hash the userspace values seen during a segfault report,
so switch these to %px. (Some architectures already use %lx.)

Fixes: ad67b74d24 ("printk: hash addresses printed with %p")
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-19 17:04:43 -08:00
Linus Torvalds f6f3732162 Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths"
This reverts commits 5c9d2d5c26, c7da82b894, and e7fe7b5cae.

We'll probably need to revisit this, but basically we should not
complicate the get_user_pages_fast() case, and checking the actual page
table protection key bits will require more care anyway, since the
protection keys depend on the exact state of the VM in question.

Particularly when doing a "remote" page lookup (ie in somebody elses VM,
not your own), you need to be much more careful than this was.  Dave
Hansen says:

 "So, the underlying bug here is that we now a get_user_pages_remote()
  and then go ahead and do the p*_access_permitted() checks against the
  current PKRU. This was introduced recently with the addition of the
  new p??_access_permitted() calls.

  We have checks in the VMA path for the "remote" gups and we avoid
  consulting PKRU for them. This got missed in the pkeys selftests
  because I did a ptrace read, but not a *write*. I also didn't
  explicitly test it against something where a COW needed to be done"

It's also not entirely clear that it makes sense to check the protection
key bits at this level at all.  But one possible eventual solution is to
make the get_user_pages_fast() case just abort if it sees protection key
bits set, which makes us fall back to the regular get_user_pages() case,
which then has a vma and can do the check there if we want to.

We'll see.

Somewhat related to this all: what we _do_ want to do some day is to
check the PAGE_USER bit - it should obviously always be set for user
pages, but it would be a good check to have back.  Because we have no
generic way to test for it, we lost it as part of moving over from the
architecture-specific x86 GUP implementation to the generic one in
commit e585513b76 ("x86/mm/gup: Switch GUP to the generic
get_user_page_fast() implementation").

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-15 18:53:22 -08:00
Dan Williams c7da82b894 mm: replace pmd_write with pmd_access_permitted in fault + gup paths
The 'access_permitted' helper is used in the gup-fast path and goes
beyond the simple _PAGE_RW check to also:

 - validate that the mapping is writable from a protection keys
   standpoint

 - validate that the pte has _PAGE_USER set since all fault paths where
   pmd_write is must be referencing user-memory.

Link: http://lkml.kernel.org/r/151043111049.2842.15241454964150083466.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29 18:40:42 -08:00
Dan Williams e7fe7b5cae mm: replace pud_write with pud_access_permitted in fault + gup paths
The 'access_permitted' helper is used in the gup-fast path and goes
beyond the simple _PAGE_RW check to also:

 - validate that the mapping is writable from a protection keys
   standpoint

 - validate that the pte has _PAGE_USER set since all fault paths where
   pud_write is must be referencing user-memory.

[dan.j.williams@intel.com: fix powerpc compile error]
  Link: http://lkml.kernel.org/r/151129127237.37405.16073414520854722485.stgit@dwillia2-desk3.amr.corp.intel.com
Link: http://lkml.kernel.org/r/151043110453.2842.2166049702068628177.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29 18:40:42 -08:00
Mel Gorman 2d4894b5d2 mm: remove cold parameter from free_hot_cold_page*
Most callers users of free_hot_cold_page claim the pages being released
are cache hot.  The exception is the page reclaim paths where it is
likely that enough pages will be freed in the near future that the
per-cpu lists are going to be recycled and the cache hotness information
is lost.  As no one really cares about the hotness of pages being
released to the allocator, just ditch the parameter.

The APIs are renamed to indicate that it's no longer about hot/cold
pages.  It should also be less confusing as there are subtle differences
between them.  __free_pages drops a reference and frees a page when the
refcount reaches zero.  free_hot_cold_page handled pages whose refcount
was already zero which is non-obvious from the name.  free_unref_page
should be more obvious.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

[mgorman@techsingularity.net: add pages to head, not tail]
  Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Pavel Tatashin df8ee57889 sparc64: simplify vmemmap_populate
Remove duplicating code by using common functions vmemmap_pud_populate
and vmemmap_pgd_populate.

Link: http://lkml.kernel.org/r/20171013173214.27300-5-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:05 -08:00
Pavel Tatashin 2a20aa1710 sparc64/mm: set fields in deferred pages
Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to
first initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set
some fields prior to initializing:

mem_init() {
     register_page_bootmem_info();
     free_all_bootmem();
     ...
}

When register_page_bootmem_info() is called only non-deferred struct
pages are initialized.  But, this function goes through some reserved
pages which might be part of the deferred, and thus are not yet
initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
    __init_single_pfn()
     __init_single_page()
      memset(0) <-- Loose the set fields here

We end up with similar issue as in the previous patch, where currently
we do not observe problem as memory is zeroed.  But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page
memory during allocation, we must make sure that struct pages are
properly initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Link: http://lkml.kernel.org/r/20171013173214.27300-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:05 -08:00
Levin, Alexander (Sasha Levin) 75f296d93b kmemcheck: stop using GFP_NOTRACK and SLAB_NOTRACK
Convert all allocations that used a NOTRACK flag to stop using it.

Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Kirill A. Shutemov c4812909f5 mm: introduce wrappers to access mm->nr_ptes
Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd
and nr_pud.

The patch also makes nr_ptes accounting dependent onto CONFIG_MMU.  Page
table accounting doesn't make sense if you don't have page tables.

It's preparation for consolidation of page-table counters in mm_struct.

Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Kirill A. Shutemov b4e98d9ac7 mm: account pud page tables
On a machine with 5-level paging support a process can allocate
significant amount of memory and stay unnoticed by oom-killer and memory
cgroup.  The trick is to allocate a lot of PUD page tables.  We don't
account PUD page tables, only PMD and PTE.

We already addressed the same issue for PMD page tables, see commit
dc6c9a35b6 ("mm: account pmd page tables to the process").
Introduction of 5-level paging brings the same issue for PUD page
tables.

The patch expands accounting to PUD level.

[kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/]
  Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com
[heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting]
  Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com
Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Greg Kroah-Hartman b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Nitin Gupta 4dbe87d5a7 sparc64: Cleanup hugepage table walk functions
Flatten out nested code structure in huge_pte_offset()
and huge_pte_alloc().

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 21:48:07 -07:00
Nitin Gupta df7b2155bb sparc64: Add 16GB hugepage support
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:

default_hugepagesz=16G hugepagesz=16G hugepages=10

Testing:

Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.

Orabug: 25362942

Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 21:48:07 -07:00
Nitin Gupta 44382b0195 sparc64: Support huge PUD case in get_user_pages
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-15 21:48:06 -07:00
Vijay Kumar fdaccf74fe sparc64: Increase max_phys_bits to 51 and VA bits to 53 for M8.
On M8 chips, use a max_phys_bits value of 51.

Also, M8 supports VA bits up to 54 bits. However, for now
restrict VA bits to 53 due to 4-level pagetable limitation.

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:08:53 -07:00
Allen Pais 7d484acb2f sparc64: recognize and support sparc M8 cpu type
Recognize SPARC-M8 cpu type, hardware caps and cpu
distribution map.

Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: David Aldridge <david.j.aldridge@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:08:53 -07:00
Nitin Gupta 8399e4b88a sparc64: Register hugepages during arch init
Add hstate for each supported hugepage size using
arch initcall. This change fixes some hugepage
parameter parsing inconsistencies:

case 1: no hugepage parameters

 Without hugepage parameters, only a hugepages-8192kB entry is visible
 in sysfs.  It's different from x86_64 where both 2M and 1G hugepage
 sizes are available.

case 2: default_hugepagesz=[64K|256M|2G]

 When specifying only a default_hugepagesz parameter, the default
 hugepage size isn't really changed and it stays at 8M. This is again
 different from x86_64.

Orabug: 25869946

Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-20 12:37:09 -07:00
Linus Torvalds 3b06b1a744 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Pull sparc fixes from David Miller:

 - Fix symbol version generation for assembler on sparc, from
   Nagarathnam Muthusamy.

 - Fix compound page handling in gup_huge_pmd(), from Nitin Gupta.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc64: Fix gup_huge_pmd
  Adding the type of exported symbols
  sed regex in Makefile.build requires line break between exported symbols
  Adding asm-prototypes.h for genksyms to generate crc
2017-07-11 21:34:24 -07:00
Thomas Meyer a94c33dd1f lib/extable.c: use bsearch() library function in search_extable()
[thomas@m3y3r.de: v3: fix arch specific implementations]
  Link: http://lkml.kernel.org/r/1497890858.12931.7.camel@m3y3r.de
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:35 -07:00
Punit Agrawal 7868a2087e mm/hugetlb: add size parameter to huge_pte_offset()
A poisoned or migrated hugepage is stored as a swap entry in the page
tables.  On architectures that support hugepages consisting of
contiguous page table entries (such as on arm64) this leads to ambiguity
in determining the page table entry to return in huge_pte_offset() when
a poisoned entry is encountered.

Let's remove the ambiguity by adding a size parameter to convey
additional information about the requested address.  Also fixup the
definition/usage of huge_pte_offset() throughout the tree.

Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.com
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: James Hogan <james.hogan@imgtec.com> (odd fixer:METAG ARCHITECTURE)
Cc: Ralf Baechle <ralf@linux-mips.org> (supporter:MIPS)
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:34 -07:00
Nitin Gupta dbd2667a4f sparc64: Fix gup_huge_pmd
The function assumes that each PMD points to head of a
huge page. This is not correct as a PMD can point to
start of any 8M region with a, say 256M, hugepage. The
fix ensures that it points to the correct head of any PMD
huge page.

Cc: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-25 15:23:57 -04:00
Hugh Dickins 1be7107fbe mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 21:50:20 +08:00
Pavel Tatashin 0197e41ce7 sparc64: delete old wrap code
The old method that is using xcall and softint to get new context id is
deleted, as it is replaced by a method of using per_cpu_secondary_mm
without xcall to perform the context wrap.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 13:45:29 -07:00
Pavel Tatashin a0582f26ec sparc64: new context wrap
The current wrap implementation has a race issue: it is called outside of
the ctx_alloc_lock, and also does not wait for all CPUs to complete the
wrap.  This means that a thread can get a new context with a new version
and another thread might still be running with the same context. The
problem is especially severe on CPUs with shared TLBs, like sun4v. I used
the following test to very quickly reproduce the problem:
- start over 8K processes (must be more than context IDs)
- write and read values at a  memory location in every process.

Very quickly memory corruptions start happening, and what we read back
does not equal what we wrote.

Several approaches were explored before settling on this one:

Approach 1:
Move smp_new_mmu_context_version() inside ctx_alloc_lock, and wait for
every process to complete the wrap. (Note: every CPU must WAIT before
leaving smp_new_mmu_context_version_client() until every one arrives).

This approach ends up with deadlocks, as some threads own locks which other
threads are waiting for, and they never receive softint until these threads
exit smp_new_mmu_context_version_client(). Since we do not allow the exit,
deadlock happens.

Approach 2:
Handle wrap right during mondo interrupt. Use etrap/rtrap to enter into
into C code, and issue new versions to every CPU.
This approach adds some overhead to runtime: in switch_mm() we must add
some checks to make sure that versions have not changed due to wrap while
we were loading the new secondary context. (could be protected by PSTATE_IE
but that degrades performance as on M7 and older CPUs as it takes 50 cycles
for each access). Also, we still need a global per-cpu array of MMs to know
where we need to load new contexts, otherwise we can change context to a
thread that is going way (if we received mondo between switch_mm() and
switch_to() time). Finally, there are some issues with window registers in
rtrap() when context IDs are changed during CPU mondo time.

The approach in this patch is the simplest and has almost no impact on
runtime.  We use the array with mm's where last secondary contexts were
loaded onto CPUs and bump their versions to the new generation without
changing context IDs. If a new process comes in to get a context ID, it
will go through get_new_mmu_context() because of version mismatch. But the
running processes do not need to be interrupted. And wrap is quicker as we
do not need to xcall and wait for everyone to receive and complete wrap.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 13:45:29 -07:00
Pavel Tatashin 7a5b4bbf49 sparc64: add per-cpu mm of secondary contexts
The new wrap is going to use information from this array to figure out
mm's that currently have valid secondary contexts setup.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 13:45:29 -07:00
Pavel Tatashin c4415235b2 sparc64: redefine first version
CTX_FIRST_VERSION defines the first context version, but also it defines
first context. This patch redefines it to only include the first context
version.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 13:45:28 -07:00
Pavel Tatashin 5889748573 sparc64: reset mm cpumask after wrap
After a wrap (getting a new context version) a process must get a new
context id, which means that we would need to flush the context id from
the TLB before running for the first time with this ID on every CPU. But,
we use mm_cpumask to determine if this process has been running on this CPU
before, and this mask is not reset after a wrap. So, there are two possible
fixes for this issue:

1. Clear mm cpumask whenever mm gets a new context id
2. Unconditionally flush context every time process is running on a CPU

This patch implements the first solution

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 13:45:28 -07:00
Liam R. Howlett f322980b74 sparc/mm/hugepages: Fix setup_hugepagesz for invalid values.
hugetlb_bad_size needs to be called on invalid values.  Also change the
pr_warn to a pr_err to better align with other platforms.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 13:45:03 -07:00
Mike Kravetz 654f480762 sparc64: mm: fix copy_tsb to correctly copy huge page TSBs
When a TSB grows beyond its current capacity, a new TSB is allocated
and copy_tsb is called to copy entries from the old TSB to the new.
A hash shift based on page size is used to calculate the index of an
entry in the TSB.  copy_tsb has hard coded PAGE_SHIFT in these
calculations.  However, for huge page TSBs the value REAL_HPAGE_SHIFT
should be used.  As a result, when copy_tsb is called for a huge page
TSB the entries are placed at the incorrect index in the newly
allocated TSB.  When doing hardware table walk, the MMU does not
match these entries and we end up in the TSB miss handling code.
This code will then create and write an entry to the correct index
in the TSB.  We take a performance hit for the table walk miss and
recreation of these entries.

Pass a new parameter to copy_tsb that is the page size shift to be
used when copying the TSB.

Suggested-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-06 13:45:02 -07:00
Orlando Arias deba804c90 sparc: Fix -Wstringop-overflow warning
Greetings,

GCC 7 introduced the -Wstringop-overflow flag to detect buffer overflows
in calls to string handling functions [1][2]. Due to the way
``empty_zero_page'' is declared in arch/sparc/include/setup.h, this
causes a warning to trigger at compile time in the function mem_init(),
which is subsequently converted to an error. The ensuing patch fixes
this issue and aligns the declaration of empty_zero_page to that of
other architectures. Thank you.

Cheers,
Orlando.

[1] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg02308.html
[2] https://gcc.gnu.org/gcc-7/changes.html

Signed-off-by: Orlando Arias <oarias@knights.ucf.edu>

--------------------------------------------------------------------------------
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 12:07:46 -07:00
Nitin Gupta 544f8f9358 sparc64: Fix hugepage page table free
Make sure the start adderess is aligned to PMD_SIZE
boundary when freeing page table backing a hugepage
region. The issue was causing segfaults when a region
backed by 64K pages was unmapped since such a region
is in general not PMD_SIZE aligned.

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-18 13:11:07 -07:00
Guenter Roeck 9d262d9511 sparc32: Export vac_cache_size to fix build error
sparc32:allmodconfig fails to build with the following error.

ERROR: "vac_cache_size" [drivers/infiniband/sw/rxe/rdma_rxe.ko] undefined!

Fixes: cb88645596 ("infiniband: Fix alignment of mmap cookies ...")
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Doug Ledford <dledford@redhat.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 12:03:30 -07:00
Nitin Gupta 76811263b3 sparc64: Fix memory corruption when THP is enabled
The memory corruption was happening due to incorrect
TLB/TSB flushing of hugepages.

Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-06 12:03:30 -07:00
bob picco adfae8a5d8 sparc64: kern_addr_valid regression
I encountered this bug when using /proc/kcore to examine the kernel. Plus a
coworker inquired about debugging tools. We computed pa but did
not use it during the maximum physical address bits test. Instead we used
the identity mapped virtual address which will always fail this test.

I believe the defect came in here:
[bpicco@zareason linus.git]$ git describe --contains bb4e6e85da
v3.18-rc1~87^2~4
.

Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27 21:20:23 -07:00
Nitin Gupta 85b1da7c47 sparc64: Add support for 2G hugepages
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27 21:20:23 -07:00
Nitin Gupta 59f1183dd3 sparc64: Fix size check in huge_pte_alloc
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-27 21:20:22 -07:00
Ingo Molnar 589ee62844 sched/headers: Prepare to remove the <linux/mm_types.h> dependency from <linux/sched.h>
Update code that relied on sched.h including various MM types for them.

This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:37 +01:00
Ingo Molnar b17b01533b sched/headers: Prepare for new header dependencies before moving code to <linux/sched/debug.h>
We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/debug.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:34 +01:00
Ingo Molnar 010426079e sched/headers: Prepare for new header dependencies before moving more code to <linux/sched/mm.h>
We are going to split more MM APIs out of <linux/sched.h>, which
will have to be picked up from a couple of .c files.

The APIs that we are going to move are:

  arch_pick_mmap_layout()
  arch_get_unmapped_area()
  arch_get_unmapped_area_topdown()
  mm_update_next_owner()

Include the header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:30 +01:00
Nitin Gupta ac65e2828d sparc64: Fix build error in flush_tsb_user_page
Patch "sparc64: Add 64K page size support"
unconditionally used __flush_huge_tsb_one_entry()
which is available only when hugetlb support is
enabled.

Another issue was incorrect TSB flushing for 64K
pages in flush_tsb_user().

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24 08:26:21 -08:00
Pavel Tatashin cd429ce2d0 sparc64: memblock resizes are not handled properly
In add_node_ranges() when memblock resize happens, the iterator keeps using
the previous freed array. This bug cause hangs on machine where there are
over 128 memory blocks during boot. For example, on machines where memory
interleaving is small.
The problem is seen on T4-4 because it cant have 2T of memory, and memory
is  interleaved at 8G. So we have 2T/8G = 256 regions to set node IDs. The
starting size of regions array is 128. Thus, we have to double at least one
time (actually we have to double twice because some memory is already
reserved and thus we need more than 256 regions). We start using an
incorrect pointer to the array after the first doubling.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 08:33:23 -08:00
Pavel Tatashin 1537b26dab sparc64: use latency groups to improve add_node_ranges speed
add_node_ranges() takes 2.6s - 3.6s per 1T of boot time. On machine with 6T
memory it takes 15.4s, on 32T it would take 82s-115s of boot time.
This function sets NUMA ids for memory blocks, and scans the whole memory a
page at a time to do so. But, we could use values in latency groups mask
and match to determine the boundaries without checking every single page.
With the fix the add_node_ranges() time is reduced from 15.4s down to 0.2s
on machine with 6T memory.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 08:33:23 -08:00
Nitin Gupta dcd1912d21 sparc64: Add 64K page size support
This patch depends on:
[v6] sparc64: Multi-page size support

- Testing

Tested on Sonoma by running stream benchmark instance which allocated
48G worth of 64K pages.

boot params: default_hugepagesz=64K hugepagesz=64K hugepages=1310720

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 08:32:10 -08:00
Nitin Gupta c7d9f77d33 sparc64: Multi-page size support
Add support for using multiple hugepage sizes simultaneously
on mainline. Currently, support for 256M has been added which
can be used along with 8M pages.

Page tables are set like this (e.g. for 256M page):
    VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31]

and TSB is set similarly:
    VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63]

- Testing

Tested on Sonoma (which supports 256M pages) by running stream
benchmark instances in parallel: one instance uses 8M pages and
another uses 256M pages, consuming 48G each.

Boot params used:

default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M
hugepages=10000

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 08:30:28 -08:00
Bhumika Goyal cbc41b43e7 sparc32: mm: srmmu: add __ro_after_init to sparc32_cachetlb_ops structures
The objects viking_ops, viking_sun4d_smp_ops and smp_cachetlb_ops of
type sparc32_cachetlb_ops are not modified anywhere after getting modified
in the init functions. Inside init their reference is also stored in a
pointer of type const struct sparc32_cachetlb_ops *. So these structures
are never modified after init, therefore add __ro_after to the declaration
of these structures.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-23 08:21:34 -08:00
Michal Hocko 6d23f8a5d4 arch, mm: remove arch specific show_mem
We have a generic implementation for quite some time already.  If there
is any arch specific information to be printed then we should add a
callback called from the generic code rather than duplicate the whole
show_mem.

The current code has resulted in the code duplication and the output
divergence which is both confusing and adds maintainance costs.

Let's just get rid of this mess.

Link: http://lkml.kernel.org/r/20170117091543.25850-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn> [UniCore32]
Acked-by: Helge Deller <deller@gmx.de> [for parisc]
Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:30 -08:00
Paul Gortmaker 5437344c2b sparc: migrate exception table users onto extable.h
This file was using module.h for search_exception_table.  We've
now separated that content out into its own file "extable.h" so
now move over to that.  Unlike most other instances, we can't
delete the module.h include here since the file needs that for
the within_module_init definition.

Reported-by: kbuild test robot <lkp@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2017-01-26 10:58:13 -05:00
Linus Torvalds 7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Thomas Tai 87a349f9cc sparc64: fix compile warning section mismatch in find_node()
A compile warning is introduced by a commit to fix the find_node().
This patch fix the compile warning by moving find_node() into __init
section. Because find_node() is only used by memblock_nid_range() which
is only used by a __init add_node_ranges(). find_node() and
memblock_nid_range() should also be inside __init section.

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-14 13:40:46 -05:00
Thomas Tai 74a5ed5c4f sparc64: Fix find_node warning if numa node cannot be found
When booting up LDOM, find_node() warns that a physical address
doesn't match a NUMA node.

WARNING: CPU: 0 PID: 0 at arch/sparc/mm/init_64.c:835
find_node+0xf4/0x120 find_node: A physical address doesn't
match a NUMA node rule. Some physical memory will be
owned by node 0.Modules linked in:

CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.0-rc3 #4
Call Trace:
 [0000000000468ba0] __warn+0xc0/0xe0
 [0000000000468c74] warn_slowpath_fmt+0x34/0x60
 [00000000004592f4] find_node+0xf4/0x120
 [0000000000dd0774] add_node_ranges+0x38/0xe4
 [0000000000dd0b1c] numa_parse_mdesc+0x268/0x2e4
 [0000000000dd0e9c] bootmem_init+0xb8/0x160
 [0000000000dd174c] paging_init+0x808/0x8fc
 [0000000000dcb0d0] setup_arch+0x2c8/0x2f0
 [0000000000dc68a0] start_kernel+0x48/0x424
 [0000000000dcb374] start_early_boot+0x27c/0x28c
 [0000000000a32c08] tlb_fixup_done+0x4c/0x64
 [0000000000027f08] 0x27f08

It is because linux use an internal structure node_masks[] to
keep the best memory latency node only. However, LDOM mdesc can
contain single latency-group with multiple memory latency nodes.

If the address doesn't match the best latency node within
node_masks[], it should check for an alternative via mdesc.
The warning message should only be printed if the address
doesn't match any node_masks[] nor within mdesc. To minimize
the impact of searching mdesc every time, the last matched
mask and index is stored in a variable.

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Chris Hyser <chris.hyser@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-10 16:47:38 -08:00
David S. Miller a74ad5e660 sparc64: Handle extremely large kernel TLB range flushes more gracefully.
When the vmalloc area gets fragmented, and because the firmware
mapping area sits between where modules live and the vmalloc area, we
can sometimes receive requests for enormous kernel TLB range flushes.

When this happens the cpu just spins flushing billions of pages and
this triggers the NMI watchdog and other problems.

We took care of this on the TSB side by doing a linear scan of the
table once we pass a certain threshold.

Do something similar for the TLB flush, however we are limited by
the TLB flush facilities provided by the different chip variants.

First of all we use an (mostly arbitrary) cut-off of 256K which is
about 32 pages.  This can be tuned in the future.

The huge range code path for each chip works as follows:

1) On spitfire we flush all non-locked TLB entries using diagnostic
   acceses.

2) On cheetah we use the "flush all" TLB flush.

3) On sun4v/hypervisor we do a TLB context flush on context 0, which
   unlike previous chips does not remove "permanent" or locked
   entries.

We could probably do something better on spitfire, such as limiting
the flush to kernel TLB entries or even doing range comparisons.
However that probably isn't worth it since those chips are old and
the TLB only had 64 entries.

Reported-by: James Clarke <jrtc27@jrtc27.com>
Tested-by: James Clarke <jrtc27@jrtc27.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 09:11:05 -07:00
David S. Miller a236441bb6 sparc64: Fix illegal relative branches in hypervisor patched TLB cross-call code.
Just like the non-cross-call TLB flush handlers, the cross-call ones need
to avoid doing PC-relative branches outside of their code blocks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 10:20:14 -07:00
David S. Miller 830cda3f98 sparc64: Fix instruction count in comment for __hypervisor_flush_tlb_pending.
Noticed by James Clarke.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26 10:08:22 -07:00
David S. Miller 849c498766 sparc64: Handle extremely large kernel TSB range flushes sanely.
If the number of pages we are flushing is more than twice the number
of entries in the TSB, just scan the TSB table for matches rather
than probing each and every page in the range.

Based upon a patch and report by James Clarke.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-25 19:43:17 -07:00
David S. Miller b429ae4d5b sparc64: Fix illegal relative branches in hypervisor patched TLB code.
When we copy code over to patch another piece of code, we can only use
PC-relative branches that target code within that piece of code.

Such PC-relative branches cannot be made to external symbols because
the patch moves the location of the code and thus modifies the
relative address of external symbols.

Use an absolute jmpl to fix this problem.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-25 16:23:26 -07:00
Lorenzo Stoakes c164154f66 mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
This removes the 'write' and 'force' use from get_user_pages_unlocked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-18 14:13:37 -07:00
Paul Gortmaker cdd4f4c710 sparc: migrate exception table users off module.h and onto extable.h
These files were only including module.h for exception table
related functions.  We've now separated that content out into its
own file "extable.h" so now move over to that and avoid all the
extra header content in module.h that we don't really need to compile
these files.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-06 01:42:30 -04:00
Atish Patra ebb99a4c12 sparc64: Fix irq stack bootmem allocation.
Currently, irq stack bootmem is allocated for all possible cpus
before nr_cpus value changes the list of possible cpus. As a result,
there is unnecessary wastage of bootmemory.

Move the irq stack bootmem allocation so that it happens after
possible cpu list is modified based on nr_cpus value.

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 08:24:03 -07:00
Mike Kravetz 1e953d846a sparc64 mm: Fix more TSB sizing issues
Commit af1b1a9b36 ("sparc64 mm: Fix base TSB sizing when hugetlb
pages are used") addressed the difference between hugetlb and THP
pages when computing TSB sizes.  The following additional issues
were also discovered while working with the code.

In order to save memory, THP makes use of a huge zero page.  This huge
zero page does not count against a task's RSS, but it does consume TSB
entries.  This is similar to hugetlb pages.  Therefore, count huge
zero page entries in hugetlb_pte_count.

Accounting of THP pages is done in the routine set_pmd_at().
Unfortunately, this does not catch the case where a THP page is split.
To handle this case, decrement the count in pmdp_invalidate().
pmdp_invalidate is only called when splitting a THP.  However, 'sanity
checks' are added in case it is ever called for other purposes.

A more general issue exists with HPAGE_SIZE accounting.
hugetlb_pte_count tracks the number of HPAGE_SIZE (8M) pages.  This
value is used to size the TSB for HPAGE_SIZE pages.  However,
each HPAGE_SIZE page consists of two REAL_HPAGE_SIZE (4M) pages.
The TSB contains an entry for each REAL_HPAGE_SIZE page.  Therefore,
the number of REAL_HPAGE_SIZE pages should be used to size the huge
page TSB.  A new compile time constant REAL_HPAGE_PER_HPAGE is used
to multiply hugetlb_pte_count before sizing the TSB.

Changes from V1
- Fixed build issue if hugetlb or THP not configured

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 08:24:02 -07:00
Paul Gortmaker bdf2f59e64 sparc64: fix section mismatch in find_numa_latencies_for_group
To fix:

  WARNING: vmlinux.o(.text.unlikely+0x580): Section mismatch in
  reference from the function find_numa_latencies_for_group() to the
  function .init.text:find_mlgroup()

  The function find_numa_latencies_for_group() references the
  function __init find_mlgroup().  This is often because
  find_numa_latencies_for_group lacks a __init annotation or the
  annotation of find_mlgroup is wrong.

It turns out find_numa_latencies_for_group is only called from:
    static int __init numa_parse_mdesc(void)
and hence we can tag find_numa_latencies_for_group with __init.

In doing so we see that find_best_numa_node_for_mlgroup is only
called from within __init and hence can also be marked with __init.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Nitin Gupta <nitin.m.gupta@oracle.com>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: sparclinux@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 08:24:02 -07:00
Nitin Gupta 7bc3777ca1 sparc64: Trim page tables for 8M hugepages
For PMD aligned (8M) hugepages, we currently allocate
all four page table levels which is wasteful. We now
allocate till PMD level only which saves memory usage
from page tables.

Also, when freeing page table for 8M hugepage backed region,
make sure we don't try to access non-existent PTE level.

Orabug: 22630259

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-29 10:49:16 -07:00
Mike Kravetz af1b1a9b36 sparc64 mm: Fix base TSB sizing when hugetlb pages are used
do_sparc64_fault() calculates both the base and huge page RSS sizes and
uses this information in calls to tsb_grow().  The calculation for base
page TSB size is not correct if the task uses hugetlb pages.  hugetlb
pages are not accounted for in RSS, therefore the call to get_mm_rss(mm)
does not include hugetlb pages.  However, the number of pages based on
huge_pte_count (which does include hugetlb pages) is subtracted from
this value.  This will result in an artificially small and often negative
RSS calculation.  The base TSB size is then often set to max_tsb_size
as the passed RSS is unsigned, so a negative value looks really big.

THP pages are also accounted for in huge_pte_count, and THP pages are
accounted for in RSS so the calculation in do_sparc64_fault() is correct
if a task only uses THP pages.

A single huge_pte_count is not sufficient for TSB sizing if both hugetlb
and THP pages can be used.  Instead of a single counter, use two:  one
for hugetlb and one for THP.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-28 21:22:12 -07:00
Kirill A. Shutemov dcddffd41d mm: do not pass mm_struct into handle_mm_fault
We always have vma->vm_mm around.

Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Michal Hocko 32d6bd9059 tree wide: get rid of __GFP_REPEAT for order-0 allocations part I
This is the third version of the patchset previously sent [1].  I have
basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get
rid of superfluous gfp flags" which went through dm tree.  I am sending
it now because it is tree wide and chances for conflicts are reduced
considerably when we want to target rc2.  I plan to send the next step
and rename the flag and move to a better semantic later during this
release cycle so we will have a new semantic ready for 4.8 merge window
hopefully.

Motivation:

While working on something unrelated I've checked the current usage of
__GFP_REPEAT in the tree.  It seems that a majority of the usage is and
always has been bogus because __GFP_REPEAT has always been about costly
high order allocations while we are using it for order-0 or very small
orders very often.  It seems that a big pile of them is just a
copy&paste when a code has been adopted from one arch to another.

I think it makes some sense to get rid of them because they are just
making the semantic more unclear.  Please note that GFP_REPEAT is
documented as

* __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt

* _might_ fail.  This depends upon the particular VM implementation.
  while !costly requests have basically nofail semantic.  So one could
  reasonably expect that order-0 request with __GFP_REPEAT will not loop
  for ever.  This is not implemented right now though.

I would like to move on with __GFP_REPEAT and define a better semantic
for it.

  $ git grep __GFP_REPEAT origin/master | wc -l
  111
  $ git grep __GFP_REPEAT | wc -l
  36

So we are down to the third after this patch series.  The remaining
places really seem to be relying on __GFP_REPEAT due to large allocation
requests.  This still needs some double checking which I will do later
after all the simple ones are sorted out.

I am touching a lot of arch specific code here and I hope I got it right
but as a matter of fact I even didn't compile test for some archs as I
do not have cross compiler for them.  Patches should be quite trivial to
review for stupid compile mistakes though.  The tricky parts are usually
hidden by macro definitions and thats where I would appreciate help from
arch maintainers.

[1] http://lkml.kernel.org/r/1461849846-27209-1-git-send-email-mhocko@kernel.org

This patch (of 19):

__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations.  Yet we
have the full kernel tree with its usage for apparently order-0
allocations.  This is really confusing because __GFP_REPEAT is
explicitly documented to allow allocation failures which is a weaker
semantic than the current order-0 has (basically nofail).

Let's simply drop __GFP_REPEAT from those places.  This would allow to
identify place which really need allocator to retry harder and formulate
a more specific semantic for what the flag is supposed to do actually.

Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Crispin <blogic@openwrt.org>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
David S. Miller 9ea46abe22 sparc64: Take ctx_alloc_lock properly in hugetlb_setup().
On cheetahplus chips we take the ctx_alloc_lock in order to
modify the TLB lookup parameters for the indexed TLBs, which
are stored in the context register.

This is called with interrupts disabled, however ctx_alloc_lock
is an IRQ safe lock, therefore we must take acquire/release it
properly with spin_{lock,unlock}_irq().

Reported-by: Meelis Roos <mroos@linux.ee>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-25 12:51:32 -07:00
Nitin Gupta 24e49ee3d7 sparc64: Reduce TLB flushes during hugepte changes
During hugepage map/unmap, TSB and TLB flushes are currently
issued at every PAGE_SIZE'd boundary which is unnecessary.
We now issue the flush at REAL_HPAGE_SIZE boundaries only.

Without this patch workloads which unmap a large hugepage
backed VMA region get CPU lockups due to excessive TLB
flush calls.

Orabug: 22365539, 22643230, 22995196

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 18:44:27 -07:00
Sam Ravnborg 6b1cabe899 sparc32: drop superfluous cast in calls to __nocache_pa()
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 17:55:42 -07:00
Sam Ravnborg 6e6e41879e sparc32: fix build with STRICT_MM_TYPECHECKS
Based on recent thread on linux-arch (some weeks ago) I
decided to check how much work was required to build sparc32
with STRICT_MM_TYPECHECKS enabled.

The resulting binary (checked srmmu.o) was to my suprise smaller with
STRICT_MM_TYPECHECKS defined, than without.

As I have no working gear to test sparc32 bits at for the moment,
I did not enable STRICT_MM_TYPECHECKS - but was tempeted to do so.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20 17:55:42 -07:00
Khalid Aziz c5b8b5beee sparc64: recognize and support Sonoma CPU type
Add code to recognize SPARC-Sonoma cpu correctly and update cpu hardware
caps and cpu distribution map. SPARC-Sonoma is based upon SPARC-M7 core
along with additional PCI functions added on and is reported by firmware
as "SPARC-SN".

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 16:43:47 -04:00
Linus Torvalds d4dc3b2442 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Pull sparc fixes from David Miller:
 "Minor typing cleanup from Joe Perches, and some comment typo fixes
  from Adam Buchbinder"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc: Convert naked unsigned uses to unsigned int
  sparc: Fix misspellings in comments.
2016-03-28 15:14:11 -05:00
Joe Perches 9ef595d83a sparc: Convert naked unsigned uses to unsigned int
Use the more normal kernel definition/declaration style.

Done via:

$ git ls-files arch/sparc | \
  xargs ./scripts/checkpatch.pl -f --fix-inplace --types=unspecified_int

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-20 21:28:58 -07:00
Linus Torvalds 643ad15d47 Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 protection key support from Ingo Molnar:
 "This tree adds support for a new memory protection hardware feature
  that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

  There's a background article at LWN.net:

      https://lwn.net/Articles/643797/

  The gist is that protection keys allow the encoding of
  user-controllable permission masks in the pte.  So instead of having a
  fixed protection mask in the pte (which needs a system call to change
  and works on a per page basis), the user can map a (handful of)
  protection mask variants and can change the masks runtime relatively
  cheaply, without having to change every single page in the affected
  virtual memory range.

  This allows the dynamic switching of the protection bits of large
  amounts of virtual memory, via user-space instructions.  It also
  allows more precise control of MMU permission bits: for example the
  executable bit is separate from the read bit (see more about that
  below).

  This tree adds the MM infrastructure and low level x86 glue needed for
  that, plus it adds a high level API to make use of protection keys -
  if a user-space application calls:

        mmap(..., PROT_EXEC);

  or

        mprotect(ptr, sz, PROT_EXEC);

  (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
  this special case, and will set a special protection key on this
  memory range.  It also sets the appropriate bits in the Protection
  Keys User Rights (PKRU) register so that the memory becomes unreadable
  and unwritable.

  So using protection keys the kernel is able to implement 'true'
  PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
  PROT_READ as well.  Unreadable executable mappings have security
  advantages: they cannot be read via information leaks to figure out
  ASLR details, nor can they be scanned for ROP gadgets - and they
  cannot be used by exploits for data purposes either.

  We know about no user-space code that relies on pure PROT_EXEC
  mappings today, but binary loaders could start making use of this new
  feature to map binaries and libraries in a more secure fashion.

  There is other pending pkeys work that offers more high level system
  call APIs to manage protection keys - but those are not part of this
  pull request.

  Right now there's a Kconfig that controls this feature
  (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
  (like most x86 CPU feature enablement code that has no runtime
  overhead), but it's not user-configurable at the moment.  If there's
  any serious problem with this then we can make it configurable and/or
  flip the default"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
  mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
  x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
  mm/core, x86/mm/pkeys: Add execute-only protection keys support
  x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
  x86/mm/pkeys: Allow kernel to modify user pkey rights register
  x86/fpu: Allow setting of XSAVE state
  x86/mm: Factor out LDT init from context init
  mm/core, x86/mm/pkeys: Add arch_validate_pkey()
  mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
  x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
  x86/mm/pkeys: Add Kconfig prompt to existing config option
  x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
  x86/mm/pkeys: Dump PKRU with other kernel registers
  mm/core, x86/mm/pkeys: Differentiate instruction fetches
  x86/mm/pkeys: Optimize fault handling in access_error()
  mm/core: Do not enforce PKEY permissions on remote mm access
  um, pkeys: Add UML arch_*_access_permitted() methods
  mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
  x86/mm/gup: Simplify get_user_pages() PTE bit handling
  ...
2016-03-20 19:08:56 -07:00
Kirill A. Shutemov 3ed3a4f0dd mm: cleanup *pte_alloc* interfaces
There are few things about *pte_alloc*() helpers worth cleaning up:

 - 'vma' argument is unused, let's drop it;

 - most __pte_alloc() callers do speculative check for pmd_none(),
   before taking ptl: let's introduce pte_alloc() macro which does
   the check.

   The only direct user of __pte_alloc left is userfaultfd, which has
   different expectation about atomicity wrt pmd.

 - pte_alloc_map() and pte_alloc_map_lock() are redefined using
   pte_alloc().

[sudeep.holla@arm.com: fix build for arm64 hugetlbpage]
[sfr@canb.auug.org.au: fix arch/arm/mm/mmu.c some more]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Dave Hansen d4edcf0d56 mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm
We will soon modify the vanilla get_user_pages() so it can no
longer be used on mm/tasks other than 'current/current->mm',
which is by far the most common way it is called.  For now,
we allow the old-style calls, but warn when they are used.
(implemented in previous patch)

This patch switches all callers of:

	get_user_pages()
	get_user_pages_unlocked()
	get_user_pages_locked()

to stop passing tsk/mm so they will no longer see the warnings.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-16 10:11:12 +01:00
Toshi Kani 35d98e93fe arch: Set IORESOURCE_SYSTEM_RAM flag for System RAM
Set IORESOURCE_SYSTEM_RAM in flags of resource ranges with
"System RAM", "Kernel code", "Kernel data", and "Kernel bss".

Note that:

 - IORESOURCE_SYSRAM (i.e. modifier bit) is set in flags when
   IORESOURCE_MEM is already set. IORESOURCE_SYSTEM_RAM is defined
   as (IORESOURCE_MEM|IORESOURCE_SYSRAM).

 - Some archs do not set 'flags' for children nodes, such as
   "Kernel code".  This patch does not change 'flags' in this
   case.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@linux-mips.org
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-parisc@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/1453841853-11383-7-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-30 09:49:57 +01:00
Kirill A. Shutemov 99f1bc0116 sparc, thp: remove infrastructure for handling splitting PMDs
With new refcounting we don't need to mark PMDs splitting.  Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov ddc58f27f9 mm: drop tail page refcounting
Tail page refcounting is utterly complicated and painful to support.

It uses ->_mapcount on tail pages to store how many times this page is
pinned.  get_page() bumps ->_mapcount on tail page in addition to
->_count on head.  This information is required by split_huge_page() to
be able to distribute pins from head of compound page to tails during
the split.

We will need ->_mapcount to account PTE mappings of subpages of the
compound page.  We eliminate need in current meaning of ->_mapcount in
tail pages by forbidding split entirely if the page is pinned.

The only user of tail page refcounting is THP which is marked BROKEN for
now.

Let's drop all this mess.  It makes get_page() and put_page() much
simpler.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Nitin Gupta 36beca6571 sparc64: Fix numa node distance initialization
Orabug: 22495713

Currently, NUMA node distance matrix is initialized only
when a machine descriptor (MD) exists. However, sun4u
machines (e.g. Sun Blade 2500) do not have an MD and thus
distance values were left uninitialized. The initialization
is now moved such that it happens on both sun4u and sun4v.

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Tested-by: Mikael Pettersson <mikpelinux@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-14 16:58:59 -05:00
Nitin Gupta 52708d690b sparc64: Fix numa distance values
Orabug: 21896119

Use machine descriptor (MD) to get node latency
values instead of just using default values.

Testing:
On an T5-8 system with:
 - total nodes = 8
 - self latencies = 0x26d18
 - latency to other nodes = 0x3a598
   => latency ratio = ~1.5

output of numactl --hardware

 - before fix:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  20  20  20  20  20  20
  1:  20  10  20  20  20  20  20  20
  2:  20  20  10  20  20  20  20  20
  3:  20  20  20  10  20  20  20  20
  4:  20  20  20  20  10  20  20  20
  5:  20  20  20  20  20  10  20  20
  6:  20  20  20  20  20  20  10  20
  7:  20  20  20  20  20  20  20  10

 - after fix:

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  15  15  15  15  15  15  15
  1:  15  10  15  15  15  15  15  15
  2:  15  15  10  15  15  15  15  15
  3:  15  15  15  10  15  15  15  15
  4:  15  15  15  15  10  15  15  15
  5:  15  15  15  15  15  10  15  15
  6:  15  15  15  15  15  15  10  15
  7:  15  15  15  15  15  15  15  10

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Reviewed-by: Chris Hyser <chris.hyser@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-04 12:14:49 -08:00
David Ahern 2bf7c3efc3 sparc64: Convert BUG_ON to warning
Pagefault handling has a BUG_ON path that panics the system. Convert it to
a warning instead. There is no need to bring down the system for this kind
of failure.

The following was hit while running:
    perf sched record -g -- make -j 16

[3609412.782801] kernel BUG at /opt/dahern/linux.git/arch/sparc/mm/fault_64.c:416!
[3609412.782833]               \|/ ____ \|/
[3609412.782833]               "@'/ .. \`@"
[3609412.782833]               /_| \__/ |_\
[3609412.782833]                  \__U_/
[3609412.782870] cat(4516): Kernel bad sw trap 5 [#1]
[3609412.782889] CPU: 0 PID: 4516 Comm: cat Tainted: G            E   4.1.0-rc8+ #6
[3609412.782909] task: fff8000126e31f80 ti: fff8000110d90000 task.ti: fff8000110d90000
[3609412.782931] TSTATE: 0000004411001603 TPC: 000000000096b164 TNPC: 000000000096b168 Y: 0000004e    Tainted: G            E
[3609412.782964] TPC: <do_sparc64_fault+0x5e4/0x6a0>
[3609412.782979] g0: 000000000096abe0 g1: 0000000000d314c4 g2: 0000000000000000 g3: 0000000000000001
[3609412.783009] g4: fff8000126e31f80 g5: fff80001302d2000 g6: fff8000110d90000 g7: 00000000000000ff
[3609412.783045] o0: 0000000000aff6a8 o1: 00000000000001a0 o2: 0000000000000001 o3: 0000000000000054
[3609412.783080] o4: fff8000100026820 o5: 0000000000000001 sp: fff8000110d935f1 ret_pc: 000000000096b15c
[3609412.783117] RPC: <do_sparc64_fault+0x5dc/0x6a0>
[3609412.783137] l0: 000007feff996000 l1: 0000000000030001 l2: 0000000000000004 l3: fff8000127bd0120
[3609412.783174] l4: 0000000000000054 l5: fff8000127bd0188 l6: 0000000000000000 l7: fff8000110d9dba8
[3609412.783210] i0: fff8000110d93f60 i1: fff8000110ca5530 i2: 000000000000003f i3: 0000000000000054
[3609412.783244] i4: fff800010000081a i5: fff8000100000398 i6: fff8000110d936a1 i7: 0000000000407c6c
[3609412.783286] I7: <sparc64_realfault_common+0x10/0x20>
[3609412.783308] Call Trace:
[3609412.783329]  [0000000000407c6c] sparc64_realfault_common+0x10/0x20
[3609412.783353] Disabling lock debugging due to kernel taint
[3609412.783379] Caller[0000000000407c6c]: sparc64_realfault_common+0x10/0x20
[3609412.783449] Caller[fff80001002283e4]: 0xfff80001002283e4
[3609412.783471] Instruction DUMP: 921021a0  7feaff91  901222a8 <91d02005> 82086100  02f87f7b  808a2873  81cfe008  01000000
[3609412.783542] Kernel panic - not syncing: Fatal exception
[3609412.784605] Press Stop-A (L1-A) to return to the boot prom
[3609412.784615] ---[ end Kernel panic - not syncing: Fatal exception

With this patch rather than a panic I occasionally get something like this:
    perf sched record -g -m 1024  -- make -j N

where N is based on number of cpus (128 to 1024 for a T7-4 and 8 for an 8 cpu
VM on a T5-2).

WARNING: CPU: 211 PID: 52565 at /opt/dahern/linux.git/arch/sparc/mm/fault_64.c:417 do_sparc64_fault+0x340/0x70c()
address (7feffcd6000) != regs->tpc (fff80001004873c0)
Modules linked in: ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 cdc_ether usbnet mii ixgbe mdio igb i2c_algo_bit i2c_core ptp crc32c_sparc64 camellia_sparc64 des_sparc64 des_generic md5_sparc64 sha512_sparc64 sha1_sparc64 uio_pdrv_genirq uio usb_storage mpt3sas scsi_transport_sas raid_class aes_sparc64 sunvnet sunvdc sha256_sparc64(E) sha256_generic(E)
CPU: 211 PID: 52565 Comm: ld Tainted: G        W   E   4.1.0-rc8+ #19
Call Trace:
 [000000000045ce30] warn_slowpath_common+0x7c/0xa0
 [000000000045ceec] warn_slowpath_fmt+0x30/0x40
 [000000000098ad64] do_sparc64_fault+0x340/0x70c
 [0000000000407c2c] sparc64_realfault_common+0x10/0x20
---[ end trace 62ee02065a01a049 ]---
ld[52565]: segfault at fff80001004873c0 ip fff80001004873c0 (rpc fff8000100158868) sp 000007feffcd70e1 error 30002 in libc-2.12.so[fff8000100410000+184000]

The segfault is horrible, but better than a system panic.

An 8-cpu VM on a T5-2 also showed the above traces from time to time,
so it is a general problem and not specific to the T7 or baremetal.

Signed-off-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-25 06:01:02 -07:00
Tony Luck fc6daaf931 mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
Some high end Intel Xeon systems report uncorrectable memory errors as a
recoverable machine check.  Linux has included code for some time to
process these and just signal the affected processes (or even recover
completely if the error was in a read only page that can be replaced by
reading from disk).

But we have no recovery path for errors encountered during kernel code
execution.  Except for some very specific cases were are unlikely to ever
be able to recover.

Enter memory mirroring. Actually 3rd generation of memory mirroing.

Gen1: All memory is mirrored
	Pro: No s/w enabling - h/w just gets good data from other side of the
	     mirror
	Con: Halves effective memory capacity available to OS/applications

Gen2: Partial memory mirror - just mirror memory begind some memory controllers
	Pro: Keep more of the capacity
	Con: Nightmare to enable. Have to choose between allocating from
	     mirrored memory for safety vs. NUMA local memory for performance

Gen3: Address range partial memory mirror - some mirror on each memory
      controller
	Pro: Can tune the amount of mirror and keep NUMA performance
	Con: I have to write memory management code to implement

The current plan is just to use mirrored memory for kernel allocations.
This has been broken into two phases:

1) This patch series - find the mirrored memory, use it for boot time
   allocations

2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
   unused mirrored memory from mm/memblock.c and only give it out to
   select kernel allocations (this is still being scoped because
   page_alloc.c is scary).

This patch (of 3):

Add extra "flags" to memblock to allow selection of memory based on
attribute.  No functional changes

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Xiexiuqi <xiexiuqi@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:44 -07:00
Zhang Zhen e81f2d2237 mm/hugetlb: reduce arch dependent code about huge_pmd_unshare
Currently we have many duplicates in definitions of huge_pmd_unshare.  In
all architectures this function just returns 0 when
CONFIG_ARCH_WANT_HUGE_PMD_SHARE is N.

This patch puts the default implementation in mm/hugetlb.c and lets these
architectures use the common code.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: David Rientjes <rientjes@google.com>
Cc: James Yang <James.Yang@freescale.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:41 -07:00
Ingo Molnar f407a82586 Merge branch 'linus' into sched/core, to resolve conflict
Conflicts:
	arch/sparc/include/asm/topology_64.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-02 08:05:42 +02:00
Khalid Aziz 494e5b6fae sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE
sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE

Bit 9 of TTE is CV (Cacheable in V-cache) on sparc v9 processor while
the same bit 9 is MCDE (Memory Corruption Detection Enable) on M7
processor. This creates a conflicting usage of the same bit. Kernel
sets TTE.cv bit on all pages for sun4v architecture which works well
for sparc v9 but enables memory corruption detection on M7 processor
which is not the intent. This patch adds code to determine if kernel
is running on M7 processor and takes steps to not enable memory
corruption detection in TTE erroneously.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-31 22:15:01 -07:00
David Hildenbrand 70ffdb9393 mm/fault, arch: Use pagefault_disable() to check for disabled pagefaults in the handler
Introduce faulthandler_disabled() and use it to check for irq context and
disabled pagefaults (via pagefault_disable()) in the pagefault handlers.

Please note that we keep the in_atomic() checks in place - to detect
whether in irq context (in which case preemption is always properly
disabled).

In contrast, preempt_disable() should never be used to disable pagefaults.
With !CONFIG_PREEMPT_COUNT, preempt_disable() doesn't modify the preempt
counter, and therefore the result of in_atomic() differs.
We validate that condition by using might_fault() checks when calling
might_sleep().

Therefore, add a comment to faulthandler_disabled(), describing why this
is needed.

faulthandler_disabled() and pagefault_disable() are defined in
linux/uaccess.h, so let's properly add that include to all relevant files.

This patch is based on a patch from Thomas Gleixner.

Reviewed-and-tested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David.Laight@ACULAB.COM
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: benh@kernel.crashing.org
Cc: bigeasy@linutronix.de
Cc: borntraeger@de.ibm.com
Cc: daniel.vetter@intel.com
Cc: heiko.carstens@de.ibm.com
Cc: herbert@gondor.apana.org.au
Cc: hocko@suse.cz
Cc: hughd@google.com
Cc: mst@redhat.com
Cc: paulus@samba.org
Cc: ralf@linux-mips.org
Cc: schwidefsky@de.ibm.com
Cc: yang.shi@windriver.com
Link: http://lkml.kernel.org/r/1431359540-32227-7-git-send-email-dahi@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 08:39:15 +02:00
David Hildenbrand 2cb7c9cb42 sched/preempt, mm/kmap: Explicitly disable/enable preemption in kmap_atomic_*
The existing code relies on pagefault_disable() implicitly disabling
preemption, so that no schedule will happen between kmap_atomic() and
kunmap_atomic().

Let's make this explicit, to prepare for pagefault_disable() not
touching preemption anymore.

Reviewed-and-tested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: David.Laight@ACULAB.COM
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: benh@kernel.crashing.org
Cc: bigeasy@linutronix.de
Cc: borntraeger@de.ibm.com
Cc: daniel.vetter@intel.com
Cc: heiko.carstens@de.ibm.com
Cc: herbert@gondor.apana.org.au
Cc: hocko@suse.cz
Cc: hughd@google.com
Cc: mst@redhat.com
Cc: paulus@samba.org
Cc: ralf@linux-mips.org
Cc: schwidefsky@de.ibm.com
Cc: yang.shi@windriver.com
Link: http://lkml.kernel.org/r/1431359540-32227-5-git-send-email-dahi@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 08:39:14 +02:00
David S. Miller 3c08158e0e sparc: Fix /proc/kcore
/proc/kcore investigates the "System RAM" elements in /proc/iomem to
initialize it's memory tables.  Therefore we have to register them
before it tries to do so.  kcore uses device_initcall() so let's
use arch_initcall() for the registry.

Also we need ARCH_PROC_KCORE_TEXT to get the virtual addresses of
the kernel image correct.

Reported-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 19:15:28 -07:00
Andrea Arcangeli a7b780750e mm: gup: use get_user_pages_unlocked within get_user_pages_fast
This allows the get_user_pages_fast slow path to release the mmap_sem
before blocking.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:05 -08:00
Naoya Horiguchi 61f77eda9b mm/hugetlb: reduce arch dependent code around follow_huge_*
Currently we have many duplicates in definitions around
follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this
patch tries to remove the m.  The basic idea is to put the default
implementation for these functions in mm/hugetlb.c as weak symbols
(regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement
arch-specific code only when the arch needs it.

For follow_huge_addr(), only powerpc and ia64 have their own
implementation, and in all other architectures this function just returns
ERR_PTR(-EINVAL).  So this patch sets returning ERR_PTR(-EINVAL) as
default.

As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to
always return 0 in your architecture (like in ia64 or sparc,) it's never
called (the callsite is optimized away) no matter how implemented it is.
So in such architectures, we don't need arch-specific implementation.

In some architecture (like mips, s390 and tile,) their current
arch-specific follow_huge_(pmd|pud)() are effectively identical with the
common code, so this patch lets these architecture use the common code.

One exception is metag, where pmd_huge() could return non-zero but it
expects follow_huge_pmd() to always return NULL.  This means that we need
arch-specific implementation which returns NULL.  This behavior looks
strange to me (because non-zero pmd_huge() implies that the architecture
supports PMD-based hugepage, so follow_huge_pmd() can/should return some
relevant value,) but that's beyond this cleanup patch, so let's keep it.

Justification of non-trivial changes:
- in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
  patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
  is true when follow_huge_pmd() can be called (note that pmd_huge() has
  the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
- in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
  code. This patch forces these archs use PMD_MASK, but it's OK because
  they are identical in both archs.
  In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
  In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
  PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
  PTE_ORDER is always 0, so these are identical.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:01 -08:00
Linus Torvalds 33692f2759 vm: add VM_FAULT_SIGSEGV handling support
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
"you should SIGSEGV" error, because the SIGSEGV case was generally
handled by the caller - usually the architecture fault handler.

That results in lots of duplication - all the architecture fault
handlers end up doing very similar "look up vma, check permissions, do
retries etc" - but it generally works.  However, there are cases where
the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.

In particular, when accessing the stack guard page, libsigsegv expects a
SIGSEGV.  And it usually got one, because the stack growth is handled by
that duplicated architecture fault handler.

However, when the generic VM layer started propagating the error return
from the stack expansion in commit fee7e49d45 ("mm: propagate error
from stack expansion even for guard page"), that now exposed the
existing VM_FAULT_SIGBUS result to user space.  And user space really
expected SIGSEGV, not SIGBUS.

To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
duplicate architecture fault handlers about it.  They all already have
the code to handle SIGSEGV, so it's about just tying that new return
value to the existing code, but it's all a bit annoying.

This is the mindless minimal patch to do this.  A more extensive patch
would be to try to gather up the mostly shared fault handling logic into
one generic helper routine, and long-term we really should do that
cleanup.

Just from this patch, you can generally see that most architectures just
copied (directly or indirectly) the old x86 way of doing things, but in
the meantime that original x86 model has been improved to hold the VM
semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
"newer" things, so it would be a good idea to bring all those
improvements to the generic case and teach other architectures about
them too.

Reported-and-tested-by: Takashi Iwai <tiwai@suse.de>
Tested-by: Jan Engelhardt <jengelh@inai.de>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
Cc: linux-arch@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-29 10:51:32 -08:00
Andreas Larsson 66d0f7ec9f sparc32: destroy_context() and switch_mm() needs to disable interrupts.
Load balancing can be triggered in the critical sections protected by
srmmu_context_spinlock in destroy_context() and switch_mm() and can hang
the cpu waiting for the rq lock of another cpu that in turn has called
switch_mm hangning on srmmu_context_spinlock leading to deadlock.

So, disable interrupt while taking srmmu_context_spinlock in
destroy_context() and switch_mm() so we don't deadlock.

See also commit 77b838fa1e ("[SPARC64]: destroy_context() needs to disable
interrupts.")

Signed-off-by: Andreas Larsson <andreas@gaisler.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-18 12:47:54 -05:00
Joonsoo Kim 031bc5743f mm/debug-pagealloc: make debug-pagealloc boottime configurable
Now, we have prepared to avoid using debug-pagealloc in boottime.  So
introduce new kernel-parameter to disable debug-pagealloc in boottime, and
makes related functions to be disabled in this case.

Only non-intuitive part is change of guard page functions.  Because guard
page is effective only if debug-pagealloc is enabled, turning off
according to debug-pagealloc is reasonable thing to do.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Jungsoo Son <jungsoo.son@lge.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:48 -08:00
David S. Miller 06090e8ed8 sparc64: Implement __get_user_pages_fast().
It is not sufficient to only implement get_user_pages_fast(), you
must also implement the atomic version __get_user_pages_fast()
otherwise you end up using the weak symbol fallback implementation
which simply returns zero.

This is dangerous, because it causes the futex code to loop forever
if transparent hugepages are supported (see get_futex_key()).

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-24 09:59:02 -07:00
Linus Torvalds 0429fbc0bd Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu consistent-ops changes from Tejun Heo:
 "Way back, before the current percpu allocator was implemented, static
  and dynamic percpu memory areas were allocated and handled separately
  and had their own accessors.  The distinction has been gone for many
  years now; however, the now duplicate two sets of accessors remained
  with the pointer based ones - this_cpu_*() - evolving various other
  operations over time.  During the process, we also accumulated other
  inconsistent operations.

  This pull request contains Christoph's patches to clean up the
  duplicate accessor situation.  __get_cpu_var() uses are replaced with
  with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

  Unfortunately, the former sometimes is tricky thanks to C being a bit
  messy with the distinction between lvalues and pointers, which led to
  a rather ugly solution for cpumask_var_t involving the introduction of
  this_cpu_cpumask_var_ptr().

  This converts most of the uses but not all.  Christoph will follow up
  with the remaining conversions in this merge window and hopefully
  remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
  irqchip: Properly fetch the per cpu offset
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
  ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
  Revert "powerpc: Replace __get_cpu_var uses"
  percpu: Remove __this_cpu_ptr
  clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
  sparc: Replace __get_cpu_var uses
  avr32: Replace __get_cpu_var with __this_cpu_write
  blackfin: Replace __get_cpu_var uses
  tile: Use this_cpu_ptr() for hardware counters
  tile: Replace __get_cpu_var uses
  powerpc: Replace __get_cpu_var uses
  alpha: Replace __get_cpu_var
  ia64: Replace __get_cpu_var uses
  s390: cio driver &__get_cpu_var replacements
  s390: Replace __get_cpu_var uses
  mips: Replace __get_cpu_var uses
  MIPS: Replace __get_cpu_var uses in FPU emulator.
  arm: Replace __this_cpu_ptr with raw_cpu_ptr
  ...
2014-10-15 07:48:18 +02:00
David S. Miller d195b71bad sparc64: Kill unnecessary tables and increase MAX_BANKS.
swapper_low_pmd_dir and swapper_pud_dir are actually completely
useless and unnecessary.

We just need swapper_pg_dir[].  Naturally the other page table chunks
will be allocated on an as-needed basis.  Since the kernel actually
accesses these tables in the PAGE_OFFSET view, there is not even a TLB
locality advantage of placing them in the kernel image.

Use the hard coded vmlinux.ld.S slot for swapper_pg_dir which is
naturally page aligned.

Increase MAX_BANKS to 1024 in order to handle heavily fragmented
virtual guests.

Even with this MAX_BANKS increase, the kernel is 20K+ smaller.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-05 16:53:40 -07:00
David S. Miller bb4e6e85da sparc64: Adjust vmalloc region size based upon available virtual address bits.
In order to accomodate embedded per-cpu allocation with large numbers
of cpus and numa nodes, we have to use as much virtual address space
as possible for the vmalloc region.  Otherwise we can get things like:

PERCPU: max_distance=0x380001c10000 too large for vmalloc space 0xff00000000

So, once we select a value for PAGE_OFFSET, derive the size of the
vmalloc region based upon that.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-05 16:53:40 -07:00
David S. Miller 7c0fa0f24b sparc64: Increase MAX_PHYS_ADDRESS_BITS to 53.
Make sure, at compile time, that the kernel can properly support
whatever MAX_PHYS_ADDRESS_BITS is defined to.

On M7 chips, use a max_phys_bits value of 49.

Based upon a patch by Bob Picco.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-05 16:53:40 -07:00
David S. Miller c06240c7f5 sparc64: Use kernel page tables for vmemmap.
For sparse memory configurations, the vmemmap array behaves terribly
and it takes up an inordinate amount of space in the BSS section of
the kernel image unconditionally.

Just build huge PMDs and look them up just like we do for TLB misses
in the vmalloc area.

Kernel BSS shrinks by about 2MB.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-05 16:53:39 -07:00
David S. Miller 0dd5b7b09e sparc64: Fix physical memory management regressions with large max_phys_bits.
If max_phys_bits needs to be > 43 (f.e. for T4 chips), things like
DEBUG_PAGEALLOC stop working because the 3-level page tables only
can cover up to 43 bits.

Another problem is that when we increased MAX_PHYS_ADDRESS_BITS up to
47, several statically allocated tables became enormous.

Compounding this is that we will need to support up to 49 bits of
physical addressing for M7 chips.

The two tables in question are sparc64_valid_addr_bitmap and
kpte_linear_bitmap.

The first holds a bitmap, with 1 bit for each 4MB chunk of physical
memory, indicating whether that chunk actually exists in the machine
and is valid.

The second table is a set of 2-bit values which tell how large of a
mapping (4MB, 256MB, 2GB, 16GB, respectively) we can use at each 256MB
chunk of ram in the system.

These tables are huge and take up an enormous amount of the BSS
section of the sparc64 kernel image.  Specifically, the
sparc64_valid_addr_bitmap is 4MB, and the kpte_linear_bitmap is 128K.

So let's solve the space wastage and the DEBUG_PAGEALLOC problem
at the same time, by using the kernel page tables (as designed) to
manage this information.

We have to keep using large mappings when DEBUG_PAGEALLOC is disabled,
and we do this by encoding huge PMDs and PUDs.

On a T4-2 with 256GB of ram the kernel page table takes up 16K with
DEBUG_PAGEALLOC disabled and 256MB with it enabled.  Furthermore, this
memory is dynamically allocated at run time rather than coded
statically into the kernel image.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-05 16:53:39 -07:00
David S. Miller 8c82dc0e88 sparc64: Adjust KTSB assembler to support larger physical addresses.
As currently coded the KTSB accesses in the kernel only support up to
47 bits of physical addressing.

Adjust the instruction and patching sequence in order to support
arbitrary 64 bits addresses.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-05 16:53:39 -07:00
David S. Miller 4397bed080 sparc64: Define VA hole at run time, rather than at compile time.
Now that we use 4-level page tables, we can provide up to 53-bits of
virtual address space to the user.

Adjust the VA hole based upon the capabilities of the cpu type probed.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-05 16:53:39 -07:00
David S. Miller ac55c76814 sparc64: Switch to 4-level page tables.
This has become necessary with chips that support more than 43-bits
of physical addressing.

Based almost entirely upon a patch by Bob Picco.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-05 16:53:38 -07:00
David S. Miller 473ad7f4fb sparc64: Fix reversed start/end in flush_tlb_kernel_range()
When we have to split up a flush request into multiple pieces
(in order to avoid the firmware range) we don't specify the
arguments in the right order for the second piece.

Fix the order, or else we get hangs as the code tries to
flush "a lot" of entries and we get lockups like this:

[ 4422.981276] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [expect:117032]
[ 4422.996130] Modules linked in: ipv6 loop usb_storage igb ptp sg sr_mod ehci_pci ehci_hcd pps_core n2_rng rng_core
[ 4423.016617] CPU: 12 PID: 117032 Comm: expect Not tainted 3.17.0-rc4+ #1608
[ 4423.030331] task: fff8003cc730e220 ti: fff8003d99d54000 task.ti: fff8003d99d54000
[ 4423.045282] TSTATE: 0000000011001602 TPC: 00000000004521e8 TNPC: 00000000004521ec Y: 00000000    Not tainted
[ 4423.064905] TPC: <__flush_tlb_kernel_range+0x28/0x40>
[ 4423.074964] g0: 000000000052fd10 g1: 00000001295a8000 g2: ffffff7176ffc000 g3: 0000000000002000
[ 4423.092324] g4: fff8003cc730e220 g5: fff8003dfedcc000 g6: fff8003d99d54000 g7: 0000000000000006
[ 4423.109687] o0: 0000000000000000 o1: 0000000000000000 o2: 0000000000000003 o3: 00000000f0000000
[ 4423.127058] o4: 0000000000000080 o5: 00000001295a8000 sp: fff8003d99d56d01 ret_pc: 000000000052ff54
[ 4423.145121] RPC: <__purge_vmap_area_lazy+0x314/0x3a0>
[ 4423.155185] l0: 0000000000000000 l1: 0000000000000000 l2: 0000000000a38040 l3: 0000000000000000
[ 4423.172559] l4: fff8003dae8965e0 l5: ffffffffffffffff l6: 0000000000000000 l7: 00000000f7e2b138
[ 4423.189913] i0: fff8003d99d576a0 i1: fff8003d99d576a8 i2: fff8003d99d575e8 i3: 0000000000000000
[ 4423.207284] i4: 0000000000008008 i5: fff8003d99d575c8 i6: fff8003d99d56df1 i7: 0000000000530c24
[ 4423.224640] I7: <free_vmap_area_noflush+0x64/0x80>
[ 4423.234193] Call Trace:
[ 4423.239051]  [0000000000530c24] free_vmap_area_noflush+0x64/0x80
[ 4423.251029]  [0000000000531a7c] remove_vm_area+0x5c/0x80
[ 4423.261628]  [0000000000531b80] __vunmap+0x20/0x120
[ 4423.271352]  [000000000071cf18] n_tty_close+0x18/0x40
[ 4423.281423]  [00000000007222b0] tty_ldisc_close+0x30/0x60
[ 4423.292183]  [00000000007225a4] tty_ldisc_reinit+0x24/0xa0
[ 4423.303120]  [0000000000722ab4] tty_ldisc_hangup+0xd4/0x1e0
[ 4423.314232]  [0000000000719aa0] __tty_hangup+0x280/0x3c0
[ 4423.324835]  [0000000000724cb4] pty_close+0x134/0x1a0
[ 4423.334905]  [000000000071aa24] tty_release+0x104/0x500
[ 4423.345316]  [00000000005511d0] __fput+0x90/0x1e0
[ 4423.354701]  [000000000047fa54] task_work_run+0x94/0xe0
[ 4423.365126]  [0000000000404b44] __handle_signal+0xc/0x2c

Fixes: 4ca9a23765 ("sparc64: Guard against flushing openfirmware mappings.")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-04 21:05:14 -07:00
bob picco 7c21d533ab sparc64: mem boot option correction
The "mem" boot option can result in many unexpected consequences. This patch
attempts to prevent boot hangs which have been experienced on T4-4 and T5-8.
Basically the boot loader allocates vmlinuz and initrd higher in available
OBP physical memory. For example, on a 2Tb T5-8 it isn't possible to boot
with mem=20G.

The patch utilizes memblock to avoid reserved regions and trim memory which
is only free. Other improvements are possible for a multi-node machine.

This is a snippet of the boot log with mem=20G on T5-8 with the patch applied:
MEMBLOCK configuration:	<- before memory reduction
 memory size = 0x1ffad6ce000 reserved size = 0xa1adf44
 memory.cnt  = 0xb
 memory[0x0]    [0x00000030400000-0x00003fdde47fff], 0x3fada48000 bytes
 memory[0x1]    [0x00003fdde4e000-0x00003fdde4ffff], 0x2000 bytes
 memory[0x2]    [0x00080000000000-0x00083fffffffff], 0x4000000000 bytes
 memory[0x3]    [0x00100000000000-0x00103fffffffff], 0x4000000000 bytes
 memory[0x4]    [0x00180000000000-0x00183fffffffff], 0x4000000000 bytes
 memory[0x5]    [0x00200000000000-0x00203fffffffff], 0x4000000000 bytes
 memory[0x6]    [0x00280000000000-0x00283fffffffff], 0x4000000000 bytes
 memory[0x7]    [0x00300000000000-0x00303fffffffff], 0x4000000000 bytes
 memory[0x8]    [0x00380000000000-0x00383fffc71fff], 0x3fffc72000 bytes
 memory[0x9]    [0x00383fffc92000-0x00383fffca1fff], 0x10000 bytes
 memory[0xa]    [0x00383fffcb4000-0x00383fffcb5fff], 0x2000 bytes
 reserved.cnt  = 0x2
 reserved[0x0]  [0x00380000000000-0x0038000117e7f8], 0x117e7f9 bytes
 reserved[0x1]  [0x00380004000000-0x0038000d02f74a], 0x902f74b bytes
...
MEMBLOCK configuration:	<- after reduction of memory
 memory size = 0x50a1adf44 reserved size = 0xa1adf44
 memory.cnt  = 0x4
 memory[0x0]    [0x00380000000000-0x0038000117e7f8], 0x117e7f9 bytes
 memory[0x1]    [0x00380004000000-0x0038050d01d74a], 0x50901d74b bytes
 memory[0x2]    [0x00383fffc92000-0x00383fffca1fff], 0x10000 bytes
 memory[0x3]    [0x00383fffcb4000-0x00383fffcb5fff], 0x2000 bytes
 reserved.cnt  = 0x2
 reserved[0x0]  [0x00380000000000-0x0038000117e7f8], 0x117e7f9 bytes
 reserved[0x1]  [0x00380004000000-0x0038000d02f74a], 0x902f74b bytes
...
Early memory node ranges
  node   7: [mem 0x380000000000-0x38000117dfff]
  node   7: [mem 0x380004000000-0x380f0d01bfff]
  node   7: [mem 0x383fffc92000-0x383fffca1fff]
  node   7: [mem 0x383fffcb4000-0x383fffcb5fff]
Could not find start_pfn for node 0
Could not find start_pfn for node 1
Could not find start_pfn for node 2
Could not find start_pfn for node 3
Could not find start_pfn for node 4
Could not find start_pfn for node 5
Could not find start_pfn for node 6
.

The patch was tested on T4-1, T5-8 and Jalap?no.

Cc: sparclinux@vger.kernel.org
Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-16 18:23:11 -07:00
bob picco 3dee9df548 sparc64: find_node adjustment
We have seen an issue with guest boot into LDOM that causes early boot failures
because of no matching rules for node identitity of the memory. I analyzed this
on my T4 and concluded there might not be a solution. I saw the issue in
mainline too when booting into the control/primary domain - with guests
configured.  Note, this could be a firmware bug on some older machines.

I'll provide a full explanation of the issues below. Should we not find a
matching BEST latency group for a real address (RA) then we will assume node 0.
On the T4-2 here with the information provided I can't see an alternative.

Technically the LDOM shown below should match the MBLOCK to the
favorable latency group. However other factors must be considered too. Were
the memory controllers configured "fine" grained interleave or "coarse"
grain interleaved -  T4. Also should a "group" MD node be considered a NUMA
node?

There has to be at least one Machine Description (MD) "group" and hence one
NUMA node. The group can have one or more latency groups (lg) - more than one
memory controller. The current code chooses the smallest latency as the most
favorable per group. The latency and lg information is in MLGROUP below.
MBLOCK is the base and size of the RAs for the machine as fetched from OBP
/memory "available" property. My machine has one MBLOCK but more would be
possible - with holes?

For a T4-2 the following information has been gathered:
with LDOM guest
MEMBLOCK configuration:
 memory size = 0x27f870000
 memory.cnt  = 0x3
 memory[0x0]    [0x00000020400000-0x0000029fc67fff], 0x27f868000 bytes
 memory[0x1]    [0x0000029fd8a000-0x0000029fd8bfff], 0x2000 bytes
 memory[0x2]    [0x0000029fd92000-0x0000029fd97fff], 0x6000 bytes
 reserved.cnt  = 0x2
 reserved[0x0]  [0x00000020800000-0x000000216c15c0], 0xec15c1 bytes
 reserved[0x1]  [0x00000024800000-0x0000002c180c1e], 0x7980c1f bytes
MBLOCK[0]: base[20000000] size[280000000] offset[0]
(note: "base" and "size" reported in "MBLOCK" encompass the "memory[X]" values)
(note: (RA + offset) & mask = val is the formula to detect a match for the
memory controller. should there be no match for find_node node, a return
value of -1 resulted for the node - BAD)

There is one group. It has these forward links
MLGROUP[1]: node[545] latency[1f7e8] match[200000000] mask[200000000]
MLGROUP[2]: node[54d] latency[2de60] match[0] mask[200000000]
NUMA NODE[0]: node[545] mask[200000000] val[200000000] (latency[1f7e8])
(note: "val" is the best lg's (smallest latency) "match")

no LDOM guest - bare metal
MEMBLOCK configuration:
 memory size = 0xfdf2d0000
 memory.cnt  = 0x3
 memory[0x0]    [0x00000020400000-0x00000fff6adfff], 0xfdf2ae000 bytes
 memory[0x1]    [0x00000fff6d2000-0x00000fff6e7fff], 0x16000 bytes
 memory[0x2]    [0x00000fff766000-0x00000fff771fff], 0xc000 bytes
 reserved.cnt  = 0x2
 reserved[0x0]  [0x00000020800000-0x00000021a04580], 0x1204581 bytes
 reserved[0x1]  [0x00000024800000-0x0000002c7d29fc], 0x7fd29fd bytes
MBLOCK[0]: base[20000000] size[fe0000000] offset[0]

there are two groups
group node[16d5]
MLGROUP[0]: node[1765] latency[1f7e8] match[0] mask[200000000]
MLGROUP[3]: node[177d] latency[2de60] match[200000000] mask[200000000]
NUMA NODE[0]: node[1765] mask[200000000] val[0] (latency[1f7e8])
group node[171d]
MLGROUP[2]: node[1775] latency[2de60] match[0] mask[200000000]
MLGROUP[1]: node[176d] latency[1f7e8] match[200000000] mask[200000000]
NUMA NODE[1]: node[176d] mask[200000000] val[200000000] (latency[1f7e8])
(note: for this two "group" bare metal machine, 1/2 memory is in group one's
lg and 1/2 memory is in group two's lg).

Cc: sparclinux@vger.kernel.org
Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-16 17:55:09 -07:00
bob picco 4ccb927289 sparc64: sun4v TLB error power off events
We've witnessed a few TLB events causing the machine to power off because
of prom_halt. In one case it was some nfs related area during rmmod. Another
was an mmapper of /dev/mem. A more recent one is an ITLB issue with
a bad pagesize which could be a hardware bug. Bugs happen but we should
attempt to not power off the machine and/or hang it when possible.

This is a DTLB error from an mmapper of /dev/mem:
[root@sparcie ~]# SUN4V-DTLB: Error at TPC[fffff80100903e6c], tl 1
SUN4V-DTLB: TPC<0xfffff80100903e6c>
SUN4V-DTLB: O7[fffff801081979d0]
SUN4V-DTLB: O7<0xfffff801081979d0>
SUN4V-DTLB: vaddr[fffff80100000000] ctx[1250] pte[98000000000f0610] error[2]
.

This is recent mainline for ITLB:
[ 3708.179864] SUN4V-ITLB: TPC<0xfffffc010071cefc>
[ 3708.188866] SUN4V-ITLB: O7[fffffc010071cee8]
[ 3708.197377] SUN4V-ITLB: O7<0xfffffc010071cee8>
[ 3708.206539] SUN4V-ITLB: vaddr[e0003] ctx[1a3c] pte[2900000dcc800eeb] error[4]
.

Normally sun4v_itlb_error_report() and sun4v_dtlb_error_report() would call
prom_halt() and drop us to OF command prompt "ok". This isn't the case for
LDOMs and the machine powers off.

For the HV reported error of HV_ENORADDR for HV HV_MMU_MAP_ADDR_TRAP we cause
a SIGBUS error by qualifying it within do_sparc64_fault() for fault code mask
of FAULT_CODE_BAD_RA. This is done when trap level (%tl) is less or equal
one("1"). Otherwise, for %tl > 1,  we proceed eventually to die_if_kernel().

The logic of this patch was partially inspired by David Miller's feedback.

Power off of large sparc64 machines is painful. Plus die_if_kernel provides
more context. A reset sequence isn't a brief period on large sparc64 but
better than power-off/power-on sequence.

Cc: sparclinux@vger.kernel.org
Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-16 17:46:44 -07:00
Christoph Lameter 494fc42170 sparc: Replace __get_cpu_var uses
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x).  This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset.  Thereby address calculations are avoided and less registers
are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e.  using a global
register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

	DEFINE_PER_CPU(int, y);
	int *x = &__get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

	DEFINE_PER_CPU(int, y[20]);
	int *x = __get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

	DEFINE_PER_CPU(int, y);
	int x = __get_cpu_var(y)

   Converts to

	int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

	DEFINE_PER_CPU(struct mystruct, y);
	struct mystruct x = __get_cpu_var(y);

   Converts to

	memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

	DEFINE_PER_CPU(int, y)
	__get_cpu_var(y) = x;

   Converts to

	__this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

	DEFINE_PER_CPU(int, y);
	__get_cpu_var(y)++

   Converts to

	__this_cpu_inc(y)

Cc: sparclinux@vger.kernel.org
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:55 -04:00
David S. Miller 5b6ff9df05 sparc64: Fix up merge thinko.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 19:09:19 -07:00
David S. Miller e9011d0866 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Conflicts:
	arch/sparc/mm/init_64.c

Conflict was simple non-overlapping additions.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05 18:57:18 -07:00
David S. Miller 4ca9a23765 sparc64: Guard against flushing openfirmware mappings.
Based almost entirely upon a patch by Christopher Alexander Tobias
Schulze.

In commit db64fe0225 ("mm: rewrite vmap
layer") lazy VMAP tlb flushing was added to the vmalloc layer.  This
causes problems on sparc64.

Sparc64 has two VMAP mapped regions and they are not contiguous with
eachother.  First we have the malloc mapping area, then another
unrelated region, then the vmalloc region.

This "another unrelated region" is where the firmware is mapped.

If the lazy TLB flushing logic in the vmalloc code triggers after
we've had both a module unload and a vfree or similar, it will pass an
address range that goes from somewhere inside the malloc region to
somewhere inside the vmalloc region, and thus covering the
openfirmware area entirely.

The sparc64 kernel learns about openfirmware's dynamic mappings in
this region early in the boot, and then services TLB misses in this
area.  But openfirmware has some locked TLB entries which are not
mentioned in those dynamic mappings and we should thus not disturb
them.

These huge lazy TLB flush ranges causes those openfirmware locked TLB
entries to be removed, resulting in all kinds of problems including
hard hangs and crashes during reboot/reset.

Besides causing problems like this, such huge TLB flush ranges are
also incredibly inefficient.  A plea has been made with the author of
the VMAP lazy TLB flushing code, but for now we'll put a safety guard
into our flush_tlb_kernel_range() implementation.

Since the implementation has become non-trivial, stop defining it as a
macro and instead make it a function in a C source file.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-04 20:16:00 -07:00
David S. Miller 18f3813252 sparc64: Do not insert non-valid PTEs into the TSB hash table.
The assumption was that update_mmu_cache() (and the equivalent for PMDs) would
only be called when the PTE being installed will be accessible by the user.

This is not true for code paths originating from remove_migration_pte().

There are dire consequences for placing a non-valid PTE into the TSB.  The TLB
miss frramework assumes thatwhen a TSB entry matches we can just load it into
the TLB and return from the TLB miss trap.

So if a non-valid PTE is in there, we will deadlock taking the TLB miss over
and over, never satisfying the miss.

Just exit early from update_mmu_cache() and friends in this situation.

Based upon a report and patch from Christopher Alexander Tobias Schulze.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-04 16:34:01 -07:00
bob picco f6d4fb5cc0 sparc64 - add mem to iomem resource
This patch adds sparc RAM to /proc/iomem. It also identifies the
code, data and bss regions of the kernel.

Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-21 21:37:05 -07:00
Linus Torvalds c4222e4635 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next
Pull sparc fixes from David Miller:
 "Sparc sparse fixes from Sam Ravnborg"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next: (67 commits)
  sparc64: fix sparse warnings in int_64.c
  sparc64: fix sparse warning in ftrace.c
  sparc64: fix sparse warning in kprobes.c
  sparc64: fix sparse warning in kgdb_64.c
  sparc64: fix sparse warnings in compat_audit.c
  sparc64: fix sparse warnings in init_64.c
  sparc64: fix sparse warnings in aes_glue.c
  sparc: fix sparse warnings in smp_32.c + smp_64.c
  sparc64: fix sparse warnings in perf_event.c
  sparc64: fix sparse warnings in kprobes.c
  sparc64: fix sparse warning in tsb.c
  sparc64: clean up compat_sigset_t.seta handling
  sparc64: fix sparse "Should it be static?" warnings in signal32.c
  sparc64: fix sparse warnings in sys_sparc32.c
  sparc64: fix sparse warning in pci.c
  sparc64: fix sparse warnings in smp_64.c
  sparc64: fix sparse warning in prom_64.c
  sparc64: fix sparse warning in btext.c
  sparc64: fix sparse warnings in sys_sparc_64.c + unaligned_64.c
  sparc64: fix sparse warning in process_64.c
  ...

Conflicts:
	arch/sparc/include/asm/pgtable_64.h
2014-06-19 07:50:07 -10:00
Naoya Horiguchi c177c81e09 hugetlb: restrict hugepage_migration_support() to x86_64
Currently hugepage migration is available for all archs which support
pmd-level hugepage, but testing is done only for x86_64 and there're
bugs for other archs.  So to avoid breaking such archs, this patch
limits the availability strictly to x86_64 until developers of other
archs get interested in enabling this feature.

Simply disabling hugepage migration on non-x86_64 archs is not enough to
fix the reported problem where sys_move_pages() hits the BUG_ON() in
follow_page(FOLL_GET), so let's fix this by checking if hugepage
migration is supported in vma_migratable().

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org>	[3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:53:51 -07:00
Sam Ravnborg 48d372164d sparc64: fix sparse warnings in int_64.c
Fix following warnings:
init_64.c:798:5: warning: symbol 'numa_cpu_lookup_table' was not declared. Should it be static?
init_64.c:799:11: warning: symbol 'numa_cpumask_lookup_table' was not declared. Should it be static?

The warnings were present with an allnoconfig
Fix so the variables are only declared if CONFIG_NEED_MULTIPLE_NODES is defined.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-18 19:01:35 -07:00
Sam Ravnborg 59dec13b27 sparc64: fix sparse warnings in init_64.c
Fix following warnings:
init_64.c:191:10: warning: symbol 'dcpage_flushes' was not declared. Should it be static?
init_64.c:193:10: warning: symbol 'dcpage_flushes_xcall' was not declared. Should it be static?

Add extern declaration to asm/setup.h and drop local declaration in smp_64.h

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-18 19:01:34 -07:00
Sam Ravnborg 8c7260c0d9 sparc64: fix sparse warning in tsb.c
Fix following warning:
tsb.c:290:5: warning: symbol 'sysctl_tsb_ratio' was not declared. Should it be static?

Add extern declaration in asm/setup.h and remove local declaration
in kernel/sysctl.c

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-18 19:01:32 -07:00
Sam Ravnborg 8df52620e6 sparc64: fix sparse warnings in sys_sparc_64.c + unaligned_64.c
Fix following warnings:
kernel/sys_sparc_64.c:643:17: warning: symbol 'sys_kern_features' was not declared. Should it be static?
kernel/unaligned_64.c:297:17: warning: symbol 'kernel_unaligned_trap' was not declared. Should it be static?
kernel/unaligned_64.c:387:5: warning: symbol 'handle_popc' was not declared. Should it be static?
kernel/unaligned_64.c:428:5: warning: symbol 'handle_ldf_stq' was not declared. Should it be static?
kernel/unaligned_64.c:553:6: warning: symbol 'handle_ld_nf' was not declared. Should it be static?
kernel/unaligned_64.c:579:6: warning: symbol 'handle_lddfmna' was not declared. Should it be static?
kernel/unaligned_64.c:643:6: warning: symbol 'handle_stdfmna' was not declared. Should it be static?

Functions that are only used in kernel/ - add prototypes in kernel.h
Functions used outside kernel/ - add prototype in asm/setup.h
Removed local prototypes

One of the local prototypes had wrong signature (return void - not int).

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-18 19:01:30 -07:00