Commit graph

50148 commits

Author SHA1 Message Date
Jan Kara
48f2301c07 hugetlbfs: use pagevec_lookup_range() in remove_inode_hugepages()
We want only pages from given range in remove_inode_hugepages().  Use
pagevec_lookup_range() instead of pagevec_lookup().

Link: http://lkml.kernel.org/r/20170726114704.7626-8-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Jan Kara
2b85a6171d ext4: use pagevec_lookup_range() in writeback code
Both occurences of pagevec_lookup() actually want only pages from a
given range.  Use pagevec_lookup_range() for the lookup.

Link: http://lkml.kernel.org/r/20170726114704.7626-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Jan Kara
dec0da7b60 ext4: use pagevec_lookup_range() in ext4_find_unwritten_pgoff()
Use pagevec_lookup_range() in ext4_find_unwritten_pgoff() since we are
interested only in pages in the given range.  Simplify the logic as a
result of not getting pages out of range and index getting automatically
advanced.

Link: http://lkml.kernel.org/r/20170726114704.7626-6-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Jan Kara
c10f778ddf fs: fix performance regression in clean_bdev_aliases()
Commit e64855c6cf ("fs: Add helper to clean bdev aliases under a bh
and use it") added a wrapper for clean_bdev_aliases() that invalidates
bdev aliases underlying a single buffer head.

However this has caused a performance regression for bonnie++ benchmark
on ext4 filesystem when delayed allocation is turned off (ext3 mode) -
average of 3 runs:

  Hmean SeqOut Char  164787.55 (  0.00%) 107189.06 (-34.95%)
  Hmean SeqOut Block 219883.89 (  0.00%) 168870.32 (-23.20%)

The reason for this regression is that clean_bdev_aliases() is slower
when called for a single block because pagevec_lookup() it uses will end
up iterating through the radix tree until it finds a page (which may
take a while) but we are only interested whether there's a page at a
particular index.

Fix the problem by using pagevec_lookup_range() instead which avoids the
needless iteration.

Fixes: e64855c6cf ("fs: Add helper to clean bdev aliases under a bh and use it")
Link: http://lkml.kernel.org/r/20170726114704.7626-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Jan Kara
d72dc8a25a mm: make pagevec_lookup() update index
Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue.  Most callers want this
and also pagevec_lookup_tag() already does this.

Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Jan Kara
26b433d0da fscache: remove unused ->now_uncached callback
Patch series "Ranged pagevec lookup", v2.

In this series I make pagevec_lookup() update the index (to be
consistent with pagevec_lookup_tag() and also as a preparation for
ranged lookups), provide ranged variant of pagevec_lookup() and use it
in places where it makes sense.  This not only removes some common code
but is also a measurable performance win for some use cases (see patch
4/10) where radix tree is sparse and searching & grabing of a page after
the end of the range has measurable overhead.

This patch (of 10):

The callback doesn't ever get called.  Remove it.

Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Jun Piao
964f14a0d3 ocfs2: clean up some dead code
clean up some unused functions and parameters.

Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Jan Kara
01ffb56bc1 ocfs2: make ocfs2_set_acl() static
The function is never called outside of fs/ocfs2/acl.c.

Link: http://lkml.kernel.org/r/20170801141252.19675-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Nicolas Iooss
2f52074d35 dax: initialize variable pfn before using it
dax_pmd_insert_mapping() contains the following code:

        pfn_t pfn;
        if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
            goto fallback;
        /* ... */
    fallback:
      trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);

When the condition in the if statement fails, the function calls
trace_dax_pmd_insert_mapping_fallback() with an uninitialized pfn value.

This issue has been found while building the kernel with clang.  The
compiler reported:

    fs/dax.c:1280:6: error: variable 'pfn' is used uninitialized
    whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
        if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    fs/dax.c:1310:60: note: uninitialized use occurs here
      trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
                                                                     ^~~

Link: http://lkml.kernel.org/r/20170903083000.587-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Ross Zwisler
917f34526c dax: use PG_PMD_COLOUR instead of open coding
Use ~PG_PMD_COLOUR in dax_entry_waitqueue() instead of open coding an
equivalent page offset mask.

Link: http://lkml.kernel.org/r/20170822222436.18926-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Slusarz, Marcin" <marcin.slusarz@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Ross Zwisler
a2e050f5a9 dax: explain how read(2)/write(2) addresses are validated
Add a comment explaining how the user addresses provided to read(2) and
write(2) are validated in the DAX I/O path.

We call dax_copy_from_iter() or copy_to_iter() on these without calling
access_ok() first in the DAX code, and there was a concern that the user
might be able to read/write to arbitrary kernel addresses with this
path.

Link: http://lkml.kernel.org/r/20170816173615.10098-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Ross Zwisler
527b19d080 dax: move all DAX radix tree defs to fs/dax.c
Now that we no longer insert struct page pointers in DAX radix trees the
page cache code no longer needs to know anything about DAX exceptional
entries.  Move all the DAX exceptional entry definitions from dax.h to
fs/dax.c.

Link: http://lkml.kernel.org/r/20170724170616.25810-6-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Ross Zwisler
d01ad197ac dax: remove DAX code from page_cache_tree_insert()
Now that we no longer insert struct page pointers in DAX radix trees we
can remove the special casing for DAX in page_cache_tree_insert().

This also allows us to make dax_wake_mapping_entry_waiter() local to
fs/dax.c, removing it from dax.h.

Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Ross Zwisler
91d25ba8a6 dax: use common 4k zero page for dax mmap reads
When servicing mmap() reads from file holes the current DAX code
allocates a page cache page of all zeroes and places the struct page
pointer in the mapping->page_tree radix tree.

This has three major drawbacks:

1) It consumes memory unnecessarily. For every 4k page that is read via
   a DAX mmap() over a hole, we allocate a new page cache page. This
   means that if you read 1GiB worth of pages, you end up using 1GiB of
   zeroed memory. This is easily visible by looking at the overall
   memory consumption of the system or by looking at /proc/[pid]/smaps:

	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
	Size:            1048576 kB
	Rss:             1048576 kB
	Pss:             1048576 kB
	Shared_Clean:          0 kB
	Shared_Dirty:          0 kB
	Private_Clean:   1048576 kB
	Private_Dirty:         0 kB
	Referenced:      1048576 kB
	Anonymous:             0 kB
	LazyFree:              0 kB
	AnonHugePages:         0 kB
	ShmemPmdMapped:        0 kB
	Shared_Hugetlb:        0 kB
	Private_Hugetlb:       0 kB
	Swap:                  0 kB
	SwapPss:               0 kB
	KernelPageSize:        4 kB
	MMUPageSize:           4 kB
	Locked:                0 kB

2) It is slower than using a common zero page because each page fault
   has more work to do. Instead of just inserting a common zero page we
   have to allocate a page cache page, zero it, and then insert it. Here
   are the average latencies of dax_load_hole() as measured by ftrace on
   a random test box:

    Old method, using zeroed page cache pages:	3.4 us
    New method, using the common 4k zero page:	0.8 us

   This was the average latency over 1 GiB of sequential reads done by
   this simple fio script:

     [global]
     size=1G
     filename=/root/dax/data
     fallocate=none
     [io]
     rw=read
     ioengine=mmap

3) The fact that we had to check for both DAX exceptional entries and
   for page cache pages in the radix tree made the DAX code more
   complex.

Solve these issues by following the lead of the DAX PMD code and using a
common 4k zero page instead.  As with the PMD code we will now insert a
DAX exceptional entry into the radix tree instead of a struct page
pointer which allows us to remove all the special casing in the DAX
code.

Note that we do still pretty aggressively check for regular pages in the
DAX radix tree, especially where we take action based on the bits set in
the page.  If we ever find a regular page in our radix tree now that
most likely means that someone besides DAX is inserting pages (which has
happened lots of times in the past), and we want to find that out early
and fail loudly.

This solution also removes the extra memory consumption.  Here is that
same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
code:

	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
	Size:            1048576 kB
	Rss:                   0 kB
	Pss:                   0 kB
	Shared_Clean:          0 kB
	Shared_Dirty:          0 kB
	Private_Clean:         0 kB
	Private_Dirty:         0 kB
	Referenced:            0 kB
	Anonymous:             0 kB
	LazyFree:              0 kB
	AnonHugePages:         0 kB
	ShmemPmdMapped:        0 kB
	Shared_Hugetlb:        0 kB
	Private_Hugetlb:       0 kB
	Swap:                  0 kB
	SwapPss:               0 kB
	KernelPageSize:        4 kB
	MMUPageSize:           4 kB
	Locked:                0 kB

Overall system memory consumption is similarly improved.

Another major change is that we remove dax_pfn_mkwrite() from our fault
flow, and instead rely on the page fault itself to make the PTE dirty
and writeable.  The following description from the patch adding the
vm_insert_mixed_mkwrite() call explains this a little more:

   "To be able to use the common 4k zero page in DAX we need to have our
    PTE fault path look more like our PMD fault path where a PTE entry
    can be marked as dirty and writeable as it is first inserted rather
    than waiting for a follow-up dax_pfn_mkwrite() =>
    finish_mkwrite_fault() call.

    Right now we can rely on having a dax_pfn_mkwrite() call because we
    can distinguish between these two cases in do_wp_page():

            case 1: 4k zero page => writable DAX storage
            case 2: read-only DAX storage => writeable DAX storage

    This distinction is made by via vm_normal_page(). vm_normal_page()
    returns false for the common 4k zero page, though, just as it does
    for DAX ptes. Instead of special casing the DAX + 4k zero page case
    we will simplify our DAX PTE page fault sequence so that it matches
    our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
    We will instead use dax_iomap_fault() to handle write-protection
    faults.

    This means that insert_pfn() needs to follow the lead of
    insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
    'mkwrite' is set insert_pfn() will do the work that was previously
    done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"

Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Ross Zwisler
e30331ff05 dax: relocate some dax functions
dax_load_hole() will soon need to call dax_insert_mapping_entry(), so it
needs to be moved lower in dax.c so the definition exists.

dax_wake_mapping_entry_waiter() will soon be removed from dax.h and be
made static to dax.c, so we need to move its definition above all its
callers.

Link: http://lkml.kernel.org/r/20170724170616.25810-3-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Linus Torvalds
bafb0762cb Char/Misc drivers for 4.14-rc1
Here is the big char/misc driver update for 4.14-rc1.
 
 Lots of different stuff in here, it's been an active development cycle
 for some reason.  Highlights are:
   - updated binder driver, this brings binder up to date with what
     shipped in the Android O release, plus some more changes that
     happened since then that are in the Android development trees.
   - coresight updates and fixes
   - mux driver file renames to be a bit "nicer"
   - intel_th driver updates
   - normal set of hyper-v updates and changes
   - small fpga subsystem and driver updates
   - lots of const code changes all over the driver trees
   - extcon driver updates
   - fmc driver subsystem upadates
   - w1 subsystem minor reworks and new features and drivers added
   - spmi driver updates
 
 Plus a smattering of other minor driver updates and fixes.
 
 All of these have been in linux-next with no reported issues for a
 while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWa1+Ew8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yl26wCgquufNylfhxr65NbJrovduJYzRnUAniCivXg8
 bePIh/JI5WxWoHK+wEbY
 =hYWx
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the big char/misc driver update for 4.14-rc1.

  Lots of different stuff in here, it's been an active development cycle
  for some reason. Highlights are:

   - updated binder driver, this brings binder up to date with what
     shipped in the Android O release, plus some more changes that
     happened since then that are in the Android development trees.

   - coresight updates and fixes

   - mux driver file renames to be a bit "nicer"

   - intel_th driver updates

   - normal set of hyper-v updates and changes

   - small fpga subsystem and driver updates

   - lots of const code changes all over the driver trees

   - extcon driver updates

   - fmc driver subsystem upadates

   - w1 subsystem minor reworks and new features and drivers added

   - spmi driver updates

  Plus a smattering of other minor driver updates and fixes.

  All of these have been in linux-next with no reported issues for a
  while"

* tag 'char-misc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (244 commits)
  ANDROID: binder: don't queue async transactions to thread.
  ANDROID: binder: don't enqueue death notifications to thread todo.
  ANDROID: binder: Don't BUG_ON(!spin_is_locked()).
  ANDROID: binder: Add BINDER_GET_NODE_DEBUG_INFO ioctl
  ANDROID: binder: push new transactions to waiting threads.
  ANDROID: binder: remove proc waitqueue
  android: binder: Add page usage in binder stats
  android: binder: fixup crash introduced by moving buffer hdr
  drivers: w1: add hwmon temp support for w1_therm
  drivers: w1: refactor w1_slave_show to make the temp reading functionality separate
  drivers: w1: add hwmon support structures
  eeprom: idt_89hpesx: Support both ACPI and OF probing
  mcb: Fix an error handling path in 'chameleon_parse_cells()'
  MCB: add support for SC31 to mcb-lpc
  mux: make device_type const
  char: virtio: constify attribute_group structures.
  Documentation/ABI: document the nvmem sysfs files
  lkdtm: fix spelling mistake: "incremeted" -> "incremented"
  perf: cs-etm: Fix ETMv4 CONFIGR entry in perf.data file
  nvmem: include linux/err.h from header
  ...
2017-09-05 11:08:17 -07:00
Linus Torvalds
44b1671fae Driver core update for 4.14-rc1
Here is the "big" driver core update for 4.14-rc1.
 
 It's really not all that big, the largest thing here being some firmware
 tests to help ensure that that crazy api is working properly.
 
 There's also a new uevent for when a driver is bound or unbound from a
 device, fixing a hole in the driver model that's been there since the
 very beginning.  Many thanks to Dmitry for being persistent and pointing
 out how wrong I was about this all along :)
 
 Patches for the new uevents are already in the systemd tree, if people
 want to play around with them.
 
 Otherwise just a number of other small api changes and updates here,
 nothing major.  All of these patches have been in linux-next for a
 while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWa1/IQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yn8jACfdQg+YXGxTExonxnyiWgoDMMSO2gAn1ETOaak
 itLO5ll4b6EQ0r3pU27d
 =pCYl
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core update from Greg KH:
 "Here is the "big" driver core update for 4.14-rc1.

  It's really not all that big, the largest thing here being some
  firmware tests to help ensure that that crazy api is working properly.

  There's also a new uevent for when a driver is bound or unbound from a
  device, fixing a hole in the driver model that's been there since the
  very beginning. Many thanks to Dmitry for being persistent and
  pointing out how wrong I was about this all along :)

  Patches for the new uevents are already in the systemd tree, if people
  want to play around with them.

  Otherwise just a number of other small api changes and updates here,
  nothing major. All of these patches have been in linux-next for a
  while with no reported issues"

* tag 'driver-core-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (28 commits)
  driver core: bus: Fix a potential double free
  Do not disable driver and bus shutdown hook when class shutdown hook is set.
  base: topology: constify attribute_group structures.
  base: Convert to using %pOF instead of full_name
  kernfs: Clarify lockdep name for kn->count
  fbdev: uvesafb: remove DRIVER_ATTR() usage
  xen: xen-pciback: remove DRIVER_ATTR() usage
  driver core: Document struct device:dma_ops
  mod_devicetable: Remove excess description from structured comment
  test_firmware: add batched firmware tests
  firmware: enable a debug print for batched requests
  firmware: define pr_fmt
  firmware: send -EINTR on signal abort on fallback mechanism
  test_firmware: add test case for SIGCHLD on sync fallback
  initcall_debug: add deferred probe times
  Input: axp20x-pek - switch to using devm_device_add_group()
  Input: synaptics_rmi4 - use devm_device_add_group() for attributes in F01
  Input: gpio_keys - use devm_device_add_group() for attributes
  driver core: add devm_device_add_group() and friends
  driver core: add device_{add|remove}_group() helpers
  ...
2017-09-05 10:41:21 -07:00
Linus Torvalds
5f82e71a00 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:

 - Add 'cross-release' support to lockdep, which allows APIs like
   completions, where it's not the 'owner' who releases the lock, to be
   tracked. It's all activated automatically under
   CONFIG_PROVE_LOCKING=y.

 - Clean up (restructure) the x86 atomics op implementation to be more
   readable, in preparation of KASAN annotations. (Dmitry Vyukov)

 - Fix static keys (Paolo Bonzini)

 - Add killable versions of down_read() et al (Kirill Tkhai)

 - Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)

 - Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)

 - Remove smp_mb__before_spinlock() and convert its usages, introduce
   smp_mb__after_spinlock() (Peter Zijlstra)

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
  locking/lockdep/selftests: Fix mixed read-write ABBA tests
  sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
  acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
  locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
  smp: Avoid using two cache lines for struct call_single_data
  locking/lockdep: Untangle xhlock history save/restore from task independence
  locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
  futex: Remove duplicated code and fix undefined behaviour
  Documentation/locking/atomic: Finish the document...
  locking/lockdep: Fix workqueue crossrelease annotation
  workqueue/lockdep: 'Fix' flush_work() annotation
  locking/lockdep/selftests: Add mixed read-write ABBA tests
  mm, locking/barriers: Clarify tlb_flush_pending() barriers
  locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
  locking/lockdep: Explicitly initialize wq_barrier::done::map
  locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
  locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
  locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
  locking/refcounts, x86/asm: Implement fast refcount overflow protection
  locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
  ...
2017-09-04 11:52:29 -07:00
Linus Torvalds
f213a6c84c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - fix affine wakeups (Peter Zijlstra)

   - improve CPU onlining (and general bootup) scalability on systems
     with ridiculous number (thousands) of CPUs (Peter Zijlstra)

   - sched/numa updates (Rik van Riel)

   - sched/deadline updates (Byungchul Park)

   - sched/cpufreq enhancements and related cleanups (Viresh Kumar)

   - sched/debug enhancements (Xie XiuQi)

   - various fixes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  sched/debug: Optimize sched_domain sysctl generation
  sched/topology: Avoid pointless rebuild
  sched/topology, cpuset: Avoid spurious/wrong domain rebuilds
  sched/topology: Improve comments
  sched/topology: Fix memory leak in __sdt_alloc()
  sched/completion: Document that reinit_completion() must be called after complete_all()
  sched/autogroup: Fix error reporting printk text in autogroup_create()
  sched/fair: Fix wake_affine() for !NUMA_BALANCING
  sched/debug: Intruduce task_state_to_char() helper function
  sched/debug: Show task state in /proc/sched_debug
  sched/debug: Use task_pid_nr_ns in /proc/$pid/sched
  sched/core: Remove unnecessary initialization init_idle_bootup_task()
  sched/deadline: Change return value of cpudl_find()
  sched/deadline: Make find_later_rq() choose a closer CPU in topology
  sched/numa: Scale scan period with tasks in group and shared/private
  sched/numa: Slow down scan rate if shared faults dominate
  sched/pelt: Fix false running accounting
  sched: Mark pick_next_task_dl() and build_sched_domain() as static
  sched/cpupri: Don't re-initialize 'struct cpupri'
  sched/deadline: Don't re-initialize 'struct cpudl'
  ...
2017-09-04 09:10:24 -07:00
Ingo Molnar
edc2988c54 Merge branch 'linus' into locking/core, to fix up conflicts
Conflicts:
	mm/page_alloc.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-09-04 11:01:18 +02:00
Linus Torvalds
69c0067aa3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc fixes from Al Viro:
 "Loose ends and regressions from the last merge window.

  Strictly speaking, only binfmt_flat thing is a build regression per
  se - the rest is 'only sparse cares about that' stuff"

[ This came in before the 4.13 release and could have gone there, but it
  was late in the release and nothing seemed critical enough to care, so
  I'm pulling it in the 4.14 merge window instead  - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  binfmt_flat: fix arch/m32r and arch/microblaze flat_put_addr_at_rp()
  compat_hdio_ioctl: Fix a declaration
  <linux/uaccess.h>: Fix copy_in_user() declaration
  annotate RWF_... flags
  teach SYSCALL_DEFINE/COMPAT_SYSCALL_DEFINE to handle __bitwise arguments
2017-09-03 16:09:03 -07:00
Linus Torvalds
d0d6ab53c9 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs version warning fix from Steve French:
 "As requested, additional kernel warning messages to clarify the
  default dialect changes"

[ There is still some discussion about exactly which version should be
  the new default.  Longer-term we have auto-negotiation coming, but
  that's not there yet..  - Linus ]

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  Fix warning messages when mounting to older servers
2017-09-01 20:57:27 -07:00
Oleg Nesterov
138e4ad67a epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove()
The race was introduced by me in commit 971316f050 ("epoll:
ep_unregister_pollwait() can use the freed pwq->whead").  I did not
realize that nothing can protect eventpoll after ep_poll_callback() sets
->whead = NULL, only whead->lock can save us from the race with
ep_free() or ep_remove().

Move ->whead = NULL to the end of ep_poll_callback() and add the
necessary barriers.

TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
before this patch.

Hopefully this explains use-after-free reported by syzcaller:

	BUG: KASAN: use-after-free in debug_spin_lock_before
	...
	 _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
	 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148

this is spin_lock(eventpoll->lock),

	...
	Freed by task 17774:
	...
	 kfree+0xe8/0x2c0 mm/slub.c:3883
	 ep_free+0x22c/0x2a0 fs/eventpoll.c:865

Fixes: 971316f050 ("epoll: ep_unregister_pollwait() can use the freed pwq->whead")
Reported-by: 范龙飞 <long7573@126.com>
Cc: stable@vger.kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-01 13:07:35 -07:00
Linus Torvalds
b8a78bb4d1 ceph fscache page locking fix from Zheng, marked for stable.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZqYGlAAoJEEp/3jgCEfOLx58H/1jnP79H03/kchVBCGPLCKjs
 E+pgHpb2922EGeYmEUoxfq627SiCODap/jo6JFVpsd+JnmHLZiMzmEzGpDce6fn9
 /YY5u3WNtmnKtyPvl0kzspK0ujaeCuiRyarULXBiHveL2ZQINKus4F9MiZphNnt4
 X4hgo866+esEf6LocuEkMoEGvgN7vk/Q9nDPgD/YoFrhCuwdvLJpBnE65CGbQyk5
 n3g0qlBR+yorDr1stdlSyVUDPkF5FQjhQTqkpi1oPAhsNPKgVPyZzRIEQEA+nI+N
 wTsQ0SMKfST4PNaRNdUuO1xwszYziqqlLZ2KwaaLIDHlElcbQR1S3GUKz6hddJc=
 =oesm
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.13-rc8' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "ceph fscache page locking fix from Zheng, marked for stable"

* tag 'ceph-for-4.13-rc8' of git://github.com/ceph/ceph-client:
  ceph: fix readpage from fscache
2017-09-01 12:46:30 -07:00
Steve French
7e682f766f Fix warning messages when mounting to older servers
When mounting to older servers, such as Windows XP (or even Windows 7),
the limited error messages that can be passed back to user space can
get confusing since the default dialect has changed from SMB1 (CIFS) to
more secure SMB3 dialect. Log additional information when the user chooses
to use the default dialects and when the server does not support the
dialect requested.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-09-01 00:18:44 -05:00
Linus Torvalds
e89ce1f89f two cifs bug fixes for stable
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQGcBAABAgAGBQJZpxWIAAoJEIosvXAHck9Rl1AL/2wNdEvDvDYZ48IeJN/R0Agb
 nhsXycDshjjUSMghm5IMXvaOitwQL3bcbVeGBDOF4UYoDgJZSuOeOvzYhAJaykiy
 Cs2rwve9Vuw9pliY1D4705g2jduvca6KP4Rzcmf8zeRqG8siQLDlFI8MqYcWiz6d
 F8O9EH+n+Daqf4L5b9IB/3aXZiQLDXKC5HjwJJlPjF6fhkqMbwFLSgTYPmCoM2M2
 oC4jDIPIvM7+p5K8Hs3UvkUaeUoMV1Hx/bba3gGFLK0lnGiD2davZPNBLR46nt6j
 L6l0ujg1pv0kOnTsyfiZp/m43UF5IGeLacBj5DczDrio/8FYKcLCoTIXpWB/dMn/
 NR9kGw/Y9M7jv+qumwrFzMaqv4ztMc2QssFsm37gHEtkEIBPiPTYixRIG/OkVRo7
 u8sZic1DvFot0YalZqpIX6Ss6egiY5CI4q7AWU8XwpkiW+K/RBN8g31Cmcx6scU2
 6/ywpNm2tP5D7AfH13jQh8CQOstnrFH50fUCyXpASA==
 =VgIs
 -----END PGP SIGNATURE-----

Merge tag 'cifs-fixes-for-4.13-rc7-and-stable' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Two cifs bug fixes for stable"

* tag 'cifs-fixes-for-4.13-rc7-and-stable' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: remove endian related sparse warning
  CIFS: Fix maximum SMB2 header size
2017-08-31 18:45:04 -07:00
Linus Torvalds
ea25c43179 Merge branch 'mmu_notifier_fixes'
Merge mmu_notifier fixes from Jérôme Glisse:
 "The invalidate_page callback suffered from 2 pitfalls. First it used
  to happen after page table lock was release and thus a new page might
  have been setup for the virtual address before the call to
  invalidate_page().

  This is in a weird way fixed by commit c7ab0d2fdc ("mm: convert
  try_to_unmap_one() to use page_vma_mapped_walk()") which moved the
  callback under the page table lock. Which also broke several existing
  user of the mmu_notifier API that assumed they could sleep inside this
  callback.

  The second pitfall was invalidate_page being the only callback not
  taking a range of address in respect to invalidation but was giving an
  address and a page. Lot of the callback implementer assumed this could
  never be THP and thus failed to invalidate the appropriate range for
  THP pages.

  By killing this callback we unify the mmu_notifier callback API to
  always take a virtual address range as input.

  There is now two clear API (I am not mentioning the youngess API which
  is seldomly used):

   - invalidate_range_start()/end() callback (which allow you to sleep)

   - invalidate_range() where you can not sleep but happen right after
     page table update under page table lock

  Note that a lot of existing user feels broken in respect to
  range_start/ range_end. Many user only have range_start() callback but
  there is nothing preventing them to undo what was invalidated in their
  range_start() callback after it returns but before any CPU page table
  update take place.

  The code pattern use in kvm or umem odp is an example on how to
  properly avoid such race. In a nutshell use some kind of sequence
  number and active range invalidation counter to block anything that
  might undo what the range_start() callback did.

  If you do not care about keeping fully in sync with CPU page table (ie
  you can live with CPU page table pointing to new different page for a
  given virtual address) then you can take a reference on the pages
  inside the range_start callback and drop it in range_end or when your
  driver is done with those pages.

  Last alternative is to use invalidate_range() if you can do
  invalidation without sleeping as invalidate_range() callback happens
  under the CPU page table spinlock right after the page table is
  updated.

  The first two patches convert existing mmu_notifier_invalidate_page()
  calls to mmu_notifier_invalidate_range() and bracket those call with
  call to mmu_notifier_invalidate_range_start()/end().

  The next ten patches remove existing invalidate_page() callback as it
  can no longer happen.

  Finally the last page remove the invalidate_page() callback completely
  so it can RIP.

  Changes since v1:
   - remove more dead code in kvm (no testing impact)
   - more accurate end address computation (patch 2) in page_mkclean_one
     and try_to_unmap_one
   - added tested-by/reviewed-by gotten so far"

* emailed patches from Jérôme Glisse <jglisse@redhat.com>:
  mm/mmu_notifier: kill invalidate_page
  KVM: update to new mmu_notifier semantic v2
  xen/gntdev: update to new mmu_notifier semantic
  sgi-gru: update to new mmu_notifier semantic
  misc/mic/scif: update to new mmu_notifier semantic
  iommu/intel: update to new mmu_notifier semantic
  iommu/amd: update to new mmu_notifier semantic
  IB/hfi1: update to new mmu_notifier semantic
  IB/umem: update to new mmu_notifier semantic
  drm/amdgpu: update to new mmu_notifier semantic
  powerpc/powernv: update to new mmu_notifier semantic
  mm/rmap: update to new mmu_notifier semantic v2
  dax: update to new mmu_notifier semantic
2017-08-31 17:30:01 -07:00
Dave Kleikamp
c227390c91 jfs should use MAX_LFS_FILESIZE when calculating s_maxbytes
jfs had previously avoided the use of MAX_LFS_FILESIZE because it hadn't
accounted for the whole 32-bit index range on 32-bit systems.  That has
been fixed by commit 0cc3b0ec23 ("Clarify (and fix) MAX_LFS_FILESIZE
macros"), so we can simplify the code now.

Suggested by Andreas Dilger.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: jfs-discussion@lists.sourceforge.net
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31 17:02:21 -07:00
Jérôme Glisse
a4d1a88525 dax: update to new mmu_notifier semantic
Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
and make sure it is bracketed by calls to *_invalidate_range_start()/end().

Note that because we can not presume the pmd value or pte value we have
to assume the worst and unconditionaly report an invalidation as
happening.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-31 16:12:59 -07:00
Yan, Zheng
dd2bc47348 ceph: fix readpage from fscache
ceph_readpage() unlocks page prematurely prematurely in the case
that page is reading from fscache. Caller of readpage expects that
page is uptodate when it get unlocked. So page shoule get locked
by completion callback of fscache_read_or_alloc_pages()

Cc: stable@vger.kernel.org # 4.1+, needs backporting for < 4.7
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-09-01 00:04:26 +02:00
Christoph Hellwig
ddef7ed2b5 annotate RWF_... flags
[AV: added missing annotations in syscalls.h/compat.h]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-31 17:32:38 -04:00
Steve French
6e3c1529c3 CIFS: remove endian related sparse warning
Recent patch had an endian warning ie
cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup()

Signed-off-by: Steve French <smfrench@gmail.com>
CC: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-08-30 14:43:11 -05:00
Pavel Shilovsky
9e37b1784f CIFS: Fix maximum SMB2 header size
Currently the maximum size of SMB2/3 header is set incorrectly which
leads to hanging of directory listing operations on encrypted SMB3
connections. Fix this by setting the maximum size to 170 bytes that
is calculated as RFC1002 length field size (4) + transform header
size (52) + SMB2 header size (64) + create response size (56).

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
2017-08-30 14:42:30 -05:00
Helge Deller
79de3cbe9a fs/select: Fix memory corruption in compat_get_fd_set()
Commit 464d62421c ("select: switch compat_{get,put}_fd_set() to
compat_{get,put}_bitmap()") changed the calculation on how many bytes
need to be zeroed when userspace handed over a NULL pointer for a fdset
array in the select syscall.

The calculation was changed in compat_get_fd_set() wrongly from
	memset(fdset, 0, ((nr + 1) & ~1)*sizeof(compat_ulong_t));
to
	memset(fdset, 0, ALIGN(nr, BITS_PER_LONG));

The ALIGN(nr, BITS_PER_LONG) calculates the number of _bits_ which need
to be zeroed in the target fdset array (rounded up to the next full bits
for an unsigned long).

But the memset() call expects the number of _bytes_ to be zeroed.

This leads to clearing more memory than wanted (on the stack area or
even at kmalloc()ed memory areas) and to random kernel crashes as we
have seen them on the parisc platform.

The correct change should have been

	memset(fdset, 0, (ALIGN(nr, BITS_PER_LONG) / BITS_PER_LONG) * BYTES_PER_LONG);

which is the same as can be archieved with a call to

	zero_fd_set(nr, fdset).

Fixes: 464d62421c ("select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()"
Acked-by:: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-28 16:09:19 -07:00
Waiman Long
39bf04db6b kernfs: Clarify lockdep name for kn->count
The reference count in kernfs_node structure is treated like a rwsem by
using lockdep instrumentation code. The lockdep name, however, is still
"s_active" which is carried over from the old sysfs code. As s_active
is no longer the variable name, its use may confuse users on where the
lock is when it is reported by lockdep. So it is changed to "kn->count"
which is how this variable is normally referenced in kernfs code.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-28 16:50:15 +02:00
Greg Kroah-Hartman
9749c37275 Merge 4.13-rc7 into char-misc-next
We want the binder fix in here as well for testing and merge issues.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-28 10:19:01 +02:00
Linus Torvalds
b3242dba9f Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "6 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm/memblock.c: reversed logic in memblock_discard()
  fork: fix incorrect fput of ->exe_file causing use-after-free
  mm/madvise.c: fix freeing of locked page with MADV_FREE
  dax: fix deadlock due to misaligned PMD faults
  mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled
  PM/hibernate: touch NMI watchdog when creating snapshot
2017-08-25 18:02:27 -07:00
Linus Torvalds
105065c3f7 Two nfsd bugfixes, neither 4.13 regressions, but both potentially
serious.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZoFxzAAoJECebzXlCjuG+T+4QAJhvEAPfoqxAJcjpy5Wgal96
 1QmHR1owRyA85MMVHhnVUClzzezECc8uXOxRvRFx+4pCW4PRwY3CRa6H0Acrte0l
 npxWi6CiOkuLTCA+NNVnJAty7zBp2Ag0hYJc2NFwhZJ1cVOcIab6Pc7U6jyoB7Nh
 d10rmB7eYsevZgKaCwxxlieFIkIDrPhIJzku5Zy7PXneITzDKX8kEaIs+JkuJ3xt
 H2w3ERpeeDVDlRd6ffo2OwXKaQkCmMNb64c2YA6yZptOHikuR5ARuvZxbOGveHrM
 uCrxAFgETBIusmBC45W9MmTw4c3GgDcW8/yx09pLWD7UDwsbOLMspXl9usX5sgaq
 Py3HpyPpZjovmfJUCI4UW/RWyo4El5T3IlknHjjg5AfnA3fe15xZVKcmKetVe4k9
 QxWKenwv+0hnOztF5Xotiysw+08aF6rIe3QQ/n6ZMathZAqvaaKsHa5TICL78anO
 F1WqwEKx7c7wg1ZnvV2uAeVsGobHi6Y5LAsyKx3dZMfZmVjqZe4wxGSD5eFAore5
 t4QWDWnLY0t/iPrYpLB1vINXvgD1T6b3rvnMiwm2B+ITMNzNOgLK0vYsNjzsk0uL
 gIOGma2LN7HwtKlsZHZewsR2rsIPcQ4D9FfPZBo1+jSYLzL4ktHWTalFCngwylhe
 y7iV/D+jvrHzrMr9T6rl
 =L3ES
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Two nfsd bugfixes, neither 4.13 regressions, but both potentially
  serious"

* tag 'nfsd-4.13-2' of git://linux-nfs.org/~bfields/linux:
  net: sunrpc: svcsock: fix NULL-pointer exception
  nfsd: Limit end of page list when decoding NFSv4 WRITE
2017-08-25 17:27:26 -07:00
Linus Torvalds
8c7932a32e some bug fixes for stable for cifs
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQGcBAABAgAGBQJZn2HQAAoJEIosvXAHck9RXMAL/iVeR4DjmXLwGQtOIQUzj0pv
 0JRubkh8/ud5VvfznjDvy0bBl/jodCK6N2wU7iqBhJUYW5Tc/TLaRt6MZ2KT4pLo
 PrD64hdjEtxkU5si+LOVLU11KndEIIQUV5+Mh9Zqj51DTHsyXJHPi/98HjNJm5Gq
 pXfUk+4eq229Pqq1JuPtfPaNHH/fZCODLf82vDQZedlaZhzHgXtDg6iQM0SalNhg
 iQSAWvmFr5lHlMs5/QMkhurvSaS38GXd+npWUGlJmFymlQbpqzpPGdYMgjnzLxDC
 Jw/Uowzo136CWSkSQV2DudKveNfIrVDYGgb97NgtZxsXYlBuJu4rCJvpLOsm6zap
 ZRnSReRvEIr6/TvMJ2wnRioz0JkbpPz8gMg7EUzfaexZtuAHXx6bguf2RjrnLJiH
 jhV+U+1uwTOgJejbvju/KVV6AP9kECyE5tZjuDF8FenfWkboqAYNaxxWVAfZreF5
 wMF0FeJWoGUxwYgRvd8neG1VWB5LQO8rNaQmYNBi7w==
 =MlGX
 -----END PGP SIGNATURE-----

Merge tag 'cifs-fixes-for-4.13-rc6-and-stable' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Some bug fixes for stable for cifs"

* tag 'cifs-fixes-for-4.13-rc6-and-stable' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup()
  cifs: Fix df output for users with quota limits
2017-08-25 17:22:33 -07:00
Ross Zwisler
fffa281b48 dax: fix deadlock due to misaligned PMD faults
In DAX there are two separate places where the 2MiB range of a PMD is
defined.

The first is in the page tables, where a PMD mapping inserted for a
given address spans from (vmf->address & PMD_MASK) to ((vmf->address &
PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB boundary below the
address to the 2MiB boundary above the address.

So, for example, a fault at address 3MiB (0x30 0000) falls within the
PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).

The second PMD range is in the mapping->page_tree, where a given file
offset is covered by a radix tree entry that spans from one 2MiB aligned
file offset to another 2MiB aligned file offset.

So, for example, the file offset for 3MiB (pgoff 768) falls within the
PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff
512) to 4MiB (pgoff 1024).

This system works so long as the addresses and file offsets for a given
mapping both have the same offsets relative to the start of each PMD.

Consider the case where the starting address for a given file isn't 2MiB
aligned - say our faulting address is 3 MiB (0x30 0000), but that
corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
the mapping are misaligned so that the 2MiB range defined in the page
tables never matches up with the 2MiB range defined in the radix tree.

The current code notices this case for DAX faults to storage with the
following test in dax_pmd_insert_mapping():

	if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
		goto unlock_fallback;

This test makes sure that the pfn we get from the driver is 2MiB
aligned, and relies on the assumption that the 2MiB alignment of the pfn
we get back from the driver matches the 2MiB alignment of the faulting
address.

However, faults to holes were not checked and we could hit the problem
described above.

This was reported in response to the NVML nvml/src/test/pmempool_sync
TEST5:

	$ cd nvml/src/test/pmempool_sync
	$ make TEST5

You can grab NVML here:

	https://github.com/pmem/nvml/

The dmesg warning you see when you hit this error is:

  WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310

Where we notice in dax_insert_mapping_entry() that the radix tree entry
we are about to replace doesn't match the locked entry that we had
previously inserted into the tree.  This happens because the initial
insertion was done in grab_mapping_entry() using a pgoff calculated from
the faulting address (vmf->address), and the replacement in
dax_pmd_load_hole() => dax_insert_mapping_entry() is done using
vmf->pgoff.

In our failure case those two page offsets (one calculated from
vmf->address, one using vmf->pgoff) point to different order 9 radix
tree entries.

This failure case can result in a deadlock because the radix tree unlock
also happens on the pgoff calculated from vmf->address.  This means that
the locked radix tree entry that we swapped in to the tree in
dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
future faults to that 2MiB range will block forever.

Fix this by validating that the faulting address's PMD offset matches
the PMD offset from the start of the file.  This check is done at the
very beginning of the fault and covers faults that would have mapped to
storage as well as faults to holes.  I left the COLOUR check in
dax_pmd_insert_mapping() in place in case we ever hit the insanity
condition where the alignment of the pfn we get from the driver doesn't
match the alignment of the userspace address.

Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: "Slusarz, Marcin" <marcin.slusarz@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-25 16:12:46 -07:00
Ingo Molnar
3a9ff4fd04 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:07:13 +02:00
Ingo Molnar
10c9850cb2 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-25 11:04:51 +02:00
Chuck Lever
fc788f64f1 nfsd: Limit end of page list when decoding NFSv4 WRITE
When processing an NFSv4 WRITE operation, argp->end should never
point past the end of the data in the final page of the page list.
Otherwise, nfsd4_decode_compound can walk into uninitialized memory.

More critical, nfsd4_decode_write is failing to increment argp->pagelen
when it increments argp->pagelist.  This can cause later xdr decoders
to assume more data is available than really is, which can cause server
crashes on malformed requests.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24 18:05:30 -04:00
Linus Torvalds
b71a5e3fe8 Merge branch 'for-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
 "We have one more fixup that stems from the blk_status_t conversion
  that did not quite cover everything.

  The normal cases were not affected because the code is 0, but any
  error and retries could mix up new and old values"

* 'for-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix blk_status_t/errno confusion
2017-08-24 14:10:31 -07:00
Eric W. Biederman
311fc65c9f pty: Repair TIOCGPTPEER
The implementation of TIOCGPTPEER has two issues.

When /dev/ptmx (as opposed to /dev/pts/ptmx) is opened the wrong
vfsmount is passed to dentry_open.  Which results in the kernel displaying
the wrong pathname for the peer.

The second is simply by caching the vfsmount and dentry of the peer it leaves
them open, in a way they were not previously Which because of the inreased
reference counts can cause unnecessary behaviour differences resulting in
regressions.

To fix these move the ioctl into tty_io.c at a generic level allowing
the ioctl to have access to the struct file on which the ioctl is
being called.  This allows the path of the slave to be derived when
opening the slave through TIOCGPTPEER instead of requiring the path to
the slave be cached.  Thus removing the need for caching the path.

A new function devpts_ptmx_path is factored out of devpts_acquire and
used to implement a function devpts_mntget.   The new function devpts_mntget
takes a filp to perform the lookup on and fsi so that it can confirm
that the superblock that is found by devpts_ptmx_path is the proper superblock.

v2: Lots of fixes to make the code actually work
v3: Suggestions by Linus
    - Removed the unnecessary initialization of filp in ptm_open_peer
    - Simplified devpts_ptmx_path as gotos are no longer required

[ This is the fix for the issue that was reverted in commit
  143c97cc65, but this time without breaking 'pbuilder' due to
  increased reference counts   - Linus ]

Fixes: 54ebbfb160 ("tty: add TIOCGPTPEER ioctl")
Reported-by: Christian Brauner <christian.brauner@canonical.com>
Reported-and-tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-24 13:23:03 -07:00
Omar Sandoval
58efbc9f54 Btrfs: fix blk_status_t/errno confusion
This fixes several instances of blk_status_t and bare errno ints being
mixed up, some of which are real bugs.

In the normal case, 0 matches BLK_STS_OK, so we don't observe any
effects of the missing conversion, but in case of errors or passes
through the repair/retry paths, the errors get mixed up.

The changes were identified using 'sparse', we don't have reports of the
buggy behaviour.

Fixes: 4e4cbee93d ("block: switch bios to blk_status_t")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-24 17:19:02 +02:00
Linus Torvalds
143c97cc65 Revert "pty: fix the cached path of the pty slave file descriptor in the master"
This reverts commit c8c03f1858.

It turns out that while fixing the ptmx file descriptor to have the
correct 'struct path' to the associated slave pty is a really good
thing, it breaks some user space tools for a very annoying reason.

The problem is that /dev/ptmx and its associated slave pty (/dev/pts/X)
are on different mounts.  That was what caused us to have the wrong path
in the first place (we would mix up the vfsmount of the 'ptmx' node,
with the dentry of the pty slave node), but it also means that now while
we use the right vfsmount, having the pty master open also keeps the pts
mount busy.

And it turn sout that that makes 'pbuilder' very unhappy, as noted by
Stefan Lippers-Hollmann:

 "This patch introduces a regression for me when using pbuilder
  0.228.7[2] (a helper to build Debian packages in a chroot and to
  create and update its chroots) when trying to umount /dev/ptmx (inside
  the chroot) on Debian/ unstable (full log and pbuilder configuration
  file[3] attached).

  [...]
  Setting up build-essential (12.3) ...
  Processing triggers for libc-bin (2.24-15) ...
  I: unmounting dev/ptmx filesystem
  W: Could not unmount dev/ptmx: umount: /var/cache/pbuilder/build/1340/dev/ptmx: target is busy
          (In some cases useful info about processes that
           use the device is found by lsof(8) or fuser(1).)"

apparently pbuilder tries to unmount the /dev/pts filesystem while still
holding at least one master node open, which is arguably not very nice,
but we don't break user space even when fixing other bugs.

So this commit has to be reverted.

I'll try to figure out a way to avoid caching the path to the slave pty
in the master pty.  The only thing that actually wants that slave pty
path is the "TIOCGPTPEER" ioctl, and I think we could just recreate the
path at that time.

Reported-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
Cc: Eric W Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian.brauner@canonical.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-23 18:16:11 -07:00
Ronnie Sahlberg
d3edede29f cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup()
Add checking for the path component length and verify it is <= the maximum
that the server advertizes via FileFsAttributeInformation.

With this patch cifs.ko will now return ENAMETOOLONG instead of ENOENT
when users to access an overlong path.

To test this, try to cd into a (non-existing) directory on a CIFS share
that has a too long name:
cd /mnt/aaaaaaaaaaaaaaa...

and it now should show a good error message from the shell:
bash: cd: /mnt/aaaaaaaaaaaaaaaa...aaaaaa: File name too long

rh bz 1153996

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Cc: <stable@vger.kernel.org>
2017-08-23 13:34:52 -05:00
Sachin Prabhu
42bec214d8 cifs: Fix df output for users with quota limits
The df for a SMB2 share triggers a GetInfo call for
FS_FULL_SIZE_INFORMATION. The values returned are used to populate
struct statfs.

The problem is that none of the information returned by the call
contains the total blocks available on the filesystem. Instead we use
the blocks available to the user ie. quota limitation when filling out
statfs.f_blocks. The information returned does contain Actual free units
on the filesystem and is used to populate statfs.f_bfree. For users with
quota enabled, it can lead to situations where the total free space
reported is more than the total blocks on the system ending up with df
reports like the following

 # df -h /mnt/a
Filesystem         Size  Used Avail Use% Mounted on
//192.168.22.10/a  2.5G -2.3G  2.5G    - /mnt/a

To fix this problem, we instead populate both statfs.f_bfree with the
same value as statfs.f_bavail ie. CallerAvailableAllocationUnits. This
is similar to what is done already in the code for cifs and df now
reports the quota information for the user used to mount the share.

 # df --si /mnt/a
Filesystem         Size  Used Avail Use% Mounted on
//192.168.22.10/a  2.7G  101M  2.6G   4% /mnt/a

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Pierguido Lambri <plambri@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Cc: <stable@vger.kernel.org>
2017-08-23 13:33:21 -05:00
Linus Torvalds
98b9f8a454 Fix a clang build regression and an potential xattr corruption bug.
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlmdAHoACgkQ8vlZVpUN
 gaNVsgf/SRn6HaOpX7BdrtkXqjV8VvLZsDmsZPkhchdmTxMpIFJNf16/sg0hqdyJ
 wcTx3y+BkBSjBXLtqK+hslVyg4pUjSBWWZyZ9Dtyi5+B92CJJJBdaHIpcdvd3Ek1
 J/HPQjqcPXL43Cg5SQ0/KgVMhCze9I4bEbNm2evC18bC15hZAVP0FK1hT3FNpyIB
 fhOu9FZdnzlcBlnLdfTqgIEPaHzc6zcJnqpSbkT0InjiJf5cxDionhoaBzUh9Jzg
 bKvkFRDTDWDrBcYStuHwgpELmVVYJGbwjzMVOAcmeCiSJqNbU1/Ym5t3e3rflKmi
 6YEyDhK43iZGiR4/QUffrCxEIzfqrA==
 =dOeQ
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a clang build regression and an potential xattr corruption bug"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: add missing xattr hash update
  ext4: fix clang build regression
2017-08-22 21:30:52 -07:00