Commit graph

175 commits

Author SHA1 Message Date
Jan Kara
dec0da7b60 ext4: use pagevec_lookup_range() in ext4_find_unwritten_pgoff()
Use pagevec_lookup_range() in ext4_find_unwritten_pgoff() since we are
interested only in pages in the given range.  Simplify the logic as a
result of not getting pages out of range and index getting automatically
advanced.

Link: http://lkml.kernel.org/r/20170726114704.7626-6-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Jan Kara
d72dc8a25a mm: make pagevec_lookup() update index
Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue.  Most callers want this
and also pagevec_lookup_tag() already does this.

Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:26 -07:00
Ross Zwisler
91d25ba8a6 dax: use common 4k zero page for dax mmap reads
When servicing mmap() reads from file holes the current DAX code
allocates a page cache page of all zeroes and places the struct page
pointer in the mapping->page_tree radix tree.

This has three major drawbacks:

1) It consumes memory unnecessarily. For every 4k page that is read via
   a DAX mmap() over a hole, we allocate a new page cache page. This
   means that if you read 1GiB worth of pages, you end up using 1GiB of
   zeroed memory. This is easily visible by looking at the overall
   memory consumption of the system or by looking at /proc/[pid]/smaps:

	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
	Size:            1048576 kB
	Rss:             1048576 kB
	Pss:             1048576 kB
	Shared_Clean:          0 kB
	Shared_Dirty:          0 kB
	Private_Clean:   1048576 kB
	Private_Dirty:         0 kB
	Referenced:      1048576 kB
	Anonymous:             0 kB
	LazyFree:              0 kB
	AnonHugePages:         0 kB
	ShmemPmdMapped:        0 kB
	Shared_Hugetlb:        0 kB
	Private_Hugetlb:       0 kB
	Swap:                  0 kB
	SwapPss:               0 kB
	KernelPageSize:        4 kB
	MMUPageSize:           4 kB
	Locked:                0 kB

2) It is slower than using a common zero page because each page fault
   has more work to do. Instead of just inserting a common zero page we
   have to allocate a page cache page, zero it, and then insert it. Here
   are the average latencies of dax_load_hole() as measured by ftrace on
   a random test box:

    Old method, using zeroed page cache pages:	3.4 us
    New method, using the common 4k zero page:	0.8 us

   This was the average latency over 1 GiB of sequential reads done by
   this simple fio script:

     [global]
     size=1G
     filename=/root/dax/data
     fallocate=none
     [io]
     rw=read
     ioengine=mmap

3) The fact that we had to check for both DAX exceptional entries and
   for page cache pages in the radix tree made the DAX code more
   complex.

Solve these issues by following the lead of the DAX PMD code and using a
common 4k zero page instead.  As with the PMD code we will now insert a
DAX exceptional entry into the radix tree instead of a struct page
pointer which allows us to remove all the special casing in the DAX
code.

Note that we do still pretty aggressively check for regular pages in the
DAX radix tree, especially where we take action based on the bits set in
the page.  If we ever find a regular page in our radix tree now that
most likely means that someone besides DAX is inserting pages (which has
happened lots of times in the past), and we want to find that out early
and fail loudly.

This solution also removes the extra memory consumption.  Here is that
same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
code:

	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
	Size:            1048576 kB
	Rss:                   0 kB
	Pss:                   0 kB
	Shared_Clean:          0 kB
	Shared_Dirty:          0 kB
	Private_Clean:         0 kB
	Private_Dirty:         0 kB
	Referenced:            0 kB
	Anonymous:             0 kB
	LazyFree:              0 kB
	AnonHugePages:         0 kB
	ShmemPmdMapped:        0 kB
	Shared_Hugetlb:        0 kB
	Private_Hugetlb:       0 kB
	Swap:                  0 kB
	SwapPss:               0 kB
	KernelPageSize:        4 kB
	MMUPageSize:           4 kB
	Locked:                0 kB

Overall system memory consumption is similarly improved.

Another major change is that we remove dax_pfn_mkwrite() from our fault
flow, and instead rely on the page fault itself to make the PTE dirty
and writeable.  The following description from the patch adding the
vm_insert_mixed_mkwrite() call explains this a little more:

   "To be able to use the common 4k zero page in DAX we need to have our
    PTE fault path look more like our PMD fault path where a PTE entry
    can be marked as dirty and writeable as it is first inserted rather
    than waiting for a follow-up dax_pfn_mkwrite() =>
    finish_mkwrite_fault() call.

    Right now we can rely on having a dax_pfn_mkwrite() call because we
    can distinguish between these two cases in do_wp_page():

            case 1: 4k zero page => writable DAX storage
            case 2: read-only DAX storage => writeable DAX storage

    This distinction is made by via vm_normal_page(). vm_normal_page()
    returns false for the common 4k zero page, though, just as it does
    for DAX ptes. Instead of special casing the DAX + 4k zero page case
    we will simplify our DAX PTE page fault sequence so that it matches
    our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
    We will instead use dax_iomap_fault() to handle write-protection
    faults.

    This means that insert_pfn() needs to follow the lead of
    insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
    'mkwrite' is set insert_pfn() will do the work that was previously
    done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"

Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:24 -07:00
Jan Kara
fcf5ea1099 ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize
ext4_find_unwritten_pgoff() does not properly handle a situation when
starting index is in the middle of a page and blocksize < pagesize. The
following command shows the bug on filesystem with 1k blocksize:

  xfs_io -f -c "falloc 0 4k" \
            -c "pwrite 1k 1k" \
            -c "pwrite 3k 1k" \
            -c "seek -a -r 0" foo

In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048,
SEEK_DATA) will return the correct result.

Fix the problem by neglecting buffers in a page before starting offset.

Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org # 3.8+
2017-08-05 17:43:24 -04:00
Linus Torvalds
bc2c6421cb The first major feature for ext4 this merge window is the largedir
feature, which allows ext4 directories to support over 2 billion
 directory entries (assuming ~64 byte file names; in practice, users
 will run into practical performance limits first.)  This feature was
 originally written by the Lustre team, and credit goes to Artem
 Blagodarenko from Seagate for getting this feature upstream.
 
 The second major major feature allows ext4 to support extended
 attribute values up to 64k.  This feature was also originally from
 Lustre, and has been enhanced by Tahsin Erdogan from Google with a
 deduplication feature so that if multiple files have the same xattr
 value (for example, Windows ACL's stored by Samba), only one copy will
 be stored on disk for encoding and caching efficiency.
 
 We also have the usual set of bug fixes, cleanups, and optimizations.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAllhl5AACgkQ8vlZVpUN
 gaOiNQf+L23sT9KIQmFwQP38vkBVw67Eo7gBfevmk7oqQLiRppT5mmLzW8EWEDxR
 PVaDQXvSZi18wSCAAcCd1ZqeIZk0P6tst0ufnIT60tGlZdUlwSLyrqvV/30axR2g
 6kcnv90ZszrQNx5U8q8bMzNrs1KtyPHFCRzavFsBX11WezNSpWnH2in/uxO+t9Jy
 F2zlrLUrE2m9AVMH48Dh6LbeaB6pqgr4k3jq1jG4Iqb2h9xgU8OKhs8gL07YS+Qi
 5A7s8GIvYQSoZUO9DOOie2f1zhpO0KrhXchyZTJukVQH7TsmFxoSh0vhXnP1Bohu
 CNLV6dzetDT0VfmPr1WhVe7lhZeeVw==
 =FFkF
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The first major feature for ext4 this merge window is the largedir
  feature, which allows ext4 directories to support over 2 billion
  directory entries (assuming ~64 byte file names; in practice, users
  will run into practical performance limits first.) This feature was
  originally written by the Lustre team, and credit goes to Artem
  Blagodarenko from Seagate for getting this feature upstream.

  The second major major feature allows ext4 to support extended
  attribute values up to 64k. This feature was also originally from
  Lustre, and has been enhanced by Tahsin Erdogan from Google with a
  deduplication feature so that if multiple files have the same xattr
  value (for example, Windows ACL's stored by Samba), only one copy will
  be stored on disk for encoding and caching efficiency.

  We also have the usual set of bug fixes, cleanups, and optimizations"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
  ext4: fix spelling mistake: "prellocated" -> "preallocated"
  ext4: fix __ext4_new_inode() journal credits calculation
  ext4: skip ext4_init_security() and encryption on ea_inodes
  fs: generic_block_bmap(): initialize all of the fields in the temp bh
  ext4: change fast symlink test to not rely on i_blocks
  ext4: require key for truncate(2) of encrypted file
  ext4: don't bother checking for encryption key in ->mmap()
  ext4: check return value of kstrtoull correctly in reserved_clusters_store
  ext4: fix off-by-one fsmap error on 1k block filesystems
  ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()
  ext4: return EIO on read error in ext4_find_entry
  ext4: forbid encrypting root directory
  ext4: send parallel discards on commit completions
  ext4: avoid unnecessary stalls in ext4_evict_inode()
  ext4: add nombcache mount option
  ext4: strong binding of xattr inode references
  ext4: eliminate xattr entry e_hash recalculation for removes
  ext4: reserve space for xattr entries/names
  quota: add get_inode_usage callback to transfer multi-inode charges
  ext4: xattr inode deduplication
  ...
2017-07-09 09:31:22 -07:00
Eric Biggers
66e0aaadce ext4: don't bother checking for encryption key in ->mmap()
Since only an open file can be mmap'ed, and we only allow open()ing an
encrypted file when its key is available, there is no need to check for
the key again before permitting each mmap().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-06-23 19:41:38 -04:00
Goldwyn Rodrigues
728fbc0e10 ext4: nowait aio support
Return EAGAIN if any of the following checks fail for direct I/O:
  + i_rwsem is lockable
  + Writing beyond end of file (will trigger allocation)
  + Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-20 07:12:03 -06:00
Eryu Guan
624327f879 ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()
ext4_find_unwritten_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.

When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size ext4 on x86_64 host.

  # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
  	    -c "seek -d 0" /mnt/ext4/testfile
  wrote 1024/1024 bytes at offset 2048
  1 KiB, 1 ops; 0.0000 sec (42.459 MiB/sec and 43478.2609 ops/sec)
  Whence  Result
  DATA    EOF

Data at offset 2k was missed, and lseek(2) returned ENXIO.

This is unconvered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.

Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-05-24 18:02:20 -04:00
Jan Kara
3f1d5bad3f ext4: fix off-by-in in loop termination in ext4_find_unwritten_pgoff()
There is an off-by-one error in loop termination conditions in
ext4_find_unwritten_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:34:23 -04:00
Jan Kara
7d95eddf31 ext4: fix SEEK_HOLE
Currently, SEEK_HOLE implementation in ext4 may both return that there's
a hole at some offset although that offset already has data and skip
some holes during a search for the next hole. The first problem is
demostrated by:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (2.054 GiB/sec and 538461.5385 ops/sec)
Whence	Result
HOLE	0

Where we can see that SEEK_HOLE wrongly returned offset 0 as containing
a hole although we have written data there. The second problem can be
demonstrated by:

xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
       -c "seek -h 0" file

wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (1.978 GiB/sec and 518518.5185 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (2 GiB/sec and 500000.0000 ops/sec)
Whence	Result
HOLE	139264

Where we can see that hole at offsets 56k..128k has been ignored by the
SEEK_HOLE call.

The underlying problem is in the ext4_find_unwritten_pgoff() which is
just buggy. In some cases it fails to update returned offset when it
finds a hole (when no pages are found or when the first found page has
higher index than expected), in some cases conditions for detecting hole
are just missing (we fail to detect a situation where indices of
returned pages are not contiguous).

Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page
indices and also handle all cases where we got less pages then expected
in one place and handle it properly there.

CC: stable@vger.kernel.org
Fixes: c8c0df241c
CC: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-21 22:33:23 -04:00
Jan Kara
fb26a1cbed ext4: return to starting transaction in ext4_dax_huge_fault()
DAX will return to locking exceptional entry before mapping blocks for a
page fault to fix possible races with concurrent writes.  To avoid lock
inversion between exceptional entry lock and transaction start, start
the transaction already in ext4_dax_huge_fault().

Fixes: 9f141d6ef6
Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:16 -07:00
David Howells
99652ea56a ext4: Add statx support
Return enhanced file attributes from the Ext4 filesystem.  This includes
the following:

 (1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME.

 (2) Certain FS_xxx_FL flags are mapped to stx_attribute flags.

This requires that all ext4 inodes have a getattr call, not just some of
them, so to this end, split the ext4_getattr() function and only call part
of it where appropriate.

Example output:

	[root@andromeda ~]# touch foo
	[root@andromeda ~]# chattr +ai foo
	[root@andromeda ~]# /tmp/test-statx foo
	statx(foo) = 0
	results=fff
	  Size: 0               Blocks: 0          IO Block: 4096    regular file
	Device: 08:12           Inode: 2101950     Links: 1
	Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
	Access: 2016-02-11 17:08:29.031795451+0000
	Modify: 2016-02-11 17:08:29.031795451+0000
	Change: 2016-02-11 17:11:11.987790114+0000
	 Birth: 2016-02-11 17:08:29.031795451+0000
	Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----)

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-03 01:05:58 -04:00
Dave Jiang
c791ace1e7 mm: replace FAULT_FLAG_SIZE with parameter to huge_fault
Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
been somewhat painful with getting the flags set and removed at the
correct locations.  More than one kernel oops was introduced due to
difficulties of getting the placement correctly.

Remove the flag values and introduce an input parameter to huge_fault
that indicates the size of the page entry.  This makes the code easier
to trace and should avoid the issues we see with the fault flags where
removal of the flag was necessary in the fallback paths.

Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Dave Jiang
a2d581675d mm,fs,dax: change ->pmd_fault to ->huge_fault
Patch series "1G transparent hugepage support for device dax", v2.

The following series implements support for 1G trasparent hugepage on
x86 for device dax.  The bulk of the code was written by Mathew Wilcox a
while back supporting transparent 1G hugepage for fs DAX.  I have
forward ported the relevant bits to 4.10-rc.  The current submission has
only the necessary code to support device DAX.

Comments from Dan Williams: So the motivation and intended user of this
functionality mirrors the motivation and users of 1GB page support in
hugetlbfs.  Given expected capacities of persistent memory devices an
in-memory database may want to reduce tlb pressure beyond what they can
already achieve with 2MB mappings of a device-dax file.  We have
customer feedback to that effect as Willy mentioned in his previous
version of these patches [1].

[1]: https://lkml.org/lkml/2016/1/31/52

Comments from Nilesh @ Oracle:

There are applications which have a process model; and if you assume
10,000 processes attempting to mmap all the 6TB memory available on a
server; we are looking at the following:

processes         : 10,000
memory            :    6TB
pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
pmd @ 2M page size: 120,000 / 512 = ~240GB
pud @ 1G page size: 240GB / 512 = ~480MB

As you can see with 2M pages, this system will use up an exorbitant
amount of DRAM to hold the page tables; but the 1G pages finally brings
it down to a reasonable level.  Memory sizes will keep increasing; so
this number will keep increasing.

An argument can be made to convert the applications from process model
to thread model, but in the real world that may not be always practical.
Hopefully this helps explain the use case where this is valuable.

This patch (of 3):

In preparation for adding the ability to handle PUD pages, convert
vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault.  The
vm_fault structure is extended to include a union of the different page
table pointers that may be needed, and three flag bits are reserved to
indicate which type of pointer is in the union.

[ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
  Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
[dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
  Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Dave Jiang
11bac80004 mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.

Remove the vma parameter to simplify things.

[arnd@arndb.de: fix ARM build]
  Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Dave Jiang
f42003917b mm, dax: change pmd_fault() to take only vmf parameter
pmd_fault() and related functions really only need the vmf parameter since
the additional parameters are all included in the vmf struct.  Remove the
additional parameter and simplify pmd_fault() and friends.

Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:26 -08:00
Dave Jiang
d8a849e1bc mm, dax: make pmd_fault() and friends be the same as fault()
Instead of passing in multiple parameters in the pmd_fault() handler,
a vmf can be passed in just like a fault() handler. This will simplify
code and remove the need for the actual pmd fault handlers to allocate a
vmf. Related functions are also modified to do the same.

[dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off]
  Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:26 -08:00
Christoph Hellwig
ff5462e39c ext4: fix DAX write locking
Unlike O_DIRECT DAX is not an optional opt-in feature selected by the
application, so we'll have to provide the traditional synchronіzation
of overlapping writes as we do for buffered writes.

This was broken historically for DAX, but got fixed for ext2 and XFS
as part of the iomap conversion.  Fix up ext4 as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-02-08 14:39:27 -05:00
Theodore Ts'o
0db1ff222d ext4: add shutdown bit and check for it
Add a shutdown bit that will cause ext4 processing to fail immediately
with EIO.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-05 01:28:48 -05:00
Jan Kara
1db175428e ext4: Simplify DAX fault path
Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we
can use transaction starting in ext4_iomap_begin() and thus simplify
ext4_dax_fault(). It also provides us proper retries in case of ENOSPC.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-26 20:29:25 -08:00
Jan Kara
e2ae766c1b ext4: convert DAX faults to iomap infrastructure
Convert DAX faults to use iomap infrastructure. We would not have to start
transaction in ext4_dax_fault() anymore since ext4_iomap_begin takes
care of that but so far we do that to avoid lock inversion of
transaction start with DAX entry lock which gets acquired in
dax_iomap_fault() before calling ->iomap_begin handler.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:51:24 -05:00
Jan Kara
776722e85d ext4: DAX iomap write support
Implement DAX writes using the new iomap infrastructure instead of
overloading the direct IO path.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 18:09:11 -05:00
Jan Kara
364443cbcf ext4: convert DAX reads to iomap infrastructure
Implement basic iomap_begin function that handles reading and use it for
DAX reads.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 17:36:06 -05:00
Jan Kara
213bcd9ccb ext4: factor out checks from ext4_file_write_iter()
Factor out checks of 'from' and whether we are overwriting out of
ext4_file_write_iter() so that the function is easier to follow.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-11-20 17:29:51 -05:00
Linus Torvalds
97d2116708 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "xattr stuff from Andreas

  This completes the switch to xattr_handler ->get()/->set() from
  ->getxattr/->setxattr/->removexattr"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Remove {get,set,remove}xattr inode operations
  xattr: Stop calling {get,set,remove}xattr inode operations
  vfs: Check for the IOP_XATTR flag in listxattr
  xattr: Add __vfs_{get,set,remove}xattr helpers
  libfs: Use IOP_XATTR flag for empty directory handling
  vfs: Use IOP_XATTR flag for bad-inode handling
  vfs: Add IOP_XATTR inode operations flag
  vfs: Move xattr_resolve_name to the front of fs/xattr.c
  ecryptfs: Switch to generic xattr handlers
  sockfs: Get rid of getxattr iop
  sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
  kernfs: Switch to generic xattr handlers
  hfs: Switch to generic xattr handlers
  jffs2: Remove jffs2_{get,set,remove}xattr macros
  xattr: Remove unnecessary NULL attribute name check
2016-10-10 17:11:50 -07:00
Linus Torvalds
b66484cd74 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - fsnotify updates

 - ocfs2 updates

 - all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
  console: don't prefer first registered if DT specifies stdout-path
  cred: simpler, 1D supplementary groups
  CREDITS: update Pavel's information, add GPG key, remove snail mail address
  mailmap: add Johan Hovold
  .gitattributes: set git diff driver for C source code files
  uprobes: remove function declarations from arch/{mips,s390}
  spelling.txt: "modeled" is spelt correctly
  nmi_backtrace: generate one-line reports for idle cpus
  arch/tile: adopt the new nmi_backtrace framework
  nmi_backtrace: do a local dump_stack() instead of a self-NMI
  nmi_backtrace: add more trigger_*_cpu_backtrace() methods
  min/max: remove sparse warnings when they're nested
  Documentation/filesystems/proc.txt: add more description for maps/smaps
  mm, proc: fix region lost in /proc/self/smaps
  proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
  proc: add LSM hook checks to /proc/<tid>/timerslack_ns
  proc: relax /proc/<tid>/timerslack_ns capability requirements
  meminfo: break apart a very long seq_printf with #ifdefs
  seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
  proc: faster /proc/*/status
  ...
2016-10-07 21:38:00 -07:00
Andreas Gruenbacher
fd50ecaddf vfs: Remove {get,set,remove}xattr inode operations
These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 21:48:36 -04:00
Toshi Kani
dbe6ec8156 ext2/4, xfs: call thp_get_unmapped_area() for pmd mappings
To support DAX pmd mappings with unmodified applications, filesystems
need to align an mmap address by the pmd size.

Call thp_get_unmapped_area() from f_op->get_unmapped_area.

Note, there is no change in behavior for a non-DAX file.

Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Jan Kara
51e8137b82 ext4: remove plugging from ext4_file_write_iter()
do_blockdev_direct_IO() takes care of properly plugging direct IO so
there's no need to plug again inside ext4_file_write_iter().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:57:41 -04:00
Jan Kara
4b0524aae0 ext4: allow unlocked direct IO when pages are cached
Currently we do not allow unlocked (meaning without inode_lock) direct
IO when the file has any pages cached. This check is not needed anymore
as we keep inode lock until ext4_direct_IO_write() and thus can happily
writeback and evict any pages conflicting with current direct IO write.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-30 01:55:32 -04:00
Fabian Frederick
518eaa6387 ext4: create EXT4_MAX_BLOCKS() macro
Create a macro to calculate length + offset -> maximum blocks
This adds more readability.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-09-15 11:55:01 -04:00
Linus Torvalds
0e06f5c0de Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2

 - most(?) of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
  thp: fix comments of __pmd_trans_huge_lock()
  cgroup: remove unnecessary 0 check from css_from_id()
  cgroup: fix idr leak for the first cgroup root
  mm: memcontrol: fix documentation for compound parameter
  mm: memcontrol: remove BUG_ON in uncharge_list
  mm: fix build warnings in <linux/compaction.h>
  mm, thp: convert from optimistic swapin collapsing to conservative
  mm, thp: fix comment inconsistency for swapin readahead functions
  thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
  shmem: split huge pages beyond i_size under memory pressure
  thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
  khugepaged: add support of collapse for tmpfs/shmem pages
  shmem: make shmem_inode_info::lock irq-safe
  khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
  thp: extract khugepaged from mm/huge_memory.c
  shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
  shmem: add huge pages support
  shmem: get_unmapped_area align huge page
  shmem: prepare huge= mount option and sysfs knob
  mm, rmap: account shmem thp pages
  ...
2016-07-26 19:55:54 -07:00
Ross Zwisler
6b524995a7 dax: remote unused fault wrappers
Remove the unused wrappers dax_fault() and dax_pmd_fault().  After this
removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
dax_pmd_fault() respectively, and update all callers.

The dax_fault() and dax_pmd_fault() wrappers were initially intended to
capture some filesystem independent functionality around page faults
(calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
and ctime).

However, the following commits:

   5726b27b09 ("ext2: Add locking for DAX faults")
   ea3d7209ca ("ext4: fix races between page faults and hole punching")

added locking to the ext2 and ext4 filesystems after these common
operations but before __dax_fault() and __dax_pmd_fault() were called.
This means that these wrappers are no longer used, and are unlikely to
be used in the future.

XFS has had locking analogous to what was recently added to ext2 and
ext4 since DAX support was initially introduced by:

   6b698edeee ("xfs: add DAX file operations support")

Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Jaegeuk Kim
a7550b30ab ext4 crypto: migrate into vfs's crypto engine
This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-07-10 14:01:03 -04:00
Linus Torvalds
315227f6da DAX error handling for 4.7
- Until now, dax has been disabled if media errors were found on
   any device. This enables the use of DAX in the presence of these
   errors by making all sector-aligned zeroing go through the driver.
 - The driver (already) has the ability to clear errors on writes that
   are sent through the block layer using 'DSMs' defined in ACPI 6.1.
 
 Other misc changes:
 
 - When mounting DAX filesystems, check to make sure the partition
   is page aligned. This is a requirement for DAX, and previously, we
   allowed such unaligned mounts to succeed, but subsequent reads/writes
   would fail.
 
 - Misc/cleanup fixes from Jan that remove unused code from DAX related to
   zeroing, writeback, and some size checks.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXQ4GKAAoJEHr6Yb6juE3/zowP/iclIhgXXXMQJRUHJlePMXC8
 15sGZ32JS1ak9g7vrsmNVEDNynfNtiMYdBxtUyRuj6xqgwdZvFk3F55KOCPtaeA1
 +yADkgeRkTAcwzmHw9WQVEzBCqyzSisdrwtEfH817qdq9FJdH66x2Kos6i+HeAVr
 5Q/e4gs7lKrjf384/QBl+wxNZOndJaQAPd2VRHQqx2A9F33v0ljdwRaUG1r4fjK2
 dtmhcZCqdQyuAGXW3piTnZc5ZFc3DPqO4FkEfqkEK3lFOflK0fd8wMsAZRp/Jd0j
 GJsgnVSWSqG0Dz476djlG0w8t2p5Jv1g9cKChV+ZZEdFLKWHCOUFqXNj8uI8I4k5
 cOEKCHyJ3IwfSHhNQqktEWrQN4T8ZXhWtuc9GuV4UZYuqJqHci6EdR/YsWsJjV+L
 lm/qvK4ipDS1pivxOy8KX/iN0z7Io8J9GXpStDx3g8iWjLlh4YYlbJLWeeRepo/z
 aPlV/QAKcHiGY6jzLExrZIyCWkzwo6O+0p1Kxerv9/7K/32HWbOodZ+tC8eD+N25
 pV69nCGf+u50T2TtIx1+iann4NC1r7zg5yqnT9AgpyZpiwR5joCDzI5sXW+D0rcS
 vPtfM84Ccdeq/e6mvfIpZgR0/npQapKnrmUest0J7P2BFPHiFPji1KzZ7M+1aFOo
 9R6JdrAj0Sc+FBa+cGzH
 =v6Of
 -----END PGP SIGNATURE-----

Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull misc DAX updates from Vishal Verma:
 "DAX error handling for 4.7

   - Until now, dax has been disabled if media errors were found on any
     device.  This enables the use of DAX in the presence of these
     errors by making all sector-aligned zeroing go through the driver.

   - The driver (already) has the ability to clear errors on writes that
     are sent through the block layer using 'DSMs' defined in ACPI 6.1.

  Other misc changes:

   - When mounting DAX filesystems, check to make sure the partition is
     page aligned.  This is a requirement for DAX, and previously, we
     allowed such unaligned mounts to succeed, but subsequent
     reads/writes would fail.

   - Misc/cleanup fixes from Jan that remove unused code from DAX
     related to zeroing, writeback, and some size checks"

* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: fix a comment in dax_zero_page_range and dax_truncate_page
  dax: for truncate/hole-punch, do zeroing through the driver if possible
  dax: export a low-level __dax_zero_page_range helper
  dax: use sb_issue_zerout instead of calling dax_clear_sectors
  dax: enable dax in the presence of known media errors (badblocks)
  dax: fallback from pmd to pte on error
  block: Update blkdev_dax_capable() for consistency
  xfs: Add alignment check for DAX mount
  ext2: Add alignment check for DAX mount
  ext4: Add alignment check for DAX mount
  block: Add bdev_dax_supported() for dax mount checks
  block: Add vfs_msg() interface
  dax: Remove redundant inode size checks
  dax: Remove pointless writeback from dax_do_io()
  dax: Remove zeroing from dax_io()
  dax: Remove dead zeroing code from fault handlers
  ext2: Avoid DAX zeroing to corrupt data
  ext2: Fix block zeroing in ext2_get_blocks() for DAX
  dax: Remove complete_unwritten argument
  DAX: move RADIX_DAX_ definitions to dax.c
2016-05-26 19:34:26 -07:00
Linus Torvalds
0e01df100b Fix a number of bugs, most notably a potential stale data exposure
after a crash and a potential BUG_ON crash if a file has the data
 journalling flag enabled while it has dirty delayed allocation blocks
 that haven't been written yet.  Also fix a potential crash in the new
 project quota code and a maliciously corrupted file system.
 
 In addition, fix some DAX-specific bugs, including when there is a
 transient ENOSPC situation and races between writes via direct I/O and
 an mmap'ed segment that could lead to lost I/O.
 
 Finally the usual set of miscellaneous cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXQ40fAAoJEPL5WVaVDYGjnwMH+wXHASgPfzZgtRInsTG8W/2L
 jsmAcMlyMAYIATWMppNtPIq0td49z1dYO0YkKhtPVMwfzu230IFWhGWp93WqP9ve
 XYHMmaBorFlMAzWgMKn1K0ExWZlV+ammmcTKgU0kU4qyZp0G/NnMtlXIkSNv2amI
 9Mn6R+v97c20gn8e9HWP/IVWkgPr+WBtEXaSGjC7dL6yI8hL+rJMqN82D76oU5ea
 vtwzrna/ISijy+etYmQzqHNYNaBKf40+B5HxQZw/Ta3FSHofBwXAyLaeEAr260Mf
 V3Eg2NDcKQxiZ3adBzIUvrRnrJV381OmHoguo8Frs8YHTTRiZ0T/s7FGr2Q0NYE=
 =7yIM
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Fix a number of bugs, most notably a potential stale data exposure
  after a crash and a potential BUG_ON crash if a file has the data
  journalling flag enabled while it has dirty delayed allocation blocks
  that haven't been written yet.  Also fix a potential crash in the new
  project quota code and a maliciously corrupted file system.

  In addition, fix some DAX-specific bugs, including when there is a
  transient ENOSPC situation and races between writes via direct I/O and
  an mmap'ed segment that could lead to lost I/O.

  Finally the usual set of miscellaneous cleanups"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
  ext4: pre-zero allocated blocks for DAX IO
  ext4: refactor direct IO code
  ext4: fix race in transient ENOSPC detection
  ext4: handle transient ENOSPC properly for DAX
  dax: call get_blocks() with create == 1 for write faults to unwritten extents
  ext4: remove unmeetable inconsisteny check from ext4_find_extent()
  jbd2: remove excess descriptions for handle_s
  ext4: remove unnecessary bio get/put
  ext4: silence UBSAN in ext4_mb_init()
  ext4: address UBSAN warning in mb_find_order_for_block()
  ext4: fix oops on corrupted filesystem
  ext4: fix check of dqget() return value in ext4_ioctl_setproject()
  ext4: clean up error handling when orphan list is corrupted
  ext4: fix hang when processing corrupted orphaned inode list
  ext4: remove trailing \n from ext4_warning/ext4_error calls
  ext4: fix races between changing inode journal mode and ext4_writepages
  ext4: handle unwritten or delalloc buffers before enabling data journaling
  ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
  ext4: do not ask jbd2 to write data for delalloc buffers
  jbd2: add support for avoiding data writes during transaction commits
  ...
2016-05-24 12:55:26 -07:00
Jan Kara
02fbd13975 dax: Remove complete_unwritten argument
Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
2016-05-16 18:11:51 -06:00
Jan Kara
12735f8819 ext4: pre-zero allocated blocks for DAX IO
Currently ext4 treats DAX IO the same way as direct IO. I.e., it
allocates unwritten extents before IO is done and converts unwritten
extents afterwards. However this way DAX IO can race with page fault to
the same area:

ext4_ext_direct_IO()				dax_fault()
  dax_io()
    get_block() - allocates unwritten extent
    copy_from_iter_pmem()
						  get_block() - converts
						    unwritten block to
						    written and zeroes it
						    out
  ext4_convert_unwritten_extents()

So data written with DAX IO gets lost. Similarly dax_new_buf() called
from dax_io() can overwrite data that has been already written to the
block via mmap.

Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
use them for DAX mmap. The downside of this solution is that every
allocating write writes each block twice (once zeros, once data). Fixing
the race with locking is possible as well however we would need to
lock-out faults for the whole range written to by DAX IO. And that is
not easy to do without locking-out faults for the whole file which seems
too aggressive.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-05-13 00:51:15 -04:00
Christoph Hellwig
e259221763 fs: simplify the generic_write_sync prototype
The kiocb already has the new position, so use that.  The only interesting
case is AIO, where we currently don't bother updating ki_pos.  We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.

While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Christoph Hellwig
dde0c2e798 fs: add IOCB_SYNC and IOCB_DSYNC
This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.

XXX: Will need a few additional audits for O_DSYNC

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-01 19:58:39 -04:00
Jakub Wilk
8d2ae1cbe8 ext4: remove trailing \n from ext4_warning/ext4_error calls
Messages passed to ext4_warning() or ext4_error() don't need trailing
newlines, because these function add the newlines themselves.

Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
2016-04-27 01:11:21 -04:00
Linus Torvalds
93061f390f These changes contains a fix for overlayfs interacting with some
(badly behaved) dentry code in various file systems.  These have been
 reviewed by Al and the respective file system mtinainers and are going
 through the ext4 tree for convenience.
 
 This also has a few ext4 encryption bug fixes that were discovered in
 Android testing (yes, we will need to get these sync'ed up with the
 fs/crypto code; I'll take care of that).  It also has some bug fixes
 and a change to ignore the legacy quota options to allow for xfstests
 regression testing of ext4's internal quota feature and to be more
 consistent with how xfs handles this case.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJXBn4aAAoJEPL5WVaVDYGjHWgH/2wXnlQnC2ndJhblBWtPzprz
 OQW4dawdnhxqbTEGUqWe942tZivSb/liu/lF+urCGbWsbgz9jNOCmEAg7JPwlccY
 mjzwDvtVq5U4d2rP+JDWXLy/Gi8XgUclhbQDWFVIIIea6fS7IuFWqoVBR+HPMhra
 9tEygpiy5lNtJA/hqq3/z9x0AywAjwrYR491CuWreo2Uu1aeKg0YZsiDsuAcGioN
 Waa2TgbC/ZZyJuJcPBP8If+VOFAa0ea3F+C/o7Tb9bOqwuz0qSTcaMRgt6eQ2KUt
 P4b9Ecp1XLjJTC7IYOknUOScY3lCyREx/Xya9oGZfFNTSHzbOlLBoplCr3aUpYQ=
 =/HHR
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "These changes contains a fix for overlayfs interacting with some
  (badly behaved) dentry code in various file systems.  These have been
  reviewed by Al and the respective file system mtinainers and are going
  through the ext4 tree for convenience.

  This also has a few ext4 encryption bug fixes that were discovered in
  Android testing (yes, we will need to get these sync'ed up with the
  fs/crypto code; I'll take care of that).  It also has some bug fixes
  and a change to ignore the legacy quota options to allow for xfstests
  regression testing of ext4's internal quota feature and to be more
  consistent with how xfs handles this case"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: ignore quota mount options if the quota feature is enabled
  ext4 crypto: fix some error handling
  ext4: avoid calling dquot_get_next_id() if quota is not enabled
  ext4: retry block allocation for failed DIO and DAX writes
  ext4: add lockdep annotations for i_data_sem
  ext4: allow readdir()'s of large empty directories to be interrupted
  btrfs: fix crash/invalid memory access on fsync when using overlayfs
  ext4 crypto: use dget_parent() in ext4_d_revalidate()
  ext4: use file_dentry()
  ext4: use dget_parent() in ext4_file_open()
  nfs: use file_dentry()
  fs: add file_dentry()
  ext4 crypto: don't let data integrity writebacks fail with ENOMEM
  ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
2016-04-07 17:22:20 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Miklos Szeredi
c0a37d4878 ext4: use file_dentry()
EXT4 may be used as lower layer of overlayfs and accessing f_path.dentry
can lead to a crash.

Fix by replacing direct access of file->f_path.dentry with the
file_dentry() accessor, which will always return a native object.

Reported-by: Daniel Axtens <dja@axtens.net>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Fixes: ff978b09f9 ("ext4 crypto: move context consistency check to ext4_file_open()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org> # v4.5
2016-03-26 16:14:42 -04:00
Miklos Szeredi
9dd78d8c9a ext4: use dget_parent() in ext4_file_open()
In f_op->open() lock on parent is not held, so there's no guarantee that
parent dentry won't go away at any time.

Even after this patch there's no guarantee that 'dir' will stay the parent
of 'inode', but at least it won't be freed while being used.

Fixes: ff978b09f9 ("ext4 crypto: move context consistency check to ext4_file_open()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@vger.kernel.org> # v4.5
2016-03-26 16:14:41 -04:00
Linus Torvalds
faeb20ecfa Performance improvements in SEEK_DATA and xattr scalability
improvements, plus a lot of clean ups and bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJW6c9mAAoJEPL5WVaVDYGjWsEIAJkWUvKB3GgGgP82sKDBP2P8
 IbWegO1ICMrSY78BqLI7mLCqggH5JClBgYU3O4VFv8Brj1L9mS5X+vflaDE1j9jj
 Ik1KZKtZl1opOwO1L3D4l/ipZAiENUp7NehTtpsFousmz6nMZ5vo6x4t3QSwbUIm
 YXpxUIxHEhBcW5i3EDkfYG8305V5oj8HsVf6T98OlWGpBO5VGNMAHvA7CQdQe7Rd
 chv70rij5V684bJAEoosEFXVAuOUrxcBqbFA3Nlb432YOPj0ISLx76kw0GIjUYtf
 yjoSClbRgwxGzh0jm+yaoYjjm83xbsYbHSsBmh3+/QLMbKTLXeCqR/BiqJavmcM=
 =bWpz
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Performance improvements in SEEK_DATA and xattr scalability
  improvements, plus a lot of clean ups and bug fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (38 commits)
  ext4: clean up error handling in the MMP support
  jbd2: do not fail journal because of frozen_buffer allocation failure
  ext4: use __GFP_NOFAIL in ext4_free_blocks()
  ext4: fix compile error while opening the macro DOUBLE_CHECK
  ext4: print ext4 mount option data_err=abort correctly
  ext4: fix NULL pointer dereference in ext4_mark_inode_dirty()
  ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry()
  ext4: fix misspellings in comments.
  jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path
  ext4: more efficient SEEK_DATA implementation
  ext4: cleanup handling of bh->b_state in DAX mmap
  ext4: return hole from ext4_map_blocks()
  ext4: factor out determining of hole size
  ext4: fix setting of referenced bit in ext4_es_lookup_extent()
  ext4: remove i_ioend_count
  ext4: simplify io_end handling for AIO DIO
  ext4: move trans handling and completion deferal out of _ext4_get_block
  ext4: rename and split get blocks functions
  ext4: use i_mutex to serialize unaligned AIO DIO
  ext4: pack ioend structure better
  ...
2016-03-17 16:31:18 -07:00
Jan Kara
2d90c160e5 ext4: more efficient SEEK_DATA implementation
Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as
ext4_seek_data() iterates hole block-by-block. Fix the problem by using
returned hole size from ext4_map_blocks() and thus skip the hole in one
go.

Update also SEEK_HOLE implementation to follow the same pattern as
SEEK_DATA to make future maintenance easier.

Furthermore we add cond_resched() to both ext4_seek_data() and
ext4_seek_hole() to avoid softlockups in case evil user creates huge
fragmented file and we have to go through lots of extents.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-09 23:11:13 -05:00
Jan Kara
e142d05263 ext4: use i_mutex to serialize unaligned AIO DIO
Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
However the code cleanups that happened after 2011 when the lock was
introduced made aio_mutex acquired at almost the same places where we
already have exclusion using i_mutex. So just use i_mutex for the
exclusion of unaligned AIO DIO.

The change moves waiting for pending unwritten extent conversion under
i_mutex. That makes special handling of O_APPEND writes unnecessary and
also avoids possible livelocking of unaligned AIO DIO with aligned one
(nothing was preventing contiguous stream of aligned AIO DIOs to let
unaligned AIO DIO wait forever).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-03-08 22:44:50 -05:00
Ross Zwisler
1e9d180ba3 ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()
As it is currently written ext4_dax_mkwrite() assumes that the call into
__dax_mkwrite() will not have to do a block allocation so it doesn't create
a journal entry.  For a read that creates a zero page to cover a hole
followed by a write that actually allocates storage this is incorrect.  The
ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
get_blocks() to allocate storage.

Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
as this function already has all the logic needed to allocate a journal
entry and call __dax_fault().

Also update the ext2 fault handlers in this same way to remove duplicate
code and keep the logic between ext2 and ext4 the same.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-27 14:01:16 -05:00
Theodore Ts'o
ff978b09f9 ext4 crypto: move context consistency check to ext4_file_open()
In the case where the per-file key for the directory is cached, but
root does not have access to the key needed to derive the per-file key
for the files in the directory, we allow the lookup to succeed, so
that lstat(2) and unlink(2) can suceed.  However, if a program tries
to open the file, it will get an ENOKEY error.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-08 00:54:26 -05:00