Commit graph

47558 commits

Author SHA1 Message Date
Andy Lutomirski
06deeec77a cifs: Fix smbencrypt() to stop pointing a scatterlist at the stack
smbencrypt() points a scatterlist to the stack, which is breaks if
CONFIG_VMAP_STACK=y.

Fix it by switching to crypto_cipher_encrypt_one().  The new code
should be considerably faster as an added benefit.

This code is nearly identical to some code that Eric Biggers
suggested.

Cc: stable@vger.kernel.org # 4.9 only
Reported-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2016-12-14 01:44:16 -06:00
Shaohua Li
7a62a52333 block_dev: don't update file access position for sync direct IO
For sync direct IO, generic_file_direct_write/generic_file_read_iter
will update file access position. Don't duplicate the update in
.direct_IO. This cause my raid array can't assemble.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-13 21:07:08 -07:00
NeilBrown
bcc7f5b4be block_dev: don't test bdev->bd_contains when it is not stable
bdev->bd_contains is not stable before calling __blkdev_get().
When __blkdev_get() is called on a parition with ->bd_openers == 0
it sets
  bdev->bd_contains = bdev;
which is not correct for a partition.
After a call to __blkdev_get() succeeds, ->bd_openers will be > 0
and then ->bd_contains is stable.

When FMODE_EXCL is used, blkdev_get() calls
   bd_start_claiming() ->  bd_prepare_to_claim() -> bd_may_claim()

This call happens before __blkdev_get() is called, so ->bd_contains
is not stable.  So bd_may_claim() cannot safely use ->bd_contains.
It currently tries to use it, and this can lead to a BUG_ON().

This happens when a whole device is already open with a bd_holder (in
use by dm in my particular example) and two threads race to open a
partition of that device for the first time, one opening with O_EXCL and
one without.

The thread that doesn't use O_EXCL gets through blkdev_get() to
__blkdev_get(), gains the ->bd_mutex, and sets bdev->bd_contains = bdev;

Immediately thereafter the other thread, using FMODE_EXCL, calls
bd_start_claiming() from blkdev_get().  This should fail because the
whole device has a holder, but because bdev->bd_contains == bdev
bd_may_claim() incorrectly reports success.
This thread continues and blocks on bd_mutex.

The first thread then sets bdev->bd_contains correctly and drops the mutex.
The thread using FMODE_EXCL then continues and when it calls bd_may_claim()
again in:
			BUG_ON(!bd_may_claim(bdev, whole, holder));
The BUG_ON fires.

Fix this by removing the dependency on ->bd_contains in
bd_may_claim().  As bd_may_claim() has direct access to the whole
device, it can simply test if the target bdev is the whole device.

Fixes: 6b4517a791 ("block: implement bd_claiming and claiming block")
Cc: stable@vger.kernel.org (v2.6.35+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-13 21:07:08 -07:00
Linus Torvalds
aa3ecf388a xen: features and fixes for 4.10 rc0
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABAgAGBQJYT5HMAAoJELDendYovxMvhNQH/1g/3ahM4JKN8Z0SbjKBEdQm
 yj2xOj6cE3l6wMSUblKjZD2DLLhpmcHT/E97Xro/lZQEfQJoMXXWWDFowMU/P1LA
 mJxb7Fzq5Wr+6eGSAlIQB270MrpNi/luf+CWHMwVA3V7R3KRXwonOdGQSkISIzCd
 tgIydEA3a9r2+HgeIBpZFZ4GcSrJQU75krMyl2tjD1C+jeYVd+zdoj2OnDsZQDZQ
 hDWApMpNbpSBAn7JtSSdXWSTBsGH0lUECebeYPhPQ2sX2P6Y8+UCGwA7i6FFdbTa
 agXfVSdRz8dCe3k19VcKDAw6nK9BTTMnEeEHmkmygIh6wuHPP44CzigTXIbJoXI=
 =zjfm
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen updates from Juergen Gross:
 "Xen features and fixes for 4.10

  These are some fixes, a move of some arm related headers to share them
  between arm and arm64 and a series introducing a helper to make code
  more readable.

  The most notable change is David stepping down as maintainer of the
  Xen hypervisor interface. This results in me sending you the pull
  requests for Xen related code from now on"

* tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (29 commits)
  xen/balloon: Only mark a page as managed when it is released
  xenbus: fix deadlock on writes to /proc/xen/xenbus
  xen/scsifront: don't request a slot on the ring until request is ready
  xen/x86: Increase xen_e820_map to E820_X_MAX possible entries
  x86: Make E820_X_MAX unconditionally larger than E820MAX
  xen/pci: Bubble up error and fix description.
  xen: xenbus: set error code on failure
  xen: set error code on failures
  arm/xen: Use alloc_percpu rather than __alloc_percpu
  arm/arm64: xen: Move shared architecture headers to include/xen/arm
  xen/events: use xen_vcpu_id mapping for EVTCHNOP_status
  xen/gntdev: Use VM_MIXEDMAP instead of VM_IO to avoid NUMA balancing
  xen-scsifront: Add a missing call to kfree
  MAINTAINERS: update XEN HYPERVISOR INTERFACE
  xenfs: Use proc_create_mount_point() to create /proc/xen
  xen-platform: use builtin_pci_driver
  xen-netback: fix error handling output
  xen: make use of xenbus_read_unsigned() in xenbus
  xen: make use of xenbus_read_unsigned() in xen-pciback
  xen: make use of xenbus_read_unsigned() in xen-fbfront
  ...
2016-12-13 16:07:55 -08:00
Linus Torvalds
098c30557a Driver core patches for 4.10-rc1
Here's the new driver core patches for 4.10-rc1.
 
 Big thing here is the nice addition of "functional dependencies" to the
 driver core.  The idea has been talked about for a very long time, great
 job to Rafael for stepping up and implementing it. It's been tested for
 longer than the 4.9-rc1 date, we held off on merging it earlier in order
 to feel more comfortable about it.
 
 Other than that, it's just a handful of small other patches, some good
 cleanups to the mess that is the firmware class code, and we have a test
 driver for the deferred probe logic.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWFAvPQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ym3NgCgmhFeWEkp9SDt17YGGavmnzQUlBQAoJlUipJp
 PHeQkq15ZWw3wWC9FEvM
 =91M1
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here's the new driver core patches for 4.10-rc1.

  Big thing here is the nice addition of "functional dependencies" to
  the driver core. The idea has been talked about for a very long time,
  great job to Rafael for stepping up and implementing it. It's been
  tested for longer than the 4.9-rc1 date, we held off on merging it
  earlier in order to feel more comfortable about it.

  Other than that, it's just a handful of small other patches, some good
  cleanups to the mess that is the firmware class code, and we have a
  test driver for the deferred probe logic.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'driver-core-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (30 commits)
  firmware: Correct handling of fw_state_wait() return value
  driver core: Silence device links sphinx warning
  firmware: remove warning at documentation generation time
  drivers: base: dma-mapping: Fix typo in dmam_alloc_non_coherent comments
  driver core: test_async: fix up typo found by 0-day
  firmware: move fw_state_is_done() into UHM section
  firmware: do not use fw_lock for fw_state protection
  firmware: drop bit ops in favor of simple state machine
  firmware: refactor loading status
  firmware: fix usermode helper fallback loading
  driver core: firmware_class: convert to use class_groups
  driver core: devcoredump: convert to use class_groups
  driver core: class: add class_groups support
  kernfs: Declare two local data structures static
  driver-core: fix platform_no_drv_owner.cocci warnings
  drivers/base/memory.c: Remove unused 'first_page' variable
  driver core: add CLASS_ATTR_WO()
  drivers: base: cacheinfo: support DT overrides for cache properties
  drivers: base: cacheinfo: add pr_fmt logging
  drivers: base: cacheinfo: fix boot error message when acpi is enabled
  ...
2016-12-13 11:42:18 -08:00
Maxim Patlasov
2939e1a86f btrfs: limit async_work allocation and worker func duration
Problem statement: unprivileged user who has read-write access to more than
one btrfs subvolume may easily consume all kernel memory (eventually
triggering oom-killer).

Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):

[root@kteam1 ~]# cat prep.sh

DEV=/dev/sdb
mkfs.btrfs -f $DEV
mount $DEV /mnt
for i in `seq 1 16`
do
	mkdir /mnt/$i
	btrfs subvolume create /mnt/SV_$i
	ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
	mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
	chmod a+rwx /mnt/$i
done

[root@kteam1 ~]# sh prep.sh

[maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done

[root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done
kmalloc-128        10144  10144    128   32    1 : tunables    0    0    0 : slabdata    317    317      0
kmalloc-128       9992352 9992352    128   32    1 : tunables    0    0    0 : slabdata 312261 312261      0
kmalloc-128       24226752 24226752    128   32    1 : tunables    0    0    0 : slabdata 757086 757086      0
kmalloc-128       42754240 42754240    128   32    1 : tunables    0    0    0 : slabdata 1336070 1336070      0

The huge numbers above come from insane number of async_work-s allocated
and queued by btrfs_wq_run_delayed_node.

The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
worker func (btrfs_async_run_delayed_root) processes at least
BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
works as expected while the list is almost empty. As soon as it is getting
bigger, worker func starts to process more than one item at a time, it takes
longer, and the chances to have async_works queued more than needed is getting
higher.

The problem above is worsened by another flaw of delayed-inode implementation:
if async_work was queued in a throttling branch (number of items >=
BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
the func occupies CPU infinitely (up to 30sec in my experiments): while the
func is trying to drain the list, the user activity may add more and more
items to the list.

The patch fixes both problems in straightforward way: refuse queuing too
many works in btrfs_wq_run_delayed_node and bail out of worker func if
at least BTRFS_DELAYED_WRITEBACK items are processed.

Changed in v2: remove support of thresh == NO_THRESHOLD.

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v3.15+
2016-12-13 11:01:30 -08:00
Richard Weinberger
ec9160dacd ubifs: Use fscrypt ioctl() helpers
Commit db717d8e26 ("fscrypto: move ioctl processing more fully into
common code") moved ioctl() related functions into fscrypt and offers
us now a set of helper functions.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: David Gstir <david@sigma-star.at>
2016-12-13 19:54:52 +01:00
Linus Torvalds
36869cb93d Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
2016-12-13 10:19:16 -08:00
Linus Torvalds
52281b38bc Improvements and fixes to pstore subsystem:
- Add additional checks for bad platform data
 
 - Remove bounce buffer in console writer
 
 - Protect read/unlink race with a mutex
 
 - Correctly give up during dump locking failures
 
 - Increase ftrace bandwidth by splitting ftrace buffers per CPU
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJYSJxYAAoJEIly9N/cbcAmYBsQAIAmHDgk3ootLQhyatZ9H2X0
 Nyl24xA7UCPaz13ddF1tUaItI4mYBWfY4gde+3fIVXDitgmFxZZqb8YV68CvFgUt
 Hb8tlTiM0F2z/muGBIgJ5TN5XiB4dO0WgvcKvnQdzyNGPVlAXvowHPkaM9X+iEA1
 y4U2Le7iK9+9fvkH7RM4O3hMiTmpKeUITYTWo1Y8n9LaZo3w5+pqhS+TPu75uyD0
 pLb53EOzZmg1nu9hcac5t4G5W1Lr4ji2EekDXemi/571HAzQnMXxJWc6ZVYLDNfP
 W4D0UGcHAERDzrYwWcGn8HIThYlpbnVw9atSTTodJTiIubtsRt4haycUH1hqMS5o
 4R2myhbAoM0A3zYBqrhwtQHg8apNes2hOR2WycAqgvylZZl1o6zaEs9zc7aafYuy
 N/M0x5tlya3fOgkvkJsmERT5jtqDVMhtBZ2xa8NYfJCHgULaUmjEx25eTr1kF3nW
 ERIX/3IayMvqHwYptP9dOzy2owLpXC8yZlM34AeM+ub93hHj1ELLfG7aN0bklD/+
 wfmIX8HpOA2XGWflOk5fiHLHro6pwRU9zOIIHFJ4Tf60PMoN+rjRfej1fjz+KOhO
 gxUYaCb+/4BlCqLqdFvF54qhQO2qmVuOAg/1BLu+hnZtXSyhVJxePthSs5shyoE8
 owL8rVXDGapjF1xO6WCR
 =UmFL
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:
 "Improvements and fixes to pstore subsystem:

   - add additional checks for bad platform data

   - remove bounce buffer in console writer

   - protect read/unlink race with a mutex

   - correctly give up during dump locking failures

   - increase ftrace bandwidth by splitting ftrace buffers per CPU"

* tag 'pstore-v4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  ramoops: add pdata NULL check to ramoops_probe
  pstore: Convert console write to use ->write_buf
  pstore: Protect unlink with read_mutex
  pstore: Use global ftrace filters for function trace filtering
  ftrace: Provide API to use global filtering for ftrace ops
  pstore: Clarify context field przs as dprzs
  pstore: improve error report for failed setup
  pstore: Merge per-CPU ftrace records into one
  pstore: Add ftrace timestamp counter
  ramoops: Split ftrace buffer space into per-CPU zones
  pstore: Make ramoops_init_przs generic for other prz arrays
  pstore: Allow prz to control need for locking
  pstore: Warn on PSTORE_TYPE_PMSG using deprecated function
  pstore: Make spinlock per zone instead of global
  pstore: Actually give up during locking failure
2016-12-13 09:16:11 -08:00
Chris Mason
5f52a2c512 Merge branch 'for-chris-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.10
Patches queued up by Filipe:

The most important change is still the fix for the extent tree
corruption that happens due to balance when qgroups are enabled (a
regression introduced in 4.7 by a fix for a regression from the last
qgroups rework). This has been hitting SLE and openSUSE users and QA
very badly, where transactions keep getting aborted when running
delayed references leaving the root filesystem in RO mode and nearly
unusable.  There are fixes here that allow us to run xfstests again
with the integrity checker enabled, which has been impossible since 4.8
(apparently I'm the only one running xfstests with the integrity
checker enabled, which is useful to validate dirtied leafs, like
checking if there are keys out of order, etc).  The rest are just some
trivial fixes, most of them tagged for stable, and two cleanups.

Signed-off-by: Chris Mason <clm@fb.com>
2016-12-13 09:14:42 -08:00
Richard Weinberger
3858866866 ubifs: Use FS_CFLG_OWN_PAGES
Commit bd7b829038 ("fscrypt: Cleanup page locking requirements for
fscrypt_{decrypt,encrypt}_page()") renamed the flag.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-13 16:18:16 +01:00
Jan Kara
5716863e0f fsnotify: Fix possible use-after-free in inode iteration on umount
fsnotify_unmount_inodes() plays complex tricks to pin next inode in the
sb->s_inodes list when iterating over all inodes. Furthermore the code has a
bug that if the current inode is the last on i_sb_list that does not have e.g.
I_FREEING set, then we leave next_i pointing to inode which may get removed
from the i_sb_list once we drop s_inode_list_lock thus resulting in
use-after-free issues (usually manifesting as infinite looping in
fsnotify_unmount_inodes()).

Fix the problem by keeping current inode pinned somewhat longer. Then we can
make the code much simpler and standard.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2016-12-13 12:57:52 +01:00
Linus Torvalds
e7aa8c2eb1 These are the documentation changes for 4.10.
It's another busy cycle for the docs tree, as the sphinx conversion
 continues.  Highlights include:
 
  - Further work on PDF output, which remains a bit of a pain but should be
    more solid now.
 
  - Five more DocBook template files converted to Sphinx.  Only 27 to go...
    Lots of plain-text files have also been converted and integrated.
 
  - Images in binary formats have been replaced with more source-friendly
    versions.
 
  - Various bits of organizational work, including the renaming of various
    files discussed at the kernel summit.
 
  - New documentation for the device_link mechanism.
 
 ...and, of course, lots of typo fixes and small updates.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYTbl7AAoJEI3ONVYwIuV63NIP/REwzThnGWFJMRSuq8Ieq2r9
 sFSQsaGTGlhyKiDoEooo+SO/Za3uTonjK+e7WZg8mhdiEdamta5aociU/71C1Yy/
 T9ur0FhcGblrvZ1NidSDvCLwuECZOMMei7mgLZ9a+KCpc4ANqqTVZSUm1blKcqhF
 XelhVXxBa0ar35l/pVzyCxkdNXRWXv+MJZE8hp5XAdTdr11DS7UY9zrZdH31axtf
 BZlbYJrvB8WPydU6myTjRpirA17Hu7uU64MsL3bNIEiRQ+nVghEzQC8uxeUCvfVx
 r0H5AgGGQeir+e8GEv2T20SPZ+dumXs+y/HehKNb3jS3gV0mo+pKPeUhwLIxr+Zh
 QY64gf+jYf5ISHwAJRnU0Ima72ehObzSbx9Dko10nhq2OvbR5f83gjz9t9jKYFU7
 RDowICA8lwqyRbHRoVfyoW8CpVhWFpMFu3yNeJMckeTish3m7ANqzaWslbsqIP5G
 zxgFMIrVVSbeae+sUeygtEJAnWI09aZ4tuaUXYtGWwu6ikC/3aV6DryP4bthG2LF
 A19uV4nMrLuuh8g2wiTHHjMfjYRwvSn+f9yaolwJhwyNDXQzRPy+ZJ3W/6olOkXC
 bAxTmVRCW5GA/fmSrfXmW1KbnxlWfP2C62hzZQ09UHxzTHdR97oFLDQdZhKo1uwf
 pmSJR0hVeRUmA4uw6+Su
 =A0EV
 -----END PGP SIGNATURE-----

Merge tag 'docs-4.10' of git://git.lwn.net/linux

Pull documentation update from Jonathan Corbet:
 "These are the documentation changes for 4.10.

  It's another busy cycle for the docs tree, as the sphinx conversion
  continues. Highlights include:

   - Further work on PDF output, which remains a bit of a pain but
     should be more solid now.

   - Five more DocBook template files converted to Sphinx. Only 27 to
     go... Lots of plain-text files have also been converted and
     integrated.

   - Images in binary formats have been replaced with more
     source-friendly versions.

   - Various bits of organizational work, including the renaming of
     various files discussed at the kernel summit.

   - New documentation for the device_link mechanism.

  ... and, of course, lots of typo fixes and small updates"

* tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits)
  dma-buf: Extract dma-buf.rst
  Update Documentation/00-INDEX
  docs: 00-INDEX: document directories/files with no docs
  docs: 00-INDEX: remove non-existing entries
  docs: 00-INDEX: add missing entries for documentation files/dirs
  docs: 00-INDEX: consolidate process/ and admin-guide/ description
  scripts: add a script to check if Documentation/00-INDEX is sane
  Docs: change sh -> awk in REPORTING-BUGS
  Documentation/core-api/device_link: Add initial documentation
  core-api: remove an unexpected unident
  ppc/idle: Add documentation for powersave=off
  Doc: Correct typo, "Introdution" => "Introduction"
  Documentation/atomic_ops.txt: convert to ReST markup
  Documentation/local_ops.txt: convert to ReST markup
  Documentation/assoc_array.txt: convert to ReST markup
  docs-rst: parse-headers.pl: cleanup the documentation
  docs-rst: fix media cleandocs target
  docs-rst: media/Makefile: reorganize the rules
  docs-rst: media: build SVG from graphviz files
  docs-rst: replace bayer.png by a SVG image
  ...
2016-12-12 21:58:13 -08:00
Linus Torvalds
e34bac726d Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - various misc bits

 - most of MM (quite a lot of MM material is awaiting the merge of
   linux-next dependencies)

 - kasan

 - printk updates

 - procfs updates

 - MAINTAINERS

 - /lib updates

 - checkpatch updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits)
  init: reduce rootwait polling interval time to 5ms
  binfmt_elf: use vmalloc() for allocation of vma_filesz
  checkpatch: don't emit unified-diff error for rename-only patches
  checkpatch: don't check c99 types like uint8_t under tools
  checkpatch: avoid multiple line dereferences
  checkpatch: don't check .pl files, improve absolute path commit log test
  scripts/checkpatch.pl: fix spelling
  checkpatch: don't try to get maintained status when --no-tree is given
  lib/ida: document locking requirements a bit better
  lib/rbtree.c: fix typo in comment of ____rb_erase_color
  lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM
  MAINTAINERS: add drm and drm/i915 irc channels
  MAINTAINERS: add "C:" for URI for chat where developers hang out
  MAINTAINERS: add drm and drm/i915 bug filing info
  MAINTAINERS: add "B:" for URI where to file bugs
  get_maintainer: look for arbitrary letter prefixes in sections
  printk: add Kconfig option to set default console loglevel
  printk/sound: handle more message headers
  printk/btrfs: handle more message headers
  printk/kdb: handle more message headers
  ...
2016-12-12 20:50:02 -08:00
Linus Torvalds
9465d9cc31 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The time/timekeeping/timer folks deliver with this update:

   - Fix a reintroduced signed/unsigned issue and cleanup the whole
     signed/unsigned mess in the timekeeping core so this wont happen
     accidentaly again.

   - Add a new trace clock based on boot time

   - Prevent injection of random sleep times when PM tracing abuses the
     RTC for storage

   - Make posix timers configurable for real tiny systems

   - Add tracepoints for the alarm timer subsystem so timer based
     suspend wakeups can be instrumented

   - The usual pile of fixes and updates to core and drivers"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  timekeeping: Use mul_u64_u32_shr() instead of open coding it
  timekeeping: Get rid of pointless typecasts
  timekeeping: Make the conversion call chain consistently unsigned
  timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
  alarmtimer: Add tracepoints for alarm timers
  trace: Update documentation for mono, mono_raw and boot clock
  trace: Add an option for boot clock as trace clock
  timekeeping: Add a fast and NMI safe boot clock
  timekeeping/clocksource_cyc2ns: Document intended range limitation
  timekeeping: Ignore the bogus sleep time if pm_trace is enabled
  selftests/timers: Fix spelling mistake "Asyncrhonous" -> "Asynchronous"
  clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomap
  clocksource/drivers/arm_arch_timer: Map frame with of_io_request_and_map()
  arm64: dts: rockchip: Arch counter doesn't tick in system suspend
  clocksource/drivers/arm_arch_timer: Don't assume clock runs in suspend
  posix-timers: Make them configurable
  posix_cpu_timers: Move the add_device_randomness() call to a proper place
  timer: Move sys_alarm from timer.c to itimer.c
  ptp_clock: Allow for it to be optional
  Kconfig: Regenerate *.c_shipped files after previous changes
  ...
2016-12-12 19:56:15 -08:00
Linus Torvalds
e71c3978d6 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp hotplug updates from Thomas Gleixner:
 "This is the final round of converting the notifier mess to the state
  machine. The removal of the notifiers and the related infrastructure
  will happen around rc1, as there are conversions outstanding in other
  trees.

  The whole exercise removed about 2000 lines of code in total and in
  course of the conversion several dozen bugs got fixed. The new
  mechanism allows to test almost every hotplug step standalone, so
  usage sites can exercise all transitions extensively.

  There is more room for improvement, like integrating all the
  pointlessly different architecture mechanisms of synchronizing,
  setting cpus online etc into the core code"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
  tracing/rb: Init the CPU mask on allocation
  soc/fsl/qbman: Convert to hotplug state machine
  soc/fsl/qbman: Convert to hotplug state machine
  zram: Convert to hotplug state machine
  KVM/PPC/Book3S HV: Convert to hotplug state machine
  arm64/cpuinfo: Convert to hotplug state machine
  arm64/cpuinfo: Make hotplug notifier symmetric
  mm/compaction: Convert to hotplug state machine
  iommu/vt-d: Convert to hotplug state machine
  mm/zswap: Convert pool to hotplug state machine
  mm/zswap: Convert dst-mem to hotplug state machine
  mm/zsmalloc: Convert to hotplug state machine
  mm/vmstat: Convert to hotplug state machine
  mm/vmstat: Avoid on each online CPU loops
  mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
  tracing/rb: Convert to hotplug state machine
  oprofile/nmi timer: Convert to hotplug state machine
  net/iucv: Use explicit clean up labels in iucv_init()
  x86/pci/amd-bus: Convert to hotplug state machine
  x86/oprofile/nmi: Convert to hotplug state machine
  ...
2016-12-12 19:25:04 -08:00
Jason Baron
30f74aa085 binfmt_elf: use vmalloc() for allocation of vma_filesz
We have observed page allocations failures of order 4 during core dump
while trying to allocate vma_filesz.  This results in a useless core
file of size 0.  To improve reliability use vmalloc().

Note that the vmalloc() allocation is bounded by sysctl_max_map_count,
which is 65,530 by default.  So with a 4k page size, and 8 bytes per
seg, this is a max of 128 pages or an order 7 allocation.  Other parts
of the core dump path, such as fill_files_note() are already using
vmalloc() for presumably similar reasons.

Link: http://lkml.kernel.org/r/1479745791-17611-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:10 -08:00
Petr Mladek
262c5e86fe printk/btrfs: handle more message headers
Commit 4bcc595ccd ("printk: reinstate KERN_CONT for printing
continuation lines") allows to define more message headers for a single
message.  The motivation is that continuous lines might get mixed.
Therefore it make sense to define the right log level for every piece of
a cont line.

The current btrfs_printk() macros do not support continuous lines at the
moment.  But better be prepared for a custom messages and avoid
potential "lvl" buffer overflow.

This patch iterates over the entire message header.  It is interested
only into the message level like the original code.

This patch also introduces PRINTK_MAX_SINGLE_HEADER_LEN.  Three bytes
are enough for the message level header at the moment.  But it used to
be three, see the commit 04d2c8c83d ("printk: convert the format for
KERN_<LEVEL> to a 2 byte pattern").

Also I fixed the default ratelimit level.  It looked very strange when it
was different from the default log level.

[pmladek@suse.com: Fix a check of the valid message level]
  Link: http://lkml.kernel.org/r/20161111183236.GD2145@dhcp128.suse.cz
Link: http://lkml.kernel.org/r/1478695291-12169-4-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Joe Perches <joe@perches.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
1270dd8d99 fs/proc: calculate /proc/* and /proc/*/task/* nlink at init time
Runtime nlink calculation works but meh.  I don't know how to do it at
compile time, but I know how to do it at init time.

Shift "2+" part into init time as a bonus.

Link: http://lkml.kernel.org/r/20161122195549.GB29812@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
bac5f5d56b fs/proc/base.c: save decrement during lookup/readdir in /proc/$PID
Comparison for "<" works equally well as comparison for "<=" but one
SUB/LEA is saved (no, it is not optimised away, at least here).

Link: http://lkml.kernel.org/r/20161122195143.GA29812@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Rasmus Villemoes
209b14dc03 fs/proc/array.c: slightly improve render_sigset_t
format_decode and vsnprintf occasionally show up in perf top, so I went
looking for places that might not need the full printf power.  With the
help of kprobes, I gathered some statistics on which format strings we
mostly pass to vsnprintf.  On a trivial desktop workload, I hit "%x" 25%
of the time, so something apparently reads /proc/pid/status (which does
5*16 printf("%x") calls) a lot.

With this patch, reading /proc/pid/status is 30% faster according to
this microbenchmark:

	char buf[4096];
	int i, fd;
	for (i = 0; i < 10000; ++i) {
		fd = open("/proc/self/status", O_RDONLY);
		read(fd, buf, sizeof(buf));
		close(fd);
	}

Link: http://lkml.kernel.org/r/1474410485-1305-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Andrei Vagin <avagin@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
492b2da605 proc: tweak comments about 2 stage open and everything
Some comments were obsoleted since commit 05c0ae21c0 ("try a saner
locking for pde_opener...").

Some new comments added.

Some confusing comments replaced with equally confusing ones.

Link: http://lkml.kernel.org/r/20161029160231.GD1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
39a10ac23c proc: kmalloc struct pde_opener
kzalloc is too much, half of the fields will be reinitialized anyway.

If proc file doesn't have ->release hook (some still do not), clearing
is unnecessary because it will be freed immediately.

Link: http://lkml.kernel.org/r/20161029155747.GC1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
f5887c71cf proc: fix type of struct pde_opener::closing field
struct pde_opener::closing is boolean.

Link: http://lkml.kernel.org/r/20161029155439.GB1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
06a0c4175d proc: just list_del() struct pde_opener
list_del_init() is too much, structure will be freed in three lines
anyway.

Link: http://lkml.kernel.org/r/20161029155313.GA1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
9a87fe0d7c proc: make struct struct map_files_info::len unsigned int
Linux doesn't support 4GB+ filenames in /proc, so unsigned long is too
much.

MOV r64, r/m64 is larger than MOV r32, r/m32.

Link: http://lkml.kernel.org/r/20161029161123.GG1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Alexey Dobriyan
623f594e7d proc: make struct pid_entry::len unsigned
"unsigned int" is better on x86_64 because it most of the time it
autoexpands to 64-bit value while "int" requires MOVSX instruction.

Link: http://lkml.kernel.org/r/20161029160810.GF1246@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Kees Cook
af884cd4a5 proc: report no_new_privs state
Similar to being able to examine if a process has been correctly
confined with seccomp, the state of no_new_privs is equally interesting,
so this adds it to /proc/$pid/status.

Link: http://lkml.kernel.org/r/20161103214041.GA58566@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jann Horn <jann@thejh.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rodrigo Freire <rfreire@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Robert Ho <robert.hu@intel.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Richard W.M. Jones" <rjones@redhat.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Hugh Dickins
a66c0410b9 mm: add cond_resched() in gather_pte_stats()
The other pagetable walks in task_mmu.c have a cond_resched() after
walking their ptes: add a cond_resched() in gather_pte_stats() too, for
reading /proc/<id>/numa_maps.  Only pagemap_pmd_range() has a
cond_resched() in its (unusually expensive) pmd_trans_huge case: more
should probably be added, but leave them unchanged for now.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052157400.13021@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:09 -08:00
Johannes Weiner
4d693d0860 lib: radix-tree: update callback for changing leaf nodes
Support handing __radix_tree_replace() a callback that gets invoked for
all leaf nodes that change or get freed as a result of the slot
replacement, to assist users tracking nodes with node->private_list.

This prepares for putting page cache shadow entries into the radix tree
root again and drastically simplifying the shadow tracking.

Link: http://lkml.kernel.org/r/20161117193134.GD23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:08 -08:00
Johannes Weiner
6d75f366b9 lib: radix-tree: check accounting of existing slot replacement users
The bug in khugepaged fixed earlier in this series shows that radix tree
slot replacement is fragile; and it will become more so when not only
NULL<->!NULL transitions need to be caught but transitions from and to
exceptional entries as well.  We need checks.

Re-implement radix_tree_replace_slot() on top of the sanity-checked
__radix_tree_replace().  This requires existing callers to also pass the
radix tree root, but it'll warn us when somebody replaces slots with
contents that need proper accounting (transitions between NULL entries,
real entries, exceptional entries) and where a replacement through the
slot pointer would corrupt the radix tree node counts.

Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:08 -08:00
Johannes Weiner
f7942430e4 lib: radix-tree: native accounting of exceptional entries
The way the page cache is sneaking shadow entries of evicted pages into
the radix tree past the node entry accounting and tracking them manually
in the upper bits of node->count is fraught with problems.

These shadow entries are marked in the tree as exceptional entries,
which are a native concept to the radix tree.  Maintain an explicit
counter of exceptional entries in the radix tree node.  Subsequent
patches will switch shadow entry tracking over to that counter.

DAX and shmem are the other users of exceptional entries.  Since slot
replacements that change the entry type from regular to exceptional must
now be accounted, introduce a __radix_tree_replace() function that does
replacement and accounting, and switch DAX and shmem over.

The increase in radix tree node size is temporary.  A followup patch
switches the shadow tracking to this new scheme and we'll no longer need
the upper bits in node->count and shrink that back to one byte.

Link: http://lkml.kernel.org/r/20161117192945.GA23430@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:08 -08:00
Tahsin Erdogan
bace924818 fs/fs-writeback.c: remove redundant if check
b_more_io non-empty check is already preceded by an opposite check.

Link: http://lkml.kernel.org/r/1478591249-30641-1-git-send-email-tahsin@google.com
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:08 -08:00
Deepa Dinamani
c62c38f6b9 ocfs2: replace CURRENT_TIME macro
CURRENT_TIME is not y2038 safe.

Use y2038 safe ktime_get_real_seconds() here for timestamps.  struct
heartbeat_block's hb_seq and deletetion time are already 64 bits wide
and accommodate times beyond y2038.

Also use y2038 safe ktime_get_real_ts64() for on disk inode timestamps.
These are also wide enough to accommodate time64_t.

Link: http://lkml.kernel.org/r/1475365298-29236-1-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Deepa Dinamani
395627b071 ocfs2: use time64_t to represent orphan scan times
struct timespec is not y2038 safe.  Use time64_t which is y2038 safe to
represent orphan scan times.  time64_t is sufficient here as only the
seconds delta times are relevant.

Also use appropriate time functions that return time in time64_t format.
Time functions now return monotonic time instead of real time as only
delta scan times are relevant and these values are not persistent across
reboots.

The format string for the debug print is still using long as this is
only the time elapsed since the last scan and long is sufficient to
represent this value.

Link: http://lkml.kernel.org/r/1475365138-20567-1-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Ashish Samant
4131d53810 ocfs2: fix double put of recount tree in ocfs2_lock_refcount_tree()
In ocfs2_lock_refcount_tree, if ocfs2_read_refcount_block() returns an
error, we do ocfs2_refcount_tree_put twice (once in
ocfs2_unlock_refcount_tree and once outside it), thereby reducing the
refcount of the refcount tree twice, but we dont delete the tree in this
case.  This will make refcnt of the tree = 0 and the
ocfs2_refcount_tree_put will eventually call ocfs2_mark_lockres_freeing,
setting OCFS2_LOCK_FREEING for the refcount_tree->rf_lockres.

The error returned by ocfs2_read_refcount_block is propagated all the
way back and for next iteration of write, ocfs2_lock_refcount_tree gets
the same tree back from ocfs2_get_refcount_tree because we havent
deleted the tree.  Now we have the same tree, but OCFS2_LOCK_FREEING is
set for rf_lockres and eventually, when _ocfs2_lock_refcount_tree is
called in this iteration, BUG_ON( __ocfs2_cluster_lock:1395 ERROR:
Cluster lock called on freeing lockres T00000000000000000386019775b08d!
flags 0x81) is triggerred.

Call stack:

  (loop16,11155,0):ocfs2_lock_refcount_tree:482 ERROR: status = -5
  (loop16,11155,0):ocfs2_refcount_cow_hunk:3497 ERROR: status = -5
  (loop16,11155,0):ocfs2_refcount_cow:3560 ERROR: status = -5
  (loop16,11155,0):ocfs2_prepare_inode_for_refcount:2111 ERROR: status = -5
  (loop16,11155,0):ocfs2_prepare_inode_for_write:2190 ERROR: status = -5
  (loop16,11155,0):ocfs2_file_write_iter:2331 ERROR: status = -5
  (loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: bug expression:
  lockres->l_flags & OCFS2_LOCK_FREEING

  (loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: Cluster lock called on
  freeing lockres T00000000000000000386019775b08d! flags 0x81

  kernel BUG at fs/ocfs2/dlmglue.c:1395!

  invalid opcode: 0000 [#1] SMP  CPU 0
  Modules linked in: tun ocfs2 jbd2 xen_blkback xen_netback xen_gntdev .. sd_mod crc_t10dif ext3 jbd mbcache
  RIP: __ocfs2_cluster_lock+0x31c/0x740 [ocfs2]
  RSP: e02b:ffff88017c0138a0  EFLAGS: 00010086
  Process loop16 (pid: 11155, threadinfo ffff88017c010000, task ffff8801b5374300)
  Call Trace:
     ocfs2_refcount_lock+0xae/0x130 [ocfs2]
     __ocfs2_lock_refcount_tree+0x29/0xe0 [ocfs2]
     ocfs2_lock_refcount_tree+0xdd/0x320 [ocfs2]
     ocfs2_refcount_cow_hunk+0x1cb/0x440 [ocfs2]
     ocfs2_refcount_cow+0xa9/0x1d0 [ocfs2]
     ocfs2_prepare_inode_for_refcount+0x115/0x200 [ocfs2]
     ocfs2_prepare_inode_for_write+0x33b/0x470 [ocfs2]
     ocfs2_file_write_iter+0x220/0x8c0 [ocfs2]
     aio_write_iter+0x2e/0x30

Fix this by avoiding the second call to ocfs2_refcount_tree_put()

Link: http://lkml.kernel.org/r/1473984404-32011-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Eric Ren <zren@suse.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
piaojun
07f38d971c ocfs2: clean up unused 'page' parameter in ocfs2_write_end_nolock()
'page' parameter in ocfs2_write_end_nolock() is never used.

Link: http://lkml.kernel.org/r/582FD91A.5000902@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
piaojun
28bb5ef485 ocfs2/dlm: clean up deadcode in dlm_master_request_handler()
When 'dispatch_assert' is set, 'response' must be DLM_MASTER_RESP_YES,
and 'res' won't be null, so execution can't reach these two branch.

Link: http://lkml.kernel.org/r/58174C91.3040004@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Guozhonghua
aa7b58597f ocfs2: delete redundant code and set the node bit into maybe_map directly
The variable `set_maybe' is redundant when the mle has been found in the
map.  So it is ok to set the node_idx into mle's maybe_map directly.

Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4A3D490DD@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: Guozhonghua <guozhonghua@h3c.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
piaojun
46832b2de5 ocfs2/dlm: clean up useless BUG_ON default case in dlm_finalize_reco_handler()
The value of 'stage' must be between 1 and 2, so the switch can't reach
the default case.

Link: http://lkml.kernel.org/r/57FB5EB2.7050002@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12 18:55:06 -08:00
Theodore Ts'o
a551d7c8de Merge branch 'fscrypt' into dev 2016-12-12 21:50:28 -05:00
Jan Kara
0cb80b4847 dax: Fix sleep in atomic contex in grab_mapping_entry()
Commit 642261ac99: "dax: add struct iomap based DAX PMD support" has
introduced unmapping of page tables if huge page needs to be split in
grab_mapping_entry(). However the unmapping happens after
radix_tree_preload() call which disables preemption and thus
unmap_mapping_range() tries to acquire i_mmap_lock in atomic context
which is a bug. Fix the problem by moving unmapping before
radix_tree_preload() call.

Fixes: 642261ac99
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-12 21:34:12 -05:00
Yan, Zheng
dc24de82d6 ceph: properly set issue_seq for cap release
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
1e4ef0c633 ceph: add flags parameter to send_cap_msg
Add a flags parameter to send_cap_msg, so we can request expedited
service from the MDS when we know we'll be waiting on the result.

Set that flag in the case of try_flush_caps. The callers of that
function generally wait synchronously on the result, so it's beneficial
to ask the server to expedite it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
43b2967330 ceph: update cap message struct version to 10
The userland ceph has MClientCaps at struct version 10. This brings the
kernel up the same version.

For now, all of the the new stuff is set to default values including
the flags field, which will be conditionally set in a later patch.

Note that we don't need to set the change_attr and btime to anything
since we aren't currently setting the feature flag. The MDS should
ignore those values.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
0ff8bfb394 ceph: define new argument structure for send_cap_msg
When we get to this many arguments, it's hard to work with positional
parameters. send_cap_msg is already at 25 arguments, with more needed.

Define a new args structure and pass a pointer to it to send_cap_msg.
Eventually it might make sense to embed one of these inside
ceph_cap_snap instead of tracking individual fields.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
9670079f5f ceph: move xattr initialzation before the encoding past the ceph_mds_caps
Just for clarity. This part is inside the header, so it makes sense to
group it with the rest of the stuff in the header.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:28 +01:00
Jeff Layton
4945a08479 ceph: fix minor typo in unsafe_request_wait
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Yan, Zheng
5f743e4566 ceph: record truncate size/seq for snap data writeback
Dirty snapshot data needs to be flushed unconditionally. If they
were created before truncation, writeback should use old truncate
size/seq.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Yan, Zheng
e9e427f0a1 ceph: check availability of mds cluster on mount
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Yan, Zheng
7ce469a53e ceph: fix splice read for no Fc capability case
When iov_iter type is ITER_PIPE, copy_page_to_iter() increases
the page's reference and add the page to a pipe_buffer. It also
set the pipe_buffer's ops to page_cache_pipe_buf_ops. The comfirm
callback in page_cache_pipe_buf_ops expects the page is from page
cache and uptodate, otherwise it return error.

For ceph_sync_read() case, pages are not from page cache. So we
can't call copy_page_to_iter() when iov_iter type is ITER_PIPE.
The fix is using iov_iter_get_pages_alloc() to allocate pages
for the pipe. (the code is similar to default_file_splice_read)

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Yan, Zheng
2b1ac852eb ceph: try getting buffer capability for readahead/fadvise
For readahead/fadvise cases, caller of ceph_readpages does not
hold buffer capability. Pages can be added to page cache while
there is no buffer capability. This can cause data integrity
issue.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Nikolay Borisov
5c341ee328 ceph: fix scheduler warning due to nested blocking
try_get_cap_refs can be used as a condition in a wait_event* calls.
This is all fine until it has to call __ceph_do_pending_vmtruncate,
which in turn acquires the i_truncate_mutex. This leads to a situation
in which a task's state is !TASK_RUNNING and at the same time it's
trying to acquire a sleeping primitive. In essence a nested sleeping
primitives are being used. This causes the following warning:

WARNING: CPU: 22 PID: 11064 at kernel/sched/core.c:7631 __might_sleep+0x9f/0xb0()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109447d>] prepare_to_wait_event+0x5d/0x110
 ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6
CPU: 22 PID: 11064 Comm: fs_checker.pl Tainted: G           O    4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015
 0000000000000000 ffff8838b416fa88 ffffffff812f4409 ffff8838b416fad0
 ffffffff81a034f2 ffff8838b416fac0 ffffffff81052b46 ffffffff81a0432c
 0000000000000061 0000000000000000 0000000000000000 ffff88167bda54a0
Call Trace:
 [<ffffffff812f4409>] dump_stack+0x67/0x9e
 [<ffffffff81052b46>] warn_slowpath_common+0x86/0xc0
 [<ffffffff81052bcc>] warn_slowpath_fmt+0x4c/0x50
 [<ffffffff8109447d>] ? prepare_to_wait_event+0x5d/0x110
 [<ffffffff8109447d>] ? prepare_to_wait_event+0x5d/0x110
 [<ffffffff8107767f>] __might_sleep+0x9f/0xb0
 [<ffffffff81612d30>] mutex_lock+0x20/0x40
 [<ffffffffa04eea14>] __ceph_do_pending_vmtruncate+0x44/0x1a0 [ceph]
 [<ffffffffa04fa692>] try_get_cap_refs+0xa2/0x320 [ceph]
 [<ffffffffa04fd6f5>] ceph_get_caps+0x255/0x2b0 [ceph]
 [<ffffffff81094370>] ? wait_woken+0xb0/0xb0
 [<ffffffffa04f2c11>] ceph_write_iter+0x2b1/0xde0 [ceph]
 [<ffffffff81613f22>] ? schedule_timeout+0x202/0x260
 [<ffffffff8117f01a>] ? kmem_cache_free+0x1ea/0x200
 [<ffffffff811b46ce>] ? iput+0x9e/0x230
 [<ffffffff81077632>] ? __might_sleep+0x52/0xb0
 [<ffffffff81156147>] ? __might_fault+0x37/0x40
 [<ffffffff8119e123>] ? cp_new_stat+0x153/0x170
 [<ffffffff81198cfa>] __vfs_write+0xaa/0xe0
 [<ffffffff81199369>] vfs_write+0xa9/0x190
 [<ffffffff811b6d01>] ? set_close_on_exec+0x31/0x70
 [<ffffffff8119a056>] SyS_write+0x46/0xa0

This happens since wait_event_interruptible can interfere with the
mutex locking code, since they both fiddle with the task state.

Fix the issue by using the newly-added nested blocking infrastructure
in 61ada528de ("sched/wait: Provide infrastructure to deal with
nested blocking")

Link: https://lwn.net/Articles/628628/
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-12-12 23:54:27 +01:00
Zhi Zhang
a380a031cb ceph: fix printing wrong return variable in ceph_direct_read_write()
Fix printing wrong return variable for invalidate_inode_pages2_range in
ceph_direct_read_write().

Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-12 23:54:27 +01:00
Ilya Dryomov
0dde584882 libceph: drop len argument of *verify_authorizer_reply()
The length of the reply is protocol-dependent - for cephx it's
ceph_x_authorize_reply.  Nothing sensible can be passed from the
messenger layer anyway.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2016-12-12 23:09:21 +01:00
Richard Weinberger
fc4b891bbe ubifs: Raise write version to 5
Starting with version 5 the following properties change:
 - UBIFS_FLG_DOUBLE_HASH is mandatory
 - UBIFS_FLG_ENCRYPTION is optional but depdens on UBIFS_FLG_DOUBLE_HASH
 - Filesystems with unknown super block flags will be rejected, this
   allows us in future to add new features without raising the UBIFS
   write version.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
e021986ee4 ubifs: Implement UBIFS_FLG_ENCRYPTION
This feature flag indicates that the filesystem contains encrypted
files.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
d63d61c169 ubifs: Implement UBIFS_FLG_DOUBLE_HASH
This feature flag indicates that all directory entry nodes have a 32bit
cookie set and therefore UBIFS is allowed to perform lookups by hash.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
cc41a53652 ubifs: Use a random number for cookies
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
528e3d178f ubifs: Add full hash lookup support
UBIFS stores a 32bit hash of every file, for traditional lookups by name
this scheme is fine since UBIFS can first try to find the file by the
hash of the filename and upon collisions it can walk through all entries
with the same hash and do a string compare.
When filesnames are encrypted fscrypto will ask the filesystem for a
unique cookie, based on this cookie the filesystem has to be able to
locate the target file again. With 32bit hashes this is impossible
because the chance for collisions is very high. Do deal with that we
store a 32bit cookie directly in the UBIFS directory entry node such
that we get a 64bit cookie (32bit from filename hash and the dent
cookie). For a lookup by hash UBIFS finds the entry by the first 32bit
and then compares the dent cookie. If it does not match, it has to do a
linear search of the whole directory and compares all dent cookies until
the correct entry is found.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
b91dc9816e ubifs: Rename tnc_read_node_nm
tnc_read_hashed_node() is a better name since we read a node
by a given hash, not a name.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
ca7f85be8d ubifs: Add support for encrypted symlinks
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
f4f61d2cc6 ubifs: Implement encrypted filenames
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
b9bc8c7bdb ubifs: Make r5 hash binary string aware
As of now all filenames known by UBIFS are strings with a NUL
terminator. With encrypted filenames a filename can be any binary
string and the r5 function cannot search for the NUL terminator.
UBIFS always knows how long a filename is, therefore we can change
the hash function to iterate over the filename length to work
correctly with binary strings.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
304790c038 ubifs: Relax checks in ubifs_validate_entry()
With encrypted filenames we store raw binary data, doing
string tests is no longer possible.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
7799953b34 ubifs: Implement encrypt/decrypt for all IO
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
1ee77870c9 ubifs: Constify struct inode pointer in ubifs_crypt_is_encrypted()
...and provide a non const variant for fscrypto

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
f1f52d6b02 ubifs: Introduce new data node field, compr_size
When data of a data node is compressed and encrypted
we need to store the size of the compressed data because
before encryption we may have to add padding bytes.

For the new field we consume the last two padding bytes
in struct ubifs_data_node. Two bytes are fine because
the data length is at most 4096.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
959c2de2b3 ubifs: Enforce crypto policy in mmap
We need this extra check in mmap because a process could
gain an already opened fd.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
700eada82a ubifs: Massage assert in ubifs_xattr_set() wrt. fscrypto
When we're creating a new inode in UBIFS the inode is not
yet exposed and fscrypto calls ubifs_xattr_set() without
holding the inode mutex. This is okay but ubifs_xattr_set()
has to know about this.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
9270b2f4cd ubifs: Preload crypto context in ->lookup()
...and mark the dentry as encrypted.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
ac7e47a9ed ubifs: Enforce crypto policy in ->link and ->rename
When a file is moved or linked into another directory
its current crypto policy has to be compatible with the
target policy.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
a79bff21c1 ubifs: Implement file open operation
We need ->open() for files to load the crypto key.
If the no key is present and the file is encrypted,
refuse to open.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
ba40e6a3c4 ubifs: Implement directory open operation
We need the ->open() hook to load the crypto context
which is needed for all crypto operations within that
directory.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
43b113fea2 ubifs: Massage ubifs_listxattr() for encryption context
We have to make sure that we don't expose our internal
crypto context to userspace.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
d475a50745 ubifs: Add skeleton for fscrypto
This is the first building block to provide file level
encryption on UBIFS.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
6a5e98ab7d ubifs: Define UBIFS crypto context xattr
Like ext4 UBIFS will store the crypto context in a xattr
attribute.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
ade46c3a60 ubifs: Export xattr get and set functions
For fscrypto we need this function outside of xattr.c.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Richard Weinberger
f6337d8426 ubifs: Export ubifs_check_dir_empty()
fscrypto will need this function too. Also get struct ubifs_info
from the provided inode. Not all callers will have a reference to
struct ubifs_info.

Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:07:38 +01:00
Christophe Jaillet
d40a796217 ubifs: Remove some dead code
'ubifs_fast_find_freeable()' can not return an error pointer, so this test
can be removed.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:06:28 +01:00
Rafał Miłecki
1b7fc2c006 ubifs: Use dirty_writeback_interval value for wbuf timer
Right now wbuf timer has hardcoded timeouts and there is no place for
manual adjustments. Some projects / cases many need that though. Few
file systems allow doing that by respecting dirty_writeback_interval
that can be set using sysctl (dirty_writeback_centisecs).

Lowering dirty_writeback_interval could be some way of dealing with user
space apps lacking proper fsyncs. This is definitely *not* a perfect
solution but we don't have ideal (user space) world. There were already
advanced discussions on this matter, mostly when ext4 was introduced and
it wasn't behaving as ext3. Anyway, the final decision was to add some
hacks to the ext4, as trying to fix whole user space or adding new API
was pointless.

We can't (and shouldn't?) just follow ext4. We can't e.g. sync on close
as this would cause too many commits and flash wearing. On the other
hand we still should allow some trade-off between -o sync and default
wbuf timeout. Respecting dirty_writeback_interval should allow some sane
cutomizations if used warily.

Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Reviewed-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:06:24 +01:00
Rafał Miłecki
854826c9d5 ubifs: Drop softlimit and delta fields from struct ubifs_wbuf
Values of these fields are set during init and never modified. They are
used (read) in a single function only. There isn't really any reason to
keep them in a struct. It only makes struct just a bit bigger without
any visible gain.

Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Reviewed-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2016-12-12 23:06:11 +01:00
Yunlei He
c0ed4405a9 f2fs: fix a missing size change in f2fs_setattr
This patch fix a missing size change in f2fs_setattr

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-12-12 11:09:05 -08:00
Christophe JAILLET
04102c76a7 orangefs: Axe some dead code
The "perf_counter_reset" case has already been handled above.
Moreover "ORANGEFS_PARAM_REQUEST_OP_READAHEAD_COUNT_SIZE" is not a really
consistent.
It is likely that this (dead) code is a cut and paste left over.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-12-12 12:49:20 -05:00
Colin Ian King
4defb5f912 orangefs: fix memory leak of string 'new' on exit path
allocates string 'new' is not free'd on the exit path when
cdm_element_count <= 0. Fix this by kfree'ing it.

Fixes CoverityScan CID#1375923 "Resource Leak"

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2016-12-12 11:43:25 -05:00
Chris Mason
7c4c71ac8a Revert "Btrfs: adjust len of writes if following a preallocated extent"
This is exposing an existing deadlock between fsync and AIO.  Until we
have the deadlock fixed, I'm pulling this one out.

This reverts commit a23eaa875f.

Signed-off-by: Chris Mason <clm@fb.com>
2016-12-11 15:27:15 -08:00
David Gstir
6a34e4d2be fscrypt: Rename FS_WRITE_PATH_FL to FS_CTX_HAS_BOUNCE_BUFFER_FL
... to better explain its purpose after introducing in-place encryption
without bounce buffer.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-11 16:33:18 -05:00
David Gstir
f32d7ac20a fscrypt: Delay bounce page pool allocation until needed
Since fscrypt users can now indicated if fscrypt_encrypt_page() should
use a bounce page, we can delay the bounce page pool initialization util
it is really needed. That is until fscrypt_operations has no
FS_CFLG_OWN_PAGES flag set.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-11 16:33:11 -05:00
David Gstir
bd7b829038 fscrypt: Cleanup page locking requirements for fscrypt_{decrypt,encrypt}_page()
Rename the FS_CFLG_INPLACE_ENCRYPTION flag to FS_CFLG_OWN_PAGES which,
when set, indicates that the fs uses pages under its own control as
opposed to writeback pages which require locking and a bounce buffer for
encryption.

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-11 16:26:12 -05:00
David Gstir
1400451f04 fscrypt: Cleanup fscrypt_{decrypt,encrypt}_page()
- Improve documentation
- Add BUG_ON(len == 0) to avoid accidental switch of offs and len
parameters
- Improve variable names for readability

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-11 16:26:12 -05:00
David Gstir
9e532772b4 fscrypt: Never allocate fscrypt_ctx on in-place encryption
In case of in-place encryption fscrypt_ctx was allocated but never
released. Since we don't need it for in-place encryption, we skip
allocating it.

Fixes: 1c7dcf69ee ("fscrypt: Add in-place encryption mode")

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-11 16:26:11 -05:00
David Gstir
e550c16c8a fscrypt: Use correct index in decrypt path.
Actually use the fs-provided index instead of always using page->index
which is only set for page-cache pages.

Fixes: 9c4bb8a3a9 ("fscrypt: Let fs select encryption index/tweak")

Signed-off-by: David Gstir <david@sigma-star.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-11 16:26:10 -05:00
Theodore Ts'o
cc4e0df038 fscrypt: move non-public structures and constants to fscrypt_private.h
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
2016-12-11 16:26:09 -05:00
Theodore Ts'o
b98701df34 fscrypt: unexport fscrypt_initialize()
The fscrypt_initalize() function isn't used outside fs/crypto, so
there's no point making it be an exported symbol.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
2016-12-11 16:26:08 -05:00
Theodore Ts'o
3325bea5b2 fscrypt: rename get_crypt_info() to fscrypt_get_crypt_info()
To avoid namespace collisions, rename get_crypt_info() to
fscrypt_get_crypt_info().  The function is only used inside the
fs/crypto directory, so declare it in the new header file,
fscrypt_private.h.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
2016-12-11 16:26:08 -05:00
Eric Biggers
db717d8e26 fscrypto: move ioctl processing more fully into common code
Multiple bugs were recently fixed in the "set encryption policy" ioctl.
To make it clear that fscrypt_process_policy() and fscrypt_get_policy()
implement ioctls and therefore their implementations must take standard
security and correctness precautions, rename them to
fscrypt_ioctl_set_policy() and fscrypt_ioctl_get_policy().  Make the
latter take in a struct file * to make it consistent with the former.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-11 16:26:07 -05:00
Eric Biggers
8048123576 fscrypto: remove unneeded Kconfig dependencies
SHA256 and ENCRYPTED_KEYS are not needed.  CTR shouldn't be needed
either, but I left it for now because it was intentionally added by
commit 71dea01ea2 ("ext4 crypto: require CONFIG_CRYPTO_CTR if ext4
encryption is enabled").  So it sounds like there may be a dependency
problem elsewhere, which I have not been able to identify specifically,
that must be solved before CTR can be removed.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-11 16:26:07 -05:00
Sergey Karamov
73b92a2a5e ext4: do not perform data journaling when data is encrypted
Currently data journalling is incompatible with encryption: enabling both
at the same time has never been supported by design, and would result in
unpredictable behavior. However, users are not precluded from turning on
both features simultaneously. This change programmatically replaces data
journaling for encrypted regular files with ordered data journaling mode.

Background:
Journaling encrypted data has not been supported because it operates on
buffer heads of the page in the page cache. Namely, when the commit
happens, which could be up to five seconds after caching, the commit
thread uses the buffer heads attached to the page to copy the contents of
the page to the journal. With encryption, it would have been required to
keep the bounce buffer with ciphertext for up to the aforementioned five
seconds, since the page cache can only hold plaintext and could not be
used for journaling. Alternatively, it would be required to setup the
journal to initiate a callback at the commit time to perform deferred
encryption - in this case, not only would the data have to be written
twice, but it would also have to be encrypted twice. This level of
complexity was not justified for a mode that in practice is very rarely
used because of the overhead from the data journalling.

Solution:
If data=journaled has been set as a mount option for a filesystem, or if
journaling is enabled on a regular file, do not perform journaling if the
file is also encrypted, instead fall back to the data=ordered mode for the
file.

Rationale:
The intent is to allow seamless and proper filesystem operation when
journaling and encryption have both been enabled, and have these two
conflicting features gracefully resolved by the filesystem.

Fixes: 4461471107
Signed-off-by: Sergey Karamov <skaramov@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-12-10 17:54:58 -05:00
David S. Miller
821781a9f4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-12-10 16:21:55 -05:00
Darrick J. Wong
29ac8e856c ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
Connect the new VFS clone_range, copy_range, and dedupe_range features
to the existing reflink capability of ocfs2.  Compared to the existing
ocfs2 reflink ioctl We have to do things a little differently to support
the VFS semantics (we can clone subranges of a file but we don't clone
xattrs), but the VFS ioctls are more broadly supported.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Convert inline data files to extents files before reflinking,
and fix i_blocks so that stat(2) output is correct.
v3: Make zero-length dedupe consistent with btrfs behavior.
v4: Use VFS double-inode lock routines and remove MAX_DEDUPE_LEN.
2016-12-10 12:39:45 -08:00
Darrick J. Wong
86e59436d4 ocfs2: charge quota for reflinked blocks
When ocfs2 shares blocks from one file to another, it's necessary to
charge that many blocks to the quota because ocfs2 tallies block charges
according to the number of blocks mapped, not the number of physical
blocks used.

Without this patch, reflinking X blocks and then CoWing all of them
causes quota usage to *decrease* by X as seen in generic/305.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-10 12:39:45 -08:00
Darrick J. Wong
aef73a61c0 ocfs2: fix bad pointer cast
generic/188 triggered a dmesg stack trace because the dio completion
was casting a buffer head to an on-disk inode, which is whacky.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-10 12:39:45 -08:00
Darrick J. Wong
dbf896fc28 ocfs2: always unlock when completing dio writes
Always unlock the inode when completing dio writes, even if an error
has occurrred.  The caller already checks the inode and unlocks it
if needed, so we might as well reduce contention.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-10 12:39:45 -08:00
Darrick J. Wong
085549553d ocfs2: don't eat io errors during _dio_end_io_write
ocfs2_dio_end_io_write eats whatever errors may happen,
which means that write errors do not propagate to userspace.
Fix that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-10 12:39:45 -08:00
Darrick J. Wong
3e10b793fc ocfs2: budget for extent tree splits when adding refcount flag
When we're adding the refcount flag to an extent, we have to budget
enough space to handle a full extent btree split in addition to
whatever modifications have to be made to the refcount btree.  We
don't currently do this, with the result that generic/186 crashes
when we need an extent split but not a refcount split because meta_ac
never gets allocated.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-10 12:39:45 -08:00
Darrick J. Wong
06a7030581 ocfs2: prohibit refcounted swapfiles
The swapfile mechanism calls bmap once to find all the swap file
mappings, which means that we cannot properly support CoW remapping.
Therefore, error out if the swap code tries to call bmap on a
refcounted file.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-10 12:39:45 -08:00
Darrick J. Wong
86544fbd85 ocfs2: add newlines to some error messages
These two error messages are missing the trailing newline.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-10 12:39:45 -08:00
Darrick J. Wong
84e40080bd ocfs2: convert inode refcount test to a helper
Replace the open-coded inode refcount flag test with a helper function
to reduce the potential for bugs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-10 12:39:45 -08:00
Al Viro
04fff6416c simple_write_end(): don't zero in short copy into uptodate
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-10 14:25:19 -05:00
Al Viro
92e50d2d42 exofs: don't mess with simple_write_{begin,end}
... and don't zero anything on short copy; just unlock
and return 0 if that has happened on non-uptodate page.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-10 14:25:19 -05:00
Al Viro
77469c3f57 9p: saner ->write_end() on failing copy into non-uptodate page
If we had a short copy into an uptodate page, there's no reason
whatsoever to zero anything; OTOH, if that page had _not_ been
uptodate, we must have been trying to overwrite it completely
and got a short copy.  In that case, overwriting the end with
zeroes, marking uptodate and sending to server is just plain
wrong.  Just unlock, keep it non-uptodate and return 0.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-10 14:25:18 -05:00
Al Viro
43388b21e7 fix gfs2_stuffed_write_end() on short copies
a) the page is uptodate - ->write_begin() would either fail (in which
case we don't reach ->write_end()), or unstuff the inode, or find the
page already uptodate, or do a successful call of stuffed_readpage(),
which would've made it uptodate

b) zeroing the tail in pagecache is wrong.  kill -9 at the right time
while writing unmodified file contents to the same file should _not_
leave us in a situation when read() from the file will be reporting
it full of zeroes.  Especially since that effect will be transient -
at some later point the page will be evicted and then we'll be back
to the real file contents.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-10 14:25:18 -05:00
Al Viro
b9de313cf0 fix ceph_write_end()
don't zero on short copies; if the page was uptodate it's just plain
wrong, and if it wasn't we'll be better off just returning 0 and
buggering off.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-10 14:24:45 -05:00
Dan Carpenter
578620f451 ext4: return -ENOMEM instead of success
We should set the error code if kzalloc() fails.

Fixes: 67cf5b09a4 ("ext4: add the basic function for inline data support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2016-12-10 09:56:01 -05:00
Darrick J. Wong
7e6e1ef48f ext4: reject inodes with negative size
Don't load an inode with a negative size; this causes integer overflow
problems in the VFS.

[ Added EXT4_ERROR_INODE() to mark file system as corrupted. -TYT]

Fixes: a48380f769 (ext4: rename i_dir_acl to i_size_high)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2016-12-10 09:55:01 -05:00
Andreas Gruenbacher
dff25ddb48 nfs: add support for the umask attribute
Clients can set the umask attribute when creating files to cause the
server to apply it always except when inheriting permissions from the
parent directory.  That way, the new files will end up with the same
permissions as files created locally.

See https://tools.ietf.org/html/draft-ietf-nfsv4-umask-02 for more details.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-09 23:47:10 -05:00
Al Viro
c0cf3ef5e0 nfs_write_end(): fix handling of short copies
What matters when deciding if we should make a page uptodate is
not how much we _wanted_ to copy, but how much we actually have
copied.  As it is, on architectures that do not zero tail on
short copy we can leave uninitialized data in page marked uptodate.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-09 22:41:47 -05:00
Trond Myklebust
d9152114f7 pNFS/flexfiles: Ensure we have enough buffer for layoutreturn
The flexfiles client can piggyback both layout errors and layoutstats
as part of the layoutreturn. Both these payloads can get large, with
20 layout error entries taking up about 1.2K, and 4 layoutstats entries
taking up another 1K.
This patch allows a maximum payload of 4k by allocating a full page.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-09 20:26:59 -05:00
Trond Myklebust
5ba6a09e92 pNFS/flexfiles: Remove a redundant parameter in ff_layout_encode_ioerr()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-09 20:26:58 -05:00
Darrick J. Wong
876bec6f9b vfs: refactor clone/dedupe_file_range common functions
Hoist both the XFS reflink inode state and preparation code and the XFS
file blocks compare functions into the VFS so that ocfs2 can take
advantage of it for reflink and dedupe.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-09 16:18:30 -08:00
Christoph Hellwig
a76b5b0437 fs: try to clone files first in vfs_copy_file_range
A clone is a perfectly fine implementation of a file copy, so most
file systems just implement the copy that way.  Instead of duplicating
this logic move it to the VFS.  Currently btrfs and XFS implement copies
the same way as clones and there is no behavior change for them, cifs
only implements clones and grow support for copy_file_range with this
patch.  NFS implements both, so this will allow copy_file_range to work
on servers that only implement CLONE and be lot more efficient on servers
that implements CLONE and COPY.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-09 16:17:19 -08:00
Linus Torvalds
af9468db44 A fix for an issue with ->d_revalidate() in ceph, causing frequent
kernel crashes.  Marked for stable - it goes back to 4.6, but started
 popping up only in 4.8.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJYSo2yAAoJEEp/3jgCEfOLa1UH/j/nhhmy6bkvNTNQT9PuWmlu
 1cGG6JkJq/USEaSO+VXGSAjBCjCngNTYYXBo0IBCnkf11tuwagvz/9LSbvy9P+vu
 1IKwcJBFpgcEMEZWsYjVui9uFiDcLYiTPt4pux4tQ4vyj6HEFgioTg/430ApUEOS
 ywO1pjRz8RH0FlKhhcTRGOwVcwUzI/aRw7aLeflSwz3mDnh6ajp/8pjvxWf7AN+V
 Ih9LygjYNb4IdUcgN2G05z2qKLPfNAoBA+kRdEkOzecX2J0Db8Bu1bfZBxgOK+ui
 kpdVlFPkpULbwjlLLpvmOgy7FKgmLfdxuEuQol8hCu0KQ+buP/kZnbjg6QBeCtk=
 =1nK/
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.9-rc9' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A fix for an issue with ->d_revalidate() in ceph, causing frequent
  kernel crashes.

  Marked for stable - it goes back to 4.6, but started popping up only
  in 4.8"

* tag 'ceph-for-4.9-rc9' of git://github.com/ceph/ceph-client:
  ceph: don't set req->r_locked_dir in ceph_d_revalidate
2016-12-09 11:02:40 -08:00
Miklos Szeredi
d16744ec8a vfs: make generic_readlink() static
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-09 16:45:04 +01:00
Miklos Szeredi
dfeef68862 vfs: remove ".readlink = generic_readlink" assignments
If .readlink == NULL implies generic_readlink().

Generated by:

to_del="\.readlink.*=.*generic_readlink"
for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-09 16:45:04 +01:00
Miklos Szeredi
76fca90e9f vfs: default to generic_readlink()
If i_op->readlink is NULL, but i_op->get_link is set then vfs_readlink()
defaults to calling generic_readlink().

The IOP_DEFAULT_READLINK flag indicates that the above conditions are met
and the default action can be taken.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-09 16:45:04 +01:00
Miklos Szeredi
fd4a0edf2a vfs: replace calling i_op->readlink with vfs_readlink()
Also check d_is_symlink() in callers instead of inode->i_op->readlink
because following patches will allow NULL ->readlink for symlinks.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-09 16:45:04 +01:00
Miklos Szeredi
2a07a1f5ab proc/self: use generic_readlink
The /proc/self and /proc/self-thread symlinks have separate but identical
functionality for reading and following.  This cleanup utilizes
generic_readlink to remove the duplication.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-12-09 16:45:03 +01:00
Miklos Szeredi
6c988f5759 ecryptfs: use vfs_get_link()
Here again we are copying form one buffer to another, while jumping through
hoops to make kernel memory look like userspace memory.

For no good reason, since vfs_get_link() provides exactly what is needed.

As a bonus, now the security hook for readlink is also called on the
underlying inode.

Note: this can be called from link-following context.  But this is okay:

 - not in RCU mode

 - commit e54ad7f1ee ("proc: prevent stacking filesystems on top")

 - ecryptfs is *reading* the underlying symlink not following it, so the
   right security hook is being called

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
2016-12-09 16:45:03 +01:00
Chris Mason
e5d6b12fe1 Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors
btrfs_transaction_abort() has a WARN() to help us nail down whatever
problem lead to the abort.  But most of the time, we're aborting for EIO,
and the warning just adds noise.

Signed-off-by: Chris Mason <clm@fb.com>
2016-12-09 06:00:28 -08:00
Miklos Szeredi
3f9ca75516 bad_inode: add missing i_op initializers
New inode operations were forgotten to be added to bad_inode.  Most of the
time the op is checked for NULL before being called but marking the inode
bad and the check can race (very unlikely).

However in case of ->get_link() only DCACHE_SYMLINK_TYPE is checked before
calling the op, so there's no race and will definitely oops when trying to
follow links on such a beast.

Also remove comments about extinct ops.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
2016-12-09 11:57:43 +01:00
Dave Chinner
9807b773da Merge branch 'xfs-4.10-misc-fixes-4' into for-next 2016-12-09 16:56:26 +11:00
Eric Sandeen
9875258ca7 xfs: nuke unused tracepoint definitions
This is all unused code, so remove it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-09 16:49:54 +11:00
Darrick J. Wong
b24a978c37 xfs: use GPF_NOFS when allocating btree cursors
Use NOFS for allocating btree cursors, since they can be called
under the ilock.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-09 16:49:54 +11:00
Eryu Guan
0c187dc508 xfs: use xfs_vn_setattr_size to check on new size
Commit 6552321831 ("xfs: remove i_iolock and use i_rwsem in the
VFS inode instead") introduced a regression that truncate(2) doesn't
check on new size, so it succeeds even if the new size exceeds the
current resource limit. Because xfs_setattr_size() was used instead
of xfs_vn_setattr_size(), and the latter calls xfs_vn_change_ok()
first to do sanity check on permission and new size.

This is found by truncate03 test from ltp, and the following is a
simplified reproducer:

  #!/bin/bash
  dev=/dev/sda5
  mnt=/mnt/xfs

  mkfs -t xfs -f $dev
  mount $dev $mnt

  # set max file size to 16k
  ulimit -f 16
  truncate -s $((16 * 1024 + 1)) /mnt/xfs/testfile
  [ $? -eq 0 ] && echo "FAIL: truncate exceeded max file size"
  ulimit -f unlimited
  umount $mnt

Signed-off-by: Eryu Guan <eguan@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-09 16:49:54 +11:00
Dave Chinner
4cf4573d89 xfs: deprecate barrier/nobarrier mount option
We always perform integrity operations now, so these mount options
don't do anything. Deprecate them and mark them for removal in
in a year.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-09 16:49:54 +11:00
Dave Chinner
2291dab2c9 xfs: Always flush caches when integrity is required
There is no reason anymore for not issuing device integrity
operations when teh filesystem requires ordering or data integrity
guarantees. We should always issue cache flushes and FUA writes
where necessary and let the underlying storage optimise them as
necessary for correct integrity operation.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-09 16:49:54 +11:00
Eric Sandeen
2e1d23370e xfs: ignore leaf attr ichdr.count in verifier during log replay
When we create a new attribute, we first create a shortform
attribute, and try to fit the new attribute into it.
If that fails, we copy the (empty) attribute into a leaf attribute,
and do the copy again.  Thus there can be a transient state where
we have an empty leaf attribute.

If we encounter this during log replay, the verifier will fail.
So add a test to ignore this part of the leaf attr verification
during log replay.

Thanks as usual to dchinner for spotting the problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-09 16:49:47 +11:00
Fred Isaman
65990d1afb pNFS/flexfiles: Fix a deadlock on LAYOUTGET
We encountered a deadlock where the SEQUENCE that accompanied the
LAYOUTGET triggered a session drain, while ff_layout_alloc_lseg
triggered a GETDEVICEINFO.  The GETDEVICEINFO hung waiting for the
session drain, while the LAYOUTGET held the slot waiting for
alloc_lseg to finish.
  Avoid this by moving the call to nfs4_find_get_deviceid out of
ff_layout_alloc_lseg and into nfs4_ff_layout_prepare_ds.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
[dros@primarydata.com: pNFS/flexfiles: fix races in ff_layout_mirror_valid]
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-08 21:49:57 -05:00
Jeff Layton
c3f4688a08 ceph: don't set req->r_locked_dir in ceph_d_revalidate
This function sets req->r_locked_dir which is supposed to indicate to
ceph_fill_trace that the parent's i_rwsem is locked for write.
Unfortunately, there is no guarantee that the dir will be locked when
d_revalidate is called, so we really don't want ceph_fill_trace to do
any dcache manipulation from this context. Clear req->r_locked_dir since
it's clearly not safe to do that.

What we really want to know with d_revalidate is whether the dentry
still points to the same inode. ceph_fill_trace installs a pointer to
the inode in req->r_target_inode, so we can just compare that to
d_inode(dentry) to see if it's the same one after the lookup.

Also, since we aren't generally interested in the parent here, we can
switch to using a GETATTR to hint that to the MDS, which also means that
we only need to reserve one cap.

Finally, just remove the d_unhashed check. That's really outside the
purview of a filesystem's d_revalidate. If the thing became unhashed
while we're checking it, then that's up to the VFS to handle anyway.

Fixes: 200fd27c8f ("ceph: use lookup request to revalidate dentry")
Link: http://tracker.ceph.com/issues/18041
Reported-by: Donatas Abraitis <donatas.abraitis@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-12-08 14:32:16 +01:00
Jaegeuk Kim
5eba8c5d1f f2fs: fix to access nullified flush_cmd_control pointer
f2fs_sync_file()             remount_ro
 - f2fs_readonly
                               - destroy_flush_cmd_control
 - f2fs_issue_flush
   - no fcc pointer!

So, this patch doesn't free fcc in this case, but just stop its kernel thread
which sends flush commands.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-12-07 18:56:50 -08:00
Li Wang
64d2ab32ef vfs: fix put_compat_statfs64() does not handle errors
put_compat_statfs64() does NOT return -1 and setting errno to EOVERFLOW
when some variables(like: f_bsize) overflowed in the returned struct.

The reason is that the ubuf->f_blocks is __u64 type, it couldn't be
4bits as the judgement in put_comat_statfs64(). Here correct the
__u32 variables(in struct compat_statfs64) for comparison.

reproducer:
step1. mount hugetlbfs with two different pagesize on ppc64 arch.

$ hugeadm --pool-pages-max 16M:0
$ hugeadm --create-mount
$ mount | grep -i hugetlbfs
none on /var/lib/hugetlbfs/pagesize-16MB type hugetlbfs (rw,relatime,seclabel,pagesize=16777216)
none on /var/lib/hugetlbfs/pagesize-16GB type hugetlbfs (rw,relatime,seclabel,pagesize=17179869184)

step2. compile & run this C program.

$ cat statfs64_test.c

 #define _LARGEFILE64_SOURCE
 #include <stdio.h>
 #include <sys/syscall.h>
 #include <sys/statfs.h>

 int main()
 {
	struct statfs64 sb;
	int err;

	err = syscall(SYS_statfs64, "/var/lib/hugetlbfs/pagesize-16GB", sizeof(sb), &sb);
	if (err)
		return -1;

	printf("sizeof f_bsize = %d, f_bsize=%ld\n", sizeof(sb.f_bsize), sb.f_bsize);

	return 0;
 }

$ gcc -m32 statfs64_test.c
$ ./a.out
sizeof f_bsize = 4, f_bsize=0

Signed-off-by: Li Wang <liwang@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-07 17:44:38 -05:00
Jaegeuk Kim
a2125ff7dd f2fs: free meta pages if sanity check for ckpt is failed
This fixes missing freeing meta pages in the error case.

Tested-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-12-07 14:38:16 -08:00
Jaegeuk Kim
2040fce83f f2fs: detect wrong layout
Previous mkfs.f2fs allows small partition inappropriately, so f2fs should detect
that as well.

Refer this in f2fs-tools.

mkfs.f2fs: detect small partition by overprovision ratio and # of segments

Reported-and-Tested-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-12-07 14:37:33 -08:00
Trond Myklebust
2f065ddb64 pNFS: Layoutreturn must free the layout after the layout-private data
The layout-private data may depend on the layout and/or the inode
still existing when it does post-processing and frees its data, so we
need to free them after calling lrp->ld_private.ops->free().

This fixes a mirror list corruption issue in the flexfiles driver.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-07 13:41:59 -05:00
Trond Myklebust
cb06793517 pNFS/flexfiles: Fix ff_layout_add_ds_error_locked()
When we're merging an old entry into our new entry, we want to ensure that
we add the list entry in the correct place.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-07 13:41:58 -05:00
NeilBrown
7a0566b38c NFSv4: Add missing nfs_put_lock_context()
Otherwise the lock context won't be freed when we're done with it.

From: NeilBrown <neilb@suse.com>
Fixes: 5bd3f817 ("NFSv4: change nfs4_select_rw_stateid to take a lock_context inplace of lock_owner")
Signed-off-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-07 13:41:58 -05:00
Darrick J. Wong
b46dc03381 ext2: reject inodes with negative size
Don't load an inode with a negative size; this causes integer overflow
problems in the VFS.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-12-07 13:03:31 +01:00
Dave Chinner
a444d72e60 Merge branch 'xfs-4.10-misc-fixes-3' into for-next 2016-12-07 17:42:30 +11:00
Lucas Stach
6031e73a5b xfs: use rhashtable to track buffer cache
On filesystems with a lot of metadata and in metadata intensive workloads
xfs_buf_find() is showing up at the top of the CPU cycles trace. Most of
the CPU time is spent on CPU cache misses while traversing the rbtree.

As the buffer cache does not need any kind of ordering, but fast lookups
a hashtable is the natural data structure to use. The rhashtable
infrastructure provides a self-scaling hashtable implementation and
allows lookups to proceed while the table is going through a resize
operation.

This reduces the CPU-time spent for the lookups to 1/3 even for small
filesystems with a relatively small number of cached buffers, with
possibly much larger gains on higher loaded filesystems.

[dchinner: reduce minimum hash size to an acceptable size for large
	   filesystems with many AGs with no active use.]
[dchinner: remove stale rbtree asserts.]
[dchinner: use xfs_buf_map for compare function argument.]
[dchinner: make functions static.]
[dchinner: remove redundant comments.]

Signed-off-by: Lucas Stach <dev@lynxeye.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-07 17:36:36 +11:00
Miklos Szeredi
c01638f5d9 fuse: fix clearing suid, sgid for chown()
Basically, the pjdfstests set the ownership of a file to 06555, and then
chowns it (as root) to a new uid/gid. Prior to commit a09f99edde ("fuse:
fix killing s[ug]id in setattr"), fuse would send down a setattr with both
the uid/gid change and a new mode.  Now, it just sends down the uid/gid
change.

Technically this is NOTABUG, since POSIX doesn't _require_ that we clear
these bits for a privileged process, but Linux (wisely) has done that and I
think we don't want to change that behavior here.

This is caused by the use of should_remove_suid(), which will always return
0 when the process has CAP_FSETID.

In fact we really don't need to be calling should_remove_suid() at all,
since we've already been indicated that we should remove the suid, we just
don't want to use a (very) stale mode for that.

This patch should fix the above as well as simplify the logic.

Reported-by: Jeff Layton <jlayton@redhat.com> 
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: a09f99edde ("fuse: fix killing s[ug]id in setattr")
Cc: <stable@vger.kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2016-12-06 16:18:45 +01:00
David Sterba
34441361c4 btrfs: opencode chunk locking, remove helpers
The helpers are trivial and we don't use them consistently.

Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:07:00 +01:00
Jeff Mahoney
3a45bb207e btrfs: remove root parameter from transaction commit/end routines
Now we only use the root parameter to print the root objectid in
a tracepoint.  We can use the root parameter from the transaction
handle for that.  It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:07:00 +01:00
Jeff Mahoney
bf89d38feb btrfs: split btrfs_wait_marked_extents into normal and tree log functions
btrfs_write_and_wait_marked_extents and btrfs_sync_log both call
btrfs_wait_marked_extents, which provides a core loop and then handles
errors differently based on whether it's it's a log root or not.

This means that btrfs_write_and_wait_marked_extents needs to take a root
because btrfs_wait_marked_extents requires one, even though it's only
used to determine whether the root is a log root.  The log root code
won't ever call into the transaction commit code using a log root, so we
can factor out the core loop and provide the error handling appropriate
to each waiter in new routines.  This allows us to eventually remove
the root argument from btrfs_commit_transaction, and as a result,
btrfs_end_transaction.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:07:00 +01:00
Jeff Mahoney
2ff7e61e0d btrfs: take an fs_info directly when the root is not used otherwise
There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer.  Let's convert those to
just accept an fs_info pointer directly.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:59 +01:00
Jeff Mahoney
afdb571890 btrfs: simplify btrfs_wait_cache_io prototype
With the exception of the one case where btrfs_wait_cache_io is called
without a block group, it's called with the same arguments.  The root
argument is only used in the special case, so let's factor out the core
and simplify the call in the normal case to require a trans, block group,
and path.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:59 +01:00
Jeff Mahoney
71ff6437c2 btrfs: convert extent-tree tracepoints to use fs_info
The extent-tree tracepoints all operate on the extent root, regardless of
which root is passed in.  Let's just use the extent root objectid instead.
If it turns out that nobody is depending on the format of this tracepoint,
we can drop the root printing entirely.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:59 +01:00
Jeff Mahoney
ccdf9b305a btrfs: root->fs_info cleanup, access fs_info->delayed_root directly
This results in btrfs_assert_delayed_root_empty and
btrfs_destroy_delayed_inode taking an fs_info instead of a root.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:59 +01:00
Jeff Mahoney
0b246afa62 btrfs: root->fs_info cleanup, add fs_info convenience variables
In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable.  This makes the code considerably
more readable.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:59 +01:00
Jeff Mahoney
6202df6921 btrfs: root->fs_info cleanup, update_block_group{,flags}
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Jeff Mahoney
3796d33535 btrfs: root->fs_info cleanup, lock/unlock_chunks
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Jeff Mahoney
27965b6c2c btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Jeff Mahoney
da17066c40 btrfs: pull node/sector/stripe sizes out of root and into fs_info
We track the node sizes per-root, but they never vary from the values
in the superblock.  This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Jeff Mahoney
f15376df0d btrfs: root->fs_info cleanup, io_ctl_init
The io_ctl->root member was only being used to access root->fs_info.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Jeff Mahoney
fb456252d3 btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Jeff Mahoney
c28f158e5e btrfs: struct reada_control.root -> reada_control.fs_info
The root is never used.  We substitute extent_root in for the
reada_find_extent call, since it's only ever used to obtain the node
size.  This call site will be changed to use fs_info in a later patch.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:57 +01:00
Jeff Mahoney
de14379225 btrfs: struct btrfsic_state->root should be an fs_info
The root member is never used except for obtaining an fs_info pointer.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:57 +01:00
Jeff Mahoney
2b2e27eb92 btrfs: alloc_reserved_file_extent trace point should use extent_root
Even though a separate root is passed in, we're still operating on the
extent root.  Let's use that for the trace point.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:57 +01:00
Jeff Mahoney
5112febbc7 btrfs: btrfs_init_new_device should use fs_info->dev_root
btrfs_init_new_device only uses the root passed in via the ioctl to
start the transaction.  Nothing else that happens is related to whatever
root the user used to initiate the ioctl.  We can drop the root requirement
and just use fs_info->dev_root instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:57 +01:00
Jeff Mahoney
6bccf3ab1e btrfs: call functions that always use the same root with fs_info instead
There are many functions that are always called with the same root
argument.  Rather than passing the same root every time, we can
pass an fs_info pointer instead and have the function get the root
pointer itself.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:57 +01:00
Jeff Mahoney
5b4aacefb8 btrfs: call functions that overwrite their root parameter with fs_info
There are 11 functions that accept a root parameter and immediately
overwrite it.  We can pass those an fs_info pointer instead.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:57 +01:00
Trond Myklebust
362fb578a5 pNFS: Release NFS_LAYOUT_RETURN when invalidating the layout stateid
Ensure we release the NFS_LAYOUT_RETURN lock when we invalidate the
layout stateid, so that processes and RPC tasks that are waiting on
the layout return can continue.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-05 22:52:01 -05:00
Al Viro
8f64fb1cce namei: fold should_follow_link() with the step into not-followed link
All callers are followed by the same boilerplate - "if it has returned
0, update nd->path/inode/seq - we are not following a symlink here".
Pull it into the function itself, renaming it into step_into().
Rename WALK_GET to WALK_FOLLOW, while we are at it - more descriptive
name.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:11:58 -05:00
Al Viro
31d66bcd3f namei: pass both WALK_GET and WALK_MORE to should_follow_link()
... and pull put_link() logics into it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:11:58 -05:00
Al Viro
1c4ff1a87e namei: invert WALK_PUT logics
... turning the condition for put_link() in walk_component() into
"WALK_MORE not passed and depth is non-zero".  Again, makes for
simpler arguments.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:11:57 -05:00
Al Viro
7f49b47109 namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
Simplifies the arguments both for it and for walk_component()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:11:57 -05:00
Al Viro
ba8f46135a namei: saner calling conventions for mountpoint_last()
leave the result in nd->path, have caller do follow_mount() and
copy it to the final destination.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:11:57 -05:00
Al Viro
c1d4dd2767 namei.c: get rid of user_path_parent()
direct use of filename_parentat() is just as readable

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:11:57 -05:00
Al Viro
f0bb5aaf2c vfs: misc struct path constification
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:03:49 -05:00
Al Viro
ca71cf71ee namespace.c: constify struct path passed to a bunch of primitives
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:03:12 -05:00
Al Viro
8c54ca9c68 quota: constify struct path in quota_on
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:03:06 -05:00
Al Viro
a4141d7cf8 constify alloc_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:01:16 -05:00
Al Viro
92872094a1 constify btrfs_mksubvol()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:01:16 -05:00
Al Viro
5b5577e4eb autofs: constify find_autofs_mount() callback
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:01:16 -05:00
Al Viro
71215a75ce constify get_dcookie() and friends
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 19:01:16 -05:00
Al Viro
12c7f9dc0f constify fsnotify_parent()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 18:58:32 -05:00
Al Viro
e637835ecc fsnotify(): constify 'data'
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 18:58:31 -05:00
Al Viro
3cd5eca8d7 fsnotify: constify 'data' passed to ->handle_event()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 18:58:31 -05:00
Mickaël Salaün
640eb7e7b5 fs: Constify path_is_under()'s arguments
The function path_is_under() doesn't modify the paths pointed by its
arguments but only browse them. Constifying this pointers make a cleaner
interface to be used by (future) code which may only have access to
const struct path pointers (e.g. LSM hooks).

Signed-off-by: Mickaël Salaün <mic@digikod.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 18:55:47 -05:00
Pavel Shilovsky
96a988ffeb CIFS: Fix a possible double locking of mutex during reconnect
With the current code it is possible to lock a mutex twice when
a subsequent reconnects are triggered. On the 1st reconnect we
reconnect sessions and tcons and then persistent file handles.
If the 2nd reconnect happens during the reconnecting of persistent
file handles then the following sequence of calls is observed:

cifs_reopen_file -> SMB2_open -> small_smb2_init -> smb2_reconnect
-> cifs_reopen_persistent_file_handles -> cifs_reopen_file (again!).

So, we are trying to acquire the same cfile->fh_mutex twice which
is wrong. Fix this by moving reconnecting of persistent handles to
the delayed work (smb2_reconnect_server) and submitting this work
every time we reconnect tcon in SMB2 commands handling codepath.

This can also lead to corruption of a temporary file list in
cifs_reopen_persistent_file_handles() because we can recursively
call this function twice.

Cc: Stable <stable@vger.kernel.org> # v4.9+
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
2016-12-05 12:52:01 -08:00
Pavel Shilovsky
53e0e11efe CIFS: Fix a possible memory corruption during reconnect
We can not unlock/lock cifs_tcp_ses_lock while walking through ses
and tcon lists because it can corrupt list iterator pointers and
a tcon structure can be released if we don't hold an extra reference.
Fix it by moving a reconnect process to a separate delayed work
and acquiring a reference to every tcon that needs to be reconnected.
Also do not send an echo request on newly established connections.

CC: Stable <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
2016-12-05 12:08:33 -08:00
Jaegeuk Kim
f455c8a5f0 f2fs: call sync_fs when f2fs is idle
The sync_fs in f2fs_balance_fs_bg must avoid interrupting current user requests.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-12-05 11:44:07 -08:00
Jaegeuk Kim
204706c7ac Revert "f2fs: use percpu_counter for # of dirty pages in inode"
This reverts commit 1beba1b3a9.

The perpcu_counter doesn't provide atomicity in single core and consume more
DRAM. That incurs fs_mark test failure due to ENOMEM.

Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2016-12-05 11:43:59 -08:00
Al Viro
cbbd26b8b1 [iov_iter] new primitives - copy_from_iter_full() and friends
copy_from_iter_full(), copy_from_iter_full_nocache() and
csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
et.al., advancing iterator only in case of successful full copy
and returning whether it had been successful or not.

Convert some obvious users.  *NOTE* - do not blindly assume that
something is a good candidate for those unless you are sure that
not advancing iov_iter in failure case is the right thing in
this case.  Anything that does short read/short write kind of
stuff (or is in a loop, etc.) is unlikely to be a good one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 14:33:36 -05:00
Pavel Shilovsky
e3d240e9d5 CIFS: Fix a possible memory corruption in push locks
If maxBuf is not 0 but less than a size of SMB2 lock structure
we can end up with a memory corruption.

Cc: Stable <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
2016-12-05 11:08:55 -08:00
Pavel Shilovsky
4772c79599 CIFS: Fix missing nls unload in smb2_reconnect()
Cc: Stable <stable@vger.kernel.org>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
2016-12-05 11:08:40 -08:00
Dave Chinner
cae028df53 xfs: optimise CRC updates
Nick Piggin reported that the CRC overhead in an fsync heavy
workload was higher than expected on a Power8 machine. Part of this
was to do with the fact that the power8 CRC implementation is not
efficient for CRC lengths of less than 512 bytes, and so the way we
split the CRCs over the CRC field means a lot of the CRCs are
reduced to being less than than optimal size.

To optimise this, change the CRC update mechanism to zero the CRC
field first, and then compute the CRC in one pass over the buffer
and write the result back into the buffer. We can do this safely
because anything writing a CRC has exclusive access to the buffer
the CRC is being calculated over.

We leave the CRC verify code the same - it still splits the CRC
calculation - because we do not want read-only operations modifying
the underlying buffer. This is because read-only operations may not
have an exclusive access to the buffer guaranteed, and so temporary
modifications could leak out to to other processes accessing the
buffer concurrently.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 14:40:32 +11:00
Dave Chinner
11ef38afe9 xfs: make xfs btree stats less huge
Embedding a switch statement in every btree stats inc/add adds a lot
of code overhead to the core btree infrastructure paths. Stats are
supposed to be small and lightweight, but the btree stats have
become big and bloated as we've added more btrees. It needs fixing
because the reflink code will just add more overhead again.

Convert the v2 btree stats to arrays instead of independent
variables, and instead use the type to index the specific btree
array via an enum. This allows us to use array based indexing
to update the stats, rather than having to derefence variables
specific to the btree type.

If we then wrap the xfsstats structure in a union and place uint32_t
array beside it, and calculate the correct btree stats array base
array index when creating a btree cursor,  we can easily access
entries in the stats structure without having to switch names based
on the btree type.

We then replace with the switch statement with a simple set of stats
wrapper macros, resulting in a significant simplification of the
btree stats code, and:

   text	   data	    bss	    dec	    hex	filename
  48905	    144	      8	  49057	   bfa1	fs/xfs/libxfs/xfs_btree.o.old
  36793	    144	      8	  36945	   9051	fs/xfs/libxfs/xfs_btree.o

it reduces the core btree infrastructure code size by close to 25%!

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 14:38:58 +11:00
Darrick J. Wong
1bb33a9870 xfs: don't cap maximum dedupe request length
After various discussions on linux-fsdevel, it has been decided that it
is not necessary to cap the length of a dedupe request, and that
correctly-written userspace client programs will be able to absorb the
change.  Therefore, remove the length clamping behavior.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:38:57 +11:00
Darrick J. Wong
ef388e2054 xfs: don't allow di_size with high bit set
The on-disk field di_size is used to set i_size, which is a signed
integer of loff_t.  If the high bit of di_size is set, we'll end up with
a negative i_size, which will cause all sorts of problems.  Since the
VFS won't let us create a file with such length, we should catch them
here in the verifier too.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:38:38 +11:00
Darrick J. Wong
0f352f8ee8 xfs: error out if trying to add attrs and anextents > 0
We shouldn't assert if somehow we end up trying to add an attr fork to
an inode that apparently already has attr extents because this is an
indication of on-disk corruption.  Instead, return an error code to
userspace.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:38:11 +11:00
Darrick J. Wong
96a3aefb8f xfs: don't crash if reading a directory results in an unexpected hole
In xfs_dir3_data_read, we can encounter the situation where err == 0 and
*bpp == NULL if the given bno offset happens to be a hole; this leads to
a crash if we try to set the buffer type after the _da_read_buf call.
Holes can happen due to corrupt or malicious entries in the bmbt data,
so be a little more careful when we're handling buffers.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:37:47 +11:00
Darrick J. Wong
356a322522 xfs: complain if we don't get nextents bmap records
When reading into memory all extents of a btree-format inode fork,
complain if the number of extents we find is not the same as the number
of extents reported in the inode core.  This is needed to stop an IO
action from accessing the garbage areas of the in-core fork.

[dchinner: removed redundant assert]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:36:56 +11:00
Darrick J. Wong
bb3be7e7c1 xfs: check for bogus values in btree block headers
When we're reading a btree block, make sure that what we retrieved
matches the owner and level; and has a plausible number of records.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:33:54 +11:00
Darrick J. Wong
d2a047f31e xfs: forbid AG btrees with level == 0
There is no such thing as a zero-level AG btree since even a single-node
zero-records btree has one level.  Btree cursor constructors read
cur_nlevels straight from disk and then access things like
cur_bufs[cur_nlevels - 1] which is /really/ bad if cur_nlevels is zero!
Therefore, strengthen the verifiers to prevent this possibility.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:32:50 +11:00
Eric Sandeen
f7a136aee3 xfs: several xattr functions can be void
There are a handful of xattr functions which now return
nothing but zero.  They can be made void, chased through calling
functions, and error handling etc can be removed.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:32:14 +11:00
Eric Sandeen
c44a1f2262 xfs: handle cow fork in xfs_bmap_trace_exlist
By inspection, xfs_bmap_trace_exlist isn't handling cow forks,
and will trace the data fork instead.

Fix this by setting state appropriately if whichfork
== XFS_COW_FORK.

()___()
< @ @ >
 |   |
 {o_o}
  (|)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:32:00 +11:00
Eric Sandeen
7710517fc3 xfs: pass state not whichfork to trace_xfs_extlist
When xfs_bmap_trace_exlist called trace_xfs_extlist,
it sent in the "whichfork" var instead of the bmap "state"
as expected (even though state was already set up for this
purpose).

As a result, the xfs_bmap_class in tracing code used
"whichfork" not state in xfs_iext_state_to_fork(), and got
the wrong ifork pointer.  It all goes downhill from
there, including an ASSERT when ifp_bytes is empty
by the time it reaches xfs_iext_get_ext():

XFS: Assertion failed: idx < ifp->if_bytes / sizeof(xfs_bmbt_rec_t)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:31:50 +11:00
Eric Sandeen
200237d674 xfs: Move AGI buffer type setting to xfs_read_agi
We've missed properly setting the buffer type for
an AGI transaction in 3 spots now, so just move it
into xfs_read_agi() and set it if we are in a transaction
to avoid the problem in the future.

This is similar to how it is done in i.e. the dir3
and attr3 read functions.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:31:31 +11:00
Eric Sandeen
6b10b23ca9 xfs: set AGI buffer type in xlog_recover_clear_agi_bucket
xlog_recover_clear_agi_bucket didn't set the
type to XFS_BLFT_AGI_BUF, so we got a warning during log
replay (or an ASSERT on a debug build).

    XFS (md0): Unknown buffer type 0!
    XFS (md0): _xfs_buf_ioapply: no ops on block 0xaea8802/0x1

Fix this, as was done in f19b872b for 2 other locations
with the same problem.

cc: <stable@vger.kernel.org> # 3.10 to current
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-12-05 12:31:06 +11:00
Trond Myklebust
d94cbf6c73 NFSv4.1: Don't schedule lease recovery in nfs4_schedule_session_recovery()
If the session has an error, then we want to start by recovering the
session, as any SEQUENCE we send is going to fail with a session
error.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-04 19:34:38 -05:00
Trond Myklebust
2cf10cdd48 NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE
In the case where SEQUENCE receives a NFS4ERR_BADSESSION or
NFS4ERR_DEADSESSION error, we just want to report the session as needing
recovery, and then we want to retry the operation.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-04 19:26:40 -05:00
Trond Myklebust
1cd9cb05f9 NFS: Only look at the change attribute cache state in nfs_check_verifier
When looking at whether or not our dcache is valid, we really don't care
about the general state of the directory attribute cache. Instead, we
we only care about the state of the change attribute.

This fixes a performance issue when the client is responsible for
changing the directory contents; a number of NFSv4 operations will
atomically update the directory change attribute, but may not return
all the other attributes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-04 18:34:34 -05:00
Al Viro
450630975d don't open-code file_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-04 18:29:28 -05:00
Trond Myklebust
9310b224f2 NFS: Fix incorrect size revalidation when holding a delegation
We should only care about checking the attributes if the page cache
is marked as dubious (using NFS_INO_REVAL_PAGECACHE) and the
NFS_INO_REVAL_FORCED flag is set.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-04 18:08:40 -05:00
Trond Myklebust
10727772b9 NFS: Fix incorrect mapping revalidation when holding a delegation
We should only care about checking the attributes if the page cache
is marked as dubious (using NFS_INO_REVAL_PAGECACHE) and the
NFS_INO_REVAL_FORCED flag is set.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-04 16:50:09 -05:00
Ian Kent
1c4344a50d autofs - dont hold spin lock over direct mount expire
Commit 7cbdb4a286 altered the autofs indirect mount expire to
not hold a spin lock during the expire check.

The direct mount expire needs the same treatment because to
make autofs expires namespace aware may_umount_tree() needs to
to use a similar method to may_umount() when checking if a mount
tree is in use.

This means may_umount_tree() will end up taking the namespace_sem
for the check so the autofs direct mount expire won't be allowed
to hold a spin lock over the check.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-03 20:51:50 -05:00
Ian Kent
455e8f1030 autofs - constify misc struct path instances
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-03 20:51:50 -05:00
Ian Kent
f74e7b33c3 vfs: remove unused have_submounts() function
Now that path_has_submounts() has been added have_submounts() is no
longer used so remove it.

Link: http://lkml.kernel.org/r/20161011053428.27645.12310.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-03 20:51:49 -05:00
Ian Kent
6035974147 autofs: use path_has_submounts() to fix unreliable have_submount() checks
If an automount mount is clone(2)ed into a file system that is propagation
private, when it later expires in the originating namespace, subsequent
calls to autofs ->d_automount() for that dentry in the original namespace
will return ELOOP until the mount is umounted in the cloned namespace.

Now that a struct path is available where needed use path_has_submounts()
instead of have_submounts() so we don't get false positives when checking
if a dentry is a mount point or contains mounts in the current namespace.

Link: http://lkml.kernel.org/r/20161011053423.27645.91233.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-03 20:51:49 -05:00
Ian Kent
cfaf86ab6c autofs: use path_is_mountpoint() to fix unreliable d_mountpoint() checks
If an automount mount is clone(2)ed into a file system that is propagation
private, when it later expires in the originating namespace, subsequent
calls to autofs ->d_automount() for that dentry in the original namespace
will return ELOOP until the mount is umounted in the cloned namespace.

Now that a struct path is available where needed use path_is_mountpoint()
instead of d_mountpoint() so we don't get false positives when checking if
a dentry is a mount point in the current namespace.

Link: http://lkml.kernel.org/r/20161011053418.27645.15241.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-03 20:51:48 -05:00
Ian Kent
dd36a882e7 autofs: change autofs4_wait() to take struct path
In order to use the functions path_is_mountpoint() and path_has_submounts()
autofs needs to pass a struct path in several places.

Now change autofs4_wait() to take a struct path instead of a struct
dentry.

Link: http://lkml.kernel.org/r/20161011053413.27645.84666.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-03 20:51:48 -05:00
Ian Kent
74f504cff5 autofs: change autofs4_expire_wait()/do_expire_wait() to take struct path
In order to use the functions path_is_mountpoint() and path_has_submounts()
autofs needs to pass a struct path in several places.

Start by changing autofs4_expire_wait() and do_expire_wait() to take
a struct path instead of a struct dentry.

Link: http://lkml.kernel.org/r/20161011053408.27645.40091.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-03 20:51:47 -05:00
Ian Kent
01619491a5 vfs: add path_has_submounts()
d_mountpoint() can only be used reliably to establish if a dentry is
not mounted in any namespace. It isn't aware of the possibility there
may be multiple mounts using the given dentry, possibly in a different
namespace.

Add function, path_has_submounts(), that checks is a struct path contains
mounts (or is a mountpoint itself) to handle this case.

Link: http://lkml.kernel.org/r/20161011053403.27645.55242.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-03 20:51:47 -05:00
Ian Kent
c6609c0a1c vfs: add path_is_mountpoint() helper
d_mountpoint() can only be used reliably to establish if a dentry is
not mounted in any namespace. It isn't aware of the possibility there
may be multiple mounts using a given dentry that may be in a different
namespace.

Add helper functions, path_is_mountpoint(), that checks if a struct path
is a mountpoint for this case.

Link: http://lkml.kernel.org/r/20161011053358.27645.9729.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-03 20:51:35 -05:00
Dan Carpenter
011c88e36c ext4: remove another test in ext4_alloc_file_blocks()
Before commit c3fe493ccd ('ext4: remove unneeded test in
ext4_alloc_file_blocks()') then it was possible for "depth" to be -1
but now, it's not possible that it is negative.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-12-03 16:46:58 -05:00
Jan Kara
ab04df7818 ext4: fix checks for data=ordered and journal_async_commit options
Combination of data=ordered mode and journal_async_commit mount option
is invalid. However the check in parse_options() fails to detect the
case where we simply end up defaulting to data=ordered mode and we
detect the problem only on remount which triggers hard to understand
failure to remount the filesystem.

Fix the checking of mount options to take into account also the default
mode by moving the check somewhat later in the mount sequence.

Reported-by: Wolfgang Walter <linux@stwm.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-03 16:20:53 -05:00
Eric Biggers
b649668c0b mbcache: document that "find" functions only return reusable entries
mb_cache_entry_find_first() and mb_cache_entry_find_next() only return
cache entries with the 'e_reusable' bit set.  This should be documented.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-12-03 15:55:01 -05:00
Eric Biggers
132d4e2d55 mbcache: use consistent type for entry count
mbcache used several different types to represent the number of entries
in the cache.  For consistency within mbcache and with the shrinker API,
always use unsigned long.

This does not change behavior for current mbcache users (ext2 and ext4)
since they limit the entry count to a value which easily fits in an int.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-12-03 15:43:48 -05:00
Eric Biggers
97c7b18a5d mbcache: remove unnecessary module_get/module_put
When mbcache is built as a module, any modules that use it (ext2 and/or
ext4) will depend on its symbols directly, incrementing its reference
count.  Therefore, there is no need to do module_get/module_put.

Also note that since the module_get/module_put were in the mbcache
module itself, executing those lines of code was already dependent on
another reference to the mbcache module being held.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-12-03 15:38:29 -05:00
Trond Myklebust
230bc962a6 pNFS/flexfiles: Support sending layoutstats in layoutreturn
Add the ability to send an array of layoutstats entries as part of
layoutreturn.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-03 15:37:46 -05:00
Trond Myklebust
422c93c881 pNFS/flexfiles: Minor refactoring before adding iostats to layoutreturn
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-03 15:37:45 -05:00
Trond Myklebust
2f8220c16e NFS: Fix up read of mirror stats
Need to lock while reading in order to ensure 64-bit reads are correct.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-03 15:37:44 -05:00
Trond Myklebust
08e2e5bc6c pNFS/flexfiles: Clean up layoutstats
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-03 15:37:44 -05:00
Trond Myklebust
5b9b3c855a pNFS/flexfiles: Refactor encoding of the layoutreturn payload
Add the layout error payload to the flexfiles layoutreturn private
data, and set up the encoding mechanisms. This is a refactoring in
preparation for adding the layout iostats payload.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-03 15:37:43 -05:00
Eric Biggers
21d0f4fa8e mbcache: don't BUG() if entry cache cannot be allocated
mbcache can be a module that is loaded long after startup, when someone
asks to mount an ext2 or ext4 filesystem.  Therefore it should not BUG()
if kmem_cache_create() fails, but rather just fail the module load.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-12-03 15:28:53 -05:00
Eric Biggers
918b7306ed mbcache: correctly handle 'e_referenced' bit
mbcache entries have an 'e_referenced' bit which users can set with
mb_cache_entry_touch() to indicate that an entry should be given another
pass through the LRU list before the shrinker can delete it.  However,
mb_cache_shrink() actually would, when seeing an e_referenced entry at
the front of the list (the least-recently used end), place it right at
the front of the list again.  The next iteration would then remove the
entry from the list and delete it.  Consequently, e_referenced had
essentially no effect, so ext2/ext4 xattr blocks would sometimes not be
reused as often as expected.

Fix this by making the shrinker move e_referenced entries to the back of
the list rather than the front.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2016-12-03 15:13:15 -05:00
Trond Myklebust
287bd3e954 pNFS: Add a layoutreturn callback to performa layout-private setup
Add a callback to allow the flexfiles layout driver to initialise the
layout private payload.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-03 13:12:16 -05:00
David S. Miller
2745529ac7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Couple conflicts resolved here:

1) In the MACB driver, a bug fix to properly initialize the
   RX tail pointer properly overlapped with some changes
   to support variable sized rings.

2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
   overlapping with a reorganization of the driver to support
   ACPI, OF, as well as PCI variants of the chip.

3) In 'net' we had several probe error path bug fixes to the
   stmmac driver, meanwhile a lot of this code was cleaned up
   and reorganized in 'net-next'.

4) The cls_flower classifier obtained a helper function in
   'net-next' called __fl_delete() and this overlapped with
   Daniel Borkamann's bug fix to use RCU for object destruction
   in 'net'.  It also overlapped with Jiri's change to guard
   the rhashtable_remove_fast() call with a check against
   tc_skip_sw().

5) In mlx4, a revert bug fix in 'net' overlapped with some
   unrelated changes in 'net-next'.

6) In geneve, a stale header pointer after pskb_expand_head()
   bug fix in 'net' overlapped with a large reorganization of
   the same code in 'net-next'.  Since the 'net-next' code no
   longer had the bug in question, there was nothing to do
   other than to simply take the 'net-next' hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 12:29:53 -05:00
Trond Myklebust
4d796d751c pNFS: Allow layout drivers to manage private data in struct nfs4_layoutreturn
Cleanup to allow layout drivers to attach private data to layoutreturn,
and manage the data.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-02 23:37:45 -05:00
Ian Kent
fb5f51c742 vfs: change d_manage() to take a struct path
For the autofs module to be able to reliably check if a dentry is a
mountpoint in a multiple namespace environment the ->d_manage() dentry
operation will need to take a path argument instead of a dentry.

Link: http://lkml.kernel.org/r/20161011053352.27645.83962.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-02 22:15:53 -05:00
Pavel Shilovsky
b0a752b5ce CIFS: Decrease verbosity of ioctl call
Cc: Stable <stable@vger.kernel.org> # v4.9+
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
2016-12-02 16:04:33 -08:00
Theodore Ts'o
4db0d88e2e ext4: fix reading new encrypted symlinks on no-journal file systems
On a filesystem with no journal, a symlink longer than about 32
characters (exact length depending on padding for encryption) could not
be followed or read immediately after being created in an encrypted
directory.  This happened because when the symlink data went through the
delayed allocation path instead of the journaling path, the symlink was
incorrectly detected as a "fast" symlink rather than a "slow" symlink
until its data was written out.

To fix this, disable delayed allocation for symlinks, since there is
no benefit for delayed allocation anyway.

Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-02 12:12:53 -05:00
Trond Myklebust
06946c6a3d pNFS/flexfiles: Only send layoutstats updates for mirrors that were updated
If there have been no reads or writes to a given mirror since the last
layoutstats update, then don't resend the same data.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-02 11:42:58 -05:00
Trond Myklebust
46c98c6d1b pNFS/flexfiles: Don't attempt to send layoutstats if there are no entries
If the list of mirrors is empty, then don't send an RPC call.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-02 11:42:58 -05:00
Trond Myklebust
1bcf4c5c59 NFS: Allow getattr to also report readdirplus cache hits
If the use called stat() on an 'ls -l' workload, and the attribute
cache was successfully revalidate by READDIRPLUS, then we want to
report that back so that the readdir code continues to use
readdirplus.

Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-02 11:42:51 -05:00
Trond Myklebust
63519fbc67 NFS: Be more targeted about readdirplus use when doing lookup/revalidation
There is little point in setting NFS_INO_ADVISE_RDPLUS in nfs_lookup and
nfs_lookup_revalidate() unless a process is actually doing readdir on the
parent directory.
Furthermore, there is little point in using readdirplus if we're trying
to revalidate a negative dentry.

Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-02 11:42:51 -05:00
Trond Myklebust
79f687a3de NFS: Fix a performance regression in readdir
Ben Coddington reports that commit 311324ad17, by adding the function
nfs_dir_mapping_need_revalidate() that checks page cache validity on
each call to nfs_readdir() causes a performance regression when
the directory is being modified.

If the directory is changing while we're iterating through the directory,
POSIX does not require us to invalidate the page cache unless the user
calls rewinddir(). However, we still do want to ensure that we use
readdirplus in order to avoid a load of stat() calls when the user
is doing an 'ls -l' workload.

The fix should be to invalidate the page cache immediately when we're
setting the NFS_INO_ADVISE_RDPLUS bit.

Reported-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 311324ad17 ("NFS: Be more aggressive in using readdirplus...")
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-02 11:42:50 -05:00
Wei Yongjun
f36ab161be NFS: fix typo in parameter description
Fix typo in parameter description.

Fixes: 5405fc44c3 ("NFSv4.x: Add kernel parameter to control the
callback server")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 18:00:11 -05:00
NeilBrown
d51fdb87a6 NFS: discard nfs_lockowner structure.
It now has only one field and is only used in one structure.
So replaced it in that structure by the field it contains.

Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:58:13 -05:00
NeilBrown
8d42443166 NFSv4: enhance nfs4_copy_lock_stateid to use a flock stateid if there is one
A process can have two possible lock owner for a given open file:
a per-process Posix lock owner and a per-open-file flock owner
Use both of these when searching for a suitable stateid to use.

With this patch, READ/WRITE requests will use the correct stateid
if a flock lock is active.

Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:58:05 -05:00
NeilBrown
1739347549 NFSv4: change nfs4_select_rw_stateid to take a lock_context inplace of lock_owner
The only time that a lock_context is not immediately available is in
setattr, and now that it has an open_context, it can easily find one
with nfs_get_lock_context.
This removes the need for the on-stack nfs_lockowner.

This change is preparation for correctly support flock stateids.

Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:57:56 -05:00
NeilBrown
29b59f9416 NFSv4: change nfs4_do_setattr to take an open_context instead of a nfs4_state.
The open_context can always lead directly to the state, and is always easily
available, so this is a straightforward change.
Doing this makes more information available to _nfs4_do_setattr() for use
in the next patch.

Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:57:45 -05:00
NeilBrown
532d4def2f NFSv4: add flock_owner to open context
An open file description (struct file) in a given process can be
associated with two different lock owners.

It can have a Posix lock owner which will be different in each process
that has a fd on the file.
It can have a Flock owner which will be the same in all processes.

When searching for a lock stateid to use, we need to consider both of these
owners

So add a new "flock_owner" to the "nfs_open_context" (of which there
is one for each open file description).

This flock_owner does not need to be reference-counted as there is a
1-1 relation between 'struct file' and nfs open contexts,
and it will never be part of a list of contexts.  So there is no need
for a 'flock_context' - just the owner is enough.

The io_count included in the (Posix) lock_context provides no
guarantee that all read-aheads that could use the state have
completed, so not supporting it for flock locks in not a serious
problem.  Synchronization between flock and read-ahead can be added
later if needed.

When creating an open_context for a non-openning create call, we don't have
a 'struct file' to pass in, so the lock context gets initialized with
a NULL owner, but this will never be used.

The flock_owner is not used at all in this patch, that will come later.

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:57:27 -05:00
NeilBrown
b184b5c38e NFS: remove l_pid field from nfs_lockowner
this field is not used in any important way and probably should
have been removed by

Commit: 8003d3c4aa ("nfs4: treat lock owners as opaque values")

which removed the pid argument from nfs4_get_lock_state.

Except in unusual and uninteresting cases, two threads with the same
->tgid will have the same ->files pointer, so keeping them both
for comparison brings no benefit.

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:57:07 -05:00
Anna Schumaker
4d3b55d3c7 NFS: Remove unused argument from nfs_direct_write_complete()
This parameter hasn't been used since 2a009ec9 (Linux 3.13-rc3), so
let's remove it from this function.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:50:37 -05:00
Anna Schumaker
7d38de3ffa NFS: Remove unused authflavour parameter from nfs_get_client()
This parameter hasn't been used since f8407299 (Linux 3.11-rc2), so
let's remove it from this function and callers.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:46:32 -05:00
J. Bruce Fields
ced85a7568 nfs: fix false positives in nfs40_walk_client_list()
It's possible that two different servers can return the same (clientid,
verifier) pair purely by coincidence.  Both are 64-bit values, but
depending on the server implementation, they can be highly predictable
and collisions may be quite likely, especially when there are lots of
servers.

So, check for this case.  If the clientid and verifier both match, then
we actually know they *can't* be the same server, since a new
SETCLIENTID to an already-known server should have changed the verifier.

This helps fix a bug that could cause the client to mount a filesystem
from the wrong server.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Yongcheng Yang <yoyang@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:43:31 -05:00
Trond Myklebust
b85f562049 pNFS: Skip invalid stateids when doing a bulk destroy
If the layout stateid is already invalid, we have no work to do.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:51 -05:00
Trond Myklebust
29ade5db12 pNFS: Wait on outstanding layoutreturns to complete in pnfs_roc()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:50 -05:00
Trond Myklebust
abb3e1c877 pNFS: Don't mark the layout as freed if the last lseg is marked for return
Address another memory leak.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:50 -05:00
Trond Myklebust
4aab97327f pNFS: Sync the layout state bits in pnfs_cache_lseg_for_layoutreturn
Ensure that the layout state bits are synced when we cache a layout
segment for layoutreturn using an appropriate call to
pnfs_set_plh_return_info.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:49 -05:00
Trond Myklebust
24408f5282 pNFS: Fix bugs in _pnfs_return_layout
We need to honour the NFS_LAYOUT_RETURN_REQUESTED bit regardless of
whether or not there are layout segments pending.
Furthermore, we should ensure that we leave the plh_return_segs list
empty.

This patch fixes a memory leak of the layout segments on plh_return_segs.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:49 -05:00
Trond Myklebust
fe1cf9469d pNFS: Clear all layout segment state in pnfs_mark_layout_stateid_invalid
When the layout state is invalidated, then so is the layout segment
state, and hence we do need to clean up the state bits.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:48 -05:00
Trond Myklebust
53e6fc86ab pNFS: Prevent unnecessary layoutreturns after delegreturn
If we cannot grab the inode or superblock, then we cannot pin the
layout header, and so we cannot send a layoutreturn as part of an
async delegreturn call. In this case, we currently end up sending
an extra layoutreturn after the delegreturn. Since the layout was
implicitly returned by the delegreturn, that just gets a BAD_STATEID.

The fix is to simply complete the return-on-close immediately.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:48 -05:00
Trond Myklebust
1c5bd76d17 pNFS: Enable layoutreturn operation for return-on-close
Amend the pnfs return on close helper functions to enable sending the
layoutreturn op in CLOSE/DELEGRETURN. This closes a potential race between
CLOSE/DELEGRETURN and parallel OPEN calls to the same file, and allows the
client and the server to agree on whether or not there is an outstanding
layout.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:47 -05:00
Trond Myklebust
828ed9ec1b pNFS: Clean up - add a helper to initialise struct layoutreturn_args
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:47 -05:00
Trond Myklebust
586f1c39da NFSv4: Add encode/decode of the layoutreturn op in DELEGRETURN
Add XDR encoding for the layoutreturn op, and storage for the layoutreturn
arguments to the DELEGRETURN compound.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:46 -05:00
Trond Myklebust
cf80516579 NFSv4: Add encode/decode of the layoutreturn op in CLOSE
Add XDR encoding for the layoutreturn op, and storage for the layoutreturn
arguments to the CLOSE compound.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:46 -05:00
Trond Myklebust
d8434d4c54 NFSv4: Fix missing operation accounting in NFS4_dec_delegreturn_sz
We need to account for the reply to the PUTFH operation in the
DELEGRETURN compound.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:45 -05:00
Trond Myklebust
69820d22c5 pNFS: Don't mark layout segments invalid on layoutreturn in pnfs_roc
The layoutreturn call will take care of invalidating the layout segments
once the call is successful.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:45 -05:00
Trond Myklebust
94e5c571fc pNFS: Get rid of unnecessary layout parameter in encode_layoutreturn callback
The parameter is already present in the "args" structure.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:44 -05:00
Trond Myklebust
0cdc329ec9 pNFS: Skip checking for return-on-close if the layout is invalid
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:44 -05:00
Trond Myklebust
e685d237e6 pNFS: Remove spurious wake up in pnfs_layout_remove_lseg()
There is no change to the value of NFS_LAYOUT_RETURN, so we should
not be waking up the RPC call.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:43 -05:00
Trond Myklebust
2a974425e5 NFSv4: Ignore LAYOUTRETURN result if the layout doesn't match or is invalid
Fix a potential race with CB_LAYOUTRECALL in which the server recalls the
remaining layout segments while our LAYOUTRETURN is still in transit.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:43 -05:00
Trond Myklebust
68f744797e pNFS: Do not free layout segments that are marked for return
We may want to process and transmit layout stat information for the
layout segments that are being returned, so we should defer freeing
them until after the layoutreturn has completed.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:42 -05:00
Trond Myklebust
7b410d9ce4 pNFS: Delay getting the layout header in CB_LAYOUTRECALL handlers
Instead of grabbing the layout, we want to get the inode so that we
can reduce races between layoutget and layoutrecall when the server
does not support call referring.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:42 -05:00
Trond Myklebust
17822b207f pNFS: consolidate the different range intersection tests
Both pnfs.c and the flexfiles code have their own versions of the
range intersection testing, and the "end_offset" helper.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:41 -05:00
Trond Myklebust
ee284e35d8 pNFS: Fix race in pnfs_wait_on_layoutreturn
We must put the task to sleep while holding the inode->i_lock in order
to ensure atomicity with the test for NFS_LAYOUT_RETURN.

Fixes: 500d701f33 ("NFS41: make close wait for layoutreturn")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:41 -05:00
Trond Myklebust
6604b203fb pNFS: On error, do not send LAYOUTGET until the LAYOUTRETURN has completed
If there is an I/O error, we should not call LAYOUTGET until the
LAYOUTRETURN that reports the error is complete.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.8+
2016-12-01 17:21:40 -05:00
Trond Myklebust
9888d837f3 pNFS: Force a retry of LAYOUTGET if the stateid doesn't match our cache
If the server sends us a completely new stateid, and the client thinks
it already holds a layout, then force a retry of the LAYOUTGET after
invalidating the existing layout in order to avoid corruption due to
races.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:40 -05:00
Trond Myklebust
ae5a459d5f pNFS: Clear NFS_LAYOUT_RETURN_REQUESTED when invalidating the layout stateid
We must ensure that we don't schedule a layoutreturn if the layout stateid
has been marked as invalid.

Fixes: 2a59a04116 ("pNFS: Fix pnfs_set_layout_stateid() to clear...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.8+
2016-12-01 17:21:39 -05:00
Trond Myklebust
7b650994ab pNFS: Don't clear the layout stateid if a layout return is outstanding
If we no longer hold any layout segments, we're normally expected to
consider the layout stateid to be invalid. However we cannot assume this
if we're about to, or in the process of sending a layoutreturn.

Fixes: 334a8f3711 ("pNFS: Don't forget the layout stateid if...")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.8+
2016-12-01 17:21:39 -05:00
Trond Myklebust
54e4a0dfa2 pNFS: Fix a deadlock between read resends and layoutreturn
We must not call nfs_pageio_init_read() on a new nfs_pageio_descriptor
while holding a reference to a layout segment, as that can deadlock
pnfs_update_layout().

Fixes: d67ae825a5 ("pnfs/flexfiles: Add the FlexFile Layout Driver")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.0+
2016-12-01 17:21:38 -05:00
Fred Isaman
9a837856cf NFSv4.1: Fix regression in callback retry handling
When initializing a freshly created slot for the calllback channel,
the seq_nr needs to be 0, not 1.  Otherwise validate_seqid
and nfs4_slot_wait_on_seqid get confused and believe that the
mpty slot corresponds to a previously sent reply.

Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:38 -05:00
Trond Myklebust
1ad13dbc85 NFSv4: Optimise away forced revalidation when we know the attributes are OK
The NFS_INO_REVAL_FORCED flag needs to be set if we just got a delegation,
and we see that there might still be some ambiguity as to whether or not
our attribute or data cache are valid.
In practice, this means that a call to nfs_check_inode_attributes() will
have noticed a discrepancy between cached attributes and measured ones,
so let's move the setting of NFS_INO_REVAL_FORCED to there.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:37 -05:00
Trond Myklebust
3ecefc9295 NFSv4: Don't request close-to-open attribute when holding a delegation
If holding a delegation, we do not need to ask the server to return
close-to-open cache consistency attributes as part of the CLOSE
compound.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:37 -05:00
Trond Myklebust
3947b74d0f NFSv4: Don't request a GETATTR on open_downgrade.
If we're not closing the file completely, there is no need to request
close-to-open attributes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:36 -05:00
Trond Myklebust
1cc1baf14b NFSv4: Don't ask for the change attribute when reclaiming state
We don't need to ask for the change attribute when returning a delegation
or recovering from a server reboot, and it could actually cause us to
obtain an incorrect value if we're using a pNFS flavour that requires
LAYOUTCOMMIT.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:36 -05:00
Trond Myklebust
536585ccf9 NFSv4: Don't check file access when reclaiming state
If we're reclaiming state after a reboot, or as part of returning a
delegation, we don't need to check access modes again.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-12-01 17:21:35 -05:00
Eryu Guan
3a4b77cd47 ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.

ext4_calculate_overhead():
  buf = get_zeroed_page(GFP_NOFS);  <=== PAGE_SIZE buffer
  blks = count_overhead(sb, i, buf);

count_overhead():
  for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
          ext4_set_bit(EXT4_B2C(sbi, s++), buf);   <=== buffer overrun
          count++;
  }

This can be reproduced easily for me by this script:

  #!/bin/bash
  rm -f fs.img
  mkdir -p /mnt/ext4
  fallocate -l 16M fs.img
  mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
  debugfs -w -R "ssv first_meta_bg 842150400" fs.img
  mount -o loop fs.img /mnt/ext4

Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.

Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 15:08:37 -05:00
Eric Biggers
d7614cc161 ext4: correctly detect when an xattr value has an invalid size
It was possible for an xattr value to have a very large size, which
would then pass validation on 32-bit architectures due to a pointer
wraparound.  Fix this by validating the size in a way which avoids
pointer wraparound.

It was also possible that a value's size would fit in the available
space but its padded size would not.  This would cause an out-of-bounds
memory write in ext4_xattr_set_entry when replacing the xattr value.
For example, if an xattr value of unpadded size 253 bytes went until the
very end of the inode or block, then using setxattr(2) to replace this
xattr's value with 256 bytes would cause a write to the 3 bytes past the
end of the inode or buffer, and the new xattr value would be incorrectly
truncated.  Fix this by requiring that the padded size fit in the
available space rather than the unpadded size.

This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 14:57:29 -05:00
Eric Biggers
290ab23001 ext4: don't read out of bounds when checking for in-inode xattrs
With i_extra_isize equal to or close to the available space, it was
possible for us to read past the end of the inode when trying to detect
or validate in-inode xattrs.  Fix this by checking for the needed extra
space first.

This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 14:51:58 -05:00
Eric Biggers
2dc8d9e19b ext4: forbid i_extra_isize not divisible by 4
i_extra_isize not divisible by 4 is problematic for several reasons:

- It causes the in-inode xattr space to be misaligned, but the xattr
  header and entries are not declared __packed to express this
  possibility.  This may cause poor performance or incorrect code
  generation on some platforms.
- When validating the xattr entries we can read past the end of the
  inode if the size available for xattrs is not a multiple of 4.
- It allows the nonsensical i_extra_isize=1, which doesn't even leave
  enough room for i_extra_isize itself.

Therefore, update ext4_iget() to consider i_extra_isize not divisible by
4 to be an error, like the case where i_extra_isize is too large.

This also matches the rule recently added to e2fsck for determining
whether an inode has valid i_extra_isize.

This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems, since the size of ext4_inode
has always been a multiple of 4.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2016-12-01 14:43:33 -05:00
Linus Torvalds
2caceb3294 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fix from Miklos Szeredi:
 "This fixes a regression introduced in 4.8"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix d_real() for stacked fs
2016-12-01 10:31:53 -08:00
Eric Biggers
ba679017ef ext4: disable pwsalt ioctl when encryption disabled by config
On a CONFIG_EXT4_FS_ENCRYPTION=n kernel, the ioctls to get and set
encryption policies were disabled but EXT4_IOC_GET_ENCRYPTION_PWSALT was
not.  But there's no good reason to expose the pwsalt ioctl if the
kernel doesn't support encryption.  The pwsalt ioctl was also disabled
pre-4.8 (via ext4_sb_has_crypto() previously returning 0 when encryption
was disabled by config) and seems to have been enabled by mistake when
ext4 encryption was refactored to use fs/crypto/.  So let's disable it
again.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 11:55:51 -05:00
Eric Biggers
35997d1ce8 ext4: get rid of ext4_sb_has_crypto()
ext4_sb_has_crypto() just called through to ext4_has_feature_encrypt(),
and all callers except one were already using the latter.  So remove it
and switch its one caller to ext4_has_feature_encrypt().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 11:54:18 -05:00
Daeho Jeong
05ac5aa18a ext4: fix inode checksum calculation problem if i_extra_size is small
We've fixed the race condition problem in calculating ext4 checksum
value in commit b47820edd1 ("ext4: avoid modifying checksum fields
directly during checksum veficationon"). However, by this change,
when calculating the checksum value of inode whose i_extra_size is
less than 4, we couldn't calculate the checksum value in a proper way.
This problem was found and reported by Nix, Thank you.

Reported-by: Nix <nix@esperi.org.uk>
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 11:49:12 -05:00
Jan Kara
6dcc693bc5 ext4: warn when page is dirtied without buffers
Warn when a page is dirtied without buffers (as that will likely lead to
a crash in ext4_writepages()) or when it gets newly dirtied without the
page being locked (as there is nothing that prevents buffers to get
stripped just before calling set_page_dirty() under memory pressure).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-12-01 11:46:40 -05:00
Rabin Vincent
af309226db block: protect iterate_bdevs() against concurrent close
If a block device is closed while iterate_bdevs() is handling it, the
following NULL pointer dereference occurs because bdev->b_disk is NULL
in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
turn called by the mapping_cap_writeback_dirty() call in
__filemap_fdatawrite_range()):

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000508
 IP: [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
 PGD 9e62067 PUD 9ee8067 PMD 0
 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 task: ffff880009f4d700 ti: ffff880009f5c000 task.ti: ffff880009f5c000
 RIP: 0010:[<ffffffff81314790>]  [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
 RSP: 0018:ffff880009f5fe68  EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff88000ec17a38 RCX: ffffffff81a4e940
 RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff88000ec176c0
 RBP: ffff880009f5fe68 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000001 R11: 0000000000000000 R12: ffff88000ec17860
 R13: ffffffff811b25c0 R14: ffff88000ec178e0 R15: ffff88000ec17a38
 FS:  00007faee505d700(0000) GS:ffff88000fb00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000000000000508 CR3: 0000000009e8a000 CR4: 00000000000006e0
 Stack:
  ffff880009f5feb8 ffffffff8112e7f5 0000000000000000 7fffffffffffffff
  0000000000000000 0000000000000000 7fffffffffffffff 0000000000000001
  ffff88000ec178e0 ffff88000ec17860 ffff880009f5fec8 ffffffff8112e81f
 Call Trace:
  [<ffffffff8112e7f5>] __filemap_fdatawrite_range+0x85/0x90
  [<ffffffff8112e81f>] filemap_fdatawrite+0x1f/0x30
  [<ffffffff811b25d6>] fdatawrite_one_bdev+0x16/0x20
  [<ffffffff811bc402>] iterate_bdevs+0xf2/0x130
  [<ffffffff811b2763>] sys_sync+0x63/0x90
  [<ffffffff815d4272>] entry_SYSCALL_64_fastpath+0x12/0x76
 Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 <48> 8b 80 08 05 00 00 5d
 RIP  [<ffffffff81314790>] blk_get_backing_dev_info+0x10/0x20
  RSP <ffff880009f5fe68>
 CR2: 0000000000000508
 ---[ end trace 2487336ceb3de62d ]---

The crash is easily reproducible by running the following command, if an
msleep(100) is inserted before the call to func() in iterate_devs():

 while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done

Fix it by holding the bd_mutex across the func() call and only calling
func() if the bdev is opened.

Cc: stable@vger.kernel.org
Fixes: 5c0d6b60a0 ("vfs: Create function for iterating over block devices")
Reported-and-tested-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01 08:26:39 -07:00
Steve French
8b217fe7fc SMB3: parsing for new snapshot timestamp mount parm
New mount option "snapshot=<time>" to allow mounting an earlier
version of the remote volume (if such a snapshot exists on
the server).

Note that eventually specifying a snapshot time of 1 will allow
the user to mount the oldest snapshot. A subsequent patch
add the processing for that and another for actually specifying
the "time warp" create context on SMB2/SMB3 open.

Check to make sure SMB2 negotiated, and ensure that
we use a different tcon if mount same share twice
but with different snaphshot times

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2016-12-01 00:23:20 -06:00