linux-stable

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2024-08-21 08:19:28 +00:00

Author	SHA1	Message	Date
Yunlong Song	3436c4bdb3	f2fs: put allocate_segment after refresh_sit_entry SIT information should be updated before segment allocation, since SSR needs latest valid block information. Current code does not update the old_blkaddr info in sit_entry, so adjust the allocate_segment to its proper location. Commit `5e443818fa` ("f2fs: handle dirty segments inside refresh_sit_entry") puts it into wrong location. Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-24 09:37:30 -08:00
Linus Torvalds	f1ef09fde1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "There is a lot here. A lot of these changes result in subtle user visible differences in kernel behavior. I don't expect anything will care but I will revert/fix things immediately if any regressions show up. From Seth Forshee there is a continuation of the work to make the vfs ready for unpriviled mounts. We had thought the previous changes prevented the creation of files outside of s_user_ns of a filesystem, but it turns we missed the O_CREAT path. Ooops. Pavel Tikhomirov and Oleg Nesterov worked together to fix a long standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only children that are forked after the prctl are considered and not children forked before the prctl. The only known user of this prctl systemd forks all children after the prctl. So no userspace regressions will occur. Holding earlier forked children to the same rules as later forked children creates a semantic that is sane enough to allow checkpoing of processes that use this feature. There is a long delayed change by Nikolay Borisov to limit inotify instances inside a user namespace. Michael Kerrisk extends the API for files used to maniuplate namespaces with two new trivial ioctls to allow discovery of the hierachy and properties of namespaces. Konstantin Khlebnikov with the help of Al Viro adds code that when a network namespace exits purges it's sysctl entries from the dcache. As in some circumstances this could use a lot of memory. Vivek Goyal fixed a bug with stacked filesystems where the permissions on the wrong inode were being checked. I continue previous work on ptracing across exec. Allowing a file to be setuid across exec while being ptraced if the tracer has enough credentials in the user namespace, and if the process has CAP_SETUID in it's own namespace. Proc files for setuid or otherwise undumpable executables are now owned by the root in the user namespace of their mm. Allowing debugging of setuid applications in containers to work better. A bug I introduced with permission checking and automount is now fixed. The big change is to mark the mounts that the kernel initiates as a result of an automount. This allows the permission checks in sget to be safely suppressed for this kind of mount. As the permission check happened when the original filesystem was mounted. Finally a special case in the mount namespace is removed preventing unbounded chains in the mount hash table, and making the semantics simpler which benefits CRIU. The vfs fix along with related work in ima and evm I believe makes us ready to finish developing and merge fully unprivileged mounts of the fuse filesystem. The cleanups of the mount namespace makes discussing how to fix the worst case complexity of umount. The stacked filesystem fixes pave the way for adding multiple mappings for the filesystem uids so that efficient and safer containers can be implemented" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc/sysctl: Don't grab i_lock under sysctl_lock. vfs: Use upper filesystem inode in bprm_fill_uid() proc/sysctl: prune stale dentries during unregistering mnt: Tuck mounts under others instead of creating shadow/side mounts. prctl: propagate has_child_subreaper flag to every descendant introduce the walk_process_tree() helper nsfs: Add an ioctl() to return owner UID of a userns fs: Better permission checking for submounts exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction vfs: open() with O_CREAT should not create inodes with unknown ids nsfs: Add an ioctl() to return the namespace type proc: Better ownership of files for non-dumpable tasks in user namespaces exec: Remove LSM_UNSAFE_PTRACE_CAP exec: Test the ptracer's saved cred to see if the tracee can gain caps exec: Don't reset euid and egid when the tracee has CAP_SETUID inotify: Convert to using per-namespace limits	2017-02-23 20:33:51 -08:00
Filipe Manana	263d3995c9	Btrfs: try harder to migrate items to left sibling before splitting a leaf Before attempting to split a leaf we try to migrate items from the leaf to its right and left siblings. We start by trying to move items into the rigth sibling and, if the new item is meant to be inserted at the end of our leaf, we try to free from our leaf an amount of bytes equal to the number of bytes used by the new item, by setting the variable space_needed to the byte size of that new item. However if we fail to move enough items to the right sibling due to lack of space in that sibling, we then try to move items into the left sibling, and in that case we try to free an amount equal to the size of the new item from our leaf, when we need only to free an amount corresponding to the size of the new item minus the current free space of our leaf. So make sure that before we try to move items to the left sibling we do set the variable space_needed with a value corresponding to the new item's size minus the leaf's current free space. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2017-02-24 00:39:44 +00:00
Filipe Manana	76b42abbf7	Btrfs: fix data loss after truncate when using the no-holes feature If we have a file with an implicit hole (NO_HOLES feature enabled) that has an extent following the hole, delayed writes against regions of the file behind the hole happened before but were not yet flushed and then we truncate the file to a smaller size that lies inside the hole, we end up persisting a wrong disk_i_size value for our inode that leads to data loss after umounting and mounting again the filesystem or after the inode is evicted and loaded again. This happens because at inode.c:btrfs_truncate_inode_items() we end up setting last_size to the offset of the extent that we deleted and that followed the hole. We then pass that value to btrfs_ordered_update_i_size() which updates the inode's disk_i_size to a value smaller then the offset of the buffered (delayed) writes. Example reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0x01 0K 32K" /mnt/foo $ xfs_io -d -c "pwrite -S 0x02 -b 32K 64K 32K" /mnt/foo $ xfs_io -c "truncate 60K" /mnt/foo --> inode's disk_i_size updated to 0 $ md5sum /mnt/foo 3c5ca3c3ab42f4b04d7e7eb0b0d4d806 /mnt/foo $ umount /dev/sdb $ mount /dev/sdb /mnt $ md5sum /mnt/foo d41d8cd98f00b204e9800998ecf8427e /mnt/foo --> Empty file, all data lost! Cc: <stable@vger.kernel.org> # 3.14+ Fixes: `16e7549f04` ("Btrfs: incompatible format change to remove hole extents") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2017-02-24 00:39:31 +00:00
Filipe Manana	82bfb2e7b6	Btrfs: incremental send, fix unnecessary hole writes for sparse files When using the NO_HOLES feature, during an incremental send we often issue write operations for holes when we should not, because that range is already a hole in the destination snapshot. While that does not change the contents of the file at the receiver, it avoids preservation of file holes, leading to wasted disk space and extra IO during send/receive. A couple examples where the holes are not preserved follows. $ mkfs.btrfs -O no-holes -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xaa 0 4K" /mnt/foo $ xfs_io -f -c "pwrite -S 0xaa 0 4K" -c "pwrite -S 0xbb 1028K 4K" /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap1 # Now add one new extent to our first test file, increasing its size and # leaving a 1Mb hole between the first extent and this new extent. $ xfs_io -c "pwrite -S 0xbb 1028K 4K" /mnt/foo # Now overwrite the last extent of our second test file. $ xfs_io -c "pwrite -S 0xcc 1028K 4K" /mnt/bar $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo /mnt/snap2/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 25088..25095 8 0x2000 1: [8..2055]: hole 2048 2: [2056..2063]: 24576..24583 8 0x2001 $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar /mnt/snap2/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 25096..25103 8 0x2000 1: [8..2055]: hole 2048 2: [2056..2063]: 24584..24591 8 0x2001 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt # It's not relevant to enable no-holes in the new filesystem. $ mkfs.btrfs -O no-holes -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive /mnt -f /tmp/1.snap $ btrfs receive /mnt -f /tmp/2.snap $ xfs_io -r -c "fiemap -v" /mnt/snap2/foo /mnt/snap2/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 24576..24583 8 0x2000 1: [8..2063]: 25624..27679 2056 0x1 $ xfs_io -r -c "fiemap -v" /mnt/snap2/bar /mnt/snap2/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 24584..24591 8 0x2000 1: [8..2063]: 27680..29735 2056 0x1 The holes do not exist in the second filesystem and they were replaced with extents filled with the byte 0x00, making each file take 1032Kb of space instead of 8Kb. So fix this by not issuing the write operations consisting of buffers filled with the byte 0x00 when the destination snapshot already has a hole for the respective range. A test case for fstests will follow soon. Signed-off-by: Filipe Manana <fdmanana@suse.com>	2017-02-24 00:39:21 +00:00
Filipe Manana	a9b9477db2	Btrfs: fix use-after-free due to wrong order of destroying work queues Before we destroy all work queues (and wait for their tasks to complete) we were destroying the work queues used for metadata I/O operations, which can result in a use-after-free problem because most tasks from all work queues do metadata I/O operations. For example, the tasks from the caching workers work queue (fs_info->caching_workers), which is destroyed only after the work queue used for metadata reads (fs_info->endio_meta_workers) is destroyed, do metadata reads, which result in attempts to queue tasks into the later work queue, triggering a use-after-free with a trace like the following: [23114.613543] general protection fault: 0000 [#1] PREEMPT SMP [23114.614442] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic acpi_cpufreq tpm_tis tpm_tis_core tpm ppdev parport_pc parport i2c_piix4 processor sg evdev i2c_core psmouse pcspkr serio_raw button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: scsi_debug] [23114.616932] CPU: 9 PID: 4537 Comm: kworker/u32:8 Not tainted 4.9.0-rc7-btrfs-next-36+ #1 [23114.616932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [23114.616932] Workqueue: btrfs-cache btrfs_cache_helper [btrfs] [23114.616932] task: ffff880221d45780 task.stack: ffffc9000bc50000 [23114.616932] RIP: 0010:[<ffffffffa037c1bf>] [<ffffffffa037c1bf>] btrfs_queue_work+0x2c/0x190 [btrfs] [23114.616932] RSP: 0018:ffff88023f443d60 EFLAGS: 00010246 [23114.616932] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000102 [23114.616932] RDX: ffffffffa0419000 RSI: ffff88011df534f0 RDI: ffff880101f01c00 [23114.616932] RBP: ffff88023f443d80 R08: 00000000000f7000 R09: 000000000000ffff [23114.616932] R10: ffff88023f443d48 R11: 0000000000001000 R12: ffff88011df534f0 [23114.616932] R13: ffff880135963868 R14: 0000000000001000 R15: 0000000000001000 [23114.616932] FS: 0000000000000000(0000) GS:ffff88023f440000(0000) knlGS:0000000000000000 [23114.616932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [23114.616932] CR2: 00007f0fb9f8e520 CR3: 0000000001a0b000 CR4: 00000000000006e0 [23114.616932] Stack: [23114.616932] ffff880101f01c00 ffff88011df534f0 ffff880135963868 0000000000001000 [23114.616932] ffff88023f443da0 ffffffffa03470af ffff880149b37200 ffff880135963868 [23114.616932] ffff88023f443db8 ffffffff8125293c ffff880149b37200 ffff88023f443de0 [23114.616932] Call Trace: [23114.616932] <IRQ> [23114.616932] [<ffffffffa03470af>] end_workqueue_bio+0xd5/0xda [btrfs] [23114.616932] [<ffffffff8125293c>] bio_endio+0x54/0x57 [23114.616932] [<ffffffffa0377929>] btrfs_end_bio+0xf7/0x106 [btrfs] [23114.616932] [<ffffffff8125293c>] bio_endio+0x54/0x57 [23114.616932] [<ffffffff8125955f>] blk_update_request+0x21a/0x30f [23114.616932] [<ffffffffa0022316>] scsi_end_request+0x31/0x182 [scsi_mod] [23114.616932] [<ffffffffa00235fc>] scsi_io_completion+0x1ce/0x4c8 [scsi_mod] [23114.616932] [<ffffffffa001ba9d>] scsi_finish_command+0x104/0x10d [scsi_mod] [23114.616932] [<ffffffffa002311f>] scsi_softirq_done+0x101/0x10a [scsi_mod] [23114.616932] [<ffffffff8125fbd9>] blk_done_softirq+0x82/0x8d [23114.616932] [<ffffffff814c8a4b>] __do_softirq+0x1ab/0x412 [23114.616932] [<ffffffff8105b01d>] irq_exit+0x49/0x99 [23114.616932] [<ffffffff81035135>] smp_call_function_single_interrupt+0x24/0x26 [23114.616932] [<ffffffff814c7ec9>] call_function_single_interrupt+0x89/0x90 [23114.616932] <EOI> [23114.616932] [<ffffffffa0023262>] ? scsi_request_fn+0x13a/0x2a1 [scsi_mod] [23114.616932] [<ffffffff814c5966>] ? _raw_spin_unlock_irq+0x2c/0x4a [23114.616932] [<ffffffff814c596c>] ? _raw_spin_unlock_irq+0x32/0x4a [23114.616932] [<ffffffff814c5966>] ? _raw_spin_unlock_irq+0x2c/0x4a [23114.616932] [<ffffffffa0023262>] scsi_request_fn+0x13a/0x2a1 [scsi_mod] [23114.616932] [<ffffffff8125590e>] __blk_run_queue_uncond+0x22/0x2b [23114.616932] [<ffffffff81255930>] __blk_run_queue+0x19/0x1b [23114.616932] [<ffffffff8125ab01>] blk_queue_bio+0x268/0x282 [23114.616932] [<ffffffff81258f44>] generic_make_request+0xbd/0x160 [23114.616932] [<ffffffff812590e7>] submit_bio+0x100/0x11d [23114.616932] [<ffffffff81298603>] ? __this_cpu_preempt_check+0x13/0x15 [23114.616932] [<ffffffff812a1805>] ? __percpu_counter_add+0x8e/0xa7 [23114.616932] [<ffffffffa03bfd47>] btrfsic_submit_bio+0x1a/0x1d [btrfs] [23114.616932] [<ffffffffa0377db2>] btrfs_map_bio+0x1f4/0x26d [btrfs] [23114.616932] [<ffffffffa0348a33>] btree_submit_bio_hook+0x74/0xbf [btrfs] [23114.616932] [<ffffffffa03489bf>] ? btrfs_wq_submit_bio+0x160/0x160 [btrfs] [23114.616932] [<ffffffffa03697a9>] submit_one_bio+0x6b/0x89 [btrfs] [23114.616932] [<ffffffffa036f5be>] read_extent_buffer_pages+0x170/0x1ec [btrfs] [23114.616932] [<ffffffffa03471fa>] ? free_root_pointers+0x64/0x64 [btrfs] [23114.616932] [<ffffffffa0348adf>] readahead_tree_block+0x3f/0x4c [btrfs] [23114.616932] [<ffffffffa032e115>] read_block_for_search.isra.20+0x1ce/0x23d [btrfs] [23114.616932] [<ffffffffa032fab8>] btrfs_search_slot+0x65f/0x774 [btrfs] [23114.616932] [<ffffffffa036eff1>] ? free_extent_buffer+0x73/0x7e [btrfs] [23114.616932] [<ffffffffa0331ba4>] btrfs_next_old_leaf+0xa1/0x33c [btrfs] [23114.616932] [<ffffffffa0331e4f>] btrfs_next_leaf+0x10/0x12 [btrfs] [23114.616932] [<ffffffffa0336aa6>] caching_thread+0x22d/0x416 [btrfs] [23114.616932] [<ffffffffa037bce9>] btrfs_scrubparity_helper+0x187/0x3b6 [btrfs] [23114.616932] [<ffffffffa037c036>] btrfs_cache_helper+0xe/0x10 [btrfs] [23114.616932] [<ffffffff8106cf96>] process_one_work+0x273/0x4e4 [23114.616932] [<ffffffff8106d6db>] worker_thread+0x1eb/0x2ca [23114.616932] [<ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6 [23114.616932] [<ffffffff81072a81>] kthread+0xd5/0xdd [23114.616932] [<ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a [23114.616932] [<ffffffff814c6257>] ret_from_fork+0x27/0x40 [23114.616932] Code: 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 53 49 89 f4 48 8b 46 70 a8 04 74 09 48 8b 5f 08 48 85 db 75 03 48 8b 1f 49 89 5c 24 68 <83> 7b 64 ff 74 04 f0 ff 43 58 49 83 7c 24 08 00 74 2c 4c 8d 6b [23114.616932] RIP [<ffffffffa037c1bf>] btrfs_queue_work+0x2c/0x190 [btrfs] [23114.616932] RSP <ffff88023f443d60> [23114.689493] ---[ end trace 6e48b6bc707ca34b ]--- [23114.690166] Kernel panic - not syncing: Fatal exception in interrupt [23114.691283] Kernel Offset: disabled [23114.691918] ---[ end Kernel panic - not syncing: Fatal exception in interrupt The following diagram shows the sequence of operations that lead to the use-after-free problem from the above trace: CPU 1 CPU 2 CPU 3 caching_thread() close_ctree() btrfs_stop_all_workers() btrfs_destroy_workqueue( fs_info->endio_meta_workers) btrfs_search_slot() read_block_for_search() readahead_tree_block() read_extent_buffer_pages() submit_one_bio() btree_submit_bio_hook() btrfs_bio_wq_end_io() --> sets the bio's bi_end_io callback to end_workqueue_bio() --> bio is submitted bio completes and its bi_end_io callback is invoked --> end_workqueue_bio() --> attempts to queue a task on fs_info->endio_meta_workers btrfs_destroy_workqueue( fs_info->caching_workers) So fix this by destroying the queues used for metadata I/O tasks only after destroying all the other queues. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2017-02-24 00:38:56 +00:00
Filipe Manana	5cdd7db6c5	Btrfs: fix assertion failure when freeing block groups at close_ctree() At close_ctree() we free the block groups and then only after we wait for any running worker kthreads to finish and shutdown the workqueues. This behaviour is racy and it triggers an assertion failure when freeing block groups because while we are doing it we can have for example a block group caching kthread running, and in that case the block group's reference count can still be greater than 1 by the time we assert its reference count is 1, leading to an assertion failure: [19041.198004] assertion failed: atomic_read(&block_group->count) == 1, file: fs/btrfs/extent-tree.c, line: 9799 [19041.200584] ------------[ cut here ]------------ [19041.201692] kernel BUG at fs/btrfs/ctree.h:3418! [19041.202830] invalid opcode: 0000 [#1] PREEMPT SMP [19041.203929] Modules linked in: btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic ppdev sg psmouse acpi_cpufreq pcspkr parport_pc evdev tpm_tis parport tpm_tis_core i2c_piix4 i2c_core tpm serio_raw processor button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs] [19041.208082] CPU: 6 PID: 29051 Comm: umount Not tainted 4.9.0-rc7-btrfs-next-36+ #1 [19041.208082] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [19041.208082] task: ffff88015f028980 task.stack: ffffc9000ad34000 [19041.208082] RIP: 0010:[<ffffffffa03e319e>] [<ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs] [19041.208082] RSP: 0018:ffffc9000ad37d60 EFLAGS: 00010286 [19041.208082] RAX: 0000000000000061 RBX: ffff88015ecb4000 RCX: 0000000000000001 [19041.208082] RDX: ffff88023f392fb8 RSI: ffffffff817ef7ba RDI: 00000000ffffffff [19041.208082] RBP: ffffc9000ad37d60 R08: 0000000000000001 R09: 0000000000000000 [19041.208082] R10: ffffc9000ad37cb0 R11: ffffffff82f2b66d R12: ffff88023431d170 [19041.208082] R13: ffff88015ecb40c0 R14: ffff88023431d000 R15: ffff88015ecb4100 [19041.208082] FS: 00007f44f3d42840(0000) GS:ffff88023f380000(0000) knlGS:0000000000000000 [19041.208082] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [19041.208082] CR2: 00007f65d623b000 CR3: 00000002166f2000 CR4: 00000000000006e0 [19041.208082] Stack: [19041.208082] ffffc9000ad37d98 ffffffffa035989f ffff88015ecb4000 ffff88015ecb5630 [19041.208082] ffff88014f6be000 0000000000000000 00007ffcf0ba6a10 ffffc9000ad37df8 [19041.208082] ffffffffa0368cd4 ffff88014e9658e0 ffffc9000ad37e08 ffffffff811a634d [19041.208082] Call Trace: [19041.208082] [<ffffffffa035989f>] btrfs_free_block_groups+0x17f/0x392 [btrfs] [19041.208082] [<ffffffffa0368cd4>] close_ctree+0x1c5/0x2e1 [btrfs] [19041.208082] [<ffffffff811a634d>] ? evict_inodes+0x132/0x141 [19041.208082] [<ffffffffa034356d>] btrfs_put_super+0x15/0x17 [btrfs] [19041.208082] [<ffffffff8118fc32>] generic_shutdown_super+0x6a/0xeb [19041.208082] [<ffffffff8119004f>] kill_anon_super+0x12/0x1c [19041.208082] [<ffffffffa0343370>] btrfs_kill_super+0x16/0x21 [btrfs] [19041.208082] [<ffffffff8118fad1>] deactivate_locked_super+0x3b/0x68 [19041.208082] [<ffffffff8118fb34>] deactivate_super+0x36/0x39 [19041.208082] [<ffffffff811a9946>] cleanup_mnt+0x58/0x76 [19041.208082] [<ffffffff811a99a2>] __cleanup_mnt+0x12/0x14 [19041.208082] [<ffffffff81071573>] task_work_run+0x6f/0x95 [19041.208082] [<ffffffff81001897>] prepare_exit_to_usermode+0xa3/0xc1 [19041.208082] [<ffffffff81001a23>] syscall_return_slowpath+0x16e/0x1d2 [19041.208082] [<ffffffff814c607d>] entry_SYSCALL_64_fastpath+0xab/0xad [19041.208082] Code: c7 ae a0 3e a0 48 89 e5 e8 4e 74 d4 e0 0f 0b 55 89 f1 48 c7 c2 0b a4 3e a0 48 89 fe 48 c7 c7 a4 a6 3e a0 48 89 e5 e8 30 74 d4 e0 <0f> 0b 55 31 d2 48 89 e5 e8 d5 b9 f7 ff 5d c3 48 63 f6 55 31 c9 [19041.208082] RIP [<ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs] [19041.208082] RSP <ffffc9000ad37d60> [19041.279264] ---[ end trace 23330586f16f064d ]--- This started happening as of kernel 4.8, since commit `f3bca8028b` ("Btrfs: add ASSERT for block group's memory leak") introduced these assertions. So fix this by freeing the block groups only after waiting for all worker kthreads to complete and shutdown the workqueues. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2017-02-24 00:38:27 +00:00
Filipe Manana	3168021cf9	Btrfs: do not create explicit holes when replaying log tree if NO_HOLES enabled We log holes explicitly by using file extent items, however when replaying a log tree, if a logged file extent item corresponds to a hole and the NO_HOLES feature is enabled we do not need to copy the file extent item into the fs/subvolume tree, as the absence of such file extent items is the purpose of the NO_HOLES feature. So skip the copying of file extent items representing holes when the NO_HOLES feature is enabled. Signed-off-by: Filipe Manana <fdmanana@suse.com>	2017-02-24 00:38:10 +00:00
Robbie Ko	91e1f56a8b	Btrfs: fix leak of subvolume writers counter When falling back from a nocow write to a regular cow write, we were leaking the subvolume writers counter in 2 situations, preventing snapshot creation from ever completing in the future, as it waits for that counter to go down to zero before the snapshot creation starts. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [Improved changelog and subject] Signed-off-by: Filipe Manana <fdmanana@suse.com>	2017-02-24 00:38:01 +00:00
Filipe Manana	6f546216e9	Btrfs: bulk delete checksum items in the same leaf Very often we have the checksums for an extent spread in multiple items in the checksums tree, and currently the algorithm to delete them starts by looking for them one by one and then deleting them one by one, which is not optimal since each deletion involves shifting all the other items in the leaf and when the leaf reaches some low threshold, to move items off the leaf into its left and right neighbor leafs. Also, after each item deletion we release our search path and start a new search for other checksums items. So optimize this by deleting in bulk all the items in the same leaf that contain checksums for the extent being freed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2017-02-24 00:36:55 +00:00
Robbie Ko	0191410158	Btrfs: incremental send, do not issue invalid rmdir operations When both the parent and send snapshots have a directory inode with the same number but different generations (therefore they are different inodes) and both have an entry with the same name, an incremental send stream will contain an invalid rmdir operation that refers to the orphanized name of the inode from the parent snapshot. The following example scenario shows how this happens. Parent snapshot: . \|---- d259_old/ (ino 259, gen 9) \| \|---- d1/ (ino 258, gen 9) \| \|---- f (ino 257, gen 9) Send snapshot: . \|---- d258/ (ino 258, gen 7) \|---- d259/ (ino 259, gen 7) \|---- d1/ (ino 257, gen 7) When the kernel is processing inode 258 it notices that in both snapshots there is an inode numbered 259 that is a parent of an inode 258. However it ignores the fact that the inodes numbered 259 have different generations in both snapshots, which means they are effectively different inodes. Then it checks that both inodes 259 have a dentry named "d1" and because of that it issues a rmdir operation with orphanized name of the inode 258 from the parent snapshot. This happens at send.c:process_record_refs(), which calls send.c:did_overwrite_first_ref() that returns true and because of that later on at process_recorded_refs() such rmdir operation is issued because the inode being currently processed (258) is a directory and it was deleted in the send snapshot (and replaced with another inode that has the same number and is a directory too). Fix this issue by comparing the generations of parent directory inodes that have the same number and make send.c:did_overwrite_first_ref() when the generations are different. The following steps reproduce the problem. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ touch /mnt/f $ mkdir /mnt/d1 $ mkdir /mnt/d259_old $ mv /mnt/d1 /mnt/d259_old/d1 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/d1 $ mkdir /mnt/dir258 $ mkdir /mnt/dir259 $ mv /mnt/d1 /mnt/dir259/d1 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs receive /mnt/ -f /tmp/1.snap # Take note that once the filesystem is created, its current # generation has value 7 so the inodes from the second snapshot all have # a generation value of 7. And after receiving the first snapshot # the filesystem is at a generation value of 10, because the call to # create the second snapshot bumps the generation to 8 (the snapshot # creation ioctl does a transaction commit), the receive command calls # the snapshot creation ioctl to create the first snapshot, which bumps # the filesystem's generation to 9, and finally when the receive # operation finishes it calls an ioctl to transition the first snapshot # (snap1) from RW mode to RO mode, which does another transaction commit # and bumps the filesystem's generation to 10. This means all the inodes # in the first snapshot (snap1) have a generation value of 9. $ rm -f /tmp/1.snap $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ btrfs receive /mnt -f /tmp/1.snap $ btrfs receive -vv /mnt -f /tmp/2.snap receiving snapshot mysnap2 uuid=9c03962f-f620-0047-9f98-32e5a87116d9, ctransid=7 parent_uuid=d17a6e3f-14e5-df4f-be39-a7951a5399aa, parent_ctransid=9 utimes unlink f mkdir o257-7-0 mkdir o259-7-0 rename o257-7-0 -> o259-7-0/d1 chown o259-7-0/d1 - uid=0, gid=0 chmod o259-7-0/d1 - mode=0755 utimes o259-7-0/d1 rmdir o258-9-0 ERROR: rmdir o258-9-0 failed: No such file or directory Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana <fdmanana@suse.com>	2017-02-24 00:36:45 +00:00
Filipe Manana	fe9c798dbf	Btrfs: incremental send, do not delay rename when parent inode is new When we are checking if we need to delay the rename operation for an inode we not checking if a parent inode that exists in the send and parent snapshots is really the same inode or not, that is, we are not comparing the generation number of the parent inode in the send and parent snapshots. Not only this results in unnecessarily delaying a rename operation but also can later on make us generate an incorrect name for a new inode in the send snapshot that has the same number as another inode in the parent snapshot but a different generation. Here follows an example where this happens. Parent snapshot: . (ino 256, gen 3) \|--- dir258/ (ino 258, gen 7) \| \|--- dir257/ (ino 257, gen 7) \| \|--- dir259/ (ino 259, gen 7) Send snapshot: . (ino 256, gen 3) \|--- file258 (ino 258, gen 10) \| \|--- new_dir259/ (ino 259, gen 10) \|--- dir257/ (ino 257, gen 7) The following steps happen when computing the incremental send stream: 1) When processing inode 257, its new parent is created using its orphan name (o257-21-0), and the rename operation for inode 257 is delayed because its new parent (inode 259) was not yet processed - this decision to delay the rename operation does not make much sense because the inode 259 in the send snapshot is a new inode, it's not the same as inode 259 in the parent snapshot. 2) When processing inode 258 we end up delaying its rmdir operation, because inode 257 was not yet renamed (moved away from the directory inode 258 represents). We also create the new inode 258 using its orphan name "o258-10-0", then rename it to its final name of "file258" and then issue a truncate operation for it. However this truncate operation contains an incorrect name, which corresponds to the orphan name and not to the final name, which makes the receiver fail. This happens because when we attempt to compute the inode's current name we verify that there's another inode with the same number (258) that has its rmdir operation pending and because of that we generate an orphan name for the new inode 258 (we do this in the function get_cur_path()). Fix this by not delayed the rename operation of an inode if it has parents with the same number but different generations in both snapshots. The following steps reproduce this example scenario. $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir257 $ mkdir /mnt/dir258 $ mkdir /mnt/dir259 $ mv /mnt/dir257 /mnt/dir258/dir257 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ mv /mnt/dir258/dir257 /mnt/dir257 $ rmdir /mnt/dir258 $ rmdir /mnt/dir259 # Remount the filesystem so that the next created inodes will have the # numbers 258 and 259. This is because when a filesystem is mounted, # btrfs sets the subvolume's inode counter to a value corresponding to # the highest inode number in the subvolume plus 1. This inode counter # is used to assign a unique number to each new inode and it's # incremented by 1 after very inode creation. # Note: we unmount and then mount instead of doing a mount with # "-o remount" because otherwise the inode counter remains at value 260. $ umount /mnt $ mount /dev/sdb /mnt $ touch /mnt/file258 $ mkdir /mnt/new_dir259 $ mv /mnt/dir257 /mnt/new_dir259/dir257 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive /mnt -f /tmo/1.snap $ btrfs receive /mnt -f /tmo/2.snap -vv receiving snapshot mysnap2 uuid=e059b6d1-7f55-f140-8d7c-9a3039d23c97, ctransid=10 parent_uuid=77e98cb6-8762-814f-9e05-e8ba877fc0b0, parent_ctransid=7 utimes mkdir o259-10-0 rename dir258 -> o258-7-0 utimes mkfile o258-10-0 rename o258-10-0 -> file258 utimes truncate o258-10-0 size=0 ERROR: truncate o258-10-0 failed: No such file or directory Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>	2017-02-24 00:36:33 +00:00
Robbie Ko	4dd9920d99	Btrfs: send, fix failure to rename top level inode due to name collision Under certain situations, an incremental send operation can fail due to a premature attempt to create a new top level inode (a direct child of the subvolume/snapshot root) whose name collides with another inode that was removed from the send snapshot. Consider the following example scenario. Parent snapshot: . (ino 256, gen 8) \|---- a1/ (ino 257, gen 9) \|---- a2/ (ino 258, gen 9) Send snapshot: . (ino 256, gen 3) \|---- a2/ (ino 257, gen 7) In this scenario, when receiving the incremental send stream, the btrfs receive command fails like this (ran in verbose mode, -vv argument): rmdir a1 mkfile o257-7-0 rename o257-7-0 -> a2 ERROR: rename o257-7-0 -> a2 failed: Is a directory What happens when computing the incremental send stream is: 1) An operation to remove the directory with inode number 257 and generation 9 is issued. 2) An operation to create the inode with number 257 and generation 7 is issued. This creates the inode with an orphanized name of "o257-7-0". 3) An operation rename the new inode 257 to its final name, "a2", is issued. This is incorrect because inode 258, which has the same name and it's a child of the same parent (root inode 256), was not yet processed and therefore no rmdir operation for it was yet issued. The rename operation is issued because we fail to detect that the name of the new inode 257 collides with inode 258, because their parent, a subvolume/snapshot root (inode 256) has a different generation in both snapshots. So fix this by ignoring the generation value of a parent directory that matches a root inode (number 256) when we are checking if the name of the inode currently being processed collides with the name of some other inode that was not yet processed. We can achieve this scenario of different inodes with the same number but different generation values either by mounting a filesystem with the inode cache option (-o inode_cache) or by creating and sending snapshots across different filesystems, like in the following example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/a1 $ mkdir /mnt/a2 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/1.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ touch /mnt/a2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs receive /mnt -f /tmp/1.snap # Take note that once the filesystem is created, its current # generation has value 7 so the inode from the second snapshot has # a generation value of 7. And after receiving the first snapshot # the filesystem is at a generation value of 10, because the call to # create the second snapshot bumps the generation to 8 (the snapshot # creation ioctl does a transaction commit), the receive command calls # the snapshot creation ioctl to create the first snapshot, which bumps # the filesystem's generation to 9, and finally when the receive # operation finishes it calls an ioctl to transition the first snapshot # (snap1) from RW mode to RO mode, which does another transaction commit # and bumps the filesystem's generation to 10. $ rm -f /tmp/1.snap $ btrfs send /mnt/snap1 -f /tmp/1.snap $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ btrfs receive /mnt /tmp/1.snap # Receive of snapshot snap2 used to fail. $ btrfs receive /mnt /tmp/2.snap Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> [Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana <fdmanana@suse.com>	2017-02-24 00:36:01 +00:00
Weston Andros Adamson	ed92d8c137	NFSv4: fix getacl ERANGE for some ACL buffer sizes We're not taking into account that the space needed for the (variable length) attr bitmap, with the result that we'd sometimes get a spurious ERANGE when the ACL data got close to the end of a page. Just add in an extra page to make sure. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-02-23 17:23:35 -05:00
J. Bruce Fields	6682c14bbe	NFSv4: fix getacl head length estimation Bitmap and attrlen follow immediately after the op reply header. This was an oversight from commit `bf118a342f`. Consequences of this are just minor efficiency (extra calls to xdr_shrink_bufhead). Fixes: `bf118a342f` "NFSv4: include bitmap in nfsv4 get acl data" Reviewed-by: Kinglong Mee <kinglongmee@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-02-23 17:23:32 -05:00
Dan Carpenter	f107548039	ceph: tidy some white space in get_nonsnap_parent() The white space here seems slightly messed up. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-23 22:22:02 +01:00
Hou Pengyang	e93b986525	f2fs: add ovp valid_blocks check for bg gc victim to fg_gc For foreground gc, greedy algorithm should be adapted, which makes this formula work well: (2 * (100 / config.overprovision + 1) + 6) But currently, we fg_gc have a prior to select bg_gc victim segments to gc first, these victims are selected by cost-benefit algorithm, we can't guarantee such segments have the small valid blocks, which may destroy the f2fs rule, on the worstest case, would consume all the free segments. This patch fix this by add a filter in check_bg_victims, if segment's has # of valid blocks over overprovision ratio, skip such segments. Cc: <stable@vger.kernel.org> Signed-off-by: Hou Pengyang <houpengyang@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 11:28:20 -08:00
Jaegeuk Kim	86d54795c9	f2fs: do not wait for writeback in write_begin Otherwise we can get livelock like below. [79880.428136] dbench D 0 18405 18404 0x00000000 [79880.428139] Call Trace: [79880.428142] __schedule+0x219/0x6b0 [79880.428144] schedule+0x36/0x80 [79880.428147] schedule_timeout+0x243/0x2e0 [79880.428152] ? update_sd_lb_stats+0x16b/0x5f0 [79880.428155] ? ktime_get+0x3c/0xb0 [79880.428157] io_schedule_timeout+0xa6/0x110 [79880.428161] __lock_page+0xf7/0x130 [79880.428164] ? unlock_page+0x30/0x30 [79880.428167] pagecache_get_page+0x16b/0x250 [79880.428171] grab_cache_page_write_begin+0x20/0x40 [79880.428182] f2fs_write_begin+0xa2/0xdb0 [f2fs] [79880.428192] ? f2fs_mark_inode_dirty_sync+0x16/0x30 [f2fs] [79880.428197] ? kmem_cache_free+0x79/0x200 [79880.428203] ? __mark_inode_dirty+0x17f/0x360 [79880.428206] generic_perform_write+0xbb/0x190 [79880.428213] ? file_update_time+0xa4/0xf0 [79880.428217] __generic_file_write_iter+0x19b/0x1e0 [79880.428226] f2fs_file_write_iter+0x9c/0x180 [f2fs] [79880.428231] __vfs_write+0xc5/0x140 [79880.428235] vfs_write+0xb2/0x1b0 [79880.428238] SyS_write+0x46/0xa0 [79880.428242] entry_SYSCALL_64_fastpath+0x1e/0xad Fixes: cae96a5c8ab6 ("f2fs: check io submission more precisely") Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 11:23:27 -08:00
Yunlei He	05eeb118a0	f2fs: replace __get_victim by dirty_segments in FG_GC In FG_GC process, it will search victim section twice. This will cause some dirty section with less valid blocks skip garbage collection. section # 26425 : valid blocks # 3 142.037567: get_victim_by_default: victim 26425 : valid blocks # 3 142.037585: f2fs_get_victim: dev = (259,30), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 26425 ofs_unit = 1, pre_victim_secno = 26425, prefree = 0, free = 244 142.039494: f2fs_get_victim: dev = (259,30), type = Hot DATA, policy = (Background GC, SSR-mode, Greedy), victim = 19022 ofs_unit = 1, pre_victim_secno = 26425, prefree = 0, free = 24 142.070247: new_curseg: Debug: alloc new segment 26746 142.244341: f2fs_get_victim: dev = (259,30), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 26054 ofs_unit = 1, pre_victim_secno = 26054, prefree = 0, free = 243 142.254475: do_garbage_collect: Debug: FG_GC, seg_freed = 1 142.293131: f2fs_get_victim: dev = (259,30), type = Warm DATA, policy = (Background GC, SSR-mode, Greedy), victim = 23466 ofs_unit = 1, pre_victim_secno = -1, prefree = 0, free = 244 142.319001: f2fs_get_victim: dev = (259,30), type = Warm DATA, policy = (Background GC, SSR-mode, Greedy), victim = 23467 ofs_unit = 1, pre_victim_secno = -1, prefree = 0, free = 244 142.368879: get_victim_by_default: victim 26425 : valid blocks # 3 142.368894: f2fs_get_victim: dev = (259,30), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 26425 ofs_unit = 1, pre_victim_secno = 26425, prefree = 0, free = 244 142.378127: f2fs_get_victim: dev = (259,30), type = Hot DATA, policy = (Background GC, SSR-mode, Greedy), victim = 19612 ofs_unit = 1, pre_victim_secno = 26425, prefree = 0, free = 24 142.416917: new_curseg: Debug: alloc new segment 26054 142.656794: f2fs_get_victim: dev = (259,30), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 25404 ofs_unit = 1, pre_victim_secno = 25404, prefree = 0, free = 243 142.662139: do_garbage_collect: Debug: FG_GC, seg_freed = 1 142.684159: new_curseg: Debug: alloc new segment 25197 142.685059: get_victim_by_default: victim 26425 : valid blocks # 3 142.685079: f2fs_get_victim: dev = (259,30), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 26425 ofs_unit = 1, pre_victim_secno = 26425, prefree = 0, free = 243 142.701427: f2fs_get_victim: dev = (259,30), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 26238 ofs_unit = 1, pre_victim_secno = 26238, prefree = 0, free = 243 142.707105: do_garbage_collect: Debug: FG_GC, seg_freed = 1 142.802444: f2fs_get_victim: dev = (259,30), type = Warm DATA, policy = (Background GC, SSR-mode, Greedy), victim = 23473 ofs_unit = 1, pre_victim_secno = -1, prefree = 0, free = 244 142.804422: get_victim_by_default: victim 26425 : valid blocks # 3 142.804443: f2fs_get_victim: dev = (259,30), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 26425 ofs_unit = 1, pre_victim_secno = 26425, prefree = 0, free = 244 142.851567: f2fs_get_victim: dev = (259,30), type = Hot DATA, policy = (Background GC, SSR-mode, Greedy), victim = 19092 ofs_unit = 1, pre_victim_secno = 26425, prefree = 0, free = 24 142.865014: new_curseg: Debug: alloc new segment 26238 143.082245: f2fs_get_victim: dev = (259,30), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 26307 ofs_unit = 1, pre_victim_secno = 26307, prefree = 0, free = 244 143.088252: do_garbage_collect: Debug: FG_GC, seg_freed = 1 143.128307: new_curseg: Debug: alloc new segment 25404 143.181846: get_victim_by_default: victim 26425 : valid blocks # 3 143.181872: f2fs_get_victim: dev = (259,30), type = No TYPE, policy = (Foreground GC, LFS-mode, Greedy), victim = 26425 ofs_unit = 1, pre_victim_secno = 26425, prefree = 0, free = 244 Signed-off-by: Yunlei He <heyunlei@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 11:23:26 -08:00
Jaegeuk Kim	88c5c13a50	f2fs: fix multiple f2fs_add_link() calls having same name It turns out a stakable filesystem like sdcardfs in AOSP can trigger multiple vfs_create() to lower filesystem. In that case, f2fs will add multiple dentries having same name which breaks filesystem consistency. Until upper layer fixes, let's work around by f2fs, which shows actually not much performance regression. Cc: <stable@vger.kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 11:23:25 -08:00
Jaegeuk Kim	d50aaeec90	f2fs: show actual device info in tracepoints This patch shows actual device information in the tracepoints. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 11:23:24 -08:00
Jaegeuk Kim	5b6c6be2d8	f2fs: use SSR for warm node as well We have had node chains, but haven't used it so far due to stale node blocks. Now, we have crc\|cp_ver in node footer and give random cp_ver at format time, we can start to use it again. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 11:23:22 -08:00
Chao Yu	39133a5015	f2fs: enable inline_xattr by default In android, since SElinux is enable, security policy will be appliedd for each file, it stores in inode as an xattr entry, so it will take one 4k size node block additionally for each file. Let's enable inline_xattr by default in order to save storage space. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:21:49 -08:00
Chao Yu	23cf7212a1	f2fs: introduce noinline_xattr mount option This patch introduces new mount option 'noinline_xattr', so we can disable inline xattr functionality which is already set as a default mount option. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:21:48 -08:00
Jaegeuk Kim	25cc5d3b9d	f2fs: avoid reading NAT page by get_node_info We've not seen this buggy case for a long time, so it's time to avoid this unnecessary get_node_info() call which reading NAT page to cache nat entry. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:21:47 -08:00
Jaegeuk Kim	9b064f7d0c	f2fs: remove build_free_nids() during checkpoint Let's avoid build_free_nids() in checkpoint path. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:10:53 -08:00
Chao Yu	d260081ccf	f2fs: change recovery policy of xattr node block Currently, if we call fsync after updating the xattr date belongs to the file, f2fs needs to trigger checkpoint to keep xattr data consistent. But, this policy cause low performance as checkpoint will block most foreground operations and cause unneeded and unrelated IOs around checkpoint. This patch will reuse regular file recovery policy for xattr node block, so, we change to write xattr node block tagged with fsync flag to warm area instead of cold area, and during recovery, we search warm node chain for fsynced xattr block, and do the recovery. So, for below application IO pattern, performance can be improved obviously: - touch file - create/update/delete xattr entry in file - fsync file Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:10:52 -08:00
Bhumika Goyal	2ad0ef846b	f2fs: super: constify fscrypt_operations structure Declare fscrypt_operations structure as const as it is only stored in the s_cop field of a super_block structure. This field is of type const, so fscrypt_operations structure having this property can be made const too. File size before: fs/f2fs/super.o text data bss dec hex filename 54131 31355 184 85670 14ea6 fs/f2fs/super.o File size after: fs/f2fs/super.o text data bss dec hex filename 54227 31259 184 85670 14ea6 fs/f2fs/super.o Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:10:51 -08:00
Jaegeuk Kim	1200abb26f	f2fs: show checkpoint version at mount time If we mounted f2fs successfully, let's show current checkpoint version. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:10:50 -08:00
Jaegeuk Kim	7f54f51f46	f2fs: remove preflush for nobarrier case This patch removes REQ_PREFLUSH in the nobarrier case. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:10:48 -08:00
Jaegeuk Kim	942fd3192f	f2fs: check last page index in cached bio to decide submission If the cached bio has the last page's index, then we need to submit it. Otherwise, we don't need to submit it and can wait for further IO merges. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:10:48 -08:00
Jaegeuk Kim	d68f735b3b	f2fs: check io submission more precisely This patch check IO submission more precisely than previous rough check. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:10:47 -08:00
Jaegeuk Kim	f566bae846	f2fs: call internal __write_data_page directly This patch introduces __write_data_page to call it by f2fs_write_cache_pages directly.. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:10:46 -08:00
Jaegeuk Kim	e7c75ab099	f2fs: avoid out-of-order execution of atomic writes We need to flush data writes before flushing last node block writes by using FUA with PREFLUSH. We don't need to guarantee precedent node writes since if those are not written, we can't reach to the last node block when scanning node block chain during roll-forward recovery. Afterwards f2fs_wait_on_page_writeback guarantees all the IO submission to disk, which builds a valid node block chain. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:10:35 -08:00
Jaegeuk Kim	faa24895ac	f2fs: move write_node_page above fsync_node_pages This patch just moves write_node_page and introduces an inner function. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:09:43 -08:00
Jaegeuk Kim	c1b221078b	f2fs: move flush tracepoint This patch moves the tracepoint location for flush command. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-23 10:08:43 -08:00
Linus Torvalds	15192b0295	This is an addendum for the 4.11 merge window. Andy Price wrote this patch to close a nasty race condition that allows access to glocks that are being destroyed. Without this patch, GFS2 is vulnerable to random corruption and kernel panic. -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJYrv8+AAoJENeLYdPf93o7T58H/i3K+awecX1yrCl9qvAvxte+ UJioZd9wnrjHsprFkMMzeVC2rFH5EIm5JKEyl8zGGwIq/oaGtgWlxQsBOvyOnSyx WRvu99XjZTzu3vov7u1kiWmOOvVturdcALPHH6mFdgkCw8d15AHqQdfDvljfWbRp aHFc+x1evptskRTj4D7I6EeWig8v3Sr9qosJ2N8uKtsrcc/xIlh4ItsonlQh3Cz0 Dg83HVN2opHI5CWjRAjTK6zjF6XoEMgsjIOR4HLRVC9XEXiWLd3w+JBnTbFYJt0f k8NMk8oGbmzTC/HteJvnzGuNfSlkk4RAwaCkYo7F9f6hcKsWPECzUdyHn3ubm7M= =uIIs -----END PGP SIGNATURE----- Merge tag 'gfs2-4.11.addendum' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 fix from Bob Peterson: "This is an addendum for the 4.11 merge window. Andy Price wrote this patch to close a nasty race condition that allows access to glocks that are being destroyed. Without this patch, GFS2 is vulnerable to random corruption and kernel panic" * tag 'gfs2-4.11.addendum' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Add missing rcu locking for glock lookup	2017-02-23 09:36:04 -08:00
Andrew Price	f38e5fb95a	gfs2: Add missing rcu locking for glock lookup We must hold the rcu read lock across looking up glocks and trying to bump their refcount to prevent the glocks from being freed in between. Cc: <stable@vger.kernel.org> # 4.3+ Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>	2017-02-23 10:06:00 -05:00
Jaegeuk Kim	a00861dbca	f2fs: show # of APPEND and UPDATE inodes This patch shows cached # of APPEND and UPDATE inode entries. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 20:54:53 -08:00
DongOh Shin	cac5a3d8f5	f2fs: fix 446 coding style warnings in f2fs.h 1) Nine coding style warnings below have been resolved: "Missing a blank line after declarations" 2) 435 coding style warnings below have been resolved: "function definition argument 'x' should also have an identifier name" 3) Two coding style warnings below have been resolved: "macros should not use a trailing semicolon" Signed-off-by: DongOh Shin <doscode.kr@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 20:24:55 -08:00
DongOh Shin	c64ab12e36	f2fs: fix 3 coding style errors in f2fs.h Two coding style errors below have been resolved: "Macros with complex values should be enclosed in parentheses" And a coding style error below has been resolved: "space prohibited before that ',' (ctx:WxW)" Signed-off-by: DongOh Shin <doscode.kr@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 20:24:55 -08:00
Jaegeuk Kim	8ed5974552	f2fs: declare missing static function We missed two functions declared as static functions. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 20:24:54 -08:00
Kaixu Xia	0cc0dec2b6	f2fs: show the fault injection mount option This patch shows the fault injection mount option in f2fs_show_options(). Signed-off-by: Kaixu Xia <xiakaixu@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 20:24:53 -08:00
Chao Yu	73545817c9	f2fs: fix null pointer dereference when issuing flush in ->fsync We only allocate flush merge control structure sbi::sm_info::fcc_info when flush_merge option is on, but in f2fs_issue_flush we still try to access member of the control structure without that option, it incurs panic as show below, fix it. Call Trace: __remove_ino_entry+0xa9/0xc0 [f2fs] f2fs_do_sync_file.isra.27+0x214/0x6d0 [f2fs] f2fs_sync_file+0x18/0x20 [f2fs] vfs_fsync_range+0x3d/0xb0 __do_page_fault+0x261/0x4d0 do_fsync+0x3d/0x70 SyS_fsync+0x10/0x20 do_syscall_64+0x6e/0x180 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7f18ce260de0 RSP: 002b:00007ffdd4589258 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f18ce260de0 RDX: 0000000000000006 RSI: 00000000016c0360 RDI: 0000000000000003 RBP: 00000000016c0360 R08: 000000000000ffff R09: 000000000000001f R10: 00007ffdd4589020 R11: 0000000000000246 R12: 00000000016c0100 R13: 0000000000000000 R14: 00000000016c1f00 R15: 00000000016c0100 Code: fb 81 e3 00 08 00 00 48 89 45 a0 0f 1f 44 00 00 31 c0 85 db 75 27 41 81 e7 00 04 00 00 74 0c 41 8b 45 20 85 c0 0f 85 81 00 00 00 <f0> 41 ff 45 20 4c 89 e7 e8 f8 e9 ff ff f0 41 ff 4d 20 48 83 c4 RIP: f2fs_issue_flush+0x5b/0x170 [f2fs] RSP: ffffc90003b5fd78 CR2: 0000000000000020 ---[ end trace a09314c24f037648 ]--- Reported-by: Shuoran Liu <liushuoran@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 20:24:52 -08:00
Chao Yu	dba79f38bc	f2fs: fix to avoid overflow when left shifting page offset We use following method to calculate size with current page index: size = index << PAGE_SHIFT If type of index has only 32-bits size, left shifting will incur overflow, which makes result incorrect. So let's cast index with 64-bits type to avoid such issue. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 20:24:51 -08:00
Chao Yu	ba38c27eb9	f2fs: enhance lookup xattr Previously, in getxattr we will load all entries both in inline xattr and xattr node block, and then do the lookup in all entries, but our lookup flow shows low efficiency, since if we can lookup and hit in inline xattr of inode page cache first, we don't need to load and lookup xattr node block, which can obviously save cpu time and IO latency. Signed-off-by: Chao Yu <yuchao0@huawei.com> [Jaegeuk Kim: initialize NULL to avoid warning] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 20:24:51 -08:00
Wei Fang	b86e33075e	f2fs: fix a dead loop in f2fs_fiemap() A dead loop can be triggered in f2fs_fiemap() using the test case as below: ... fd = open(); fallocate(fd, 0, 0, 4294967296); ioctl(fd, FS_IOC_FIEMAP, fiemap_buf); ... It's caused by an overflow in __get_data_block(): ... bh->b_size = map.m_len << inode->i_blkbits; ... map.m_len is an unsigned int, and bh->b_size is a size_t which is 64 bits on 64 bits archtecture, type conversion from an unsigned int to a size_t will result in an overflow. In the above-mentioned case, bh->b_size will be zero, and f2fs_fiemap() will call get_data_block() at block 0 again an again. Fix this by adding a force conversion before left shift. Signed-off-by: Wei Fang <fangwei1@huawei.com> Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 20:24:49 -08:00
Jaegeuk Kim	dc91de78e5	f2fs: do not preallocate blocks which has wrong buffer Sheng Yong reports needless preallocation if write(small_buffer, large_size) is called. In that case, f2fs preallocates large_size, but vfs returns early due to small_buffer size. Let's detect it before preallocation phase in f2fs. Reported-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 20:24:48 -08:00
Jaegeuk Kim	dcc9165dbf	f2fs: show # of on-going flush and discard bios This patch adds stat information for flush and discard commands. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 20:24:47 -08:00
Jaegeuk Kim	1546996348	f2fs: add a kernel thread to issue discard commands asynchronously This patch adds a kernel thread to issue discard commands. It proposes three states, D_PREP, D_SUBMIT, and D_DONE to identify current bio status. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 20:24:45 -08:00
Linus Torvalds	bc49a7831b	Merge branch 'akpm' (patches from Andrew) Merge updates from Andrew Morton: "142 patches: - DAX updates - various misc bits - OCFS2 updates - most of MM" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (142 commits) mm/z3fold.c: limit first_num to the actual range of possible buddy indexes mm: fix <linux/pagemap.h> stray kernel-doc notation zram: remove obsolete sysfs attrs mm/memblock.c: remove unnecessary log and clean up oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA mm: drop unused argument of zap_page_range() mm: drop zap_details::check_swap_entries mm: drop zap_details::ignore_dirty mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled mm: help __GFP_NOFAIL allocations which do not trigger OOM killer mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically mm: consolidate GFP_NOFAIL checks in the allocator slowpath lib/show_mem.c: teach show_mem to work with the given nodemask arch, mm: remove arch specific show_mem mm, page_alloc: warn_alloc print nodemask mm, page_alloc: do not report all nodes in show_mem Revert "mm: bail out in shrink_inactive_list()" mm, vmscan: consider eligible zones in get_scan_count mm, vmscan: cleanup lru size claculations mm, vmscan: do not count freed pages as PGDEACTIVATE ...	2017-02-22 19:29:24 -08:00
Jaegeuk Kim	0b54fb8458	f2fs: factor out discard command info into discard_cmd_control This patch adds discard_cmd_control with the existing discarding controls. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 18:48:53 -08:00
Jaegeuk Kim	d4adb30f25	f2fs: reorganize stat information This patch modifies stat information more clearly. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 18:48:52 -08:00
Jaegeuk Kim	b01a92019c	f2fs: clean up flush/discard command namings This patch simply cleans up the names for flush/discard commands. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 18:48:51 -08:00
Chao Yu	ae27d62e6b	f2fs: check in-memory sit version bitmap This patch adds a mirror for sit version bitmap, and use it to detect in-memory bitmap corruption which may be caused by bit-transition of cache or memory overflow. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 18:48:50 -08:00
Chao Yu	599a09b2c1	f2fs: check in-memory nat version bitmap This patch adds a mirror for nat version bitmap, and use it to detect in-memory bitmap corruption which may be caused by bit-transition of cache or memory overflow. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 18:48:49 -08:00
Chao Yu	355e78913c	f2fs: check in-memory block bitmap This patch adds a mirror for valid block bitmap, and use it to detect in-memory bitmap corruption which may be caused by bit-transition of cache or memory overflow. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 18:48:49 -08:00
Chao Yu	5fe457430e	f2fs: introduce FI_ATOMIC_COMMIT This patch introduces a new flag to indicate inode status of doing atomic write committing, so that, we can keep atomic write status for inode during atomic committing, then we can skip GCing pages of atomic write inode, that avoids random GCed datas being mixed with current transaction, so isolation of transaction can be kept. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 18:48:48 -08:00
Chao Yu	939afa943c	f2fs: clean up with list_{first, last}_entry Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 18:48:47 -08:00
Jaegeuk Kim	25290fa559	f2fs: return fs_trim if there is no candidate If there is no candidate to submit discard command during f2fs_trim_fs, let's return without checkpoint. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 18:48:40 -08:00
Linus Torvalds	a27fcb0cd1	Changes since last update: - Various cleanups - Livelock fixes for eofblocks scanning - Improved input verification for on-disk metadata - Fix races in the copy on write remap mechanism - Fix buffer io error timeout controls - Streamlining of directio copy on write - Asynchronous discard support - Fix asserts when splitting delalloc reservations - Don't bloat bmbt when right shifting extents - Inode alignment fixes for 32k block sizes -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABCgAGBQJYp85wAAoJEPh/dxk0SrTr5HgP/jcx/oI+ap/NaXMi1Q8K65mh C3gf27cgUxtdGnEO5KRUE1Jyscuu4ZpzugDdLQISwR55kesT5FU0xpgbsfiICc86 dxLAhg8auwpTfHV+96Do2hfpO3IhYoBC2w5jo32+C+SaQUqTdPixncZukX89tjyP HOFLrQnpc336hCO2rv1Q9hSkD6IUCkSAtk+Dh1xMvbsmKFLGdmkTdqUQfl1U4YnV 2S98k9QSRdiVyzj3lAGOy+IU9aTcPX/PptMEYaQZEaod5WWNjy91lQZNM6zRc4QW 8P199yiH6CQa2vESO2SV72cJ40WihM1KQXqnrlJjAMGQ7mMGTGJcTwxhuZYUbDYZ cuk6bAUaijt/PzfmydJKlcH8vFerX4aU4CGkxPU0nph0iTR5kxYlIAMmFw2cdRzf Iar3SBb8Pc9jiNnEZMFsQ0Fd9hNk9rNoUSpKqm4FtSRocU6JjmpAdPqNYdTVKc2l 2EY7JMo0xCaTVC1WT6sE2NsxsFvm0R7H6HHG2vMFIMNkhI24GRijIXH6dQlaGCQJ 5oTHrSM7503qPlEQNsxF7zI02LpJT+duf+2ODw/FSjA1z/TWwOUYYUrPUOyQNdzP NrRnMa6LWsEehkuvz2FFko8PKXD55lTuUP1KdjigjqKp8Jzkc/PP+uvuwF5vUFfd pWRvE5m/NePWBZetbL3Q =Ga1F -----END PGP SIGNATURE----- Merge tag 'xfs-4.11-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Darrick Wong: "Here are the XFS changes for 4.11. We aren't introducing any major features in this release cycle except for this being the first merge window I've managed on my own. :) Changes since last update: - Various cleanups - Livelock fixes for eofblocks scanning - Improved input verification for on-disk metadata - Fix races in the copy on write remap mechanism - Fix buffer io error timeout controls - Streamlining of directio copy on write - Asynchronous discard support - Fix asserts when splitting delalloc reservations - Don't bloat bmbt when right shifting extents - Inode alignment fixes for 32k block sizes" * tag 'xfs-4.11-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (39 commits) xfs: remove XFS_ALLOCTYPE_ANY_AG and XFS_ALLOCTYPE_START_AG xfs: simplify xfs_rtallocate_extent xfs: tune down agno asserts in the bmap code xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignment xfs: don't reserve blocks for right shift transactions xfs: fix len comparison in xfs_extent_busy_trim xfs: fix uninitialized variable in _reflink_convert_cow xfs: split indlen reservations fairly when under reserved xfs: handle indlen shortage on delalloc extent merge xfs: resurrect debug mode drop buffered writes mechanism xfs: clear delalloc and cache on buffered write failure xfs: don't block the log commit handler for discards xfs: improve busy extent sorting xfs: improve handling of busy extents in the low-level allocator xfs: don't fail xfs_extent_busy allocation xfs: correct null checks and error processing in xfs_initialize_perag xfs: update ctime and mtime on clone destinatation inodes xfs: allocate direct I/O COW blocks in iomap_begin xfs: go straight to real allocations for direct I/O COW writes xfs: return the converted extent in __xfs_reflink_convert_cow ...	2017-02-22 18:05:23 -08:00
Denys Vlasenko	16e72e9b30	powerpc: do not make the entire heap executable On 32-bit powerpc the ELF PLT sections of binaries (built with --bss-plt, or with a toolchain which defaults to it) look like this: [17] .sbss NOBITS 0002aff8 01aff8 000014 00 WA 0 0 4 [18] .plt NOBITS 0002b00c 01aff8 000084 00 WAX 0 0 4 [19] .bss NOBITS 0002b090 01aff8 0000a4 00 WA 0 0 4 Which results in an ELF load header: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x10000 This is all correct, the load region containing the PLT is marked as executable. Note that the PLT starts at 0002b00c but the file mapping ends at 0002aff8, so the PLT falls in the 0 fill section described by the load header, and after a page boundary. Unfortunately the generic ELF loader ignores the X bit in the load headers when it creates the 0 filled non-file backed mappings. It assumes all of these mappings are RW BSS sections, which is not the case for PPC. gcc/ld has an option (--secure-plt) to not do this, this is said to incur a small performance penalty. Currently, to support 32-bit binaries with PLT in BSS kernel maps entire brk area with executable rights for all binaries, even --secure-plt ones. Stop doing that. Teach the ELF loader to check the X bit in the relevant load header and create 0 filled anonymous mappings that are executable if the load header requests that. Test program showing the difference in /proc/$PID/maps: int main() { char buf[161024]; char p = malloc(123); /* make "[heap]" mapping appear */ int fd = open("/proc/self/maps", O_RDONLY); int len = read(fd, buf, sizeof(buf)); write(1, buf, len); printf("%p\n", p); return 0; } Compiled using: gcc -mbss-plt -m32 -Os test.c -otest Unpatched ppc64 kernel: 00100000-00120000 r-xp 00000000 00:00 0 [vdso] 0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so 10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test 10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test 10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test 10690000-106c0000 rwxp 00000000 00:00 0 [heap] f7f70000-f7fa0000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so f7fa0000-f7fb0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so f7fb0000-f7fc0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so ffa90000-ffac0000 rw-p 00000000 00:00 0 [stack] 0x10690008 Patched ppc64 kernel: 00100000-00120000 r-xp 00000000 00:00 0 [vdso] 0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so 10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test 10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test 10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test 10180000-101b0000 rw-p 00000000 00:00 0 [heap] ^^^^ this has changed f7c60000-f7c90000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so f7c90000-f7ca0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so f7ca0000-f7cb0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so ff860000-ff890000 rw-p 00000000 00:00 0 [stack] 0x10180008 The patch was originally posted in 2012 by Jason Gunthorpe and apparently ignored: https://lkml.org/lkml/2012/9/30/138 Lightly run-tested. Link: http://lkml.kernel.org/r/20161215131950.23054-1-dvlasenk@redhat.com Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:29 -08:00
Nicholas Piggin	d94e0c05eb	nfs: no PG_private waiters remain, remove waker Since commit `4f52b6bb8c` ("NFS: Don't call COMMIT in ->releasepage()"), no tasks wait on PagePrivate. Thus the wake introduced in commit `9590544694` ("NFS: avoid deadlocks with loop-back mounted NFS filesystems.") can be removed. Link: http://lkml.kernel.org/r/20170103182234.30141-2-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:29 -08:00
Mike Rapoport	cac673292b	userfaultfd: shmem: allow registration of shared memory ranges Expand the userfaultfd_register/unregister routines to allow shared memory VMAs. Currently, there is no UFFDIO_ZEROPAGE and write-protection support for shared memory VMAs, which is reflected in ioctl methods supported by uffdio_register. Link: http://lkml.kernel.org/r/20161216144821.5183-34-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Mike Rapoport	ba6907db6d	userfaultfd: introduce vma_can_userfault Check whether a VMA can be used with userfault in more compact way Link: http://lkml.kernel.org/r/20161216144821.5183-28-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Mike Kravetz	369cd2121b	userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges Add routine userfaultfd_huge_must_wait which has the same functionality as the existing userfaultfd_must_wait routine. Only difference is that new routine must handle page table structure for hugepmd vmas. Link: http://lkml.kernel.org/r/20161216144821.5183-24-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Mike Kravetz	cab350afcb	userfaultfd: hugetlbfs: allow registration of ranges containing huge pages Expand the userfaultfd_register/unregister routines to allow VM_HUGETLB vmas. huge page alignment checking is performed after a VM_HUGETLB vma is encountered. Also, since there is no UFFDIO_ZEROPAGE support for huge pages do not return that as a valid ioctl method for huge page ranges. Link: http://lkml.kernel.org/r/20161216144821.5183-22-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Andrea Arcangeli	09fa5296a4	userfaultfd: non-cooperative: wake userfaults after UFFDIO_UNREGISTER Userfaults may still happen after the userfaultfd monitor thread received a UFFD_EVENT_MADVDONTNEED until UFFDIO_UNREGISTER is run. Wake any pending userfault within UFFDIO_UNREGISTER protected by the mmap_sem for writing, so they will not be reported to userland leading to UFFDIO_COPY returning -EINVAL (as the range was already unregistered) and they will not hang permanently either. Link: http://lkml.kernel.org/r/20161216144821.5183-16-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Pavel Emelyanov	05ce77249d	userfaultfd: non-cooperative: add madvise() event for MADV_DONTNEED request If the page is punched out of the address space the uffd reader should know this and zeromap the respective area in case of the #PF event. Link: http://lkml.kernel.org/r/20161216144821.5183-14-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Andrea Arcangeli	90794bf19d	userfaultfd: non-cooperative: optimize mremap_userfaultfd_complete() Optimize the mremap_userfaultfd_complete() interface to pass only the vm_userfaultfd_ctx pointer through the stack as a microoptimization. Link: http://lkml.kernel.org/r/20161216144821.5183-13-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Pavel Emelyanov	72f87654c6	userfaultfd: non-cooperative: add mremap() event The event denotes that an area [start:end] moves to different location. Length change isn't reported as "new" addresses, if they appear on the uffd reader side they will not contain any data and the latter can just zeromap them. Waiting for the event ACK is also done outside of mmap sem, as for fork event. Link: http://lkml.kernel.org/r/20161216144821.5183-12-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Mike Rapoport	d3aadc8ed4	userfaultfd: non-cooperative: dup_userfaultfd: use mm_count instead of mm_users Since commit `d2005e3f41` ("userfaultfd: don't pin the user memory in userfaultfd_file_create()") userfaultfd uses mm_count rather than mm_users to pin mm_struct. Make dup_userfaultfd consistent with this behaviour Link: http://lkml.kernel.org/r/20161216144821.5183-11-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Pavel Emelyanov	893e26e61d	userfaultfd: non-cooperative: Add fork() event When the mm with uffd-ed vmas fork()-s the respective vmas notify their uffds with the event which contains a descriptor with new uffd. This new descriptor can then be used to get events from the child and populate its mm with data. Note, that there can be different uffd-s controlling different vmas within one mm, so first we should collect all those uffds (and ctx-s) in a list and then notify them all one by one but only once per fork(). The context is created at fork() time but the descriptor, file struct and anon inode object is created at event read time. So some trickery is added to the userfaultfd_ctx_read() to handle the ctx queues' locking vs file creation. Another thing worth noticing is that the task that fork()-s waits for the uffd event to get processed WITHOUT the mmap sem. [aarcange@redhat.com: build warning fix] Link: http://lkml.kernel.org/r/20161216144821.5183-10-aarcange@redhat.com Link: http://lkml.kernel.org/r/20161216144821.5183-9-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Andrea Arcangeli	656031445d	userfaultfd: non-cooperative: report all available features to userland This will allow userland to probe all features available in the kernel. It will however only enable the requested features in the open userfaultfd context. Link: http://lkml.kernel.org/r/20161216144821.5183-8-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Pavel Emelyanov	9cd75c3cd4	userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor The custom events are queued in ctx->event_wqh not to disturb the fast-path-ed PF queue-wait-wakeup functions. The events to be generated (other than PF-s) are requested in UFFD_API ioctl with the uffd_api.features bits. Those, known by the kernel, are then turned on and reported back to the user-space. Link: http://lkml.kernel.org/r/20161216144821.5183-7-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Pavel Emelyanov	6dcc27fd39	userfaultfd: non-cooperative: Split the find_userfault() routine I will need one to lookup for userfaultfd_wait_queue-s in different wait queue Link: http://lkml.kernel.org/r/20161216144821.5183-6-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Andrea Arcangeli	a94720bf82	userfaultfd: use vma_is_anonymous Cleanup the vma->vm_ops usage. Side note: it would be more robust if vma_is_anonymous() would also check that vm_flags hasn't VM_PFNMAP set. Link: http://lkml.kernel.org/r/20161216144821.5183-5-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Andrea Arcangeli	8474901a33	userfaultfd: convert BUG() to WARN_ON_ONCE() Avoid BUG_ON()s and only WARN instead. This is just a cleanup, it can't make any runtime difference. This BUG_ON has never triggered and cannot trigger. Link: http://lkml.kernel.org/r/20161216144821.5183-4-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Andrea Arcangeli	a4605a61d6	userfaultfd: correct comment about UFFD_FEATURE_PAGEFAULT_FLAG_WP Minor comment correction. Link: http://lkml.kernel.org/r/20161216144821.5183-3-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:28 -08:00
Cong Wang	b5c66bab72	9p: fix a potential acl leak posix_acl_update_mode() could possibly clear 'acl', if so we leak the memory pointed by 'acl'. Save this pointer before calling posix_acl_update_mode() and release the memory if 'acl' really gets cleared. Link: http://lkml.kernel.org/r/1486678332-2430-1-git-send-email-xiyou.wangcong@gmail.com Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reported-by: Mark Salyzyn <salyzyn@android.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kurz <groug@kaod.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:27 -08:00
Eric Ren	b891fa5024	ocfs2: fix deadlock issue when taking inode lock at vfs entry points Commit `743b5f1434` ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()") results in a deadlock, as the author "Tariq Saeed" realized shortly after the patch was merged. The discussion happened here https://oss.oracle.com/pipermail/ocfs2-devel/2015-September/011085.html The reason why taking cluster inode lock at vfs entry points opens up a self deadlock window, is explained in the previous patch of this series. So far, we have seen two different code paths that have this issue. 1. do_sys_open may_open inode_permission ocfs2_permission ocfs2_inode_lock() <=== take PR generic_permission get_acl ocfs2_iop_get_acl ocfs2_inode_lock() <=== take PR 2. fchmod\|fchmodat chmod_common notify_change ocfs2_setattr <=== take EX posix_acl_chmod get_acl ocfs2_iop_get_acl <=== take PR ocfs2_iop_set_acl <=== take EX Fixes them by adding the tracking logic (in the previous patch) for these funcs above, ocfs2_permission(), ocfs2_iop_[set\|get]_acl(), ocfs2_setattr(). Link: http://lkml.kernel.org/r/20170117100948.11657-3-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:27 -08:00
Eric Ren	439a36b8ef	ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock We are in the situation that we have to avoid recursive cluster locking, but there is no way to check if a cluster lock has been taken by a precess already. Mostly, we can avoid recursive locking by writing code carefully. However, we found that it's very hard to handle the routines that are invoked directly by vfs code. For instance: const struct inode_operations ocfs2_file_iops = { .permission = ocfs2_permission, .get_acl = ocfs2_iop_get_acl, .set_acl = ocfs2_iop_set_acl, }; Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR): do_sys_open may_open inode_permission ocfs2_permission ocfs2_inode_lock() <=== first time generic_permission get_acl ocfs2_iop_get_acl ocfs2_inode_lock() <=== recursive one A deadlock will occur if a remote EX request comes in between two of ocfs2_inode_lock(). Briefly describe how the deadlock is formed: On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of the remote EX lock request. Another hand, the recursive cluster lock (the second one) will be blocked in in __ocfs2_cluster_lock() because of OCFS2_LOCK_BLOCKED. But, the downconvert never complete, why? because there is no chance for the first cluster lock on this node to be unlocked - we block ourselves in the code path. The idea to fix this issue is mostly taken from gfs2 code. 1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track of the processes' pid who has taken the cluster lock of this lock resource; 2. introduce a new flag for ocfs2_inode_lock_full: OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for us if we've got cluster lock. 3. export a helper: ocfs2_is_locked_by_me() is used to check if we have got the cluster lock in the upper code path. The tracking logic should be used by some of the ocfs2 vfs's callbacks, to solve the recursive locking issue cuased by the fact that vfs routines can call into each other. The performance penalty of processing the holder list should only be seen at a few cases where the tracking logic is used, such as get/set acl. You may ask what if the first time we got a PR lock, and the second time we want a EX lock? fortunately, this case never happens in the real world, as far as I can see, including permission check, (get\|set)_(acl\|attr), and the gfs2 code also do so. [sfr@canb.auug.org.au remove some inlines] Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:27 -08:00
Dave Jiang	f42003917b	mm, dax: change pmd_fault() to take only vmf parameter pmd_fault() and related functions really only need the vmf parameter since the additional parameters are all included in the vmf struct. Remove the additional parameter and simplify pmd_fault() and friends. Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:26 -08:00
Dave Jiang	d8a849e1bc	mm, dax: make pmd_fault() and friends be the same as fault() Instead of passing in multiple parameters in the pmd_fault() handler, a vmf can be passed in just like a fault() handler. This will simplify code and remove the need for the actual pmd fault handlers to allocate a vmf. Related functions are also modified to do the same. [dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off] Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:26 -08:00
Ross Zwisler	27a7ffaccd	dax: add tracepoints to dax_pmd_insert_mapping() Add tracepoints to dax_pmd_insert_mapping(), following the same logging conventions as the tracepoints in dax_iomap_pmd_fault(). Here is an example PMD fault showing the new tracepoints: big-1504 [001] .... 326.960743: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 big-1504 [001] .... 326.960753: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE\|ALLOW_RETRY\|KILLABLE\|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 big-1504 [001] .... 326.960981: dax_pmd_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10505000 length 0x200000 pfn 0x100600 DEV\|MAP radix_entry 0xc000e big-1504 [001] .... 326.960986: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE\|ALLOW_RETRY\|KILLABLE\|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/1484085142-2297-6-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:26 -08:00
Ross Zwisler	653b2ea339	dax: add tracepoints to dax_pmd_load_hole() Add tracepoints to dax_pmd_load_hole(), following the same logging conventions as the tracepoints in dax_iomap_pmd_fault(). Here is an example PMD fault showing the new tracepoints: read_big-1478 [004] .... 238.242188: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 read_big-1478 [004] .... 238.242191: dax_pmd_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY\|KILLABLE\|USER address 0x10400000 vm_start 0x10200000 vm_end 0x10600000 pgoff 0x200 max_pgoff 0x1400 read_big-1478 [004] .... 238.242390: dax_pmd_load_hole: dev 259:0 ino 0x1003 shared address 0x10400000 zero_page ffffea0002c20000 radix_entry 0x1e read_big-1478 [004] .... 238.242392: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY\|KILLABLE\|USER address 0x10400000 vm_start 0x10200000 vm_end 0x10600000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/1484085142-2297-5-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:26 -08:00
Ross Zwisler	282a8e0391	dax: add tracepoint infrastructure, PMD tracing Tracepoints are the standard way to capture debugging and tracing information in many parts of the kernel, including the XFS and ext4 filesystems. Create a tracepoint header for FS DAX and add the first DAX tracepoints to the PMD fault handler. This allows the tracing for DAX to be done in the same way as the filesystem tracing so that developers can look at them together and get a coherent idea of what the system is doing. I added both an entry and exit tracepoint because future patches will add tracepoints to child functions of dax_iomap_pmd_fault() like dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to be wrapped by the parent function tracepoints so the code flow is more easily understood. Having entry and exit tracepoints for faults also allows us to easily see what filesystems functions were called during the fault. These filesystem functions get executed via iomap_begin() and iomap_end() calls, for example, and will have their own tracepoints. For PMD faults we primarily want to understand the type of mapping, the fault flags, the faulting address and whether it fell back to 4k faults. If it fell back to 4k faults the tracepoints should let us understand why. I named the new tracepoint header file "fs_dax.h" to allow for device DAX to have its own separate tracing header in the same directory at some point. Here is an example output for these events from a successful PMD fault: big-1441 [005] .... 32.582758: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003 big-1441 [005] .... 32.582776: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE\|ALLOW_RETRY\|KILLABLE\|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 big-1441 [005] .... 32.583292: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE\|ALLOW_RETRY\|KILLABLE\|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE Link: http://lkml.kernel.org/r/1484085142-2297-3-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:26 -08:00
Liu Bo	6288d6eabc	Btrfs: use the correct type when creating cow dio extent 'BTRFS_ORDERED_REGULAR' was introduced for the cow case in patch 'Btrfs: specify a new ordered extent type for create_io_em', but it missed the directIO cow case. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2017-02-22 15:55:03 -08:00
Filipe Manana	b1517622f2	Btrfs: fix deadlock between dedup on same file and starting writeback If we are deduping two ranges of the same file we need to make sure that we lock all pages in ascending order, that is, lock first the pages from the range with lower offset and then the pages from the other range, as otherwise we can deadlock with a concurrent task that is starting delalloc (writeback). Example trace: [74073.052218] INFO: task kworker/u32:10:17997 blocked for more than 120 seconds. [74073.053889] Tainted: G W 4.9.0-rc7-btrfs-next-36+ #1 [74073.055071] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [74073.056696] kworker/u32:10 D 0 17997 2 0x00000000 [74073.058606] Workqueue: writeback wb_workfn (flush-btrfs-53176) [74073.061370] ffff880031e79858 ffff8802159d2580 ffff880237004580 ffff880031e79240 [74073.064784] ffff88023f4978c0 ffffc9000817b638 ffffffff814c15e1 0000000000000000 [74073.068386] ffff88023f4978d8 ffff88023f4978c0 000000000017b620 ffff880031e79240 [74073.071712] Call Trace: [74073.072884] [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4 [74073.075395] [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f [74073.077511] [<ffffffff814c18d2>] schedule+0x8c/0xa0 [74073.079440] [<ffffffff814c4b36>] schedule_timeout+0x43/0xff [74073.081637] [<ffffffff8110953e>] ? time_hardirqs_on+0x9/0x14 [74073.083809] [<ffffffff81095c67>] ? trace_hardirqs_on_caller+0x16/0x197 [74073.086314] [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32 [74073.100654] [<ffffffff810be048>] ? ktime_get+0x41/0x52 [74073.102619] [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102 [74073.104771] [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102 [74073.106969] [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39 [74073.108954] [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99 [74073.110981] [<ffffffff8112b692>] __lock_page+0x6b/0x6d [74073.112833] [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a [74073.115010] [<ffffffffa031178b>] lock_page+0x2f/0x32 [btrfs] [74073.116999] [<ffffffffa0311d9f>] lock_delalloc_pages+0xc7/0x1a0 [btrfs] [74073.119243] [<ffffffffa0313d15>] find_lock_delalloc_range+0xc3/0x1a4 [btrfs] [74073.121636] [<ffffffffa0313e81>] writepage_delalloc.isra.31+0x8b/0x134 [btrfs] [74073.124229] [<ffffffffa0315d69>] __extent_writepage+0x1c1/0x2bf [btrfs] [74073.126372] [<ffffffffa03160f2>] extent_write_cache_pages.isra.30.constprop.49+0x28b/0x36c [btrfs] [74073.129371] [<ffffffffa03165b9>] extent_writepages+0x4b/0x5c [btrfs] [74073.131440] [<ffffffffa02fcb59>] ? insert_reserved_file_extent.constprop.42+0x261/0x261 [btrfs] [74073.134303] [<ffffffff811b4ce4>] ? writeback_sb_inodes+0xe0/0x4a1 [74073.136298] [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs] [74073.138248] [<ffffffff81138200>] do_writepages+0x23/0x2c [74073.139910] [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2 [74073.142003] [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1 [74073.136298] [<ffffffffa02fab7f>] btrfs_writepages+0x28/0x2a [btrfs] [74073.138248] [<ffffffff81138200>] do_writepages+0x23/0x2c [74073.139910] [<ffffffff811b3cab>] __writeback_single_inode+0x105/0x6d2 [74073.142003] [<ffffffff811b4e96>] writeback_sb_inodes+0x292/0x4a1 [74073.143911] [<ffffffff811b511b>] __writeback_inodes_wb+0x76/0xae [74073.145787] [<ffffffff811b53ca>] wb_writeback+0x1cc/0x4d7 [74073.147452] [<ffffffff811b60cd>] wb_workfn+0x194/0x37d [74073.149084] [<ffffffff811b60cd>] ? wb_workfn+0x194/0x37d [74073.150726] [<ffffffff8106ce77>] ? process_one_work+0x154/0x4e4 [74073.152694] [<ffffffff8106cf96>] process_one_work+0x273/0x4e4 [74073.154452] [<ffffffff8106d6db>] worker_thread+0x1eb/0x2ca [74073.156138] [<ffffffff8106d4f0>] ? rescuer_thread+0x2b6/0x2b6 [74073.157837] [<ffffffff81072a81>] kthread+0xd5/0xdd [74073.159339] [<ffffffff810729ac>] ? __kthread_unpark+0x5a/0x5a [74073.161088] [<ffffffff814c6257>] ret_from_fork+0x27/0x40 [74073.162680] INFO: lockdep is turned off. [74073.163855] INFO: task do-dedup:30264 blocked for more than 120 seconds. [74073.181180] Tainted: G W 4.9.0-rc7-btrfs-next-36+ #1 [74073.181180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [74073.185296] fdm-stress D 0 30264 29974 0x00000000 [74073.186810] ffff880089595118 ffff880211b8eac0 ffff880237030380 ffff880089594b00 [74073.188998] ffff88023f2978c0 ffffc900063abb68 ffffffff814c15e1 0000000000000000 [74073.191070] ffff88023f2978d8 ffff88023f2978c0 00000000003abb50 ffff880089594b00 [74073.193286] Call Trace: [74073.193990] [<ffffffff814c15e1>] ? __schedule+0x48f/0x6f4 [74073.195418] [<ffffffff814c1c8b>] ? bit_wait+0x2f/0x2f [74073.196796] [<ffffffff814c18d2>] schedule+0x8c/0xa0 [74073.198163] [<ffffffff814c4b36>] schedule_timeout+0x43/0xff [74073.199621] [<ffffffff81095df5>] ? trace_hardirqs_on+0xd/0xf [74073.201100] [<ffffffff810bde98>] ? timekeeping_get_ns+0x1e/0x32 [74073.202686] [<ffffffff810be048>] ? ktime_get+0x41/0x52 [74073.204051] [<ffffffff814c10f0>] io_schedule_timeout+0xa0/0x102 [74073.205585] [<ffffffff814c10f0>] ? io_schedule_timeout+0xa0/0x102 [74073.207123] [<ffffffff814c1ca6>] bit_wait_io+0x1b/0x39 [74073.208238] [<ffffffff814c1fb8>] __wait_on_bit_lock+0x4f/0x99 [74073.208871] [<ffffffff8112b692>] __lock_page+0x6b/0x6d [74073.209430] [<ffffffff8108ceb4>] ? autoremove_wake_function+0x3a/0x3a [74073.210101] [<ffffffff8112b800>] lock_page+0x2f/0x32 [74073.210636] [<ffffffff8112c502>] pagecache_get_page+0x5e/0x153 [74073.211270] [<ffffffffa03257eb>] gather_extent_pages+0x4e/0x109 [btrfs] [74073.212166] [<ffffffffa032a04c>] btrfs_dedupe_file_range+0x1e1/0x4dd [btrfs] [74073.213257] [<ffffffff8118d9b5>] vfs_dedupe_file_range+0x1c1/0x221 [74073.214086] [<ffffffff8119e0c4>] do_vfs_ioctl+0x442/0x600 [74073.214767] [<ffffffff811a7874>] ? rcu_read_unlock+0x5b/0x5d [74073.215619] [<ffffffff811a7953>] ? __fget+0x6b/0x77 [74073.216338] [<ffffffff8119e2d9>] SyS_ioctl+0x57/0x79 [74073.217149] [<ffffffff814c5fea>] entry_SYSCALL_64_fastpath+0x18/0xad [74073.218102] [<ffffffff81109552>] ? time_hardirqs_off+0x9/0x14 [74073.218968] [<ffffffff810938ce>] ? trace_hardirqs_off_caller+0x1f/0xaa [74073.219938] INFO: lockdep is turned off. What happened was the following: CPU 1 CPU 2 btrfs_dedupe_file_range() --> using same inode as source and target --> src range is [768K, 1Mb[ --> dst range is [0, 256K[ btrfs_cmp_data_prepare() --> calls gather_extent_pages() for range [768K, 1Mb[ and locks all pages in that range do_writepages() btrfs_writepages() extent_writepages() extent_write_cache_pages() __extent_writepage() writepage_delalloc() find_lock_delalloc_range() --> finds range [0, 1Mb[ lock_delalloc_pages() --> locks all pages in the range [0, 768K[ --> tries to lock page at offset 768K --> deadlock --> calls gather_extent_pages() to lock pages in the range [0, 256K[ --> deadlock, task at CPU 1 already locked that range and it's trying to lock the range we locked previously So fix this by making sure that during a dedup we always lock first the pages from the range with lower offset. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2017-02-22 15:55:02 -08:00
Jaegeuk Kim	0333ad4e4f	f2fs: avoid needless checkpoint in f2fs_trim_fs The f2fs_trim_fs() doesn't need to do checkpoint if there are newly allocated data blocks only which didn't change the critical checkpoint data such as nat and sit entries. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2017-02-22 13:16:36 -08:00
Trond Myklebust	a5e14c9376	Revert "NFSv4.1: Handle NFS4ERR_BADSESSION/NFS4ERR_DEADSESSION replies to OP_SEQUENCE" This reverts commit `2cf10cdd48`. The patch has been seen to cause excessive looping. Reported-by: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # 4.10+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-02-22 15:17:14 -05:00
Linus Torvalds	b2064617c7	driver core patches for 4.11-rc1 Here is the "small" driver core patches for 4.11-rc1. Not much here, some firmware documentation and self-test updates, a debugfs code formatting issue, and a new feature for call_usermodehelper to make it more robust on systems that want to lock it down in a more secure way. All of these have been linux-next for a while now with no reported issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWK2jKg8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ymCEACgozYuqZZ/TUGW0P3xVNi7fbfUWCEAn3nYExrc XgevqeYOSKp2We6X/2JX =aZ+5 -----END PGP SIGNATURE----- Merge tag 'driver-core-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the "small" driver core patches for 4.11-rc1. Not much here, some firmware documentation and self-test updates, a debugfs code formatting issue, and a new feature for call_usermodehelper to make it more robust on systems that want to lock it down in a more secure way. All of these have been linux-next for a while now with no reported issues" * tag 'driver-core-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kernfs: handle null pointers while printing node name and path Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper() Make static usermode helper binaries constant kmod: make usermodehelper path a const string firmware: revamp firmware documentation selftests: firmware: send expected errors to /dev/null selftests: firmware: only modprobe if driver is missing platform: Print the resource range if device failed to claim kref: prefer atomic_inc_not_zero to atomic_add_unless debugfs: improve formatting of debugfs_real_fops()	2017-02-22 11:44:32 -08:00
Miklos Szeredi	9a87ad3da9	fuse: release: private_data cannot be NULL Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-02-22 20:08:25 +01:00
Miklos Szeredi	267d84449f	fuse: cleanup fuse_file refcounting struct fuse_file is stored in file->private_data. Make this always be a counting reference for consistency. This also allows fuse_sync_release() to call fuse_file_put() instead of partially duplicating its functionality. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-02-22 20:08:25 +01:00
Miklos Szeredi	2e38bea99a	fuse: add missing FR_FORCE fuse_file_put() was missing the "force" flag for the RELEASE request when sending synchronously (fuseblk). If this flag is not set, then a sync request may be interrupted before it is dequeued by the userspace filesystem. In this case the OPEN won't be balanced with a RELEASE. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `5a18ec176c` ("fuse: fix hang of single threaded fuseblk filesystem") Cc: <stable@vger.kernel.org> # v2.6.38+	2017-02-22 20:08:25 +01:00
Trond Myklebust	9d8cacbf56	NFSv4: Fix reboot recovery in copy offload Copy offload code needs to be hooked into the code for handling NFS4ERR_BAD_STATEID by ensuring that we set the "stateid" field in struct nfs4_exception. Reported-by: Olga Kornievskaia <aglo@umich.edu> Fixes: `2e72448b07` ("NFS: Add COPY nfs operation") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-02-22 13:49:11 -05:00
Linus Torvalds	3051bf36c2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Pull networking updates from David Miller: "Highlights: 1) Support TX_RING in AF_PACKET TPACKET_V3 mode, from Sowmini Varadhan. 2) Simplify classifier state on sk_buff in order to shrink it a bit. From Willem de Bruijn. 3) Introduce SIPHASH and it's usage for secure sequence numbers and syncookies. From Jason A. Donenfeld. 4) Reduce CPU usage for ICMP replies we are going to limit or suppress, from Jesper Dangaard Brouer. 5) Introduce Shared Memory Communications socket layer, from Ursula Braun. 6) Add RACK loss detection and allow it to actually trigger fast recovery instead of just assisting after other algorithms have triggered it. From Yuchung Cheng. 7) Add xmit_more and BQL support to mvneta driver, from Simon Guinot. 8) skb_cow_data avoidance in esp4 and esp6, from Steffen Klassert. 9) Export MPLS packet stats via netlink, from Robert Shearman. 10) Significantly improve inet port bind conflict handling, especially when an application is restarted and changes it's setting of reuseport. From Josef Bacik. 11) Implement TX batching in vhost_net, from Jason Wang. 12) Extend the dummy device so that VF (virtual function) features, such as configuration, can be more easily tested. From Phil Sutter. 13) Avoid two atomic ops per page on x86 in bnx2x driver, from Eric Dumazet. 14) Add new bpf MAP, implementing a longest prefix match trie. From Daniel Mack. 15) Packet sample offloading support in mlxsw driver, from Yotam Gigi. 16) Add new aquantia driver, from David VomLehn. 17) Add bpf tracepoints, from Daniel Borkmann. 18) Add support for port mirroring to b53 and bcm_sf2 drivers, from Florian Fainelli. 19) Remove custom busy polling in many drivers, it is done in the core networking since 4.5 times. From Eric Dumazet. 20) Support XDP adjust_head in virtio_net, from John Fastabend. 21) Fix several major holes in neighbour entry confirmation, from Julian Anastasov. 22) Add XDP support to bnxt_en driver, from Michael Chan. 23) VXLAN offloads for enic driver, from Govindarajulu Varadarajan. 24) Add IPVTAP driver (IP-VLAN based tap driver) from Sainath Grandhi. 25) Support GRO in IPSEC protocols, from Steffen Klassert" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1764 commits) Revert "ath10k: Search SMBIOS for OEM board file extension" net: socket: fix recvmmsg not returning error from sock_error bnxt_en: use eth_hw_addr_random() bpf: fix unlocking of jited image when module ronx not set arch: add ARCH_HAS_SET_MEMORY config net: napi_watchdog() can use napi_schedule_irqoff() tcp: Revert "tcp: tcp_probe: use spin_lock_bh()" net/hsr: use eth_hw_addr_random() net: mvpp2: enable building on 64-bit platforms net: mvpp2: switch to build_skb() in the RX path net: mvpp2: simplify MVPP2_PRS_RI_* definitions net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT net: mvpp2: remove unused register definitions net: mvpp2: simplify mvpp2_bm_bufs_add() net: mvpp2: drop useless fields in mvpp2_bm_pool and related code net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue' net: mvpp2: release reference to txq_cpu[] entry after unmapping net: mvpp2: handle too large value in mvpp2_rx_time_coal_set() net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set() net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set ...	2017-02-22 10:15:09 -08:00
Trond Myklebust	df3ab232e4	pNFS/flexfiles: If the layout is invalid, it must be updated before retrying If we see that our pNFS READ/WRITE/COMMIT operation failed, but we also see that our layout segment is no longer valid, then we need to get a new layout segment before retrying. Fixes: `90816d1dda` ("NFSv4.1/flexfiles: Don't mark the entire deviceid...") Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-02-22 10:49:37 -05:00
Linus Torvalds	6d1dd93ea0	Minor changes to pstore tree: - update MAINTAINERS with current git repo, add more files. - move prz allocation checks into the walker - initialize flags correctly (by accident spinlock was technically ok) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Kees Cook <kees@outflux.net> iQIcBAABCgAGBQJYrJo2AAoJEIly9N/cbcAm9ooP/jQcPPA4PrvC+U8Kv4JsPCbA AKiYvLpDr5F6f+8yUCyemrNj6dPdIvl0FJTCrPD2EZC6GEBXvy87umRIxzqweSi8 E/MCD9PXJoXZZR4Auy9/O2tVd38BxLXqZAd9fVAouDEJUiNPWX2Zvmna9vEGK5Yh FAJ+GL1iHtH17smrN33DmXu4ro6gl+RVLzqYZc5MoQQcAar4c+IWlZyIuHLUNjF2 FNLTJETyAsiCZQ/zvKgJW8gEWVo/qpSHelFEw/xTjcw2oAnp+H1821bd0UmzSlH6 B4Veyv8DfB2Rw+7eAdFetxQDuvuKLZbrecqSr6h6zSftDC3oXyM+Bx+f3v6RL8mt 3ijv4tToV27odBVJzIfPlXL9HN87/2P3UxVJt2e7X96BEyd/AXRXSJFOilD3tud6 Zc6qanC9Agik+URZPRuQzfGK0csdlsqu6wMmOso5qo2fvQ6TsHATtZiwQZEh2i5r 8gOZn7s1EYt5N1XXcTS5damCORnE0i4TvCryfJmR84WKwHSDe43fVnHloU/FXTjt EZM2ZIwpBMzpWh+glLl4+ARkEZMq8gMmNk1vCa+w0F6B2g+ipM3G9GqpWUiUKeKJ /LRQj8RPzIpeRe7z6wxGkDet5XZY/PM/eXriUXOz0f0Mw7WQiJdhOLsUO0kz0s6S /MWo2b1BHU32uJ41xqca =lSPN -----END PGP SIGNATURE----- Merge tag 'pstore-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: "Minor changes to pstore tree: - update MAINTAINERS with current git repo, add more files. - move prz allocation checks into the walker - initialize flags correctly (by accident spinlock was technically ok)" * tag 'pstore-v4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: MAINTAINERS: Adjust pstore git repo URI, add files pstore: Check for prz allocation in walker pstore: Correctly initialize spinlock and flags	2017-02-21 17:51:37 -08:00
Trond Myklebust	686a816ab6	NFSv4: Clean up owner/group attribute decode Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-02-21 16:56:16 -05:00
Trond Myklebust	1bbe60ff49	NFSv4: Remove bogus "struct nfs_client" argument from decode_ace() We shouldn't need to force callers to carry an unused argument. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-02-21 16:56:16 -05:00
Trond Myklebust	5a1f6d9e9b	NFSv4: Fix the underestimation of delegation XDR space reservation Account for the "space_limit" field in struct open_write_delegation4. Fixes: `2cebf82883` ("NFSv4: Fix the underestimate of NFSv4 open request size") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-02-21 16:56:16 -05:00
Trond Myklebust	c065eeea3b	NFSv4: Replace callback string decode function with a generic Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-02-21 16:56:16 -05:00
Trond Myklebust	6da59ce2fd	NFSv4: Replace the open coded decode_opaque_inline() with the new generic Also ensure that we always check that the size of the decoded object matches the expectation that it must be smaller than NFS4_OPAQUE_LIMIT. This should be true for all the current users of decode_opaque_inline(), including decode_ace(), decode_pathname(), decode_attr_fs_locations() and decode_exchange_id(). Note that this allows us to get rid of a number of existing checks in decode_exchange_id(), Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-02-21 16:56:16 -05:00
Trond Myklebust	ab6e9aaf16	NFSv4: Replace ad-hoc xdr encode/decode helpers with xdr_stream_* generics Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2017-02-21 16:56:16 -05:00
Linus Torvalds	c9341ee0af	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security layer updates from James Morris: "Highlights: - major AppArmor update: policy namespaces & lots of fixes - add /sys/kernel/security/lsm node for easy detection of loaded LSMs - SELinux cgroupfs labeling support - SELinux context mounts on tmpfs, ramfs, devpts within user namespaces - improved TPM 2.0 support" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (117 commits) tpm: declare tpm2_get_pcr_allocation() as static tpm: Fix expected number of response bytes of TPM1.2 PCR Extend tpm xen: drop unneeded chip variable tpm: fix misspelled "facilitate" in module parameter description tpm_tis: fix the error handling of init_tis() KEYS: Use memzero_explicit() for secret data KEYS: Fix an error code in request_master_key() sign-file: fix build error in sign-file.c with libressl selinux: allow changing labels for cgroupfs selinux: fix off-by-one in setprocattr tpm: silence an array overflow warning tpm: fix the type of owned field in cap_t tpm: add securityfs support for TPM 2.0 firmware event log tpm: enhance read_log_of() to support Physical TPM event log tpm: enhance TPM 2.0 PCR extend to support multiple banks tpm: implement TPM 2.0 capability to get active PCR banks tpm: fix RC value check in tpm2_seal_trusted tpm_tis: fix iTPM probe via probe_itpm() function tpm: Begin the process to deprecate user_read_timer tpm: remove tpm_read_index and tpm_write_index from tpm.h ...	2017-02-21 12:49:56 -08:00
Tejun Heo	f83f3c5156	kernfs: fix locking around kernfs_ops->release() callback The release callback may be called from two places - file release operation and kernfs open file draining. kernfs_open_file->mutex is used to synchronize the two callsites. This unfortunately leads to possible circular locking because of->mutex is used to protect the usual kernfs operations which may use locking constructs which are held while removing and thus draining kernfs files. @of->mutex is for synchronizing concurrent kernfs access operations and all we need here is synchronization between the releaes and drain paths. As the drain path has to grab kernfs_open_file_mutex anyway, let's use the mutex to synchronize the release operation instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Tony Lindgren <tony@atomide.com> Fixes: `0e67db2f9f` ("kernfs: add kernfs_ops->open/release() callbacks") Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-02-21 15:49:25 -05:00
Jan Kara	cccd9fb9ec	block: Revalidate i_bdev reference in bd_aquire() When a device gets removed, block device inode unhashed so that it is not used anymore (bdget() will not find it anymore). Later when a new device gets created with the same device number, we create new block device inode. However there may be file system device inodes whose i_bdev still points to the original block device inode and thus we get two active block device inodes for the same device. They will share the same gendisk so the only visible differences will be that page caches will not be coherent and BDIs will be different (the old block device inode still points to unregistered BDI). Fix the problem by checking in bd_acquire() whether i_bdev still points to active block device inode and re-lookup the block device if not. That way any open of a block device happening after the old device has been removed will get correct block device inode. Tested-by: Lekshmi Pillai <lekshmicpillai@in.ibm.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-21 12:51:54 -07:00
Eric W. Biederman	ace0c791e6	proc/sysctl: Don't grab i_lock under sysctl_lock. Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > This patch has locking problem. I've got lockdep splat under LTP. > > [ 6633.115456] ====================================================== > [ 6633.115502] [ INFO: possible circular locking dependency detected ] > [ 6633.115553] 4.9.10-debug+ #9 Tainted: G L > [ 6633.115584] ------------------------------------------------------- > [ 6633.115627] ksm02/284980 is trying to acquire lock: > [ 6633.115659] (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80 > [ 6633.115834] but task is already holding lock: > [ 6633.115882] (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110 > [ 6633.116026] which lock already depends on the new lock. > [ 6633.116026] > [ 6633.116080] > [ 6633.116080] the existing dependency chain (in reverse order) is: > [ 6633.116117] > -> #2 (sysctl_lock){+.+...}: > -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}: > -> #0 (&sb->s_type->i_lock_key#4){+.+...}: > > d_lock nests inside i_lock > sysctl_lock nests inside d_lock in d_compare > > This patch adds i_lock nesting inside sysctl_lock. Al Viro <viro@ZenIV.linux.org.uk> replied: > Once ->unregistering is set, you can drop sysctl_lock just fine. So I'd > try something like this - use rcu_read_lock() in proc_sys_prune_dcache(), > drop sysctl_lock() before it and regain after. Make sure that no inodes > are added to the list ones ->unregistering has been set and use RCU list > primitives for modifying the inode list, with sysctl_lock still used to > serialize its modifications. > > Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing > igrab() is safe there. Since we don't drop inode reference until after we'd > passed beyond it in the list, list_for_each_entry_rcu() should be fine. I agree with Al Viro's analsysis of the situtation. Fixes: `d6cffbbe9a` ("proc/sysctl: prune stale dentries during unregistering") Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Tested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2017-02-22 08:34:53 +13:00
Linus Torvalds	772c8f6f3b	for-4.11/linus-merge-signed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABCAAGBQJYqeb8AAoJEPfTWPspceCmB3UP/3UtcPrzEm8w2cxB9MaWhZN3 J+jiwlO4vaqhm2HVzQtoJqfaqRlud/iDx5cIXE2S7FnIM54ZKs3CANbKu8X+b1zm eJije3zMI8A8qyftigbz6a/Y2kWE4ZqFEc9WU5CWawfTl3ImCVUi8+F5X0wOLU/h r50zAQOEyURH4G5usNl9q0olF6FonJ82AcYm1iJ0QP2wYWZRJauC0rRn8IT93tyK bZPHnGKdkd7km8yi3zr2GNWOfuZZuA0HWAaF4qfrHPZQ883gITFAUIlFb1f+2TNl DkQzRrBB2wPWPnlbfb9KejMkvL94hflzsLb5rHt835DyVXFRyjxsgyAI8A+LPGSz vqZ3rsbWj6H4F9z2CkZ+T+AP/ZSWDNjwc0RXPm9HYdR5CDeTxIUVvnFQ44YNsmTv Xd5BKrUJ2oKegAxQG6zcuFx23p8JzhT70l+mNrMdtyeKnDD9FRdDvhKG9AHeTipn o/DnGivhS3UMQoQ7D68KOO+kuhLDeo7my5XGsnjzMO/iHqg++7IP2HyYYs/Ba4qZ cYaCtSDQW71Zt0vsqa6dvPuXBveu4h8Qh8R7uAGjSGS9IAFFb4Cab2tiUdISE6PE YnMWzY+G6pT8imlLVOL5/QFuo2Q4pUsaL0AHpXMCN9TZnQtbqXa8eqwnKnQ0m2KN 7ut0IYYEPaYUX5xFn1K6 =z7AL -----END PGP SIGNATURE----- Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block Pull block layer updates from Jens Axboe: - blk-mq scheduling framework from me and Omar, with a port of the deadline scheduler for this framework. A port of BFQ from Paolo is in the works, and should be ready for 4.12. - Various fixups and improvements to the above scheduling framework from Omar, Paolo, Bart, me, others. - Cleanup of the exported sysfs blk-mq data into debugfs, from Omar. This allows us to export more information that helps debug hangs or performance issues, without cluttering or abusing the sysfs API. - Fixes for the sbitmap code, the scalable bitmap code that was migrated from blk-mq, from Omar. - Removal of the BLOCK_PC support in struct request, and refactoring of carrying SCSI payloads in the block layer. This cleans up the code nicely, and enables us to kill the SCSI specific parts of struct request, shrinking it down nicely. From Christoph mainly, with help from Hannes. - Support for ranged discard requests and discard merging, also from Christoph. - Support for OPAL in the block layer, and for NVMe as well. Mainly from Scott Bauer, with fixes/updates from various others folks. - Error code fixup for gdrom from Christophe. - cciss pci irq allocation cleanup from Christoph. - Making the cdrom device operations read only, from Kees Cook. - Fixes for duplicate bdi registrations and bdi/queue life time problems from Jan and Dan. - Set of fixes and updates for lightnvm, from Matias and Javier. - A few fixes for nbd from Josef, using idr to name devices and a workqueue deadlock fix on receive. Also marks Josef as the current maintainer of nbd. - Fix from Josef, overwriting queue settings when the number of hardware queues is updated for a blk-mq device. - NVMe fix from Keith, ensuring that we don't repeatedly mark and IO aborted, if we didn't end up aborting it. - SG gap merging fix from Ming Lei for block. - Loop fix also from Ming, fixing a race and crash between setting loop status and IO. - Two block race fixes from Tahsin, fixing request list iteration and fixing a race between device registration and udev device add notifiations. - Double free fix from cgroup writeback, from Tejun. - Another double free fix in blkcg, from Hou Tao. - Partition overflow fix for EFI from Alden Tondettar. * tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits) nvme: Check for Security send/recv support before issuing commands. block/sed-opal: allocate struct opal_dev dynamically block/sed-opal: tone down not supported warnings block: don't defer flushes on blk-mq + scheduling blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers blk-mq: don't special case flush inserts for blk-mq-sched blk-mq-sched: don't add flushes to the head of requeue queue blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not block: do not allow updates through sysfs until registration completes lightnvm: set default lun range when no luns are specified lightnvm: fix off-by-one error on target initialization Maintainers: Modify SED list from nvme to block Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN uapi: sed-opal fix IOW for activate lsp to use correct struct cdrom: Make device operations read-only elevator: fix loading wrong elevator type for blk-mq devices cciss: switch to pci_irq_alloc_vectors block/loop: fix race between I/O and set_status blk-mq-sched: don't hold queue_lock when calling exit_icq block: set make_request_fn manually in blk_mq_update_nr_hw_queues ...	2017-02-21 10:57:33 -08:00
Linus Torvalds	9763dd6f81	We've got eight GFS2 patches for this merge window: 1. Andy Price submitted a patch to make gfs2_write_full_page a static function. 2. Dan Carpenter submitted a patch to fix a ERR_PTR thinko. I've also got a few patches, three of which fix bugs related to deleting very large files, which cause GFS2 to run out of journal space: 3. The first one prevents GFS2 delete operation from requesting too much journal space. 4. The second one fixes a problem whereby GFS2 can hang because it wasn't taking journal space demand into its calculations. 5. The third one wakes up IO waiters when a flush is done to restart processes stuck waiting for journal space to become available. The other three patches are a performance improvement related to spin_lock contention between multiple writers: 6. The "tr_touched" variable was switched to a flag to be more atomic and eliminate the possibility of some races. 7. Function meta_lo_add was moved inline with its only caller to make the code more readable and efficient. 8. Contention on the gfs2_log_lock spinlock was greatly reduced by avoiding the lock altogether in cases where we don't really need it: buffers that already appear in the appropriate metadata list for the journal. Many thanks to Steve Whitehouse for the ideas and principles behind these patches. -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJYrEEEAAoJENeLYdPf93o7bjoIAIqPG/EAzi+idgMWDPQa9Eit 53dPy16snkrbWwtaK6spSWlH6bGYuHeanXORYon9bvtVjKYaa4NQclGihN2IE6uB O8zT+MGwP45LDhNplVJpumaOALZ9ZDqQSe+3tHeNK3FhNirLyiIjSqrHt/7Yi1qi fPLlT4Jx0TBo5rhvEGa7Yg01WhWVtnmVSMqJXj/7ZtC50s1aPyDUikdNIDfDCN2X LxfKGDXuk6p63VQ6JKqYSBVATCR0/bbKfkuk/kBUTYLoHoapImxB8d0HgIdsh1Mv 9PlbZnnNW8k5oapuhVxjl0T5G0JsQgCkPb/wlte+ryOCjBoc2L2fCUV5qc0QxWc= =xQyl -----END PGP SIGNATURE----- Merge tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull GFS2 updates from Robert Peterson: "We've got eight GFS2 patches for this merge window: - Andy Price submitted a patch to make gfs2_write_full_page a static function. - Dan Carpenter submitted a patch to fix a ERR_PTR thinko. Three patches fix bugs related to deleting very large files, which cause GFS2 to run out of journal space: - The first one prevents GFS2 delete operation from requesting too much journal space. - The second one fixes a problem whereby GFS2 can hang because it wasn't taking journal space demand into its calculations. - The third one wakes up IO waiters when a flush is done to restart processes stuck waiting for journal space to become available. The final three patches are a performance improvement related to spin_lock contention between multiple writers: - The "tr_touched" variable was switched to a flag to be more atomic and eliminate the possibility of some races. - Function meta_lo_add was moved inline with its only caller to make the code more readable and efficient. - Contention on the gfs2_log_lock spinlock was greatly reduced by avoiding the lock altogether in cases where we don't really need it: buffers that already appear in the appropriate metadata list for the journal. Many thanks to Steve Whitehouse for the ideas and principles behind these patches" * tag 'gfs2-4.11.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Make gfs2_write_full_page static GFS2: Reduce contention on gfs2_log_lock GFS2: Inline function meta_lo_add GFS2: Switch tr_touched to flag in transaction GFS2: Wake up io waiters whenever a flush is done GFS2: Made logd daemon take into account log demand GFS2: Limit number of transaction blocks requested for truncates GFS2: Fix reference to ERR_PTR in gfs2_glock_iter_next	2017-02-21 07:46:34 -08:00
Linus Torvalds	70fcf5c339	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF fixes and cleanups from Jan Kara: "Several small UDF fixes and cleanups and a small cleanup of fanotify code" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: simplify the code of fanotify_merge udf: simplify udf_ioctl() udf: fix ioctl errors udf: allow implicit blocksize specification during mount udf: check partition reference in udf_read_inode() udf: atomically read inode size udf: merge module informations in super.c udf: remove next_epos from udf_update_extent_cache() udf: Factor out trimming of crtime udf: remove empty condition udf: remove unneeded line break udf: merge bh free udf: use pointer for kernel_long_ad argument udf: use __packed instead of __attribute__ ((packed)) udf: Make stat on symlink report symlink length as st_size fs/udf: make #ifdef UDF_PREALLOCATE unconditional fs: udf: Replace CURRENT_TIME with current_time()	2017-02-21 07:44:03 -08:00
Christoph Hellwig	783112f740	nfsd: special case truncates some more Both the NFS protocols and the Linux VFS use a setattr operation with a bitmap of attributes to set to set various file attributes including the file size and the uid/gid. The Linux syscalls never mix size updates with unrelated updates like the uid/gid, and some file systems like XFS and GFS2 rely on the fact that truncates don't update random other attributes, and many other file systems handle the case but do not update the other attributes in the same transaction. NFSD on the other hand passes the attributes it gets on the wire more or less directly through to the VFS, leading to updates the file systems don't expect. XFS at least has an assert on the allowed attributes, which caught an unusual NFS client setting the size and group at the same time. To handle this issue properly this splits the notify_change call in nfsd_setattr into two separate ones. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Tested-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-02-21 10:13:37 -05:00
Linus Torvalds	2bfe01eff4	Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 Pull CIFS/SMB3 updates from Steve French: "Includes support for a critical SMB3 security feature: per-share encryption from Pavel, and a cleanup from Jean Delvare. Will have another cifs/smb3 merge next week" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Allow to switch on encryption with seal mount option CIFS: Add capability to decrypt big read responses CIFS: Decrypt and process small encrypted packets CIFS: Add copy into pages callback for a read operation CIFS: Add mid handle callback CIFS: Add transform header handling callbacks CIFS: Encrypt SMB3 requests before sending CIFS: Enable encryption during session setup phase CIFS: Add capability to transform requests before sending CIFS: Separate RFC1001 length processing for SMB2 read CIFS: Separate SMB2 sync header processing CIFS: Send RFC1001 length in a separate iov CIFS: Make send_cancel take rqst as argument CIFS: Make SendReceive2() takes resp iov CIFS: Separate SMB2 header structure CIFS: Fix splice read for non-cached files cifs: Add soft dependencies cifs: Only select the required crypto modules cifs: Simplify SMB2 and SMB311 dependencies	2017-02-20 18:38:47 -08:00
Linus Torvalds	cab7076a18	For this cycle we add support for the shutdown ioctl, which is primarily used for testing, but which can be useful on production systems when a scratch volume is being destroyed and the data on it doesn't need to be saved. This found (and we fixed) a number of bugs with ext4's recovery to corrupted file system --- the bugs increased the amount of data that could be potentially lost, and in the case of the inline data feature, could cause the kernel to BUG. Also included are a number of other bug fixes, including in ext4's fscrypt, DAX, inline data support. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlirXesACgkQ8vlZVpUN gaMOzQf8Ct6uPatV+m855oR4dAbZr2+lY4A4C+vHDzBtSMkPRyLX8cuo8XcwfTIm vPVyDnL6EPyhXPxxfItu+92wAq1m5mVpKo57d0Ft5lw0rHxNtJTgVSRzsQ7VDRjj 5qMHW2K7Bk7EjzTeW3SF8/3+hqpzkAvRtNCntcomk5h08+cWMC8JSnn1kqw+naIn EcbrC72GZb8JUELogVXC2vU58lp50SSBdr3l005jqKc5BvljMvdJ0Izn/3RVyU7u q7vtynhe2ScFcHe/UzL1QgmQOy32tJpbS0NHalW47aw3Ynmn4cSX0YhhT9FDjRNQ VOOfo1m1sAg166x0E+Nn7FeghTSSyA== =cPIf -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "For this cycle we add support for the shutdown ioctl, which is primarily used for testing, but which can be useful on production systems when a scratch volume is being destroyed and the data on it doesn't need to be saved. This found (and we fixed) a number of bugs with ext4's recovery to corrupted file system --- the bugs increased the amount of data that could be potentially lost, and in the case of the inline data feature, could cause the kernel to BUG. Also included are a number of other bug fixes, including in ext4's fscrypt, DAX, inline data support" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits) ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN ext4: fix fencepost in s_first_meta_bg validation ext4: don't BUG when truncating encrypted inodes on the orphan list ext4: do not use stripe_width if it is not set ext4: fix stripe-unaligned allocations dax: assert that i_rwsem is held exclusive for writes ext4: fix DAX write locking ext4: add EXT4_IOC_GOINGDOWN ioctl ext4: add shutdown bit and check for it ext4: rename s_resize_flags to s_ext4_flags ext4: return EROFS if device is r/o and journal replay is needed ext4: preserve the needs_recovery flag when the journal is aborted jbd2: don't leak modified metadata buffers on an aborted journal ext4: fix inline data error paths ext4: move halfmd4 into hash.c directly ext4: fix use-after-iput when fscrypt contexts are inconsistent jbd2: fix use after free in kjournald2() ext4: fix data corruption in data=journal mode ext4: trim allocation requests to group size ext4: replace BUG_ON with WARN_ON in mb_find_extent() ...	2017-02-20 18:24:39 -08:00
Linus Torvalds	6c24337f22	Various cleanups for the file system encryption feature. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlirP6wACgkQ8vlZVpUN gaMwpQgApR67CxzlstxYjZpWPAqC8McJ2FBDX+mCOle5Vkc1WQDklwr0oCfQThTj eDSFRhNfIvyPh0DJ589PxBCsWOqN5h6Si7hD5ZinomVNI+IL89OytaU5EV2OpWaW iKdJgO9Tm8U7LuY6FOIoVdX57kUXVdkWoj61rC056B1SNhnNiVeofi7lYDM8Ix4q IGSQ9W24iQKmCk4hCwgObhJBRK9RnlOH0GLUmpMaS+jnfnj/uePwdxWEFsPuCOob 8acAJ49lr55kjIw79E0BAyWxhEZ2aiArHk8PaWynT/DyNq3ftcapPlpftoeba8vo glBJRX70QxPvt0iHEp0ykfExkhWhFA== =Joki -----END PGP SIGNATURE----- Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt Pull fscrypt updates from Ted Ts'o: "Various cleanups for the file system encryption feature" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: fscrypt: constify struct fscrypt_operations fscrypt: properly declare on-stack completion fscrypt: split supp and notsupp declarations into their own headers fscrypt: remove redundant assignment of res fscrypt: make fscrypt_operations.key_prefix a string fscrypt: remove unused 'mode' member of fscrypt_ctx ext4: don't allow encrypted operations without keys fscrypt: make test_dummy_encryption require a keyring key fscrypt: factor out bio specific functions fscrypt: pass up error codes from ->get_context() fscrypt: remove user-triggerable warning messages fscrypt: use EEXIST when file already uses different policy fscrypt: use ENOTDIR when setting encryption policy on nondirectory fscrypt: use ENOKEY when file cannot be created w/o key	2017-02-20 18:22:31 -08:00
Christoph Hellwig	758e99fefe	nfsd: minor nfsd_setattr cleanup Simplify exit paths, size_change use. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-02-20 17:20:44 -05:00
J. Bruce Fields	60709c093e	nfsd: merge stable fix into main nfsd branch	2017-02-20 17:20:05 -05:00
Linus Torvalds	42e1b14b6e	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - Implement wraparound-safe refcount_t and kref_t types based on generic atomic primitives (Peter Zijlstra) - Improve and fix the ww_mutex code (Nicolai Hähnle) - Add self-tests to the ww_mutex code (Chris Wilson) - Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr Bueso) - Micro-optimize the current-task logic all around the core kernel (Davidlohr Bueso) - Tidy up after recent optimizations: remove stale code and APIs, clean up the code (Waiman Long) - ... plus misc fixes, updates and cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) fork: Fix task_struct alignment locking/spinlock/debug: Remove spinlock lockup detection code lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS lkdtm: Convert to refcount_t testing kref: Implement 'struct kref' using refcount_t refcount_t: Introduce a special purpose refcount type sched/wake_q: Clarify queue reinit comment sched/wait, rcuwait: Fix typo in comment locking/mutex: Fix lockdep_assert_held() fail locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock() locking/rwsem: Reinit wake_q after use locking/rwsem: Remove unnecessary atomic_long_t casts jump_labels: Move header guard #endif down where it belongs locking/atomic, kref: Implement kref_put_lock() locking/ww_mutex: Turn off __must_check for now locking/atomic, kref: Avoid more abuse locking/atomic, kref: Use kref_get_unless_zero() more locking/atomic, kref: Kill kref_sub() locking/atomic, kref: Add kref_read() locking/atomic, kref: Add KREF_INIT() ...	2017-02-20 13:23:30 -08:00
Linus Torvalds	828cad8ea0	Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this (fairly busy) cycle were: - There was a class of scheduler bugs related to forgetting to update the rq-clock timestamp which can cause weird and hard to debug problems, so there's a new debug facility for this: which uncovered a whole lot of bugs which convinced us that we want to keep the debug facility. (Peter Zijlstra, Matt Fleming) - Various cputime related updates: eliminate cputime and use u64 nanoseconds directly, simplify and improve the arch interfaces, implement delayed accounting more widely, etc. - (Frederic Weisbecker) - Move code around for better structure plus cleanups (Ingo Molnar) - Move IO schedule accounting deeper into the scheduler plus related changes to improve the situation (Tejun Heo) - ... plus a round of sched/rt and sched/deadline fixes, plus other fixes, updats and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits) sched/core: Remove unlikely() annotation from sched_move_task() sched/autogroup: Rename auto_group.[ch] to autogroup.[ch] sched/topology: Split out scheduler topology code from core.c into topology.c sched/core: Remove unnecessary #include headers sched/rq_clock: Consolidate the ordering of the rq_clock methods delayacct: Include <uapi/linux/taskstats.h> sched/core: Clean up comments sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds sched/clock: Add dummy clear_sched_clock_stable() stub function sched/cputime: Remove generic asm headers sched/cputime: Remove unused nsec_to_cputime() s390, sched/cputime: Remove unused cputime definitions powerpc, sched/cputime: Remove unused cputime definitions s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs ia64, sched/cputime: Remove unused cputime definitions ia64: Convert vtime to use nsec units directly ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it sched/cputime: Remove jiffies based cputime sched/cputime, vtime: Return nsecs instead of cputime_t to account sched/cputime: Complete nsec conversion of tick based accounting ...	2017-02-20 12:52:55 -08:00
Theodore Ts'o	e9be2ac7c0	ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN It's very likely the file system independent ioctl name will be FS_IOC_SHUTDOWN, so let's use the same name for the ext4 ioctl name. Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2017-02-20 15:34:59 -05:00
Linus Torvalds	20dcfe1b7d	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Nothing exciting, just the usual pile of fixes, updates and cleanups: - A bunch of clocksource driver updates - Removal of CONFIG_TIMER_STATS and the related /proc file - More posix timer slim down work - A scalability enhancement in the tick broadcast code - Math cleanups" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) hrtimer: Catch invalid clockids again math64, tile: Fix build failure clocksource/drivers/arm_arch_timer:: Mark cyclecounter __ro_after_init timerfd: Protect the might cancel mechanism proper timer_list: Remove useless cast when printing time: Remove CONFIG_TIMER_STATS clocksource/drivers/arm_arch_timer: Work around Hisilicon erratum 161010101 clocksource/drivers/arm_arch_timer: Introduce generic errata handling infrastructure clocksource/drivers/arm_arch_timer: Remove fsl-a008585 parameter clocksource/drivers/arm_arch_timer: Add dt binding for hisilicon-161010101 erratum clocksource/drivers/ostm: Add renesas-ostm timer driver clocksource/drivers/ostm: Document renesas-ostm timer DT bindings clocksource/drivers/tcb_clksrc: Use 32 bit tcb as sched_clock clocksource/drivers/gemini: Add driver for the Cortina Gemini clocksource: add DT bindings for Cortina Gemini clockevents: Add a clkevt-of mechanism like clksrc-of tick/broadcast: Reduce lock cacheline contention timers: Omit POSIX timer stuff from task_struct when disabled x86/timer: Make delay() work during early bootup delay: Add explanation of udelay() inaccuracy ...	2017-02-20 10:06:32 -08:00
Miklos Szeredi	0eb8af4916	vfs: use helper for calling f_op->fsync() Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-02-20 16:51:23 +01:00
Miklos Szeredi	f74ac01520	mm: use helper for calling f_op->mmap() Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-02-20 16:51:23 +01:00
Miklos Szeredi	bb7462b6fd	vfs: use helpers for calling f_op->{read,write}_iter() Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-02-20 16:51:23 +01:00
Miklos Szeredi	0f78d06ac1	vfs: pass type instead of fn to do_{loop,iter}_readv_writev() Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-02-20 16:51:23 +01:00
Miklos Szeredi	7687a7a443	vfs: extract common parts of {compat_,}do_readv_writev() Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2017-02-20 16:51:23 +01:00
Jeff Layton	df963ea8a0	ceph: remove req from unsafe list when unregistering it There's no reason a request should ever be on a s_unsafe list but not in the request tree. Cc: stable@vger.kernel.org Link: http://tracker.ceph.com/issues/18474 Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 13:06:03 +01:00
Jeff Layton	5eb9f6040f	ceph: do a LOOKUP in d_revalidate instead of GETATTR In commit `c3f4688a08` (ceph: don't set req->r_locked_dir in ceph_d_revalidate), we changed the code to do a GETATTR instead of a LOOKUP as the parent info isn't strictly necessary to revalidate the dentry. What we missed there though is that in order to update the lease on the dentry after revalidating it, we _do_ need parent info. Change ceph_d_revalidate back to doing a LOOKUP instead of a GETATTR so that we can get the parent info in order to update the lease from ceph_fill_trace. Note that we set req->r_parent here, but we cannot set the CEPH_MDS_R_PARENT_LOCKED flag as we can't guarantee that it is. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:10 +01:00
Jeff Layton	cdde7c4351	ceph: call update_dentry_lease even when r_locked dir is not set We don't really require that the parent be locked in order to update the lease on a dentry. Lease info is protected by the d_lock. In the event that the parent is not locked in ceph_fill_trace, and we have both parent and target info, go ahead and update the dentry lease. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:10 +01:00
Jeff Layton	f5d55f0397	ceph: vet the target and parent inodes before updating dentry lease In a later patch, we're going to need to allow ceph_fill_trace to update the dentry's lease when the parent is not locked. This is potentially racy though -- by the time we get around to processing the trace, the parent may have already changed. Change update_dentry_lease to take a ceph_vino pointer and use that to ensure that the dentry's parent still matches it before updating the lease. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:09 +01:00
Jeff Layton	80d025ffed	ceph: don't update_dentry_lease unless we actually got one This if block updates the dentry lease even in the case where the MDS didn't grant one. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:09 +01:00
Jeff Layton	3dd69aabce	ceph: add a new flag to indicate whether parent is locked struct ceph_mds_request has an r_locked_dir pointer, which is set to indicate the parent inode and that its i_rwsem is locked. In some critical places, we need to be able to indicate the parent inode to the request handling code, even when its i_rwsem may not be locked. Most of the code that operates on r_locked_dir doesn't require that the i_rwsem be locked. We only really need it to handle manipulation of the dcache. The rest (filling of the inode, updating dentry leases, etc.) already has its own locking. Add a new r_req_flags bit that indicates whether the parent is locked when doing the request, and rename the pointer to "r_parent". For now, all the places that set r_parent also set this flag, but that will change in a later patch. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:08 +01:00
Jeff Layton	bc2de10dc4	ceph: convert bools in ceph_mds_request to a new r_req_flags field Currently, we have a bunch of bool flags in struct ceph_mds_request. We need more flags though, but each bool takes (at least) a byte. Those add up over time. Merge all of the existing bools in this struct into a single unsigned long, and use the set/test/clear_bit macros to manipulate them. These are atomic operations, but that is required here to prevent load/modify/store races. The existing flags are protected by different locks, so we can't rely on them for that purpose. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:08 +01:00
Jeff Layton	f5a03b0804	ceph: drop session argument to ceph_fill_trace Just get it from r_session since that's what's always passed in. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:08 +01:00
Jeff Layton	6fffaef954	ceph: remove "Debugging hook" from ceph_fill_trace Keeping around commented out code is just asking for it to bitrot and makes viewing the code under cscope more confusing. If we really need this, then we can revert this patch and put it under a Kconfig option. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:08 +01:00
Yan, Zheng	c1944fedd8	ceph: avoid calling ceph_renew_caps() infinitely __ceph_caps_mds_wanted() ignores caps from stale session. So the return value of __ceph_caps_mds_wanted() can keep the same across ceph_renew_caps(). This causes try_get_cap_refs() to keep calling ceph_renew_caps(). The fix is ignore the session valid check for the try_get_cap_refs() case. If session is stale, just let the caps requester sleep. Signed-off-by: Yan, Zheng <zyan@redhat.com>	2017-02-20 12:16:07 +01:00
Yan, Zheng	00f06cba53	ceph: make sure flushing inode in proper session's cap_flushing list when flushing inode's auth cap changes, we need to move it into the new auth cap session's cap_flushing list Signed-off-by: Yan, Zheng <zyan@redhat.com>	2017-02-20 12:16:07 +01:00
Yan, Zheng	d641df819d	ceph: update readpages osd request according to size of pages add_to_page_cache_lru() can fails, so the actual pages to read can be smaller than the initial size of osd request. We need to update osd request size in that case. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com>	2017-02-20 12:16:07 +01:00
Jeff Layton	24c149ad69	ceph: fix bogus endianness change in ceph_ioctl_set_layout sparse says: fs/ceph/ioctl.c💯28: warning: cast to restricted __le64 preferred_osd is a __s64 so we don't need to do any conversion. Also, just remove the cast in ceph_ioctl_get_layout as it's not needed. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:07 +01:00
Yan, Zheng	eb65b919b9	ceph: avoid updating mds_wanted too frequently user space may open/close single file frequently. It's not good to send a clientcaps message to mds for each open/close syscall. Signed-off-by: Yan, Zheng <zyan@redhat.com>	2017-02-20 12:16:06 +01:00
Andreas Gerstmayr	7c94ba2790	ceph: set io_pages bdi hint This patch sets the io_pages bdi hint based on the rsize mount option. Without this patch large buffered reads (request size > max readahead) are processed sequentially in chunks of the readahead size (i.e. read requests are sent out up to the readahead size, then the do_generic_file_read() function waits until the first page is received). With this patch read requests are sent out at once up to the size specified in the rsize mount option (default: 64 MB). Signed-off-by: Andreas Gerstmayr <andreas.gerstmayr@catalysts.cc> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>	2017-02-20 12:16:05 +01:00
Colin Ian King	0fbc5360bf	ceph: fix spelling mistake: "enabing" -> "enabling" trivial fix to spelling mistake in debug message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>	2017-02-20 12:16:05 +01:00
Seraphime Kirkovski	52953d5591	ceph: cleanup ACCESS_ONCE -> READ_ONCE This removes the uses of ACCESS_ONCE in favor of READ_ONCE Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>	2017-02-20 12:16:05 +01:00
Jeff Layton	ca6c8ae0f7	ceph: pass parent inode info to ceph_encode_dentry_release if we have it If we have a parent inode reference already, then we don't need to go back up the directory tree to find one. Link: http://tracker.ceph.com/issues/18148 Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:05 +01:00
Jeff Layton	adf0d68701	ceph: fix unsafe dcache access in ceph_encode_dentry_release Accessing d_parent requires some sort of locking or it could vanish out from under us. Since we take the d_lock anyway, use that to fetch d_parent and take a reference to it, and then use that reference to call ceph_encode_inode_release. Link: http://tracker.ceph.com/issues/18148 Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:05 +01:00
Jeff Layton	fd36a71762	ceph: pass parent dir ino info to build_dentry_path In the event that we have a parent inode reference in the request, we can use that instead of mucking about in the dcache. Pass any parent inode info we have down to build_dentry_path so it can make use of it. Link: http://tracker.ceph.com/issues/18148 Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:05 +01:00
Jeff Layton	c6b0b656ca	ceph: clean up unsafe d_parent accesses in build_dentry_path While we hold a reference to the dentry when build_dentry_path is called, we could end up racing with a rename that changes d_parent. Handle that situation correctly, by using the rcu_read_lock to ensure that the parent dentry and inode stick around long enough to safely check ceph_snap and ceph_ino. Link: http://tracker.ceph.com/issues/18148 Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:05 +01:00
Jeff Layton	30c71233a1	ceph: clean up unsafe d_parent access in __choose_mds __choose_mds exists to pick an MDS to use when issuing a call. Doing that typically involves picking an inode and using the authoritative MDS for it. In most cases, that's pretty straightforward, as we are using an inode to which we hold a reference (usually represented by r_dentry or r_inode in the request). In the case of a snapshotted directory however, we need to fetch the non-snapped parent, which involves walking back up the parents in the tree. The dentries in the snapshot dir are effectively frozen but the overall parent is _not_, and could vanish if a concurrent rename were to occur. Clean this code up and take special care to ensure the validity of the entries we're working with. First, try to use the inode in r_locked_dir if one exists. If not and all we have is r_dentry, then we have to walk back up the tree. Use the rcu_read_lock for this so we can ensure that any d_parent we find won't go away, and take extra care to deal with the possibility that the dentries could go negative. Change get_nonsnap_parent to return an inode, and take a reference to that inode before returning (if any). Change all of the other places where we set "inode" in __choose_mds to also take a reference, and then call iput on that inode before exiting the function. Link: http://tracker.ceph.com/issues/18148 Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2017-02-20 12:16:04 +01:00
Dan Carpenter	eec11535ca	hfs: fix hfs_readdir() I was looking through static analysis warnings and there is a bug here that goes all the way back to the start of git. Basically we're copying the pointer and nearby garbage instead of the data the fd.key pointer is pointing to. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2017-02-18 22:11:15 -05:00
Christoph Hellwig	8d242e932f	xfs: remove XFS_ALLOCTYPE_ANY_AG and XFS_ALLOCTYPE_START_AG XFS_ALLOCTYPE_ANY_AG was only used for the RT allocator and is unused now, and XFS_ALLOCTYPE_START_AG has been unused for a while. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-02-17 20:32:10 -08:00
Christoph Hellwig	089ec2f875	xfs: simplify xfs_rtallocate_extent We can deduce the allocation type from the bno argument, and do the return without prod much simpler internally. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix the macro for the non-rt build] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-02-17 16:52:52 -08:00
Kinglong Mee	7323f0d288	NFSD: Reserve adequate space for LOCKT operation After tightening the OP_LOCKT reply size estimate, we can get warnings like: [11512.783519] RPC request reserved 124 but used 152 [11512.813624] RPC request reserved 108 but used 136 Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-02-17 16:26:04 -05:00
Kinglong Mee	2282cd2c05	NFSD: Get response size before operation for all RPCs NFSD usess PAGE_SIZE as the reply size estimate for RPCs which don't support op_rsize_bop(), A PAGE_SIZE (4096) is larger than many real response sizes, eg, access (op_encode_hdr_size + 2), seek (op_encode_hdr_size + 3). This patch just adds op_rsize_bop() for all RPCs getting response size. An overestimate is generally safe but the tighter estimates are probably better. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-02-17 16:26:03 -05:00
Kinglong Mee	827433801c	nfsd/callback: Drop a useless data copy when comparing sessionid Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-02-17 16:26:02 -05:00
Kinglong Mee	e86a40bc73	nfsd/callback: skip the callback tag The callback tag is NULL, and hdr->nops is unused too right now, but. But if we were to ever test with a nonzero callback tag, nops will get a bad value. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-02-17 16:26:01 -05:00
Kinglong Mee	f7d1ddbe76	nfsd/callback: Cleanup callback cred on shutdown The rpccred gotten from rpc_lookup_machine_cred() should be put when state is shutdown. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-02-17 16:26:00 -05:00
Kinglong Mee	c3821b3497	nfsd/idmap: return nfserr_inval for 0-length names Tigran Mkrtchyan's new pynfs testcases for zero length principals fail: SATT16 st_setattr.testEmptyPrincipal : FAILURE Setting empty owner should return NFS4ERR_INVAL, instead got NFS4ERR_BADOWNER SATT17 st_setattr.testEmptyGroupPrincipal : FAILURE Setting empty owner_group should return NFS4ERR_INVAL, instead got NFS4ERR_BADOWNER This patch checks the principal and returns nfserr_inval directly. It could check after decoding in nfs4xdr.c, but it's simpler to do it in nfsd_map_xxxx. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>	2017-02-17 16:25:59 -05:00
Jens Axboe	818551e2b2	Merge branch 'for-4.11/next' into for-4.11/linus-merge Signed-off-by: Jens Axboe <axboe@fb.com>	2017-02-17 14:08:19 -07:00
Herbert Xu	98687f426b	gfs2: Use rhashtable walk interface in glock_hash_walk The function glock_hash_walk walks the rhashtable by hand. This is broken because if it catches the hash table in the middle of a rehash, then it will miss entries. This patch replaces the manual walk by using the rhashtable walk interface. Fixes: `88ffbf3e03` ("GFS2: Use resizable hash table for glocks") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-02-17 12:28:22 -05:00
Jeff Mahoney	71367b3fa7	btrfs: use btrfs_debug instead of pr_debug in transaction abort Commit `e5d6b12fe1` (Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors) added a pr_debug call to be printed when a transaction is aborted with -EIO instead of WARN. btrfs_debug prints which file system the message is associated with so let's use that instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:56 +01:00
Jeff Mahoney	21e75ffe3c	btrfs: btrfs_truncate_free_space_cache always allocates path btrfs_truncate_free_space_cache always allocates a btrfs_path structure but only uses it when the caller passes a block group. Let's move the allocation and free into the conditional. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:56 +01:00
Jeff Mahoney	77ab86bf1c	btrfs: free-space-cache, clean up unnecessary root arguments The free space cache APIs accept a root but always use the tree root. Also, btrfs_truncate_free_space_cache accepts a root AND an inode but the inode always points to the root anyway, so let's just pass the inode. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:56 +01:00
Jeff Mahoney	5e00f1939f	btrfs: convert btrfs_inc_block_group_ro to accept fs_info btrfs_inc_block_group_ro is either passed the extent root or the dev root, but it doesn't do anything with the dev tree. Let's convert to passing an fs_info and using the extent root. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:56 +01:00
Jeff Mahoney	0c9ab349c2	btrfs: flush_space always takes fs_info->fs_root We don't need to pass a root to flush_space since it always uses the fs_root. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:55 +01:00
Jeff Mahoney	87bde3cdfc	btrfs: pass fs_info to (more) routines that are only called with extent_root Outside of interactions with qgroups, the roots passed in extent-tree.c are usually passed to ensure that we don't do refcounts on log trees or to get the allocation profile for an allocation request. Otherwise, it operates on the extent root. This patch converts some more routines in extent-tree.c that are always called with the extent root to accept an fs_info instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:55 +01:00
Qu Wenruo	fb235dc06f	btrfs: qgroup: Move half of the qgroup accounting time out of commit trans Just as Filipe pointed out, the most time consuming parts of qgroup are btrfs_qgroup_account_extents() and btrfs_qgroup_prepare_account_extents(). Which both call btrfs_find_all_roots() to get old_roots and new_roots ulist. What makes things worse is, we're calling that expensive btrfs_find_all_roots() at transaction committing time with TRANS_STATE_COMMIT_DOING, which will blocks all incoming transaction. Such behavior is necessary for @new_roots search as current btrfs_find_all_roots() can't do it correctly so we do call it just before switch commit roots. However for @old_roots search, it's not necessary as such search is based on commit_root, so it will always be correct and we can move it out of transaction committing. This patch moves the @old_roots search part out of commit_transaction(), so in theory we can half the time qgroup time consumption at commit_transaction(). But please note that, this won't speedup qgroup overall, the total time consumption is still the same, just reduce the performance stall. Cc: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:55 +01:00
David Sterba	15b34517a6	btrfs: remove unused parameter from adjust_slots_upwards Never used. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:55 +01:00
David Sterba	0e8d931a82	btrfs: remove unused parameters from __btrfs_write_out_cache Both unused after the call to update_cache_item has been moved to __btrfs_wait_cache_io. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:55 +01:00
David Sterba	7bf1a15912	btrfs: remove unused parameter from cleanup_write_cache_enospc bitmap_list is unused since the io_ctl framework. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:55 +01:00
David Sterba	d75eefdf96	btrfs: remove unused parameter from __add_inode_ref Unused since the helper has been split, eb used in the caller. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:54 +01:00
David Sterba	4a0ab9d711	btrfs: remove unused parameter from clone_copy_inline_extent Never used. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:54 +01:00
David Sterba	1a287cfea1	btrfs: remove unused parameters from btrfs_cmp_data After the page locking has been reworked, we get all pages prepared via cmp_pages. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:54 +01:00
David Sterba	eeac44cb49	btrfs: remove unused parameter from __add_inline_refs Never used. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:54 +01:00
David Sterba	e5987e1319	btrfs: remove unused parameters from scrub_setup_wr_ctx Never used. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:54 +01:00
David Sterba	61d7e4cb11	btrfs: remove unused parameter from create_snapshot The name parameters have never been used, as the name is passed via the dentry. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:54 +01:00
David Sterba	e4a4dce72e	btrfs: remove unused parameter from init_first_rw_device The 'device' used to be added in that function, but now it's done by the caller. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:54 +01:00
David Sterba	72b468c8da	btrfs: remove unused parameter from __btrfs_alloc_chunk We grab fs_info from other parameters. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:53 +01:00
David Sterba	56e033a787	btrfs: remove unused parameter from btrfs_fill_super Never used for anything meaningful since we have our own superblock filler. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:53 +01:00
David Sterba	4242b64a4c	btrfs: remove unused parameter from extent_write_cache_pages The 'tree' was used to call locking hook that does not exist anymore. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:53 +01:00
David Sterba	df9f628e3d	btrfs: remove unused parameter from add_pending_csums Never used. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:53 +01:00
David Sterba	3d4b9496e8	btrfs: remove unused parameter from update_nr_written The logic has been updated in "Btrfs: make mapping->writeback_index point to the last written page" (`a91326679f`) and page is not needed anymore. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:53 +01:00
David Sterba	c2df8bb43f	btrfs: remove unused parameter from submit_extent_page This used to hold number of maximum pages to allocate, but this is now limited by BIO_MAX_PAGES. The local are now unused and removed as well. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:53 +01:00
David Sterba	f1e3026192	btrfs: remove unused parameter from tree_move_next_or_upnext Not needed. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:52 +01:00
David Sterba	ab6a43e122	btrfs: remove unused parameter from tree_move_down Never needed. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:52 +01:00
David Sterba	3d3a126a81	btrfs: remove unused parameter from btrfs_check_super_valid None of the checks need to know the ro/rw status as they're all not changing the superblock. Moreover, we can access the sb flags directly if we'd need to decide by the ro/rw status. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:52 +01:00
David Sterba	8b74c03e3c	btrfs: remove unused parameter from btrfs_prepare_extent_commit Added but never used. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:52 +01:00
David Sterba	7775c8184e	btrfs: remove unused parameter from btrfs_subvolume_release_metadata Unused since qgroup refactoring that split data and metadata accounting, the btrfs_qgroup_free helper. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:52 +01:00
David Sterba	66cb7ddbf2	btrfs: remove unused parameter from __push_leaf_left Unused since long ago. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:52 +01:00
David Sterba	1e47eef223	btrfs: remove unused parameter from __push_leaf_right Unused since long ago. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:52 +01:00
David Sterba	eece6a9cf6	btrfs: merge two superblock writing helpers write_all_supers and write_ctree_super are almost equal, the parameter 'trans' is unused so we can drop it and have just one helper. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:51 +01:00
David Sterba	b75f506243	btrfs: remove unused parameter from write_dev_supers The barriers are handled by the caller. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:51 +01:00
David Sterba	4961e2930f	btrfs: remove unused parameter from split_item Never used. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:51 +01:00
David Sterba	7c302b49dd	btrfs: remove unused parameter from clean_tree_block Added but never needed. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:51 +01:00
David Sterba	e27f62652b	btrfs: remove unused parameter from check_async_write Added but never used. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:51 +01:00
David Sterba	cda79c545e	btrfs: remove unused parameter from read_block_for_search Never used in that function. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:50 +01:00
David Sterba	6655bc3de1	btrfs: ulist: rename ulist_fini to ulist_release Change the name so it matches the naming we already use eg. for btrfs_path. Suggested-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:50 +01:00
David Sterba	4ae8553c2d	btrfs: remove pointless rcu protection from btrfs_qgroup_inherit There was never need for RCU protection around reading nodesize or other fairly constant filesystem data. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:50 +01:00
David Sterba	0b08e1f4f7	btrfs: qgroups: opencode qgroup_free helper The helper name is not too helpful and is just wrapping a simple call. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:50 +01:00
David Sterba	9ea6e2b548	btrfs: remove unnecessary mutex lock in qgroup_account_snapshot The quota status used to be tracked as a variable, so the mutex was needed (until "Btrfs: add a flags field to btrfs_fs_info" `afcdd129e0`). Since the status is a bit modified atomically and we don't hold the mutex beyond the check, we can drop it. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:50 +01:00
David Sterba	81353d50f5	btrfs: check quota status earlier and don't do unnecessary frees Status of quotas should be the first check in btrfs_qgroup_account_extent and we can return immediatelly, no need to do no-op ulist frees. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:50 +01:00
David Sterba	53d3235995	btrfs: embed extent_changeset::range_changed to the structure We can embed range_changed to the extent changeset to address following problems: - no need to allocate ulist dynamically, we also get rid of the GFP_NOFS for free - fix lack of allocation failure checking in btrfs_qgroup_reserve_data The stack consuption where extent_changeset is used slightly increases: before: 16 after: 16 - 8 (for pointer) + 32 (sizeof ulist) = 40 Which is bearable. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:49 +01:00
David Sterba	9d03793386	btrfs: ulist: make the finalization function public Make ulist_fini externally visible so the ulist API is complete. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:49 +01:00
David Sterba	025db916aa	btrfs: qgroups: make __del_qgroup_relation static Internal helper. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:49 +01:00
David Sterba	1d4805386e	btrfs: make space cache inode readahead failure nonfatal We do a readahead of the free space cache inode to speed things up but the failure is not fatal, like in other readahead cases. Proper reads would need to happen anyway and any errors would be caught there. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:49 +01:00
David Sterba	6602caf149	btrfs: use GFP_KERNEL in btrfs_add/del_qgroup_relation Qgroup relations are added/deleted from ioctl, we hold the high level qgroup lock, no deadlocks or recursion from the allocation possible here. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:49 +01:00
David Sterba	52bf8e7aea	btrfs: use GFP_KERNEL in btrfs_quota_enable We don't need to use GFP_NOFS here as this is called from ioctls an the only lock held is the subvol_sem, which is of a high level and protects creation/renames/deletion and is never held in the writeout paths. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:49 +01:00
David Sterba	323b88f4ab	btrfs: use GFP_KERNEL in btrfs_read_qgroup_config The qgroup config is read during mount, we do not have to use NOFS. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:49 +01:00
David Sterba	23269bf5ea	btrfs: use GFP_KERNEL in create_snapshot We don't need to use GFP_NOFS here as this is called from ioctls an the only lock held is the subvol_sem, which is of a high level and protects creation/renames/deletion and is never held in the writeout paths. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:48 +01:00
Liu Bo	1af4a0aaa5	Btrfs: specify a new ordered extent type for create_io_em As 0 refers to an existing type BTRFS_ORDERED_IO_DONE, this specifies a new type 'REGULAR' for regular IO. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:48 +01:00
Liu Bo	6f9994dbab	Btrfs: create a helper to create em for IO We have similar codes to create and insert extent mapping around IO path, this merges them into a single helper. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:48 +01:00
Liu Bo	4136135b08	Btrfs: use helper to get used bytes of space_info This uses a helper instead of open code around used byte of space_info everywhere. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:48 +01:00
Liu Bo	0c9b36e0d7	Btrfs: try to avoid acquiring free space ctl's lock We don't need to take the lock if the block group has not been cached. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:48 +01:00
Qu Wenruo	6f6b643e44	btrfs: Better csum error message for data csum mismatch The original csum error message only outputs inode number, offset, check sum and expected check sum. However no root objectid is outputted, which sometimes makes debugging quite painful under multi-subvolume case (including relocation). Also the checksum output is decimal, which seldom makes sense for users/developers and is hard to read in most time. This patch will add root objectid, which will be %lld for rootid larger than LAST_FREE_OBJECTID, and hex csum output for better readability. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:48 +01:00
Takafumi Kubota	fe01aa6538	Btrfs: add another missing end_page_writeback on submit_extent_page failure If btrfs_bio_alloc fails in submit_extent_page, submit_extent_page returns without clearing the writeback bit of the failed page. __extent_writepage_io, that is a caller of submit_extent_page, does not clear the remaining writeback bit anywhere. As a result, this will cause the hang at filemap_fdatawait_range, because it waits the writeback bit to be cleared from the failed page. So, we have to call end_page_writeback to clear the writeback bit. For reproducing the hang, we inject a fault like if (should_failtest()) { // I define should_failtest() bio = NULL; } else { bio = btrfs_bio_alloc(...); } in submit_extent_page. We should also check whether page has the bit before end_page_writeback, to avoid the conflict against the other end_page_writeback in bio_endio. Thus, we add PageWriteback checks not only in __extent_writepage_io, but also in write_one_eb too, because it misses the check. Signed-off-by: Takafumi Kubota <takafumi.kubota1012@sslab.ics.keio.ac.jp> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:47 +01:00
David Sterba	66bbc1c0c0	btrfs: remove unused ulist members Commit "btrfs: ulist: Add ulist_del() function" (`d4b8040459`) removed some debugging code but left the structure defintions. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:47 +01:00
Liu Bo	76c0021db8	Btrfs: use helper to simplify lock/unlock pages Since we have a helper to set page bits, let lock_delalloc_pages and __unlock_for_delalloc use it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:47 +01:00
Liu Bo	da2c7009f6	btrfs: teach __process_pages_contig about PAGE_LOCK operation Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ changes to the helper separated from the following patch ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:35 +01:00
Christoph Hellwig	410d17f67e	xfs: tune down agno asserts in the bmap code In various places we currently assert that xfs_bmap_btalloc allocates from the same as the firstblock value passed in, unless it's either NULLAGNO or the dop_low flag is set. But the reflink code does not fully follow this convention as it passes in firstblock purely as a hint for the allocator without actually having previous allocations in the transaction, and without having a minleft check on the current AG, leading to the assert firing on a very full and heavily used file system. As even the reflink code only allocates from equal or higher AGs for now we can simply the check to always allow for equal or higher AGs. Note that we need to eventually split the two meanings of the firstblock value. At that point we can also allow the reflink code to allocate from any AG instead of limiting it in any way. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-02-16 17:20:39 -08:00
Chandan Rajendra	8ee9fdbebc	xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignment On a ppc64 system, executing generic/256 test with 32k block size gives the following call trace, XFS: Assertion failed: args->maxlen > 0, file: /root/repos/linux/fs/xfs/libxfs/xfs_alloc.c, line: 2026 kernel BUG at /root/repos/linux/fs/xfs/xfs_message.c:113! Oops: Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=2048 DEBUG_PAGEALLOC NUMA pSeries Modules linked in: CPU: 2 PID: 19361 Comm: mkdir Not tainted 4.10.0-rc5 #58 task: c000000102606d80 task.stack: c0000001026b8000 NIP: c0000000004ef798 LR: c0000000004ef798 CTR: c00000000082b290 REGS: c0000001026bb090 TRAP: 0700 Not tainted (4.10.0-rc5) MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28004428 XER: 00000000 CFAR: c0000000004ef180 SOFTE: 1 GPR00: c0000000004ef798 c0000001026bb310 c000000001157300 ffffffffffffffea GPR04: 000000000000000a c0000001026bb130 0000000000000000 ffffffffffffffc0 GPR08: 00000000000000d1 0000000000000021 00000000ffffffd1 c000000000dd4990 GPR12: 0000000022004444 c00000000fe00800 0000000020000000 0000000000000000 GPR16: 0000000000000000 0000000043a606fc 0000000043a76c08 0000000043a1b3d0 GPR20: 000001002a35cd60 c0000001026bbb80 0000000000000000 0000000000000001 GPR24: 0000000000000240 0000000000000004 c00000062dc55000 0000000000000000 GPR28: 0000000000000004 c00000062ecd9200 0000000000000000 c0000001026bb6c0 NIP [c0000000004ef798] .assfail+0x28/0x30 LR [c0000000004ef798] .assfail+0x28/0x30 Call Trace: [c0000001026bb310] [c0000000004ef798] .assfail+0x28/0x30 (unreliable) [c0000001026bb380] [c000000000455d74] .xfs_alloc_space_available+0x194/0x1b0 [c0000001026bb410] [c00000000045b914] .xfs_alloc_fix_freelist+0x144/0x480 [c0000001026bb580] [c00000000045c368] .xfs_alloc_vextent+0x698/0xa90 [c0000001026bb650] [c0000000004a6200] .xfs_ialloc_ag_alloc+0x170/0x820 [c0000001026bb7c0] [c0000000004a9098] .xfs_dialloc+0x158/0x320 [c0000001026bb8a0] [c0000000004e628c] .xfs_ialloc+0x7c/0x610 [c0000001026bb990] [c0000000004e8138] .xfs_dir_ialloc+0xa8/0x2f0 [c0000001026bbaa0] [c0000000004e8814] .xfs_create+0x494/0x790 [c0000001026bbbf0] [c0000000004e5ebc] .xfs_generic_create+0x2bc/0x410 [c0000001026bbce0] [c0000000002b4a34] .vfs_mkdir+0x154/0x230 [c0000001026bbd70] [c0000000002bc444] .SyS_mkdirat+0x94/0x120 [c0000001026bbe30] [c00000000000b760] system_call+0x38/0xfc Instruction dump: 4e800020 60000000 7c0802a6 7c862378 3c82ffca 7ca72b78 38841c18 7c651b78 38600000 f8010010 f821ff91 4bfff94d <0fe00000> 60000000 7c0802a6 7c892378 When block size is larger than inode cluster size, the call to XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size) returns 0. Also, mkfs.xfs would have set xfs_sb->sb_inoalignmt to 0. This causes xfs_ialloc_cluster_alignment() to return 0. Due to this args.minalignslop (in xfs_ialloc_ag_alloc()) gets the unsigned equivalent of -1 assigned to it. This later causes alloc_len in xfs_alloc_space_available() to have a value of 0. In such a scenario when args.total is also 0, the assert statement "ASSERT(args->maxlen > 0);" fails. This commit fixes the bug by replacing the call to XFS_B_TO_FSBT() in xfs_ialloc_cluster_alignment() with a call to xfs_icluster_size_fsb(). Suggested-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-02-16 17:20:38 -08:00
Brian Foster	48af96ab92	xfs: don't reserve blocks for right shift transactions The block reservation for the transaction allocated in xfs_shift_file_space() is an artifact of the original collapse range support. It exists to handle the case where a collapse range occurs, the initial extent is left shifted into a location that forms a contiguous boundary with the previous extent and thus the extents are merged. This code was subsequently refactored and reused for insert range (right shift) support. If an insert range occurs under low free space conditions, the extent at the starting offset is split before the first shift transaction is allocated. If the block reservation fails, this leaves separate, but contiguous extents around in the inode. While not a fatal problem, this is unexpected and will flag a warning on subsequent insert range operations on the inode. This problem has been reproduce intermittently by generic/270 running against a ramdisk device. Since right shift does not create new extent boundaries in the inode, a block reservation for extent merge is unnecessary. Update xfs_shift_file_space() to conditionally reserve fs blocks for left shift transactions only. This avoids the warning reproduced by generic/270. Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-02-16 17:20:13 -08:00
Arnd Bergmann	353fe445f5	xfs: fix len comparison in xfs_extent_busy_trim The length is now passed by reference, so the assertion has to be updated to match the other changes, as pointed out by this W=1 warning: fs/xfs/xfs_extent_busy.c: In function 'xfs_extent_busy_trim': fs/xfs/xfs_extent_busy.c:356:13: error: ordered comparison of pointer with integer zero [-Werror=extra] Fixes: `ebf5587261` ("xfs: improve handling of busy extents in the low-level allocator") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-02-16 17:20:12 -08:00
Darrick J. Wong	93aaead52a	xfs: fix uninitialized variable in _reflink_convert_cow Fix an uninitialize variable. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-02-16 17:20:12 -08:00
Brian Foster	75d65361cf	xfs: split indlen reservations fairly when under reserved Certain workoads that punch holes into speculative preallocation can cause delalloc indirect reservation splits when the delalloc extent is split in two. If further splits occur, an already short-handed extent can be split into two in a manner that leaves zero indirect blocks for one of the two new extents. This occurs because the shortage is large enough that the xfs_bmap_split_indlen() algorithm completely drains the requested indlen of one of the extents before it honors the existing reservation. This ultimately results in a warning from xfs_bmap_del_extent(). This has been observed during file copies of large, sparse files using 'cp --sparse=always.' To avoid this problem, update xfs_bmap_split_indlen() to explicitly apply the reservation shortage fairly between both extents. This smooths out the overall indlen shortage and defers the situation where we end up with a delalloc extent with zero indlen reservation to extreme circumstances. Reported-by: Patrick Dung <mpatdung@gmail.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-02-16 17:20:12 -08:00
Brian Foster	0e339ef855	xfs: handle indlen shortage on delalloc extent merge When a delalloc extent is created, it can be merged with pre-existing, contiguous, delalloc extents. When this occurs, xfs_bmap_add_extent_hole_delay() merges the extents along with the associated indirect block reservations. The expectation here is that the combined worst case indlen reservation is always less than or equal to the indlen reservation for the individual extents. This is not always the case, however, as existing extents can less than the expected indlen reservation if the extent was previously split due to a hole punch. If a new extent merges with such an extent, the total indlen requirement may be larger than the sum of the indlen reservations held by both extents. xfs_bmap_add_extent_hole_delay() assumes that the worst case indlen reservation is always available and assigns it to the merged extent without consideration for the indlen held by the pre-existing extent. As a result, the subsequent xfs_mod_fdblocks() call can attempt an unintentional allocation rather than a free (indicated by an ASSERT() failure). Further, if the allocation happens to fail in this context, the failure goes unhandled and creates a filesystem wide block accounting inconsistency. Fix xfs_bmap_add_extent_hole_delay() to function as designed. Cap the indlen reservation assigned to the merged extent to the sum of the indlen reservations held by each of the individual extents. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-02-16 17:20:12 -08:00
Brian Foster	9dbddd7b0c	xfs: resurrect debug mode drop buffered writes mechanism A debug mode write failure mechanism was introduced to XFS in commit `801cc4e17a` ("xfs: debug mode forced buffered write failure") to facilitate targeted testing of delalloc indirect reservation management from userspace. This code was subsequently rendered ineffective by the move to iomap based buffered writes in commit `68a9f5e700` ("xfs: implement iomap based buffered write path"). This likely went unnoticed because the associated userspace code had not made it into xfstests. Resurrect this mechanism to facilitate effective indlen reservation testing from xfstests. The move to iomap based buffered writes relocated the hook this mechanism needs to return write failure from XFS to generic code. The failure trigger must remain in XFS. Given that limitation, convert this from a write failure mechanism to one that simply drops writes without returning failure to userspace. Rename all "fail_writes" references to "drop_writes" to illustrate the point. This is more hacky than preferred, but still triggers the XFS error handling behavior required to drive the indlen tests. This is only available in DEBUG mode and for testing purposes only. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-02-16 17:19:15 -08:00
Brian Foster	fa7f138ac4	xfs: clear delalloc and cache on buffered write failure The buffered write failure handling code in xfs_file_iomap_end_delalloc() has a couple minor problems. First, if written == 0, start_fsb is not rounded down and it fails to kill off a delalloc block if the start offset is block unaligned. This results in a lingering delalloc block and broken delalloc block accounting detected at unmount time. Fix this by rounding down start_fsb in the unlikely event that written == 0. Second, it is possible for a failed overwrite of a delalloc extent to leave dirty pagecache around over a hole in the file. This is because is possible to hit ->iomap_end() on write failure before the iomap code has attempted to allocate pagecache, and thus has no need to clean it up. If the targeted delalloc extent was successfully written by a previous write, however, then it does still have dirty pages when ->iomap_end() punches out the underlying blocks. This ultimately results in writeback over a hole. To fix this problem, unconditionally punch out the pagecache from XFS before the associated delalloc range. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2017-02-16 17:19:12 -08:00
David S. Miller	3f64116a83	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2017-02-16 19:34:01 -05:00
Miklos Szeredi	5a81e6a171	vfs: fix uninitialized flags in splice_to_pipe() Flags (PIPE_BUF_FLAG_PACKET, PIPE_BUF_FLAG_GIFT) could remain on the unused part of the pipe ring buffer. Previously splice_to_pipe() left the flags value alone, which could result in incorrect behavior. Uninitialized flags appears to have been there from the introduction of the splice syscall. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # 2.6.17+ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-16 09:09:02 -08:00
Miklos Szeredi	84588a93d0	fuse: fix uninitialized flags in pipe_buffer Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `d82718e348` ("fuse_dev_splice_read(): switch to add_to_pipe()") Cc: <stable@vger.kernel.org> # 4.9+	2017-02-16 15:08:20 +01:00
Sahitya Tummala	6ba4d2722d	fuse: fix use after free issue in fuse_dev_do_read() There is a potential race between fuse_dev_do_write() and request_wait_answer() contexts as shown below: TASK 1: __fuse_request_send(): \|--spin_lock(&fiq->waitq.lock); \|--queue_request(); \|--spin_unlock(&fiq->waitq.lock); \|--request_wait_answer(): \|--if (test_bit(FR_SENT, &req->flags)) <gets pre-empted after it is validated true> TASK 2: fuse_dev_do_write(): \|--clears bit FR_SENT, \|--request_end(): \|--sets bit FR_FINISHED \|--spin_lock(&fiq->waitq.lock); \|--list_del_init(&req->intr_entry); \|--spin_unlock(&fiq->waitq.lock); \|--fuse_put_request(); \|--queue_interrupt(); <request gets queued to interrupts list> \|--wake_up_locked(&fiq->waitq); \|--wait_event_freezable(); <as FR_FINISHED is set, it returns and then the caller frees this request> Now, the next fuse_dev_do_read(), see interrupts list is not empty and then calls fuse_read_interrupt() which tries to access the request which is already free'd and gets the below crash: [11432.401266] Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b6b ... [11432.418518] Kernel BUG at ffffff80083720e0 [11432.456168] PC is at __list_del_entry+0x6c/0xc4 [11432.463573] LR is at fuse_dev_do_read+0x1ac/0x474 ... [11432.679999] [<ffffff80083720e0>] __list_del_entry+0x6c/0xc4 [11432.687794] [<ffffff80082c65e0>] fuse_dev_do_read+0x1ac/0x474 [11432.693180] [<ffffff80082c6b14>] fuse_dev_read+0x6c/0x78 [11432.699082] [<ffffff80081d5638>] __vfs_read+0xc0/0xe8 [11432.704459] [<ffffff80081d5efc>] vfs_read+0x90/0x108 [11432.709406] [<ffffff80081d67f0>] SyS_read+0x58/0x94 As FR_FINISHED bit is set before deleting the intr_entry with input queue lock in request completion path, do the testing of this flag and queueing atomically with the same lock in queue_interrupt(). Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: `fd22d62ed0` ("fuse: no fc->lock for iqueue parts") Cc: <stable@vger.kernel.org> # 4.2+	2017-02-15 10:28:24 +01:00
Theodore Ts'o	2ba3e6e8af	ext4: fix fencepost in s_first_meta_bg validation It is OK for s_first_meta_bg to be equal to the number of block group descriptor blocks. (It rarely happens, but it shouldn't cause any problems.) https://bugzilla.kernel.org/show_bug.cgi?id=194567 Fixes: `3a4b77cd47` Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org	2017-02-15 01:26:39 -05:00
Theodore Ts'o	0d06863f90	ext4: don't BUG when truncating encrypted inodes on the orphan list Fix a BUG when the kernel tries to mount a file system constructed as follows: echo foo > foo.txt mke2fs -Fq -t ext4 -O encrypt foo.img 100 debugfs -w foo.img << EOF write foo.txt a set_inode_field a i_flags 0x80800 set_super_value s_last_orphan 12 quit EOF root@kvm-xfstests:~# mount -o loop foo.img /mnt [ 160.238770] ------------[ cut here ]------------ [ 160.240106] kernel BUG at /usr/projects/linux/ext4/fs/ext4/inode.c:3874! [ 160.240106] invalid opcode: 0000 [#1] SMP [ 160.240106] Modules linked in: [ 160.240106] CPU: 0 PID: 2547 Comm: mount Tainted: G W 4.10.0-rc3-00034-gcdd33b941b67 #227 [ 160.240106] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014 [ 160.240106] task: f4518000 task.stack: f47b6000 [ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 [ 160.240106] EFLAGS: 00010246 CPU: 0 [ 160.240106] EAX: 00000001 EBX: f7be4b50 ECX: f47b7dc0 EDX: 00000007 [ 160.240106] ESI: f43b05a8 EDI: f43babec EBP: f47b7dd0 ESP: f47b7dac [ 160.240106] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 160.240106] CR0: 80050033 CR2: bfd85b08 CR3: 34a00680 CR4: 000006f0 [ 160.240106] Call Trace: [ 160.240106] ext4_truncate+0x1e9/0x3e5 [ 160.240106] ext4_fill_super+0x286f/0x2b1e [ 160.240106] ? set_blocksize+0x2e/0x7e [ 160.240106] mount_bdev+0x114/0x15f [ 160.240106] ext4_mount+0x15/0x17 [ 160.240106] ? ext4_calculate_overhead+0x39d/0x39d [ 160.240106] mount_fs+0x58/0x115 [ 160.240106] vfs_kern_mount+0x4b/0xae [ 160.240106] do_mount+0x671/0x8c3 [ 160.240106] ? _copy_from_user+0x70/0x83 [ 160.240106] ? strndup_user+0x31/0x46 [ 160.240106] SyS_mount+0x57/0x7b [ 160.240106] do_int80_syscall_32+0x4f/0x61 [ 160.240106] entry_INT80_32+0x2f/0x2f [ 160.240106] EIP: 0xb76b919e [ 160.240106] EFLAGS: 00000246 CPU: 0 [ 160.240106] EAX: ffffffda EBX: 08053838 ECX: 08052188 EDX: 080537e8 [ 160.240106] ESI: c0ed0000 EDI: 00000000 EBP: 080537e8 ESP: bfa13660 [ 160.240106] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b [ 160.240106] Code: 59 8b 00 a8 01 0f 84 09 01 00 00 8b 07 66 25 00 f0 66 3d 00 80 75 61 89 f8 e8 3e e2 ff ff 84 c0 74 56 83 bf 48 02 00 00 00 75 02 <0f> 0b 81 7d e8 00 10 00 00 74 02 0f 0b 8b 43 04 8b 53 08 31 c9 [ 160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 SS:ESP: 0068:f47b7dac [ 160.317241] ---[ end trace d6a773a375c810a5 ]--- The problem is that when the kernel tries to truncate an inode in ext4_truncate(), it tries to clear any on-disk data beyond i_size. Without the encryption key, it can't do that, and so it triggers a BUG. E2fsck does not provide this service, and in practice most file systems have their orphan list processed by e2fsck, so to avoid crashing, this patch skips this step if we don't have access to the encryption key (which is the case when processing the orphan list; in all other cases, we will have the encryption key, or the kernel wouldn't have allowed the file to be opened). An open question is whether the fact that e2fsck isn't clearing the bytes beyond i_size causing problems --- and if we've lived with it not doing it for so long, can we drop this from the kernel replay of the orphan list in all cases (not just when we don't have the key for encrypted inodes). Addresses-Google-Bug: #35209576 Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2017-02-14 11:31:15 -05:00
Liu Bo	873695b301	Btrfs: create helper for processing bits on contiguous pages This introduces a new helper which can be used to process pages bits. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:51:00 +01:00
Liu Bo	e4c3b2dcd1	Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist run_delalloc_nocow has used trans in two places where they don't actually need @trans. For btrfs_lookup_file_extent, we search for file extents without COWing anything, and for btrfs_cross_ref_exist, the only place where we need @trans is deferencing it in order to get running_transaction which we could easily get from the global fs_info. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:51:00 +01:00
Liu Bo	f72ad18e99	Btrfs: pass delayed_refs directly to btrfs_find_delayed_ref_head All we need is @delayed_refs, all callers have get it ahead of calling btrfs_find_delayed_ref_head since lock needs to be acquired firstly, there is no reason to deference it again inside the function. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:59 +01:00
Liu Bo	d07b85284f	Btrfs: remove unused trans in read_block_for_search @trans is not used at all, this removes it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:59 +01:00
Liu Bo	bcf934894f	Btrfs: cleanup unused cached_state in __extent_writepage_io @cached_state is no more required in __extent_writepage_io, also remove the goto label. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:59 +01:00
Jeff Mahoney	003d7c59e8	btrfs: allow unlink to exceed subvolume quota Once a qgroup limit is exceeded, it's impossible to restore normal operation to the subvolume without modifying the limit or removing the subvolume. This is a surprising situation for many users used to the typical workflow with quotas on other file systems where it's possible to remove files until the used space is back under the limit. When we go to unlink a file and start the transaction, we'll hit the qgroup limit while trying to reserve space for the items we'll modify while removing the file. We discussed last month how best to handle this situation and agreed that there is no perfect solution. The best principle-of-least-surprise solution is to handle it similarly to how we already handle ENOSPC when unlinking, which is to allow the operation to succeed with the expectation that it will ultimately release space under most circumstances. This patch modifies the transaction start path to select whether to honor the qgroups limits. btrfs_start_transaction_fallback_global_rsv is the only caller that skips enforcement. The reservation and tracking still happens normally -- it just skips the enforcement step. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:59 +01:00
Liu Bo	9a9239acb4	Btrfs: fix wrong argument for btrfs_lookup_ordered_range Commit Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units" (`d0b7da88`) did this, but btrfs_lookup_ordered_range expects a 'length' rather than a 'page_end'. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:59 +01:00
Qu Wenruo	a7ceffbbbd	btrfs: raid56: Remove unused variable in lock_stripe_add Variable 'walk' in lock_stripe_add() is not used. Remove it. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:59 +01:00
Omar Sandoval	fc4badd9fe	Btrfs: refactor btrfs_extent_same() slightly This was originally a prep patch for changing the behavior on len=0, but we went another direction with that. This still makes the function slightly easier to follow. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:58 +01:00
Omar Sandoval	310712b2f7	Btrfs: constify struct btrfs_{,disk_}key wherever possible In a lot of places, it's unclear when it's safe to reuse a struct btrfs_key after it has been passed to a helper function. Constify these arguments wherever possible to make it obvious. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:58 +01:00
Liu Bo	4aaedfb0b6	Btrfs: fix another race between truncate and lockless dio write Dio writes can update i_size in btrfs_get_blocks_direct when it writes to offset beyond EOF so that endio can update disk_i_size correctly (because we don't udpate disk_i_size beyond i_size). However, when truncating down a file, we firstly update i_size and then wait for in-flight lockless dio reads/writes, according to the above, i_size may have been changed in dio writes, and file extents don't get truncated. For lockless dio writes are always overwrites, i_size is not supposed to be changed, so this adds a check to filter out this case. The race could be reproduced by fstests/generic/299 with patch "Btrfs: fix btrfs_ordered_update_i_size to update disk_i_size properly" applied. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:58 +01:00
Liu Bo	62c821a8e2	Btrfs: clean up btrfs_ordered_update_i_size Since we have a good helper entry_end, use it for ordered extent. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ whitespace reformatting ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:58 +01:00
Liu Bo	5416034f7a	Btrfs: fix comment in btrfs_page_mkwrite The comment about "page_mkwrite gets called every time the page is dirtied" in btrfs_page_mkwrite is not correct, it only gets called the first time the page gets dirtied after the page faults in. However, we don't need to touch the code because it works well, although the proper logic is to check if delalloc bits has been set and if so, go free reserved space, if not, set the delalloc bits for dirty page range. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:58 +01:00
Liu Bo	19fd2df5b1	Btrfs: fix btrfs_ordered_update_i_size to update disk_i_size properly btrfs_ordered_update_i_size can be called by truncate and endio, but only endio takes ordered_extent which contains the completed IO. while truncating down a file, if there are some in-flight IOs, btrfs_ordered_update_i_size in endio will set disk_i_size to @orig_offset that is zero. If truncating-down fails somehow, we try to recover in memory isize with this zero'd disk_i_size. Fix it by only updating disk_i_size with @orig_offset when btrfs_ordered_update_i_size is not called from endio while truncating down and waiting for in-flight IOs completing their work before recover in-memory size. Besides fixing the above issue, add an assertion for last_size to double check we truncate down to the desired size. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:57 +01:00
David Sterba	f85b7379cd	btrfs: fix over-80 lines introduced by previous cleanups This goes as a separate patch because fixing that inside the patches caused too many many conflicts. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:57 +01:00
Nikolay Borisov	f329e31971	btrfs: Make count_inode_refs take btrfs_inode Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:57 +01:00
Nikolay Borisov	3628365823	btrfs: Make count_inode_extrefs take btrfs_inode Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:57 +01:00

... 3 4 5 6 7 ...

48308 commits