Commit graph

506 commits

Author SHA1 Message Date
Filipe Manana
67c5e7d464 Btrfs: fix race between balance and unused block group deletion
We have a race between deleting an unused block group and balancing the
same block group that leads to an assertion failure/BUG(), producing the
following trace:

[181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
[181631.220591] ------------[ cut here ]------------
[181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
[181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
[181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
[181631.224566] RIP: 0010:[<ffffffffa03f19f6>]  [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
[181631.224566] RSP: 0018:ffff8800b5827ae8  EFLAGS: 00010246
[181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
[181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
[181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
[181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
[181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
[181631.224566] FS:  00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
[181631.224566] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
[181631.224566] Stack:
[181631.224566]  ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
[181631.224566]  ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
[181631.224566]  ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
[181631.224566] Call Trace:
[181631.224566]  [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
[181631.224566]  [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
[181631.224566]  [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
[181631.224566]  [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
[181631.224566]  [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
[181631.224566]  [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
[181631.224566]  [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
[181631.224566]  [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
[181631.224566]  [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
[181631.224566]  [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
[181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
[181631.224566]  [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
[181631.224566]  [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
(...)

The sequence of steps leading to this are:

           CPU 0                                         CPU 1

  btrfs_balance()
    btrfs_relocate_chunk()

      btrfs_relocate_block_group(bg X)
        btrfs_lookup_block_group(bg X)

                                               cleaner_kthread
                                                  locks fs_info->cleaner_mutex

                                                  btrfs_delete_unused_bgs()
                                                    finds bg X, which became
                                                    unused in the previous
                                                    transaction

                                                    checks bg X ->ro == 0,
                                                    so it proceeds
        sets bg X ->ro to 1
        (btrfs_set_block_group_ro(bg X))

        blocks on fs_info->cleaner_mutex
                                                    btrfs_remove_chunk(bg X)
                                                  unlocks fs_info->cleaner_mutex

        acquires fs_info->cleaner_mutex
        relocate_block_group()
          --> does nothing, no extents found in
              the extent tree from bg X
        unlocks fs_info->cleaner_mutex

      btrfs_relocate_block_group(bg X) returns

    btrfs_remove_chunk(bg X)
       extent map not found
          --> ASSERT(0)

Fix this by using a new mutex to make sure these 2 operations, block
group relocation and removal, are serialized.

This issue is reproducible by running fstests generic/038 (which stresses
chunk allocation and automatic removal of unused block groups) together
with the following balance loop:

    while true; do btrfs balance start -dusage=0 <mountpoint> ; done

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 14:36:46 -07:00
Zhao Lei
65f5333875 btrfs: cleanup noused initialization of dev in btrfs_end_bio()
It is introduced by:
 c404e0dc2c
 Btrfs: fix use-after-free in the finishing procedure of the device replace

But seems no relationship with that bug, this patch revirt these
code block for cleanup.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-30 13:20:02 -07:00
Chris Mason
c40b7b064f Merge branch 'sysfs-fsdevices-4.2-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into anand 2015-06-23 05:34:39 -07:00
Anand Jain
d2ff1b2008 Btrfs: sysfs: add support to show replacing target in the sysfs
This patch will add support to show the replacing target in sysfs
during the process of replacement.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-06-19 14:03:54 +02:00
Anand Jain
4fde46f0cc Btrfs: free the stale device
When btrfs on a device is overwritten with a new btrfs (mkfs),
the old btrfs instance in the kernel becomes stale. So with this
patch, if kernel finds device is overwritten then delete the stale
fsid/uuid.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
2015-06-19 14:03:13 +02:00
Filipe Manana
4617ea3a52 Btrfs: fix necessary chunk tree space calculation when allocating a chunk
When allocating a new chunk or removing one we need to update num_devs
device items and insert or remove a chunk item in the chunk tree, so
in the worst case the space needed in the chunk space_info is:

  btrfs_calc_trunc_metadata_size(chunk_root, num_devs) +
     btrfs_calc_trans_metadata_size(chunk_root, 1)

That is, in the worst case we need to cow num_devs paths and cow 1 other
path that can result in splitting every node and leaf, and each path
consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring
some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes,
which were unnecessary since updating the existing device items does
not result in splitting the nodes and leaf since after updating them
they remain with the same size.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10 07:02:46 -07:00
Filipe Manana
39c2d7facc Btrfs: fix -ENOSPC on block group removal
Unlike when attempting to allocate a new block group, where we check
that we have enough space in the system space_info to update the device
items and insert a new chunk item in the chunk tree, we were not checking
if the system space_info had enough space for updating the device items
and deleting the chunk item in the chunk tree. This often lead to -ENOSPC
error when attempting to allocate blocks for the chunk tree (during btree
node/leaf COW operations) while updating the device items or deleting the
chunk item, which resulted in the current transaction being aborted and
turning the filesystem into read-only mode.

While running fstests generic/038, which stresses allocation of block
groups and removal of unused block groups, with a large scratch device
(750Gb) this happened often, despite more than enough unallocated space,
and resulted in the following trace:

[68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[68663.600407] BTRFS: Transaction aborted (error -28)
(...)
[68663.730829] Call Trace:
[68663.732585]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[68663.734334]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[68663.739980]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[68663.757153]  [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.760925]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[68663.762854]  [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs]
[68663.764073]  [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[68663.765130]  [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs]
[68663.765998]  [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs]
[68663.767068]  [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs]
[68663.768227]  [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c
[68663.769081]  [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs]
[68663.799485]  [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs]
[68663.809208]  [<ffffffff8105f367>] kthread+0xef/0xf7
[68663.828795]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
[68663.844942]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.846486]  [<ffffffff81435a88>] ret_from_fork+0x58/0x90
[68663.847760]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
[68663.849503] ---[ end trace 798477c6d6dbaad6 ]---
[68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left

So fix this by verifying that enough space exists in system space_info,
and reserving the space in the chunk block reserve, before attempting to
delete the block group and allocate a new system chunk if we don't have
enough space to perform the necessary updates and delete in the chunk
tree. Like for the block group creation case, we don't error our if we
fail to allocate a new system chunk, since we might end up not needing
it (no node/leaf splits happen during the COW operations and/or we end
up not needing to COW any btree nodes or leafs because they were already
COWed in the current transaction and their writeback didn't start yet).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:05 -07:00
Filipe Manana
c152b63efc Btrfs: fix chunk allocation regression leading to transaction abort
With commit 1b98450816 ("Btrfs: fix find_free_dev_extent() malfunction
in case device tree has hole") introduced in the kernel 4.1 merge window,
we end up using part of a device hole for which there are already pending
chunks or pinned chunks. Before that commit we didn't use the hole and
would just move on to the next hole in the device.

However when we adjust the start offset for the chunk allocation and we
have pinned chunks, we set it blindly to the end offset of the pinned
chunk we are currently processing, which is dangerous because we can
have a pending chunk that has a start offset that matches the end offset
of our pinned chunk - leading us to a case where we end up getting two
pending chunks that start at the same physical device offset, which makes
us later abort the current transaction with -EEXIST when finishing the
chunk allocation at btrfs_create_pending_block_groups():

[194737.659017] ------------[ cut here ]------------
[194737.660192] WARNING: CPU: 15 PID: 31111 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
[194737.662209] BTRFS: Transaction aborted (error -17)
[194737.663175] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse
[194737.674015] CPU: 15 PID: 31111 Comm: xfs_io Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[194737.675986] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[194737.682999]  0000000000000009 ffff8800564c7a98 ffffffff8142fa46 ffffffff8108b6a2
[194737.684540]  ffff8800564c7ae8 ffff8800564c7ad8 ffffffff81045ea5 ffff8800564c7b78
[194737.686017]  ffffffffa0383aa7 00000000ffffffef ffff88000c7ba000 ffff8801a1f66f40
[194737.687509] Call Trace:
[194737.688068]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[194737.689027]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[194737.690095]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[194737.691198]  [<ffffffffa0383aa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
[194737.693789]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[194737.695065]  [<ffffffffa0383aa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
[194737.696806]  [<ffffffffa039a3bd>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[194737.698683]  [<ffffffffa03aa433>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[194737.700329]  [<ffffffffa03aa725>] btrfs_end_transaction+0x10/0x12 [btrfs]
[194737.701924]  [<ffffffffa0394b51>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[194737.703675]  [<ffffffffa03b8ba4>] __btrfs_buffered_write+0x16a/0x4c8 [btrfs]
[194737.705417]  [<ffffffffa03bb502>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
[194737.707058]  [<ffffffffa03bb511>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
[194737.708560]  [<ffffffffa03bb68d>] btrfs_file_write_iter+0x325/0x431 [btrfs]
[194737.710673]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
[194737.712076]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
[194737.713293]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
[194737.714443]  [<ffffffff81154424>] SyS_pwrite64+0x64/0x82
[194737.715646]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[194737.717175] ---[ end trace f2d5dc04e56d7e48 ]---
[194737.718170] BTRFS: error (device sdc) in btrfs_create_pending_block_groups:9524: errno=-17 Object already exists

The -EEXIST failure comes from btrfs_finish_chunk_alloc(), called by
btrfs_create_pending_block_groups(), when it attempts to insert a
duplicated device extent item via btrfs_alloc_dev_extent().

This issue was reproducible with fstests generic/038 running in a loop for
several hours (it's very hard to hit) and using MOUNT_OPTIONS="-o discard".
Applying Jeff's recent patch titled "btrfs: add missing discards when
unpinning extents with -o discard" makes the issue much easier to reproduce
(usually within 4 to 5 hours), since it pins chunks for longer periods of
time when an unused block group is deleted by the cleaner kthread.

Fix this by making sure that we never adjust the start offset to a lower
value than it currently has.

Fixes: 1b98450816 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:02:55 -07:00
Sasha Levin
2037a0933b btrfs: use after free when closing devices
__btrfs_close_devices() would call_rcu to free the device, which is racy with
list_for_each_entry() accessing the memory to retrieve the next device on the
list.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:36 -07:00
Anand Jain
33b97e4327 Btrfs: check error before reporting missing device and add uuid
Report missing device when add is successful,
otherwise it would exit as ENOMEM. And add uuid
to the report.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:35 -07:00
Anand Jain
816fcebe8f Btrfs: log when missing device is created
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:34 -07:00
David Sterba
6d13f5497f btrfs: fix warnings after changes in btrfs_abort_transaction
fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’:
fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
   btrfs_abort_transaction(trans, tree_root,
   ^
  CC [M]  fs/btrfs/ioctl.o
fs/btrfs/ioctl.c: In function ‘create_subvol’:
fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=]
   btrfs_abort_transaction(trans, root, PTR_ERR(new_root));

PTR_ERR returns long, but we're really using 'int' for the error codes
everywhere so just set and use the local variable.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:34 -07:00
Filipe Manana
53e489bc8c Btrfs: check pending chunks when shrinking fs to avoid corruption
When we shrink the usable size of a device (its total_bytes), we go over
all the device extent items in the device tree and attempt to relocate
the chunk of any device extent that goes beyond the new usable size for
the device. We do that after setting the new usable size (total_bytes) in
the device object, so that all new allocations (and reallocations) don't
use areas of the device that go beyond the new (shorter) size. However we
were not considering that before setting the new size in the device,
pending chunks might have been created that use device extents that go
beyond the new size, and those device extents are not yet in the device
tree after we search the device tree - they are still attached to the
list of new block group for some ongoing transaction handle, and they are
only added to the device tree when the transaction handle is ended (via
btrfs_create_pending_block_groups()).

So check for pending chunks with device extents that go beyond the new
size and if any exists, commit the current transaction and repeat the
search in the device tree.

Not doing this it would mean we would return success to user space while
still having extents that go beyond the new size, and later user space
could override those locations on the device while the fs still references
them, causing all sorts of corruption and unexpected events.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02 19:34:33 -07:00
Anand Jain
2421a8cd5f Btrfs: sysfs: don't fail seeding for the sake of sysfs kobject issue
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:22 +02:00
Anand Jain
6c14a1641b Btrfs: sysfs btrfs_kobj_rm_device() pass fs_devices instead of fs_info
since btrfs_kobj_rm_device() does nothing with fs_info

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain
1ba43816af Btrfs: sysfs btrfs_kobj_add_device() pass fs_devices instead of fs_info
btrfs_kobj_add_device() does not need fs_info any more.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain
5a13f4308c Btrfs: sysfs: add pointer to access fs_info from fs_devices
adds fs_info pointer with struct btrfs_fs_devices.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:21 +02:00
Anand Jain
c73eccf75b Btrfs: introduce btrfs_get_fs_uuids to get fs_uuids
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Anand Jain
2e7910d6ca Btrfs: sysfs: move super_kobj and device_dir_kobj from fs_info to btrfs_fs_devices
This patch will provide a framework and help to create attributes
from the structure btrfs_fs_devices which are available even before
fs_info is created. So by moving the parent kobject super_kobj from
fs_info to btrfs_fs_devices, it will help to create attributes
from the btrfs_fs_devices as well.

Patches on top of this patch now will be able to create the
sys/fs/btrfs/fsid kobject and attributes from btrfs_fs_devices
when devices are scanned and registered to the kernel.

Just to note, this does not change any of the existing btrfs sysfs
external kobject names and its attributes and not even the life
cycle of them. Changes are internal only. And to ensure the same,
this path has been tested with various device operations and,
checking and comparing the sysfs kobjects and attributes with
sysfs kobject and attributes with out this patch, and they remain
same.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-05-27 12:27:20 +02:00
Filipe Manana
a96295965b Btrfs: fix racy system chunk allocation when setting block group ro
If while setting a block group read-only we end up allocating a system
chunk, through check_system_chunk(), we were not doing it while holding
the chunk mutex which is a problem if a concurrent chunk allocation is
happening, through do_chunk_alloc(), as it means both block groups can
end up using the same logical addresses and physical regions in the
device(s). So make sure we hold the chunk mutex.

Cc: stable@vger.kernel.org  # 4.0+
Fixes: 2f0810880f ("btrfs: delete chunk allocation attemp when
                      setting block group ro")

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-19 18:04:17 -07:00
Forrest Liu
1b98450816 Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
If device tree has hole, find_free_dev_extent() cannot find available
address properly.

The problem can be reproduce by following script.

    mntpath=/btrfs
    loopdev=/dev/loop0
    filepath=/home/forrest/image

    umount $mntpath
    losetup -d $loopdev
    truncate --size 100g $filepath
    losetup $loopdev $filepath
    mkfs.btrfs -f $loopdev
    mount $loopdev $mntpath

    # make device tree with one big hole
    for i in `seq 1 1 100`; do
        fallocate -l 1g $mntpath/$i
    done
    sync
    for i in `seq 1 1 95`; do
        rm $mntpath/$i
    done
    sync

    # wait cleaner thread remove unused block group
    sleep 300

    fallocate -l 1g $mntpath/aaa

    # failed to allocate new chunk
    fallocate -l 1g $mntpath/bbb

Above script will make device tree with one big hole, and can only allocate
just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb

    item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48
        dev extent chunk_tree 3
        chunk objectid 256 chunk offset 106292051968 length 1073741824
    item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48
        dev extent chunk_tree 3
        chunk objectid 256 chunk offset 103108575232 length 1073741824

Signed-off-by: Forrest Liu <forrestl@synology.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26 06:26:59 -07:00
Zhao Lei
f2ab76188e btrfs: Fix tail space processing in find_free_dev_extent()
It is another reason for NO_SPACE case.

When we found enough free space in loop and saved them to
max_hole_start/size before, and tail space contains pending extent,
origional innocent max_hole_start/size are reset in retry.

As a result, find_free_dev_extent() returns less space than it can,
and cause NO_SPACE in user program.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-04-13 07:27:18 -07:00
Chris Mason
fc4c3c872f Merge branch 'cleanups-post-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1
Signed-off-by: Chris Mason <clm@fb.com>

Conflicts:
	fs/btrfs/disk-io.c
2015-03-25 10:52:48 -07:00
David Sterba
258ece0212 btrfs: remove shadowing variables in __btrfs_map_block
1) We can safely use the function's 'i'. Fixes warning

fs/btrfs/volumes.c:5257:7: warning: declaration of 'i' shadows a previous local
fs/btrfs/volumes.c:4951:6: warning: shadowed declaration is here

2) A local variable duplicates name of an argument, we can use the value
directly. Fixes warning

fs/btrfs/volumes.c:5433:8: warning: declaration of 'length' shadows a parameter
fs/btrfs/volumes.c:4935:27: warning: shadowed declaration is here

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:35 +01:00
David Sterba
9d644a623e btrfs: cleanup, use correct type in div_u64_rem
div_u64_rem expects u32 for divisior and reminder.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:01 +01:00
David Sterba
47c5713f47 btrfs: replace remaining do_div calls with div_u64 variants
Switch to div_u64_rem that does type checking and has more obvious
semantics than do_div.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:01 +01:00
David Sterba
b8b93addde btrfs: cleanup 64bit/32bit divs, provably bounded values
The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type.
Get rid of a few more do_div instances.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:00 +01:00
David Sterba
31e818fe73 btrfs: cleanup, use kmalloc_array/kcalloc array helpers
Convert kmalloc(nr * size, ..) to kmalloc_array that does additional
overflow checks, the zeroing variant is kcalloc.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:58 +01:00
David Sterba
853d8ec4b2 btrfs: need_resched not needed with cond_resched
Cleanup, no special reason to do

if (need_resched())
        cond_resched();

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:55 +01:00
Chris Mason
e57cf21e97 Btrfs: fix allocation size calculations in alloc_btrfs_bio
Since commit 8e5cfb55d3 (Btrfs: Make raid_map array be inlined in
btrfs_bio structure), the raid map array is allocated along with the
btrfs bio in alloc_btrfs_bio.  The calculation used to decide how much
we need to allocate was using the wrong parameter passed into the
allocation function.

The passed in real_stripes will be zero if a target replace operation
is not currently running.  We want to use total_stripes instead.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: David Sterba <dsterba@suse.cz>
Tested-by: David Sterba <dsterba@suse.cz>
2015-02-20 06:55:15 -08:00
Eric Sandeen
9eaed21ef9 btrfs: remove unused fs_info arg from btrfs_close_extra_devices()
The commit:
8dabb74 Btrfs: change core code of btrfs to support the
        device replace operations
added the fs_info argument, but never used it -
just remove it again.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:45 +01:00
Zhao Lei
08da757d3f btrfs: cleanup: use for() loop in btrfs_map_bio()
for() is obviously better in these code block, and remove noused
init-value to reduce about 6 bytes binary size.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
Zhao Lei
a688a04aab btrfs: remove unused chunk_tree argument in several functions
There functions include unused chunk_tree argument from the begining,
it is time to remove them and clean up relative code to prepare value
of this argument in caller.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
David Sterba
e8c9f18603 btrfs: constify structs with op functions or static definitions
There are some op tables that can be easily made const, similarly the
sysfs feature and raid tables. This is motivated by PaX CONSTIFY plugin.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
Zhao Lei
13212b54d1 btrfs: Fix out-of-space bug
Btrfs will report NO_SPACE when we create and remove files for several times,
and we can't write to filesystem until mount it again.

Steps to reproduce:
 1: Create a single-dev btrfs fs with default option
 2: Write a file into it to take up most fs space
 3: Delete above file
 4: Wait about 100s to let chunk removed
 5: goto 2

Script is like following:
 #!/bin/bash

 # Recommend 1.2G space, too large disk will make test slow
 DEV="/dev/sda16"
 MNT="/mnt/tmp"

 dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2
 file_size_m=$((dev_size * 75 / 100 / 1024 / 1024))

 echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev"

 for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done
 echo "mkfs $DEV"
 mkfs.btrfs -f "$DEV" >/dev/null || exit 2
 echo "mount $DEV $MNT"
 mount "$DEV" "$MNT" || exit 2

 for ((loop_i = 0; loop_i < 20; loop_i++)); do
     echo
     echo "loop $loop_i"

     echo "dd file..."
     cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m")
     "${cmd[@]}" 2>/dev/null || {
         # NO_SPACE error triggered
         echo "dd failed: ${cmd[*]}"
         exit 1
     }

     echo "rm file..."
     rm -f "$MNT"/file0 || exit 2

     for ((i = 0; i < 10; i++)); do
         df "$MNT" | tail -1
         sleep 10
     done
 done

Reason:
 It is triggered by commit: 47ab2a6c68
 which is used to remove empty block groups automatically, but the
 reason is not in that patch. Code before works well because btrfs
 don't need to create and delete chunks so many times with high
 complexity.
 Above bug is caused by many reason, any of them can trigger it.

Reason1:
 When we remove some continuous chunks but leave other chunks after,
 these disk space should be used by chunk-recreating, but in current
 code, only first create will successed.
 Fixed by Forrest Liu <forrestl@synology.com> in:
 Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

Reason2:
 contains_pending_extent() return wrong value in calculation.
 Fixed by Forrest Liu <forrestl@synology.com> in:
 Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

Reason3:
 btrfs_check_data_free_space() try to commit transaction and retry
 allocating chunk when the first allocating failed, but space_info->full
 is set in first allocating, and prevent second allocating in retry.
 Fixed in this patch by clear space_info->full in commit transaction.

 Tested for severial times by above script.

Changelog v3->v4:
 use light weight int instead of atomic_t to record have_remove_bgs in
 transaction, suggested by:
 Josef Bacik <jbacik@fb.com>

Changelog v2->v3:
 v2 fixed the bug by adding more commit-transaction, but we
 only need to reclaim space when we are really have no space for
 new chunk, noticed by:
 Filipe David Manana <fdmanana@gmail.com>

 Actually, our code already have this type of commit-and-retry,
 we only need to make it working with removed-bgs.
 v3 fixed the bug with above way.

Changelog v1->v2:
 v1 will introduce a new bug when delete and create chunk in same disk
 space in same transaction, noticed by:
 Filipe David Manana <fdmanana@gmail.com>
 V2 fix this bug by commit transaction after remove block grops.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Suggested-by: Filipe David Manana <fdmanana@gmail.com>
Suggested-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:19:14 -08:00
David Sterba
e3540eab29 btrfs: add more checks to btrfs_read_sys_array
Verify that the sys_array has enough bytes to read the next item.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:39 -08:00
David Sterba
1ffb22cf8c btrfs: cleanup, rename a few variables in btrfs_read_sys_array
There's a pointer to buffer, integer offset and offset passed as
pointer, try to find matching names for them.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:28 -08:00
Zhao Lei
ffe2d2034b Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simply
So we can check raid56 with:
 (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
instead of long:
 (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6))

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
10f1190016 Btrfs: Include map_type in raid_bio
Corrent code use many kinds of "clever" way to determine operation
target's raid type, as:
  raid_map != NULL
  or
  raid_map[MAX_NR] == RAID[56]_Q_STRIPE

To make code easy to maintenance, this patch put raid type into
bbio, and we can always get raid type from bbio with a "stupid"
way.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
6e9606d2a2 Btrfs: add ref_count and free function for btrfs_bio
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
   to keep btrfs_bio's memory in raid56 recovery implement.
2: free function for bbio will make code clean and flexible, plus
   forced data type checking in compile.

Changelog v1->v2:
 Rename following by David Sterba's suggestion:
 put_btrfs_bio() -> btrfs_put_bio()
 get_btrfs_bio() -> btrfs_get_bio()
 bbio->ref_count -> bbio->refs

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
8e5cfb55d3 Btrfs: Make raid_map array be inlined in btrfs_bio structure
It can make code more simple and clear, we need not care about
free bbio and raid_map together.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:47 -08:00
Zhao Lei
cc7539edea Btrfs: sort raid_map before adding tgtdev stripes
It can avoid complex calculation of real stripes in sort,
moreover, we can clean up code of sorting tgtdev_map because it
will be in order initially.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:47 -08:00
Chris Mason
57bbddd7fb Merge branch 'cleanup/blocksize-diet-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2015-01-21 17:49:35 -08:00
Al Viro
98af592f5b btrfs: filp_open() returns ERR_PTR() on failure, not NULL...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-17 06:43:56 -05:00
David Sterba
a83fffb75d btrfs: sink blocksize parameter to btrfs_find_create_tree_block
Finally it's clear that the requested blocksize is always equal to
nodesize, with one exception, the superblock.

Superblock has fixed size regardless of the metadata block size, but
uses the same helpers to initialize sys array/chunk tree and to work
with the chunk items. So it pretends to be an extent_buffer for a
moment, btrfs_read_sys_array is full of special cases, we're adding one
more.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:21 +01:00
Chris Mason
9627aeee3e Merge branch 'raid56-scrub-replace' of git://github.com/miaoxie/linux-btrfs into for-linus 2014-12-02 18:42:03 -08:00
Filipe Manana
495e64f4fe Btrfs: fix fs mapping extent map leak
On chunk allocation error (label "error_del_extent"), after adding the
extent map to the tree and to the pending chunks list, we would leave
decrementing the extent map's refcount by 2 instead of 3 (our allocation
+ tree reference + list reference).

Also, on chunk/block group removal, if the block group was on the list
pending_chunks we weren't decrementing the respective list reference.

Detected by 'rmmod btrfs':

[20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
[20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-1+ #1
[20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20770.106130]  0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040
[20770.106132]  ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0
[20770.106134]  ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78
[20770.106136] Call Trace:
[20770.106142]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
[20770.106145]  [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90
[20770.106164]  [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs]
[20770.106176]  [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs]
[20770.106179]  [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4
[20770.106182]  [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[20770.106184]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana
04216820fe Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

        checking extents
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        owner ref check failed [833912832 16384]
        Errors found in extent allocation tree or chunk allocation
        checking free space cache
        checking fs roots
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        root 5 root dir 256 error
        root 5 inode 260 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 262 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 263 errors 2001, no inode item, link count wrong
        (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Miao Xie
4245215d6a Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:47 +08:00
Miao Xie
2c8cdd6ee4 Btrfs, replace: write dirty pages into the replace target device
The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:46 +08:00