Commit Graph

211617 Commits

Author SHA1 Message Date
Josef Bacik 7619585390 Btrfs: fix more ESTALE problems with NFS
When creating new inodes we don't setup inode->i_generation.  So if we generate
an fh with a newly created inode we save the generation of 0, but if we flush
the inode to disk and have to read it back when getting the inode on the server
we'll have the right i_generation, so gens wont match and we get ESTALE.  This
patch properly sets inode->i_generation when we create the new inode and now I'm
no longer getting ESTALE.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:08 -05:00
Josef Bacik 2ede0daf01 Btrfs: handle NFS lookups properly
People kept reporting NFS issues, specifically getting ESTALE alot.  I figured
out how to reproduce the problem

SERVER
mkfs.btrfs /dev/sda1
mount /dev/sda1 /mnt/btrfs-test
<add /mnt/btrfs-test to /etc/exports>
btrfs subvol create /mnt/btrfs-test/foo
service nfs start

CLIENT
mount server:/mnt/btrfs /mnt/test
cd /mnt/test/foo
ls

SERVER
echo 3 > /proc/sys/vm/drop_caches

CLIENT
ls			<-- get an ESTALE here

This is because the standard way to lookup a name in nfsd is to use readdir, and
what it does is do a readdir on the parent directory looking for the inode of
the child.  So in this case the parent being / and the child being foo.  Well
subvols all have the same inode number, so doing a readdir of / looking for
inode 256 will return '.', which obviously doesn't match foo.  So instead we
need to have our own .get_name so that we can find the right name.

Our .get_name will either lookup the inode backref or the root backref,
whichever we're looking for, and return the name we find.  Running the above
reproducer with this patch results in everything acting the way its supposed to.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:08 -05:00
Mariusz Kozlowski 0410c94aff btrfs: make 1-bit signed fileds unsigned
Fixes these sparse warnings:
fs/btrfs/ctree.h:811:17: error: dubious one-bit signed bitfield
fs/btrfs/ctree.h:812:20: error: dubious one-bit signed bitfield
fs/btrfs/ctree.h:813:19: error: dubious one-bit signed bitfield

Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:07 -05:00
Li Zefan f209561ad8 btrfs: Show device attr correctly for symlinks
Symlinks and files of other types show different device numbers, though
they are on the same partition:

 $ touch tmp; ln -s tmp tmp2; stat tmp tmp2
   File: `tmp'
   Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
 Device: 15h/21d	Inode: 984027      Links: 1
 --- snip ---
   File: `tmp2' -> `tmp'
   Size: 3         	Blocks: 0          IO Block: 4096   symbolic link
 Device: 13h/19d	Inode: 984028      Links: 1

Reported-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:07 -05:00
Li Zefan 5f3888ff6f btrfs: Set file size correctly in file clone
Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the
original filesize of the dest file 'file2' is 30K:

  # ls -l /mnt/file2
  -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2

Now clone file1 to file2, the dest file should be 40K, but it
still shows 30K:

  # ls -l /mnt/file2
  -rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:06 -05:00
Li Zefan 2a6b8daeda btrfs: Check if dest_offset is block-size aligned before cloning file
We've done the check for src_offset and src_length, and We should
also check dest_offset, otherwise we'll corrupt the destination
file:

  (After cloning file1 to file2 with unaligned dest_offset)
  # cat /mnt/file2
  cat: /mnt/file2: Input/output error

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:05 -05:00
Josef Bacik 0de90876c6 Btrfs: handle the space_cache option properly
When I added the clear_cache option I screwed up and took the break out of
the space_cache case statement, so whenever you mount with space_cache you also
get clear_cache, which does you no good if you say set space_cache in fstab so
it always gets set.  This patch adds the break back in properly.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:05 -05:00
Arne Jansen 6f33434850 btrfs: Fix early enospc because 'unused' calculated with wrong sign.
'unused' calculated with wrong sign in reserve_metadata_bytes().
This might have lead to unwanted over-reservations.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:04 -05:00
Miao Xie e65e153554 btrfs: fix panic caused by direct IO
btrfs paniced when we write >64KB data by direct IO at one time.

Reproduce steps:
 # mkfs.btrfs /dev/sda5 /dev/sda6
 # mount /dev/sda5 /mnt
 # dd if=/dev/zero of=/mnt/tmpfile bs=100K count=1 oflag=direct

Then btrfs paniced:
mapping failed logical 1103155200 bio len 69632 len 12288
------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:3010!
[SNIP]
Pid: 1992, comm: btrfs-worker-0 Not tainted 2.6.37-rc1 #1 D2399/PRIMERGY
RIP: 0010:[<ffffffffa03d1462>]  [<ffffffffa03d1462>] btrfs_map_bio+0x202/0x210 [btrfs]
[SNIP]
Call Trace:
 [<ffffffffa03ab3eb>] __btrfs_submit_bio_done+0x1b/0x20 [btrfs]
 [<ffffffffa03a35ff>] run_one_async_done+0x9f/0xb0 [btrfs]
 [<ffffffffa03d3d20>] run_ordered_completions+0x80/0xc0 [btrfs]
 [<ffffffffa03d45a4>] worker_loop+0x154/0x5f0 [btrfs]
 [<ffffffffa03d4450>] ? worker_loop+0x0/0x5f0 [btrfs]
 [<ffffffffa03d4450>] ? worker_loop+0x0/0x5f0 [btrfs]
 [<ffffffff81083216>] kthread+0x96/0xa0
 [<ffffffff8100cec4>] kernel_thread_helper+0x4/0x10
 [<ffffffff81083180>] ? kthread+0x0/0xa0
 [<ffffffff8100cec0>] ? kernel_thread_helper+0x0/0x10

We fix this problem by splitting bios when we submit bios.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:04 -05:00
Miao Xie 88f794ede7 btrfs: cleanup duplicate bio allocating functions
extent_bio_alloc() and compressed_bio_alloc() are similar, cleanup
similar source code.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:03 -05:00
Miao Xie 0c56fa9662 btrfs: fix free dip and dip->csums twice
bio_endio() will free dip and dip->csums, so dip and dip->csums twice will
be freed twice. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:02 -05:00
Chris Mason 784b4e29a2 Btrfs: add migrate page for metadata inode
Migrate page will directly call the btrfs btree writepage function,
which isn't actually allowed.

Our writepage assumes that you have locked the extent_buffer and
flagged the block as written.  Without doing these steps, we can
corrupt metadata blocks.

A later commit will remove the btree writepage function since
it is really only safely used internally by btrfs.  We
use writepages for everything else.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-11-21 22:26:02 -05:00
Chris Mason 6418c96107 Btrfs: deal with errors from updating the tree log
During unlink we remove any references to the inode from
the tree log.  It can return -ENOENT and other errors,
and this changes the unlink code to deal with it.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-30 07:34:24 -04:00
Sage Weil 4260f7c751 Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Add a mount option user_subvol_rm_allowed that allows users to delete a
(potentially non-empty!) subvol when they would otherwise we allowed to do
an rmdir(2).  We duplicate the may_delete() checks from the core VFS code
to implement identical security checks (minus the directory size check).
We additionally require that the user has write+exec permission on the
subvol root inode.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 21:42:10 -04:00
Sage Weil 531cb13f1e Btrfs: make SNAP_DESTROY async
There is no reason to force an immediate commit when deleting a snapshot.
Users have some expectation that space from a deleted snapshot be freed
immediately, but even if we do commit the reclaim is a background process.

If users _do_ want the deletion to be durable, they can call 'sync'.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 21:42:10 -04:00
Sage Weil 72fd032e94 Btrfs: add SNAP_CREATE_ASYNC ioctl
Create a snap without waiting for it to commit to disk.  The ioctl is
ordered such that subsequent operations will not be contained by the
created snapshot, and the commit is initiated, but the ioctl does not
wait for the snapshot to commit to disk.

We return the specific transid to userspace so that an application can wait
for this specific snapshot creation to commit via the WAIT_SYNC ioctl.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 21:41:57 -04:00
Sage Weil 462045928b Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete.  Any modification started after the ioctl returns is
guaranteed not to be included in the commit.  If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.

WAIT_SYNC will wait for any in-progress commit to complete.  If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately.  If the specified transaction doesn't exist, it
returns EINVAL.

If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.

These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.

Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.

Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result.  However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well.  Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
Sage Weil bb9c12c945 Btrfs: async transaction commit
Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk.  This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.

The wait_for_unblock flag controls whether we wait for the 'middle' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:34 -04:00
Sage Weil 99d16cbcaf Btrfs: fix deadlock in btrfs_commit_transaction
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
num_writers > 1 or should_grow at the top of the loop.  Then, much much
later, we wait for that timeout if either num_writers or should_grow is
true.  However, it's possible for a racing process (calling
btrfs_end_transaction()) to decrement num_writers such that we wait
forever instead of for 1.

Fix this by deciding how long to wait when we wait.  Include a smp_mb()
before checking if the waitqueue is active to ensure the num_writers
is visible.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:34 -04:00
Sage Weil fccdae435c Btrfs: fix lockdep warning on clone ioctl
I'm no lockdep expert, but this appears to make the lockdep warning go
away for the i_mutex locking in the clone ioctl.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:33 -04:00
Sage Weil 050006a753 Btrfs: fix clone ioctl where range is adjacent to extent
We had an edge case issue where the requested range was just
following an existing extent. Instead of skipping to the next
extent, we used the previous one which lead to having zero
sized extents.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:33 -04:00
Sage Weil 9a019196ec Btrfs: fix delalloc checks in clone ioctl
The lookup_first_ordered_extent() was done on the wrong inode, and the
->delalloc_bytes test was wrong, as the following
btrfs_wait_ordered_range() would only invoke a range write and wouldn't
write the entire file data range. Also, a bad parameter was passed to
btrfs_wait_ordered_range().

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:33 -04:00
Chris Mason d8e39c457b Btrfs: drop unused variable in block_alloc_rsv
The alloc_target variable is not really used.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:17:41 -04:00
Andi Kleen 559af82114 Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.

Still needs more review.

Found by gcc 4.6's new warnings

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:37 -04:00
Andi Kleen 411fc6bcef Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
These are all the cases where a variable is set, but not
read which are really bugs.

- Couple of incorrect error handling fixed.
- One incorrect use of a allocation policy
- Some other things

Still needs more review.

Found by gcc 4.6's new warnings.

[akpm@linux-foundation.org: fix build.  Might have been bitrot]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:31 -04:00
Julia Lawall d0b678cb0a Btrfs: Use ERR_CAST helpers
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)).  The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
type T;
T x;
identifier f;
@@

T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
 ...+> }

@@
expression x;
@@

- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:23 -04:00
Julia Lawall 2354d08fe9 Btrfs: use memdup_user helpers
Use memdup_user when user data is immediately copied into the
allocated region.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@

-  to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+  to = memdup_user(from,size);
   if (
-      to==NULL
+      IS_ERR(to)
                 || ...) {
   <+... when != goto l1;
-  -ENOMEM
+  PTR_ERR(to)
   ...+>
   }
-  if (copy_from_user(to, from, size) != 0) {
-    <+... when != goto l2;
-    -EFAULT
-    ...+>
-  }
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:18 -04:00
Chris Mason 18e503d695 Btrfs: fix raid code for removing missing drives
When btrfs is mounted in degraded mode, it has some internal structures
to track the missing devices.  This missing device is setup as readonly,
but the mapping code can get upset when we try to write to it.

This changes the mapping code to return -EIO instead of oops when we try
to write to the readonly device.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:46 -04:00
Miao Xie 19fe0a8b78 Btrfs: Switch the extent buffer rbtree into a radix tree
This patch reduces the CPU time spent in the extent buffer search by using the
radix tree instead of the rbtree and using the rcu lock instead of the spin
lock.

I did a quick test by the benchmark tool[1] and found the patch improve the
file creation/deletion performance problem that I have reported[2].

Before applying this patch:
Create files:
	Total files: 50000
	Total time: 0.971531
	Average time: 0.000019
Delete files:
	Total files: 50000
	Total time: 1.366761
	Average time: 0.000027

After applying this patch:
Create files:
	Total files: 50000
	Total time: 0.927455
	Average time: 0.000019
Delete files:
	Total files: 50000
	Total time: 1.292280
	Average time: 0.000026

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
[2] http://marc.info/?l=linux-btrfs&m=128212635122920&w=2

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:45 -04:00
Miao Xie 897ca6e9b4 Btrfs: restructure try_release_extent_buffer()
restructure try_release_extent_buffer() and write a function to release the
extent buffer. It will be used later.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:45 -04:00
Chris Mason bf9022e06a Btrfs: use the flusher threads for delalloc throttling
We have a fairly complex set of loops around walking our list of
delalloc inodes when we find metadata delalloc space running low.
It doesn't work very well, can use large amounts of CPU and doesn't
do very efficient writeback.

This switches us to kick the bdi flusher threads instead.  All dirty
data in btrfs is accounted as delalloc data, so this is very similar
in terms of what it writes, but we're able to just kick off the IO
and wait for progress.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:36 -04:00
Chris Mason e5bc245829 Btrfs: tune the chunk allocation to 5% of the FS as metadata
An earlier commit tried to keep us from allocating too many
empty metadata chunks.  It was somewhat too restrictive and could
lead to ENOSPC errors on empty filesystems.

This increases the limits to about 5% of the FS size, allowing more
metadata chunks to be preallocated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:35 -04:00
Chris Mason 3259f8bed2 Add new functions for triggering inode writeback
When btrfs is running low on metadata space, it needs to force delayed
allocation pages to disk.  It currently does this with a suboptimal walk
of a private list of inodes with delayed allocation, and it would be
much better if we used the generic flusher threads.

writeback_inodes_sb_if_idle would be ideal, but it waits for the flusher
thread to start IO on all the dirty pages in the FS before it returns.
This adds variants of writeback_inodes_sb* that allow the caller to
control how many pages get sent down.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:29 -04:00
Chris Mason cb44921a09 Btrfs: don't loop forever on bad btree blocks
When btrfs discovers the generation number in a btree block is
incorrect, it can loop forever without forcing the RAID
code to try a valid mirror, and without returning EIO.

This changes things to properly kick out the EIO.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 09:31:30 -04:00
Chris Mason 6b5b817f10 Merge branch 'bug-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work
Conflicts:
	fs/btrfs/extent-tree.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 09:27:49 -04:00
Josef Bacik 8216ef866d Btrfs: let the user know space caching is enabled
If you mount -o space_cache, the option will be persistent across mounts, but to
make sure the user knows that they did this, emit a message telling them if they
didn't mount with -o space_cache but the feature is still used.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:37 -04:00
Josef Bacik 88c2ba3b06 Btrfs: Add a clear_cache mount option
If something goes wrong with the free space cache we need a way to make sure
it's not loaded on mount and that it's cleared for everybody.  When you pass the
clear_cache option it will make it so all block groups are setup to be cleared,
which keeps them from being loaded and then they will be truncated when the
transaction is committed.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:36 -04:00
Josef Bacik 67377734fd Btrfs: add support for mixed data+metadata block groups
There are just a few things that need to be fixed in the kernel to support mixed
data+metadata block groups.  Mostly we just need to make sure that if we are
using mixed block groups that we continue to allocate mixed block groups as we
need them.  Also we need to make sure __find_space_info will find our space info
if we search for DATA or METADATA only.  Tested this with xfstests and it works
nicely.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:36 -04:00
Josef Bacik dde5abee12 Btrfs: check cache->caching_ctl before returning if caching has started
With the free space disk caching we can mark the block group as started with the
caching, but we don't have a caching ctl.  This can race with anybody else who
tries to get the caching ctl before we cache (this is very hard to do btw).  So
instead check to see if cache->caching_ctl is set, and if not return NULL.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:35 -04:00
Josef Bacik 9d66e233c7 Btrfs: load free space cache if it exists
This patch actually loads the free space cache if it exists.  The only thing
that really changes here is that we need to cache the block group if we're going
to remove an extent from it.  Previously we did not do this since the caching
kthread would pick it up.  With the on disk cache we don't have this luxury so
we need to make sure we read the on disk cache in first, and then remove the
extent, that way when the extent is unpinned the free space is added to the
block group.  This has been tested with all sorts of things.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:35 -04:00
Josef Bacik 0cb59c9953 Btrfs: write out free space cache
This is a simple bit, just dump the free space cache out to our preallocated
inode when we're writing out dirty block groups.  There are a bunch of changes
in inode.c in order to account for special cases.  Mostly when we're doing the
writeout we're holding trans_mutex, so we need to use the nolock transacation
functions.  Also we can't do asynchronous completions since the async thread
could be blocked on already completed IO waiting for the transaction lock.  This
has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
tests.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:29 -04:00
Josef Bacik 0af3d00bad Btrfs: create special free space cache inode
In order to save free space cache, we need an inode to hold the data, and we
need a special item to point at the right inode for the right block group.  So
first, create a special item that will point to the right inode, and the number
of extent entries we will have and the number of bitmaps we will have.  We
truncate and pre-allocate space everytime to make sure it's uptodate.

This feature will be turned on as soon as you mount with -o space_cache, however
it is safe to boot into old kernels, they will just generate the cache the old
fashion way.  When you boot back into a newer kernel we will notice that we
modified and not the cache and automatically discard the cache.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-28 15:59:09 -04:00
Josef Bacik e9bb7f10d3 Btrfs: remove warn_on from use_block_rsv
Because btrfs_dirty_inode does a btrfs_join_transaction, it doesn't actually
reserve space.  It does this so we can try and dirty the inode quickly without
having to deal with the ENOSPC problems.  But if it does get back ENOSPC it
handles it properly.  The problem is use_block_rsv does a WARN_ON whenever this
case happens, even tho btrfs_dirty_inode takes it into account and actually
expects to get -ENOSPC if things are particularly tight.  So instead just remove
the warning.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-26 12:55:03 -04:00
Josef Bacik 382279336f Btrfs: set trans to null in reserve_metadata_bytes if we commit the transaction
btrfs_commit_transaction will free our trans, but because we pass trans to
shrink_delalloc we could possibly have a use after free situation.  So instead
if we commit the transaction, set trans to null and set committed to true so we
don't keep trying to commit a transaction.  This fixes a panic I could reproduce
at will.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-26 12:52:53 -04:00
Josef Bacik 0e78340f3c Btrfs: fix error handling in btrfs_get_sb
If we failed to find the root subvol id, or the subvol=<name>, we would
deactivate the locked super and close the devices.  The problem is at this point
we have gotten the SB all setup, which includes setting super_operations, so
when we'd deactiveate the super, we'd do a close_ctree() which closes the
devices, so we'd end up closing the devices twice.  So if you do something like
this

mount /dev/sda1 /mnt/test1
mount /dev/sda1 /mnt/test2 -o subvol=xxx
umount /mnt/test1

it would blow up (if subvol xxx doesn't exist).  This patch fixes that problem.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:55:03 -04:00
Josef Bacik 8bb8ab2e93 Btrfs: rework how we reserve metadata bytes
With multi-threaded writes we were getting ENOSPC early because somebody would
come in, start flushing delalloc because they couldn't make their reservation,
and in the meantime other threads would come in and use the space that was
getting freed up, so when the original thread went to check to see if they had
space they didn't and they'd return ENOSPC.  So instead if we have some free
space but not enough for our reservation, take the reservation and then start
doing the flushing.  The only time we don't take reservations is when we've
already overcommitted our space, that way we don't have people who come late to
the party way overcommitting ourselves.  This also moves all of the retrying and
flushing code into reserve_metdata_bytes so it's all uniform.  This keeps my
fs_mark test from returning -ENOSPC as soon as it starts and actually lets me
fill up the disk.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:55:01 -04:00
Josef Bacik 14ed0ca6e8 Btrfs: don't allocate chunks as aggressively
Because the ENOSPC code over reserves super aggressively we end up allocating
chunks way more often than we should.  For example with my fs_mark tests on a
2gb fs I can end up reserved 1gb just for metadata, when only 34mb of that is
being used.  So instead check to see if the amount of space actually used is
less than 30% of the total space, and if so don't allocate a chunk, but only if
we have at least 256mb of free space to make sure we don't put too much pressure
on free space.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:55:00 -04:00
Josef Bacik 0019f10db6 Btrfs: re-work delalloc flushing
Currently we try and flush delalloc, but we only do that in a sort of weak way,
which works fine in most cases but if we're under heavy pressure we need to be
able to wait for flushing to happen.  Also instead of checking the bytes
reserved in the block_rsv, check the space info since it is more accurate.  The
sync option will be used in a future patch.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:58 -04:00
Josef Bacik 6d48755d02 Btrfs: fix reservation code for mixed block groups
The global reservation stuff tries to add together DATA and METADATA used in
order to figure out how much to reserve for everything, but this doesn't work
right for mixed block groups.  Instead if we have mixed block groups just set
data used to 0.  Also with mixed block groups we will use bytes_may_use for
keeping track of delalloc bytes, so we need to take that into account in our
reservation calculations.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:56 -04:00
Josef Bacik 89a55897a2 Btrfs: fix df regression
The new ENOSPC stuff breaks out the raid types which breaks the way we were
reporting df to the system.  This fixes it back so that Available is the total
space available to data and used is the actual bytes used by the filesystem.
This means that Available is Total - data used - all of the metadata space.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:55 -04:00