Commit Graph

214 Commits

Author SHA1 Message Date
Josef Bacik de2b530bfb Btrfs: actually log directory we are fsync()'ing
If you just create a directory and then fsync that directory and then pull the
power plug you will come back up and the directory will not be there.  That is
because we won't actually create directories if we've logged files inside of
them since they will be created on replay, but in this check we will set our
logged_trans of our current directory if it happens to be a directory, making us
think it doesn't need to be logged.  Fix the logic to only do this to parent
directories.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:24 -04:00
Filipe David Borba Manana ada9af215c Btrfs: don't ignore errors from btrfs_run_delayed_items
tree-log.c was ignoring the return value from btrfs_run_delayed_items()
in several places.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:54 -04:00
Filipe David Borba Manana 2bac325ea8 Btrfs: fix inode leak on kmalloc failure in tree-log.c
In tree-log.c:replay_one_name(), if memory allocation for
the name fails, ensure we iput the dir inode we got before
before we return.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:53 -04:00
Josef Bacik f3b15ccdbb Btrfs: release both paths before logging dir/changed extents
The ceph guys tripped over this bug where we were still holding onto the
original path that we used to copy the inode with when logging.  This is based
on Chris's fix which was reported to fix the problem.  We need to drop the paths
in two cases anyway so just move the drop up so that we don't have duplicate
code.  Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:30:16 -04:00
Josef Bacik 8c2a1a3028 Btrfs: exclude logged extents before replying when we are mixed
With non-mixed block groups we replay the logs before we're allowed to do any
writes, so we get away with not pinning/removing the data extents until right
when we replay them.  However with mixed block groups we allocate out of the
same pool, so we could easily allocate a metadata block that was logged in our
tree log.  To deal with this we just need to notice that we have mixed block
groups and do the normal excluding/removal dance during the pin stage of the log
replay and that way we don't allocate metadata blocks from areas we have logged
data extents.  With this patch we now pass xfstests generic/311 with mixed
block groups turned on.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:17 -04:00
Miao Xie c6adc9cc08 Btrfs: merge pending IO for tree log write back
Before applying this patch, we flushed the log tree of the fs/file
tree firstly, and then flushed the log root tree. It is ineffective,
especially on the hard disk. This patch improved this problem by wrapping
the above two flushes by the same blk_plug.

By test, the performance of the sync write went up ~60%(2.9MB/s -> 4.6MB/s)
on my scsi disk whose disk buffer was enabled.

Test step:
 # mkfs.btrfs -f -m single <disk>
 # mount <disk> <mnt>
 # dd if=/dev/zero of=<mnt>/file0 bs=32K count=1024 oflag=sync

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:05 -04:00
Liu Bo 2da1c669f0 Btrfs: kill replicate code in replay_one_buffer
EXTREF is treated same as REF, so we can make the code tidy.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:01 -04:00
Miao Xie cb517eabba Btrfs: cleanup the similar code of the fs root read
There are several functions whose code is similar, such as
  btrfs_find_last_root()
  btrfs_read_fs_root_no_radix()

Besides that, some functions are invoked twice, it is unnecessary,
for example, we are sure that all roots which is found in
  btrfs_find_orphan_roots()
have their orphan items, so it is unnecessary to check the orphan
item again.

So cleanup it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:37 -04:00
Eric Sandeen 48a3b6366f btrfs: make static code static & remove dead code
Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.

removed functions:

btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()

btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.

ulist.c functions are left, another patch will take care of those.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:23 -04:00
Josef Bacik 3650860b90 Btrfs: remove almost all of the BUG()'s from tree-log.c
There were a whole bunch and I was doing it for other things.  I haven't tested
these error paths but at the very least this is better than panicing.  I've only
left 2 BUG_ON()'s since they are logic errors and I want to replace them with a
ASSERT framework that we can compile out for production users.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:21 -04:00
Josef Bacik b50c6e250e Btrfs: deal with free space cache errors while replaying log
So everybody who got hit by my fsync bug will still continue to hit this
BUG_ON() in the free space cache, which is pretty heavy handed.  So I took a
file system that had this bug and fixed up all the BUG_ON()'s and leaks that
popped up when I tried to mount a broken file system like this.  With this patch
we just fail to mount instead of panicing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:20 -04:00
Josef Bacik abefa55ac1 Btrfs: check return value of commit when recovering log
We need to check the return value of the commit in case something goes wrong,
otherwise we could end up going down the line and doing more stuff (like orphan
cleanup) before we notice we should have errored out.  We need to do this before
we free up the log_tree_root since the caller will handle all of that.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:10 -04:00
Josef Bacik 5ec8dca761 Btrfs: don't try and free ebs twice in log replay
This work is done by btrfs_free_path() anyway so there's no need for this
duplicate work.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:04 -04:00
Tsutomu Itoh 4b90c68015 Btrfs: remove unused argument of btrfs_extend_item()
Argument 'trans' is not used in btrfs_extend_item().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:53 -04:00
Tsutomu Itoh afe5fea72b Btrfs: cleanup of function where fixup_low_keys() is called
If argument 'trans' is unnecessary in the function where
fixup_low_keys() is called, 'trans' is deleted.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:52 -04:00
Josef Bacik 09a2a8f96e Btrfs: fix bad extent logging
A user sent me a btrfs-image of a file system that was panicing on mount during
the log recovery.  I had originally thought these problems were from a bug in
the free space cache code, but that was just a symptom of the problem.  The
problem is if your application does something like this

[prealloc][prealloc][prealloc]

the internal extent maps will merge those all together into one extent map, even
though on disk they are 3 separate extents.  So if you go to write into one of
these ranges the extent map will be right since we use the physical extent when
doing the write, but when we log the extents they will use the wrong sizes for
the remainder prealloc space.  If this doesn't happen to trip up the free space
cache (which it won't in a lot of cases) then you will get bogus entries in your
extent tree which will screw stuff up later.  The data and such will still work,
but everything else is broken.  This patch fixes this by not allowing extents
that are on the modified list to be merged.  This has the side effect that we
are no longer adding everything to the modified list all the time, which means
we now have to call btrfs_drop_extents every time we log an extent into the
tree.  So this allows me to drop all this speciality code I was using to get
around calling btrfs_drop_extents.  With this patch the testcase I've created no
longer creates a bogus file system after replaying the log.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:34 -04:00
Josef Bacik cc95bef635 Btrfs: log ram bytes properly
When logging changed extents I was logging ram_bytes as the current length,
which isn't correct, it's supposed to be the ram bytes of the original extent.
This is for compression where even if we split the extent we need to know the
ram bytes so when we uncompress the extent we know how big it will be.  This was
still working out right with compression for some reason but I think we were
getting lucky.  It was definitely off for prealloc which is why I noticed it,
btrfsck was complaining about it.  With this patch btrfsck no longer complains
after a log replay.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:33 -04:00
Zhi Yong Wu 6841ebee6b btrfs: Cleanup some redundant codes in btrfs_log_inode()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:20 -04:00
Josef Bacik 4bc4bee459 Btrfs: make sure nbytes are right after log replay
While trying to track down a tree log replay bug I noticed that fsck was always
complaining about nbytes not being right for our fsynced file.  That is because
the new fsync stuff doesn't wait for ordered extents to complete, so the inodes
nbytes are not necessarily updated properly when we log it.  So to fix this we
need to set nbytes to whatever it is on the inode that is on disk, so when we
replay the extents we can just add the bytes that are being added as we replay
the extent.  This makes it work for the case that we have the wrong nbytes or
the case that we logged everything and nbytes is actually correct.  With this
I'm no longer getting nbytes errors out of btrfsck.

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-04-13 07:35:06 -04:00
Josef Bacik 9bf7a48905 Btrfs: use set_nlink if our i_nlink is 0
We need to inc the nlink of deleted entries when running replay so we can do the
unlink on the fs_root and get everything cleaned up and then have the orphan
cleanup do the right thing.  The problem is inc_nlink complains about this, even
thought it still does the right thing.  So use set_nlink() if our i_nlink is 0
to keep users from seeing the warnings during log replay.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-03-04 16:30:06 -05:00
Josef Bacik 124fe663f9 Btrfs: delete inline extents when we find them during logging
Apparently when we do inline extents we allow the data to overlap the last chunk
of the btrfs_file_extent_item, which means that we can possibly have a
btrfs_file_extent_item that isn't actually as large as a btrfs_file_extent_item.
This messes with us when we try to overwrite the extent when logging new extents
since we expect for it to be the right size.  To fix this just delete the item
and try to do the insert again which will give us the proper sized
btrfs_file_extent_item.  This fixes a panic where map_private_extent_buffer
would blow up because we're trying to write past the end of the leaf.  Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-03-01 11:47:21 -05:00
Liu Bo 3321719ed6 Btrfs: fix memory leak of log roots
When we abort a transaction while fsyncing, we'll skip freeing log roots
part of committing a transaction, which leads to memory leak.

This adds a 'free log roots' in putting super when no more users hold
references on log roots, so it's safe and clean.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-28 13:33:51 -05:00
Qu Wenruo fda2832feb btrfs: cleanup for open-coded alignment
Though most of the btrfs codes are using ALIGN macro for page alignment,
there are still some codes using open-coded alignment like the
following:
------
        u64 mask = ((u64)root->stripesize - 1);
        u64 ret = (val + mask) & ~mask;
------
Or even hidden one:
------
        num_bytes = (end - start + blocksize) & ~(blocksize - 1);
------

Sometimes these open-coded alignment is not so easy to understand for
newbie like me.

This commit changes the open-coded alignment to the ALIGN macro for a
better readability.

Also there is a previous patch from David Sterba with similar changes,
but the patch is for 3.2 kernel and seems not merged.
http://www.spinics.net/lists/linux-btrfs/msg12747.html

Cc: David Sterba <dave@jikos.cz>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-26 11:04:13 -05:00
Eric Sandeen de78b51a28 btrfs: remove cache only arguments from defrag path
The entry point at the defrag ioctl always sets "cache only" to 0;
the codepaths haven't run for a long time as far as I can
tell.  Chris says they're dead code, so remove them.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:36 -05:00
Liu Bo dcfac4156f Btrfs: kill unused argument of btrfs_pin_extent_for_log_replay
Argument 'trans' is not used any more.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:14 -05:00
Josef Bacik 2ab28f322f Btrfs: wait on ordered extents at the last possible moment
Since we don't actually copy the extent information from the source tree in
the fast case we don't need to wait for ordered io to be completed in order
to fsync, we just need to wait for the io to be completed.  So when we're
logging our file just attach all of the ordered extents to the log, and then
when the log syncs just wait for IO_DONE on the ordered extents and then
write the super.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:04 -05:00
Liu Bo 192000dda2 Btrfs: use right range to find checksum for compressed extents
For compressed extents, the range of checksum is covered by disk length,
and the disk length is different with ram length, so we need to use disk
length instead to get us the right checksum.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:17 -05:00
Josef Bacik 201a903894 Btrfs: do not allow logged extents to be merged or removed
We drop the extent map tree lock while we're logging extents, so somebody
could come in and merge another extent into this one and screw up our
logging, or they could even remove us from the list which would keep us from
logging the extent or freeing our ref on it, so we need to make sure to not
clear LOGGING until after the extent is logged, and then we can merge it to
adjacent extents.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:49:48 -05:00
Josef Bacik 0b1c6ccade Btrfs: use tokens where we can in the tree log
If we are syncing over and over the overhead of doing all those maps in
fill_inode_item and log_changed_extents really starts to hurt, so use map
tokens so we can avoid all the extra mapping.  Since the token maps from our
offset to the end of the page make sure to set the first thing in the item
first so we really only do one map.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:26 -05:00
Josef Bacik 70c8a91ce2 Btrfs: log changed inodes based on the extent map tree
We don't really need to copy extents from the source tree since we have all
of the information already available to us in the extent_map tree.  So
instead just write the extents straight to the log tree and don't bother to
copy the extent items from the source tree.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:24 -05:00
Josef Bacik a95249b392 Btrfs: don't bother copying if we're only logging the inode
We don't copy inode items anwyay, we just copy them straight into the log
from the in memory inode.  So if we know we're only logging the inode, don't
bother dropping anything, just try to insert it and either if it succeeds or
we get EEXIST we can update the inode item in the log and carry on.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:22 -05:00
Josef Bacik e997615149 Btrfs: only log the inode item if we can get away with it
Currently we copy all the file information into the log, inode item, the
refs, xattrs etc.  Except most of this doesn't change from fsync to fsync,
just the inode item changes.  So set a flag if an xattr changes or a link is
added, and otherwise only log the inode item.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:21 -05:00
Miao Xie 5269b67e3d Btrfs: fix missing log when BTRFS_INODE_NEEDS_FULL_SYNC is set
If we set BTRFS_INODE_NEEDS_FULL_SYNC, we should log all the extent,
but now we forget to take it into account, and set a wrong max key,
if so, we will skip the file extent metadata when doing logging. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:22 -05:00
Miao Xie bbe1426764 Btrfs: fix unprotected extent map operation when logging file extents
We forget to protect the modified_extents list, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:22 -05:00
Miao Xie 315a9850da Btrfs: fix wrong file extent length
There are two types of the file extent - inline extent and regular extent,
When we log file extents, we didn't take inline extent into account, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:21 -05:00
Liu Bo 183f37fa35 Btrfs: do not log extents when we only log new names
When we log new names, we need to log just enough to recreate the inode
during log replay, and there is no need to log extents along with it.

This actually fixes a bug revealed by xfstests 241, where it shows
that we're logging some extents that have not updated metadata,
so we don't get proper EXTENT_DATA items to be copied to log tree.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:18 -05:00
Eric W. Biederman e9069f4708 btrfs: Fix compilation with user namespace support enabled
When compiling with user namespace support btrfs fails like:

fs/btrfs/tree-log.c: In function ‘fill_inode_item’:
fs/btrfs/tree-log.c:2955:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_uid’
fs/btrfs/ctree.h:2026:1: note: expected ‘u32’ but argument is of type ‘kuid_t’
fs/btrfs/tree-log.c:2956:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_gid’
fs/btrfs/ctree.h:2027:1: note: expected ‘u32’ but argument is of type ‘kgid_t’

Fix this by using i_uid_read and i_gid_read in

Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-10-12 15:01:42 -07:00
Chris Mason f46dbe3dee btrfs: init ref_index to zero in add_inode_ref
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-09 11:17:20 -04:00
Stefan Behrens 5af3e8cce8 Btrfs: make filesystem read-only when submitting barrier fails
So far the return code of barrier_all_devices() is ignored, which
means that errors are ignored. The result can be a corrupt
filesystem which is not consistent.
This commit adds code to evaluate the return code of
barrier_all_devices(). The normal btrfs_error() mechanism is used to
switch the filesystem into read-only mode when errors are detected.

In order to decide whether barrier_all_devices() should return
error or success, the number of disks that are allowed to fail the
barrier submission is calculated. This calculation accounts for the
worst RAID level of metadata, system and data. If single, dup or
RAID0 is in use, a single disk error is already considered to be
fatal. Otherwise a single disk error is tolerated.

The calculation of the number of disks that are tolerated to fail
the barrier operation is performed when the filesystem gets mounted,
when a balance operation is started and finished, and when devices
are added or removed.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-10-09 09:20:19 -04:00
Josef Bacik 94edf4ae43 Btrfs: don't bother committing delayed inode updates when fsyncing
We can just copy the in memory inode into the tree log directly, no sense in
updating the fs tree so we can copy it into the tree log tree.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-09 09:19:50 -04:00
Josef Bacik 18ec90d63f Btrfs: be smarter about dropping things from the tree log
When we truncate existing items in the tree log we've been searching for
each individual item and removing them.  This is unnecessary churn and
searching, just keep track of the slot we are on and how many items we need
to delete and delete them all at once.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-09 09:15:41 -04:00
Josef Bacik 6f1fed7753 Btrfs: don't lookup csums for prealloc extents
The tree logging stuff was looking up csums to copy over for prealloc
extents which is just work we don't need to be doing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-09 09:15:41 -04:00
Josef Bacik e6138876ad Btrfs: cache extent state when writing out dirty metadata pages
Everytime we write out dirty pages we search for an offset in the tree,
convert the bits in the state, and then when we wait we search for the
offset again and clear the bits.  So for every dirty range in the io tree we
are doing 4 rb searches, which is suboptimal.  With this patch we are only
doing 2 searches for every cycle (modulo weird things happening).  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-09 09:15:41 -04:00
Mark Fasheh f186373fef btrfs: extended inode refs
This patch adds basic support for extended inode refs. This includes support
for link and unlink of the refs, which basically gets us support for rename
as well.

Inode creation does not need changing - extended refs are only added after
the ref array is full.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-10-09 09:14:45 -04:00
Jan Schmidt 5a1d7843ca btrfs: improved readablity for add_inode_ref
Moved part of the code into a sub function and replaced most of the gotos
by ifs, hoping that it will be easier to read now.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-10-08 20:09:02 -04:00
Josef Bacik 0aa4a17d82 Btrfs: handle not finding the extent exactly when logging changed extents
I started hitting warnings when running xfstest 68 in a loop because there
were EM's that were not lined up properly with the physical extents.  This
is ok, if we do something like punch a hole or write to a preallocated space
or something like that we can have an EM that doesn't cover the entire
physical extent.  So fix the tree logging stuff to cope with this case so we
don't just commit the transaction.  With this patch I no longer see the
warnings from the tree logging code.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-08 20:09:02 -04:00
Josef Bacik ff44c6e36d Btrfs: do not hold the write_lock on the extent tree while logging
Dave Sterba pointed out a sleeping while atomic bug while doing fsync.  This
is because I'm an idiot and didn't realize that rwlock's were spin locks, so
we've been holding this thing while doing allocations and such which is not
good.  This patch fixes this by dropping the write lock before we do
anything heavy and re-acquire it when it is done.  We also need to take a
ref on the em's in case their corresponding pages are evicted and mark them
as being logged so that releasepage does not remove them and doesn't remove
them from our local list.  Thanks,

Reported-by: Dave Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-04 09:39:58 -04:00
Miao Xie 2ecb79239b Btrfs: fix unprotected ->log_batch
We forget to protect ->log_batch when syncing a file, this patch fix
this problem by atomic operation. And ->log_batch is used to check
if there are parallel sync operations or not, so it is unnecessary to
reset it to 0 after the sync operation of the current log tree complete.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-01 15:19:12 -04:00
Josef Bacik 2aaa665581 Btrfs: add hole punching
This patch adds hole punching via fallocate.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:07 -04:00
Josef Bacik 2671485d39 Btrfs: remove unused hint byte argument for btrfs_drop_extents
I audited all users of btrfs_drop_extents and found that nobody actually uses
the hint_byte argument.  I'm sure it was used for something at some point but
it's not used now, and the way the pinning works the disk bytenr would never be
immediately useful anyway so lets just remove it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:06 -04:00
Liu Bo d279440511 Btrfs: check if an inode has no checksum when logging it
This is based on Josef's "Btrfs: turbo charge fsync".

If an inode is a BTRFS_INODE_NODATASUM one, we don't need to look for csum
items any more.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:06 -04:00
Liu Bo 46d8bc3424 Btrfs: fix a bug in checking whether a inode is already in log
This is based on Josef's "Btrfs: turbo charge fsync".

The current btrfs checks if an inode is in log by comparing
root's last_log_commit to inode's last_sub_trans[2].

But the problem is that this root->last_log_commit is shared among
inodes.

Say we have N inodes to be logged, after the first inode,
root's last_log_commit is updated and the N-1 remained files will
be skipped.

This fixes the bug by keeping a local copy of root's last_log_commit
inside each inode and this local copy will be maintained itself.

[1]: we regard each log transaction as a subset of btrfs's transaction,
i.e. sub_trans

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:06 -04:00
Liu Bo 4e2f84e63d Btrfs: improve fsync by filtering extents that we want
This is based on Josef's "Btrfs: turbo charge fsync".

The above Josef's patch performs very good in random sync write test,
because we won't have too much extents to merge.

However, it does not performs good on the test:
dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync

The reason is when we do sequencial sync write, we need to merge the
current extent just with the previous one, so that we can get accumulated
extents to log:

A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...

So we'll have to flush more and more checksum into log tree, which is the
bottleneck according to my tests.

But we can avoid this by telling fsync the real extents that are needed
to be logged.

With this, I did the above dd sync write test (size=50m),

         w/o (orig)   w/ (josef's)   w/ (this)
SATA      104KB/s       109KB/s       121KB/s
ramdisk   1.5MB/s       1.5MB/s       10.7MB/s (613%)

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:05 -04:00
Liu Bo 06d3d22b45 Btrfs: cleanup extents after we finish logging inode
This is based on Josef's "Btrfs: turbo charge fsync".

We should cleanup those extents after we've finished logging inode,
otherwise we may do redundant work on them.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
2012-10-01 15:19:04 -04:00
Josef Bacik 0fa83cdb1d Btrfs: only warn if we hit an error when doing the tree logging
I hit this a couple times while working on my fsync patch (all my bugs, not
normal operation), but with my new stuff we could have new errors from cases
I have not encountered, so instead of BUG()'ing we should be WARN()'ing so
that we are notified there is a problem but the user doesn't lose their
data.  We can easily commit the transaction in the case that the tree
logging fails and still be fine, so let's try and be as nice to the user as
possible.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:03 -04:00
Josef Bacik 5dc562c541 Btrfs: turbo charge fsync
At least for the vm workload.  Currently on fsync we will

1) Truncate all items in the log tree for the given inode if they exist

and

2) Copy all items for a given inode into the log

The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing.  This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them.  We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already.  Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write

		Original	Patched
SATA drive	82KB/s		140KB/s
Fusion drive	431KB/s		2532KB/s

So around 2-6 times faster depending on your hardware.  There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok.  This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get.  All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.

The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok.  I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-10-01 15:19:03 -04:00
Tsutomu Itoh b995929515 Btrfs: return error of btrfs_update_inode() to caller
We didn't check error of btrfs_update_inode(), but that error looks
easy to bubble back up.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:54 -04:00
Chris Mason b6305567e7 Btrfs: run delayed directory updates during log replay
While we are resolving directory modifications in the
tree log, we are triggering delayed metadata updates to
the filesystem btrees.

This commit forces the delayed updates to run so the
replay code can find any modifications done.  It stops
us from crashing because the directory deleltion replay
expects items to be removed immediately from the tree.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
cc: stable@kernel.org
2012-07-02 15:39:19 -04:00
Josef Bacik 5bdbeb2187 Btrfs: fix return code in drop_objectid_items
So dpkg fsync()'s the file and the directory containing the file whenever it
writes to a file which is really slow in btrfs.  This is partly because
fsync()'ing a directory _always_ committed the transaction instead of just
going to the tree log.  This is because drop_objectid_items() would return 1
since it does a btrfs_search_slot() which returns 1.  In tree-log jargon
this means that we have to commit the transaction to be safe.  So just check
if ret is greater than 0 and set it to 0 if it does.  With this patch we now
use the tree-log instead of committing the entire transaction, which is
twice as fast on my box.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:42 -04:00
Josef Bacik 22ee6985de Btrfs: check to see if the inode is in the log before fsyncing
We have this check down in the actual logging code, but this is after we
start a transaction and all that good stuff.  So move the helper
inode_in_log() out so we can call it in fsync() and avoid starting a
transaction altogether and just exit if we've already fsync()'ed this file
recently.  You would notice this issue if you fsync()'ed a file over and
over again until the transaction committed.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:42 -04:00
Tsutomu Itoh 018642a1f1 Btrfs: return value of btrfs_read_buffer is checked correctly
btrfs_read_buffer() has the possibility of returning the error.
Therefore, I add the code in which the return value of btrfs_read_buffer()
is checked.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-05-30 10:23:41 -04:00
Chris Mason b9fab919b7 Btrfs: avoid sleeping in verify_parent_transid while atomic
verify_parent_transid needs to lock the extent range to make
sure no IO is underway, and so it can safely clear the
uptodate bits if our checks fail.

But, a few callers are using it with spinlocks held.  Most
of the time, the generation numbers are going to match, and
we don't want to switch to a blocking lock just for the error
case.  This adds an atomic flag to verify_parent_transid,
and changes it to return EAGAIN if it needs to block to
properly verifiy things.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-05-06 07:23:47 -04:00
Jeff Mahoney 79787eaab4 btrfs: replace many BUG_ONs with proper error handling
btrfs currently handles most errors with BUG_ON. This patch is a work-in-
 progress but aims to handle most errors other than internal logic
 errors and ENOMEM more gracefully.

 This iteration prevents most crashes but can run into lockups with
 the page lock on occasion when the timing "works out."

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 11:52:54 +01:00
Jeff Mahoney 143bede527 btrfs: return void in functions without error conditions
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:34 +01:00
Jan Kara 6dd70ce4eb btrfs: Fix busyloops in transaction waiting code
wait_log_commit() and wait_for_writer() were using slightly different
conditions for deciding whether they should call schedule() and whether they
should continue in the wait loop. Thus it could happen that we busylooped when
the first condition was not true while the second one was. That is burning CPU
cycles needlessly and is deadly on UP machines...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-26 15:01:11 -05:00
Chris Mason 9785dbdf26 Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into integration 2012-01-16 15:26:31 -05:00
Arne Jansen 66d7e7f09f Btrfs: mark delayed refs as for cow
Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
from every call site. The for_cow parameter will later on be used to
determine if a ref will change anything with respect to qgroups.

Delayed refs coming from relocation are always counted as for_cow, as they
don't change subvol quota.

Also pass in the fs_info for later use.

btrfs_find_all_roots() will use this as an optimization, as changes that are
for_cow will not change anything with respect to which root points to a
certain leaf. Thus, we don't need to add the current sequence number to
those delayed refs.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-12-22 16:22:27 +01:00
Linus Torvalds 6a6662ced4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits)
  Btrfs: check for a null fs root when writing to the backup root log
  Btrfs: fix race during transaction joins
  Btrfs: fix a potential btrfs_bio leak on scrub fixups
  Btrfs: rename btrfs_bio multi -> bbio for consistency
  Btrfs: stop leaking btrfs_bios on readahead
  Btrfs: stop the readahead threads on failed mount
  Btrfs: fix extent_buffer leak in the metadata IO error handling
  Btrfs: fix the new inspection ioctls for 32 bit compat
  Btrfs: fix delayed insertion reservation
  Btrfs: ClearPageError during writepage and clean_tree_block
  Btrfs: be smarter about committing the transaction in reserve_metadata_bytes
  Btrfs: make a delayed_block_rsv for the delayed item insertion
  Btrfs: add a log of past tree roots
  btrfs: separate superblock items out of fs_info
  Btrfs: use the global reserve when truncating the free space cache inode
  Btrfs: release metadata from global reserve if we have to fallback for unlink
  Btrfs: make sure to flush queued bios if write_cache_pages waits
  Btrfs: fix extent pinning bugs in the tree log
  Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN
  Btrfs: don't wait as long for more batches during SSD log commit
  ...
2011-11-06 20:03:41 -08:00
David Sterba 6c41761fc6 btrfs: separate superblock items out of fs_info
fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)

Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-11-06 03:04:01 -05:00
Chris Mason e688b7252f Btrfs: fix extent pinning bugs in the tree log
The tree log had two important bugs that could cause corruptions after a
crash.  Sometimes we were allowing tree log blocks to be reused after
the tree log was committed but before the transaction commit was done.

This allowed a future metadata write to overwrite the tree log data.  It
is fixed by adding a new variant of freeing reserved extents that always
pins them.  Credit goes to Stefan Behrens and Arne Jansen for many many
hours spent tracking this bug down.

During tree log replay, we do a pass through the tree log and pin all
the extents we find.  This makes sure the replay code won't go in and
use any of those blocks for new allocations during replay.  The problem
is the free space cache isn't honoring these pinned extents.  So the
allocator can end up handing them out, leading to all kinds of problems
during replay.

The fix here is to force any free space cache to load while we pin the
extents, and then to make sure we remove the pinned extents from the
free space rbtree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
2011-11-06 03:03:48 -05:00
Chris Mason cd354ad613 Btrfs: don't wait as long for more batches during SSD log commit
When we're doing log commits, we try to wait for more writers to come in
and make the commit bigger.  This helps improve performance on rotating
disks, but on SSDs it adds latencies.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:47 -05:00
Miklos Szeredi bfe8684869 filesystems: add set_nlink()
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:43 +01:00
liubo 34f3e4f23c Btrfs: fix an oops of log replay
When btrfs recovers from a crash, it may hit the oops below:

------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:4580!
[...]
RIP: 0010:[<ffffffffa03df251>]  [<ffffffffa03df251>] btrfs_add_link+0x161/0x1c0 [btrfs]
[...]
Call Trace:
 [<ffffffffa03e7b31>] ? btrfs_inode_ref_index+0x31/0x80 [btrfs]
 [<ffffffffa04054e9>] add_inode_ref+0x319/0x3f0 [btrfs]
 [<ffffffffa0407087>] replay_one_buffer+0x2c7/0x390 [btrfs]
 [<ffffffffa040444a>] walk_down_log_tree+0x32a/0x480 [btrfs]
 [<ffffffffa0404695>] walk_log_tree+0xf5/0x240 [btrfs]
 [<ffffffffa0406cc0>] btrfs_recover_log_trees+0x250/0x350 [btrfs]
 [<ffffffffa0406dc0>] ? btrfs_recover_log_trees+0x350/0x350 [btrfs]
 [<ffffffffa03d18b2>] open_ctree+0x1442/0x17d0 [btrfs]
[...]

This comes from that while replaying an inode ref item, we forget to
check those old conflicting DIR_ITEM and DIR_INDEX items in fs/file tree,
then we will come to conflict corners which lead to BUG_ON().

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Tested-by: Andy Lutomirski <luto@mit.edu>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-16 21:09:15 -04:00
Chris Mason b43b31bdf2 Merge branch 'alloc_path' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/btrfs-error-handling into for-linus 2011-08-01 14:27:34 -04:00
Chris Mason bd681513fa Btrfs: switch the btrfs tree locks to reader/writer
The btrfs metadata btree is the source of significant
lock contention, especially in the root node.   This
commit changes our locking to use a reader/writer
lock.

The lock is built on top of rw spinlocks, and it
extends the lock tracking to remember if we have a
read lock or a write lock when we go to blocking.  Atomics
count the number of blocking readers or writers at any
given time.

It removes all of the adaptive spinning from the old code
and uses only the spinning/blocking hints inside of btrfs
to decide when it should continue spinning.

In read heavy workloads this is dramatically faster.  In write
heavy workloads we're still faster because of less contention
on the root node lock.

We suffer slightly in dbench because we schedule more often
during write locks, but all other benchmarks so far are improved.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:46 -04:00
Mark Fasheh 1e5063d093 btrfs: Don't BUG_ON alloc_path errors in replay_one_buffer()
The two ->process_func call sites in tree-log.c which were ignoring a return
code have also been updated to gracefully exit as well.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 14:14:44 -07:00
David Sterba 3ed4498caf btrfs: fix dereference of ERR_PTR value
smatch reports:

btrfs_recover_log_trees error: 'wc.replay_dest' dereferencing
possible ERR_PTR()

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-17 14:54:17 -04:00
Chris Mason d6c0cb379c Merge branch 'cleanups_and_fixes' into inode_numbers
Conflicts:
	fs/btrfs/tree-log.c
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 14:37:47 -04:00
Tsutomu Itoh 37daa4f968 Btrfs: check return value of btrfs_inc_extent_ref()
If return value of btrfs_inc_extent_ref() is not 0, BUG() is called.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:40 -04:00
Tsutomu Itoh c00e9493f1 Btrfs: return error to caller if read_one_inode() fails
When read_one_inode() fails, error code is returned to caller instead
of BUG_ON().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:40 -04:00
Tsutomu Itoh 1cd307990d Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
Currently, btrfs_truncate_item and btrfs_extend_item returns only 0.
So, the check by BUG_ON in the caller is unnecessary.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:39 -04:00
Tsutomu Itoh 65a246c5ff Btrfs: return error code to caller when btrfs_del_item fails
The error code is returned instead of calling BUG_ON when
btrfs_del_item returns the error.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:39 -04:00
liubo 8e531cdfeb Btrfs: do not flush csum items of unchanged file data during treelog
The current code relogs the entire inode every time during fsync log,
and it is much better suited to small files rather than large ones.

During my performance test, the fsync performace of large files sucks,
and we can ascribe this to the tremendous amount of csum infos of the
large ones, cause we have to flush all of these csum infos into log trees
even when there are only _one_ change in the whole file data.  Apparently,
to optimize fsync, we need to create a filter to skip the unnecessary csum
ones, that is, the corresponding file data remains unchanged before this fsync.

Here I have some test results to show, I use sysbench to do "random write + fsync".

===
sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags=  [prepare, run]
===

Sysbench args:
  - Number of threads: 1
  - Extra file open flags: 0
  - 2 files, 4Gb each
  - Block size 4Kb
  - Number of random requests for random IO: 10000
  - Read/Write ratio for combined random IO test: 1.50
  - Periodic FSYNC enabled, calling fsync() each 100 requests.
  - Calling fsync() at the end of test, Enabled.
  - Using synchronous I/O mode
  - Doing random write test

Sysbench results:
===
   Operations performed:  0 Read, 10000 Write, 200 Other = 10200 Total
   Read 0b  Written 39.062Mb  Total transferred 39.062Mb
===
a) without patch:  (*SPEED* : 451.01Kb/sec)
   112.75 Requests/sec executed

b) with patch:     (*SPEED* : 4.7533Mb/sec)
   1216.84 Requests/sec executed

PS: I've made a _sub transid_ stuff patch, but it does not perform as effectively as this patch,
and I'm wanderring where the problem is and trying to improve it more.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 10:13:16 -04:00
Chris Mason 712673339a Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/ctree.h
	fs/btrfs/volumes.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 06:30:52 -04:00
Chris Mason 945d8962ce Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into inode_numbers
Conflicts:
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/tree-log.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:33:42 -04:00
Chris Mason dcc6d07322 Merge branch 'delayed_inode' into inode_numbers
Conflicts:
	fs/btrfs/inode.c
	fs/btrfs/ioctl.c
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 07:07:01 -04:00
Miao Xie 16cdcec736 btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030

After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:30:56 -04:00
Chris Mason 0965537308 Merge branch 'ino-alloc' of git://repo.or.cz/linux-btrfs-devel into inode_numbers
Conflicts:
	fs/btrfs/free-space-cache.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:27:38 -04:00
Arne Jansen a2de733c78 btrfs: scrub
This adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.

This commit is based on the series originally posted to linux-btrfs
with some improvements that resulted from comments from David Sterba,
Ilya Dryomov and Jan Schmidt.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-12 14:45:20 +02:00
David Sterba b3b4aa74b5 btrfs: drop unused parameter from btrfs_release_path
parameter tree root it's not used since commit
5f39d397df ("Btrfs: Create extent_buffer
interface for large blocksizes")

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba c704005d88 btrfs: unify checking of IS_ERR and null
use IS_ERR_OR_NULL when possible, done by this coccinelle script:

@ match @
identifier id;
@@
(
- BUG_ON(IS_ERR(id) || !id);
+ BUG_ON(IS_ERR_OR_NULL(id));
|
- IS_ERR(id) || !id
+ IS_ERR_OR_NULL(id)
|
- !id || IS_ERR(id)
+ IS_ERR_OR_NULL(id)
)

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:20 +02:00
Tsutomu Itoh a62f44a5f4 Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log()
It is necessary to unlock mutex_lock before it return an error when
btrfs_alloc_path() fails.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:51 -04:00
Li Zefan 33345d0152 Btrfs: Always use 64bit inode number
There's a potential problem in 32bit system when we exhaust 32bit inode
numbers and start to allocate big inode numbers, because btrfs uses
inode->i_ino in many places.

So here we always use BTRFS_I(inode)->location.objectid, which is an
u64 variable.

There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
and inode->i_ino will be used in those cases.

Another reason to make this change is I'm going to use a special inode
to save free ino cache, and the inode number must be > (u64)-256.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:09 +08:00
liubo c622ae6085 btrfs: make inode ref log recovery faster
When we recover from crash via write-ahead log tree and process
the inode refs, for each btrfs_inode_ref item, we will
1) check if we already have a perfect match in fs/file tree, if
   we have, then we're done.
2) search the corresponding back reference in fs/file tree, and
   check all the names in this back reference to see if they are
   also in the log to avoid conflict corners.
3) recover the logged inode refs to fs/file tree.

In current btrfs, however,
- for 2)'s check, once is enough, since the checked back reference
  will remain unchanged after processing all the inode refs belonged
  to the key.
- it has no need to do another 1) between 2) and 3).

I've made a small test to show how it improves,

$dd if=/dev/zero of=foobar bs=4K count=1
$sync
$make 100 hard links continuously, like ln foobar link_i
$fsync foobar
$echo b > /proc/sysrq-trigger
after reboot
$time mount DEV PATH

without patch:
real    0m0.285s
user    0m0.001s
sys     0m0.009s

with patch:
real    0m0.123s
user    0m0.000s
sys     0m0.010s

Changelog v1->v2:
- fix double free - pointed by David Sterba
Changelog v2->v3:
- adjust free order

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:48 -04:00
Tsutomu Itoh db5b493ac7 Btrfs: cleanup some BUG_ON()
This patch changes some BUG_ON() to the error return.
(but, most callers still use BUG_ON())

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:35 -04:00
Josef Bacik 22a94d44bd Btrfs: add checks to verify dir items are correct
We need to make sure the dir items we get are valid dir items.  So any time we
try and read one check it with verify_dir_item, which will do various sanity
checks to make sure it looks sane.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-03-17 14:21:41 -04:00
Tsutomu Itoh 98d5dc13e7 btrfs: fix return value check of btrfs_start_transaction()
The error check of btrfs_start_transaction() is added, and the mistake
of the error check on several places is corrected.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-01 07:17:27 -05:00
Tsutomu Itoh 5df6708348 btrfs: checking NULL or not in some functions
Because NULL is returned when the memory allocation fails,
it is checked whether it is NULL.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-01 07:16:37 -05:00
Chris Mason b31eabd86e Btrfs: catch errors from btrfs_sync_log
btrfs_sync_log returns -EAGAIN when we need full transaction commits
instead of small log commits, but sometimes we were dropping the return
value.

In practice, we check for this a few different ways, but this is still a
bug that can leave off full log commits when we really need them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-31 16:48:24 -05:00
liubo 2a29edc6b6 btrfs: fix several uncheck memory allocations
To make btrfs more stable, add several missing necessary memory allocation
checks, and when no memory, return proper errno.

We've checked that some of those -ENOMEM errors will be returned to
userspace, and some will be catched by BUG_ON() in the upper callers,
and none will be ignored silently.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-01-28 16:40:36 -05:00