If we attempt to do a RWF_NOWAIT write against a file range for which we
can only do NOCOW for a part of it, due to the existence of holes or
shared extents for example, we proceed with the write as if it were
possible to NOCOW the whole range.
Example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/sdj/bar
$ chattr +C /mnt/sdj/bar
$ xfs_io -d -c "pwrite -S 0xab -b 256K 0 256K" /mnt/bar
wrote 262144/262144 bytes at offset 0
256 KiB, 1 ops; 0.0003 sec (694.444 MiB/sec and 2777.7778 ops/sec)
$ xfs_io -c "fpunch 64K 64K" /mnt/bar
$ sync
$ xfs_io -d -c "pwrite -N -V 1 -b 128K -S 0xfe 0 128K" /mnt/bar
wrote 131072/131072 bytes at offset 0
128 KiB, 1 ops; 0.0007 sec (160.051 MiB/sec and 1280.4097 ops/sec)
This last write should fail with -EAGAIN since the file range from 64K to
128K is a hole. On xfs it fails, as expected, but on ext4 it currently
succeeds because apparently it is expensive to check if there are extents
allocated for the whole range, but I'll check with the ext4 people.
Fix the issue by checking if check_can_nocow() returns a number of
NOCOW'able bytes smaller then the requested number of bytes, and if it
does return -EAGAIN.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we do a successful RWF_NOWAIT write we end up locking the snapshot lock
of the inode, through a call to check_can_nocow(), but we never unlock it.
This means the next attempt to create a snapshot on the subvolume will
hang forever.
Trivial reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ touch /mnt/foobar
$ chattr +C /mnt/foobar
$ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foobar
$ xfs_io -d -c "pwrite -N -V 1 -S 0xfe 0 64K" /mnt/foobar
$ btrfs subvolume snapshot -r /mnt /mnt/snap
--> hangs
Fix this by unlocking the snapshot lock if check_can_nocow() returned
success.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit a43a67a2d7.
This patch reverts the main part of switching direct io implementation
to iomap infrastructure. There's a problem in invalidate page that
couldn't be solved as regression in this development cycle.
The problem occurs when buffered and direct io are mixed, and the ranges
overlap. Although this is not recommended, filesystems implement
measures or fallbacks to make it somehow work. In this case, fallback to
buffered IO would be an option for btrfs (this already happens when
direct io is done on compressed data), but the change would be needed in
the iomap code, bringing new semantics to other filesystems.
Another problem arises when again the buffered and direct ios are mixed,
invalidation fails, then -EIO is set on the mapping and fsync will fail,
though there's no real error.
There have been discussions how to fix that, but revert seems to be the
least intrusive option.
Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit d8f3e73587.
The patch is a cleanup of direct IO port to iomap infrastructure,
which gets reverted.
Signed-off-by: David Sterba <dsterba@suse.com>
The read and write versions don't have anything in common except for the
call to iomap_dio_rw. So split this function, and merge each half into
its only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Switch from __blockdev_direct_IO() to iomap_dio_rw().
Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
as iomap_begin() for iomap direct I/O functions. This function
allocates and locks all the blocks required for the I/O.
btrfs_submit_direct() is used as the submit_io() hook for direct I/O
ops.
Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.
We don't need address_space.direct_IO() anymore so set it to noop.
Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.
BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
exclusive in case of writes. This guards against simultaneous truncates.
Use iomap->iomap_end() to check for failed or incomplete direct I/O:
- for writes, call __endio_write_update_ordered()
- for reads, unlock extents
btrfs_dio_data is now hooked in iomap->private and not
current->journal_info. It carries the reservation variable and the
amount of data submitted, so we can calculate the amount of data to call
__endio_write_update_ordered in case of an error.
This patch removes last use of struct buffer_head from btrfs.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode lookup starting at btrfs_iget takes the full location key,
while only the objectid is used to match the inode, because the lookup
happens inside the given root thus the inode number is unique.
The entire location key is properly set up in btrfs_init_locked_inode.
Simplify the helpers and pass only inode number, renaming it to 'ino'
instead of 'objectid'. This allows to remove temporary variables key,
saving some stack space.
Signed-off-by: David Sterba <dsterba@suse.com>
The main function to lookup a root by its id btrfs_get_fs_root takes the
whole key, while only using the objectid. The value of offset is preset
to (u64)-1 but not actually used until btrfs_find_root that does the
actual search.
Switch btrfs_get_fs_root to use only objectid and remove all local
variables that existed just for the lookup. The actual key for search is
set up in btrfs_get_fs_root, reusing another key variable.
Signed-off-by: David Sterba <dsterba@suse.com>
The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
In fact, that bit can only be set to those trees:
- Subvolume roots
- Data reloc root
- Reloc roots for above roots
All other trees won't get this bit set. So just by the result, it is
obvious that, roots with this bit set can have tree blocks shared with
other trees. Either shared by snapshots, or by reloc roots (an special
snapshot created by relocation).
This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
make it easier to understand, and update all comment mentioning
"reference counted" to follow the rename.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a revert of commit 0a8068a3dd ("btrfs: make ranged full
fsyncs more efficient"), with updated comment in btrfs_sync_file.
Commit 0a8068a3dd ("btrfs: make ranged full fsyncs more efficient")
made full fsyncs operate on the given range only as it assumed it was safe
when using the NO_HOLES feature, since the hole detection was simplified
some time ago and no longer was a source for races with ordered extent
completion of adjacent file ranges.
However it's still not safe to have a full fsync only operate on the given
range, because extent maps for new extents might not be present in memory
due to inode eviction or extent cloning. Consider the following example:
1) We are currently at transaction N;
2) We write to the file range [0, 1MiB);
3) Writeback finishes for the whole range and ordered extents complete,
while we are still at transaction N;
4) The inode is evicted;
5) We open the file for writing, causing the inode to be loaded to
memory again, which sets the 'full sync' bit on its flags. At this
point the inode's list of modified extent maps is empty (figuring
out which extents were created in the current transaction and were
not yet logged by an fsync is expensive, that's why we set the
'full sync' bit when loading an inode);
6) We write to the file range [512KiB, 768KiB);
7) We do a ranged fsync (such as msync()) for file range [512KiB, 768KiB).
This correctly flushes this range and logs its extent into the log
tree. When the writeback started an extent map for range [512KiB, 768KiB)
was added to the inode's list of modified extents, and when the fsync()
finishes logging it removes that extent map from the list of modified
extent maps. This fsync also clears the 'full sync' bit;
8) We do a regular fsync() (full ranged). This fsync() ends up doing
nothing because the inode's list of modified extents is empty and
no other changes happened since the previous ranged fsync(), so
it just returns success (0) and we end up never logging extents for
the file ranges [0, 512KiB) and [768KiB, 1MiB).
Another scenario where this can happen is if we replace steps 2 to 4 with
cloning from another file into our test file, as that sets the 'full sync'
bit in our inode's flags and does not populate its list of modified extent
maps.
This was causing test case generic/457 to fail sporadically when using the
NO_HOLES feature, as it exercised this later case where the inode has the
'full sync' bit set and has no extent maps in memory to represent the new
extents due to extent cloning.
Fix this by reverting commit 0a8068a3dd ("btrfs: make ranged full fsyncs
more efficient") since there is no easy way to work around it.
Fixes: 0a8068a3dd ("btrfs: make ranged full fsyncs more efficient")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Ordered ops are started twice in sync file, once outside of inode mutex
and once inside, taking the dio semaphore. There was one error path
missing the semaphore unlock.
Fixes: aab15e8ec2 ("Btrfs: fix rare chances for data loss when doing a fast fsync")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ add changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have proper root ref counting everywhere we can kill the
subvol_srcu.
* removal of fs_info::subvol_srcu reduces size of fs_info by 1176 bytes
* the refcount_t used for the references checks for accidental 0->1
in cases where the root lifetime would not be properly protected
* there's a leak detector for roots to catch unfreed roots at umount
time
* SRCU served us well over the years but is was not a proper
synchronization mechanism for some cases
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 0c713cbab6 ("Btrfs: fix race between ranged fsync and writeback
of adjacent ranges") fixed a bug where we could end up with file extent
items in a log tree that represent file ranges that overlap due to a race
between the hole detection of a ranged full fsync and writeback for a
different file range.
The problem was solved by forcing any ranged full fsync to become a
non-ranged full fsync - setting the range start to 0 and the end offset to
LLONG_MAX. This was a simple solution because the code that detected and
marked holes was very complex, it used to be done at copy_items() and
implied several searches on the fs/subvolume tree. The drawback of that
solution was that we started to flush delalloc for the entire file and
wait for all the ordered extents to complete for ranged full fsyncs
(including ordered extents covering ranges completely outside the given
range). Fortunatelly ranged full fsyncs are not the most common case
(hopefully for most workloads).
However a later fix for detecting and marking holes was made by commit
0e56315ca1 ("Btrfs: fix missing hole after hole punching and fsync
when using NO_HOLES") and it simplified a lot the detection of holes,
and now copy_items() no longer does it and we do it in a much more simple
way at btrfs_log_holes().
This makes it now possible to simply make the code that detects holes to
operate only on the initial range and no longer need to operate on the
whole file, while also avoiding the need to flush delalloc for the entire
file and wait for ordered extents that cover ranges that don't overlap the
given range.
Another special care is that we must skip file extent items that fall
entirely outside the fsync range when copying inode items from the
fs/subvolume tree into the log tree - this is to avoid races with ordered
extent completion for extents falling outside the fsync range, which could
cause us to end up with file extent items in the log tree that have
overlapping ranges - for example if the fsync range is [1Mb, 2Mb], when
we copy inode items we could copy an extent item for the range [0, 512K],
then release the search path and before moving to the next leaf, an
ordered extent for a range of [256Kb, 512Kb] completes - this would
cause us to copy the new extent item for range [256Kb, 512Kb] into the
log tree after we have copied one for the range [0, 512Kb] - the extents
overlap, resulting in a corruption.
So this change just does these steps:
1) When the NO_HOLES feature is enabled it leaves the initial range
intact - no longer sets it to [0, LLONG_MAX] when the full sync bit
is set in the inode. If NO_HOLES is not enabled, always set the range
to a full, just like before this change, to avoid missing file extent
items representing holes after replaying the log (for both full and
fast fsyncs);
2) Make the hole detection code to operate only on the fsync range;
3) Make the code that copies items from the fs/subvolume tree to skip
copying file extent items that cover a range completely outside the
range of the fsync.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a fast fsync for a range that starts at an offset greater than
zero, we can end up with a log that when replayed causes the respective
inode miss a file extent item representing a hole if we are not using the
NO_HOLES feature. This is because for fast fsyncs we don't log any extents
that cover a range different from the one requested in the fsync.
Example scenario to trigger it:
$ mkfs.btrfs -O ^no-holes -f /dev/sdd
$ mount /dev/sdd /mnt
# Create a file with a single 256K and fsync it to clear to full sync
# bit in the inode - we want the msync below to trigger a fast fsync.
$ xfs_io -f -c "pwrite -S 0xab 0 256K" -c "fsync" /mnt/foo
# Force a transaction commit and wipe out the log tree.
$ sync
# Dirty 768K of data, increasing the file size to 1Mb, and flush only
# the range from 256K to 512K without updating the log tree
# (sync_file_range() does not trigger fsync, it only starts writeback
# and waits for it to finish).
$ xfs_io -c "pwrite -S 0xcd 256K 768K" /mnt/foo
$ xfs_io -c "sync_range -abw 256K 256K" /mnt/foo
# Now dirty the range from 768K to 1M again and sync that range.
$ xfs_io -c "mmap -w 768K 256K" \
-c "mwrite -S 0xef 768K 256K" \
-c "msync -s 768K 256K" \
-c "munmap" \
/mnt/foo
<power fail>
# Mount to replay the log.
$ mount /dev/sdd /mnt
$ umount /mnt
$ btrfs check /dev/sdd
Opening filesystem to check...
Checking filesystem on /dev/sdd
UUID: 482fb574-b288-478e-a190-a9c44a78fca6
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
root 5 inode 257 errors 100, file extent discount
Found file extent holes:
start: 262144, len: 524288
ERROR: errors found in fs roots
found 720896 bytes used, error(s) found
total csum bytes: 512
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 123514
file data blocks allocated: 589824
referenced 589824
Fix this issue by setting the range to full (0 to LLONG_MAX) when the
NO_HOLES feature is not enabled. This results in extra work being done
but it gives the guarantee we don't end up with missing holes after
replaying the log.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The reflink code is quite large and has been living in ioctl.c since ever.
It has grown over the years after many bug fixes and improvements, and
since I'm planning on making some further improvements on it, it's time
to get it better organized by moving into its own file, reflink.c
(similar to what xfs does for example).
This change only moves the code out of ioctl.c into the new file, it
doesn't do any other change.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch removes all haphazard code implementing nocow writers
exclusion from pending snapshot creation and switches to using the drew
lock to ensure this invariant still holds.
'Readers' are snapshot creators from create_snapshot and 'writers' are
nocow writers from buffered write path or btrfs_setsize. This locking
scheme allows for multiple snapshots to happen while any nocow writers
are blocked, since writes to page cache in the nocow path will make
snapshots inconsistent.
So for performance reasons we'd like to have the ability to run multiple
concurrent snapshots and also favors readers in this case. And in case
there aren't pending snapshots (which will be the majority of the cases)
we rely on the percpu's writers counter to avoid cacheline contention.
The main gain from using the drew lock is it's now a lot easier to
reason about the guarantees of the locking scheme and whether there is
some silent breakage lurking.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree pointer can be safely read from the inode so we can drop the
redundant argument from btrfs_lock_and_flush_ordered_range.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are now using these for all roots, rename them to btrfs_put_root()
and btrfs_grab_root();
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all callers of btrfs_get_fs_root are subsequently calling
btrfs_grab_fs_root and handling dropping the ref when they are done
appropriately, go ahead and push btrfs_grab_fs_root up into
btrfs_get_fs_root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are looking up an arbitrary inode, we need to hold a ref on the root
while we're doing this.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All this does is call btrfs_get_fs_root() with check_ref == true. Just
use btrfs_get_fs_root() so we don't have a bunch of different helpers
that do the same thing.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we have a safe way to update the i_size, replace all uses of
btrfs_ordered_update_i_size with btrfs_inode_safe_disk_i_size_write.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to use this everywhere we modify the file extent items
permanently. These include:
1) Inserting new file extents for writes and prealloc extents.
2) Truncating inode items.
3) btrfs_cont_expand().
4) Insert inline extents.
5) Insert new extents from log replay.
6) Insert a new extent for clone, as it could be past i_size.
7) Hole punching
For hole punching in particular it might seem it's not necessary because
anybody extending would use btrfs_cont_expand, however there is a corner
that still can give us trouble. Start with an empty file and
fallocate KEEP_SIZE 1M-2M
We now have a 0 length file, and a hole file extent from 0-1M, and a
prealloc extent from 1M-2M. Now
punch 1M-1.5M
Because this is past i_size we have
[HOLE EXTENT][ NOTHING ][PREALLOC]
[0 1M][1M 1.5M][1.5M 2M]
with an i_size of 0. Now if we pwrite 0-1.5M we'll increas our i_size
to 1.5M, but our disk_i_size is still 0 until the ordered extent
completes.
However if we now immediately truncate 2M on the file we'll just call
btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M. If we
commit the transaction here and crash we'll expose the gap.
To fix this we need to clear the file extent mapping for the range that
we punched but didn't insert a corresponding file extent for. This will
mean the truncate will only get an disk_i_size set to 1M if we crash
before the finish ordered io happens.
I've written an xfstest to reproduce the problem and validate this fix.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only pass this as 1 from __extent_writepage_io(). The parameter
basically means "pretend I didn't pass in a page". This is silly since
we can simply not pass in the page. Get rid of the parameter from
btrfs_get_extent(), and since it's used as a get_extent_t callback,
remove it from get_extent_t and btree_get_extent(), neither of which
need it.
While we're here, let's document btrfs_get_extent().
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
ordered->start, ordered->len, and ordered->disk_len correspond to
fi->disk_bytenr, fi->num_bytes, and fi->disk_num_bytes, respectively.
It's confusing to translate between the two naming schemes. Since a
btrfs_ordered_extent is basically a pending btrfs_file_extent_item,
let's make the former use the naming from the latter.
Note that I didn't touch the names in tracepoints just in case there are
scripts depending on the current naming.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When using the NO_HOLES feature if we clone a range that contains a hole
and a temporary ENOSPC happens while dropping extents from the target
inode's range, we can end up failing and aborting the transaction with
-EEXIST or with a corrupt file extent item, that has a length greater
than it should and overlaps with other extents. For example when cloning
the following range from inode A to inode B:
Inode A:
extent A1 extent A2
[ ----------- ] [ hole, implicit, 4MB length ] [ ------------- ]
0 1MB 5MB 6MB
Range to clone: [1MB, 6MB)
Inode B:
extent B1 extent B2 extent B3 extent B4
[ ---------- ] [ --------- ] [ ---------- ] [ ---------- ]
0 1MB 1MB 2MB 2MB 5MB 5MB 6MB
Target range: [1MB, 6MB) (same as source, to make it easier to explain)
The following can happen:
1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
set 'drop_end' to 2MB, meaning it was able to drop only extent B2;
3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 2MB - 1MB =
1MB;
4) We then attempt to insert a file extent item at inode B with a file
offset of 5MB, which is the value of clone_info->file_offset. This
fails with error -EEXIST because there's already an extent at that
offset (extent B4);
5) We abort the current transaction with -EEXIST and return that error
to user space as well.
Another example, for extent corruption:
Inode A:
extent A1 extent A2
[ ----------- ] [ hole, implicit, 10MB length ] [ ------------- ]
0 1MB 11MB 12MB
Inode B:
extent B1 extent B2
[ ----------- ] [ --------- ] [ ----------------------------- ]
0 1MB 1MB 5MB 5MB 12MB
Target range: [1MB, 12MB) (same as source, to make it easier to explain)
1) btrfs_punch_hole_range() gets -ENOSPC from __btrfs_drop_extents();
2) At that point, 'cur_offset' is set to 1MB and __btrfs_drop_extents()
set 'drop_end' to 5MB, meaning it was able to drop only extent B2;
3) We then compute 'clone_len' as 'drop_end' - 'cur_offset' = 5MB - 1MB =
4MB;
4) We then insert a file extent item at inode B with a file offset of 11MB
which is the value of clone_info->file_offset, and a length of 4MB (the
value of 'clone_len'). So we get 2 extents items with ranges that
overlap and an extent length of 4MB, larger then the extent A2 from
inode A (1MB length);
5) After that we end the transaction, balance the btree dirty pages and
then start another or join the previous transaction. It might happen
that the transaction which inserted the incorrect extent was committed
by another task so we end up with extent corruption if a power failure
happens.
So fix this by making sure we attempt to insert the extent to clone at
the destination inode only if we are past dropping the sub-range that
corresponds to a hole.
Fixes: 690a5dbfc5 ("Btrfs: fix ENOSPC errors, leading to transaction aborts, when cloning extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can now remove the bdev from extent_map. Previous patches made sure
that bio_set_dev is correctly in all places and that we don't need to
grab it from latest_bdev or pass it around inside the extent map.
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using an input pointer parameter as the return value and have
an int as the return type of find_desired_extent, rework the function to
directly return the found offset. Doing that the 'ret' variable in
btrfs_llseek_file can be removed. Additional (subjective) benefit is
that btrfs' llseek function now resemebles those of the other major
filesystems.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Handle SEEK_END/SEEK_CUR in a single 'default' case by directly
returning from generic_file_llseek. This makes the 'out' label
redundant. Finally return directly the vale from vfs_setpos. No
semantic changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Modifying the file position is done on a per-file basis. This renders
holding the inode lock for writing useless and makes the performance of
concurrent llseek's abysmal.
Fix this by holding the inode for read. This provides protection against
concurrent truncates and find_desired_extent already includes proper
extent locking for the range which ensures proper locking against
concurrent writes. SEEK_CUR and SEEK_END can be done lockessly.
The former is synchronized by file::f_lock spinlock. SEEK_END is not
synchronized but atomic, but that's OK since there is not guarantee that
SEEK_END will always be at the end of the file in the face of tail
modifications.
This change brings ~82% performance improvement when doing a lot of
parallel fseeks. The workload essentially does:
for (d=0; d<num_seek_read; d++)
{
/* offset %= 16777216; */
fseek (f, 256 * d % 16777216, SEEK_SET);
fread (buffer, 64, 1, f);
}
Without patch:
num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real 0m41.412s
user 0m28.777s
sys 2m16.510s
With patch:
num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
real 0m11.479s
user 0m27.629s
sys 0m21.040s
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a buffered write it's possible to leave the subv_writers
counter of the root, used for synchronization between buffered nocow
writers and snapshotting. This happens in an exceptional case like the
following:
1) We fail to allocate data space for the write, since there's not
enough available data space nor enough unallocated space for allocating
a new data block group;
2) Because of that failure, we try to go to NOCOW mode, which succeeds
and therefore we set the local variable 'only_release_metadata' to true
and set the root's sub_writers counter to 1 through the call to
btrfs_start_write_no_snapshotting() made by check_can_nocow();
3) The call to btrfs_copy_from_user() returns zero, which is very unlikely
to happen but not impossible;
4) No pages are copied because btrfs_copy_from_user() returned zero;
5) We call btrfs_end_write_no_snapshotting() which decrements the root's
subv_writers counter to 0;
6) We don't set 'only_release_metadata' back to 'false' because we do
it only if 'copied', the value returned by btrfs_copy_from_user(), is
greater than zero;
7) On the next iteration of the while loop, which processes the same
page range, we are now able to allocate data space for the write (we
got enough data space released in the meanwhile);
8) After this if we fail at btrfs_delalloc_reserve_metadata(), because
now there isn't enough free metadata space, or in some other place
further below (prepare_pages(), lock_and_cleanup_extent_if_need(),
btrfs_dirty_pages()), we break out of the while loop with
'only_release_metadata' having a value of 'true';
9) Because 'only_release_metadata' is 'true' we end up decrementing the
root's subv_writers counter to -1 (through a call to
btrfs_end_write_no_snapshotting()), and we also end up not releasing the
data space previously reserved through btrfs_check_data_free_space().
As a consequence the mechanism for synchronizing NOCOW buffered writes
with snapshotting gets broken.
Fix this by always setting 'only_release_metadata' to false at the start
of each iteration.
Fixes: 8257b2dc3c ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume")
Fixes: 7ee9e4405f ("Btrfs: check if we can nocow if we don't have data space")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter is now always set to NULL and could be dropped. The last
user was get_default_root but that got reworked in 05dbe6837b ("Btrfs:
unify subvol= and subvolid= mounting") and the parameter became unused.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is similar to 942491c9e6 ("xfs: fix AIM7 regression"). Apparently
our current rwsem code doesn't like doing the trylock, then lock for
real scheme. This causes extra contention on the lock and can be
measured eg. by AIM7 benchmark. So change our read/write methods to
just do the trylock for the RWF_NOWAIT case.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
We were checking for the full fsync flag in the inode before locking the
inode, which is racy, since at that that time it might not be set but
after we acquire the inode lock some other task set it. One case where
this can happen is on a system low on memory and some concurrent task
failed to allocate an extent map and therefore set the full sync flag on
the inode, to force the next fsync to work in full mode.
A consequence of missing the full fsync flag set is hitting the problems
fixed by commit 0c713cbab6 ("Btrfs: fix race between ranged fsync and
writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
of weird inconsistencies after replaying a log due to file extents items
representing ranges that overlap.
So just move the check such that it's done after locking the inode and
before starting writeback again.
Fixes: 0c713cbab6 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.
PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.
[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.
The most obvious one is:
In btrfs_buffered_write():
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
We always free qgroup PREALLOC meta space.
While in btrfs_truncate_block():
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
We only free qgroup PREALLOC meta space when something went wrong.
[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.
The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
In btrfs_delalloc_reserve_metadata()
- Free the unused metadata space
Normally in:
btrfs_delalloc_release_extents()
|- btrfs_inode_rsv_release()
Here we do calculation on whether we should release or not.
E.g. for 64K buffered write, the metadata rsv works like:
/* The first page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=0
total: num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
rsv */
/* The 2nd page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
rsv, so we free what we have reserved */
/* The 3rd~16th pages */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed (still space for one outstanding extent)
This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().
The good news is:
- The callers are not that hot
The hottest caller is in btrfs_buffered_write(), which is already
fixed by commit 336a8bb8e3 ("btrfs: Fix wrong
btrfs_delalloc_release_extents parameter"). Thus it's not that
easy to cause false EDQUOT.
- The trans commit in advance for qgroup would hide the bug
Since commit f5fef45936 ("btrfs: qgroup: Make qgroup async transaction
commit more aggressive"), when btrfs qgroup metadata free space is slow,
it will try to commit transaction and free the wrongly converted
PERTRANS space, so it's not that easy to hit such bug.
[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we have a buffered write that starts at an offset greater than or
equals to the file's size happening concurrently with a full ranged
fiemap, we can end up leaking an extent state structure.
Suppose we have a file with a size of 1Mb, and before the buffered write
and fiemap are performed, it has a single extent state in its io tree
representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.
The following sequence diagram shows how the memory leak happens if a
fiemap a buffered write, starting at offset 1Mb and with a length of
4Kb, are performed concurrently.
CPU 1 CPU 2
extent_fiemap()
--> it's a full ranged fiemap
range from 0 to LLONG_MAX - 1
(9223372036854775807)
--> locks range in the inode's
io tree
--> after this we have 2 extent
states in the io tree:
--> 1 for range [0, 1Mb[ with
the bits EXTENT_LOCKED and
EXTENT_DELALLOC_BITS set
--> 1 for the range
[1Mb, LLONG_MAX[ with
the EXTENT_LOCKED bit set
--> start buffered write at offset
1Mb with a length of 4Kb
btrfs_file_write_iter()
btrfs_buffered_write()
--> cached_state is NULL
lock_and_cleanup_extent_if_need()
--> returns 0 and does not lock
range because it starts
at current i_size / eof
--> cached_state remains NULL
btrfs_dirty_pages()
btrfs_set_extent_delalloc()
(...)
__set_extent_bit()
--> splits extent state for range
[1Mb, LLONG_MAX[ and now we
have 2 extent states:
--> one for the range
[1Mb, 1Mb + 4Kb[ with
EXTENT_LOCKED set
--> another one for the range
[1Mb + 4Kb, LLONG_MAX[ with
EXTENT_LOCKED set as well
--> sets EXTENT_DELALLOC on the
extent state for the range
[1Mb, 1Mb + 4Kb[
--> caches extent state
[1Mb, 1Mb + 4Kb[ into
@cached_state because it has
the bit EXTENT_LOCKED set
--> btrfs_buffered_write() ends up
with a non-NULL cached_state and
never calls anything to release its
reference on it, resulting in a
memory leak
Fix this by calling free_extent_state() on cached_state if the range was
not locked by lock_and_cleanup_extent_if_need().
The same issue can happen if anything else other than fiemap locks a range
that covers eof and beyond.
This could be triggered, sporadically, by test case generic/561 from the
fstests suite, which makes duperemove run concurrently with fsstress, and
duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
the leak is reported in dmesg/syslog when removing the btrfs module with
a message like the following:
[77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1
Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.
CC: stable@vger.kernel.org # 4.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit fee187d9d9 ("Btrfs: do not set EXTENT_DIRTY along with
EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we
can simplify and stop trying to clear it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The VFS indicates a synchronous write to ->write_iter() via
iocb->ki_flags. The IOCB_{,D}SYNC flags may be set based on the file
(see iocb_flags()) or the RWF_* flags passed to a syscall like
pwritev2() (see kiocb_set_rw_flags()).
However, in btrfs_file_write_iter(), we're checking if a write is
synchronous based only on the file; we use this to decide when to bump
the sync_writers counter and thus do CRCs synchronously. Make sure we do
this for all synchronous writes as determined by the VFS.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add const ]
Signed-off-by: David Sterba <dsterba@suse.com>
generic_write_checks() may modify iov_iter_count(), so we must get the
count after the call, not before. Using the wrong one has a couple of
consequences:
1. We check a longer range in check_can_nocow() for nowait than we're
actually writing.
2. We create extra hole extent maps in btrfs_cont_expand(). As far as I
can tell, this is harmless, but I might be missing something.
These issues are pretty minor, but let's fix it before something more
important trips on it.
Fixes: edf064e7c6 ("btrfs: nowait aio support")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that
it doesn't take into account any splitting at the levels, because
truncate will never split nodes. However truncate _and_ changing will
never split nodes, so rename btrfs_calc_trunc_metadata_size to
btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely
for inserting items, so rename this to btrfs_calc_insert_metadata_size.
Making these clearer will help when I start using them differently in
upcoming patches.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's unlikely in-band dedupe is going to land so just remove any
leftovers - dedupe.h header as well as the 'dedupe' parameter to
btrfs_set_extent_delalloc.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When cloning extents (or deduplicating) we create a transaction with a
space reservation that considers we will drop or update a single file
extent item of the destination inode (that we modify a single leaf). That
is fine for the vast majority of scenarios, however it might happen that
we need to drop many file extent items, and adjust at most two file extent
items, in the destination root, which can span multiple leafs. This will
lead to either the call to btrfs_drop_extents() to fail with ENOSPC or
the subsequent calls to btrfs_insert_empty_item() or btrfs_update_inode()
(called through clone_finish_inode_update()) to fail with ENOSPC. Such
failure results in a transaction abort, leaving the filesystem in a
read-only mode.
In order to fix this we need to follow the same approach as the hole
punching code, where we create a local reservation with 1 unit and keep
ending and starting transactions, after balancing the btree inode,
when __btrfs_drop_extents() returns ENOSPC. So fix this by making the
extent cloning call calls the recently added btrfs_punch_hole_range()
helper, which is what does the mentioned work for hole punching, and
make sure whenever we drop extent items in a transaction, we also add a
replacing file extent item, to avoid corruption (a hole) if after ending
a transaction and before starting a new one, the old transaction gets
committed and a power failure happens before we finish cloning.
A test case for fstests follows soon.
Reported-by: David Goodwin <david@codepoets.co.uk>
Link: https://lore.kernel.org/linux-btrfs/a4a4cf31-9cf4-e52c-1f86-c62d336c9cd1@codepoets.co.uk/
Reported-by: Sam Tygier <sam@tygier.co.uk>
Link: https://lore.kernel.org/linux-btrfs/82aace9f-a1e3-1f0b-055f-3ea75f7a41a0@tygier.co.uk/
Fixes: b6f3409b21 ("Btrfs: reserve sufficient space for ioctl clone")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the code that is responsible for dropping extents in a range out of
btrfs_punch_hole() into a new helper function, btrfs_punch_hole_range(),
so that later it can be used by the reflinking (extent cloning and dedup)
code to fix a ENOSPC bug.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have code for data and metadata reservations for delalloc. There's
quite a bit of code here, and it's used in a lot of places so I've
separated it out to it's own file. inode.c and file.c are already
pretty large, and this code is complicated enough to live in its own
space.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A few more instances whre we don't need to specify the values as long as
they are the same that enum assigns automatically. All of the enums are
in-memory only and nothing relies on the exact values.
Signed-off-by: David Sterba <dsterba@suse.com>
If the range for which we are punching a hole covers only part of a page,
we end up updating the inode item but we skip the update of the inode's
iversion, mtime and ctime. Fix that by ensuring we update those properties
of the inode.
A patch for fstests test case generic/059 that tests this as been sent
along with this fix.
Fixes: 2aaa665581 ("Btrfs: add hole punching")
Fixes: e8c1c76e80 ("Btrfs: add missing inode update when punching hole")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The first thing code does in check_can_nocow is trying to block
concurrent snapshots. If this fails (due to snpashot already being in
progress) the function returns ENOSPC which makes no sense. Instead
return EAGAIN. Despite this return value not being propagated to callers
it's good practice to return the closest in terms of semantics error
code. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There several functions which open code
btrfs_lock_and_flush_ordered_range, just replace them with a call to the
function. No functional changes.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
inode) that happens to be ranged, which happens during a msync() or writes
for files opened with O_SYNC for example, we can end up with a corrupt log,
due to different file extent items representing ranges that overlap with
each other, or hit some assertion failures.
When doing a ranged fsync we only flush delalloc and wait for ordered
exents within that range. If while we are logging items from our inode
ordered extents for adjacent ranges complete, we end up in a race that can
make us insert the file extent items that overlap with others we logged
previously and the assertion failures.
For example, if tree-log.c:copy_items() receives a leaf that has the
following file extents items, all with a length of 4K and therefore there
is an implicit hole in the range 68K to 72K - 1:
(257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
It copies them to the log tree. However due to the need to detect implicit
holes, it may release the path, in order to look at the previous leaf to
detect an implicit hole, and then later it will search again in the tree
for the first file extent item key, with the goal of locking again the
leaf (which might have changed due to concurrent changes to other inodes).
However when it locks again the leaf containing the first key, the key
corresponding to the extent at offset 72K may not be there anymore since
there is an ordered extent for that range that is finishing (that is,
somewhere in the middle of btrfs_finish_ordered_io()), and it just
removed the file extent item but has not yet replaced it with a new file
extent item, so the part of copy_items() that does hole detection will
decide that there is a hole in the range starting from 68K to 76K - 1,
and therefore insert a file extent item to represent that hole, having
a key offset of 68K. After that we now have a log tree with 2 different
extent items that have overlapping ranges:
1) The file extent item copied before copy_items() released the path,
which has a key offset of 72K and a length of 4K, representing the
file range 72K to 76K - 1.
2) And a file extent item representing a hole that has a key offset of
68K and a length of 8K, representing the range 68K to 76K - 1. This
item was inserted after releasing the path, and overlaps with the
extent item inserted before.
The overlapping extent items can cause all sorts of unpredictable and
incorrect behaviour, either when replayed or if a fast (non full) fsync
happens later, which can trigger a BUG_ON() when calling
btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
trace like the following:
[61666.783269] ------------[ cut here ]------------
[61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
[61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
(...)
[61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
[61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
[61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
[61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
[61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
[61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
[61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
[61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
[61666.786253] FS: 00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
[61666.786253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
[61666.786253] Call Trace:
[61666.786253] __btrfs_drop_extents+0x5e3/0xaad [btrfs]
[61666.786253] ? time_hardirqs_on+0x9/0x14
[61666.786253] btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
[61666.786253] ? release_extent_buffer+0x38/0xb4 [btrfs]
[61666.786253] btrfs_log_inode+0xb6e/0xcdc [btrfs]
[61666.786253] ? lock_acquire+0x131/0x1c5
[61666.786253] ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
[61666.786253] ? arch_local_irq_save+0x9/0xc
[61666.786253] ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
[61666.786253] btrfs_log_inode_parent+0x223/0x659 [btrfs]
[61666.786253] ? arch_local_irq_save+0x9/0xc
[61666.786253] ? lockref_get_not_zero+0x2c/0x34
[61666.786253] ? rcu_read_unlock+0x3e/0x5d
[61666.786253] btrfs_log_dentry_safe+0x60/0x7b [btrfs]
[61666.786253] btrfs_sync_file+0x317/0x42c [btrfs]
[61666.786253] vfs_fsync_range+0x8c/0x9e
[61666.786253] SyS_msync+0x13c/0x1c9
[61666.786253] entry_SYSCALL_64_fastpath+0x18/0xad
A sample of a corrupt log tree leaf with overlapping extents I got from
running btrfs/072:
item 14 key (295 108 200704) itemoff 2599 itemsize 53
extent data disk bytenr 0 nr 0
extent data offset 0 nr 458752 ram 458752
item 15 key (295 108 659456) itemoff 2546 itemsize 53
extent data disk bytenr 4343541760 nr 770048
extent data offset 606208 nr 163840 ram 770048
item 16 key (295 108 663552) itemoff 2493 itemsize 53
extent data disk bytenr 4343541760 nr 770048
extent data offset 610304 nr 155648 ram 770048
item 17 key (295 108 819200) itemoff 2440 itemsize 53
extent data disk bytenr 4334788608 nr 4096
extent data offset 0 nr 4096 ram 4096
The file extent item at offset 659456 (item 15) ends at offset 823296
(659456 + 163840) while the next file extent item (item 16) starts at
offset 663552.
Another different problem that the race can trigger is a failure in the
assertions at tree-log.c:copy_items(), which expect that the first file
extent item key we found before releasing the path exists after we have
released path and that the last key we found before releasing the path
also exists after releasing the path:
$ cat -n fs/btrfs/tree-log.c
4080 if (need_find_last_extent) {
4081 /* btrfs_prev_leaf could return 1 without releasing the path */
4082 btrfs_release_path(src_path);
4083 ret = btrfs_search_slot(NULL, inode->root, &first_key,
4084 src_path, 0, 0);
4085 if (ret < 0)
4086 return ret;
4087 ASSERT(ret == 0);
(...)
4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) {
4104 ret = btrfs_next_leaf(inode->root, src_path);
4105 if (ret < 0)
4106 return ret;
4107 ASSERT(ret == 0);
4108 src = src_path->nodes[0];
4109 i = 0;
4110 need_find_last_extent = true;
4111 }
(...)
The second assertion implicitly expects that the last key before the path
release still exists, because the surrounding while loop only stops after
we have found that key. When this assertion fails it produces a stack like
this:
[139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
[139590.037406] ------------[ cut here ]------------
[139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
[139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G W 5.0.0-btrfs-next-46 #1
(...)
[139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
(...)
[139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
[139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
[139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
[139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
[139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
[139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
[139590.042501] FS: 00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
[139590.042847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
[139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[139590.044250] Call Trace:
[139590.044631] copy_items+0xa3f/0x1000 [btrfs]
[139590.045009] ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
[139590.045396] btrfs_log_inode+0x7b3/0xd70 [btrfs]
[139590.045773] btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
[139590.046143] ? do_raw_spin_unlock+0x49/0xc0
[139590.046510] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[139590.046872] btrfs_sync_file+0x3b6/0x440 [btrfs]
[139590.047243] btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
[139590.047592] __vfs_write+0x129/0x1c0
[139590.047932] vfs_write+0xc2/0x1b0
[139590.048270] ksys_write+0x55/0xc0
[139590.048608] do_syscall_64+0x60/0x1b0
[139590.048946] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[139590.049287] RIP: 0033:0x7f2efc4be190
(...)
[139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
[139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
[139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
[139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
[139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
(...)
[139590.055128] ---[ end trace 193f35d0215cdeeb ]---
So fix this race between a full ranged fsync and writeback of adjacent
ranges by flushing all delalloc and waiting for all ordered extents to
complete before logging the inode. This is the simplest way to solve the
problem because currently the full fsync path does not deal with ranges
at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
to look at adjacent ranges for hole detection. For use cases of ranged
fsyncs this can make a few fsyncs slower but on the other hand it can
make some following fsyncs to other ranges do less work or no need to do
anything at all. A full fsync is rare anyway and happens only once after
loading/creating an inode and once after less common operations such as a
shrinking truncate.
This is an issue that exists for a long time, and was often triggered by
generic/127, because it does mmap'ed writes and msync (which triggers a
ranged fsync). Adding support for the tree checker to detect overlapping
extents (next patch in the series) and trigger a WARN() when such cases
are found, and then calling btrfs_check_leaf_full() at the end of
btrfs_insert_file_extent() made the issue much easier to detect. Running
btrfs/072 with that change to the tree checker and making fsstress open
files always with O_SYNC made it much easier to trigger the issue (as
triggering it with generic/127 is very rare).
CC: stable@vger.kernel.org # 3.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have an error writing out a delalloc range in
btrfs_punch_hole_lock_range we'll unlock the inode and then goto
out_only_mutex, where we will again unlock the inode. This is bad,
don't do this.
Fixes: f27451f229 ("Btrfs: add support for fallocate's zero range operation")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long
parameter list and the confusing @owner parameter.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the new btrfs_ref structure and replace parameter list to clean up
the usage of owner and level to distinguish the extent types.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing fallocate, we first add the range to the reserve_list and
then reserve the quota. If quota reservation fails, we'll release all
reserved parts of reserve_list.
However, cur_offset is not updated to indicate that this range is
already been inserted into the list. Therefore, the same range is freed
twice. Once at list_for_each_entry loop, and once at the end of the
function. This will result in WARN_ON on bytes_may_use when we free the
remaining space.
At the end, under the 'out' label we have a call to:
btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, alloc_end - cur_offset);
The start offset, third argument, should be cur_offset.
Everything from alloc_start to cur_offset was freed by the
list_for_each_entry_safe_loop.
Fixes: 18513091af ("btrfs: update btrfs_space_info's bytes_may_use timely")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BUG_ON(1) leads to bogus warnings from clang when
CONFIG_PROFILE_ANNOTATED_BRANCHES is set:
fs/btrfs/volumes.c:5041:3: error: variable 'max_chunk_size' is used uninitialized whenever 'if' condition is false
[-Werror,-Wsometimes-uninitialized]
BUG_ON(1);
^~~~~~~~~
include/asm-generic/bug.h:61:36: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^~~~~~~~~~~~~~~~~~~
include/linux/compiler.h:48:23: note: expanded from macro 'unlikely'
# define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/volumes.c:5046:9: note: uninitialized use occurs here
max_chunk_size);
^~~~~~~~~~~~~~
include/linux/kernel.h:860:36: note: expanded from macro 'min'
#define min(x, y) __careful_cmp(x, y, <)
^
include/linux/kernel.h:853:17: note: expanded from macro '__careful_cmp'
__cmp_once(x, y, __UNIQUE_ID(__x), __UNIQUE_ID(__y), op))
^
include/linux/kernel.h:847:25: note: expanded from macro '__cmp_once'
typeof(y) unique_y = (y); \
^
fs/btrfs/volumes.c:5041:3: note: remove the 'if' if its condition is always true
BUG_ON(1);
^
include/asm-generic/bug.h:61:32: note: expanded from macro 'BUG_ON'
#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
^
fs/btrfs/volumes.c:4993:20: note: initialize the variable 'max_chunk_size' to silence this warning
u64 max_chunk_size;
^
= 0
Change it to BUG() so clang can see that this code path can never
continue.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit db2462a6ad ("btrfs: don't run delayed refs in the end transaction
logic") removed its last use, so now it does absolutely nothing, therefore
remove it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is a simple wrapper over btrfs_get_extent that returns
either:
a) A real extent in the passed range or
b) Adjusted extent based on whether delalloc bytes are found backing up
a hole.
To support these semantics it doesn't need the page/pg_offset/create
arguments which are passed to btrfs_get_extent in case an extent is to
be created. So simplify the function by removing the unused arguments.
No functional changes.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The typos accumulate over time so once in a while time they get fixed in
a large patch.
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
offset into a page.
So replace them by the offset_in_page() macro instead of open-coding it if
they're not used as an alignment check.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The io_err field of struct btrfs_log_ctx is no longer used after the
recent simplification of the fast fsync path, where we now wait for
ordered extents to complete before logging the inode. We did this in
commit b5e6c3e170 ("btrfs: always wait on ordered extents at fsync
time") and commit a2120a473a ("btrfs: clean up the left over
logged_list usage") removed its last use.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlv9qYgACgkQxWXV+ddt
WDupPQ/8DdeLZYQG1tlx2Q+X4/tqPVyAzUjguYzbIY7wvSs1zbEEedENsD8E97yC
So8ooGnP5B6/dqVidLFQBPwTXN59GybYbrDci8qh0DOJTl3+1r8byD9JC+iofrOF
tltJkZ+eCOQyyqHHzlzw15uNOg48Qzj1oXvTAcE0P6iN5UcvcfwRW/S39pjsn63C
63zc09XJ1hmJMJTWZo5h3GoD2UvzrwGXPKXNdv/NWkw9sqQbWdjvZFdqKbvY1VeM
Oa6FPAPErJqEEEePhpDYbyRcnzjJRMs0deLGpGGChGldQxgMO8ILzBwh/KalfzK7
h7LIuv1EclUqlyv0mXPqg2E/C3n2UMPqQYFsK9Lt+4Y/PkrWA2jx0lSg0fBl3k8c
7PyiTqPNPNF8LU48tPEnOzJuNPkquOycgdyQOUpHnS43OF5OLIb6tVyjK4eJHRWw
xtP65M72qM8T65+gsxYcdm0lvIDLidIwFS+2g4ibKU7EwlYkTC9AHFIAyFKTgxeP
MpkIH90mKhSxOpbq8RICgr2jWcJZYoFQ4soi1oE+bgyjv75PyhJ0eXOprCh/4KZp
nkXlPy2skkO9gGecyvr51x/opDEjEkObyOjQm2LhhWYvgcnHgW8Zp1jhQKxabHvz
iZdVIs/agOerpk1d9ZBHhIXOeS2UcE5klqVRAdf961Wobh+HNis=
=cCvI
-----END PGP SIGNATURE-----
Merge tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Some of these bugs are being hit during testing so we'd like to get
them merged, otherwise there are usual stability fixes for stable
trees"
* tag 'for-4.20-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: relocation: set trans to be NULL after ending transaction
Btrfs: fix race between enabling quotas and subvolume creation
Btrfs: send, fix infinite loop due to directory rename dependencies
Btrfs: ensure path name is null terminated at btrfs_control_ioctl
Btrfs: fix rare chances for data loss when doing a fast fsync
btrfs: Always try all copies when reading extent buffers
After the simplification of the fast fsync patch done recently by commit
b5e6c3e170 ("btrfs: always wait on ordered extents at fsync time") and
commit e7175a6927 ("btrfs: remove the wait ordered logic in the
log_one_extent path"), we got a very short time window where we can get
extents logged without writeback completing first or extents logged
without logging the respective data checksums. Both issues can only happen
when doing a non-full (fast) fsync.
As soon as we enter btrfs_sync_file() we trigger writeback, then lock the
inode and then wait for the writeback to complete before starting to log
the inode. However before we acquire the inode's lock and after we started
writeback, it's possible that more writes happened and dirtied more pages.
If that happened and those pages get writeback triggered while we are
logging the inode (for example, the VM subsystem triggering it due to
memory pressure, or another concurrent fsync), we end up seeing the
respective extent maps in the inode's list of modified extents and will
log matching file extent items without waiting for the respective
ordered extents to complete, meaning that either of the following will
happen:
1) We log an extent after its writeback finishes but before its checksums
are added to the csum tree, leading to -EIO errors when attempting to
read the extent after a log replay.
2) We log an extent before its writeback finishes.
Therefore after the log replay we will have a file extent item pointing
to an unwritten extent (and without the respective data checksums as
well).
This could not happen before the fast fsync patch simplification, because
for any extent we found in the list of modified extents, we would wait for
its respective ordered extent to finish writeback or collect its checksums
for logging if it did not complete yet.
Fix this by triggering writeback again after acquiring the inode's lock
and before waiting for ordered extents to complete.
Fixes: e7175a6927 ("btrfs: remove the wait ordered logic in the log_one_extent path")
Fixes: b5e6c3e170 ("btrfs: always wait on ordered extents at fsync time")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rework the vfs_clone_file_range and vfs_dedupe_file_range infrastructure to use
a common .remap_file_range method and supply generic bounds and sanity checking
functions that are shared with the data write path. The current VFS
infrastructure has problems with rlimit, LFS file sizes, file time stamps,
maximum filesystem file sizes, stripping setuid bits, etc and so they are
addressed in these commits.
We also introduce the ability for the ->remap_file_range methods to return short
clones so that clones for vfs_copy_file_range() don't get rejected if the entire
range can't be cloned. It also allows filesystems to sliently skip deduplication
of partial EOF blocks if they are not capable of doing so without requiring
errors to be thrown to userspace.
All existing filesystems are converted to user the new .remap_file_range method,
and both XFS and ocfs2 are modified to make use of the new generic checking
infrastructure.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJb29gEAAoJEK3oKUf0dfodpOAQAL2VbHjvKXEwNMDTKscSRMmZ
Z0xXo3gamFKQ+VGOqy2g2lmAYQs9SAnTuCGTJ7zIAp7u+q8gzUy5FzKAwLS4Id6L
8siaY6nzlicfO04d0MdXnWz0f3xykChgzfdQfVUlUi7WrDioBUECLPmx4a+USsp1
DQGjLOZfoOAmn2rijdnH9RTEaHqg+8mcTaLN9TRav4gGqrWxldFKXw2y6ouFC7uo
/hxTRNXR9VI+EdbDelwBNXl9nU9gQA0WLOvRKwgUrtv6bSJohTPsmXt7EbBtNcVR
cl3zDNc1sLD1bLaRLEUAszI/33wXaaQgom1iB51obIcHHef+JxRNG/j6rUMfzxZI
VaauGv5EIvtaKN0LTAqVVLQ8t2MQFYfOr8TykmO+1UFog204aKRANdVMHDSjxD/0
dTGKJGcq+HnKQ+JHDbTdvuXEL8sUUl1FiLjOQbZPw63XmuddLKFUA2TOjXn6htbU
1h1MG5d9KjGLpabp2BQheczD08NuSmcrOBNt7IoeI3+nxr3HpMwprfB9TyaERy9X
iEgyVXmjjc9bLLRW7A2wm77aW64NvPs51wKMnvuNgNwnCewrGS6cB8WVj2zbQjH1
h3f3nku44s9ctNPSBzb/sJLnpqmZQ5t0oSmrMSN+5+En6rNTacoJCzxHRJBA7z/h
Z+C6y1GTZw0euY6Zjiwu
=CE/A
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull vfs dedup fixes from Dave Chinner:
"This reworks the vfs data cloning infrastructure.
We discovered many issues with these interfaces late in the 4.19 cycle
- the worst of them (data corruption, setuid stripping) were fixed for
XFS in 4.19-rc8, but a larger rework of the infrastructure fixing all
the problems was needed. That rework is the contents of this pull
request.
Rework the vfs_clone_file_range and vfs_dedupe_file_range
infrastructure to use a common .remap_file_range method and supply
generic bounds and sanity checking functions that are shared with the
data write path. The current VFS infrastructure has problems with
rlimit, LFS file sizes, file time stamps, maximum filesystem file
sizes, stripping setuid bits, etc and so they are addressed in these
commits.
We also introduce the ability for the ->remap_file_range methods to
return short clones so that clones for vfs_copy_file_range() don't get
rejected if the entire range can't be cloned. It also allows
filesystems to sliently skip deduplication of partial EOF blocks if
they are not capable of doing so without requiring errors to be thrown
to userspace.
Existing filesystems are converted to user the new remap_file_range
method, and both XFS and ocfs2 are modified to make use of the new
generic checking infrastructure"
* tag 'xfs-4.20-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits)
xfs: remove [cm]time update from reflink calls
xfs: remove xfs_reflink_remap_range
xfs: remove redundant remap partial EOF block checks
xfs: support returning partial reflink results
xfs: clean up xfs_reflink_remap_blocks call site
xfs: fix pagecache truncation prior to reflink
ocfs2: remove ocfs2_reflink_remap_range
ocfs2: support partial clone range and dedupe range
ocfs2: fix pagecache truncation prior to reflink
ocfs2: truncate page cache for clone destination file before remapping
vfs: clean up generic_remap_file_range_prep return value
vfs: hide file range comparison function
vfs: enable remap callers that can handle short operations
vfs: plumb remap flags through the vfs dedupe functions
vfs: plumb remap flags through the vfs clone functions
vfs: make remap_file_range functions take and return bytes completed
vfs: remap helper should update destination inode metadata
vfs: pass remap flags to generic_remap_checks
vfs: pass remap flags to generic_remap_file_range_prep
vfs: combine the clone and dedupe into a single remap_file_range
...
Combine the clone_file_range and dedupe_file_range operations into a
single remap_file_range file operation dispatch since they're
fundamentally the same operation. The differences between the two can
be made in the prep functions.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
During buffered writes, we follow this basic series of steps:
again:
lock all the pages
wait for writeback on all the pages
Take the extent range lock
wait for ordered extents on the whole range
clean all the pages
if (copy_from_user_in_atomic() hits a fault) {
drop our locks
goto again;
}
dirty all the pages
release all the locks
The extra waiting, cleaning and locking are there to make sure we don't
modify pages in flight to the drive, after they've been crc'd.
If some of the pages in the range were already dirty when the write
began, and we need to goto again, we create a window where a dirty page
has been cleaned and unlocked. It may be reclaimed before we're able to
lock it again, which means we'll read the old contents off the drive and
lose any modifications that had been pending writeback.
We don't actually need to clean the pages. All of the other locking in
place makes sure we don't start IO on the pages, so we can just leave
them dirty for the duration of the write.
Fixes: 73d59314e6 (the original btrfs merge)
CC: stable@vger.kernel.org # v4.4+
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using true and false here is closer to the expected semantic than using
0 and 1. No functional change.
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Functions that get btrfs inode can simply reach the fs_info by
dereferencing the root and this looks a bit more straightforward
compared to the btrfs_sb(...) indirection.
If the transaction handle is available and not NULL it's used instead.
Signed-off-by: David Sterba <dsterba@suse.com>
struct kiocb carries the ki_pos, so there is no need to pass it as
a separate function parameter.
generic_file_direct_write() increments ki_pos, so we now assign pos
after the function.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
[ rename to btrfs_buffered_write ]
Signed-off-by: David Sterba <dsterba@suse.com>
We used to call btrfs_file_extent_inline_len() to get the uncompressed
data size of an inlined extent.
However this function is hiding evil, for compressed extent, it has no
choice but to directly read out ram_bytes from btrfs_file_extent_item.
While for uncompressed extent, it uses item size to calculate the real
data size, and ignoring ram_bytes completely.
In fact, for corrupted ram_bytes, due to above behavior kernel
btrfs_print_leaf() can't even print correct ram_bytes to expose the bug.
Since we have the tree-checker to verify all EXTENT_DATA, such mismatch
can be detected pretty easily, thus we can trust ram_bytes without the
evil btrfs_file_extent_inline_len().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The logic to check if the inode is already in the log can now be
simplified since we always wait for the ordered extents to complete
before deciding whether the inode needs to be logged. The big comment
about it can go away too.
CC: Filipe Manana <fdmanana@suse.com>
Suggested-by: Filipe Manana <fdmanana@suse.com>
[ code and changelog copied from mail discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
There's a priority inversion that exists currently with btrfs fsync. In
some cases we will collect outstanding ordered extents onto a list and
only wait on them at the very last second. However this "very last
second" falls inside of a transaction handle, so if we are in a lower
priority cgroup we can end up holding the transaction open for longer
than needed, so if a high priority cgroup is also trying to fsync()
it'll see latency.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type
for delalloc") merged into mainline is not the latest version submitted
to mail list in Dec 2017.
It has a fatal wrong @qgroup_free parameter, which results increasing
qgroup metadata pertrans reserved space, and causing a lot of early EDQUOT.
Fix it by applying the correct diff on top of current branch.
Fixes: 43b18595d6 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.
Signed-off-by: David Sterba <dsterba@suse.com>
Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
preallocated meta rsv, making it quite easy to underflow qgroup meta
reservation.
Since we have the new qgroup meta rsv types, apply it to delalloc
reservation.
Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
rsv type.
And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
they will convert corresponding META_PREALLOC reservation to
META_PERTRANS.
This is mainly due to the fact that current qgroup numbers will only be
updated in btrfs_commit_transaction(), that's to say if we don't keep
such placeholder reservation, we can exceed qgroup limitation.
And for callers freeing outstanding extent in error handler, we will
just free META_PREALLOC bytes.
This behavior makes callers of btrfs_qgroup_release_meta() or
btrfs_qgroup_convert_meta() to be aware of which type they are.
So in this patch, btrfs_delalloc_release_metadata() and its callers get
an extra parameter to info qgroup to do correct meta convert/release.
The good news is, even we use the wrong type (convert or free), it won't
cause obvious bug, as prealloc type is always in good shape, and the
type only affects how per-trans meta is increased or not.
So the worst case will be at most metadata limitation can be sometimes
exceeded (no convert at all) or metadata limitation is reached too soon
(no free at all).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 3558d4f88e ("btrfs: Deprecate userspace transaction ioctls")
marked the beginning of the end of userspace transaction. This commit
finishes the job! There are no known users and ceph does not use the
ioctl anymore.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Sage Weil <sage@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that nothing uses the root arg of btrfs_log_dentry_safe it can be
safely removed. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.
Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:
- printf wrappers, error messages
- exit helpers
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlpvikQACgkQxWXV+ddt
WDs6qA//ZE7eEH0sKpD4Z+3gUevk/MMXwE9prRijEdjXz/K/UXtvpq0sI7HMQskZ
Ls9Wmzof+3WEQoa08RQZFzwuclW1Udm09SqE2oHP2gXQB5rC0BtWdrlMaKUJy03y
NUwxHetbE6TsFLU5HIVmi05NexNx9SVV6oJTWt00RlXTePw9Aoc88ikoXXUE2vqH
wbH9/ccmM9EkDFxdG+YG5QX054kQV8/5RXdqBJnIiGVRX5ZsAY84AN9x9YoRCVUw
wq9TfPu6XmeA6Uq6knpeLlXDms5w+FE3n5CduROk7Q7YNgpoZekF20c8uK8HzT4T
KF8hc0QpQgRCVBJ8I4MbPSMRIDf3IWfZmWSDEDda/6/ep6Bl99b8PFvdDKDBMUct
8wsgGrwGbHuz2l2QUIXjpBL9Cv9Tbu8vjmg0h2hFrpiH1c8JaXjKtJXAMtigWsZ1
DdX+5Y0zqvV/YLpzKF4aMDWXIteN4qaznvjdmj3B7BxgcnITOV/cmPCyMplNrtUa
Cs2fzGV5tpxhBzxE490v+frMULmLq2W1e6WPfFCqPKBCqulcR75TozDQS9M2/h4k
uAZzVKoguHrUPP1ONVas9aVC05K473nbYPd28eecYwkZ4z32hK4SAdnQY1aPreTe
axoV7p7a+i1bkzT5LK6gIfpddVWth8w45nz4P0lwxp0Z6XlkbJE=
=Irul
-----END PGP SIGNATURE-----
Merge tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features or user visible changes:
- fallocate: implement zero range mode
- avoid losing data raid profile when deleting a device
- tree item checker: more checks for directory items and xattrs
Notable fixes:
- raid56 recovery: don't use cached stripes, that could be
potentially changed and a later RMW or recovery would lead to
corruptions or failures
- let raid56 try harder to rebuild damaged data, reading from all
stripes if necessary
- fix scrub to repair raid56 in a similar way as in the case above
Other:
- cleanups: device freeing, removed some call indirections, redundant
bio_put/_get, unused parameters, refactorings and renames
- RCU list traversal fixups
- simplify mount callchain, remove recursing back when mounting a
subvolume
- plug for fsync, may improve bio merging on multiple devices
- compression heurisic: replace heap sort with radix sort, gains some
performance
- add extent map selftests, buffered write vs dio"
* tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (155 commits)
btrfs: drop devid as device_list_add() arg
btrfs: get device pointer from device_list_add()
btrfs: set the total_devices in device_list_add()
btrfs: move pr_info into device_list_add
btrfs: make btrfs_free_stale_devices() to match the path
btrfs: rename btrfs_free_stale_devices() arg to skip_dev
btrfs: make btrfs_free_stale_devices() argument optional
btrfs: make btrfs_free_stale_device() to iterate all stales
btrfs: no need to check for btrfs_fs_devices::seeding
btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it
Btrfs: noinline merge_extent_mapping
Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping
Btrfs: extent map selftest: dio write vs dio read
Btrfs: extent map selftest: buffered write vs dio read
Btrfs: add extent map selftests
Btrfs: move extent map specific code to extent_map.c
Btrfs: add helper for em merge logic
Btrfs: fix unexpected EEXIST from btrfs_get_extent
Btrfs: fix incorrect block_len in merge_extent_mapping
btrfs: Remove unused readahead spinlock
...
Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.
We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.
Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
For a fallocate's zero range operation that targets a range with an end
that is not aligned to the sector size, we can end up not updating the
inode's i_size. This happens when the last page of the range maps to an
unwritten (prealloc) extent and before that last page we have either a
hole or a written extent. This is because in this scenario we relied
on a call to btrfs_prealloc_file_range() to update the inode's i_size,
however it can only update the i_size to the "down aligned" end of the
range.
Example:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ xfs_io -f -c "pwrite -S 0xff 0 428K" /mnt/foobar
$ xfs_io -c "falloc -k 428K 4K" /mnt/foobar
$ xfs_io -c "fzero 0 430K" /mnt/foobar
$ du --bytes /mnt/foobar
438272 /mnt/foobar
The inode's i_size was left as 428Kb (438272 bytes) when it should have
been updated to 430Kb (440320 bytes).
Fix this by always updating the inode's i_size explicitly after zeroing
the range.
Fixes: ba6d5887946ff86d93dc ("Btrfs: add support for fallocate's zero range operation")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a buffered IO write, we can have an extent state that we got when
we locked the range (if the range starts at an offset lower than eof), so
always pass it to btrfs_dirty_pages() so that setting the delalloc bit
in the range does not need to do a full search in the inode's io tree,
saving time and reducing the amount of time we hold the io tree's lock.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This implements support the zero range operation of fallocate. For now
at least it's as simple as possible while reusing most of the existing
fallocate and hole punching infrastructure.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass either GFP_NOFS or GFP_KERNEL now, so we can sink the
parameter to the function, though we lose some of the slightly better
semantics of GFP_KERNEL in some places, it's worth cleaning up the
callchains.
Signed-off-by: David Sterba <dsterba@suse.com>
Setting plug can merge adjacent IOs before dispatching IOs to the disk
driver.
Without plug, it'd not be a problem for single disk usecases, but for
multiple disks using raid profile, a large IO can be split to several
IOs of stripe length, and plug can be helpful to bring them together
for each disk so that we can save several disk access.
Moreover, fsync issues synchronous writes, so plug can really take
effect.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers use GFP_NOFS, we don't have to pass it as an argument. The
built-in tests pass GFP_KERNEL, but they run only at module load time
and NOFS works there as well.
Signed-off-by: David Sterba <dsterba@suse.com>
Adding __init macro gives kernel a hint that this function is only used
during the initialization phase and its memory resources can be freed up
after.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right before we go into this loop locked_end is set to alloc_end - 1 and
is being used in nearby functions, no need to have exceptions. This just
makes the code consistent, no functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fallocating a file in btrfs goes through several stages. The one before
actually inserting the fallocated extents is to create a qgroup
reservation, covering the desired range. To this end there is a loop in
btrfs_fallocate which checks to see if there are holes in the fallocated
range or !PREALLOC extents past EOF and if so create qgroup reservations
for them. Unfortunately, the main condition of the loop is burried right
at the end of its body rather than in the actual while statement which
makes it non-obvious. Fix this by moving the condition in the while
statement where it belongs. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Xfstests btrfs/146 revealed this corruption,
[ 58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async page read
[ 58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[ 58.152403] list_add corruption. prev->next should be next (ffff88005e6775d8), but was ffffc9000189be88. (prev=ffffc9000189be88).
[ 58.153518] ------------[ cut here ]------------
[ 58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 __list_add_valid+0x169/0x1f0
...
[ 58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0
...
[ 58.161956] Call Trace:
[ 58.162264] btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs]
[ 58.163583] btrfs_log_dentry_safe+0x60/0x80 [btrfs]
[ 58.164003] btrfs_sync_file+0x4c2/0x6f0 [btrfs]
[ 58.164393] vfs_fsync_range+0x5f/0xd0
[ 58.164898] do_fsync+0x5a/0x90
[ 58.165170] SyS_fsync+0x10/0x20
[ 58.165395] entry_SYSCALL_64_fastpath+0x1f/0xbe
...
It turns out that we could record btrfs_log_ctx:io_err in
log_one_extents when IO fails, but make log_one_extents() return '0'
instead of -EIO, so the IO error is not acknowledged by the callers,
i.e. btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list
from list head 'root->log_ctxs'. Since btrfs_log_ctx is allocated
from stack memory, it'd get freed with a object alive on the
list. then a future list_add will throw the above warning.
This returns the correct error in the above case.
Jeff also reported this while testing against his fsync error
patch set[1].
[1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html
"btrfs list corruption and soft lockups while testing writeback error handling"
Fixes: 8407f55326 ("Btrfs: fix data corruption after fast fsync and writeback error")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The patch from commit a7e3b975a0 ("Btrfs: fix reported number of inode
blocks") introduced a regression where if we do a buffered write starting
at position equal to or greater than the file's size and then stat(2) the
file before writeback is triggered, the number of used blocks does not
change (unless there's a prealloc/unwritten extent). Example:
$ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
$ du -h foobar
0 foobar
$ sync
$ du -h foobar
64K foobar
The first version of that patch didn't had this regression and the second
version, which was the one committed, was made only to address some
performance regression detected by the intel test robots using fs_mark.
This fixes the regression by setting the new delaloc bit in the range, and
doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
well, so that this way we set both bits at once avoiding navigation of the
inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
meaninful place, as we should set the new dellaloc bit when if we set the
delalloc bit, which happens only if we copied bytes into the pages at
__btrfs_buffered_write().
This was making some of LTP's du tests fail, which can be quickly run
using a command line like the following:
$ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt
Fixes: a7e3b975a0 ("Btrfs: fix reported number of inode blocks")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the definition of the function btrfs_find_new_delalloc_bytes() closer
to the function btrfs_dirty_pages(), because in a future commit it will be
used exclusively by btrfs_dirty_pages(). This just moves the function's
definition, with no functional changes at all.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right now we do a lot of weird hoops around outstanding_extents in order
to keep the extent count consistent. This is because we logically
transfer the outstanding_extent count from the initial reservation
through the set_delalloc_bits. This makes it pretty difficult to get a
handle on how and when we need to mess with outstanding_extents.
Fix this by revamping the rules of how we deal with outstanding_extents.
Now instead everybody that is holding on to a delalloc extent is
required to increase the outstanding extents count for itself. This
means we'll have something like this
btrfs_delalloc_reserve_metadata - outstanding_extents = 1
btrfs_set_extent_delalloc - outstanding_extents = 2
btrfs_release_delalloc_extents - outstanding_extents = 1
for an initial file write. Now take the append write where we extend an
existing delalloc range but still under the maximum extent size
btrfs_delalloc_reserve_metadata - outstanding_extents = 2
btrfs_set_extent_delalloc
btrfs_set_bit_hook - outstanding_extents = 3
btrfs_merge_extent_hook - outstanding_extents = 2
btrfs_delalloc_release_extents - outstanding_extnets = 1
In order to make the ordered extent transition we of course must now
make ordered extents carry their own outstanding_extent reservation, so
for cow_file_range we end up with
btrfs_add_ordered_extent - outstanding_extents = 2
clear_extent_bit - outstanding_extents = 1
btrfs_remove_ordered_extent - outstanding_extents = 0
This makes all manipulations of outstanding_extents much more explicit.
Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
combined with btrfs_release_delalloc_extents, even in the error case, as
that is the only function that actually modifies the
outstanding_extents counter.
The drawback to this is now we are much more likely to have transient
cases where outstanding_extents is much larger than it actually should
be. This could happen before as we manipulated the delalloc bits, but
now it happens basically at every write. This may put more pressure on
the ENOSPC flushing code, but I think making this code simpler is worth
the cost. I have another change coming to mitigate this side-effect
somewhat.
I also added trace points for the counter manipulation. These were used
by a bpf script I wrote to help track down leak issues.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Code cleanup for better understanding:
Variable needs_unlock to be called extent_locked to show state as
opposed to action. Changed the type to int, to reduce code in the
critical path.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We need the actual root for the ref verifier tool to work, so change
these functions to pass the root around instead. This will be used in
a subsequent patch.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull nowait read support from Al Viro:
"Support IOCB_NOWAIT for buffered reads and block devices"
* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
block_dev: support RFW_NOWAIT on block device nodes
fs: support RWF_NOWAIT for buffered reads
fs: support IOCB_NOWAIT in generic_file_buffered_read
fs: pass iocb to do_generic_file_read
This is based on the old idea and code from Milosz Tanski. With the aio
nowait code it becomes mostly trivial now. Buffered writes continue to
return -EOPNOTSUPP if RWF_NOWAIT is passed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Readdir does dir_emit while under the btree lock. dir_emit can trigger
the page fault which means we can deadlock. Fix this by allocating a
buffer on opening a directory and copying the readdir into this buffer
and doing dir_emit from outside of the tree lock.
Thread A
readdir <holding tree lock>
dir_emit
<page fault>
down_read(mmap_sem)
Thread B
mmap write
down_write(mmap_sem)
page_mkwrite
wait_ordered_extents
Process C
finish_ordered_extent
insert_reserved_file_extent
try to lock leaf <hang>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copy the deadlock scenario to changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs fix from David Sterba:
"This fixes a user-visible bug introduced by the nowait-aio patches
merged in this cycle"
* 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: nowait aio: Correct assignment of pos
Assigning pos for usage early messes up in append mode, where the pos is
re-assigned in generic_write_checks(). Assign pos later to get the
correct position to write from iocb->ki_pos.
Since check_can_nocow also uses the value of pos, we shift
generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT
are present in generic_write_checks(), so checking for IOCB_NOWAIT is
enough.
Also, put locking sequence in the fast path.
This fixes a user visible bug, as reported:
"apparently breaks several shell related features on my system.
In zsh history stopped working, because no new entries are added
anymore.
I fist noticed the issue when I tried to build mplayer. It uses a shell
script to generate a help_mp.h file:
[...]
Here is a simple testcase:
% echo "foo" >> test
% echo "foo" >> test
% cat test
foo
%
"
Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: Jens Axboe <axboe@kernel.dk>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Link: https://lkml.kernel.org/r/20170704042306.GA274@x4
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZXhmCAAoJEAAOaEEZVoIVpRkP/1qlYn3pq6d5Kuz84pejOmlL
5jbkS/cOmeTxeUU4+B1xG8Lx7bAk8PfSXQOADbSJGiZd0ug95tJxplFYIGJzR/tG
aNMHeu/BVKKhUKORGuKR9rJKtwC839L/qao+yPBo5U3mU4L73rFWX8fxFuhSJ8HR
hvkgBu3Hx6GY59CzxJ8iJzj+B+uPSFrNweAk0+0UeWkBgTzEdiGqaXBX4cHIkq/5
hMoCG+xnmwHKbCBsQ5js+YJT+HedZ4lvfjOqGxgElUyjJ7Bkt/IFYOp8TUiu193T
tA4UinDjN8A7FImmIBIftrECmrAC9HIGhGZroYkMKbb8ReDR2ikE5FhKEpuAGU3a
BXBgX2mPQuArvZWM7qeJCkxV9QJ0u/8Ykbyzo30iPrICyrzbEvIubeB/mDA034+Z
Z0/z8C3v7826F3zP/NyaQEojUgRq30McMOIS8GMnx15HJwRsRKlzjfy9Wm4tWhl0
t3nH1jMqAZ7068s6rfh/oCwdgGOwr5o4hW/bnlITzxbjWQUOnZIe7KBxIezZJ2rv
OcIwd5qE8PNtpagGj5oUbnjGOTkERAgsMfvPk5tjUNt28/qUlVs2V0aeo47dlcsh
oYr8WMOIzw98Rl7Bo70mplLrqLD6nGl0LfXOyUlT4STgLWW4ksmLVuJjWIUxcO/0
yKWjj9wfYRQ0vSUqhsI5
=3Z93
-----END PGP SIGNATURE-----
Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull Writeback error handling updates from Jeff Layton:
"This pile represents the bulk of the writeback error handling fixes
that I have for this cycle. Some of the earlier patches in this pile
may look trivial but they are prerequisites for later patches in the
series.
The aim of this set is to improve how we track and report writeback
errors to userland. Most applications that care about data integrity
will periodically call fsync/fdatasync/msync to ensure that their
writes have made it to the backing store.
For a very long time, we have tracked writeback errors using two flags
in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
writeback error occurs (via mapping_set_error) and are cleared as a
side-effect of filemap_check_errors (as you noted yesterday). This
model really sucks for userland.
Only the first task to call fsync (or msync or fdatasync) will see the
error. Any subsequent task calling fsync on a file will get back 0
(unless another writeback error occurs in the interim). If I have
several tasks writing to a file and calling fsync to ensure that their
writes got stored, then I need to have them coordinate with one
another. That's difficult enough, but in a world of containerized
setups that coordination may even not be possible.
But wait...it gets worse!
The calls to filemap_check_errors can be buried pretty far down in the
call stack, and there are internal callers of filemap_write_and_wait
and the like that also end up clearing those errors. Many of those
callers ignore the error return from that function or return it to
userland at nonsensical times (e.g. truncate() or stat()). If I get
back -EIO on a truncate, there is no reason to think that it was
because some previous writeback failed, and a subsequent fsync() will
(incorrectly) return 0.
This pile aims to do three things:
1) ensure that when a writeback error occurs that that error will be
reported to userland on a subsequent fsync/fdatasync/msync call,
regardless of what internal callers are doing
2) report writeback errors on all file descriptions that were open at
the time that the error occurred. This is a user-visible change,
but I think most applications are written to assume this behavior
anyway. Those that aren't are unlikely to be hurt by it.
3) document what filesystems should do when there is a writeback
error. Today, there is very little consistency between them, and a
lot of cargo-cult copying. We need to make it very clear what
filesystems should do in this situation.
To achieve this, the set adds a new data type (errseq_t) and then
builds new writeback error tracking infrastructure around that. Once
all of that is in place, we change the filesystems to use the new
infrastructure for reporting wb errors to userland.
Note that this is just the initial foray into cleaning up this mess.
There is a lot of work remaining here:
1) convert the rest of the filesystems in a similar fashion. Once the
initial set is in, then I think most other fs' will be fairly
simple to convert. Hopefully most of those can in via individual
filesystem trees.
2) convert internal waiters on writeback to use errseq_t for
detecting errors instead of relying on the AS_* flags. I have some
draft patches for this for ext4, but they are not quite ready for
prime time yet.
This was a discussion topic this year at LSF/MM too. If you're
interested in the gory details, LWN has some good articles about this:
https://lwn.net/Articles/718734/https://lwn.net/Articles/724307/"
* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
btrfs: minimal conversion to errseq_t writeback error reporting on fsync
xfs: minimal conversion to errseq_t writeback error reporting
ext4: use errseq_t based error handling for reporting data writeback errors
fs: convert __generic_file_fsync to use errseq_t based reporting
block: convert to errseq_t based writeback error tracking
dax: set errors in mapping when writeback fails
Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
fs: new infrastructure for writeback error handling and reporting
lib: add errseq_t type and infrastructure for handling it
mm: don't TestClearPageError in __filemap_fdatawait_range
mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
jbd2: don't clear and reset errors after waiting on writeback
buffer: set errors in mapping at the time that the error occurs
fs: check for writeback errors after syncing out buffers in generic_file_fsync
buffer: use mapping_set_error instead of setting the flag
mm: fix mapping_set_error call in me_pagecache_dirty
Just check and advance the errseq_t in the file before returning, and
use an errseq_t based check for writeback errors.
Other internal callers of filemap_* functions are left as-is.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Pull btrfs updates from David Sterba:
"The core updates improve error handling (mostly related to bios), with
the usual incremental work on the GFP_NOFS (mis)use removal,
refactoring or cleanups. Except the two top patches, all have been in
for-next for an extensive amount of time.
User visible changes:
- statx support
- quota override tunable
- improved compression thresholds
- obsoleted mount option alloc_start
Core updates:
- bio-related updates:
- faster bio cloning
- no allocation failures
- preallocated flush bios
- more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates
- prep work for btree_inode removal
- dir-item validation
- qgoup fixes and updates
- cleanups:
- removed unused struct members, unused code, refactoring
- argument refactoring (fs_info/root, caller -> callee sink)
- SEARCH_TREE ioctl docs"
* 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
btrfs: Remove false alert when fiemap range is smaller than on-disk extent
btrfs: Don't clear SGID when inheriting ACLs
btrfs: fix integer overflow in calc_reclaim_items_nr
btrfs: scrub: fix target device intialization while setting up scrub context
btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
btrfs: qgroup: Return actually freed bytes for qgroup release or free data
btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
btrfs: qgroup: Add quick exit for non-fs extents
Btrfs: rework delayed ref total_bytes_pinned accounting
Btrfs: return old and new total ref mods when adding delayed refs
Btrfs: always account pinned bytes when dropping a tree block ref
Btrfs: update total_bytes_pinned when pinning down extents
Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
btrfs: fix validation of XATTR_ITEM dir items
btrfs: Verify dir_item in iterate_object_props
btrfs: Check name_len before in btrfs_del_root_ref
btrfs: Check name_len before reading btrfs_get_name
...
[BUG]
For the following case, btrfs can underflow qgroup reserved space
at an error path:
(Page size 4K, function name without "btrfs_" prefix)
Task A | Task B
----------------------------------------------------------------------
Buffered_write [0, 2K) |
|- check_data_free_space() |
| |- qgroup_reserve_data() |
| Range aligned to page |
| range [0, 4K) <<< |
| 4K bytes reserved <<< |
|- copy pages to page cache |
| Buffered_write [2K, 4K)
| |- check_data_free_space()
| | |- qgroup_reserved_data()
| | Range alinged to page
| | range [0, 4K)
| | Already reserved by A <<<
| | 0 bytes reserved <<<
| |- delalloc_reserve_metadata()
| | And it *FAILED* (Maybe EQUOTA)
| |- free_reserved_data_space()
|- qgroup_free_data()
Range aligned to page range
[0, 4K)
Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)
[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.
And at writeback time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.
[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.
Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.
Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it in error
paths.
The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.
This will lead to qgroup reserved space underflow.
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While punching a hole in a range that is not aligned with the sector size
(currently the same as the page size) we can end up leaving an extent map
in memory with a length that is smaller then the sector size or with a
start offset that is not aligned to the sector size. Both cases are not
expected and can lead to problems. This issue is easily detected
after the patch from commit a7e3b975a0 ("Btrfs: fix reported number of
inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the
following for example:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo
$ xfs_io -c "fpunch 60K 90K" /mnt/foo
$ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo
$ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo
$ umount /mnt
After the unmount operation we can see several warnings emmitted due to
underflows related to space reservation counters:
[ 2837.443299] ------------[ cut here ]------------
[ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 btrfs_destroy_inode+0xe8/0x27e [btrfs]
[ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button se
rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_gene
ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: G W 4.10.0-rc8-btrfs-next-43+ #1
[ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.462379] Call Trace:
[ 2837.462379] dump_stack+0x68/0x92
[ 2837.462379] __warn+0xc2/0xdd
[ 2837.462379] warn_slowpath_null+0x1d/0x1f
[ 2837.462379] btrfs_destroy_inode+0xe8/0x27e [btrfs]
[ 2837.462379] destroy_inode+0x3d/0x55
[ 2837.462379] evict+0x177/0x17e
[ 2837.462379] dispose_list+0x50/0x71
[ 2837.462379] evict_inodes+0x132/0x141
[ 2837.462379] generic_shutdown_super+0x3f/0xeb
[ 2837.462379] kill_anon_super+0x12/0x1c
[ 2837.462379] btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.462379] deactivate_locked_super+0x30/0x68
[ 2837.462379] deactivate_super+0x36/0x39
[ 2837.462379] cleanup_mnt+0x58/0x76
[ 2837.462379] __cleanup_mnt+0x12/0x14
[ 2837.462379] task_work_run+0x77/0x9b
[ 2837.462379] prepare_exit_to_usermode+0x9d/0xc5
[ 2837.462379] syscall_return_slowpath+0x196/0x1b9
[ 2837.462379] entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.462379] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.462379] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.462379] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.462379] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.462379] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.462379] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.519355] ---[ end trace e79345fe24b30b8d ]---
[ 2837.596256] ------------[ cut here ]------------
[ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 btrfs_free_block_groups+0x246/0x3eb [btrfs]
[ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: G W 4.10.0-rc8-btrfs-next-43+ #1
[ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.663359] Call Trace:
[ 2837.663359] dump_stack+0x68/0x92
[ 2837.663359] __warn+0xc2/0xdd
[ 2837.663359] warn_slowpath_null+0x1d/0x1f
[ 2837.663359] btrfs_free_block_groups+0x246/0x3eb [btrfs]
[ 2837.663359] close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.663359] ? evict_inodes+0x132/0x141
[ 2837.663359] btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.663359] generic_shutdown_super+0x6a/0xeb
[ 2837.663359] kill_anon_super+0x12/0x1c
[ 2837.663359] btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.663359] deactivate_locked_super+0x30/0x68
[ 2837.663359] deactivate_super+0x36/0x39
[ 2837.663359] cleanup_mnt+0x58/0x76
[ 2837.663359] __cleanup_mnt+0x12/0x14
[ 2837.663359] task_work_run+0x77/0x9b
[ 2837.663359] prepare_exit_to_usermode+0x9d/0xc5
[ 2837.663359] syscall_return_slowpath+0x196/0x1b9
[ 2837.663359] entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.663359] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.663359] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.663359] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.663359] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.663359] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.663359] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.663359] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.739445] ---[ end trace e79345fe24b30b8e ]---
[ 2837.745595] ------------[ cut here ]------------
[ 2837.746412] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5700 btrfs_free_block_groups+0x261/0x3eb [btrfs]
[ 2837.747955] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.755395] CPU: 8 PID: 2474 Comm: umount Tainted: G W 4.10.0-rc8-btrfs-next-43+ #1
[ 2837.756769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.758526] Call Trace:
[ 2837.758925] dump_stack+0x68/0x92
[ 2837.759383] __warn+0xc2/0xdd
[ 2837.759383] warn_slowpath_null+0x1d/0x1f
[ 2837.759383] btrfs_free_block_groups+0x261/0x3eb [btrfs]
[ 2837.759383] close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.759383] ? evict_inodes+0x132/0x141
[ 2837.759383] btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.759383] generic_shutdown_super+0x6a/0xeb
[ 2837.759383] kill_anon_super+0x12/0x1c
[ 2837.759383] btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.759383] deactivate_locked_super+0x30/0x68
[ 2837.759383] deactivate_super+0x36/0x39
[ 2837.759383] cleanup_mnt+0x58/0x76
[ 2837.759383] __cleanup_mnt+0x12/0x14
[ 2837.759383] task_work_run+0x77/0x9b
[ 2837.759383] prepare_exit_to_usermode+0x9d/0xc5
[ 2837.759383] syscall_return_slowpath+0x196/0x1b9
[ 2837.759383] entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.759383] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.759383] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.759383] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.759383] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.759383] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.759383] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.759383] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.777063] ---[ end trace e79345fe24b30b8f ]---
[ 2837.778235] ------------[ cut here ]------------
[ 2837.778856] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:9825 btrfs_free_block_groups+0x348/0x3eb [btrfs]
[ 2837.791385] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
[ 2837.797711] CPU: 8 PID: 2474 Comm: umount Tainted: G W 4.10.0-rc8-btrfs-next-43+ #1
[ 2837.798594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[ 2837.800118] Call Trace:
[ 2837.800515] dump_stack+0x68/0x92
[ 2837.801015] __warn+0xc2/0xdd
[ 2837.801471] warn_slowpath_null+0x1d/0x1f
[ 2837.801698] btrfs_free_block_groups+0x348/0x3eb [btrfs]
[ 2837.801698] close_ctree+0x1dd/0x2e1 [btrfs]
[ 2837.801698] ? evict_inodes+0x132/0x141
[ 2837.801698] btrfs_put_super+0x15/0x17 [btrfs]
[ 2837.801698] generic_shutdown_super+0x6a/0xeb
[ 2837.801698] kill_anon_super+0x12/0x1c
[ 2837.801698] btrfs_kill_super+0x16/0x21 [btrfs]
[ 2837.801698] deactivate_locked_super+0x30/0x68
[ 2837.801698] deactivate_super+0x36/0x39
[ 2837.801698] cleanup_mnt+0x58/0x76
[ 2837.801698] __cleanup_mnt+0x12/0x14
[ 2837.801698] task_work_run+0x77/0x9b
[ 2837.801698] prepare_exit_to_usermode+0x9d/0xc5
[ 2837.801698] syscall_return_slowpath+0x196/0x1b9
[ 2837.801698] entry_SYSCALL_64_fastpath+0xab/0xad
[ 2837.801698] RIP: 0033:0x7f3ef3e6b9a7
[ 2837.801698] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[ 2837.801698] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
[ 2837.801698] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
[ 2837.801698] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
[ 2837.801698] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
[ 2837.801698] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
[ 2837.818441] ---[ end trace e79345fe24b30b90 ]---
[ 2837.818991] BTRFS info (device sdc): space_info 1 has 7974912 free, is not full
[ 2837.819830] BTRFS info (device sdc): space_info total=8388608, used=417792, pinned=0, reserved=0, may_use=18446744073709547520, readonly=0
What happens in the above example is the following:
1) When punching the hole, at btrfs_punch_hole(), the variable tail_len
is set to 2048 (as tail_start is 148Kb + 1 and offset + len is 150Kb).
This results in the creation of an extent map with a length of 2Kb
starting at file offset 148Kb, through find_first_non_hole() ->
btrfs_get_extent().
2) The second write (first write after the hole punch operation), sets
the range [50Kb, 152Kb[ to delalloc.
3) The third write, at btrfs_find_new_delalloc_bytes(), sees the extent
map covering the range [148Kb, 150Kb[ and ends up calling
set_extent_bit() for the same range, which results in splitting an
existing extent state record, covering the range [148Kb, 152Kb[ into
two 2Kb extent state records, covering the ranges [148Kb, 150Kb[ and
[150Kb, 152Kb[.
4) Finally at lock_and_cleanup_extent_if_need(), immediately after calling
btrfs_find_new_delalloc_bytes() we clear the delalloc bit from the
range [100Kb, 152Kb[ which results in the btrfs_clear_bit_hook()
callback being invoked against the two 2Kb extent state records that
cover the ranges [148Kb, 150Kb[ and [150Kb, 152Kb[. When called against
the first 2Kb extent state, it calls btrfs_delalloc_release_metadata()
with a length argument of 2048 bytes. That function rounds up the length
to a sector size aligned length, so it ends up considering a length of
4096 bytes, and then calls calc_csum_metadata_size() which results in
decrementing the inode's csum_bytes counter by 4096 bytes, so after
it stays a value of 0 bytes. Then the same happens when
btrfs_clear_bit_hook() is called against the second extent state that
has a length of 2Kb, covering the range [150Kb, 152Kb[, the length is
rounded up to 4096 and calc_csum_metadata_size() ends up being called
to decrement 4096 bytes from the inode's csum_bytes counter, which
at that time has a value of 0, leading to an underflow, which is
exactly what triggers the first warning, at btrfs_destroy_inode().
All the other warnings relate to several space accounting counters
that underflow as well due to similar reasons.
A similar case but where the hole punching operation creates an extent map
with a start offset not aligned to the sector size is the following:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "fpunch 695K 820K" $SCRATCH_MNT/bar
$ xfs_io -c "pwrite -S 0xaa 1008K 307K" $SCRATCH_MNT/bar
$ xfs_io -c "pwrite -S 0xbb -b 630K 1073K 630K" $SCRATCH_MNT/bar
$ xfs_io -c "pwrite -S 0xcc -b 459K 1068K 459K" $SCRATCH_MNT/bar
$ umount /mnt
During the unmount operation we get similar traces for the same reasons as
in the first example.
So fix the hole punching operation to make sure it never creates extent
maps with a length that is not aligned to the sector size nor with a start
offset that is not aligned to the sector size, as this breaks all
assumptions and it's a land mine.
Fixes: d77815461f ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
Cc: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Return EAGAIN if any of the following checks fail
+ i_rwsem is not lockable
+ NODATACOW or PREALLOC is not set
+ Cannot nocow at the desired location
+ Writing beyond end of file which is not allocated
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently when there are buffered writes that were not yet flushed and
they fall within allocated ranges of the file (that is, not in holes or
beyond eof assuming there are no prealloc extents beyond eof), btrfs
simply reports an incorrect number of used blocks through the stat(2)
system call (or any of its variants), regardless of mount options or
inode flags (compress, compress-force, nodatacow). This is because the
number of blocks used that is reported is based on the current number
of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
inode. The later covers bytes that both fall within allocated regions
of the file and holes.
Example scenarios where the number of reported blocks is wrong while the
buffered writes are not flushed:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)
$ sync
$ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)
# The following should have reported 64K...
$ du -h /mnt/sdc/foo1
128K /mnt/sdc/foo1
$ sync
# After flushing the buffered write, it now reports the correct value.
$ du -h /mnt/sdc/foo1
64K /mnt/sdc/foo1
$ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
wrote 65536/65536 bytes at offset 0
64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)
$ sync
$ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
wrote 65536/65536 bytes at offset 65536
64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
# The following should have reported 128K...
$ du -h /mnt/sdc/foo2
192K /mnt/sdc/foo2
$ sync
# After flushing the buffered write, it now reports the correct value.
$ du -h /mnt/sdc/foo2
128K /mnt/sdc/foo2
So the number of used file blocks is simply incorrect, unlike in other
filesystems such as ext4 and xfs for example, but only while the buffered
writes are not flushed.
Fix this by tracking the number of delalloc bytes that fall within holes
and beyond eof of a file, and use instead this new counter when reporting
the number of used blocks for an inode.
Another different problem that exists is that the delalloc bytes counter
is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
the respective range in the inode's iotree) and the vfs inode's bytes
counter is only incremented when writeback finishes (through
insert_reserved_file_extent()). Therefore while writeback is ongoing we
simply report a wrong number of blocks used by an inode if the write
operation covers a range previously unallocated. While this change does
not fix this problem, it does minimizes it a lot by shortening that time
window, as the new dealloc bytes counter (new_delalloc_bytes) is only
decremented when writeback finishes right before updating the vfs inode's
bytes counter. Fully fixing this second problem is not trivial and will
be addressed later by a different patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
extent map structure. The failure can happen either due to an -ENOMEM
condition or, when quotas are enabled, due to -EDQUOT for example.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
btrfs_get_extent() never returns NULL pointers, so this code introduces
a static checker warning.
The btrfs_get_extent() is a bit complex, but trust me that it doesn't
return NULLs and also if it did we would trigger the BUG_ON(!em) before
the last return statement.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[ updated subject ]
Signed-off-by: David Sterba <dsterba@suse.com>
Pull more btrfs updates from Chris Mason:
"Btrfs round two.
These are mostly a continuation of Dave Sterba's collection of
cleanups, but Filipe also has some bug fixes and performance
improvements"
* 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (69 commits)
btrfs: add dummy callback for readpage_io_failed and drop checks
btrfs: drop checks for mandatory extent_io_ops callbacks
btrfs: document existence of extent_io ops callbacks
btrfs: let writepage_end_io_hook return void
btrfs: do proper error handling in btrfs_insert_xattr_item
btrfs: handle allocation error in update_dev_stat_item
btrfs: remove BUG_ON from __tree_mod_log_insert
btrfs: derive maximum output size in the compression implementation
btrfs: use predefined limits for calculating maximum number of pages for compression
btrfs: export compression buffer limits in a header
btrfs: merge nr_pages input and output parameter in compress_pages
btrfs: merge length input and output parameter in compress_pages
btrfs: constify name of subvolume in creation helpers
btrfs: constify buffers used by compression helpers
btrfs: constify input buffer of btrfs_csum_data
btrfs: constify device path passed to relevant helpers
btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode
btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode
btrfs: Make btrfs_add_nondir take btrfs_inode
btrfs: Make btrfs_add_link take btrfs_inode
...
In addition to changing the signature, this patch also switches
all the functions which are used as an argument to also take btrfs_inode.
Namely those are: btrfs_get_extent and btrfs_get_extent_filemap.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
branch.
This patch also fixes multiple checkpatch warnings: WARNING: Prefer
'unsigned int' to bare use of 'unsigned'
Thanks to Andrew Morton for suggesting more appropriate function instead
of macro.
[geliangtang@gmail.com: truncate: use i_blocksize()]
Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently btrfs_ino takes a struct inode and this causes a lot of
internal btrfs functions which consume this ino to take a VFS inode,
rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
of VFS structs into the internals of btrfs first it's necessary to
eliminate all uses of struct inode for the purpose of inode. This patch
does that by using BTRFS_I to convert an inode to btrfs_inode. With
this problem eliminated subsequent patches will start eliminating the
passing of struct inode altogether, eventually resulting in a lot cleaner
code.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
[ fix btrfs_get_extent tracepoint prototype ]
Signed-off-by: David Sterba <dsterba@suse.com>
Pull more vfs updates from Al Viro:
"In this pile:
- autofs-namespace series
- dedupe stuff
- more struct path constification"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
ocfs2: charge quota for reflinked blocks
ocfs2: fix bad pointer cast
ocfs2: always unlock when completing dio writes
ocfs2: don't eat io errors during _dio_end_io_write
ocfs2: budget for extent tree splits when adding refcount flag
ocfs2: prohibit refcounted swapfiles
ocfs2: add newlines to some error messages
ocfs2: convert inode refcount test to a helper
simple_write_end(): don't zero in short copy into uptodate
exofs: don't mess with simple_write_{begin,end}
9p: saner ->write_end() on failing copy into non-uptodate page
fix gfs2_stuffed_write_end() on short copies
fix ceph_write_end()
nfs_write_end(): fix handling of short copies
vfs: refactor clone/dedupe_file_range common functions
fs: try to clone files first in vfs_copy_file_range
vfs: misc struct path constification
namespace.c: constify struct path passed to a bunch of primitives
quota: constify struct path in quota_on
...
Patches queued up by Filipe:
The most important change is still the fix for the extent tree
corruption that happens due to balance when qgroups are enabled (a
regression introduced in 4.7 by a fix for a regression from the last
qgroups rework). This has been hitting SLE and openSUSE users and QA
very badly, where transactions keep getting aborted when running
delayed references leaving the root filesystem in RO mode and nearly
unusable. There are fixes here that allow us to run xfstests again
with the integrity checker enabled, which has been impossible since 4.8
(apparently I'm the only one running xfstests with the integrity
checker enabled, which is useful to validate dirtied leafs, like
checking if there are keys out of order, etc). The rest are just some
trivial fixes, most of them tagged for stable, and two cleanups.
Signed-off-by: Chris Mason <clm@fb.com>
A clone is a perfectly fine implementation of a file copy, so most
file systems just implement the copy that way. Instead of duplicating
this logic move it to the VFS. Currently btrfs and XFS implement copies
the same way as clones and there is no behavior change for them, cifs
only implements clones and grow support for copy_file_range with this
patch. NFS implements both, so this will allow copy_file_range to work
on servers that only implement CLONE and be lot more efficient on servers
that implements CLONE and COPY.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Now we only use the root parameter to print the root objectid in
a tracepoint. We can use the root parameter from the transaction
handle for that. It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer. Let's convert those to
just accept an fs_info pointer directly.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable. This makes the code considerably
more readable.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We track the node sizes per-root, but they never vary from the values
in the superblock. This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The hole punching can result in adding new leafs (and as a consequence
new nodes) to the tree because when we find file extent items that span
beyond the hole range we may end up not deleting them (just adjusting
them, reducing their range by reducing their length or increasing their
offset field) and add new file extent items representing holes.
So after splitting a leaf (therefore creating a new one) to insert a new
file extent item representing a hole, a new node might be added to each
level of the tree in the worst case scenario (since there's a new key
and every parent node was full).
For example if a file has an extent item representing the range 0 to 64Mb
and we punch a hole in the range 1Mb to 20Mb, the existing extent item is
duplicated and one of the copies is adjusted to represent the range 0 to
1Mb, the other copy adjusted to represent the range 20Mb to 64Mb, and a
new file extent item representing a hole in the range 1Mb to 20Mb is
inserted.
Fix this by using btrfs_calc_trans_metadata_size() instead of
btrfs_calc_trunc_metadata_size(), so that enough metadata space is
reserved for the worst possible case.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[Modified changelog for clarity and correctness]
At this point we will have dropped extent entries from the file, so if we fail
to insert the new hole entries then we are leaving the fs in a corrupt state
(albeit an easily fixed one). Abort the transaciton if this happens so we can
avoid corrupting the fs. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to do hole punching we have a block reserve to hold the reservation we
need to drop the extents in our range. Since we could end up dropping a lot of
extents we set rsv->failfast so we can just loop around again and drop the
remaining of the range. Unfortunately we unconditionally fill the hole extents
in and start from the last extent we encountered, which we may or may not have
dropped. So this can result in overlapping file extent entries, which can be
tripped over in a variety of ways, either by hitting BUG_ON(!ret) in
fill_holes() after the search, or in btrfs_set_item_key_safe() in
btrfs_drop_extent() at a later time by an unrelated task. Fix this by only
setting drop_end to the last extent we did actually drop. This way our holes
are filled in properly for the range that we did drop, and the rest of the range
that remains to be dropped is actually dropped. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pull btrfs updates from Chris Mason:
"This is a big variety of fixes and cleanups.
Liu Bo continues to fixup fuzzer related problems, and some of Josef's
cleanups are prep for his bigger extent buffer changes (slated for
v4.10)"
* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (39 commits)
Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs"
Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf
Btrfs: don't BUG() during drop snapshot
btrfs: fix btrfs_no_printk stub helper
Btrfs: memset to avoid stale content in btree leaf
btrfs: parent_start initialization cleanup
btrfs: Remove already completed TODO comment
btrfs: Do not reassign count in btrfs_run_delayed_refs
btrfs: fix a possible umount deadlock
Btrfs: fix memory leak in do_walk_down
btrfs: btrfs_debug should consume fs_info when DEBUG is not defined
btrfs: convert send's verbose_printk to btrfs_debug
btrfs: convert pr_* to btrfs_* where possible
btrfs: convert printk(KERN_* to use pr_* calls
btrfs: unsplit printed strings
btrfs: clean the old superblocks before freeing the device
Btrfs: kill BUG_ON in run_delayed_tree_ref
Btrfs: don't leak reloc root nodes on error
btrfs: squash lines for simple wrapper functions
Btrfs: improve check_node to avoid reading corrupted nodes
...
Pull more vfs updates from Al Viro:
">rename2() work from Miklos + current_time() from Deepa"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Replace current_fs_time() with current_time()
fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
fs: Replace CURRENT_TIME with current_time() for inode timestamps
fs: proc: Delete inode time initializations in proc_alloc_inode()
vfs: Add current_time() api
vfs: add note about i_op->rename changes to porting
fs: rename "rename2" i_op to "rename"
vfs: remove unused i_op->rename
fs: make remaining filesystems use .rename2
libfs: support RENAME_NOREPLACE in simple_rename()
fs: support RENAME_NOREPLACE for local filesystems
ncpfs: fix unused variable warning
current_fs_time() uses struct super_block* as an argument.
As per Linus's suggestion, this is changed to take struct
inode* as a parameter instead. This is because the function
is primarily meant for vfs inode timestamps.
Also the function was renamed as per Arnd's suggestion.
Change all calls to current_fs_time() to use the new
current_time() function instead. current_fs_time() will be
deleted.
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>