Commit graph

873410 commits

Author SHA1 Message Date
David Sterba
8d6fac0087 btrfs: add support for 4-copy replication (raid1c4)
Add new block group profile to store 4 copies in a simliar way that
current RAID1 does.  The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the 2-
and 3-copy RAID1.

The minimum number of devices is 4, the maximum number of devices/chunks
that can be lost/damaged is 3. There is no comparable traditional RAID
level, the profile is added for future needs to accompany triple-parity
and beyond.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:49 +01:00
David Sterba
47e6f7423b btrfs: add support for 3-copy replication (raid1c3)
Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the
2-copy RAID1.

The minimum number of devices is 3, the maximum number of devices/chunks
that can be lost/damaged is 2. Like RAID6 but with 33% space
utilization.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:49 +01:00
David Sterba
fac07d2b09 btrfs: sink write flags to cow_file_range_async
In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios",
cow_file_range_async gained wbc as a parameter and this makes passing
write flags redundant. Set it inside the function and remove the
parameter.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:49 +01:00
David Sterba
57e5ffeb87 btrfs: sink write_flags to __extent_writepage_io
__extent_writepage reads write flags from wbc and passes both to
__extent_writepage_io. This makes write_flags redundant and we can
remove it.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:48 +01:00
Filipe Manana
fd0ddbe250 Btrfs: send, skip backreference walking for extents with many references
Backreference walking, which is used by send to figure if it can issue
clone operations instead of write operations, can be very slow and use
too much memory when extents have many references. This change simply
skips backreference walking when an extent has more than 64 references,
in which case we fallback to a write operation instead of a clone
operation. This limit is conservative and in practice I observed no
signicant slowdown with up to 100 references and still low memory usage
up to that limit.

This is a temporary workaround until there are speedups in the backref
walking code, and as such it does not attempt to add extra interfaces or
knobs to tweak the threshold.

Reported-by: Atemu <atemu.main@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:48 +01:00
Filipe Manana
11f2069c11 Btrfs: send, allow clone operations within the same file
For send we currently skip clone operations when the source and
destination files are the same. This is so because clone didn't support
this case in its early days, but support for it was added back in May
2013 by commit a96fbc7288 ("Btrfs: allow file data clone within a
file"). This change adds support for it.

Example:

  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdd /mnt/sdd

  $ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar
  $ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar

  $ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap

  $ mkfs.btrfs -f /dev/sde
  $ mount /dev/sde /mnt/sde

  $ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde

Without this change file foobar at the destination has a single 128Kb
extent:

  $ filefrag -v /mnt/sde/snap/foobar
  Filesystem type is: 9123683e
  File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
   ext:     logical_offset:        physical_offset: length:   expected: flags:
     0:        0..      31:          0..        31:     32:             last,unknown_loc,delalloc,eof
  /mnt/sde/snap/foobar: 1 extent found

With this we get a single 64Kb extent that is shared at file offsets 0
and 64K, just like in the source filesystem:

  $ filefrag -v /mnt/sde/snap/foobar
  Filesystem type is: 9123683e
  File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
   ext:     logical_offset:        physical_offset: length:   expected: flags:
     0:        0..      15:       3328..      3343:     16:             shared
     1:       16..      31:       3328..      3343:     16:       3344: last,shared,eof
  /mnt/sde/snap/foobar: 2 extents found

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:48 +01:00
Qu Wenruo
6b7faadd98 btrfs: Ensure we trim ranges across block group boundary
[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:

  btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%

type:		bbio->map_type, in above case, it's SINGLE DATA.
start:		Logical address of this trim
len:		Logical length of this trim
trimmed:	Physically trimmed bytes
ratio:		trimmed / len

Thus leaving some unused space not discarded.

[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:

btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
   |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
   |- btrfs_discard_extent()

However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.

When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:

      |<---    BG1    --->|<---      BG2     --->|
      |//////|<--   Range to discard   --->|/////|

To discard that range, we have the following calls:

  btrfs_discard_extent()
  |- btrfs_map_block()
  |  Returned bbio will end at BG1's end. As btrfs_map_block()
  |  never returns result across block group boundary.
  |- btrfs_issuse_discard()
     Issue discard for each stripe.

So we will only discard the range in BG1, not the remaining part in BG2.

Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.

[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
  btrfs_map_block() uses its @length paramter to notify the caller how
  many bytes are mapped in current call.
  With __btrfs_map_block_for_discard() also modifing the @length,
  btrfs_discard_extent() now understands when to do extra trim.

- Call btrfs_map_block() in a loop until we hit the range end Since we
  now know how many bytes are mapped each time, we can iterate through
  each block group boundary and issue correct trim for each range.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:48 +01:00
Qu Wenruo
2d974619a7 btrfs: volumes: Use more straightforward way to calculate map length
The old code goes:

 	offset = logical - em->start;
	length = min_t(u64, em->len - offset, length);

Where @length calculation is dependent on offset, it can take reader
several more seconds to find it's just the same code as:

 	offset = logical - em->start;
	length = min_t(u64, em->start + em->len - logical, length);

Use above code to make the length calculate independent from other
variable, thus slightly increase the readability.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:48 +01:00
Qu Wenruo
153a6d2999 btrfs: tree-checker: Check item size before reading file extent type
In check_extent_data_item(), we read file extent type without verifying
if the item size is valid.

Add such check to ensure the file extent type we read is correct.

The check is not as accurate as we need to cover both inline and regular
extents, so it only checks if the item size is larger or equal to inline
header.
So the existing size checks on inline/regular extents are still needed.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:48 +01:00
Dan Carpenter
3ec17a67cc btrfs: clean up locking name in scrub_enumerate_chunks()
The "&fs_info->dev_replace.rwsem" and "&dev_replace->rwsem" refer to
the same lock but Smatch is not clever enough to figure that out so it
leads to static checker warnings.  It's better to use it consistently
anyway.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:47 +01:00
Nikolay Borisov
6ef108dd0c btrfs: Streamline btrfs_fs_info::backup_root_index semantics
The backup_root_index member stores the index at which the backup root
should be saved upon next transaction commit. However, there is a
small deviation from this behavior in the form of a check in
backup_super_roots which checks if current root generation equals to the
generation of the previous root. This can trigger in the following
scenario:

slot0: gen-2
slot1: gen-1
slot2: gen
slot3: unused

Now suppose slot3 (which is also the root specified in the super block)
is corrupted hence init_tree_roots chooses to use the backup root at
slot2, meaning read_backup_root will read slot2 and assign the
superblock generation to gen-1. Despite this backup_root_index will
point at slot3 because its init happens in init_backup_root_slot, long
before any parsing of the backup roots occur. Then on next transaction
start, gen-1 will be incremented by 1 making the root's generation
equal gen. Subsequently, on transaction commit the following check
triggers:

  if (btrfs_backup_tree_root_gen(root_backup) ==
           btrfs_header_generation(info->tree_root->node))

This causes the 'next_backup', which is the index at which the backup is
going to be written to, to set to last_backup, which will be slot2.

All of this is a very confusing way of expressing the following
invariant:

 Always write a backup root at the index following the last used backup
 root.

This commit streamlines this logic by setting backup_root_index to the
next index after the one used for mount.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:47 +01:00
Nikolay Borisov
4ac039ad75 btrfs: Rename find_oldest_super_backup to init_backup_root_slot
The old name name was an awful misnomer because it didn't really find
the oldest super backup per-se but rather its slot. For example if we
have:

slot0: gen - 2
slot1: gen - 1
slot2: gen
slot3: empty

init_backup_root_slot will return slot3 and not slot0.

The new name is more appropriate since the function doesn't care whether
there is a valid backup in the returned slot or not.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:47 +01:00
Nikolay Borisov
260eb11bd4 btrfs: Remove unused next_root_backup function
This function has been superseded by previous commits and is no longer
used so just remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:47 +01:00
Nikolay Borisov
336a0d8df1 btrfs: Don't use objectid_mutex during mount
Since the filesystem is not well formed and no trees are loaded it's
pointless holding the objectid_mutex. Just remove its usage.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:47 +01:00
Nikolay Borisov
b8522a1e5f btrfs: Factor out tree roots initialization during mount
The code responsible for reading and initializing tree roots is
scattered in open_ctree among 2 labels, emulating a loop. This is rather
confusing to reason about. Instead, factor the code to a new function,
init_tree_roots which implements the same logical flow.

There are a couple of notable differences, namely:

* Instead of using next_backup_root it's using the newly introduced
  read_backup_root.

* If read_backup_root returns an error init_tree_roots propagates the
  error and there is no special handling of that case e.g. the code jumps
  straight to 'fail_tree_roots' label. The old code, however, was
  (erroneously) jumping to 'fail_block_groups' label if next_backup_root
  did fail, this was unnecessary since the tree roots init logic doesn't
  modify the state of block groups.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:46 +01:00
Nikolay Borisov
bd2336b2ac btrfs: Add read_backup_root
This function will replace next_root_backup with a much saner/cleaner
interface.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:46 +01:00
Nikolay Borisov
fc2e4c5b35 btrfs: Remove newest_gen argument from find_oldest_super_backup
It's no longer needed following cleanups around find_newest_backup_root

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:46 +01:00
Nikolay Borisov
01f0f9da9d btrfs: Cleanup and simplify find_newest_super_backup
Backup roots are always written in a circular manner. By definition we
can only ever have 1 backup root whose generation equals to that of the
superblock. Hence, the 'if' in the for loop will trigger at most once.
This is sufficient to return the newest backup root.

Furthermore the newest_gen parameter is always set to the generation of
the superblock. This value can be obtained from the fs_info.

This patch removes the unnecessary code dealing with the wraparound
case and makes 'newest_gen' a local variable.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:46 +01:00
Filipe Manana
16ad3be175 Btrfs: remove unnecessary delalloc mutex for inodes
The inode delalloc mutex was added a long time ago by commit f248679e86
("Btrfs: add a delalloc mutex to inodes for delalloc reservations"), and
the reason for its introduction is not very clear from the change log. It
claims it solves bogus warnings from lockdep, however it lacks an example
report/warning from lockdep, or any explanation.

Since we have enough concurrentcy protection from the locks of the space
info and block reserve objects, and such lockdep warnings don't seem to
exist anymore (at least on a 5.3 kernel I couldn't get them with fstests,
ltp, fs_mark, etc), remove it, simplifying things a bit and decreasing
the size of the btrfs_inode structure. With some quick fio tests doing
direct IO and mmap writes I couldn't observe any significant performance
increase either (direct IO writes that don't increase the file's size
don't hold the inode's lock for their entire duration and mmap writes
don't hold the inode's lock at all), which are the only type of writes
that could see any performance gain due to less serialization.

Review feedback from Josef:

The problem was taking the i_mutex in mmap, which is how I was
protecting delalloc reservations originally.  The delalloc mutex didn't
come with all of the other dependencies.  That's what the lockdep
messages were about, removing the lock isn't going to make them appear
again.

We _had_ to lock around this because we used to do tricks to keep from
over-reserving, and if we didn't serialize delalloc reservations we'd
end up with ugly accounting problems when we tried to clean things up.

However with my recentish changes this isn't the case anymore.  Every
operation is responsible for reserving its space, and then adding it to
the inode.  Then cleaning up is straightforward and can't be mucked up
by other users.  So we no longer need the delalloc mutex to safe us from
ourselves.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:46 +01:00
Filipe Manana
bf2df5aed1 Btrfs: remove wait queue from space_info structure
It is not used anymore since commit 957780eb27 ("Btrfs: introduce
ticketed enospc infrastructure"), so just remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:46 +01:00
Johannes Thumshirn
f5389f330d btrfs: remove cached space_info in btrfs_statfs()
In btrfs_statfs() we cache fs_info::space_info in a local variable only
to use it once in a list_for_each_rcu() statement.

Not only is the local variable unnecessary it even makes the code harder
to follow as it's not clear which list it is iterating.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:45 +01:00
David Sterba
b3470b5dbe btrfs: add dedicated members for start and length of a block group
The on-disk format of block group item makes use of the key that stores
the offset and length. This is further used in the code, although this
makes thing harder to understand. The key is also packed so the
offset/length is not properly aligned as u64.

Add start (key.objectid) and length (key.offset) members to block group
and remove the embedded key.  When the item is searched or written, a
local variable for key is used.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:45 +01:00
David Sterba
0222dfdd4a btrfs: rename extent buffer block group item accessors
Accessors defined by BTRFS_SETGET_FUNCS take a raw extent buffer and
manipulate the items there, there's no special prefix required. The
block group accessors had _disk_ because previously the names were
occupied by the on-stack accessors. As this has been addressed in the
previous patch, we can now unify the naming.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:45 +01:00
David Sterba
de0dc456fd btrfs: rename block_group_item on-stack accessors to follow naming
All accessors defined by BTRFS_SETGET_STACK_FUNCS contain _stack_ in the
name, the block group ones were not following that scheme, so let's
switch them.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:45 +01:00
David Sterba
3d976388da btrfs: remove embedded block_group_cache::item
The members ::used and ::flags are now in the block group cache
structure, the last one is chunk_objectid, but that's set to a fixed
value and otherwise unused. The item is constructed from a local
variable before write, so we can remove the embedded one from block
group.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:45 +01:00
David Sterba
f93c63e547 btrfs: move block_group_item::flags to block group
The flags are read from the item that's embedded to block group struct,
but the item will be removed. Use the ::flags after read and before
write.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:44 +01:00
David Sterba
bf38be65f3 btrfs: move block_group_item::used to block group
For unknown reasons, the member 'used' in the block group struct is
stored in the b-tree item and accessed everywhere using the special
accessor helper. Let's unify it and make it a regular member and only
update the item before writing it to the tree.

The item is still being used for flags and chunk_objectid, there's some
duplication until the item is removed in following patches.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:44 +01:00
Qu Wenruo
34b127aecd btrfs: Remove btrfs_bio::flags member
The last user of btrfs_bio::flags was removed in commit 326e1dbb57
("block: remove management of bi_remaining when restoring original
bi_end_io"), remove it.

(Tagged for stable as the structure is heavily used and space savings
are desirable.)

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:44 +01:00
David Sterba
352ae07b59 btrfs: add blake2b to checksumming algorithms
Add blake2b (with 256 bit digest) to the list of possible checksumming
algorithms used by BTRFS.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:44 +01:00
David Sterba
b4e967be43 btrfs: add member for a specific checksum driver
Currently all the checksum algorithms generate a fixed size digest size
and we use it.  The on-disk format can hold up to BTRFS_CSUM_SIZE bytes
and BLAKE2b produces digest of 512 bits by default. We can't do that and
will use the blake2b-256, this needs to be passed to the crypto API.

Separate that from the base algorithm name and add a member to request
specific driver, in this case with the digest size.

The only place that uses the driver name is the crypto API setup.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:44 +01:00
Johannes Thumshirn
41e6d2a808 btrfs: sysfs: show used checksum driver per filesystem
Show the used driver for the checksum algorithm for the filesystem in
sysfs file /sys/fs/btrfs/UUID/features/checksum, eg.

  crc32c (crc32c-generic)

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:43 +01:00
David Sterba
f7cea56c0f btrfs: sysfs: export supported checksums
Export supported checksum algorithms via sysfs in the list of static
features:

  /sys/fs/btrfs/features/supported_checksums

Space spearated list of checksum algorithm names.

Co-developed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:43 +01:00
Johannes Thumshirn
3831bf0094 btrfs: add sha256 to checksumming algorithm
Add sha256 to the list of possible checksumming algorithms used by BTRFS.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:43 +01:00
Johannes Thumshirn
3951e7f050 btrfs: add xxhash64 to checksumming algorithms
Add xxhash64 to the list of possible checksumming algorithms used by
BTRFS.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 17:51:43 +01:00
David Sterba
8530c37a70 btrfs: get bdev from latest_dev for dio bh_result
To remove use of extent_map::bdev we need to find a replacement, and the
latest_bdev is the only one we can use here, because inode::i_bdev and
superblock::s_bdev are NULL.

The DIO code uses bdev in two places:

* to read blocksize to perform alignment checks in
  do_blockdev_direct_IO, but we do them in btrfs code before any call to
  DIO

* in the following call chain:

  do_direct_IO
    get_more_blocks
     sdio->get_block() <-- this is btrfs_get_blocks_direct

  subsequently the map_bh->b_dev member is used in clean_bdev_aliases
  and dio_new_bio to set the bio's bdev to that of the buffer_head.
  However, because we have provided a submit function dio_bio_submit
  calls our submission function and ignores the bdev.

So it's safe to pass any valid bdev that's used within the filesystem.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:01 +01:00
David Sterba
c3e14909d3 btrfs: assert extent_map bdevs and lookup_map and split
This is a preparatory patch for removing extent_map::bdev. There's some
history behind the code so this is only precaution to catch if things
break before the actual removal happens.

Logically, comparing a raw low-level block device (bdev) does not make
sense for extent maps (high-level objects). This had no effect in
practice but was quite confusing in the code.  The lookup_map is set iff
EXTENT_FLAG_FS_MAPPING is set.

The two pointers were stored in the same bytes and used potentially in
two meanings. Now they're split, so the asserts are in place to check
that the condition will not change.

The lookup map pointer misused bdev, this has been changed in commit
95617d6932 ("btrfs: cleanup, stop casting for extent_map->lookup
everywhere") to the explicit type. But the semantics hasn't changed and
bdev was not actually used to decide if maps are mergeable.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:01 +01:00
Johannes Thumshirn
32ab3d1b4d btrfs: remove pointless indentation in btrfs_read_sys_array()
Instead of checking if we've read a BTRFS_CHUNK_ITEM_KEY from disk and
then process it we could just bail out early if the read disk key wasn't
a BTRFS_CHUNK_ITEM_KEY.

This removes a level of indentation and makes the code nicer to read.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:01 +01:00
Johannes Thumshirn
5ae2169290 btrfs: reduce indentation in btrfs_may_alloc_data_chunk
In btrfs_may_alloc_data_chunk() we're checking if the chunk type is of
type BTRFS_BLOCK_GROUP_DATA and if it is we process it.

Instead of checking if the chunk type is a BTRFS_BLOCK_GROUP_DATA chunk
we can negate the check and bail out early if it isn't.

This makes the code a bit more readable.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:00 +01:00
Johannes Thumshirn
721860d578 btrfs: remove pointless local variable in lock_stripe_add()
In lock_stripe_add() we're caching the bucket for the stripe hash table
just for a single call to dereference the stripe hash.

If we just directly call rbio_bucket() we can safe the pointless local
variable.

Also move the dereferencing of the stripe hash outside of the variable
declaration block to not break over the 80 characters limit.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:00 +01:00
Johannes Thumshirn
9d6cb1b0f9 btrfs: raid56: reduce indentation in lock_stripe_add
In lock_stripe_add() we're traversing the stripe hash list and check if
the current list element's raid_map equals is equal to the raid bio's
raid_map. If both are equal we continue processing.

If we'd check for inequality instead of equality we can reduce one level
of indentation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:00 +01:00
David Sterba
1d2e7c7c3e btrfs: tracepoints: constify all pointers
We don't modify the data passed to tracepoints, some of the declarations
are already const, add it to the rest.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:00 +01:00
David Sterba
94c3f6c6b8 btrfs: tracepoints: drop typecasts from printk
Remove typecasts from trace printk, adjust types and move typecast to
the assignment if necessary. When assigning, the types are more obvious
compared to matching the variables to the format strings.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:00 +01:00
Nikolay Borisov
bc80230e0e btrfs: Return offset from find_desired_extent
Instead of using an input pointer parameter as the return value and have
an int as the return type of find_desired_extent, rework the function to
directly return the found offset. Doing that the 'ret' variable in
btrfs_llseek_file can be removed. Additional (subjective) benefit is
that btrfs' llseek function now resemebles those of the other major
filesystems.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:47:00 +01:00
Nikolay Borisov
2034f3b470 btrfs: Simplify btrfs_file_llseek
Handle SEEK_END/SEEK_CUR in a single 'default' case by directly
returning from generic_file_llseek. This makes the 'out' label
redundant.  Finally return directly the vale from vfs_setpos. No
semantic changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
Nikolay Borisov
d79b7c26b1 btrfs: Speed up btrfs_file_llseek
Modifying the file position is done on a per-file basis. This renders
holding the inode lock for writing useless and makes the performance of
concurrent llseek's abysmal.

Fix this by holding the inode for read. This provides protection against
concurrent truncates and find_desired_extent already includes proper
extent locking for the range which ensures proper locking against
concurrent writes. SEEK_CUR and SEEK_END can be done lockessly.

The former is synchronized by file::f_lock spinlock. SEEK_END is not
synchronized but atomic, but that's OK since there is not guarantee that
SEEK_END will always be at the end of the file in the face of tail
modifications.

This change brings ~82% performance improvement when doing a lot of
parallel fseeks. The workload essentially does:

    for (d=0; d<num_seek_read; d++)
      {
	/* offset %= 16777216; */
	fseek (f, 256 * d % 16777216, SEEK_SET);
	fread (buffer, 64, 1, f);
      }

Without patch:

num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

real	0m41.412s
user	0m28.777s
sys	2m16.510s

With patch:

num workprocesses = 16
num fseek/fread = 8000000
step = 256
fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

real	0m11.479s
user	0m27.629s
sys	0m21.040s

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
David Sterba
0cf2521313 btrfs: compression: remove ops pointer from workspace_manager
We can infer the ops from the type that is now passed to all functions
that would need it, this makes workspace_manager::ops redundant and can
be removed.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
David Sterba
1e00235160 btrfs: compression: inline free_workspace
Replace indirect calls to free_workspace by switch and calls to the
specific callbacks. This is mainly to get rid of the indirection due to
spectre vulnerability mitigations.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
David Sterba
a3bbd2a9ee btrfs: compression: pass type to btrfs_put_workspace
We can infer the workspace_manager from type and the type will be used
in the following patch to call a common helper for free_workspace.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
David Sterba
c778df1406 btrfs: compression: inline alloc_workspace
Replace indirect calls to alloc_workspace by switch and calls to the
specific callbacks. This is mainly to get rid of the indirection due to
spectre vulnerability mitigations.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:58 +01:00
David Sterba
5907a9bb13 btrfs: compression: pass type to btrfs_get_workspace
We can infer the workspace_manager from type and the type will be used
in the following patch to call a common helper for alloc_workspace.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:58 +01:00