Commit graph

10410 commits

Author SHA1 Message Date
Johannes Thumshirn
37f00a6d2e btrfs: introduce btrfs_is_data_reloc_root
There are several places in our codebase where we check if a root is the
root of the data reloc tree and subsequent patches will introduce more.

Factor out the check into a small helper function instead of open coding
it multiple times.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Qu Wenruo
38d5e541dd btrfs: unexport repair_io_failure()
Function repair_io_failure() is no longer used out of extent_io.c since
commit 8b9b6f2554 ("btrfs: scrub: cleanup the remaining nodatasum
fixup code"), which removes the last external caller.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Filipe Manana
f6df27dd27 btrfs: do not commit delayed inode when logging a file in full sync mode
When logging a regular file in full sync mode, we are currently committing
its delayed inode item. This is to ensure that we never miss copying the
inode item, with its most up to date data, into the log tree.

However that is not necessary since commit e4545de5b0 ("Btrfs: fix fsync
data loss after append write"), because even if we don't find the leaf
with the inode item when looking for leaves that changed in the current
transaction, we end up logging the inode item later using the in-memory
content. In case we find the leaf containing the inode item, we already
end up using the in-memory inode for filling the inode item in the log
tree, and not the inode item that is in the fs/subvolume tree, as it
might be not up to date (copy_items() -> fill_inode_item()).

So don't commit the delayed inode item, which brings a couple of benefits:

1) Avoid writing the inode item to the fs/subvolume btree, saving time and
   reducing lock contention on the btree;

2) In case no other item for the inode was changed, added or deleted in
   the same leaf where the inode item is located, we ended up copying
   all the items in that leaf to the log tree - it's harmless from a
   functional point of view, but it wastes time and log tree space.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 10/10 and the following test results compare a branch with
the whole patch set applied versus a branch without any of the patches
applied.

The following script was used to test dbench with 8 and 16 jobs on a
machine with 12 cores, 64G of RAM, a NVME device and using a non-debug
kernel config (Debian's default):

  $ cat test.sh
  #!/bin/bash

  if [ $# -ne 1 ]; then
      echo "Use $0 NUM_JOBS"
      exit 1
  fi

  NUM_JOBS=$1

  DEV=/dev/nvme0n1
  MNT=/mnt/nvme0n1
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-m single -d single"

  echo "performance" | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 120 $NUM_JOBS

  umount $MNT

The results were the following:

8 jobs, before patchset:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4113896     0.009   238.665
 Close        3021699     0.001     0.590
 Rename        174215     0.082   238.733
 Unlink        830977     0.049   238.642
 Deltree           96     2.232     8.022
 Mkdir             48     0.003     0.005
 Qpathinfo    3729013     0.005     2.672
 Qfileinfo     653206     0.001     0.152
 Qfsinfo       683866     0.002     0.526
 Sfileinfo     335055     0.004     1.571
 Find         1441800     0.016     4.288
 WriteX       2049644     0.010     3.982
 ReadX        6449786     0.003     0.969
 LockX          13400     0.002     0.043
 UnlockX        13400     0.001     0.075
 Flush         288349     2.521   245.516

Throughput 1075.73 MB/sec  8 clients  8 procs  max_latency=245.520 ms

8 jobs, after patchset:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4154282     0.009   156.675
 Close        3051450     0.001     0.843
 Rename        175912     0.072     4.444
 Unlink        839067     0.048    66.050
 Deltree           96     2.131     5.979
 Mkdir             48     0.002     0.004
 Qpathinfo    3765575     0.005     3.079
 Qfileinfo     659582     0.001     0.099
 Qfsinfo       690474     0.002     0.155
 Sfileinfo     338366     0.004     1.419
 Find         1455816     0.016     3.423
 WriteX       2069538     0.010     4.328
 ReadX        6512429     0.003     0.840
 LockX          13530     0.002     0.078
 UnlockX        13530     0.001     0.051
 Flush         291158     2.500   163.468

Throughput 1105.45 MB/sec  8 clients  8 procs  max_latency=163.474 ms

+2.7% throughput, -40.1% max latency

16 jobs, before patchset:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    5457602     0.033   337.098
 Close        4008979     0.002     2.018
 Rename        231051     0.323   254.054
 Unlink       1102209     0.202   337.243
 Deltree          160     6.521    31.720
 Mkdir             80     0.003     0.007
 Qpathinfo    4946147     0.014     6.988
 Qfileinfo     867440     0.001     1.642
 Qfsinfo       907081     0.003     1.821
 Sfileinfo     444433     0.005     2.053
 Find         1912506     0.067     7.854
 WriteX       2724852     0.018     7.428
 ReadX        8553883     0.003     2.059
 LockX          17770     0.003     0.350
 UnlockX        17770     0.002     0.627
 Flush         382533     2.810   353.691

Throughput 1413.09 MB/sec  16 clients  16 procs  max_latency=353.696 ms

16 jobs, after patchset:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    5393156     0.034   303.181
 Close        3961986     0.002     1.502
 Rename        228359     0.320   253.379
 Unlink       1088920     0.206   303.409
 Deltree          160     6.419    30.088
 Mkdir             80     0.003     0.004
 Qpathinfo    4887967     0.015     7.722
 Qfileinfo     857408     0.001     1.651
 Qfsinfo       896343     0.002     2.147
 Sfileinfo     439317     0.005     4.298
 Find         1890018     0.073     8.347
 WriteX       2693356     0.018     6.373
 ReadX        8453485     0.003     3.836
 LockX          17562     0.003     0.486
 UnlockX        17562     0.002     0.635
 Flush         378023     2.802   315.904

Throughput 1454.46 MB/sec  16 clients  16 procs  max_latency=315.910 ms

+2.9% throughput, -11.3% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Filipe Manana
5328b2a7ff btrfs: avoid attempt to drop extents when logging inode for the first time
When logging an extent, in the fast fsync path, we always attempt do drop
or trim any existing extents with a range that match or overlap the range
of the extent we are about to log. We do that through a call to
btrfs_drop_extents().

However this is not needed when we are logging the inode for the first
time in the current transaction, since we have no inode items of the
inode in the log tree. Calling btrfs_drop_extents() does a deletion search
on the log tree, which is expensive when we have concurrent tasks
accessing the log tree because a deletion search always acquires a write
lock on the extent buffers at levels 2, 1 and 0, adding significant lock
contention, specially taking into account the height of a log tree rarely
(if ever) goes beyond 2 or 3, due to its short life.

So skip the call to btrfs_drop_extents() when the inode was not previously
logged in the current transaction.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 9/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Filipe Manana
a5c733a4b6 btrfs: avoid search for logged i_size when logging inode if possible
If we are logging that an inode exists and the inode was not logged
before, we can avoid searching in the log tree for the inode item since we
know it does not exists. That wastes time and adds more lock contention on
the extent buffers of the log tree when there are other tasks that are
logging other inodes.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 8/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Filipe Manana
4934a81502 btrfs: avoid expensive search when truncating inode items from the log
Whenever we are logging a file inode in full sync mode we call
btrfs_truncate_inode_items() to delete items of the inode we may have
previously logged.

That results in doing a btree search for deletion, which is expensive
because it always acquires write locks for extent buffers at levels 2, 1
and 0, and it balances any node that is less than half full. Acquiring
the write locks can block the task if the extent buffers are already
locked by another task or block other tasks attempting to lock them,
which is specially bad in case of log trees since they are small due to
their short life, with a root node at a level typically not greater than
level 2.

If we know that we are logging the inode for the first time in the current
transaction, we can skip the call to btrfs_truncate_inode_items(), avoiding
the deletion search. This change does that.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 7/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:01 +02:00
Filipe Manana
8a2b3da191 btrfs: add helper to truncate inode items when logging inode
Move the call to btrfs_truncate_inode_items(), and the surrounding retry
loop, into a local helper function. This avoids some repetition and avoids
making the next change a bit awkward due to a bit of too much indentation.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 6/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Filipe Manana
88e221cdac btrfs: avoid expensive search when dropping inode items from log
Whenever we are logging a directory inode, logging that an inode exists or
logging an inode that has changes in its references or xattrs, we attempt
to delete items of this inode we may have previously logged (through calls
to drop_objectid_items()).

That attempt does a btree search for deletion, which is expensive because
it always acquires write locks for extent buffers at levels 2, 1 and 0,
and it balances any node that is less than half full. Acquiring the write
locks can block the task if the extent buffers are already locked or block
other tasks attempting to lock them, which is specially bad in case of log
trees since they are small due to their short life, with a root node at a
level typically not greater than level 2.

If we know that we are logging the inode for the first time in the current
transaction, we can skip the search. This change does that.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 5/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Filipe Manana
130341be7f btrfs: always update the logged transaction when logging new names
When we are logging a new name for an inode, due to a link or rename
operation, if the inode has ancestor inodes that are new, created in the
current transaction, we need to log that these inodes exist. To ensure
that a subsequent explicit fsync on one of these ancestor inodes does
sync the log, we don't set the logged_trans field of these inodes.
This was done in commit 75b463d2b4 ("btrfs: do not commit logs and
transactions during link and rename operations"), to avoid syncing a
log after a rename or link operation.

In order to allow for future changes to do some optimizations, change
this behaviour to always update the logged_trans of any logged inode
and don't update the last_log_commit of the inode if we are logging
that it exists. This accomplishes that same objective with simpler
logic, allowing for some optimizations in the next patches.

So just do that simplification.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 4/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Filipe Manana
c48792c6ee btrfs: do not log new dentries when logging that a new name exists
When logging a new name for an inode, due to a link or rename operation,
we don't need to log all new dentries of the parent directories and their
subdirectories. We only want to log the names of the inode and that any
new parent directories exist. So in this case don't trigger logging of
the new dentries, that is only need when doing an explicit fsync on a
directory or on a file which requires logging its parent directories.

This avoids unnecessary work and reduces contention on the extent buffers
of a log tree.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 3/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Filipe Manana
289cffcb03 btrfs: remove no longer needed checks for NULL log context
Since commit 75b463d2b4 ("btrfs: do not commit logs and transactions
during link and rename operations"), we always pass a non-NULL log context
to btrfs_log_inode_parent() and therefore to all the functions that it
calls. So remove the checks we have all over the place that test for a
NULL log context, making the code shorter and easier to read, as well as
reducing the size of the generated code.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 2/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Filipe Manana
1e0860f3b3 btrfs: check if a log tree exists at inode_logged()
In case an inode was never logged since it was loaded from disk and was
modified in the current transaction (its ->last_trans matches the ID of
the current transaction), inode_logged() returns true even if there's no
existing log tree. In this case we can simply check if a log tree exists
and return false if it does not. This avoids a caller of inode_logged()
doing some unnecessary, but harmless, work.

For btrfs_log_new_name() it avoids it logging an inode in case it was
never logged since it was loaded from disk and there is currently no log
tree for the inode's root. For the remaining callers of inode_logged(),
btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log(), it has
no effect since they already check if a log tree exists through their
calls to join_running_log_trans().

So just add a check to inode_logged() to verify if a log tree exists, and
return false if it does not.

This patch is part of a patch set comprised of the following patches:

  btrfs: check if a log tree exists at inode_logged()
  btrfs: remove no longer needed checks for NULL log context
  btrfs: do not log new dentries when logging that a new name exists
  btrfs: always update the logged transaction when logging new names
  btrfs: avoid expensive search when dropping inode items from log
  btrfs: add helper to truncate inode items when logging inode
  btrfs: avoid expensive search when truncating inode items from the log
  btrfs: avoid search for logged i_size when logging inode if possible
  btrfs: avoid attempt to drop extents when logging inode for the first time
  btrfs: do not commit delayed inode when logging a file in full sync mode

This is patch 1/10 and test results are listed in the change log of the
last patch in the set.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Anand Jain
cdccc03a8a btrfs: remove stale comment about the btrfs_show_devname
There were few lockdep warnings because btrfs_show_devname() was using
device_list_mutex as recorded in the commits:

  0ccd05285e ("btrfs: fix a possible umount deadlock")
  779bf3fefa ("btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex")

And finally, commit 88c14590cd ("btrfs: use RCU in btrfs_show_devname
for device list traversal") removed the device_list_mutex from
btrfs_show_devname for performance reasons.

This patch removes a stale comment about the function
btrfs_show_devname and device_list_mutex.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Anand Jain
b7cb29e666 btrfs: update latest_dev when we create a sprout device
When we add a device to the seed filesystem (sprouting) it is a new
filesystem (and fsid) on the device added. Update the latest_dev so
that /proc/self/mounts shows the correct device.

Example:

  $ btrfstune -S1 /dev/vg/seed
  $ mount /dev/vg/seed /btrfs
  mount: /btrfs: WARNING: device write-protected, mounted read-only.

  $ cat /proc/self/mounts | grep btrfs
  /dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0

  $ btrfs dev add -f /dev/vg/new /btrfs

Before:

  $ cat /proc/self/mounts | grep btrfs
  /dev/mapper/vg-seed /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0

After:

  $ cat /proc/self/mounts | grep btrfs
  /dev/mapper/vg-new /btrfs btrfs ro,relatime,space_cache,subvolid=5,subvol=/ 0 0

Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Anand Jain
6605fd2f39 btrfs: use latest_dev in btrfs_show_devname
The test case btrfs/238 reports the warning below:

 WARNING: CPU: 3 PID: 481 at fs/btrfs/super.c:2509 btrfs_show_devname+0x104/0x1e8 [btrfs]
 CPU: 2 PID: 1 Comm: systemd Tainted: G        W  O 5.14.0-rc1-custom #72
 Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
 Call trace:
   btrfs_show_devname+0x108/0x1b4 [btrfs]
   show_mountinfo+0x234/0x2c4
   m_show+0x28/0x34
   seq_read_iter+0x12c/0x3c4
   vfs_read+0x29c/0x2c8
   ksys_read+0x80/0xec
   __arm64_sys_read+0x28/0x34
   invoke_syscall+0x50/0xf8
   do_el0_svc+0x88/0x138
   el0_svc+0x2c/0x8c
   el0t_64_sync_handler+0x84/0xe4
   el0t_64_sync+0x198/0x19c

Reason:
While btrfs_prepare_sprout() moves the fs_devices::devices into
fs_devices::seed_list, the btrfs_show_devname() searches for the devices
and found none, leading to the warning as in above.

Fix:
latest_dev is updated according to the changes to the device list.
That means we could use the latest_dev->name to show the device name in
/proc/self/mounts, the pointer will be always valid as it's assigned
before the device is deleted from the list in remove or replace.
The RCU protection is sufficient as the device structure is freed after
synchronization.

Reported-by: Su Yue <l@damenly.su>
Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Anand Jain
d24fa5c1da btrfs: convert latest_bdev type to btrfs_device and rename
In preparation to fix a bug in btrfs_show_devname().

Convert fs_devices::latest_bdev type from struct block_device to struct
btrfs_device and, rename the member to fs_devices::latest_dev.
So that btrfs_show_devname() can use fs_devices::latest_dev::name.

Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Naohiro Aota
7ae9bd1803 btrfs: zoned: finish relocating block group
We will no longer write to a relocating block group. So, we can finish it
now.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Naohiro Aota
be1a1d7a5d btrfs: zoned: finish fully written block group
If we have written to the zone capacity, the device automatically
deactivates the zone. Sync up block group side (the active BG list and
zone_is_active flag) with it.

We need to do it both on data BGs and metadata BGs. On data side, we add a
hook to btrfs_finish_ordered_io(). On metadata side, we use
end_extent_buffer_writeback().

To reduce excess lookup of a block group, we mark the last extent buffer in
a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for
data (ordered_extent), because the address may change due to
REQ_OP_ZONE_APPEND.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:00 +02:00
Naohiro Aota
a85f05e59b btrfs: zoned: avoid chunk allocation if active block group has enough space
The current extent allocator tries to allocate a new block group when the
existing block groups do not have enough space. On a ZNS device, a new
block group means a new active zone. If the number of active zones has
already reached the max_active_zones, activating a new zone needs to finish
an existing zone, leading to wasting the free space there.

So, instead, it should reuse the existing active block groups as much as
possible when we can't activate any other zones without sacrificing an
already activated block group.

While at it, I converted find_free_extent_update_loop() to check the
found_extent() case early and made the other conditions simpler.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
a12b0dc0aa btrfs: move ffe_ctl one level up
We are passing too many variables as it is from btrfs_reserve_extent() to
find_free_extent(). The next commit will add min_alloc_size to ffe_ctl, and
that means another pass-through argument. Take this opportunity to move
ffe_ctl one level up and drop the redundant arguments.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
eb66a010d5 btrfs: zoned: activate new block group
Activate new block group at btrfs_make_block_group(). We do not check the
return value. If failed, we can try again later at the actual extent
allocation phase.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
2e654e4bb9 btrfs: zoned: activate block group on allocation
Activate a block group when trying to allocate an extent from it. We check
read-only case and no space left case before trying to activate a block
group not to consume the number of active zones uselessly.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
68a384b5ab btrfs: zoned: load active zone info for block group
Load activeness of underlying zones of a block group. When underlying zones
are active, we add the block group to the fs_info->zone_active_bgs list.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
afba2bc036 btrfs: zoned: implement active zone tracking
Add zone_is_active flag to btrfs_block_group. This flag indicates the
underlying zones are all active. Such zone active block groups are tracked
by fs_info->active_bg_list.

btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying
device part. They set/clear the bitmap to indicate zone activeness and
count the number of zones we can activate left.

btrfs_zone_{activate,finish}() take responsibility for the logical part and
the list management. In addition, btrfs_zone_finish() wait for any writes
on it and send REQ_OP_ZONE_FINISH to the zone.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
dafc340dbd btrfs: zoned: introduce physical_map to btrfs_block_group
We will use a block group's physical location to track active zones and
finish fully written zones in the following commits. Since the zone
activation is done in the extent allocation context which already holding
the tree locks, we can't query the chunk tree for the physical locations.
So, copy the location info into a block group and use it for activation.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
ea6f8ddcde btrfs: zoned: load active zone information from devices
The ZNS specification defines a limit on the number of zones that can be in
the implicit open, explicit open or closed conditions. Any zone with such
condition is defined as an active zone and correspond to any zone that is
being written or that has been only partially written. If the maximum
number of active zones is reached, we must either reset or finish some
active zones before being able to chose other zones for storing data.

Load queue_max_active_zones() and track the number of active zones left on
the device.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
8376d9e1ed btrfs: zoned: finish superblock zone once no space left for new SB
If there is no more space left for a new superblock in a superblock zone,
then it is better to ZONE_FINISH the zone and frees up the active zone
count.

Since btrfs_advance_sb_log() can now issue REQ_OP_ZONE_FINISH, we also need
to convert it to return int for the error case.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
9658b72ef3 btrfs: zoned: locate superblock position using zone capacity
sb_write_pointer() returns the write position of next superblock. For READ,
we need a previous location. When the pointer is at the head, the previous
one is the last one of the other zone. Calculate the last one's position
from zone capacity.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
5daaf552d1 btrfs: zoned: consider zone as full when no more SB can be written
We cannot write beyond zone capacity. So, we should consider a zone as
"full" when the write pointer goes beyond capacity - the size of super
info.

Also, take this opportunity to replace a subtle duplicated code with a loop
and fix a typo in comment.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
d8da0e8567 btrfs: zoned: tweak reclaim threshold for zone capacity
With the introduction of zone capacity, the range [capacity, length] is
always zone unusable. Counting this region as a reclaim target will
cause reclaiming too early. Reclaim block groups based on bytes that can
be usable after resetting.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:59 +02:00
Naohiro Aota
98173255bd btrfs: zoned: calculate free space from zone capacity
Now that we introduced capacity in a block group, we need to calculate free
space using the capacity instead of the length. Thus, bytes we account
capacity - alloc_pointer as free, and account bytes [capacity, length] as
zone unusable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Naohiro Aota
c46c4247ab btrfs: zoned: move btrfs_free_excluded_extents out of btrfs_calc_zone_unusable
btrfs_free_excluded_extents() is not neccessary for
btrfs_calc_zone_unusable() and it makes btrfs_calc_zone_unusable()
difficult to reuse. Move it out and call btrfs_free_excluded_extents()
in proper context.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Naohiro Aota
8eae532be7 btrfs: zoned: load zone capacity information from devices
The ZNS specification introduces the concept of a Zone Capacity.  A zone
capacity is an additional per-zone attribute that indicates the number of
usable logical blocks within each zone, starting from the first logical
block of each zone. It is always smaller or equal to the zone size.

With the SINGLE profile, we can set a block group's "capacity" as the same
as the underlying zone's Zone Capacity. We will limit the allocation not
to exceed in a following commit.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Qu Wenruo
c22a3572cb btrfs: defrag: enable defrag for subpage case
With the new infrastructure which has taken subpage into consideration,
now we should be safe to allow defrag to work for subpage case.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Qu Wenruo
c635757365 btrfs: defrag: remove the old infrastructure
Now the old infrastructure can all be removed, defrag

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Qu Wenruo
7b508037d4 btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()
The function defrag_one_cluster() is able to defrag one range well
enough, we only need to do preparation for it, including:

- Clamp and align the defrag range
- Exclude invalid cases
- Proper inode locking

The old infrastructures will not be removed in this patch, as it would
be too noisy to review.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:58 +02:00
Qu Wenruo
b18c3ab234 btrfs: defrag: introduce helper to defrag one cluster
This new helper, defrag_one_cluster(), will defrag one cluster (at most
256K):

- Collect all initial targets

- Kick in readahead when possible

- Call defrag_one_range() on each initial target
  With some extra range clamping.

- Update @sectors_defragged parameter

This involves one behavior change, the defragged sectors accounting is
no longer as accurate as old behavior, as the initial targets are not
consistent.

We can have new holes punched inside the initial target, and we will
skip such holes later.
But the defragged sectors accounting doesn't need to be that accurate
anyway, thus I don't want to pass those extra accounting burden into
defrag_one_range().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:57 +02:00
Qu Wenruo
e9eec72151 btrfs: defrag: introduce helper to defrag a range
A new helper, defrag_one_range(), is introduced to defrag one range.

This function will mostly prepare the needed pages and extent status for
defrag_one_locked_target().

As we can only have a consistent view of extent map with page and extent
bits locked, we need to re-check the range passed in to get a real
target list for defrag_one_locked_target().

Since defrag_collect_targets() will call defrag_lookup_extent() and lock
extent range, we also need to teach those two functions to skip extent
lock.  Thus new parameter, @locked, is introduced to skip extent lock if
the caller has already locked the range.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:57 +02:00
Qu Wenruo
22b398eeee btrfs: defrag: introduce helper to defrag a contiguous prepared range
A new helper, defrag_one_locked_target(), introduced to do the real part
of defrag.

The caller needs to ensure both page and extents bits are locked, and no
ordered extent exists for the range, and all writeback is finished.

The core defrag part is pretty straight-forward:

- Reserve space
- Set extent bits to defrag
- Update involved pages to be dirty

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:07:13 +02:00
Qu Wenruo
eb793cf857 btrfs: defrag: introduce helper to collect target file extents
Introduce a helper, defrag_collect_targets(), to collect all possible
targets to be defragged.

This function will not consider things like max_sectors_to_defrag, thus
caller should be responsible to ensure we don't exceed the limit.

This function will be the first stage of later defrag rework.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:06:53 +02:00
Qu Wenruo
5767b50c00 btrfs: defrag: factor out page preparation into a helper
In cluster_pages_for_defrag(), we have complex code block inside one
for() loop.

The code block is to prepare one page for defrag, this will ensure:

- The page is locked and set up properly.
- No ordered extent exists in the page range.
- The page is uptodate.

This behavior is pretty common and will be reused by later defrag
rework.

So factor out the code into its own helper, defrag_prepare_one_page(),
for later usage, and cleanup the code by a little.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:06:34 +02:00
Qu Wenruo
76068cae63 btrfs: defrag: replace hard coded PAGE_SIZE with sectorsize
When testing subpage defrag support, I always find some strange inode
nbytes error, after a lot of debugging, it turns out that
defrag_lookup_extent() is using PAGE_SIZE as size for
lookup_extent_mapping().

Since lookup_extent_mapping() is calling __lookup_extent_mapping() with
@strict == 1, this means any extent map smaller than one page will be
ignored, prevent subpage defrag to grab a correct extent map.

There are quite some PAGE_SIZE usage in ioctl.c, but most of them are
correct usages, and can be one of the following cases:

- ioctl structure size check
  We want ioctl structure to be contained inside one page.

- real page operations

The remaining cases in defrag_lookup_extent() and
check_defrag_in_cache() will be addressed in this patch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:06:15 +02:00
Qu Wenruo
cae7968680 btrfs: defrag: also check PagePrivate for subpage cases in cluster_pages_for_defrag()
In function cluster_pages_for_defrag() we have a window where we unlock
page, either start the ordered range or read the content from disk.

When we re-lock the page, we need to make sure it still has the correct
page->private for subpage.

Thus add the extra PagePrivate check here to handle subpage cases
properly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:05:18 +02:00
Qu Wenruo
1ccc2e8a86 btrfs: defrag: pass file_ra_state instead of file to btrfs_defrag_file()
Currently btrfs_defrag_file() accepts both "struct inode" and "struct
file" as parameter.  We can easily grab "struct inode" from "struct
file" using file_inode() helper.

The reason why we need "struct file" is just to re-use its f_ra.

Change this to pass "struct file_ra_state" parameter, so that it's more
clear what we really want.  Since we're here, also add some comments on
the function btrfs_defrag_file().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:04:39 +02:00
Anand Jain
a09f23c355 btrfs: rename and switch to bool btrfs_chunk_readonly
btrfs_chunk_readonly() checks if the given chunk is writeable. It
returns 1 for readonly, and 0 for writeable. So the return argument type
bool shall suffice instead of the current type int.

Also, rename btrfs_chunk_readonly() to btrfs_chunk_writeable() as we
check if the bg is writeable, and helps to keep the logic at the parent
function simpler to understand.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:03:57 +02:00
Sidong Yang
44bee215f7 btrfs: reflink: initialize return value to 0 in btrfs_extent_same()
Fix a warning reported by smatch that ret could be returned without
initialized.  The dedupe operations are supposed to to return 0 for a 0
length range but the caller does not pass olen == 0. To keep this
behaviour and also fix the warning initialize ret to 0.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:03:57 +02:00
Qu Wenruo
72a69cd030 btrfs: subpage: pack all subpage bitmaps into a larger bitmap
Currently we use u16 bitmap to make 4k sectorsize work for 64K page
size.

But this u16 bitmap is not large enough to contain larger page size like
128K, nor is space efficient for 16K page size.

To handle both cases, here we pack all subpage bitmaps into a larger
bitmap, now btrfs_subpage::bitmaps[] will be the ultimate bitmap for
subpage usage.

Each sub-bitmap will has its start bit number recorded in
btrfs_subpage_info::*_start, and its bitmap length will be recorded in
btrfs_subpage_info::bitmap_nr_bits.

All subpage bitmap operations will be converted from using direct u16
operations to bitmap operations, with above *_start calculated.

For 64K page size with 4K sectorsize, this should not cause much
difference.

While for 16K page size, we will only need 1 unsigned long (u32) to
store all the bitmaps, which saves quite some space.

Furthermore, this allows us to support larger page size like 128K and
258K.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:03:55 +02:00
Qu Wenruo
8481dd80ab btrfs: subpage: introduce btrfs_subpage_bitmap_info
Currently we use fixed size u16 bitmap for subpage bitmap.  This is fine
for 4K sectorsize with 64K page size.

But for 4K sectorsize and larger page size, the bitmap is too small,
while for smaller page size like 16K, u16 bitmaps waste too much space.

Here we introduce a new helper structure, btrfs_subpage_bitmap_info, to
record the proper bitmap size, and where each bitmap should start at.

By this, we can later compact all subpage bitmaps into one u32 bitmap.
This patch is the first step.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Qu Wenruo
651fb41927 btrfs: subpage: make btrfs_alloc_subpage() return btrfs_subpage directly
The existing calling convention of btrfs_alloc_subpage() is pretty
awful.  Change it to a more common pattern by returning struct
btrfs_subpage directly and let the caller to determine if the call
succeeded.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Qu Wenruo
fdf250db89 btrfs: subpage: only call btrfs_alloc_subpage() when sectorsize is smaller than PAGE_SIZE
There are two call sites of btrfs_alloc_subpage():

- btrfs_attach_subpage()
  We have ensured sectorsize is smaller than PAGE_SIZE

- alloc_extent_buffer()
  We call btrfs_alloc_subpage() unconditionally.

The alloc_extent_buffer() forces us to check the sectorsize size against
page size inside btrfs_alloc_subpage().

Since the function name, btrfs_alloc_subpage(), already indicates it
should only get called for subpage cases, do the check in
alloc_extent_buffer() and add an ASSERT() in btrfs_alloc_subpage().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Su Yue
9675ea8c9d btrfs: update comment for fs_devices::seed_list in btrfs_rm_device
Update it since commit 944d3f9fac ("btrfs: switch seed device to
list api") did conversion from fs_devices::seed to fs_devices::seed_list.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Anand Jain
991a3daeda btrfs: drop unnecessary ret in ioctl_quota_rescan_status
There is no need for the variable ret after d66105cfa873 ("btrfs:
allocate btrfs_ioctl_quota_rescan_args on stack"), remove it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Marcos Paulo de Souza
0e3dd5bce8 btrfs: send: simplify send_create_inode_if_needed
The out label is being overused, we can simply return if the condition
permits.

No functional changes.

Reviewed-by: Su Yue <l@damenly.su>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Nikolay Borisov
f6f39f7a0a btrfs: rename btrfs_alloc_chunk to btrfs_create_chunk
The user facing function used to allocate new chunks is
btrfs_chunk_alloc, unfortunately there is yet another similar sounding
function - btrfs_alloc_chunk. This creates confusion, especially since
the latter function can be considered "private" in the sense that it
implements the first stage of chunk creation and as such is called by
btrfs_chunk_alloc.

To avoid the awkwardness that comes with having similarly named but
distinctly different in their purpose function rename btrfs_alloc_chunk
to btrfs_create_chunk, given that the main purpose of this function is
to orchestrate the whole process of allocating a chunk - reserving space
into devices, deciding on characteristics of the stripe size and
creating the in-memory structures.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-25 21:17:16 +02:00
Andreas Gruenbacher
4fdccaa0d1 iomap: Add done_before argument to iomap_dio_rw
Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred.  When the request succeeds, we
report that done_before additional bytes were tranferred.  This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted.  The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-10-24 15:26:05 +02:00
Christoph Hellwig
1226dfff57 btrfs: use sync_blockdev
Use sync_blockdev instead of opencoding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20211019062530.2174626-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig
cda00eba02 btrfs: use bdev_nr_bytes instead of open coding it
Use the proper helper to read the block device size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:43:22 -06:00
Kees Cook
a2c5062f39 btrfs: Use memset_startat() to clear end of struct
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.

Use memset_startat() so memset() doesn't get confused about writing
beyond the destination member that is intended to be the starting point
of zeroing through the end of the struct.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2021-10-18 12:28:52 -07:00
Andreas Gruenbacher
a6294593e8 iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
Turn iov_iter_fault_in_readable into a function that returns the number
of bytes not faulted in, similar to copy_to_user, instead of returning a
non-zero value when any of the requested pages couldn't be faulted in.
This supports the existing users that require all pages to be faulted in
as well as new users that are happy if any pages can be faulted in.

Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
sure this change doesn't silently break things.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-18 16:35:06 +02:00
Andreas Gruenbacher
bb523b406c gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
Turn fault_in_pages_{readable,writeable} into versions that return the
number of bytes not faulted in, similar to copy_to_user, instead of
returning a non-zero value when any of the requested pages couldn't be
faulted in.  This supports the existing users that require all pages to
be faulted in as well as new users that are happy if any pages can be
faulted in.

Rename the functions to fault_in_{readable,writeable} to make sure
this change doesn't silently break things.

Neither of these functions is entirely trivial and it doesn't seem
useful to inline them, so move them to mm/gup.c.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-18 16:33:03 +02:00
Christoph Hellwig
3e08773c38 block: switch polling to be bio based
Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

 - the cookie construction can made entirely private in blk-mq.c
 - the caller does not need to remember the request_queue and cookie
   separately and thus sidesteps their lifetime issues
 - keeping the device and the cookie inside the bio allows to trivially
   support polling BIOs remapping by stacking drivers
 - a lot of code to propagate the cookie back up the submission path can
   be removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
e41d12f539 mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>
There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
break the include chain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Christoph Hellwig
348332e000 mm: don't include <linux/blk-cgroup.h> in <linux/writeback.h>
blk-cgroup.h pulls in blkdev.h and thus pretty much all the block
headers.  Break this dependency chain by turning wbc_blkcg_css into a
macro and dropping the blk-cgroup.h include.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Linus Torvalds
1986c10acc for-5.15-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmFkq/oACgkQxWXV+ddt
 WDs10g//Qx27foBu0U3ovvsla0t8GgcqgzUyOx3zxed0MbOEQCtK6kqRHQ/I+9ap
 1Ec5y4qQqBwfp1NKlYdU/EiKBQIYbJO/nYhVIrFI/EZL/7qJTwyjYjrOjG9zIMvy
 2ekxuF/XVnM6p3hyRcuWMCxsossuK4XIkb0bSZrwk/nFA6nt+gbXR1oE94JitM8p
 0pwjvSVqpdTmOAIU5+oQldqL/By7un/rv+o6OTD9sJqTdQ1UMlHVDaa9mD8aCsYk
 XIiCYfkyo9rlbSAB5wmWuiAhske2xh7IXSr4l9mKxGOA0egbQAgmS1Zw3+Km7vFM
 t+ji/4rTFPFd2yv/sLCEnMinuwvBr3mnEh6pDHR76RNrI4CoK/GHmZSf7XyqzV8W
 QOftznNA9/nJInTULdhCDvNxbKhKKb+xeSP1L4uytnWc5am+WKOPLNkfczJUh3sq
 WUORpaUxByDol6BMsdQJqPVJ7CH5YI8lQzuQFoUTXDCgeQUBE2wE1s3q+5Ma+dNZ
 mamkfQim2R42nPk7RSQlFBeIyDBVBXWfSNvXNovrPFJyRmZqRWzh0nb3PS9VNnUy
 6oCOCIT7XlM4Jwh4ZR21OT66RNQQ/2sLUOU/4838TOOdn00UVBrFObHQ+ll8rq74
 Va9j0atj6iIn9c8lDQkqTek0pMDcmVGzb2MV6JA4BCbCL/lcGk8=
 =u3qV
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more error handling fixes, stemming from code inspection, error
  injection or fuzzing"

* tag 'for-5.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix abort logic in btrfs_replace_file_extents
  btrfs: check for error when looking up inode during dir entry replay
  btrfs: unify lookup return value when dir entry is missing
  btrfs: deal with errors when adding inode reference during log replay
  btrfs: deal with errors when replaying dir entry during log replay
  btrfs: deal with errors when checking if a dir entry exists during log replay
  btrfs: update refs for any root except tree log roots
  btrfs: unlock newly allocated extent buffer after error
2021-10-11 16:48:19 -07:00
Josef Bacik
4afb912f43 btrfs: fix abort logic in btrfs_replace_file_extents
Error injection testing uncovered a case where we'd end up with a
corrupt file system with a missing extent in the middle of a file.  This
occurs because the if statement to decide if we should abort is wrong.

The only way we would abort in this case is if we got a ret !=
-EOPNOTSUPP and we called from the file clone code.  However the
prealloc code uses this path too.  Instead we need to abort if there is
an error, and the only error we _don't_ abort on is -EOPNOTSUPP and only
if we came from the clone file code.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 22:08:06 +02:00
Filipe Manana
cfd312695b btrfs: check for error when looking up inode during dir entry replay
At replay_one_name(), we are treating any error from btrfs_lookup_inode()
as if the inode does not exists. Fix this by checking for an error and
returning it to the caller.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 22:06:34 +02:00
Filipe Manana
8dcbc26194 btrfs: unify lookup return value when dir entry is missing
btrfs_lookup_dir_index_item() and btrfs_lookup_dir_item() lookup for dir
entries and both are used during log replay or when updating a log tree
during an unlink.

However when the dir item does not exists, btrfs_lookup_dir_item() returns
NULL while btrfs_lookup_dir_index_item() returns PTR_ERR(-ENOENT), and if
the dir item exists but there is no matching entry for a given name or
index, both return NULL. This makes the call sites during log replay to
be more verbose than necessary and it makes it easy to miss this slight
difference. Since we don't need to distinguish between those two cases,
make btrfs_lookup_dir_index_item() always return NULL when there is no
matching directory entry - either because there isn't any dir entry or
because there is one but it does not match the given name and index.

Also rename the argument 'objectid' of btrfs_lookup_dir_index_item() to
'index' since it is supposed to match an index number, and the name
'objectid' is not very good because it can easily be confused with an
inode number (like the inode number a dir entry points to).

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 22:06:32 +02:00
Filipe Manana
52db77791f btrfs: deal with errors when adding inode reference during log replay
At __inode_add_ref(), we treating any error returned from
btrfs_lookup_dir_item() or from btrfs_lookup_dir_index_item() as meaning
that there is no existing directory entry in the fs/subvolume tree.
This is not correct since we can get errors such as, for example, -EIO
when reading extent buffers while searching the fs/subvolume's btree.

So fix that and return the error to the caller when it is not -ENOENT.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 22:06:30 +02:00
Filipe Manana
e15ac64137 btrfs: deal with errors when replaying dir entry during log replay
At replay_one_one(), we are treating any error returned from
btrfs_lookup_dir_item() or from btrfs_lookup_dir_index_item() as meaning
that there is no existing directory entry in the fs/subvolume tree.
This is not correct since we can get errors such as, for example, -EIO
when reading extent buffers while searching the fs/subvolume's btree.

So fix that and return the error to the caller when it is not -ENOENT.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 22:06:23 +02:00
Filipe Manana
77a5b9e3d1 btrfs: deal with errors when checking if a dir entry exists during log replay
Currently inode_in_dir() ignores errors returned from
btrfs_lookup_dir_index_item() and from btrfs_lookup_dir_item(), treating
any errors as if the directory entry does not exists in the fs/subvolume
tree, which is obviously not correct, as we can get errors such as -EIO
when reading extent buffers while searching the fs/subvolume's tree.

Fix that by making inode_in_dir() return the errors and making its only
caller, add_inode_ref(), deal with returned errors as well.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 22:06:22 +02:00
Josef Bacik
d175209be0 btrfs: update refs for any root except tree log roots
I hit a stuck relocation on btrfs/061 during my overnight testing.  This
turned out to be because we had left over extent entries in our extent
root for a data reloc inode that no longer existed.  This happened
because in btrfs_drop_extents() we only update refs if we have SHAREABLE
set or we are the tree_root.  This regression was introduced by
aeb935a455 ("btrfs: don't set SHAREABLE flag for data reloc tree")
where we stopped setting SHAREABLE for the data reloc tree.

The problem here is we actually do want to update extent references for
data extents in the data reloc tree, in fact we only don't want to
update extent references if the file extents are in the log tree.
Update this check to only skip updating references in the case of the
log tree.

This is relatively rare, because you have to be running scrub at the
same time, which is what btrfs/061 does.  The data reloc inode has its
extents pre-allocated, and then we copy the extent into the
pre-allocated chunks.  We theoretically should never be calling
btrfs_drop_extents() on a data reloc inode.  The exception of course is
with scrub, if our pre-allocated extent falls inside of the block group
we are scrubbing, then the block group will be marked read only and we
will be forced to cow that extent.  This means we will call
btrfs_drop_extents() on that range when we COW that file extent.

This isn't really problematic if we do this, the data reloc inode
requires that our extent lengths match exactly with the extent we are
copying, thankfully we validate the extent is correct with
get_new_location(), so if we happen to COW only part of the extent we
won't link it in when we do the relocation, so we are safe from any
other shenanigans that arise because of this interaction with scrub.

Fixes: aeb935a455 ("btrfs: don't set SHAREABLE flag for data reloc tree")
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 22:04:36 +02:00
Qu Wenruo
19ea40dddf btrfs: unlock newly allocated extent buffer after error
[BUG]
There is a bug report that injected ENOMEM error could leave a tree
block locked while we return to user-space:

  BTRFS info (device loop0): enabling ssd optimizations
  FAULT_INJECTION: forcing a failure.
  name failslab, interval 1, probability 0, space 0, times 0
  CPU: 0 PID: 7579 Comm: syz-executor Not tainted 5.15.0-rc1 #16
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
  rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
  Call Trace:
   __dump_stack lib/dump_stack.c:88 [inline]
   dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106
   fail_dump lib/fault-inject.c:52 [inline]
   should_fail+0x13c/0x160 lib/fault-inject.c:146
   should_failslab+0x5/0x10 mm/slab_common.c:1328
   slab_pre_alloc_hook.constprop.99+0x4e/0xc0 mm/slab.h:494
   slab_alloc_node mm/slub.c:3120 [inline]
   slab_alloc mm/slub.c:3214 [inline]
   kmem_cache_alloc+0x44/0x280 mm/slub.c:3219
   btrfs_alloc_delayed_extent_op fs/btrfs/delayed-ref.h:299 [inline]
   btrfs_alloc_tree_block+0x38c/0x670 fs/btrfs/extent-tree.c:4833
   __btrfs_cow_block+0x16f/0x7d0 fs/btrfs/ctree.c:415
   btrfs_cow_block+0x12a/0x300 fs/btrfs/ctree.c:570
   btrfs_search_slot+0x6b0/0xee0 fs/btrfs/ctree.c:1768
   btrfs_insert_empty_items+0x80/0xf0 fs/btrfs/ctree.c:3905
   btrfs_new_inode+0x311/0xa60 fs/btrfs/inode.c:6530
   btrfs_create+0x12b/0x270 fs/btrfs/inode.c:6783
   lookup_open+0x660/0x780 fs/namei.c:3282
   open_last_lookups fs/namei.c:3352 [inline]
   path_openat+0x465/0xe20 fs/namei.c:3557
   do_filp_open+0xe3/0x170 fs/namei.c:3588
   do_sys_openat2+0x357/0x4a0 fs/open.c:1200
   do_sys_open+0x87/0xd0 fs/open.c:1216
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x34/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x46ae99
  Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48
  89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
  01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007f46711b9c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
  RAX: ffffffffffffffda RBX: 000000000078c0a0 RCX: 000000000046ae99
  RDX: 0000000000000000 RSI: 00000000000000a1 RDI: 0000000020005800
  RBP: 00007f46711b9c80 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000017
  R13: 0000000000000000 R14: 000000000078c0a0 R15: 00007ffc129da6e0

  ================================================
  WARNING: lock held when returning to user space!
  5.15.0-rc1 #16 Not tainted
  ------------------------------------------------
  syz-executor/7579 is leaving the kernel with locks still held!
  1 lock held by syz-executor/7579:
   #0: ffff888104b73da8 (btrfs-tree-01/1){+.+.}-{3:3}, at:
  __btrfs_tree_lock+0x2e/0x1a0 fs/btrfs/locking.c:112

[CAUSE]
In btrfs_alloc_tree_block(), after btrfs_init_new_buffer(), the new
extent buffer @buf is locked, but if later operations like adding
delayed tree ref fail, we just free @buf without unlocking it,
resulting above warning.

[FIX]
Unlock @buf in out_free_buf: label.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsZ9O6Zr0KK1yGn=1rQi6Crh1yeCRdTSBxx9R99L4xdn-Q@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-07 22:04:20 +02:00
Linus Torvalds
f9e36107ec for-5.15-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmFM79wACgkQxWXV+ddt
 WDtdZQ/+K7NNEutg4JEH7n2KiXxwj8P23NwVK66a+XwH6/ZBe9xz5TQpnTJQ+D13
 +3mhthTJG7Wbcv+FlUVbfTSp5q8m2IH7CKox43o4JCZEEGtFfRPHrBGHLlGKMk3P
 ap2TZ3rvo0Sb21rx978HCQY824wdJvhv0SmWSScmvWzlTQEKaJHz1OFJgFxhUsMp
 Cy9y7mtIy+Ei4qJglU88iFNXhNL6YvwqXxDFY5LwN9rlCaV+rLk476aPfIBvvyf8
 4f34FHJOe1w9Jlk3KfydIwWefRBbq2dm0zNqNrMHNjl8zXbvfn8+ETOvf54HbjIz
 GGgKiZlBgNh2Na+p0SLoloEvBUdD5lSUCXis8099oUWZ+MporIwsyy4jAvtAeWR/
 QxBkZyxvTNFlXLamSo6oS58K9BNuxFYO7nLGSXQFEoYvb8/fu18rRt/A/rmNS8TU
 2vxpYacNKbggoULiGDzB74JY7MHdHRcMcAhmfDeG1bvNESPHfyLnpHfWamBVoUO6
 0eQOr78f1UpBlqJAGAGtfBefN1kMDnORyX0npGkGLFrKYiZbMgsxdjkNhiHnsufl
 9gsNVJ6baCeB1d5qS2vpZXeOLw0ln5iYZa5Yqz0eh/yc/9Wlj/YCsKRuAbaPMR1i
 i2ppHo3/na4K6L0EgSi6SU3xaUT+4LLzEEcBlJJuWZEwUTeYwiM=
 =VJFC
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - regression fix for leak of transaction handle after verity rollback
   failure

 - properly reset device last error between mounts

 - improve one error handling case when checksumming bios

 - fixup confusing displayed size of space info free space

* tag 'for-5.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: prevent __btrfs_dump_space_info() to underflow its free space
  btrfs: fix mount failure due to past and transient device flush error
  btrfs: fix transaction handle leak after verity rollback failure
  btrfs: replace BUG_ON() in btrfs_csum_one_bio() with proper error handling
2021-09-23 14:39:41 -07:00
Qu Wenruo
0619b79014 btrfs: prevent __btrfs_dump_space_info() to underflow its free space
It's not uncommon where __btrfs_dump_space_info() gets called
under over-commit situations.

In that case free space would underflow as total allocated space is not
enough to handle all the over-committed space.

Such underflow values can sometimes cause confusion for users enabled
enospc_debug mount option, and takes some seconds for developers to
convert the underflow value to signed result.

Just output the free space as s64 to avoid such problem.

Reported-by: Eli V <eliventer@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAJtFHUSy4zgyhf-4d9T+KdJp9w=UgzC2A0V=VtmaeEpcGgm1-Q@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-17 19:29:54 +02:00
Filipe Manana
6b225baaba btrfs: fix mount failure due to past and transient device flush error
When we get an error flushing one device, during a super block commit, we
record the error in the device structure, in the field 'last_flush_error'.
This is used to later check if we should error out the super block commit,
depending on whether the number of flush errors is greater than or equals
to the maximum tolerated device failures for a raid profile.

However if we get a transient device flush error, unmount the filesystem
and later try to mount it, we can fail the mount because we treat that
past error as critical and consider the device is missing. Even if it's
very likely that the error will happen again, as it's probably due to a
hardware related problem, there may be cases where the error might not
happen again. One example is during testing, and a test case like the
new generic/648 from fstests always triggers this. The test cases
generic/019 and generic/475 also trigger this scenario, but very
sporadically.

When this happens we get an error like this:

  $ mount /dev/sdc /mnt
  mount: /mnt wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error.

  $ dmesg
  (...)
  [12918.886926] BTRFS warning (device sdc): chunk 13631488 missing 1 devices, max tolerance is 0 for writable mount
  [12918.888293] BTRFS warning (device sdc): writable mount is not allowed due to too many missing devices
  [12918.890853] BTRFS error (device sdc): open_ctree failed

The failure happens because when btrfs_check_rw_degradable() is called at
mount time, or at remount from RO to RW time, is sees a non zero value in
a device's ->last_flush_error attribute, and therefore considers that the
device is 'missing'.

Fix this by setting a device's ->last_flush_error to zero when we close a
device, making sure the error is not seen on the next mount attempt. We
only need to track flush errors during the current mount, so that we never
commit a super block if such errors happened.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-17 19:29:45 +02:00
Filipe Manana
acbee9aff8 btrfs: fix transaction handle leak after verity rollback failure
During a verity rollback, if we fail to update the inode or delete the
orphan, we abort the transaction and return without releasing our
transaction handle. Fix that by releasing the handle.

Fixes: 146054090b ("btrfs: initial fsverity support")
Fixes: 705242538f ("btrfs: verity metadata orphan items")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-17 19:29:41 +02:00
Qu Wenruo
bbc9a6eb5e btrfs: replace BUG_ON() in btrfs_csum_one_bio() with proper error handling
There is a BUG_ON() in btrfs_csum_one_bio() to catch code logic error.
It has indeed caught several bugs during subpage development.
But the BUG_ON() itself will bring down the whole system which is
an overkill.

Replace it with a WARN() and exit gracefully, so that it won't crash the
whole system while we can still catch the code logic error.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-17 19:29:38 +02:00
Linus Torvalds
8dde20867c for-5.15-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmE6BpMACgkQxWXV+ddt
 WDvXhA//aaeKQIiVbiCnmMKFVX08wU8/pUXf65TJIaaTd0KE0QMu/aW6ibOpC6QI
 auf7wTiErHoJM7A22nL+Eoic7shlPueBktt3YcmdyQ/3ZFR6Wr7Td/cby0FvTOJU
 m0bjLMLp3rWSpnbMMUlOt8VSOcA892jnp7MHVtRYGfmfslwE5iTRtnPjmVobinm7
 dfKxCXUgMG9NWINIJobn70GQsZCXipa1A+MdbkdIPyjbM+tgR0EXZBrSaEcgMVpV
 dWnwTphx0io/tsgt4ZVQzGaCWtesBAe4yhaJJK92eFMTOKlYB/8y5P31N9wBL9Uj
 AOn0ke2Uc/weah50W7AhxeU3nCSGUAl9DbGrovKEfP/p0T9NJC/l3P1gwqpeGuld
 IbrBNFGVm3Noo2ZSoZU55P17gnTFHBMnXyVsbaoEldcsBv39D8K+tZ9F2vFaAV3e
 VayZiUuw/PhEcucYCQKdUCwFqFjJJfNnYpNtMSY3aCHeTOjphrP2sWBxKNAkWChB
 n4O5IFBm5e8YjBVNItZrlXE9KtE+JuwGSbNihhQQ/wy/M1sxB76DpaKnCLgjQmF6
 peBZktTRr8X7aRs1BGQKrU7Yzq7oR1psYadUhGIrrWp/qS4UCXkvYnkMQ0FInyQH
 pYQNHTDE4PSECzEhQAj9syeVE3lnGMGIWylmniamiuDsQcvaydM=
 =RQSe
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix max_inline mount option limit on 64k page system

 - lockdep fixes:
     - update bdev time in a safer way
     - move bdev put outside of sb write section when removing device
     - fix possible deadlock when mounting seed/sprout filesystem

 - zoned mode: fix split extent accounting

 - minor include fixup

* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: fix double counting of split ordered extent
  btrfs: fix lockdep warning while mounting sprout fs
  btrfs: delay blkdev_put until after the device remove
  btrfs: update the bdev time directly when closing
  btrfs: use correct header for div_u64 in misc.h
  btrfs: fix upper limit for max_inline for page size 64K
2021-09-09 16:09:56 -07:00
Naohiro Aota
f79645df80 btrfs: zoned: fix double counting of split ordered extent
btrfs_add_ordered_extent_*() add num_bytes to fs_info->ordered_bytes.
Then, splitting an ordered extent will call btrfs_add_ordered_extent_*()
again for split extents, leading to double counting of the region of
a split extent. These leaked bytes are finally reported at unmount time
as follow:

  BTRFS info (device dm-1): at unmount dio bytes count 364544

Fix the double counting by subtracting split extent's size from
fs_info->ordered_bytes.

Fixes: d22002fd37 ("btrfs: zoned: split ordered extent when bio is sent")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 14:30:41 +02:00
Anand Jain
c124706900 btrfs: fix lockdep warning while mounting sprout fs
Following test case reproduces lockdep warning.

  Test case:

  $ mkfs.btrfs -f <dev1>
  $ btrfstune -S 1 <dev1>
  $ mount <dev1> <mnt>
  $ btrfs device add <dev2> <mnt> -f
  $ umount <mnt>
  $ mount <dev2> <mnt>
  $ umount <mnt>

The warning claims a possible ABBA deadlock between the threads
initiated by [#1] btrfs device add and [#0] the mount.

  [ 540.743122] WARNING: possible circular locking dependency detected
  [ 540.743129] 5.11.0-rc7+ #5 Not tainted
  [ 540.743135] ------------------------------------------------------
  [ 540.743142] mount/2515 is trying to acquire lock:
  [ 540.743149] ffffa0c5544c2ce0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: clone_fs_devices+0x6d/0x210 [btrfs]
  [ 540.743458] but task is already holding lock:
  [ 540.743461] ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
  [ 540.743541] which lock already depends on the new lock.
  [ 540.743543] the existing dependency chain (in reverse order) is:

  [ 540.743546] -> #1 (btrfs-chunk-00){++++}-{4:4}:
  [ 540.743566] down_read_nested+0x48/0x2b0
  [ 540.743585] __btrfs_tree_read_lock+0x32/0x200 [btrfs]
  [ 540.743650] btrfs_read_lock_root_node+0x70/0x200 [btrfs]
  [ 540.743733] btrfs_search_slot+0x6c6/0xe00 [btrfs]
  [ 540.743785] btrfs_update_device+0x83/0x260 [btrfs]
  [ 540.743849] btrfs_finish_chunk_alloc+0x13f/0x660 [btrfs] <--- device_list_mutex
  [ 540.743911] btrfs_create_pending_block_groups+0x18d/0x3f0 [btrfs]
  [ 540.743982] btrfs_commit_transaction+0x86/0x1260 [btrfs]
  [ 540.744037] btrfs_init_new_device+0x1600/0x1dd0 [btrfs]
  [ 540.744101] btrfs_ioctl+0x1c77/0x24c0 [btrfs]
  [ 540.744166] __x64_sys_ioctl+0xe4/0x140
  [ 540.744170] do_syscall_64+0x4b/0x80
  [ 540.744174] entry_SYSCALL_64_after_hwframe+0x44/0xa9

  [ 540.744180] -> #0 (&fs_devs->device_list_mutex){+.+.}-{4:4}:
  [ 540.744184] __lock_acquire+0x155f/0x2360
  [ 540.744188] lock_acquire+0x10b/0x5c0
  [ 540.744190] __mutex_lock+0xb1/0xf80
  [ 540.744193] mutex_lock_nested+0x27/0x30
  [ 540.744196] clone_fs_devices+0x6d/0x210 [btrfs]
  [ 540.744270] btrfs_read_chunk_tree+0x3c7/0xbb0 [btrfs]
  [ 540.744336] open_ctree+0xf6e/0x2074 [btrfs]
  [ 540.744406] btrfs_mount_root.cold.72+0x16/0x127 [btrfs]
  [ 540.744472] legacy_get_tree+0x38/0x90
  [ 540.744475] vfs_get_tree+0x30/0x140
  [ 540.744478] fc_mount+0x16/0x60
  [ 540.744482] vfs_kern_mount+0x91/0x100
  [ 540.744484] btrfs_mount+0x1e6/0x670 [btrfs]
  [ 540.744536] legacy_get_tree+0x38/0x90
  [ 540.744537] vfs_get_tree+0x30/0x140
  [ 540.744539] path_mount+0x8d8/0x1070
  [ 540.744541] do_mount+0x8d/0xc0
  [ 540.744543] __x64_sys_mount+0x125/0x160
  [ 540.744545] do_syscall_64+0x4b/0x80
  [ 540.744547] entry_SYSCALL_64_after_hwframe+0x44/0xa9

  [ 540.744551] other info that might help us debug this:
  [ 540.744552] Possible unsafe locking scenario:

  [ 540.744553] CPU0 				CPU1
  [ 540.744554] ---- 				----
  [ 540.744555] lock(btrfs-chunk-00);
  [ 540.744557] 					lock(&fs_devs->device_list_mutex);
  [ 540.744560] 					lock(btrfs-chunk-00);
  [ 540.744562] lock(&fs_devs->device_list_mutex);
  [ 540.744564]
   *** DEADLOCK ***

  [ 540.744565] 3 locks held by mount/2515:
  [ 540.744567] #0: ffffa0c56bf7a0e0 (&type->s_umount_key#42/1){+.+.}-{4:4}, at: alloc_super.isra.16+0xdf/0x450
  [ 540.744574] #1: ffffffffc05a9628 (uuid_mutex){+.+.}-{4:4}, at: btrfs_read_chunk_tree+0x63/0xbb0 [btrfs]
  [ 540.744640] #2: ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
  [ 540.744708]
   stack backtrace:
  [ 540.744712] CPU: 2 PID: 2515 Comm: mount Not tainted 5.11.0-rc7+ #5

But the device_list_mutex in clone_fs_devices() is redundant, as
explained below.  Two threads [1]  and [2] (below) could lead to
clone_fs_device().

  [1]
  open_ctree <== mount sprout fs
   btrfs_read_chunk_tree()
    mutex_lock(&uuid_mutex) <== global lock
    read_one_dev()
     open_seed_devices()
      clone_fs_devices() <== seed fs_devices
       mutex_lock(&orig->device_list_mutex) <== seed fs_devices

  [2]
  btrfs_init_new_device() <== sprouting
   mutex_lock(&uuid_mutex); <== global lock
   btrfs_prepare_sprout()
     lockdep_assert_held(&uuid_mutex)
     clone_fs_devices(seed_fs_device) <== seed fs_devices

Both of these threads hold uuid_mutex which is sufficient to protect
getting the seed device(s) freed while we are trying to clone it for
sprouting [2] or mounting a sprout [1] (as above). A mounted seed device
can not free/write/replace because it is read-only. An unmounted seed
device can be freed by btrfs_free_stale_devices(), but it needs
uuid_mutex.  So this patch removes the unnecessary device_list_mutex in
clone_fs_devices().  And adds a lockdep_assert_held(&uuid_mutex) in
clone_fs_devices().

Reported-by: Su Yue <l@damenly.su>
Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 14:30:05 +02:00
Josef Bacik
3fa421dedb btrfs: delay blkdev_put until after the device remove
When removing the device we call blkdev_put() on the device once we've
removed it, and because we have an EXCL open we need to take the
->open_mutex on the block device to clean it up.  Unfortunately during
device remove we are holding the sb writers lock, which results in the
following lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #407 Not tainted
------------------------------------------------------
losetup/11595 is trying to acquire lock:
ffff973ac35dd138 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0

but task is already holding lock:
ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7d/0x750
       lo_open+0x28/0x60 [loop]
       blkdev_get_whole+0x25/0xf0
       blkdev_get_by_dev.part.0+0x168/0x3c0
       blkdev_open+0xd2/0xe0
       do_dentry_open+0x161/0x390
       path_openat+0x3cc/0xa20
       do_filp_open+0x96/0x120
       do_sys_openat2+0x7b/0x130
       __x64_sys_openat+0x46/0x70
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&disk->open_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7d/0x750
       blkdev_put+0x3a/0x220
       btrfs_rm_device.cold+0x62/0xe5
       btrfs_ioctl+0x2a31/0x2e70
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#12){.+.+}-{0:0}:
       lo_write_bvec+0xc2/0x240 [loop]
       loop_process_work+0x238/0xd00 [loop]
       process_one_work+0x26b/0x560
       worker_thread+0x55/0x3c0
       kthread+0x140/0x160
       ret_from_fork+0x1f/0x30

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
       process_one_work+0x245/0x560
       worker_thread+0x55/0x3c0
       kthread+0x140/0x160
       ret_from_fork+0x1f/0x30

-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       flush_workqueue+0x91/0x5e0
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
  (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&lo->lo_mutex);
                               lock(&disk->open_mutex);
                               lock(&lo->lo_mutex);
  lock((wq_completion)loop0);

 *** DEADLOCK ***

1 lock held by losetup/11595:
 #0: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

stack backtrace:
CPU: 0 PID: 11595 Comm: losetup Not tainted 5.14.0-rc2+ #407
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
 dump_stack_lvl+0x57/0x72
 check_noncircular+0xcf/0xf0
 ? stack_trace_save+0x3b/0x50
 __lock_acquire+0x10ea/0x1d90
 lock_acquire+0xb5/0x2b0
 ? flush_workqueue+0x67/0x5e0
 ? lockdep_init_map_type+0x47/0x220
 flush_workqueue+0x91/0x5e0
 ? flush_workqueue+0x67/0x5e0
 ? verify_cpu+0xf0/0x100
 drain_workqueue+0xa0/0x110
 destroy_workqueue+0x36/0x250
 __loop_clr_fd+0x9a/0x660 [loop]
 ? blkdev_ioctl+0x8d/0x2a0
 block_ioctl+0x3f/0x50
 __x64_sys_ioctl+0x80/0xb0
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fc21255d4cb

So instead save the bdev and do the put once we've dropped the sb
writers lock in order to avoid the lockdep recursion.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 14:29:59 +02:00
Josef Bacik
8f96a5bfa1 btrfs: update the bdev time directly when closing
We update the ctime/mtime of a block device when we remove it so that
blkid knows the device changed.  However we do this by re-opening the
block device and calling filp_update_time.  This is more correct because
it'll call the inode->i_op->update_time if it exists, but the block dev
inodes do not do this.  Instead call generic_update_time() on the
bd_inode in order to avoid the blkdev_open path and get rid of the
following lockdep splat:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc2+ #406 Not tainted
------------------------------------------------------
losetup/11596 is trying to acquire lock:
ffff939640d2f538 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0

but task is already holding lock:
ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #4 (&lo->lo_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7d/0x750
       lo_open+0x28/0x60 [loop]
       blkdev_get_whole+0x25/0xf0
       blkdev_get_by_dev.part.0+0x168/0x3c0
       blkdev_open+0xd2/0xe0
       do_dentry_open+0x161/0x390
       path_openat+0x3cc/0xa20
       do_filp_open+0x96/0x120
       do_sys_openat2+0x7b/0x130
       __x64_sys_openat+0x46/0x70
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #3 (&disk->open_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7d/0x750
       blkdev_get_by_dev.part.0+0x56/0x3c0
       blkdev_open+0xd2/0xe0
       do_dentry_open+0x161/0x390
       path_openat+0x3cc/0xa20
       do_filp_open+0x96/0x120
       file_open_name+0xc7/0x170
       filp_open+0x2c/0x50
       btrfs_scratch_superblocks.part.0+0x10f/0x170
       btrfs_rm_device.cold+0xe8/0xed
       btrfs_ioctl+0x2a31/0x2e70
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #2 (sb_writers#12){.+.+}-{0:0}:
       lo_write_bvec+0xc2/0x240 [loop]
       loop_process_work+0x238/0xd00 [loop]
       process_one_work+0x26b/0x560
       worker_thread+0x55/0x3c0
       kthread+0x140/0x160
       ret_from_fork+0x1f/0x30

-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
       process_one_work+0x245/0x560
       worker_thread+0x55/0x3c0
       kthread+0x140/0x160
       ret_from_fork+0x1f/0x30

-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       flush_workqueue+0x91/0x5e0
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

other info that might help us debug this:

Chain exists of:
  (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&lo->lo_mutex);
                               lock(&disk->open_mutex);
                               lock(&lo->lo_mutex);
  lock((wq_completion)loop0);

 *** DEADLOCK ***

1 lock held by losetup/11596:
 #0: ffff939655510c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]

stack backtrace:
CPU: 1 PID: 11596 Comm: losetup Not tainted 5.14.0-rc2+ #406
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
 dump_stack_lvl+0x57/0x72
 check_noncircular+0xcf/0xf0
 ? stack_trace_save+0x3b/0x50
 __lock_acquire+0x10ea/0x1d90
 lock_acquire+0xb5/0x2b0
 ? flush_workqueue+0x67/0x5e0
 ? lockdep_init_map_type+0x47/0x220
 flush_workqueue+0x91/0x5e0
 ? flush_workqueue+0x67/0x5e0
 ? verify_cpu+0xf0/0x100
 drain_workqueue+0xa0/0x110
 destroy_workqueue+0x36/0x250
 __loop_clr_fd+0x9a/0x660 [loop]
 ? blkdev_ioctl+0x8d/0x2a0
 block_ioctl+0x3f/0x50
 __x64_sys_ioctl+0x80/0xb0
 do_syscall_64+0x38/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 14:29:55 +02:00
Kari Argillander
cde7417ce4 btrfs: use correct header for div_u64 in misc.h
asm/do_div.h is for div_u64, but it is found in math64.h. This change
will make compiler job easier and prevent compiler errors in situation
where compiler will not find math64.h from another paths.

Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 14:29:50 +02:00
Anand Jain
6f93e834fa btrfs: fix upper limit for max_inline for page size 64K
The mount option max_inline ranges from 0 to the sectorsize (which is
now equal to page size). But we parse the mount options too early and
before the actual sectorsize is read from the superblock. So the upper
limit of max_inline is unaware of the actual sectorsize and is limited
by the temporary sectorsize 4096, even on a system where the default
sectorsize is 64K.

Fix this by reading the superblock sectorsize before the mount option
parse.

Reported-by: Alexander Tsvetkov <alexander.tsvetkov@oracle.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07 14:28:46 +02:00
Linus Torvalds
815409a12c overlayfs update for 5.15
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYTDKKAAKCRDh3BK/laaZ
 PG9PAQCUF0fdBlCKudwSEt5PV5xemycL9OCAlYCd7d4XbBIe9wEA6sVJL9J+OwV2
 aF0NomiXtJccE+S9+byjVCyqSzQJGQQ=
 =6L2Y
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs update from Miklos Szeredi:

 - Copy up immutable/append/sync/noatime attributes (Amir Goldstein)

 - Improve performance by enabling RCU lookup.

 - Misc fixes and improvements

The reason this touches so many files is that the ->get_acl() method now
gets a "bool rcu" argument.  The ->get_acl() API was updated based on
comments from Al and Linus:

Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguQxpd6Wgc0Jd3ks77zcsAv_bn0q17L3VNnnmPKu11t8A@mail.gmail.com/

* tag 'ovl-update-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: enable RCU'd ->get_acl()
  vfs: add rcu argument to ->get_acl() callback
  ovl: fix BUG_ON() in may_delete() when called from ovl_cleanup()
  ovl: use kvalloc in xattr copy-up
  ovl: update ctime when changing fileattr
  ovl: skip checking lower file's i_writecount on truncate
  ovl: relax lookup error on mismatch origin ftype
  ovl: do not set overlay.opaque for new directories
  ovl: add ovl_allow_offline_changes() helper
  ovl: disable decoding null uuid with redirect_dir
  ovl: consistent behavior for immutable/append-only inodes
  ovl: copy up sync/noatime fileattr flags
  ovl: pass ovl_fs to ovl_check_setxattr()
  fs: add generic helper for filling statx attribute flags
2021-09-02 09:21:27 -07:00
Linus Torvalds
0ee7c3e25d New code for 5.15:
- Simplify the bio_end_page usage in the buffered IO code.
  - Support reading inline data at nonzero offsets for erofs.
  - Fix some typos and bad grammar.
  - Convert kmap_atomic usage in the inline data read path.
  - Add some extra inline data input checking.
  - Fix a memory corruption bug stemming from iomap_swapfile_activate
    trying to activate more pages than mm was expecting.
  - Pass errnos through the page writeback code so that writeback errors
    are reported correctly instead of being munged to EIO.
  - Replace iomap_apply with a open-coded iterator loops to reduce the
    number of indirect calls by a third to a half.
  - Refactor the fsdax code to use iomap iterators instead of the
    open-coded iomap_apply code that it had before.
  - Format file range iomap tracepoint data in hexadecimal and
    standardize the names used in the pretty-print string.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmEnwC0ACgkQ+H93GTRK
 tOtVOQ//Zu9ul2ZmPARMV8xyAfopnLpmggREOthFbPkDZ3z3ZgRpPxlbAvWEEKnj
 VDNLFNj204rDojuxP/YSdxgiawLod7dYfXIwwft8R8oI7MdgVQhpvimUi5bkz/Od
 X5pmFDe84INfFvEztOgC+sPk1RI/ToQLgrcIffWMWfF2iyVkNVMCD5MMe6LoH1la
 9GbVCfPx6Y2Nffaa8EuAEgaCo7FMPc81bvQG4qpeqXyX8qql/r5n4YENhkn3n4qa
 zI4F2lgqwbelFkamZOYNDjtLt13lb7Ze0PoFOpmTZUqlyybqhRxDvJ+OxZn8W6zH
 20pxWx/RCXhCp/sS6DRcYyf7WKoIfdGDkxed7aSuhJ+VKKtBtsjMoy7dh5IY5RJa
 8L1DMat6xtea8Glx04SF7Vib0n/An9oHOTzLEWxsUlRaPhW68uVpKgXuGLTAf+dc
 ztJhlQ9pLX0D2NmgGlkXN8d4F1XEH2BgyIrtF6UNtMbyIlCREHM9HELJs6JzKl6U
 a4ivJXyaq8o/hlXr8IMWUOTVubS0i+hgvvQjnVJmcSTJxhH10mPPJLnNsGX6heD9
 SlnnXRbD03iqsbMJP/R431VKooryOSKBc86IEECkuMz3RUfw75DGAnLtETnT1rsA
 71rSVf5NaCGZ2hV4du6jv53TS7yrPpqkxJHyDWD1WP4xGPbO1XA=
 =iVns
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.15-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "The most notable externally visible change for this cycle is the
  addition of support for reads to inline tail fragments of files, which
  was requested by the erofs developers; and a correction for a kernel
  memory corruption bug if the sysadmin tries to activate a swapfile
  with more pages than the swapfile header suggests.

  We also now report writeback completion errors to the file mapping
  correctly, instead of munging all errors into EIO.

  Internally, the bulk of the changes are Christoph's patchset to reduce
  the indirect function call count by a third to a half by converting
  iomap iteration from a loop pattern to a generator/consumer pattern.
  As an added bonus, fsdax no longer open-codes iomap apply loops.

  Summary:

   - Simplify the bio_end_page usage in the buffered IO code.

   - Support reading inline data at nonzero offsets for erofs.

   - Fix some typos and bad grammar.

   - Convert kmap_atomic usage in the inline data read path.

   - Add some extra inline data input checking.

   - Fix a memory corruption bug stemming from iomap_swapfile_activate
     trying to activate more pages than mm was expecting.

   - Pass errnos through the page writeback code so that writeback
     errors are reported correctly instead of being munged to EIO.

   - Replace iomap_apply with a open-coded iterator loops to reduce the
     number of indirect calls by a third to a half.

   - Refactor the fsdax code to use iomap iterators instead of the
     open-coded iomap_apply code that it had before.

   - Format file range iomap tracepoint data in hexadecimal and
     standardize the names used in the pretty-print string"

* tag 'iomap-5.15-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (41 commits)
  iomap: standardize tracepoint formatting and storage
  mm/swap: consider max pages in iomap_swapfile_add_extent
  iomap: move loop control code to iter.c
  iomap: constify iomap_iter_srcmap
  fsdax: switch the fault handlers to use iomap_iter
  fsdax: factor out a dax_fault_actor() helper
  fsdax: factor out helpers to simplify the dax fault code
  iomap: rework unshare flag
  iomap: pass an iomap_iter to various buffered I/O helpers
  iomap: remove iomap_apply
  fsdax: switch dax_iomap_rw to use iomap_iter
  iomap: switch iomap_swapfile_activate to use iomap_iter
  iomap: switch iomap_seek_data to use iomap_iter
  iomap: switch iomap_seek_hole to use iomap_iter
  iomap: switch iomap_bmap to use iomap_iter
  iomap: switch iomap_fiemap to use iomap_iter
  iomap: switch __iomap_dio_rw to use iomap_iter
  iomap: switch iomap_page_mkwrite to use iomap_iter
  iomap: switch iomap_zero_range to use iomap_iter
  iomap: switch iomap_file_unshare to use iomap_iter
  ...
2021-08-31 11:13:35 -07:00
Linus Torvalds
87045e6546 for-5.15-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEs2NIACgkQxWXV+ddt
 WDsJMQ/+PJ/yXfI85mAeAzTJLWQ0zD6YO3iBhf3wOeyychWC4on435pj+zW8zR/U
 /bix25ygoWF4MvGF6p0uyv4Z5mnvkZXE5lapUcJu6wXG7se1QRPH0broTh05IBXK
 SnT93Eb9RexaiNFk7DVma9XkviqZ/ZISPtkJ9wYrfIba7j/U/wa+PtEFS7wk58hP
 rFQXgV64xm/pcP28YYHfOkCjdyUMdJrnBUvfKOlX6d94lmYbP5lyiTL+XJEXExzN
 wPakD0UsnXPr4TRvf+YRTPeFHPPUgyORII7otVUOKmGywWtcJrELX8rXFoW+6GwB
 dzZIcSYXHUxU5UrtMbZgiztVBJ+bQY5juYMIrj13eYOMYkijxAqPP84iDO15+TSV
 zNqyAVjUglHCGUGjhSpAxnAmtp+IJTZfVAWcvIKq3VqvJtb8tssQsk9bqFjH1xlH
 qNJLE57CYe3tjw05K9y0keMh2iJWRWkXZYkgI/zjwo5nreemobpN+3fO4yneVLh7
 ecdBmSl/JVSzAB1NamLOCZNGZLUqiiuTvZlJtI6ZsekrN1+4A6QzVcU/MGjSYL1v
 C7W0hK0LF+e3xIBkxTKVq8noolsgbmlWacxJq8fZq9HwZy5IVJOVm9STDlCuLaIo
 gPr0V0itkclcsMU0CHTyCjMsfuHYUwJZXwg93wKfJf5UCzS4OWU=
 =ALO9
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "The highlights of this round are integrations with fs-verity and
  idmapped mounts, the rest is usual mix of minor improvements, speedups
  and cleanups.

  There are some patches outside of btrfs, namely updating some VFS
  interfaces, all straightforward and acked.

  Features:

   - fs-verity support, using standard ioctls, backward compatible with
     read-only limitation on inodes with previously enabled fs-verity

   - idmapped mount support

   - make mount with rescue=ibadroots more tolerant to partially damaged
     trees

   - allow raid0 on a single device and raid10 on two devices,
     degenerate cases but might be useful as an intermediate step during
     conversion to other profiles

   - zoned mode block group auto reclaim can be disabled via sysfs knob

  Performance improvements:

   - continue readahead of node siblings even if target node is in
     memory, could speed up full send (on sample test +11%)

   - batching of delayed items can speed up creating many files

   - fsync/tree-log speedups
       - avoid unnecessary work (gains +2% throughput, -2% run time on
         sample load)
       - reduced lock contention on renames (on dbench +4% throughput,
         up to -30% latency)

  Fixes:

   - various zoned mode fixes

   - preemptive flushing threshold tuning, avoid excessive work on
     almost full filesystems

  Core:

   - continued subpage support, preparation for implementing remaining
     features like compression and defragmentation; with some
     limitations, write is now enabled on 64K page systems with 4K
     sectors, still considered experimental
       - no readahead on compressed reads
       - inline extents disabled
       - disabled raid56 profile conversion and mount

   - improved flushing logic, fixing early ENOSPC on some workloads

   - inode flags have been internally split to read-only and read-write
     incompat bit parts, used by fs-verity

   - new tree items for fs-verity
       - descriptor item
       - Merkle tree item

   - inode operations extended to be namespace-aware

   - cleanups and refactoring

  Generic code changes:

   - fs: new export filemap_fdatawrite_wbc

   - fs: removed sync_inode

   - block: bio_trim argument type fixups

   - vfs: add namespace-aware lookup"

* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
  btrfs: reset replace target device to allocation state on close
  btrfs: zoned: fix ordered extent boundary calculation
  btrfs: do not do preemptive flushing if the majority is global rsv
  btrfs: reduce the preemptive flushing threshold to 90%
  btrfs: tree-log: check btrfs_lookup_data_extent return value
  btrfs: avoid unnecessarily logging directories that had no changes
  btrfs: allow idmapped mount
  btrfs: handle ACLs on idmapped mounts
  btrfs: allow idmapped INO_LOOKUP_USER ioctl
  btrfs: allow idmapped SUBVOL_SETFLAGS ioctl
  btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls
  btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids
  btrfs: allow idmapped SNAP_DESTROY ioctls
  btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls
  btrfs: check whether fsgid/fsuid are mapped during subvolume creation
  btrfs: allow idmapped permission inode op
  btrfs: allow idmapped setattr inode op
  btrfs: allow idmapped tmpfile inode op
  btrfs: allow idmapped symlink inode op
  btrfs: allow idmapped mkdir inode op
  ...
2021-08-31 09:41:22 -07:00
Linus Torvalds
9b49ceb854 for-5.14-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEmQy4ACgkQxWXV+ddt
 WDuCRRAAmuO+6Zsl5MSq0hBnpec/VBN6lTi9VPt184BjW1IWsqwR1Ax8dVQEKgCm
 gzkGYEuVq2L5p+/ugWKKftAbmUU85Jf3AIsv81SCJQosRkxVXAdbrZOv00yUZy6/
 5YOdO+9u61otvtO6LcZz9l+0LcpSmrBwEszluyIS+nArgQyZwX2aZTjcScDJvB9+
 1y7Eo6eIbqbcJOf4mLDIJh0bHaiA7HB6jYJkbsnz51wBU2ETATzNzAoyP5ReTPGc
 1s0uxrpY37kHcUUTd6q8VLDTM6Ei4vF2zQm0jWcrw0K3hM6yPuH+GiEADoV/xsls
 6pbtss1E81rHEQjcK8brf6CxbOak8/WXV0gRia/3avkFteVlax+NJxRdVhksuJln
 siGlQqASX3vYdNL0nG+U0ml1Y9C1ZXTXu4lGjS6rtT9oeV+YSccG2UjIT9LEtuON
 W/zE4bUMqCddcZFEPH5jNK+ChGS8mmfs+UFFR+W/JzMIO8Uji5/K44FZDFBo0Oc/
 3JEgk7ZV4D+8SblBPMxJx0fZqbE8ggKM+IN5CAyscINOWOxrmRiaFFRygRX0TLDB
 2uts9owItW6zvaTRY6RclVeCvJ6ARQli4pv7YxZmH85hhtCbn515imWvLWw4+tSg
 QwrtDnPVMSJTdzFHvsmeE9lM6Vaw0ur70Ysyd29k/XJu3WwRdkM=
 =jN7s
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One more fix that I think qualifies for a late merge. It's a revert of
  a one-liner fix that meanwhile got backported to stable kernels and we
  got reports from users.

  The broken fix prevents creating compressed inline extents, which
  could be noticeable on space consumption.

  Technically it's a regression as the patch was merged in 5.14-rc1 but
  got propagated to several stable kernels and has higher exposure than
  a 'typical' development cycle bug"

* tag 'for-5.14-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Revert "btrfs: compression: don't try to compress if we don't have enough pages"
2021-08-26 11:05:11 -07:00
Qu Wenruo
4e9655763b Revert "btrfs: compression: don't try to compress if we don't have enough pages"
This reverts commit f216562731.

[BUG]
It's no longer possible to create compressed inline extent after commit
f216562731 ("btrfs: compression: don't try to compress if we don't
have enough pages").

[CAUSE]
For compression code, there are several possible reasons we have a range
that needs to be compressed while it's no more than one page.

- Compressed inline write
  The data is always smaller than one sector and the test lacks the
  condition to properly recognize a non-inline extent.

- Compressed subpage write
  For the incoming subpage compressed write support, we require page
  alignment of the delalloc range.
  And for 64K page size, we can compress just one page into smaller
  sectors.

For those reasons, the requirement for the data to be more than one page
is not correct, and is already causing regression for compressed inline
data writeback.  The idea of skipping one page to avoid wasting CPU time
could be revisited in the future.

[FIX]
Fix it by reverting the offending commit.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net
Fixes: f216562731 ("btrfs: compression: don't try to compress if we don't have enough pages")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-25 15:08:19 +02:00
Desmond Cheong Zhi Xi
0d977e0eba btrfs: reset replace target device to allocation state on close
This crash was observed with a failed assertion on device close:

  BTRFS: Transaction aborted (error -28)
  WARNING: CPU: 1 PID: 3902 at fs/btrfs/extent-tree.c:2150 btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
  Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress raid6_pq loop
  CPU: 1 PID: 3902 Comm: kworker/u8:4 Not tainted 5.14.0-rc5-default+ #1532
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
  RIP: 0010:btrfs_run_delayed_refs+0x1d2/0x1e0 [btrfs]
  RSP: 0018:ffffb7a5452d7d80 EFLAGS: 00010282
  RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
  RBP: ffff97834176a378 R08: 0000000000000001 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff97835195d388
  R13: 0000000005b08000 R14: ffff978385484000 R15: 000000000000016c
  FS:  0000000000000000(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000056190d003fe8 CR3: 000000002a81e005 CR4: 0000000000170ea0
  Call Trace:
   flush_space+0x197/0x2f0 [btrfs]
   btrfs_async_reclaim_metadata_space+0x139/0x300 [btrfs]
   process_one_work+0x262/0x5e0
   worker_thread+0x4c/0x320
   ? process_one_work+0x5e0/0x5e0
   kthread+0x144/0x170
   ? set_kthread_struct+0x40/0x40
   ret_from_fork+0x1f/0x30
  irq event stamp: 19334989
  hardirqs last  enabled at (19334997): [<ffffffffab0e0c87>] console_unlock+0x2b7/0x400
  hardirqs last disabled at (19335006): [<ffffffffab0e0d0d>] console_unlock+0x33d/0x400
  softirqs last  enabled at (19334900): [<ffffffffaba0030d>] __do_softirq+0x30d/0x574
  softirqs last disabled at (19334893): [<ffffffffab0721ec>] irq_exit_rcu+0x12c/0x140
  ---[ end trace 45939e308e0dd3c7 ]---
  BTRFS: error (device vdd) in btrfs_run_delayed_refs:2150: errno=-28 No space left
  BTRFS info (device vdd): forced readonly
  BTRFS warning (device vdd): failed setting block group ro: -30
  BTRFS info (device vdd): suspending dev_replace for unmount
  assertion failed: !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state), in fs/btrfs/volumes.c:1150
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.h:3431!
  invalid opcode: 0000 [#1] PREEMPT SMP
  CPU: 1 PID: 3982 Comm: umount Tainted: G        W         5.14.0-rc5-default+ #1532
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
  RSP: 0018:ffffb7a5454c7db8 EFLAGS: 00010246
  RAX: 0000000000000068 RBX: ffff978364b91c00 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffffffffabee13c4 RDI: 00000000ffffffff
  RBP: ffff9783523a4c00 R08: 0000000000000001 R09: 0000000000000001
  R10: 0000000000000000 R11: 0000000000000001 R12: ffff9783523a4d18
  R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000003
  FS:  00007f61c8f42800(0000) GS:ffff9783bd800000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000056190cffa810 CR3: 0000000030b96002 CR4: 0000000000170ea0
  Call Trace:
   btrfs_close_one_device.cold+0x11/0x55 [btrfs]
   close_fs_devices+0x44/0xb0 [btrfs]
   btrfs_close_devices+0x48/0x160 [btrfs]
   generic_shutdown_super+0x69/0x100
   kill_anon_super+0x14/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x2c/0xa0
   cleanup_mnt+0x144/0x1b0
   task_work_run+0x59/0xa0
   exit_to_user_mode_loop+0xe7/0xf0
   exit_to_user_mode_prepare+0xaf/0xf0
   syscall_exit_to_user_mode+0x19/0x50
   do_syscall_64+0x4a/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae

This happens when close_ctree is called while a dev_replace hasn't
completed. In close_ctree, we suspend the dev_replace, but keep the
replace target around so that we can resume the dev_replace procedure
when we mount the root again. This is the call trace:

  close_ctree():
    btrfs_dev_replace_suspend_for_unmount();
    btrfs_close_devices():
      btrfs_close_fs_devices():
        btrfs_close_one_device():
          ASSERT(!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
                 &device->dev_state));

However, since the replace target sticks around, there is a device
with BTRFS_DEV_STATE_REPLACE_TGT set on close, and we fail the
assertion in btrfs_close_one_device.

To fix this, if we come across the replace target device when
closing, we should properly reset it back to allocation state. This
fix also ensures that if a non-target device has a corrupted state and
has the BTRFS_DEV_STATE_REPLACE_TGT bit set, the assertion will still
catch the error.

Reported-by: David Sterba <dsterba@suse.com>
Fixes: b2a6166768 ("btrfs: fix rw device counting in __btrfs_free_extra_devids")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:57:18 +02:00
Naohiro Aota
939c7feb19 btrfs: zoned: fix ordered extent boundary calculation
btrfs_lookup_ordered_extent() is supposed to query the offset in a file
instead of the logical address. Pass the file offset from
submit_extent_page() to calc_bio_boundaries().

Also, calc_bio_boundaries() relies on the bio's operation flag, so move
the call site after setting it.

Fixes: 390ed29b81 ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:16 +02:00
Josef Bacik
1146239794 btrfs: do not do preemptive flushing if the majority is global rsv
A common characteristic of the bug report where preemptive flushing was
going full tilt was the fact that the vast majority of the free metadata
space was used up by the global reserve.  The hard 90% threshold would
cover the majority of these cases, but to be even smarter we should take
into account how much of the outstanding reservations are covered by the
global block reserve.  If the global block reserve accounts for the vast
majority of outstanding reservations, skip preemptive flushing, as it
will likely just cause churn and pain.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:16 +02:00
Josef Bacik
93c60b17f2 btrfs: reduce the preemptive flushing threshold to 90%
The preemptive flushing code was added in order to avoid needing to
synchronously wait for ENOSPC flushing to recover space.  Once we're
almost full however we can essentially flush constantly.  We were using
98% as a threshold to determine if we were simply full, however in
practice this is a really high bar to hit.  For example reports of
systems running into this problem had around 94% usage and thus
continued to flush.  Fix this by lowering the threshold to 90%, which is
a more sane value, especially for smaller file systems.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
CC: stable@vger.kernel.org # 5.12+
Fixes: 576fa34830 ("btrfs: improve preemptive background space flushing")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:15 +02:00
Marcos Paulo de Souza
3736127a3a btrfs: tree-log: check btrfs_lookup_data_extent return value
Function btrfs_lookup_data_extent calls btrfs_search_slot to verify if
the EXTENT_ITEM exists in the extent tree. btrfs_search_slot can return
values bellow zero if an error happened.

Function replay_one_extent currently checks if the search found
something (0 returned) and increments the reference, and if not, it
seems to evaluate as 'not found'.

Fix the condition by checking if the value was bellow zero and return
early.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:15 +02:00
Filipe Manana
8be2ba2e0e btrfs: avoid unnecessarily logging directories that had no changes
There are several cases where when logging an inode we need to log its
parent directories or logging subdirectories when logging a directory.

There are cases however where we end up logging a directory even if it was
not changed in the current transaction, no dentries added or removed since
the last transaction. While this is harmless from a functional point of
view, it is a waste time as it brings no advantage.

One example where this is triggered is the following:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt

  $ mkdir /mnt/A
  $ mkdir /mnt/B
  $ mkdir /mnt/C

  $ touch /mnt/A/foo
  $ ln /mnt/A/foo /mnt/B/bar
  $ ln /mnt/A/foo /mnt/C/baz

  $ sync

  $ rm -f /mnt/A/foo
  $ xfs_io -c "fsync" /mnt/B/bar

This last fsync ends up logging directories A, B and C, however we only
need to log directory A, as B and C were not changed since the last
transaction commit.

So fix this by changing need_log_inode(), to return false in case the
given inode is a directory and has a ->last_trans value smaller than the
current transaction's ID.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:15 +02:00
Christian Brauner
5b9b26f5d0 btrfs: allow idmapped mount
Now that we converted btrfs internally to account for idmapped mounts
allow the creation of idmapped mounts on by setting the FS_ALLOW_IDMAP
flag.  We only need to raise this flag on the btrfs_root_fs_type
filesystem since btrfs_mount_root() is ultimately responsible for
allocating the superblock and is called into from btrfs_mount()
associated with btrfs_fs_type.

The conversion of the btrfs inode operations was straightforward.
Regarding btrfs specific ioctls that perform checks based on inode
permissions only those have been allowed that are not filesystem wide
operations and hence can be reasonably charged against a specific mount.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:15 +02:00
Christian Brauner
4a8b34afa9 btrfs: handle ACLs on idmapped mounts
Make the ACL code idmapped mount aware. The POSIX default and POSIX
access ACLs are the only ACLs other than some specific xattrs that take
DAC permissions into account. On an idmapped mount they need to be
translated according to the mount's userns. The main change is done to
__btrfs_set_acl() which is responsible for translating POSIX ACLs to
their final on-disk representation.

The btrfs_init_acl() helper does not need to take the idmapped mount
into account since it is called in the context of file creation
operations (mknod, create, mkdir, symlink, tmpfile) and is used for
btrfs_init_inode_security() to copy POSIX default and POSIX access
permissions from the parent directory. These ACLs need to be inherited
unmodified from the parent directory. This is identical to what we do
for ext4 and xfs.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:15 +02:00
Christian Brauner
6623d9a0b0 btrfs: allow idmapped INO_LOOKUP_USER ioctl
The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl
and has the following restrictions. The main difference between the two
is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER
is scoped beneath the file descriptor passed with the ioctl.
Specifically, INO_LOOKUP_USER must adhere to the following restrictions:

- The caller must be privileged over each inode of each path component
  for the path they are trying to lookup.

- The path for the subvolume the caller is trying to lookup must be reachable
  from the inode associated with the file descriptor passed with the ioctl.

The second condition makes it possible to scope the lookup of the path
to the mount identified by the file descriptor passed with the ioctl.
This allows us to enable this ioctl on idmapped mounts.

Specifically, this is possible because all child subvolumes of a parent
subvolume are reachable when the parent subvolume is mounted. So if the
user had access to open the parent subvolume or has been given the fd
then they can lookup the path if they had access to it provided they
were privileged over each path component.

Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name
of a subvolume even though they would otherwise be restricted from doing
so via regular VFS-based lookup.

So think about a parent subvolume with multiple child subvolumes.
Someone could mount he parent subvolume and restrict access to the child
subvolumes by overmounting them with empty directories. At this point
the user can't traverse the child subvolumes and they can't open files
in the child subvolumes.  However, they can still learn the path of
child subvolumes as long as they have access to the parent subvolume by
using the INO_LOOKUP_USER ioctl.

The underlying assumption here is that it's ok that the lookup ioctls
can't really take mounts into account other than the original mount the
fd belongs to during lookup. Since this assumption is baked into the
original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:15 +02:00
Christian Brauner
39e1674ff0 btrfs: allow idmapped SUBVOL_SETFLAGS ioctl
Setting flags on subvolumes or snapshots are core features of btrfs. The
SUBVOL_SETFLAGS ioctl is especially important as it allows to make
subvolumes and snapshots read-only or read-write. Allow setting flags on
btrfs subvolumes and snapshots on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:14 +02:00
Christian Brauner
e4fed17a32 btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls
The SET_RECEIVED_SUBVOL ioctls are used to set information about
a received subvolume. Make it possible to set information about a
received subvolume on idmapped mounts. This is a fairly straightforward
operation since all the permission checking helpers are already capable
of handling idmapped mounts. So we just need to pass down the mount's
userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:14 +02:00
Christian Brauner
aabb34e7a3 btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids
So far we prevented the deletion of subvolumes and snapshots using
subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag.

This restriction is necessary on idmapped mounts as this allows
filesystem wide subvolume and snapshot deletions and thus can escape the
scope of what's exposed under the mount identified by the fd passed with
the ioctl.

Deletion by subvolume id works by looking for an alias of the parent of
the subvolume or snapshot to be deleted. The parent alias can be
anywhere in the filesystem. However, as long as the alias of the parent
that is found is the same as the one identified by the file descriptor
passed through the ioctl we can allow the deletion.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:14 +02:00
Christian Brauner
c4ed533bdc btrfs: allow idmapped SNAP_DESTROY ioctls
Destroying subvolumes and snapshots are important features of btrfs.
Both operations are available to unprivileged users if the filesystem
has been mounted with the "user_subvol_rm_allowed" mount option. Allow
subvolume and snapshot deletion on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.

Subvolumes and snapshots can either be deleted by specifying their name
or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.

This feature is blocked on idmapped mounts as this allows filesystem
wide subvolume deletions and thus can escape the scope of what's exposed
under the mount identified by the fd passed with the ioctl.

This means that even the root or CAP_SYS_ADMIN capable user can't delete
a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.

The root user is currently already subject to permission checks in
btrfs_may_delete() including whether the inode's i_uid/i_gid of the
directory the subvolume is located in have a mapping in the caller's
idmapping. For this to fail isn't currently possible since a btrfs
filesystem can't be mounted with a non-initial idmapping but it shows
that even the root user would fail to delete a subvolume if the relevant
inode isn't mapped in their idmapping. The idmapped mount case is the
same in principle.

This isn't a huge problem a root user wanting to delete arbitrary
subvolumes can just always create another (even detached) mount without
an idmapping attached.

In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
subvolume to delete is directly located under inode referenced by the fd
passed for the ioctl() in a follow-up commit.

Here is an example where a btrfs subvolume is deleted through a
subvolume mount that does not expose the subvolume to be delete but it
can still be deleted by using the subvolume id:

  /* Compile the following program as "delete_by_spec". */

  #define _GNU_SOURCE
  #include <fcntl.h>
  #include <inttypes.h>
  #include <linux/btrfs.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <sys/ioctl.h>
  #include <sys/stat.h>
  #include <sys/types.h>
  #include <unistd.h>

  static int rm_subvolume_by_id(int fd, uint64_t subvolid)
  {
	 struct btrfs_ioctl_vol_args_v2 args = {};
	 int ret;

	 args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
	 args.subvolid = subvolid;

	 ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
	 if (ret < 0)
		 return -1;

	 return 0;
  }

  int main(int argc, char *argv[])
  {
	 int subvolid = 0;

	 if (argc < 3)
		 exit(1);

	 fprintf(stderr, "Opening %s\n", argv[1]);
	 int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
	 if (fd < 0)
		 exit(2);

	 subvolid = atoi(argv[2]);

	 fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
	 int ret = rm_subvolume_by_id(fd, subvolid);
	 if (ret < 0)
		 exit(3);

	 exit(0);
  }
  #include <stdio.h>"
  #include <stdlib.h>"
  #include <linux/btrfs.h"

  truncate -s 10G btrfs.img
  mkfs.btrfs btrfs.img
  export LOOPDEV=$(sudo losetup -f --show btrfs.img)
  mount ${LOOPDEV} /mnt
  sudo chown $(id -u):$(id -g) /mnt
  btrfs subvolume create /mnt/A
  btrfs subvolume create /mnt/B/C
  # Get subvolume id via:
  sudo btrfs subvolume show /mnt/A
  # Save subvolid
  SUBVOLID=<nr>
  sudo umount /mnt
  sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
  ./delete_by_spec /mnt ${SUBVOLID}

With idmapped mounts this can potentially be used by users to delete
subvolumes/snapshots they would otherwise not have access to as the
idmapping would be applied to an inode that is not exposed in the mount
of the subvolume.

The fact that this is a filesystem wide operation suggests it might be a
good idea to expose this under a separate ioctl that clearly indicates
this. In essence, the file descriptor passed with the ioctl is merely
used to identify the filesystem on which to operate when
BTRFS_SUBVOL_SPEC_BY_ID is used.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:14 +02:00
Christian Brauner
4d4340c912 btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls
Creating subvolumes and snapshots is one of the core features of btrfs
and is even available to unprivileged users. Make it possible to use
subvolume and snapshot creation on idmapped mounts. This is a fairly
straightforward operation since all the permission checking helpers are
already capable of handling idmapped mounts. So we just need to pass
down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:14 +02:00
Christian Brauner
5474bf400f btrfs: check whether fsgid/fsuid are mapped during subvolume creation
When a new subvolume is created btrfs currently doesn't check whether
the fsgid/fsuid of the caller actually have a mapping in the user
namespace attached to the filesystem. The VFS always checks this to make
sure that the caller's fsgid/fsuid can be represented on-disk. This is
most relevant for filesystems that can be mounted inside user namespaces
but it is in general a good hardening measure to prevent unrepresentable
gid/uid from being written to disk.

Since we want to support idmapped mounts for btrfs ioctls to create
subvolumes in follow-up patches this becomes important since we want to
make sure the fsgid/fsuid of the caller as mapped according to the
idmapped mount can be represented on-disk. Simply add the missing
fsuidgid_has_mapping() line from the VFS may_create() version to
btrfs_may_create().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:14 +02:00
Christian Brauner
3bc71ba02c btrfs: allow idmapped permission inode op
Enable btrfs_permission() to handle idmapped mounts. This is just a
matter of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:13 +02:00
Christian Brauner
d4d0946461 btrfs: allow idmapped setattr inode op
Enable btrfs_setattr() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:13 +02:00
Christian Brauner
98b6ab5fc0 btrfs: allow idmapped tmpfile inode op
Enable btrfs_tmpfile() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:13 +02:00
Christian Brauner
5a0521086e btrfs: allow idmapped symlink inode op
Enable btrfs_symlink() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:13 +02:00
Christian Brauner
b0b3e44d34 btrfs: allow idmapped mkdir inode op
Enable btrfs_mkdir() to handle idmapped mounts. This is just a matter of
passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:13 +02:00
Christian Brauner
e93ca491d0 btrfs: allow idmapped create inode op
Enable btrfs_create() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:13 +02:00
Christian Brauner
72105277dc btrfs: allow idmapped mknod inode op
Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of
passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:12 +02:00
Christian Brauner
c020d2eaf1 btrfs: allow idmapped getattr inode op
Enable btrfs_getattr() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:12 +02:00
Christian Brauner
ca07274c3d btrfs: allow idmapped rename inode op
Enable btrfs_rename() to handle idmapped mounts. This is just a matter
of passing down the mount's userns.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:12 +02:00
Christian Brauner
b3b6f5b922 btrfs: handle idmaps in btrfs_new_inode()
Extend btrfs_new_inode() to take the idmapped mount into account when
initializing a new inode. This is just a matter of passing down the
mount's userns. The rest is taken care of in inode_init_owner(). This is
a preliminary patch to make the individual btrfs inode operations
idmapped mount aware.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:12 +02:00
Anand Jain
e7849e33cf btrfs: sysfs: document structures and their associated files
Sysfs file has grown big. It takes some time to locate the correct
struct attribute to add new files. Create a table and map the struct
attribute to its sysfs path.

Also, fix the comment about the debug sysfs path.  And add the comments
to the attributes instead of attribute group, where sysfs file names are
defined.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:12 +02:00
Qu Wenruo
e4571b8c5e btrfs: fix NULL pointer dereference when deleting device by invalid id
[BUG]
It's easy to trigger NULL pointer dereference, just by removing a
non-existing device id:

 # mkfs.btrfs -f -m single -d single /dev/test/scratch1 \
				     /dev/test/scratch2
 # mount /dev/test/scratch1 /mnt/btrfs
 # btrfs device remove 3 /mnt/btrfs

Then we have the following kernel NULL pointer dereference:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs]
  btrfs_ioctl+0x18bb/0x3190 [btrfs]
  ? lock_is_held_type+0xa5/0x120
  ? find_held_lock.constprop.0+0x2b/0x80
  ? do_user_addr_fault+0x201/0x6a0
  ? lock_release+0xd2/0x2d0
  ? __x64_sys_ioctl+0x83/0xb0
  __x64_sys_ioctl+0x83/0xb0
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

[CAUSE]
Commit a27a94c2b0 ("btrfs: Make btrfs_find_device_by_devspec return
btrfs_device directly") moves the "missing" device path check into
btrfs_rm_device().

But btrfs_rm_device() itself can have case where it only receives
@devid, with NULL as @device_path.

In that case, calling strcmp() on NULL will trigger the NULL pointer
dereference.

Before that commit, we handle the "missing" case inside
btrfs_find_device_by_devspec(), which will not check @device_path at all
if @devid is provided, thus no way to trigger the bug.

[FIX]
Before calling strcmp(), also make sure @device_path is not NULL.

Fixes: a27a94c2b0 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly")
CC: stable@vger.kernel.org # 5.4+
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:11 +02:00
Naohiro Aota
63fb5879db btrfs: zoned: add asserts on splitting extent_map
We call split_zoned_em() on an extent_map on submitting a bio for it. Thus,
we can assume the extent_map is PINNED, not LOGGING, and in the modified
list. Add ASSERT()s to ensure the extent_maps after the split also has the
proper flags set and are in the modified list.

Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:11 +02:00
Naohiro Aota
0ae79c6fe7 btrfs: zoned: fix block group alloc_offset calculation
alloc_offset is offset from the start of a block group and @offset is
actually an address in logical space. Thus, we need to consider
block_group->start when calculating them.

Fixes: 011b41bffa ("btrfs: zoned: advance allocation pointer after tree log node")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:11 +02:00
Naohiro Aota
ba86dd9fe6 btrfs: zoned: suppress reclaim error message on EAGAIN
btrfs_relocate_chunk() can fail with -EAGAIN when e.g. send operations are
running. The message can fail btrfs/187 and it's unnecessary because we
anyway add it back to the reclaim list.

btrfs_reclaim_bgs_work()
`-> btrfs_relocate_chunk()
    `-> btrfs_relocate_block_group()
        `-> reloc_chunk_start()
            `-> if (fs_info->send_in_progress)
                `-> return -EAGAIN

CC: stable@vger.kernel.org # 5.13+
Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:11 +02:00
Johannes Thumshirn
77233c2d2e btrfs: zoned: allow disabling of zone auto reclaim
Automatically reclaiming dirty zones might not always be desired for all
workloads, especially as there are currently still some rough edges with
the relocation code on zoned filesystems.

Allow disabling zone auto reclaim on a per filesystem basis by writing 0
as the threshold value.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:11 +02:00
Filipe Manana
1f29537302 btrfs: update comment at log_conflicting_inodes()
A comment at log_conflicting_inodes() mentions that we check the inode's
logged_trans field instead of using btrfs_inode_in_log() because the field
last_log_commit is not updated when we log that an inode exists and the
inode has the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) set. The part
about the full sync flag is not true anymore since commit 9acc8103ab
("btrfs: fix unpersisted i_size on fsync after expanding truncate"), so
update the comment to not mention that part anymore.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:11 +02:00
Filipe Manana
d135a53396 btrfs: remove no longer needed full sync flag check at inode_logged()
Now that we are checking if the inode's logged_trans is 0 to detect the
possibility of the inode having been evicted and reloaded, the test for
the full sync flag (BTRFS_INODE_NEEDS_FULL_SYNC) is no longer needed at
tree-log.c:inode_logged(). Its purpose was to detect the possibility
of a previous eviction as well, since when an inode is loaded the full
sync flag is always set on it (and only cleared after the inode is
logged).

So just remove the check and update the comment. The check for the inode's
logged_trans being 0 was added recently by the patch with the subject
"btrfs: eliminate some false positives when checking if inode was logged".

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:10 +02:00
Filipe Manana
1c167b87f4 btrfs: remove unnecessary NULL check for the new inode during rename exchange
At the very end of btrfs_rename_exchange(), in case an error happened, we
are checking if 'new_inode' is NULL, but that is not needed since during a
rename exchange, unlike regular renames, 'new_inode' can never be NULL,
and if it were, we would have a crashed much earlier when we dereference it
multiple times.

So remove the check because it is not necessary and because it is causing
static checkers to emit a warning. I probably introduced the check by
copy-pasting similar code from btrfs_rename(), where 'new_inode' can be
NULL, in commit 86e8aa0e77 ("Btrfs: unpin logs if rename exchange
operation fails").

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:10 +02:00
Goldwyn Rodrigues
dce2815039 btrfs: allocate backref_ctx on stack in find_extent_clone
Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx
on stack. The size is reasonably small.

sizeof(backref_ctx) = 48

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:10 +02:00
Goldwyn Rodrigues
c853a5783e btrfs: allocate btrfs_ioctl_defrag_range_args on stack
Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args,
allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably
small and ioctls are called in process context.

sizeof(btrfs_ioctl_defrag_range_args) = 48

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:10 +02:00
Goldwyn Rodrigues
0afb603afc btrfs: allocate btrfs_ioctl_quota_rescan_args on stack
Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args,
allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably
small and ioctls are called in process context.

sizeof(btrfs_ioctl_quota_rescan_args) = 64

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:10 +02:00
Goldwyn Rodrigues
98caf9531e btrfs: allocate file_ra_state on stack in readahead_cache
Instead of allocating file_ra_state using kmalloc, allocate on stack.
sizeof(struct readahead) = 32 bytes.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:10 +02:00
Marcos Paulo de Souza
0ff40a910f btrfs: introduce btrfs_search_backwards function
It's a common practice to start a search using offset (u64)-1, which is
the u64 maximum value, meaning that we want the search_slot function to
be set in the last item with the same objectid and type.

Once we are in this position, it's a matter to start a search backwards
by calling btrfs_previous_item, which will check if we'll need to go to
a previous leaf and other necessary checks, only to be sure that we are
in last offset of the same object and type.

The new btrfs_search_backwards function does the all these steps when
necessary, and can be used to avoid code duplication.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
David Sterba
ea3dc7d2d1 btrfs: print if fsverity support is built in when loading module
As fsverity support depends on a config option, print that at module
load time like we do for similar features.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
Boris Burkov
705242538f btrfs: verity metadata orphan items
Writing out the verity data is too large of an operation to do in a
single transaction. If we are interrupted before we finish creating
fsverity metadata for a file, or fail to clean up already created
metadata after a failure, we could leak the verity items that we already
committed.

To address this issue, we use the orphan mechanism. When we start
enabling verity on a file, we also add an orphan item for that inode.
When we are finished, we delete the orphan. However, if we are
interrupted midway, the orphan will be present at mount and we can
cleanup the half-formed verity state.

There is a possible race with a normal unlink operation: if unlink and
verity run on the same file in parallel, it is possible for verity to
succeed and delete the still legitimate orphan added by unlink. Then, if
we are interrupted and mount in that state, we will never clean up the
inode properly. This is also possible for a file created with O_TMPFILE.
Check nlink==0 before deleting to avoid this race.

A final thing to note is that this is a resurrection of using orphans to
signal an operation besides "delete this inode". The old case was to
signal the need to do a truncate. That case still technically applies
for mounting very old file systems, so we need to take some care to not
clobber it. To that end, we just have to be careful that verity orphan
cleanup is a no-op for non-verity files.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
Boris Burkov
146054090b btrfs: initial fsverity support
Add support for fsverity in btrfs. To support the generic interface in
fs/verity, we add two new item types in the fs tree for inodes with
verity enabled. One stores the per-file verity descriptor and btrfs
verity item and the other stores the Merkle tree data itself.

Verity checking is done in end_page_read just before a page is marked
uptodate. This naturally handles a variety of edge cases like holes,
preallocated extents, and inline extents. Some care needs to be taken to
not try to verity pages past the end of the file, which are accessed by
the generic buffered file reading code under some circumstances like
reading to the end of the last page and trying to read again. Direct IO
on a verity file falls back to buffered reads.

Verity relies on PageChecked for the Merkle tree data itself to avoid
re-walking up shared paths in the tree. For this reason, we need to
cache the Merkle tree data. Since the file is immutable after verity is
turned on, we can cache it at an index past EOF.

Use the new inode ro_flags to store verity on the inode item, so that we
can enable verity on a file, then rollback to an older kernel and still
mount the file system and read the file. Since we can't safely write the
file anymore without ruining the invariants of the Merkle tree, we mark
a ro_compat flag on the file system when a file has verity enabled.

Acked-by: Eric Biggers <ebiggers@google.com>
Co-developed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
Boris Burkov
77eea05e78 btrfs: add ro compat flags to inodes
Currently, inode flags are fully backwards incompatible in btrfs. If we
introduce a new inode flag, then tree-checker will detect it and fail.
This can even cause us to fail to mount entirely. To make it possible to
introduce new flags which can be read-only compatible, like VERITY, we
add new ro flags to btrfs without treating them quite so harshly in
tree-checker. A read-only file system can survive an unexpected flag,
and can be mounted.

As for the implementation, it unfortunately gets a little complicated.

The on-disk representation of the inode, btrfs_inode_item, has an __le64
for flags but the in-memory representation, btrfs_inode, uses a u32.
David Sterba had the nice idea that we could reclaim those wasted 32 bits
on disk and use them for the new ro_compat flags.

It turns out that the tree-checker code which checks for unknown flags
is broken, and ignores the upper 32 bits we are hoping to use. The issue
is that the flags use the literal 1 rather than 1ULL, so the flags are
signed ints, and one of them is specifically (1 << 31). As a result, the
mask which ORs the flags is a negative integer on machines where int is
32 bit twos complement. When tree-checker evaluates the expression:

  btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)

The mask is something like 0x80000abc, which gets promoted to u64 with
sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
all the upper bits zeroed, and we can't detect unexpected flags.

This suggests that we can't use those bits after all. Luckily, we have
good reason to believe that they are zero anyway. Inode flags are
metadata, which is always checksummed, so any bit flips that would
introduce 1s would cause a checksum failure anyway (excluding the
improbable case of the checksum getting corrupted exactly badly).

Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
inode flag should preserve its value and not add leading zeroes
(at least for twos complement). The only place that flag
(BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
the root item, and indeed for that inode we see 0xffffffff80000000 as
the flags on disk. However, that inode is never seen by tree checker,
nor is it used in a context where verity might be meaningful.
Theoretically, a future ro flag might cause trouble on that inode, so we
should proactively clean up that mess before it does.

With the introduction of the new ro flags, keep two separate unsigned
masks and check them against the appropriate u32. Since we no longer run
afoul of sign extension, this also stops writing out 0xffffffff80000000
in root_item inodes going forward.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
Anand Jain
efc222f8d7 btrfs: simplify return values in btrfs_check_raid_min_devices
Function btrfs_check_raid_min_devices() returns error code from the enum
btrfs_err_code and it starts from 1. So there is no need to check if ret
is > 0. So drop this check and also drop the local variable ret.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
Qu Wenruo
7361b4ae03 btrfs: remove the dead comment in writepage_delalloc()
When btrfs_run_delalloc_range() failed, we will error out.

But there is a strange comment mentioning that
btrfs_run_delalloc_range() could have returned value >0 to indicate the
IO has already started.

Commit 40f765805f ("Btrfs: split up __extent_writepage to lower stack
usage") introduced the comment, but unfortunately at that time, we were
already using @page_started to indicate that case, and still return 0.

Furthermore, even if that comment was right (which is not), we would
return -EIO if the IO had already started.

By all means the comment is incorrect, just remove the comment along
with the dead check.

Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to
make sure we either return 0 or error, no positive return value.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:08 +02:00
David Sterba
b2f78e8805 btrfs: allow degenerate raid0/raid10
The data on raid0 and raid10 are supposed to be spread over multiple
devices, so the minimum constraints are set to 2 and 4 respectively.
This is an artificial limit and there's some interest to remove it.

Change this to allow raid0 on one device and raid10 on two devices. This
works as expected eg. when converting or removing devices.

The only difference is when raid0 on two devices gets one device
removed. Unpatched would silently create a single profile, while newly
it would be raid0.

The motivation is to allow to preserve the profile type as long as it
possible for some intermediate state (device removal, conversion), or
when there are disks of different size, with raid0 the otherwise
unusable space of the last device will be used too. Similarly for
raid10, though the two largest devices would need to be the same.

Unpatched kernel will mount and use the degenerate profiles just fine
but won't allow any operation that would not satisfy the stricter device
number constraints, eg. not allowing to go from 3 to 2 devices for
raid10 or various profile conversions.

Example output:

  # btrfs fi us -T .
  Overall:
      Device size:                  10.00GiB
      Device allocated:              1.01GiB
      Device unallocated:            8.99GiB
      Device missing:                  0.00B
      Used:                        200.61MiB
      Free (estimated):              9.79GiB      (min: 9.79GiB)
      Free (statfs, df):             9.79GiB
      Data ratio:                       1.00
      Metadata ratio:                   1.00
      Global reserve:                3.25MiB      (used: 0.00B)
      Multiple profiles:                  no

		Data      Metadata  System
  Id Path       RAID0     single    single   Unallocated
  -- ---------- --------- --------- -------- -----------
   1 /dev/sda10   1.00GiB   8.00MiB  1.00MiB     8.99GiB
  -- ---------- --------- --------- -------- -----------
     Total        1.00GiB   8.00MiB  1.00MiB     8.99GiB
     Used       200.25MiB 352.00KiB 16.00KiB

  # btrfs dev us .
  /dev/sda10, ID: 1
     Device size:            10.00GiB
     Device slack:              0.00B
     Data,RAID0/1:            1.00GiB
     Metadata,single:         8.00MiB
     System,single:           1.00MiB
     Unallocated:             8.99GiB

Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
profile is printed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:08 +02:00
Filipe Manana
bd54f381a1 btrfs: do not pin logs too early during renames
During renames we pin the logs of the roots a bit too early, before the
calls to btrfs_insert_inode_ref(). We can pin the logs after those calls,
since those will not change anything in a log tree.

In a scenario where we have multiple and diverse filesystem operations
running in parallel, those calls can take a significant amount of time,
due to lock contention on extent buffers, and delay log commits from other
tasks for longer than necessary.

So just pin logs after calls to btrfs_insert_inode_ref() and right before
the first operation that can update a log tree.

The following script that uses dbench was used for testing:

  $ cat dbench-test.sh
  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/nvme0n1
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-m single -d single"

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  umount $DEV &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 120 16

  umount $MNT

The tests were run on a machine with 12 cores, 64G of RAN, a NVMe device
and using a non-debug kernel config (Debian's default config).

The results compare a branch without this patch and without the previous
patch in the series, that has the subject:

 "btrfs: eliminate some false positives when checking if inode was logged"

Versus the same branch with these two patches applied.

dbench with 8 clients, results before:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4391359     0.009   249.745
 Close        3225882     0.001     3.243
 Rename        185953     0.065   240.643
 Unlink        886669     0.049   249.906
 Deltree          112     2.455   217.433
 Mkdir             56     0.002     0.004
 Qpathinfo    3980281     0.004     3.109
 Qfileinfo     697579     0.001     0.187
 Qfsinfo       729780     0.002     2.424
 Sfileinfo     357764     0.004     1.415
 Find         1538861     0.016     4.863
 WriteX       2189666     0.010     3.327
 ReadX        6883443     0.002     0.729
 LockX          14298     0.002     0.073
 UnlockX        14298     0.001     0.042
 Flush         307777     2.447   303.663

Throughput 1149.6 MB/sec  8 clients  8 procs  max_latency=303.666 ms

dbench with 8 clients, results after:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4269920     0.009   213.532
 Close        3136653     0.001     0.690
 Rename        180805     0.082   213.858
 Unlink        862189     0.050   172.893
 Deltree          112     2.998   218.328
 Mkdir             56     0.002     0.003
 Qpathinfo    3870158     0.004     5.072
 Qfileinfo     678375     0.001     0.194
 Qfsinfo       709604     0.002     0.485
 Sfileinfo     347850     0.004     1.304
 Find         1496310     0.017     5.504
 WriteX       2129613     0.010     2.882
 ReadX        6693066     0.002     1.517
 LockX          13902     0.002     0.075
 UnlockX        13902     0.001     0.055
 Flush         299276     2.511   220.189

Throughput 1187.33 MB/sec  8 clients  8 procs  max_latency=220.194 ms

+3.2% throughput, -31.8% max latency

dbench with 16 clients, results before:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    5978334     0.028   156.507
 Close        4391598     0.001     1.345
 Rename        253136     0.241   155.057
 Unlink       1207220     0.182   257.344
 Deltree          160     6.123    36.277
 Mkdir             80     0.003     0.005
 Qpathinfo    5418817     0.012     6.867
 Qfileinfo     949929     0.001     0.941
 Qfsinfo       993560     0.002     1.386
 Sfileinfo     486904     0.004     2.829
 Find         2095088     0.059     8.164
 WriteX       2982319     0.017     9.029
 ReadX        9371484     0.002     4.052
 LockX          19470     0.002     0.461
 UnlockX        19470     0.001     0.990
 Flush         418936     2.740   347.902

Throughput 1495.31 MB/sec  16 clients  16 procs  max_latency=347.909 ms

dbench with 16 clients, results after:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    5711833     0.029   131.240
 Close        4195897     0.001     1.732
 Rename        241849     0.204   147.831
 Unlink       1153341     0.184   231.322
 Deltree          160     6.086    30.198
 Mkdir             80     0.003     0.021
 Qpathinfo    5177011     0.012     7.150
 Qfileinfo     907768     0.001     0.793
 Qfsinfo       949205     0.002     1.431
 Sfileinfo     465317     0.004     2.454
 Find         2001541     0.058     7.819
 WriteX       2850661     0.017     9.110
 ReadX        8952289     0.002     3.991
 LockX          18596     0.002     0.655
 UnlockX        18596     0.001     0.179
 Flush         400342     2.879   293.607

Throughput 1565.73 MB/sec  16 clients  16 procs  max_latency=293.611 ms

+4.6% throughput, -16.9% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:08 +02:00
Filipe Manana
6e8e777deb btrfs: eliminate some false positives when checking if inode was logged
When checking if an inode was previously logged in the current transaction
through the helper inode_logged(), we can return some false positives that
can be easily eliminated. These correspond to the cases where an inode has
a ->logged_trans value that is not zero and its value is smaller then the
ID of the current transaction. This means we know exactly that the inode
was never logged before in the current transaction, so we can return false
and avoid the callers to do extra work:

1) Having btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log()
   unnecessarily join a log transaction and do deletion searches in a log
   tree that will not find anything. This just adds unnecessary contention
   on extent buffer locks;

2) Having btrfs_log_new_name() unnecessarily log an inode when it is not
   needed. If the inode was not logged before, we don't need to log it in
   LOG_INODE_EXISTS mode.

So just make sure that any false positive only happens when ->logged_trans
has a value of 0.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:08 +02:00
Naohiro Aota
42b5d73b5d btrfs: drop unnecessary ASSERT from btrfs_submit_direct()
When on SINGLE block group, btrfs_get_io_geometry() will return "the
size of the block group - the offset of the logical address within the
block group" as geom.len. Since we allow up to 8 GiB zone size on zoned
filesystem, we can have up to 8 GiB block group, so can have up to 8 GiB
geom.len as well. With this setup, we easily hit the "ASSERT(geom.len <=
INT_MAX);".

The ASSERT looks like to guard btrfs_bio_clone_partial() and bio_trim()
which both take "int" (now u64 due to the previous patch). So to be
precise the ASSERT should check if clone_len <= UINT_MAX. But actually,
clone_len is already capped by bio.bi_iter.bi_size which is unsigned
int. So the ASSERT is not necessary.

Drop the ASSERT and properly compare submit_len and geom.len in u64.
Then, let the implicit casting to convert it to u64.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:08 +02:00
Chaitanya Kulkarni
21dda654d4 btrfs: fix argument type of btrfs_bio_clone_partial()
The offset and can never be negative use unsigned int instead of int
type for them.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:08 +02:00
Josef Bacik
b377630527 btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking
sync_inode() has some holes that can cause problems if we're under heavy
ENOSPC pressure.  If there's writeback running on a separate thread
sync_inode() will skip writing the inode altogether.  What we really
want is to make sure writeback has been started on all the pages to make
sure we can see the ordered extents and wait on them if appropriate.
Switch to this new helper which will allow us to accomplish this and
avoid ENOSPC'ing early.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:07 +02:00
Josef Bacik
e16460707e btrfs: wait on async extents when flushing delalloc
I've been debugging an early ENOSPC problem in production and finally
root caused it to this problem.  When we switched to the per-inode in
38d715f494 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") I pulled out the async extent handling, because we
were doing the correct thing by calling filemap_flush() if we had async
extents set.  This would properly wait on any async extents by locking
the page in the second flush, thus making sure our ordered extents were
properly set up.

However when I switched us back to page based flushing, I used
sync_inode(), which allows us to pass in our own wbc.  The problem here
is that sync_inode() is smarter than the filemap_* helpers, it tries to
avoid calling writepages at all.  This means that our second call could
skip calling do_writepages altogether, and thus not wait on the pagelock
for the async helpers.  This means we could come back before any ordered
extents were created and then simply continue on in our flushing
mechanisms and ENOSPC out when we have plenty of space to use.

Fix this by putting back the async pages logic in shrink_delalloc.  This
allows us to bulk write out everything that we need to, and then we can
wait in one place for the async helpers to catch up, and then wait on
any ordered extents that are created.

Fixes: e076ab2a2c ("btrfs: shrink delalloc pages instead of full inodes")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:07 +02:00
Josef Bacik
03fe78cc29 btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be.  With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.

Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations.  Now assume we have 128MiB of delalloc
outstanding.  With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.

Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.

Fix this two ways.  First, change the calculations to be a fraction of
the total delalloc bytes on the system.  Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.

Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once.  This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.

I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again.  This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:07 +02:00
Josef Bacik
fcdef39c03 btrfs: enable a tracepoint when we fail tickets
When debugging early enospc problems it was useful to have a tracepoint
where we failed all tickets so I could check the state of the enospc
counters at failure time to validate my fixes.  This adds the tracpoint
so you can easily get that information.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:06 +02:00
Josef Bacik
ac98141d14 btrfs: wake up async_delalloc_pages waiters after submit
We use the async_delalloc_pages mechanism to make sure that we've
completed our async work before trying to continue our delalloc
flushing.  The reason for this is we need to see any ordered extents
that were created by our delalloc flushing.  However we're waking up
before we do the submit work, which is before we create the ordered
extents.  This is a pretty wide race window where we could potentially
think there are no ordered extents and thus exit shrink_delalloc
prematurely.  Fix this by waking us up after we've done the work to
create ordered extents.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:06 +02:00
Qu Wenruo
963e4db83e btrfs: unify regular and subpage error paths in __extent_writepage()
[BUG]
When running btrfs/160 in a loop for subpage with experimental
compression support, it has a high chance to crash (~20%):

 BTRFS critical (device dm-7): panic in __btrfs_add_ordered_extent:238: inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ordered-data.c:238!
 Internal error: Oops - BUG: 0 [#1] SMP
 pc : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
 lr : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
 Call trace:
  __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
  btrfs_add_ordered_extent+0x2c/0x50 [btrfs]
  run_delalloc_nocow+0x81c/0x8fc [btrfs]
  btrfs_run_delalloc_range+0xa4/0x390 [btrfs]
  writepage_delalloc+0xc0/0x1ac [btrfs]
  __extent_writepage+0xf4/0x370 [btrfs]
  extent_write_cache_pages+0x288/0x4f4 [btrfs]
  extent_writepages+0x58/0xe0 [btrfs]
  btrfs_writepages+0x1c/0x30 [btrfs]
  do_writepages+0x60/0x110
  __filemap_fdatawrite_range+0x108/0x170
  filemap_fdatawrite_range+0x20/0x30
  btrfs_fdatawrite_range+0x34/0x4dc [btrfs]
  __btrfs_write_out_cache+0x34c/0x480 [btrfs]
  btrfs_write_out_cache+0x144/0x220 [btrfs]
  btrfs_start_dirty_block_groups+0x3ac/0x6b0 [btrfs]
  btrfs_commit_transaction+0xd0/0xbb4 [btrfs]
  btrfs_sync_fs+0x64/0x1cc [btrfs]
  sync_fs_one_sb+0x3c/0x50
  iterate_supers+0xcc/0x1d4
  ksys_sync+0x6c/0xd0
  __arm64_sys_sync+0x1c/0x30
  invoke_syscall+0x50/0x120
  el0_svc_common.constprop.0+0x4c/0xd4
  do_el0_svc+0x30/0x9c
  el0_svc+0x2c/0x54
  el0_sync_handler+0x1a8/0x1b0
  el0_sync+0x198/0x1c0
 ---[ end trace 336f67369ae6e0af ]---

[CAUSE]
For subpage case, we can have multiple sectors inside a page, this makes
it possible for __extent_writepage() to have part of its page submitted
before returning.

In btrfs/160, we are using dm-dust to emulate write error, this means
for certain pages, we could have everything running fine, but at the end
of __extent_writepage(), one of the submitted bios fails due to dm-dust.

Then the page is marked Error, and we change @ret from 0 to -EIO.

This makes the caller extent_write_cache_pages() to error out, without
submitting the remaining pages.

Furthermore, since we're erroring out for free space cache, it doesn't
really care about the error and will update the inode and retry the
writeback.

Then we re-run the delalloc range, and will try to insert the same
delalloc range while previous delalloc range is still hanging there,
triggering the above error.

[FIX]
The proper fix is to handle errors from __extent_writepage() properly,
by ending the remaining ordered extent.

But that fix needs the following changes:

- Know at exactly which sector the error happened
  Currently __extent_writepage_io() works for the full page, can't
  return at which sector we hit the error.

- Grab the ordered extent covering the failed sector

As a hotfix for subpage case, here we unify the error paths in
__extent_writepage().

In fact, the "if (PageError(page))" branch never get executed if @ret is
still 0 for non-subpage cases.

As for non-subpage case, we never submit current page in
__extent_writepage(), but only add current page into bio.
The bio can only get submitted in next page.

Thus we never get PageError() set due to IO failure, thus when we hit
the branch, @ret is never 0.

By simply removing that @ret assignment, we let subpage case ignore the
IO failure, thus only error out for fatal errors just like regular
sectorsize.

So that IO error won't be treated as fatal error not trigger the hanging
OE problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:06 +02:00
Qu Wenruo
95ea0486b2 btrfs: allow read-write for 4K sectorsize on 64K page size systems
Since now we support data and metadata read-write for subpage, remove
the RO requirement for subpage mount.

There are some extra limitations though:

- For now, subpage RW mount is still considered experimental
  Thus that mount warning will still be there.

- No compression support
  There are still quite some PAGE_SIZE hard coded and quite some call
  sites use extent_clear_unlock_delalloc() to unlock locked_page.
  This will screw up subpage helpers.

  Now for subpage RW mount, no matter what mount option or inode attr is
  set, all writes will not be compressed.  Although reading compressed
  data has no problem.

- No defrag for subpage case
  The defrag support for subpage case will come in later patches, which
  will also rework the defrag workflow.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the i_size.

  This is a very special corner case, and even current btrfs check won't
  report error on such inline extent + regular extent.
  But considering how much effort has been put to prevent such inline +
  regular, I'd prefer to cut off inline extent completely until we have
  a good solution.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:06 +02:00
Qu Wenruo
9d9ea1e68a btrfs: subpage: fix relocation potentially overwriting last page data
[BUG]
When using the following script, btrfs will report data corruption after
one data balance with subpage support:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt
  $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
  sync
  btrfs balance start -d $mnt
  btrfs scrub start -B $mnt

Similar problem can be easily observed in btrfs/028 test case, there
will be tons of balance failure with -EIO.

[CAUSE]
Above fsstress will result the following data extents layout in extent
tree:
  item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
    refs 2 gen 7 flags DATA
    extent data backref root FS_TREE objectid 259 offset 1339392 count 1
    extent data backref root FS_TREE objectid 259 offset 647168 count 1
  item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
    block group used 102400 chunk_objectid 256 flags DATA
  item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
    refs 1 gen 7 flags DATA
    extent data backref root FS_TREE objectid 259 offset 729088 count 1

Then when creating the data reloc inode, the data reloc inode will look
like this:

	0	32K	64K	96K 100K	104K
	|<------ Extent A ----->|   |<- Ext B ->|

Then when we first try to relocate extent A, we setup the data reloc
inode with i_size 96K, then read both page [0, 64K) and page [64K, 128K).

For page 64K, since the i_size is just 96K, we fill range [96K, 128K)
with 0 and set it uptodate.

Then when we come to extent B, we update i_size to 104K, then try to read
page [64K, 128K).
Then we find the page is already uptodate, so we skip the read.
But range [96K, 128K) is filled with 0, not the real data.

Then we writeback the data reloc inode to disk, with 0 filling range
[96K, 128K), corrupting the content of extent B.

The behavior is caused by the fact that we still do full page read for
subpage case.

The bug won't really happen for regular sectorsize, as one page only
contains one sector.

[FIX]
This patch will fix the problem by invalidating range [i_size, PAGE_END]
in prealloc_file_extent_cluster().

So that if above example happens, when we preallocate the file extent
for extent B, we will clear the uptodate bits for range [96K, 128K),
allowing later relocate_one_page() to re-read the needed range.

There is a special note for the invalidating part.

Since we're not calling real btrfs_invalidatepage(), but just clearing
the subpage and page uptodate bits, we can leave a page half dirty and
half out of date.

Reading such page can cause a deadlock, as we normally expect a dirty
page to be fully uptodate.

Thus here we flush and wait the data reloc inode before doing the hacked
invalidating.  This won't cause extra overhead, as we're going to
writeback the data later anyway.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:06 +02:00
Qu Wenruo
e3c62324e4 btrfs: subpage: fix false alert when relocating partial preallocated data extents
[BUG]
When relocating partial preallocated data extents (part of the
preallocated extent is written) for subpage, it can cause the following
false alert and make the relocation to fail:

  BTRFS info (device dm-3): balance: start -d
  BTRFS info (device dm-3): relocating block group 13631488 flags data
  BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
  BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
  BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  BTRFS info (device dm-3): balance: ended with status: -5

The minimal script to reproduce looks like this:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt
  xfs_io -f -c "falloc 0 8k" $mnt/file
  xfs_io -f -c "pwrite 0 4k" $mnt/file
  btrfs balance start -d $mnt

[CAUSE]
Function btrfs_verify_data_csum() checks if the full range has
EXTENT_NODATASUM bit for data reloc inode, if *all* bytes of the range
have EXTENT_NODATASUM bit, then it skip the range.

This works pretty well for regular sectorsize, as in that case
btrfs_verify_data_csum() is called for each sector, thus no problem at
all.

But for subpage case, btrfs_verify_data_csum() is called on each bvec,
which can contain several sectors, and since it checks *all* bytes for
EXTENT_NODATASUM bit, if we have some range with csum, then we will
continue checking all the sectors.

For the preallocated sectors, it doesn't have any csum, thus obviously
the csum won't match and cause the false alert.

[FIX]
Move the EXTENT_NODATASUM check into the main loop, so that we can check
each sector for EXTENT_NODATASUM bit for subpage case.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:05 +02:00
Qu Wenruo
7c11d0ae43 btrfs: subpage: fix a potential use-after-free in writeback helper
[BUG]
There is a possible use-after-free bug when running generic/095.

 BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
 Faulting instruction address: 0xc000000000283654
 c000000000283078 do_raw_spin_unlock+0x88/0x230
 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
 c0000000009e0458 end_bio_extent_writepage+0x158/0x270
 c000000000b6fd14 bio_endio+0x254/0x270
 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
 c000000000b6fd14 bio_endio+0x254/0x270
 c000000000b781fc blk_update_request+0x46c/0x670
 c000000000b8b394 blk_mq_end_request+0x34/0x1d0
 c000000000d82d1c lo_complete_rq+0x11c/0x140
 c000000000b880a4 blk_complete_reqs+0x84/0xb0
 c0000000012b2ca4 __do_softirq+0x334/0x680
 c0000000001dd878 irq_exit+0x148/0x1d0
 c000000000016f4c do_IRQ+0x20c/0x240
 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0

[CAUSE]
There is very small race window like the following in generic/095.

	Thread 1		|		Thread 2
--------------------------------+------------------------------------
  end_bio_extent_writepage()	| btrfs_releasepage()
  |- spin_lock_irqsave()	| |
  |- end_page_writeback()	| |
  |				| |- if (PageWriteback() ||...)
  |				| |- clear_page_extent_mapped()
  |				|    |- kfree(subpage);
  |- spin_unlock_irqrestore().

The race can also happen between writeback and btrfs_invalidatepage(),
although that would be much harder as btrfs_invalidatepage() has much
more work to do before the clear_page_extent_mapped() call.

[FIX]
Here we "wait" for the subapge spinlock to be released before we detach
subpage structure.
So this patch will introduce a new function, wait_subpage_spinlock(), to
do the "wait" by acquiring the spinlock and release it.

Since the caller has ensured the page is not dirty nor writeback, and
page is already locked, the only way to hold the subpage spinlock is
from endio function.
Thus we only need to acquire the spinlock to wait for any existing
holder.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:05 +02:00
Qu Wenruo
e046786619 btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage()
[BUG]
When running generic/095, there is a high chance to crash with subpage
data RW support:

 assertion failed: PagePrivate(page) && page->private
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ctree.h:3403!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17
 Hardware name: Khadas VIM3 (DT)
 Call trace:
  assertfail.constprop.0+0x28/0x2c [btrfs]
  btrfs_subpage_assert+0x80/0xa0 [btrfs]
  btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
  btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
  btrfs_dirty_pages+0x160/0x270 [btrfs]
  btrfs_buffered_write+0x444/0x630 [btrfs]
  btrfs_direct_write+0x1cc/0x2d0 [btrfs]
  btrfs_file_write_iter+0xc0/0x160 [btrfs]
  new_sync_write+0xe8/0x180
  vfs_write+0x1b4/0x210
  ksys_pwrite64+0x7c/0xc0
  __arm64_sys_pwrite64+0x24/0x30
  el0_svc_common.constprop.0+0x70/0x140
  do_el0_svc+0x28/0x90
  el0_svc+0x2c/0x54
  el0_sync_handler+0x1a8/0x1ac
  el0_sync+0x170/0x180
 Code: f0000160 913be042 913c4000 955444bc (d4210000)
 ---[ end trace 3fdd39f4cccedd68 ]---

[CAUSE]
Although prepare_pages() calls find_or_create_page(), which returns the
page locked, but in later prepare_uptodate_page() calls, we may call
btrfs_readpage() which will unlock the page before it returns.

This leaves a window where btrfs_releasepage() can sneak in and release
the page, clearing page->private and causing above ASSERT().

[FIX]
In prepare_uptodate_page(), we should not only check page->mapping, but
also PagePrivate() to ensure we are still holding the correct page which
has proper fs context setup.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:05 +02:00