Commit Graph

1404 Commits

Author SHA1 Message Date
Linus Torvalds cac405a3bf for-6.6-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmURvloACgkQxWXV+ddt
 WDt+CQ/+NgBtQn7eyABsdHzXWPxpFyGZrdw5ldKnly3G+WDW2GKMaZ6CpDuEZGNQ
 vMAkSGX5LIHXvO79pDnGG0i+bRINWrc5HZVZ/p5Da6wplBTgIPlbLmxaZX9MJLbx
 j7Oz37GXiQJY8BxnVCnsb+bhhTrTbO9HFUQr/nxefIvu22OBdL1WXYcfuBOeEsFG
 qr/aeC52YqCVgXvt+8a5DqAKE0NWc4PFMFUMo4vlf1xuL652fvff7xiup1CAIgBh
 qsCa17E7q+qjri2phAhbFNadfpH5wGfyjTWScOlaFuXjRhW2v2oqz3WU5IQj4dmu
 PI+k++PLUzIxT0IcjD1YbZzRFaEI6fR2W0GA4LK08fjVehh2ao5jOjtRgLl8HlqG
 qC5fslAPzUxRmwMmCjSGfXF14sgtyLy8eVWf69xn06/1cbEmfHDrWNXP1QHuq6eT
 Jqy8Ywia3jRzzfZ1utABJPLBW4hFQKkyobtyd67fxslUFmtuLvLqGTiOdmVFiD9K
 o+BF2xjEz2n8O1+aRZk5SFNC9zcaASaRg/wQrhvSI9qxM18fh4TXgKQOniLzAK7v
 lZc+JkegFW4CVquCUpmbsdZAOpVNRXfPOJIt/w6G+oRbaiTvPUnrH+uyq8IGREbw
 E7d8XIP0qlF0DQBGK4Mw/riZz/e5MmEKNjza6M+fj2uglpfWTv4=
 =6WEW
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - delayed refs fixes:
     - fix race when refilling delayed refs block reserve
     - prevent transaction block reserve underflow when starting
       transaction
     - error message and value adjustments

 - fix build warnings with CONFIG_CC_OPTIMIZE_FOR_SIZE and
   -Wmaybe-uninitialized

 - fix for smatch report where uninitialized data from invalid extent
   buffer range could be returned to the caller

 - fix numeric overflow in statfs when calculating lower threshold
   for a full filesystem

* tag 'for-6.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: initialize start_slot in btrfs_log_prealloc_extents
  btrfs: make sure to initialize start and len in find_free_dev_extent
  btrfs: reset destination buffer when read_extent_buffer() gets invalid range
  btrfs: properly report 0 avail for very full file systems
  btrfs: log message if extent item not found when running delayed extent op
  btrfs: remove redundant BUG_ON() from __btrfs_inc_extent_ref()
  btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1
  btrfs: prevent transaction block reserve underflow when starting transaction
  btrfs: fix race when refilling delayed refs block reserve
2023-09-26 09:44:08 -07:00
Josef Bacik 20218dfbaa btrfs: make sure to initialize start and len in find_free_dev_extent
Jens reported a compiler error when using CONFIG_CC_OPTIMIZE_FOR_SIZE=y
that looks like this

  In function ‘gather_device_info’,
      inlined from ‘btrfs_create_chunk’ at fs/btrfs/volumes.c:5507:8:
  fs/btrfs/volumes.c:5245:48: warning: ‘dev_offset’ may be used uninitialized [-Wmaybe-uninitialized]
   5245 |                 devices_info[ndevs].dev_offset = dev_offset;
	|                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
  fs/btrfs/volumes.c: In function ‘btrfs_create_chunk’:
  fs/btrfs/volumes.c:5196:13: note: ‘dev_offset’ was declared here
   5196 |         u64 dev_offset;

This occurs because find_free_dev_extent is responsible for setting
dev_offset, however if we get an -ENOMEM at the top of the function
we'll return without setting the value.

This isn't actually a problem because we will see the -ENOMEM in
gather_device_info() and return and not use the uninitialized value,
however we also just don't want the compiler warning so rework the code
slightly in find_free_dev_extent() to make sure it's always setting
*start and *len to avoid the compiler warning.

Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-21 18:52:20 +02:00
Linus Torvalds 547635c6ac for-6.6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTskOwACgkQxWXV+ddt
 WDsNJw/8CCi41Z7e3LdJsQd2iy3/+oJZUvIGuT5YvshYxTLCbV7AL+diBPnSQs4Q
 /KFMGL7RZBgJzwVoSQtXnESXXgX8VOVfN1zY//k5g6z7BscCEQd73H/M0B8ciZy/
 aBygm9tJ7EtWbGZWNR8yad8YtOgl6xoClrPnJK/DCLwMGPy2o+fnKP3Y9FOKY5KM
 1Sl0Y4FlJ9dTJpxIwYbx4xmuyHrh2OivjU/KnS9SzQlHu0nl6zsIAE45eKem2/EG
 1figY5aFBYPpPYfopbLDalEBR3bQGiViZVJuNEop3AimdcMOXw9jBF3EZYUb5Tgn
 MleMDgmmjLGOE/txGhvTxKj9kci2aGX+fJn3jXbcIMksAA0OQFLPqzGvEQcrs6Ok
 HA0RsmAkS5fWNDCuuo4ZPXEyUPvluTQizkwyoulOfnK+UPJCWaRqbEBMTsvm6M6X
 wFT2czwLpaEU/W6loIZkISUhfbRqVoA3DfHy398QXNzRhSrg8fQJjma1f7mrHvTi
 CzU+OD5YSC2nXktVOnklyTr0XT+7HF69cumlDbr8TS8u1qu8n1keU/7M3MBB4xZk
 BZFJDz8pnsAqpwVA4T434E/w45MDnYlwBw5r+U8Xjyso8xlau+sYXKcim85vT2Q0
 yx/L91P6tdekR1y97p4aDdxw/PgTzdkNGMnsTBMVzgtCj+5pMmE=
 =N7Yn
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "No new features, the bulk of the changes are fixes, refactoring and
  cleanups. The notable fix is the scrub performance restoration after
  rewrite in 6.4, though still only partial.

  Fixes:

   - scrub performance drop due to rewrite in 6.4 partially restored:
      - do IO grouping by blg_plug/blk_unplug again
      - avoid unnecessary tree searches when processing stripes, in
        extent and checksum trees
      - the drop is noticeable on fast PCIe devices, -66% and restored
        to -33% of the original
      - backports to 6.4 planned

   - handle more corner cases of transaction commit during orphan
     cleanup or delayed ref processing

   - use correct fsid/metadata_uuid when validating super block

   - copy directory permissions and time when creating a stub subvolume

  Core:

   - debugging feature integrity checker deprecated, to be removed in
     6.7

   - in zoned mode, zones are activated just before the write, making
     error handling easier, now the overcommit mechanism can be enabled
     again which improves performance by avoiding more frequent flushing

   - v0 extent handling completely removed, deprecated long time ago

   - error handling improvements

   - tests:
      - extent buffer bitmap tests
      - pinned extent splitting tests

   - cleanups and refactoring:
      - compression writeback
      - extent buffer bitmap
      - space flushing, ENOSPC handling"

* tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits)
  btrfs: zoned: skip splitting and logical rewriting on pre-alloc write
  btrfs: tests: test invalid splitting when skipping pinned drop extent_map
  btrfs: tests: add a test for btrfs_add_extent_mapping
  btrfs: tests: add extent_map tests for dropping with odd layouts
  btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker()
  btrfs: scrub: don't go ordered workqueue for dev-replace
  btrfs: scrub: fix grouping of read IO
  btrfs: scrub: avoid unnecessary csum tree search preparing stripes
  btrfs: scrub: avoid unnecessary extent tree search preparing stripes
  btrfs: copy dir permission and time when creating a stub subvolume
  btrfs: remove pointless empty list check when reading delayed dir indexes
  btrfs: drop redundant check to use fs_devices::metadata_uuid
  btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super
  btrfs: use the correct superblock to compare fsid in btrfs_validate_super
  btrfs: simplify memcpy either of metadata_uuid or fsid
  btrfs: add a helper to read the superblock metadata_uuid
  btrfs: remove v0 extent handling
  btrfs: output extra debug info if we failed to find an inline backref
  btrfs: move the !zoned assert into run_delalloc_cow
  btrfs: consolidate the error handling in run_delalloc_nocow
  ...
2023-08-28 12:26:57 -07:00
Linus Torvalds 615e95831e v6.6-vfs.ctime
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc
 oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2
 XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0=
 =eJz5
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs timestamp updates from Christian Brauner:
 "This adds VFS support for multi-grain timestamps and converts tmpfs,
  xfs, ext4, and btrfs to use them. This carries acks from all relevant
  filesystems.

  The VFS always uses coarse-grained timestamps when updating the ctime
  and mtime after a change. This has the benefit of allowing filesystems
  to optimize away a lot of metadata updates, down to around 1 per
  jiffy, even when a file is under heavy writes.

  Unfortunately, this has always been an issue when we're exporting via
  NFSv3, which relies on timestamps to validate caches. A lot of changes
  can happen in a jiffy, so timestamps aren't sufficient to help the
  client decide to invalidate the cache.

  Even with NFSv4, a lot of exported filesystems don't properly support
  a change attribute and are subject to the same problems with timestamp
  granularity. Other applications have similar issues with timestamps
  (e.g., backup applications).

  If we were to always use fine-grained timestamps, that would improve
  the situation, but that becomes rather expensive, as the underlying
  filesystem would have to log a lot more metadata updates.

  This introduces fine-grained timestamps that are used when they are
  actively queried.

  This uses the 31st bit of the ctime tv_nsec field to indicate that
  something has queried the inode for the mtime or ctime. When this flag
  is set, on the next mtime or ctime update, the kernel will fetch a
  fine-grained timestamp instead of the usual coarse-grained one.

  As POSIX generally mandates that when the mtime changes, the ctime
  must also change the kernel always stores normalized ctime values, so
  only the first 30 bits of the tv_nsec field are ever used.

  Filesytems can opt into this behavior by setting the FS_MGTIME flag in
  the fstype. Filesystems that don't set this flag will continue to use
  coarse-grained timestamps.

  Various preparatory changes, fixes and cleanups are included:

   - Fixup all relevant places where POSIX requires updating ctime
     together with mtime. This is a wide-range of places and all
     maintainers provided necessary Acks.

   - Add new accessors for inode->i_ctime directly and change all
     callers to rely on them. Plain accesses to inode->i_ctime are now
     gone and it is accordingly rename to inode->__i_ctime and commented
     as requiring accessors.

   - Extend generic_fillattr() to pass in a request mask mirroring in a
     sense the statx() uapi. This allows callers to pass in a request
     mask to only get a subset of attributes filled in.

   - Rework timestamp updates so it's possible to drop the @now
     parameter the update_time() inode operation and associated helpers.

   - Add inode_update_timestamps() and convert all filesystems to it
     removing a bunch of open-coding"

* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
  btrfs: convert to multigrain timestamps
  ext4: switch to multigrain timestamps
  xfs: switch to multigrain timestamps
  tmpfs: add support for multigrain timestamps
  fs: add infrastructure for multigrain timestamps
  fs: drop the timespec64 argument from update_time
  xfs: have xfs_vn_update_time gets its own timestamp
  fat: make fat_update_time get its own timestamp
  fat: remove i_version handling from fat_update_time
  ubifs: have ubifs_update_time use inode_update_timestamps
  btrfs: have it use inode_update_timestamps
  fs: drop the timespec64 arg from generic_update_time
  fs: pass the request_mask to generic_fillattr
  fs: remove silly warning from current_time
  gfs2: fix timestamp handling on quota inodes
  fs: rename i_ctime field to __i_ctime
  selinux: convert to ctime accessor functions
  security: convert to ctime accessor functions
  apparmor: convert to ctime accessor functions
  sunrpc: convert to ctime accessor functions
  ...
2023-08-28 09:31:32 -07:00
Anand Jain 319baafcef btrfs: simplify memcpy either of metadata_uuid or fsid
There is a helper which provides either metadata_uuid or fsid as per
METADATA_UUID flag. So use it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:48 +02:00
Anand Jain 4844c3664a btrfs: add a helper to read the superblock metadata_uuid
In some cases, we need to read the FSID from the superblock when the
metadata_uuid is not set, and otherwise, read the metadata_uuid. So,
add a helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:54:48 +02:00
Filipe Manana ed8947bc73 btrfs: merge find_free_dev_extent() and find_free_dev_extent_start()
There is no point in having find_free_dev_extent() because it's just a
simple wrapper around find_free_dev_extent_start() which always passes a
value of 0 for the search_start argument. Since there are no other callers
of find_free_dev_extent_start(), remove find_free_dev_extent() and rename
find_free_dev_extent_start() to find_free_dev_extent(), removing its
search_start argument because it's always 0.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:18 +02:00
Filipe Manana 883647f4b5 btrfs: make find_free_dev_extent() static
The function find_free_dev_extent() is only used within volumes.c, so make
it static and remove its prototype from volumes.h.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:18 +02:00
Anand Jain 7f9879eb60 btrfs: print name and pid when device scanning processes race
There is a race between systemd and mount, as both of them try to register
the device in the kernel. When systemd loses the race, it prints the
following message:

  BTRFS error: device /dev/sdb7 belongs to fsid 1b3bacbf-14db-49c9-a3ef-547998aacc4e, and the fs is already mounted.

The 'btrfs dev scan' registers one device at a time, so there is no way
for the mount thread to wait in the kernel for all the devices to have
registered as it won't know if all the devices are discovered.

For now, improve the error log by printing the command name and process
ID along with the error message.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:16 +02:00
Christoph Hellwig b2cc440058 btrfs: simplify the no-bioc fast path condition in btrfs_map_block
nr_alloc_stripes can't be one if we are writing to a replacement device,
as it is incremented for that case right above.  Remove the duplicate
checks.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:13 +02:00
Filipe Manana e5860f8207 btrfs: make find_first_extent_bit() return a boolean
Currently find_first_extent_bit() returns a 0 if it found a range in the
given io tree and 1 if it didn't find any. There's no need to return any
errors, so make the return value a boolean and invert the logic to make
more sense: return true if it found a range and false if it didn't find
any range.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:12 +02:00
Qu Wenruo ed3764f726 btrfs: add comments for btrfs_map_block()
The function btrfs_map_block() is a critical part of the btrfs storage
layer, which handles mapping of logical ranges to physical ranges.

Thus it's better to have some basic explanation, especially on the
following points:

- Segment split by various boundaries
  As a continuous logical range may be split into different segments,
  due to various factors like zones and RAID0/5/6/10 boundaries.

- The meaning of @mirror_num

- The possible single stripe optimization

- One deprecated parameter @need_raid_map
  Just explicitly mark it deprecated so we're aware of the problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21 14:52:12 +02:00
Linus Torvalds 12e6ccedb3 for-6.5-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTgyQQACgkQxWXV+ddt
 WDvqSQ/+PFg0GwssGuiqWTGbfHV2bJCJWeuXUJNuKFo8PtEnpN0zf28ihsaRXAHF
 ZDFKrRjEmb62n+EWJFDpC7wmnz6UJEoEtQteN2VBnLSIUQAKFI+g5flXrR85rk1D
 d52JSXtaXSZeCtZH/wdYWdfkL19SJQqJrFDY1WmRLCylOsLHuG0a67fXNeL+5WM/
 NgGUMk0bO/j2CKjiCwJT4EpsSP4tFj49TciuDESyXnS8aDbPLbAQkGpYlE+99HSj
 D3vjZeqdVfmVhSjdIrK2eTlndzCl+HU+J1DXHzRE6I5XkXhzofJFtrlsvl++C9pv
 UZL9bFyMFzybKME33RWvzXBhiRguZ4hfGBoh5FQbJl4yErU4I5RVZcd3/S/2V6n+
 AzWemwkOdLEiiPD+aLV28EYdKpnd4GFweVTxeXjdXrJrSx/e4Vn/kPNq1aZJi6Qi
 ex3hZWr0oN7JG/StN6i3ix09fEB8cyDzn/jaEwk5zb6uHVN8fw7whkVwZOvFkXx5
 VcPxZOyxBFxwmN+L6JlxkIGEpu8UQC2RHa1JJzDTXJPqpz6W68d2wJ8jlDFJYUaf
 fahDd8FoG/e/EYh8sPsOnp3gMY53UxxWLF8fuZXVScq9+g5zA3jfftF+a3TaA5bh
 e119g0ml+KIGtTB7Q8nLob4PA12NNhNtHbKfdSPDhOfvz8heg9A=
 =eFDQ
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix infinite loop in readdir(), could happen in a big directory when
   files get renamed during enumeration

 - fix extent map handling of skipped pinned ranges

 - fix a corner case when handling ordered extent length

 - fix a potential crash when balance cancel races with pause

 - verify correct uuid when starting scrub or device replace

* tag 'for-6.5-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix incorrect splitting in btrfs_drop_extent_map_range
  btrfs: fix BUG_ON condition in btrfs_cancel_balance
  btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
  btrfs: fix replace/scrub failure with metadata_uuid
  btrfs: fix infinite directory reads
2023-08-19 17:57:07 +02:00
xiaoshoukui 29eefa6d0d btrfs: fix BUG_ON condition in btrfs_cancel_balance
Pausing and canceling balance can race to interrupt balance lead to BUG_ON
panic in btrfs_cancel_balance. The BUG_ON condition in btrfs_cancel_balance
does not take this race scenario into account.

However, the race condition has no other side effects. We can fix that.

Reproducing it with panic trace like this:

  kernel BUG at fs/btrfs/volumes.c:4618!
  RIP: 0010:btrfs_cancel_balance+0x5cf/0x6a0
  Call Trace:
   <TASK>
   ? do_nanosleep+0x60/0x120
   ? hrtimer_nanosleep+0xb7/0x1a0
   ? sched_core_clone_cookie+0x70/0x70
   btrfs_ioctl_balance_ctl+0x55/0x70
   btrfs_ioctl+0xa46/0xd20
   __x64_sys_ioctl+0x7d/0xa0
   do_syscall_64+0x38/0x80
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

  Race scenario as follows:
  > mutex_unlock(&fs_info->balance_mutex);
  > --------------------
  > .......issue pause and cancel req in another thread
  > --------------------
  > ret = __btrfs_balance(fs_info);
  >
  > mutex_lock(&fs_info->balance_mutex);
  > if (ret == -ECANCELED && atomic_read(&fs_info->balance_pause_req)) {
  >         btrfs_info(fs_info, "balance: paused");
  >         btrfs_exclop_balance(fs_info, BTRFS_EXCLOP_BALANCE_PAUSED);
  > }

CC: stable@vger.kernel.org # 4.19+
Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-17 15:27:45 +02:00
Jeff Layton 913e99287b
fs: drop the timespec64 argument from update_time
Now that all of the update_time operations are prepared for it, we can
drop the timespec64 argument from the update_time operation. Do that and
remove it from some associated functions like inode_update_time and
inode_needs_update_time.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11 09:04:57 +02:00
Linus Torvalds 4667025951 for-6.5-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmS4TjYACgkQxWXV+ddt
 WDt8yw//b6E+Ab0QBBS6Dl9R3euv+aqQUa35Lg8z2L89oLjNlX/yZSj4ZIZ/467A
 omDtCat1uwYSlzPgM7RtfvXew7ujULmK9UP5IdwEQgQ5qf+pIE50h78x64zCLG1R
 g2C9OMGvNxU8kOR2sRFaBJ+Lf8Mth8ipOzDl8XPE4m7t87tLYdpmVoHJKzNj/AcF
 UOruS8HilFYlc8Oqb/yxDPYeMmdUQaqahhq4tHkoIB/09aePaJwAiyhQ53v3TaAX
 oBvBBrV65WdHu3epfB9WcPQRyd09ov+e14UXAX7CepkZ0RdIpU1CcszPGHTtOP94
 fRaup2zFygolHSTMPFU3drtnI3CRIGntdfHFqDEH+M1ysvRxG4VUnAjWL6w9KZZC
 3z2G6cYoVxpcbCch5wpn/97cr6PowRABK0AoirrreU0VA2Vkdi45N5eoT/2DTEVT
 s5VirJsfC5IXCrvQf/h67jHlqk+23e+/uP8in+GWTgPEmddTSWNKCi9ZcHKQ/skf
 5U9Y3vkFFl68cQEwcB6vYMlbCiCxCuimz8AvGFrxkU9qmah6gcgTnXwKqneAbr3f
 jFwZvgh+E5ueK76/KLLWIKyfDx5cU8N/D5TlmZ+n/TAoWUFVs5V2ctB4EOsIotev
 EN7sriJGUkfGp+EAs4Xux4OKGx6IinyogdZmQN9HmNIlGks02qk=
 =yO9Y
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Stable fixes:

   - fix race between balance and cancel/pause

   - various iput() fixes

   - fix use-after-free of new block group that became unused

   - fix warning when putting transaction with qgroups enabled after
     abort

   - fix crash in subpage mode when page could be released between map
     and map read

   - when scrubbing raid56 verify the P/Q stripes unconditionally

   - fix minor memory leak in zoned mode when a block group with an
     unexpected superblock is found

  Regression fixes:

   - fix ordered extent split error handling when submitting direct IO

   - user irq-safe locking when adding delayed iputs"

* tag 'for-6.5-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix warning when putting transaction with qgroups enabled after abort
  btrfs: fix ordered extent split error handling in btrfs_dio_submit_io
  btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expand
  btrfs: raid56: always verify the P/Q contents for scrub
  btrfs: use irq safe locking when running and adding delayed iputs
  btrfs: fix iput() on error pointer after error during orphan cleanup
  btrfs: fix double iput() on inode after an error during orphan cleanup
  btrfs: zoned: fix memory leak after finding block group with super blocks
  btrfs: fix use-after-free of new block group that became unused
  btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_block
  btrfs: fix race between balance and cancel/pause
2023-07-20 08:11:30 -07:00
Christoph Hellwig 4e7de35eb7 btrfs: be a bit more careful when setting mirror_num_ret in btrfs_map_block
The mirror_num_ret is allowed to be NULL, although it has to be set when
smap is set.  Unfortunately that is not a well enough specifiable
invariant for static type checkers, so add a NULL check to make sure they
are fine.

Fixes: 03793cbbc8 ("btrfs: add fast path for single device io in __btrfs_map_block")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-11 17:32:14 +02:00
Josef Bacik b19c98f237 btrfs: fix race between balance and cancel/pause
Syzbot reported a panic that looks like this:

  assertion failed: fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE_PAUSED, in fs/btrfs/ioctl.c:465
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/messages.c:259!
  RIP: 0010:btrfs_assertfail+0x2c/0x30 fs/btrfs/messages.c:259
  Call Trace:
   <TASK>
   btrfs_exclop_balance fs/btrfs/ioctl.c:465 [inline]
   btrfs_ioctl_balance fs/btrfs/ioctl.c:3564 [inline]
   btrfs_ioctl+0x531e/0x5b30 fs/btrfs/ioctl.c:4632
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:870 [inline]
   __se_sys_ioctl fs/ioctl.c:856 [inline]
   __x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x63/0xcd

The reproducer is running a balance and a cancel or pause in parallel.
The way balance finishes is a bit wonky, if we were paused we need to
save the balance_ctl in the fs_info, but clear it otherwise and cleanup.
However we rely on the return values being specific errors, or having a
cancel request or no pause request.  If balance completes and returns 0,
but we have a pause or cancel request we won't do the appropriate
cleanup, and then the next time we try to start a balance we'll trip
this ASSERT.

The error handling is just wrong here, we always want to clean up,
unless we got -ECANCELLED and we set the appropriate pause flag in the
exclusive op.  With this patch the reproducer ran for an hour without
tripping, previously it would trip in less than a few minutes.

Reported-by: syzbot+c0f3acf145cb465426d5@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-11 17:31:58 +02:00
Linus Torvalds a0433f8cae for-6.5/block-2023-06-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
 zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
 U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
 39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
 apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
 H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
 20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
 Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
 49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
 h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
 n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
 1ADPteUOTA==
 =zP4Y
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
      - Various cleanups all around (Irvin, Chaitanya, Christophe)
      - Better struct packing (Christophe JAILLET)
      - Reduce controller error logs for optional commands (Keith)
      - Support for >=64KiB block sizes (Daniel Gomez)
      - Fabrics fixes and code organization (Max, Chaitanya, Daniel
        Wagner)

 - bcache updates via Coly:
      - Fix a race at init time (Mingzhe Zou)
      - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)

 - use page pinning in the block layer for dio (David)

 - convert old block dio code to page pinning (David, Christoph)

 - cleanups for pktcdvd (Andy)

 - cleanups for rnbd (Guoqing)

 - use the unchecked __bio_add_page() for the initial single page
   additions (Johannes)

 - fix overflows in the Amiga partition handling code (Michael)

 - improve mq-deadline zoned device support (Bart)

 - keep passthrough requests out of the IO schedulers (Christoph, Ming)

 - improve support for flush requests, making them less special to deal
   with (Christoph)

 - add bdev holder ops and shutdown methods (Christoph)

 - fix the name_to_dev_t() situation and use cases (Christoph)

 - decouple the block open flags from fmode_t (Christoph)

 - ublk updates and cleanups, including adding user copy support (Ming)

 - BFQ sanity checking (Bart)

 - convert brd from radix to xarray (Pankaj)

 - constify various structures (Thomas, Ivan)

 - more fine grained persistent reservation ioctl capability checks
   (Jingbo)

 - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
   Jordy, Li, Min, Yu, Zhong, Waiman)

* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
  scsi/sg: don't grab scsi host module reference
  ext4: Fix warning in blkdev_put()
  block: don't return -EINVAL for not found names in devt_from_devname
  cdrom: Fix spectre-v1 gadget
  block: Improve kernel-doc headers
  blk-mq: don't insert passthrough request into sw queue
  bsg: make bsg_class a static const structure
  ublk: make ublk_chr_class a static const structure
  aoe: make aoe_class a static const structure
  block/rnbd: make all 'class' structures const
  block: fix the exclusive open mask in disk_scan_partitions
  block: add overflow checks for Amiga partition support
  block: change all __u32 annotations to __be32 in affs_hardblocks.h
  block: fix signed int overflow in Amiga partition support
  block: add capacity validation in bdev_add_partition()
  block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
  block: disallow Persistent Reservation on partitions
  reiserfs: fix blkdev_put() warning from release_journal_dev()
  block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
  block: document the holder argument to blkdev_get_by_path
  ...
2023-06-26 12:47:20 -07:00
Linus Torvalds cc423f6337 for-6.5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmSZ0YUACgkQxWXV+ddt
 WDuF9g/+OsEZflGYIZj11trG0l5HWApnINKqhZ524J0TNy9KxY0KOTqPCOg0O41E
 Vt7uJCMG06+ifvVEby8srjzUZxUutIuMeGIP91VyXUbt+CleAWunxnL8aKcTPS3T
 QxyGx0VmukO2UHYCXhXQLRXo/+zPcBxnk6UgVzcBCIOecOMTB1KCeblBmqd3q86f
 NVFse+BTkmtm86u/1rzEDqgY6lMHQ+jNoHhpRVGpzmSnSX8GELyOW3QnixS0LCo2
 no0vvW0QXkRJ+S68V5zBWlqa3xr21jOYcmON2Ra2G8Etsjx7W5XKS3I9k/4uonxb
 LbITmBwEZWt/aTzmLFT16S5M9BlRqH5Ffmsw7Ls+NDmdvH/f1zM8XeNAb7kpFTrn
 T3aALjkcd65/JFmgyVmzdt4BSmrUkYm0EEmLirQec86HJ4NlQJpJ2B7cfMWKPyal
 +VgaT4S+fLTc/HJD3nObMXTCxZrMf0sBUhU4/QXqL7TTjqqosSn26mlGNUocw7Ty
 HaESk7j2L9TMPt640r1G98j9ND7sWmyBmiYsah8F3MKZCIS892qhtFs0m5g2tA1F
 sjPv9u6M5Pi6ie5Eo8xs+SqKa7TPPVsbZ9XcMRBuzDc5AtUPAm6ii9QVwref8wTq
 qO379jDepgPj4HZkXMzQKxd6rw6wrF854304XhjHZefk+ChhIc4=
 =nN4X
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Mainly core changes, refactoring and optimizations.

  Performance is improved in some areas, overall there may be a
  cumulative improvement due to refactoring that removed lookups in the
  IO path or simplified IO submission tracking.

  Core:

   - submit IO synchronously for fast checksums (crc32c and xxhash),
     remove high priority worker kthread

   - read extent buffer in one go, simplify IO tracking, bio submission
     and locking

   - remove additional tracking of redirtied extent buffers, originally
     added for zoned mode but actually not needed

   - track ordered extent pointer in bio to avoid rbtree lookups during
     IO

   - scrub, use recovered data stripes as cache to avoid unnecessary
     read

   - in zoned mode, optimize logical to physical mappings of extents

   - remove PageError handling, not set by VFS nor writeback

   - cleanups, refactoring, better structure packing

   - lots of error handling improvements

   - more assertions, lockdep annotations

   - print assertion failure with the exact line where it happens

   - tracepoint updates

   - more debugging prints

  Performance:

   - speedup in fsync(), better tracking of inode logged status can
     avoid transaction commit

   - IO path structures track logical offsets in data structures and
     does not need to look it up

  User visible changes:

   - don't commit transaction for every created subvolume, this can
     reduce time when many subvolumes are created in a batch

   - print affected files when relocation fails

   - trigger orphan file cleanup during START_SYNC ioctl

  Notable fixes:

   - fix crash when disabling quota and relocation

   - fix crashes when removing roots from drity list

   - fix transacion abort during relocation when converting from newer
     profiles not covered by fallback

   - in zoned mode, stop reclaiming block groups if filesystem becomes
     read-only

   - fix rare race condition in tree mod log rewind that can miss some
     btree node slots

   - with enabled fsverity, drop up-to-date page bit in case the
     verification fails"

* tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (194 commits)
  btrfs: fix race between quota disable and relocation
  btrfs: add comment to struct btrfs_fs_info::dirty_cowonly_roots
  btrfs: fix race when deleting free space root from the dirty cow roots list
  btrfs: fix race when deleting quota root from the dirty cow roots list
  btrfs: tracepoints: also show actual number of the outstanding extents
  btrfs: update i_version in update_dev_time
  btrfs: make btrfs_compressed_bioset static
  btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile
  btrfs: scrub: remove btrfs_fs_info::scrub_wr_completion_workers
  btrfs: scrub: remove scrub_ctx::csum_list member
  btrfs: do not BUG_ON after failure to migrate space during truncation
  btrfs: do not BUG_ON on failure to get dir index for new snapshot
  btrfs: send: do not BUG_ON() on unexpected symlink data extent
  btrfs: do not BUG_ON() when dropping inode items from log root
  btrfs: replace BUG_ON() at split_item() with proper error handling
  btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()
  btrfs: do not BUG_ON() on tree mod log failures at insert_ptr()
  btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()
  btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert()
  btrfs: abort transaction at update_ref_for_cow() when ref count is zero
  ...
2023-06-26 11:41:38 -07:00
Qu Wenruo cb091225a5 btrfs: fix remaining u32 overflows when left shifting stripe_nr
There was regression caused by a97699d1d6 ("btrfs: replace
map_lookup->stripe_len by BTRFS_STRIPE_LEN") and supposedly fixed by
a7299a18a1 ("btrfs: fix u32 overflows when left shifting stripe_nr").
To avoid code churn the fix was open coding the type casts but
unfortunately missed one which was still possible to hit [1].

The missing place was assignment of bioc->full_stripe_logical inside
btrfs_map_block().

Fix it by adding a helper that does the safe calculation of the offset
and use it everywhere even though it may not be strictly necessary due
to already using u64 types.  This replaces all remaining
"<< BTRFS_STRIPE_LEN_SHIFT" calls.

[1] https://lore.kernel.org/linux-btrfs/20230622065438.86402-1-wqu@suse.com/

Fixes: a7299a18a1 ("btrfs: fix u32 overflows when left shifting stripe_nr")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-22 17:03:55 +02:00
Qu Wenruo a7299a18a1 btrfs: fix u32 overflows when left shifting stripe_nr
[BUG]
David reported an ASSERT() get triggered during fio load on 8 devices
with data/raid6 and metadata/raid1c3:

  fio --rw=randrw --randrepeat=1 --size=3000m \
	  --bsrange=512b-64k --bs_unaligned \
	  --ioengine=libaio --fsync=1024 \
	  --name=job0 --name=job1 \

The ASSERT() is from rbio_add_bio() of raid56.c:

	ASSERT(orig_logical >= full_stripe_start &&
	       orig_logical + orig_len <= full_stripe_start +
	       rbio->nr_data * BTRFS_STRIPE_LEN);

Which is checking if the target rbio is crossing the full stripe
boundary.

  [100.789] assertion failed: orig_logical >= full_stripe_start && orig_logical + orig_len <= full_stripe_start + rbio->nr_data * BTRFS_STRIPE_LEN, in fs/btrfs/raid56.c:1622
  [100.795] ------------[ cut here ]------------
  [100.796] kernel BUG at fs/btrfs/raid56.c:1622!
  [100.797] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
  [100.798] CPU: 1 PID: 100 Comm: kworker/u8:4 Not tainted 6.4.0-rc6-default+ #124
  [100.799] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
  [100.802] Workqueue: writeback wb_workfn (flush-btrfs-1)
  [100.803] RIP: 0010:rbio_add_bio+0x204/0x210 [btrfs]
  [100.806] RSP: 0018:ffff888104a8f300 EFLAGS: 00010246
  [100.808] RAX: 00000000000000a1 RBX: ffff8881075907e0 RCX: ffffed1020951e01
  [100.809] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000001
  [100.811] RBP: 0000000141d20000 R08: 0000000000000001 R09: ffff888104a8f04f
  [100.813] R10: ffffed1020951e09 R11: 0000000000000003 R12: ffff88810e87f400
  [100.815] R13: 0000000041d20000 R14: 0000000144529000 R15: ffff888101524000
  [100.817] FS:  0000000000000000(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
  [100.821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [100.822] CR2: 000055d54e44c270 CR3: 000000010a9a1006 CR4: 00000000003706a0
  [100.824] Call Trace:
  [100.825]  <TASK>
  [100.825]  ? die+0x32/0x80
  [100.826]  ? do_trap+0x12d/0x160
  [100.827]  ? rbio_add_bio+0x204/0x210 [btrfs]
  [100.827]  ? rbio_add_bio+0x204/0x210 [btrfs]
  [100.829]  ? do_error_trap+0x90/0x130
  [100.830]  ? rbio_add_bio+0x204/0x210 [btrfs]
  [100.831]  ? handle_invalid_op+0x2c/0x30
  [100.833]  ? rbio_add_bio+0x204/0x210 [btrfs]
  [100.835]  ? exc_invalid_op+0x29/0x40
  [100.836]  ? asm_exc_invalid_op+0x16/0x20
  [100.837]  ? rbio_add_bio+0x204/0x210 [btrfs]
  [100.837]  raid56_parity_write+0x64/0x270 [btrfs]
  [100.838]  btrfs_submit_chunk+0x26e/0x800 [btrfs]
  [100.840]  ? btrfs_bio_init+0x80/0x80 [btrfs]
  [100.841]  ? release_pages+0x503/0x6d0
  [100.842]  ? folio_unlock+0x2f/0x60
  [100.844]  ? __folio_put+0x60/0x60
  [100.845]  ? btrfs_do_readpage+0xae0/0xae0 [btrfs]
  [100.847]  btrfs_submit_bio+0x21/0x60 [btrfs]
  [100.847]  submit_one_bio+0x6a/0xb0 [btrfs]
  [100.849]  extent_write_cache_pages+0x395/0x680 [btrfs]
  [100.850]  ? __extent_writepage+0x520/0x520 [btrfs]
  [100.851]  ? mark_usage+0x190/0x190
  [100.852]  extent_writepages+0xdb/0x130 [btrfs]
  [100.853]  ? extent_write_locked_range+0x480/0x480 [btrfs]
  [100.854]  ? mark_usage+0x190/0x190
  [100.854]  ? attach_extent_buffer_page+0x220/0x220 [btrfs]
  [100.855]  ? reacquire_held_locks+0x178/0x280
  [100.856]  ? writeback_sb_inodes+0x245/0x7f0
  [100.857]  do_writepages+0x102/0x2e0
  [100.858]  ? page_writeback_cpu_online+0x10/0x10
  [100.859]  ? __lock_release.isra.0+0x14a/0x4d0
  [100.860]  ? reacquire_held_locks+0x280/0x280
  [100.861]  ? __lock_acquired+0x1e9/0x3d0
  [100.862]  ? do_raw_spin_lock+0x1b0/0x1b0
  [100.863]  __writeback_single_inode+0x94/0x450
  [100.864]  writeback_sb_inodes+0x372/0x7f0
  [100.864]  ? lock_sync+0xd0/0xd0
  [100.865]  ? do_raw_spin_unlock+0x93/0xf0
  [100.866]  ? sync_inode_metadata+0xc0/0xc0
  [100.867]  ? rwsem_optimistic_spin+0x340/0x340
  [100.868]  __writeback_inodes_wb+0x70/0x130
  [100.869]  wb_writeback+0x2d1/0x530
  [100.869]  ? __writeback_inodes_wb+0x130/0x130
  [100.870]  ? lockdep_hardirqs_on_prepare.part.0+0xf1/0x1c0
  [100.870]  wb_do_writeback+0x3eb/0x480
  [100.871]  ? wb_writeback+0x530/0x530
  [100.871]  ? mark_lock_irq+0xcd0/0xcd0
  [100.872]  wb_workfn+0xe0/0x3f0<

[CAUSE]
Commit a97699d1d6 ("btrfs: replace map_lookup->stripe_len by
BTRFS_STRIPE_LEN") changes how we calculate the map length, to reduce
u64 division.

Function btrfs_max_io_len() is to get the length to the stripe boundary.

It calculates the full stripe start offset (inside the chunk) by the
following code:

		*full_stripe_start =
			rounddown(*stripe_nr, nr_data_stripes(map)) <<
			BTRFS_STRIPE_LEN_SHIFT;

The calculation itself is fine, but the value returned by rounddown() is
dependent on both @stripe_nr (which is u32) and nr_data_stripes() (which
returned int).

Thus the result is also u32, then we do the left shift, which can
overflow u32.

If such overflow happens, @full_stripe_start will be a value way smaller
than @offset, causing later "full_stripe_len - (offset -
*full_stripe_start)" to underflow, thus make later length calculation to
have no stripe boundary limit, resulting a write bio to exceed stripe
boundary.

There are some other locations like this, with a u32 @stripe_nr got left
shift, which can lead to a similar overflow.

[FIX]
Fix all @stripe_nr with left shift with a type cast to u64 before the
left shift.

Those involved @stripe_nr or similar variables are recording the stripe
number inside the chunk, which is small enough to be contained by u32,
but their offset inside the chunk can not fit into u32.

Thus for those specific left shifts, a type cast to u64 is necessary so
this patch does not touch them and the code will be cleaned up in the
future to keep the fix minimal.

Reported-by: David Sterba <dsterba@suse.com>
Fixes: a97699d1d6 ("btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN")
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-20 19:10:31 +02:00
Jeff Layton c9e561c475 btrfs: update i_version in update_dev_time
When updating the ctime, we also want to update i_version.

This is just something I noticed by inspection. There is probably no way
to test this today unless you can somehow get to this inode via nfsd.
Still, I think it's the right thing to do for consistency's sake.

David Sterba's comment: I don't see anything wrong with setting the
iversion bit, however I also don't see where this would be useful.
Agreed with the consistency, otherwise the time is updated when device
super block is wiped or a device initialized, both are big events so
missing that due to lack of iversion update seems unlikely. I'll add it
to the queue, thanks.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
[ add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 17:38:50 +02:00
Christoph Hellwig 8680e58761 btrfs: open code need_full_stripe conditions
need_full_stripe is just a somewhat complicated way to say
"op != BTRFS_MAP_READ".  Just spell that explicit check out, which makes
a lot of the code currently using the helper easier to understand.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Christoph Hellwig 723b8bb17e btrfs: open code btrfs_map_sblock
btrfs_map_sblock just hard codes three arguments and calls
btrfs_map_sblock.  Remove it as it doesn't provide any real value, but
makes following the btrfs_map_block call chains harder.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Christoph Hellwig cd4efd210e btrfs: rename __btrfs_map_block to btrfs_map_block
Now that the old btrfs_map_block is gone, drop the leading underscores
from __btrfs_map_block.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Christoph Hellwig d69d7ffc26 btrfs: remove unused btrfs_map_block
There are no users of btrfs_map_block left, so remove it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:34 +02:00
Christoph Hellwig 3965a4c793 btrfs: remove unused BTRFS_MAP_DISCARD
BTRFS_MAP_DISCARD is never set, as REQ_OP_DISCARD is never passed to
btrfs_op() only only checked in two ASSERTS.

Remove it and let the catchall WARN_ON in btrfs_op() deal with accidental
REQ_OP_DISCARDs leaked into btrfs_op(). Last use was in a4012f06f1
("btrfs: split discard handling out of btrfs_map_block").

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:33 +02:00
Anand Jain a3c54b0be1 btrfs: simplify how changed fsid and metadata_uuid is checked
We often check if the metadata_uuid is not the same as fsid, and then we
check if the given fsid matches the metadata_uuid. This patch refactors
this logic into function match_fsid_changed and utilize it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Anand Jain 1a89834500 btrfs: simplify fsid and metadata_uuid comparisons
Refactor the functions find_fsid() and find_fsid_with_metadata_uuid(),
as they currently share a common set of code to compare the fsid and
metadata_uuid. Create a common helper function, match_fsid_fs_devices().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Anand Jain c6930d7d11 btrfs: merge calls to alloc_fs_devices in device_list_add
Simplify has_metadata_uuid checks - by localizing the has_metadata_uuid
checked within alloc_fs_devices()'s second argument, it improves the
code readability.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:29 +02:00
Anand Jain 19c4c49ca9 btrfs: streamline fsid checks in alloc_fs_devices
We currently have redundant checks for the non-null value of fsid
simplify it.

And, no one is using alloc_fs_devices() with a NULL metadata_uuid
while fsid is not NULL, add an assert() to verify this condition.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:28 +02:00
Filipe Manana f2db4d5cb4 btrfs: make btrfs_free_device() static
The function btrfs_free_device() is never used outside of volumes.c, so
make it static and remove its prototype declaration at volumes.h.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19 13:59:22 +02:00
Christoph Hellwig 05bdb99653 block: replace fmode_t with a block-specific type for block open flags
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE.  Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:05 -06:00
Christoph Hellwig 2736e8eeb0 block: use the holder as indication for exclusive opens
The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder.  Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.

For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>		[btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig 2ef789288a btrfs: don't pass a holder for non-exclusive blkdev_get_by_path
Passing a holder to blkdev_get_by_path when FMODE_EXCL isn't set doesn't
make sense, so pass NULL instead and remove the holder argument from the
call chains the only end up in non-FMODE_EXCL blkdev_get_by_path calls.

Exclusive mode for device scanning is not used since commit 50d281fc43
("btrfs: scan device in non-exclusive mode")".

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20230608110258.189493-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig 0718afd47f block: introduce holder ops
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and
installed in the block_device for exclusive claims.  It will be used to
allow the block layer to call back into the user of the block device for
thing like notification of a removed device or a device resize.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Filipe Manana 611ccc58e1 btrfs: fix leak of source device allocation state after device replace
When a device replace finishes, the source device is freed by calling
btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the
allocation state, tracked in the device's alloc_state io tree, is never
freed.

This is a regression recently introduced by commit f0bb5474cf ("btrfs:
remove redundant release of btrfs_device::alloc_state"), which removed a
call to extent_io_tree_release() from btrfs_free_device(), with the
rationale that btrfs_close_one_device() already releases the allocation
state from a device and btrfs_close_one_device() is always called before
a device is freed with btrfs_free_device(). However that is not true for
the device replace case, as btrfs_free_device() is called without any
previous call to btrfs_close_one_device().

The issue is trivial to reproduce, for example, by running test btrfs/027
from fstests:

  $ ./check btrfs/027
  $ rmmod btrfs
  $ dmesg
  (...)
  [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started
  [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished
  [84519.552251] BTRFS info (device sdc): scrub: started on devid 1
  [84519.552277] BTRFS info (device sdc): scrub: started on devid 2
  [84519.552332] BTRFS info (device sdc): scrub: started on devid 3
  [84519.552705] BTRFS info (device sdc): scrub: started on devid 4
  [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
  [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0
  [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0
  [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0
  [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
  [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1
  [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1

So fix this by adding back the call to extent_io_tree_release() at
btrfs_free_device().

Fixes: f0bb5474cf ("btrfs: remove redundant release of btrfs_device::alloc_state")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-28 16:36:31 +02:00
Genjian Zhang 8ba7d5f5ba btrfs: fix uninitialized variable warnings
There are some warnings on older compilers (gcc 10, 7) or non-x86_64
architectures (aarch64).  As btrfs wants to enable -Wmaybe-uninitialized
by default, fix the warnings even though it's not necessary on recent
compilers (gcc 12+).

../fs/btrfs/volumes.c: In function ‘btrfs_init_new_device’:
../fs/btrfs/volumes.c:2703:3: error: ‘seed_devices’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
 2703 |   btrfs_setup_sprout(fs_info, seed_devices);
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

../fs/btrfs/send.c: In function ‘get_cur_inode_state’:
../include/linux/compiler.h:70:32: error: ‘right_gen’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
   70 |   (__if_trace.miss_hit[1]++,1) :  \
      |                                ^
../fs/btrfs/send.c:1878:6: note: ‘right_gen’ was declared here
 1878 |  u64 right_gen;
      |      ^~~~~~~~~

Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 19:52:19 +02:00
Qu Wenruo 4886ff7b50 btrfs: introduce a new helper to submit write bio for repair
Both scrub and read-repair are utilizing a special repair writes that:

- Only writes back to a single device
  Even for read-repair on RAID56, we only update the corrupted data
  stripe itself, not triggering the full RMW path.

- Requires a valid @mirror_num
  For RAID56 case, only @mirror_num == 1 is valid.
  For non-RAID56 cases, we need @mirror_num to locate our stripe.

- No data csum generation needed

These two call sites still have some differences though:

- Read-repair goes plain bio
  It doesn't need a full btrfs_bio, and goes submit_bio_wait().

- New scrub repair would go btrfs_bio
  To simplify both read and write path.

So here this patch would:

- Introduce a common helper, btrfs_map_repair_block()
  Due to the single device nature, we can use an on-stack
  btrfs_io_stripe to pass device and its physical bytenr.

- Introduce a new interface, btrfs_submit_repair_bio(), for later scrub
  code
  This is for the incoming scrub code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Anand Jain f0bb5474cf btrfs: remove redundant release of btrfs_device::alloc_state
Commit 321f69f86a ("btrfs: reset device back to allocation state when
removing") included adding extent_io_tree_release(&device->alloc_state)
to btrfs_close_one_device(), which had already been called in
btrfs_free_device().

The alloc_state tree (IO_TREE_DEVICE_ALLOC_STATE), is created in
btrfs_alloc_device() and released in btrfs_close_one_device(). Therefore,
the additional call to extent_io_tree_release(&device->alloc_state) in
btrfs_free_device() is unnecessary and can be removed.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Anand Jain 1f16033c99 btrfs: warn for any missed cleanup at btrfs_close_one_device
During my recent search for the root cause of a reported bug, I realized
that it's a good idea to issue a warning for missed cleanup instead of
using debug-only assertions. Since most installations run with debug off,
missed cleanups and premature calls to close could go unnoticed. However,
these issues are serious enough to warrant reporting and fixing.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:23 +02:00
Filipe Manana 5758d1bd2d btrfs: remove bytes_used argument from btrfs_make_block_group()
The only caller of btrfs_make_block_group() always passes 0 as the value
for the bytes_used argument, so remove it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:19 +02:00
Qu Wenruo 5f50fa918f btrfs: do not use replace target device as an extra mirror
[BUG]
Currently btrfs can use dev-replace device as an extra mirror for
read-repair.  But it can lead to NODATASUM corruption in the following
case:

 There is a RAID1 data chunk, and dev-replace is running from
 dev2 to dev0.

 |//| = Replaced data
          X       X+1MB     X+2MB
  Dev 2:  |       |         |           <- Source dev
  Dev 0:  |///////|         |           <- Target dev

Then a read on dev 2 X+2MB happens.
And something wrong happened inside devid 2, causing an -EIO.

In that case, read-repair would try the next mirror, and since we can
use target device as an extra mirror, we will use that mirror instead.

But unfortunately since the read is beyond the current replace cursor,
we should not trust it at all, what we get would be just uninitialized
garbage.

But if this read is for NODATASUM range, then we just trust them and
cause data corruption.

[CAUSE]
We used to have some checks to make sure we only return such extra
mirror when the range is before our left cursor.

The first commit introducing this behavior is ad6d620e2a ("Btrfs:
allow repair code to include target disk when searching mirrors").

But later a fix, 22ab04e814 ("Btrfs: fix race between device replace
and chunk allocation") changed the behavior, to always let
btrfs_map_block() include the extra mirror to address a race in
dev-replace which can cause missing writes to target device.

This means, we lose the tracking of cursor for the extra mirror, thus
can lead to above corruption.

[FIX]
The extra mirror is never a reliable one, at the beginning of
dev-replace, the reliability is zero, while only at the end of the
replace it's a fully reliable mirror.

We either do the complex tracking, or never trust it.

IMHO it's much easier to maintain if we don't trust it at all, and the
extra mirror can only benefit for a limited period of time (during
replace).

Thus this patch would completely remove the ability to use target device
as an extra mirror.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:16 +02:00
Qu Wenruo 18d758a2d8 btrfs: replace btrfs_io_context::raid_map with a fixed u64 value
In btrfs_io_context structure, we have a pointer raid_map, which
indicates the logical bytenr for each stripe.

But considering we always call sort_parity_stripes(), the result
raid_map[] is always sorted, thus raid_map[0] is always the logical
bytenr of the full stripe.

So why we waste the space and time (for sorting) for raid_map?

This patch will replace btrfs_io_context::raid_map with a single u64
number, full_stripe_start, by:

- Replace btrfs_io_context::raid_map with full_stripe_start

- Replace call sites using raid_map[0] to use full_stripe_start

- Replace call sites using raid_map[i] to compare with nr_data_stripes.

The benefits are:

- Less memory wasted on raid_map
  It's sizeof(u64) * num_stripes vs sizeof(u64).
  It'll always save at least one u64, and the benefit grows larger with
  num_stripes.

- No more weird alloc_btrfs_io_context() behavior
  As there is only one fixed size + one variable length array.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:14 +02:00
Qu Wenruo 1faf388506 btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.

For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).

But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.

Currently we use a complex way using tgtdev_map[] array, e.g:

 num_tgtdevs = 1
 tgtdev_map[0] = 0    <- Means stripes[0] is not involved in replace.
 tgtdev_map[1] = 3    <- Means stripes[1] is involved in replace,
			 and it's duplicated to stripes[3].
 tgtdev_map[2] = 0    <- Means stripes[2] is not involved in replace.

But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.

Thus we can change it to a fixed array to represent the mapping:

 replace_nr_stripes = 1
 replace_stripe_src = 1    <- Means stripes[1] is involved in replace.
			      thus the extra stripe is a copy of
			      stripes[1]

By this we can save some space for bioc on RAID56 chunks with many
devices.  And we get rid of one variable sized array from bioc.

Thus the patch involves the following changes:

- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
  and @replace_stripe_src.

  @num_tgtdevs is just renamed to @replace_nr_stripes.
  While the mapping is completely changed.

- Add extra ASSERT()s for RAID56 code

- Only add two more extra stripes for dev-replace cases.
  As we have an upper limit on how many dev-replace stripes we can have.

- Unify the behavior of handle_ops_on_dev_replace()
  Previously handle_ops_on_dev_replace() go two different paths for
  WRITE and GET_READ_MIRRORS.
  Now unify them by always going the WRITE path first (with at most 2
  replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
  extra stripes, just drop one stripe.

- Remove the @real_stripes argument from alloc_btrfs_io_context()
  As we don't need the old variable length array any more.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:14 +02:00
Qu Wenruo 4ced85f81a btrfs: reduce type width of btrfs_io_contexts
That structure is our ultimate object for all __btrfs_map_block()
related functions.  We have some hard to understand members, like
tgtdev_map, but without any comments.

This patch will improve the situation:

- Add extra comments for num_stripes, mirror_num, num_tgtdevs and
  tgtdev_map[]
  Especially for the last two members, add a dedicated (thus very long)
  comments for them, with example to explain it.

- Shrink those int members to u16.
  In fact our on-disk format is only using u16 for num_stripes, thus
  no need to use int at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:14 +02:00
Qu Wenruo be5c7edbfd btrfs: simplify the bioc argument for handle_ops_on_dev_replace()
There is no memory re-allocation for handle_ops_on_dev_replace(), thus
we don't need to pass a btrfs_io_context pointer.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:14 +02:00
Qu Wenruo 6ded22c1bf btrfs: reduce div64 calls by limiting the number of stripes of a chunk to u32
There are quite some div64 calls inside btrfs_map_block() and its
variants.

Such calls are for @stripe_nr, where @stripe_nr is the number of
stripes before our logical bytenr inside a chunk.

However we can eliminate such div64 calls by just reducing the width of
@stripe_nr from 64 to 32.

This can be done because our chunk size limit is already 10G, with fixed
stripe length 64K.
Thus a U32 is definitely enough to contain the number of stripes.

With such width reduction, we can get rid of slower div64, and extra
warning for certain 32bit arch.

This patch would do:

- Add a new tree-checker chunk validation on chunk length
  Make sure no chunk can reach 256G, which can also act as a bitflip
  checker.

- Reduce the width from u64 to u32 for @stripe_nr variables

- Replace unnecessary div64 calls with regular modulo and division
  32bit division and modulo are much faster than 64bit operations, and
  we are finally free of the div64 fear at least in those involved
  functions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:14 +02:00
Qu Wenruo a97699d1d6 btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN
Currently btrfs doesn't support stripe lengths other than 64KiB.
This is already set in the tree-checker.

There is really no meaning to record that fixed value in map_lookup for
now, and can all be replaced with BTRFS_STRIPE_LEN.

Furthermore we can use the fix stripe length to do the following
optimization:

- Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division
  Now we only need to do a right shift.

  And the value of BTRFS_STRIPE_LEN itself is already too large for bit
  shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift,
  a compiler warning would be triggered.

  Thus this bit shift optimization would be safe.

- Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17 18:01:14 +02:00
Filipe Manana 2d82a40aa7 btrfs: fix deadlock when aborting transaction during relocation with scrub
Before relocating a block group we pause scrub, then do the relocation and
then unpause scrub. The relocation process requires starting and committing
a transaction, and if we have a failure in the critical section of the
transaction commit path (transaction state >= TRANS_STATE_COMMIT_START),
we will deadlock if there is a paused scrub.

That results in stack traces like the following:

  [42.479] BTRFS info (device sdc): relocating block group 53876686848 flags metadata|raid6
  [42.936] BTRFS warning (device sdc): Skipping commit of aborted transaction.
  [42.936] ------------[ cut here ]------------
  [42.936] BTRFS: Transaction aborted (error -28)
  [42.936] WARNING: CPU: 11 PID: 346822 at fs/btrfs/transaction.c:1977 btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
  [42.936] Modules linked in: dm_flakey dm_mod loop btrfs (...)
  [42.936] CPU: 11 PID: 346822 Comm: btrfs Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [42.936] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [42.936] RIP: 0010:btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
  [42.936] Code: ff ff 45 8b (...)
  [42.936] RSP: 0018:ffffb58649633b48 EFLAGS: 00010282
  [42.936] RAX: 0000000000000000 RBX: ffff8be6ef4d5bd8 RCX: 0000000000000000
  [42.936] RDX: 0000000000000002 RSI: ffffffffb35e7782 RDI: 00000000ffffffff
  [42.936] RBP: ffff8be6ef4d5c98 R08: 0000000000000000 R09: ffffb586496339e8
  [42.936] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8be6d38c7c00
  [42.936] R13: 00000000ffffffe4 R14: ffff8be6c268c000 R15: ffff8be6ef4d5cf0
  [42.936] FS:  00007f381a82b340(0000) GS:ffff8beddfcc0000(0000) knlGS:0000000000000000
  [42.936] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [42.936] CR2: 00007f1e35fb7638 CR3: 0000000117680006 CR4: 0000000000370ee0
  [42.936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [42.936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [42.936] Call Trace:
  [42.936]  <TASK>
  [42.936]  ? start_transaction+0xcb/0x610 [btrfs]
  [42.936]  prepare_to_relocate+0x111/0x1a0 [btrfs]
  [42.936]  relocate_block_group+0x57/0x5d0 [btrfs]
  [42.936]  ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
  [42.936]  btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
  [42.936]  ? __pfx_autoremove_wake_function+0x10/0x10
  [42.936]  btrfs_relocate_chunk+0x3b/0x150 [btrfs]
  [42.936]  btrfs_balance+0x8ff/0x11d0 [btrfs]
  [42.936]  ? __kmem_cache_alloc_node+0x14a/0x410
  [42.936]  btrfs_ioctl+0x2334/0x32c0 [btrfs]
  [42.937]  ? mod_objcg_state+0xd2/0x360
  [42.937]  ? refill_obj_stock+0xb0/0x160
  [42.937]  ? seq_release+0x25/0x30
  [42.937]  ? __rseq_handle_notify_resume+0x3b5/0x4b0
  [42.937]  ? percpu_counter_add_batch+0x2e/0xa0
  [42.937]  ? __x64_sys_ioctl+0x88/0xc0
  [42.937]  __x64_sys_ioctl+0x88/0xc0
  [42.937]  do_syscall_64+0x38/0x90
  [42.937]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [42.937] RIP: 0033:0x7f381a6ffe9b
  [42.937] Code: 00 48 89 44 24 (...)
  [42.937] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [42.937] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
  [42.937] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
  [42.937] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
  [42.937] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
  [42.937] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
  [42.937]  </TASK>
  [42.937] ---[ end trace 0000000000000000 ]---
  [42.937] BTRFS: error (device sdc: state A) in cleanup_transaction:1977: errno=-28 No space left
  [59.196] INFO: task btrfs:346772 blocked for more than 120 seconds.
  [59.196]       Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [59.196] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [59.196] task:btrfs           state:D stack:0     pid:346772 ppid:1      flags:0x00004002
  [59.196] Call Trace:
  [59.196]  <TASK>
  [59.196]  __schedule+0x392/0xa70
  [59.196]  ? __pv_queued_spin_lock_slowpath+0x165/0x370
  [59.196]  schedule+0x5d/0xd0
  [59.196]  __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
  [59.197]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.197]  scrub_pause_off+0x21/0x50 [btrfs]
  [59.197]  scrub_simple_mirror+0x1c7/0x950 [btrfs]
  [59.197]  ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
  [59.198]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.198]  scrub_stripe+0x20d/0x740 [btrfs]
  [59.198]  scrub_chunk+0xc4/0x130 [btrfs]
  [59.198]  scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
  [59.198]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.198]  btrfs_scrub_dev+0x236/0x6a0 [btrfs]
  [59.199]  ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
  [59.199]  ? _copy_from_user+0x7b/0x80
  [59.199]  btrfs_ioctl+0xde1/0x32c0 [btrfs]
  [59.199]  ? refill_stock+0x33/0x50
  [59.199]  ? should_failslab+0xa/0x20
  [59.199]  ? kmem_cache_alloc_node+0x151/0x460
  [59.199]  ? alloc_io_context+0x1b/0x80
  [59.199]  ? preempt_count_add+0x70/0xa0
  [59.199]  ? __x64_sys_ioctl+0x88/0xc0
  [59.199]  __x64_sys_ioctl+0x88/0xc0
  [59.199]  do_syscall_64+0x38/0x90
  [59.199]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [59.199] RIP: 0033:0x7f82ffaffe9b
  [59.199] RSP: 002b:00007f82ff9fcc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [59.199] RAX: ffffffffffffffda RBX: 000055b191e36310 RCX: 00007f82ffaffe9b
  [59.199] RDX: 000055b191e36310 RSI: 00000000c400941b RDI: 0000000000000003
  [59.199] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
  [59.199] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff9fd640
  [59.199] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
  [59.199]  </TASK>
  [59.199] INFO: task btrfs:346773 blocked for more than 120 seconds.
  [59.200]       Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [59.200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [59.201] task:btrfs           state:D stack:0     pid:346773 ppid:1      flags:0x00004002
  [59.201] Call Trace:
  [59.201]  <TASK>
  [59.201]  __schedule+0x392/0xa70
  [59.201]  ? __pv_queued_spin_lock_slowpath+0x165/0x370
  [59.201]  schedule+0x5d/0xd0
  [59.201]  __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
  [59.201]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.201]  scrub_pause_off+0x21/0x50 [btrfs]
  [59.202]  scrub_simple_mirror+0x1c7/0x950 [btrfs]
  [59.202]  ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
  [59.202]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.202]  scrub_stripe+0x20d/0x740 [btrfs]
  [59.202]  scrub_chunk+0xc4/0x130 [btrfs]
  [59.203]  scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
  [59.203]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.203]  btrfs_scrub_dev+0x236/0x6a0 [btrfs]
  [59.203]  ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
  [59.203]  ? _copy_from_user+0x7b/0x80
  [59.203]  btrfs_ioctl+0xde1/0x32c0 [btrfs]
  [59.204]  ? should_failslab+0xa/0x20
  [59.204]  ? kmem_cache_alloc_node+0x151/0x460
  [59.204]  ? alloc_io_context+0x1b/0x80
  [59.204]  ? preempt_count_add+0x70/0xa0
  [59.204]  ? __x64_sys_ioctl+0x88/0xc0
  [59.204]  __x64_sys_ioctl+0x88/0xc0
  [59.204]  do_syscall_64+0x38/0x90
  [59.204]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [59.204] RIP: 0033:0x7f82ffaffe9b
  [59.204] RSP: 002b:00007f82ff1fbc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [59.204] RAX: ffffffffffffffda RBX: 000055b191e36790 RCX: 00007f82ffaffe9b
  [59.204] RDX: 000055b191e36790 RSI: 00000000c400941b RDI: 0000000000000003
  [59.204] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
  [59.204] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff1fc640
  [59.204] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
  [59.204]  </TASK>
  [59.204] INFO: task btrfs:346774 blocked for more than 120 seconds.
  [59.205]       Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [59.205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [59.206] task:btrfs           state:D stack:0     pid:346774 ppid:1      flags:0x00004002
  [59.206] Call Trace:
  [59.206]  <TASK>
  [59.206]  __schedule+0x392/0xa70
  [59.206]  schedule+0x5d/0xd0
  [59.206]  __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
  [59.206]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.206]  scrub_pause_off+0x21/0x50 [btrfs]
  [59.207]  scrub_simple_mirror+0x1c7/0x950 [btrfs]
  [59.207]  ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
  [59.207]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.207]  scrub_stripe+0x20d/0x740 [btrfs]
  [59.208]  scrub_chunk+0xc4/0x130 [btrfs]
  [59.208]  scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
  [59.208]  ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
  [59.208]  btrfs_scrub_dev+0x236/0x6a0 [btrfs]
  [59.208]  ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
  [59.209]  ? _copy_from_user+0x7b/0x80
  [59.209]  btrfs_ioctl+0xde1/0x32c0 [btrfs]
  [59.209]  ? should_failslab+0xa/0x20
  [59.209]  ? kmem_cache_alloc_node+0x151/0x460
  [59.209]  ? alloc_io_context+0x1b/0x80
  [59.209]  ? preempt_count_add+0x70/0xa0
  [59.209]  ? __x64_sys_ioctl+0x88/0xc0
  [59.209]  __x64_sys_ioctl+0x88/0xc0
  [59.209]  do_syscall_64+0x38/0x90
  [59.209]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [59.209] RIP: 0033:0x7f82ffaffe9b
  [59.209] RSP: 002b:00007f82fe9fac50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [59.209] RAX: ffffffffffffffda RBX: 000055b191e36c10 RCX: 00007f82ffaffe9b
  [59.209] RDX: 000055b191e36c10 RSI: 00000000c400941b RDI: 0000000000000003
  [59.209] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
  [59.209] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe9fb640
  [59.209] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
  [59.209]  </TASK>
  [59.209] INFO: task btrfs:346775 blocked for more than 120 seconds.
  [59.210]       Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [59.210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [59.211] task:btrfs           state:D stack:0     pid:346775 ppid:1      flags:0x00004002
  [59.211] Call Trace:
  [59.211]  <TASK>
  [59.211]  __schedule+0x392/0xa70
  [59.211]  schedule+0x5d/0xd0
  [59.211]  __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
  [59.211]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.211]  scrub_pause_off+0x21/0x50 [btrfs]
  [59.212]  scrub_simple_mirror+0x1c7/0x950 [btrfs]
  [59.212]  ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
  [59.212]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.212]  scrub_stripe+0x20d/0x740 [btrfs]
  [59.213]  scrub_chunk+0xc4/0x130 [btrfs]
  [59.213]  scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
  [59.213]  ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
  [59.213]  btrfs_scrub_dev+0x236/0x6a0 [btrfs]
  [59.213]  ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
  [59.214]  ? _copy_from_user+0x7b/0x80
  [59.214]  btrfs_ioctl+0xde1/0x32c0 [btrfs]
  [59.214]  ? should_failslab+0xa/0x20
  [59.214]  ? kmem_cache_alloc_node+0x151/0x460
  [59.214]  ? alloc_io_context+0x1b/0x80
  [59.214]  ? preempt_count_add+0x70/0xa0
  [59.214]  ? __x64_sys_ioctl+0x88/0xc0
  [59.214]  __x64_sys_ioctl+0x88/0xc0
  [59.214]  do_syscall_64+0x38/0x90
  [59.214]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [59.214] RIP: 0033:0x7f82ffaffe9b
  [59.214] RSP: 002b:00007f82fe1f9c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [59.214] RAX: ffffffffffffffda RBX: 000055b191e37090 RCX: 00007f82ffaffe9b
  [59.214] RDX: 000055b191e37090 RSI: 00000000c400941b RDI: 0000000000000003
  [59.214] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
  [59.214] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe1fa640
  [59.214] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
  [59.214]  </TASK>
  [59.214] INFO: task btrfs:346776 blocked for more than 120 seconds.
  [59.215]       Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [59.216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [59.217] task:btrfs           state:D stack:0     pid:346776 ppid:1      flags:0x00004002
  [59.217] Call Trace:
  [59.217]  <TASK>
  [59.217]  __schedule+0x392/0xa70
  [59.217]  ? __pv_queued_spin_lock_slowpath+0x165/0x370
  [59.217]  schedule+0x5d/0xd0
  [59.217]  __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
  [59.217]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.217]  scrub_pause_off+0x21/0x50 [btrfs]
  [59.217]  scrub_simple_mirror+0x1c7/0x950 [btrfs]
  [59.217]  ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
  [59.218]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.218]  scrub_stripe+0x20d/0x740 [btrfs]
  [59.218]  scrub_chunk+0xc4/0x130 [btrfs]
  [59.218]  scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
  [59.219]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.219]  btrfs_scrub_dev+0x236/0x6a0 [btrfs]
  [59.219]  ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
  [59.219]  ? _copy_from_user+0x7b/0x80
  [59.219]  btrfs_ioctl+0xde1/0x32c0 [btrfs]
  [59.219]  ? should_failslab+0xa/0x20
  [59.219]  ? kmem_cache_alloc_node+0x151/0x460
  [59.219]  ? alloc_io_context+0x1b/0x80
  [59.219]  ? preempt_count_add+0x70/0xa0
  [59.219]  ? __x64_sys_ioctl+0x88/0xc0
  [59.219]  __x64_sys_ioctl+0x88/0xc0
  [59.219]  do_syscall_64+0x38/0x90
  [59.219]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [59.219] RIP: 0033:0x7f82ffaffe9b
  [59.219] RSP: 002b:00007f82fd9f8c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [59.219] RAX: ffffffffffffffda RBX: 000055b191e37510 RCX: 00007f82ffaffe9b
  [59.219] RDX: 000055b191e37510 RSI: 00000000c400941b RDI: 0000000000000003
  [59.219] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
  [59.219] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fd9f9640
  [59.219] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
  [59.219]  </TASK>
  [59.219] INFO: task btrfs:346822 blocked for more than 120 seconds.
  [59.220]       Tainted: G        W          6.3.0-rc2-btrfs-next-127+ #1
  [59.221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [59.222] task:btrfs           state:D stack:0     pid:346822 ppid:1      flags:0x00004002
  [59.222] Call Trace:
  [59.222]  <TASK>
  [59.222]  __schedule+0x392/0xa70
  [59.222]  schedule+0x5d/0xd0
  [59.222]  btrfs_scrub_cancel+0x91/0x100 [btrfs]
  [59.222]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.222]  btrfs_commit_transaction+0x572/0xeb0 [btrfs]
  [59.223]  ? start_transaction+0xcb/0x610 [btrfs]
  [59.223]  prepare_to_relocate+0x111/0x1a0 [btrfs]
  [59.223]  relocate_block_group+0x57/0x5d0 [btrfs]
  [59.223]  ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
  [59.223]  btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
  [59.224]  ? __pfx_autoremove_wake_function+0x10/0x10
  [59.224]  btrfs_relocate_chunk+0x3b/0x150 [btrfs]
  [59.224]  btrfs_balance+0x8ff/0x11d0 [btrfs]
  [59.224]  ? __kmem_cache_alloc_node+0x14a/0x410
  [59.224]  btrfs_ioctl+0x2334/0x32c0 [btrfs]
  [59.225]  ? mod_objcg_state+0xd2/0x360
  [59.225]  ? refill_obj_stock+0xb0/0x160
  [59.225]  ? seq_release+0x25/0x30
  [59.225]  ? __rseq_handle_notify_resume+0x3b5/0x4b0
  [59.225]  ? percpu_counter_add_batch+0x2e/0xa0
  [59.225]  ? __x64_sys_ioctl+0x88/0xc0
  [59.225]  __x64_sys_ioctl+0x88/0xc0
  [59.225]  do_syscall_64+0x38/0x90
  [59.225]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
  [59.225] RIP: 0033:0x7f381a6ffe9b
  [59.225] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [59.225] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
  [59.225] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
  [59.225] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
  [59.225] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
  [59.225] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
  [59.225]  </TASK>

What happens is the following:

1) A scrub is running, so fs_info->scrubs_running is 1;

2) Task A starts block group relocation, and at btrfs_relocate_chunk() it
   pauses scrub by calling btrfs_scrub_pause(). That increments
   fs_info->scrub_pause_req from 0 to 1 and waits for the scrub task to
   pause (for fs_info->scrubs_paused to be == to fs_info->scrubs_running);

3) The scrub task pauses at scrub_pause_off(), waiting for
   fs_info->scrub_pause_req to decrease to 0;

4) Task A then enters btrfs_relocate_block_group(), and down that call
   chain we start a transaction and then attempt to commit it;

5) When task A calls btrfs_commit_transaction(), it either will do the
   commit itself or wait for some other task that already started the
   commit of the transaction - it doesn't matter which case;

6) The transaction commit enters state TRANS_STATE_COMMIT_START;

7) An error happens during the transaction commit, like -ENOSPC when
   running delayed refs or delayed items for example;

8) This results in calling transaction.c:cleanup_transaction(), where
   we call btrfs_scrub_cancel(), incrementing fs_info->scrub_cancel_req
   from 0 to 1, and blocking this task waiting for fs_info->scrubs_running
   to decrease to 0;

9) From this point on, both the transaction commit and the scrub task
   hang forever:

   1) The transaction commit is waiting for fs_info->scrubs_running to
      be decreased to 0;

   2) The scrub task is at scrub_pause_off() waiting for
      fs_info->scrub_pause_req to decrease to 0 - so it can not proceed
      to stop the scrub and decrement fs_info->scrubs_running from 0 to 1.

   Therefore resulting in a deadlock.

Fix this by having cleanup_transaction(), called if a transaction commit
fails, not call btrfs_scrub_cancel() if relocation is in progress, and
having btrfs_relocate_block_group() call btrfs_scrub_cancel() instead if
the relocation failed and a transaction abort happened.

This was triggered with btrfs/061 from fstests.

Fixes: 55e3a601c8 ("btrfs: Fix data checksum error cause by replace with io-load.")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-28 00:47:00 +02:00
Anand Jain 50d281fc43 btrfs: scan device in non-exclusive mode
This fixes mkfs/mount/check failures due to race with systemd-udevd
scan.

During the device scan initiated by systemd-udevd, other user space
EXCL operations such as mkfs, mount, or check may get blocked and result
in a "Device or resource busy" error. This is because the device
scan process opens the device with the EXCL flag in the kernel.

Two reports were received:

 - btrfs/179 test case, where the fsck command failed with the -EBUSY
   error

 - LTP pwritev03 test case, where mkfs.vfs failed with
   the -EBUSY error, when mkfs.vfs tried to overwrite old btrfs filesystem
   on the device.

In both cases, fsck and mkfs (respectively) were racing with a
systemd-udevd device scan, and systemd-udevd won, resulting in the
-EBUSY error for fsck and mkfs.

Reproducing the problem has been difficult because there is a very
small window during which these userspace threads can race to
acquire the exclusive device open. Even on the system where the problem
was observed, the problem occurrences were anywhere between 10 to 400
iterations and chances of reproducing decreases with debug printk()s.

However, an exclusive device open is unnecessary for the scan process,
as there are no write operations on the device during scan. Furthermore,
during the mount process, the superblock is re-read in the below
function call chain:

  btrfs_mount_root
   btrfs_open_devices
    open_fs_devices
     btrfs_open_one_device
       btrfs_get_bdev_and_sb

So, to fix this issue, removes the FMODE_EXCL flag from the scan
operation, and add a comment.

The case where mkfs may still write to the device and a scan is running,
the btrfs signature is not written at that time so scan will not
recognize such device.

Reported-by: Sherry Yang <sherry.yang@oracle.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202303170839.fdf23068-oliver.sang@intel.com
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-28 00:46:56 +02:00
Qu Wenruo 1c3ab6dfa0 btrfs: handle missing chunk mapping more gracefully
[BUG]
During my scrub rework, I did a stupid thing like this:

        bio->bi_iter.bi_sector = stripe->logical;
        btrfs_submit_bio(fs_info, bio, stripe->mirror_num);

Above bi_sector assignment is using logical address directly, which
lacks ">> SECTOR_SHIFT".

This results a read on a range which has no chunk mapping.

This results the following crash:

  BTRFS critical (device dm-1): unable to find logical 11274289152 length 65536
  assertion failed: !IS_ERR(em), in fs/btrfs/volumes.c:6387

Sure this is all my fault, but this shows a possible problem in real
world, that some bit flip in file extents/tree block can point to
unmapped ranges, and trigger above ASSERT(), or if CONFIG_BTRFS_ASSERT
is not configured, cause invalid pointer access.

[PROBLEMS]
In the above call chain, we just don't handle the possible error from
btrfs_get_chunk_map() inside __btrfs_map_block().

[FIX]
The fix is straightforward, replace the ASSERT() with proper error
handling (callers handle errors already).

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-15 20:51:05 +01:00
Christoph Hellwig f8a02dc6fd btrfs: remove struct btrfs_io_geometry
Now that btrfs_get_io_geometry has a single caller, we can massage it
into a form that is more suitable for that caller and remove the
marshalling into and out of struct btrfs_io_geometry.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:52 +01:00
Colin Ian King 67da05b3f2 btrfs: fix spelling mistakes found using codespell
There quite a few spelling mistakes as found using codespell. Fix them.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15 19:38:50 +01:00
Anand Jain 5f58d783fd btrfs: free device in btrfs_close_devices for a single device filesystem
We have this check to make sure we don't accidentally add older devices
that may have disappeared and re-appeared with an older generation from
being added to an fs_devices (such as a replace source device). This
makes sense, we don't want stale disks in our file system. However for
single disks this doesn't really make sense.

I've seen this in testing, but I was provided a reproducer from a
project that builds btrfs images on loopback devices. The loopback
device gets cached with the new generation, and then if it is re-used to
generate a new file system we'll fail to mount it because the new fs is
"older" than what we have in cache.

Fix this by freeing the cache when closing the device for a single device
filesystem. This will ensure that the mount command passed device path is
scanned successfully during the next mount.

CC: stable@vger.kernel.org # 5.10+
Reported-by: Daan De Meyer <daandemeyer@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-09 17:02:51 +01:00
Josef Bacik 3c538de0f2 btrfs: limit device extents to the device size
There was a recent regression in btrfs/177 that started happening with
the size class patches ("btrfs: introduce size class to block group
allocator").  This however isn't a regression introduced by those
patches, but rather the bug was uncovered by a change in behavior in
these patches.  The patches triggered more chunk allocations in the
^free-space-tree case, which uncovered a race with device shrink.

The problem is we will set the device total size to the new size, and
use this to find a hole for a device extent.  However during shrink we
may have device extents allocated past this range, so we could
potentially find a hole in a range past our new shrink size.  We don't
actually limit our found extent to the device size anywhere, we assume
that we will not find a hole past our device size.  This isn't true with
shrink as we're relocating block groups and thus creating holes past the
device size.

Fix this by making sure we do not search past the new device size, and
if we wander into any device extents that start after our device size
simply break from the loop and use whatever hole we've already found.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-25 20:11:08 +01:00
Christoph Hellwig 26ecf243e4 btrfs: stop using write_one_page in btrfs_scratch_superblock
write_one_page is an awkward interface that expects the page locked and
->writepage to be implemented.  Replace that by zeroing the signature
bytes and synchronize the block device page using the proper bdev
helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-16 19:46:19 +01:00
Christoph Hellwig 0e0078f72b btrfs: factor out scratching of one regular super block
btrfs_scratch_superblocks open codes scratching super block of a
non-zoned super block.  Split the code to read, zero and write the
superblock for regular devices into a separate helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-16 19:46:16 +01:00
Qu Wenruo ed02363fbb btrfs: add extra error messages to cover non-ENOMEM errors from device_add_list()
[BUG]
When test case btrfs/219 (aka, mount a registered device but with a lower
generation) failed, there is not any useful information for the end user
to find out what's going wrong.

The mount failure just looks like this:

  #  mount -o loop /tmp/219.img2 /mnt/btrfs/
  mount: /mnt/btrfs: mount(2) system call failed: File exists.
         dmesg(1) may have more information after failed mount system call.

While the dmesg contains nothing but the loop device change:

  loop1: detected capacity change from 0 to 524288

[CAUSE]
In device_list_add() we have a lot of extra checks to reject invalid
cases.

That function also contains the regular device scan result like the
following prompt:

  BTRFS: device fsid 6222333e-f9f1-47e6-b306-55ddd4dcaef4 devid 1 transid 8 /dev/loop0 scanned by systemd-udevd (3027)

But unfortunately not all errors have their own error messages, thus if
we hit something wrong in device_add_list(), there may be no error
messages at all.

[FIX]
Add errors message for all non-ENOMEM errors.

For ENOMEM, I'd say we're in a much worse situation, and there should be
some OOM messages way before our call sites.

CC: stable@vger.kernel.org # 6.0+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11 20:04:12 +01:00
void0red 1742e1c90c btrfs: fix extent map use-after-free when handling missing device in read_one_chunk
Store the error code before freeing the extent_map. Though it's
reference counted structure, in that function it's the first and last
allocation so this would lead to a potential use-after-free.

The error can happen eg. when chunk is stored on a missing device and
the degraded mount option is missing.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216721
Reported-by: eriri <1527030098@qq.com>
Fixes: adfb69af7d ("btrfs: add_missing_dev() should return the actual error")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: void0red <void0red@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:59 +01:00
Christoph Hellwig 103c19723c btrfs: split the bio submission path into a separate file
The code used by btrfs_submit_bio only interacts with the rest of
volumes.c through __btrfs_map_block (which itself is a more generic
version of two exported helpers) and does not really have anything
to do with volumes.c.  Create a new bio.c file and a bio.h header
going along with it for the btrfs_bio-based storage layer, which
will grow even more going forward.

Also update the file with my copyright notice given that a large
part of the moved code was written or rewritten by me.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:57 +01:00
Li zeming 9f0eac070d btrfs: allocate btrfs_io_context without GFP_NOFAIL
The __GFP_NOFAIL flag could loop indefinitely when allocation memory in
alloc_btrfs_io_context. The callers starting from __btrfs_map_block
already handle errors so it's safe to drop the flag.

Signed-off-by: Li zeming <zeming@nfschina.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:57 +01:00
Qu Wenruo cb3e217bdb btrfs: use btrfs_dev_name() helper to handle missing devices better
[BUG]
If dev-replace failed to re-construct its data/metadata, the kernel
message would be incorrect for the missing device:

 BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
 BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault)

Note the above "dev (efault)" of the second line.
While the first line is properly reporting "<missing disk>".

[CAUSE]
Although dev-replace is using btrfs_dev_name(), the heavy lifting work
is still done by scrub (scrub is reused by both dev-replace and regular
scrub).

Unfortunately scrub code never uses btrfs_dev_name() helper, as it's
only declared locally inside dev-replace.c.

[FIX]
Fix the output by:

- Move the btrfs_dev_name() helper to volumes.h

- Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls
  Only zoned code is not touched, as I'm not familiar with degraded
  zoned code.

- Constify return value and parameter

Now the output looks pretty sane:

 BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started
 BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk>

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:57 +01:00
Anand Jain bb21e30260 btrfs: move device->name RCU allocation and assign to btrfs_alloc_device()
There is a repeating code section in the parent function after calling
btrfs_alloc_device(), as below:

      name = rcu_string_strdup(path, GFP_...);
      if (!name) {
              btrfs_free_device(device);
              return ERR_PTR(-ENOMEM);
      }
      rcu_assign_pointer(device->name, name);

Except in add_missing_dev() for obvious reasons.

This patch consolidates that repeating code into the btrfs_alloc_device()
itself so that the parent function doesn't have to duplicate code.
This consolidation also helps to review issues regarding RCU lock
violation with device->name.

Parent function device_list_add() and add_missing_dev() use GFP_NOFS for
the allocation, whereas the rest of the parent functions use GFP_KERNEL,
so bring the NOFS allocation context using memalloc_nofs_save() in the
function device_list_add() and add_missing_dev() is already doing it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:55 +01:00
David Sterba 35da5a7ede btrfs: drop private_data parameter from extent_io_tree_init
All callers except one pass NULL, so the parameter can be dropped and
the inode::io_tree initialization can be open coded.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:54 +01:00
David Sterba 428c8e0310 btrfs: simplify percent calculation helpers, rename div_factor
The div_factor* helpers calculate fraction or percentage fraction. The
name is a bit confusing, we use it only for percentage calculations and
there are two helpers.

There's a helper mult_frac that's for general fractions, that tries to
be accurate but we multiply and divide by small numbers so we can use
the div_u64 helper.

Rename the div_factor* helpers and use 1..100 percentage range, also drop
the case checking for percentage == 100, it's never hit.

The conversions:

* div_factor calculates tenths and the numbers need to be adjusted
* div_factor_fine is direct replacement

Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:48 +01:00
Josef Bacik 7f0add250f btrfs: move super_block specific helpers into super.h
This will make syncing fs.h to user space a little easier if we can pull
the super block specific helpers out of fs.h and put them in super.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:47 +01:00
Josef Bacik 2fc6822c99 btrfs: move scrub prototypes into scrub.h
Move these out of ctree.h into scrub.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:47 +01:00
Josef Bacik 677074792a btrfs: move relocation prototypes into relocation.h
Move these out of ctree.h into relocation.h to cut down on code in
ctree.h

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:47 +01:00
Josef Bacik 7572dec8f5 btrfs: move ioctl prototypes into ioctl.h
Move these out of ctree.h into ioctl.h to cut down on code in ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
Josef Bacik c7a03b524d btrfs: move uuid tree prototypes to uuid-tree.h
Move these out of ctree.h into uuid-tree.h to cut down on the code in
ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:46 +01:00
David Sterba 43dd529abe btrfs: update function comments
Update, reformat or reword function comments. This also removes the kdoc
marker so we don't get reports when the function name is missing.

Changes made:

- remove kdoc markers
- reformat the brief description to be a proper sentence
- reword to imperative voice
- align parameter list
- fix typos

Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:45 +01:00
Josef Bacik 07e81dc944 btrfs: move accessor helpers into accessors.h
This is a large patch, but because they're all macros it's impossible to
split up.  Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:42 +01:00
Josef Bacik c7f13d428e btrfs: move fs wide helpers out of ctree.h
We have several fs wide related helpers in ctree.h.  The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing() and the read only helpers that also aren't directly
related to the ctree code.  Move these into a fs.h header, which will
serve as the location for file system wide related helpers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:41 +01:00
David Sterba 63a7cb1307 btrfs: auto enable discard=async when possible
There's a request to automatically enable async discard for capable
devices. We can do that, the async mode is designed to wait for larger
freed extents and is not intrusive, with limits to iops, kbps or latency.

The status and tunables will be exported in /sys/fs/btrfs/FSID/discard .

The automatic selection is done if there's at least one discard capable
device in the filesystem (not capable devices are skipped). Mounting
with any other discard option will honor that option, notably mounting
with nodiscard will keep it disabled.

Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05 18:00:41 +01:00
Johannes Thumshirn a8d1b1647b btrfs: zoned: initialize device's zone info for seeding
When performing seeding on a zoned filesystem it is necessary to
initialize each zoned device's btrfs_zoned_device_info structure,
otherwise mounting the filesystem will cause a NULL pointer dereference.

This was uncovered by fstests' testcase btrfs/163.

CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-07 14:35:24 +01:00
Johannes Thumshirn 21e61ec6d0 btrfs: zoned: clone zoned device info when cloning a device
When cloning a btrfs_device, we're not cloning the associated
btrfs_zoned_device_info structure of the device in case of a zoned
filesystem.

Later on this leads to a NULL pointer dereference when accessing the
device's zone_info for instance when setting a zone as active.

This was uncovered by fstests' testcase btrfs/161.

CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-07 14:35:21 +01:00
Liu Shixin 0fca385d6e btrfs: fix match incorrectly in dev_args_match_device
syzkaller found a failed assertion:

  assertion failed: (args->devid != (u64)-1) || args->missing, in fs/btrfs/volumes.c:6921

This can be triggered when we set devid to (u64)-1 by ioctl. In this
case, the match of devid will be skipped and the match of device may
succeed incorrectly.

Patch 562d7b1512 introduced this function which is used to match device.
This function contains two matching scenarios, we can distinguish them by
checking the value of args->missing rather than check whether args->devid
and args->uuid is default value.

Reported-by: syzbot+031687116258450f9853@syzkaller.appspotmail.com
Fixes: 562d7b1512 ("btrfs: handle device lookup with btrfs_dev_lookup_args")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-07 14:30:45 +01:00
Qu Wenruo 76a66ba101 btrfs: don't use btrfs_chunk::sub_stripes from disk
[BUG]
There are two reports (the earliest one from LKP, a more recent one from
kernel bugzilla) that we can have some chunks with 0 as sub_stripes.

This will cause divide-by-zero errors at btrfs_rmap_block, which is
introduced by a recent kernel patch ac0677348f ("btrfs: merge
calculations for simple striped profiles in btrfs_rmap_block"):

		if (map->type & (BTRFS_BLOCK_GROUP_RAID0 |
				 BTRFS_BLOCK_GROUP_RAID10)) {
			stripe_nr = stripe_nr * map->num_stripes + i;
			stripe_nr = div_u64(stripe_nr, map->sub_stripes); <<<
		}

[CAUSE]
From the more recent report, it has been proven that we have some chunks
with 0 as sub_stripes, mostly caused by older mkfs.

It turns out that the mkfs.btrfs fix is only introduced in 6718ab4d33aa
("btrfs-progs: Initialize sub_stripes to 1 in btrfs_alloc_data_chunk")
which is included in v5.4 btrfs-progs release.

So there would be quite some old filesystems with such 0 sub_stripes.

[FIX]
Just don't trust the sub_stripes values from disk.

We have a trusted btrfs_raid_array[] to fetch the correct sub_stripes
numbers for each profile and that are fixed.

By this, we can keep the compatibility with older filesystems while
still avoid divide-by-zero bugs.

Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Viktor Kuzmin <kvaster@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216559
Fixes: ac0677348f ("btrfs: merge calculations for simple striped profiles in btrfs_rmap_block")
CC: stable@vger.kernel.org # 6.0
Reviewed-by: Su Yue <glass@fydeos.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-10-25 10:17:33 +02:00
Qu Wenruo a05d3c9153 btrfs: check superblock to ensure the fs was not modified at thaw time
[BACKGROUND]
There is an incident report that, one user hibernated the system, with
one btrfs on removable device still mounted.

Then by some incident, the btrfs got mounted and modified by another
system/OS, then back to the hibernated system.

After resuming from the hibernation, new write happened into the victim btrfs.

Now the fs is completely broken, since the underlying btrfs is no longer
the same one before the hibernation, and the user lost their data due to
various transid mismatch.

[REPRODUCER]
We can emulate the situation using the following small script:

  truncate -s 1G $dev
  mkfs.btrfs -f $dev
  mount $dev $mnt
  fsstress -w -d $mnt -n 500
  sync
  xfs_freeze -f $mnt
  cp $dev $dev.backup

  # There is no way to mount the same cloned fs on the same system,
  # as the conflicting fsid will be rejected by btrfs.
  # Thus here we have to wipe the fs using a different btrfs.
  mkfs.btrfs -f $dev.backup

  dd if=$dev.backup of=$dev bs=1M
  xfs_freeze -u $mnt
  fsstress -w -d $mnt -n 20
  umount $mnt
  btrfs check $dev

The final fsck will fail due to some tree blocks has incorrect fsid.

This is enough to emulate the problem hit by the unfortunate user.

[ENHANCEMENT]
Although such case should not be that common, it can still happen from
time to time.

From the view of btrfs, we can detect any unexpected super block change,
and if there is any unexpected change, we just mark the fs read-only,
and thaw the fs.

By this we can limit the damage to minimal, and I hope no one would lose
their data by this anymore.

Suggested-by: Goffredo Baroncelli <kreijack@libero.it>
Link: https://lore.kernel.org/linux-btrfs/83bf3b4b-7f4c-387a-b286-9251e3991e34@bluemole.com/
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:59 +02:00
Christoph Hellwig 928ff3beb8 btrfs: stop allocation a btrfs_io_context for simple I/O
The I/O context structure is only used to pass the btrfs_device to
the end I/O handler for I/Os that go to a single device.

Stop allocating the I/O context for these cases by passing the optional
btrfs_io_stripe argument to __btrfs_map_block to query the mapping
information and then using a fast path submission and I/O completion
handler.  As the old btrfs_io_context based I/O submission path is
only used for mirrored writes, rename the functions to make that
clear and stop setting the btrfs_bio device and mirror_num field
that is only used for reads.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:59 +02:00
Christoph Hellwig 03793cbbc8 btrfs: add fast path for single device io in __btrfs_map_block
There is no need for most of the btrfs_io_context when doing I/O to a
single device.  To support such I/O without the extra btrfs_io_context
allocation, turn the mirror_num argument into a pointer so that it can
be used to output the selected mirror number, and add an optional
argument that points to a btrfs_io_stripe structure, which will be
filled with a single extent if provided by the caller.

In that case the btrfs_io_context allocation can be skipped as all
information for the single device I/O is provided in the mirror_num
argument and the on-stack btrfs_io_stripe.  A caller that makes use of
this new argument will be added in the next commit.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:59 +02:00
Christoph Hellwig 28793b194e btrfs: decide bio cloning inside submit_stripe_bio
Remove the orig_bio argument as it can be derived from the bioc, and
the clone argument as it can be calculated from bioc and dev_nr.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:59 +02:00
Christoph Hellwig 32747c4455 btrfs: factor out low-level bio setup from submit_stripe_bio
Split out a low-level btrfs_submit_dev_bio helper that just submits
the bio without any cloning decisions or setting up the end I/O handler
for future reuse by a different caller.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:59 +02:00
Christoph Hellwig 917f32a235 btrfs: give struct btrfs_bio a real end_io handler
Currently btrfs_bio end I/O handling is a bit of a mess.  The bi_end_io
handler and bi_private pointer of the embedded struct bio are both used
to handle the completion of the high-level btrfs_bio and for the I/O
completion for the low-level device that the embedded bio ends up being
sent to.

To support this bi_end_io and bi_private are saved into the
btrfs_io_context structure and then restored after the bio sent to the
underlying device has completed the actual I/O.

Untangle this by adding an end I/O handler and private data to struct
btrfs_bio for the high-level btrfs_bio based completions, and leave the
actual bio bi_end_io handler and bi_private pointer entirely to the
low-level device I/O.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:59 +02:00
Christoph Hellwig f1c2937976 btrfs: properly abstract the parity raid bio handling
The parity raid write/recover functionality is currently not very well
abstracted from the bio submission and completion handling in volumes.c:

 - the raid56 code directly completes the original btrfs_bio fed into
   btrfs_submit_bio instead of dispatching back to volumes.c
 - the raid56 code consumes the bioc and bio_counter references taken
   by volumes.c, which also leads to special casing of the calls from
   the scrub code into the raid56 code

To fix this up supply a bi_end_io handler that calls back into the
volumes.c machinery, which then puts the bioc, decrements the bio_counter
and completes the original bio, and updates the scrub code to also
take ownership of the bioc and bio_counter in all cases.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:59 +02:00
Christoph Hellwig c3a62baf21 btrfs: use chained bios when cloning
The stripes_pending in the btrfs_io_context counts number of inflight
low-level bios for an upper btrfs_bio.  For reads this is generally
one as reads are never cloned, while for writes we can trivially use
the bio remaining mechanisms that is used for chained bios.

To be able to make use of that mechanism, split out a separate trivial
end_io handler for the cloned bios that does a minimal amount of error
tracking and which then calls bio_endio on the original bio to transfer
control to that, with the remaining counter making sure it is completed
last.  This then allows to merge btrfs_end_bioc into the original bio
bi_end_io handler.

To make this all work all error handling needs to happen through the
bi_end_io handler, which requires a small amount of reshuffling in
submit_stripe_bio so that the bio is cloned already by the time the
suitability of the device is checked.

This reduces the size of the btrfs_io_context and prepares splitting
the btrfs_bio at the stripe boundary.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:59 +02:00
Christoph Hellwig 2bbc72f14f btrfs: don't take a bio_counter reference for cloned bios
Stop grabbing an extra bio_counter reference for each clone bio in a
mirrored write and instead just release the one original reference in
btrfs_end_bioc once all the bios for a single btrfs_bio have completed
instead of at the end of btrfs_submit_bio once all bios have been
submitted.

This means the reference is now carried by the "upper" btrfs_bio only
instead of each lower bio.

Also remove the now unused btrfs_bio_counter_inc_noblocked helper.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:58 +02:00
Christoph Hellwig 6b42f5e343 btrfs: pass the operation to btrfs_bio_alloc
Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset
set does.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:58 +02:00
Christoph Hellwig d45cfb883b btrfs: move btrfs_bio allocation to volumes.c
volumes.c is the place that implements the storage layer using the
btrfs_bio structure, so move the bio_set and allocation helpers there
as well.

To make up for the new initialization boilerplate, merge the two
init/exit helpers in extent_io.c into a single one.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:58 +02:00
Josef Bacik 588a486835 btrfs: remove lock protection for BLOCK_GROUP_FLAG_RELOCATING_REPAIR
Before when this was modifying the bit field we had to protect it with
the bg->lock, however now we're using bit helpers so we can stop
using the bg->lock.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:54 +02:00
Josef Bacik 9283b9e09a btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY
We use this during device replace for zoned devices, we were simply
taking the lock because it was in a bit field and we needed the lock to
be safe with other modifications in the bitfield.  With the bit helpers
we no longer require that locking.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:54 +02:00
Josef Bacik 3349b57fd4 btrfs: convert block group bit field to use bit helpers
We use a bit field in the btrfs_block_group for different flags, however
this is awkward because we have to hold the block_group->lock for any
modification of any of these fields, and makes the code clunky for a few
of these flags.  Convert these to a properly flags setup so we can
utilize the bit helpers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:27:54 +02:00
Qu Wenruo 5da431b71d btrfs: fix the max chunk size and stripe length calculation
[BEHAVIOR CHANGE]
Since commit f6fca3917b ("btrfs: store chunk size in space-info
struct"), btrfs no longer can create larger data chunks than 1G:

  mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4
  mount $dev1 $mnt

  btrfs balance start --full $mnt
  btrfs balance start --full $mnt
  umount $mnt

  btrfs ins dump-tree -t chunk $dev1 | grep "DATA|RAID0" -C 2

Before that offending commit, what we got is a 4G data chunk:

	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
		length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 4 sub_stripes 1

Now what we got is only 1G data chunk:

	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176
		length 1073741824 owner 2 stripe_len 65536 type DATA|RAID0
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 4 sub_stripes 1

This will increase the number of data chunks by the number of devices,
not only increase system chunk usage, but also greatly increase mount
time.

Without a proper reason, we should not change the max chunk size.

[CAUSE]
Previously, we set max data chunk size to 10G, while max data stripe
length to 1G.

Commit f6fca3917b ("btrfs: store chunk size in space-info struct")
completely ignored the 10G limit, but use 1G max stripe limit instead,
causing above shrink in max data chunk size.

[FIX]
Fix the max data chunk size to 10G, and in decide_stripe_size_regular()
we limit stripe_size to 1G manually.

This should only affect data chunks, as for metadata chunks we always
set the max stripe size the same as max chunk size (256M or 1G
depending on fs size).

Now the same script result the same old result:

	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
		length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 4 sub_stripes 1

Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Fixes: f6fca3917b ("btrfs: store chunk size in space-info struct")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-06 17:49:58 +02:00
Zixuan Fu 9ea0106a7a btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()
In btrfs_get_dev_args_from_path(), btrfs_get_bdev_and_sb() can fail if
the path is invalid. In this case, btrfs_get_dev_args_from_path()
returns directly without freeing args->uuid and args->fsid allocated
before, which causes memory leak.

To fix these possible leaks, when btrfs_get_bdev_and_sb() fails,
btrfs_put_dev_args_from_path() is called to clean up the memory.

Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Fixes: faa775c41d ("btrfs: add a btrfs_get_dev_args_from_path helper")
CC: stable@vger.kernel.org # 5.16
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Zixuan Fu <r33s3n6@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-08-22 18:06:33 +02:00
Christoph Hellwig d28beb3e81 btrfs: merge btrfs_dev_stat_print_on_error with its only caller
Fold it into the only caller.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:42 +02:00
Christoph Hellwig b9af128d1e btrfs: raid56: transfer the bio counter reference to the raid submission helpers
Transfer the bio counter reference acquired by btrfs_submit_bio to
raid56_parity_write and raid56_parity_recovery together with the bio
that the reference was acquired for instead of acquiring another
reference in those helpers and dropping the original one in
btrfs_submit_bio.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:40 +02:00
Christoph Hellwig 6065fd95da btrfs: do not return errors from raid56_parity_recover
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it.  This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.

Also use the proper bool type for the generic_io argument.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:39 +02:00
Christoph Hellwig 31683f4aae btrfs: do not return errors from raid56_parity_write
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it.  This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:39 +02:00