If we extended the size of a swapfile after its header was created (by the
mkswap utility) and then try to activate it, we will map the entire file
when activating the swap file, instead of limiting to the max size defined
in the swap file's header.
Currently test case generic/643 from fstests fails because we do not
respect that size limit defined in the swap file's header.
So fix this by not mapping file ranges beyond the max size defined in the
swap header.
This is the same type of bug that iomap used to have, and was fixed in
commit 36ca7943ac ("mm/swap: consider max pages in
iomap_swapfile_add_extent").
Fixes: ed46ff3d42 ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we're going to want to use btrfs_truncate_inode_items
without looking up the associated inode. In order to accommodate this
add the inode to btrfs_truncate_control and handle the case where
control->inode is NULL appropriately. This is fairly straightforward,
we simply need to add a helper for the trace points, as the file extent
map update is controlled by a flag on btrfs_truncate_control.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the future we are going to want to truncate inode items without
needing to have an btrfs_inode to pass in, so add ino to the
btrfs_truncate_control and use that to look up the inode items to
truncate.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We only care about updating the file extent range when we are doing a
normal truncation. We skip this for tree logging currently, but we can
also skip this for eviction as well. Using a flag makes it more
explicit when we want to do this work.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently have a bunch of awkward checks to make sure we only update
the inode i_bytes if we're truncating the real inode. Instead keep
track of the number of bytes we need to sub in the
btrfs_truncate_control, and then do the appropriate adjustment in the
truncate paths that care.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently will update the i_size of the inode as we truncate it down,
however we skip this if we're calling btrfs_truncate_inode_items from
the tree log code. However we also don't care about this in the case of
evict. Instead keep track of this value in the btrfs_truncate_control
and then have btrfs_truncate() and the free space cache truncate path
both do the i_size update themselves.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I'm going to be adding more arguments and counters to
btrfs_truncate_inode_items, so add a control struct to handle all of the
extra arguments to make it easier to follow.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a special case in btrfs_truncate_inode_items() to call
btrfs_kill_delayed_inode_items() if min_type == 0, which is only called
during evict.
Instead move this out into evict proper, and add some comments because I
erroneously attempted to remove this code altogether without
understanding what we were doing.
Evict is updating the inode only because we only care about making sure
the i_nlink count has hit disk. If we had pending deletions we don't
want to process those via the delayed inode updates, we simply want to
drop all of them and reclaim the reserved metadata space. Then from
there the btrfs_truncate_inode_items() will do the work to remove all of
the items as appropriate.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we are locking the extent and dropping the extent cache for
any inodes we truncate, unless they're in the tree log. We call this
helper from:
- truncate
- evict
- tree log
- free space cache truncation
For evict we've already dropped all of the extent cache for this inode
once we've gotten here, and we're the only one accessing this inode, so
this step is unnecessary.
For the tree log code we already skip this part.
Pull this work into the truncate path and the free space cache
truncation path.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is an inode item related manipulation with a few vfs related
adjustments. I'm going to remove the vfs related code from this helper
and simplify it a lot, but I want those changes to be easily seen via
git blame, so move this function now and then the simplification work
can be done.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few helpers in inode-item.c, and I'm going to make a few
changes to how we do truncate in the future, so break out these
definitions into their own header file to trim down ctree.h some and
make it easier to do the work on truncate in the future.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have multiple csum roots in the future, so convert all
users of ->csum_root to btrfs_csum_root() and rename ->csum_root to
->_csum_root so we can easily find remaining users in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few places where we skip doing csums if we mounted with one of
the rescue options that ignores bad csum roots. In the future when
there are multiple csum roots it'll be costly to check and see if there
are any missing csum roots, so simply add a flag to indicate the fs
should skip loading csums in case of errors.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We used to need the root for btrfs_reserve_metadata_bytes to check the
orphan cleanup state, but we no longer need that, we simply need the
fs_info. Change btrfs_reserve_metadata_bytes() to use the fs_info, and
change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the
same as they simply call btrfs_reserve_metadata_bytes() and then
manipulate the block_rsv that is being used.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we don't care about the stage of the orphan_cleanup_state,
simply replace it with a bit on ->state to make sure we don't call the
orphan cleanup every time we wander into this root.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I forgot to convert this over when I introduced the global reserve
stealing code to the space flushing code. Evict was simply trying to
make its reservation and then if it failed it would steal from the
global rsv, which is racey because it's outside of the normal ticketing
code.
Fix this by setting ticket->steal if we are BTRFS_RESERVE_FLUSH_EVICT,
and then make the priority flushing path do the steal for us.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of getting the btrfs_item for this, simply pass in the slot of
the item and then use the btrfs_item_size_nr() helper inside of
btrfs_file_extent_inline_item_len().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a direct IO write against a file range that either has
preallocated extents in that range or has regular extents and the file
has the NOCOW attribute set, the write fails with -ENOSPC when all of
the following conditions are met:
1) There are no data blocks groups with enough free space matching
the size of the write;
2) There's not enough unallocated space for allocating a new data block
group;
3) The extents in the target file range are not shared, neither through
snapshots nor through reflinks.
This is wrong because a NOCOW write can be done in such case, and in fact
it's possible to do it using a buffered IO write, since when failing to
allocate data space, the buffered IO path checks if a NOCOW write is
possible.
The failure in direct IO write path comes from the fact that early on,
at btrfs_dio_iomap_begin(), we try to allocate data space for the write
and if it that fails we return the error and stop - we never check if we
can do NOCOW. But later, at btrfs_get_blocks_direct_write(), we check
if we can do a NOCOW write into the range, or a subset of the range, and
then release the previously reserved data space.
Fix this by doing the data reservation only if needed, when we must COW,
at btrfs_get_blocks_direct_write() instead of doing it at
btrfs_dio_iomap_begin(). This also simplifies a bit the logic and removes
the inneficiency of doing unnecessary data reservations.
The following example test script reproduces the problem:
$ cat dio-nocow-enospc.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
# Use a small fixed size (1G) filesystem so that it's quick to fill
# it up.
# Make sure the mixed block groups feature is not enabled because we
# later want to not have more space available for allocating data
# extents but still have enough metadata space free for the file writes.
mkfs.btrfs -f -b $((1024 * 1024 * 1024)) -O ^mixed-bg $DEV
mount $DEV $MNT
# Create our test file with the NOCOW attribute set.
touch $MNT/foobar
chattr +C $MNT/foobar
# Now fill in all unallocated space with data for our test file.
# This will allocate a data block group that will be full and leave
# no (or a very small amount of) unallocated space in the device, so
# that it will not be possible to allocate a new block group later.
echo
echo "Creating test file with initial data..."
xfs_io -c "pwrite -S 0xab -b 1M 0 900M" $MNT/foobar
# Now try a direct IO write against file range [0, 10M[.
# This should succeed since this is a NOCOW file and an extent for the
# range was previously allocated.
echo
echo "Trying direct IO write over allocated space..."
xfs_io -d -c "pwrite -S 0xcd -b 10M 0 10M" $MNT/foobar
umount $MNT
When running the test:
$ ./dio-nocow-enospc.sh
(...)
Creating test file with initial data...
wrote 943718400/943718400 bytes at offset 0
900 MiB, 900 ops; 0:00:01.43 (625.526 MiB/sec and 625.5265 ops/sec)
Trying direct IO write over allocated space...
pwrite: No space left on device
A test case for fstests will follow, testing both this direct IO write
scenario as well as the buffered IO write scenario to make it less likely
to get future regressions on the buffered IO case.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmF/7PAACgkQxWXV+ddt
WDtp6A//SbVYeuHWpsXkhBiOpJt2PpS1K8VY5LIJc3brua5EZm8IarlR57X9IqYu
89ZlWnuANrw4d5RRiIO+NYhc+DR6+ydxHesJG+I2B+o5OnR0Ynb06gLhsP1tSK6y
lYZORQFJZP051ODU/uEc8A0KZN7DySIUmqezAibfyxepF6oPEap0nFp17/B80tWp
sKdMp2TBN5ymZwsdSK1nZ7ws1ZL57HgkFDPqp8m8CuPTkneG4CtNol6yUpuPExpL
QzvQsqTygmiFoy0uNTG7Rg7IlKqEuhbR7lwfkmcBZCV66JmhFco5QhxN13QIn42s
+YSug52SMWc8YVHIEj16xtBgHEqZXWYey8d2ewhc0tDSGDm0HmXCNjcn1vYr0NJr
5bW/7/3bpkHYejasy1wDEK5P8Uo2xsgpRyAvuEReGoRi8ze66EohahvP3o7YJi/Q
o0pROXdCT89JbM/T4MTvN/5MUlCSM7rnexXZ39ldGNacPgn9FAUCPw6KtzKKyVRe
DF19nPOUXSg6SLECbVkRQUwcOjxOTFP+T0Jx61Um8bomFskYJJnmr4SD3pqlzgp7
NxV5ad0+r7zU0x9MADkyqboObo0ROAfD4hthcZiRN+0UIK+Gq5nATTD5ur6/nwsT
0PJGOXDPz7cmfqUdmvpA0ctRxbFEqpaz6sDh7nq/iUSmaGITcUM=
=HvYu
-----END PGP SIGNATURE-----
Merge tag 'for-5.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"The updates this time are more under the hood and enhancing existing
features (subpage with compression and zoned namespaces).
Performance related:
- misc small inode logging improvements (+3% throughput, -11% latency
on sample dbench workload)
- more efficient directory logging: bulk item insertion, less tree
searches and locking
- speed up bulk insertion of items into a b-tree, which is used when
logging directories, when running delayed items for directories
(fsync and transaction commits) and when running the slow path
(full sync) of an fsync (bulk creation run time -4%, deletion -12%)
Core:
- continued subpage support
- make defragmentation work
- make compression write work
- zoned mode
- support ZNS (zoned namespaces), zone capacity is number of
usable blocks in each zone
- add dedicated block group (zoned) for relocation, to prevent
out of order writes in some cases
- greedy block group reclaim, pick the ones with least usable
space first
- preparatory work for send protocol updates
- error handling improvements
- cleanups and refactoring
Fixes:
- lockdep warnings
- in show_devname callback, on seeding device
- device delete on loop device due to conversions to workqueues
- fix deadlock between chunk allocation and chunk btree modifications
- fix tracking of missing device count and status"
* tag 'for-5.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (140 commits)
btrfs: remove root argument from check_item_in_log()
btrfs: remove root argument from add_link()
btrfs: remove root argument from btrfs_unlink_inode()
btrfs: remove root argument from drop_one_dir_item()
btrfs: clear MISSING device status bit in btrfs_close_one_device
btrfs: call btrfs_check_rw_degradable only if there is a missing device
btrfs: send: prepare for v2 protocol
btrfs: fix comment about sector sizes supported in 64K systems
btrfs: update device path inode time instead of bd_inode
fs: export an inode_update_time helper
btrfs: fix deadlock when defragging transparent huge pages
btrfs: sysfs: convert scnprintf and snprintf to sysfs_emit
btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZE
btrfs: update comments for chunk allocation -ENOSPC cases
btrfs: fix deadlock between chunk allocation and chunk btree modifications
btrfs: zoned: use greedy gc for auto reclaim
btrfs: check-integrity: stop storing the block device name in btrfsic_dev_state
btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls
btrfs: add a btrfs_get_dev_args_from_path helper
btrfs: handle device lookup with btrfs_dev_lookup_args
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KDgQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpmQ2D/wO0nH3U+3+OZChi3XUwYck9Dev3o6BANCF
ClATiK/kivZY0xY1r8J4ixirZo2gcjIMpWSC3JGYZ5LdspfmYGLUbMjfZsaeU23i
lAKaX1IqfArmHN76k3IU1bKCg7B0/LFwC0q9QTFWTSwNSs8RK/EZLJ61U1hEXUb3
OfIpaMmvPiMaU7yuPqhcZK14m1cg1srrLM4rFB/PqsWWStF07pHq32WeArGDAU0e
Fe0YSnYD7qqA5Qc37KwqjCTmmxKX5YZf7etIcA6p3DNmwcuQrVNzKoCH/ZEDijaD
E2bS/BWbN1x96+rtoEZfBYEaNIrkmJzmW6+fJ53OITbJF3KqP6V66erhqNcFYCzC
mhFlRe7voXb/8AP7zQqSIhK529BUBM36sQ6nF7EiQcDrfLc1z39mq6eblUxbknIA
DDPISD5Tseik9N9x0bc7vINseKyHI1E90VAU/XKADcuGbzLvehPx+2p+Iq5ch5Ah
oa1G3RdlWWQOZxphJHWJhu1qMfo5+FP9dFZj1aoo7b8Kbc/CedyoQe71cpIE5wNh
Jj/EpWJnuyKXwuTic2VYGC+6ezM9O5DSdqCfP3YuZky95VESyvRCKJYMMgBYRVdC
/LuxhnBXIY2G8An7ZTnX0kLCCvLbapIwa0NyA98/xeOngO843coJ6wn8ZmE9LJNH
kMmpCygUrA==
=QWC+
-----END PGP SIGNATURE-----
Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
- mq-deadline accounting improvements (Bart)
- blk-wbt timer fix (Andrea)
- Untangle the block layer includes (Christoph)
- Rework the poll support to be bio based, which will enable adding
support for polling for bio based drivers (Christoph)
- Block layer core support for multi-actuator drives (Damien)
- blk-crypto improvements (Eric)
- Batched tag allocation support (me)
- Request completion batching support (me)
- Plugging improvements (me)
- Shared tag set improvements (John)
- Concurrent queue quiesce support (Ming)
- Cache bdev in ->private_data for block devices (Pavel)
- bdev dio improvements (Pavel)
- Block device invalidation and block size improvements (Xie)
- Various cleanups, fixes, and improvements (Christoph, Jackie,
Masahira, Tejun, Yu, Pavel, Zheng, me)
* tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits)
blk-mq-debugfs: Show active requests per queue for shared tags
block: improve readability of blk_mq_end_request_batch()
virtio-blk: Use blk_validate_block_size() to validate block size
loop: Use blk_validate_block_size() to validate block size
nbd: Use blk_validate_block_size() to validate block size
block: Add a helper to validate the block size
block: re-flow blk_mq_rq_ctx_init()
block: prefetch request to be initialized
block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
block: add rq_flags to struct blk_mq_alloc_data
block: add async version of bio_set_polled
block: kill DIO_MULTI_BIO
block: kill unused polling bits in __blkdev_direct_IO()
block: avoid extra iter advance with async iocb
block: Add independent access ranges support
blk-mq: don't issue request directly in case that current is to be blocked
sbitmap: silence data race warning
blk-cgroup: synchronize blkg creation against policy deactivation
block: refactor bio_iov_bvec_set()
block: add single bio async direct IO helper
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmF72q0ACgkQxWXV+ddt
WDvFOxAAkcryx2FP5aqaoMzBKfoCtMFHO3uAvm+rsMcglWe5kaXhBnHa2HPzoyEh
YqEx2TeXMTuA2I15bU8KV1RMhQzzRjC4NhdRqY6uaKAcKgON6sJlK5qsq2BnB+V3
nrue1jppM2Vv8wNzjMNeVETQNC7pmg29yQP/fvWaB36Yar2tyfyWDF11e42HR7cU
yLQUedg30WEayz3Mp6MTBF36h09WXQrZSs7Iwk1JMQbpxWcpn2CjXrO+vIZOMdvH
XZZsxBTNB8GJIaJlXssgsq3OP2wspK1lrVHNfi5PYtcZEaFrhkPaVB6enDfd41YV
zXwj1dnemCni9fh88gZprel9bLyB37dSVfIqq2Ly3hQbSAN4dmHIpxGwPSRIr+Hl
Bn3UfClHpAftbpd/Y77U7GgcYnkuRo3Bd4mGTF3ZuPDLVrf/QX5BlfGa2dmJYoml
NfBit7Ha4UrxLW6C8RC6fyEbLQxpNYFY55Ra0Tj0BBO/uhWiqtQGZwC/qbyPKfzN
YZFcPR6iTILoCHXNan3iZIuLeASMT0djgAtunXXf/BuFnxGfnOuqL3bKt2vojh3+
rsqpeIxSP/VklKv4JcP3axeLmUK6cA8/9dV2ES0M0Fc0o341jfh+AoVw0GleFeus
gXlDFPRJeE8yyXmjKyW4shctOczqoeMIq3umebXPP9R4jd/LU/g=
=YWGa
-----END PGP SIGNATURE-----
Merge tag 'for-5.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Last minute fixes for crash on 32bit architectures when compression is
in use. It's a regression introduced in 5.15-rc and I'd really like
not let this into the final release, fixes via stable trees would add
unnecessary delay.
The problem is on 32bit architectures with highmem enabled, the pages
for compression may need to be kmapped, while the patches removed that
as we don't use GFP_HIGHMEM allocations anymore. The pages that don't
come from local allocation still may be from highmem. Despite being on
32bit there's enough such ARM machines in use so it's not a marginal
issue.
I did full reverts of the patches one by one instead of a huge one.
There's one exception for the "lzo" revert as there was an
intermediate patch touching the same code to make it compatible with
subpage. I can't revert that one too, so the revert in lzo.c is
manual. Qu Wenruo has worked on that with me and verified the changes"
* tag 'for-5.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Revert "btrfs: compression: drop kmap/kunmap from lzo"
Revert "btrfs: compression: drop kmap/kunmap from zlib"
Revert "btrfs: compression: drop kmap/kunmap from zstd"
Revert "btrfs: compression: drop kmap/kunmap from generic helpers"
The root argument passed to btrfs_unlink_inode() and its callee,
__btrfs_unlink_inode(), always matches the root of the given directory and
the given inode. So remove the argument and make __btrfs_unlink_inode()
use the root of the directory.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member btrfs_bio::logical is only initialized by two call sites:
- btrfs_repair_one_sector()
No corresponding site to utilize it.
- btrfs_submit_direct()
The corresponding site to utilize it is btrfs_check_read_dio_bio().
However for btrfs_check_read_dio_bio(), we can grab the file_offset from
btrfs_dio_private::file_offset directly.
Thus it turns out we don't really need that btrfs_bio::logical member at
all.
For btrfs_bio, the logical bytenr can be fetched from its
bio->bi_iter.bi_sector directly.
So let's just remove the member to save 8 bytes for structure btrfs_bio.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The naming of "logical_offset" can be confused with logical bytenr of
the dio range.
In fact it's file offset, and the naming "file_offset" is already widely
used in all other sites.
Just do the rename to avoid confusion.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of checking whether qgroup processing for a dealyed ref has to
happen in the core of delayed ref, simply pull the check at init time of
respective delayed ref structures. This eliminates the final use of
real_root in delayed-ref core paving the way to making this member
optional.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to make 'real_root' used only in ref-verify it's required to
have the necessary context to perform the same checks that this member
is used for. So add 'mod_root' which will contain the root on behalf of
which a delayed ref was created and a 'skip_group' parameter which
will contain callsite-specific override of skip_qgroup.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few flags that are inconsistently used to describe the fs in
different states of failure. As of 5963ffcaf3 ("btrfs: always abort
the transaction if we abort a trans handle") we will always set
BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
and ERROR to see if things have gone wrong. Add a helper to check
BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
use the helper.
The TRANS_ABORTED bit check was added in af72273381 ("Btrfs: clean up
resources during umount after trans is aborted") but is not actually
specific.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we will abort the transaction if we get a random error (like
-EIO) while trying to remove the directory entries from the root log
during rename.
However since these are simply log tree related errors, we can mark the
trans as needing a full commit. Then if the error was truly
catastrophic we'll hit it during the normal commit and abort as
appropriate.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For compressed write, we use a mechanism called async COW, which unlike
regular run_delalloc_cow() or cow_file_range() will also unlock the
first page.
This mechanism allows us to continue handling next ranges, without
waiting for the time consuming compression.
But this has a problem for subpage case, as we could have the following
delalloc range for a page:
0 32K 64K
| |///////| |///////|
\- A \- B
In the above case, if we pass both ranges to cow_file_range_async(),
both range A and range B will try to unlock the full page [0, 64K).
And which one finishes later than the other one will try to do other
page operations like end_page_writeback() on a unlocked page, triggering
VM layer BUG_ON().
To make subpage compression work at least partially, here we add another
restriction for it, only allow compression if the delalloc range is
fully page aligned.
By that, async extent is always ensured to unlock the first page
exclusively, just like it used to be for regular sectorsize.
In theory, we only need to make sure the delalloc range fully covers its
first page, but the tail page will be locked anyway, blocking later
writeback until the compression finishes.
Thus here we choose to make sure the range is fully page aligned before
doing the compression.
In the future, we could optimize the situation by properly increasing
subpage::writers number for the locked page, but that also means we need
to change how we run delalloc range of page.
(Instead of running each delalloc range we hit, we need to find and lock
all delalloc ranges covering the page, then run each of them).
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With experimental subpage compression enabled, a simple fsstress can
lead to self deadlock on page 720896:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount $dev -o compress $mnt
$fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156
[CAUSE]
If we have a file layout looks like below:
0 32K 64K 96K 128K
|//| |///////////////|
4K
Then we run delalloc range for the inode, it will:
- Call find_lock_delalloc_range() with @delalloc_start = 0
Then we got a delalloc range [0, 4K).
This range will be COWed.
- Call find_lock_delalloc_range() again with @delalloc_start = 4K
Since find_lock_delalloc_range() never cares whether the range
is still inside page range [0, 64K), it will return range [64K, 128K).
This range meets the condition for subpage compression, will go
through async COW path.
And async COW path will return @page_started.
But that @page_started is now for range [64K, 128K), not for range
[0, 64K).
- writepage_dellloc() returned 1 for page [0, 64K)
Thus page [0, 64K) will not be unlocked, nor its page dirty status
will be cleared.
Next time when we try to lock page [0, 64K) we will deadlock, as there
is no one to release page [0, 64K).
This problem will never happen for regular page size as one page only
contains one sector. After the first find_lock_delalloc_range() call,
the @delalloc_end will go beyond @page_end no matter if we found a
delalloc range or not
Thus this bug only happens for subpage, as now we need multiple runs to
exhaust the delalloc range of a page.
[FIX]
Fix the problem by ensuring the delalloc range we ran at least started
inside @locked_page.
So that we will never get incorrect @page_started.
And to prevent such problem from happening again:
- Make find_lock_delalloc_range() return false if the found range is
beyond @end value passed in.
Since @end will be utilized now, add an ASSERT() to ensure we pass
correct @end into find_lock_delalloc_range().
This also means, for selftests we needs to populate @end before calling
find_lock_delalloc_range().
- New ASSERT() in find_lock_delalloc_range()
Now we will make sure the @start/@end passed in at least covers part
of the page.
- New ASSERT() in run_delalloc_range()
To make sure the range at least starts inside @locked page.
- Use @delalloc_start as proper cursor, while @delalloc_end is always
reset to @page_end.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new helper, submit_uncompressed_range(), for async cow cases
where we fallback to COW.
There are some new updates introduced to the helper:
- Proper locked_page detection
It's possible that the async_extent range doesn't cover the locked
page. In that case we shouldn't unlock the locked page.
In the new helper, we will ensure that we only unlock the locked page
when:
* The locked page covers part of the async_extent range
* The locked page is not unlocked by cow_file_range() nor
extent_write_locked_range()
This also means extra comments are added focusing on the page locking.
- Add extra comment on some rare parameter used.
We use @unlock_page = 0 for cow_file_range(), where only two call
sites doing the same thing, including the new helper.
It's definitely worth some comments.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In function compress_file_range(), when the compression is finished, the
function just rounds up @total_in to PAGE_SIZE. This is fine for
regular sectorsize == PAGE_SIZE case, but not for subpage.
Just change the ALIGN(, PAGE_SIZE) to round_up(, sectorsize) so that
both regular sectorsize and subpage sectorsize will be happy.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several cleanups for extent_write_locked_range(), most of them
are pure cleanups, but with some preparation for future subpage support.
- Add a proper comment for which call sites are suitable
Unlike regular synchronized extent write back, if async COW or zoned
COW happens, we have all pages in the range still locked.
Thus for those (only) two call sites, we need this function to submit
page content into bios and submit them.
- Remove @mode parameter
All the existing two call sites pass WB_SYNC_ALL. No need for @mode
parameter.
- Better error handling
Currently if we hit an error during the page iteration loop, we
overwrite @ret, causing only the last error can be recorded.
Here we add @found_error and @first_error variable to record if we hit
any error, and the first error we hit.
So the first error won't get lost.
- Don't reuse @start as the cursor
We reuse the parameter @start as the cursor to iterate the range, not
a big problem, but since we're here, introduce a proper @cur as the
cursor.
- Remove impossible branch
Since all pages are still locked after the ordered extent is inserted,
there is no way that pages can get its dirty bit cleared.
Remove the branch where page is not dirty and replace it with an
ASSERT().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a big chunk of code inside a while() loop, with tons of strange
jumps for error handling. It's definitely not to the code standard of
today. Move the code into a new function, submit_one_async_extent().
Since we're here, also do the following changes:
- Comment style change
To follow the current scheme
- Don't fallback to non-compressed write then hitting ENOSPC
If we hit ENOSPC for compressed write, how could we reserve more space
for non-compressed write?
Thus we go error path directly.
This removes the retry: label.
- Add more comment for super long parameter list
Explain which parameter is for, so we don't need to check the
prototype.
- Move the error handling to submit_one_async_extent()
Thus no strange code like:
out_free:
...
goto again;
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As the last caller in compression.c has been removed, we don't need that
function anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although in btrfs we have very limited usage of PageChecked flag, it's
still some page flag not yet subpage compatible.
Fix it by introducing btrfs_subpage::checked_offset to do the convert.
For most call sites, especially for free-space cache, COW fixup and
btrfs_invalidatepage(), they all work in full page mode anyway.
For other call sites, they work as subpage compatible mode.
Some call sites need extra modification:
- btrfs_drop_pages()
Needs extra parameter to get the real range we need to clear checked
flag.
Also since btrfs_drop_pages() will accept pages beyond the dirtied
range, update btrfs_subpage_clamp_range() to handle such case
by setting @len to 0 if the page is beyond target range.
- btrfs_invalidatepage()
We need to call subpage helper before calling __btrfs_releasepage(),
or it will trigger ASSERT() as page->private will be cleared.
- btrfs_verify_data_csum()
In theory we don't need the io_bio->csum check anymore, but it's
won't hurt. Just change the comment.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since async_extent holds the compressed page, it would trigger the new
ASSERT() in btrfs_mark_ordered_io_finished() which checks that the range
is inside the page.
Now btrfs_writepage_endio_finish_ordered() can accept @page == NULL,
just pass NULL to btrfs_writepage_endio_finish_ordered().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For structure async_chunk, we use a very strange member layout to grab
structure async_cow who owns this async_chunk.
At initialization, it goes like this:
async_chunk[i].pending = &ctx->num_chunks;
Then at async_cow_free() we do a super weird freeing:
/*
* Since the pointer to 'pending' is at the beginning of the array of
* async_chunk's, freeing it ensures the whole array has been freed.
*/
if (atomic_dec_and_test(async_chunk->pending))
kvfree(async_chunk->pending);
This is absolutely an abuse of kvfree().
Replace async_chunk::pending with async_chunk::async_cow, so that we can
grab the async_cow structure directly, without this strange dancing.
And with this change, there is no requirement for any specific member
location.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting a batch of items into a btree, we end up looping over the
data sizes array 3 times:
1) Once in the caller of btrfs_insert_empty_items(), when it populates the
array with the data sizes for each item;
2) Once at btrfs_insert_empty_items() to sum the elements of the data
sizes array and compute the total data size;
3) And then once again at setup_items_for_insert(), where we do exactly
the same as what we do at btrfs_insert_empty_items(), to compute the
total data size.
That is not bad for small arrays, but when the arrays have hundreds of
elements, the time spent on looping is not negligible. For example when
doing batch inserts of delayed items for dir index items or when logging
a directory, it's common to have 200 to 260 dir index items in a single
batch when using a leaf size of 16K and using file names between 8 and 12
characters. For a 64K leaf size, multiply that by 4. Taking into account
that during directory logging or when flushing delayed dir index items we
can have many of those large batches, the time spent on the looping adds
up quickly.
It's also more important to avoid it at setup_items_for_insert(), since
we are holding a write lock on a leaf and, in some cases, on upper nodes
of the btree, which causes us to block other tasks that want to access
the leaf and nodes for longer than necessary.
So change the code so that setup_items_for_insert() and
btrfs_insert_empty_items() no longer compute the total data size, and
instead rely on the caller to supply it. This makes us loop over the
array only once, where we can both populate the data size array and
compute the total data size, taking advantage of spatial and temporal
locality. To make this more manageable, use a structure to contain
all the relevant details for a batch of items (keys array, data sizes
array, total data size, number of items), and use it as an argument
for btrfs_insert_empty_items() and setup_items_for_insert().
This patch is part of a small patchset that is comprised of the following
patches:
btrfs: loop only once over data sizes array when inserting an item batch
btrfs: unexport setup_items_for_insert()
btrfs: use single bulk copy operations when logging directories
This is patch 1/3 and performance results, and the specific tests, are
included in the changelog of patch 3/3.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously we had "struct btrfs_bio", which records IO context for
mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
btrfs specific info for logical bytenr bio.
With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
"btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
The struct btrfs_bio changes meaning by this commit. There was a
suggested name like btrfs_logical_bio but it's a bit long and we'd
prefer to use a shorter name.
This could be a concern for backports to older kernels where the
different meaning could possibly cause confusion or bugs. Comparing the
new and old structures, there's no overlap among the struct members so a
build would break in case of incorrect backport.
We haven't had many backports to bio code anyway so this is more of a
theoretical cause of bugs and a matter of precaution but we'll need to
keep the semantic change in mind.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the first time we log a directory in the current transaction, for
each directory item in a changed leaf of the subvolume tree, we have to
check if we previously logged the item, in order to overwrite it in case
its data changed or skip it in case its data hasn't changed.
Checking if we have logged each item before not only wastes times, but it
also adds lock contention on the log tree. So in order to minimize the
number of times we do such checks, keep track of the offset of the last
key we logged for a directory and, on the next time we log the directory,
skip the checks for any new keys that have an offset greater than the
offset we have previously saved. This is specially effective for index
keys, because the offset for these keys comes from a monotonically
increasing counter.
This patch is part of a patchset comprised of the following 5 patches:
btrfs: remove root argument from btrfs_log_inode() and its callees
btrfs: remove redundant log root assignment from log_dir_items()
btrfs: factor out the copying loop of dir items from log_dir_items()
btrfs: insert items in batches when logging a directory when possible
btrfs: keep track of the last logged keys when logging a directory
This is patch 5/5.
The following test was used on a non-debug kernel to measure the impact
it has on a directory fsync:
$ cat test-dir-fsync.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_NEW_FILES=100000
NUM_FILE_DELETES=1000
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
# fsync the directory, this will log the new dir items and the inodes
# they point to, because these are new inodes.
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
# sync to force transaction commit and wipeout the log.
sync
del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
rm -f $MNT/testdir/file_$i
done
# fsync the directory, this will only log dir items, there are no
# dentries pointing to new inodes.
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
umount $MNT
Test results with NUM_NEW_FILES set to 100 000 and 1 000 000:
**** before patchset, 100 000 files, 1000 deletes ****
dir fsync took 848 ms after adding 100000 files
dir fsync took 175 ms after deleting 1000 files
**** after patchset, 100 000 files, 1000 deletes ****
dir fsync took 758 ms after adding 100000 files (-11.2%)
dir fsync took 63 ms after deleting 1000 files (-94.1%)
**** before patchset, 1 000 000 files, 1000 deletes ****
dir fsync took 9945 ms after adding 1000000 files
dir fsync took 473 ms after deleting 1000 files
**** after patchset, 1 000 000 files, 1000 deletes ****
dir fsync took 8677 ms after adding 1000000 files (-13.6%)
dir fsync took 146 ms after deleting 1000 files (-105.6%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Prepare for allowing preallocation for relocation inodes.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several places in our codebase where we check if a root is the
root of the data reloc tree and subsequent patches will introduce more.
Factor out the check into a small helper function instead of open coding
it multiple times.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to fix a bug in btrfs_show_devname().
Convert fs_devices::latest_bdev type from struct block_device to struct
btrfs_device and, rename the member to fs_devices::latest_dev.
So that btrfs_show_devname() can use fs_devices::latest_dev::name.
Tested-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have written to the zone capacity, the device automatically
deactivates the zone. Sync up block group side (the active BG list and
zone_is_active flag) with it.
We need to do it both on data BGs and metadata BGs. On data side, we add a
hook to btrfs_finish_ordered_io(). On metadata side, we use
end_extent_buffer_writeback().
To reduce excess lookup of a block group, we mark the last extent buffer in
a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for
data (ordered_extent), because the address may change due to
REQ_OP_ZONE_APPEND.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.
Polling for the bio itself leads to a few advantages:
- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-cgroup.h pulls in blkdev.h and thus pretty much all the block
headers. Break this dependency chain by turning wbc_blkcg_css into a
macro and dropping the blk-cgroup.h include.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Simplify the bio_end_page usage in the buffered IO code.
- Support reading inline data at nonzero offsets for erofs.
- Fix some typos and bad grammar.
- Convert kmap_atomic usage in the inline data read path.
- Add some extra inline data input checking.
- Fix a memory corruption bug stemming from iomap_swapfile_activate
trying to activate more pages than mm was expecting.
- Pass errnos through the page writeback code so that writeback errors
are reported correctly instead of being munged to EIO.
- Replace iomap_apply with a open-coded iterator loops to reduce the
number of indirect calls by a third to a half.
- Refactor the fsdax code to use iomap iterators instead of the
open-coded iomap_apply code that it had before.
- Format file range iomap tracepoint data in hexadecimal and
standardize the names used in the pretty-print string.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmEnwC0ACgkQ+H93GTRK
tOtVOQ//Zu9ul2ZmPARMV8xyAfopnLpmggREOthFbPkDZ3z3ZgRpPxlbAvWEEKnj
VDNLFNj204rDojuxP/YSdxgiawLod7dYfXIwwft8R8oI7MdgVQhpvimUi5bkz/Od
X5pmFDe84INfFvEztOgC+sPk1RI/ToQLgrcIffWMWfF2iyVkNVMCD5MMe6LoH1la
9GbVCfPx6Y2Nffaa8EuAEgaCo7FMPc81bvQG4qpeqXyX8qql/r5n4YENhkn3n4qa
zI4F2lgqwbelFkamZOYNDjtLt13lb7Ze0PoFOpmTZUqlyybqhRxDvJ+OxZn8W6zH
20pxWx/RCXhCp/sS6DRcYyf7WKoIfdGDkxed7aSuhJ+VKKtBtsjMoy7dh5IY5RJa
8L1DMat6xtea8Glx04SF7Vib0n/An9oHOTzLEWxsUlRaPhW68uVpKgXuGLTAf+dc
ztJhlQ9pLX0D2NmgGlkXN8d4F1XEH2BgyIrtF6UNtMbyIlCREHM9HELJs6JzKl6U
a4ivJXyaq8o/hlXr8IMWUOTVubS0i+hgvvQjnVJmcSTJxhH10mPPJLnNsGX6heD9
SlnnXRbD03iqsbMJP/R431VKooryOSKBc86IEECkuMz3RUfw75DGAnLtETnT1rsA
71rSVf5NaCGZ2hV4du6jv53TS7yrPpqkxJHyDWD1WP4xGPbO1XA=
=iVns
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.15-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"The most notable externally visible change for this cycle is the
addition of support for reads to inline tail fragments of files, which
was requested by the erofs developers; and a correction for a kernel
memory corruption bug if the sysadmin tries to activate a swapfile
with more pages than the swapfile header suggests.
We also now report writeback completion errors to the file mapping
correctly, instead of munging all errors into EIO.
Internally, the bulk of the changes are Christoph's patchset to reduce
the indirect function call count by a third to a half by converting
iomap iteration from a loop pattern to a generator/consumer pattern.
As an added bonus, fsdax no longer open-codes iomap apply loops.
Summary:
- Simplify the bio_end_page usage in the buffered IO code.
- Support reading inline data at nonzero offsets for erofs.
- Fix some typos and bad grammar.
- Convert kmap_atomic usage in the inline data read path.
- Add some extra inline data input checking.
- Fix a memory corruption bug stemming from iomap_swapfile_activate
trying to activate more pages than mm was expecting.
- Pass errnos through the page writeback code so that writeback
errors are reported correctly instead of being munged to EIO.
- Replace iomap_apply with a open-coded iterator loops to reduce the
number of indirect calls by a third to a half.
- Refactor the fsdax code to use iomap iterators instead of the
open-coded iomap_apply code that it had before.
- Format file range iomap tracepoint data in hexadecimal and
standardize the names used in the pretty-print string"
* tag 'iomap-5.15-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (41 commits)
iomap: standardize tracepoint formatting and storage
mm/swap: consider max pages in iomap_swapfile_add_extent
iomap: move loop control code to iter.c
iomap: constify iomap_iter_srcmap
fsdax: switch the fault handlers to use iomap_iter
fsdax: factor out a dax_fault_actor() helper
fsdax: factor out helpers to simplify the dax fault code
iomap: rework unshare flag
iomap: pass an iomap_iter to various buffered I/O helpers
iomap: remove iomap_apply
fsdax: switch dax_iomap_rw to use iomap_iter
iomap: switch iomap_swapfile_activate to use iomap_iter
iomap: switch iomap_seek_data to use iomap_iter
iomap: switch iomap_seek_hole to use iomap_iter
iomap: switch iomap_bmap to use iomap_iter
iomap: switch iomap_fiemap to use iomap_iter
iomap: switch __iomap_dio_rw to use iomap_iter
iomap: switch iomap_page_mkwrite to use iomap_iter
iomap: switch iomap_zero_range to use iomap_iter
iomap: switch iomap_file_unshare to use iomap_iter
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEs2NIACgkQxWXV+ddt
WDsJMQ/+PJ/yXfI85mAeAzTJLWQ0zD6YO3iBhf3wOeyychWC4on435pj+zW8zR/U
/bix25ygoWF4MvGF6p0uyv4Z5mnvkZXE5lapUcJu6wXG7se1QRPH0broTh05IBXK
SnT93Eb9RexaiNFk7DVma9XkviqZ/ZISPtkJ9wYrfIba7j/U/wa+PtEFS7wk58hP
rFQXgV64xm/pcP28YYHfOkCjdyUMdJrnBUvfKOlX6d94lmYbP5lyiTL+XJEXExzN
wPakD0UsnXPr4TRvf+YRTPeFHPPUgyORII7otVUOKmGywWtcJrELX8rXFoW+6GwB
dzZIcSYXHUxU5UrtMbZgiztVBJ+bQY5juYMIrj13eYOMYkijxAqPP84iDO15+TSV
zNqyAVjUglHCGUGjhSpAxnAmtp+IJTZfVAWcvIKq3VqvJtb8tssQsk9bqFjH1xlH
qNJLE57CYe3tjw05K9y0keMh2iJWRWkXZYkgI/zjwo5nreemobpN+3fO4yneVLh7
ecdBmSl/JVSzAB1NamLOCZNGZLUqiiuTvZlJtI6ZsekrN1+4A6QzVcU/MGjSYL1v
C7W0hK0LF+e3xIBkxTKVq8noolsgbmlWacxJq8fZq9HwZy5IVJOVm9STDlCuLaIo
gPr0V0itkclcsMU0CHTyCjMsfuHYUwJZXwg93wKfJf5UCzS4OWU=
=ALO9
-----END PGP SIGNATURE-----
Merge tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"The highlights of this round are integrations with fs-verity and
idmapped mounts, the rest is usual mix of minor improvements, speedups
and cleanups.
There are some patches outside of btrfs, namely updating some VFS
interfaces, all straightforward and acked.
Features:
- fs-verity support, using standard ioctls, backward compatible with
read-only limitation on inodes with previously enabled fs-verity
- idmapped mount support
- make mount with rescue=ibadroots more tolerant to partially damaged
trees
- allow raid0 on a single device and raid10 on two devices,
degenerate cases but might be useful as an intermediate step during
conversion to other profiles
- zoned mode block group auto reclaim can be disabled via sysfs knob
Performance improvements:
- continue readahead of node siblings even if target node is in
memory, could speed up full send (on sample test +11%)
- batching of delayed items can speed up creating many files
- fsync/tree-log speedups
- avoid unnecessary work (gains +2% throughput, -2% run time on
sample load)
- reduced lock contention on renames (on dbench +4% throughput,
up to -30% latency)
Fixes:
- various zoned mode fixes
- preemptive flushing threshold tuning, avoid excessive work on
almost full filesystems
Core:
- continued subpage support, preparation for implementing remaining
features like compression and defragmentation; with some
limitations, write is now enabled on 64K page systems with 4K
sectors, still considered experimental
- no readahead on compressed reads
- inline extents disabled
- disabled raid56 profile conversion and mount
- improved flushing logic, fixing early ENOSPC on some workloads
- inode flags have been internally split to read-only and read-write
incompat bit parts, used by fs-verity
- new tree items for fs-verity
- descriptor item
- Merkle tree item
- inode operations extended to be namespace-aware
- cleanups and refactoring
Generic code changes:
- fs: new export filemap_fdatawrite_wbc
- fs: removed sync_inode
- block: bio_trim argument type fixups
- vfs: add namespace-aware lookup"
* tag 'for-5.15-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
btrfs: reset replace target device to allocation state on close
btrfs: zoned: fix ordered extent boundary calculation
btrfs: do not do preemptive flushing if the majority is global rsv
btrfs: reduce the preemptive flushing threshold to 90%
btrfs: tree-log: check btrfs_lookup_data_extent return value
btrfs: avoid unnecessarily logging directories that had no changes
btrfs: allow idmapped mount
btrfs: handle ACLs on idmapped mounts
btrfs: allow idmapped INO_LOOKUP_USER ioctl
btrfs: allow idmapped SUBVOL_SETFLAGS ioctl
btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls
btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids
btrfs: allow idmapped SNAP_DESTROY ioctls
btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls
btrfs: check whether fsgid/fsuid are mapped during subvolume creation
btrfs: allow idmapped permission inode op
btrfs: allow idmapped setattr inode op
btrfs: allow idmapped tmpfile inode op
btrfs: allow idmapped symlink inode op
btrfs: allow idmapped mkdir inode op
...