Commit graph

66358 commits

Author SHA1 Message Date
David Sterba
8bb1cf1ba6 btrfs: scrub: rename ratelimit state varaible to avoid shadowing
There's already defined _rs within ctree.h:btrfs_printk_ratelimited,
local variables should not use _ to avoid such name clashes with
macro-local variables.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:55 +02:00
David Sterba
0af447d050 btrfs: remove unnecessarily shadowed variables
In btrfs_orphan_cleanup, there's another instance of fs_info, but it's
the same as the one we already have.

In btrfs_backref_finish_upper_links, rb_node is same type and used
as temporary cursor to the tree.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:55 +02:00
David Sterba
cb4c919830 btrfs: compression: move declarations to header
The declarations of compression algorithm callbacks are defined in the
.c file as they're used from there. Compiler warns that there are no
declarations for public functions when compiling lzo.c/zlib.c/zstd.c.
Fix that by moving the declarations to the header as it's the common
place for all of them.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:55 +02:00
David Sterba
9e6df7cedf btrfs: remove const from btrfs_feature_set_name
The function btrfs_feature_set_name returns a const char pointer, the
second const is not necessary and reported as a warning:

 In file included from fs/btrfs/space-info.c:6:
 fs/btrfs/sysfs.h:16:1: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
    16 | const char * const btrfs_feature_set_name(enum btrfs_feature_set set);
       | ^~~~~

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:55 +02:00
Qu Wenruo
e21139c621 btrfs: cleanup calculation of lockend in lock_and_cleanup_extent_if_need()
We're just doing rounding up to sectorsize to calculate the lockend.
There is no need to do the unnecessary length calculation, just direct
round_up() is enough.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:54 +02:00
Josef Bacik
c4923027bd btrfs: fix possible infinite loop in data async reclaim
Dave reported an issue where generic/102 would sometimes hang.  This
turned out to be because we'd get into this spot where we were no longer
making progress on data reservations because our exit condition was not
met.  The log is basically

while (!space_info->full && !list_empty(&space_info->tickets))
	flush_space(space_info, flush_state);

where flush state is our various flush states, but doesn't include
ALLOC_CHUNK_FORCE.  This is because we actually lead with allocating
chunks, and so the assumption was that once you got to the actual
flushing states you could no longer allocate chunks.  This was a stupid
assumption, because you could have deleted block groups that would be
reclaimed by a transaction commit, thus unsetting space_info->full.
This is essentially what happens with generic/102, and so sometimes
you'd get stuck in the flushing loop because we weren't allocating
chunks, but flushing space wasn't giving us what we needed to make
progress.

Fix this by adding ALLOC_CHUNK_FORCE to the end of our flushing states,
that way we will eventually bail out because we did end up with
space_info->full if we free'd a chunk previously.  Otherwise, as is the
case for this test, we'll allocate our chunk and continue on our happy
merry way.

Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:54 +02:00
Josef Bacik
1a7a92c8dd btrfs: add a comment explaining the data flush steps
The data flushing steps are not obvious to people other than myself and
Chris.  Write a giant comment explaining the reasoning behind each flush
step for data as well as why it is in that particular order.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:54 +02:00
Josef Bacik
5705674081 btrfs: do async reclaim for data reservations
Now that we have the data ticketing stuff in place, move normal data
reservations to use an async reclaim helper to satisfy tickets.  Before
we could have multiple tasks race in and both allocate chunks, resulting
in more data chunks than we would necessarily need.  Serializing these
allocations and making a single thread responsible for flushing will
only allocate chunks as needed, as well as cut down on transaction
commits and other flush related activities.

Priority reservations will still work as they have before, simply
trying to allocate a chunk until they can make their reservation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:54 +02:00
Josef Bacik
cb3e393045 btrfs: flush delayed refs when trying to reserve data space
We can end up with freed extents in the delayed refs, and thus
may_commit_transaction() may not think we have enough pinned space to
commit the transaction and we'll ENOSPC early.  Handle this by running
the delayed refs in order to make sure pinned is uptodate before we try
to commit the transaction.

Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:54 +02:00
Josef Bacik
327feeeb2e btrfs: run delayed iputs before committing the transaction for data
Before we were waiting on iputs after we committed the transaction, but
this doesn't really make much sense.  We want to reclaim any space we
may have in order to be more likely to commit the transaction, due to
pinned space being added by running the delayed iputs.  Fix this by
making delayed iputs run before committing the transaction.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:53 +02:00
Josef Bacik
bb86bd3db8 btrfs: don't force commit if we are data
We used to unconditionally commit the transaction at least 2 times and
then on the 3rd try check against pinned space to make sure committing
the transaction was worth the effort.  This is overkill, we know nobody
is going to steal our reservation, and if we can't make our reservation
with the pinned amount simply bail out.

This also cleans up the passing of bytes_needed to
may_commit_transaction, as that was the thing we added into place in
order to accomplish this behavior.  We no longer need it so remove that
mess.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:53 +02:00
Josef Bacik
0282700135 btrfs: drop the commit_cycles stuff for data reservations
This was an old wart left over from how we previously did data
reservations.  Before we could have people race in and take a
reservation while we were flushing space, so we needed to make sure we
looped a few times before giving up.  Now that we're using the ticketing
infrastructure we don't have to worry about this and can drop the logic
altogether.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:53 +02:00
Josef Bacik
f3bda421c1 btrfs: use the same helper for data and metadata reservations
Now that data reservations follow the same pattern as metadata
reservations we can simply rename __reserve_metadata_bytes to
__reserve_bytes and use that helper for data reservations.

Things to keep in mind, btrfs_can_overcommit() returns 0 for data,
because we can never overcommit.  We also will never pass in FLUSH_ALL
for data, so we'll simply be added to the priority list and go straight
into handle_reserve_ticket.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:53 +02:00
Josef Bacik
0532a6f8b6 btrfs: serialize data reservations if we are flushing
Nikolay reported a problem where generic/371 would fail sometimes with a
slow drive.  The gist of the test is that we fallocate a file in
parallel with a pwrite of a different file.  These two files combined
are smaller than the file system, but sometimes the pwrite would ENOSPC.

A fair bit of investigation uncovered the fact that the fallocate
workload was racing in and grabbing the free space that the pwrite
workload was trying to free up so it could make its own reservation.
After a few loops of this eventually the pwrite workload would error out
with an ENOSPC.

We've had the same problem with metadata as well, and we serialized all
metadata allocations to satisfy this problem.  This wasn't usually a
problem with data because data reservations are more straightforward,
but obviously could still happen.

Fix this by not allowing reservations to occur if there are any pending
tickets waiting to be satisfied on the space info.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:53 +02:00
Josef Bacik
1004f6860f btrfs: use ticketing for data space reservations
Now that we have all the infrastructure in place, use the ticketing
infrastructure to make data allocations.  This still maintains the exact
same flushing behavior, but now we're using tickets to get our
reservations satisfied.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:53 +02:00
Josef Bacik
8698fc4eb7 btrfs: add btrfs_reserve_data_bytes and use it
Create a new function btrfs_reserve_data_bytes() in order to handle data
reservations.  This uses the new flush types and flush states to handle
making data reservations.

This patch specifically does not change any functionality, and is
purposefully not cleaned up in order to make bisection easier for the
future patches.  The new helper is identical to the old helper in how it
handles data reservations.  We first try to force a chunk allocation,
and then we run through the flush states all at once and in the same
order that they were done with the old helper.

Subsequent patches will clean this up and change the behavior of the
flushing, and it is important to keep those changes separate so we can
easily bisect down to the patch that caused the regression, rather than
the patch that made us start using the new infrastructure.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:52 +02:00
Josef Bacik
a1ed0a8216 btrfs: add the data transaction commit logic into may_commit_transaction
Data space flushing currently unconditionally commits the transaction
twice in a row, and the last time it checks if there's enough pinned
extents to satisfy its reservation before deciding to commit the
transaction for the 3rd and final time.

Encode this logic into may_commit_transaction().  In the next patch we
will pass in U64_MAX for bytes_needed the first two times, and the final
time we will pass in the actual bytes we need so the normal logic will
apply.

This patch exists solely to make the logical changes I will make to the
flushing state machine separate to make it easier to bisect any
performance related regressions.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:52 +02:00
Josef Bacik
058e6d1d26 btrfs: add flushing states for handling data reservations
Currently the way we do data reservations is by seeing if we have enough
space in our space_info.  If we do not and we're a normal inode we'll

1) Attempt to force a chunk allocation until we can't anymore.
2) If that fails we'll flush delalloc, then commit the transaction, then
   run the delayed iputs.

If we are a free space inode we're only allowed to force a chunk
allocation.  In order to use the normal flushing mechanism we need to
encode this into a flush state array for normal inodes.  Since both will
start with allocating chunks until the space info is full there is no
need to add this as a flush state, this will be handled specially.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:52 +02:00
Josef Bacik
448b966b49 btrfs: check tickets after waiting on ordered extents
Right now if the space is freed up after the ordered extents complete
(which is likely since the reservations are held until they complete),
we would do extra delalloc flushing before we'd notice that we didn't
have any more tickets.  Fix this by moving the tickets check after our
wait_ordered_extents check.

Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:52 +02:00
Josef Bacik
38d715f494 btrfs: use btrfs_start_delalloc_roots in shrink_delalloc
The original iteration of flushing had us flushing delalloc and then
checking to see if we could make our reservation, thus we were very
careful about how many pages we would flush at once.

But now that everything is async and we satisfy tickets as the space
becomes available we don't have to keep track of any of this, simply
try and flush the number of dirty inodes we may have in order to
reclaim space to make our reservation.  This cleans up our delalloc
flushing significantly.

The async_pages stuff is dropped because btrfs_start_delalloc_roots()
handles the case that we generate async extents for us, so we no longer
require this extra logic.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:52 +02:00
Josef Bacik
39753e4a3a btrfs: use the btrfs_space_info_free_bytes_may_use helper for delalloc
We are going to use the ticket infrastructure for data, so use the
btrfs_space_info_free_bytes_may_use() helper in
btrfs_free_reserved_data_space_noquota() so we get the
btrfs_try_granting_tickets call when we free our reservation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:52 +02:00
Josef Bacik
99ffb43e5d btrfs: call btrfs_try_granting_tickets when reserving space
If we have compression on we could free up more space than we reserved,
and thus be able to make a space reservation.  Add the call for this
scenario.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:51 +02:00
Josef Bacik
2732798c9b btrfs: call btrfs_try_granting_tickets when unpinning anything
When unpinning we were only calling btrfs_try_granting_tickets() if
global_rsv->space_info == space_info, which is problematic because we
use ticketing for SYSTEM chunks, and want to use it for DATA as well.
Fix this by moving this call outside of that if statement.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:51 +02:00
Josef Bacik
3308234a7e btrfs: call btrfs_try_granting_tickets when freeing reserved bytes
We were missing a call to btrfs_try_granting_tickets in
btrfs_free_reserved_bytes, so add it to handle the case where we're able
to satisfy an allocation because we've freed a pending reservation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:51 +02:00
Josef Bacik
c6c453032e btrfs: make ALLOC_CHUNK use the space info flags
We have traditionally used flush_space() to flush metadata space, so
we've been unconditionally using btrfs_metadata_alloc_profile() for our
profile to allocate a chunk. However if we're going to use this for
data we need to use btrfs_get_alloc_profile() on the space_info we pass
in.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:51 +02:00
Josef Bacik
920a9958c2 btrfs: make shrink_delalloc take space_info as an arg
Currently shrink_delalloc just looks up the metadata space info, but
this won't work if we're trying to reclaim space for data chunks.  We
get the right space_info we want passed into flush_space, so simply pass
that along to shrink_delalloc.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:51 +02:00
Josef Bacik
d7f81fac97 btrfs: handle U64_MAX for shrink_delalloc
Data allocations are going to want to pass in U64_MAX for flushing
space, adjust shrink_delalloc to handle this properly.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:51 +02:00
Josef Bacik
288be2d997 btrfs: remove orig from shrink_delalloc
We don't use this anywhere inside of shrink_delalloc since 17024ad0a0
("Btrfs: fix early ENOSPC due to delalloc"), remove it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:50 +02:00
Josef Bacik
b49121393f btrfs: change nr to u64 in btrfs_start_delalloc_roots
We have btrfs_wait_ordered_roots() which takes a u64 for nr, but
btrfs_start_delalloc_roots() that takes an int for nr, which makes using
them in conjunction, especially for something like (u64)-1, annoying and
inconsistent.  Fix btrfs_start_delalloc_roots() to take a u64 for nr and
adjust start_delalloc_inodes() and it's callers appropriately.

This means we've adjusted start_delalloc_inodes() to take a pointer of
nr since we want to preserve the ability for start-delalloc_inodes() to
return an error, so simply make it do the nr adjusting as necessary.

Part of adjusting the callers to this means changing
btrfs_writeback_inodes_sb_nr() to take a u64 for items.  This may be
confusing because it seems unrelated, but the caller of
btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the
function variable that needs to be changed.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:50 +02:00
Nikolay Borisov
8e56008180 btrfs: remove fsid argument from btrfs_sysfs_update_sprout_fsid
It can be accessed from 'fs_devices' as it's identical to
fs_info->fs_devices. Also add a comment about why we are calling the
function. No semantic changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:50 +02:00
Nikolay Borisov
57297c1e8e btrfs: remove spurious BUG_ON in btrfs_get_extent
That BUG_ON cannot ever trigger because as the comment there states -
'err' is always set. Simply remove it as it brings no value.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:50 +02:00
Randy Dunlap
260db43cd2 btrfs: delete duplicated words + other fixes in comments
Delete repeated words in fs/btrfs/.
{to, the, a, and old}
and change "into 2 part" to "into 2 parts".

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:50 +02:00
Qu Wenruo
437490fed3 btrfs: tracepoints: output proper root owner for trace_find_free_extent()
The current trace event always output result like this:

 find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA)
 find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA)
 find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA)
 find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA)

T's saying we're allocating data extent for EXTENT tree, which is not
even possible.

It's because we always use EXTENT tree as the owner for
trace_find_free_extent() without using the @root from
btrfs_reserve_extent().

This patch will change the parameter to use proper @root for
trace_find_free_extent():

Now it looks much better:

 find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=4096 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=7(CSUM_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=1(ROOT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)

Reported-by: Hans van Kranenburg <hans@knorrie.org>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:49 +02:00
Linus Torvalds
702bfc891d io_uring-5.9-2020-10-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl93Z48QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmp4EACwxi4UVnL0zhaOBmXfqxDuaXViwkfVZNxx
 d40y+DcCewnpZMk2G9cES8OKG+Tu2GFX2yl1m2XdrIWJ6jpnGFKJOkNQGfPDQrT3
 fI7qFrEDeSVeLUMMBxtvZLW8w2D0KcNCgla4h/ESXI9xtPTZdYXhYQY0zfuWalUC
 ZplUgAWlHx82qJari7ZmIfeVtpAoujTvkccRe+/RtPv5vO+UsvP7kqPSCYMGqhHS
 7z5gK3Nw+PNMWrzZVZ6Rw5nLeExx9PJGgiEkitEjn7mRJELXV9eWnTt9D0eVwaec
 WO7OSQmrJLmMFER4ZhkDNJkXZFvlYUCygnwJQmH70LflRqUEA00O6wX4J32O3NIg
 fIDWKMGGANFU5atL+RHqfQgUYq0GY1UsIvZxJnwRwv1QssmJoQq9fpT6VYqiQMik
 2JAeWyMqTGI4vRNmVJKTR/13SpRUYrvS3wHN53kCaBBhE5Y/vFksgOGgXZBG/TPk
 odpegeJOTa5xuS0YcKIK6yL/xHENct1Y1BtVjczrXKJz0E90n5ZdIR0lEg6Ij3B1
 jZUwKiS2sY09eBaJIQvtD4hIaw5VgqtwinKTyt7MBw/6pCqJpSZtaV0Uvgvjq/Se
 1ifUo4cWwQBccZLgWeWoEalio2fNIyb+J+sm7eu9Xygjl67U2M8oMfAN2JjkM7As
 btLazer4lg==
 =fo3Z
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - fix for async buffered reads if read-ahead is fully disabled (Hao)

 - double poll match fix

 - ->show_fdinfo() potential ABBA deadlock complaint fix

* tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block:
  io_uring: fix async buffered reads when readahead is disabled
  io_uring: fix potential ABBA deadlock in ->show_fdinfo()
  io_uring: always delete double poll wait entry on match
2020-10-02 14:38:10 -07:00
Linus Torvalds
d4fce2e20f Merge branch 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull epoll fixes from Al Viro:
 "Several race fixes in epoll"

* 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ep_create_wakeup_source(): dentry name can change under you...
  epoll: EPOLL_CTL_ADD: close the race in decision to take fast path
  epoll: replace ->visited/visited_list with generation count
  epoll: do not insert into poll queues until all sanity checks are done
2020-10-02 10:37:08 -07:00
Linus Torvalds
4e3b9ce271 for-5.9-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl93REAACgkQxWXV+ddt
 WDv0/A//XYr1XLC/5sMILHqYZ4ogiFxC3Nfjeyt6vfBPX3J0d2eHnw5Rw+ZHHHdQ
 qtoKWom9ZwCxjybghwmvfxJuohy+6Sc764aEj+rYpUcCmmUZsAZZpmwpZqpYG+0H
 DEn9p45T0MO+r5lsF/GdNqqsdXZfUlZy7PweIhZucQxENM8cowklqKCo4AU2IEW4
 203THU3UxQayn0um6kaiesioh8TtT+R9UVAyyA3n6lGINHKG8AMy0ulS/M2Uzgq5
 eAzWne4Opy+wLxubBdeqruPiQrFQp+JV/YhTTEHGKRXykRYXwZnCDYdK27X4UKkt
 g3Ne0cEd/JuxZfb3Mzsd7+MF0xr9xKJPziFXv7YZt0LkiHE+B0b/DwA9FksR9sdO
 4BY2oe0gztstIMqQ5qnriJMDQxonyUt2G65YW8sCI9b32vRYaHLhCWZRYzbmftEO
 W4FJOnAI2It3Ib0CUkBjkPYkmH113Q6g59k015IpoYRGmExhnC59zhuijdmthxFJ
 S5PXFymVhxt9iMOKM0jE17Rp/j4hVg/bdFVHJryzlOsldjq63Vukqoo24SQhiqfY
 qYn/Ilkc/h1YD/pxehFAhZcbGfEdjD5oo8OkGoKIUXfv35r7JH/5F/x+4DxZNnYk
 n0oHJ7WBR01AlHAcuTvsN7z9O2ZX6wZufkkgKYLBvtGtyC71T3A=
 =MT2i
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two more fixes.

  One is for a lockdep warning/lockup (also caught by syzbot), that one
  has been seen in practice. Regarding the other syzbot reports
  mentioned last time, they don't seem to be urgent and reliably
  reproducible so they'll be fixed later.

  The second fix is for a potential corruption when device replace
  finishes and the in-memory state of trim is not copied to the new
  device"

* tag 'for-5.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix filesystem corruption after a device replace
  btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks
  btrfs: move btrfs_scratch_superblocks into btrfs_dev_replace_finishing
2020-10-02 10:09:40 -07:00
Linus Torvalds
472e5b056f pipe: remove pipe_wait() and fix wakeup race with splice
The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to
happen, which depended on all pipe IO being entirely serialized by the
pipe lock.  So by checking the state you were waiting for, and then
adding yourself to the wait queue before dropping the lock, you were
guaranteed to see all the wakeups.

Strictly speaking, the actual wakeups were not done under the lock, but
the pipe_wait() model still worked, because since the waiter held the
lock when checking whether it should sleep, it would always see the
current state, and the wakeup was always done after updating the state.

However, commit 0ddad21d3e ("pipe: use exclusive waits when reading or
writing") split the single wait-queue into two, and in the process also
made the "wait for event" code wait for _two_ wait queues, and that then
showed a race with the wakers that were not serialized by the pipe lock.

It's only splice that used that "pipe_wait()" model, so the problem
wasn't obvious, but Josef Bacik reports:

 "I hit a hang with fstest btrfs/187, which does a btrfs send into
  /dev/null. This works by creating a pipe, the write side is given to
  the kernel to write into, and the read side is handed to a thread that
  splices into a file, in this case /dev/null.

  The box that was hung had the write side stuck here [pipe_write] and
  the read side stuck here [splice_from_pipe_next -> pipe_wait].

  [ more details about pipe_wait() scenario ]

  The problem is we're doing the prepare_to_wait, which sets our state
  each time, however we can be woken up either with reads or writes. In
  the case above we race with the WRITER waking us up, and re-set our
  state to INTERRUPTIBLE, and thus never break out of schedule"

Josef had a patch that avoided the issue in pipe_wait() by just making
it set the state only once, but the deeper problem is that pipe_wait()
depends on a level of synchonization by the pipe mutex that it really
shouldn't.  And the whole "wait for any pipe state change" model really
isn't very good to begin with.

So rather than trying to work around things in pipe_wait(), remove that
legacy model of "wait for arbitrary pipe event" entirely, and actually
create functions that wait for the pipe actually being readable or
writable, and can do so without depending on the pipe lock serializing
everything.

Fixes: 0ddad21d3e ("pipe: use exclusive waits when reading or writing")
Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-01 19:14:36 -07:00
Filipe Manana
4c8f353272 btrfs: fix filesystem corruption after a device replace
We use a device's allocation state tree to track ranges in a device used
for allocated chunks, and we set ranges in this tree when allocating a new
chunk. However after a device replace operation, we were not setting the
allocated ranges in the new device's allocation state tree, so that tree
is empty after a device replace.

This means that a fitrim operation after a device replace will trim the
device ranges that have allocated chunks and extents, as we trim every
range for which there is not a range marked in the device's allocation
state tree. It is also important during chunk allocation, since the
device's allocation state is used to determine if a range is already
allocated when allocating a new chunk.

This is trivial to reproduce and the following script triggers the bug:

  $ cat reproducer.sh
  #!/bin/bash

  DEV1="/dev/sdg"
  DEV2="/dev/sdh"
  DEV3="/dev/sdi"

  wipefs -a $DEV1 $DEV2 $DEV3 &> /dev/null

  # Create a raid1 test fs on 2 devices.
  mkfs.btrfs -f -m raid1 -d raid1 $DEV1 $DEV2 > /dev/null
  mount $DEV1 /mnt/btrfs

  xfs_io -f -c "pwrite -S 0xab 0 10M" /mnt/btrfs/foo

  echo "Starting to replace $DEV1 with $DEV3"
  btrfs replace start -B $DEV1 $DEV3 /mnt/btrfs
  echo

  echo "Running fstrim"
  fstrim /mnt/btrfs
  echo

  echo "Unmounting filesystem"
  umount /mnt/btrfs

  echo "Mounting filesystem in degraded mode using $DEV3 only"
  wipefs -a $DEV1 $DEV2 &> /dev/null
  mount -o degraded $DEV3 /mnt/btrfs
  if [ $? -ne 0 ]; then
          dmesg | tail
          echo
          echo "Failed to mount in degraded mode"
          exit 1
  fi

  echo
  echo "File foo data (expected all bytes = 0xab):"
  od -A d -t x1 /mnt/btrfs/foo

  umount /mnt/btrfs

When running the reproducer:

  $ ./replace-test.sh
  wrote 10485760/10485760 bytes at offset 0
  10 MiB, 2560 ops; 0.0901 sec (110.877 MiB/sec and 28384.5216 ops/sec)
  Starting to replace /dev/sdg with /dev/sdi

  Running fstrim

  Unmounting filesystem
  Mounting filesystem in degraded mode using /dev/sdi only
  mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/sdi, missing codepage or helper program, or other error.
  [19581.748641] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi started
  [19581.803842] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi finished
  [19582.208293] BTRFS info (device sdi): allowing degraded mounts
  [19582.208298] BTRFS info (device sdi): disk space caching is enabled
  [19582.208301] BTRFS info (device sdi): has skinny extents
  [19582.212853] BTRFS warning (device sdi): devid 2 uuid 1f731f47-e1bb-4f00-bfbb-9e5a0cb4ba9f is missing
  [19582.213904] btree_readpage_end_io_hook: 25839 callbacks suppressed
  [19582.213907] BTRFS error (device sdi): bad tree block start, want 30490624 have 0
  [19582.214780] BTRFS warning (device sdi): failed to read root (objectid=7): -5
  [19582.231576] BTRFS error (device sdi): open_ctree failed

  Failed to mount in degraded mode

So fix by setting all allocated ranges in the replace target device when
the replace operation is finishing, when we are holding the chunk mutex
and we can not race with new chunk allocations.

A test case for fstests follows soon.

Fixes: 1c11b63eff ("btrfs: replace pending/pinned chunks lists with io tree")
CC: stable@vger.kernel.org # 5.2+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-30 19:40:51 +02:00
Josef Bacik
a466c85edc btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks
When closing and freeing the source device we could end up doing our
final blkdev_put() on the bdev, which will grab the bd_mutex.  As such
we want to be holding as few locks as possible, so move this call
outside of the dev_replace->lock_finishing_cancel_unmount lock.  Since
we're modifying the fs_devices we need to make sure we're holding the
uuid_mutex here, so take that as well.

There's a report from syzbot probably hitting one of the cases where
the bd_mutex and device_list_mutex are taken in the wrong order, however
it's not with device replace, like this patch fixes. As there's no
reproducer available so far, we can't verify the fix.

https://lore.kernel.org/lkml/000000000000fc04d105afcf86d7@google.com/
dashboard link: https://syzkaller.appspot.com/bug?extid=84a0634dc5d21d488419

  WARNING: possible circular locking dependency detected
  5.9.0-rc5-syzkaller #0 Not tainted
  ------------------------------------------------------
  syz-executor.0/6878 is trying to acquire lock:
  ffff88804c17d780 (&bdev->bd_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x520 fs/block_dev.c:1804

  but task is already holding lock:
  ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #4 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
	 __mutex_lock_common kernel/locking/mutex.c:956 [inline]
	 __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
	 btrfs_finish_chunk_alloc+0x281/0xf90 fs/btrfs/volumes.c:5255
	 btrfs_create_pending_block_groups+0x2f3/0x700 fs/btrfs/block-group.c:2109
	 __btrfs_end_transaction+0xf5/0x690 fs/btrfs/transaction.c:916
	 find_free_extent_update_loop fs/btrfs/extent-tree.c:3807 [inline]
	 find_free_extent+0x23b7/0x2e60 fs/btrfs/extent-tree.c:4127
	 btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
	 cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
	 btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
	 writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
	 __extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
	 extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
	 extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
	 do_writepages+0xec/0x290 mm/page-writeback.c:2352
	 __writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
	 writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
	 wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
	 wb_do_writeback fs/fs-writeback.c:2039 [inline]
	 wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
	 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
	 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
	 kthread+0x3b5/0x4a0 kernel/kthread.c:292
	 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

  -> #3 (sb_internal#2){.+.+}-{0:0}:
	 percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
	 __sb_start_write+0x234/0x470 fs/super.c:1672
	 sb_start_intwrite include/linux/fs.h:1690 [inline]
	 start_transaction+0xbe7/0x1170 fs/btrfs/transaction.c:624
	 find_free_extent_update_loop fs/btrfs/extent-tree.c:3789 [inline]
	 find_free_extent+0x25e1/0x2e60 fs/btrfs/extent-tree.c:4127
	 btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206
	 cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063
	 btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838
	 writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439
	 __extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653
	 extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249
	 extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370
	 do_writepages+0xec/0x290 mm/page-writeback.c:2352
	 __writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461
	 writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721
	 wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894
	 wb_do_writeback fs/fs-writeback.c:2039 [inline]
	 wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080
	 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
	 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
	 kthread+0x3b5/0x4a0 kernel/kthread.c:292
	 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

  -> #2 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}:
	 __flush_work+0x60e/0xac0 kernel/workqueue.c:3041
	 wb_shutdown+0x180/0x220 mm/backing-dev.c:355
	 bdi_unregister+0x174/0x590 mm/backing-dev.c:872
	 del_gendisk+0x820/0xa10 block/genhd.c:933
	 loop_remove drivers/block/loop.c:2192 [inline]
	 loop_control_ioctl drivers/block/loop.c:2291 [inline]
	 loop_control_ioctl+0x3b1/0x480 drivers/block/loop.c:2257
	 vfs_ioctl fs/ioctl.c:48 [inline]
	 __do_sys_ioctl fs/ioctl.c:753 [inline]
	 __se_sys_ioctl fs/ioctl.c:739 [inline]
	 __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
	 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (loop_ctl_mutex){+.+.}-{3:3}:
	 __mutex_lock_common kernel/locking/mutex.c:956 [inline]
	 __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
	 lo_open+0x19/0xd0 drivers/block/loop.c:1893
	 __blkdev_get+0x759/0x1aa0 fs/block_dev.c:1507
	 blkdev_get fs/block_dev.c:1639 [inline]
	 blkdev_open+0x227/0x300 fs/block_dev.c:1753
	 do_dentry_open+0x4b9/0x11b0 fs/open.c:817
	 do_open fs/namei.c:3251 [inline]
	 path_openat+0x1b9a/0x2730 fs/namei.c:3368
	 do_filp_open+0x17e/0x3c0 fs/namei.c:3395
	 do_sys_openat2+0x16d/0x420 fs/open.c:1168
	 do_sys_open fs/open.c:1184 [inline]
	 __do_sys_open fs/open.c:1192 [inline]
	 __se_sys_open fs/open.c:1188 [inline]
	 __x64_sys_open+0x119/0x1c0 fs/open.c:1188
	 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&bdev->bd_mutex){+.+.}-{3:3}:
	 check_prev_add kernel/locking/lockdep.c:2496 [inline]
	 check_prevs_add kernel/locking/lockdep.c:2601 [inline]
	 validate_chain kernel/locking/lockdep.c:3218 [inline]
	 __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
	 lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
	 __mutex_lock_common kernel/locking/mutex.c:956 [inline]
	 __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
	 blkdev_put+0x30/0x520 fs/block_dev.c:1804
	 btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
	 btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
	 btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
	 close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
	 close_fs_devices fs/btrfs/volumes.c:1193 [inline]
	 btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
	 close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
	 generic_shutdown_super+0x144/0x370 fs/super.c:464
	 kill_anon_super+0x36/0x60 fs/super.c:1108
	 btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
	 deactivate_locked_super+0x94/0x160 fs/super.c:335
	 deactivate_super+0xad/0xd0 fs/super.c:366
	 cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
	 task_work_run+0xdd/0x190 kernel/task_work.c:141
	 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
	 exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
	 exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
	 syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    &bdev->bd_mutex --> sb_internal#2 --> &fs_devs->device_list_mutex

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&fs_devs->device_list_mutex);
				 lock(sb_internal#2);
				 lock(&fs_devs->device_list_mutex);
    lock(&bdev->bd_mutex);

   *** DEADLOCK ***

  3 locks held by syz-executor.0/6878:
   #0: ffff88809070c0e0 (&type->s_umount_key#70){++++}-{3:3}, at: deactivate_super+0xa5/0xd0 fs/super.c:365
   #1: ffffffff8a5b37a8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_close_devices+0x23/0x1f0 fs/btrfs/volumes.c:1178
   #2: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159

  stack backtrace:
  CPU: 0 PID: 6878 Comm: syz-executor.0 Not tainted 5.9.0-rc5-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x198/0x1fd lib/dump_stack.c:118
   check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
   check_prev_add kernel/locking/lockdep.c:2496 [inline]
   check_prevs_add kernel/locking/lockdep.c:2601 [inline]
   validate_chain kernel/locking/lockdep.c:3218 [inline]
   __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426
   lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006
   __mutex_lock_common kernel/locking/mutex.c:956 [inline]
   __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
   blkdev_put+0x30/0x520 fs/block_dev.c:1804
   btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline]
   btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline]
   btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline]
   close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161
   close_fs_devices fs/btrfs/volumes.c:1193 [inline]
   btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179
   close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149
   generic_shutdown_super+0x144/0x370 fs/super.c:464
   kill_anon_super+0x36/0x60 fs/super.c:1108
   btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265
   deactivate_locked_super+0x94/0x160 fs/super.c:335
   deactivate_super+0xad/0xd0 fs/super.c:366
   cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118
   task_work_run+0xdd/0x190 kernel/task_work.c:141
   tracehook_notify_resume include/linux/tracehook.h:188 [inline]
   exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
   exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
   syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x460027
  RSP: 002b:00007fff59216328 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 0000000000076035 RCX: 0000000000460027
  RDX: 0000000000403188 RSI: 0000000000000002 RDI: 00007fff592163d0
  RBP: 0000000000000333 R08: 0000000000000000 R09: 000000000000000b
  R10: 0000000000000005 R11: 0000000000000246 R12: 00007fff59217460
  R13: 0000000002df2a60 R14: 0000000000000000 R15: 00007fff59217460

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add syzbot reference ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-30 19:34:24 +02:00
Linus Torvalds
90fb702791 autofs: use __kernel_write() for the autofs pipe writing
autofs got broken in some configurations by commit 13c164b1a1
("autofs: switch to kernel_write") because there is now an extra LSM
permission check done by security_file_permission() in rw_verify_area().

autofs is one if the few places that really does want the much more
limited __kernel_write(), because the write is an internal kernel one
that shouldn't do any user permission checks (it also doesn't need the
file_start_write/file_end_write logic, since it's just a pipe).

There are a couple of other cases like that - accounting, core dumping,
and splice - but autofs stands out because it can be built as a module.

As a result, we need to export this internal __kernel_write() function
again.

We really don't want any other module to use this, but we don't have a
"EXPORT_SYMBOL_FOR_AUTOFS_ONLY()".  But we can mark it GPL-only to at
least approximate that "internal use only" for licensing.

While in this area, make autofs pass in NULL for the file position
pointer, since it's always a pipe, and we now use a NULL file pointer
for streaming file descriptors (see file_ppos() and commit 438ab720c6:
"vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files")

This effectively reverts commits 9db9775224 ("fs: unexport
__kernel_write") and 13c164b1a1 ("autofs: switch to kernel_write").

Fixes: 13c164b1a1 ("autofs: switch to kernel_write")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-29 17:18:34 -07:00
Hao Xu
c8d317aa18 io_uring: fix async buffered reads when readahead is disabled
The async buffered reads feature is not working when readahead is
turned off. There are two things to concern:

- when doing retry in io_read, not only the IOCB_WAITQ flag but also
  the IOCB_NOWAIT flag is still set, which makes it goes to would_block
  phase in generic_file_buffered_read() and then return -EAGAIN. After
  that, the io-wq thread work is queued, and later doing the async
  reads in the old way.

- even if we remove IOCB_NOWAIT when doing retry, the feature is still
  not running properly, since in generic_file_buffered_read() it goes to
  lock_page_killable() after calling mapping->a_ops->readpage() to do
  IO, and thus causing process to sleep.

Fixes: 1a0a7853b9 ("mm: support async buffered reads in generic_file_buffered_read()")
Fixes: 3b2a4439e0 ("io_uring: get rid of kiocb_wait_page_queue_init()")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-29 07:54:00 -06:00
Linus Torvalds
fb0155a09b NFS client bugfixes for Linux 5.9
Highlights include:
 
 Bugfixes:
 - NFSv4.2: copy_file_range needs to invalidate caches on success
 - NFSv4.2: Fix security label length not being reset
 - pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly on read
 - pNFS/flexfiles: Fix signed/unsigned type issues with mirror indices
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl9yHBYACgkQZwvnipYK
 APLKCA//Sppmzm+kFDmZ6iWplwdoIq7rnIMG7eKKGD754dDvOtYNIw9D9yOIY5G6
 eVdvQ10m6vA8Dp8AxaWK9qacMXljmOX8szz+Bf1NcIe2F6X/waO3zMoud8Rd9Ja4
 PigAbAW6Gs0gohL3wg+jh5N5JlaDcZ0Dri3QWdqGaHjhrKV9MW9h0BpBCx9YCPkL
 FFgk+I+524rGQnkHvCWbclww4428e+MSYdeJE+c4wrIx/HCz3iJ60AFA0SIAw7FV
 6qMtxN4/kqfdIrA074xcreMdkucxe3lNl7ujT1T6dum2OwERq+WyzkwoirqNguJM
 X71CXU9IE8rw72ATWMoba961i4HITp05ZbVg7yXZrrRkAEljyHhr67R/1RRSlxQm
 ZrPOICrCoXKHRFTbNL7Sb+xeTGbuZQkbcwGXnUYdTIO3JQ6PRIEFb/y8yuuT+EPG
 KWk2vM+QM9036qfBWjbAZMOpwDB4oiVkBgzNM8FGcebiV1FANQ1by7oMaQsH1NLm
 WY0M0KFY2wdv3ovGT7oUOEbtoxD993HuuLdIWxTRHFjRPgg8WKTFnf4BIeZtMjY8
 oRvN83hEjWszuTEuuEukUdsqLTftv7rNhxrotoh9WfeSXvJDB6PF0y55UmZ6WuKE
 wRQQLxC9Om+E3HidxgOolqKxD6d4OOY3XJWzH3As7sJEgQyE/5o=
 =cNi/
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

   - NFSv4.2: copy_file_range needs to invalidate caches on success

   - NFSv4.2: Fix security label length not being reset

   - pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly
     on read

   - pNFS/flexfiles: Fix signed/unsigned type issues with mirror
     indices"

* tag 'nfs-for-5.9-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  pNFS/flexfiles: Be consistent about mirror index types
  pNFS/flexfiles: Ensure we initialise the mirror bsizes correctly on read
  NFSv4.2: fix client's attribute cache management for copy_file_range
  nfs: Fix security label length not being reset
2020-09-28 11:05:56 -07:00
Jens Axboe
fad8e0de44 io_uring: fix potential ABBA deadlock in ->show_fdinfo()
syzbot reports a potential lock deadlock between the normal IO path and
->show_fdinfo():

======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc6-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.2/19710 is trying to acquire lock:
ffff888098ddc450 (sb_writers#4){.+.+}-{0:0}, at: io_write+0x6b5/0xb30 fs/io_uring.c:3296

but task is already holding lock:
ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&ctx->uring_lock){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:956 [inline]
       __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
       __io_uring_show_fdinfo fs/io_uring.c:8417 [inline]
       io_uring_show_fdinfo+0x194/0xc70 fs/io_uring.c:8460
       seq_show+0x4a8/0x700 fs/proc/fd.c:65
       seq_read+0x432/0x1070 fs/seq_file.c:208
       do_loop_readv_writev fs/read_write.c:734 [inline]
       do_loop_readv_writev fs/read_write.c:721 [inline]
       do_iter_read+0x48e/0x6e0 fs/read_write.c:955
       vfs_readv+0xe5/0x150 fs/read_write.c:1073
       kernel_readv fs/splice.c:355 [inline]
       default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
       do_splice_to+0x137/0x170 fs/splice.c:871
       splice_direct_to_actor+0x307/0x980 fs/splice.c:950
       do_splice_direct+0x1b3/0x280 fs/splice.c:1059
       do_sendfile+0x55f/0xd40 fs/read_write.c:1540
       __do_sys_sendfile64 fs/read_write.c:1601 [inline]
       __se_sys_sendfile64 fs/read_write.c:1587 [inline]
       __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #1 (&p->lock){+.+.}-{3:3}:
       __mutex_lock_common kernel/locking/mutex.c:956 [inline]
       __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103
       seq_read+0x61/0x1070 fs/seq_file.c:155
       pde_read fs/proc/inode.c:306 [inline]
       proc_reg_read+0x221/0x300 fs/proc/inode.c:318
       do_loop_readv_writev fs/read_write.c:734 [inline]
       do_loop_readv_writev fs/read_write.c:721 [inline]
       do_iter_read+0x48e/0x6e0 fs/read_write.c:955
       vfs_readv+0xe5/0x150 fs/read_write.c:1073
       kernel_readv fs/splice.c:355 [inline]
       default_file_splice_read.constprop.0+0x4e6/0x9e0 fs/splice.c:412
       do_splice_to+0x137/0x170 fs/splice.c:871
       splice_direct_to_actor+0x307/0x980 fs/splice.c:950
       do_splice_direct+0x1b3/0x280 fs/splice.c:1059
       do_sendfile+0x55f/0xd40 fs/read_write.c:1540
       __do_sys_sendfile64 fs/read_write.c:1601 [inline]
       __se_sys_sendfile64 fs/read_write.c:1587 [inline]
       __x64_sys_sendfile64+0x1cc/0x210 fs/read_write.c:1587
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (sb_writers#4){.+.+}-{0:0}:
       check_prev_add kernel/locking/lockdep.c:2496 [inline]
       check_prevs_add kernel/locking/lockdep.c:2601 [inline]
       validate_chain kernel/locking/lockdep.c:3218 [inline]
       __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
       lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
       percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
       __sb_start_write+0x228/0x450 fs/super.c:1672
       io_write+0x6b5/0xb30 fs/io_uring.c:3296
       io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
       __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
       io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
       io_submit_sqe fs/io_uring.c:6324 [inline]
       io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
       __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

other info that might help us debug this:

Chain exists of:
  sb_writers#4 --> &p->lock --> &ctx->uring_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ctx->uring_lock);
                               lock(&p->lock);
                               lock(&ctx->uring_lock);
  lock(sb_writers#4);

 *** DEADLOCK ***

1 lock held by syz-executor.2/19710:
 #0: ffff8880a11b8428 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0xe9a/0x1bd0 fs/io_uring.c:8348

stack backtrace:
CPU: 0 PID: 19710 Comm: syz-executor.2 Not tainted 5.9.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x198/0x1fd lib/dump_stack.c:118
 check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827
 check_prev_add kernel/locking/lockdep.c:2496 [inline]
 check_prevs_add kernel/locking/lockdep.c:2601 [inline]
 validate_chain kernel/locking/lockdep.c:3218 [inline]
 __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4441
 lock_acquire+0x1f3/0xaf0 kernel/locking/lockdep.c:5029
 percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
 __sb_start_write+0x228/0x450 fs/super.c:1672
 io_write+0x6b5/0xb30 fs/io_uring.c:3296
 io_issue_sqe+0x18f/0x5c50 fs/io_uring.c:5719
 __io_queue_sqe+0x280/0x1160 fs/io_uring.c:6175
 io_queue_sqe+0x692/0xfa0 fs/io_uring.c:6254
 io_submit_sqe fs/io_uring.c:6324 [inline]
 io_submit_sqes+0x1761/0x2400 fs/io_uring.c:6521
 __do_sys_io_uring_enter+0xeac/0x1bd0 fs/io_uring.c:8349
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45e179
Code: 3d b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 0b b2 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f1194e74c78 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
RAX: ffffffffffffffda RBX: 00000000000082c0 RCX: 000000000045e179
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000004
RBP: 000000000118cf98 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cf4c
R13: 00007ffd1aa5756f R14: 00007f1194e759c0 R15: 000000000118cf4c

Fix this by just not diving into details if we fail to trylock the
io_uring mutex. We know the ctx isn't going away during this operation,
but we cannot safely iterate buffers/files/personalities if we don't
hold the io_uring mutex.

Reported-by: syzbot+2f8fa4e860edc3066aba@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-28 09:06:08 -06:00
Jens Axboe
8706e04ed7 io_uring: always delete double poll wait entry on match
syzbot reports a crash with tty polling, which is using the double poll
handling:

general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f]
CPU: 0 PID: 6874 Comm: syz-executor749 Not tainted 5.9.0-rc6-next-20200924-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_poll_get_single fs/io_uring.c:4778 [inline]
RIP: 0010:io_poll_double_wake+0x51/0x510 fs/io_uring.c:4845
Code: fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 9e 03 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 5d 08 48 8d 7b 48 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 06 0f 8e 63 03 00 00 0f b6 6b 48 bf 06 00 00
RSP: 0018:ffffc90001c1fb70 EFLAGS: 00010006
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000009 RSI: ffffffff81d9b3ad RDI: 0000000000000048
RBP: dffffc0000000000 R08: ffff8880a3cac798 R09: ffffc90001c1fc60
R10: fffff52000383f73 R11: 0000000000000000 R12: 0000000000000004
R13: ffff8880a3cac798 R14: ffff8880a3cac7a0 R15: 0000000000000004
FS:  0000000001f98880(0000) GS:ffff8880ae400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f18886916c0 CR3: 0000000094c5a000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 __wake_up_common+0x147/0x650 kernel/sched/wait.c:93
 __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:123
 tty_ldisc_hangup+0x1cf/0x680 drivers/tty/tty_ldisc.c:735
 __tty_hangup.part.0+0x403/0x870 drivers/tty/tty_io.c:625
 __tty_hangup drivers/tty/tty_io.c:575 [inline]
 tty_vhangup+0x1d/0x30 drivers/tty/tty_io.c:698
 pty_close+0x3f5/0x550 drivers/tty/pty.c:79
 tty_release+0x455/0xf60 drivers/tty/tty_io.c:1679
 __fput+0x285/0x920 fs/file_table.c:281
 task_work_run+0xdd/0x190 kernel/task_work.c:141
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:165 [inline]
 exit_to_user_mode_prepare+0x1e2/0x1f0 kernel/entry/common.c:192
 syscall_exit_to_user_mode+0x7a/0x2c0 kernel/entry/common.c:267
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x401210

which is due to a failure in removing the double poll wait entry if we
hit a wakeup match. This can cause multiple invocations of the wakeup,
which isn't safe.

Cc: stable@vger.kernel.org # v5.8
Reported-by: syzbot+81b3883093f772addf6d@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-28 08:38:54 -06:00
Linus Torvalds
692495baa3 io_uring-5.9-2020-09-25
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9upV4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuPrD/9K1UQLv38K2nPYclLymOi+GIsukpjgwzdY
 SM38GNXU5vYkFhylH/bXfckNQ/gja0/whNpcr/UVCgTWleMnss9UiZaCgysyuIOL
 vnBxT4yDZIxtkOwF/790NiwV2FrsmrLFdNZU4LkmfbmmrAlNtjOElKyJsOyNNMzJ
 UMzHH2Z1vvUwKz+Yq4fPyZCJbpNHN6ABwkSXY/Nz8oWsKfw728fZztLsP57gOtkl
 yYVFO2z1n7VaWp5ZzVFYG51DFuMCIDXgN6mMlaKfnQ6auQZxjR+jg69HSRKLjIx7
 ZEG1gl/DzwH1+751P7HnuI3U7BtBYolyErHW4j4a6Ko4XX8PPhG9ODKOmsEMPrEq
 gCUGcGgWUsEyz+pyullTEt7ea/oLGJ5N86qtNOdviXETZZTghm47QlzxFWr1/GWy
 wH++ctBf/Ekk0dbCBF6mJiqDl/PrVSDSClTVhsGJESEmk4BOoC9zd9zT/EfsiR9m
 vA8wLE2g1/5oU+irQ0Dlc/EENVWISiigOFFvTPJJjma9iGXAW3kV2/aYW6DKZSwM
 w/va7zTlzt89O+L0AT+Rg8btaiTiaZcs3op1AFa1z5Gut3b5YhWL/e95wlaOI6Nv
 Tudm4GX06BaN1QdUDV9g0Pr5iNOaCvuOArjNOU3j7ySusJxiJ8GdA3WFqJ/XUlIV
 pne8hC/+7A==
 =mBw7
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two fixes for regressions in this cycle, and one that goes to 5.8
  stable:

   - fix leak of getname() retrieved filename

   - remove plug->nowait assignment, fixing a regression with btrfs

   - fix for async buffered retry"

* tag 'io_uring-5.9-2020-09-25' of git://git.kernel.dk/linux-block:
  io_uring: ensure async buffered read-retry is setup properly
  io_uring: don't unconditionally set plug->nowait = true
  io_uring: ensure open/openat2 name is cleaned on cancelation
2020-09-26 11:13:51 -07:00
Jens Axboe
f38c7e3abf io_uring: ensure async buffered read-retry is setup properly
A previous commit for fixing up short reads botched the async retry
path, so we ended up going to worker threads more often than we should.
Fix this up, so retries work the way they originally were intended to.

Fixes: 227c0c9673 ("io_uring: internally retry short reads")
Reported-by: Hao_Xu <haoxu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 15:39:13 -06:00
Jens Axboe
62c774ed48 io_uring: don't unconditionally set plug->nowait = true
This causes all the bios to be submitted with REQ_NOWAIT, which can be
problematic on either btrfs or on file systems that otherwise use a mix
of block devices where only some of them support it.

For now, just remove the setting of plug->nowait = true.

Reported-by: Dan Melnic <dmm@fb.com>
Reported-by: Brian Foster <bfoster@redhat.com>
Fixes: b63534c41e ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 09:01:53 -06:00
Josef Bacik
313b085851 btrfs: move btrfs_scratch_superblocks into btrfs_dev_replace_finishing
We need to move the closing of the src_device out of all the device
replace locking, but we definitely want to zero out the superblock
before we commit the last time to make sure the device is properly
removed.  Handle this by pushing btrfs_scratch_superblocks into
btrfs_dev_replace_finishing, and then later on we'll move the src_device
closing and freeing stuff where we need it to be.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-09-25 16:40:22 +02:00
Jens Axboe
f3cd485050 io_uring: ensure open/openat2 name is cleaned on cancelation
If we cancel these requests, we'll leak the memory associated with the
filename. Add them to the table of ops that need cleaning, if
REQ_F_NEED_CLEANUP is set.

Cc: stable@vger.kernel.org
Fixes: e62753e4e2 ("io_uring: call statx directly")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 07:41:46 -06:00
Al Viro
3701cb59d8 ep_create_wakeup_source(): dentry name can change under you...
or get freed, for that matter, if it's a long (separately stored)
name.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-24 19:41:58 -04:00