Commit graph

121 commits

Author SHA1 Message Date
Naohiro Aota
b093151391 btrfs: zoned: activate metadata block group on flush_space
For metadata space on zoned filesystem, reaching ALLOC_CHUNK{,_FORCE}
means we don't have enough space left in the active_total_bytes. Before
allocating a new chunk, we can try to activate an existing block group
in this case.

Also, allocating a chunk is not enough to grant a ticket for metadata
space on zoned filesystem we need to activate the block group to
increase the active_total_bytes.

btrfs_zoned_activate_one_bg() implements the activation feature. It will
activate a block group by (maybe) finishing a block group. It will give up
activating a block group if it cannot finish any block group.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:42 +02:00
Naohiro Aota
79417d040f btrfs: zoned: disable metadata overcommit for zoned
The metadata overcommit makes the space reservation flexible but it is also
harmful to active zone tracking. Since we cannot finish a block group from
the metadata allocation context, we might not activate a new block group
and might not be able to actually write out the overcommit reservations.

So, disable metadata overcommit for zoned filesystems. We will ensure
the reservations are under active_total_bytes in the following patches.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:42 +02:00
Naohiro Aota
6a921de589 btrfs: zoned: introduce space_info->active_total_bytes
The active_total_bytes, like the total_bytes, accounts for the total bytes
of active block groups in the space_info.

With an introduction of active_total_bytes, we can check if the reserved
bytes can be written to the block groups without activating a new block
group. The check is necessary for metadata allocation on zoned
filesystem. We cannot finish a block group, which may require waiting
for the current transaction, from the metadata allocation context.
Instead, we need to ensure the ongoing allocation (reserved bytes) fits
in active block groups.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:42 +02:00
Stefan Roesch
f6fca3917b btrfs: store chunk size in space-info struct
The chunk size is stored in the btrfs_space_info structure.  It is
initialized at the start and is then used.

A new API is added to update the current chunk size.  This API is used
to be able to expose the chunk_size as a sysfs setting.

Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename and merge helpers, switch atomic type to u64, style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:45:32 +02:00
David Sterba
143823cf4d btrfs: fix typos in comments
Codespell has found a few typos.

Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:33 +02:00
Josef Bacik
bb5a098d97 btrfs: make the bg_reclaim_threshold per-space info
For non-zoned file systems it's useful to have the auto reclaim feature,
however there are different use cases for non-zoned, for example we may
not want to reclaim metadata chunks ever, only data chunks.  Move this
sysfs flag to per-space_info.  This won't affect current users because
this tunable only ever did anything for zoned, and that is currently
hidden behind BTRFS_CONFIG_DEBUG.

Tested-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ jth restore global bg_reclaim_threshold ]
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Yu Zhe
0d031dc4aa btrfs: remove unnecessary type casts
Explicit type casts are not necessary when it's void* to another pointer
type.

Signed-off-by: Yu Zhe <yuzhe@nfschina.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:11 +02:00
Niels Dossche
bf7bd725b0 btrfs: add lockdep_assert_held to need_preemptive_reclaim
In a previous patch ("btrfs: extend locking to all space_info members
accesses") the locking for the space_info members was extended in
btrfs_preempt_reclaim_metadata_space because not all the member
accesses that needed locks were actually locked (bytes_pinned et al).

It was then suggested to also add a call to lockdep_assert_held to
need_preemptive_reclaim. This function also works with space_info
members. As of now, it has only two call sites which both hold the lock.

Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Niels Dossche
06bae87663 btrfs: extend locking to all space_info members accesses
bytes_pinned is always accessed under space_info->lock, except in
btrfs_preempt_reclaim_metadata_space, however the other members are
accessed under that lock. The reserved member of the rsv's are also
partially accessed under a lock and partially not. Move all these
accesses into the same lock to ensure consistency.

This could potentially race and lead to a flush instead of a commit but
it's not a big problem as it's only for preemptive flush.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Niels Dossche <niels.dossche@ugent.be>
Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Yang Li
be8d1a2ab9 btrfs: fix argument list that the kdoc format and script verified
The warnings were found by running scripts/kernel-doc, which is
caused by using 'make W=1'.

fs/btrfs/extent_io.c:3210: warning: Function parameter or member
'bio_ctrl' not described in 'btrfs_bio_add_page'
fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio'
description in 'btrfs_bio_add_page'
fs/btrfs/extent_io.c:3210: warning: Excess function parameter
'prev_bio_flags' description in 'btrfs_bio_add_page'
fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root'
description in 'btrfs_reserve_metadata_bytes'
fs/btrfs/space-info.c:1602: warning: Function parameter or member
'fs_info' not described in 'btrfs_reserve_metadata_bytes'

Note: this is fixing only the warnings regarding parameter list, the
first line is not strictly conforming to the kdoc format as the btrfs
codebase does not stick to that and keeps the first line more free form
(because it's only for internal use).

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07 14:18:27 +01:00
Josef Bacik
ce5603d015 btrfs: don't use the extent_root in flush_space
We only need the root to start a transaction, and since it's a global
root we can pick anything, change to the tree_root as we'll have a lot
of extent roots in the future.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:48 +01:00
Josef Bacik
9270501c16 btrfs: change root to fs_info for btrfs_reserve_metadata_bytes
We used to need the root for btrfs_reserve_metadata_bytes to check the
orphan cleanup state, but we no longer need that, we simply need the
fs_info.  Change btrfs_reserve_metadata_bytes() to use the fs_info, and
change both btrfs_block_rsv_refill() and btrfs_block_rsv_add() to do the
same as they simply call btrfs_reserve_metadata_bytes() and then
manipulate the block_rsv that is being used.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:45 +01:00
Josef Bacik
6dbdd578cd btrfs: remove global rsv stealing logic for orphan cleanup
This is very old code before we were stealing from the global reserve
during evict.  We have proper ways to steal from the global reserve
while we're evicting, so rip out this code as it's no longer necessary.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:45 +01:00
Josef Bacik
ee6adbfd6a btrfs: make BTRFS_RESERVE_FLUSH_EVICT use the global rsv stealing code
I forgot to convert this over when I introduced the global reserve
stealing code to the space flushing code.  Evict was simply trying to
make its reservation and then if it failed it would steal from the
global rsv, which is racey because it's outside of the normal ticketing
code.

Fix this by setting ticket->steal if we are BTRFS_RESERVE_FLUSH_EVICT,
and then make the priority flushing path do the steal for us.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:45 +01:00
Josef Bacik
1b0309eaa4 btrfs: check ticket->steal in steal_from_global_block_rsv
We're going to use this helper in the priority flushing loop, move this
check into the helper to simplify the logic.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:45 +01:00
Josef Bacik
9cd8dcdc5e btrfs: check for priority ticket granting before flushing
Since we're dropping locks before we enter the priority flushing loops
we could have had our ticket granted before we got the space_info->lock.
So add this check to avoid doing some extra flushing in the priority
flushing cases.

The case in priority_reclaim_metadata_space is an optimization.  Think
we came in to reserve, we didn't have the space, we added our ticket to
the list.  But at the same time somebody was waiting on the space_info
lock to add space and do btrfs_try_granting_ticket(), so we drop the
lock, get satisfied, come in to do our loop, and we have been
satisfied.

This is the priority reclaim path, so to_reclaim could be !0 still
because we may have only satisfied the priority tickets and still left
non priority tickets on the list.  We would then have to_reclaim but
->bytes == 0.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add note about the optimization ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:45 +01:00
Josef Bacik
9f35f76d7d btrfs: handle priority ticket failures in their respective helpers
Currently the error case for the priority tickets is handled where we
deal with all of the tickets, priority and non-priority.  This is OK in
general, but it makes for some awkward locking.  We take and drop the
space_info->lock back to back because of these different types of
tickets.

Rework the code to handle priority ticket failures in their respective
helpers.  This allows us to be less wonky with our space_info->lock
usage, and means that the main handler simply has to check
ticket->error, as the ticket is guaranteed to be off any list and
completely handled by the time it exits one of the handlers.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03 15:09:45 +01:00
Josef Bacik
0e24f6d84b btrfs: do not infinite loop in data reclaim if we aborted
Error injection stressing uncovered a busy loop in our data reclaim
loop.  There are two cases here, one where we loop creating block groups
until space_info->full is set, or in the main loop we will skip erroring
out any tickets if space_info->full == 0.  Unfortunately if we aborted
the transaction then we will never allocate chunks or reclaim any space
and thus never get ->full, and you'll see stack traces like this:

  watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/u4:4:139]
  CPU: 0 PID: 139 Comm: kworker/u4:4 Tainted: G        W         5.13.0-rc1+ #328
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Workqueue: events_unbound btrfs_async_reclaim_data_space
  RIP: 0010:btrfs_join_transaction+0x12/0x20
  RSP: 0018:ffffb2b780b77de0 EFLAGS: 00000246
  RAX: ffffb2b781863d58 RBX: 0000000000000000 RCX: 0000000000000000
  RDX: 0000000000000801 RSI: ffff987952b57400 RDI: ffff987940aa3000
  RBP: ffff987954d55000 R08: 0000000000000001 R09: ffff98795539e8f0
  R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffffff
  R13: ffff987952b574c8 R14: ffff987952b57400 R15: 0000000000000008
  FS:  0000000000000000(0000) GS:ffff9879bbc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f0703da4000 CR3: 0000000113398004 CR4: 0000000000370ef0
  Call Trace:
   flush_space+0x4a8/0x660
   btrfs_async_reclaim_data_space+0x55/0x130
   process_one_work+0x1e9/0x380
   worker_thread+0x53/0x3e0
   ? process_one_work+0x380/0x380
   kthread+0x118/0x140
   ? __kthread_bind_mask+0x60/0x60
   ret_from_fork+0x1f/0x30

Fix this by checking to see if we have a btrfs fs error in either of the
reclaim loops, and if so fail the tickets and bail.  In addition to
this, fix maybe_fail_all_tickets() to not try to grant tickets if we've
aborted, simply fail everything.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26 19:08:05 +02:00
Qu Wenruo
0619b79014 btrfs: prevent __btrfs_dump_space_info() to underflow its free space
It's not uncommon where __btrfs_dump_space_info() gets called
under over-commit situations.

In that case free space would underflow as total allocated space is not
enough to handle all the over-committed space.

Such underflow values can sometimes cause confusion for users enabled
enospc_debug mount option, and takes some seconds for developers to
convert the underflow value to signed result.

Just output the free space as s64 to avoid such problem.

Reported-by: Eli V <eliventer@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAJtFHUSy4zgyhf-4d9T+KdJp9w=UgzC2A0V=VtmaeEpcGgm1-Q@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-17 19:29:54 +02:00
Josef Bacik
1146239794 btrfs: do not do preemptive flushing if the majority is global rsv
A common characteristic of the bug report where preemptive flushing was
going full tilt was the fact that the vast majority of the free metadata
space was used up by the global reserve.  The hard 90% threshold would
cover the majority of these cases, but to be even smarter we should take
into account how much of the outstanding reservations are covered by the
global block reserve.  If the global block reserve accounts for the vast
majority of outstanding reservations, skip preemptive flushing, as it
will likely just cause churn and pain.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:16 +02:00
Josef Bacik
93c60b17f2 btrfs: reduce the preemptive flushing threshold to 90%
The preemptive flushing code was added in order to avoid needing to
synchronously wait for ENOSPC flushing to recover space.  Once we're
almost full however we can essentially flush constantly.  We were using
98% as a threshold to determine if we were simply full, however in
practice this is a really high bar to hit.  For example reports of
systems running into this problem had around 94% usage and thus
continued to flush.  Fix this by lowering the threshold to 90%, which is
a more sane value, especially for smaller file systems.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212185
CC: stable@vger.kernel.org # 5.12+
Fixes: 576fa34830 ("btrfs: improve preemptive background space flushing")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:15 +02:00
Josef Bacik
e16460707e btrfs: wait on async extents when flushing delalloc
I've been debugging an early ENOSPC problem in production and finally
root caused it to this problem.  When we switched to the per-inode in
38d715f494 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") I pulled out the async extent handling, because we
were doing the correct thing by calling filemap_flush() if we had async
extents set.  This would properly wait on any async extents by locking
the page in the second flush, thus making sure our ordered extents were
properly set up.

However when I switched us back to page based flushing, I used
sync_inode(), which allows us to pass in our own wbc.  The problem here
is that sync_inode() is smarter than the filemap_* helpers, it tries to
avoid calling writepages at all.  This means that our second call could
skip calling do_writepages altogether, and thus not wait on the pagelock
for the async helpers.  This means we could come back before any ordered
extents were created and then simply continue on in our flushing
mechanisms and ENOSPC out when we have plenty of space to use.

Fix this by putting back the async pages logic in shrink_delalloc.  This
allows us to bulk write out everything that we need to, and then we can
wait in one place for the async helpers to catch up, and then wait on
any ordered extents that are created.

Fixes: e076ab2a2c ("btrfs: shrink delalloc pages instead of full inodes")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:07 +02:00
Josef Bacik
03fe78cc29 btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be.  With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.

Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations.  Now assume we have 128MiB of delalloc
outstanding.  With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.

Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.

Fix this two ways.  First, change the calculations to be a fraction of
the total delalloc bytes on the system.  Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.

Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once.  This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.

I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again.  This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:07 +02:00
Josef Bacik
fcdef39c03 btrfs: enable a tracepoint when we fail tickets
When debugging early enospc problems it was useful to have a tracepoint
where we failed all tickets so I could check the state of the enospc
counters at failure time to validate my fixes.  This adds the tracpoint
so you can easily get that information.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:06 +02:00
Josef Bacik
138a12d865 btrfs: rip out btrfs_space_info::total_bytes_pinned
We used this in may_commit_transaction() in order to determine if we
needed to commit the transaction.  However we no longer have that logic
and thus have no use of this counter anymore, so delete it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 14:55:25 +02:00
Josef Bacik
3ffad6961d btrfs: rip the first_ticket_bytes logic from fail_all_tickets
This was a trick implemented to handle the case where we had a giant
reservation in front of a bunch of little reservations in the ticket
queue.  If the giant reservation was too large for the transaction
commit to make a difference we'd ENOSPC everybody out instead of
committing the transaction.  This logic was put in to force us to go
back and re-try the transaction commit logic to see if we could make
progress.

Instead now we know we've committed the transaction, so any space that
would have been recovered is now available, and would be caught by the
btrfs_try_granting_tickets() in this loop, so we no longer need this
code and can simply delete it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 14:51:48 +02:00
Josef Bacik
0480855392 btrfs: remove FLUSH_DELAYED_REFS from data ENOSPC flushing
Since we unconditionally commit the transaction now we no longer need to
run the delayed refs to make sure our total_bytes_pinned value is
uptodate, we can simply commit the transaction.  Remove this stage from
the data flushing list.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 14:51:10 +02:00
Josef Bacik
c416a30cdd btrfs: rip out may_commit_transaction
may_commit_transaction was introduced before the ticketing
infrastructure existed.  There was a problem where we'd legitimately be
out of space, but every reservation would trigger a transaction commit
and then fail.  Thus if you had 1000 things trying to make a
reservation, they'd all do the flushing loop and thus commit the
transaction 1000 times before they'd get their ENOSPC.

This helper was introduced to short circuit this, if there wasn't space
that could be reclaimed by committing the transaction then simply ENOSPC
out.  This made true ENOSPC tests much faster as we didn't waste a bunch
of time.

However many of our bugs over the years have been from cases where we
didn't account for some space that would be reclaimed by committing a
transaction.  The delayed refs rsv space, delayed rsv, many pinned bytes
miscalculations, etc.  And in the meantime the original problem has been
solved with ticketing.  We no longer will commit the transaction 1000
times.  Instead we'll get 1000 waiters, we will go through the flushing
mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
out.  The ticketing infrastructure gives us a deterministic way to see
if we're making progress or not, thus we avoid a lot of extra work.

So simplify this step by simply unconditionally committing the
transaction.  This removes what is arguably our most common source of
early ENOSPC bugs and will allow us to drastically simplify many of the
things we track because we simply won't need them with this stuff gone.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 14:50:45 +02:00
David Sterba
1a9fd4172d btrfs: fix typos in comments
Fix typos that have snuck in since the last round. Found by codespell.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 14:11:57 +02:00
Josef Bacik
385f421f18 btrfs: handle preemptive delalloc flushing slightly differently
If we decide to flush delalloc from the preemptive flusher, we really do
not want to wait on ordered extents, as it gains us nothing.  However
there was logic to go ahead and wait on ordered extents if there was
more ordered bytes than delalloc bytes.  We do not want this behavior,
so pass through whether this flushing is for preemption, and do not wait
for ordered extents if that's the case.  Also break out of the shrink
loop after the first flushing, as we just want to one shot shrink
delalloc.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik
3e10156997 btrfs: only ignore delalloc if delalloc is much smaller than ordered
While testing heavy delalloc workloads I noticed that sometimes we'd
just stop preemptively flushing when we had loads of delalloc available
to flush.  This is because we skip preemptive flushing if delalloc <=
ordered.  However if we start with say 4gib of delalloc, and we flush
2gib of that, we'll stop flushing there, when we still have 2gib of
delalloc to flush.

Instead adjust the ordered bytes down by half, this way if 2/3 of our
outstanding delalloc reservations are tied up by ordered extents we
don't bother preemptive flushing, as we're getting close to the state
where we need to wait on ordered extents.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik
30acce4eb0 btrfs: don't include the global rsv size in the preemptive used amount
When deciding if we should preemptively flush space, we will add in the
amount of space used by all block rsvs.  However this also includes the
global block rsv, which isn't flushable so shouldn't be accounted for in
this calculation.  If we decide to use ->bytes_may_use in our used
calculation we need to subtract the global rsv size from this amount so
it most closely matches the flushable space.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik
1239e2da16 btrfs: use the global rsv size in the preemptive thresh calculation
We calculate the amount of "free" space available for normal
reservations by taking the total space and subtracting out the hard used
space, which is readonly, used, and reserved space.

However we weren't taking into account the global block rsv, which is
essentially hard used space.  Handle this by subtracting it from the
available free space, so that our threshold more closely mirrors
reality.

We need to do the check because it's possible that the global_rsv_size +
used is > total_bytes, sometimes the global reserve can end up being
calculated as larger than the available size (think small filesystems
where we only have the original 8MiB chunk of metadata).  It doesn't
usually happen, but that can get us into trouble so this is safer.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik
610a6ef44e btrfs: take into account global rsv in need_preemptive_reclaim
Global rsv can't be used for normal allocations, and for very full file
systems we can decide to try and async flush constantly even though
there's really not a lot of space to reclaim.  Deal with this by
including the global block rsv size in the "total used" calculation.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik
0aae4ca9e9 btrfs: only clamp the first time we have to start flushing
We were clamping the threshold for preemptive reclaim any time we added
a ticket to wait on, which if we have a lot of threads means we'd
essentially max out the clamp the first time we start to flush.

Instead of doing this, simply do it every time we have to start
flushing, this will make us ramp up gradually instead of going to max
clamping as soon as we start needing to do flushing.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik
ed738ba7f9 btrfs: check worker before need_preemptive_reclaim
need_preemptive_reclaim() does some calculations, which aren't heavy,
but if we're already running preemptive reclaim there's no reason to do
them at all, so re-order the checks so that we don't do the calculation
if we're already doing reclaim.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21 15:19:04 +02:00
Josef Bacik
2cdb3909c9 btrfs: use percpu_read_positive instead of sum_positive for need_preempt
Looking at perf data for a fio workload I noticed that we were spending
a pretty large chunk of time (around 5%) doing percpu_counter_sum() in
need_preemptive_reclaim.  This is silly, as we only want to know if we
have more ordered than delalloc to see if we should be counting the
delayed items in our threshold calculation.  Change this to
percpu_read_positive() to avoid the overhead.

I ran this through fsperf to validate the changes, obviously the latency
numbers in dbench and fio are quite jittery, so take them as you wish,
but overall the improvements on throughput, iops, and bw are all
positive.  Each test was run two times, the given value is the average
of both runs for their respective column.

  btrfs ssd normal test results

  bufferedrandwrite16g results
       metric         baseline   current          diff
  ==========================================================
  write_io_kbytes     16777216   16777216     0.00%
  read_clat_ns_p99           0          0     0.00%
  write_bw_bytes      1.04e+08   1.05e+08     1.12%
  read_iops                  0          0     0.00%
  write_clat_ns_p50      13888      11840   -14.75%
  read_io_kbytes             0          0     0.00%
  read_io_bytes              0          0     0.00%
  write_clat_ns_p99      35008      29312   -16.27%
  read_bw_bytes              0          0     0.00%
  elapsed                  170        167    -1.76%
  write_lat_ns_min     4221.50    3762.50   -10.87%
  sys_cpu                39.65      35.37   -10.79%
  write_lat_ns_max    2.67e+10   2.50e+10    -6.63%
  read_lat_ns_min            0          0     0.00%
  write_iops          25270.10   25553.43     1.12%
  read_lat_ns_max            0          0     0.00%
  read_clat_ns_p50           0          0     0.00%

  dbench60 results
    metric     baseline   current         diff
  ==================================================
  qpathinfo       11.12     12.73    14.52%
  throughput     416.09    445.66     7.11%
  flush         3485.63   1887.55   -45.85%
  qfileinfo        0.70      1.92   173.86%
  ntcreatex      992.60    695.76   -29.91%
  qfsinfo          2.43      3.71    52.48%
  close            1.67      3.14    88.09%
  sfileinfo       66.54    105.20    58.10%
  rename         809.23    619.59   -23.43%
  find            16.88     15.46    -8.41%
  unlink         820.54    670.86   -18.24%
  writex        3375.20   2637.91   -21.84%
  deltree        386.33    449.98    16.48%
  readx            3.43      3.41    -0.60%
  mkdir            0.05      0.03   -38.46%
  lockx            0.26      0.26    -0.76%
  unlockx          0.81      0.32   -60.33%

  dio4kbs16threads results
       metric          baseline       current           diff
  ================================================================
  write_io_kbytes         5249676       3357150   -36.05%
  read_clat_ns_p99              0             0     0.00%
  write_bw_bytes      89583501.50   57291192.50   -36.05%
  read_iops                     0             0     0.00%
  write_clat_ns_p50        242688        263680     8.65%
  read_io_kbytes                0             0     0.00%
  read_io_bytes                 0             0     0.00%
  write_clat_ns_p99      15826944      36732928   132.09%
  read_bw_bytes                 0             0     0.00%
  elapsed                      61            61     0.00%
  write_lat_ns_min          42704         42095    -1.43%
  sys_cpu                    5.27          3.45   -34.52%
  write_lat_ns_max       7.43e+08      9.27e+08    24.71%
  read_lat_ns_min               0             0     0.00%
  write_iops             21870.97      13987.11   -36.05%
  read_lat_ns_max               0             0     0.00%
  read_clat_ns_p50              0             0     0.00%

  randwrite2xram results
       metric          baseline       current           diff
  ================================================================
  write_io_kbytes        24831972      28876262    16.29%
  read_clat_ns_p99              0             0     0.00%
  write_bw_bytes      83745273.50   92182192.50    10.07%
  read_iops                     0             0     0.00%
  write_clat_ns_p50         13952         11648   -16.51%
  read_io_kbytes                0             0     0.00%
  read_io_bytes                 0             0     0.00%
  write_clat_ns_p99         50176         52992     5.61%
  read_bw_bytes                 0             0     0.00%
  elapsed                     314           332     5.73%
  write_lat_ns_min        5920.50          5127   -13.40%
  sys_cpu                    7.82          7.35    -6.07%
  write_lat_ns_max       5.27e+10      3.88e+10   -26.44%
  read_lat_ns_min               0             0     0.00%
  write_iops             20445.62      22505.42    10.07%
  read_lat_ns_max               0             0     0.00%
  read_clat_ns_p50              0             0     0.00%

  untarfirefox results
  metric    baseline   current        diff
  ==============================================
  elapsed      47.41     47.40   -0.03%

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-19 17:25:17 +02:00
Naohiro Aota
169e0da91a btrfs: zoned: track unusable bytes for zones
In a zoned filesystem a once written then freed region is not usable
until the underlying zone has been reset. So we need to distinguish such
unusable space from usable free space.

Therefore we need to introduce the "zone_unusable" field to the block
group structure, and "bytes_zone_unusable" to the space_info structure
to track the unusable space.

Pinned bytes are always reclaimed to the unusable space. But, when an
allocated region is returned before using e.g., the block group becomes
read-only between allocation time and reservation time, we can safely
return the region to the block group. For the situation, this commit
introduces "btrfs_add_free_space_unused". This behaves the same as
btrfs_add_free_space() on regular filesystem. On zoned filesystems, it
rewinds the allocation offset.

Because the read-only bytes tracks free but unusable bytes when the block
group is read-only, we need to migrate the zone_unusable bytes to
read-only bytes when a block group is marked read-only.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-09 02:46:03 +01:00
Josef Bacik
e5ad49e215 btrfs: add a trace class for dumping the current ENOSPC state
Often when I'm debugging ENOSPC related issues I have to resort to
printing the entire ENOSPC state with trace_printk() in different spots.
This gets pretty annoying, so add a trace state that does this for us.
Then add a trace point at the end of preemptive flushing so you can see
the state of the space_info when we decide to exit preemptive flushing.
This helped me figure out we weren't kicking in the preemptive flushing
soon enough.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:59 +01:00
Josef Bacik
4b02b00fe5 btrfs: adjust the flush trace point to include the source
Since we have normal ticketed flushing and preemptive flushing, adjust
the tracepoint so that we know the source of the flushing action to make
it easier to debug problems.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:59 +01:00
Josef Bacik
88a777a6e5 btrfs: implement space clamping for preemptive flushing
Starting preemptive flushing at 50% of available free space is a good
start, but some workloads are particularly abusive and can quickly
overwhelm the preemptive flushing code and drive us into using tickets.

Handle this by clamping down on our threshold for starting and
continuing to run preemptive flushing.  This is particularly important
for our overcommit case, as we can really drive the file system into
overages and then it's more difficult to pull it back as we start to
actually fill up the file system.

The clamping is essentially 2^CLAMP, but we start at 1 so whatever we
calculate for overcommit is the baseline.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:59 +01:00
Josef Bacik
2e294c6049 btrfs: simplify the logic in need_preemptive_flushing
A lot of this was added all in one go with no explanation, and is a bit
unwieldy and confusing.  Simplify the logic to start preemptive flushing
if we've reserved more than half of our available free space.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:59 +01:00
Josef Bacik
9f42d37748 btrfs: rework btrfs_calc_reclaim_metadata_size
Currently btrfs_calc_reclaim_metadata_size does two things, it returns
the space currently required for flushing by the tickets, and if there
are no tickets it calculates a value for the preemptive flushing.

However for the normal ticketed flushing we really only care about the
space required for tickets.  We will accidentally come in and flush one
time, but as soon as we see there are no tickets we bail out of our
flushing.

Fix this by making btrfs_calc_reclaim_metadata_size really only tell us
what is required for flushing if we have people waiting on space.  Then
move the preemptive flushing logic into need_preemptive_reclaim().  We
ignore btrfs_calc_reclaim_metadata_size() in need_preemptive_reclaim()
because if we are in this path then we made our reservation and there
are not pending tickets currently, so we do not need to check it, simply
do the fuzzy logic to check if we're getting low on space.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik
f205edf773 btrfs: check reclaim_size in need_preemptive_reclaim
If we're flushing space for tickets then we have
space_info->reclaim_size set and we do not need to do background
reclaim.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik
ae7913ba52 btrfs: rename need_do_async_reclaim
All of our normal flushing is asynchronous reclaim, so this helper is
poorly named.  This is more checking if we need to preemptively flush
space, so rename it to need_preemptive_reclaim.

Also switch it to bool and make it plain static as followup patches will
move more code here.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik
576fa34830 btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.

However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.

In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.

When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing.  However this meant that we essentially
never preemptively flushed.  This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.

The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction.  This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer.  With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.

To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on.  Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing.  Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik
f00c42dd4c btrfs: introduce a FORCE_COMMIT_TRANS flush operation
Solely for preemptive flushing, we want to be able to force the
transaction commit without any of the ambiguity of
may_commit_transaction().  This is because may_commit_transaction()
checks tickets and such, and in preemptive flushing we already know
it'll be helpful, so use this to keep the code nice and clean and
straightforward.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik
5deb17e18e btrfs: track ordered bytes instead of just dio ordered bytes
We track dio_bytes because the shrink delalloc code needs to know if we
have more DIO in flight than we have normal buffered IO.  The reason for
this is because we can't "flush" DIO, we have to just wait on the
ordered extents to finish.

However this is true of all ordered extents.  If we have more ordered
space outstanding than dirty pages we should be waiting on ordered
extents.  We already are ok on this front technically, because we always
do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in
the preemptive flushing code as well, so change this to count all
ordered bytes instead of just DIO ordered bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik
ac1ea10e75 btrfs: add a trace point for reserve tickets
While debugging a ENOSPC related performance problem I needed to see the
time difference between start and end of a reserve ticket, so add a
trace point to report when we handle a reserve ticket.

I opted to spit out start_ns itself without calculating the difference
because there could be a gap between enabling the tracepoint and setting
start_ns.  Doing it this way allows us to filter on 0 start_ns so we
don't get bogus entries, and we can easily calculate the time difference
with bpftrace or something else.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:58 +01:00
Josef Bacik
91e79a83ff btrfs: make flush_space take a enum btrfs_flush_state instead of int
I got a automated message from somebody who runs clang against our
kernels and it's because I used the wrong enum type for what I passed
into flush_space, caught by -Wenum-conversion.  Change the argument to
be explicitly the enum we're expecting to make everything consistent.
Maybe eventually gcc will catch errors like this.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:58:57 +01:00