linux-stable/include/trace/events
Boris Burkov 52bb7a2166 btrfs: introduce size class to block group allocator
The aim of this patch is to reduce the fragmentation of block groups
under certain unhappy workloads. It is particularly effective when the
size of extents correlates with their lifetime, which is something we
have observed causing fragmentation in the fleet at Meta.

This patch categorizes extents into size classes:

- x < 128KiB: "small"
- 128KiB < x < 8MiB: "medium"
- x > 8MiB: "large"

and as much as possible reduces allocations of extents into block groups
that don't match the size class. This takes advantage of any (possible)
correlation between size and lifetime and also leaves behind predictable
re-usable gaps when extents are freed; small writes don't gum up bigger
holes.

Size classes are implemented in the following way:

- Mark each new block group with a size class of the first allocation
  that goes into it.

- Add two new passes to ffe: "unset size class" and "wrong size class".
  First, try only matching block groups, then try unset ones, then allow
  allocation of new ones, and finally allow mismatched block groups.

- Filtering is done just by skipping inappropriate ones, there is no
  special size class indexing.

Other solutions I considered were:

- A best fit allocator with an rb-tree. This worked well, as small
  writes didn't leak big holes from large freed extents, but led to
  regressions in ffe and write performance due to lock contention on
  the rb-tree with every allocation possibly updating it in parallel.
  Perhaps something clever could be done to do the updates in the
  background while being "right enough".

- A fixed size "working set". This prevents freeing an extent
  drastically changing where writes currently land, and seems like a
  good option too. Doesn't take advantage of size in any way.

- The same size class idea, but implemented with xarray marks. This
  turned out to be slower than looping the linked list and skipping
  wrong block groups, and is also less flexible since we must have only
  3 size classes (max #marks). With the current approach we can have as
  many as we like.

Performance testing was done via: https://github.com/josefbacik/fsperf
Of particular relevance are the new fragmentation specific tests.

A brief summary of the testing results:

- Neutral results on existing tests. There are some minor regressions
  and improvements here and there, but nothing that truly stands out as
  notable.
- Improvement on new tests where size class and extent lifetime are
  correlated. Fragmentation in these cases is completely eliminated
  and write performance is generally a little better. There is also
  significant improvement where extent sizes are just a bit larger than
  the size class boundaries.
- Regression on one new tests: where the allocations are sized
  intentionally a hair under the borders of the size classes. Results
  are neutral on the test that intentionally attacks this new scheme by
  mixing extent size and lifetime.

The full dump of the performance results can be found here:
https://bur.io/fsperf/size-class-2022-11-15.txt
(there are ANSI escape codes, so best to curl and view in terminal)

Here is a snippet from the full results for a new test which mixes
buffered writes appending to a long lived set of files and large short
lived fallocates:

bufferedappendvsfallocate results
         metric             baseline       current        stdev            diff
======================================================================================
avg_commit_ms                    31.13         29.20          2.67     -6.22%
bg_count                            14         15.60             0     11.43%
commits                          11.10         12.20          0.32      9.91%
elapsed                          27.30         26.40          2.98     -3.30%
end_state_mount_ns         11122551.90   10635118.90     851143.04     -4.38%
end_state_umount_ns           1.36e+09      1.35e+09   12248056.65     -1.07%
find_free_extent_calls       116244.30     114354.30        964.56     -1.63%
find_free_extent_ns_max      599507.20    1047168.20     103337.08     74.67%
find_free_extent_ns_mean       3607.19       3672.11        101.20      1.80%
find_free_extent_ns_min            500           512          6.67      2.40%
find_free_extent_ns_p50           2848          2876         37.65      0.98%
find_free_extent_ns_p95           4916          5000         75.45      1.71%
find_free_extent_ns_p99       20734.49      20920.48       1670.93      0.90%
frag_pct_max                     61.67             0          8.05   -100.00%
frag_pct_mean                    43.59             0          6.10   -100.00%
frag_pct_min                     25.91             0         16.60   -100.00%
frag_pct_p50                     42.53             0          7.25   -100.00%
frag_pct_p95                     61.67             0          8.05   -100.00%
frag_pct_p99                     61.67             0          8.05   -100.00%
fragmented_bg_count               6.10             0          1.45   -100.00%
max_commit_ms                    49.80            46          5.37     -7.63%
sys_cpu                           2.59          2.62          0.29      1.39%
write_bw_bytes                1.62e+08      1.68e+08   17975843.50      3.23%
write_clat_ns_mean            57426.39      54475.95       2292.72     -5.14%
write_clat_ns_p50             46950.40      42905.60       2101.35     -8.62%
write_clat_ns_p99            148070.40     143769.60       2115.17     -2.90%
write_io_kbytes                4194304       4194304             0      0.00%
write_iops                     2476.15       2556.10        274.29      3.23%
write_lat_ns_max            2101667.60    2251129.50     370556.59      7.11%
write_lat_ns_mean             59374.91      55682.00       2523.09     -6.22%
write_lat_ns_min              17353.10         16250       1646.08     -6.36%

There are some mixed improvements/regressions in most metrics along with
an elimination of fragmentation in this workload.

On the balance, the drastic 1->0 improvement in the happy cases seems
worth the mix of regressions and improvements we do observe.

Some considerations for future work:

- Experimenting with more size classes
- More hinting/search ordering work to approximate a best-fit allocator

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:34 +01:00
..
9p.h 9p fid refcount: add a 9p_fid_ref tracepoint 2022-07-02 18:52:21 +09:00
afs.h afs: Fix access after dec in put functions 2022-08-02 18:21:29 +01:00
alarmtimer.h
asoc.h
avc.h
bcache.h
block.h block: introduce block_rq_error tracepoint 2022-02-11 10:00:16 -07:00
bpf_test_run.h
bridge.h
btrfs.h btrfs: introduce size class to block group allocator 2023-02-13 17:50:34 +01:00
cachefiles.h fscache,cachefiles: add prepare_ondemand_read() callback 2022-12-07 10:56:29 +08:00
cgroup.h cgroup: Trace event cgroup id fields should be u64 2021-12-01 07:23:35 -10:00
clk.h clk: Add trace events for rate requests 2022-12-07 13:54:09 -08:00
cma.h mm, tracing: unify PFN format strings 2021-06-29 10:53:52 -07:00
compaction.h tracing: incorrect gfp_t conversion 2022-05-13 07:20:18 -07:00
context_tracking.h
cpuhp.h
cxl.h cxl/pci: Add some type-safety to the AER trace points 2022-12-06 14:37:52 -08:00
damon.h mm/damon: hide kernel pointer from tracepoint event 2022-01-15 16:30:33 +02:00
devfreq.h
devlink.h tracing: devlink: Use static array for string in devlink_trap_report event 2022-07-14 15:05:57 -04:00
dlm.h fs: dlm: add dst nodeid for msg tracing 2022-11-21 09:45:49 -06:00
dma_fence.h treewide: Add missing semicolons to __assign_str uses 2021-06-30 09:19:14 -04:00
erofs.h erofs: clean up erofs_iget() 2022-09-27 17:27:45 +08:00
error_report.h panic: use error_report_end tracepoint on warnings 2022-01-20 08:52:55 +02:00
ext4.h ext4: disable fast-commit of encrypted dir operations 2022-12-08 21:49:24 -05:00
f2fs.h f2fs: add block_age-based extent cache 2022-12-12 14:53:56 -08:00
fib.h tracing/ipv4/ipv6: Use static array for name field in fib*_lookup_table event 2022-07-15 13:35:59 -04:00
fib6.h tracing/ipv4/ipv6: Use static array for name field in fib*_lookup_table event 2022-07-15 13:35:59 -04:00
filelock.h
filemap.h filemap: Convert tracing of page cache operations to folio 2022-01-04 13:15:33 -05:00
fs_dax.h
fscache.h fscache: Fix oops due to race with cookie_lru and use_cookie 2022-12-07 11:49:18 -08:00
fsi.h fsi: Add trace events in initialization path 2022-02-21 19:38:54 +10:30
fsi_master_aspeed.h fsi: Add trace events in initialization path 2022-02-21 19:38:54 +10:30
fsi_master_ast_cf.h
fsi_master_gpio.h
gpio.h
gpu_mem.h
habanalabs.h habanalabs: trace DMA allocations 2022-09-18 13:29:53 +03:00
host1x.h
huge_memory.h mm/khugepaged: add tracepoint to collapse_file() 2022-12-11 18:12:09 -08:00
hwmon.h
i2c.h
i2c_slave.h i2c: add tracepoints for I2C slave events 2022-03-20 00:11:05 +01:00
ib_mad.h IB/mad: Don't call to function that might sleep while in atomic context 2022-11-10 10:57:15 +02:00
ib_umad.h
initcall.h
intel-sst.h
intel_ifs.h trace: platform/x86/intel/ifs: Add trace point to track Intel IFS operations 2022-05-12 15:35:29 +02:00
intel_ish.h
io_uring.h io_uring: trace local task work run 2022-09-21 10:30:42 -06:00
iocost.h blk-iocost: Trace vtime_base_rate instead of vtime_rate 2022-12-01 07:44:12 -07:00
iommu.h iommu: Log iova range in map/unmap trace events 2021-12-06 11:59:31 +01:00
ipi.h
irq.h
irq_matrix.h
iscsi.h scsi: iscsi: tracing: Use the new __vstring() helper 2022-07-19 11:20:25 -04:00
jbd2.h jbd2: use the correct print format 2022-12-08 21:49:12 -05:00
kmem.h mm: convert mm's rss stats into percpu_counter 2022-11-30 15:58:40 -08:00
kvm.h KVM: x86/mmu: rename trace function name for asynchronous page fault 2022-08-10 15:08:26 -04:00
kyber.h kyber: avoid q->disk dereferences in trace points 2021-10-15 21:02:57 -06:00
libata.h ata: libata: add qc->flags in ata_qc_complete_template tracepoint 2022-06-17 16:30:03 +09:00
lock.h locking/mutex: Make contention tracepoints more consistent wrt adaptive spinning 2022-04-05 10:24:36 +02:00
maple_tree.h Maple Tree: add new data structure 2022-09-26 19:46:13 -07:00
mce.h
mctp.h mctp: Add SIOCMCTP{ALLOC,DROP}TAG ioctls for tag control 2022-02-09 12:00:11 +00:00
mdio.h
migrate.h mm/migration: add trace events for base page and HugeTLB migrations 2022-03-24 19:06:45 -07:00
mlxsw.h
mmap.h mm: start tracking VMAs with maple tree 2022-09-26 19:46:14 -07:00
mmap_lock.h mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN 2021-11-06 13:30:36 -07:00
mmc.h
mmflags.h mm: Add PG_arch_3 page flag 2022-11-29 09:26:07 +00:00
module.h
mptcp.h mptcp: dump infinite_map field in mptcp_dump_mpext 2022-04-23 11:51:05 +01:00
napi.h
nbd.h
neigh.h neighbor: tracing: Have neigh_create event use __string() 2022-07-15 13:35:59 -04:00
net.h net: Print hashed skb addresses for all net and qdisc events 2022-06-27 11:57:06 +01:00
net_probe_common.h
netfs.h netfs: Add a function to consolidate beginning a read 2022-03-18 09:29:05 +00:00
netlink.h
nilfs2.h fs/nilfs2: Use the enum req_op and blk_opf_t types 2022-07-14 12:14:33 -06:00
nmi.h
objagg.h
oom.h
osnoise.h tracing: Fix spelling in osnoise tracer "interferences" -> "interference" 2021-06-28 14:12:27 -04:00
page_isolation.h
page_pool.h mm, tracing: unify PFN format strings 2021-06-29 10:53:52 -07:00
page_ref.h mm: introduce PAGEFLAGS_MASK to replace ((1UL << NR_PAGEFLAGS) - 1) 2021-09-08 11:50:24 -07:00
pagemap.h mm/lru: Convert __pagevec_lru_add_fn to take a folio 2021-10-18 07:49:40 -04:00
percpu.h include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace" 2022-05-25 10:47:48 -07:00
power.h cpuidle: Add cpu_idle_miss trace event 2022-08-03 17:50:58 +02:00
power_cpu_migrate.h
preemptirq.h
printk.h
pwc.h
pwm.h pwm/tracing: Also record trace events for failed API calls 2022-12-06 12:46:23 +01:00
qdisc.h net: Print hashed skb addresses for all net and qdisc events 2022-06-27 11:57:06 +01:00
qla.h scsi: qla2xxx: tracing: Use the new __vstring() helper 2022-07-19 11:20:25 -04:00
qrtr.h
rcu.h rcu: Refactor rcu_barrier() empty-list handling 2022-02-08 10:12:28 -08:00
rdma_core.h
regulator.h
rpcgss.h trace: Relocate event helper files 2022-12-10 11:01:12 -05:00
rpcrdma.h trace: Relocate event helper files 2022-12-10 11:01:12 -05:00
rpm.h
rseq.h
rtc.h
rv.h rv/monitor: Add the wwnr monitor 2022-07-30 14:01:30 -04:00
rwmmio.h asm-generic/io: Add _RET_IP_ to MMIO trace for more accurate debug info 2022-11-21 22:02:10 +01:00
rxrpc.h rxrpc: Move client call connection to the I/O thread 2023-01-06 09:43:33 +00:00
sched.h sched/tracing: Append prev_state to tp args instead 2022-05-12 00:37:11 +02:00
scmi.h firmware: arm_scmi: Harmonize SCMI tracing message format 2022-08-23 12:21:37 +01:00
scsi.h scsi: trace: Print driver_tag and scheduler_tag in SCSI trace 2022-06-21 21:43:23 -04:00
sctp.h
signal.h
siox.h
skb.h net: use %pS for kfree_skb tracing event location 2022-11-24 15:27:49 +01:00
smbus.h
sock.h net: sock: tracing: Fix sock_exceed_buf_limit not to dereference stale pointer 2022-07-08 12:06:17 +01:00
sof.h ASoC: SOF: replace ipc4-loader dev_vdbg with tracepoints 2022-09-19 15:44:08 +01:00
sof_intel.h ASoC: SOF: Intel: replace dev_vdbg with tracepoints 2022-09-19 15:44:06 +01:00
spi.h spi: Enable tracing of the SPI setup CS selection 2021-05-26 21:22:13 +01:00
spmi.h spmi: trace: fix stack-out-of-bound access in SPMI tracing functions 2022-07-24 16:16:44 +02:00
sunrpc.h SUNRPC: Make the svc_authenticate tracepoint conditional 2022-12-10 11:01:13 -05:00
sunvnet.h
swiotlb.h swiotlb: make the swiotlb_init interface more useful 2022-04-18 07:21:11 +02:00
syscalls.h
target.h
task.h
tcp.h tcp: Add tracepoint for tcp_set_ca_state 2022-04-07 20:33:15 -07:00
tegra_apb_dma.h
thermal.h drivers/thermal/cpufreq_cooling : Refactor thermal_power_cpu_get_power tracing 2022-07-28 17:29:42 +02:00
thermal_power_allocator.h
thermal_pressure.h arch_topology: Trace the update thermal pressure 2022-05-06 09:57:38 +02:00
thp.h mm/migration: add trace events for THP migrations 2022-03-24 19:06:45 -07:00
timer.h tracing/timer: Add missing argument documentation of trace points 2022-04-14 16:14:49 +02:00
tlb.h
udp.h
ufs.h scsi: ufs: core: Enable power management for wlun 2021-05-10 22:28:20 -04:00
v4l2.h
vb2.h
vmalloc.h mm: vmalloc: add free_vmap_area_noflush trace event 2022-11-08 17:37:17 -08:00
vmscan.h tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate 2022-05-19 14:08:55 -07:00
vsock_virtio_transport_common.h virtio/vsock: update trace event for SEQPACKET 2021-06-11 13:32:47 -07:00
watchdog.h watchdog: Add tracing events for the most usual watchdog events 2022-10-12 09:47:02 +02:00
wbt.h
workqueue.h workqueue: Fix type of cpu in trace event 2022-06-07 07:09:47 -10:00
writeback.h remove congestion tracking framework 2022-03-22 15:57:01 -07:00
xdp.h xdp: Extend xdp_redirect_map with broadcast support 2021-05-26 09:46:16 +02:00
xen.h