linux-stable/net/core
Jakub Kicinski 6025b9135f net: dqs: add NIC stall detector based on BQL
softnet_data->time_squeeze is sometimes used as a proxy for
host overload or indication of scheduling problems. In practice
this statistic is very noisy and has hard to grasp units -
e.g. is 10 squeezes a second to be expected, or high?

Delaying network (NAPI) processing leads to drops on NIC queues
but also RTT bloat, impacting pacing and CA decisions.
Stalls are a little hard to detect on the Rx side, because
there may simply have not been any packets received in given
period of time. Packet timestamps help a little bit, but
again we don't know if packets are stale because we're
not keeping up or because someone (*cough* cgroups)
disabled IRQs for a long time.

We can, however, use Tx as a proxy for Rx stalls. Most drivers
use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
On the Tx side we know exactly when packets get queued,
and completed, so there is no uncertainty.

This patch adds stall checks to BQL. Why BQL? Because
it's a convenient place to add such checks, already
called by most drivers, and it has copious free space
in its structures (this patch adds no extra cache
references or dirtying to the fast path).

The algorithm takes one parameter - max delay AKA stall
threshold and increments a counter whenever NAPI got delayed
for at least that amount of time. It also records the length
of the longest stall.

To be precise every time NAPI has not polled for at least
stall thrs we check if there were any Tx packets queued
between last NAPI run and now - stall_thrs/2.

Unlike the classic Tx watchdog this mechanism does not
ignore stalls caused by Tx being disabled, or loss of link.
I don't think the check is worth the complexity, and
stall is a stall, whether due to host overload, flow
control, link down... doesn't matter much to the application.

We have been running this detector in production at Meta
for 2 years, with the threshold of 8ms. It's the lowest
value where false positives become rare. There's still
a constant stream of reported stalls (especially without
the ksoftirqd deferral patches reverted), those who like
their stall metrics to be 0 may prefer higher value.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-08 10:23:26 +00:00
..
Makefile net: introduce struct net_hotdata 2024-03-07 21:12:41 -08:00
bpf_sk_storage.c net: Namespace-ify sysctl_optmem_max 2023-12-15 11:01:27 +00:00
datagram.c net: Fix from address in memcpy_to_iter_csum() 2024-02-02 12:21:02 +00:00
dev.c net: move rps_sock_flow_table to net_hotdata 2024-03-07 21:12:43 -08:00
dev.h net: move netdev_tstamp_prequeue into net_hotdata 2024-03-07 21:12:41 -08:00
dev_addr_lists.c
dev_addr_lists_test.c net: fill in MODULE_DESCRIPTION()s under net/core 2023-10-28 11:29:27 +01:00
dev_ioctl.c net: partial revert of the "Make timestamping selectable: series 2023-11-18 18:42:37 -08:00
drop_monitor.c genetlink: Use internal flags for multicast groups 2023-12-29 08:43:59 +00:00
dst.c net: dst: Make dst_destroy() static and return void. 2024-02-06 11:45:53 +01:00
dst_cache.c
failover.c net: failover: use IFF_NO_ADDRCONF flag to prevent ipv6 addrconf 2022-12-12 15:18:25 -08:00
fib_notifier.c
fib_rules.c fib: rules: remove repeated assignment in fib_nl2rule 2024-01-07 15:16:19 +00:00
filter.c bpf-next-for-netdev 2024-03-02 20:50:59 -08:00
flow_dissector.c net/core: Fix ETH_P_1588 flow dissector 2023-09-15 10:40:04 +01:00
flow_offload.c tc: flower: Enable offload support IPSEC SPI field. 2023-08-02 10:09:32 +01:00
gen_estimator.c treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
gen_stats.c net: Remove the obsolte u64_stats_fetch_*_irq() users (net). 2022-10-28 20:13:54 -07:00
gro.c net: introduce struct net_hotdata 2024-03-07 21:12:41 -08:00
gro_cells.c net: move netdev_max_backlog to net_hotdata 2024-03-07 21:12:42 -08:00
gso.c net: introduce struct net_hotdata 2024-03-07 21:12:41 -08:00
gso_test.c net: gso_test: support CONFIG_MAX_SKB_FRAGS up to 45 2023-11-13 11:04:38 +00:00
hotdata.c net: move dev_rx_weight to net_hotdata 2024-03-07 21:12:42 -08:00
hwbm.c
link_watch.c net: add netdev_set_operstate() helper 2024-02-14 11:20:13 +00:00
lwt_bpf.c lwt: Fix return values of BPF xmit ops 2023-08-18 16:05:26 +02:00
lwtunnel.c xfrm: lwtunnel: squelch kernel warning in case XFRM encap type is not available 2022-10-12 10:45:51 +02:00
neighbour.c neighbour: Don't let neigh_forced_gc() disable preemption for long 2023-12-08 10:37:43 +00:00
net-procfs.c net: move ptype_all into net_hotdata 2024-03-07 21:12:41 -08:00
net-sysfs.c net: dqs: add NIC stall detector based on BQL 2024-03-08 10:23:26 +00:00
net-sysfs.h
net-traces.c udp6: add a missing call into udp_fail_queue_rcv_skb tracepoint 2023-07-07 09:16:52 +01:00
net_namespace.c net: use synchronize_rcu_expedited in cleanup_net() 2024-02-12 12:17:03 +00:00
netclassid_cgroup.c cgroup, netclassid: on modifying netclassid in cgroup, only consider the main process. 2023-10-16 16:36:53 -07:00
netdev-genl-gen.c netdev: add per-queue statistics 2024-03-07 21:13:25 -08:00
netdev-genl-gen.h netdev: add per-queue statistics 2024-03-07 21:13:25 -08:00
netdev-genl.c netdev: add queue stat for alloc failures 2024-03-07 21:13:26 -08:00
netevent.c
netpoll.c netpoll: allocate netdev tracker right away 2023-06-15 08:21:11 +01:00
netprio_cgroup.c
of_net.c net: Explicitly include correct DT includes 2023-07-27 20:33:16 -07:00
page_pool.c net: page_pool: fix recycle stats for system page_pool allocator 2024-02-19 12:30:27 -08:00
page_pool_priv.h net: page_pool: report when page pool was destroyed 2023-11-28 15:48:39 +01:00
page_pool_user.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2024-03-07 10:29:36 -08:00
pktgen.c net: pktgen: Use wait_event_freezable_timeout() for freezable kthread 2023-12-27 14:34:52 +00:00
ptp_classifier.c
request_sock.c tcp: make sure init the accept_queue's spinlocks once 2024-01-19 21:13:25 -08:00
rtnetlink.c netlink: let core handle error cases in dump operations 2024-03-07 20:48:22 -08:00
scm.c af_unix: Try to run GC async. 2024-01-26 20:34:25 -08:00
secure_seq.c tcp: Fix data-races around sysctl knobs related to SYN option. 2022-07-20 10:14:49 +01:00
selftests.c net: fill in MODULE_DESCRIPTION()s under net/core 2023-10-28 11:29:27 +01:00
skbuff.c net: move skbuff_cache(s) to net_hotdata 2024-03-07 21:12:42 -08:00
skmsg.c bpf, sockmap: Fix NULL pointer dereference in sk_psock_verdict_data_ready() 2024-02-21 17:15:23 +01:00
sock.c sock: Use unsafe_memcpy() for sock_copy() 2024-03-05 18:35:12 -08:00
sock_destructor.h
sock_diag.c sock_diag: remove sock_diag_mutex 2024-01-23 15:13:55 +01:00
sock_map.c bpf: syzkaller found null ptr deref in unix_bpf proto add 2023-12-13 16:32:28 -08:00
sock_reuseport.c soreuseport: Fix socket selection for SO_INCOMING_CPU. 2022-10-25 11:35:16 +02:00
stream.c net: Return error from sk_stream_wait_connect() if sk_wait_event() fails 2023-12-15 10:48:51 +00:00
sysctl_net_core.c net: move rps_sock_flow_table to net_hotdata 2024-03-07 21:12:43 -08:00
timestamping.c net: partial revert of the "Make timestamping selectable: series 2023-11-18 18:42:37 -08:00
tso.c net: tso: inline tso_count_descs() 2022-12-12 15:04:39 -08:00
utils.c net: core: inet[46]_pton strlen len types 2022-11-01 21:14:39 -07:00
xdp.c net: move skbuff_cache(s) to net_hotdata 2024-03-07 21:12:42 -08:00