linux-stable/net
Neal Cardwell 419b2c5766 tcp: fix delayed ACKs for MSS boundary condition
[ Upstream commit 4720852ed9 ]

This commit fixes poor delayed ACK behavior that can cause poor TCP
latency in a particular boundary condition: when an application makes
a TCP socket write that is an exact multiple of the MSS size.

The problem is that there is painful boundary discontinuity in the
current delayed ACK behavior. With the current delayed ACK behavior,
we have:

(1) If an app reads data when > 1*MSS is unacknowledged, then
    tcp_cleanup_rbuf() ACKs immediately because of:

     tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||

(2) If an app reads all received data, and the packets were < 1*MSS,
    and either (a) the app is not ping-pong or (b) we received two
    packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause
    of:

     ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED2) ||
      ((icsk->icsk_ack.pending & ICSK_ACK_PUSHED) &&
       !inet_csk_in_pingpong_mode(sk))) &&

(3) *However*: if an app reads exactly 1*MSS of data,
    tcp_cleanup_rbuf() does not send an immediate ACK. This is true
    even if the app is not ping-pong and the 1*MSS of data had the PSH
    bit set, suggesting the sending application completed an
    application write.

Thus if the app is not ping-pong, we have this painful case where
>1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
write whose last skb is an exact multiple of 1*MSS can get a 40ms
delayed ACK. This means that any app that transfers data in one
direction and takes care to align write size or packet size with MSS
can suffer this problem. With receive zero copy making 4KB MSS values
more common, it is becoming more common to have application writes
naturally align with MSS, and more applications are likely to
encounter this delayed ACK problem.

The fix in this commit is to refine the delayed ACK heuristics with a
simple check: immediately ACK a received 1*MSS skb with PSH bit set if
the app reads all data. Why? If an skb has a len of exactly 1*MSS and
has the PSH bit set then it is likely the end of an application
write. So more data may not be arriving soon, and yet the data sender
may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
an ACK immediately if the app reads all of the data and is not
ping-pong. Note that this logic is also executed for the case where
len > MSS, but in that case this logic does not matter (and does not
hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
app reads data and there is more than an MSS of unACKed data.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Guo <guoxin0309@gmail.com>
Link: https://lore.kernel.org/r/20231001151239.1866845-2-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-10 22:00:43 +02:00
..
6lowpan
9p 9p: virtio: make sure 'offs' is initialized in zc_request 2023-09-13 09:42:21 +02:00
802
8021q vlan: fix a potential uninit-value in vlan_dev_hard_start_xmit() 2023-05-24 17:32:47 +01:00
appletalk
atm atm: hide unused procfs functions 2023-06-09 10:34:16 +02:00
ax25
batman-adv batman-adv: Hold rtnl lock during MTU update via netlink 2023-08-30 16:11:08 +02:00
bluetooth Bluetooth: ISO: Fix handling of listen for unicast 2023-10-10 22:00:40 +02:00
bpf
bpfilter
bridge neighbour: fix data-races around n->output 2023-10-10 22:00:42 +02:00
caif
can can: raw: add missing refcount for memory leak fix 2023-08-30 16:11:11 +02:00
ceph libceph: fix potential hang in ceph_osdc_notify() 2023-08-11 12:08:19 +02:00
core neighbour: fix data-races around n->output 2023-10-10 22:00:42 +02:00
dcb net: dcb: choose correct policy to parse DCB_ATTR_BCN 2023-08-11 12:08:17 +02:00
dccp dccp: fix dccp_v4_err()/dccp_v6_err() again 2023-10-06 14:56:39 +02:00
devlink devlink: remove reload failed checks in params get/set callbacks 2023-09-23 11:11:01 +02:00
dns_resolver
dsa net: dsa: sja1105: always enable the send_meta options 2023-07-19 16:22:06 +02:00
ethernet
ethtool ipv6: Remove in6addr_any alternatives. 2023-09-19 12:28:10 +02:00
hsr net: hsr: Add __packed to struct hsr_sup_tlv. 2023-10-06 14:56:57 +02:00
ieee802154
ife
ipv4 tcp: fix delayed ACKs for MSS boundary condition 2023-10-10 22:00:43 +02:00
ipv6 ipv6: tcp: add a missing nf_reset_ct() in 3WHS handling 2023-10-10 22:00:42 +02:00
iucv
kcm kcm: Fix error handling for SOCK_DGRAM in kcm_sendmsg(). 2023-09-19 12:28:10 +02:00
key net: af_key: fix sadb_x_filter validation 2023-08-23 17:52:32 +02:00
l2tp ipv4, ipv6: Fix handling of transhdrlen in __ip{,6}_append_data() 2023-10-10 22:00:42 +02:00
l3mdev
lapb
llc llc: Don't drop packet from non-root netns. 2023-07-27 08:50:45 +02:00
mac80211 wifi: mac80211: fix potential key use-after-free 2023-10-10 22:00:41 +02:00
mac802154
mctp
mpls
mptcp mptcp: userspace pm allow creating id 0 subflow 2023-10-10 22:00:38 +02:00
ncsi ncsi: Propagate carrier gain/loss events to the NCSI controller 2023-10-06 14:56:57 +02:00
netfilter netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure 2023-10-10 22:00:43 +02:00
netlabel netlabel: fix shift wrapping bug in netlbl_catmap_setlong() 2023-09-13 09:42:24 +02:00
netlink netlink: convert nlk->flags to atomic flags 2023-09-23 11:11:02 +02:00
netrom netrom: Deny concurrent connect(). 2023-09-13 09:42:35 +02:00
nfc net: nfc: llcp: Add lock when modifying device list 2023-10-10 22:00:42 +02:00
nsh net: nsh: Use correct mac_offset to unwind gso skb in nsh_gso_segment() 2023-05-24 17:32:45 +01:00
openvswitch net: openvswitch: reject negative ifindex 2023-08-23 17:52:35 +02:00
packet net/packet: annotate data-races around tp->status 2023-08-16 18:27:26 +02:00
phonet
psample
qrtr
rds net: replace calls to sock->ops->connect() with kernel_connect() 2023-10-10 22:00:38 +02:00
rfkill
rose
rxrpc
sched net/sched: Retire rsvp classifier 2023-09-23 11:11:13 +02:00
sctp sctp: annotate data-races around sk->sk_wmem_queued 2023-09-19 12:28:00 +02:00
smc net/smc: bugfix for smcr v2 server connect success statistic 2023-10-06 14:56:53 +02:00
strparser
sunrpc Revert "SUNRPC dont update timeout value on connection reset" 2023-10-06 14:57:03 +02:00
switchdev
tipc tipc: fix a potential deadlock on &tx->lock 2023-10-10 22:00:43 +02:00
tls net/tls: do not free tls_rec on async operation in bpf_exec_tx_verdict() 2023-09-19 12:28:09 +02:00
unix af_unix: Fix data-race around unix_tot_inflight. 2023-09-19 12:28:02 +02:00
vmw_vsock
wireless wifi: cfg80211: fix cqm_config access race 2023-10-10 22:00:40 +02:00
x25
xdp xsk: Fix xsk_diag use-after-free error during socket cleanup 2023-09-19 12:28:01 +02:00
xfrm xfrm: add forgotten nla_policy for XFRMA_MTIMER_THRESH 2023-08-23 17:52:32 +02:00
compat.c
devres.c
Kconfig
Kconfig.debug
Makefile devlink: move code to a dedicated directory 2023-08-30 16:11:00 +02:00
socket.c net: prevent rewrite of msg_name in sock_sendmsg() 2023-10-10 22:00:39 +02:00
sysctl_net.c