linux-stable/Documentation/networking
mfreemon@cloudflare.com 0796c53424 tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
[ Upstream commit b650d953cd ]

Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.

To reproduce:  Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()).  This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].

As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.

The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender.  This problem has previously been identified in
RFC 7323, appendix F [1].

The Linux kernel currently adheres to never shrinking the window.

In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session.  A receive buffer full condition should instead result
in a zero window and an indefinite wait.

In practice, this problem is largely hidden for most flows.  It
is not applicable to mice flows.  Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.

But this problem does show up for other types of flows.  Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time.  In these cases, we directly
encounter the problem described in [1].

RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].

This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf).  This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window

Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323

Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-19 23:08:54 +02:00
..
caif
device_drivers
devlink
dsa
mac80211_hwsim
6lowpan.rst
6pack.rst
af_xdp.rst xsk: Honor SO_BINDTODEVICE on bind 2023-07-19 16:22:05 +02:00
alias.rst
arcnet-hardware.rst
arcnet.rst
atm.rst
ax25.rst
bareudp.rst
batman-adv.rst
bonding.rst
bridge.rst
can.rst
can_ucan_protocol.rst
cdc_mbim.rst
checksum-offloads.rst
dccp.rst
dctcp.rst
dns_resolver.rst
driver.rst
eql.rst
ethtool-netlink.rst ethtool: add interface to interact with Ethernet Power Equipment 2022-10-03 17:33:57 -07:00
failover.rst
fib_trie.rst
filter.rst treewide: use get_random_u32() when possible 2022-10-11 17:42:58 -06:00
gen_stats.rst
generic-hdlc.rst
generic_netlink.rst Documentation: networking: Update generic_netlink_howto URL 2022-11-23 17:25:02 -08:00
gtp.rst
ieee802154.rst
ila.rst
index.rst
ioam6-sysctl.rst
ip-sysctl.rst tcp: enforce receive buffer memory limits by allowing the tcp window to shrink 2023-10-19 23:08:54 +02:00
ip_dynaddr.rst
ipddp.rst
ipsec.rst
ipv6.rst
ipvlan.rst
ipvs-sysctl.rst
j1939.rst
kapi.rst
kcm.rst
l2tp.rst
lapb-module.rst
mac80211-auth-assoc-deauth.txt
mac80211-injection.rst
mctp.rst
mpls-sysctl.rst
mptcp-sysctl.rst
msg_zerocopy.rst
multiqueue.rst
net_dim.rst
net_failover.rst
netconsole.rst
netdev-features.rst
netdevices.rst
netfilter-sysctl.rst
netif-msg.rst
nexthop-group-resilient.rst
nf_conntrack-sysctl.rst
nf_flowtable.rst
nfc.rst
openvswitch.rst
operstates.rst
packet_mmap.rst
page_pool.rst
phonet.rst
phy.rst docs: networking: phy: add missing space 2022-10-05 20:32:39 -07:00
pktgen.rst
plip.rst
ppp_generic.rst
proc_net_tcp.rst
radiotap-headers.rst
rds.rst
regulatory.rst
representors.rst
rxrpc.rst
scaling.rst
sctp.rst
secid.rst
seg6-sysctl.rst
segmentation-offloads.rst
sfp-phylink.rst
skbuff.rst
smc-sysctl.rst
snmp_counter.rst
statistics.rst
strparser.rst
switchdev.rst
sysfs-tagging.rst
tc-actions-env-rules.rst
tcp-thin.rst
team.rst
timestamping.rst
tipc.rst
tls-offload-layers.svg
tls-offload-reorder-bad.svg
tls-offload-reorder-good.svg
tls-offload.rst
tls.rst
tproxy.rst
tuntap.rst
udplite.rst
vrf.rst
vxlan.rst
x25-iface.rst
x25.rst
xfrm_device.rst
xfrm_proc.rst
xfrm_sync.rst
xfrm_sysctl.rst