linux-stable/Documentation/networking
mfreemon@cloudflare.com 0796c53424 tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
[ Upstream commit b650d953cd ]

Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.

To reproduce:  Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()).  This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].

As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.

The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender.  This problem has previously been identified in
RFC 7323, appendix F [1].

The Linux kernel currently adheres to never shrinking the window.

In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session.  A receive buffer full condition should instead result
in a zero window and an indefinite wait.

In practice, this problem is largely hidden for most flows.  It
is not applicable to mice flows.  Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.

But this problem does show up for other types of flows.  Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time.  In these cases, we directly
encounter the problem described in [1].

RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].

This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf).  This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window

Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323

Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-19 23:08:54 +02:00
..
caif tty: cumulate and document tty_struct::flow* members 2021-05-13 16:57:16 +02:00
device_drivers docs: networking: device drivers: flexcan: fix invalid email 2022-09-06 08:32:12 +02:00
devlink Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-09-01 12:58:02 -07:00
dsa docs: net: dsa: update information about multiple CPU ports 2022-09-20 10:32:36 +02:00
mac80211_hwsim
6lowpan.rst docs: networking: convert 6lowpan.txt to ReST 2020-02-28 14:52:36 +01:00
6pack.rst docs: networking: convert 6pack.txt to ReST 2020-04-28 14:38:38 -07:00
af_xdp.rst xsk: Honor SO_BINDTODEVICE on bind 2023-07-19 16:22:05 +02:00
alias.rst
arcnet-hardware.rst docs: networking: arcnet-hardware.rst: don't duplicate chapter names 2020-05-01 12:24:43 -07:00
arcnet.rst Documentation: networking: arcnet: drop doubled word 2020-07-04 17:46:21 -07:00
atm.rst docs: networking: convert atm.txt to ReST 2020-04-28 14:38:38 -07:00
ax25.rst Documentation: networking: ax25: drop doubled word 2020-07-04 17:46:21 -07:00
bareudp.rst Documentation: bareudp: Corrected description of bareudp module. 2020-07-28 17:53:03 -07:00
batman-adv.rst batman-adv: Move IRC channel to hackint.org 2021-08-08 20:05:46 +02:00
bonding.rst Documentation: bonding: clarify supported modes for tlb_dynamic_lb 2022-08-30 23:17:54 -07:00
bridge.rst
can.rst can: Break loopback loop on loopback documentation 2022-06-11 22:40:13 +02:00
can_ucan_protocol.rst Documentation: networking: can_ucan_protocol: drop doubled words 2020-07-04 17:46:21 -07:00
cdc_mbim.rst docs: networking: convert cdc_mbim.txt to ReST 2020-04-28 14:38:39 -07:00
checksum-offloads.rst docs: networking: convert netdev-features.txt to ReST 2020-04-30 12:56:36 -07:00
dccp.rst net: dccp: Add SIOCOUTQ IOCTL support (send buffer fill) 2020-07-22 17:00:37 -07:00
dctcp.rst docs: networking: convert dctcp.txt to ReST 2020-04-28 14:38:39 -07:00
dns_resolver.rst docs: networking: convert dns_resolver.txt to ReST 2020-04-28 14:39:46 -07:00
driver.rst Documentation: networking: correct possessive "its" 2022-08-31 12:36:08 -07:00
eql.rst docs: networking: convert eql.txt to ReST 2020-04-28 14:39:46 -07:00
ethtool-netlink.rst ethtool: add interface to interact with Ethernet Power Equipment 2022-10-03 17:33:57 -07:00
failover.rst
fib_trie.rst docs: networking: convert fib_trie.txt to ReST 2020-04-28 14:39:46 -07:00
filter.rst treewide: use get_random_u32() when possible 2022-10-11 17:42:58 -06:00
gen_stats.rst docs: networking: convert gen_stats.txt to ReST 2020-04-28 14:39:46 -07:00
generic-hdlc.rst docs: networking: convert generic-hdlc.txt to ReST 2020-04-28 14:39:46 -07:00
generic_netlink.rst Documentation: networking: Update generic_netlink_howto URL 2022-11-23 17:25:02 -08:00
gtp.rst docs: networking: convert gtp.txt to ReST 2020-04-28 14:39:46 -07:00
ieee802154.rst docs: net: ieee802154.rst: fix C expressions 2020-10-15 07:49:41 +02:00
ila.rst docs: networking: convert ila.txt to ReST 2020-04-28 14:39:47 -07:00
index.rst docs: net: add an explanation of VF (and other) Representors 2022-09-21 07:31:38 -07:00
ioam6-sysctl.rst ipv6: ioam: Documentation for new IOAM sysctls 2021-07-21 08:14:33 -07:00
ip-sysctl.rst tcp: enforce receive buffer memory limits by allowing the tcp window to shrink 2023-10-19 23:08:54 +02:00
ip_dynaddr.rst docs: networking: convert ip_dynaddr.txt to ReST 2020-04-28 14:39:47 -07:00
ipddp.rst docs: networking: convert ipddp.txt to ReST 2020-04-28 14:39:47 -07:00
ipsec.rst docs: networking: convert ipsec.txt to ReST 2020-04-28 14:39:47 -07:00
ipv6.rst docs: networking: convert ipv6.txt to ReST 2020-04-28 14:40:18 -07:00
ipvlan.rst Documentation: networking: correct possessive "its" 2022-08-31 12:36:08 -07:00
ipvs-sysctl.rst netfilter: ipvs: Fix reuse connection if RS weight is 0 2021-11-08 11:42:47 +01:00
j1939.rst can: j1939: add tables for the CAN identifier and its fields 2020-11-20 09:43:29 +01:00
kapi.rst wimax: move out to staging 2020-10-29 19:27:45 +01:00
kcm.rst docs: networking: convert kcm.txt to ReST 2020-04-28 14:40:19 -07:00
l2tp.rst Documentation: networking: correct possessive "its" 2022-08-31 12:36:08 -07:00
lapb-module.rst docs: networking: convert lapb-module.txt to ReST 2020-04-30 12:56:35 -07:00
mac80211-auth-assoc-deauth.txt
mac80211-injection.rst doc: networking: wireless: fix wiki website url 2020-06-08 10:05:53 +02:00
mctp.rst mctp: Add SIOCMCTP{ALLOC,DROP}TAG ioctls for tag control 2022-02-09 12:00:11 +00:00
mpls-sysctl.rst docs: networking: convert mpls-sysctl.txt to ReST 2020-04-30 12:56:36 -07:00
mptcp-sysctl.rst Documentation: mptcp: fix pm_type formatting 2022-09-13 10:18:44 +02:00
msg_zerocopy.rst docs: use the lore redirector everywhere 2021-10-12 13:58:19 -06:00
multiqueue.rst docs: networking: convert multiqueue.txt to ReST 2020-04-30 12:56:36 -07:00
net_dim.rst docs: networking: add full DIM API 2020-04-10 18:11:04 -07:00
net_failover.rst Documentation: networking: net_failover: Fix documentation 2021-11-17 13:59:49 +00:00
netconsole.rst docs: networking: convert netconsole.txt to ReST 2020-04-30 12:56:36 -07:00
netdev-features.rst net: hsr: add offloading support 2021-02-11 13:24:44 -08:00
netdevices.rst net: bonding: move ioctl handling to private ndo operation 2021-07-27 20:11:45 +01:00
netfilter-sysctl.rst docs: networking: convert netfilter-sysctl.txt to ReST 2020-04-30 12:56:36 -07:00
netif-msg.rst docs: networking: convert netif-msg.txt to ReST 2020-04-30 12:56:36 -07:00
nexthop-group-resilient.rst Documentation: net: Document resilient next-hop groups 2021-03-29 13:51:38 -07:00
nf_conntrack-sysctl.rst netfilter: conntrack: remove nf_conntrack_helper documentation 2022-09-20 23:50:03 +02:00
nf_flowtable.rst docs: nf_flowtable: fix compilation and warnings 2021-03-25 17:42:02 -07:00
nfc.rst
openvswitch.rst docs: networking: convert openvswitch.txt to ReST 2020-04-30 12:56:36 -07:00
operstates.rst docs: operstates: document IF_OPER_TESTING 2021-08-02 15:16:04 +01:00
packet_mmap.rst docs: networking: Replace strncpy() with strscpy() 2021-06-04 11:21:43 -06:00
page_pool.rst Documentation: update networking/page_pool.rst 2022-03-03 09:55:28 +00:00
phonet.rst docs: networking: convert phonet.txt to ReST 2020-04-30 12:56:37 -07:00
phy.rst docs: networking: phy: add missing space 2022-10-05 20:32:39 -07:00
pktgen.rst pktgen: document the latest pktgen usage options 2021-08-25 13:44:30 +01:00
plip.rst docs: networking: convert PLIP.txt to ReST 2020-04-30 12:56:37 -07:00
ppp_generic.rst docs: update ppp_generic.rst to document new ioctls 2020-12-10 13:57:36 -08:00
proc_net_tcp.rst docs: networking: convert proc_net_tcp.txt to ReST 2020-04-30 12:56:37 -07:00
radiotap-headers.rst docs: networking: convert radiotap-headers.txt to ReST 2020-04-30 12:56:37 -07:00
rds.rst Doc: networking: Fix the title's Sphinx overline in rds.rst 2021-11-29 15:18:21 -07:00
regulatory.rst doc: networking: wireless: fix wiki website url 2020-06-08 10:05:53 +02:00
representors.rst docs: net: add an explanation of VF (and other) Representors 2022-09-21 07:31:38 -07:00
rxrpc.rst rxrpc: Remove rxrpc_get_reply_time() which is no longer used 2022-09-01 11:44:13 +01:00
scaling.rst docs: networking: update XPS to account for netif_set_xps_queue 2020-10-13 16:21:54 -07:00
sctp.rst docs: networking: convert sctp.txt to ReST 2020-04-30 12:56:38 -07:00
secid.rst docs: networking: convert secid.txt to ReST 2020-04-30 12:56:38 -07:00
seg6-sysctl.rst doc: move seg6_flowlabel to seg6-sysctl.rst 2021-04-14 13:13:15 -07:00
segmentation-offloads.rst
sfp-phylink.rst doc: sfp-phylink: Fix a broken reference 2022-08-02 21:45:07 -07:00
skbuff.rst skbuff: render the checksum comment to documentation 2022-05-10 17:48:37 -07:00
smc-sysctl.rst net/smc: Unbind r/w buffer size from clcsock and make them tunable 2022-09-22 12:58:21 +02:00
snmp_counter.rst net-next: docs: Fix typos in snmp_counter.rst 2021-01-05 17:07:38 -08:00
statistics.rst docs: networking: extend the statistics documentation 2021-04-16 16:59:20 -07:00
strparser.rst docs: networking: convert strparser.txt to ReST 2020-04-30 12:56:38 -07:00
switchdev.rst docs: net: add an explanation of VF (and other) Representors 2022-09-21 07:31:38 -07:00
sysfs-tagging.rst Documentation: better locations for sysfs-pci, sysfs-tagging 2020-10-09 09:33:23 -06:00
tc-actions-env-rules.rst docs: networking: convert tc-actions-env-rules.txt to ReST 2020-04-30 12:56:38 -07:00
tcp-thin.rst docs: networking: convert tcp-thin.txt to ReST 2020-04-30 12:56:38 -07:00
team.rst docs: networking: convert team.txt to ReST 2020-04-30 12:56:38 -07:00
timestamping.rst docs: networking: Use netif_rx(). 2022-03-04 12:02:19 +00:00
tipc.rst Documentation: add more details in tipc.rst 2021-07-01 13:18:18 -07:00
tls-offload-layers.svg
tls-offload-reorder-bad.svg
tls-offload-reorder-good.svg
tls-offload.rst net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled 2021-01-19 15:58:05 -08:00
tls.rst tls: rx: add counter for NoPad violations 2022-07-11 19:48:33 -07:00
tproxy.rst docs: networking: convert tproxy.txt to ReST 2020-04-30 12:56:38 -07:00
tuntap.rst docs: networking: Replace strncpy() with strscpy() 2021-06-04 11:21:43 -06:00
udplite.rst docs: networking: convert udplite.txt to ReST 2020-05-01 12:24:40 -07:00
vrf.rst doc: Document unexpected tcp_l3mdev_accept=1 behavior 2021-08-23 11:53:24 +01:00
vxlan.rst docs: vxlan: add info about device features 2020-09-28 12:50:12 -07:00
x25-iface.rst net: x25: Queue received packets in the drivers instead of per-CPU queues 2021-04-05 11:42:12 -07:00
x25.rst net: x25: Remove unimplemented X.25-over-LLC code stubs 2020-12-12 17:15:33 -08:00
xfrm_device.rst docs: networking: Fix a typo 2021-03-20 19:02:42 -07:00
xfrm_proc.rst docs: networking: convert xfrm_proc.txt to ReST 2020-05-01 12:24:40 -07:00
xfrm_sync.rst docs: networking: convert xfrm_sync.txt to ReST 2020-05-01 12:24:41 -07:00
xfrm_sysctl.rst docs: networking: convert xfrm_sysctl.txt to ReST 2020-05-01 12:24:41 -07:00