linux-stable/Documentation
mfreemon@cloudflare.com 0796c53424 tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
[ Upstream commit b650d953cd ]

Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.

To reproduce:  Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()).  This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].

As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.

The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender.  This problem has previously been identified in
RFC 7323, appendix F [1].

The Linux kernel currently adheres to never shrinking the window.

In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session.  A receive buffer full condition should instead result
in a zero window and an indefinite wait.

In practice, this problem is largely hidden for most flows.  It
is not applicable to mice flows.  Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.

But this problem does show up for other types of flows.  Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time.  In these cases, we directly
encounter the problem described in [1].

RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].

This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf).  This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window

Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323

Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-19 23:08:54 +02:00
..
ABI platform/chrome: chromeos_acpi: print hex string for ACPI_TYPE_BUFFER 2023-09-13 09:43:03 +02:00
accounting sched/psi: Allow unprivileged polling of N*2s period 2023-07-27 08:50:38 +02:00
admin-guide mm, memcg: reconsider kmem.limit_in_bytes deprecation 2023-10-06 14:57:06 +02:00
arc
arm EFI updates for v6.1 2022-10-09 08:56:54 -07:00
arm64 arm64: errata: Add Cortex-A520 speculative unprivileged load workaround 2023-10-10 22:00:39 +02:00
block blk-crypto: don't use struct request_queue for public interfaces 2023-05-11 23:03:00 +09:00
bpf bpf, docs: Fix modulo zero, division by zero, overflow, and underflow 2023-03-10 09:33:50 +01:00
cdrom
core-api overflow: Fix kern-doc markup for functions 2022-10-25 14:57:42 -07:00
cpu-freq
crypto
dev-tools docs/scripts/gdb: add necessary make scripts_gdb step 2023-03-10 09:33:58 +01:00
devicetree dt-bindings: interrupt-controller: renesas,rzg2l-irqc: Update description for '#interrupt-cells' property 2023-10-19 23:08:50 +02:00
doc-guide Rust introduction for v6.1-rc1 2022-10-03 16:39:37 -07:00
driver-api spi: Update reference to struct spi_controller 2022-12-31 13:32:06 +01:00
fault-injection lkdtm: replace ll_rw_block with submit_bh 2023-07-19 16:21:53 +02:00
fb
features
filesystems fs: Lock moved directories 2023-07-19 16:22:12 +02:00
firmware-guide Merge branches 'acpi-misc', 'acpi-tools' and 'acpi-docs' 2022-10-03 20:03:49 +02:00
firmware_class
fpga
gpu drm/vmwgfx: Remove vmwgfx_hashtab 2023-01-18 11:58:28 +01:00
hid
hwmon hwmon: (ftsteutates) Fix scaling of measurements 2023-03-10 09:33:10 +01:00
i2c
ia64
iio
images
infiniband
input Merge branch 'next' into for-linus 2022-10-09 22:30:23 -07:00
isdn
kbuild Documentation: kbuild: Add description of git for reproducible builds 2022-10-28 00:16:29 +09:00
kernel-hacking Documentation: Fix spelling mistake in hacking.rst 2022-10-24 11:27:51 -06:00
leds
litmus-tests
livepatch
locking
loongarch docs/LoongArch: Add booting description 2022-12-08 14:59:15 +08:00
m68k
maintainer
mhi
mips
misc-devices
mm mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[] 2023-09-19 12:27:54 +02:00
netlabel
networking tcp: enforce receive buffer memory limits by allowing the tcp window to shrink 2023-10-19 23:08:54 +02:00
nios2
nvdimm
openrisc
parisc
PCI
pcmcia
peci
power
powerpc
process docs: Set minimal gtags / GNU GLOBAL version to 6.6.5 2023-07-05 18:27:38 +01:00
RCU There's not a huge amount of activity in the docs tree this time around, 2022-10-03 10:23:32 -07:00
riscv riscv: Move early dtb mapping into the fixmap region 2023-05-01 08:26:28 +09:00
rust
s390 vfio/mdev: embedd struct mdev_parent in the parent data structure 2022-10-04 12:06:58 -06:00
scheduler
scsi scsi: core: Fix the scsi_set_resid() documentation 2023-09-13 09:43:00 +02:00
security KEYS: encrypted: fix key instantiation with user-provided data 2022-12-21 17:48:11 +01:00
sh
sound ALSA: hda/sigmatel: add pin overrides for Intel DP45SG motherboard 2023-04-20 12:35:05 +02:00
sparc
sphinx docs: Fix the docs build with Sphinx 6.0 2023-01-18 11:58:10 +01:00
sphinx-static
spi
staging
target
timers
tools A handful of relatively simple documentation fixes, plus a set of patches 2022-10-13 10:58:32 -07:00
trace tracing/probes: Add symstr type for dynamic events 2023-08-03 10:23:54 +02:00
translations docs/zh_CN: Add LoongArch booting description's translation 2022-12-08 15:03:14 +08:00
usb
userspace-api media: uapi: HEVC: Add num_delta_pocs_of_ref_rps_idx field 2023-09-13 09:42:20 +02:00
virt KVM: s390: disable migration mode when dirty tracking is disabled 2023-03-10 09:34:05 +01:00
w1
watchdog
x86 x86/sev: Add SEV-SNP guest feature negotiation support 2023-02-01 08:34:50 +01:00
xtensa
.gitignore
arch.rst
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py There's not a huge amount of activity in the docs tree this time around, 2022-10-03 10:23:32 -07:00
docutils.conf
dontdiff
index.rst Rust introduction for v6.1-rc1 2022-10-03 16:39:37 -07:00
Kconfig
Makefile
memory-barriers.txt
SubmittingPatches
subsystem-apis.rst