linux-stable/net/dccp
Eric Dumazet 19757cebf0 tcp: switch orphan_count to bare per-cpu counters
Use of percpu_counter structure to track count of orphaned
sockets is causing problems on modern hosts with 256 cpus
or more.

Stefan Bach reported a serious spinlock contention in real workloads,
that I was able to reproduce with a netfilter rule dropping
incoming FIN packets.

    53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
            |
            ---queued_spin_lock_slowpath
               |
                --53.51%--_raw_spin_lock_irqsave
                          |
                           --53.51%--__percpu_counter_sum
                                     tcp_check_oom
                                     |
                                     |--39.03%--__tcp_close
                                     |          tcp_close
                                     |          inet_release
                                     |          inet6_release
                                     |          sock_close
                                     |          __fput
                                     |          ____fput
                                     |          task_work_run
                                     |          exit_to_usermode_loop
                                     |          do_syscall_64
                                     |          entry_SYSCALL_64_after_hwframe
                                     |          __GI___libc_close
                                     |
                                      --14.48%--tcp_out_of_resources
                                                tcp_write_timeout
                                                tcp_retransmit_timer
                                                tcp_write_timer_handler
                                                tcp_write_timer
                                                call_timer_fn
                                                expire_timers
                                                __run_timers
                                                run_timer_softirq
                                                __softirqentry_text_start

As explained in commit cf86a086a1 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting"), default batch size is too big
for the default value of tcp_max_orphans (262144).

But even if we reduce batch sizes, there would still be cases
where the estimated count of orphans is beyond the limit,
and where tcp_too_many_orphans() has to call the expensive
percpu_counter_sum_positive().

One solution is to use plain per-cpu counters, and have
a timer to periodically refresh this cache.

Updating this cache every 100ms seems about right, tcp pressure
state is not radically changing over shorter periods.

percpu_counter was nice 15 years ago while hosts had less
than 16 cpus, not anymore by current standards.

v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
    reported by kernel test robot <lkp@intel.com>
    Remove unused socket argument from tcp_too_many_orphans()

Fixes: dd24c00191 ("net: Use a percpu_counter for orphan_count")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Stefan Bach <sfb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-15 11:28:34 +01:00
..
ccids dccp: tfrc: fix doc warnings in tfrc_equation.c 2021-06-10 14:08:49 -07:00
Kconfig dccp: Replace HTTP links with HTTPS ones 2020-07-13 11:54:07 -07:00
Makefile net: dccp: Remove dccpprobe module 2018-01-02 14:27:30 -05:00
ackvec.c net: dccp: Fix most of the kerneldoc warnings 2020-10-30 12:08:54 -07:00
ackvec.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
ccid.c net: dccp: Add __printf() markup to fix -Wsuggest-attribute=format 2020-10-30 11:31:46 -07:00
ccid.h net: dccp: Replace zero-length array with flexible-array member 2020-02-28 12:08:37 -08:00
dccp.h tcp: switch orphan_count to bare per-cpu counters 2021-10-15 11:28:34 +01:00
diag.c inet_diag: Move the INET_DIAG_REQ_BYTECODE nlattr to cb->data 2020-02-27 18:50:19 -08:00
feat.c dccp: Return the correct errno code 2021-02-06 11:15:28 -08:00
feat.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
input.c net: dccp: Convert to use the preferred fallthrough macro 2020-08-22 12:38:34 -07:00
ipv4.c net: sock: introduce sk_error_report 2021-06-29 11:28:21 -07:00
ipv6.c net: sock: introduce sk_error_report 2021-06-29 11:28:21 -07:00
ipv6.h treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 2019-06-19 17:09:55 +02:00
minisocks.c dccp: don't duplicate ccid when cloning dccp sock 2021-09-08 11:28:35 +01:00
options.c net: dccp: Convert to use the preferred fallthrough macro 2020-08-22 12:38:34 -07:00
output.c net: dccp: Fix most of the kerneldoc warnings 2020-10-30 12:08:54 -07:00
proto.c tcp: switch orphan_count to bare per-cpu counters 2021-10-15 11:28:34 +01:00
qpolicy.c net: dccp: Fix most of the kerneldoc warnings 2020-10-30 12:08:54 -07:00
sysctl.c proc/sysctl: add shared variables for range check 2019-07-18 17:08:07 -07:00
timer.c net: sock: introduce sk_error_report 2021-06-29 11:28:21 -07:00
trace.h net: dccp: Add DCCP sendmsg trace event 2018-01-02 14:27:30 -05:00