Commit graph

7421 commits

Author SHA1 Message Date
Eric Dumazet
9580bf2edb net: relax expensive skb_unclone() in iptunnel_handle_offloads()
Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
tunnel have to go through an expensive skb_unclone()

Reallocating skb->head is a lot of work.

Test should really check if a 'real clone' of the packet was done.

TCP does not care if the original gso_type is changed while the packet
travels in the stack.

This adds skb_header_unclone() which is a variant of skb_clone()
using skb_header_cloned() check instead of skb_cloned().

This variant can probably be used from other points.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 00:22:19 -04:00
Tom Herbert
182a352d2d gre: Create common functions for transmit
Create common functions for both IPv4 and IPv6 GRE in transmit. These
are put into gre.h.

Common functions are for:
  - GRE checksum calculation. Move gre_checksum to gre.h.
  - Building a GRE header. Move GRE build_header and rename
    gre_build_header.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 19:23:31 -04:00
Tom Herbert
95f5c64c3c gre: Move utility functions to common headers
Several of the GRE functions defined in net/ipv4/ip_gre.c are usable
for IPv6 GRE implementation (that is they are protocol agnostic).

These include:
  - GRE flag handling functions are move to gre.h
  - GRE build_header is moved to gre.h and renamed gre_build_header
  - parse_gre_header is moved to gre_demux.c and renamed gre_parse_header
  - iptunnel_pull_header is taken out of gre_parse_header. This is now
    done by caller. The header length is returned from gre_parse_header
    in an int* argument.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 19:23:31 -04:00
Eric Dumazet
d41a69f1d3 tcp: make tcp_sendmsg() aware of socket backlog
Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:26 -04:00
Eric Dumazet
e61da9e259 udp: prepare for non BH masking at backlog processing
UDP uses the generic socket backlog code, and this will soon
be changed to not disable BH when protocol is called back.

We need to use appropriate SNMP accessors.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:25 -04:00
Eric Dumazet
fb3477c0f4 tcp: do not block bh during prequeue processing
AFAIK, nothing in current TCP stack absolutely wants BH
being disabled once socket is owned by a thread running in
process context.

As mentioned in my prior patch ("tcp: give prequeue mode some care"),
processing a batch of packets might take time, better not block BH
at all.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:25 -04:00
Eric Dumazet
c10d9310ed tcp: do not assume TCP code is non preemptible
We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.

Many SNMP updates were assuming that BH (and preemption) was disabled.

Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()

Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:25 -04:00
Eric Dumazet
0cef6a4c34 tcp: give prequeue mode some care
TCP prequeue goal is to defer processing of incoming packets
to user space thread currently blocked in a recvmsg() system call.

Intent is to spend less time processing these packets on behalf
of softirq handler, as softirq handler is unfair to normal process
scheduler decisions, as it might interrupt threads that do not
even use networking.

Current prequeue implementation has following issues :

1) It only checks size of the prequeue against sk_rcvbuf

   It was fine 15 years ago when sk_rcvbuf was in the 64KB vicinity.
   But we now have ~8MB values to cope with modern networking needs.
   We have to add sk_rmem_alloc in the equation, since out of order
   packets can definitely use up to sk_rcvbuf memory themselves.

2) Even with a fixed memory truesize check, prequeue can be filled
   by thousands of packets. When prequeue needs to be flushed, either
   from sofirq context (in tcp_prequeue() or timer code), or process
   context (in tcp_prequeue_process()), this adds a latency spike
   which is often not desirable.
   I added a fixed limit of 32 packets, as this translated to a max
   flush time of 60 us on my test hosts.

   Also note that all packets in prequeue are not accounted for tcp_mem,
   since they are not charged against sk_forward_alloc at this point.
   This is probably not a big deal.

Note that this might increase LINUX_MIB_TCPPREQUEUEDROPPED counts,
which is misnamed, as packets are not dropped at all, but rather pushed
to the stack (where they can be either consumed or dropped)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 17:14:35 -04:00
Martin KaFai Lau
a166140e81 tcp: Handle eor bit when fragmenting a skb
When fragmenting a skb, the next_skb should carry
the eor from prev_skb.  The eor of prev_skb should
also be reset.

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 sendto(4, ..., 15330, MSG_EOR, ..., ...) = 15330
0.200 sendto(4, ..., 730, 0, ..., ...) = 730

0.200 > .  1:7301(7300) ack 1
0.200 > . 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1

0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:14:19 -04:00
Martin KaFai Lau
a643b5d41c tcp: Handle eor bit when coalescing skb
This patch:
1. Prevent next_skb from coalescing to the prev_skb if
   TCP_SKB_CB(prev_skb)->eor is set
2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is
   allowed

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 write(4, ..., 11680) = 11680

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:13141,nop,nop>
0.300 > P. 1:731(730) ack 1
0.300 > P. 731:1461(730) ack 1
0.400 < . 1:1(0) ack 13141 win 257

0.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:14:19 -04:00
Martin KaFai Lau
c134ecb878 tcp: Make use of MSG_EOR in tcp_sendmsg
This patch adds an eor bit to the TCP_SKB_CB.  When MSG_EOR
is passed to tcp_sendmsg, the eor bit will be set at the skb
containing the last byte of the userland's msg.  The eor bit
will prevent data from appending to that skb in the future.

The change in do_tcp_sendpages is to honor the eor set
during the previous tcp_sendmsg(MSG_EOR) call.

This patch handles the tcp_sendmsg case.  The followup patches
will handle other skb coalescing and fragment cases.

One potential use case is to use MSG_EOR with
SOF_TIMESTAMPING_TX_ACK to get a more accurate
TCP ack timestamping on application protocol with
multiple outgoing response messages (e.g. HTTP2).

Packetdrill script for testing:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 14600) = 14600
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730
0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730

0.200 > .  1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257
0.300 > P. 14601:15331(730) ack 1
0.300 > P. 15331:16061(730) ack 1

0.400 < . 1:1(0) ack 16061 win 257
0.400 close(4) = 0
0.400 > F. 16061:16061(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 16062:16062(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:14:18 -04:00
Soheil Hassas Yeganeh
0a2cf20c3f tcp: remove SKBTX_ACK_TSTAMP since it is redundant
The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when
the timestamp of the TCP acknowledgement should be reported on
error queue. Since accessing skb_shinfo is likely to incur a
cache-line miss at the time of receiving the ack, the
txstamp_ack bit was added in tcp_skb_cb, which is set iff
the SKBTX_ACK_TSTAMP flag is set for an skb. This makes
SKBTX_ACK_TSTAMP flag redundant.

Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit
everywhere.

Note that this frees one bit in shinfo->tx_flags.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Suggested-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:06:10 -04:00
Soheil Hassas Yeganeh
863c1fd981 tcp: remove an unnecessary check in tcp_tx_timestamp
Remove the redundant check for sk->sk_tsflags in tcp_tx_timestamp.

tcp_tx_timestamp() receives the tsflags as a parameter. As a
result the "sk->sk_tsflags || tsflags" is redundant, since
tsflags already includes sk->sk_tsflags plus overrides from
control messages.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:06:10 -04:00
Eric Dumazet
02a1d6e7a6 net: rename NET_{ADD|INC}_STATS_BH()
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Eric Dumazet
b15084ec7d net: rename IP_UPD_PO_STATS_BH()
Rename IP_UPD_PO_STATS_BH() to __IP_UPD_PO_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Eric Dumazet
98f619957e net: rename IP_ADD_STATS_BH()
Rename IP_ADD_STATS_BH() to __IP_ADD_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Eric Dumazet
b45386efa2 net: rename IP_INC_STATS_BH()
Rename IP_INC_STATS_BH() to __IP_INC_STATS(), to
better express this is used in non preemptible context.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet
214d3f1f87 net: icmp: rename ICMPMSGIN_INC_STATS_BH()
Remove misleading _BH suffix.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet
90bbcc6083 net: tcp: rename TCP_INC_STATS_BH
Rename TCP_INC_STATS_BH() to __TCP_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet
02c223470c net: udp: rename UDP_INC_STATS_BH()
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(),
and UDP6_INC_STATS_BH() to __UDP6_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:23 -04:00
Eric Dumazet
5d3848bc33 net: rename ICMP_INC_STATS_BH()
Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:22 -04:00
Eric Dumazet
6aef70a851 net: snmp: kill various STATS_USER() helpers
In the old days (before linux-3.0), SNMP counters were duplicated,
one for user context, and one for BH context.

After commit 8f0ea0fe3a ("snmp: reduce percpu needs by 50%")
we have a single copy, and what really matters is preemption being
enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc()
respectively.

We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(),
NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(),
SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(),
UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER()

Following patches will rename __BH helpers to make clear their
usage is not tied to BH being disabled.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:22 -04:00
David S. Miller
c0cc53162a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor overlapping changes in the conflicts.

In the macsec case, the change of the default ID macro
name overlapped with the 64-bit netlink attribute alignment
fixes in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 15:43:10 -04:00
Nicolas Dichtel
6ed46d1247 sock_diag: align nlattr properly when needed
I also fix the value of INET_DIAG_MAX. It's wrong since commit 8f840e47f1
which is only in net-next right now, thus I didn't make a separate patch.

Fixes: 8f840e47f1 ("sctp: add the sctp_diag.c file")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-26 12:00:48 -04:00
Eric Dumazet
960a26282f net: better drop monitoring in ip{6}_recv_error()
We should call consume_skb(skb) when skb is properly consumed,
or kfree_skb(skb) when skb must be dropped in error case.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:48:10 -04:00
Eric Dumazet
0aea76d35c tcp: SYN packets are now simply consumed
We now have proper per-listener but also per network namespace counters
for SYN packets that might be dropped.

We replace the kfree_skb() by consume_skb() to be drop monitor [1]
friendly, and remove an obsolete comment.
FastOpen SYN packets can carry payload in them just fine.

[1] perf record -a -g -e skb:kfree_skb sleep 1; perf report

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 15:48:10 -04:00
Craig Gallek
d296ba60d8 soreuseport: Resolve merge conflict for v4/v6 ordering fix
d894ba18d4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
was merged as a bug fix to the net tree.  Two conflicting changes
were committed to net-next before the above fix was merged back to
net-next:
ca065d0cf8 ("udp: no longer use SLAB_DESTROY_BY_RCU")
3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")

These changes switched the datastructure used for TCP and UDP sockets
from hlist_nulls to hlist.  This patch applies the necessary parts
of the net tree fix to net-next which were not automatic as part of the
merge.

Fixes: 1602f49b58 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-25 13:27:54 -04:00
Paolo Abeni
391a20333b ipv4/fib: don't warn when primary address is missing if in_dev is dead
After commit fbd40ea018 ("ipv4: Don't do expensive useless work
during inetdev destroy.") when deleting an interface,
fib_del_ifaddr() can be executed without any primary address
present on the dead interface.

The above is safe, but triggers some "bug: prim == NULL" warnings.

This commit avoids warning if the in_dev is dead

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 23:26:29 -04:00
Eric Dumazet
10d3be5692 tcp-tso: do not split TSO packets at retransmit time
Linux TCP stack painfully segments all TSO/GSO packets before retransmits.

This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.

Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
 - Less memory overhead, because write queues have less skbs
 - Less cpu overhead at ACK processing.
 - Better SACK processing, as lot of studies mentioned how
   awful linux was at this ;)
 - Less cpu overhead to send the rtx packets
   (IP stack traversal, netfilter traversal, drivers...)
 - Better latencies in presence of losses.
 - Smaller spikes in fq like packet schedulers, as retransmits
   are not constrained by TCP Small Queues.

1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 14:43:59 -04:00
Martin KaFai Lau
2de8023e7b tcp: Merge txstamp_ack in tcp_skb_collapse_tstamp
When collapsing skbs, txstamp_ack also needs to be merged.

Retrans Collapse Test:
~~~~~~
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
0.200 write(4, ..., 11680) = 11680

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 13141 win 257

BPF Output Before:
~~~~~
<No output due to missing SCM_TSTAMP_ACK timestamp>

BPF Output After:
~~~~~
<...>-2027  [007] d.s.    79.765921: : ee_data:1459

Sacks Collapse Test:
~~~~~
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 1460) = 1460
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 13140) = 13140
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

BPF Output Before:
~~~~~
<No output due to missing SCM_TSTAMP_ACK timestamp>

BPF Output After:
~~~~~
<...>-2049  [007] d.s.    89.185538: : ee_data:14599

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 14:06:43 -04:00
Martin KaFai Lau
b51e13faf7 tcp: Carry txstamp_ack in tcp_fragment_tstamp
When a tcp skb is sliced into two smaller skbs (e.g. in
tcp_fragment() and tso_fragment()),  it does not carry
the txstamp_ack bit to the newly created skb if it is needed.
The end result is a timestamping event (SCM_TSTAMP_ACK) will
be missing from the sk->sk_error_queue.

This patch carries this bit to the new skb2
in tcp_fragment_tstamp().

BPF Output Before:
~~~~~~
<No output due to missing SCM_TSTAMP_ACK timestamp>

BPF Output After:
~~~~~~
<...>-2050  [000] d.s.   100.928763: : ee_data:14599

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 14600) = 14600
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0

0.200 > . 1:7301(7300) ack 1
0.200 > P. 7301:14601(7300) ack 1

0.300 < . 1:1(0) ack 14601 win 257

0.300 close(4) = 0
0.300 > F. 14601:14601(0) ack 1
0.400 < F. 1:1(0) ack 16062 win 257
0.400 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 14:06:43 -04:00
David S. Miller
11afbff861 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree, mostly from Florian Westphal to sort out the lack of sufficient
validation in x_tables and connlabel preparation patches to add
nf_tables support. They are:

1) Ensure we don't go over the ruleset blob boundaries in
   mark_source_chains().

2) Validate that target jumps land on an existing xt_entry. This extra
   sanitization comes with a performance penalty when loading the ruleset.

3) Introduce xt_check_entry_offsets() and use it from {arp,ip,ip6}tables.

4) Get rid of the smallish check_entry() functions in {arp,ip,ip6}tables.

5) Make sure the minimal possible target size in x_tables.

6) Similar to #3, add xt_compat_check_entry_offsets() for compat code.

7) Check that standard target size is valid.

8) More sanitization to ensure that the target_offset field is correct.

9) Add xt_check_entry_match() to validate that matches are well-formed.

10-12) Three patch to reduce the number of parameters in
    translate_compat_table() for {arp,ip,ip6}tables by using a container
    structure.

13) No need to return value from xt_compat_match_from_user(), so make
    it void.

14) Consolidate translate_table() so it can be used by compat code too.

15) Remove obsolete check for compat code, so we keep consistent with
    what was already removed in the native layout code (back in 2007).

16) Get rid of target jump validation from mark_source_chains(),
    obsoleted by #2.

17) Introduce xt_copy_counters_from_user() to consolidate counter
    copying, and use it from {arp,ip,ip6}tables.

18,22) Get rid of unnecessary explicit inlining in ctnetlink for dump
    functions.

19) Move nf_connlabel_match() to xt_connlabel.

20) Skip event notification if connlabel did not change.

21) Update of nf_connlabels_get() to make the upcoming nft connlabel
    support easier.

23) Remove spinlock to read protocol state field in conntrack.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 00:12:08 -04:00
Nicolas Dichtel
2175d87cc3 libnl: nla_put_msecs(): align on a 64-bit area
nla_data() is now aligned on a 64-bit area.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 20:13:24 -04:00
Nicolas Dichtel
b46f6ded90 libnl: nla_put_be64(): align on a 64-bit area
nla_data() is now aligned on a 64-bit area.

A temporary version (nla_put_be64_32bit()) is added for nla_put_net64().
This function is removed in the next patch.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 20:13:24 -04:00
David S. Miller
1602f49b58 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were two cases of simple overlapping changes,
nothing serious.

In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 18:51:33 -04:00
Martin KaFai Lau
cfea5a688e tcp: Merge tx_flags and tskey in tcp_shifted_skb
After receiving sacks, tcp_shifted_skb() will collapse
skbs if possible.  tx_flags and tskey also have to be
merged.

This patch reuses the tcp_skb_collapse_tstamp() to handle
them.

BPF Output Before:
~~~~~
<no-output-due-to-missing-tstamp-event>

BPF Output After:
~~~~~
<...>-2024  [007] d.s.    88.644374: : ee_data:14599

Packetdrill Script:
~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 1460) = 1460
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 13140) = 13140

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

0.400 close(4) = 0
0.400 > F. 14601:14601(0) ack 1
0.500 < F. 1:1(0) ack 14602 win 257
0.500 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 14:40:55 -04:00
Martin KaFai Lau
082ac2d51d tcp: Merge tx_flags and tskey in tcp_collapse_retrans
If two skbs are merged/collapsed during retransmission, the current
logic does not merge the tx_flags and tskey.  The end result is
the SCM_TSTAMP_ACK timestamp could be missing for a packet.

The patch:
1. Merge the tx_flags
2. Overwrite the prev_skb's tskey with the next_skb's tskey

BPF Output Before:
~~~~~~
<no-output-due-to-missing-tstamp-event>

BPF Output After:
~~~~~~
packetdrill-2092  [001] d.s.   453.998486: : ee_data:1459

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 730) = 730
+0 setsockopt(4, SOL_SOCKET, 37, [2176], 4) = 0
0.200 write(4, ..., 11680) = 11680
+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0

0.200 > P. 1:731(730) ack 1
0.200 > P. 731:1461(730) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:13141(4380) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 13141 win 257

0.400 close(4) = 0
0.400 > F. 13141:13141(0) ack 1
0.500 < F. 1:1(0) ack 13142 win 257
0.500 > . 13142:13142(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 14:40:55 -04:00
Nicolas Dichtel
a9a080422e ipmr: align RTA_MFC_STATS on 64-bit
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 14:22:13 -04:00
Xin Long
b7de529c79 net: use jiffies_to_msecs to replace EXPIRES_IN_MS in inet/sctp_diag
EXPIRES_IN_MS macro comes from net/ipv4/inet_diag.c and dates
back to before jiffies_to_msecs() has been introduced.

Now we can remove it and use jiffies_to_msecs().

Suggested-by: Jakub Sitnicki <jkbs@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 13:55:33 -04:00
Martin KaFai Lau
479f85c366 tcp: Fix SOF_TIMESTAMPING_TX_ACK when handling dup acks
Assuming SOF_TIMESTAMPING_TX_ACK is on. When dup acks are received,
it could incorrectly think that a skb has already
been acked and queue a SCM_TSTAMP_ACK cmsg to the
sk->sk_error_queue.

In tcp_ack_tstamp(), it checks
'between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1)'.
If prior_snd_una == tcp_sk(sk)->snd_una like the following packetdrill
script, between() returns true but the tskey is actually not acked.
e.g. try between(3, 2, 1).

The fix is to replace between() with one before() and one !before().
By doing this, the -1 offset on the tcp_sk(sk)->snd_una can also be
removed.

A packetdrill script is used to reproduce the dup ack scenario.
Due to the lacking cmsg support in packetdrill (may be I
cannot find it),  a BPF prog is used to kprobe to
sock_queue_err_skb() and print out the value of
serr->ee.ee_data.

Both the packetdrill and the bcc BPF script is attached at the end of
this commit message.

BPF Output Before Fix:
~~~~~~
      <...>-2056  [001] d.s.   433.927987: : ee_data:1459  #incorrect
packetdrill-2056  [001] d.s.   433.929563: : ee_data:1459  #incorrect
packetdrill-2056  [001] d.s.   433.930765: : ee_data:1459  #incorrect
packetdrill-2056  [001] d.s.   434.028177: : ee_data:1459
packetdrill-2056  [001] d.s.   434.029686: : ee_data:14599

BPF Output After Fix:
~~~~~~
      <...>-2049  [000] d.s.   113.517039: : ee_data:1459
      <...>-2049  [000] d.s.   113.517253: : ee_data:14599

BCC BPF Script:
~~~~~~
#!/usr/bin/env python

from __future__ import print_function
from bcc import BPF

bpf_text = """
#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>
#include <linux/errqueue.h>

#ifdef memset
#undef memset
#endif

int trace_err_skb(struct pt_regs *ctx)
{
	struct sk_buff *skb = (struct sk_buff *)ctx->si;
	struct sock *sk = (struct sock *)ctx->di;
	struct sock_exterr_skb *serr;
	u32 ee_data = 0;

	if (!sk || !skb)
		return 0;

	serr = SKB_EXT_ERR(skb);
	bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data);
	bpf_trace_printk("ee_data:%u\\n", ee_data);

	return 0;
};
"""

b = BPF(text=bpf_text)
b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb")
print("Attached to kprobe")
b.trace_print()

Packetdrill Script:
~~~~~~
+0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
+0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
+0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0

+0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
0.200 write(4, ..., 1460) = 1460
0.200 write(4, ..., 13140) = 13140

0.200 > P. 1:1461(1460) ack 1
0.200 > . 1461:8761(7300) ack 1
0.200 > P. 8761:14601(5840) ack 1

0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop>
0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop>
0.300 > P. 1:1461(1460) ack 1
0.400 < . 1:1(0) ack 14601 win 257

0.400 close(4) = 0
0.400 > F. 14601:14601(0) ack 1
0.500 < F. 1:1(0) ack 14602 win 257
0.500 > . 14602:14602(0) ack 2

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 13:45:43 -04:00
Dan Carpenter
110361f41c udp: fix if statement in SIOCINQ ioctl
We deleted a line of code and accidentally made the "return put_user()"
part of the if statement when it's supposed to be unconditional.

Fixes: 9f9a45beaa ('udp: do not expect udp headers on ioctl SIOCINQ')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-18 13:40:08 -04:00
Alexander Duyck
e0c20967c8 GRE: Add support for GRO/GSO of IPv6 GRE traffic
Since GRE doesn't really care about L3 protocol we can support IPv4 and
IPv6 using the same offloads.  With that being the case we can add a call
to register the offloads for IPv6 as a part of our GRE offload
initialization.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16 19:09:13 -04:00
Alexander Duyck
aed069df09 ip_tunnel_core: iptunnel_handle_offloads returns int and doesn't free skb
This patch updates the IP tunnel core function iptunnel_handle_offloads so
that we return an int and do not free the skb inside the function.  This
actually allows us to clean up several paths in several tunnels so that we
can free the skb at one point in the path without having to have a
secondary path if we are supporting tunnel offloads.

In addition it should resolve some double-free issues I have found in the
tunnels paths as I believe it is possible for us to end up triggering such
an event in the case of fou or gue.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-16 19:09:13 -04:00
Xin Long
cb2050a7b8 sctp: export some functions for sctp_diag in inet_diag
inet_diag_msg_common_fill is used to fill the diag msg common info,
we need to use it in sctp_diag as well, so export it.

inet_diag_msg_attrs_fill is used to fill some common attrs info between
sctp diag and tcp diag.

v2->v3:
- do not need to define and export inet_diag_get_handler any more.
  cause all the functions in it are in sctp_diag.ko, we just call
  them in sctp_diag.ko.

- add inet_diag_msg_attrs_fill to make codes clear.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 17:29:36 -04:00
Eric Dumazet
8804b2722d tcp: remove false sharing in tcp_rcv_state_process()
Last known hot point during SYNFLOOD attack is the clearing
of rx_opt.saw_tstamp in tcp_rcv_state_process()

It is not needed for a listener, so we move it where it matters.

Performance while a SYNFLOOD hits a single listener socket
went from 5 Mpps to 6 Mpps on my test server (24 cores, 8 NIC RX queues)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:45:44 -04:00
Eric Dumazet
b3d051477c tcp: do not mess with listener sk_wmem_alloc
When removing sk_refcnt manipulation on synflood, I missed that
using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
transitioned to 0.

We should hold sk_refcnt instead, but this is a big deal under attack.
(Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)

In this patch, I chose to not attach a socket to syncookies skb.

Performance is now 5 Mpps instead of 3.2 Mpps.

Following patch will remove last known false sharing in
tcp_rcv_state_process()

Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:45:44 -04:00
Craig Gallek
d894ba18d4 soreuseport: fix ordering for mixed v4/v6 sockets
With the SO_REUSEPORT socket option, it is possible to create sockets
in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
the AF_INET6 sockets.

Prior to the commits referenced below, an incoming IPv4 packet would
always be routed to a socket of type AF_INET when this mixed-mode was used.
After those changes, the same packet would be routed to the most recently
bound socket (if this happened to be an AF_INET6 socket, it would
have an IPv4 mapped IPv6 address).

The change in behavior occurred because the recent SO_REUSEPORT optimizations
short-circuit the socket scoring logic as soon as they find a match.  They
did not take into account the scoring logic that favors AF_INET sockets
over AF_INET6 sockets in the event of a tie.

To fix this problem, this patch changes the insertion order of AF_INET
and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
the tail of the list.  This will force AF_INET sockets to always be
considered first.

Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")

Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 21:14:03 -04:00
Alexander Duyck
802ab55adc GSO: Support partial segmentation offload
This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header.  The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.

With the current implementation it allows us to add support for the
following GSO types with an inner TSO_MANGLEID or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_GSO_IPIP
NETIF_F_GSO_SIT
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM

In the case of hardware that already supports tunneling we may be able to
extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
hardware can support updating inner IPv4 headers.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:41 -04:00
Alexander Duyck
1530545ed6 GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values
This patch does two things.

First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field.  As
a result we should now be able to aggregate flows that were converted from
IPv6 to IPv4.  In addition this allows us more flexibility for future
implementations of segmentation as we may be able to use a fixed IP ID when
segmenting the flow.

The second thing this does is that it places limitations on the outer IPv4
ID header in the case of tunneled frames.  Specifically it forces the IP ID
to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
This way we can avoid creating overlapping series of IP IDs that could
possibly be fragmented if the frame goes through GRO and is then
resegmented via GSO.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:41 -04:00
Alexander Duyck
cbc53e08a7 GSO: Add GSO type for fixed IPv4 ID
This patch adds support for TSO using IPv4 headers with a fixed IP ID
field.  This is meant to allow us to do a lossless GRO in the case of TCP
flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
headers.

In addition I am adding a feature that for now I am referring to TSO with
IP ID mangling.  Basically when this flag is enabled the device has the
option to either output the flow with incrementing IP IDs or with a fixed
IP ID regardless of what the original IP ID ordering was.  This is useful
in cases where the DF bit is set and we do not care if the original IP ID
value is maintained.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:40 -04:00