Commit Graph

802 Commits

Author SHA1 Message Date
Menglong Dong 8eba65fa5f net: tcp: use kfree_skb_reason() for tcp_v{4,6}_do_rcv()
Replace kfree_skb() used in tcp_v4_do_rcv() and tcp_v6_do_rcv() with
kfree_skb_reason().

Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-20 13:55:31 +00:00
Menglong Dong 7a26dc9e7b net: tcp: add skb drop reasons to tcp_add_backlog()
Pass the address of drop_reason to tcp_add_backlog() to store the
reasons for skb drops when fails. Following drop reasons are
introduced:

SKB_DROP_REASON_SOCKET_BACKLOG

Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-20 13:55:31 +00:00
Menglong Dong 643b622b51 net: tcp: add skb drop reasons to tcp_v{4,6}_inbound_md5_hash()
Pass the address of drop reason to tcp_v4_inbound_md5_hash() and
tcp_v6_inbound_md5_hash() to store the reasons for skb drops when this
function fails. Therefore, the drop reason can be passed to
kfree_skb_reason() when the skb needs to be freed.

Following drop reasons are added:

SKB_DROP_REASON_TCP_MD5NOTFOUND
SKB_DROP_REASON_TCP_MD5UNEXPECTED
SKB_DROP_REASON_TCP_MD5FAILURE

SKB_DROP_REASON_TCP_MD5* above correspond to LINUX_MIB_TCPMD5*

Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-20 13:55:31 +00:00
Menglong Dong c0e3154d9c net: tcp: use kfree_skb_reason() for tcp_v6_rcv()
Replace kfree_skb() used in tcp_v6_rcv() with kfree_skb_reason().

Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-20 13:55:31 +00:00
Eric Dumazet fbb8295248 tcp: allocate tcp_death_row outside of struct netns_ipv4
I forgot tcp had per netns tracking of timewait sockets,
and their sysctl to change the limit.

After 0dad4087a8 ("tcp/dccp: get rid of inet_twsk_purge()"),
whole struct net can be freed before last tw socket is freed.

We need to allocate a separate struct inet_timewait_death_row
object per netns.

tw_count becomes a refcount and gains associated debugging infrastructure.

BUG: KASAN: use-after-free in inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
Read of size 8 at addr ffff88807d5f9f40 by task kworker/1:7/3690

CPU: 1 PID: 3690 Comm: kworker/1:7 Not tainted 5.16.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events pwq_unbound_release_workfn
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
 __kasan_report mm/kasan/report.c:442 [inline]
 kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
 inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
 call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421
 expire_timers kernel/time/timer.c:1466 [inline]
 __run_timers.part.0+0x67c/0xa30 kernel/time/timer.c:1734
 __run_timers kernel/time/timer.c:1715 [inline]
 run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747
 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
 invoke_softirq kernel/softirq.c:432 [inline]
 __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
 irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
RIP: 0010:lockdep_unregister_key+0x1c9/0x250 kernel/locking/lockdep.c:6328
Code: 00 00 00 48 89 ee e8 46 fd ff ff 4c 89 f7 e8 5e c9 ff ff e8 09 cc ff ff 9c 58 f6 c4 02 75 26 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c 41 5d 41 5e 41 5f e9 19 4a 08 00 0f 0b 5b 5d 41 5c 41 5d
RSP: 0018:ffffc90004077cb8 EFLAGS: 00000206
RAX: 0000000000000046 RBX: ffff88807b61b498 RCX: 0000000000000001
RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff888077027128 R08: 0000000000000001 R09: ffffffff8f1ea4fc
R10: fffffbfff1ff93ee R11: 000000000000af1e R12: 0000000000000246
R13: 0000000000000000 R14: ffffffff8ffc89b8 R15: ffffffff90157fb0
 wq_unregister_lockdep kernel/workqueue.c:3508 [inline]
 pwq_unbound_release_workfn+0x254/0x340 kernel/workqueue.c:3746
 process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
 worker_thread+0x657/0x1110 kernel/workqueue.c:2454
 kthread+0x2e9/0x3a0 kernel/kthread.c:377
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
 </TASK>

Allocated by task 3635:
 kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
 kasan_set_track mm/kasan/common.c:46 [inline]
 set_alloc_info mm/kasan/common.c:437 [inline]
 __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:470
 kasan_slab_alloc include/linux/kasan.h:260 [inline]
 slab_post_alloc_hook mm/slab.h:732 [inline]
 slab_alloc_node mm/slub.c:3230 [inline]
 slab_alloc mm/slub.c:3238 [inline]
 kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3243
 kmem_cache_zalloc include/linux/slab.h:705 [inline]
 net_alloc net/core/net_namespace.c:407 [inline]
 copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
 unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
 ksys_unshare+0x445/0x920 kernel/fork.c:3048
 __do_sys_unshare kernel/fork.c:3119 [inline]
 __se_sys_unshare kernel/fork.c:3117 [inline]
 __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff88807d5f9a80
 which belongs to the cache net_namespace of size 6528
The buggy address is located 1216 bytes inside of
 6528-byte region [ffff88807d5f9a80, ffff88807d5fb400)
The buggy address belongs to the page:
page:ffffea0001f57e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88807d5f9a80 pfn:0x7d5f8
head:ffffea0001f57e00 order:3 compound_mapcount:0 compound_pincount:0
memcg:ffff888070023001
flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000010200 ffff888010dd4f48 ffffea0001404e08 ffff8880118fd000
raw: ffff88807d5f9a80 0000000000040002 00000001ffffffff ffff888070023001
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3634, ts 119694798460, free_ts 119693556950
 prep_new_page mm/page_alloc.c:2434 [inline]
 get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
 alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
 alloc_slab_page mm/slub.c:1799 [inline]
 allocate_slab mm/slub.c:1944 [inline]
 new_slab+0x28a/0x3b0 mm/slub.c:2004
 ___slab_alloc+0x87c/0xe90 mm/slub.c:3018
 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
 slab_alloc_node mm/slub.c:3196 [inline]
 slab_alloc mm/slub.c:3238 [inline]
 kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3243
 kmem_cache_zalloc include/linux/slab.h:705 [inline]
 net_alloc net/core/net_namespace.c:407 [inline]
 copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
 unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
 ksys_unshare+0x445/0x920 kernel/fork.c:3048
 __do_sys_unshare kernel/fork.c:3119 [inline]
 __se_sys_unshare kernel/fork.c:3117 [inline]
 __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
page last free stack trace:
 reset_page_owner include/linux/page_owner.h:24 [inline]
 free_pages_prepare mm/page_alloc.c:1352 [inline]
 free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
 free_unref_page_prepare mm/page_alloc.c:3325 [inline]
 free_unref_page+0x19/0x690 mm/page_alloc.c:3404
 skb_free_head net/core/skbuff.c:655 [inline]
 skb_release_data+0x65d/0x790 net/core/skbuff.c:677
 skb_release_all net/core/skbuff.c:742 [inline]
 __kfree_skb net/core/skbuff.c:756 [inline]
 consume_skb net/core/skbuff.c:914 [inline]
 consume_skb+0xc2/0x160 net/core/skbuff.c:908
 skb_free_datagram+0x1b/0x1f0 net/core/datagram.c:325
 netlink_recvmsg+0x636/0xea0 net/netlink/af_netlink.c:1998
 sock_recvmsg_nosec net/socket.c:948 [inline]
 sock_recvmsg net/socket.c:966 [inline]
 sock_recvmsg net/socket.c:962 [inline]
 ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
 ___sys_recvmsg+0x127/0x200 net/socket.c:2674
 __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Memory state around the buggy address:
 ffff88807d5f9e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88807d5f9e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88807d5f9f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                           ^
 ffff88807d5f9f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88807d5fa000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 0dad4087a8 ("tcp/dccp: get rid of inet_twsk_purge()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220126180714.845362-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-26 19:00:31 -08:00
Eric Dumazet 0dad4087a8 tcp/dccp: get rid of inet_twsk_purge()
Prior patches in the series made sure tw_timer_handler()
can be fired after netns has been dismantled/freed.

We no longer have to scan a potentially big TCP ehash
table at netns dismantle.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-25 11:25:21 +00:00
Menglong Dong 91a760b269 net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:

	err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
	if (err) {
		inet->inet_saddr = inet->inet_rcv_saddr = 0;
		goto out_release_sock;
	}

Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.

To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().

Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
2022-01-06 17:08:35 -08:00
Jakub Kicinski 8b3f913322 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/net/sock.h
  commit 8f905c0e73 ("inet: fully convert sk->sk_rx_dst to RCU rules")
  commit 43f51df417 ("net: move early demux fields close to sk_refcnt")
  https://lore.kernel.org/all/20211222141641.0caa0ab3@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-23 16:09:58 -08:00
Eric Dumazet 8f905c0e73 inet: fully convert sk->sk_rx_dst to RCU rules
syzbot reported various issues around early demux,
one being included in this changelog [1]

sk->sk_rx_dst is using RCU protection without clearly
documenting it.

And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
are not following standard RCU rules.

[a]    dst_release(dst);
[b]    sk->sk_rx_dst = NULL;

They look wrong because a delete operation of RCU protected
pointer is supposed to clear the pointer before
the call_rcu()/synchronize_rcu() guarding actual memory freeing.

In some cases indeed, dst could be freed before [b] is done.

We could cheat by clearing sk_rx_dst before calling
dst_release(), but this seems the right time to stick
to standard RCU annotations and debugging facilities.

[1]
BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204

CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
 __kasan_report mm/kasan/report.c:433 [inline]
 kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
 dst_check include/net/dst.h:470 [inline]
 tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
 ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
 ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
 ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
 ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
 __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
 __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
 __netif_receive_skb_list net/core/dev.c:5608 [inline]
 netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
 gro_normal_list net/core/dev.c:5853 [inline]
 gro_normal_list net/core/dev.c:5849 [inline]
 napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
 virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
 virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
 __napi_poll+0xaf/0x440 net/core/dev.c:7023
 napi_poll net/core/dev.c:7090 [inline]
 net_rx_action+0x801/0xb40 net/core/dev.c:7177
 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
 invoke_softirq kernel/softirq.c:432 [inline]
 __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
 irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
 common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
 asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
RIP: 0033:0x7f5e972bfd57
Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
RSP: 002b:00007fff8a413210 EFLAGS: 00000283
RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
 </TASK>

Allocated by task 13:
 kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
 kasan_set_track mm/kasan/common.c:46 [inline]
 set_alloc_info mm/kasan/common.c:434 [inline]
 __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
 kasan_slab_alloc include/linux/kasan.h:259 [inline]
 slab_post_alloc_hook mm/slab.h:519 [inline]
 slab_alloc_node mm/slub.c:3234 [inline]
 slab_alloc mm/slub.c:3242 [inline]
 kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
 dst_alloc+0x146/0x1f0 net/core/dst.c:92
 rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
 ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
 ip_route_input_rcu net/ipv4/route.c:2470 [inline]
 ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
 ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
 ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
 ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
 ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
 __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
 __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
 __netif_receive_skb_list net/core/dev.c:5608 [inline]
 netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
 gro_normal_list net/core/dev.c:5853 [inline]
 gro_normal_list net/core/dev.c:5849 [inline]
 napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
 virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
 virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
 __napi_poll+0xaf/0x440 net/core/dev.c:7023
 napi_poll net/core/dev.c:7090 [inline]
 net_rx_action+0x801/0xb40 net/core/dev.c:7177
 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558

Freed by task 13:
 kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
 kasan_set_track+0x21/0x30 mm/kasan/common.c:46
 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
 ____kasan_slab_free mm/kasan/common.c:366 [inline]
 ____kasan_slab_free mm/kasan/common.c:328 [inline]
 __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
 kasan_slab_free include/linux/kasan.h:235 [inline]
 slab_free_hook mm/slub.c:1723 [inline]
 slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
 slab_free mm/slub.c:3513 [inline]
 kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
 dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
 rcu_do_batch kernel/rcu/tree.c:2506 [inline]
 rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558

Last potentially related work creation:
 kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
 __kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
 __call_rcu kernel/rcu/tree.c:2985 [inline]
 call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
 dst_release net/core/dst.c:177 [inline]
 dst_release+0x79/0xe0 net/core/dst.c:167
 tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
 sk_backlog_rcv include/net/sock.h:1030 [inline]
 __release_sock+0x134/0x3b0 net/core/sock.c:2768
 release_sock+0x54/0x1b0 net/core/sock.c:3300
 tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
 inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
 sock_sendmsg_nosec net/socket.c:704 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:724
 sock_write_iter+0x289/0x3c0 net/socket.c:1057
 call_write_iter include/linux/fs.h:2162 [inline]
 new_sync_write+0x429/0x660 fs/read_write.c:503
 vfs_write+0x7cd/0xae0 fs/read_write.c:590
 ksys_write+0x1ee/0x250 fs/read_write.c:643
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae

The buggy address belongs to the object at ffff88807f1cb700
 which belongs to the cache ip_dst_cache of size 176
The buggy address is located 58 bytes inside of
 176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
The buggy address belongs to the page:
page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
 prep_new_page mm/page_alloc.c:2418 [inline]
 get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
 alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
 alloc_slab_page mm/slub.c:1793 [inline]
 allocate_slab mm/slub.c:1930 [inline]
 new_slab+0x32d/0x4a0 mm/slub.c:1993
 ___slab_alloc+0x918/0xfe0 mm/slub.c:3022
 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
 slab_alloc_node mm/slub.c:3200 [inline]
 slab_alloc mm/slub.c:3242 [inline]
 kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
 dst_alloc+0x146/0x1f0 net/core/dst.c:92
 rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
 __mkroute_output net/ipv4/route.c:2564 [inline]
 ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
 ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
 __ip_route_output_key include/net/route.h:126 [inline]
 ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
 ip_route_output_key include/net/route.h:142 [inline]
 geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
 geneve_xmit_skb drivers/net/geneve.c:899 [inline]
 geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
 __netdev_start_xmit include/linux/netdevice.h:4994 [inline]
 netdev_start_xmit include/linux/netdevice.h:5008 [inline]
 xmit_one net/core/dev.c:3590 [inline]
 dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
 __dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
page last free stack trace:
 reset_page_owner include/linux/page_owner.h:24 [inline]
 free_pages_prepare mm/page_alloc.c:1338 [inline]
 free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
 free_unref_page_prepare mm/page_alloc.c:3309 [inline]
 free_unref_page+0x19/0x690 mm/page_alloc.c:3388
 qlink_free mm/kasan/quarantine.c:146 [inline]
 qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
 kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
 __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
 kasan_slab_alloc include/linux/kasan.h:259 [inline]
 slab_post_alloc_hook mm/slab.h:519 [inline]
 slab_alloc_node mm/slub.c:3234 [inline]
 kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
 __alloc_skb+0x215/0x340 net/core/skbuff.c:414
 alloc_skb include/linux/skbuff.h:1126 [inline]
 alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
 sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
 mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
 add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
 add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
 mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
 mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
 mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
 process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
 worker_thread+0x658/0x11f0 kernel/workqueue.c:2445

Memory state around the buggy address:
 ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
>ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                        ^
 ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
 ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 41063e9dd1 ("ipv4: Early TCP socket demux.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-20 18:46:43 -08:00
Eric Dumazet f35f821935 tcp: defer skb freeing after socket lock is released
tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.

A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.

Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.

This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.

This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.

Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.

One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)

Tested:
 100Gbit NIC
 Max throughput for one TCP_STREAM flow, over 10 runs

MTU : 1500
Before: 55 Gbit
After:  66 Gbit

MTU : 4096+(headers)
Before: 82 Gbit
After:  95 Gbit

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:35 +00:00
Eric Dumazet d2489c7b6d tcp: add RETPOLINE mitigation to sk_backlog_rcv
Use INDIRECT_CALL_INET() to avoid an indirect call
when/if CONFIG_RETPOLINE=y

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Eric Dumazet 3735440200 tcp: small optimization in tcp_v6_send_check()
For TCP flows, inet6_sk(sk)->saddr has the same value
than sk->sk_v6_rcv_saddr.

Using sk->sk_v6_rcv_saddr increases data locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-16 13:10:34 +00:00
Nghia Le 70bf363d7a ipv6: remove useless assignment to newinet in tcp_v6_syn_recv_sock()
The newinet value is initialized with inet_sk() in a block code to
handle sockets for the ETH_P_IP protocol. Along this code path,
newinet is never read. Thus, assignment to newinet is needless and
can be removed.

Signed-off-by: Nghia Le <nghialm78@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20211104143740.32446-1-nghialm78@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-05 19:49:40 -07:00
Eric Dumazet 4f2266748e tcp: do not clear skb->csum if already zero
Freshly allocated skbs have their csum field cleared already.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-28 12:44:39 +01:00
Eric Dumazet 12c8691de3 ipv6/tcp: small drop monitor changes
Two kfree_skb() calls must be replaced by consume_skb()
for skbs that are not technically dropped.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 18:02:14 -07:00
Eric Dumazet 790eb67374 ipv6: guard IPV6_MINHOPCOUNT with a static key
RFC 5082 IPV6_MINHOPCOUNT is rarely used on hosts.

Add a static key to remove from TCP fast path useless code,
and potential cache line miss to fetch tcp_inet6_sk(sk)->min_hopcount

Note that once ip6_min_hopcount static key has been enabled,
it stays enabled until next boot.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 18:02:13 -07:00
Eric Dumazet cc17c3c8e8 ipv6: annotate data races around np->min_hopcount
No report yet from KCSAN, yet worth documenting the races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 18:02:13 -07:00
Eric Dumazet ef57c1610d ipv6: move inet6_sk(sk)->rx_dst_cookie to sk->sk_rx_dst_cookie
Increase cache locality by moving rx_dst_coookie next to sk->sk_rx_dst

This removes one or two cache line misses in IPv6 early demux (TCP/UDP)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 18:02:12 -07:00
Eric Dumazet 0c0a5ef809 tcp: move inet->rx_dst_ifindex to sk->sk_rx_dst_ifindex
Increase cache locality by moving rx_dst_ifindex next to sk->sk_rx_dst

This is part of an effort to reduce cache line misses in TCP fast path.

This removes one cache line miss in early demux.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-25 18:02:12 -07:00
David S. Miller bdfa75ad70 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Lots of simnple overlapping additions.

With a build fix from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-22 11:41:16 +01:00
Leonard Crestez a76c2315be tcp: md5: Allow MD5SIG_FLAG_IFINDEX with ifindex=0
Multiple VRFs are generally meant to be "separate" but right now md5
keys for the default VRF also affect connections inside VRFs if the IP
addresses happen to overlap.

So far the combination of TCP_MD5SIG_FLAG_IFINDEX with tcpm_ifindex == 0
was an error, accept this to mean "key only applies to default VRF".
This is what applications using VRFs for traffic separation want.

Signed-off-by: Leonard Crestez <cdleonard@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-15 14:36:57 +01:00
Eric Dumazet d8b81175e4 tcp: remove sk_{tr}x_skb_cache
This reverts the following patches :

- commit 2e05fcae83 ("tcp: fix compile error if !CONFIG_SYSCTL")
- commit 4f661542a4 ("tcp: fix zerocopy and notsent_lowat issues")
- commit 472c2e07ee ("tcp: add one skb cache for tx")
- commit 8b27dae5a2 ("tcp: add one skb cache for rx")

Having a cache of one skb (in each direction) per TCP socket is fragile,
since it can cause a significant increase of memory needs,
and not good enough for high speed flows anyway where more than one skb
is needed.

We want instead to add a generic infrastructure, with more flexible
per-cpu caches, for alien NUMA nodes.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-23 12:50:26 +01:00
Alexander Ovechkin 43b90bfad3 net: send SYNACK packet with accepted fwmark
commit e05a90ec9e ("net: reflect mark on tcp syn ack packets")
fixed IPv4 only.

This part is for the IPv6 side.

Fixes: e05a90ec9e ("net: reflect mark on tcp syn ack packets")
Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru>
Acked-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09 11:24:02 -07:00
Eric Dumazet c7bb4b8903 ipv6: tcp: drop silly ICMPv6 packet too big messages
While TCP stack scales reasonably well, there is still one part that
can be used to DDOS it.

IPv6 Packet too big messages have to lookup/insert a new route,
and if abused by attackers, can easily put hosts under high stress,
with many cpus contending on a spinlock while one is stuck in fib6_run_gc()

ip6_protocol_deliver_rcu()
 icmpv6_rcv()
  icmpv6_notify()
   tcp_v6_err()
    tcp_v6_mtu_reduced()
     inet6_csk_update_pmtu()
      ip6_rt_update_pmtu()
       __ip6_rt_update_pmtu()
        ip6_rt_cache_alloc()
         ip6_dst_alloc()
          dst_alloc()
           ip6_dst_gc()
            fib6_run_gc()
             spin_lock_bh() ...

Some of our servers have been hit by malicious ICMPv6 packets
trying to _increase_ the MTU/MSS of TCP flows.

We believe these ICMPv6 packets are a result of a bug in one ISP stack,
since they were blindly sent back for _every_ (small) packet sent to them.

These packets are for one TCP flow:
09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240
09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240

TCP stack can filter some silly requests :

1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err()
2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS.

This tests happen before the IPv6 routing stack is entered, thus
removing the potential contention and route exhaustion.

Note that IPv6 stack was performing these checks, but too late
(ie : after the route has been added, and after the potential
garbage collect war)

v2: fix typo caught by Martin, thanks !
v3: exports tcp_mtu_to_mss(), caught by David, thanks !

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08 12:27:08 -07:00
Eric Dumazet 561022acb1 tcp: annotate data races around tp->mtu_info
While tp->mtu_info is read while socket is owned, the write
sides happen from err handlers (tcp_v[46]_mtu_reduced)
which only own the socket spinlock.

Fixes: 563d34d057 ("tcp: dont drop MTU reduction indications")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-02 13:31:48 -07:00
Alexander Aring e3ae2365ef net: sock: introduce sk_error_report
This patch introduces a function wrapper to call the sk_error_report
callback. That will prepare to add additional handling whenever
sk_error_report is called, for example to trace socket errors.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 11:28:21 -07:00
Kuniyuki Iwashima d4f2c86b2b tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.
This patch also changes the code to call reuseport_migrate_sock() and
inet_reqsk_clone(), but unlike the other cases, we do not call
inet_reqsk_clone() right after reuseport_migrate_sock().

Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener
has three kinds of refcnt:

  (A) for listener itself
  (B) carried by reuqest_sock
  (C) sock_hold() in tcp_v[46]_rcv()

While processing the req, (A) may disappear by close(listener). Also, (B)
can disappear by accept(listener) once we put the req into the accept
queue. So, we have to hold another refcnt (C) for the listener to prevent
use-after-free.

For socket migration, we call reuseport_migrate_sock() to select a listener
with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv().
This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv().
Thus we have to take another refcnt (B) for the newly cloned request_sock.

In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and
try to put the new req into the accept queue. By migrating req after
winning the "own_req" race, we can avoid such a worst situation:

  CPU 1 looks up req1
  CPU 2 looks up req1, unhashes it, then CPU 1 loses the race
  CPU 3 looks up req2, unhashes it, then CPU 2 loses the race
  ...

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-8-kuniyu@amazon.co.jp
2021-06-15 18:01:06 +02:00
Jakub Kicinski 709c031423 tcp: add tracepoint for checksum errors
Add a tracepoint for capturing TCP segments with
a bad checksum. This makes it easy to identify
sources of bad frames in the fleet (e.g. machines
with faulty NICs).

It should also help tools like IOvisor's tcpdrop.py
which are used today to get detailed information
about such packets.

We don't have a socket in many cases so we must
open code the address extraction based just on
the skb.

v2: add missing export for ipv6=m

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14 15:26:03 -07:00
Florian Westphal dc87efdb1a mptcp: add mptcp reset option support
The MPTCP reset option allows to carry a mptcp-specific error code that
provides more information on the nature of a connection reset.

Reset option data received gets stored in the subflow context so it can
be sent to userspace via the 'subflow closed' netlink event.

When a subflow is closed, the desired error code that should be sent to
the peer is also placed in the subflow context structure.

If a reset is sent before subflow establishment could complete, e.g. on
HMAC failure during an MP_JOIN operation, the mptcp skb extension is
used to store the reset information.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-02 14:21:50 -07:00
Cong Wang 8a59f9d1e3 sock: Introduce sk->sk_prot->psock_update_sk_prot()
Currently sockmap calls into each protocol to update the struct
proto and replace it. This certainly won't work when the protocol
is implemented as a module, for example, AF_UNIX.

Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each
protocol can implement its own way to replace the struct proto.
This also helps get rid of symbol dependencies on CONFIG_INET.

Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com
2021-04-01 10:56:14 -07:00
Jakub Kicinski dcc32f4f18 ipv6: weaken the v4mapped source check
This reverts commit 6af1799aaf.

Commit 6af1799aaf ("ipv6: drop incoming packets having a v4mapped
source address") introduced an input check against v4mapped addresses.
Use of such addresses on the wire is indeed questionable and not
allowed on public Internet. As the commit pointed out

  https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02

lists potential issues.

Unfortunately there are applications which use v4mapped addresses,
and breaking them is a clear regression. For example v4mapped
addresses (or any semi-valid addresses, really) may be used
for uni-direction event streams or packet export.

Since the issue which sparked the addition of the check was with
TCP and request_socks in particular push the check down to TCPv6
and DCCP. This restores the ability to receive UDPv6 packets with
v4mapped address as the source.

Keep using the IPSTATS_MIB_INHDRERRORS statistic to minimize the
user-visible changes.

Fixes: 6af1799aaf ("ipv6: drop incoming packets having a v4mapped source address")
Reported-by: Sunyi Shao <sunyishao@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 11:19:23 -07:00
David S. Miller b8af417e4d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-02-16

The following pull-request contains BPF updates for your *net-next* tree.

There's a small merge conflict between 7eeba1706e ("tcp: Add receive timestamp
support for receive zerocopy.") from net-next tree and 9cacf81f81 ("bpf: Remove
extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows:

  [...]
                lock_sock(sk);
                err = tcp_zerocopy_receive(sk, &zc, &tss);
                err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
                                                          &zc, &len, err);
                release_sock(sk);
  [...]

We've added 116 non-merge commits during the last 27 day(s) which contain
a total of 156 files changed, 5662 insertions(+), 1489 deletions(-).

The main changes are:

1) Adds support of pointers to types with known size among global function
   args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov.

2) Add bpf_iter for task_vma which can be used to generate information similar
   to /proc/pid/maps, from Song Liu.

3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow
   rewriting bind user ports from BPF side below the ip_unprivileged_port_start
   range, both from Stanislav Fomichev.

4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map
   as well as per-cpu maps for the latter, from Alexei Starovoitov.

5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer
   for sleepable programs, both from KP Singh.

6) Extend verifier to enable variable offset read/write access to the BPF
   program stack, from Andrei Matei.

7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to
   query device MTU from programs, from Jesper Dangaard Brouer.

8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF
   tracing programs, from Florent Revest.

9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when
   otherwise too many passes are required, from Gary Lin.

10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function
    verification both related to zero-extension handling, from Ilya Leoshkevich.

11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa.

12) Batch of AF_XDP selftest cleanups and small performance improvement
    for libbpf's xsk map redirect for newer kernels, from Björn Töpel.

13) Follow-up BPF doc and verifier improvements around atomics with
    BPF_FETCH, from Brendan Jackman.

14) Permit zero-sized data sections e.g. if ELF .rodata section contains
    read-only data from local variables, from Yonghong Song.

15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-16 13:14:06 -08:00
Brian Vazquez bbd807dfbf net: indirect call helpers for ipv4/ipv6 dst_check functions
This patch avoids the indirect call for the common case:
ip6_dst_check and ipv4_dst_check

Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-03 14:51:40 -08:00
Stanislav Fomichev 9cacf81f81 bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
call in do_tcp_getsockopt using the on-stack data. This removes
3% overhead for locking/unlocking the socket.

Without this patch:
     3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
            |
             --3.30%--__cgroup_bpf_run_filter_getsockopt
                       |
                        --0.81%--__kmalloc

With the patch applied:
     0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern

Note, exporting uapi/tcp.h requires removing netinet/tcp.h
from test_progs.h because those headers have confliciting
definitions.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
2021-01-20 14:23:00 -08:00
Linus Torvalds ca5b877b6c selinux/stable-5.11 PR 20201214
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl/YBtEUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNnwA/9Ek8DG/1t8CEoJxpoRvwovQxNo+bi
 0rCT9vqvx9PeCwoZi/0Vp6oKmpE1HADvbeB/+e00VrbLYnzE3oRY6VkpjoZRofKS
 vc0/MzHSFxFUR1OTHwCefcXlPLK+bfitQbX5jEMeVyQCXNXXIrN7CnJf1LmCeLTR
 kQBPlEN9lt7HyNVAi34FhOD/TQbWnFHgl2z5puffgri6cWnc+TALKMYytUZ+rYex
 NYndDJW5b3g5kTat2eErn0FruxfzloGs0xMIiWb+z2i9kl41D+dkKPdAN7idqCSC
 Jv0nJP/bDftzA0wOe9szmGaLQzu7YnCN5kiWcSspatZVnon42Cy/tp9tiuPGLRFU
 XtelDfpyX6o3CLN0tX7LQEO+GYxPzvM6iaR2OrsChWPozUIIR3TLQg7jJN4bvNKl
 TR6gCGZCoAeS5JLNGjzVKxT/oKQY+tCLLlYXQdQY6swNFi3EKmPr+K1D9lgm98fO
 f3d1QmWiZZNmtxxoVogT0qoQYjkfgpnm3dVx813Vt+lwHlVpHGMEPpO27iD3/RYb
 w2yWOJaGKwMD8iL0l+Cm6CPW0/nE5FFISQjWgC8b4Vgxlyan6+L9eViqGICkrUQ2
 Edo0i1YFFZ4utHYkDf1VYBbJ+36KyCtdktgLAcbgnePiPB3E1XBsXTIIStSUIbVQ
 iEbTkBlsCG4GIeU=
 =6Cqb
 -----END PGP SIGNATURE-----

Merge tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull selinux updates from Paul Moore:
 "While we have a small number of SELinux patches for v5.11, there are a
  few changes worth highlighting:

   - Change the LSM network hooks to pass flowi_common structs instead
     of the parent flowi struct as the LSMs do not currently need the
     full flowi struct and they do not have enough information to use it
     safely (missing information on the address family).

     This patch was discussed both with Herbert Xu (representing team
     netdev) and James Morris (representing team
     LSMs-other-than-SELinux).

   - Fix how we handle errors in inode_doinit_with_dentry() so that we
     attempt to properly label the inode on following lookups instead of
     continuing to treat it as unlabeled.

   - Tweak the kernel logic around allowx, auditallowx, and dontauditx
     SELinux policy statements such that the auditx/dontauditx are
     effective even without the allowx statement.

  Everything passes our test suite"

* tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  lsm,selinux: pass flowi_common instead of flowi to the LSM hooks
  selinux: Fix fall-through warnings for Clang
  selinux: drop super_block backpointer from superblock_security_struct
  selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling
  selinux: allow dontauditx and auditallowx rules to take effect without allowx
  selinux: fix error initialization in inode_doinit_with_dentry()
2020-12-16 11:01:04 -08:00
Jakub Kicinski 46d5e62dd3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().

strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.

Conflicts:
	net/netfilter/nf_tables_api.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-11 22:29:38 -08:00
Wei Wang 8ef44b6fe4 tcp: Retain ECT bits for tos reflection
For DCTCP, we have to retain the ECT bits set by the congestion control
algorithm on the socket when reflecting syn TOS in syn-ack, in order to
make ECN work properly.

Fixes: ac8f1710c1 ("tcp: reflect tos value received in SYN to the socket")
Reported-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 16:08:23 -08:00
Florian Westphal 7ea851d19b tcp: merge 'init_req' and 'route_req' functions
The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
a TCP reset if the token in a MP_JOIN request is unknown.

At this time we don't do this, the 3whs completes and the 'new subflow'
is reset afterwards.  There are two ways to allow MPTCP to send the
reset.

1. override 'send_synack' callback and emit the rst from there.
   The drawback is that the request socket gets inserted into the
   listeners queue just to get removed again right away.

2. Send the reset from the 'route_req' function instead.
   This avoids the 'add&remove request socket', but route_req lacks the
   skb that is required to send the TCP reset.

Instead of just adding the skb to that function for MPTCP sake alone,
Paolo suggested to merge init_req and route_req functions.

This saves one indirection from syn processing path and provides the skb
to the merged function at the same time.

'send reset on unknown mptcp join token' is added in next patch.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 12:56:03 -08:00
Alexander Duyck 407c85c7dd tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECN
When a BPF program is used to select between a type of TCP congestion
control algorithm that uses either ECN or not there is a case where the
synack for the frame was coming up without the ECT0 bit set. A bit of
research found that this was due to the final socket being configured to
dctcp while the listener socket was staying in cubic.

To reproduce it all that is needed is to monitor TCP traffic while running
the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
assuming tcp_dctcp module is loaded or compiled in and the traffic matches
the rules in the sample file, is that for all frames with the exception of
the synack the ECT0 bit is set.

To address that it is necessary to make one additional call to
tcp_bpf_ca_needs_ecn using the request socket and then use the output of
that to set the ECT0 bit for the tos/tclass of the packet.

Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24 14:12:55 -08:00
Ricardo Dias 01770a1661 tcp: fix race condition when creating child sockets from syncookies
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.

Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23 16:32:33 -08:00
Paul Moore 3df98d7921 lsm,selinux: pass flowi_common instead of flowi to the LSM hooks
As pointed out by Herbert in a recent related patch, the LSM hooks do
not have the necessary address family information to use the flowi
struct safely.  As none of the LSMs currently use any of the protocol
specific flowi information, replace the flowi pointers with pointers
to the address family independent flowi_common struct.

Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-11-23 18:36:21 -05:00
Alexander Duyck 861602b577 tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 header
An issue was recently found where DCTCP SYN/ACK packets did not have the
ECT bit set in the L3 header. A bit of code review found that the recent
change referenced below had gone though and added a mask that prevented the
ECN bits from being populated in the L3 header.

This patch addresses that by rolling back the mask so that it is only
applied to the flags coming from the incoming TCP request instead of
applying it to the socket tos/tclass field. Doing this the ECT bits were
restored in the SYN/ACK packets in my testing.

One thing that is not addressed by this patch set is the fact that
tcp_reflect_tos appears to be incompatible with ECN based congestion
avoidance algorithms. At a minimum the feature should likely be documented
which it currently isn't.

Fixes: ac8f1710c1 ("tcp: reflect tos value received in SYN to the socket")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 18:09:47 -08:00
Randy Dunlap 634a63e73f net: ipv6: delete duplicated words
Drop repeated words in net/ipv6/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:12:43 -07:00
Wei Wang ac8f1710c1 tcp: reflect tos value received in SYN to the socket
This commit adds a new TCP feature to reflect the tos value received in
SYN, and send it out on the SYN-ACK, and eventually set the tos value of
the established socket with this reflected tos value. This provides a
way to set the traffic class/QoS level for all traffic in the same
connection to be the same as the incoming SYN request. It could be
useful in data centers to provide equivalent QoS according to the
incoming request.
This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by
default turned off.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:15:40 -07:00
Wei Wang e92dd77e6f ipv6: add tos reflection in TCP reset and ack
Currently, ipv6 stack does not do any TOS reflection. To make the
behavior consistent with v4 stack, this commit adds TOS reflection in
tcp_v6_reqsk_send_ack() and tcp_v6_send_reset(). We clear the lower
2-bit ECN value of the received TOS in compliance with RFC 3168 6.1.5
robustness principles.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:20:55 -07:00
Martin KaFai Lau 331fca4315 bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt()
The bpf prog needs to parse the SYN header to learn what options have
been sent by the peer's bpf-prog before writing its options into SYNACK.
This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
This syn_skb will eventually be made available (as read-only) to the
bpf prog.  This will be the only SYN packet available to the bpf
prog during syncookie.  For other regular cases, the bpf prog can
also use the saved_syn.

When writing options, the bpf prog will first be called to tell the
kernel its required number of bytes.  It is done by the new
bpf_skops_hdr_opt_len().  The bpf prog will only be called when the new
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
When the bpf prog returns, the kernel will know how many bytes are needed
and then update the "*remaining" arg accordingly.  4 byte alignment will
be included in the "*remaining" before this function returns.  The 4 byte
aligned number of bytes will also be stored into the opts->bpf_opt_len.
"bpf_opt_len" is a newly added member to the struct tcp_out_options.

Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
header options.  The bpf prog is only called if it has reserved spaces
before (opts->bpf_opt_len > 0).

The bpf prog is the last one getting a chance to reserve header space
and writing the header option.

These two functions are half implemented to highlight the changes in
TCP stack.  The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Christoph Hellwig d4c19c4914 net/tcp: switch ->md5_parse to sockptr_t
Pass a sockptr_t to prepare for set_fs-less handling of the kernel
pointer from bpf-cgroup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-24 15:41:54 -07:00
Christoph Hellwig 3021ad5299 net/ipv6: remove compat_ipv6_{get,set}sockopt
Handle the few cases that need special treatment in-line using
in_compat_syscall().  This also removes all the now unused
compat_{get,set}sockopt methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-19 18:16:41 -07:00
Eric Dumazet dd2e0b86fc tcp: remove indirect calls for icsk->icsk_af_ops->send_check
Mitigate RETPOLINE costs in __tcp_transmit_skb()
by using INDIRECT_CALL_INET() wrapper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-20 17:47:53 -07:00
Vinay Kumar Yadav 6abde0b241 crypto/chtls: IPv6 support for inline TLS
Extends support to IPv6 for Inline TLS server.

Signed-off-by: Vinay Kumar Yadav <vinay.yadav@chelsio.com>

v1->v2:
- cc'd tcp folks.

v2->v3:
- changed EXPORT_SYMBOL() to EXPORT_SYMBOL_GPL()

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 15:51:25 -07:00
Eric Dumazet d29245692a tcp: ipv6: support RFC 6069 (TCP-LD)
Make tcp_ld_RTO_revert() helper available to IPv6, and
implement RFC 6069 :

Quoting this RFC :

3. Connectivity Disruption Indication

   For Internet Protocol version 6 (IPv6) [RFC2460], the counterpart of
   the ICMP destination unreachable message of code 0 (net unreachable)
   and of code 1 (host unreachable) is the ICMPv6 destination
   unreachable message of code 0 (no route to destination) [RFC4443].
   As with IPv4, a router should generate an ICMPv6 destination
   unreachable message of code 0 in response to a packet that cannot be
   delivered to its destination address because it lacks a matching
   entry in its routing table.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:02:46 -07:00
Eric Dumazet 45af29ca76 tcp: allow traceroute -Mtcp for unpriv users
Unpriv users can use traceroute over plain UDP sockets, but not TCP ones.

$ traceroute -Mtcp 8.8.8.8
You do not have enough privileges to use this traceroute method.

$ traceroute -n -Mudp 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  192.168.86.1  3.631 ms  3.512 ms  3.405 ms
 2  10.1.10.1  4.183 ms  4.125 ms  4.072 ms
 3  96.120.88.125  20.621 ms  19.462 ms  20.553 ms
 4  96.110.177.65  24.271 ms  25.351 ms  25.250 ms
 5  69.139.199.197  44.492 ms  43.075 ms  44.346 ms
 6  68.86.143.93  27.969 ms  25.184 ms  25.092 ms
 7  96.112.146.18  25.323 ms 96.112.146.22  25.583 ms 96.112.146.26  24.502 ms
 8  72.14.239.204  24.405 ms 74.125.37.224  16.326 ms  17.194 ms
 9  209.85.251.9  18.154 ms 209.85.247.55  14.449 ms 209.85.251.9  26.296 ms^C

We can easily support traceroute over TCP, by queueing an error message
into socket error queue.

Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
enable this feature, and that the error message is only queued
while in SYN_SNT state.

socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
        inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host)
recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
        inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0},
        msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}],
        msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}},
                                   {cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}],
        msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144

Suggested-by: Maciej Żenczykowski <maze@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-25 17:54:06 -07:00
Joe Perches a8eceea84a inet: Use fallthrough;
Convert the various uses of fallthrough comments to fallthrough;

Done via script
Link: https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

And by hand:

net/ipv6/ip6_fib.c has a fallthrough comment outside of an #ifdef block
that causes gcc to emit a warning if converted in-place.

So move the new fallthrough; inside the containing #ifdef/#endif too.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:55:00 -07:00
Geert Uytterhoeven 31484d56ca mptcp: Fix undefined mptcp_handle_ipv6_mapped for modular IPV6
If CONFIG_MPTCP=y, CONFIG_MPTCP_IPV6=n, and CONFIG_IPV6=m:

    ERROR: "mptcp_handle_ipv6_mapped" [net/ipv6/ipv6.ko] undefined!

This does not happen if CONFIG_MPTCP_IPV6=y, as CONFIG_MPTCP_IPV6
selects CONFIG_IPV6, and thus forces CONFIG_IPV6 builtin.

As exporting a symbol for an empty function would be a bit wasteful, fix
this by providing a dummy version of mptcp_handle_ipv6_mapped() for the
CONFIG_MPTCP_IPV6=n case.

Rename mptcp_handle_ipv6_mapped() to mptcpv6_handle_mapped(), to make it
clear this is a pure-IPV6 function, just like mptcpv6_init().

Fixes: cec37a6e41 ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-30 10:55:54 +01:00
Peter Krystad cec37a6e41 mptcp: Handle MP_CAPABLE options for outgoing connections
Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request,
to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to
the final ACK of the three-way handshake.

Use the .sk_rx_dst_set() handler in the subflow proto to capture when the
responding SYN-ACK is received and notify the MPTCP connection layer.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Mat Martineau f870fa0b57 mptcp: Add MPTCP socket stubs
Implements the infrastructure for MPTCP sockets.

MPTCP sockets open one in-kernel TCP socket per subflow. These subflow
sockets are only managed by the MPTCP socket that owns them and are not
visible from userspace. This commit allows a userspace program to open
an MPTCP socket with:

  sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

The resulting socket is simply a wrapper around a single regular TCP
socket, without any of the MPTCP protocol implemented over the wire.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Mat Martineau 35b2c32116 tcp: Export TCP functions and ops struct
MPTCP will make use of tcp_send_mss() and tcp_push() when sending
data to specific TCP subflows.

tcp_request_sock_ipvX_ops and ipvX_specific will be referenced
during TCP subflow creation.

Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
David Ahern 6b102db50c net: Add device index to tcp_md5sig
Add support for userspace to specify a device index to limit the scope
of an entry via the TCP_MD5SIG_EXT setsockopt. The existing __tcpm_pad
is renamed to tcpm_ifindex and the new field is only checked if the new
TCP_MD5SIG_FLAG_IFINDEX is set in tcpm_flags. For now, the device index
must point to an L3 master device (e.g., VRF). The API and error
handling are setup to allow the constraint to be relaxed in the future
to any device index.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern dea53bb80e tcp: Add l3index to tcp_md5sig_key and md5 functions
Add l3index to tcp_md5sig_key to represent the L3 domain of a key, and
add l3index to tcp_md5_do_add and tcp_md5_do_del to fill in the key.

With the key now based on an l3index, add the new parameter to the
lookup functions and consider the l3index when looking for a match.

The l3index comes from the skb when processing ingress packets leveraging
the helpers created for socket lookups, tcp_v4_sdif and inet_iif (and the
v6 variants). When the sdif index is set it means the packet ingressed a
device that is part of an L3 domain and inet_iif points to the VRF device.
For egress, the L3 domain is determined from the socket binding and
sk_bound_dev_if.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern d14c77e0b2 ipv6/tcp: Pass dif and sdif to tcp_v6_inbound_md5_hash
The original ingress device index is saved to the cb space of the skb
and the cb is moved during tcp processing. Since tcp_v6_inbound_md5_hash
can be called before and after the cb move, pass dif and sdif to it so
the caller can save both prior to the cb move. Both are used by a later
patch.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
Sabrina Dubroca c4e85f73af net: ipv6: add net argument to ip6_dst_lookup_flow
This will be used in the conversion of ipv6_stub to ip6_dst_lookup_flow,
as some modules currently pass a net argument without a socket to
ip6_dst_lookup. This is equivalent to commit 343d60aada ("ipv6: change
ipv6_stub_impl.ipv6_dst_lookup to take net argument").

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-04 12:27:12 -08:00
Eric Dumazet 288efe8606 net: annotate lockless accesses to sk->sk_ack_backlog
sk->sk_ack_backlog can be read without any lock being held.
We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing
and/or potential KCSAN warnings.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 16:14:48 -08:00
Eric Dumazet 0f31746452 tcp: annotate tp->write_seq lockless reads
There are few places where we fetch tp->write_seq while
this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet 7db48e9839 tcp: annotate tp->copied_seq lockless reads
There are few places where we fetch tp->copied_seq while
this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.

Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet dba7d9b8c7 tcp: annotate tp->rcv_nxt lockless reads
There are few places where we fetch tp->rcv_nxt while
this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.

Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)

syzbot reported :

BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv

write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
 tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
 tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
 tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
 tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
 tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
 ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
 netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
 napi_skb_finish net/core/dev.c:5671 [inline]
 napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
 receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061

read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
 tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
 tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
 sock_poll+0xed/0x250 net/socket.c:1256
 vfs_poll include/linux/poll.h:90 [inline]
 ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
 ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
 ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
 ep_send_events fs/eventpoll.c:1793 [inline]
 ep_poll+0xe3/0x900 fs/eventpoll.c:1930
 do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
 __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
 __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
 __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
 do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet d983ea6f16 tcp: add rcu protection around tp->fastopen_rsk
Both tcp_v4_err() and tcp_v6_err() do the following operations
while they do not own the socket lock :

	fastopen = tp->fastopen_rsk;
 	snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;

The problem is that without appropriate barrier, the compiler
might reload tp->fastopen_rsk and trigger a NULL deref.

request sockets are protected by RCU, we can simply add
the missing annotations and barriers to solve the issue.

Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet f6c0f5d209 tcp: honor SO_PRIORITY in TIME_WAIT state
ctl packets sent on behalf of TIME_WAIT sockets currently
have a zero skb->priority, which can cause various problems.

In this patch we :

- add a tw_priority field in struct inet_timewait_sock.

- populate it from sk->sk_priority when a TIME_WAIT is created.

- For IPv4, change ip_send_unicast_reply() and its two
  callers to propagate tw_priority correctly.
  ip_send_unicast_reply() no longer changes sk->sk_priority.

- For IPv6, make sure TIME_WAIT sockets pass their tw_priority
  field to tcp_v6_send_response() and tcp_v6_send_ack().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:05:02 +02:00
Eric Dumazet e9a5dceee5 ipv6: tcp: provide sk->sk_priority to ctl packets
We can populate skb->priority for some ctl packets
instead of always using zero.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:05:02 +02:00
Eric Dumazet 4f6570d720 ipv6: add priority parameter to ip6_xmit()
Currently, ip6_xmit() sets skb->priority based on sk->sk_priority

This is not desirable for TCP since TCP shares the same ctl socket
for a given netns. We want to be able to send RST or ACK packets
with a non zero skb->priority.

This patch has no functional change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:05:02 +02:00
Petar Penkov 9349d600fb tcp: add skb-less helpers to retrieve SYN cookie
This patch allows generation of a SYN cookie before an SKB has been
allocated, as is the case at XDP.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-30 21:03:05 -07:00
Eric Dumazet 052e0690f1 ipv6: tcp: fix flowlabels reflection for RST packets
In 323a53c412 ("ipv6: tcp: enable flowlabel reflection in some RST packets")
and 50a8accf10 ("ipv6: tcp: send consistent flowlabel in TIME_WAIT state")
we took care of IPv6 flowlabel reflections for two cases.

This patch takes care of the remaining case, when the RST packet
is sent on behalf of a 'full' socket.

In Marek use case, this was a socket in TCP_CLOSE state.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Marek Majkowski <marek@cloudflare.com>
Tested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-11 14:43:25 -07:00
Willem de Bruijn 59c820b231 ipv6: elide flowlabel check if no exclusive leases exist
Processes can request ipv6 flowlabels with cmsg IPV6_FLOWINFO.
If not set, by default an autogenerated flowlabel is selected.

Explicit flowlabels require a control operation per label plus a
datapath check on every connection (every datagram if unconnected).
This is particularly expensive on unconnected sockets multiplexing
many flows, such as QUIC.

In the common case, where no lease is exclusive, the check can be
safely elided, as both lease request and check trivially succeed.
Indeed, autoflowlabel does the same even with exclusive leases.

Elide the check if no process has requested an exclusive lease.

fl6_sock_lookup previously returns either a reference to a lease or
NULL to denote failure. Modify to return a real error and update
all callers. On return NULL, they can use the label and will elide
the atomic_dec in fl6_sock_release.

This is an optimization. Robust applications still have to revert to
requesting leases if the fast path fails due to an exclusive lease.

Changes RFC->v1:
  - use static_key_false_deferred to rate limit jump label operations
    - call static_key_deferred_flush to stop timers on exit
  - move decrement out of RCU context
  - defer optimization also if opt data is associated with a lease
  - updated all fp6_sock_lookup callers, not just udp

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 19:38:03 -07:00
Eric Dumazet a346abe051 ipv6: icmp: allow flowlabel reflection in echo replies
Extend flowlabel_reflect bitmask to allow conditional
reflection of incoming flowlabels in echo replies.

Note this has precedence against auto flowlabels.

Add flowlabel_reflect enum to replace hard coded
values.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-01 10:54:51 -07:00
Eric Dumazet d6fb396cfa ipv4: tcp: fix ACK/RST sent with a transmit delay
If we want to set a EDT time for the skb we want to send
via ip_send_unicast_reply(), we have to pass a new parameter
and initialize ipc.sockc.transmit_time with it.

This fixes the EDT time for ACK/RST packets sent on behalf of
a TIME_WAIT socket.

Fixes: a842fe1425 ("tcp: add optional per socket transmit delay")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:51:12 -07:00
Eric Dumazet a842fe1425 tcp: add optional per socket transmit delay
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.

Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.

EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.

Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.

This requires FQ packet scheduler or a EDT-enabled NIC.

This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.

  unsigned int tx_delay = 10000; /* 10 msec */

  setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));

Note that FQ packet scheduler limits might need some tweaking :

man tc-fq

PARAMETERS
   limit
       Hard  limit  on  the  real  queue  size. When this limit is
       reached, new packets are dropped. If the value is  lowered,
       packets  are  dropped so that the new limit is met. Default
       is 10000 packets.

   flow_limit
       Hard limit on the maximum  number  of  packets  queued  per
       flow.  Default value is 100.

Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.

Use of a jump label makes this support runtime-free, for hosts
never using the option.

Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)

Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 13:05:43 -07:00
Eric Dumazet c67b85558f ipv6: tcp: send consistent autoflowlabel in TIME_WAIT state
In case autoflowlabel is in action, skb_get_hash_flowi6()
derives a non zero skb->hash to the flowlabel.

If skb->hash is zero, a flow dissection is performed.

Since all TCP skbs sent from ESTABLISH state inherit their
skb->hash from sk->sk_txhash, we better keep a copy
of sk->sk_txhash into the TIME_WAIT socket.

After this patch, ACK or RST packets sent on behalf of
a TIME_WAIT socket have the flowlabel that was previously
used by the flow.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 20:10:19 -07:00
Eric Dumazet 392096736a ipv6: tcp: fix potential NULL deref in tcp_v6_send_reset()
syzbot found a crash in tcp_v6_send_reset() caused by my latest
change.

Problem is that if an skb has been queued to socket prequeue,
skb_dst(skb)->dev can not anymore point to the device.

Fortunately in this case the socket pointer is not NULL.

A similar issue has been fixed in commit 0f85feae6b ("tcp: fix
more NULL deref after prequeue changes"), I should have known better.

Fixes: 323a53c412 ("ipv6: tcp: enable flowlabel reflection in some RST packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:42:23 -07:00
David S. Miller a6cdeeb16b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Some ISDN files that got removed in net-next had some changes
done in mainline, take the removals.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07 11:00:14 -07:00
Eric Dumazet 50a8accf10 ipv6: tcp: send consistent flowlabel in TIME_WAIT state
After commit 1d13a96c74 ("ipv6: tcp: fix flowlabel value in ACK
messages"), we stored in tw_flowlabel the flowlabel, in the
case ACK packets needed to be sent on behalf of a TIME_WAIT socket.

We can use the same field so that RST packets sent from
TIME_WAIT state also use a consistent flowlabel.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 19:05:01 -07:00
Eric Dumazet 323a53c412 ipv6: tcp: enable flowlabel reflection in some RST packets
When RST packets are sent because no socket could be found,
it makes sense to use flowlabel_reflect sysctl to decide
if a reflection of the flowlabel is requested.

This extends commit 22b6722bfa ("ipv6: Add sysctl for per
namespace flow label reflection"), for some TCP RST packets.

In order to provide full control of this new feature,
flowlabel_reflect becomes a bitmask.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 19:05:01 -07:00
Thomas Gleixner 2874c5fd28 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Paolo Abeni 97ff7ffb11 net: use indirect calls helpers at early demux stage
So that we avoid another indirect call per RX packet, if
early demux is enabled.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 10:38:04 -07:00
Paolo Abeni 0e219ae48c net: use indirect calls helpers for L3 handler hooks
So that we avoid another indirect call per RX packet in the common
case.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 10:38:04 -07:00
Eric Dumazet f5d547676c tcp: fix tcp_inet6_sk() for 32bit kernels
It turns out that struct ipv6_pinfo is not located as we think.

inet6_sk_generic() and tcp_inet6_sk() disagree on 32bit kernels by 4-bytes,
because struct tcp_sock has 8-bytes alignment,
but ipv6_pinfo size is not a multiple of 8.

sizeof(struct ipv6_pinfo): 116 (not padded to 8)

I actually first coded tcp_inet6_sk() as this patch does, but thought
that "container_of(tcp_sk(sk), struct tcp6_sock, tcp)" was cleaner.

As Julian told me : Nobody should use tcp6_sock.inet6
directly, it should be accessed via tcp_inet6_sk() or inet6_sk().

This happened when we added the first u64 field in struct tcp_sock.

Fixes: 93a77c11ae ("tcp: add tcp_inet6_sk() helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 10:21:21 -07:00
David S. Miller 356d71e00d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-03-27 17:37:58 -07:00
Eric Dumazet 8b27dae5a2 tcp: add one skb cache for rx
Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.

This means the incoming skb had to be allocated on a cpu,
but freed on another.

This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.

A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.

More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.

This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.

(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
 instead of 8 Mpps)

This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :

 - CPU handling the NIC rx interrupts, feeding the receive queue,
  and (after this patch) freeing the skbs that were consumed.

 - CPU in recvmsg() system call, essentially 100 % busy copying out
  data to user space.

Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.

Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.

To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:57:38 -04:00
Eric Dumazet 93a77c11ae tcp: add tcp_inet6_sk() helper
TCP ipv6 fast path dereferences a pointer to get to the inet6
part of a tcp socket, but given the fixed memory placement,
we can do better and avoid a possible cache line miss.

This also reduces register pressure, since we let the compiler
know about this memory placement.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19 14:09:32 -07:00
Eric Dumazet 89e4130939 tcp: do not use ipv6 header for ipv4 flow
When a dual stack tcp listener accepts an ipv4 flow,
it should not attempt to use an ipv6 header or tcp_v6_iif() helper.

Fixes: 1397ed35f2 ("ipv6: add flowinfo for tcp6 pkt_options for all cases")
Fixes: df3687ffc6 ("ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GET")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19 14:00:27 -07:00
Yafang Shao 9946b3410b tcp: clean up SOCK_DEBUG()
Per discussion with Daniel[1] and Eric[2], these SOCK_DEBUG() calles in
TCP are not needed now.
We'd better clean up it.

[1] https://patchwork.ozlabs.org/patch/1035573/
[2] https://patchwork.ozlabs.org/patch/1040533/

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25 09:41:14 -08:00
Wei Wang 31954cd8bb tcp: Refactor pingpong code
Instead of using pingpong as a single bit information, we refactor the
code to treat it as a counter. When interactive session is detected,
we set pingpong count to TCP_PINGPONG_THRESH. And when pingpong count
is >= TCP_PINGPONG_THRESH, we consider the session in pingpong mode.

This patch is a pure refactor and sets foundation for the next patch.
This patch itself does not change any pingpong logic.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-27 13:29:43 -08:00
David Ahern c2027d1e17 ipv6: Fix handling of LLA with VRF and sockets bound to VRF
A recent commit allows sockets bound to a VRF to receive ipv6 link local
packets. However, it only works for UDP and worse TCP connection attempts
to the LLA with the only listener bound to the VRF just hang where as
before the client gets a reset and connection refused. Fix by adjusting
ir_iif for LL addresses and packets received through a device enslaved
to a VRF.

Fixes: 6f12fa7755 ("vrf: mark skb for multicast or link-local as enslaved to VRF")
Reported-by: Donald Sharp <sharpd@cumulusnetworks.com>
Cc: Mike Manning <mmanning@vyatta.att-mail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-15 11:36:14 -08:00
Stefano Brivio 32bbd8793f net: Convert protocol error handlers from void to int
We'll need this to handle ICMP errors for tunnels without a sending socket
(i.e. FoU and GUE). There, we might have to look up different types of IP
tunnels, registered as network protocols, before we get a match, so we
want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
inet_protos and inet6_protos. These error codes will be used in the next
patch.

For consistency, return sensible error codes in protocol error handlers
whenever handlers can't handle errors because, even if valid, they don't
match a protocol or any of its states.

This has no effect on existing error handling paths.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-08 17:13:08 -08:00
David Ahern 24b711edfc net/ipv6: Fix linklocal to global address with VRF
Example setup:
    host: ip -6 addr add dev eth1 2001:db8:104::4
           where eth1 is enslaved to a VRF

    switch: ip -6 ro add 2001:db8:104::4/128 dev br1
            where br1 only has an LLA

           ping6 2001:db8:104::4
           ssh   2001:db8:104::4

(NOTE: UDP works fine if the PKTINFO has the address set to the global
address and ifindex is set to the index of eth1 with a destination an
LLA).

For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
L3 master. If it is then return the ifindex from rt6i_idev similar
to what is done for loopback.

For TCP, restore the original tcp_v6_iif definition which is needed in
most places and add a new tcp_v6_iif_l3_slave that considers the
l3_slave variability. This latter check is only needed for socket
lookups.

Fixes: 9ff7438460 ("net: vrf: Handle ipv6 multicast and link-local addresses")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-21 19:31:46 -07:00
Frank van der Linden 4fd44a98ff tcp: verify the checksum of the first data segment in a new connection
commit 079096f103 ("tcp/dccp: install syn_recv requests into ehash
table") introduced an optimization for the handling of child sockets
created for a new TCP connection.

But this optimization passes any data associated with the last ACK of the
connection handshake up the stack without verifying its checksum, because it
calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
directly.  These lower-level processing functions do not do any checksum
verification.

Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
fix this.

Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-14 17:04:41 -07:00
Linus Torvalds 1c8c5a9d38 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Add Maglev hashing scheduler to IPVS, from Inju Song.

 2) Lots of new TC subsystem tests from Roman Mashak.

 3) Add TCP zero copy receive and fix delayed acks and autotuning with
    SO_RCVLOWAT, from Eric Dumazet.

 4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
    Brouer.

 5) Add ttl inherit support to vxlan, from Hangbin Liu.

 6) Properly separate ipv6 routes into their logically independant
    components. fib6_info for the routing table, and fib6_nh for sets of
    nexthops, which thus can be shared. From David Ahern.

 7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
    messages from XDP programs. From Nikita V. Shirokov.

 8) Lots of long overdue cleanups to the r8169 driver, from Heiner
    Kallweit.

 9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.

10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.

11) Plumb extack down into fib_rules, from Roopa Prabhu.

12) Add Flower classifier offload support to igb, from Vinicius Costa
    Gomes.

13) Add UDP GSO support, from Willem de Bruijn.

14) Add documentation for eBPF helpers, from Quentin Monnet.

15) Add TLS tx offload to mlx5, from Ilya Lesokhin.

16) Allow applications to be given the number of bytes available to read
    on a socket via a control message returned from recvmsg(), from
    Soheil Hassas Yeganeh.

17) Add x86_32 eBPF JIT compiler, from Wang YanQing.

18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
    From Björn Töpel.

19) Remove indirect load support from all of the BPF JITs and handle
    these operations in the verifier by translating them into native BPF
    instead. From Daniel Borkmann.

20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.

21) Allow XDP programs to do lookups in the main kernel routing tables
    for forwarding. From David Ahern.

22) Allow drivers to store hardware state into an ELF section of kernel
    dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.

23) Various RACK and loss detection improvements in TCP, from Yuchung
    Cheng.

24) Add TCP SACK compression, from Eric Dumazet.

25) Add User Mode Helper support and basic bpfilter infrastructure, from
    Alexei Starovoitov.

26) Support ports and protocol values in RTM_GETROUTE, from Roopa
    Prabhu.

27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
    Brouer.

28) Add lots of forwarding selftests, from Petr Machata.

29) Add generic network device failover driver, from Sridhar Samudrala.

* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
  strparser: Add __strp_unpause and use it in ktls.
  rxrpc: Fix terminal retransmission connection ID to include the channel
  net: hns3: Optimize PF CMDQ interrupt switching process
  net: hns3: Fix for VF mailbox receiving unknown message
  net: hns3: Fix for VF mailbox cannot receiving PF response
  bnx2x: use the right constant
  Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
  net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
  enic: fix UDP rss bits
  netdev-FAQ: clarify DaveM's position for stable backports
  rtnetlink: validate attributes in do_setlink()
  mlxsw: Add extack messages for port_{un, }split failures
  netdevsim: Add extack error message for devlink reload
  devlink: Add extack to reload and port_{un, }split operations
  net: metrics: add proper netlink validation
  ipmr: fix error path when ipmr_new_table fails
  ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
  net: hns3: remove unused hclgevf_cfg_func_mta_filter
  netfilter: provide udp*_lib_lookup for nf_tproxy
  qed*: Utilize FW 8.37.2.0
  ...
2018-06-06 18:39:49 -07:00
Yafang Shao 3d97d88e80 tcp: minor optimization around tcp_hdr() usage in receive path
This is additional to the
commit ea1627c20c ("tcp: minor optimizations around tcp_hdr() usage").
At this point, skb->data is same with tcp_hdr() as tcp header has not
been pulled yet. So use the less expensive one to get the tcp header.

Remove the third parameter of tcp_rcv_established() and put it into
the function body.

Furthermore, the local variables are listed as a reverse christmas tree :)

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-31 13:20:47 -04:00
Christoph Hellwig c350637227 proc: introduce proc_create_net{,_data}
Variants of proc_create{,_data} that directly take a struct seq_operations
and deal with network namespaces in ->open and ->release.  All callers of
proc_create + seq_open_net converted over, and seq_{open,release}_net are
removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:24:30 +02:00
Christoph Hellwig 37d849bb42 ipv{4,6}/tcp: simplify procfs registration
Avoid most of the afinfo indirections and just call the proc helpers
directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16 07:23:35 +02:00
Jon Maxwell 0048369055 tcp: Add mark for TIMEWAIT sockets
This version has some suggestions by Eric Dumazet:

- Use a local variable for the mark in IPv6 instead of ctl_sk to avoid SMP
races.
- Use the more elegant "IP4_REPLY_MARK(net, skb->mark) ?: sk->sk_mark"
statement.
- Factorize code as sk_fullsock() check is not necessary.

Aidan McGurn from Openwave Mobility systems reported the following bug:

"Marked routing is broken on customer deployment. Its effects are large
increase in Uplink retransmissions caused by the client never receiving
the final ACK to their FINACK - this ACK misses the mark and routes out
of the incorrect route."

Currently marks are added to sk_buffs for replies when the "fwmark_reflect"
sysctl is enabled. But not for TW sockets that had sk->sk_mark set via
setsockopt(SO_MARK..).

Fix this in IPv4/v6 by adding tw->tw_mark for TIME_WAIT sockets. Copy the the
original sk->sk_mark in __inet_twsk_hashdance() to the new tw->tw_mark location.
Then progate this so that the skb gets sent with the correct mark. Do the same
for resets. Give the "fwmark_reflect" sysctl precedence over sk->sk_mark so that
netfilter rules are still honored.

Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-10 17:44:52 -04:00
David S. Miller d4069fe6fc Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-03-31

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add raw BPF tracepoint API in order to have a BPF program type that
   can access kernel internal arguments of the tracepoints in their
   raw form similar to kprobes based BPF programs. This infrastructure
   also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which
   returns an anon-inode backed fd for the tracepoint object that allows
   for automatic detach of the BPF program resp. unregistering of the
   tracepoint probe on fd release, from Alexei.

2) Add new BPF cgroup hooks at bind() and connect() entry in order to
   allow BPF programs to reject, inspect or modify user space passed
   struct sockaddr, and as well a hook at post bind time once the port
   has been allocated. They are used in FB's container management engine
   for implementing policy, replacing fragile LD_PRELOAD wrapper
   intercepting bind() and connect() calls that only works in limited
   scenarios like glibc based apps but not for other runtimes in
   containerized applications, from Andrey.

3) BPF_F_INGRESS flag support has been added to sockmap programs for
   their redirect helper call bringing it in line with cls_bpf based
   programs. Support is added for both variants of sockmap programs,
   meaning for tx ULP hooks as well as recv skb hooks, from John.

4) Various improvements on BPF side for the nfp driver, besides others
   this work adds BPF map update and delete helper call support from
   the datapath, JITing of 32 and 64 bit XADD instructions as well as
   offload support of bpf_get_prandom_u32() call. Initial implementation
   of nfp packet cache has been tackled that optimizes memory access
   (see merge commit for further details), from Jakub and Jiong.

5) Removal of struct bpf_verifier_env argument from the print_bpf_insn()
   API has been done in order to prepare to use print_bpf_insn() soon
   out of perf tool directly. This makes the print_bpf_insn() API more
   generic and pushes the env into private data. bpftool is adjusted
   as well with the print_bpf_insn() argument removal, from Jiri.

6) Couple of cleanups and prep work for the upcoming BTF (BPF Type
   Format). The latter will reuse the current BPF verifier log as
   well, thus bpf_verifier_log() is further generalized, from Martin.

7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read
   and write support has been added in similar fashion to existing
   IPv6 IPV6_TCLASS socket option we already have, from Nikita.

8) Fixes in recent sockmap scatterlist API usage, which did not use
   sg_init_table() for initialization thus triggering a BUG_ON() in
   scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and
   uses a small helper sg_init_marker() to properly handle the affected
   cases, from Prashant.

9) Let the BPF core follow IDR code convention and therefore use the
   idr_preload() and idr_preload_end() helpers, which would also help
   idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory
   pressure, from Shaohua.

10) Last but not least, a spelling fix in an error message for the
    BPF cookie UID helper under BPF sample code, from Colin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-31 23:33:04 -04:00
Andrey Ignatov d74bad4e74 bpf: Hooks for sys_connect
== The problem ==

See description of the problem in the initial patch of this patch set.

== The solution ==

The patch provides much more reliable in-kernel solution for the 2nd
part of the problem: making outgoing connecttion from desired IP.

It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
`BPF_CGROUP_INET6_CONNECT` for program type
`BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
source and destination of a connection at connect(2) time.

Local end of connection can be bound to desired IP using newly
introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
and doesn't support binding to port, i.e. leverages
`IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
* looking for a free port is expensive and can affect performance
  significantly;
* there is no use-case for port.

As for remote end (`struct sockaddr *` passed by user), both parts of it
can be overridden, remote IP and remote port. It's useful if an
application inside cgroup wants to connect to another application inside
same cgroup or to itself, but knows nothing about IP assigned to the
cgroup.

Support is added for IPv4 and IPv6, for TCP and UDP.

IPv4 and IPv6 have separate attach types for same reason as sys_bind
hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
when user passes sockaddr_in since it'd be out-of-bound.

== Implementation notes ==

The patch introduces new field in `struct proto`: `pre_connect` that is
a pointer to a function with same signature as `connect` but is called
before it. The reason is in some cases BPF hooks should be called way
before control is passed to `sk->sk_prot->connect`. Specifically
`inet_dgram_connect` autobinds socket before calling
`sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
it'd cause double-bind. On the other hand `proto.pre_connect` provides a
flexible way to add BPF hooks for connect only for necessary `proto` and
call them at desired time before `connect`. Since `bpf_bind()` is
allowed to bind only to IP and autobind in `inet_dgram_connect` binds
only port there is no chance of double-bind.

bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
of value of `bind_address_no_port` socket field.

bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
and __inet6_bind() since all call-sites, where bpf_bind() is called,
already hold socket lock.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-31 02:15:54 +02:00
Kirill Tkhai 2f635ceeb2 net: Drop pernet_operations::async
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-27 13:18:09 -04:00
Kirill Tkhai fef65a2c6c net: Convert tcpv6_net_ops
These pernet_operations create and destroy net::ipv6.tcp_sk
socket, which is used in tcp_v6_send_response() only. It looks
like foreign pernet_operations don't want to set ipv6 connection
inside destroyed net, so this socket may be created in destroyed
in parallel with anything else. inet_twsk_purge() is also safe
for that, as described in patch for tcp_sk_ops. So, it's possible
to mark them as async.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-19 14:19:10 -05:00
Eric Dumazet e0f9759f53 tcp: try to keep packet if SYN_RCV race is lost
배석진 reported that in some situations, packets for a given 5-tuple
end up being processed by different CPUS.

This involves RPS, and fragmentation.

배석진 is seeing packet drops when a SYN_RECV request socket is
moved into ESTABLISH state. Other states are protected by socket lock.

This is caused by a CPU losing the race, and simply not caring enough.

Since this seems to occur frequently, we can do better and perform
a second lookup.

Note that all needed memory barriers are already in the existing code,
thanks to the spin_lock()/spin_unlock() pair in inet_ehash_insert()
and reqsk_put(). The second lookup must find the new socket,
unless it has already been accepted and closed by another cpu.

Note that the fragmentation could be avoided in the first place by
use of a correct TCP MSS option in the SYN{ACK} packet, but this
does not mean we can not be more robust.

Many thanks to 배석진 for a very detailed analysis.

Reported-by: 배석진 <soukjin.bae@samsung.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-14 14:21:45 -05:00
Song Liu 5c487bb9ad tcp: tracepoint: only call trace_tcp_send_reset with full socket
tracepoint tcp_send_reset requires a full socket to work. However, it
may be called when in TCP_TIME_WAIT:

        case TCP_TW_RST:
                tcp_v6_send_reset(sk, skb);
                inet_twsk_deschedule_put(inet_twsk(sk));
                goto discard_it;

To avoid this problem, this patch checks the socket with sk_fullsock()
before calling trace_tcp_send_reset().

Fixes: c24b14c46b ("tcp: add tracepoint trace_tcp_send_reset")
Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-07 22:00:42 -05:00
Alexey Dobriyan 96890d6252 net: delete /proc THIS_MODULE references
/proc has been ignoring struct file_operations::owner field for 10 years.
Specifically, it started with commit 786d7e1612
("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
inode->i_fop is initialized with proxy struct file_operations for
regular files:

	-               if (de->proc_fops)
	-                       inode->i_fop = de->proc_fops;
	+               if (de->proc_fops) {
	+                       if (S_ISREG(inode->i_mode))
	+                               inode->i_fop = &proc_reg_file_ops;
	+                       else
	+                               inode->i_fop = de->proc_fops;
	+               }

VFS stopped pinning module at this point.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-16 15:01:33 -05:00
David Ahern 54dc3e3324 net: ipv6: Allow connect to linklocal address from socket bound to vrf
Allow a process bound to a VRF to connect to a linklocal address.
Currently, this fails because of a mismatch between the scope of the
linklocal address and the sk_bound_dev_if inherited by the VRF binding:
    $ ssh -6 fe80::70b8:cff:fedd:ead8%eth1
    ssh: connect to host fe80::70b8:cff:fedd:ead8%eth1 port 22: Invalid argument

Relax the scope check to allow the socket to be bound to the same L3
device as the scope id.

This makes ipv6 linklocal consistent with other relaxed checks enabled
by commits 1ff23beebd ("net: l3mdev: Allow send on enslaved interface")
and 7bb387c5ab ("net: Allow IP_MULTICAST_IF to set index to L3 slave").

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-08 14:11:18 -05:00
Yafang Shao 986ffdfd08 net: sock: replace sk_state_load with inet_sk_state_load and remove sk_state_store
sk_state_load is only used by AF_INET/AF_INET6, so rename it to
inet_sk_state_load and move it into inet_sock.h.

sk_state_store is removed as it is not used any more.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 14:00:25 -05:00
Christoph Paasch 30791ac419 tcp md5sig: Use skb's saddr when replying to an incoming segment
The MD5-key that belongs to a connection is identified by the peer's
IP-address. When we are in tcp_v4(6)_reqsk_send_ack(), we are replying
to an incoming segment from tcp_check_req() that failed the seq-number
checks.

Thus, to find the correct key, we need to use the skb's saddr and not
the daddr.

This bug seems to have been there since quite a while, but probably got
unnoticed because the consequences are not catastrophic. We will call
tcp_v4_reqsk_send_ack only to send a challenge-ACK back to the peer,
thus the connection doesn't really fail.

Fixes: 9501f97229 ("tcp md5sig: Let the caller pass appropriate key for tcp_v{4,6}_do_calc_md5_hash().")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-12 11:15:42 -05:00
Eric Dumazet eeea10b83a tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb()
James Morris reported kernel stack corruption bug [1] while
running the SELinux testsuite, and bisected to a recent
commit bffa72cf7f ("net: sk_buff rbnode reorg")

We believe this commit is fine, but exposes an older bug.

SELinux code runs from tcp_filter() and might send an ICMP,
expecting IP options to be found in skb->cb[] using regular IPCB placement.

We need to defer TCP mangling of skb->cb[] after tcp_filter() calls.

This patch adds tcp_v4_fill_cb()/tcp_v4_restore_cb() in a very
similar way we added them for IPv6.

[1]
[  339.806024] SELinux: failure in selinux_parse_skb(), unable to parse packet
[  339.822505] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff81745af5
[  339.822505]
[  339.852250] CPU: 4 PID: 3642 Comm: client Not tainted 4.15.0-rc1-test #15
[  339.868498] Hardware name: LENOVO 10FGS0VA1L/30BC, BIOS FWKT68A   01/19/2017
[  339.885060] Call Trace:
[  339.896875]  <IRQ>
[  339.908103]  dump_stack+0x63/0x87
[  339.920645]  panic+0xe8/0x248
[  339.932668]  ? ip_push_pending_frames+0x33/0x40
[  339.946328]  ? icmp_send+0x525/0x530
[  339.958861]  ? kfree_skbmem+0x60/0x70
[  339.971431]  __stack_chk_fail+0x1b/0x20
[  339.984049]  icmp_send+0x525/0x530
[  339.996205]  ? netlbl_skbuff_err+0x36/0x40
[  340.008997]  ? selinux_netlbl_err+0x11/0x20
[  340.021816]  ? selinux_socket_sock_rcv_skb+0x211/0x230
[  340.035529]  ? security_sock_rcv_skb+0x3b/0x50
[  340.048471]  ? sk_filter_trim_cap+0x44/0x1c0
[  340.061246]  ? tcp_v4_inbound_md5_hash+0x69/0x1b0
[  340.074562]  ? tcp_filter+0x2c/0x40
[  340.086400]  ? tcp_v4_rcv+0x820/0xa20
[  340.098329]  ? ip_local_deliver_finish+0x71/0x1a0
[  340.111279]  ? ip_local_deliver+0x6f/0xe0
[  340.123535]  ? ip_rcv_finish+0x3a0/0x3a0
[  340.135523]  ? ip_rcv_finish+0xdb/0x3a0
[  340.147442]  ? ip_rcv+0x27c/0x3c0
[  340.158668]  ? inet_del_offload+0x40/0x40
[  340.170580]  ? __netif_receive_skb_core+0x4ac/0x900
[  340.183285]  ? rcu_accelerate_cbs+0x5b/0x80
[  340.195282]  ? __netif_receive_skb+0x18/0x60
[  340.207288]  ? process_backlog+0x95/0x140
[  340.218948]  ? net_rx_action+0x26c/0x3b0
[  340.230416]  ? __do_softirq+0xc9/0x26a
[  340.241625]  ? do_softirq_own_stack+0x2a/0x40
[  340.253368]  </IRQ>
[  340.262673]  ? do_softirq+0x50/0x60
[  340.273450]  ? __local_bh_enable_ip+0x57/0x60
[  340.285045]  ? ip_finish_output2+0x175/0x350
[  340.296403]  ? ip_finish_output+0x127/0x1d0
[  340.307665]  ? nf_hook_slow+0x3c/0xb0
[  340.318230]  ? ip_output+0x72/0xe0
[  340.328524]  ? ip_fragment.constprop.54+0x80/0x80
[  340.340070]  ? ip_local_out+0x35/0x40
[  340.350497]  ? ip_queue_xmit+0x15c/0x3f0
[  340.361060]  ? __kmalloc_reserve.isra.40+0x31/0x90
[  340.372484]  ? __skb_clone+0x2e/0x130
[  340.382633]  ? tcp_transmit_skb+0x558/0xa10
[  340.393262]  ? tcp_connect+0x938/0xad0
[  340.403370]  ? ktime_get_with_offset+0x4c/0xb0
[  340.414206]  ? tcp_v4_connect+0x457/0x4e0
[  340.424471]  ? __inet_stream_connect+0xb3/0x300
[  340.435195]  ? inet_stream_connect+0x3b/0x60
[  340.445607]  ? SYSC_connect+0xd9/0x110
[  340.455455]  ? __audit_syscall_entry+0xaf/0x100
[  340.466112]  ? syscall_trace_enter+0x1d0/0x2b0
[  340.476636]  ? __audit_syscall_exit+0x209/0x290
[  340.487151]  ? SyS_connect+0xe/0x10
[  340.496453]  ? do_syscall_64+0x67/0x1b0
[  340.506078]  ? entry_SYSCALL64_slow_path+0x25/0x25

Fixes: 971f10eca1 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: James Morris <james.l.morris@oracle.com>
Tested-by: James Morris <james.l.morris@oracle.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-03 12:39:14 -05:00
Eric Dumazet 3016dad75b tcp: remove buggy call to tcp_v6_restore_cb()
tcp_v6_send_reset() expects to receive an skb with skb->cb[] layout as
used in TCP stack.
MD5 lookup uses tcp_v6_iif() and tcp_v6_sdif() and thus
TCP_SKB_CB(skb)->header.h6

This patch probably fixes RST packets sent on behalf of a timewait md5
ipv6 socket.

Before Florian patch, tcp_v6_restore_cb() was needed before jumping to
no_tcp_socket label.

Fixes: 271c3b9b7b ("tcp: honour SO_BINDTODEVICE for TW_RST case too")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-30 10:21:44 -05:00
Eric Dumazet 356d1833b6 tcp: Namespace-ify sysctl_tcp_rmem and sysctl_tcp_wmem
Note that when a new netns is created, it inherits its
sysctl_tcp_rmem and sysctl_tcp_wmem from initial netns.

This change is needed so that we can refine TCP rcvbuf autotuning,
to take RTT into consideration.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10 14:34:58 +09:00
Song Liu c24b14c46b tcp: add tracepoint trace_tcp_send_reset
New tracepoint trace_tcp_send_reset is added and called from
tcp_v4_send_reset(), tcp_v6_send_reset() and tcp_send_active_reset().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-24 01:21:25 +01:00
Gustavo A. R. Silva 275757e6ba ipv6: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Notice that in some cases I placed the "fall through" comment
on its own line, which is what GCC is expecting to find.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18 14:13:08 +01:00
Eric Dumazet 1f3b359f10 tcp: fix a request socket leak
While the cited commit fixed a possible deadlock, it added a leak
of the request socket, since reqsk_put() must be called if the BPF
filter decided the ACK packet must be dropped.

Fixes: d624d276d1 ("tcp: fix possible deadlock in TCP stack vs BPF filter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-08 16:07:17 -07:00
David Ahern a8e3bb347d net: Add comment that early_demux can change via sysctl
Twice patches trying to constify inet{6}_protocol have been reverted:
39294c3df2 ("Revert "ipv6: constify inet6_protocol structures"") to
revert 3a3a4e3054 and then 03157937fe ("Revert "ipv4: make
net_protocol const"") to revert aa8db499ea.

Add a comment that the structures can not be const because the
early_demux field can change based on a sysctl.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28 15:17:29 -07:00
Mike Maloney 98aaa913b4 tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE to TCP recvmsg
When SOF_TIMESTAMPING_RX_SOFTWARE is enabled for tcp sockets, return the
timestamp corresponding to the highest sequence number data returned.

Previously the skb->tstamp is overwritten when a TCP packet is placed
in the out of order queue.  While the packet is in the ooo queue, save the
timestamp in the TCB_SKB_CB.  This space is shared with the gso_*
options which are only used on the tx path, and a previously unused 4
byte hole.

When skbs are coalesced either in the sk_receive_queue or the
out_of_order_queue always choose the timestamp of the appended skb to
maintain the invariant of returning the timestamp of the last byte in
the recvmsg buffer.

Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-23 20:30:47 -07:00
David S. Miller 463910e2df Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-15 20:23:23 -07:00
Eric Dumazet d624d276d1 tcp: fix possible deadlock in TCP stack vs BPF filter
Filtering the ACK packet was not put at the right place.

At this place, we already allocated a child and put it
into accept queue.

We absolutely need to call tcp_child_process() to release
its spinlock, or we will deadlock at accept() or close() time.

Found by syzkaller team (Thanks a lot !)

Fixes: 8fac365f63 ("tcp: Add a tcp_filter hook before handle ack packet")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-14 22:31:27 -07:00
David Ahern 4297a0ef08 net: ipv6: add second dif to inet6 socket lookups
Add a second device index, sdif, to inet6 socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
ingress index is obtained from IPCB using inet_sdif and after tcp_v4_rcv
tcp_v4_sdif is used.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07 11:39:22 -07:00
Julia Lawall 39294c3df2 Revert "ipv6: constify inet6_protocol structures"
This reverts commit 3a3a4e3054.

inet6_add_protocol and inet6_del_protocol include casts that remove the
effect of the const annotation on their parameter, leading to possible
runtime crashes.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 10:03:17 -07:00
Florian Westphal e7942d0633 tcp: remove prequeue support
prequeue is a tcp receive optimization that moves part of rx processing
from bh to process context.

This only works if the socket being processed belongs to a process that
is blocked in recv on that socket.

In practice, this doesn't happen anymore that often because nowadays
servers tend to use an event driven (epoll) model.

Even normal client applications (web browsers) commonly use many tcp
connections in parallel.

This has measureable impact only in netperf (which uses plain recv and
thus allows prequeue use) from host to locally running vm (~4%), however,
there were no changes when using netperf between two physical hosts with
ixgbe interfaces.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 14:37:49 -07:00
Julia Lawall 3a3a4e3054 ipv6: constify inet6_protocol structures
The inet6_protocol structure is only passed as the first argument to
inet6_add_protocol or inet6_del_protocol, both of which are declared as
const.  Thus the inet6_protocol structure itself can be const.

Also drop __read_mostly where present on the newly const structures.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-28 18:52:56 -07:00
Matvejchikov Ilya e42e24c3cc tcp: remove redundant argument from tcp_rcv_established()
The last (4th) argument of tcp_rcv_established() is redundant as it
always equals to skb->len and the skb itself is always passed as 2th
agrument. There is no reason to have it.

Signed-off-by: Ilya V. Matveychikov <matvejchikov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 17:28:12 -07:00
Reshetova, Elena 41c6d650f6 net: convert sock.sk_refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena 633547973f net: convert sk_buff.users from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:07 -07:00
Chenbo Feng 8fac365f63 tcp: Add a tcp_filter hook before handle ack packet
Currently in both ipv4 and ipv6 code path, the ack packet received when
sk at TCP_NEW_SYN_RECV state is not filtered by socket filter or cgroup
filter since it is handled from tcp_child_process and never reaches the
tcp_filter inside tcp_v4_rcv or tcp_v6_rcv. Adding a tcp_filter hooks
here can make sure all the ingress tcp packet can be correctly filtered.

Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22 11:13:56 -04:00
Ivan Delalande 8917a777be tcp: md5: add TCP_MD5SIG_EXT socket option to set a key address prefix
Replace first padding in the tcp_md5sig structure with a new flag field
and address prefix length so it can be specified when configuring a new
key for TCP MD5 signature. The tcpm_flags field will only be used if the
socket option is TCP_MD5SIG_EXT to avoid breaking existing programs, and
tcpm_prefixlen only when the TCP_MD5SIG_FLAG_PREFIX flag is set.

Signed-off-by: Bob Gilligan <gilligan@arista.com>
Signed-off-by: Eric Mowat <mowat@arista.com>
Signed-off-by: Ivan Delalande <colona@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-19 13:51:34 -04:00
Ivan Delalande 6797318e62 tcp: md5: add an address prefix for key lookup
This allows the keys used for TCP MD5 signature to be used for whole
range of addresses, specified with a prefix length, instead of only one
address as it currently is.

Signed-off-by: Bob Gilligan <gilligan@arista.com>
Signed-off-by: Eric Mowat <mowat@arista.com>
Signed-off-by: Ivan Delalande <colona@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-19 13:50:55 -04:00
Johannes Berg d58ff35122 networking: make skb_push & __skb_push return void pointers
It seems like a historic accident that these return unsigned char *,
and in many places that means casts are required, more often than not.

Make these functions return void * and remove all the casts across
the tree, adding a (u8 *) cast only where the unsigned char pointer
was used directly, all done with the following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

    @@
    expression SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - fn(SKB, LEN)[0]
    + *(u8 *)fn(SKB, LEN)

Note that the last part there converts from push(...)[0] to the
more idiomatic *(u8 *)push(...).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-16 11:48:40 -04:00
Chenbo Feng 384abed1fe bpf: Remove duplicate tcp_filter hook in ipv6
There are two tcp_filter hooks in tcp_ipv6 ingress path currently.
One is at tcp_v6_rcv and another is in tcp_v6_do_rcv. It seems the
tcp_filter() call inside tcp_v6_do_rcv is redundent and some packet
will be filtered twice in this situation. This will cause trouble
when using eBPF filters to account traffic data.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10 16:08:02 -04:00
Eric Dumazet 0604475119 tcp: add TCPMemoryPressuresChrono counter
DRAM supply shortage and poor memory pressure tracking in TCP
stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
limits) and tcp_mem[] quite hazardous.

TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
limits being hit, but only tracking number of transitions.

If TCP stack behavior under stress was perfect :
1) It would maintain memory usage close to the limit.
2) Memory pressure state would be entered for short times.

We certainly prefer 100 events lasting 10ms compared to one event
lasting 200 seconds.

This patch adds a new SNMP counter tracking cumulative duration of
memory pressure events, given in ms units.

$ cat /proc/sys/net/ipv4/tcp_mem
3088    4117    6176
$ grep TCP /proc/net/sockstat
TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
$ nstat -n ; sleep 10 ; nstat |grep Pressure
TcpExtTCPMemoryPressures        1700
TcpExtTCPMemoryPressuresChrono  5209

v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
instructed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08 11:26:19 -04:00
Eric Dumazet 5d2ed0521a tcp: Namespaceify sysctl_tcp_timestamps
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-08 10:53:29 -04:00
Eric Dumazet 9a568de481 tcp: switch TCP TS option (RFC 7323) to 1ms clock
TCP Timestamps option is defined in RFC 7323

Traditionally on linux, it has been tied to the internal
'jiffies' variable, because it had been a cheap and good enough
generator.

For TCP flows on the Internet, 1 ms resolution would be much better
than 4ms or 10ms (HZ=250 or HZ=100 respectively)

For TCP flows in the DC, Google has used usec resolution for more
than two years with great success [1]

Receive size autotuning (DRS) is indeed more precise and converges
faster to optimal window size.

This patch converts tp->tcp_mstamp to a plain u64 value storing
a 1 usec TCP clock.

This choice will allow us to upstream the 1 usec TS option as
discussed in IETF 97.

[1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 16:06:01 -04:00
Linus Torvalds a95cfad947 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Track alignment in BPF verifier so that legitimate programs won't be
    rejected on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures.

 2) Make tail calls work properly in arm64 BPF JIT, from Deniel
    Borkmann.

 3) Make the configuration and semantics Generic XDP make more sense and
    don't allow both generic XDP and a driver specific instance to be
    active at the same time. Also from Daniel.

 4) Don't crash on resume in xen-netfront, from Vitaly Kuznetsov.

 5) Fix use-after-free in VRF driver, from Gao Feng.

 6) Use netdev_alloc_skb_ip_align() to avoid unaligned IP headers in
    qca_spi driver, from Stefan Wahren.

 7) Always run cleanup routines in BPF samples when we get SIGTERM, from
    Andy Gospodarek.

 8) The mdio phy code should bring PHYs out of reset using the shared
    GPIO lines before invoking bus->reset(). From Florian Fainelli.

 9) Some USB descriptor access endian fixes in various drivers from
    Johan Hovold.

10) Handle PAUSE advertisements properly in mlx5 driver, from Gal
    Pressman.

11) Fix reversed test in mlx5e_setup_tc(), from Saeed Mahameed.

12) Cure netdev leak in AF_PACKET when using timestamping via control
    messages. From Douglas Caetano dos Santos.

13) netcp doesn't support HWTSTAMP_FILTER_ALl, reject it. From Miroslav
    Lichvar.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  ldmvsw: stop the clean timer at beginning of remove
  ldmvsw: unregistering netdev before disable hardware
  net: netcp: fix check of requested timestamping filter
  ipv6: avoid dad-failures for addresses with NODAD
  qed: Fix uninitialized data in aRFS infrastructure
  mdio: mux: fix device_node_continue.cocci warnings
  net/packet: fix missing net_device reference release
  net/mlx4_core: Use min3 to select number of MSI-X vectors
  macvlan: Fix performance issues with vlan tagged packets
  net: stmmac: use correct pointer when printing normal descriptor ring
  net/mlx5: Use underlay QPN from the root name space
  net/mlx5e: IPoIB, Only support regular RQ for now
  net/mlx5e: Fix setup TC ndo
  net/mlx5e: Fix ethtool pause support and advertise reporting
  net/mlx5e: Use the correct pause values for ethtool advertising
  vmxnet3: ensure that adapter is in proper state during force_close
  sfc: revert changes to NIC revision numbers
  net: ch9200: add missing USB-descriptor endianness conversions
  net: irda: irda-usb: fix firmware name on big-endian hosts
  net: dsa: mv88e6xxx: add default case to switch
  ...
2017-05-15 15:50:49 -07:00
WANG Cong 83eaddab43 ipv6/dccp: do not inherit ipv6_mc_list from parent
Like commit 657831ffc3 ("dccp/tcp: do not inherit mc_list from parent")
we should clear ipv6_mc_list etc. for IPv6 sockets too.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 12:17:02 -04:00
Linus Torvalds de4d195308 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes are:

   - Debloat RCU headers

   - Parallelize SRCU callback handling (plus overlapping patches)

   - Improve the performance of Tree SRCU on a CPU-hotplug stress test

   - Documentation updates

   - Miscellaneous fixes"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
  rcu: Open-code the rcu_cblist_n_lazy_cbs() function
  rcu: Open-code the rcu_cblist_n_cbs() function
  rcu: Open-code the rcu_cblist_empty() function
  rcu: Separately compile large rcu_segcblist functions
  srcu: Debloat the <linux/rcu_segcblist.h> header
  srcu: Adjust default auto-expediting holdoff
  srcu: Specify auto-expedite holdoff time
  srcu: Expedite first synchronize_srcu() when idle
  srcu: Expedited grace periods with reduced memory contention
  srcu: Make rcutorture writer stalls print SRCU GP state
  srcu: Exact tracking of srcu_data structures containing callbacks
  srcu: Make SRCU be built by default
  srcu: Fix Kconfig botch when SRCU not selected
  rcu: Make non-preemptive schedule be Tasks RCU quiescent state
  srcu: Expedite srcu_schedule_cbs_snp() callback invocation
  srcu: Parallelize callback handling
  kvm: Move srcu_struct fields to end of struct kvm
  rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
  rcu: Use true/false in assignment to bool
  rcu: Use bool value directly
  ...
2017-05-10 10:30:46 -07:00
Eric Dumazet 84b114b984 tcp: randomize timestamps on syncookies
Whole point of randomization was to hide server uptime, but an attacker
can simply start a syn flood and TCP generates 'old style' timestamps,
directly revealing server jiffies value.

Also, TSval sent by the server to a particular remote address vary
depending on syncookies being sent or not, potentially triggering PAWS
drops for innocent clients.

Lets implement proper randomization, including for SYNcookies.

Also we do not need to export sysctl_tcp_timestamps, since it is not
used from a module.

In v2, I added Florian feedback and contribution, adding tsoff to
tcp_get_cookie_sock().

v3 removed one unused variable in tcp_v4_connect() as Florian spotted.

Fixes: 95a22caee3 ("tcp: randomize tcp timestamp offsets for each connection")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Tested-by: Florian Westphal <fw@strlen.de>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-05 12:00:11 -04:00
Ingo Molnar 58d30c36d4 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Documentation updates.

 - Miscellaneous fixes.

 - Parallelize SRCU callback handling (plus overlapping patches).

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23 11:12:44 +02:00
Paul E. McKenney 5f0d5a3ae7 mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU
A group of Linux kernel hackers reported chasing a bug that resulted
from their assumption that SLAB_DESTROY_BY_RCU provided an existence
guarantee, that is, that no block from such a slab would be reallocated
during an RCU read-side critical section.  Of course, that is not the
case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
slab of blocks.

However, there is a phrase for this, namely "type safety".  This commit
therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
to avoid future instances of this sort of confusion.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
[ paulmck: Add comments mentioning the old name, as requested by Eric
  Dumazet, in order to help people familiar with the old name find
  the new one. ]
Acked-by: David Rientjes <rientjes@google.com>
2017-04-18 11:42:36 -07:00
Alexander Duyck e5907459ce tcp: Record Rx hash and NAPI ID in tcp_child_process
While working on some recent busy poll changes we found that child sockets
were being instantiated without NAPI ID being set.  In our first attempt to
fix it, it was suggested that we should just pull programming the NAPI ID
into the function itself since all callers will need to have it set.

In addition to the NAPI ID change I have dropped the code that was
populating the Rx hash since it was actually being populated in
tcp_get_cookie_sock.

Reported-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 20:49:30 -07:00
subashab@codeaurora.org dddb64bcb3 net: Add sysctl to toggle early demux for tcp and udp
Certain system process significant unconnected UDP workload.
It would be preferrable to disable UDP early demux for those systems
and enable it for TCP only.

By disabling UDP demux, we see these slight gains on an ARM64 system-
782 -> 788Mbps unconnected single stream UDPv4
633 -> 654Mbps unconnected UDPv4 different sources

The performance impact can change based on CPU architecure and cache
sizes. There will not much difference seen if entire UDP hash table
is in cache.

Both sysctls are enabled by default to preserve existing behavior.

v1->v2: Change function pointer instead of adding conditional as
suggested by Stephen.

v2->v3: Read once in callers to avoid issues due to compiler
optimizations. Also update commit message with the tests.

v3->v4: Store and use read once result instead of querying pointer
again incorrectly.

v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Suggested-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Tom Herbert <tom@herbertland.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 13:17:07 -07:00
Soheil Hassas Yeganeh 4396e46187 tcp: remove tcp_tw_recycle
The tcp_tw_recycle was already broken for connections
behind NAT, since the per-destination timestamp is not
monotonically increasing for multiple machines behind
a single destination address.

After the randomization of TCP timestamp offsets
in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
for each connection), the tcp_tw_recycle is broken for all
types of connections for the same reason: the timestamps
received from a single machine is not monotonically increasing,
anymore.

Remove tcp_tw_recycle, since it is not functional. Also, remove
the PAWSPassive SNMP counter since it is only used for
tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
since the strict argument is only set when tcp_tw_recycle is
enabled.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Lutz Vieweg <lvml@5t9.de>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:33:56 -07:00
Soheil Hassas Yeganeh d82bae12dc tcp: remove per-destination timestamp cache
Commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets for each connection)
randomizes TCP timestamps per connection. After this commit,
there is no guarantee that the timestamps received from the
same destination are monotonically increasing. As a result,
the per-destination timestamp cache in TCP metrics (i.e., tcpm_ts
in struct tcp_metrics_block) is broken and cannot be relied upon.

Remove the per-destination timestamp cache and all related code
paths.

Note that this cache was already broken for caching timestamps of
multiple machines behind a NAT sharing the same address.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Lutz Vieweg <lvml@5t9.de>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-16 20:33:56 -07:00
David S. Miller 101c431492 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/genet/bcmgenet.c
	net/core/sock.c

Conflicts were overlapping changes in bcmgenet and the
lockdep handling of sockets.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-15 11:59:10 -07:00
Jon Maxwell 45caeaa5ac dccp/tcp: fix routing redirect race
As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 #9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320610 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-13 21:55:47 -07:00
Alexey Kodanev a30aad50c2 tcp: rename *_sequence_number() to *_seq_and_tsoff()
The functions that are returning tcp sequence number also setup
TS offset value, so rename them to better describe their purpose.

No functional changes in this patch.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 18:25:34 -08:00
Alexey Kodanev 00355fa5bb tcp: setup timestamp offset when write_seq already set
Found that when randomized tcp offsets are enabled (by default)
TCP client can still start new connections without them. Later,
if server does active close and re-uses sockets in TIME-WAIT
state, new SYN from client can be rejected on PAWS check inside
tcp_timewait_state_process(), because either tw_ts_recent or
rcv_tsval doesn't really have an offset set.

Here is how to reproduce it with LTP netstress tool:
    netstress -R 1 &
    netstress -H 127.0.0.1 -lr 1000000 -a1

    [...]
    < S  seq 1956977072 win 43690 TS val 295618 ecr 459956970
    > .  ack 1956911535 win 342 TS val 459967184 ecr 1547117608
    < R  seq 1956911535 win 0 length 0
+1. < S  seq 1956977072 win 43690 TS val 296640 ecr 459956970
    > S. seq 657450664 ack 1956977073 win 43690 TS val 459968205 ecr 296640

Fixes: 95a22caee3 ("tcp: randomize tcp timestamp offsets for each connection")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-22 16:35:32 -05:00
David S. Miller 3f64116a83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-16 19:34:01 -05:00
Jonathan T. Leighton 052d2369d1 ipv6: Handle IPv4-mapped src to in6addr_any dst.
This patch adds a check on the type of the source address for the case
where the destination address is in6addr_any. If the source is an
IPv4-mapped IPv6 source address, the destination is changed to
::ffff:127.0.0.1, and otherwise the destination is changed to ::1. This
is done in three locations to handle UDP calls to either connect() or
sendmsg() and TCP calls to connect(). Note that udpv6_sendmsg() delays
handling an in6addr_any destination until very late, so the patch only
needs to handle the case where the source is an IPv4-mapped IPv6
address.

Signed-off-by: Jonathan T. Leighton <jtleight@udel.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-14 12:13:51 -05:00