Commit Graph

6323 Commits

Author SHA1 Message Date
Alexis Bauvin da5095d052 udp_tunnel: add config option to bind to a device
UDP tunnel sockets are always opened unbound to a specific device. This
patch allow the socket to be bound on a custom device, which
incidentally makes UDP tunnels VRF-aware if binding to an l3mdev.

Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03 14:15:26 -08:00
David S. Miller e561bb29b6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Trivial conflict in net/core/filter.c, a locally computed
'sdif' is now an argument to the function.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-28 22:10:54 -08:00
David S. Miller e9d8faf93d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Disable BH while holding list spinlock in nf_conncount, from
   Taehee Yoo.

2) List corruption in nf_conncount, also from Taehee.

3) Fix race that results in leaving around an empty list node in
   nf_conncount, from Taehee Yoo.

4) Proper chain handling for inactive chains from the commit path,
   from Florian Westphal. This includes a selftest for this.

5) Do duplicate rule handles when replacing rules, also from Florian.

6) Remove net_exit path in xt_RATEEST that results in splat, from Taehee.

7) Possible use-after-free in nft_compat when releasing extensions.
   From Florian.

8) Memory leak in xt_hashlimit, from Taehee.

9) Call ip_vs_dst_notifier after ipv6_dev_notf, from Xin Long.

10) Fix cttimeout with udplite and gre, from Florian.

11) Preserve oif for IPv6 link-local generated traffic from mangle
    table, from Alin Nastac.

12) Missing error handling in masquerade notifiers, from Taehee Yoo.

13) Use mutex to protect registration/unregistration of masquerade
    extensions in order to prevent a race, from Taehee.

14) Incorrect condition check in tree_nodes_free(), also from Taehee.

15) Fix chain counter leak in rule replacement path, from Taehee.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-28 11:02:45 -08:00
Taehee Yoo 095faf45e6 netfilter: nat: fix double register in masquerade modules
There is a reference counter to ensure that masquerade modules register
notifiers only once. However, the existing reference counter approach is
not safe, test commands are:

   while :
   do
   	   modprobe ip6t_MASQUERADE &
	   modprobe nft_masq_ipv6 &
	   modprobe -rv ip6t_MASQUERADE &
	   modprobe -rv nft_masq_ipv6 &
   done

numbers below represent the reference counter.
--------------------------------------------------------
CPU0        CPU1        CPU2        CPU3        CPU4
[insmod]    [insmod]    [rmmod]     [rmmod]     [insmod]
--------------------------------------------------------
0->1
register    1->2
            returns     2->1
			returns     1->0
                                                0->1
                                                register <--
                                    unregister
--------------------------------------------------------

The unregistation of CPU3 should be processed before the
registration of CPU4.

In order to fix this, use a mutex instead of reference counter.

splat looks like:
[  323.869557] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:1381]
[  323.869574] Modules linked in: nf_tables(+) nf_nat_ipv6(-) nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 n]
[  323.869574] irq event stamp: 194074
[  323.898930] hardirqs last  enabled at (194073): [<ffffffff90004a0d>] trace_hardirqs_on_thunk+0x1a/0x1c
[  323.898930] hardirqs last disabled at (194074): [<ffffffff90004a29>] trace_hardirqs_off_thunk+0x1a/0x1c
[  323.898930] softirqs last  enabled at (182132): [<ffffffff922006ec>] __do_softirq+0x6ec/0xa3b
[  323.898930] softirqs last disabled at (182109): [<ffffffff90193426>] irq_exit+0x1a6/0x1e0
[  323.898930] CPU: 0 PID: 1381 Comm: modprobe Not tainted 4.20.0-rc2+ #27
[  323.898930] RIP: 0010:raw_notifier_chain_register+0xea/0x240
[  323.898930] Code: 3c 03 0f 8e f2 00 00 00 44 3b 6b 10 7f 4d 49 bc 00 00 00 00 00 fc ff df eb 22 48 8d 7b 10 488
[  323.898930] RSP: 0018:ffff888101597218 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
[  323.898930] RAX: 0000000000000000 RBX: ffffffffc04361c0 RCX: 0000000000000000
[  323.898930] RDX: 1ffffffff26132ae RSI: ffffffffc04aa3c0 RDI: ffffffffc04361d0
[  323.898930] RBP: ffffffffc04361c8 R08: 0000000000000000 R09: 0000000000000001
[  323.898930] R10: ffff8881015972b0 R11: fffffbfff26132c4 R12: dffffc0000000000
[  323.898930] R13: 0000000000000000 R14: 1ffff110202b2e44 R15: ffffffffc04aa3c0
[  323.898930] FS:  00007f813ed41540(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
[  323.898930] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  323.898930] CR2: 0000559bf2c9f120 CR3: 000000010bc80000 CR4: 00000000001006f0
[  323.898930] Call Trace:
[  323.898930]  ? atomic_notifier_chain_register+0x2d0/0x2d0
[  323.898930]  ? down_read+0x150/0x150
[  323.898930]  ? sched_clock_cpu+0x126/0x170
[  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  323.898930]  register_netdevice_notifier+0xbb/0x790
[  323.898930]  ? __dev_close_many+0x2d0/0x2d0
[  323.898930]  ? __mutex_unlock_slowpath+0x17f/0x740
[  323.898930]  ? wait_for_completion+0x710/0x710
[  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  323.898930]  ? up_write+0x6c/0x210
[  323.898930]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  324.127073]  ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables]
[  324.127073]  nft_chain_filter_init+0x1e/0xe8a [nf_tables]
[  324.127073]  nf_tables_module_init+0x37/0x92 [nf_tables]
[ ... ]

Fixes: 8dd33cc93e ("netfilter: nf_nat: generalize IPv4 masquerading support for nf_tables")
Fixes: be6b635cd6 ("netfilter: nf_nat: generalize IPv6 masquerading support for nf_tables")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-27 00:36:46 +01:00
Taehee Yoo 584eab291c netfilter: add missing error handling code for register functions
register_{netdevice/inetaddr/inet6addr}_notifier may return an error
value, this patch adds the code to handle these error paths.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-27 00:35:19 +01:00
Alin Nastac 508b09046c netfilter: ipv6: Preserve link scope traffic original oif
When ip6_route_me_harder is invoked, it resets outgoing interface of:
  - link-local scoped packets sent by neighbor discovery
  - multicast packets sent by MLD host
  - multicast packets send by MLD proxy daemon that sets outgoing
    interface through IPV6_PKTINFO ipi6_ifindex

Link-local and multicast packets must keep their original oif after
ip6_route_me_harder is called.

Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-27 00:12:20 +01:00
Willem de Bruijn aba36930a3 net: always initialize pagedlen
In ip packet generation, pagedlen is initialized for each skb at the
start of the loop in __ip(6)_append_data, before label alloc_new_skb.

Depending on compiler options, code can be generated that jumps to
this label, triggering use of an an uninitialized variable.

In practice, at -O2, the generated code moves the initialization below
the label. But the code should not rely on that for correctness.

Fixes: 15e36f5b8e ("udp: paged allocation with gso")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-24 17:42:57 -08:00
David S. Miller b1bf78bfb2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-11-24 17:01:43 -08:00
Hangbin Liu 896585d48e net/ipv6: re-do dad when interface has IFF_NOARP flag change
When we add a new IPv6 address, we should also join corresponding solicited-node
multicast address, unless the interface has IFF_NOARP flag, as function
addrconf_join_solict() did. But if we remove IFF_NOARP flag later, we do
not do dad and add the mcast address. So we will drop corresponding neighbour
discovery message that came from other nodes.

A typical example is after creating a ipvlan with mode l3, setting up an ipv6
address and changing the mode to l2. Then we will not be able to ping this
address as the interface doesn't join related solicited-node mcast address.

Fix it by re-doing dad when interface changed IFF_NOARP flag. Then we will add
corresponding mcast group and check if there is a duplicate address on the
network.

Reported-by: Jianlin Shi <jishi@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23 11:17:56 -08:00
David S. Miller f2be6d710d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-11-19 10:55:00 -08:00
David Ahern 7ddacfa564 ipv6: Fix PMTU updates for UDP/raw sockets in presence of VRF
Preethi reported that PMTU discovery for UDP/raw applications is not
working in the presence of VRF when the socket is not bound to a device.
The problem is that ip6_sk_update_pmtu does not consider the L3 domain
of the skb device if the socket is not bound. Update the function to
set oif to the L3 master device if relevant.

Fixes: ca254490c8 ("net: Add VRF support to IPv6 stack")
Reported-by: Preethi Ramachandra <preethir@juniper.net>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-18 19:02:56 -08:00
Paolo Abeni 9c48060141 udp: fix jump label misuse
The commit 60fb9567bf ("udp: implement complete book-keeping for
encap_needed") introduced a severe misuse of jump label APIs, which
syzbot, as reported by Eric, was able to exploit.

When multiple sockets/process can concurrently request (and than
disable) the udp encap, we need to track the activation counter with
*_inc()/*_dec() jump label variants, or we can experience bad things
at disable time.

Fixes: 60fb9567bf ("udp: implement complete book-keeping for encap_needed")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16 23:01:56 -08:00
Xin Long 761f60261b ipv6: fix a dst leak when removing its exception
These is no need to hold dst before calling rt6_remove_exception_rt().
The call to dst_hold_safe() in ip6_link_failure() was for ip6_del_rt(),
which has been removed in Commit 93531c6743 ("net/ipv6: separate
handling of FIB entries from dst based routes"). Otherwise, it will
cause a dst leak.

This patch is to simply remove the dst_hold_safe() call before calling
rt6_remove_exception_rt() and also do the same in ip6_del_cached_rt().
It's safe, because the removal of the exception that holds its dst's
refcnt is protected by rt6_exception_lock.

Fixes: 93531c6743 ("net/ipv6: separate handling of FIB entries from dst based routes")
Fixes: 23fb93a4d3 ("net/ipv6: Cleanup exception and cache route handling")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16 19:44:15 -08:00
Paolo Abeni 029a374348 udp6: cleanup stats accounting in recvmsg()
In the udp6 code path, we needed multiple tests to select the correct
mib to be updated. Since we touch at least a counter at each iteration,
it's convenient to use the recently introduced __UDPX_MIB() helper once
and remove some code duplication.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-09 20:07:05 -08:00
Stefano Brivio b8a51b38e4 fou, fou6: ICMP error handlers for FoU and GUE
As the destination port in FoU and GUE receiving sockets doesn't
necessarily match the remote destination port, we can't associate errors
to the encapsulating tunnels with a socket lookup -- we need to blindly
try them instead. This means we don't even know if we are handling errors
for FoU or GUE without digging into the packets.

Hence, implement a single handler for both, one for IPv4 and one for IPv6,
that will check whether the packet that generated the ICMP error used a
direct IP encapsulation or if it had a GUE header, and send the error to
the matching protocol handler, if any.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-08 17:13:08 -08:00
Stefano Brivio e7cc082455 udp: Support for error handlers of tunnels with arbitrary destination port
ICMP error handling is currently not possible for UDP tunnels not
employing a receiving socket with local destination port matching the
remote one, because we have no way to look them up.

Add an err_handler tunnel encapsulation operation that can be exported by
tunnels in order to pass the error to the protocol implementing the
encapsulation. We can't easily use a lookup function as we did for VXLAN
and GENEVE, as protocol error handlers, which would be in turn called by
implementations of this new operation, handle the errors themselves,
together with the tunnel lookup.

Without a socket, we can't be sure which encapsulation error handler is
the appropriate one: encapsulation handlers (the ones for FoU and GUE
introduced in the next patch, e.g.) will need to check the new error codes
returned by protocol handlers to figure out if errors match the given
encapsulation, and, in turn, report this error back, so that we can try
all of them in __udp{4,6}_lib_err_encap_no_sk() until we have a match.

v2:
- Name all arguments in err_handler prototypes (David Miller)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-08 17:13:08 -08:00
Stefano Brivio 32bbd8793f net: Convert protocol error handlers from void to int
We'll need this to handle ICMP errors for tunnels without a sending socket
(i.e. FoU and GUE). There, we might have to look up different types of IP
tunnels, registered as network protocols, before we get a match, so we
want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
inet_protos and inet6_protos. These error codes will be used in the next
patch.

For consistency, return sensible error codes in protocol error handlers
whenever handlers can't handle errors because, even if valid, they don't
match a protocol or any of its states.

This has no effect on existing error handling paths.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-08 17:13:08 -08:00
Stefano Brivio a36e185e8c udp: Handle ICMP errors for tunnels with same destination port on both endpoints
For both IPv4 and IPv6, if we can't match errors to a socket, try
tunnels before ignoring them. Look up a socket with the original source
and destination ports as found in the UDP packet inside the ICMP payload,
this will work for tunnels that force the same destination port for both
endpoints, i.e. VXLAN and GENEVE.

Actually, lwtunnels could break this assumption if they are configured by
an external control plane to have different destination ports on the
endpoints: in this case, we won't be able to trace ICMP messages back to
them.

For IPv6 redirect messages, call ip6_redirect() directly with the output
interface argument set to the interface we received the packet from (as
it's the very interface we should build the exception on), otherwise the
new nexthop will be rejected. There's no such need for IPv4.

Tunnels can now export an encap_err_lookup() operation that indicates a
match. Pass the packet to the lookup function, and if the tunnel driver
reports a matching association, continue with regular ICMP error handling.

v2:
- Added newline between network and transport header sets in
  __udp{4,6}_lib_err_encap() (David Miller)
- Removed redundant skb_reset_network_header(skb); in
  __udp4_lib_err_encap()
- Removed redundant reassignment of iph in __udp4_lib_err_encap()
  (Sabrina Dubroca)
- Edited comment to __udp{4,6}_lib_err_encap() to reflect the fact this
  won't work with lwtunnels configured to use asymmetric ports. By the way,
  it's VXLAN, not VxLAN (Jiri Benc)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-08 17:13:08 -08:00
Li RongQing 1c51dc9ad6 net/ipv6: compute anycast address hash only if dev is null
avoid to compute the hash value if dev is not null, since
hash value is not used

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-08 17:04:43 -08:00
Paolo Abeni cf329aa42b udp: cope with UDP GRO packet misdirection
In some scenarios, the GRO engine can assemble an UDP GRO packet
that ultimately lands on a non GRO-enabled socket.
This patch tries to address the issue explicitly checking for the UDP
socket features before enqueuing the packet, and eventually segmenting
the unexpected GRO packet, as needed.

We must also cope with re-insertion requests: after segmentation the
UDP code calls the helper introduced by the previous patches, as needed.

Segmentation is performed by a common helper, which takes care of
updating socket and protocol stats is case of failure.

rfc v3 -> v1
 - fix compile issues with rxrpc
 - when gso_segment returns NULL, treat is as an error
 - added 'ipv4' argument to udp_rcv_segment()

rfc v2 -> rfc v3
 - moved udp_rcv_segment() into net/udp.h, account errors to socket
   and ns, always return NULL or segs list

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:23:05 -08:00
Paolo Abeni 80bde363f9 ipv6: factor out protocol delivery helper
So that we can re-use it at the UDP level in the next patch

rfc v3 -> v1:
 - add the helper declaration into the ipv6 header

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:23:05 -08:00
Paolo Abeni bcd1665e35 udp: add support for UDP_GRO cmsg
When UDP GRO is enabled, the UDP_GRO cmsg will carry the ingress
datagram size. User-space can use such info to compute the original
packets layout.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:23:04 -08:00
Paolo Abeni e20cf8d3f1 udp: implement GRO for plain UDP sockets.
This is the RX counterpart of commit bec1f6f697 ("udp: generate gso
with UDP_SEGMENT"). When UDP_GRO is enabled, such socket is also
eligible for GRO in the rx path: UDP segments directed to such socket
are assembled into a larger GSO_UDP_L4 packet.

The core UDP GRO support is enabled with setsockopt(UDP_GRO).

Initial benchmark numbers:

Before:
udp rx:   1079 MB/s   769065 calls/s

After:
udp rx:   1466 MB/s    24877 calls/s

This change introduces a side effect in respect to UDP tunnels:
after a UDP tunnel creation, now the kernel performs a lookup per ingress
UDP packet, while before such lookup happened only if the ingress packet
carried a valid internal header csum.

rfc v2 -> rfc v3:
 - fixed typos in macro name and comments
 - really enforce UDP_GRO_CNT_MAX, instead of UDP_GRO_CNT_MAX + 1
 - acquire socket lock in UDP_GRO setsockopt

rfc v1 -> rfc v2:
 - use a new option to enable UDP GRO
 - use static keys to protect the UDP GRO socket lookup

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:23:04 -08:00
Paolo Abeni 60fb9567bf udp: implement complete book-keeping for encap_needed
The *encap_needed static keys are enabled by UDP tunnels
and several UDP encapsulations type, but they are never
turned off. This can cause unneeded overall performance
degradation for systems where such features are used
transiently.

This patch introduces complete book-keeping for such keys,
decreasing the usage at socket destruction time, if needed,
and avoiding that the same socket could increase the key
usage multiple times.

rfc v3 -> v1:
 - add socket lock around udp_tunnel_encap_enable()

rfc v2 -> rfc v3:
 - use udp_tunnel_encap_enable() in setsockopt()

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:23:04 -08:00
Dewi Morgan 7bd2db404e ipv6: do not drop vrf udp multicast packets
For bound udp sockets in a vrf, also check the sdif to get the index
for ingress devices enslaved to an l3mdev.

Signed-off-by: Dewi Morgan <morgand@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:12:39 -08:00
Mike Manning 5226b6a920 ipv6: handling of multicast packets received in VRF
If the skb for multicast packets marked as enslaved to a VRF are
received, then the secondary device index should be used to obtain
the real device. And verify the multicast address against the
enslaved rather than the l3mdev device.

Signed-off-by: Dewi Morgan <morgand@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:12:39 -08:00
Mike Manning d839a0ebeb ipv6: allow ping to link-local address in VRF
If link-local packets are marked as enslaved to a VRF, then to allow
ping to the link-local from a vrf, the error handling for IPV6_PKTINFO
needs to be relaxed to also allow the pkt ipi6_ifindex to be that of a
slave device to the vrf.

Note that the real device also needs to be retrieved in icmp6_iif()
to set the ipv6 flow oif to this for icmp echo reply handling. The
recent commit 24b711edfc ("net/ipv6: Fix linklocal to global address
with VRF") takes care of this, so the sdif does not need checking here.

This fix makes ping to link-local consistent with that to global
addresses, in that this can now be done from within the same VRF that
the address is in.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:12:39 -08:00
Duncan Eastoe 7055420fb6 net: fix raw socket lookup device bind matching with VRFs
When there exist a pair of raw sockets one unbound and one bound
to a VRF but equal in all other respects, when a packet is received
in the VRF context, __raw_v4_lookup() matches on both sockets.

This results in the packet being delivered over both sockets,
instead of only the raw socket bound to the VRF. The bound device
checks in __raw_v4_lookup() are replaced with a call to
raw_sk_bound_dev_eq() which correctly handles whether the packet
should be delivered over the unbound socket in such cases.

In __raw_v6_lookup() the match on the device binding of the socket is
similarly updated to use raw_sk_bound_dev_eq() which matches the
handling in __raw_v4_lookup().

Importantly raw_sk_bound_dev_eq() takes the raw_l3mdev_accept sysctl
into account.

Signed-off-by: Duncan Eastoe <deastoe@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:12:39 -08:00
Mike Manning 6da5b0f027 net: ensure unbound datagram socket to be chosen when not in a VRF
Ensure an unbound datagram skt is chosen when not in a VRF. The check
for a device match in compute_score() for UDP must be performed when
there is no device match. For this, a failure is returned when there is
no device match. This ensures that bound sockets are never selected,
even if there is no unbound socket.

Allow IPv6 packets to be sent over a datagram skt bound to a VRF. These
packets are currently blocked, as flowi6_oif was set to that of the
master vrf device, and the ipi6_ifindex is that of the slave device.
Allow these packets to be sent by checking the device with ipi6_ifindex
has the same L3 scope as that of the bound device of the skt, which is
the master vrf device. Note that this check always succeeds if the skt
is unbound.

Even though the right datagram skt is now selected by compute_score(),
a different skt is being returned that is bound to the wrong vrf. The
difference between these and stream sockets is the handling of the skt
option for SO_REUSEPORT. While the handling when adding a skt for reuse
correctly checks that the bound device of the skt is a match, the skts
in the hashslot are already incorrect. So for the same hash, a skt for
the wrong vrf may be selected for the required port. The root cause is
that the skt is immediately placed into a slot when it is created,
but when the skt is then bound using SO_BINDTODEVICE, it remains in the
same slot. The solution is to move the skt to the correct slot by
forcing a rehash.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:12:38 -08:00
Mike Manning e78190581a net: ensure unbound stream socket to be chosen when not in a VRF
The commit a04a480d43 ("net: Require exact match for TCP socket
lookups if dif is l3mdev") only ensures that the correct socket is
selected for packets in a VRF. However, there is no guarantee that
the unbound socket will be selected for packets when not in a VRF.
By checking for a device match in compute_score() also for the case
when there is no bound device and attaching a score to this, the
unbound socket is selected. And if a failure is returned when there
is no device match, this ensures that bound sockets are never selected,
even if there is no unbound socket.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07 16:12:38 -08:00
David Ahern d7e774f356 net: Add extack argument to ip_fib_metrics_init
Add extack argument to ip_fib_metrics_init and add messages for invalid
metrics.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-06 15:00:45 -08:00
Eric Dumazet 0b215b9798 ipv6: gro: do not use slow memcmp() in ipv6_gro_receive()
ipv6_gro_receive() compares 34 bytes using slow memcmp(),
while handcoding with a couple of ipv6_addr_equal() is much faster.

Before this patch, "perf top -e cycles:pp -C <cpu>" would
see memcmp() using ~10% of cpu cycles on a 40Gbit NIC
receiving IPv6 TCP traffic.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-06 14:59:27 -08:00
David S. Miller a422757e8c Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains the first batch of Netfilter fixes for
your net tree:

1) Fix splat with IPv6 defragmenting locally generated fragments,
   from Florian Westphal.

2) Fix Incorrect check for missing attribute in nft_osf.

3) Missing INT_MIN & INT_MAX definition for netfilter bridge uapi
   header, from Jiri Slaby.

4) Revert map lookup in nft_numgen, this is already possible with
   the existing infrastructure without this extension.

5) Fix wrong listing of set reference counter, make counter
   synchronous again, from Stefano Brivio.

6) Fix CIDR 0 in hash:net,port,net, from Eric Westbrook.

7) Fix allocation failure with large set, use kvcalloc().
   From Andrey Ryabinin.

8) No need to disable BH when fetch ip set comment, patch from
   Jozsef Kadlecsik.

9) Sanity check for valid sysfs entry in xt_IDLETIMER, from
   Taehee Yoo.

10) Fix suspicious rcu usage via ip_set() macro at netlink dump,
    from Jozsef Kadlecsik.

11) Fix setting default timeout via nfnetlink_cttimeout, this
    comes with preparation patch to add nf_{tcp,udp,...}_pernet()
    helper.

12) Allow ebtables table nat to be of filter type via nft_compat.
    From Florian Westphal.

13) Incorrect calculation of next bucket in early_drop, do no bump
    hash value, update bucket counter instead. From Vasily Khoruzhick.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-05 17:19:25 -08:00
Alexey Kodanev e22d0bfa09 ipv6: properly check return value in inet6_dump_all()
Make sure we call fib6_dump_end() if it happens that skb->len
is zero. rtnl_dump_all() can reset cb->args on the next loop
iteration there.

Fixes: 08e814c9e8 ("net/ipv6: Bail early if user only wants cloned entries")
Fixes: ae677bbb44 ("net: Don't return invalid table id error when dumping all families")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-05 17:04:54 -08:00
Jeff Barnhill 6915ed86cc net/ipv6: Move anycast init/cleanup functions out of CONFIG_PROC_FS
Move the anycast.c init and cleanup functions which were inadvertently
added inside the CONFIG_PROC_FS definition.

Fixes: 2384d02520 ("net/ipv6: Add anycast addresses to a global hashtable")
Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-05 13:36:27 -08:00
Jeff Barnhill 2384d02520 net/ipv6: Add anycast addresses to a global hashtable
icmp6_send() function is expensive on systems with a large number of
interfaces. Every time it’s called, it has to verify that the source
address does not correspond to an existing anycast address by looping
through every device and every anycast address on the device.  This can
result in significant delays for a CPU when there are a large number of
neighbors and ND timers are frequently timing out and calling
neigh_invalidate().

Add anycast addresses to a global hashtable to allow quick searching for
matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.

Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-02 23:54:56 -07:00
Mike Manning f64bf6b8ae net: allow traceroute with a specified interface in a vrf
Traceroute executed in a vrf succeeds if no device is given or if the
vrf is given as the device, but fails if the interface is given as the
device. This is for default UDP probes, it succeeds for TCP SYN or ICMP
ECHO probes. As the skb bound dev is the interface and the sk dev is
the vrf, sk lookup fails for ICMP_DEST_UNREACH and ICMP_TIME_EXCEEDED
messages. The solution is for the secondary dev to be passed so that
the interface is available for the device match to succeed, in the same
way as is already done for non-error cases.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-26 16:03:08 -07:00
Stefano Brivio ee1abcf689 ipv6/ndisc: Preserve IPv6 control buffer if protocol error handlers are called
Commit a61bbcf28a ("[NET]: Store skb->timestamp as offset to a base
timestamp") introduces a neighbour control buffer and zeroes it out in
ndisc_rcv(), as ndisc_recv_ns() uses it.

Commit f2776ff047 ("[IPV6]: Fix address/interface handling in UDP and
DCCP, according to the scoping architecture.") introduces the usage of the
IPv6 control buffer in protocol error handlers (e.g. inet6_iif() in
present-day __udp6_lib_err()).

Now, with commit b94f1c0904 ("ipv6: Use icmpv6_notify() to propagate
redirect, instead of rt6_redirect()."), we call protocol error handlers
from ndisc_redirect_rcv(), after the control buffer is already stolen and
some parts are already zeroed out. This implies that inet6_iif() on this
path will always return zero.

This gives unexpected results on UDP socket lookup in __udp6_lib_err(), as
we might actually need to match sockets for a given interface.

Instead of always claiming the control buffer in ndisc_rcv(), do that only
when needed.

Fixes: b94f1c0904 ("ipv6: Use icmpv6_notify() to propagate redirect, instead of rt6_redirect().")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-26 15:58:06 -07:00
Bjørn Mork bf4cc40e93 net/{ipv4,ipv6}: Do not put target net if input nsid is invalid
The cleanup path will put the target net when netnsid is set.  So we must
reset netnsid if the input is invalid.

Fixes: d7e38611b8 ("net/ipv4: Put target net when address dump fails due to bad attributes")
Fixes: 242afaa696 ("net/ipv6: Put target net when address dump fails due to bad attributes")
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-25 16:21:31 -07:00
Florian Westphal 61792b6774 netfilter: ipv6: fix oops when defragmenting locally generated fragments
Unlike ipv4 and normal ipv6 defrag, netfilter ipv6 defragmentation did
not save/restore skb->dst.

This causes oops when handling locally generated ipv6 fragments, as
output path needs a valid dst.

Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com>
Fixes: 84379c9afe ("netfilter: ipv6: nf_defrag: drop skb dst before queueing")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-10-25 10:18:26 +02:00
David Ahern 4ed591c8ab net/ipv6: Allow onlink routes to have a device mismatch if it is the default route
The intent of ip6_route_check_nh_onlink is to make sure the gateway
given for an onlink route is not actually on a connected route for
a different interface (e.g., 2001:db8:1::/64 is on dev eth1 and then
an onlink route has a via 2001:db8:1::1 dev eth2). If the gateway
lookup hits the default route then it most likely will be a different
interface than the onlink route which is ok.

Update ip6_route_check_nh_onlink to disregard the device mismatch
if the gateway lookup hits the default route. Turns out the existing
onlink tests are passing because there is no default route or it is
an unreachable default, so update the onlink tests to have a default
route other than unreachable.

Fixes: fc1e64e109 ("net/ipv6: Add support for onlink flag")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-24 14:36:38 -07:00
Sean Tranchetti db4f1be3ca net: udp: fix handling of CHECKSUM_COMPLETE packets
Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
incorrect for any packet that has an incorrect checksum value.

udp4/6_csum_init() will both make a call to
__skb_checksum_validate_complete() to initialize/validate the csum
field when receiving a CHECKSUM_COMPLETE packet. When this packet
fails validation, skb->csum will be overwritten with the pseudoheader
checksum so the packet can be fully validated by software, but the
skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
the stack can later warn the user about their hardware spewing bad
checksums. Unfortunately, leaving the SKB in this state can cause
problems later on in the checksum calculation.

Since the the packet is still marked as CHECKSUM_COMPLETE,
udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
from skb->csum instead of adding it, leaving us with a garbage value
in that field. Once we try to copy the packet to userspace in the
udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
to checksum the packet data and add it in the garbage skb->csum value
to perform our final validation check.

Since the value we're validating is not the proper checksum, it's possible
that the folded value could come out to 0, causing us not to drop the
packet. Instead, we believe that the packet was checksummed incorrectly
by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
to warn the user with netdev_rx_csum_fault(skb->dev);

Unfortunately, since this is the UDP path, skb->dev has been overwritten
by skb->dev_scratch and is no longer a valid pointer, so we end up
reading invalid memory.

This patch addresses this problem in two ways:
	1) Do not use the dev pointer when calling netdev_rx_csum_fault()
	   from skb_copy_and_csum_datagram_msg(). Since this gets called
	   from the UDP path where skb->dev has been overwritten, we have
	   no way of knowing if the pointer is still valid. Also for the
	   sake of consistency with the other uses of
	   netdev_rx_csum_fault(), don't attempt to call it if the
	   packet was checksummed by software.

	2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
	   If we receive a packet that's CHECKSUM_COMPLETE that fails
	   verification (i.e. skb->csum_valid == 0), check who performed
	   the calculation. It's possible that the checksum was done in
	   software by the network stack earlier (such as Netfilter's
	   CONNTRACK module), and if that says the checksum is bad,
	   we can drop the packet immediately instead of waiting until
	   we try and copy it to userspace. Otherwise, we need to
	   mark the SKB as CHECKSUM_NONE, since the skb->csum field
	   no longer contains the full packet checksum after the
	   call to __skb_checksum_validate_complete().

Fixes: e6afc8ace6 ("udp: remove headers from UDP packets before queueing")
Fixes: c84d949057 ("udp: copy skb->truesize in the first cache line")
Cc: Sam Kumar <samanthakumar@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-24 14:18:16 -07:00
David Ahern ae677bbb44 net: Don't return invalid table id error when dumping all families
When doing a route dump across all address families, do not error out
if the table does not exist. This allows a route dump for AF_UNSPEC
with a table id that may only exist for some of the families.

Do return the table does not exist error if dumping routes for a
specific family and the table does not exist.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-24 14:06:25 -07:00
David Ahern 242afaa696 net/ipv6: Put target net when address dump fails due to bad attributes
If tgt_net is set based on IFA_TARGET_NETNSID attribute in the dump
request, make sure all error paths call put_net.

Fixes: 6371a71f3a ("net/ipv6: Add support for dumping addresses for a specific device")
Fixes: ed6eff1179 ("net/ipv6: Update inet6_dump_addr for strict data checking")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-24 14:06:25 -07:00
David Ahern 6371a71f3a net/ipv6: Add support for dumping addresses for a specific device
If an RTM_GETADDR dump request has ifa_index set in the ifaddrmsg
header, then return only the addresses for that device.

Since inet6_dump_addr is reused for multicast and anycast addresses,
this adds support for device specfic dumps of RTM_GETMULTICAST and
RTM_GETANYCAST as well.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-22 19:33:29 -07:00
David Ahern fe884c2be0 net/ipv6: Remove ip_idx arg to in6_dump_addrs
ip_idx is always 0 going into in6_dump_addrs; it is passed as a pointer
to save the last good index into cb. Since cb is already argument to
in6_dump_addrs, just save the value there.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-22 19:33:29 -07:00
David S. Miller 21ea1d36f6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
David Ahern's dump indexing bug fix in 'net' overlapped the
change of the function signature of inet6_fill_ifaddr() in
'net-next'.  Trivially resolved.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-21 11:54:28 -07:00
David Ahern 4ba4c566ba net/ipv6: Fix index counter for unicast addresses in in6_dump_addrs
The loop wants to skip previously dumped addresses, so loops until
current index >= saved index. If the message fills it wants to save
the index for the next address to dump - ie., the one that did not
fit in the current message.

Currently, it is incrementing the index counter before comparing to the
saved index, and then the saved index is off by 1 - it assumes the
current address is going to fit in the message.

Change the index handling to increment only after a succesful dump.

Fixes: 502a2ffd73 ("ipv6: convert idev_list to list macros")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-20 15:43:14 -07:00
David S. Miller 2e2d6f0342 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
net/sched/cls_api.c has overlapping changes to a call to
nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL
to the 5th argument, and another (from 'net-next') added cb->extack
instead of NULL to the 6th argument.

net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to
code which moved (to mr_table_dump)) in 'net-next'.  Thanks to David
Ahern for the heads up.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-19 11:03:06 -07:00
Stefano Brivio d4d576f5ab ip6_tunnel: Fix encapsulation layout
Commit 058214a4d1 ("ip6_tun: Add infrastructure for doing
encapsulation") added the ip6_tnl_encap() call in ip6_tnl_xmit(), before
the call to ipv6_push_frag_opts() to append the IPv6 Tunnel Encapsulation
Limit option (option 4, RFC 2473, par. 5.1) to the outer IPv6 header.

As long as the option didn't actually end up in generated packets, this
wasn't an issue. Then commit 89a23c8b52 ("ip6_tunnel: Fix missing tunnel
encapsulation limit option") fixed sending of this option, and the
resulting layout, e.g. for FoU, is:

.-------------------.------------.----------.-------------------.----- - -
| Outer IPv6 Header | UDP header | Option 4 | Inner IPv6 Header | Payload
'-------------------'------------'----------'-------------------'----- - -

Needless to say, FoU and GUE (at least) won't work over IPv6. The option
is appended by default, and I couldn't find a way to disable it with the
current iproute2.

Turn this into a more reasonable:

.-------------------.----------.------------.-------------------.----- - -
| Outer IPv6 Header | Option 4 | UDP header | Inner IPv6 Header | Payload
'-------------------'----------'------------'-------------------'----- - -

With this, and with 84dad55951 ("udp6: fix encap return code for
resubmitting"), FoU and GUE work again over IPv6.

Fixes: 058214a4d1 ("ip6_tun: Add infrastructure for doing encapsulation")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-18 16:54:40 -07:00