Commit graph

5696 commits

Author SHA1 Message Date
Ido Schimmel
398958ae48 ipv6: Add support for non-equal-cost multipath
The use of hash-threshold instead of modulo-N makes it trivial to add
support for non-equal-cost multipath.

Instead of dividing the multipath hash function's output space equally
between the nexthops, each nexthop is assigned a region size which is
proportional to its weight.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 15:14:44 -05:00
Ido Schimmel
3d709f69a3 ipv6: Use hash-threshold instead of modulo-N
Now that each nexthop stores its region boundary in the multipath hash
function's output space, we can use hash-threshold instead of modulo-N
in multipath selection.

This reduces the number of checks we need to perform during lookup, as
dead and linkdown nexthops are assigned a negative region boundary. In
addition, in contrast to modulo-N, only flows near region boundaries are
affected when a nexthop is added or removed.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 15:14:44 -05:00
Ido Schimmel
7696c06a18 ipv6: Use a 31-bit multipath hash
The hash thresholds assigned to IPv6 nexthops are in the range of
[-1, 2^31 - 1], where a negative value is assigned to nexthops that
should not be considered during multipath selection.

Therefore, in a similar fashion to IPv4, we need to use the upper
31-bits of the multipath hash for multipath selection.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 15:14:44 -05:00
Ido Schimmel
d7dedee184 ipv6: Calculate hash thresholds for IPv6 nexthops
Before we convert IPv6 to use hash-threshold instead of modulo-N, we
first need each nexthop to store its region boundary in the hash
function's output space.

The boundary is calculated by dividing the output space equally between
the different active nexthops. That is, nexthops that are not dead or
linkdown.

The boundaries are rebalanced whenever a nexthop is added or removed to
a multipath route and whenever a nexthop becomes active or inactive.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 15:14:44 -05:00
Colin Ian King
709af180ee ipv6: use ARRAY_SIZE for array sizing calculation on array seg6_action_table
Use the ARRAY_SIZE macro on array seg6_action_table to determine size of
the array. Improvement suggested by coccinelle.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 11:40:46 -05:00
David S. Miller
a0ce093180 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-01-09 10:37:00 -05:00
David S. Miller
9f0e896f35 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your
net-next tree:

1) Free hooks via call_rcu to speed up netns release path, from
   Florian Westphal.

2) Reduce memory footprint of hook arrays, skip allocation if family is
   not present - useful in case decnet support is not compiled built-in.
   Patches from Florian Westphal.

3) Remove defensive check for malformed IPv4 - including ihl field - and
   IPv6 headers in x_tables and nf_tables.

4) Add generic flow table offload infrastructure for nf_tables, this
   includes the netlink control plane and support for IPv4, IPv6 and
   mixed IPv4/IPv6 dataplanes. This comes with NAT support too. This
   patchset adds the IPS_OFFLOAD conntrack status bit to indicate that
   this flow has been offloaded.

5) Add secpath matching support for nf_tables, from Florian.

6) Save some code bytes in the fast path for the nf_tables netdev,
   bridge and inet families.

7) Allow one single NAT hook per point and do not allow to register NAT
   hooks in nf_tables before the conntrack hook, patches from Florian.

8) Seven patches to remove the struct nf_af_info abstraction, instead
   we perform direct calls for IPv4 which is faster. IPv6 indirections
   are still needed to avoid dependencies with the 'ipv6' module, but
   these now reside in struct nf_ipv6_ops.

9) Seven patches to handle NFPROTO_INET from the Netfilter core,
   hence we can remove specific code in nf_tables to handle this
   pseudofamily.

10) No need for synchronize_net() call for nf_queue after conversion
    to hook arrays. Also from Florian.

11) Call cond_resched_rcu() when dumping large sets in ipset to avoid
    softlockup. Again from Florian.

12) Pass lockdep_nfnl_is_held() to rcu_dereference_protected(), patch
    from Florian Westphal.

13) Fix matching of counters in ipset, from Jozsef Kadlecsik.

14) Missing nfnl lock protection in the ip_set_net_exit path, also
    from Jozsef.

15) Move connlimit code that we can reuse from nf_tables into
    nf_conncount, from Florian Westhal.

And asorted cleanups:

16) Get rid of nft_dereference(), it only has one single caller.

17) Add nft_set_is_anonymous() helper function.

18) Remove NF_ARP_FORWARD leftover chain definition in nf_tables_arp.

19) Remove unnecessary comments in nf_conntrack_h323_asn1.c
    From Varsha Rao.

20) Remove useless parameters in frag_safe_skb_hp(), from Gao Feng.

21) Constify layer 4 conntrack protocol definitions, function
    parameters to register/unregister these protocol trackers, and
    timeouts. Patches from Florian Westphal.

22) Remove nlattr_size indirection, from Florian Westphal.

23) Add fall-through comments as -Wimplicit-fallthrough needs this,
    from Gustavo A. R. Silva.

24) Use swap() macro to exchange values in ipset, patch from
    Gustavo A. R. Silva.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-08 20:40:42 -05:00
David Ahern
54dc3e3324 net: ipv6: Allow connect to linklocal address from socket bound to vrf
Allow a process bound to a VRF to connect to a linklocal address.
Currently, this fails because of a mismatch between the scope of the
linklocal address and the sk_bound_dev_if inherited by the VRF binding:
    $ ssh -6 fe80::70b8:cff:fedd:ead8%eth1
    ssh: connect to host fe80::70b8:cff:fedd:ead8%eth1 port 22: Invalid argument

Relax the scope check to allow the socket to be bound to the same L3
device as the scope id.

This makes ipv6 linklocal consistent with other relaxed checks enabled
by commits 1ff23beebd ("net: l3mdev: Allow send on enslaved interface")
and 7bb387c5ab ("net: Allow IP_MULTICAST_IF to set index to L3 slave").

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-08 14:11:18 -05:00
Pablo Neira Ayuso
7c23b629a8 netfilter: flow table support for the mixed IPv4/IPv6 family
This patch adds the IPv6 flow table type, that implements the datapath
flow table to forward IPv6 traffic.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:11:09 +01:00
Pablo Neira Ayuso
0995210753 netfilter: flow table support for IPv6
This patch adds the IPv6 flow table type, that implements the datapath
flow table to forward IPv6 traffic.

This patch exports ip6_dst_mtu_forward() that is required to check for
mtu to pass up packets that need PMTUD handling to the classic
forwarding path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:11:08 +01:00
Pablo Neira Ayuso
a7f87b47e6 netfilter: remove defensive check on malformed packets from raw sockets
Users cannot forge malformed IPv4/IPv6 headers via raw sockets that they
can inject into the stack. Specifically, not for IPv4 since 55888dfb6b
("AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl
(v2)"). IPv6 raw sockets also ensure that packets have a well-formed
IPv6 header available in the skbuff.

At quick glance, br_netfilter also validates layer 3 headers and it
drops malformed both IPv4 and IPv6 packets.

Therefore, let's remove this defensive check all over the place.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:11:04 +01:00
Pablo Neira Ayuso
b3a61254d8 netfilter: remove struct nf_afinfo and its helper functions
This abstraction has no clients anymore, remove it.

This is what remains from previous authors, so correct copyright
statement after recent modifications and code removal.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:11:02 +01:00
Pablo Neira Ayuso
464356234f netfilter: remove route_key_size field in struct nf_afinfo
This is only needed by nf_queue, place this code where it belongs.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:11:01 +01:00
Pablo Neira Ayuso
ce388f452f netfilter: move reroute indirection to struct nf_ipv6_ops
We cannot make a direct call to nf_ip6_reroute() because that would result
in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define reroute indirection in nf_ipv6_ops where this really
belongs to.

For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:10:53 +01:00
Pablo Neira Ayuso
3f87c08c61 netfilter: move route indirection to struct nf_ipv6_ops
We cannot make a direct call to nf_ip6_route() because that would result
in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define route indirection in nf_ipv6_ops where this really
belongs to.

For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:01:26 +01:00
Pablo Neira Ayuso
7db9a51e0f netfilter: remove saveroute indirection in struct nf_afinfo
This is only used by nf_queue.c and this function comes with no symbol
dependencies with IPv6, it just refers to structure layouts. Therefore,
we can replace it by a direct function call from where it belongs.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:01:25 +01:00
Pablo Neira Ayuso
f7dcbe2f36 netfilter: move checksum_partial indirection to struct nf_ipv6_ops
We cannot make a direct call to nf_ip6_checksum_partial() because that
would result in autoloading the 'ipv6' module because of symbol
dependencies.  Therefore, define checksum_partial indirection in
nf_ipv6_ops where this really belongs to.

For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:01:24 +01:00
Pablo Neira Ayuso
ef71fe27ec netfilter: move checksum indirection to struct nf_ipv6_ops
We cannot make a direct call to nf_ip6_checksum() because that would
result in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define checksum indirection in nf_ipv6_ops where this really
belongs to.

For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:01:23 +01:00
Pablo Neira Ayuso
c2f9eafee9 netfilter: nf_tables: remove hooks from family definition
They don't belong to the family definition, move them to the filter
chain type definition instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:01:22 +01:00
Pablo Neira Ayuso
c974a3a364 netfilter: nf_tables: remove multihook chains and families
Since NFPROTO_INET is handled from the core, we don't need to maintain
extra infrastructure in nf_tables to handle the double hook
registration, one for IPv4 and another for IPv6.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:01:21 +01:00
Pablo Neira Ayuso
12355d3670 netfilter: nf_tables_inet: don't use multihook infrastructure anymore
Use new native NFPROTO_INET support in netfilter core, this gets rid of
ad-hoc code in the nf_tables API codebase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:01:20 +01:00
Pablo Neira Ayuso
7a4473a31a netfilter: nf_tables: explicit nft_set_pktinfo() call from hook path
Instead of calling this function from the family specific variant, this
reduces the code size in the fast path for the netdev, bridge and inet
families. After this change, we must call nft_set_pktinfo() upfront from
the chain hook indirection.

Before:

   text    data     bss     dec     hex filename
   2145     208       0    2353     931 net/netfilter/nf_tables_netdev.o

After:

   text    data     bss     dec     hex filename
   2125     208       0    2333     91d net/netfilter/nf_tables_netdev.o

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:01:15 +01:00
Florian Westphal
f92b40a8b2 netfilter: core: only allow one nat hook per hook point
The netfilter NAT core cannot deal with more than one NAT hook per hook
location (prerouting, input ...), because the NAT hooks install a NAT null
binding in case the iptables nat table (iptable_nat hooks) or the
corresponding nftables chain (nft nat hooks) doesn't specify a nat
transformation.

Null bindings are needed to detect port collsisions between NAT-ed and
non-NAT-ed connections.

This causes nftables NAT rules to not work when iptable_nat module is
loaded, and vice versa because nat binding has already been attached
when the second nat hook is consulted.

The netfilter core is not really the correct location to handle this
(hooks are just hooks, the core has no notion of what kinds of side
 effects a hook implements), but its the only place where we can check
for conflicts between both iptables hooks and nftables hooks without
adding dependencies.

So add nat annotation to hook_ops to describe those hooks that will
add NAT bindings and then make core reject if such a hook already exists.
The annotation fills a padding hole, in case further restrictions appar
we might change this to a 'u8 type' instead of bool.

iptables error if nft nat hook active:
iptables -t nat -A POSTROUTING -j MASQUERADE
iptables v1.4.21: can't initialize iptables table `nat': File exists
Perhaps iptables or your kernel needs to be upgraded.

nftables error if iptables nat table present:
nft -f /etc/nftables/ipv4-nat
/usr/etc/nftables/ipv4-nat:3:1-2: Error: Could not process rule: File exists
table nat {
^^

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:01:13 +01:00
Florian Westphal
03d13b6868 netfilter: xtables: add and use xt_request_find_table_lock
currently we always return -ENOENT to userspace if we can't find
a particular table, or if the table initialization fails.

Followup patch will make nat table init fail in case nftables already
registered a nat hook so this change makes xt_find_table_lock return
an ERR_PTR to return the errno value reported from the table init
function.

Add xt_request_find_table_lock as try_then_request_module replacement
and use it where needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:01:12 +01:00
Florian Westphal
2c9e8637ea netfilter: conntrack: timeouts can be const
Nowadays this is just the default template that is used when setting up
the net namespace, so nothing writes to these locations.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:01:02 +01:00
Florian Westphal
9dae47aba0 netfilter: conntrack: l4 protocol trackers can be const
previous patches removed all writes to these structs so we can
now mark them as const.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 18:00:54 +01:00
Florian Westphal
cd9ceafc0a netfilter: conntrack: constify list of builtin trackers
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08 16:47:14 +01:00
Ido Schimmel
1de178edc7 ipv6: Flush multipath routes when all siblings are dead
By default, IPv6 deletes nexthops from a multipath route when the
nexthop device is put administratively down. This differs from IPv4
where the nexthops are kept, but marked with the RTNH_F_DEAD flag. A
multipath route is flushed when all of its nexthops become dead.

Align IPv6 with IPv4 and have it conform to the same guidelines.

In case the multipath route needs to be flushed, its siblings are
flushed one by one. Otherwise, the nexthops are marked with the
appropriate flags and the tree walker is instructed to skip all the
siblings.

As explained in previous patches, care is taken to update the sernum of
the affected tree nodes, so as to prevent the use of wrong dst entries.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:41 -05:00
Ido Schimmel
922c2ac82e ipv6: Take table lock outside of sernum update function
The next patch is going to allow dead routes to remain in the FIB tree
in certain situations.

When this happens we need to be sure to bump the sernum of the nodes
where these are stored so that potential copies cached in sockets are
invalidated.

The function that performs this update assumes the table lock is not
taken when it is invoked, but that will not be the case when it is
invoked by the tree walker.

Have the function assume the lock is taken and make the single caller
take the lock itself.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:41 -05:00
Ido Schimmel
4a8e56ee2c ipv6: Export sernum update function
We are going to allow dead routes to stay in the FIB tree (e.g., when
they are part of a multipath route, directly connected route with no
carrier) and revive them when their nexthop device gains carrier or when
it is put administratively up.

This is equivalent to the addition of the route to the FIB tree and we
should therefore take care of updating the sernum of all the parent
nodes of the node where the route is stored. Otherwise, we risk sockets
caching and using sub-optimal dst entries.

Export the function that performs the above, so that it could be invoked
from fib6_ifup() later on.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:40 -05:00
Ido Schimmel
b5cb5a755b ipv6: Teach tree walker to skip multipath routes
As explained in previous patch, fib6_ifdown() needs to consider the
state of all the sibling routes when a multipath route is traversed.

This is done by evaluating all the siblings when the first sibling in a
multipath route is traversed. If the multipath route does not need to be
flushed (e.g., not all siblings are dead), then we should just skip the
multipath route as our work is done.

Have the tree walker jump to the last sibling when it is determined that
the multipath route needs to be skipped.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:40 -05:00
Ido Schimmel
f9d882ea57 ipv6: Report dead flag during route dump
Up until now the RTNH_F_DEAD flag was only reported in route dump when
the 'ignore_routes_with_linkdown' sysctl was set. This is expected as
dead routes were flushed otherwise.

The reliance on this sysctl is going to be removed, so we need to report
the flag regardless of the sysctl's value.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:40 -05:00
Ido Schimmel
8067bb8c1d ipv6: Ignore dead routes during lookup
Currently, dead routes are only present in the routing tables in case
the 'ignore_routes_with_linkdown' sysctl is set. Otherwise, they are
flushed.

Subsequent patches are going to remove the reliance on this sysctl and
make IPv6 more consistent with IPv4.

Before this is done, we need to make sure dead routes are skipped during
route lookup, so as to not cause packet loss.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:40 -05:00
Ido Schimmel
44c9f2f206 ipv6: Check nexthop flags in route dump instead of carrier
Similar to previous patch, there is no need to check for the carrier of
the nexthop device when dumping the route and we can instead check for
the presence of the RTNH_F_LINKDOWN flag.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:40 -05:00
Ido Schimmel
14c5206c2d ipv6: Check nexthop flags during route lookup instead of carrier
Now that the RTNH_F_LINKDOWN flag is set in nexthops, we can avoid the
need to dereference the nexthop device and check its carrier and instead
check for the presence of the flag.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:40 -05:00
Ido Schimmel
5609b80a37 ipv6: Set nexthop flags during route creation
It is valid to install routes with a nexthop device that does not have a
carrier, so we need to make sure they're marked accordingly.

As explained in the previous patch, host and anycast routes are never
marked with the 'linkdown' flag.

Note that reject routes are unaffected, as these use the loopback device
which always has a carrier.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:40 -05:00
Ido Schimmel
27c6fa73f9 ipv6: Set nexthop flags upon carrier change
Similar to IPv4, when the carrier of a netdev changes we should toggle
the 'linkdown' flag on all the nexthops using it as their nexthop
device.

This will later allow us to test for the presence of this flag during
route lookup and dump.

Up until commit 4832c30d54 ("net: ipv6: put host and anycast routes on
device with address") host and anycast routes used the loopback netdev
as their nexthop device and thus were not marked with the 'linkdown'
flag. The patch preserves this behavior and allows one to ping the local
address even when the nexthop device does not have a carrier and the
'ignore_routes_with_linkdown' sysctl is set.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:40 -05:00
Ido Schimmel
4c981e28d3 ipv6: Prepare to handle multiple netdev events
To make IPv6 more in line with IPv4 we need to be able to respond
differently to different netdev events. For example, when a netdev is
unregistered all the routes using it as their nexthop device should be
flushed, whereas when the netdev's carrier changes only the 'linkdown'
flag should be toggled.

Currently, this is not possible, as the function that traverses the
routing tables is not aware of the triggering event.

Propagate the triggering event down, so that it could be used in later
patches.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:40 -05:00
Ido Schimmel
2127d95aef ipv6: Clear nexthop flags upon netdev up
Previous patch marked nexthops with the 'dead' and 'linkdown' flags.
Clear these flags when the netdev comes back up.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:39 -05:00
Ido Schimmel
2b2413610e ipv6: Mark dead nexthops with appropriate flags
When a netdev is put administratively down or unregistered all the
nexthops using it as their nexthop device should be marked with the
'dead' and 'linkdown' flags.

Currently, when a route is dumped its nexthop device is tested and the
flags are set accordingly. A similar check is performed during route
lookup.

Instead, we can simply mark the nexthops based on netdev events and
avoid checking the netdev's state during route dump and lookup.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:39 -05:00
Ido Schimmel
9fcb0714dc ipv6: Remove redundant route flushing during namespace dismantle
By the time fib6_net_exit() is executed all the netdevs in the namespace
have been either unregistered or pushed back to the default namespace.
That is because pernet subsys operations are always ordered before
pernet device operations and therefore invoked after them during
namespace dismantle.

Thus, all the routing tables in the namespace are empty by the time
fib6_net_exit() is invoked and the call to rt6_ifdown() can be removed.

This allows us to simplify the condition in fib6_ifdown() as it's only
ever called with an actual netdev.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07 21:29:39 -05:00
Wei Wang
7bbfe00e02 ipv6: fix general protection fault in fib6_add()
In fib6_add(), pn could be NULL if fib6_add_1() failed to return a fib6
node. Checking pn != fn before accessing pn->leaf makes sure pn is not
NULL.
This fixes the following GPF reported by syzkaller:
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 3201 Comm: syzkaller001778 Not tainted 4.15.0-rc5+ #151
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:fib6_add+0x736/0x15a0 net/ipv6/ip6_fib.c:1244
RSP: 0018:ffff8801c7626a70 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000000020 RCX: ffffffff84794465
RDX: 0000000000000004 RSI: ffff8801d38935f0 RDI: 0000000000000282
RBP: ffff8801c7626da0 R08: 1ffff10038ec4c35 R09: 0000000000000000
R10: ffff8801c7626c68 R11: 0000000000000000 R12: 00000000fffffffe
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000009
FS:  0000000000000000(0000) GS:ffff8801db200000(0063) knlGS:0000000009b70840
CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000000020be1000 CR3: 00000001d585a006 CR4: 00000000001606f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 __ip6_ins_rt+0x6c/0x90 net/ipv6/route.c:1006
 ip6_route_multipath_add+0xd14/0x16c0 net/ipv6/route.c:3833
 inet6_rtm_newroute+0xdc/0x160 net/ipv6/route.c:3957
 rtnetlink_rcv_msg+0x733/0x1020 net/core/rtnetlink.c:4411
 netlink_rcv_skb+0x21e/0x460 net/netlink/af_netlink.c:2408
 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4423
 netlink_unicast_kernel net/netlink/af_netlink.c:1275 [inline]
 netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1301
 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1864
 sock_sendmsg_nosec net/socket.c:636 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:646
 sock_write_iter+0x31a/0x5d0 net/socket.c:915
 call_write_iter include/linux/fs.h:1772 [inline]
 do_iter_readv_writev+0x525/0x7f0 fs/read_write.c:653
 do_iter_write+0x154/0x540 fs/read_write.c:932
 compat_writev+0x225/0x420 fs/read_write.c:1246
 do_compat_writev+0x115/0x220 fs/read_write.c:1267
 C_SYSC_writev fs/read_write.c:1278 [inline]
 compat_SyS_writev+0x26/0x30 fs/read_write.c:1274
 do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline]
 do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389
 entry_SYSENTER_compat+0x54/0x63 arch/x86/entry/entry_64_compat.S:125

Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: 66f5d6ce53 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-04 14:29:20 -05:00
Xin Long
2fa771be95 ip6_tunnel: allow ip6gre dev mtu to be set below 1280
Commit 582442d6d5 ("ipv6: Allow the MTU of ipip6 tunnel to be set
below 1280") fixed a mtu setting issue. It works for ipip6 tunnel.

But ip6gre dev updates the mtu also with ip6_tnl_change_mtu. Since
the inner packet over ip6gre can be ipv4 and it's mtu should also
be allowed to set below 1280, the same issue also exists on ip6gre.

This patch is to fix it by simply changing to check if parms.proto
is IPPROTO_IPV6 in ip6_tnl_change_mtu instead, to make ip6gre to
go to 'else' branch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02 12:36:14 -05:00
Eli Cooper
23263ec86a ip6_tunnel: disable dst caching if tunnel is dual-stack
When an ip6_tunnel is in mode 'any', where the transport layer
protocol can be either 4 or 41, dst_cache must be disabled.

This is because xfrm policies might apply to only one of the two
protocols. Caching dst would cause xfrm policies for one protocol
incorrectly used for the other.

Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02 12:31:12 -05:00
David S. Miller
6bb8824732 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
net/ipv6/ip6_gre.c is a case of parallel adds.

include/trace/events/tcp.h is a little bit more tricky.  The removal
of in-trace-macro ifdefs in 'net' paralleled with moving
show_tcp_state_name and friends over to include/trace/events/sock.h
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-29 15:42:26 -05:00
David S. Miller
9f30e5c5c2 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2017-12-22

1) Separate ESP handling from segmentation for GRO packets.
   This unifies the IPsec GSO and non GSO codepath.

2) Add asynchronous callbacks for xfrm on layer 2. This
   adds the necessary infrastructure to core networking.

3) Allow to use the layer2 IPsec GSO codepath for software
   crypto, all infrastructure is there now.

4) Also allow IPsec GSO with software crypto for local sockets.

5) Don't require synchronous crypto fallback on IPsec offloading,
   it is not needed anymore.

6) Check for xdo_dev_state_free and only call it if implemented.
   From Shannon Nelson.

7) Check for the required add and delete functions when a driver
   registers xdo_dev_ops. From Shannon Nelson.

8) Define xfrmdev_ops only with offload config.
   From Shannon Nelson.

9) Update the xfrm stats documentation.
   From Shannon Nelson.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27 11:15:14 -05:00
David S. Miller
65bbbf6c20 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2017-12-22

1) Check for valid id proto in validate_tmpl(), otherwise
   we may trigger a warning in xfrm_state_fini().
   From Cong Wang.

2) Fix a typo on XFRMA_OUTPUT_MARK policy attribute.
   From Michal Kubecek.

3) Verify the state is valid when encap_type < 0,
   otherwise we may crash on IPsec GRO .
   From Aviv Heller.

4) Fix stack-out-of-bounds read on socket policy lookup.
   We access the flowi of the wrong address family in the
   IPv4 mapped IPv6 case, fix this by catching address
   family missmatches before we do the lookup.

5) fix xfrm_do_migrate() with AEAD to copy the geniv
   field too. Otherwise the state is not fully initialized
   and migration fails. From Antony Antony.

6) Fix stack-out-of-bounds with misconfigured transport
   mode policies. Our policy template validation is not
   strict enough. It is possible to configure policies
   with transport mode template where the address family
   of the template does not match the selectors address
   family. Fix this by refusing such a configuration,
   address family can not change on transport mode.

7) Fix a policy reference leak when reusing pcpu xdst
   entry. From Florian Westphal.

8) Reinject transport-mode packets through tasklet,
   otherwise it is possible to reate a recursion
   loop. From Herbert Xu.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27 10:58:23 -05:00
William Tu
214bb1c78a net: erspan: remove md NULL check
The 'md' is allocated from 'tun_dst = ip_tun_rx_dst' and
since we've checked 'tun_dst', 'md' will never be NULL.
The patch removes it at both ipv4 and ipv6 erspan.

Fixes: afb4c97d90 ("ip6_gre: fix potential memory leak in ip6erspan_rcv")
Fixes: 50670b6ee9 ("ip_gre: fix potential memory leak in erspan_rcv")
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-26 17:30:11 -05:00
Tobias Brunner
09ee9dba96 ipv6: Reinject IPv6 packets if IPsec policy matches after SNAT
If SNAT modifies the source address the resulting packet might match
an IPsec policy, reinject the packet if that's the case.

The exact same thing is already done for IPv4.

Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-26 17:14:56 -05:00
Alexey Kodanev
e5a9336adb ip6_gre: fix device features for ioctl setup
When ip6gre is created using ioctl, its features, such as
scatter-gather, GSO and tx-checksumming will be turned off:

  # ip -f inet6 tunnel add gre6 mode ip6gre remote fd00::1
  # ethtool -k gre6 (truncated output)
    tx-checksumming: off
    scatter-gather: off
    tcp-segmentation-offload: off
    generic-segmentation-offload: off [requested on]

But when netlink is used, they will be enabled:
  # ip link add gre6 type ip6gre remote fd00::1
  # ethtool -k gre6 (truncated output)
    tx-checksumming: on
    scatter-gather: on
    tcp-segmentation-offload: on
    generic-segmentation-offload: on

This results in a loss of performance when gre6 is created via ioctl.
The issue was found with LTP/gre tests.

Fix it by moving the setup of device features to a separate function
and invoke it with ndo_init callback because both netlink and ioctl
will eventually call it via register_netdevice():

   register_netdevice()
       - ndo_init() callback -> ip6gre_tunnel_init() or ip6gre_tap_init()
           - ip6gre_tunnel_init_common()
                - ip6gre_tnl_init_features()

The moved code also contains two minor style fixes:
  * removed needless tab from GRE6_FEATURES on NETIF_F_HIGHDMA line.
  * fixed the issue reported by checkpatch: "Unnecessary parentheses around
    'nt->encap.type == TUNNEL_ENCAP_NONE'"

Fixes: ac4eb009e4 ("ip6gre: Add support for basic offloads offloads excluding GSO")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-26 12:21:19 -05:00