Commit graph

308 commits

Author SHA1 Message Date
Jiri Benc
9dc2ad1008 vxlan: set needed headroom correctly
vxlan_setup is called when allocating the net_device, i.e. way before
vxlan_newlink (or vxlan_dev_configure) is called. This means
vxlan->default_dst is actually unset in vxlan_setup and the condition that
sets needed_headroom always takes the else branch.

Set the needed_headrom at the point when we have the information about
the address family available.

Fixes: e4c7ed4153 ("vxlan: add ipv6 support")
Fixes: 2853af6a2e ("vxlan: use dev->needed_headroom instead of dev->hard_header_len")
CC: Cong Wang <cwang@twopensource.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 22:32:15 -07:00
Geert Uytterhoeven
0f1b7354e0 vxlan: Refactor vxlan_udp_encap_recv() to kill compiler warning
drivers/net/vxlan.c: In function ‘vxlan_udp_encap_recv’:
drivers/net/vxlan.c:1226: warning: ‘info’ may be used uninitialized in this function

While this warning is a false positive, it can be killed easily by
getting rid of the pointer intermediary and referring directly to the
ip_tunnel_info structure.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-06 19:48:03 -07:00
Pravin B Shelar
4c22279848 ip-tunnel: Use API to access tunnel metadata options.
Currently tun-info options pointer is used in few cases to
pass options around. But tunnel options can be accessed using
ip_tunnel_info_opts() API without using the pointer. Following
patch removes the redundant pointer and consistently make use
of API.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-31 12:28:56 -07:00
Jiri Benc
a43a9ef6a2 vxlan: do not receive IPv4 packets on IPv6 socket
By default (subject to the sysctl settings), IPv6 sockets listen also for
IPv4 traffic. Vxlan is not prepared for that and expects IPv6 header in
packets received through an IPv6 socket.

In addition, it's currently not possible to have both IPv4 and IPv6 vxlan
tunnel on the same port (unless bindv6only sysctl is enabled), as it's not
possible to create and bind both IPv4 and IPv6 vxlan interfaces and there's
no way to specify both IPv4 and IPv6 remote/group IP addresses.

Set IPV6_V6ONLY on vxlan sockets to fix both of these issues. This is not
done globally in udp_tunnel, as l2tp and tipc seems to work okay when
receiving IPv4 packets on IPv6 socket and people may rely on this behavior.
The other tunnels (geneve and fou) do not support IPv6.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:07:54 -07:00
Jiri Benc
7f9562a1f4 ip_tunnels: record IP version in tunnel info
There's currently nothing preventing directing packets with IPv6
encapsulation data to IPv4 tunnels (and vice versa). If this happens,
IPv6 addresses are incorrectly interpreted as IPv4 ones.

Track whether the given ip_tunnel_key contains IPv4 or IPv6 data. Store this
in ip_tunnel_info. Reject packets at appropriate places if they are supposed
to be encapsulated into an incompatible protocol.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:07:54 -07:00
Jiri Benc
46fa062ad6 ip_tunnels: convert the mode field of ip_tunnel_info to flags
The mode field holds a single bit of information only (whether the
ip_tunnel_info struct is for rx or tx). Change the mode field to bit flags.
This allows more mode flags to be added.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-29 13:07:54 -07:00
David S. Miller
0d36938bb8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-08-27 21:45:31 -07:00
Pravin B Shelar
c29a70d2ca tunnel: introduce udp_tun_rx_dst()
Introduce function udp_tun_rx_dst() to initialize tunnel dst on
receive path.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Reviewed-by: Jesse Gross <jesse@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 15:42:47 -07:00
Marcelo Ricardo Leitner
bef0057b7b vxlan: re-ignore EADDRINUSE from igmp_join
Before 56ef9c909b40[1] it used to ignore all errors from igmp_join().
That commit enhanced that and made it error out whatever error happened
with igmp_join(), but that's not good because when using multicast
groups vxlan will try to join it multiple times if the socket is reused
and then the 2nd and further attempts will fail with EADDRINUSE.

As we don't track to which groups the socket is already subscribed, it's
okay to just ignore that error.

Fixes: 56ef9c909b ("vxlan: Move socket initialization to within rtnl scope")
Reported-by: John Nielsen <lists@jnielsen.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-25 16:24:35 -07:00
Tom Herbert
58ce31cca1 vxlan: GRO support at tunnel layer
Add calls to gro_cells infrastructure to do GRO when receiving on a tunnel.

Testing:

Ran 200 netperf TCP_STREAM instance

  - With fix (GRO enabled on VXLAN interface)

    Verify GRO is happening.

    9084 MBps tput
    3.44% CPU utilization

  - Without fix (GRO disabled on VXLAN interface)

    Verified no GRO is happening.

    9084 MBps tput
    5.54% CPU utilization

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 15:59:56 -07:00
Tom Herbert
b7fe10e5eb gro: Fix remcsum offload to deal with frags in GRO
The remote checksum offload GRO did not consider the case that frag0
might be in use. This patch fixes that by accessing headers using the
skb_gro functions and not saving offsets relative to skb->head.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 15:59:56 -07:00
Jiri Benc
a725e514db vxlan: metadata based tunneling for IPv6
Support metadata based (formerly flow based) tunneling also for IPv6.
This complements commit ee122c79d4 ("vxlan: Flow based tunneling").

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:37 -07:00
Jiri Benc
6f264e47d4 vxlan: do not shadow flags variable
The 'flags' variable is already defined in the outer scope.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:37 -07:00
Jiri Benc
705cc62f67 vxlan: provide access function for vxlan socket address family
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:37 -07:00
Jiri Benc
61adedf3e3 route: move lwtunnel state to dst_entry
Currently, the lwtunnel state resides in per-protocol data. This is
a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa).
The xmit function of the tunnel does not know whether the packet has been
routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the
lwtstate data to dst_entry makes such inter-protocol tunneling possible.

As a bonus, this brings a nice diffstat.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:36 -07:00
Jiri Benc
7c383fb225 ip_tunnels: use tos and ttl fields also for IPv6
Rename the ipv4_tos and ipv4_ttl fields to just 'tos' and 'ttl', as they'll
be used with IPv6 tunnels, too.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:36 -07:00
Jiri Benc
c1ea5d672a ip_tunnels: add IPv6 addresses to ip_tunnel_key
Add the IPv6 addresses as an union with IPv4 ones. When using IPv4, the
newly introduced padding after the IPv4 addresses needs to be zeroed out.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-20 15:42:36 -07:00
Phil Sutter
22dba393a3 net: vxlan: convert to using IFF_NO_QUEUE
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:05 -07:00
Atzm Watanabe
07a51cd379 vxlan: fix fdb_dump index calculation
When too many remotes are bound to an FDB entry, index may not be increased.
This problem will be caused on the large scale environment that is based on
the unicast default destination, for instance.

Signed-off-by: Atzm Watanabe <atzm@iij.ad.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 21:15:18 -07:00
Alexei Starovoitov
da8b43c0e1 vxlan: combine VXLAN_FLOWBASED into VXLAN_COLLECT_METADATA
IFLA_VXLAN_FLOWBASED is useless without IFLA_VXLAN_COLLECT_METADATA,
so combine them into single IFLA_VXLAN_COLLECT_METADATA flag.
'flowbased' doesn't convey real meaning of the vxlan tunnel mode.
This mode can be used by routing, tc+bpf and ovs.
Only ovs is strictly flow based, so 'collect metadata' is a better
name for this tunnel mode.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07 11:46:34 -07:00
Alexei Starovoitov
f8a9b1bc1b vxlan: expose COLLECT_METADATA flag to user space
Two vxlan driver flags FLOWBASED and COLLECT_METADATA need to be set to
make use of its new flow mode. The former already exposed. Expose the latter.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:24:24 -07:00
Roopa Prabhu
343d60aada ipv6: change ipv6_stub_impl.ipv6_dst_lookup to take net argument
This patch adds net argument to ipv6_stub_impl.ipv6_dst_lookup
for use cases where sk is not available (like mpls).
sk appears to be needed to get the namespace 'net' and is optional
otherwise. This patch series changes ipv6_stub_impl.ipv6_dst_lookup
to take net argument. sk remains optional.

All callers of ipv6_stub_impl.ipv6_dst_lookup have been modified
to pass net. I have modified them to use already available
'net' in the scope of the call. I can change them to
sock_net(sk) to avoid any unintended change in behaviour if sock
namespace is different. They dont seem to be from code inspection.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:21:30 -07:00
Thomas Graf
6b6948dda7 vxlan: Use proper endian type for vni in vxlan[6]_xmit_skb
Silences the following sparse warnings:
drivers/net/vxlan.c:1818:21: warning: incorrect type in assignment (different base types)
drivers/net/vxlan.c:1818:21:    expected restricted __be32 [usertype] vx_vni
drivers/net/vxlan.c:1818:21:    got unsigned int [unsigned] [usertype] vni
drivers/net/vxlan.c:2014:58: warning: incorrect type in argument 11 (different base types)
drivers/net/vxlan.c:2014:58:    expected unsigned int [unsigned] [usertype] vni
drivers/net/vxlan.c:2014:58:    got restricted __be32 [usertype] <noident>

Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 16:33:26 -07:00
Thomas Graf
614732eaa1 openvswitch: Use regular VXLAN net_device device
This gets rid of all OVS specific VXLAN code in the receive and
transmit path by using a VXLAN net_device to represent the vport.
Only a small shim layer remains which takes care of handling the
VXLAN specific OVS Netlink configuration.

Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb()
since they are no longer needed.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:07 -07:00
Thomas Graf
0dfbdf4102 vxlan: Factor out device configuration
This factors out the device configuration out of the RTNL newlink
API which allows for in-kernel creation of VXLAN net_devices.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:06 -07:00
Thomas Graf
e7030878fc fib: Add fib rule match on tunnel id
This add the ability to select a routing table based on the tunnel
id which allows to maintain separate routing tables for each virtual
tunnel network.

ip rule add from all tunnel-id 100 lookup 100
ip rule add from all tunnel-id 200 lookup 200

A new static key controls the collection of metadata at tunnel level
upon demand.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:06 -07:00
Thomas Graf
3093fbe7ff route: Per route IP tunnel metadata via lightweight tunnel
This introduces a new IP tunnel lightweight tunnel type which allows
to specify IP tunnel instructions per route. Only IPv4 is supported
at this point.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:06 -07:00
Thomas Graf
ee122c79d4 vxlan: Flow based tunneling
Allows putting a VXLAN device into a new flow-based mode in which
skbs with a ip_tunnel_info dst metadata attached will be encapsulated
according to the instructions stored in there with the VXLAN device
defaults taken into consideration.

Similar on the receive side, if the VXLAN_F_COLLECT_METADATA flag is
set, the packet processing will populate a ip_tunnel_info struct for
each packet received and attach it to the skb using the new metadata
dst.  The metadata structure will contain the outer header and tunnel
header fields which have been stripped off. Layers further up in the
stack such as routing, tc or netfitler can later match on these fields
and perform forwarding. It is the responsibility of upper layers to
ensure that the flag is set if the metadata is needed. The flag limits
the additional cost of metadata collecting based on demand.

This prepares the VXLAN device to be steered by the routing and other
subsystems which allows to support encapsulation for a large number
of tunnel endpoints and tunnel ids through a single net_device which
improves the scalability.

It also allows for OVS to leverage this mode which in turn allows for
the removal of the OVS specific VXLAN code.

Because the skb is currently scrubed in vxlan_rcv(), the attachment of
the new dst metadata is postponed until after scrubing which requires
the temporary addition of a new member to vxlan_metadata. This member
is removed again in a later commit after the indirect VXLAN receive API
has been removed.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:06 -07:00
Sowmini Varadhan
7177a3b037 net/vxlan: Fix kernel unaligned access in __vxlan_find_mac
__vxlan_find_mac invokes ether_addr_equal on the eth_addr field,
which triggers unaligned access messages, so rearrange vxlan_fdb
to avoid this in the most non-intrusive way.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:12:34 -07:00
Sorin Dumitru
14e1d0fa97 vxlan: release lock after each bucket in vxlan_cleanup
We're seeing some softlockups from this function when there
are a lot fdb entries on a vxlan device. Taking the lock for
each bucket instead of the whole table is enough to fix that.

Signed-off-by: Sorin Dumitru <sdumitru@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-27 13:33:21 -04:00
David S. Miller
36583eb54d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/cadence/macb.c
	drivers/net/phy/phy.c
	include/linux/skbuff.h
	net/ipv4/tcp.c
	net/switchdev/switchdev.c

Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.

phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.

tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.

macb.c involved the addition of two zyncq device entries.

skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-23 01:22:35 -04:00
John W. Linville
13c3ed6a92 vxlan: correct typo in call to unregister_netdevice_queue
By inspection, this appears to be a typo.  The gating comparison
involves vxlan->dev rather than dev.  In fact, dev is the iterator in
the preceding loop above but it is actually constant in the 2nd loop.

Use of dev seems to be a bad cut-n-paste from the prior call to
unregister_netdevice_queue.  Change dev to vxlan->dev, since that is
what is actually being checked.

Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-18 16:57:09 -04:00
Nicolas Dichtel
7a0877d4b4 netns: rename peernet2id() to peernet2id_alloc()
In a following commit, a new function will be introduced to only lookup for
a nsid (no allocation if the nsid doesn't exist). To avoid confusion, the
existing function is renamed.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 22:15:30 -04:00
Thomas Graf
239fb791d4 vxlan: Correctly set flow*i_mark and flow4i_proto in route lookups
VXLAN must provide the skb mark and specifiy IPPROTO_UDP when doing
the FIB lookup for the remote ip. Otherwise an invalid route might
be returned.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05 19:37:54 -04:00
Li RongQing
608404290e vxlan: remove the unnecessary codes
The return value of vxlan_fdb_replace always is greater than or equal to 0

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-22 18:45:49 -04:00
David S. Miller
87ffabb1f0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The dwmac-socfpga.c conflict was a case of a bug fix overlapping
changes in net-next to handle an error pointer differently.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-14 15:44:14 -04:00
Jesse Gross
b736a623bd udptunnels: Call handle_offloads after inserting vlan tag.
handle_offloads() calls skb_reset_inner_headers() to store
the layer pointers to the encapsulated packet. However, we
currently push the vlag tag (if there is one) onto the packet
afterwards. This changes the MAC header for the encapsulated
packet but it is not reflected in skb->inner_mac_header, which
breaks GSO and drivers which attempt to use this for encapsulation
offloads.

Fixes: 1eaa8178 ("vxlan: Add tx-vlan offload support.")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-09 14:56:32 -04:00
WANG Cong
f13b1689d7 vxlan: do not exit on error in vxlan_stop()
We need to clean up vxlan despite vxlan_igmp_leave() fails.

This fixes the following kernel warning:

 WARNING: CPU: 0 PID: 6 at lib/debugobjects.c:263 debug_print_object+0x7c/0x8d()
 ODEBUG: free active (active state 0) object type: timer_list hint: vxlan_cleanup+0x0/0xd0
 CPU: 0 PID: 6 Comm: kworker/u8:0 Not tainted 4.0.0-rc7+ #953
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 Workqueue: netns cleanup_net
  0000000000000009 ffff88011955f948 ffffffff81a25f5a 00000000253f253e
  ffff88011955f998 ffff88011955f988 ffffffff8107608e 0000000000000000
  ffffffff814deba2 ffff8800d4e94000 ffffffff82254c30 ffffffff81fbe455
 Call Trace:
  [<ffffffff81a25f5a>] dump_stack+0x4c/0x65
  [<ffffffff8107608e>] warn_slowpath_common+0x9c/0xb6
  [<ffffffff814deba2>] ? debug_print_object+0x7c/0x8d
  [<ffffffff81076116>] warn_slowpath_fmt+0x46/0x48
  [<ffffffff814deba2>] debug_print_object+0x7c/0x8d
  [<ffffffff81666bf1>] ? vxlan_fdb_destroy+0x5b/0x5b
  [<ffffffff814dee02>] __debug_check_no_obj_freed+0xc3/0x15f
  [<ffffffff814df728>] debug_check_no_obj_freed+0x12/0x16
  [<ffffffff8117ae4e>] slab_free_hook+0x64/0x6c
  [<ffffffff8114deaa>] ? kvfree+0x31/0x33
  [<ffffffff8117dc66>] kfree+0x101/0x1ac
  [<ffffffff8114deaa>] kvfree+0x31/0x33
  [<ffffffff817d4137>] netdev_freemem+0x18/0x1a
  [<ffffffff817e8b52>] netdev_release+0x2e/0x32
  [<ffffffff815b4163>] device_release+0x5a/0x92
  [<ffffffff814bd4dd>] kobject_cleanup+0x49/0x5e
  [<ffffffff814bd3ff>] kobject_put+0x45/0x49
  [<ffffffff817d3fc1>] netdev_run_todo+0x26f/0x283
  [<ffffffff817d4873>] ? rollback_registered_many+0x20f/0x23b
  [<ffffffff817e0c80>] rtnl_unlock+0xe/0x10
  [<ffffffff817d4af0>] default_device_exit_batch+0x12a/0x139
  [<ffffffff810aadfa>] ? wait_woken+0x8f/0x8f
  [<ffffffff817c8e14>] ops_exit_list+0x2b/0x57
  [<ffffffff817c9b21>] cleanup_net+0x154/0x1e7
  [<ffffffff8108b05d>] process_one_work+0x255/0x4ad
  [<ffffffff8108af69>] ? process_one_work+0x161/0x4ad
  [<ffffffff8108b4b1>] worker_thread+0x1cd/0x2ab
  [<ffffffff8108b2e4>] ? process_scheduled_works+0x2f/0x2f
  [<ffffffff81090686>] kthread+0xd4/0xdc
  [<ffffffff8109eca3>] ? local_clock+0x19/0x22
  [<ffffffff810905b2>] ? __kthread_parkme+0x83/0x83
  [<ffffffff81a31c48>] ret_from_fork+0x58/0x90
  [<ffffffff810905b2>] ? __kthread_parkme+0x83/0x83

For the long-term, we should handle NETDEV_{UP,DOWN} event
from the lower device of a tunnel device.

Fixes: 56ef9c909b ("vxlan: Move socket initialization to within rtnl scope")
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 22:48:08 -04:00
WANG Cong
2790460e14 vxlan: fix a shadow local variable
Commit 79b16aadea
("udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb()")
introduce 'sk' but we already have one inner 'sk'.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-08 14:56:35 -04:00
David Miller
79b16aadea udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb().
That was we can make sure the output path of ipv4/ipv6 operate on
the UDP socket rather than whatever random thing happens to be in
skb->sk.

Based upon a patch by Jiri Pirko.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
2015-04-07 15:29:08 -04:00
Simon Horman
c4b495128c vxlan: correct spelling in comments
Fix some spelling / typos:
* droppped -> dropped
* asddress -> address
* compatbility -> compatibility

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-01 22:52:29 -04:00
Jiri Benc
67b61f6c13 netlink: implement nla_get_in_addr and nla_get_in6_addr
Those are counterparts to nla_put_in_addr and nla_put_in6_addr.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:58:35 -04:00
Jiri Benc
930345ea63 netlink: implement nla_put_in_addr and nla_put_in6_addr
IP addresses are often stored in netlink attributes. Add generic functions
to do that.

For nla_put_in_addr, it would be nicer to pass struct in_addr but this is
not used universally throughout the kernel, in way too many places __be32 is
used to store IPv4 address.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:58:35 -04:00
Jiri Benc
f0ef31264c vxlan: fix indentation
Some lines in vxlan code are indented by 7 spaces instead of a tab.

Fixes: e4c7ed4153 ("vxlan: add ipv6 support")
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:55:21 -04:00
Marcelo Ricardo Leitner
24c0e6838c vxlan: simplify if clause in dev_close
Dan Carpenter's static checker warned that in vxlan_stop we are checking
if 'vs' can be NULL while later we simply derreference it.

As after commit 56ef9c909b ("vxlan: Move socket initialization to
within rtnl scope") 'vs' just cannot be NULL in vxlan_stop() anymore, as
the interface won't go up if the socket initialization fails. So we are
good to just remove the check and make it consistent.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 17:01:58 -04:00
David S. Miller
0fa74a4be4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be_main.c
	net/core/sysctl_net_core.c
	net/ipv4/inet_diag.c

The be_main.c conflict resolution was really tricky.  The conflict
hunks generated by GIT were very unhelpful, to say the least.  It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().

So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged.  And this worked beautifully.

The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 18:51:09 -04:00
Marcelo Ricardo Leitner
149d7549c2 vxlan: fix possible use of uninitialized in vxlan_igmp_{join, leave}
Test robot noticed that we check the return of vxlan_igmp_join and leave
but inside them there was a path that it could be used initialized.

It's not really possible because those if() inside these igmp functions
would always match as we can't have sockets of other type in there, but
this way we keep the compiler happy.

Fixes: 56ef9c909b ("vxlan: Move socket initialization to within rtnl scope")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 13:31:24 -04:00
Marcelo Ricardo Leitner
56ef9c909b vxlan: Move socket initialization to within rtnl scope
Currently, if a multicast join operation fail, the vxlan interface will
be UP but not functional, without even a log message informing the user.

Now that we can grab socket lock after already having rntl, we don't
need to defer socket creation and multicast operations. By not deferring
we can do proper error reporting to the user through ip exit code.

This patch thus removes all deferred work that vxlan had and put it back
inline. Now the socket will only be created, bound and join multicast
group when one bring the interface up, and will undo all that as soon as
one put the interface down.

As vxlan_sock_hold() is not used after this patch, it was removed too.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:05:10 -04:00
Marcelo Ricardo Leitner
54ff9ef36b ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}
in favor of their inner __ ones, which doesn't grab rtnl.

As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.

So this patch:
- move rtnl handling to callers instead while already fixing some
  reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
  __ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
  __ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:05:09 -04:00
Alexey Kodanev
40fb70f3aa vxlan: fix wrong usage of VXLAN_VID_MASK
commit dfd8645ea1 wrongly assumes that VXLAN_VDI_MASK includes
eight lower order reserved bits of VNI field that are using for remote
checksum offload.

Right now, when VNI number greater then 0xffff, vxlan_udp_encap_recv()
will always return with 'bad_flag' error, reducing the usable vni range
from 0..16777215 to 0..65535. Also, it doesn't really check whether RCO
bits processed or not.

Fix it by adding new VNI mask which has all 32 bits of VNI field:
24 bits for id and 8 bits for other usage.

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-13 13:08:07 -04:00
Simon Horman
719a11cdbf vxlan: Don't set s_addr in vxlan_create_sock
In the case of AF_INET s_addr was set to INADDR_ANY (0) which which both
symmetric with the AF_INET6 case, where s_addr is not set, and unnecessary
as udp_conf is zeroed out earlier in the same function.

I suspect this change does not have any run-time effect due to compiler
optimisations. But it does make the code a little easier on the/my eyes.

Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 23:23:16 -04:00
Tom Herbert
0ace2ca89c vxlan: Use checksum partial with remote checksum offload
Change remote checksum handling to set checksum partial as default
behavior. Added an iflink parameter to configure not using
checksum partial (calling csum_partial to update checksum).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11 15:12:13 -08:00
Tom Herbert
15e2396d4e net: Infrastructure for CHECKSUM_PARTIAL with remote checsum offload
This patch adds infrastructure so that remote checksum offload can
set CHECKSUM_PARTIAL instead of calling csum_partial and writing
the modfied checksum field.

Add skb_remcsum_adjust_partial function to set an skb for using
CHECKSUM_PARTIAL with remote checksum offload.  Changed
skb_remcsum_process and skb_gro_remcsum_process to take a boolean
argument to indicate if checksum partial can be set or the
checksum needs to be modified using the normal algorithm.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11 15:12:12 -08:00
Tom Herbert
26c4f7da3e net: Fix remcsum in GRO path to not change packet
Remote checksum offload processing is currently the same for both
the GRO and non-GRO path. When the remote checksum offload option
is encountered, the checksum field referred to is modified in
the packet. So in the GRO case, the packet is modified in the
GRO path and then the operation is skipped when the packet goes
through the normal path based on skb->remcsum_offload. There is
a problem in that the packet may be modified in the GRO path, but
then forwarded off host still containing the remote checksum option.
A remote host will again perform RCO but now the checksum verification
will fail since GRO RCO already modified the checksum.

To fix this, we ensure that GRO restores a packet to it's original
state before returning. In this model, when GRO processes a remote
checksum option it still changes the checksum per the algorithm
but on return from lower layer processing the checksum is restored
to its original value.

In this patch we add define gro_remcsum structure which is passed
to skb_gro_remcsum_process to save offset and delta for the checksum
being changed. After lower layer processing, skb_gro_remcsum_cleanup
is called to restore the checksum before returning from GRO.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11 15:12:09 -08:00
David S. Miller
2573beec56 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09 14:35:57 -08:00
Rasmus Villemoes
a4870f79c2 vxlan: Wrong type passed to %pIS
src_ip is a pointer to a union vxlan_addr, one member of which is a
struct sockaddr. Passing a pointer to src_ip is wrong; one should pass
the value of src_ip itself. Since %pIS formally expects something of
type struct sockaddr*, let's pass a pointer to the appropriate union
member, though this of course doesn't change the generated code.

Fixes: e4c7ed4153 ("vxlan: add ipv6 support")
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 17:14:40 -08:00
David S. Miller
6e03f896b5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/vxlan.c
	drivers/vhost/net.c
	include/linux/if_vlan.h
	net/core/dev.c

The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.

In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.

In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.

In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 14:33:28 -08:00
Thomas Graf
db79a62183 vxlan: Only set has-GBP bit in header if any other bits would be set
This allows for a VXLAN-GBP socket to talk to a Linux VXLAN socket by
not setting any of the bits.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 00:38:02 -08:00
Tom Herbert
dcdc899469 net: add skb functions to process remote checksum offload
This patch adds skb_remcsum_process and skb_gro_remcsum_process to
perform the appropriate adjustments to the skb when receiving
remote checksum offload.

Updated vxlan and gue to use these functions.

Tested: Ran TCP_RR and TCP_STREAM netperf for VXLAN and GUE, did
not see any change in performance.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04 13:54:07 -08:00
Nicolas Dichtel
33564bbb2c vxlan: setup the right link netns in newlink hdlr
Rename the netns to src_net to avoid confusion with the netns where the
interface stands. The user may specify IFLA_NET_NS_[PID|FD] to create
a x-netns netndevice: IFLA_NET_NS_[PID|FD] points to the netns where the
netdevice stands and src_net to the link netns.

Note that before commit f01ec1c017 ("vxlan: add x-netns support"), it was
possible to create a x-netns vxlan netdevice, but the netdevice was not
operational.

Fixes: f01ec1c017 ("vxlan: add x-netns support")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-29 14:20:02 -08:00
Nicolas Dichtel
4967082b46 vxlan: advertise link netns in fdb messages
Previous commit is based on a wrong assumption, fdb messages are always sent
into the netns where the interface stands (see vxlan_fdb_notify()).

These fdb messages doesn't embed the rtnl attribute IFLA_LINK_NETNSID, thus we
need to add it (useful to interpret NDA_IFINDEX or NDA_DST for example).

Note also that vxlan_nlmsg_size() was not updated.

Fixes: 193523bf93 ("vxlan: advertise netns of vxlan dev in fdb msg")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27 17:11:07 -08:00
Tom Herbert
af33c1adae vxlan: Eliminate dependency on UDP socket in transmit path
In the vxlan transmit path there is no need to reference the socket
for a tunnel which is needed for the receive side. We do, however,
need the vxlan_dev flags. This patch eliminate references
to the socket in the transmit path, and changes VXLAN_F_UNSHAREABLE
to be VXLAN_F_RCV_FLAGS. This mask is used to store the flags
applicable to receive (GBP, CSUM6_RX, and REMCSUM_RX) in the
vxlan_sock flags.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-24 23:15:40 -08:00
Tom Herbert
d998f8efa4 udp: Do not require sock in udp_tunnel_xmit_skb
The UDP tunnel transmit functions udp_tunnel_xmit_skb and
udp_tunnel6_xmit_skb include a socket argument. The socket being
passed to the functions (from VXLAN) is a UDP created for receive
side. The only thing that the socket is used for in the transmit
functions is to get the setting for checksum (enabled or zero).
This patch removes the argument and and adds a nocheck argument
for checksum setting. This eliminates the unnecessary dependency
on a UDP socket for UDP tunnel transmit.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-24 23:15:40 -08:00
Nicolas Dichtel
193523bf93 vxlan: advertise netns of vxlan dev in fdb msg
Netlink FDB messages are sent in the link netns. The header of these messages
contains the ifindex (ndm_ifindex) of the netdevice, but this ifindex is
unusable in case of x-netns vxlan.
I named the new attribute NDA_NDM_IFINDEX_NETNSID, to avoid confusion with
NDA_IFINDEX.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-23 17:51:15 -08:00
Nicolas Dichtel
1728d4fabd tunnels: advertise link netns via netlink
Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-19 14:32:03 -05:00
Johannes Berg
053c095a82 netlink: make nlmsg_end() and genlmsg_end() void
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.

This makes the very common pattern of

  if (genlmsg_end(...) < 0) { ... }

be a whole bunch of dead code. Many places also simply do

  return nlmsg_end(...);

and the caller is expected to deal with it.

This also commonly (at least for me) causes errors, because it is very
common to write

  if (my_function(...))
    /* error condition */

and if my_function() does "return nlmsg_end()" this is of course wrong.

Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.

Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did

-	return nlmsg_end(...);
+	nlmsg_end(...);
+	return 0;

I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.

One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-18 01:03:45 -05:00
Thomas Graf
ac5132d1a0 vxlan: Only bind to sockets with compatible flags enabled
A VXLAN net_device looking for an appropriate socket may only consider
a socket which has a matching set of flags/extensions enabled. If
incompatible flags are enabled, return a conflict to have the caller
create a distinct socket with distinct port.

The OVS VXLAN port is kept unaware of extensions at this point.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-15 01:11:41 -05:00
Thomas Graf
3511494ce2 vxlan: Group Policy extension
Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.

The group membership is defined by the lower 16 bits of skb->mark, the
upper 16 bits are used for flags.

SELinux allows to manage label to secure local resources. However,
distributed applications require ACLs to implemented across hosts. This
is typically achieved by matching on L2-L4 fields to identify the
original sending host and process on the receiver. On top of that,
netlabel and specifically CIPSO [1] allow to map security contexts to
universal labels.  However, netlabel and CIPSO are relatively complex.
This patch provides a lightweight alternative for overlay network
environments with a trusted underlay. No additional control protocol
is required.

           Host 1:                       Host 2:

      Group A        Group B        Group B     Group A
      +-----+   +-------------+    +-------+   +-----+
      | lxc |   | SELinux CTX |    | httpd |   | VM  |
      +--+--+   +--+----------+    +---+---+   +--+--+
	  \---+---/                     \----+---/
	      |                              |
	  +---+---+                      +---+---+
	  | vxlan |                      | vxlan |
	  +---+---+                      +---+---+
	      +------------------------------+

Backwards compatibility:
A VXLAN-GBP socket can receive standard VXLAN frames and will assign
the default group 0x0000 to such frames. A Linux VXLAN socket will
drop VXLAN-GBP  frames. The extension is therefore disabled by default
and needs to be specifically enabled:

   ip link add [...] type vxlan [...] gbp

In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
must run on a separate port number.

Examples:
 iptables:
  host1# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
  host2# iptables -I INPUT -m mark --mark 0x200 -j DROP

 OVS:
  # ovs-ofctl add-flow br0 'in_port=1,actions=load:0x200->NXM_NX_TUN_GBP_ID[],NORMAL'
  # ovs-ofctl add-flow br0 'in_port=2,tun_gbp_id=0x200,actions=drop'

[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] http://lwn.net/Articles/204905/

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-15 01:11:41 -05:00
Tom Herbert
dfd8645ea1 vxlan: Remote checksum offload
Add support for remote checksum offload in VXLAN. This uses a
reserved bit to indicate that RCO is being done, and uses the low order
reserved eight bits of the VNI to hold the start and offset values in a
compressed manner.

Start is encoded in the low order seven bits of VNI. This is start >> 1
so that the checksum start offset is 0-254 using even values only.
Checksum offset (transport checksum field) is indicated in the high
order bit in the low order byte of the VNI. If the bit is set, the
checksum field is for UDP (so offset = start + 6), else checksum
field is for TCP (so offset = start + 16). Only TCP and UDP are
supported in this implementation.

Remote checksum offload for VXLAN is described in:

https://tools.ietf.org/html/draft-herbert-vxlan-rco-00

Tested by running 200 TCP_STREAM connections with VXLAN (over IPv4).

With UDP checksums and Remote Checksum Offload
  IPv4
      Client
        11.84% CPU utilization
      Server
        12.96% CPU utilization
      9197 Mbps
  IPv6
      Client
        12.46% CPU utilization
      Server
        14.48% CPU utilization
      8963 Mbps

With UDP checksums, no remote checksum offload
  IPv4
      Client
        15.67% CPU utilization
      Server
        14.83% CPU utilization
      9094 Mbps
  IPv6
      Client
        16.21% CPU utilization
      Server
        14.32% CPU utilization
      9058 Mbps

No UDP checksums
  IPv4
      Client
        15.03% CPU utilization
      Server
        23.09% CPU utilization
      9089 Mbps
  IPv6
      Client
        16.18% CPU utilization
      Server
        26.57% CPU utilization
       8954 Mbps

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-14 15:20:04 -05:00
Tom Herbert
a2b12f3c7a udp: pass udp_offload struct to UDP gro callbacks
This patch introduces udp_offload_callbacks which has the same
GRO functions (but not a GSO function) as offload_callbacks,
except there is an argument to a udp_offload struct passed to
gro_receive and gro_complete functions. This additional argument
can be used to retrieve the per port structure of the encapsulation
for use in gro processing (mostly by doing container_of on the
structure).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-14 15:20:04 -05:00
Jiri Pirko
df8a39defa net: rename vlan_tx_* helpers since "tx" is misleading there
The same macros are used for rx as well. So rename it.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13 17:51:08 -05:00
Tom Herbert
3bf3947526 vxlan: Improve support for header flags
This patch cleans up the header flags of VXLAN in anticipation of
defining some new ones:

- Move header related definitions from vxlan.c to vxlan.h
- Change VXLAN_FLAGS to be VXLAN_HF_VNI (only currently defined flag)
- Move check for unknown flags to after we find vxlan_sock, this
  assumes that some flags may be processed based on tunnel
  configuration
- Add a comment about why the stack treating unknown set flags as an
  error instead of ignoring them

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12 16:05:01 -05:00
Jesse Gross
9b174d88c2 net: Add Transparent Ethernet Bridging GRO support.
Currently the only tunnel protocol that supports GRO with encapsulated
Ethernet is VXLAN. This pulls out the Ethernet code into a proper layer
so that it can be used by other tunnel protocols such as GRE and Geneve.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-02 15:46:41 -05:00
Pravin B Shelar
74f47278cb vxlan: Fix double free of skb.
In case of error vxlan_xmit_one() can free already freed skb.
Also fixes memory leak of dst-entry.

Fixes: acbf74a763 ("vxlan: Refactor vxlan driver to make use
of the common UDP tunnel functions").

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23 23:57:31 -05:00
Marcelo Leitner
00c83b01d5 Fix race condition between vxlan_sock_add and vxlan_sock_release
Currently, when trying to reuse a socket, vxlan_sock_add will grab
vn->sock_lock, locate a reusable socket, inc refcount and release
vn->sock_lock.

But vxlan_sock_release() will first decrement refcount, and then grab
that lock. refcnt operations are atomic but as currently we have
deferred works which hold vs->refcnt each, this might happen, leading to
a use after free (specially after vxlan_igmp_leave):

  CPU 1                            CPU 2

deferred work                    vxlan_sock_add
  ...                              ...
                                   spin_lock(&vn->sock_lock)
                                   vs = vxlan_find_sock();
  vxlan_sock_release
    dec vs->refcnt, reaches 0
    spin_lock(&vn->sock_lock)
                                   vxlan_sock_hold(vs), refcnt=1
                                   spin_unlock(&vn->sock_lock)
    hlist_del_rcu(&vs->hlist);
    vxlan_notify_del_rx_port(vs)
    spin_unlock(&vn->sock_lock)

So when we look for a reusable socket, we check if it wasn't freed
already before reusing it.

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Fixes: 7c47cedf43 ("vxlan: move IGMP join/leave to work queue")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 14:57:08 -05:00
Jiri Pirko
f6f6424ba7 net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del
Do the work of parsing NDA_VLAN directly in rtnetlink code, pass simple
u16 vid to drivers from there.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-02 20:01:18 -08:00
David S. Miller
60b7379dc5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-11-29 20:47:48 -08:00
Alexander Duyck
3dc2b6a8d3 vxlan: Fix boolean flip in VXLAN_F_UDP_ZERO_CSUM6_[TX|RX]
In "vxlan: Call udp_sock_create" there was a logic error that resulted in
the default for IPv6 VXLAN tunnels going from using checksums to not using
checksums.  Since there is currently no support in iproute2 for setting
these values it means that a kernel after the change cannot talk over a IPv6
VXLAN tunnel to a kernel prior the change.

Fixes: 3ee64f3 ("vxlan: Call udp_sock_create")

Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-25 14:12:12 -05:00
David S. Miller
1459143386 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ieee802154/fakehard.c

A bug fix went into 'net' for ieee802154/fakehard.c, which is removed
in 'net-next'.

Add build fix into the merge from Stephen Rothwell in openvswitch, the
logging macros take a new initial 'log' argument, a new call was added
in 'net' so when we merge that in here we have to explicitly add the
new 'log' arg to it else the build fails.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 22:28:24 -05:00
Jiri Pirko
5968250c86 vlan: introduce *vlan_hwaccel_push_inside helpers
Use them to push skb->vlan_tci into the payload and avoid code
duplication.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Jiri Pirko
62749e2cb3 vlan: rename __vlan_put_tag to vlan_insert_tag_set_proto
Name fits better. Plus there's going to be introduced
__vlan_insert_tag later on.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21 14:20:17 -05:00
Joe Stringer
11bf7828a5 vxlan: Inline vxlan_gso_check().
Suggested-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-18 15:38:44 -05:00
Joe Stringer
23e62de33d net: Add vxlan_gso_check() helper
Most NICs that report NETIF_F_GSO_UDP_TUNNEL support VXLAN, and not
other UDP-based encapsulation protocols where the format and size of the
header differs. This patch implements a generic ndo_gso_check() for
VXLAN which will only advertise GSO support when the skb looks like it
contains VXLAN (or no UDP tunnelling at all).

Implementation shamelessly stolen from Tom Herbert:
http://thread.gmane.org/gmane.linux.network/332428/focus=333111

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-14 17:12:48 -05:00
David S. Miller
076ce44825 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/chelsio/cxgb4vf/sge.c
	drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c

sge.c was overlapping two changes, one to use the new
__dev_alloc_page() in net-next, and one to use s->fl_pg_order in net.

ixgbe_phy.c was a set of overlapping whitespace changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-14 01:01:12 -05:00
Marcelo Leitner
19ca9fc144 vxlan: Do not reuse sockets for a different address family
Currently, we only match against local port number in order to reuse
socket. But if this new vxlan wants an IPv6 socket and a IPv4 one bound
to that port, vxlan will reuse an IPv4 socket as IPv6 and a panic will
follow. The following steps reproduce it:

   # ip link add vxlan6 type vxlan id 42 group 229.10.10.10 \
       srcport 5000 6000 dev eth0
   # ip link add vxlan7 type vxlan id 43 group ff0e::110 \
       srcport 5000 6000 dev eth0
   # ip link set vxlan6 up
   # ip link set vxlan7 up
   <panic>

[    4.187481] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
...
[    4.188076] Call Trace:
[    4.188085]  [<ffffffff81667c4a>] ? ipv6_sock_mc_join+0x3a/0x630
[    4.188098]  [<ffffffffa05a6ad6>] vxlan_igmp_join+0x66/0xd0 [vxlan]
[    4.188113]  [<ffffffff810a3430>] process_one_work+0x220/0x710
[    4.188125]  [<ffffffff810a33c4>] ? process_one_work+0x1b4/0x710
[    4.188138]  [<ffffffff810a3a3b>] worker_thread+0x11b/0x3a0
[    4.188149]  [<ffffffff810a3920>] ? process_one_work+0x710/0x710

So address family must also match in order to reuse a socket.

Reported-by: Jean-Tsung Hsiao <jhsiao@redhat.com>
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-13 15:19:59 -05:00
Jesse Gross
cfdf1e1ba5 udptunnel: Add SKB_GSO_UDP_TUNNEL during gro_complete.
When doing GRO processing for UDP tunnels, we never add
SKB_GSO_UDP_TUNNEL to gso_type - only the type of the inner protocol
is added (such as SKB_GSO_TCPV4). The result is that if the packet is
later resegmented we will do GSO but not treat it as a tunnel. This
results in UDP fragmentation of the outer header instead of (i.e.) TCP
segmentation of the inner header as was originally on the wire.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-10 15:09:45 -05:00
Tom Herbert
5c91ae08e4 vxlan: Fix to enable UDP checksums on interface
Add definition to vxlan nla_policy for UDP checksum. This is necessary
to enable UDP checksums on VXLAN.

In some instances, enabling UDP checksums can improve performance on
receive for devices that return legacy checksum-unnecessary for UDP/IP.
Also, UDP checksum provides some protection against VNI corruption.

Testing:

Ran 200 instances of TCP_STREAM and TCP_RR on bnx2x.

TCP_STREAM
  IPv4, without UDP checksums
      14.41% TX CPU utilization
      25.71% RX CPU utilization
      9083.4 Mbps
  IPv4, with UDP checksums
      13.99% TX CPU utilization
      13.40% RX CPU utilization
      9095.65 Mbps

TCP_RR
  IPv4, without UDP checksums
      94.08% TX CPU utilization
      156/248/462 90/95/99% latencies
      1.12743e+06 tps
  IPv4, with UDP checksums
      94.43% TX CPU utilization
      158/250/462 90/95/99% latencies
      1.13345e+06 tps

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-06 21:59:55 -05:00
Li RongQing
7a9f526fc3 vxlan: fix a free after use
pskb_may_pull maybe change skb->data and make eth pointer oboslete,
so eth needs to reload

Fixes: 91269e390d ("vxlan: using pskb_may_pull as early as possible")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-17 16:23:58 -04:00
Li RongQing
91269e390d vxlan: using pskb_may_pull as early as possible
pskb_may_pull should be used to check if skb->data has enough space,
skb->len can not ensure that.

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-15 23:33:23 -04:00
Li RongQing
ce6502a8f9 vxlan: fix a use after free in vxlan_encap_bypass
when netif_rx() is done, the netif_rx handled skb maybe be freed,
and should not be used.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-15 23:30:28 -04:00
Eric Dumazet
0287587884 net: better IFF_XMIT_DST_RELEASE support
Testing xmit_more support with netperf and connected UDP sockets,
I found strange dst refcount false sharing.

Current handling of IFF_XMIT_DST_RELEASE is not optimal.

Dropping dst in validate_xmit_skb() is certainly too late in case
packet was queued by cpu X but dequeued by cpu Y

The logical point to take care of drop/force is in __dev_queue_xmit()
before even taking qdisc lock.

As Julian Anastasov pointed out, need for skb_dst() might come from some
packet schedulers or classifiers.

This patch adds new helper to cleanly express needs of various drivers
or qdiscs/classifiers.

Drivers that need skb_dst() in their ndo_start_xmit() should call
following helper in their setup instead of the prior :

	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
->
	netif_keep_dst(dev);

Instead of using a single bit, we use two bits, one being
eventually rebuilt in bonding/team drivers.

The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
rebuilt in bonding/team. Eventually, we could add something
smarter later.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-07 13:22:11 -04:00
Tom Herbert
996c9fd167 vxlan: Set inner protocol before transmit
Call skb_set_inner_protocol to set inner Ethernet protocol to
ETH_P_TEB before transmit. This is needed for GSO with UDP tunnels.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 21:35:52 -04:00
Andy Zhou
3c4d1daece vxlan: Fix bug introduced by commit acbf74a763
Commit acbf74a763 ("vxlan: Refactor vxlan driver to make use of the common UDP tunnel functions." introduced a bug in vxlan_xmit_one()
function, causing it to transmit Vxlan packets without proper
Vxlan header inserted. The change was not needed in the first
place. Revert it.

Reported-by: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23 15:32:10 -04:00
Andy Zhou
acbf74a763 vxlan: Refactor vxlan driver to make use of the common UDP tunnel functions.
Simplify vxlan implementation using common UDP tunnel APIs.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-19 15:57:15 -04:00
Tom Herbert
c60c308cbd vxlan: Enable checksum unnecessary conversions for vxlan/UDP sockets
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 21:36:28 -07:00
Tom Herbert
77cffe23c1 net: Clarification of CHECKSUM_UNNECESSARY
This patch:
 - Clarifies the specific requirements of devices returning
   CHECKSUM_UNNECESSARY (comments in skbuff.h).
 - Adds csum_level field to skbuff. This is used to express how
   many checksums are covered by CHECKSUM_UNNECESSARY (stores n - 1).
   This replaces the overloading of skb->encapsulation, that field is
   is now only used to indicate inner headers are valid.
 - Change __skb_checksum_validate_needed to "consume" each checksum
   as indicated by csum_level as layers of the the packet are parsed.
 - Remove skb_pop_rcv_encapsulation, no longer needed in the new
   csum_level model.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-29 20:41:11 -07:00
Gerhard Stenzel
a45e92a599 vxlan: fix incorrect initializer in union vxlan_addr
The first initializer in the following

        union vxlan_addr ipa = {
            .sin.sin_addr.s_addr = tip,
            .sa.sa_family = AF_INET,
        };

is optimised away by the compiler, due to the second initializer,
therefore initialising .sin.sin_addr.s_addr always to 0.
This results in netlink messages indicating a L3 miss never contain the
missed IP address. This was observed with GCC 4.8 and 4.9. I do not know about previous versions.
The problem affects user space programs relying on an IP address being
sent as part of a netlink message indicating a L3 miss.

Changing
            .sa.sa_family = AF_INET,
to
            .sin.sin_family = AF_INET,
fixes the problem.

Signed-off-by: Gerhard Stenzel <gerhard.stenzel@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-22 19:54:56 -07:00
David S. Miller
f139c74a8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30 13:25:49 -07:00
Jun Zhao
545469f7a5 neighbour : fix ndm_type type error issue
ndm_type means L3 address type, in neighbour proxy and vxlan, it's RTN_UNICAST.
NDA_DST is for netlink TLV type, hence it's not right value in this context.

Signed-off-by: Jun Zhao <mypopydev@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-28 17:52:17 -07:00
Tom Herbert
3ee64f39be vxlan: Call udp_sock_create
In vxlan driver call common function udp_sock_create to create the
listener UDP port.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-14 16:12:15 -07:00