Commit Graph

139 Commits

Author SHA1 Message Date
Mahesh Bandewar 92ff426450 ipvlan: add L2 check for packets arriving via virtual devices
Packets that don't have dest mac as the mac of the master device should
not be entertained by the IPvlan rx-handler. This is mostly true as the
packet path mostly takes care of that, except when the master device is
a virtual device. As demonstrated in the following case -

  ip netns add ns1
  ip link add ve1 type veth peer name ve2
  ip link add link ve2 name iv1 type ipvlan mode l2
  ip link set dev iv1 netns ns1
  ip link set ve1 up
  ip link set ve2 up
  ip -n ns1 link set iv1 up
  ip addr add 192.168.10.1/24 dev ve1
  ip -n ns1 addr 192.168.10.2/24 dev iv1
  ping -c2 192.168.10.2
  <Works!>
  ip neigh show dev ve1
  ip neigh show 192.168.10.2 lladdr <random> dev ve1
  ping -c2 192.168.10.2
  <Still works! Wrong!!>

This patch adds that missing check in the IPvlan rx-handler.

Reported-by: Amit Sikka <amit.sikka@ericsson.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-11 11:14:23 -05:00
Gao Feng 24e5992a6b ipvlan: Eliminate duplicated codes with existing function
The recv flow of ipvlan l2 mode performs as same as l3 mode for
non-multicast packet, so use the existing func ipvlan_handle_mode_l3
instead of these duplicated statements in non-multicast case.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-06 15:15:29 -05:00
David S. Miller 7cda4cee13 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Small overlapping change conflict ('net' changed a line,
'net-next' added a line right afterwards) in flexcan.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 10:44:19 -05:00
Gao Feng 5e51fe6f44 ipvlan: Add new func ipvlan_is_valid_dev instead of duplicated codes
There are multiple duplicated condition checks in the current codes, so
I add the new func ipvlan_is_valid_dev instead of the duplicated codes to
check if the netdev is real ipvlan dev.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-03 09:46:42 -05:00
Gao Feng a98a4ebc8c ipvlan: Add the skb->mark as flow4's member to lookup route
Current codes don't use skb->mark to assign flowi4_mark, it would
make the policy route rule with fwmark doesn't work as expected.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-03 09:44:02 -05:00
Gao Feng 747a713502 ipvlan: Fix insufficient skb linear check for ipv6 icmp
In the function ipvlan_get_L3_hdr, current codes use pskb_may_pull to
make sure the skb header has enough linear room for ipv6 header. But it
would use the latter memory directly without linear check when it is icmp.
So it still may access the unepxected memory in ipvlan_addr_lookup.

Now invoke the pskb_may_pull again if it is ipv6 icmp.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24 03:37:02 +09:00
Gao Feng 5fc9220a67 ipvlan: Fix insufficient skb linear check for arp
In the function ipvlan_get_L3_hdr, current codes use pskb_may_pull to
make sure the skb header has enough linear room for arp header. But it
would access the arp payload in func ipvlan_addr_lookup. So it still may
access the unepxected memory.

Now use arp_hdr_len(port->dev) instead of the arp header as the param.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24 03:37:02 +09:00
Girish Moodalbail fe18da6050 ipvlan: NULL pointer dereference panic in ipvlan_port_destroy
When call to register_netdevice() (called from ipvlan_link_new()) fails,
we call ipvlan_uninit() (through ndo_uninit()) to destroy the ipvlan
port. After returning unsuccessfully from register_netdevice() we go
ahead and call ipvlan_port_destroy() again which causes NULL pointer
dereference panic. Fix the issue by making ipvlan_init() and
ipvlan_uninit() call symmetric.

The ipvlan port will now be created inside ipvlan_init() and will be
destroyed in ipvlan_uninit().

Fixes: 2ad7bf3638 (ipvlan: Initial check-in of the IPVLAN driver)
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-18 10:37:00 +09:00
Keefe Liu ca29fd7cce ipvlan: fix ipv6 outbound device
When process the outbound packet of ipv6, we should assign the master
device to output device other than input device.

Signed-off-by: Keefe Liu <liuqifa@huawei.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 19:27:05 +09:00
David S. Miller e1ea2f9856 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several conflicts here.

NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to
nfp_fl_output() needed some adjustments because the code block is in
an else block now.

Parallel additions to net/pkt_cls.h and net/sch_generic.h

A bug fix in __tcp_retransmit_skb() conflicted with some of
the rbtree changes in net-next.

The tc action RCU callback fixes in 'net' had some overlap with some
of the recent tcf_block reworking.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-30 21:09:24 +09:00
Mahesh Bandewar fe89aa6b25 ipvlan: implement VEPA mode
This is very similar to the Macvlan VEPA mode, however, there is some
difference. IPvlan uses the mac-address of the lower device, so the VEPA
mode has implications of ICMP-redirects for packets destined for its
immediate neighbors sharing same master since the packets will have same
source and dest mac. The external switch/router will send redirect msg.

Having said that, this will be useful tool in terms of debugging
since IPvlan will not switch packets within its slaves and rely completely
on the external entity as intended in 802.1Qbg.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 18:39:57 +09:00
Mahesh Bandewar a190d04db9 ipvlan: introduce 'private' attribute for all existing modes.
IPvlan has always operated in bridge mode. However there are scenarios
where each slave should be able to talk through the master device but
not necessarily across each other. Think of an environment where each
of a namespace is a private and independant customer. In this scenario
the machine which is hosting these namespaces neither want to tell who
their neighbor is nor the individual namespaces care to talk to neighbor
on short-circuited network path.

This patch implements the mode that is very similar to the 'private' mode
in macvlan where individual slaves can send and receive traffic through
the master device, just that they can not talk among slave devices.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 18:39:57 +09:00
Girish Moodalbail dea6e19f4e tap: reference to KVA of an unloaded module causes kernel panic
The commit 9a393b5d59 ("tap: tap as an independent module") created a
separate tap module that implements tap functionality and exports
interfaces that will be used by macvtap and ipvtap modules to create
create respective tap devices.

However, that patch introduced a regression wherein the modules macvtap
and ipvtap can be removed (through modprobe -r) while there are
applications using the respective /dev/tapX devices. These applications
cause kernel to hold reference to /dev/tapX through 'struct cdev
macvtap_cdev' and 'struct cdev ipvtap_dev' defined in macvtap and ipvtap
modules respectively. So,  when the application is later closed the
kernel panics because we are referencing KVA that is present in the
unloaded modules.

----------8<------- Example ----------8<----------
$ sudo ip li add name mv0 link enp7s0 type macvtap
$ sudo ip li show mv0 |grep mv0| awk -e '{print $1 $2}'
  14:mv0@enp7s0:
$ cat /dev/tap14 &
$ lsmod |egrep -i 'tap|vlan'
macvtap                16384  0
macvlan                24576  1 macvtap
tap                    24576  3 macvtap
$ sudo modprobe -r macvtap
$ fg
cat /dev/tap14
^C

<...system panics...>
BUG: unable to handle kernel paging request at ffffffffa038c500
IP: cdev_put+0xf/0x30
----------8<-----------------8<----------

The fix is to set cdev.owner to the module that creates the tap device
(either macvtap or ipvtap). With this set, the operations (in
fs/char_dev.c) on char device holds and releases the module through
cdev_get() and cdev_put() and will not allow the module to unload
prematurely.

Fixes: 9a393b5d59 (tap: tap as an independent module)
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-28 19:17:21 +09:00
David Ahern de95e04791 net: Add extack to validator_info structs used for address notifier
Add extack to in_validator_info and in6_validator_info. Update the one
user of each, ipvlan, to return an error message for failures.

Only manual configuration of an address is plumbed in the IPv6 code path.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 13:15:07 +01:00
David Ahern ff7883ea60 net: ipv6: Make inet6addr_validator a blocking notifier
inet6addr_validator chain was added by commit 3ad7d2468f ("Ipvlan
should return an error when an address is already in use") to allow
address validation before changes are committed and to be able to
fail the address change with an error back to the user. The address
validation is not done for addresses received from router
advertisements.

Handling RAs in softirq context is the only reason for the notifier
chain to be atomic versus blocking. Since the only current user, ipvlan,
of the validator chain ignores softirq context, the notifier can be made
blocking and simply not invoked for softirq path.

The blocking option is needed by spectrum for example to validate
resources for an adding an address to an interface.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 13:15:07 +01:00
Mahesh Bandewar 32c10bbfe9 ipvlan: always use the current L2 addr of the master
If the underlying master ever changes its L2 (e.g. bonding device),
then make sure that the IPvlan slaves always emit packets with the
current L2 of the master instead of the stale mac addr which was
copied during the device creation. The problem can be seen with
following script -

  #!/bin/bash
  # Create a vEth pair
  ip link add dev veth0 type veth peer name veth1
  ip link set veth0 up
  ip link set veth1 up
  ip link show veth0
  ip link show veth1
  # Create an IPvlan device on one end of this vEth pair.
  ip link add link veth0 dev ipvl0 type ipvlan mode l2
  ip link show ipvl0
  # Change the mac-address of the vEth master.
  ip link set veth0 address 02:11:22:33:44:55

Fixes: 2ad7bf3638 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-12 22:54:22 -07:00
David Ahern 42ab19ee90 net: Add extack to upper device linking
Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04 21:39:33 -07:00
David S. Miller b63f6044d8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. Basically, updates to the conntrack core, enhancements for
nf_tables, conversion of netfilter hooks from linked list to array to
improve memory locality and asorted improvements for the Netfilter
codebase. More specifically, they are:

1) Add expection to hashes after timer initialization to prevent
   access from another CPU that walks on the hashes and calls
   del_timer(), from Florian Westphal.

2) Don't update nf_tables chain counters from hot path, this is only
   used by the x_tables compatibility layer.

3) Get rid of nested rcu_read_lock() calls from netfilter hook path.
   Hooks are always guaranteed to run from rcu read side, so remove
   nested rcu_read_lock() where possible. Patch from Taehee Yoo.

4) nf_tables new ruleset generation notifications include PID and name
   of the process that has updated the ruleset, from Phil Sutter.

5) Use skb_header_pointer() from nft_fib, so we can reuse this code from
   the nf_family netdev family. Patch from Pablo M. Bermudo.

6) Add support for nft_fib in nf_tables netdev family, also from Pablo.

7) Use deferrable workqueue for conntrack garbage collection, to reduce
   power consumption, from Patch from Subash Abhinov Kasiviswanathan.

8) Add nf_ct_expect_iterate_net() helper and use it. From Florian
   Westphal.

9) Call nf_ct_unconfirmed_destroy only from cttimeout, from Florian.

10) Drop references on conntrack removal path when skbuffs has escaped via
    nfqueue, from Florian.

11) Don't queue packets to nfqueue with dying conntrack, from Florian.

12) Constify nf_hook_ops structure, from Florian.

13) Remove neededlessly branch in nf_tables trace code, from Phil Sutter.

14) Add nla_strdup(), from Phil Sutter.

15) Rise nf_tables objects name size up to 255 chars, people want to use
    DNS names, so increase this according to what RFC 1035 specifies.
    Patch series from Phil Sutter.

16) Kill nf_conntrack_default_on, it's broken. Default on conntrack hook
    registration on demand, suggested by Eric Dumazet, patch from Florian.

17) Remove unused variables in compat_copy_entry_from_user both in
    ip_tables and arp_tables code. Patch from Taehee Yoo.

18) Constify struct nf_conntrack_l4proto, from Julia Lawall.

19) Constify nf_loginfo structure, also from Julia.

20) Use a single rb root in connlimit, from Taehee Yoo.

21) Remove unused netfilter_queue_init() prototype, from Taehee Yoo.

22) Use audit_log() instead of open-coding it, from Geliang Tang.

23) Allow to mangle tcp options via nft_exthdr, from Florian.

24) Allow to fetch TCP MSS from nft_rt, from Florian. This includes
    a fix for a miscalculation of the minimal length.

25) Simplify branch logic in h323 helper, from Nick Desaulniers.

26) Calculate netlink attribute size for conntrack tuple at compile
    time, from Florian.

27) Remove protocol name field from nf_conntrack_{l3,l4}proto structure.
    From Florian.

28) Remove holes in nf_conntrack_l4proto structure, so it becomes
    smaller. From Florian.

29) Get rid of print_tuple() indirection for /proc conntrack listing.
    Place all the code in net/netfilter/nf_conntrack_standalone.c.
    Patch from Florian.

30) Do not built in print_conntrack() if CONFIG_NF_CONNTRACK_PROCFS is
    off. From Florian.

31) Constify most nf_conntrack_{l3,l4}proto helper functions, from
    Florian.

32) Fix broken indentation in ebtables extensions, from Colin Ian King.

33) Fix several harmless sparse warning, from Florian.

34) Convert netfilter hook infrastructure to use array for better memory
    locality, joint work done by Florian and Aaron Conole. Moreover, add
    some instrumentation to debug this.

35) Batch nf_unregister_net_hooks() calls, to call synchronize_net once
    per batch, from Florian.

36) Get rid of noisy logging in ICMPv6 conntrack helper, from Florian.

37) Get rid of obsolete NFDEBUG() instrumentation, from Varsha Rao.

38) Remove unused code in the generic protocol tracker, from Davide
    Caratti.

I think I will have material for a second Netfilter batch in my queue if
time allow to make it fit in this merge window.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-03 17:08:42 -07:00
David S. Miller 3118e6e19d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:28:45 -07:00
Florian Fainelli 87173cd6cf ipvlan: Fix 64-bit statistics seqcount initialization
On 32-bit hosts and with CONFIG_DEBUG_LOCK_ALLOC we should be seeing a
lockdep splat indicating this seqcount is not correctly initialized, fix
that by using the proper helper function: netdev_alloc_pcpu_stats().

Fixes: 2ad7bf3638 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 20:06:07 -07:00
Florian Westphal 591bb2789b netfilter: nf_hook_ops structs can be const
We no longer place these on a list so they can be const.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-07-31 19:10:44 +02:00
David S. Miller 182e0b6b58 ipvlan: Stop advertising NETIF_F_UFO support.
It is going away.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-17 09:52:57 -07:00
Matthias Schiffer a8b8a889e3 net: add netlink_ext_ack argument to rtnl_link_ops.validate
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:22 -04:00
Matthias Schiffer ad744b223c net: add netlink_ext_ack argument to rtnl_link_ops.changelink
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:22 -04:00
Matthias Schiffer 7a3f4a1851 net: add netlink_ext_ack argument to rtnl_link_ops.newlink
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:21 -04:00
David S. Miller 0ddead90b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflicts were two cases of overlapping changes in
batman-adv and the qed driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-15 11:59:32 -04:00
Krister Johansen 3ad7d2468f Ipvlan should return an error when an address is already in use.
The ipvlan code already knows how to detect when a duplicate address is
about to be assigned to an ipvlan device.  However, that failure is not
propogated outward and leads to a silent failure.

Introduce a validation step at ip address creation time and allow device
drivers to register to validate the incoming ip addresses.  The ipvlan
code is the first consumer.  If it detects an address in use, we can
return an error to the user before beginning to commit the new ifa in
the networking code.

This can be especially useful if it is necessary to provision many
ipvlans in containers.  The provisioning software (or operator) can use
this to detect situations where an ip address is unexpectedly in use.

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-09 12:26:07 -04:00
David S. Miller cf124db566 net: Fix inconsistent teardown and release of private netdev state.
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init().  However, the release of these resources
can occur in one of two different places.

Either netdev_ops->ndo_uninit() or netdev->destructor().

The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.

netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.

netdev->destructor(), on the other hand, does not run until the
netdev references all go away.

Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().

This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.

If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
it is not able to invoke netdev->destructor().

This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.

However, this means that the resources that would normally be released
by netdev->destructor() will not be.

Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.

Many drivers do not try to deal with this, and instead we have leaks.

Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().

netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().

netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().

Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().

And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07 15:53:24 -04:00
Florian Westphal 3133822f5a ipvlan: use pernet operations and restrict l3s hooks to master netns
commit 4fbae7d83c ("ipvlan: Introduce l3s mode") added
registration of netfilter hooks via nf_register_hooks().

This API provides the illusion of 'global' netfilter hooks by placing the
hooks in all current and future network namespaces.

In case of ipvlan the hook appears to be only needed in the namespace
that contains the ipvlan master device (i.e., usually init_net), so
placing them in all namespaces is not needed.

This switches ipvlan driver to pernet operations, and then only registers
hooks in namespaces where a ipvlan master device is set to l3s mode.

Extra care has to be taken when the master device is moved to another
namespace, as we might have to 'move' the netfilter hooks too.

This is done by storing the namespace the ipvlan port was created in.
On REGISTER event, do (un)register operations in the old/new namespaces.

This will also allow removal of the nf_register_hooks() in a future patch.

Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 10:43:22 -04:00
Sainath Grandhi 235a9d89da ipvtap: IP-VLAN based tap driver
This patch adds a tap character device driver that is based on the
IP-VLAN network interface, called ipvtap. An ipvtap device can be created
in the same way as an ipvlan device, using 'type ipvtap', and then accessed
using the tap user space interface.

Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-11 20:59:41 -05:00
Mahesh Bandewar c3262d9dec ipvlan: use netdev_is_rx_handler_busy instead of checking specific type
IPvlan checks if the master device is already used by checking a
specific device (here it's macvlan device). This is technically not
sufficient and it should just ensure the rx_handler is busy or not.
This would be a super check that includes macvlan and any other that
has already registered rx-handler.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:22:26 -05:00
Mahesh Bandewar 019ec0032e ipvlan: fix dev_id creation corner case.
In the last patch da36e13cf6 ("ipvlan: improvise dev_id generation
logic in IPvlan") I missed some part of Dave's suggestion and because
of that the dev_id creation could fail in a corner case scenario. This
would happen when more or less 64k devices have been already created and
several have been deleted. If the devices that are still sticking around
are the last n bits from the bitmap. So in this scenario even if lower
bits are available, the dev_id search is so narrow that it always fails.

Fixes: da36e13cf6 ("ipvlan: improvise dev_id generation logic in IPvlan")
CC: David Miller <davem@davemloft.org>
CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 14:04:43 -05:00
Mahesh Bandewar da36e13cf6 ipvlan: improvise dev_id generation logic in IPvlan
The patch 009146d117 ("ipvlan: assign unique dev-id for each slave
device.") used ida_simple_get() to generate dev_ids assigned to the
slave devices. However (Eric has pointed out that) there is a shortcoming
with that approach as it always uses the first available ID. This
becomes a problem when a slave gets deleted and a new slave gets added.
The ID gets reassigned causing the new slave to get the same link-local
address. This side-effect is undesirable.

This patch adds a per-port variable that keeps track of the IDs
assigned and used as the stat-base for the IDR api. This base will be
wrapped around when it reaches the MAX (0xFFFE) value possibly on a
busy system where slaves are added and deleted routinely.

Fixes: 009146d117 ("ipvlan: assign unique dev-id for each slave device.")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Eric Dumazet <edumazet@google.com>
CC: David Miller <davem@davemloft.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-10 20:47:12 -05:00
stephen hemminger bc1f44709c net: make ndo_get_stats64 a void function
The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08 17:51:44 -05:00
Mahesh Bandewar 009146d117 ipvlan: assign unique dev-id for each slave device.
IPvlan setup uses one mac-address (of master). The IPv6 link-local
addresses are derived using the mac-address on the link. Lack of
dev-ids makes these link-local addresses same for all slaves including
that of master device. dev-ids are necessary to add differentiation
when L2 address is shared.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-04 13:30:42 -05:00
Gao Feng 3ea35d3406 driver: ipvlan: Remove unnecessary ipvlan NULL check in ipvlan_count_rx
There are three functions which would invoke the ipvlan_count_rx. They
are ipvlan_process_multicast, ipvlan_rcv_frame, and ipvlan_nf_input.
The former two functions already use the ipvlan directly before
ipvlan_count_rx, and ipvlan_nf_input gets the ipvlan from
ipvl_addr->master, it is not possible to be NULL too.
So the ipvlan pointer check is unnecessary in ipvlan_count_rx.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28 14:23:22 -05:00
Gao Feng 8667398277 driver: ipvlan: Define common functions to decrease duplicated codes used to add or del IP address
There are some duplicated codes in ipvlan_add_addr6/4 and
ipvlan_del_addr6/4. Now define two common functions ipvlan_add_addr
and ipvlan_del_addr to decrease the duplicated codes.
It could be helful to maintain the codes.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-28 14:23:21 -05:00
Mahesh Bandewar e252536068 ipvlan: fix multicast processing
In an IPvlan setup when master is set in loopback mode e.g.

  ethtool -K eth0 set loopback on

  where eth0 is master device for IPvlan setup.

The failure is caused by the faulty logic that determines if the
packet is from TX-path vs. RX-path by just looking at the mac-
addresses on the packet while processing multicast packets.

In the loopback-mode where this crash was happening, the packets
that are sent out are reflected by the NIC and are processed on
the RX path, but mac-address check tricks into thinking this
packet is from TX path and falsely uses dev_forward_skb() to pass
packets to the slave (virtual) devices.

This patch records the path while queueing packets and eliminates
logic of looking at mac-addresses for the same decision.

------------[ cut here ]------------
kernel BUG at include/linux/skbuff.h:1737!
Call Trace:
 [<ffffffff921fbbc2>] dev_forward_skb+0x92/0xd0
 [<ffffffffc031ac65>] ipvlan_process_multicast+0x395/0x4c0 [ipvlan]
 [<ffffffffc031a9a7>] ? ipvlan_process_multicast+0xd7/0x4c0 [ipvlan]
 [<ffffffff91cdfea7>] ? process_one_work+0x147/0x660
 [<ffffffff91cdff09>] process_one_work+0x1a9/0x660
 [<ffffffff91cdfea7>] ? process_one_work+0x147/0x660
 [<ffffffff91ce086d>] worker_thread+0x11d/0x360
 [<ffffffff91ce0750>] ? rescuer_thread+0x350/0x350
 [<ffffffff91ce960b>] kthread+0xdb/0xe0
 [<ffffffff91c05c70>] ? _raw_spin_unlock_irq+0x30/0x50
 [<ffffffff91ce9530>] ? flush_kthread_worker+0xc0/0xc0
 [<ffffffff92348b7a>] ret_from_fork+0x9a/0xd0
 [<ffffffff91ce9530>] ? flush_kthread_worker+0xc0/0xc0

Fixes: ba35f8588f ("ipvlan: Defer multicast / broadcast processing to a work-queue")
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23 17:53:47 -05:00
Eric Dumazet b1227d019f ipvlan: fix various issues in ipvlan_process_multicast()
1) netif_rx() / dev_forward_skb() should not be called from process
context.

2) ipvlan_count_rx() should be called with preemption disabled.

3) We should check if ipvlan->dev is up before feeding packets
to netif_rx()

4) We need to prevent device from disappearing if some packets
are in the multicast backlog.

5) One kfree_skb() should be a consume_skb() eventually

Fixes: ba35f8588f ("ipvlan: Defer multicast / broadcast processing to
a work-queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23 17:53:47 -05:00
David S. Miller 821781a9f4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-12-10 16:21:55 -05:00
Gao Feng 1a31cc86ef driver: ipvlan: Unlink the upper dev when ipvlan_link_new failed
When netdev_upper_dev_unlink failed in ipvlan_link_new, need to
unlink the ipvlan dev with upper dev.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-08 14:30:07 -05:00
Gao Feng 48140a210b driver: ipvlan: Free ipvl_port directly with kfree instead of kfree_rcu
There are two functions which would free the ipvl_port now. The first
is ipvlan_port_create. It frees the ipvl_port in the error handler,
so it could kfree it directly. The second is ipvlan_port_destroy. It
invokes netdev_rx_handler_unregister which enforces one grace period
by synchronize_net firstly, so it also could kfree the ipvl_port
directly and safely.

So it is unnecessary to use kfree_rcu to free ipvl_port.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-07 13:21:21 -05:00
David S. Miller 2745529ac7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Couple conflicts resolved here:

1) In the MACB driver, a bug fix to properly initialize the
   RX tail pointer properly overlapped with some changes
   to support variable sized rings.

2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
   overlapping with a reorganization of the driver to support
   ACPI, OF, as well as PCI variants of the chip.

3) In 'net' we had several probe error path bug fixes to the
   stmmac driver, meanwhile a lot of this code was cleaned up
   and reorganized in 'net-next'.

4) The cls_flower classifier obtained a helper function in
   'net-next' called __fl_delete() and this overlapped with
   Daniel Borkamann's bug fix to use RCU for object destruction
   in 'net'.  It also overlapped with Jiri's change to guard
   the rhashtable_remove_fast() call with a check against
   tc_skip_sw().

5) In mlx4, a revert bug fix in 'net' overlapped with some
   unrelated changes in 'net-next'.

6) In geneve, a stale header pointer after pskb_expand_head()
   bug fix in 'net' overlapped with a large reorganization of
   the same code in 'net-next'.  Since the 'net-next' code no
   longer had the bug in question, there was nothing to do
   other than to simply take the 'net-next' hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 12:29:53 -05:00
Gao Feng 8f679ed88f driver: ipvlan: Remove useless member mtu_adj of struct ipvl_dev
The mtu_adj is initialized to zero when alloc mem, there is no any
assignment to mtu_adj. It is only used in ipvlan_adjust_mtu as one
right value.
So it is useless member of struct ipvl_dev, then remove it.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-30 15:01:32 -05:00
Gao Feng 147fd2874d driver: ipvlan: Fix one possible memleak in ipvlan_link_new
When ipvlan_link_new fails and creates one ipvlan port, it does not
destroy the ipvlan port created. It causes mem leak and the physical
device contains invalid ipvlan data.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 19:58:04 -05:00
Julia Lawall ab530f63e7 ipvlan: constify l3mdev_ops structure
This l3mdev_ops structure is only stored in the l3mdev_ops field of a
net_device structure.  This field is declared const, so the l3mdev_ops
structure can be declared as const also.  Additionally drop the
__read_mostly annotation.

The semantic patch that adds const is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@r disable optional_qualifier@
identifier i;
position p;
@@
static struct l3mdev_ops i@p = { ... };

@ok@
identifier r.i;
struct net_device *e;
position p;
@@
e->l3mdev_ops = &i@p;

@bad@
position p != {r.p,ok.p};
identifier r.i;
struct l3mdev_ops e;
@@
e@i@p

@depends on !bad disable optional_qualifier@
identifier r.i;
@@
static
+const
 struct l3mdev_ops i = { ... };
// </smpl>

The effect on the layout of the .o file is shown by the following output
of the size command, first before then after the transformation:

   text    data     bss     dec     hex filename
   7364     466      52    7882    1eca drivers/net/ipvlan/ipvlan_main.o
   7412     434      52    7898    1eda drivers/net/ipvlan/ipvlan_main.o

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-15 17:49:57 -04:00
Mahesh Bandewar 4fbae7d83c ipvlan: Introduce l3s mode
In a typical IPvlan L3 setup where master is in default-ns and
each slave is into different (slave) ns. In this setup egress
packet processing for traffic originating from slave-ns will
hit all NF_HOOKs in slave-ns as well as default-ns. However same
is not true for ingress processing. All these NF_HOOKs are
hit only in the slave-ns skipping them in the default-ns.
IPvlan in L3 mode is restrictive and if admins want to deploy
iptables rules in default-ns, this asymmetric data path makes it
impossible to do so.

This patch makes use of the l3_rcv() (added as part of l3mdev
enhancements) to perform input route lookup on RX packets without
changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
to change the skb->dev just before handing over skb to L4.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-19 01:25:22 -04:00
Mahesh Bandewar b93dd49c1a ipvlan: Scrub skb before crossing the namespace boundry
The earlier patch c3aaa06d5a (ipvlan: scrub skb before routing
in L3 mode.) did this but only for TX path in L3 mode. This
patch extends it for both the modes for TX/RX path.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-25 21:47:26 -07:00
Eric Dumazet 0d7dd798fd net: ipvlan: call netdev_lockdep_set_classes()
In case a qdisc is used on a ipvlan device, we need to use different
lockdep classes to avoid false positives.

Use the new netdev_lockdep_set_classes() generic helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09 13:28:37 -07:00
Mahesh Bandewar 494e8489db ipvlan: Fix failure path in dev registration during link creation
When newlink creation fails at device-registration, the port->count
is decremented twice. Francesco Ruggeri (fruggeri@arista.com) found
this issue in Macvlan and the same exists in IPvlan driver too.

While fixing this issue I noticed another issue of missing unregister
in case of failure, so adding it to the fix which is similar to the
macvlan fix by Francesco in commit 3083796075 ("macvlan: fix failure
during registration v3")

Reported-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Eric Dumazet <edumazet@google.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 17:23:08 -04:00
Eric Dumazet f6773c5e95 vlan: propagate gso_max_segs
vlan drivers lack proper propagation of gso_max_segs from
lower device.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-17 21:05:01 -04:00
David Decotigny 314d10d73b net: ipvlan: use __ethtool_get_ksettings
Signed-off-by: David Decotigny <decot@googlers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 22:06:46 -05:00
Mahesh Bandewar ab5b7013db ipvlan: misc changes
1. scope correction for few functions that are used in single file.
2. Adjust variables that are used in fast-path to fit into single cacheline
3. Update rcv_frame() to skip shared check for frames coming over wire

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21 22:43:24 -05:00
Mahesh Bandewar e93fbc5a15 ipvlan: mode is u16
The mode argument was erronusly defined as u32 but it has always
been u16. Also use ipvlan_set_mode() helper to set the mode instead
of assigning directly. This should avoid future erronus assignments /
updates.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21 22:43:24 -05:00
Mahesh Bandewar c3aaa06d5a ipvlan: scrub skb before routing in L3 mode.
Scrub skb before hitting the iptable hooks to ensure packets hit
these hooks. Set the xnet param only when the packet is crossing the
ns boundry so if the IPvlan slave and master belong to the same ns,
the param will be set to false.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-21 22:43:24 -05:00
Mahesh Bandewar 296d485680 ipvlan: inherit MTU from master device
When we create IPvlan slave; we use ether_setup() and that
sets up default MTU to 1500 while the master device may have
lower / different MTU. Any subsequent changes to the masters'
MTU are reflected into the slaves' MTU setting. However if those
don't happen (most likely scenario), the slaves' MTU stays at
1500 which could be bad.

This change adds code to inherit MTU from the master device
instead of using the default value during the link initialization
phase.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tim Hockins <thockins@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-04 19:18:53 -05:00
Tom Herbert a188222b6e net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK
The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
set of features for offloading all checksums. This is a mask of the
checksum offload related features bits. It is incorrect to set both
NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
features of a device.

This patch:
  - Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
    NETIF_F_ALL_CSUM is being used as a mask).
  - Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
    use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 16:50:08 -05:00
Sabrina Dubroca a534dc5298 ipvlan: fix use after free of skb
ipvlan_handle_frame is a rx_handler, and when it returns a value other
than RX_HANDLER_CONSUMED (here, NET_RX_DROP aka RX_HANDLER_ANOTHER),
__netif_receive_skb_core expects that the skb still exists and will
process it further, but we just freed it.

Fixes: 2ad7bf3638 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-17 14:39:29 -05:00
Sabrina Dubroca cf554ada0b ipvlan: fix leak in ipvlan_rcv_frame
Pass a **skb to ipvlan_rcv_frame so that if skb_share_check returns a
new skb, we actually use it during further processing.

It's safe to ignore the new skb in the ipvlan_xmit_* functions, because
they call ipvlan_rcv_frame with local == true, so that dev_forward_skb
is called and always takes ownership of the skb.

Fixes: 2ad7bf3638 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-17 14:39:28 -05:00
Brenden Blanco 63b11e757d ipvlan: read direct ifindex instead of iflink
In the ipv4 outbound path of an ipvlan device in l3 mode, the ifindex is
being grabbed from dev_get_iflink. This works for the physical device
case, since as the documentation of that function notes: "Physical
interfaces have the same 'ifindex' and 'iflink' values.".  However, if
the master device is a veth, and the pairs are in separate net
namespaces, the route lookup will fail with -ENODEV due to outer veth
pair being in a separate namespace from the ipvlan master/routing
namespace.

  ns0    |   ns1    |   ns2
 veth0a--|--veth0b--|--ipvl0

In ipvlan_process_v4_outbound(), a packet sent from ipvl0 in the above
configuration will pass fl.flowi4_oif == veth0a to
ip_route_output_flow(), but *net == ns1.

Notice also that ipv6 processing is not using iflink. Since there is a
discrepancy in usage, fixup both v4 and v6 case to use local dev
variable.

Tested this with l3 ipvlan on top of veth, as well as with single
physical interface in the top namespace.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 06:39:08 -07:00
Eric W. Biederman 33224b16ff ipv4, ipv6: Pass net into ip_local_out and ip6_local_out
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:27:02 -07:00
Eric W. Biederman 57c4bf859c ipvlan: Cache net in ipvlan_process_v4_outbound and ipvlan_process_v6_outbound
Compute net once in ipvlan_process_v4_outbound and
ipvlan_process_v6_outbound and store it in a variable so that net does
not need to be recomputed next time it is used.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:27:01 -07:00
Eric W. Biederman 792883303c ipv6: Merge ip6_local_out and ip6_local_out_sk
Stop hidding the sk parameter with an inline helper function and make
all of the callers pass it, so that it is clear what the function is
doing.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:26:58 -07:00
Eric W. Biederman e2cb77db08 ipv4: Merge ip_local_out and ip_local_out_sk
It is confusing and silly hiding a parameter so modify all of
the callers to pass in the appropriate socket or skb->sk if
no socket is known.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-08 04:26:57 -07:00
Phil Sutter bf485bcf0d net: ipvlan: convert to using IFF_NO_QUEUE
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:06 -07:00
Konstantin Khlebnikov 23a5a49c83 ipvlan: ignore addresses from ipv6 autoconfiguration
Inet6addr notifier is atomic and runs in bh context without RTNL when
ipv6 receives router advertisement packet and performs autoconfiguration.

Proper fix still in discussion. Let's at least plug the bug.
v1: http://lkml.kernel.org/r/20150514135618.14062.1969.stgit@buzz
v2: http://lkml.kernel.org/r/20150703125840.24121.91556.stgit@buzz

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:33:40 -07:00
WANG Cong 0fba37a3af ipvlan: use rcu_deference_bh() in ipvlan_queue_xmit()
In tx path rcu_read_lock_bh() is held, so we need rcu_deference_bh().
This fixes the following warning:

 ===============================
 [ INFO: suspicious RCU usage. ]
 4.1.0-rc1+ #1007 Not tainted
 -------------------------------
 drivers/net/ipvlan/ipvlan.h:106 suspicious rcu_dereference_check() usage!

 other info that might help us debug this:

 rcu_scheduler_active = 1, debug_locks = 0
 1 lock held by dhclient/1076:
  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff817e8d84>] rcu_lock_acquire+0x0/0x26

 stack backtrace:
 CPU: 2 PID: 1076 Comm: dhclient Not tainted 4.1.0-rc1+ #1007
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  0000000000000001 ffff8800d381bac8 ffffffff81a4154f 000000003c1a3c19
  ffff8800d4d0a690 ffff8800d381baf8 ffffffff810b849f ffff880117d41148
  ffff880117d40000 ffff880117d40068 0000000000000156 ffff8800d381bb18
 Call Trace:
  [<ffffffff81a4154f>] dump_stack+0x4c/0x65
  [<ffffffff810b849f>] lockdep_rcu_suspicious+0x107/0x110
  [<ffffffff8165a522>] ipvlan_port_get_rcu+0x47/0x4e
  [<ffffffff8165ad14>] ipvlan_queue_xmit+0x35/0x450
  [<ffffffff817ea45d>] ? rcu_read_unlock+0x3e/0x5f
  [<ffffffff810a20bf>] ? local_clock+0x19/0x22
  [<ffffffff810b4781>] ? __lock_is_held+0x39/0x52
  [<ffffffff8165b64c>] ipvlan_start_xmit+0x1b/0x44
  [<ffffffff817edf7f>] dev_hard_start_xmit+0x2ae/0x467
  [<ffffffff817ee642>] __dev_queue_xmit+0x50a/0x60c
  [<ffffffff817ee7a7>] dev_queue_xmit_sk+0x13/0x15
  [<ffffffff81997596>] dev_queue_xmit+0x10/0x12
  [<ffffffff8199b41c>] packet_sendmsg+0xb6b/0xbdf
  [<ffffffff810b5ea7>] ? mark_lock+0x2e/0x226
  [<ffffffff810a1fcc>] ? sched_clock_cpu+0x9e/0xb7
  [<ffffffff817d56f9>] sock_sendmsg_nosec+0x12/0x1d
  [<ffffffff817d7257>] sock_sendmsg+0x29/0x2e
  [<ffffffff817d72cc>] sock_write_iter+0x70/0x91
  [<ffffffff81199563>] __vfs_write+0x7e/0xa7
  [<ffffffff811996bc>] vfs_write+0x92/0xe8
  [<ffffffff811997d7>] SyS_write+0x47/0x7e
  [<ffffffff81a4d517>] system_call_fastpath+0x12/0x6f

Fixes: 2ad7bf3638 ("ipvlan: Initial check-in of the IPVLAN driver.")
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:33:40 -07:00
Konstantin Khlebnikov 6640e673c6 ipvlan: unhash addresses without synchronize_rcu
All structures used in traffic forwarding are rcu-protected:
ipvl_addr, ipvl_dev and ipvl_port. Thus we can unhash addresses
without synchronization. We'll anyway hash it back into the same
bucket: in worst case lockless lookup will scan hash once again.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:33:39 -07:00
Konstantin Khlebnikov 6a72549731 ipvlan: plug memory leak in ipvlan_link_delete
Add missing kfree_rcu(addr, rcu);

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:33:39 -07:00
Konstantin Khlebnikov 515866f818 ipvlan: remove counters of ipv4 and ipv6 addresses
They are unused after commit f631c44bbe ("ipvlan: Always set broadcast bit in
multicast filter").

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:33:39 -07:00
Mahesh Bandewar f631c44bbe ipvlan: Always set broadcast bit in multicast filter
Earlier tricks of setting broadcast bit only when IPv4 address is added
onto interface are not good enough especially when autoconf comes in play.
Setting them on always is performance drag but now that multicast /
broadcast is not processed in fast-path; enabling broadcast will let
autoconf work correctly without affecting performance characteristics of
the device.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05 19:29:49 -04:00
Mahesh Bandewar ba35f8588f ipvlan: Defer multicast / broadcast processing to a work-queue
Processing multicast / broadcast in fast path is performance draining
and having more links means more cloning and bringing performance
down further.

Broadcast; in particular, need to be given to all the virtual links.
Earlier tricks of enabling broadcast bit for IPv4 only interfaces are not
really working since it fails autoconf. Which means enabling broadcast
for all the links if protocol specific hacks do not have to be added into
the driver.

This patch defers all (incoming as well as outgoing) multicast traffic to
a work-queue leaving only the unicast traffic in the fast-path. Now if we
need to apply any additional tricks to further reduce the impact of this
(multicast / broadcast) type of traffic, it can be implemented while
processing this work without affecting the fast-path.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-05 19:29:49 -04:00
David S. Miller 9f0d34bc34 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/asix_common.c
	drivers/net/usb/sr9800.c
	drivers/net/usb/usbnet.c
	include/linux/usb/usbnet.h
	net/ipv4/tcp_ipv4.c
	net/ipv6/tcp_ipv6.c

The TCP conflicts were overlapping changes.  In 'net' we added a
READ_ONCE() to the socket cached RX route read, whilst in 'net-next'
Eric Dumazet touched the surrounding code dealing with how mini
sockets are handled.

With USB, it's a case of the same bug fix first going into net-next
and then I cherry picked it back into net.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 16:16:53 -04:00
Nicolas Dichtel 7c4116588b ipvlan: implement ndo_get_iflink
Don't use dev->iflink anymore.

CC: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 14:05:00 -04:00
Nicolas Dichtel a54acb3a6f dev: introduce dev_get_iflink()
The goal of this patch is to prepare the removal of the iflink field. It
introduces a new ndo function, which will be implemented by virtual interfaces.

There is no functional change into this patch. All readers of iflink field
now call dev_get_iflink().

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 14:04:59 -04:00
Jiri Benc e9997c2938 ipvlan: fix check for IP addresses in control path
When an ipvlan interface is down, its addresses are not on the hash list.
Fix checks for existence of addresses not to depend on the hash list, walk
through all interface addresses instead.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:28:33 -04:00
Jiri Benc 40891e8ad6 ipvlan: do not use rcu operations for address list
All accesses to ipvlan->addrs are under rtnl.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:28:33 -04:00
Jiri Benc 2afa650ce2 ipvlan: protect against concurrent link removal
Adding and removing to the 'ipvlans' list is already done using _rcu list
operations.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:28:33 -04:00
Jiri Benc 27705f7085 ipvlan: fix addr hash list corruption
When ipvlan interface with IP addresses attached is brought down and then
deleted, the assigned addresses are deleted twice from the address hash
list, first on the interface down and second on the link deletion.
Similarly, when an address is added while the interface is down, it is added
second time once the interface is brought up.

When the interface is down, the addresses should be kept off the hash list
for performance reasons. Ensure this is true, which also fixes the double add
problem. To fix the double free, check whether the address is hashed before
removing it.

Reported-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 13:28:33 -04:00
Eric W. Biederman d476059e77 net: Kill dev_rebuild_header
Now that there are no more users kill dev_rebuild_header and all of it's
implementations.

This is long overdue.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02 16:43:41 -05:00
Eric Dumazet 6aa6395ff3 ipvlan: add a missing __percpu pcpu_stats
Cosmetic patch to add __percpu qualifier to pcpu_stats

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11 20:03:23 -08:00
Daniel Borkmann 207895fd38 net: mark some potential candidates __read_mostly
They are all either written once or extremly rarely (e.g. from init
code), so we can move them to the .data..read_mostly section.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30 17:58:39 -08:00
Mahesh Bandewar 2aab9525c3 ipvlan: fix incorrect usage of IS_ERR() macro in IPv6 code path.
The ip6_route_output() always returns a valid dst pointer unlike in IPv4
case. So the validation has to be different from the IPv4 path. Correcting
that error in this patch.

This was picked up by a static checker with a following warning -

   drivers/net/ipvlan/ipvlan_core.c:380 ipvlan_process_v6_outbound()
        warn: 'dst' isn't an ERR_PTR

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25 00:24:19 -08:00
Mahesh Bandewar 5933fea7aa ipvlan: move the device check function into netdevice.h
Move the port check [ipvlan_dev_master()] and device check
[ipvlan_dev_slave()] functions to netdevice.h and rename them
netif_is_ipvlan_port() and netif_is_ipvlan() resp. to be
consistent with macvlan api naming.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09 16:10:06 -05:00
Mahesh Bandewar 764e433b3c ipvlan: play well with macvlan device
If a device is already a macvlan port then refuse to use it as
an ipvlan port in the early stage of port creation.

	thost1:~# ip link add link eth0 mvl0 type macvlan
	thost1:~# echo $?
	0
	thost1:~# ip link add link eth0 ipvl0 type ipvlan
	RTNETLINK answers: Device or resource busy
	thost1:~# echo $?
	2
	thost1:~#

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09 16:10:06 -05:00
Markus Elfring 04901cea21 net-ipvlan: Deletion of an unnecessary check before the function call "free_percpu"
The free_percpu() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-05 21:14:20 -08:00
Mahesh Bandewar 265de6d19c ipvlan: ipvlan depends on INET and IPV6
This driver uses ip_out_local() and ip6_route_output() which are
defined only if CONFIG_INET and CONFIG_IPV6 are enabled respectively.

Reported-by: Jim Davis <jim.epost@gmail.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-29 20:53:05 -08:00
Mahesh Bandewar 92c7b0de6a ipvlan: fix sparse warnings
Fix sparse warnings reported by kbuild robot

drivers/net/ipvlan/ipvlan_main.c:172:13: warning: symbol 'ipvlan_start_xmit' was not declared. Should it be static?
drivers/net/ipvlan/ipvlan_main.c:256:33: warning: incorrect type in initializer (different address spaces)
drivers/net/ipvlan/ipvlan_main.c:256:33:    expected void const [noderef] <asn:3>*__vpp_verify
drivers/net/ipvlan/ipvlan_main.c:256:33:    got struct ipvl_pcpu_stats *<noident>
drivers/net/ipvlan/ipvlan_main.c:544:5: warning: symbol 'ipvlan_link_register' was not declared. Should it be static

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-26 15:10:17 -05:00
Mahesh Bandewar 2ad7bf3638 ipvlan: Initial check-in of the IPVLAN driver.
This driver is very similar to the macvlan driver except that it
uses L3 on the frame to determine the logical interface while
functioning as packet dispatcher. It inherits L2 of the master
device hence the packets on wire will have the same L2 for all
the packets originating from all virtual devices off of the same
master device.

This driver was developed keeping the namespace use-case in
mind. Hence most of the examples given here take that as the
base setup where main-device belongs to the default-ns and
virtual devices are assigned to the additional namespaces.

The device operates in two different modes and the difference
in these two modes in primarily in the TX side.

(a) L2 mode : In this mode, the device behaves as a L2 device.
TX processing upto L2 happens on the stack of the virtual device
associated with (namespace). Packets are switched after that
into the main device (default-ns) and queued for xmit.

RX processing is simple and all multicast, broadcast (if
applicable), and unicast belonging to the address(es) are
delivered to the virtual devices.

(b) L3 mode : In this mode, the device behaves like a L3 device.
TX processing upto L3 happens on the stack of the virtual device
associated with (namespace). Packets are switched to the
main-device (default-ns) for the L2 processing. Hence the routing
table of the default-ns will be used in this mode.

RX processins is somewhat similar to the L2 mode except that in
this mode only Unicast packets are delivered to the virtual device
while main-dev will handle all other packets.

The devices can be added using the "ip" command from the iproute2
package -

	ip link add link <master> <virtual> type ipvlan mode [ l2 | l3 ]

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Tim Hockin <thockin@google.com>
Cc: Brandon Philips <brandon.philips@coreos.com>
Cc: Pavel Emelianov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-24 15:29:18 -05:00