Commit graph

41141 commits

Author SHA1 Message Date
Daniel Borkmann
14ca0751c9 bpf: support for access to tunnel options
After eBPF being able to programmatically access/manage tunnel key meta
data via commit d3aa45ce6b ("bpf: add helpers to access tunnel metadata")
and more recently also for IPv6 through c6c3345407 ("bpf: support ipv6
for bpf_skb_{set,get}_tunnel_key"), this work adds two complementary
helpers to generically access their auxiliary tunnel options.

Geneve and vxlan support this facility. For geneve, TLVs can be pushed,
and for the vxlan case its GBP extension. I.e. setting tunnel key for geneve
case only makes sense, if we can also read/write TLVs into it. In the GBP
case, it provides the flexibility to easily map the group policy ID in
combination with other helpers or maps.

I chose to model this as two separate helpers, bpf_skb_{set,get}_tunnel_opt(),
for a couple of reasons. bpf_skb_{set,get}_tunnel_key() is already rather
complex by itself, and there may be cases for tunnel key backends where
tunnel options are not always needed. If we would have integrated this
into bpf_skb_{set,get}_tunnel_key() nevertheless, we are very limited with
remaining helper arguments, so keeping compatibility on structs in case of
passing in a flat buffer gets more cumbersome. Separating both also allows
for more flexibility and future extensibility, f.e. options could be fed
directly from a map, etc.

Moreover, change geneve's xmit path to test only for info->options_len
instead of TUNNEL_GENEVE_OPT flag. This makes it more consistent with vxlan's
xmit path and allows for avoiding to specify a protocol flag in the API on
xmit, so it can be protocol agnostic. Having info->options_len is enough
information that is needed. Tested with vxlan and geneve.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 13:58:46 -05:00
Daniel Borkmann
2208087061 bpf: allow to propagate df in bpf_skb_set_tunnel_key
Added by 9a628224a6 ("ip_tunnel: Add dont fragment flag."), allow to
feed df flag into tunneling facilities (currently supported on TX by
vxlan, geneve and gre) as a hint from eBPF's bpf_skb_set_tunnel_key()
helper.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 13:58:46 -05:00
Daniel Borkmann
577c50aade bpf: make helper function protos static
They are only used here, so there's no reason they should not be static.
Only the vlan push/pop protos are used in the test_bpf suite.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 13:55:15 -05:00
Daniel Borkmann
8afd54c87a bpf: add flags to bpf_skb_store_bytes for clearing hash
When overwriting parts of the packet with bpf_skb_store_bytes() that
were fed previously into skb->hash calculation, we should clear the
current hash with skb_clear_hash(), so that a next skb_get_hash() call
can determine the correct hash related to this skb.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 13:55:15 -05:00
Daniel Borkmann
8050c0f027 bpf: allow bpf_csum_diff to feed bpf_l3_csum_replace as well
Commit 7d672345ed ("bpf: add generic bpf_csum_diff helper") added a
generic checksum diff helper that can feed bpf_l4_csum_replace() with
a target __wsum diff that is to be applied to the L4 checksum. This
facility is very flexible, can be cascaded, allows for adding, removing,
or diffing data, or for calculating the pseudo header checksum from
scratch, but it can also be reused for working with the IPv4 header
checksum.

Thus, analogous to bpf_l4_csum_replace(), add a case for header field
value of 0 to change the checksum at a given offset through a new helper
csum_replace_by_diff(). Also, in addition to that, this provides an
easy to use interface for feeding precalculated diffs f.e. coming from
a map. It nicely complements bpf_l3_csum_replace() that currently allows
only for csum updates of 2 and 4 byte diffs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 13:55:15 -05:00
David S. Miller
810813c47a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 12:34:12 -05:00
Linus Torvalds
e2857b8f11 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix ordering of WEXT netlink messages so we don't see a newlink
    after a dellink, from Johannes Berg.

 2) Out of bounds access in minstrel_ht_set_best_prob_rage, from
    Konstantin Khlebnikov.

 3) Paging buffer memory leak in iwlwifi, from Matti Gottlieb.

 4) Wrong units used to set initial TCP rto from cached metrics, also
    from Konstantin Khlebnikov.

 5) Fix stale IP options data in the SKB control block from leaking
    through layers of encapsulation, from Bernie Harris.

 6) Zero padding len miscalculated in bnxt_en, from Michael Chan.

 7) Only CHECKSUM_PARTIAL packets should be passed down through GSO, fix
    from Hannes Frederic Sowa.

 8) Fix suspend/resume with JME networking devices, from Diego Violat
    and Guo-Fu Tseng.

 9) Checksums not validated properly in bridge multicast support due to
    the placement of the SKB header pointers at the time of the check,
    fix from Álvaro Fernández Rojas.

10) Fix hang/tiemout with r8169 if a stats fetch is done while the
    device is runtime suspended.  From Chun-Hao Lin.

11) The forwarding database netlink dump facilities don't track the
    state of the dump properly, resulting in skipped/missed entries.
    From Minoura Makoto.

12) Fix regression from a recent 3c59x bug fix, from Neil Horman.

13) Fix list corruption in bna driver, from Ivan Vecera.

14) Big endian machines crash on vlan add in bnx2x, fix from Michal
    Schmidt.

15) Ethtool RSS configuration not propagated properly in mlx5 driver,
    from Tariq Toukan.

16) Fix regression in PHY probing in stmmac driver, from Gabriel
    Fernandez.

17) Fix SKB tailroom calculation in igmp/mld code, from Benjamin
    Poirier.

18) A past change to skip empty routing headers in ipv6 extention header
    parsing accidently caused fragment headers to not be matched any
    longer.  Fix from Florian Westphal.

19) eTSEC-106 erratum needs to be applied to more gianfar chips, from
    Atsushi Nemoto.

20) Fix netdev reference after free via workqueues in usb networking
    drivers, from Oliver Neukum and Bjørn Mork.

21) mdio->irq is now an array rather than a pointer to dynamic memory,
    but several drivers were still trying to free it :-/ Fixes from
    Colin Ian King.

22) act_ipt iptables action forgets to set the family field, thus LOG
    netfilter targets don't work with it.  Fix from Phil Sutter.

23) SKB leak in ibmveth when skb_linearize() fails, from Thomas Falcon.

24) pskb_may_pull() cannot be called with interrupts disabled, fix code
    that tries to do this in vmxnet3 driver, from Neil Horman.

25) be2net driver leaks iomap'd memory on removal, fix from Douglas
    Miller.

26) Forgotton RTNL mutex unlock in ppp_create_interface() error paths,
    from Guillaume Nault.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (97 commits)
  ppp: release rtnl mutex when interface creation fails
  cdc_ncm: do not call usbnet_link_change from cdc_ncm_bind
  tcp: fix tcpi_segs_in after connection establishment
  net: hns: fix the bug about loopback
  jme: Fix device PM wakeup API usage
  jme: Do not enable NIC WoL functions on S0
  udp6: fix UDP/IPv6 encap resubmit path
  be2net: Don't leak iomapped memory on removal.
  vmxnet3: avoid calling pskb_may_pull with interrupts disabled
  net: ethernet: Add missing MFD_SYSCON dependency on HAS_IOMEM
  ibmveth: check return of skb_linearize in ibmveth_start_xmit
  cdc_ncm: toggle altsetting to force reset before setup
  usbnet: cleanup after bind() in probe()
  mlxsw: pci: Correctly determine if descriptor queue is full
  mlxsw: spectrum: Always decrement bridge's ref count
  tipc: fix nullptr crash during subscription cancel
  net: eth: altera: do not free array priv->mdio->irq
  net/ethoc: do not free array priv->mdio->irq
  net: sched: fix act_ipt for LOG target
  asix: do not free array priv->mdio->irq
  ...
2016-03-07 15:41:10 -08:00
Eric Dumazet
a9d99ce28e tcp: fix tcpi_segs_in after connection establishment
If final packet (ACK) of 3WHS is lost, it appears we do not properly
account the following incoming segment into tcpi_segs_in

While we are at it, starts segs_in with one, to count the SYN packet.

We do not yet count number of SYN we received for a request sock, we
might add this someday.

packetdrill script showing proper behavior after fix :

// Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.020 < P. 1:1001(1000) ack 1 win 32792

   +0 accept(3, ..., ...) = 4

+.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }%

Fixes: 2efd055c53 ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 15:47:13 -05:00
Bill Sommerfeld
59dca1d8a6 udp6: fix UDP/IPv6 encap resubmit path
IPv4 interprets a negative return value from a protocol handler as a
request to redispatch to a new protocol.  In contrast, IPv6 interprets a
negative value as an error, and interprets a positive value as a request
for redispatch.

UDP for IPv6 was unaware of this difference.  Change __udp6_lib_rcv() to
return a positive value for redispatch.  Note that the socket's
encap_rcv hook still needs to return a negative value to request
dispatch, and in the case of IPv6 packets, adjust IP6CB(skb)->nhoff to
identify the byte containing the next protocol.

Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 15:23:12 -05:00
Richard Alpe
49cc66eaee tipc: move netlink policies to netlink.c
Make the c files less cluttered and enable netlink attributes to be
shared between files.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 14:56:41 -05:00
Zhang Shengju
8dfd329fbc arp: correct return value of arp_rcv
Currently, arp_rcv() always return zero on a packet delivery upcall.

To make its behavior more compliant with the way this API should be
used, this patch changes this to let it return NET_RX_SUCCESS when the
packet is proper handled, and NET_RX_DROP otherwise.

v1->v2:
If sanity check is failed, call kfree_skb() instead of consume_skb(), then
return the correct return value.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 14:55:04 -05:00
Wei Tang
8303394d81 netlabel: do not initialise statics to NULL
This patch fixes the checkpatch.pl error to netlabel_domainhash.c:

ERROR: do not initialise statics to NULL

Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 11:08:26 -05:00
Wei Tang
795f3512ca netlink: do not initialise statics to 0 or NULL
This patch fixes the checkpatch.pl error to netlabel_unlabeled.c:

ERROR: do not initialise statics to 0 or NULL

Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 11:08:26 -05:00
Jon Paul Maloy
e74a386d70 tipc: remove pre-allocated message header in link struct
Until now, we have kept a pre-allocated protocol message header
aggregated into struct tipc_link. Apart from adding unnecessary
footprint to the link instances, this requires extra code both to
initialize and re-initialize it.

We now remove this sub-optimization. This change also makes it
possible to clean up the function tipc_build_proto_msg() and remove
a couple of small functions that were accessing the mentioned header.
In particular, we can replace all occurrences of the local function
call link_own_addr(link) with the generic tipc_own_addr(net).

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 23:01:20 -05:00
Parthasarathy Bhuvaragan
4de13d7ed6 tipc: fix nullptr crash during subscription cancel
commit 4d5cfcba2f ('tipc: fix connection abort during subscription
cancel'), removes the check for a valid subscription before calling
tipc_nametbl_subscribe().

This will lead to a nullptr exception when we process a
subscription cancel request. For a cancel request, a null
subscription is passed to tipc_nametbl_subscribe() resulting
in exception.

In this commit, we call tipc_nametbl_subscribe() only for
a valid subscription.

Fixes: 4d5cfcba2f ('tipc: fix connection abort during subscription cancel')
Reported-by: Anders Widell <anders.widell@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 23:00:08 -05:00
Phil Sutter
44ef548f4f net: sched: fix act_ipt for LOG target
Before calling the destroy() or target() callbacks, the family parameter
field has to be initialized. Otherwise at least the LOG target will
refuse to work and upon removal oops the kernel.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 22:57:35 -05:00
Richard Alpe
34f65dbb6c tipc: make sure required IPv6 addresses are scoped
Make sure the user has provided a scope for multicast and link local
addresses used locally by a UDP bearer.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 22:54:57 -05:00
Richard Alpe
ddb3712552 tipc: safely copy UDP netlink data from user
The netlink policy for TIPC_NLA_UDP_LOCAL and TIPC_NLA_UDP_REMOTE
is of type binary with a defined length. This causes the policy
framework to threat the defined length as maximum length.

There is however no protection against a user sending a smaller
amount of data. Prior to this patch this wasn't handled which could
result in a partially incomplete sockaddr_storage struct containing
uninitialized data.

In this patch we use nla_memcpy() when copying the user data. This
ensures a potential gap at the end is cleared out properly.

This was found by Julia with Coccinelle tool.

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 22:54:57 -05:00
Richard Alpe
2837f39c7c tipc: don't check link reset on non existing link
Make sure we have a link before checking if it has been reset or not.

Prior to this patch tipc_link_is_reset() could be called with a non
existing link, resulting in a null pointer dereference.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 22:54:56 -05:00
Richard Alpe
9b3009604b tipc: add net device to skb before UDP xmit
Prior to this patch enabling a IPv4 UDP bearer caused a null pointer
dereference in iptunnel_xmit_stats(), when it tried to dereference the
net device from the skb. To resolve this we now point the skb device
to the net device resolved from the routing table.

Fixes: 039f50629b (ip_tunnel: Move stats update to iptunnel_xmit())
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 22:54:56 -05:00
WANG Cong
d1491fa54c act_ife: fix a typo in kmemdup() parameters
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 22:43:44 -05:00
Zhang Shengju
3ef523aeee wireless: use reset to set mac header
Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-04 22:45:13 -05:00
Zhang Shengju
d57a544d71 mac80211: use reset to set header pointer
Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-04 22:45:13 -05:00
David Howells
a4373a489e rxrpc: Don't try to map ICMP to error as the lower layer already did that
In the ICMP message processing code, don't try to map ICMP codes to UNIX
error codes as the caller (IPv4/IPv6) already did that for us (ee_errno).

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 16:02:03 +00:00
David Howells
ab802ee0ab rxrpc: Clear the unused part of a sockaddr_rxrpc for memcmp() use
Clear the unused part of a sockaddr_rxrpc structs so that memcmp() can be
used to compare them.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:59:49 +00:00
David Howells
2b15ef15bc rxrpc: rxkad: Casts are needed when comparing be32 values
Forced casts are needed to avoid sparse warning when directly comparing
be32 values.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:59:13 +00:00
David Howells
098a20991d rxrpc: rxkad: The version number in the response should be net byte order
The version number rxkad places in the response should be network byte
order.

Whilst we're at it, rearrange the code to be more readable.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:59:00 +00:00
David Howells
ee72b9fddb rxrpc: Use ACCESS_ONCE() when accessing circular buffer pointers
Use ACCESS_ONCE() when accessing the other-end pointer into a circular
buffer as it's possible the other-end pointer might change whilst we're
doing this, and if we access it twice, we might get some weird things
happening.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:58:06 +00:00
David Howells
b4f1342f91 rxrpc: Adjust some whitespace and comments
Remove some excess whitespace, insert some missing spaces and adjust a
couple of comments.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:56:19 +00:00
David Howells
351c1e6486 rxrpc: Be more selective about the types of received packets we accept
Currently, received RxRPC packets outside the range 1-13 are rejected.
There are, however, holes in the range that should also be rejected - plus
at least one type we don't yet support - so reject these also.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:56:06 +00:00
David Howells
ee6fe085a9 rxrpc: Fix defined range for /proc/sys/net/rxrpc/rx_mtu
The upper bound of the defined range for rx_mtu is being set in the same
member as the lower bound (extra1) rather than the correct place (extra2).
I'm not entirely sure why this compiles.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:55:32 +00:00
David Howells
e33b3d97bc rxrpc: The protocol family should be set to PF_RXRPC not PF_UNIX
Fix the protocol family set in the proto_ops for rxrpc to be PF_RXRPC not
PF_UNIX.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:54:27 +00:00
David Howells
0d12f8a402 rxrpc: Keep the skb private record of the Rx header in host byte order
Currently, a copy of the Rx packet header is copied into the the sk_buff
private data so that we can advance the pointer into the buffer,
potentially discarding the original.  At the moment, this copy is held in
network byte order, but this means we're doing a lot of unnecessary
translations.

The reasons it was done this way are that we need the values in network
byte order occasionally and we can use the copy, slightly modified, as part
of an iov array when sending an ack or an abort packet.

However, it seems more reasonable on review that it would be better kept in
host byte order and that we make up a new header when we want to send
another packet.

To this end, rename the original header struct to rxrpc_wire_header (with
BE fields) and institute a variant called rxrpc_host_header that has host
order fields.  Change the struct in the sk_buff private data into an
rxrpc_host_header and translate the values when filling it in.

This further allows us to keep values kept in various structures in host
byte order rather than network byte order and allows removal of some fields
that are byteswapped duplicates.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:53:46 +00:00
David Howells
4c198ad17a rxrpc: Rename call events to begin RXRPC_CALL_EV_
Rename call event names to begin RXRPC_CALL_EV_ to distinguish them from the
flags.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:53:46 +00:00
David Howells
5b8848d149 rxrpc: Convert call flag and event numbers into enums
Convert call flag and event numbers into enums and move their definitions
outside of the struct.

Also move the call state enum outside of the struct and add an extra
element to count the number of states.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:53:46 +00:00
David Howells
e721498a63 rxrpc: Fix a case where a call event bit is being used as a flag bit
Fix a case where RXRPC_CALL_RELEASE (an event) is being used to specify a
flag bit.  RXRPC_CALL_RELEASED should be used instead.

Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-04 15:53:46 +00:00
Eric Dumazet
1f27cde313 net: sched: use pfifo_fast for non real queues
Some devices declare a high number of TX queues, then set a much
lower real_num_tx_queues

This cause setups using fq_codel, sfq or fq as the default qdisc to consume
more memory than really needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03 17:38:46 -05:00
David Ahern
799977d9aa net: ipv6: Fix refcnt on host routes
Andrew and Ying Huang's test robot both reported usage count problems that
trace back to the 'keep address on ifdown' patch.

>From Andrew:
We execute CRIU test on linux-next. On the current linux-next kernel
they hangs on creating a network namespace.

The kernel log contains many massages like this:
[ 1036.122108] unregister_netdevice: waiting for lo to become free.
Usage count = 2
[ 1046.165156] unregister_netdevice: waiting for lo to become free.
Usage count = 2
[ 1056.210287] unregister_netdevice: waiting for lo to become free.
Usage count = 2

I tried to revert this patch and the bug disappeared.

Here is a set of commands to reproduce this bug:

[root@linux-next-test linux-next]# uname -a
Linux linux-next-test 4.5.0-rc6-next-20160301+ #3 SMP Wed Mar 2
17:32:18 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[root@linux-next-test ~]# unshare -n
[root@linux-next-test ~]# ip link set up dev lo
[root@linux-next-test ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
[root@linux-next-test ~]# logout
[root@linux-next-test ~]# unshare -n

 -----

The problem is a change made to RTM_DELADDR case in __ipv6_ifa_notify that
was added in an early version of the offending patch and is no longer
needed.

Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Cc: Andrey Wagin <avagin@gmail.com>
Cc: Ying Huang <ying.huang@linux.intel.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: Jeremiah Mahler <jmmahler@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03 17:26:18 -05:00
Arnd Bergmann
3d1cbe839a net: mellanox: add DEVLINK dependencies
The new NET_DEVLINK infrastructure can be a loadable module, but the drivers
using it might be built-in, which causes link errors like:

drivers/net/built-in.o: In function `mlx4_load_one':
:(.text+0x2fbfda): undefined reference to `devlink_port_register'
:(.text+0x2fc084): undefined reference to `devlink_port_unregister'
drivers/net/built-in.o: In function `mlxsw_sx_port_remove':
:(.text+0x33a03a): undefined reference to `devlink_port_type_clear'
:(.text+0x33a04e): undefined reference to `devlink_port_unregister'

There are multiple ways to avoid this:

a) add 'depends on NET_DEVLINK || !NET_DEVLINK' dependencies
   for each user
b) use 'select NET_DEVLINK' from each driver that uses it
   and hide the symbol in Kconfig.
c) make NET_DEVLINK a 'bool' option so we don't have to
   list it as a dependency, and rely on the APIs to be
   stubbed out when it is disabled
d) use IS_REACHABLE() rather than IS_ENABLED() to check for
   NET_DEVLINK in include/net/devlink.h

This implements a variation of approach a) by adding an
intermediate symbol that drivers can depend on, and changes
the three drivers using it.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 09d4d087cd ("mlx4: Implement devlink interface")
Fixes: c4745500e9 ("mlxsw: Implement devlink interface")
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03 17:08:59 -05:00
Florian Westphal
5d150a9855 ipv6: re-enable fragment header matching in ipv6_find_hdr
When ipv6_find_hdr is used to find a fragment header
(caller specifies target NEXTHDR_FRAGMENT) we erronously return
-ENOENT for all fragments with nonzero offset.

Before commit 9195bb8e38, when target was specified, we did not
enter the exthdr walk loop as nexthdr == target so this used to work.

Now we do (so we can skip empty route headers). When we then stumble upon
a frag with nonzero frag_off we must return -ENOENT ("header not found")
only if the caller did not specifically request NEXTHDR_FRAGMENT.

This allows nfables exthdr expression to match ipv6 fragments, e.g. via

nft add rule ip6 filter input frag frag-off gt 0

Fixes: 9195bb8e38 ("ipv6: improve ipv6_find_hdr() to skip empty routing headers")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03 16:35:20 -05:00
Parthasarathy Bhuvaragan
f214fc4029 tipc: Revert "tipc: use existing sk_write_queue for outgoing packet chain"
reverts commit 94153e36e7 ("tipc: use existing sk_write_queue for
outgoing packet chain")

In Commit 94153e36e7, we assume that we fill & empty the socket's
sk_write_queue within the same lock_sock() session.

This is not true if the link is congested. During congestion, the
socket lock is released while we wait for the congestion to cease.
This implementation causes a nullptr exception, if the user space
program has several threads accessing the same socket descriptor.

Consider two threads of the same program performing the following:
     Thread1                                  Thread2
--------------------                    ----------------------
Enter tipc_sendmsg()                    Enter tipc_sendmsg()
lock_sock()                             lock_sock()
Enter tipc_link_xmit(), ret=ELINKCONG   spin on socket lock..
sk_wait_event()                             :
release_sock()                          grab socket lock
    :                                   Enter tipc_link_xmit(), ret=0
    :                                   release_sock()
Wakeup after congestion
lock_sock()
skb = skb_peek(pktchain);
!! TIPC_SKB_CB(skb)->wakeup_pending = tsk->link_cong;

In this case, the second thread transmits the buffers belonging to
both thread1 and thread2 successfully. When the first thread wakeup
after the congestion it assumes that the pktchain is intact and
operates on the skb's in it, which leads to the following exception:

[2102.439969] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
[2102.440074] IP: [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc]
[2102.440074] PGD 3fa3f067 PUD 3fa6b067 PMD 0
[2102.440074] Oops: 0000 [#1] SMP
[2102.440074] CPU: 2 PID: 244 Comm: sender Not tainted 3.12.28 #1
[2102.440074] RIP: 0010:[<ffffffffa005f330>]  [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc]
[...]
[2102.440074] Call Trace:
[2102.440074]  [<ffffffff8163f0b9>] ? schedule+0x29/0x70
[2102.440074]  [<ffffffffa006a756>] ? tipc_node_unlock+0x46/0x170 [tipc]
[2102.440074]  [<ffffffffa005f761>] tipc_link_xmit+0x51/0xf0 [tipc]
[2102.440074]  [<ffffffffa006d8ae>] tipc_send_stream+0x11e/0x4f0 [tipc]
[2102.440074]  [<ffffffff8106b150>] ? __wake_up_sync+0x20/0x20
[2102.440074]  [<ffffffffa006dc9c>] tipc_send_packet+0x1c/0x20 [tipc]
[2102.440074]  [<ffffffff81502478>] sock_sendmsg+0xa8/0xd0
[2102.440074]  [<ffffffff81507895>] ? release_sock+0x145/0x170
[2102.440074]  [<ffffffff815030d8>] ___sys_sendmsg+0x3d8/0x3e0
[2102.440074]  [<ffffffff816426ae>] ? _raw_spin_unlock+0xe/0x10
[2102.440074]  [<ffffffff81115c2a>] ? handle_mm_fault+0x6ca/0x9d0
[2102.440074]  [<ffffffff8107dd65>] ? set_next_entity+0x85/0xa0
[2102.440074]  [<ffffffff816426de>] ? _raw_spin_unlock_irq+0xe/0x20
[2102.440074]  [<ffffffff8107463c>] ? finish_task_switch+0x5c/0xc0
[2102.440074]  [<ffffffff8163ea8c>] ? __schedule+0x34c/0x950
[2102.440074]  [<ffffffff81504e12>] __sys_sendmsg+0x42/0x80
[2102.440074]  [<ffffffff81504e62>] SyS_sendmsg+0x12/0x20
[2102.440074]  [<ffffffff8164aed2>] system_call_fastpath+0x16/0x1b

In this commit, we maintain the skb list always in the stack.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03 16:30:29 -05:00
David S. Miller
aefd3fb26e Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2016-03-01

Here's our main set of Bluetooth & 802.15.4 patches for the 4.6 kernel.

 - New Bluetooth HCI driver for Intel/AG6xx controllers
 - New Broadcom ACPI IDs
 - LED trigger support for indicating Bluetooth powered state
 - Various fixes in mac802154, 6lowpan and related drivers
 - New USB IDs for AR3012 Bluetooth controllers

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03 16:27:23 -05:00
Benjamin Poirier
1837b2e2bc mld, igmp: Fix reserved tailroom calculation
The current reserved_tailroom calculation fails to take hlen and tlen into
account.

skb:
[__hlen__|__data____________|__tlen___|__extra__]
^                                               ^
head                                            skb_end_offset

In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:

[__hlen__|__data____________|__extra__|__tlen___]
^                                               ^
head                                            skb_end_offset

The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.

The min() in the third line can be expanded into:
if mtu < skb_tailroom - tlen:
	reserved_tailroom = skb_tailroom - mtu
else:
	reserved_tailroom = tlen

Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.

Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03 15:41:07 -05:00
Eric Engestrom
a9d562358b net/ipv4: remove left over dead code
8cc785f6f4 ("net: ipv4: make the ping
/proc code AF-independent") removed the code using it, but renamed this
variable instead of removing it.

Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 15:00:55 -05:00
Eric Engestrom
6353e1875d net/rtnetlink: remove dead code
3b766cd832 ("net/core: Add reading VF
statistics through the PF netdevice") added that variable but it's never
been used.

Signed-off-by: Eric Engestrom <eric.engestrom@imgtec.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 15:00:55 -05:00
Avinash Repaka
1659185fb4 RDS: IB: Support Fastreg MR (FRMR) memory registration mode
Fastreg MR(FRMR) is another method with which one can
register memory to HCA. Some of the newer HCAs supports only fastreg
mr mode, so we need to add support for it to have RDS functional
on them.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 14:13:19 -05:00
santosh.shilimkar@oracle.com
ad6832f950 RDS: IB: allocate extra space on queues for FRMR support
Fastreg MR(FRMR) memory registration and invalidation makes use
of work request and completion queues for its operation. Patch
allocates extra queue space towards these operation(s).

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 14:13:19 -05:00
santosh.shilimkar@oracle.com
2cb2912d65 RDS: IB: add Fastreg MR (FRMR) detection support
Discovere Fast Memmory Registration support using IB device
IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR
or FMR or both FMR and FRWR. In case both mr type are supported,
default FMR is used.

Default MR is still kept as FMR against what everyone else
is following. Default will be changed to FRMR once the
RDS performance with FRMR is comparable with FMR. The
work is in progress for the same.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 14:13:19 -05:00
santosh.shilimkar@oracle.com
db42753adb RDS: IB: add mr reused stats
Add MR reuse statistics to RDS IB transport.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 14:13:19 -05:00
santosh.shilimkar@oracle.com
37ea401e9c RDS: IB: handle the RDMA CM time wait event
Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that
it can reconnect and resume.

While testing fastreg, this error happened in couple of tests but
was getting un-noticed.

Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02 14:13:19 -05:00