Commit Graph

433 Commits

Author SHA1 Message Date
Willem de Bruijn 008ba2a13f packet: hold bind lock when rebinding to fanout hook
Packet socket bind operations must hold the po->bind_lock. This keeps
po->running consistent with whether the socket is actually on a ptype
list to receive packets.

fanout_add unbinds a socket and its packet_rcv/tpacket_rcv call, then
binds the fanout object to receive through packet_rcv_fanout.

Make it hold the po->bind_lock when testing po->running and rebinding.
Else, it can race with other rebind operations, such as that in
packet_set_ring from packet_rcv to tpacket_rcv. Concurrent updates
can result in a socket being added to a fanout group twice, causing
use-after-free KASAN bug reports, among others.

Reported independently by both trinity and syzkaller.
Verified that the syzkaller reproducer passes after this patch.

Fixes: dc99f60069 ("packet: Add fanout support.")
Reported-by: nixioaming <nixiaoming@huawei.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-20 13:57:19 -07:00
David S. Miller 6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Benjamin Poirier edbd58be15 packet: Don't write vnet header beyond end of buffer
... which may happen with certain values of tp_reserve and maclen.

Fixes: 58d19b19cd ("packet: vnet_hdr support for tpacket_rcv")
Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29 15:09:13 -07:00
David S. Miller 3b2b69efec Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mainline had UFO fixes, but UFO is removed in net-next so we
take the HEAD hunks.

Minor context conflict in bcmsysport statistics bug fix.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 12:11:16 -07:00
Willem de Bruijn c27927e372 packet: fix tp_reserve race in packet_set_ring
Updates to tp_reserve can race with reads of the field in
packet_set_ring. Avoid this by holding the socket lock during
updates in setsockopt PACKET_RESERVE.

This bug was discovered by syzkaller.

Fixes: 8913336a7e ("packet: add PACKET_RESERVE sockopt")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-10 09:52:12 -07:00
David S. Miller 29fda25a2d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two minor conflicts in virtio_net driver (bug fix overlapping addition
of a helper) and MAINTAINERS (new driver edit overlapping revamp of
PHY entry).

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-01 10:07:50 -07:00
WANG Cong c800aaf8d8 packet: fix use-after-free in prb_retire_rx_blk_timer_expired()
There are multiple reports showing we have a use-after-free in
the timer prb_retire_rx_blk_timer_expired(), where we use struct
tpacket_kbdq_core::pkbdq, a pg_vec, after it gets freed by
free_pg_vec().

The interesting part is it is not freed via packet_release() but
via packet_setsockopt(), which means we are not closing the socket.
Looking into the big and fat function packet_set_ring(), this could
happen if we satisfy the following conditions:

1. closing == 0, not on packet_release() path
2. req->tp_block_nr == 0, we don't allocate a new pg_vec
3. rx_ring->pg_vec is already set as V3, which means we already called
   packet_set_ring() wtih req->tp_block_nr > 0 previously
4. req->tp_frame_nr == 0, pass sanity check
5. po->mapped == 0, never called mmap()

In this scenario we are clearing the old rx_ring->pg_vec, so we need
to free this pg_vec, but we don't stop the timer on this path because
of closing==0.

The timer has to be stopped as long as we need to free pg_vec, therefore
the check on closing!=0 is wrong, we should check pg_vec!=NULL instead.

Thanks to liujian for testing different fixes.

Reported-by: alexander.levin@verizon.com
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: liujian (CE) <liujian56@huawei.com>
Tested-by: liujian (CE) <liujian56@huawei.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 17:33:19 -07:00
David S. Miller 7a68ada6ec Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-07-21 03:38:43 +01:00
Rosen, Rami 8410095730 net/packet: remove unused PGV_FROM_VMALLOC definition.
This patch removes the definition of PGV_FROM_VMALLOC from af_packet.c.
The PGV_FROM_VMALLOC definition was already removed by
commit 441c793a56 ("net: cleanup unused macros in net directory"),
and its usage was removed even before by commit c56b4d9012
("af_packet: remove pgv.flags"); but it was added back by mistake later on,
in commit f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation").

Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-19 16:27:48 -07:00
Iván Briano ccd4eb49f3 net/packet: Fix Tx queue selection for AF_PACKET
When PACKET_QDISC_BYPASS is not used, Tx queue selection will be done
before the packet is enqueued, taking into account any mappings set by
a queuing discipline such as mqprio without hardware offloading. This
selection may be affected by a previously saved queue_mapping, either on
the Rx path, or done before the packet reaches the device, as it's
currently the case for AF_PACKET.

In order for queue selection to work as expected when using traffic
control, there can't be another selection done before that point is
reached, so move the call to packet_pick_tx_queue to
packet_direct_xmit, leaving the default xmit path as it was before
PACKET_QDISC_BYPASS was introduced.

A forward declaration of packet_pick_tx_queue() is introduced to avoid
the need to reorder the functions within the file.

Fixes: d346a3fae3 ("packet: introduce PACKET_QDISC_BYPASS socket option")
Signed-off-by: Iván Briano <ivan.briano@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-14 08:20:28 -07:00
Reshetova, Elena fb5c2c17a5 net: convert packet_fanout.sk_ref from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:09 -07:00
Reshetova, Elena 41c6d650f6 net: convert sock.sk_refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Reshetova, Elena 14afee4b60 net: convert sock.sk_wmem_alloc from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Rosen, Rami 664f46a290 net/packet: remove unneeded declaraion of tpacket_snd().
This patch removes unneeded forward declaration of tpacket_snd()
in net/packet/af_packet.c.

Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10 16:15:25 -04:00
Rosen, Rami 878cd3ba37 net/packet: remove unused parameter in prb_curr_blk_in_use().
This patch removes unused parameter from prb_curr_blk_in_use() method
in net/packet/af_packet.c.

Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 13:15:11 -04:00
Douglas Caetano dos Santos d19b183cdc net/packet: fix missing net_device reference release
When using a TX ring buffer, if an error occurs processing a control
message (e.g. invalid message), the net_device reference is not
released.

Fixes c14ac9451c ("sock: enable timestamping using control messages")
Signed-off-by: Douglas Caetano dos Santos <douglascs@taghos.com.br>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-15 14:22:12 -04:00
David S. Miller b1513c3531 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 22:39:08 -04:00
Alexander Potapenko fd2c83b357 net/packet: check length in getsockopt() called with PACKET_HDRLEN
In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4
|val| remains uninitialized and the syscall may behave differently
depending on its value, and even copy garbage to userspace on certain
architectures. To fix this we now return -EINVAL if optlen is too small.

This bug has been detected with KMSAN.

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 14:05:52 -04:00
Mike Maloney 4a69a86420 packet: add PACKET_FANOUT_FLAG_UNIQUEID to assign new fanout group id.
Fanout uses a per net global namespace. A process that intends to create
a new fanout group can accidentally join an existing group. It is not
possible to detect this.

Add socket option PACKET_FANOUT_FLAG_UNIQUEID.  When specified the
supplied fanout group id must be set to 0, and the kernel chooses an id
that is not already in use.  This is an ephemeral flag so that
other sockets can be added to this group using setsockopt, but NOT
specifying this flag.  The current getsockopt(..., PACKET_FANOUT, ...)
can be used to retrieve the new group id.

We assume that there are not a lot of fanout groups and that this is not
a high frequency call.

The method assigns ids starting at zero and increases until it finds an
unused id.  It keeps track of the last assigned id, and uses it as a
starting point to find new ids.

Signed-off-by: Mike Maloney <maloney@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 12:46:00 -04:00
Andrey Konovalov bcc5364bdc net/packet: fix overflow in check for tp_reserve
When calculating po->tp_hdrlen + po->tp_reserve the result can overflow.

Fix by checking that tp_reserve <= INT_MAX on assign.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30 11:04:00 -07:00
Andrey Konovalov 8f8d28e4d6 net/packet: fix overflow in check for tp_frame_nr
When calculating rb->frames_per_block * req->tp_block_nr the result
can overflow.

Add a check that tp_block_size * tp_block_nr <= UINT_MAX.

Since frames_per_block <= tp_block_size, the expression would
never overflow.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30 11:03:59 -07:00
Andrey Konovalov 2b6867c2ce net/packet: fix overflow in check for priv area size
Subtracting tp_sizeof_priv from tp_block_size and casting to int
to check whether one is less then the other doesn't always work
(both of them are unsigned ints).

Compare them as is instead.

Also cast tp_sizeof_priv to u64 before using BLK_PLUS_PRIV, as
it can overflow inside BLK_PLUS_PRIV otherwise.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-30 11:03:59 -07:00
Alexander Potapenko 540e2894f7 net: don't call strlen() on the user buffer in packet_bind_spkt()
KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
uninitialized memory in packet_bind_spkt():
Acked-by: Eric Dumazet <edumazet@google.com>

==================================================================
BUG: KMSAN: use of unitialized memory
CPU: 0 PID: 1074 Comm: packet Not tainted 4.8.0-rc6+ #1891
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
 0000000000000000 ffff88006b6dfc08 ffffffff82559ae8 ffff88006b6dfb48
 ffffffff818a7c91 ffffffff85b9c870 0000000000000092 ffffffff85b9c550
 0000000000000000 0000000000000092 00000000ec400911 0000000000000002
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff82559ae8>] dump_stack+0x238/0x290 lib/dump_stack.c:51
 [<ffffffff818a6626>] kmsan_report+0x276/0x2e0 mm/kmsan/kmsan.c:1003
 [<ffffffff818a783b>] __msan_warning+0x5b/0xb0
mm/kmsan/kmsan_instr.c:424
 [<     inline     >] strlen lib/string.c:484
 [<ffffffff8259b58d>] strlcpy+0x9d/0x200 lib/string.c:144
 [<ffffffff84b2eca4>] packet_bind_spkt+0x144/0x230
net/packet/af_packet.c:3132
 [<ffffffff84242e4d>] SYSC_bind+0x40d/0x5f0 net/socket.c:1370
 [<ffffffff84242a22>] SyS_bind+0x82/0xa0 net/socket.c:1356
 [<ffffffff8515991b>] entry_SYSCALL_64_fastpath+0x13/0x8f
arch/x86/entry/entry_64.o:?
chained origin: 00000000eba00911
 [<ffffffff810bb787>] save_stack_trace+0x27/0x50
arch/x86/kernel/stacktrace.c:67
 [<     inline     >] kmsan_save_stack_with_flags mm/kmsan/kmsan.c:322
 [<     inline     >] kmsan_save_stack mm/kmsan/kmsan.c:334
 [<ffffffff818a59f8>] kmsan_internal_chain_origin+0x118/0x1e0
mm/kmsan/kmsan.c:527
 [<ffffffff818a7773>] __msan_set_alloca_origin4+0xc3/0x130
mm/kmsan/kmsan_instr.c:380
 [<ffffffff84242b69>] SYSC_bind+0x129/0x5f0 net/socket.c:1356
 [<ffffffff84242a22>] SyS_bind+0x82/0xa0 net/socket.c:1356
 [<ffffffff8515991b>] entry_SYSCALL_64_fastpath+0x13/0x8f
arch/x86/entry/entry_64.o:?
origin description: ----address@SYSC_bind (origin=00000000eb400911)
==================================================================
(the line numbers are relative to 4.8-rc6, but the bug persists
upstream)

, when I run the following program as root:

=====================================
 #include <string.h>
 #include <sys/socket.h>
 #include <netpacket/packet.h>
 #include <net/ethernet.h>

 int main() {
   struct sockaddr addr;
   memset(&addr, 0xff, sizeof(addr));
   addr.sa_family = AF_PACKET;
   int fd = socket(PF_PACKET, SOCK_PACKET, htons(ETH_P_ALL));
   bind(fd, &addr, sizeof(addr));
   return 0;
 }
=====================================

This happens because addr.sa_data copied from the userspace is not
zero-terminated, and copying it with strlcpy() in packet_bind_spkt()
results in calling strlen() on the kernel copy of that non-terminated
buffer.

Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 20:55:57 -08:00
David S. Miller f787d1debf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-19 11:18:46 -05:00
Anoob Soman 2bd624b461 packet: Do not call fanout_release from atomic contexts
Commit 6664498280 ("packet: call fanout_release, while UNREGISTERING a
netdev"), unfortunately, introduced the following issues.

1. calling mutex_lock(&fanout_mutex) (fanout_release()) from inside
rcu_read-side critical section. rcu_read_lock disables preemption, most often,
which prohibits calling sleeping functions.

[  ] include/linux/rcupdate.h:560 Illegal context switch in RCU read-side critical section!
[  ]
[  ] rcu_scheduler_active = 1, debug_locks = 0
[  ] 4 locks held by ovs-vswitchd/1969:
[  ]  #0:  (cb_lock){++++++}, at: [<ffffffff8158a6c9>] genl_rcv+0x19/0x40
[  ]  #1:  (ovs_mutex){+.+.+.}, at: [<ffffffffa04878ca>] ovs_vport_cmd_del+0x4a/0x100 [openvswitch]
[  ]  #2:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81564157>] rtnl_lock+0x17/0x20
[  ]  #3:  (rcu_read_lock){......}, at: [<ffffffff81614165>] packet_notifier+0x5/0x3f0
[  ]
[  ] Call Trace:
[  ]  [<ffffffff813770c1>] dump_stack+0x85/0xc4
[  ]  [<ffffffff810c9077>] lockdep_rcu_suspicious+0x107/0x110
[  ]  [<ffffffff810a2da7>] ___might_sleep+0x57/0x210
[  ]  [<ffffffff810a2fd0>] __might_sleep+0x70/0x90
[  ]  [<ffffffff8162e80c>] mutex_lock_nested+0x3c/0x3a0
[  ]  [<ffffffff810de93f>] ? vprintk_default+0x1f/0x30
[  ]  [<ffffffff81186e88>] ? printk+0x4d/0x4f
[  ]  [<ffffffff816106dd>] fanout_release+0x1d/0xe0
[  ]  [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0

2. calling mutex_lock(&fanout_mutex) inside spin_lock(&po->bind_lock).
"sleeping function called from invalid context"

[  ] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
[  ] in_atomic(): 1, irqs_disabled(): 0, pid: 1969, name: ovs-vswitchd
[  ] INFO: lockdep is turned off.
[  ] Call Trace:
[  ]  [<ffffffff813770c1>] dump_stack+0x85/0xc4
[  ]  [<ffffffff810a2f52>] ___might_sleep+0x202/0x210
[  ]  [<ffffffff810a2fd0>] __might_sleep+0x70/0x90
[  ]  [<ffffffff8162e80c>] mutex_lock_nested+0x3c/0x3a0
[  ]  [<ffffffff816106dd>] fanout_release+0x1d/0xe0
[  ]  [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0

3. calling dev_remove_pack(&fanout->prot_hook), from inside
spin_lock(&po->bind_lock) or rcu_read-side critical-section. dev_remove_pack()
-> synchronize_net(), which might sleep.

[  ] BUG: scheduling while atomic: ovs-vswitchd/1969/0x00000002
[  ] INFO: lockdep is turned off.
[  ] Call Trace:
[  ]  [<ffffffff813770c1>] dump_stack+0x85/0xc4
[  ]  [<ffffffff81186274>] __schedule_bug+0x64/0x73
[  ]  [<ffffffff8162b8cb>] __schedule+0x6b/0xd10
[  ]  [<ffffffff8162c5db>] schedule+0x6b/0x80
[  ]  [<ffffffff81630b1d>] schedule_timeout+0x38d/0x410
[  ]  [<ffffffff810ea3fd>] synchronize_sched_expedited+0x53d/0x810
[  ]  [<ffffffff810ea6de>] synchronize_rcu_expedited+0xe/0x10
[  ]  [<ffffffff8154eab5>] synchronize_net+0x35/0x50
[  ]  [<ffffffff8154eae3>] dev_remove_pack+0x13/0x20
[  ]  [<ffffffff8161077e>] fanout_release+0xbe/0xe0
[  ]  [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0

4. fanout_release() races with calls from different CPU.

To fix the above problems, remove the call to fanout_release() under
rcu_read_lock(). Instead, call __dev_remove_pack(&fanout->prot_hook) and
netdev_run_todo will be happy that &dev->ptype_specific list is empty. In order
to achieve this, I moved dev_{add,remove}_pack() out of fanout_{add,release} to
__fanout_{link,unlink}. So, call to {,__}unregister_prot_hook() will make sure
fanout->prot_hook is removed as well.

Fixes: 6664498280 ("packet: call fanout_release, while UNREGISTERING a netdev")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Anoob Soman <anoob.soman@citrix.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 11:11:35 -05:00
David S. Miller 3f64116a83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-16 19:34:01 -05:00
Eric Dumazet d199fab63c packet: fix races in fanout_add()
Multiple threads can call fanout_add() at the same time.

We need to grab fanout_mutex earlier to avoid races that could
lead to one thread freeing po->rollover that was set by another thread.

Do the same in fanout_release(), for peace of mind, and to help us
finding lockdep issues earlier.

Fixes: dc99f60069 ("packet: Add fanout support.")
Fixes: 0648ab70af ("packet: rollover prepare: per-socket state")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-14 15:05:12 -05:00
David S. Miller 35eeacf182 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-11 02:31:11 -05:00
Willem de Bruijn 57031eb794 packet: round up linear to header len
Link layer protocols may unconditionally pull headers, as Ethernet
does in eth_type_trans. Ensure that the entire link layer header
always lies in the skb linear segment. tpacket_snd has such a check.
Extend this to packet_snd.

Variable length link layer headers complicate the computation
somewhat. Here skb->len may be smaller than dev->hard_header_len.

Round up the linear length to be at least as long as the smallest of
the two.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-08 13:56:37 -05:00
David S. Miller 4e8f2fc1a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two trivial overlapping changes conflicts in MPLS and mlx5.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28 10:33:06 -05:00
Jason Wang 6391a4481b virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving
Commit 501db51139 ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on
xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too,
fixing this by adding a hint (has_data_valid) and set it only on the
receiving path.

Cc: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 11:01:17 -05:00
Daniel Borkmann 57ea884b0d packet: fix panic in __packet_set_timestamp on tpacket_v3 in tx mode
When TX timestamping is in use with TPACKET_V3's TX ring, then we'll
hit the BUG() in __packet_set_timestamp() when ring buffer slot is
returned to user space via tpacket_destruct_skb(). This is due to v3
being assumed as unreachable here, but since 7f953ab2ba ("af_packet:
TX_RING support for TPACKET_V3") it's not anymore. Fix it by filling
the timestamp back into the ring slot.

Fixes: 7f953ab2ba ("af_packet: TX_RING support for TPACKET_V3")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-04 23:55:42 -05:00
Sowmini Varadhan 7f953ab2ba af_packet: TX_RING support for TPACKET_V3
Although TPACKET_V3 Rx has some benefits over TPACKET_V2 Rx, *_v3
does not currently have TX_RING support. As a result an application
that wants the best perf for Tx and Rx (e.g. to handle request/response
transacations) ends up needing 2 sockets, one with *_v2 for Tx and
another with *_v3 for Rx.

This patch enables TPACKET_V2 compatible Tx features in TPACKET_V3
so that an application can use a single descriptor to get the benefits
of _v3 RX_RING and _v2 TX_RING. An application may do a block-send by
first filling up multiple frames in the Tx ring and then triggering a
transmit. This patch only support fixed size Tx frames for TPACKET_V3,
and requires that tp_next_offset must be zero.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 11:00:27 -05:00
Linus Torvalds 7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Linus Torvalds 9a19a6db37 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:

 - more ->d_init() stuff (work.dcache)

 - pathname resolution cleanups (work.namei)

 - a few missing iov_iter primitives - copy_from_iter_full() and
   friends. Either copy the full requested amount, advance the iterator
   and return true, or fail, return false and do _not_ advance the
   iterator. Quite a few open-coded callers converted (and became more
   readable and harder to fuck up that way) (work.iov_iter)

 - several assorted patches, the big one being logfs removal

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  logfs: remove from tree
  vfs: fix put_compat_statfs64() does not handle errors
  namei: fold should_follow_link() with the step into not-followed link
  namei: pass both WALK_GET and WALK_MORE to should_follow_link()
  namei: invert WALK_PUT logics
  namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
  namei: saner calling conventions for mountpoint_last()
  namei.c: get rid of user_path_parent()
  switch getfrag callbacks to ..._full() primitives
  make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
  [iov_iter] new primitives - copy_from_iter_full() and friends
  don't open-code file_inode()
  ceph: switch to use of ->d_init()
  ceph: unify dentry_operations instances
  lustre: switch to use of ->d_init()
2016-12-16 10:24:44 -08:00
Al Viro cbbd26b8b1 [iov_iter] new primitives - copy_from_iter_full() and friends
copy_from_iter_full(), copy_from_iter_full_nocache() and
csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
et.al., advancing iterator only in case of successful full copy
and returning whether it had been successful or not.

Convert some obvious users.  *NOTE* - do not blindly assume that
something is a good candidate for those unless you are sure that
not advancing iov_iter in failure case is the right thing in
this case.  Anything that does short read/short write kind of
stuff (or is in a loop, etc.) is unlikely to be a good one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 14:33:36 -05:00
David S. Miller 2745529ac7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Couple conflicts resolved here:

1) In the MACB driver, a bug fix to properly initialize the
   RX tail pointer properly overlapped with some changes
   to support variable sized rings.

2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
   overlapping with a reorganization of the driver to support
   ACPI, OF, as well as PCI variants of the chip.

3) In 'net' we had several probe error path bug fixes to the
   stmmac driver, meanwhile a lot of this code was cleaned up
   and reorganized in 'net-next'.

4) The cls_flower classifier obtained a helper function in
   'net-next' called __fl_delete() and this overlapped with
   Daniel Borkamann's bug fix to use RCU for object destruction
   in 'net'.  It also overlapped with Jiri's change to guard
   the rhashtable_remove_fast() call with a check against
   tc_skip_sw().

5) In mlx4, a revert bug fix in 'net' overlapped with some
   unrelated changes in 'net-next'.

6) In geneve, a stale header pointer after pskb_expand_head()
   bug fix in 'net' overlapped with a large reorganization of
   the same code in 'net-next'.  Since the 'net-next' code no
   longer had the bug in question, there was nothing to do
   other than to simply take the 'net-next' hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 12:29:53 -05:00
Philip Pettersson 84ac726023 packet: fix race condition in packet_set_ring
When packet_set_ring creates a ring buffer it will initialize a
struct timer_list if the packet version is TPACKET_V3. This value
can then be raced by a different thread calling setsockopt to
set the version to TPACKET_V1 before packet_set_ring has finished.

This leads to a use-after-free on a function pointer in the
struct timer_list when the socket is closed as the previously
initialized timer will not be deleted.

The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
changing the packet version while also taking the lock at the start
of packet_set_ring.

Fixes: f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 12:16:49 -05:00
Jarno Rajahalme 5a21388126 af_packet: Use virtio_net_hdr_from_skb() directly.
Remove static function __packet_rcv_vnet(), which only called
virtio_net_hdr_from_skb() and BUG()ged out if an error code was
returned.  Instead, call virtio_net_hdr_from_skb() from the former
call sites of __packet_rcv_vnet() and actually use the error handling
code that is already there.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 10:37:03 -05:00
Jarno Rajahalme db60eb5f65 af_packet: Use virtio_net_hdr_to_skb().
Use the common virtio_net_hdr_to_skb() instead of open coding it.
Other call sites were changed by commit fd2a0437dc, but this one was
missed, maybe because it is split in two parts of the source code.

Interim comparisons of 'vnet_hdr->gso_type' still work as both the
vnet_hdr and skb notion of gso_type is zero when there is no gso.

Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb")
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 10:37:03 -05:00
Jarno Rajahalme 9403cd7cbb virtio_net: Do not clear memory for struct virtio_net_hdr twice.
virtio_net_hdr_from_skb() clears the memory for the header, so there
is no point for the callers to do the same.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 10:37:03 -05:00
Willem de Bruijn 104ba78c98 packet: on direct_xmit, limit tso and csum to supported devices
When transmitting on a packet socket with PACKET_VNET_HDR and
PACKET_QDISC_BYPASS, validate device support for features requested
in vnet_hdr.

Drop TSO packets sent to devices that do not support TSO or have the
feature disabled. Note that the latter currently do process those
packets correctly, regardless of not advertising the feature.

Because of SKB_GSO_DODGY, it is not sufficient to test device features
with netif_needs_gso. Full validate_xmit_skb is needed.

Switch to software checksum for non-TSO packets that request checksum
offload if that device feature is unsupported or disabled. Note that
similar to the TSO case, device drivers may perform checksum offload
correctly even when not advertising it.

When switching to software checksum, packets hit skb_checksum_help,
which has two BUG_ON checksum not in linear segment. Packet sockets
always allocate at least up to csum_start + csum_off + 2 as linear.

Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c

  ethtool -K eth0 tso off tx on
  psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v
  psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v -N

  ethtool -K eth0 tx off
  psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G
  psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G -N

v2:
  - add EXPORT_SYMBOL_GPL(validate_xmit_skb_list)

Fixes: d346a3fae3 ("packet: introduce PACKET_QDISC_BYPASS socket option")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 15:02:15 -04:00
Anoob Soman 6664498280 packet: call fanout_release, while UNREGISTERING a netdev
If a socket has FANOUT sockopt set, a new proto_hook is registered
as part of fanout_add(). When processing a NETDEV_UNREGISTER event in
af_packet, __fanout_unlink is called for all sockets, but prot_hook which was
registered as part of fanout_add is not removed. Call fanout_release, on a
NETDEV_UNREGISTER, which removes prot_hook and removes fanout from the
fanout_list.

This fixes BUG_ON(!list_empty(&dev->ptype_specific)) in netdev_run_todo()

Signed-off-by: Anoob Soman <anoob.soman@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-06 20:50:18 -04:00
David S. Miller de0ba9a0d8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Just several instances of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-24 00:53:32 -04:00
Soheil Hassas Yeganeh f8e7718cc0 packet: propagate sock_cmsg_send() error
sock_cmsg_send() can return different error codes and not only
-EINVAL, and we should properly propagate them.

Fixes: c14ac9451c ("sock: enable timestamping using control messages")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-22 01:41:48 -04:00
Yoshihiro Shimoda edbe774623 packet: fix second argument of sock_tx_timestamp()
This patch fixes an issue that a syscall (e.g. sendto syscall) cannot
work correctly. Since the sendto syscall doesn't have msg_control buffer,
the sock_tx_timestamp() in packet_snd() cannot work correctly because
the socks.tsflags is set to 0.
So, this patch sets the socks.tsflags to sk->sk_tsflags as default.

Fixes: c14ac9451c ("sock: enable timestamping using control messages")
Reported-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
Reported-by: Keita Kobayashi <keita.kobayashi.ym@renesas.com>
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19 21:00:50 -07:00
David S. Miller 30d0844bdc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mellanox/mlx5/core/en.h
	drivers/net/ethernet/mellanox/mlx5/core/en_main.c
	drivers/net/usb/r8152.c

All three conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-06 10:35:22 -07:00
David S. Miller eb70db8756 packet: Use symmetric hash for PACKET_FANOUT_HASH.
People who use PACKET_FANOUT_HASH want a symmetric hash, meaning that
they want packets going in both directions on a flow to hash to the
same bucket.

The core kernel SKB hash became non-symmetric when the ipv6 flow label
and other entities were incorporated into the standard flow hash order
to increase entropy.

But there are no users of PACKET_FANOUT_HASH who want an assymetric
hash, they all want a symmetric one.

Therefore, use the flow dissector to compute a flat symmetric hash
over only the protocol, addresses and ports.  This hash does not get
installed into and override the normal skb hash, so this change has
no effect whatsoever on the rest of the stack.

Reported-by: Eric Leblond <eric@regit.org>
Tested-by: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:07:50 -04:00
Daniel Borkmann 113214be7f bpf: refactor bpf_prog_get and type check into helper
Since bpf_prog_get() and program type check is used in a couple of places,
refactor this into a small helper function that we can make use of. Since
the non RO prog->aux part is not used in performance critical paths and a
program destruction via RCU is rather very unlikley when doing the put, we
shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
check, but actually not taking the ref at all (due to being in fdget() /
fdput() section of the bpf fd) is even cleaner and makes the diff smaller
as well, so just go for that. Callsites are changed to make use of the new
helper where possible.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:00:47 -04:00
Mike Rapoport 1276f24eee packet: use common code for virtio_net_hdr and skb GSO conversion
Replace open coded conversion between virtio_net_hdr to skb GSO info with
virtio_net_hdr_from_skb

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 23:03:56 -07:00
Willem de Bruijn 719c44d340 packet: compat support for sock_fprog
Socket option PACKET_FANOUT_DATA takes a struct sock_fprog as argument
if PACKET_FANOUT has mode PACKET_FANOUT_CBPF. This structure contains
a pointer into user memory. If userland is 32-bit and kernel is 64-bit
the two disagree about the layout of struct sock_fprog.

Add compat setsockopt support to convert a 32-bit compat_sock_fprog to
a 64-bit sock_fprog. This is analogous to compat_sock_fprog support for
SO_REUSEPORT added in commit 1957598840 ("soreuseport: add compat
case for setsockopt SO_ATTACH_REUSEPORT_CBPF").

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-09 23:41:03 -07:00
David S. Miller 1602f49b58 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were two cases of simple overlapping changes,
nothing serious.

In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 18:51:33 -04:00
Weongyo Jeong da37845fdc packet: uses kfree_skb() for errors.
consume_skb() isn't for error cases that kfree_skb() is more proper
one.  At this patch, it fixed tpacket_rcv() and packet_rcv() to be
consistent for error or non-error cases letting perf trace its event
properly.

Signed-off-by: Weongyo Jeong <weongyo.linux@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 17:50:44 -04:00
Mathias Krause 309cf37fe2 packet: fix heap info leak in PACKET_DIAG_MCLIST sock_diag interface
Because we miss to wipe the remainder of i->addr[] in packet_mc_add(),
pdiag_put_mclist() leaks uninitialized heap bytes via the
PACKET_DIAG_MCLIST netlink attribute.

Fix this by explicitly memset(0)ing the remaining bytes in i->addr[].

Fixes: eea68e2f1a ("packet: Report socket mclist info via diag module")
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 00:46:39 -04:00
David S. Miller ae95d71261 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-04-09 17:41:41 -04:00
Dave Jones 6ae81ced37 af_packet: tone down the Tx-ring unsupported spew.
Trinity and other fuzzers can hit this WARN on far too easily,
resulting in a tainted kernel that hinders automated fuzzing.

Replace it with a rate-limited printk.

Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-06 16:05:20 -04:00
Soheil Hassas Yeganeh c14ac9451c sock: enable timestamping using control messages
Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.

Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04 15:50:30 -04:00
Willem de Bruijn 9ed988cd59 packet: validate variable length ll headers
Replace link layer header validation check ll_header_truncate with
more generic dev_validate_header.

Validation based on hard_header_len incorrectly drops valid packets
in variable length protocols, such as AX25. dev_validate_header
calls header_ops.validate for such protocols to ensure correctness
below hard_header_len.

See also http://comments.gmane.org/gmane.linux.network/401064

Fixes 9c7077622d ("packet: make packet_snd fail on len smaller than l2 header")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-09 22:13:01 -05:00
David Decotigny 7cad1bac96 net: core: use __ethtool_get_ksettings
Signed-off-by: David Decotigny <decot@googlers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 22:06:47 -05:00
Willem de Bruijn 1d036d25e5 packet: tpacket_snd gso and checksum offload
Support socket option PACKET_VNET_HDR together with PACKET_TX_RING.

When enabled, a struct virtio_net_hdr is expected to precede the data
in the ring. The vnet option must be set before the ring is created.

The implementation reuses the existing skb_copy_bits code that is used
when dev->hard_header_len is non-zero. Move this ll_header check to
before the skb alloc and combine it with a test for vnet_hdr->hdr_len.
Allocate and copy the max of the two.

Verified with test program at
github.com/wdebruij/kerneltools/blob/master/tests/psock_txring_vnet.c

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09 06:43:50 -05:00
Willem de Bruijn 8d39b4a6b8 packet: parse tpacket header before skb alloc
GSO packet headers must be stored in the linear skb segment.
Move tpacket header parsing before sock_alloc_send_skb. The GSO
follow-on patch will later increase the skb linear argument to
sock_alloc_send_skb if needed for large packets.

The header parsing code does not require an allocated skb, so is
safe to move. Later pass to tpacket_fill_skb the computed data
start and length.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09 06:43:50 -05:00
Willem de Bruijn 58d19b19cd packet: vnet_hdr support for tpacket_rcv
Support socket option PACKET_VNET_HDR together with PACKET_RX_RING.
When enabled, a struct virtio_net_hdr will precede the data in the
packet ring slots.

Verified with test program at
github.com/wdebruij/kerneltools/blob/master/tests/psock_rxring_vnet.c

  pkt: 1454269209.798420 len=5066
  vnet: gso_type=tcpv4 gso_size=1448 hlen=66 ecn=off
  csum: start=34 off=16
  eth: proto=0x800
  ip: src=<masked> dst=<masked> proto=6 len=5052

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09 06:43:50 -05:00
Willem de Bruijn 16cc140045 packet: move vnet_hdr code to helper functions
packet_snd and packet_rcv support virtio net headers for GSO.
Move this logic into helper functions to be able to reuse it in
tpacket_snd and tpacket_rcv.

This is a straighforward code move with one exception. Instead of
creating and passing a separate gso_type variable, reuse
vnet_hdr.gso_type after conversion from virtio to kernel gso type.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-09 06:43:50 -05:00
Martin Blumenstingl 880621c260 packet: Allow packets with only a header (but no payload)
Commit 9c7077622d ("packet: make packet_snd fail on len smaller
than l2 header") added validation for the packet size in packet_snd.
This change enforces that every packet needs a header (with at least
hard_header_len bytes) plus a payload with at least one byte. Before
this change the payload was optional.

This fixes PPPoE connections which do not have a "Service" or
"Host-Uniq" configured (which is violating the spec, but is still
widely used in real-world setups). Those are currently failing with the
following message: "pppd: packet size is too short (24 <= 24)"

Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-29 22:17:17 -05:00
Tobias Klauser 90836b67e2 packet: Use PAGE_ALIGNED macro
Use PAGE_ALIGNED(...) instead of open-coding it.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-17 15:25:44 -05:00
Tobias Klauser 4194b4914a packet: Don't check frames_per_block against negative values
rb->frames_per_block is an unsigned int, thus can never be negative.

Also fix spacing in the calculation of frames_per_block.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-17 15:25:44 -05:00
Daniel Borkmann 5cfb4c8d05 packet: fix tpacket_snd max frame len
Since it's introduction in commit 69e3c75f4d ("net: TX_RING and
packet mmap"), TX_RING could be used from SOCK_DGRAM and SOCK_RAW
side. When used with SOCK_DGRAM only, the size_max > dev->mtu +
reserve check should have reserve as 0, but currently, this is
unconditionally set (in it's original form as dev->hard_header_len).

I think this is not correct since tpacket_fill_skb() would then
take dev->mtu and dev->hard_header_len into account for SOCK_DGRAM,
the extra VLAN_HLEN could be possible in both cases. Presumably, the
reserve code was copied from packet_snd(), but later on missed the
check. Make it similar as we have it in packet_snd().

Fixes: 69e3c75f4d ("net: TX_RING and packet mmap")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15 18:00:35 -05:00
Daniel Borkmann c72219b75f packet: infer protocol from ethernet header if unset
In case no struct sockaddr_ll has been passed to packet
socket's sendmsg() when doing a TX_RING flush run, then
skb->protocol is set to po->num instead, which is the protocol
passed via socket(2)/bind(2).

Applications only xmitting can go the path of allocating the
socket as socket(PF_PACKET, <mode>, 0) and do a bind(2) on the
TX_RING with sll_protocol of 0. That way, register_prot_hook()
is neither called on creation nor on bind time, which saves
cycles when there's no interest in capturing anyway.

That leaves us however with po->num 0 instead and therefore
the TX_RING flush run sets skb->protocol to 0 as well. Eric
reported that this leads to problems when using tools like
trafgen over bonding device. I.e. the bonding's hash function
could invoke the kernel's flow dissector, which depends on
skb->protocol being properly set. In the current situation, all
the traffic is then directed to a single slave.

Fix it up by inferring skb->protocol from the Ethernet header
when not set and we have ARPHRD_ETHER device type. This is only
done in case of SOCK_RAW and where we have a dev->hard_header_len
length. In case of ARPHRD_ETHER devices, this is guaranteed to
cover ETH_HLEN, and therefore being accessed on the skb after
the skb_store_bits().

Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15 18:00:35 -05:00
Daniel Borkmann 3c70c13248 packet: only allow extra vlan len on ethernet devices
Packet sockets can be used by various net devices and are not
really restricted to ARPHRD_ETHER device types. However, when
currently checking for the extra 4 bytes that can be transmitted
in VLAN case, our assumption is that we generally probe on
ARPHRD_ETHER devices. Therefore, before looking into Ethernet
header, check the device type first.

This also fixes the issue where non-ARPHRD_ETHER devices could
have no dev->hard_header_len in TX_RING SOCK_RAW case, and thus
the check would test unfilled linear part of the skb (instead
of non-linear).

Fixes: 57f89bfa21 ("network: Allow af_packet to transmit +4 bytes for VLAN packets.")
Fixes: 52f1454f62 ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15 18:00:35 -05:00
Daniel Borkmann 8fd6c80d9d packet: always probe for transport header
We concluded that the skb_probe_transport_header() should better be
called unconditionally. Avoiding the call into the flow dissector has
also not really much to do with the direct xmit mode.

While it seems that only virtio_net code makes use of GSO from non
RX/TX ring packet socket paths, we should probe for a transport header
nevertheless before they hit devices.

Reference: http://thread.gmane.org/gmane.linux.network/386173/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15 18:00:35 -05:00
Daniel Borkmann efdfa2f784 packet: do skb_probe_transport_header when we actually have data
In tpacket_fill_skb() commit c1aad275b0 ("packet: set transport
header before doing xmit") and later on 40893fd0fd ("net: switch
to use skb_probe_transport_header()") was probing for a transport
header on the skb from a ring buffer slot, but at a time, where
the skb has _not even_ been filled with data yet. So that call into
the flow dissector is pretty useless. Lets do it after we've set
up the skb frags.

Fixes: c1aad275b0 ("packet: set transport header before doing xmit")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15 18:00:35 -05:00
Francesco Ruggeri 30f7ea1c2b packet: race condition in packet_bind
There is a race conditions between packet_notifier and packet_bind{_spkt}.

It happens if packet_notifier(NETDEV_UNREGISTER) executes between the
time packet_bind{_spkt} takes a reference on the new netdevice and the
time packet_do_bind sets po->ifindex.
In this case the notification can be missed.
If this happens during a dev_change_net_namespace this can result in the
netdevice to be moved to the new namespace while the packet_sock in the
old namespace still holds a reference on it. When the netdevice is later
deleted in the new namespace the deletion hangs since the packet_sock
is not found in the new namespace' &net->packet.sklist.
It can be reproduced with the script below.

This patch makes packet_do_bind check again for the presence of the
netdevice in the packet_sock's namespace after the synchronize_net
in unregister_prot_hook.
More in general it also uses the rcu lock for the duration of the bind
to stop dev_change_net_namespace/rollback_registered_many from
going past the synchronize_net following unlist_netdevice, so that
no NETDEV_UNREGISTER notifications can happen on the new netdevice
while the bind is executing. In order to do this some code from
packet_bind{_spkt} is consolidated into packet_do_dev.

import socket, os, time, sys
proto=7
realDev='em1'
vlanId=400
if len(sys.argv) > 1:
   vlanId=int(sys.argv[1])
dev='vlan%d' % vlanId

os.system('taskset -p 0x10 %d' % os.getpid())

s = socket.socket(socket.PF_PACKET, socket.SOCK_RAW, proto)
os.system('ip link add link %s name %s type vlan id %d' %
          (realDev, dev, vlanId))
os.system('ip netns add dummy')

pid=os.fork()

if pid == 0:
   # dev should be moved while packet_do_bind is in synchronize net
   os.system('taskset -p 0x20000 %d' % os.getpid())
   os.system('ip link set %s netns dummy' % dev)
   os.system('ip netns exec dummy ip link del %s' % dev)
   s.close()
   sys.exit(0)

time.sleep(.004)
try:
   s.bind(('%s' % dev, proto+1))
except:
   print 'Could not bind socket'
   s.close()
   os.system('ip netns del dummy')
   sys.exit(0)

os.waitpid(pid, 0)
s.close()
os.system('ip netns del dummy')
sys.exit(0)

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-05 14:48:42 -05:00
Eric W. Biederman 19bcf9f203 ipv4: Pass struct net into ip_defrag and ip_check_defrag
The function ip_defrag is called on both the input and the output
paths of the networking stack.  In particular conntrack when it is
tracking outbound packets from the local machine calls ip_defrag.

So add a struct net parameter and stop making ip_defrag guess which
network namespace it needs to defragment packets in.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:44:16 -07:00
Eric Dumazet 161642e24f packet: fix match_fanout_group()
Recent TCP listener patches exposed a prior af_packet bug :
match_fanout_group() blindly assumes it is always safe
to cast sk to a packet socket to compare fanout with af_packet_priv

But SYNACK packets can be sent while attached to request_sock, which
are smaller than a "struct sock".

We can read non existent memory and crash.

Fixes: c0de08d042 ("af_packet: don't emit packet on orig fanout group")
Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Eric Leblond <eric@regit.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:42:38 -07:00
Edward Jee c7d39e3263 packet: support per-packet fwmark for af_packet sendmsg
Signed-off-by: Edward Hyunkoo Jee <edjee@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 19:25:22 -07:00
Alexei Starovoitov ff936a04e5 bpf: fix cb access in socket filter programs
eBPF socket filter programs may see junk in 'u32 cb[5]' area,
since it could have been used by protocol layers earlier.

For socket filter programs used in af_packet we need to clean
20 bytes of skb->cb area if it could be used by the program.
For programs attached to TCP/UDP sockets we need to save/restore
these 20 bytes, since it's used by protocol layers.

Remove SK_RUN_FILTER macro, since it's no longer used.

Long term we may move this bpf cb area to per-cpu scratch, but that
requires addition of new 'per-cpu load/store' instructions,
so not suitable as a short term fix.

Fixes: d691f9e8d4 ("bpf: allow programs to write to certain skb fields")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-11 04:40:05 -07:00
Daniel Borkmann bab1899187 bpf, seccomp: prepare for upcoming criu support
The current ongoing effort to dump existing cBPF seccomp filters back
to user space requires to hold the pre-transformed instructions like
we do in case of socket filters from sk_attach_filter() side, so they
can be reloaded in original form at a later point in time by utilities
such as criu.

To prepare for this, simply extend the bpf_prog_create_from_user()
API to hold a flag that tells whether we should store the original
or not. Also, fanout filters could make use of that in future for
things like diag. While fanout filters already use bpf_prog_destroy(),
move seccomp over to them as well to handle original programs when
present.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Tested-by: Tycho Andersen <tycho.andersen@canonical.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 06:47:05 -07:00
David Woodhouse d3869efe7a Fix AF_PACKET ABI breakage in 4.2
Commit 7d82410950 ("virtio: add explicit big-endian support to memory
accessors") accidentally changed the virtio_net header used by
AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.

Since virtio_legacy_is_little_endian() is a very long identifier,
define a vio_le macro and use that throughout the code instead of the
hard-coded 'false' for little-endian.

This restores the ABI to match 4.1 and earlier kernels, and makes my
test program work again.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:33:55 -07:00
Willem de Bruijn f2e520956a packet: add extended BPF fanout mode
Add fanout mode PACKET_FANOUT_EBPF that accepts an en extended BPF
program to select a socket.

Update the internal eBPF program by passing to socket option
SOL_PACKET/PACKET_FANOUT_DATA a file descriptor returned by bpf().

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 14:22:48 -07:00
Willem de Bruijn 47dceb8ecd packet: add classic BPF fanout mode
Add fanout mode PACKET_FANOUT_CBPF that accepts a classic BPF program
to select a socket.

This avoids having to keep adding special case fanout modes. One
example use case is application layer load balancing. The QUIC
protocol, for instance, encodes a connection ID in UDP payload.

Also add socket option SOL_PACKET/PACKET_FANOUT_DATA that updates data
associated with the socket group. Fanout mode PACKET_FANOUT_CBPF is the
only user so far.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 14:22:47 -07:00
David S. Miller 5510b3c2a1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/s390/net/bpf_jit_comp.c
	drivers/net/ethernet/ti/netcp_ethss.c
	net/bridge/br_multicast.c
	net/ipv4/ip_fragment.c

All four conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 23:52:20 -07:00
Tobias Klauser 73d0fcf2f4 packet: remove handling of tx_ring from prb_shutdown_retire_blk_timer()
Follow e8e85cc5eb ("packet: remove handling of tx_ring") and remove
the tx_ring parameter from prb_shutdown_retire_blk_timer() as it is only
called with tx_ring = 0.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 15:11:12 -07:00
Alexander Drozdov dbd46ab412 packet: tpacket_snd(): fix signed/unsigned comparison
tpacket_fill_skb() can return a negative value (-errno) which
is stored in tp_len variable. In that case the following
condition will be (but shouldn't be) true:

tp_len > dev->mtu + dev->hard_header_len

as dev->mtu and dev->hard_header_len are both unsigned.

That may lead to just returning an incorrect EMSGSIZE errno
to the user.

Fixes: 52f1454f62 ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 00:09:58 -07:00
Lars Westerhoff 158cd4af8d packet: missing dev_put() in packet_do_bind()
When binding a PF_PACKET socket, the use count of the bound interface is
always increased with dev_hold in dev_get_by_{index,name}.  However,
when rebound with the same protocol and device as in the previous bind
the use count of the interface was not decreased.  Ultimately, this
caused the deletion of the interface to fail with the following message:

unregister_netdevice: waiting for dummy0 to become free. Usage count = 1

This patch moves the dev_put out of the conditional part that was only
executed when either the protocol or device changed on a bind.

Fixes: 902fefb82e ('packet: improve socket create/bind latency in some cases')
Signed-off-by: Lars Westerhoff <lars.westerhoff@newtec.eu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 15:38:58 -07:00
David S. Miller 3a07bd6fea Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mellanox/mlx4/main.c
	net/packet/af_packet.c

Both conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-24 02:58:51 -07:00
Maninder Singh e8e85cc5eb packet: remove handling of tx_ring
Remove handling of tx_ring in prb_setup_retire_blk_timer
for TPACKET_V3 because init_prb_bdqc is called only for zero tx_ring
and thus prb_setup_retire_blk_timer for zero tx_ring only.

And also in functon init_prb_bdqc there is no usage of tx_ring.
Thus removing tx_ring from init_prb_bdqc.

Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Suggested-by: Frans Klaver <fransklaver@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-23 06:53:29 -07:00
Willem de Bruijn 468479e604 packet: avoid out of bounds read in round robin fanout
PACKET_FANOUT_LB computes f->rr_cur such that it is modulo
f->num_members. It returns the old value unconditionally, but
f->num_members may have changed since the last store. Ensure
that the return value is always < num.

When modifying the logic, simplify it further by replacing the loop
with an unconditional atomic increment.

Fixes: dc99f60069 ("packet: Add fanout support.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-21 10:24:37 -07:00
Willem de Bruijn 59f211181b packet: free packet_rollover after synchronize_net
Destruction of the po->rollover must be delayed until there are no
more packets in flight that can access it. The field is destroyed in
packet_release, before synchronize_net. Delay using rcu.

Fixes: 0648ab70af ("packet: rollover prepare: per-socket state")

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-21 09:30:42 -07:00
Eric Dumazet f98f4514d0 packet: read num_members once in packet_rcv_fanout()
We need to tell compiler it must not read f->num_members multiple
times. Otherwise testing if num is not zero is flaky, and we could
attempt an invalid divide by 0 in fanout_demux_cpu()

Note bug was present in packet_rcv_fanout_hash() and
packet_rcv_fanout_lb() but final 3.1 had a simple location
after commit 95ec3eb417 ("packet: Add 'cpu' fanout policy.")

Fixes: dc99f60069 ("packet: Add fanout support.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-21 09:23:22 -07:00
Willem de Bruijn 4633c9e07b net-packet: fix null pointer exception in rollover mode
Rollover can be enabled as flag or mode. Allocate state in both cases.
This solves a NULL pointer exception in fanout_demux_rollover on
referencing po->rollover if using mode rollover.

Also make sure that in rollover mode each silo is tried (contrary
to rollover flag, where the main socket is excluded after an initial
try_self).

Tested:
  Passes tools/testing/net/psock_fanout.c, which tests both modes and
  flag. My previous tests were limited to bench_rollover, which only
  stresses the flag. The test now completes safely. it still gives an
  error for mode rollover, because it does not expect the new headroom
  (ROOM_NORMAL) requirement. I will send a separate patch to the test.

Fixes: 0648ab70af ("packet: rollover prepare: per-socket state")

Signed-off-by: Willem de Bruijn <willemb@google.com>

----

I should have run this test and caught this before submission, of
course. Apologies for the oversight.
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:41:38 -04:00
Willem de Bruijn 54d7c01d3e packet: fix warnings in rollover lock contention
Avoid two xchg calls whose return values were unused, causing a
warning on some architectures.

The relevant variable is a hint and read without mutual exclusion.
This fix makes all writers hold the receive_queue lock.

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 17:40:54 -04:00
Willem de Bruijn a9b6391814 packet: rollover statistics
Rollover indicates exceptional conditions. Export a counter to inform
socket owners of this state.

If no socket with sufficient room is found, rollover fails. Also count
these events.

Finally, also count when flows are rolled over early thanks to huge
flow detection, to validate its correctness.

Tested:
  Read counters in bench_rollover on all other tests in the patchset

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:43:00 -04:00
Willem de Bruijn 3b3a5b0aab packet: rollover huge flows before small flows
Migrate flows from a socket to another socket in the fanout group not
only when the socket is full. Start migrating huge flows early, to
divert possible 4-tuple attacks without affecting normal traffic.

Introduce fanout_flow_is_huge(). This detects huge flows, which are
defined as taking up more than half the load. It does so cheaply, by
storing the rxhashes of the N most recent packets. If over half of
these are the same rxhash as the current packet, then drop it. This
only protects against 4-tuple attacks. N is chosen to fit all data in
a single cache line.

Tested:
  Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

    lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s
    cpu         rx       rx.k     drop.k   rollover     r.huge   r.failed
      0         14         14          0          0          0          0
      1         20         20          0          0          0          0
      2         16         16          0          0          0          0
      3    6168824    6168824          0    4867721    4867721          0
      4    4867741    4867741          0          0          0          0
      5         12         12          0          0          0          0
      6         15         15          0          0          0          0
      7         17         17          0          0          0          0

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:43:00 -04:00
Willem de Bruijn 2ccdbaa6d5 packet: rollover lock contention avoidance
Rollover has to call packet_rcv_has_room on sockets in the fanout
group to find a socket to migrate to. This operation is expensive
especially if the packet sockets use rings, when a lock has to be
acquired.

Avoid pounding on the lock by all sockets by temporarily marking a
socket as "under memory pressure" when such pressure is detected.
While set, only the socket owner may call packet_rcv_has_room on the
socket. Once it detects normal conditions, it clears the flag. The
socket is not used as a victim by any other socket in the meantime.

Under reasonably balanced load, each socket writer frequently calls
packet_rcv_has_room and clears its own pressure field. As a backup
for when the socket is rarely written to, also clear the flag on
reading (packet_recvmsg, packet_poll) if this can be done cheaply
(i.e., without calling packet_rcv_has_room). This is only for
edge cases.

Tested:
  Ran bench_rollover: a process with 8 sockets in a single fanout
  group, each pinned to a single cpu that receives one nic recv
  interrupt. RPS and RFS are disabled. The benchmark uses packet
  rx_ring, which has to take a lock when determining whether a
  socket has room.

  Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
  uniformly across the packet sockets (and inserted an iptables
  rule to drop in PREROUTING to avoid protocol stack processing).

  Without this patch, all sockets try to migrate traffic to
  neighbors, causing lock contention when searching for a non-
  empty neighbor. The lock is the top 9 entries.

    perf record -a -g sleep 5

    -  17.82%   bench_rollover  [kernel.kallsyms]    [k] _raw_spin_lock
       - _raw_spin_lock
          - 99.00% spin_lock
    	 + 81.77% packet_rcv_has_room.isra.41
    	 + 18.23% tpacket_rcv
          + 0.84% packet_rcv_has_room.isra.41
    +   5.20%      ksoftirqd/6  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.15%      ksoftirqd/1  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.14%      ksoftirqd/2  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.12%      ksoftirqd/7  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.12%      ksoftirqd/5  [kernel.kallsyms]    [k] _raw_spin_lock
    +   5.10%      ksoftirqd/4  [kernel.kallsyms]    [k] _raw_spin_lock
    +   4.66%      ksoftirqd/0  [kernel.kallsyms]    [k] _raw_spin_lock
    +   4.45%      ksoftirqd/3  [kernel.kallsyms]    [k] _raw_spin_lock
    +   1.55%   bench_rollover  [kernel.kallsyms]    [k] packet_rcv_has_room.isra.41

  On net-next with this patch, this lock contention is no longer a
  top entry. Most time is spent in the actual read function. Next up
  are other locks:

    +  15.52%  bench_rollover  bench_rollover     [.] reader
    +   4.68%         swapper  [kernel.kallsyms]  [k] memcpy_erms
    +   2.77%         swapper  [kernel.kallsyms]  [k] packet_lookup_frame.isra.51
    +   2.56%     ksoftirqd/1  [kernel.kallsyms]  [k] memcpy_erms
    +   2.16%         swapper  [kernel.kallsyms]  [k] tpacket_rcv
    +   1.93%         swapper  [kernel.kallsyms]  [k] mlx4_en_process_rx_cq

  Looking closer at the remaining _raw_spin_lock, the cost of probing
  in rollover is now comparable to the cost of taking the lock later
  in tpacket_rcv.

    -   1.51%         swapper  [kernel.kallsyms]  [k] _raw_spin_lock
       - _raw_spin_lock
          + 33.41% packet_rcv_has_room
          + 28.15% tpacket_rcv
          + 19.54% enqueue_to_backlog
          + 6.45% __free_pages_ok
          + 2.78% packet_rcv_fanout
          + 2.13% fanout_demux_rollover
          + 2.01% netif_receive_skb_internal

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:43:00 -04:00
Willem de Bruijn 9954729bc3 packet: rollover only to socket with headroom
Only migrate flows to sockets that have sufficient headroom, where
sufficient is defined as having at least 25% empty space.

The kernel has three different buffer types: a regular socket, a ring
with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
latter two do not expose a read pointer to the kernel, so headroom is
not computed easily. All three needs a different implementation to
estimate free space.

Tested:
  Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.

  bench_rollover has as many sockets as there are NIC receive queues
  in the system. Each socket is owned by a process that is pinned to
  one of the receive cpus. RFS is disabled. RPS is enabled with an
  identity mapping (cpu x -> cpu x), to count drops with softnettop.

    lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s
    Press [Enter] to exit

    cpu         rx       rx.k     drop.k   rollover     r.huge   r.failed
      0         16         16          0          0          0          0
      1         21         21          0          0          0          0
      2    5227502    5227502          0          0          0          0
      3         18         18          0          0          0          0
      4    6083289    6083289          0    5227496          0          0
      5         22         22          0          0          0          0
      6         21         21          0          0          0          0
      7          9          9          0          0          0          0

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:42:59 -04:00
Willem de Bruijn 0648ab70af packet: rollover prepare: per-socket state
Replace rollover state per fanout group with state per socket. Future
patches will add fields to the new structure.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:42:59 -04:00
Willem de Bruijn ad377cab49 packet: rollover prepare: move code out of callsites
packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover
logic into the callee to simplify these callsites, especially with
upcoming changes.

The main differences between the two callsites is that the FLAG
variant tests whether the socket previously selected by another
mode (RR, RND, HASH, ..) has room before migrating flows, whereas the
rollover mode has no original socket to test.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:42:59 -04:00
David S. Miller b04096ff33 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Four minor merge conflicts:

1) qca_spi.c renamed the local variable used for the SPI device
   from spi_device to spi, meanwhile the spi_set_drvdata() call
   got moved further up in the probe function.

2) Two changes were both adding new members to codel params
   structure, and thus we had overlapping changes to the
   initializer function.

3) 'net' was making a fix to sk_release_kernel() which is
   completely removed in 'net-next'.

4) In net_namespace.c, the rtnl_net_fill() call for GET operations
   had the command value fixed, meanwhile 'net-next' adjusted the
   argument signature a bit.

This also matches example merge resolutions posted by Stephen
Rothwell over the past two days.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 14:31:43 -04:00
Eric W. Biederman 11aa9c28b4 net: Pass kern from net_proto_family.create to sk_alloc
In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 10:50:17 -04:00
Kretschmer, Mathias fbf33a2802 af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT).
This patch fixes an issue where the send(MSG_DONTWAIT) call
on a TX_RING is not fully non-blocking in cases where the device's sndBuf is
full. We pass nonblock=true to sock_alloc_send_skb() and return any possibly
occuring error code (most likely EGAIN) to the caller. As the fast-path stays
as it is, we keep the unlikely() around skb == NULL.

Signed-off-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-10 19:40:08 -04:00