Commit Graph

203 Commits

Author SHA1 Message Date
Lorenzo Bianconi fdc13979f9 bpf, devmap: Move drop error path to devmap for XDP_REDIRECT
We want to change the current ndo_xdp_xmit drop semantics because it will
allow us to implement better queue overflow handling. This is working
towards the larger goal of a XDP TX queue-hook. Move XDP_REDIRECT error
path handling from each XDP ethernet driver to devmap code. According to
the new APIs, the driver running the ndo_xdp_xmit pointer, will break tx
loop whenever the hw reports a tx error and it will just return to devmap
caller the number of successfully transmitted frames. It will be devmap
responsibility to free dropped frames.

Move each XDP ndo_xdp_xmit capable driver to the new APIs:

- veth
- virtio-net
- mvneta
- mvpp2
- socionext
- amazon ena
- bnxt
- freescale (dpaa2, dpaa)
- xen-frontend
- qede
- ice
- igb
- ixgbe
- i40e
- mlx5
- ti (cpsw, cpsw-new)
- tun
- sfc

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Link: https://lore.kernel.org/bpf/ed670de24f951cfd77590decf0229a0ad7fd12f6.1615201152.git.lorenzo@kernel.org
2021-03-18 16:38:51 +01:00
Maciej Fijalkowski edbea92202 veth: Store queue_mapping independently of XDP prog presence
Currently, veth_xmit() would call the skb_record_rx_queue() only when
there is XDP program loaded on peer interface in native mode.

If peer has XDP prog in generic mode, then netif_receive_generic_xdp()
has a call to netif_get_rxqueue(skb), so for multi-queue veth it will
not be possible to grab a correct rxq.

To fix that, store queue_mapping independently of XDP prog presence on
peer interface.

Fixes: 638264dc90 ("veth: Support per queue XDP ring")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Link: https://lore.kernel.org/bpf/20210303152903.11172-1-maciej.fijalkowski@intel.com
2021-03-06 00:04:52 +01:00
Lorenzo Bianconi 65e6dcf733 net, veth: Alloc skb in bulk for ndo_xdp_xmit
Split ndo_xdp_xmit and ndo_start_xmit use cases in veth_xdp_rcv routine
in order to alloc skbs in bulk for XDP_PASS verdict.

Introduce xdp_alloc_skb_bulk utility routine to alloc skb bulk list.
The proposed approach has been tested in the following scenario:

eth (ixgbe) --> XDP_REDIRECT --> veth0 --> (remote-ns) veth1 --> XDP_PASS

XDP_REDIRECT: xdp_redirect_map bpf sample
XDP_PASS: xdp_rxq_info bpf sample

traffic generator: pkt_gen sending udp traffic on a remote device

bpf-next master: ~3.64Mpps
bpf-next + skb bulking allocation: ~3.79Mpps

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/a14a30d3c06fff24e13f836c733d80efc0bd6eb5.1611957532.git.lorenzo@kernel.org
2021-02-04 01:00:07 +01:00
Lorenzo Bianconi 89f479f0ec net, xdp: Introduce xdp_build_skb_from_frame utility routine
Introduce xdp_build_skb_from_frame utility routine to build the skb
from xdp_frame. Respect to __xdp_build_skb_from_frame,
xdp_build_skb_from_frame will allocate the skb object. Rely on
xdp_build_skb_from_frame in veth driver.
Introduce missing xdp metadata support in veth_xdp_rcv_one routine.
Add missing metadata support in veth_xdp_rcv_one().

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/94ade9e853162ae1947941965193190da97457bc.1610475660.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-20 14:10:35 -08:00
Lorenzo Bianconi be9df4aff6 net, xdp: Introduce xdp_prepare_buff utility routine
Introduce xdp_prepare_buff utility routine to initialize per-descriptor
xdp_buff fields (e.g. xdp_buff pointers). Rely on xdp_prepare_buff() in
all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/45f46f12295972a97da8ca01990b3e71501e9d89.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Lorenzo Bianconi 43b5169d83 net, xdp: Introduce xdp_init_buff utility routine
Introduce xdp_init_buff utility routine to initialize xdp_buff fields
const over NAPI iterations (e.g. frame_sz or rxq pointer). Rely on
xdp_init_buff in all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/7f8329b6da1434dc2b05a77f2e800b29628a8913.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Jakub Kicinski a1dd1d8697 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-12-03

The main changes are:

1) Support BTF in kernel modules, from Andrii.

2) Introduce preferred busy-polling, from Björn.

3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.

4) Memcg-based memory accounting for bpf objects, from Roman.

5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
  selftests/bpf: Fix invalid use of strncat in test_sockmap
  libbpf: Use memcpy instead of strncpy to please GCC
  selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
  selftests/bpf: Add tp_btf CO-RE reloc test for modules
  libbpf: Support attachment of BPF tracing programs to kernel modules
  libbpf: Factor out low-level BPF program loading helper
  bpf: Allow to specify kernel module BTFs when attaching BPF programs
  bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
  selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
  selftests/bpf: Add support for marking sub-tests as skipped
  selftests/bpf: Add bpf_testmod kernel module for testing
  libbpf: Add kernel module BTF support for CO-RE relocations
  libbpf: Refactor CO-RE relocs to not assume a single BTF object
  libbpf: Add internal helper to load BTF data by FD
  bpf: Keep module's btf_data_size intact after load
  bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
  selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
  bpf: Adds support for setting window clamp
  samples/bpf: Fix spelling mistake "recieving" -> "receiving"
  bpf: Fix cold build of test_progs-no_alu32
  ...
====================

Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 07:48:12 -08:00
Björn Töpel b02e5a0ebb xsk: Propagate napi_id to XDP socket Rx path
Add napi_id to the xdp_rxq_info structure, and make sure the XDP
socket pick up the napi_id in the Rx path. The napi_id is used to find
the corresponding NAPI structure for socket busy polling.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com
2020-12-01 00:09:25 +01:00
Francis Laniel 872f690341 treewide: rename nla_strlcpy to nla_strscpy.
Calls to nla_strlcpy are now replaced by calls to nla_strscpy which is the new
name of this function.

Signed-off-by: Francis Laniel <laniel_francis@privacyrequired.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 08:08:54 -08:00
Daniel Borkmann 9aa1206e8f bpf: Add redirect_peer helper
Add an efficient ingress to ingress netns switch that can be used out of tc BPF
programs in order to redirect traffic from host ns ingress into a container
veth device ingress without having to go via CPU backlog queue [0]. For local
containers this can also be utilized and path via CPU backlog queue only needs
to be taken once, not twice. On a high level this borrows from ipvlan which does
similar switch in __netif_receive_skb_core() and then iterates via another_round.
This helps to reduce latency for mentioned use cases.

Pod to remote pod with redirect(), TCP_RR [1]:

  # percpu_netperf 10.217.1.33
          RT_LATENCY:         122.450         (per CPU:         122.666         122.401         122.333         122.401 )
        MEAN_LATENCY:         121.210         (per CPU:         121.100         121.260         121.320         121.160 )
      STDDEV_LATENCY:         120.040         (per CPU:         119.420         119.910         125.460         115.370 )
         MIN_LATENCY:          46.500         (per CPU:          47.000          47.000          47.000          45.000 )
         P50_LATENCY:         118.500         (per CPU:         118.000         119.000         118.000         119.000 )
         P90_LATENCY:         127.500         (per CPU:         127.000         128.000         127.000         128.000 )
         P99_LATENCY:         130.750         (per CPU:         131.000         131.000         129.000         132.000 )

    TRANSACTION_RATE:       32666.400         (per CPU:        8152.200        8169.842        8174.439        8169.897 )

Pod to remote pod with redirect_peer(), TCP_RR:

  # percpu_netperf 10.217.1.33
          RT_LATENCY:          44.449         (per CPU:          43.767          43.127          45.279          45.622 )
        MEAN_LATENCY:          45.065         (per CPU:          44.030          45.530          45.190          45.510 )
      STDDEV_LATENCY:          84.823         (per CPU:          66.770          97.290          84.380          90.850 )
         MIN_LATENCY:          33.500         (per CPU:          33.000          33.000          34.000          34.000 )
         P50_LATENCY:          43.250         (per CPU:          43.000          43.000          43.000          44.000 )
         P90_LATENCY:          46.750         (per CPU:          46.000          47.000          47.000          47.000 )
         P99_LATENCY:          52.750         (per CPU:          51.000          54.000          53.000          53.000 )

    TRANSACTION_RATE:       90039.500         (per CPU:       22848.186       23187.089       22085.077       21919.130 )

  [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf
  [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperf

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
2020-10-11 10:21:04 -07:00
Jakub Kicinski 5198d545db net: remove napi_hash_del() from driver-facing API
We allow drivers to call napi_hash_del() before calling
netif_napi_del() to batch RCU grace periods. This makes
the API asymmetric and leaks internal implementation details.
Soon we will want the grace period to protect more than just
the NAPI hash table.

Restructure the API and have drivers call a new function -
__netif_napi_del() if they want to take care of RCU waits.

Note that only core was checking the return status from
napi_hash_del() so the new helper does not report if the
NAPI was actually deleted.

Some notes on driver oddness:
 - veth observed the grace period before calling netif_napi_del()
   but that should not matter
 - myri10ge observed normal RCU flavor
 - bnx2x and enic did not actually observe the grace period
   (unless they did so implicitly)
 - virtio_net and enic only unhashed Rx NAPIs

The last two points seem to indicate that the calls to
napi_hash_del() were a left over rather than an optimization.
Regardless, it's easy enough to correct them.

This patch may introduce extra synchronize_net() calls for
interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on
free_netdev() to call netif_napi_del(). This seems inevitable
since we want to use RCU for netpoll dev->napi_list traversal,
and almost no drivers set IFF_DISABLE_NETPOLL.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:08:46 -07:00
Jakub Kicinski 44a8c4f33c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.

Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-04 21:28:59 -07:00
Gustavo A. R. Silva df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Maciej Żenczykowski defcffeb51 net-veth: Add type safety to veth_xdp_to_ptr() and veth_ptr_to_xdp()
This reduces likelihood of incorrect use.

Test: builds

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200819020027.4072288-1-zenczykowski@gmail.com
2020-08-19 14:10:38 -07:00
Andrii Nakryiko e8407fdeb9 bpf, xdp: Remove XDP_QUERY_PROG and XDP_QUERY_PROG_HW XDP commands
Now that BPF program/link management is centralized in generic net_device
code, kernel code never queries program id from drivers, so
XDP_QUERY_PROG/XDP_QUERY_PROG_HW commands are unnecessary.

This patch removes all the implementations of those commands in kernel, along
the xdp_attachment_query().

This patch was compile-tested on allyesconfig.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-10-andriin@fb.com
2020-07-25 20:37:02 -07:00
Lorenzo Bianconi 1b698fa5d8 xdp: Rename convert_to_xdp_frame in xdp_convert_buff_to_frame
In order to use standard 'xdp' prefix, rename convert_to_xdp_frame
utility routine in xdp_convert_buff_to_frame and replace all the
occurrences

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/6344f739be0d1a08ab2b9607584c4d5478c8c083.1590698295.git.lorenzo@kernel.org
2020-06-01 15:02:53 -07:00
Lorenzo Bianconi fc37987265 xdp: Introduce xdp_convert_frame_to_buff utility routine
Introduce xdp_convert_frame_to_buff utility routine to initialize xdp_buff
fields from xdp_frames ones. Rely on xdp_convert_frame_to_buff in veth xdp
code.

Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/87acf133073c4b2d4cbb8097e8c2480c0a0fac32.1590698295.git.lorenzo@kernel.org
2020-06-01 15:02:53 -07:00
Jesper Dangaard Brouer 45a9e6d8a6 veth: Xdp using frame_sz in veth driver
The veth driver can run XDP in "native" mode in it's own NAPI
handler, and since commit 9fc8d518d9 ("veth: Handle xdp_frames in
xdp napi ring") packets can come in two forms either xdp_frame or
skb, calling respectively veth_xdp_rcv_one() or veth_xdp_rcv_skb().

For packets to arrive in xdp_frame format, they will have been
redirected from an XDP native driver. In case of XDP_PASS or no
XDP-prog attached, the veth driver will allocate and create an SKB.

The current code in veth_xdp_rcv_one() xdp_frame case, had to guess
the frame truesize of the incoming xdp_frame, when using
veth_build_skb(). With xdp_frame->frame_sz this is not longer
necessary.

Calculating the frame_sz in veth_xdp_rcv_skb() skb case, is done
similar to the XDP-generic handling code in net/core/dev.c.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Link: https://lore.kernel.org/bpf/158945338840.97035.935897116345700902.stgit@firesoul
2020-05-14 21:21:54 -07:00
Jesper Dangaard Brouer 5c8572251f veth: Adjust hard_start offset on redirect XDP frames
When native XDP redirect into a veth device, the frame arrives in the
xdp_frame structure. It is then processed in veth_xdp_rcv_one(),
which can run a new XDP bpf_prog on the packet. Doing so requires
converting xdp_frame to xdp_buff, but the tricky part is that
xdp_frame memory area is located in the top (data_hard_start) memory
area that xdp_buff will point into.

The current code tried to protect the xdp_frame area, by assigning
xdp_buff.data_hard_start past this memory. This results in 32 bytes
less headroom to expand into via BPF-helper bpf_xdp_adjust_head().

This protect step is actually not needed, because BPF-helper
bpf_xdp_adjust_head() already reserve this area, and don't allow
BPF-prog to expand into it. Thus, it is safe to point data_hard_start
directly at xdp_frame memory area.

Fixes: 9fc8d518d9 ("veth: Handle xdp_frames in xdp napi ring")
Reported-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945338331.97035.5923525383710752178.stgit@firesoul
2020-05-14 21:21:54 -07:00
Lorenzo Bianconi 5fe6e56776 veth: rely on peer veth_rq for ndo_xdp_xmit accounting
Rely on 'remote' veth_rq to account ndo_xdp_xmit ethtool counters.
Move XDP_TX accounting to veth_xdp_flush_bq routine.
Remove 'rx' prefix in rx xdp ethool counters

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:35:13 -07:00
Lorenzo Bianconi bd32aa1f5d veth: rely on veth_rq in veth_xdp_flush_bq signature
Substitute net_device point with veth_rq one in veth_xdp_flush_bq,
veth_xdp_flush and veth_xdp_tx signature. This is a preliminary patch
to account xdp_xmit counter on 'receiving' veth_rq

Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:35:13 -07:00
Lorenzo Bianconi d99a7c2f3d veth: remove atomic64_add from veth_xdp_xmit hotpath
Remove atomic64_add from veth_xdp_xmit hotpath and rely on
xdp_xmit_err/xdp_tx_err counters

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 21:24:59 -07:00
Lorenzo Bianconi 9152cff0dd veth: introduce more xdp counters
Introduce xdp_xmit counter in order to distinguish between XDP_TX and
ndo_xdp_xmit stats. Introduce the following ethtool counters:
- rx_xdp_tx
- rx_xdp_tx_errors
- tx_xdp_xmit
- tx_xdp_xmit_errors
- rx_xdp_redirect

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 21:24:58 -07:00
Lorenzo Bianconi 66fe4a0788 veth: distinguish between rx_drops and xdp_drops
Distinguish between rx_drops and xdp_drops since the latter is already
reported in rx_packets. Report xdp_drops in ethtool statistics

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 21:24:58 -07:00
Lorenzo Bianconi 1c5b82e55f veth: introduce more specialized counters in veth_stats
Introduce xdp_tx, xdp_redirect and rx_drops counters in veth_stats data
structure. Move stats accounting in veth_poll. Remove xdp_xmit variable
in veth_xdp_rcv_one/veth_xdp_rcv_skb and rely on veth_stats counters.
This is a preliminary patch to align veth xdp statistics to mlx, intel
and marvell xdp implementation

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 21:24:58 -07:00
Lorenzo Bianconi 65780c5627 veth: move xdp stats in a dedicated structure
Move xdp stats in veth_stats data structure. This is a preliminary patch
to align xdp statistics to mlx5, ixgbe and mvneta drivers

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 21:24:58 -07:00
Jiang Lidong e25d5dbcff veth: ignore peer tx_dropped when counting local rx_dropped
When local NET_RX backlog is full due to traffic overrun,
peer veth tx_dropped counter increases. At that time, list
local veth stats, rx_dropped has double value of peer
tx_dropped, even bigger than transmit packets by peer.

In NET_RX softirq process, if any packet drop case happens,
it increases dev's rx_dropped counter and returns NET_RX_DROP.

At veth tx side, it records any error returned from peer netif_rx
into local dev tx_dropped counter.

In veth get stats process, it puts local dev rx_dropped and
peer dev tx_dropped into together as local rx_drpped value.
So that it shows double value of real dropped packets number in
this case.

This patch ignores peer tx_dropped when counting local rx_dropped,
since peer tx_dropped is duplicated to local rx_dropped at most cases.

Signed-off-by: Jiang Lidong <jianglidong3@jd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-05 17:33:57 -08:00
John Fastabend b23bfa5633 bpf, xdp: Remove no longer required rcu_read_{un}lock()
Now that we depend on rcu_call() and synchronize_rcu() to also wait
for preempt_disabled region to complete the rcu read critical section
in __dev_map_flush() is no longer required. Except in a few special
cases in drivers that need it for other reasons.

These originally ensured the map reference was safe while a map was
also being free'd. And additionally that bpf program updates via
ndo_bpf did not happen while flush updates were in flight. But flush
by new rules can only be called from preempt-disabled NAPI context.
The synchronize_rcu from the map free path and the rcu_call from the
delete path will ensure the reference there is safe. So lets remove
the rcu_read_lock and rcu_read_unlock pair to avoid any confusion
around how this is being protected.

If the rcu_read_lock was required it would mean errors in the above
logic and the original patch would also be wrong.

Now that we have done above we put the rcu_read_lock in the driver
code where it is needed in a driver dependent way. I think this
helps readability of the code so we know where and why we are
taking read locks. Most drivers will not need rcu_read_locks here
and further XDP drivers already have rcu_read_locks in their code
paths for reading xdp programs on RX side so this makes it symmetric
where we don't have half of rcu critical sections define in driver
and the other half in devmap.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/1580084042-11598-4-git-send-email-john.fastabend@gmail.com
2020-01-27 11:16:25 +01:00
Toke Høiland-Jørgensen 1d233886dd xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().

Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().

Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.

With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).

Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 20:03:34 -08:00
Eric Dumazet b4fba476dc veth: use standard dev_lstats_add() and dev_lstats_read()
This cleanup will ease u64_stats_t adoption in a single location.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 20:03:08 -08:00
Toshiaki Makita 9cda7807ee veth: Support bulk XDP_TX
XDP_TX is similar to XDP_REDIRECT as it essentially redirects packets to
the device itself. XDP_REDIRECT has bulk transmit mechanism to avoid the
heavy cost of indirect call but it also reduces lock acquisition on the
destination device that needs locks like veth and tun.

XDP_TX does not use indirect calls but drivers which require locks can
benefit from the bulk transmit for XDP_TX as well.

This patch introduces bulk transmit mechanism in veth using bulk queue
on stack, and improves XDP_TX performance by about 9%.

Here are single-core/single-flow XDP_TX test results. CPU consumptions
are taken from "perf report --no-child".

- Before:

  7.26 Mpps

  _raw_spin_lock  7.83%
  veth_xdp_xmit  12.23%

- After:

  7.94 Mpps

  _raw_spin_lock  1.08%
  veth_xdp_xmit   6.10%

v2:
- Use stack for bulk queue instead of a global variable.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-25 14:26:54 +02:00
Jesper Dangaard Brouer cbf3351067 veth: use xdp_release_frame for XDP_PASS
Like cpumap use xdp_release_frame() when an xdp_frame got
converted into an SKB and send towars the network stack.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Thomas Gleixner 09c434b8a0 treewide: Add SPDX license identifier for more missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have MODULE_LICENCE("GPL*") inside which was used in the initial
   scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Julian Wiedmann 056b21fbe6 net: veth: use generic helper to report timestamping info
For reporting the common set of SW timestamping capabilities, use
ethtool_op_get_ts_info() instead of re-implementing it.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 16:26:37 -07:00
Florian Fainelli abdf47aab4 veth: Fix -Wformat-truncation
Provide a precision hint to snprintf() in order to eliminate a
-Wformat-truncation warning provided below. A maximum of 11 characters
is allowed to reach a maximum of 32 - 1 characters given a possible
maximum value of queues using up to UINT_MAX which occupies 10
characters. Incidentally 11 is the number of characters for
"xdp_packets" which is the largest string we append.

drivers/net/veth.c: In function 'veth_get_strings':
drivers/net/veth.c:118:47: warning: '%s' directive output may be
truncated writing up to 31 bytes into a region of size between 12 and 21
[-Wformat-truncation=]
     snprintf(p, ETH_GSTRING_LEN, "rx_queue_%u_%s",
                                               ^~
drivers/net/veth.c:118:5: note: 'snprintf' output between 12 and 52
bytes into a destination of size 32
     snprintf(p, ETH_GSTRING_LEN, "rx_queue_%u_%s",
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       i, veth_rq_stats_desc[j].desc);
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-23 13:44:58 -08:00
Gustavo A. R. Silva a9b6d9ef09 veth: Mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch
cases where we are expecting to fall through.

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enabling
-Wimplicit-fallthrough.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 11:51:52 -08:00
David Ahern d0522f1cd2 net: Add extack argument to rtnl_create_link
Add extack arg to rtnl_create_link and add messages for invalid
number of Tx or Rx queues.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-06 15:00:45 -08:00
Toshiaki Makita d397b9682c veth: Add ethtool statistics support for XDP
Expose per-queue stats for ethtool -S.
As there are only rx queues, and rx queues are used only when XDP is
used, per-queue counters are only rx XDP ones.

Example:

$ ethtool -S veth0
NIC statistics:
     peer_ifindex: 11
     rx_queue_0_xdp_packets: 28601434
     rx_queue_0_xdp_bytes: 1716086040
     rx_queue_0_xdp_drops: 28601434
     rx_queue_1_xdp_packets: 17873050
     rx_queue_1_xdp_bytes: 1072383000
     rx_queue_1_xdp_drops: 17873050

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 21:58:46 -07:00
Toshiaki Makita 4195e54aaf veth: Account for XDP packet statistics on rx side
On XDP path veth has napi handler so we can collect statistics on
per-queue basis for XDP.

By this change now we can collect XDP_DROP drop count as well as packets
and bytes coming through ndo_xdp_xmit. Packet counters shown by
"ip -s link", sysfs stats or /proc/net/dev is now correct for XDP.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 21:58:46 -07:00
Toshiaki Makita 2131479df6 veth: Account for packet drops in ndo_xdp_xmit
Use existing atomic drop counter. Since drop path is really an
exceptional case here, I'm thinking atomic ops would not hurt the
performance.
XDP packets and bytes are not counted in ndo_xdp_xmit, but will be
accounted on rx side by the following commit.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-15 21:58:46 -07:00
Li RongQing 14d7341679 veth: rename pcpu_vstats as pcpu_lstats
struct pcpu_vstats and pcpu_lstats have same members and
usage, and pcpu_lstats is used in many files, so rename
pcpu_vstats as pcpu_lstats to reduce duplicate definition

Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18 19:52:23 -07:00
David S. Miller e366fa4350 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
Two new tls tests added in parallel in both net and net-next.

Used Stephen Rothwell's linux-next resolution.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18 09:33:27 -07:00
Toshiaki Makita 4bf9ffa0fb veth: Orphan skb before GRO
GRO expects skbs not to be owned by sockets, but when XDP is enabled veth
passed skbs owned by sockets. It caused corrupted sk_wmem_alloc.

Paolo Abeni reported the following splat:

[  362.098904] refcount_t overflow at skb_set_owner_w+0x5e/0xa0 in iperf3[1644], uid/euid: 0/0
[  362.108239] WARNING: CPU: 0 PID: 1644 at kernel/panic.c:648 refcount_error_report+0xa0/0xa4
[  362.117547] Modules linked in: tcp_diag inet_diag veth intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf ipmi_ssif iTCO_wdt sg ipmi_si iTCO_vendor_support ipmi_devintf mxm_wmi ipmi_msghandler pcspkr dcdbas mei_me wmi mei lpc_ich acpi_power_meter pcc_cpufreq xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ixgbe igb ttm ahci mdio libahci ptp crc32c_intel drm pps_core libata i2c_algo_bit dca dm_mirror dm_region_hash dm_log dm_mod
[  362.176622] CPU: 0 PID: 1644 Comm: iperf3 Not tainted 4.19.0-rc2.vanilla+ #2025
[  362.184777] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
[  362.193124] RIP: 0010:refcount_error_report+0xa0/0xa4
[  362.198758] Code: 08 00 00 48 8b 95 80 00 00 00 49 8d 8c 24 80 0a 00 00 41 89 c1 44 89 2c 24 48 89 de 48 c7 c7 18 4d e7 9d 31 c0 e8 30 fa ff ff <0f> 0b eb 88 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 54 49 89 fc
[  362.219711] RSP: 0018:ffff9ee6ff603c20 EFLAGS: 00010282
[  362.225538] RAX: 0000000000000000 RBX: ffffffff9de83e10 RCX: 0000000000000000
[  362.233497] RDX: 0000000000000001 RSI: ffff9ee6ff6167d8 RDI: ffff9ee6ff6167d8
[  362.241457] RBP: ffff9ee6ff603d78 R08: 0000000000000490 R09: 0000000000000004
[  362.249416] R10: 0000000000000000 R11: ffff9ee6ff603990 R12: ffff9ee664b94500
[  362.257377] R13: 0000000000000000 R14: 0000000000000004 R15: ffffffff9de615f9
[  362.265337] FS:  00007f1d22d28740(0000) GS:ffff9ee6ff600000(0000) knlGS:0000000000000000
[  362.274363] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  362.280773] CR2: 00007f1d222f35d0 CR3: 0000001fddfec003 CR4: 00000000001606f0
[  362.288733] Call Trace:
[  362.291459]  <IRQ>
[  362.293702]  ex_handler_refcount+0x4e/0x80
[  362.298269]  fixup_exception+0x35/0x40
[  362.302451]  do_trap+0x109/0x150
[  362.306048]  do_error_trap+0xd5/0x130
[  362.315766]  invalid_op+0x14/0x20
[  362.319460] RIP: 0010:skb_set_owner_w+0x5e/0xa0
[  362.324512] Code: ef ff ff 74 49 48 c7 43 60 20 7b 4a 9d 8b 85 f4 01 00 00 85 c0 75 16 8b 83 e0 00 00 00 f0 01 85 44 01 00 00 0f 88 d8 23 16 00 <5b> 5d c3 80 8b 91 00 00 00 01 8b 85 f4 01 00 00 89 83 a4 00 00 00
[  362.345465] RSP: 0018:ffff9ee6ff603e20 EFLAGS: 00010a86
[  362.351291] RAX: 0000000000001100 RBX: ffff9ee65deec700 RCX: ffff9ee65e829244
[  362.359250] RDX: 0000000000000100 RSI: ffff9ee65e829100 RDI: ffff9ee65deec700
[  362.367210] RBP: ffff9ee65e829100 R08: 000000000002a380 R09: 0000000000000000
[  362.375169] R10: 0000000000000002 R11: fffff1a4bf77bb00 R12: ffffc0754661d000
[  362.383130] R13: ffff9ee65deec200 R14: ffff9ee65f597000 R15: 00000000000000aa
[  362.391092]  veth_xdp_rcv+0x4e4/0x890 [veth]
[  362.399357]  veth_poll+0x4d/0x17a [veth]
[  362.403731]  net_rx_action+0x2af/0x3f0
[  362.407912]  __do_softirq+0xdd/0x29e
[  362.411897]  do_softirq_own_stack+0x2a/0x40
[  362.416561]  </IRQ>
[  362.418899]  do_softirq+0x4b/0x70
[  362.422594]  __local_bh_enable_ip+0x50/0x60
[  362.427258]  ip_finish_output2+0x16a/0x390
[  362.431824]  ip_output+0x71/0xe0
[  362.440670]  __tcp_transmit_skb+0x583/0xab0
[  362.445333]  tcp_write_xmit+0x247/0xfb0
[  362.449609]  __tcp_push_pending_frames+0x2d/0xd0
[  362.454760]  tcp_sendmsg_locked+0x857/0xd30
[  362.459424]  tcp_sendmsg+0x27/0x40
[  362.463216]  sock_sendmsg+0x36/0x50
[  362.467104]  sock_write_iter+0x87/0x100
[  362.471382]  __vfs_write+0x112/0x1a0
[  362.475369]  vfs_write+0xad/0x1a0
[  362.479062]  ksys_write+0x52/0xc0
[  362.482759]  do_syscall_64+0x5b/0x180
[  362.486841]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  362.492473] RIP: 0033:0x7f1d22293238
[  362.496458] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 c5 54 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
[  362.517409] RSP: 002b:00007ffebaef8008 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  362.525855] RAX: ffffffffffffffda RBX: 0000000000002800 RCX: 00007f1d22293238
[  362.533816] RDX: 0000000000002800 RSI: 00007f1d22d36000 RDI: 0000000000000005
[  362.541775] RBP: 00007f1d22d36000 R08: 00000002db777a30 R09: 0000562b70712b20
[  362.549734] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
[  362.557693] R13: 0000000000002800 R14: 00007ffebaef8060 R15: 0000562b70712260

In order to avoid this, orphan the skb before entering GRO.

Fixes: 948d4f214f ("veth: Add driver XDP")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16 15:33:50 -07:00
Michael Walle aa4e689ed1 veth: add software timestamping
Provide a software TX timestamp as well as the ethtool query interface
and report the software timestamp capabilities.

Tested with "ethtool -T" and two linuxptp instances each bound to a
tunnel endpoint.

Signed-off-by: Michael Walle <michael@walle.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 22:56:46 -07:00
Toshiaki Makita 7797b93b75 veth: Free queues on link delete
David Ahern reported memory leak in veth.

=======================================================================
$ cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8800354d5c00 (size 1024):
  comm "ip", pid 836, jiffies 4294722952 (age 25.904s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<(____ptrval____)>] kmemleak_alloc+0x70/0x94
    [<(____ptrval____)>] slab_post_alloc_hook+0x42/0x52
    [<(____ptrval____)>] __kmalloc+0x101/0x142
    [<(____ptrval____)>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
    [<(____ptrval____)>] veth_newlink+0x147/0x3ac [veth]
    ...
unreferenced object 0xffff88002e009c00 (size 1024):
  comm "ip", pid 836, jiffies 4294722958 (age 25.898s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<(____ptrval____)>] kmemleak_alloc+0x70/0x94
    [<(____ptrval____)>] slab_post_alloc_hook+0x42/0x52
    [<(____ptrval____)>] __kmalloc+0x101/0x142
    [<(____ptrval____)>] kmalloc_array.constprop.20+0x1e/0x26 [veth]
    [<(____ptrval____)>] veth_newlink+0x219/0x3ac [veth]
=======================================================================

veth_rq allocated in veth_newlink() was not freed on dellink.

We need to free up them after veth_close() so that any packets will not
reference the queues afterwards. Thus free them in veth_dev_free() in
the same way as freeing stats structure (vstats).

Also move queues allocation to veth_dev_init() to be in line with stats
allocation.

Fixes: 638264dc90 ("veth: Support per queue XDP ring")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16 12:22:31 -07:00
Toshiaki Makita 638264dc90 veth: Support per queue XDP ring
Move XDP and napi related fields from veth_priv to newly created veth_rq
structure.

When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
selected by current cpu.

When skbs are enqueued from the peer device, rxq is one to one mapping
of its peer txq. This way we have a restriction that the number of rxqs
must not less than the number of peer txqs, but leave the possibility to
achieve bulk skb xmit in the future because txq lock would make it
possible to remove rxq ptr_ring lock.

v3:
- Add extack messages.
- Fix array overrun in veth_xmit.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-10 16:12:21 +02:00
Toshiaki Makita d1396004dd veth: Add XDP TX and REDIRECT
This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)          (XDP)         (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem_info from NIC so that page recycling of the NIC works on
the destination veth's XDP.
In this way return_frame is not fully guarded by NAPI, since another
NAPI handler on another cpu may use the same xdp_mem_info concurrently.
Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
NAPI context.

v8:
- Don't use xdp_frame pointer address for data_hard_start of xdp_buff.

v4:
- Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
  xdp_mem_info.

v3:
- Fix double free when veth_xdp_tx() returns a positive value.
- Convert xdp_xmit and xdp_redir variables into flags.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-10 16:12:21 +02:00
Toshiaki Makita af87a3aa1b veth: Add ndo_xdp_xmit
This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that when the peer veth device does not set driver xdp, redirected
packets will be dropped because the peer is not ready for NAPI.

v4:
- Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
  Add comments about it and check only MTU.

v2:
- Drop the part converting xdp_frame into skb when XDP is not enabled.
- Implement bulk interface of ndo_xdp_xmit.
- Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-10 16:12:21 +02:00
Toshiaki Makita 9fc8d518d9 veth: Handle xdp_frames in xdp napi ring
This is preparation for XDP TX and ndo_xdp_xmit.
This allows napi handler to handle xdp_frames through xdp ring as well
as sk_buff.

v8:
- Don't use xdp_frame pointer address to calculate skb->head and
  headroom.

v7:
- Use xdp_scrub_frame() instead of memset().

v3:
- Revert v2 change around rings and use a flag to differentiate skb and
  xdp_frame, since bulk skb xmit makes little performance difference
  for now.

v2:
- Use another ring instead of using flag to differentiate skb and
  xdp_frame. This approach makes bulk skb transmit possible in
  veth_xmit later.
- Clear xdp_frame feilds in skb->head.
- Implement adjust_tail.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-10 16:12:21 +02:00
Toshiaki Makita dc2248220a veth: Avoid drops by oversized packets when XDP is enabled
Oversized packets including GSO packets can be dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

v4:
- Don't auto-adjust MTU but cap max MTU.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-10 16:12:20 +02:00
Toshiaki Makita 948d4f214f veth: Add driver XDP
This is the basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

This introduces NAPI when XDP is enabled, because XDP is now heavily
relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
enqueues packets to the ring and peer NAPI handler drains the ring.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved by allocating rings
on the per-queue basis later.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

v6:
- Check skb->len only when allocation is needed.
- Add __GFP_NOWARN to alloc_page() as it can be triggered by external
  events.

v3:
- Fix race on closing the device.
- Add extack messages in ndo_bpf.

v2:
- Squashed with the patch adding NAPI.
- Implement adjust_tail.
- Don't acquire consumer lock because it is guarded by NAPI.
- Make poll_controller noop since it is unnecessary.
- Register rxq_info on enabling XDP rather than on opening the device.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-10 16:12:20 +02:00
Stephen Hemminger 72d24955b4 veth: set peer GSO values
When new veth is created, and GSO values have been configured
on one device, clone those values to the peer.

For example:
   # ip link add dev vm1 gso_max_size 65530 type veth peer name vm2

This should create vm1 <--> vm2 with both having GSO maximum
size set to 65530.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-08 14:22:59 -05:00
David S. Miller b079115937 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
A set of overlapping changes in macvlan and the rocker
driver, nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-30 12:43:08 -04:00
Matthias Schiffer a8b8a889e3 net: add netlink_ext_ack argument to rtnl_link_ops.validate
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:22 -04:00
Matthias Schiffer 7a3f4a1851 net: add netlink_ext_ack argument to rtnl_link_ops.newlink
Add support for extended error reporting.

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-26 23:13:21 -04:00
Serhey Popovych 191cdb3822 veth: Be more robust on network device creation when no attributes
There are number of problems with configuration peer
network device in absence of IFLA_VETH_PEER attributes
where attributes for main network device shared with
peer.

First it is not feasible to configure both network
devices with same MAC address since this makes
communication in such configuration problematic.

This case can be reproduced with following sequence:

  # ip link add address 02:11:22:33:44:55 type veth
  # ip li sh
  ...
  26: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc \
  noop state DOWN mode DEFAULT qlen 1000
      link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff
  27: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc \
  noop state DOWN mode DEFAULT qlen 1000
      link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff

Second it is not possible to register both main and
peer network devices with same name, that happens
when name for main interface is given with IFLA_IFNAME
and same attribute reused for peer.

This case can be reproduced with following sequence:

  # ip link add dev veth1a type veth
  RTNETLINK answers: File exists

To fix both of the cases check if corresponding netlink
attributes are taken from peer_tb when valid or
name based on rtnl ops kind and random address is used.

Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-22 11:15:15 -04:00
David S. Miller cf124db566 net: Fix inconsistent teardown and release of private netdev state.
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init().  However, the release of these resources
can occur in one of two different places.

Either netdev_ops->ndo_uninit() or netdev->destructor().

The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.

netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.

netdev->destructor(), on the other hand, does not run until the
netdev references all go away.

Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().

This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.

If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
it is not able to invoke netdev->destructor().

This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.

However, this means that the resources that would normally be released
by netdev->destructor() will not be.

Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.

Many drivers do not try to deal with this, and instead we have leaks.

Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().

netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().

netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().

Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().

And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-07 15:53:24 -04:00
Johannes Berg fceb6435e8 netlink: pass extended ACK struct to parsing functions
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:22 -04:00
Philippe Reynes 56607b987b net: veth: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29 14:07:22 -07:00
stephen hemminger bc1f44709c net: make ndo_get_stats64 a void function
The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-08 17:51:44 -05:00
Jarod Wilson 91572088e3 net: use core MTU range checking in core net infra
geneve:
- Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
- This one isn't quite as straight-forward as others, could use some
  closer inspection and testing

macvlan:
- set min/max_mtu

tun:
- set min/max_mtu, remove tun_net_change_mtu

vxlan:
- Merge __vxlan_change_mtu back into vxlan_change_mtu
- Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
  change_mtu function
- This one is also not as straight-forward and could use closer inspection
  and testing from vxlan folks

bridge:
- set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
  change_mtu function

openvswitch:
- set min/max_mtu, remove internal_dev_change_mtu
- note: max_mtu wasn't checked previously, it's been set to 65535, which
  is the largest possible size supported

sch_teql:
- set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)

macsec:
- min_mtu = 0, max_mtu = 65535

macvlan:
- min_mtu = 0, max_mtu = 65535

ntb_netdev:
- min_mtu = 0, max_mtu = 65535

veth:
- min_mtu = 68, max_mtu = 65535

8021q:
- min_mtu = 0, max_mtu = 65535

CC: netdev@vger.kernel.org
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Tom Herbert <tom@herbertland.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Jiri Benc <jbenc@redhat.com>
CC: WANG Cong <xiyou.wangcong@gmail.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Pravin B Shelar <pshelar@ovn.org>
CC: Sabrina Dubroca <sd@queasysnail.net>
CC: Patrick McHardy <kaber@trash.net>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Pravin Shelar <pshelar@nicira.com>
CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20 14:51:09 -04:00
David Ahern 607fca9acf net: veth: Set features for MPLS
veth does not really transmit packets only moves the skb from one
netdev to another so gso and checksum is not really needed. Add
the features to mpls_features to get the same benefit and performance
with MPLS as without it.

Reported-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 22:27:18 -07:00
Xin Long c80fafbbb5 veth: sctp: add NETIF_F_SCTP_CRC to device features
Commit b17c706987 ("loopback: sctp: add NETIF_F_SCTP_CSUM to device
features") added NETIF_F_SCTP_CRC to device features for lo device to
improve the performance of sctp over lo.

This patch is to add NETIF_F_SCTP_CRC to device features for veth to
improve the performance of sctp over veth.

Before this patch:
  ip netns exec cs_client netperf -H 10.167.12.2 -t SCTP_STREAM -- -m 10K
  Recv   Send    Send
  Socket Socket  Message  Elapsed
  Size   Size    Size     Time     Throughput
  bytes  bytes   bytes    secs.    10^6bits/sec

  212992 212992  10240    10.00    1117.16

After this patch:
  ip netns exec cs_client netperf -H 10.167.12.2 -t SCTP_STREAM -- -m 10K
  Recv   Send    Send
  Socket Socket  Message  Elapsed
  Size   Size    Size     Time     Throughput
  bytes  bytes   bytes    secs.    10^6bits/sec

  212992 212992  10240    10.20    1415.22

Tested-by: Li Shuang <tjlishuang@yeah.net>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 11:43:03 -07:00
Alexander Duyck 732912d727 veth: Update features to include all tunnel GSO types
This patch adds support for the checksum enabled versions of UDP and GRE
tunnels.  With this change we should be able to send and receive GSO frames
of these types over the veth pair without needing to segment the packets.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 14:14:59 -04:00
Paolo Abeni 163e529200 veth: implement ndo_set_rx_headroom
The rx headroom for veth dev is the peer device needed_headroom.
Avoid ping-pong updates setting the private flag IFF_PHONY_HEADROOM.

This avoids skb head reallocation when forwarding from a veth dev
towards a device adding some kind of encapsulation.

When transmitting frames below the MTU size towards a vxlan device,
this gives about 10% performance speed-up when OVS is used to connect
the veth and the vxlan device and a little more when using a
plain Linux bridge.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-01 15:54:31 -05:00
Vijay Pandurangan ce8c839b74 veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.
Packets that arrive from real hardware devices have ip_summed ==
CHECKSUM_UNNECESSARY if the hardware verified the checksums, or
CHECKSUM_NONE if the packet is bad or it was unable to verify it. The
current version of veth will replace CHECKSUM_NONE with
CHECKSUM_UNNECESSARY, which causes corrupt packets routed from hardware to
a veth device to be delivered to the application. This caused applications
at Twitter to receive corrupt data when network hardware was corrupting
packets.

We believe this was added as an optimization to skip computing and
verifying checksums for communication between containers. However, locally
generated packets have ip_summed == CHECKSUM_PARTIAL, so the code as
written does nothing for them. As far as we can tell, after removing this
code, these packets are transmitted from one stack to another unmodified
(tcpdump shows invalid checksums on both sides, as expected), and they are
delivered correctly to applications. We didn’t test every possible network
configuration, but we tried a few common ones such as bridging containers,
using NAT between the host and a container, and routing from hardware
devices to containers. We have effectively deployed this in production at
Twitter (by disabling RX checksum offloading on veth devices).

This code dates back to the first version of the driver, commit
<e314dbdc1c0dc6a548ecf> ("[NET]: Virtual ethernet device driver"), so I
suspect this bug occurred mostly because the driver API has evolved
significantly since then. Commit <0b7967503dc97864f283a> ("net/veth: Fix
packet checksumming") (in December 2010) fixed this for packets that get
created locally and sent to hardware devices, by not changing
CHECKSUM_PARTIAL. However, the same issue still occurs for packets coming
in from hardware devices.

Co-authored-by: Evan Jones <ej@evanjones.ca>
Signed-off-by: Evan Jones <ej@evanjones.ca>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Phil Sutter <phil@nwl.cc>
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Vijay Pandurangan <vijayp@vijayp.ca>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22 15:15:34 -05:00
Phil Sutter 02f01ec1c5 net: veth: enable noqueue operation by default
Signed-off-by: Phil Sutter <phil@nwl.cc>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:04 -07:00
Toshiaki Makita 1a04a82156 veth: Don't segment multiple tagged packets on veth device
Veth devices don't need to segment multiple tagged packets.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 14:24:50 -07:00
Nicolas Dichtel a45253bf32 veth: set iflink to the peer veth
Now that the peer netns is advertised in rtnl messages, we can set this property
so that IFLA_LINK will advertise the peer ifindex. It allows the userland to get
the full veth configuration.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 14:05:01 -04:00
Nicolas Dichtel e5f4e7b9ff veth: advertise link netns via netlink
Assign rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-23 17:51:15 -08:00
Tom Gundersen 5517750f05 net: rtnetlink - make create_link take name_assign_type
This passes down NET_NAME_USER (or NET_NAME_ENUM) to alloc_netdev(),
for any device created over rtnetlink.

v9: restore reverse-christmas-tree order of local variables

Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 16:13:07 -07:00
WANG Cong bb446c19fe veth: add netpoll support
It is trivial to add netpoll support to veth, since
it is not a stacked device, we don't need to setup and
clean up netpoll.

Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-25 16:35:37 -07:00
David S. Miller 64c27237a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/marvell/mvneta.c

The mvneta.c conflict is a case of overlapping changes,
a conversion to devm_ioremap_resource() vs. a conversion
to netdev_alloc_pcpu_stats.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-29 18:48:54 -04:00
Vlad Yasevich 3f8c707b9a veth: Turn off vlan rx acceleration in vlan_features
For completeness, turn off vlan rx acceleration in vlan_features so
that it doesn't show up on q-in-q setups.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 17:16:51 -04:00
Eric W. Biederman 57a7744e09 net: Replace u64_stats_fetch_begin_bh to u64_stats_fetch_begin_irq
Replace the bh safe variant with the hard irq safe variant.

We need a hard irq safe variant to deal with netpoll transmitting
packets from hard irq context, and we need it in most if not all of
the places using the bh safe variant.

Except on 32bit uni-processor the code is exactly the same so don't
bother with a bh variant, just have a hard irq safe variant that
everyone can use.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-14 22:41:36 -04:00
David S. Miller 67ddc87f16 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/wireless/ath/ath9k/recv.c
	drivers/net/wireless/mwifiex/pcie.c
	net/ipv6/sit.c

The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.

The two wireless conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05 20:32:02 -05:00
Toshiaki Makita 8d0d21f405 veth: Fix vlan_features so as to be able to use stacked vlan interfaces
Even if we create a stacked vlan interface such as veth0.10.20, it sends
single tagged frames (tagged with only vid 10).
Because vlan_features of a veth interface has the
NETIF_F_HW_VLAN_[CTAG/STAG]_TX bits, veth0.10 also has that feature, so
dev_hard_start_xmit(veth0.10) doesn't call __vlan_put_tag() and
vlan_dev_hard_start_xmit(veth0.10) overwrites vlan_tci.
This prevents us from using a combination of 802.1ad and 802.1Q
in containers, etc.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-20 02:15:38 -05:00
Jiri Pirko f7b12606b5 rtnl: make ifla_policy static
The only place this is used outside rtnetlink.c is veth. So provide
wrapper function for this usage.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-18 18:15:42 -05:00
WANG Cong 1c213bd24a net: introduce netdev_alloc_pcpu_stats() for drivers
There are many drivers calling alloc_percpu() to allocate pcpu stats
and then initializing ->syncp. So just introduce a helper function for them.

Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-14 15:49:55 -05:00
Linus Torvalds 5e30025a31 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking changes from Ingo Molnar:
 "The biggest changes:

   - add lockdep support for seqcount/seqlocks structures, this
     unearthed both bugs and required extra annotation.

   - move the various kernel locking primitives to the new
     kernel/locking/ directory"

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
  block: Use u64_stats_init() to initialize seqcounts
  locking/lockdep: Mark __lockdep_count_forward_deps() as static
  lockdep/proc: Fix lock-time avg computation
  locking/doc: Update references to kernel/mutex.c
  ipv6: Fix possible ipv6 seqlock deadlock
  cpuset: Fix potential deadlock w/ set_mems_allowed
  seqcount: Add lockdep functionality to seqcount/seqlock structures
  net: Explicitly initialize u64_stats_sync structures for lockdep
  locking: Move the percpu-rwsem code to kernel/locking/
  locking: Move the lglocks code to kernel/locking/
  locking: Move the rwsem code to kernel/locking/
  locking: Move the rtmutex code to kernel/locking/
  locking: Move the semaphore core to kernel/locking/
  locking: Move the spinlock code to kernel/locking/
  locking: Move the lockdep code to kernel/locking/
  locking: Move the mutex code to kernel/locking/
  hung_task debugging: Add tracepoint to report the hang
  x86/locking/kconfig: Update paravirt spinlock Kconfig description
  lockstat: Report avg wait and hold times
  lockdep, x86/alternatives: Drop ancient lockdep fixup message
  ...
2013-11-14 16:30:30 +09:00
John Stultz 827da44c61 net: Explicitly initialize u64_stats_sync structures for lockdep
In order to enable lockdep on seqcount/seqlock structures, we
must explicitly initialize any locks.

The u64_stats_sync structure, uses a seqcount, and thus we need
to introduce a u64_stats_init() function and use it to initialize
the structure.

This unfortunately adds a lot of fairly trivial initialization code
to a number of drivers. But the benefit of ensuring correctness makes
this worth while.

Because these changes are required for lockdep to be enabled, and the
changes are quite trivial, I've not yet split this patch out into 30-some
separate patches, as I figured it would be better to get the various
maintainers thoughts on how to best merge this change along with
the seqcount lockdep enablement.

Feedback would be appreciated!

Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Roger Luethi <rl@hellgate.ch>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Simon Horman <horms@verge.net.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-11-06 12:40:25 +01:00
Eric Dumazet 82d8189826 veth: extend features to support tunneling
While investigating on a recent vxlan regression, I found veth
was using a zero features set for vxlan tunnels.

We have to segment GSO frames, copy the payload, and do the checksum.

This patch brings a ~200% performance increase

We probably have to add hw_enc_features support
on other virtual devices.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-28 00:57:29 -04:00
Gao feng 5c70ef85a2 veth: allow to setup multicast address for veth device
We can only setup multicast address for network device when
net_device_ops->ndo_set_rx_mode is not null.

Some configurations need to add multicast address for net
device, such as netfilter cluster match module.

Add a fake ndo_set_rx_mode function to allow this operation.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-10 00:08:07 -04:00
David S. Miller b343ca84b4 Revert "veth: Showing peer of veth type dev in ip link (kernel side)"
This reverts commit 612c337306.

As per Stephen Hemminger, the layout of the netlink attribute
is not implemented correctly so revert this for now.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 21:52:03 -04:00
Masatake YAMATO 612c337306 veth: Showing peer of veth type dev in ip link (kernel side)
ip link has ability to show extra information of net work device if
kernel provides sunh information. With this patch veth driver can
provide its peer ifindex information to ip command via netlink
interface.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-08 15:23:25 -04:00
Flavio Leitner b69bbddfa1 veth: add vlan features
The veth device doesn't provide the vlan features,
so TSO for example is disabled and that causes
performance issues when using tagged traffic.

The test topology looks like this:

    br0                     br1
  /   \                  /     \
vnet  veth0.10 ----- veth1.10   vnet
VM                               VM

The netperf results with current veth driver:
MIGRATED TCP STREAM TEST from 192.168.1.1 ()
port 0 AF_INET to 192.168.1.2 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.01    2210.22

Now after applying the proposed patch:
MIGRATED TCP STREAM TEST from 192.168.1.1 ()
port 0 AF_INET to 192.168.1.2 () port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    13067.47

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-19 17:36:03 -07:00
Hong zhi guo 3f8b96379a veth: remove redundant call of dev_alloc_name
it's called in the following register_netdevice. No need to call it
here.
Tested with "ip link add type veth" and "ip link add xxx%d type veth".

Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 01:21:20 -07:00
Patrick McHardy 28d2b136ca net: vlan: announce STAG offload capability in some drivers
- macvlan: propagate STAG filtering capabilities from underlying device
- ifb: announce STAG tagging support in addition to CTAG tagging support
- veth: announce STAG tagging/stripping support in addition to CTAG support

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 14:46:06 -04:00
Patrick McHardy f646968f8f net: vlan: rename NETIF_F_HW_VLAN_* feature flags to NETIF_F_HW_VLAN_CTAG_*
Rename the hardware VLAN acceleration features to include "CTAG" to indicate
that they only support CTAGs. Follow up patches will introduce 802.1ad
server provider tagging (STAGs) and require the distinction for hardware not
supporting acclerating both.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 14:45:26 -04:00
Eric Dumazet f45a5c267d veth: fix NULL dereference in veth_dellink()
commit d0e2c55e7c (veth: avoid a NULL deref in veth_stats_one)
added another NULL deref in veth_dellink().

# ip link add name veth1 type veth peer name veth0
# rmmod veth

We crash because veth_dellink() is called twice, so we must
take care of NULL peer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-10 20:41:43 -05:00
Eric Dumazet 2efd32ee1b veth: fix a NULL deref in netif_carrier_off
In commit d0e2c55e7c (veth: avoid a NULL deref in veth_stats_one)
we now clear the peer pointers in veth_dellink()

veth_close() must therefore make sure the peer pointer is set.

Reported-by: Tom Parkin <tom.parkin@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-10 14:11:46 -08:00
Eric Dumazet d0e2c55e7c veth: avoid a NULL deref in veth_stats_one
commit 2681128f0c (veth: extend device features) added a NULL deref
in veth_stats_one(), as veth_get_stats64() was not testing if the peer
device was setup or not.

At init time, we call dev_get_stats() before veth pair is fully setup.

[  178.854758]  [<ffffffffa00f5677>] veth_get_stats64+0x47/0x70 [veth]
[  178.861013]  [<ffffffff814f0a2d>] dev_get_stats+0x6d/0x130
[  178.866486]  [<ffffffff81504efc>] rtnl_fill_ifinfo+0x47c/0x930
[  178.872299]  [<ffffffff81505b93>] rtmsg_ifinfo+0x83/0x100
[  178.877678]  [<ffffffff81505cc6>] rtnl_configure_link+0x76/0xa0
[  178.883580]  [<ffffffffa00f52fa>] veth_newlink+0x16a/0x350 [veth]
[  178.889654]  [<ffffffff815061cc>] rtnl_newlink+0x4dc/0x5e0
[  178.895128]  [<ffffffff81505e1e>] ? rtnl_newlink+0x12e/0x5e0
[  178.900769]  [<ffffffff8150587d>] rtnetlink_rcv_msg+0x11d/0x310
[  178.906669]  [<ffffffff81505760>] ? __rtnl_unlock+0x20/0x20
[  178.912225]  [<ffffffff81521f89>] netlink_rcv_skb+0xa9/0xd0
[  178.917779]  [<ffffffff81502d55>] rtnetlink_rcv+0x25/0x40
[  178.923159]  [<ffffffff815218d1>] netlink_unicast+0x1b1/0x230
[  178.928887]  [<ffffffff81521c4e>] netlink_sendmsg+0x2fe/0x3b0
[  178.934615]  [<ffffffff814dbe22>] sock_sendmsg+0xd2/0xf0

So we must check if peer was setup in veth_get_stats64()

As pointed out by Ben Hutchings, priv->peer is missing proper
synchronization. Adding RCU protection is a safe and well documented
way to make sure we don't access about to be freed or already
freed data.

Reported-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-07 19:42:50 -08:00
Eric Dumazet 8093315a91 veth: extend device features
veth is lacking most modern facilities, like SG, checksums, TSO.

It makes sense to extend dev->features to get them, or GRO aggregation
is defeated by a forced segmentation.

Reported-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-30 02:31:59 -08:00
Eric Dumazet 2681128f0c veth: reduce stat overhead
veth stats are a bit bloated. There is no need to account transmit
and receive stats, since they are absolutely symmetric.

Also use a per device atomic64_t for the dropped counter, as it
should never be used in fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-30 02:31:58 -08:00
Rami Rosen c07135633b rtnelink: remove unused parameter from rtnl_create_link().
This patch removes an unused parameter (src_net) from rtnl_create_link()
method and from the method single invocation, in veth.
This parameter was used in the past when calling
ops->get_tx_queues(src_net, tb) in rtnl_create_link().
The get_tx_queues() member of rtnl_link_ops was replaced by two methods,
get_num_tx_queues() and get_num_rx_queues(), which do not get any
parameter. This was done in commit d40156aa5e by
Jiri Pirko ("rtnl: allow to specify different num for rx and tx queue count").

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-30 12:24:40 -05:00
Hannes Frederic Sowa 23ea5a9637 veth: allow changing the mac address while interface is up
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-01 12:26:10 -04:00
Pavel Emelyanov e6f8f1a739 veth: Allow to create peer link with given ifindex
The ifinfomsg is in there (thanks kaber@ for foreseeing this long time ago),
so take the given ifidex and register netdev with it.

Ben noticed, that this code path previously ignored ifmp->ifi_index and
userland could be passing in garbage. Thus it may now fail occasionally
because the value clashes with an existing interface.

To address this it's assumed that if the caller specifies the ifindex for
the veth master device, then it's aware of this possibility and should
explicitly specify (or set to 0 for auto-assignment) the peer's ifindex as
well. With this the compatibility with old tools not setting ifindex is
preserved.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-09 16:18:07 -07:00
David S. Miller 32efe08d77 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_stats.c

Small minor conflict in bnx2x, wherein one commit changed how
statistics were stored in software, and another commit
fixed endianness bugs wrt. reading the values provided by
the chip in memory.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-19 16:03:15 -05:00
Danny Kukawka f2cedb63df net: replace random_ether_addr() with eth_hw_addr_random()
Replace usage of random_ether_addr() with eth_hw_addr_random()
to set addr_assign_type correctly to NET_ADDR_RANDOM.

Change the trivial cases.

v2: adapt to renamed eth_hw_addr_random()

Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-15 15:34:16 -05:00
Thomas Graf 237114384a veth: Enforce minimum size of VETH_INFO_PEER
VETH_INFO_PEER carries struct ifinfomsg plus optional IFLA
attributes. A minimal size of sizeof(struct ifinfomsg) must be
enforced or we may risk accessing that struct beyond the limits
of the netlink message.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-15 14:59:20 -05:00