Commit Graph

1479 Commits

Author SHA1 Message Date
Eric Dumazet 7c68d1a6b4 net: qdisc_pkt_len_init() should be more robust
Without proper validation of DODGY packets, we might very well
feed qdisc_pkt_len_init() with invalid GSO packets.

tcp_hdrlen() might access out-of-bound data, so let's use
skb_header_pointer() and proper checks.

Whole story is described in commit d0c081b491 ("flow_dissector:
properly cap thoff field")

We have the goal of validating DODGY packets earlier in the stack,
so we might very well revert this fix in the future.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jason Wang <jasowang@redhat.com>
Reported-by: syzbot+9da69ebac7dddd804552@syzkaller.appspotmail.com
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-22 16:00:05 -05:00
David S. Miller 55a5ec9b77 Revert "net: core: dev_get_valid_name is now the same as dev_alloc_name_ns"
This reverts commit 87c320e515.

Changing the error return code in some situations turns out to
be harmful in practice.  In particular Michael Ellerman reports
that DHCP fails on his powerpc machines, and this revert gets
things working again.

Johannes Berg agrees that this revert is the best course of
action for now.

Fixes: 029b6d1405 ("Revert "net: core: maybe return -EEXIST in __dev_alloc_name"")
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02 11:50:12 -05:00
Song Liu 2d17d8d79e xdp: linearize skb in netif_receive_generic_xdp()
In netif_receive_generic_xdp(), it is necessary to linearize all
nonlinear skb. However, in current implementation, skb with
troom <= 0 are not linearized. This patch fixes this by calling
skb_linearize() for all nonlinear skb.

Fixes: de8f3a83b0 ("bpf: add meta pointer for direct access")
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-15 14:34:36 +01:00
Johannes Berg 029b6d1405 Revert "net: core: maybe return -EEXIST in __dev_alloc_name"
This reverts commit d6f295e9def0; some userspace (in the case
we noticed it's wpa_supplicant), is relying on the current
error code to determine that a fixed name interface already
exists.

Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 11:29:48 -05:00
David S. Miller e4be7baba8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2017-11-23

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Several BPF offloading fixes, from Jakub. Among others:

    - Limit offload to cls_bpf and XDP program types only.
    - Move device validation into the driver and don't make
      any assumptions about the device in the classifier due
      to shared blocks semantics.
    - Don't pass offloaded XDP program into the driver when
      it should be run in native XDP instead. Offloaded ones
      are not JITed for the host in such cases.
    - Don't destroy device offload state when moved to
      another namespace.
    - Revert dumping offload info into user space for now,
      since ifindex alone is not sufficient. This will be
      redone properly for bpf-next tree.

2) Fix test_verifier to avoid using bpf_probe_write_user()
   helper in test cases, since it's dumping a warning into
   kernel log which may confuse users when only running tests.
   Switch to use bpf_trace_printk() instead, from Yonghong.

3) Several fixes for correcting ARG_CONST_SIZE_OR_ZERO semantics
   before it becomes uabi, from Gianluca. More specifically:

    - Add a type ARG_PTR_TO_MEM_OR_NULL that is used only
      by bpf_csum_diff(), where the argument is either a
      valid pointer or NULL. The subsequent ARG_CONST_SIZE_OR_ZERO
      then enforces a valid pointer in case of non-0 size
      or a valid pointer or NULL in case of size 0. Given
      that, the semantics for ARG_PTR_TO_MEM in combination
      with ARG_CONST_SIZE_OR_ZERO are now such that in case
      of size 0, the pointer must always be valid and cannot
      be NULL. This fix in semantics allows for bpf_probe_read()
      to drop the recently added size == 0 check in the helper
      that would become part of uabi otherwise once released.
      At the same time we can then fix bpf_probe_read_str() and
      bpf_perf_event_output() to use ARG_CONST_SIZE_OR_ZERO
      instead of ARG_CONST_SIZE in order to fix recently
      reported issues by Arnaldo et al, where LLVM optimizes
      two boundary checks into a single one for unknown
      variables where the verifier looses track of the variable
      bounds and thus rejects valid programs otherwise.

4) A fix for the verifier for the case when it detects
   comparison of two constants where the branch is guaranteed
   to not be taken at runtime. Verifier will rightfully prune
   the exploration of such paths, but we still pass the program
   to JITs, where they would complain about using reserved
   fields, etc. Track such dead instructions and sanitize
   them with mov r0,r0. Rejection is not possible since LLVM
   may generate them for valid C code and doesn't do as much
   data flow analysis as verifier. For bpf-next we might
   implement removal of such dead code and adjust branches
   instead. Fix from Alexei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24 02:33:01 +09:00
Willem de Bruijn 0c19f846d5 net: accept UFO datagrams from tuntap and packet
Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.

Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.

Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.

It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.

To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit 939912216f ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643
("net: avoid skb_warn_bad_offload false positives on UFO").

(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32 and this is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.

Tested
  Booted a v4.13 guest kernel with QEMU. On a host kernel before this
  patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
  enabled, same as on a v4.13 host kernel.

  A UFO packet sent from the guest appears on the tap device:
    host:
      nc -l -p -u 8000 &
      tcpdump -n -i tap0

    guest:
      dd if=/dev/zero of=payload.txt bs=1 count=2000
      nc -u 192.16.1.1 8000 < payload.txt

  Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
  packets arriving fragmented:

    ./with_tap_pair.sh ./tap_send_ufo tap0 tap1
    (from https://github.com/wdebruij/kerneltools/tree/master/tests)

Changes
  v1 -> v2
    - simplified set_offload change (review comment)
    - documented test procedure

Link: http://lkml.kernel.org/r/<CAF=yD-LuUeDuL9YWPJD9ykOZ0QCjNeznPDr6whqZ9NGMNF12Mw@mail.gmail.com>
Fixes: fb652fdfe8 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-24 01:37:35 +09:00
Jakub Kicinski 441a33031f net: xdp: don't allow device-bound programs in driver mode
Currently device-bound programs are not able to run on the host
to save resources (host JIT is not invoked).  Don't allow XDP
programs to be attached without the HW_MODE flag.  In theory
if program is already translated for device offload the driver
should choose to offload it instead of loading it in the driver.
However, offloading translated program may still fail resulting
in device-bound program being run on the host.

Prevent this by refusing to attach device bound programs if
XDP_FLAGS_HW_MODE is not set.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-11-21 00:37:35 +01:00
Jakub Kicinski 288b3de55a bpf: offload: move offload device validation out to the drivers
With TC shared block changes we can't depend on correct netdev
pointer being available in cls_bpf.  Move the device validation
to the driver.  Core will only make sure that offloaded programs
are always attached in the driver (or in HW by the driver).  We
trust that drivers which implement offload callbacks will perform
necessary checks.

Moving the checks to the driver is generally a useful thing,
in practice the check should be against a switchdev instance,
not a netdev, given that most ASICs will probably allow using
the same program on many ports.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-11-21 00:37:35 +01:00
Linus Torvalds 5bbcc0f595 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Maintain the TCP retransmit queue using an rbtree, with 1GB
      windows at 100Gb this really has become necessary. From Eric
      Dumazet.

   2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.

   3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
      Lunn.

   4) Add meter action support to openvswitch, from Andy Zhou.

   5) Add a data meta pointer for BPF accessible packets, from Daniel
      Borkmann.

   6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.

   7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.

   8) More work to move the RTNL mutex down, from Florian Westphal.

   9) Add 'bpftool' utility, to help with bpf program introspection.
      From Jakub Kicinski.

  10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
      Dangaard Brouer.

  11) Support 'blocks' of transformations in the packet scheduler which
      can span multiple network devices, from Jiri Pirko.

  12) TC flower offload support in cxgb4, from Kumar Sanghvi.

  13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
      Leitner.

  14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.

  15) Add RED qdisc offloadability, and use it in mlxsw driver. From
      Nogah Frankel.

  16) eBPF based device controller for cgroup v2, from Roman Gushchin.

  17) Add some fundamental tracepoints for TCP, from Song Liu.

  18) Remove garbage collection from ipv6 route layer, this is a
      significant accomplishment. From Wei Wang.

  19) Add multicast route offload support to mlxsw, from Yotam Gigi"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
  tcp: highest_sack fix
  geneve: fix fill_info when link down
  bpf: fix lockdep splat
  net: cdc_ncm: GetNtbFormat endian fix
  openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
  netem: remove unnecessary 64 bit modulus
  netem: use 64 bit divide by rate
  tcp: Namespace-ify sysctl_tcp_default_congestion_control
  net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
  ipv6: set all.accept_dad to 0 by default
  uapi: fix linux/tls.h userspace compilation error
  usbnet: ipheth: prevent TX queue timeouts when device not ready
  vhost_net: conditionally enable tx polling
  uapi: fix linux/rxrpc.h userspace compilation errors
  net: stmmac: fix LPI transitioning for dwmac4
  atm: horizon: Fix irq release error
  net-sysfs: trigger netlink notification on ifalias change via sysfs
  openvswitch: Using kfree_rcu() to simplify the code
  openvswitch: Make local function ovs_nsh_key_attr_size() static
  openvswitch: Fix return value check in ovs_meter_cmd_features()
  ...
2017-11-15 11:56:19 -08:00
Rasmus Villemoes 87c320e515 net: core: dev_get_valid_name is now the same as dev_alloc_name_ns
If name contains a %, it's easy to see that this patch doesn't change
anything (other than eliminate the duplicate dev_valid_name
call). Otherwise, we'll now just spend a little time in snprintf()
copying name to the stack buffer allocated in dev_alloc_name_ns, and do
the __dev_get_by_name using that buffer rather than name.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:46 +09:00
Rasmus Villemoes d6f295e9de net: core: maybe return -EEXIST in __dev_alloc_name
If we're given format string with no %d, -EEXIST is a saner error code.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:46 +09:00
Rasmus Villemoes 93809105cf net: core: check dev_valid_name in __dev_alloc_name
We currently only exclude non-sysfs-friendly names via
dev_get_valid_name; there doesn't seem to be a reason to allow such
names when we're called via dev_alloc_name.

This does duplicate the dev_valid_name check in the dev_get_valid_name()
case; we'll fix that shortly.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:46 +09:00
Rasmus Villemoes 6224abda0d net: core: drop pointless check in __dev_alloc_name
The only caller passes a stack buffer as buf, so it won't equal the
passed-in name. Moreover, we're already using buf as a scratch buffer
inside the if (p) {} block, so if buf and name were the same, that
snprintf() call would be overwriting its own format string.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:46 +09:00
Rasmus Villemoes c46d7642e9 net: core: eliminate dev_alloc_name{,_ns} code duplication
dev_alloc_name contained a BUG_ON(), which I moved to dev_alloc_name_ns;
the only other caller of that already has the same BUG_ON.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:46 +09:00
Rasmus Villemoes 2c88b85598 net: core: move dev_alloc_name_ns a little higher
No functional change.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:46 +09:00
Rasmus Villemoes 51f299dd94 net: core: improve sanity checking in __dev_alloc_name
__dev_alloc_name is called from the public (and exported)
dev_alloc_name(), so we don't have a guarantee that strlen(name) is at
most IFNAMSIZ. If somebody manages to get __dev_alloc_name called with a
% char beyond the 31st character, we'd be making a snprintf() call that
will very easily crash the kernel (using an appropriate %p extension,
we'll likely dereference some completely bogus pointer).

In the normal case where strlen() is sane, we don't even save anything
by limiting to IFNAMSIZ, so just use strchr().

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 16:38:45 +09:00
Vasily Averin ee21b18b6b netdev: exit_net cleanup check added
Be sure that dev_base_head list initialized in net_init hook was return
to initial state

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-14 15:45:53 +09:00
Jakub Kicinski 248f346ffe xdp: allow attaching programs loaded for specific device
Pass the netdev pointer to bpf_prog_get_type().  This way
BPF code can decide whether the device matches what the
code was loaded/translated for.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:19 +09:00
Jakub Kicinski f4e63525ee net: bpf: rename ndo_xdp to ndo_bpf
ndo_xdp is a control path callback for setting up XDP in the
driver.  We can reuse it for other forms of communication
between the eBPF stack and the drivers.  Rename the callback
and associated structures and definitions.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-05 22:26:18 +09:00
Jiri Pirko 46209401f8 net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath
In sch_handle_egress and sch_handle_ingress tp->q is used only in order
to update stats. So stats and filter list are the only things that are
needed in clsact qdisc fastpath processing. Introduce new mini_Qdisc
struct to hold those items. Also, introduce a helper to swap the
mini_Qdisc structures in case filter list head changes.

This removes need for tp->q usage without added overhead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 21:57:24 +09:00
Mark Rutland 6aa7de0591 locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-25 11:01:08 +02:00
David S. Miller f8ddadc4db Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
There were quite a few overlapping sets of changes here.

Daniel's bug fix for off-by-ones in the new BPF branch instructions,
along with the added allowances for "data_end > ptr + x" forms
collided with the metadata additions.

Along with those three changes came veritifer test cases, which in
their final form I tried to group together properly.  If I had just
trimmed GIT's conflict tags as-is, this would have split up the
meta tests unnecessarily.

In the socketmap code, a set of preemption disabling changes
overlapped with the rename of bpf_compute_data_end() to
bpf_compute_data_pointers().

Changes were made to the mv88e6060.c driver set addr method
which got removed in net-next.

The hyperv transport socket layer had a locking change in 'net'
which overlapped with a change of socket state macro usage
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 13:39:14 +01:00
Jesper Dangaard Brouer 1c601d829a bpf: cpumap xdp_buff to skb conversion and allocation
This patch makes cpumap functional, by adding SKB allocation and
invoking the network stack on the dequeuing CPU.

For constructing the SKB on the remote CPU, the xdp_buff in converted
into a struct xdp_pkt, and it mapped into the top headroom of the
packet, to avoid allocating separate mem.  For now, struct xdp_pkt is
just a cpumap internal data structure, with info carried between
enqueue to dequeue.

If a driver doesn't have enough headroom it is simply dropped, with
return code -EOVERFLOW.  This will be picked up the xdp tracepoint
infrastructure, to allow users to catch this.

V2: take into account xdp->data_meta

V4:
 - Drop busypoll tricks, keeping it more simple.
 - Skip RPS and Generic-XDP-recursive-reinjection, suggested by Alexei

V5: correct RCU read protection around __netif_receive_skb_core.

V6: Setting TASK_RUNNING vs TASK_INTERRUPTIBLE based on talk with Rik van Riel

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-18 12:12:18 +01:00
Henrik Austad 8a5f2166a6 net: export netdev_txq_to_tc to allow sch_mqprio to compile as module
In commit 32302902ff ("mqprio: Reserve last 32 classid values for HW
traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc
to find the correct tc instead of dev->tc_to_txq[]

However, when mqprio is compiled as a module, it cannot resolve the
symbol, leading to this error:

     ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined!

This adds an EXPORT_SYMBOL() since the other user in the kernel
(netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a
sysfs-callback.

Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Henrik Austad <haustad@cisco.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-17 17:00:08 +01:00
Cong Wang 0ad646c81b tun: call dev_get_valid_name() before register_netdevice()
register_netdevice() could fail early when we have an invalid
dev name, in which case ->ndo_uninit() is not called. For tun
device, this is a problem because a timer etc. are already
initialized and it expects ->ndo_uninit() to clean them up.

We could move these initializations into a ->ndo_init() so
that register_netdevice() knows better, however this is still
complicated due to the logic in tun_detach().

Therefore, I choose to just call dev_get_valid_name() before
register_netdevice(), which is quicker and much easier to audit.
And for this specific case, it is already enough.

Fixes: 96442e4242 ("tuntap: choose the txq based on rxq")
Reported-by: Dmitry Alexeev <avekceeb@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 21:02:54 +01:00
David Ahern 42ab19ee90 net: Add extack to upper device linking
Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04 21:39:33 -07:00
David Ahern 51d0c04795 net: Add extack to netdev_notifier_info
Add netlink_ext_ack to netdev_notifier_info to allow notifier
handlers to return errors to userspace.

Clean up the initialization in dev.c such that extack is easily
added in subsequent patches where relevant. Specifically, remove
the init call in call_netdevice_notifiers_info and have callers
initalize on stack when info is declared.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04 21:39:33 -07:00
Nicolas Dichtel 6621dd29eb dev: advertise the new nsid when the netns iface changes
x-netns interfaces are bound to two netns: the link netns and the upper
netns. Usually, this kind of interfaces is created in the link netns and
then moved to the upper netns. At the end, the interface is visible only
in the upper netns. The link nsid is advertised via netlink in the upper
netns, thus the user always knows where is the link part.

There is no such mechanism in the link netns. When the interface is moved
to another netns, the user cannot "follow" it.
This patch adds a new netlink attribute which helps to follow an interface
which moves to another netns. When the interface is unregistered, the new
nsid is advertised. If the interface is a x-netns interface (ie
rtnl_link_ops->get_link_net is defined), the nsid is allocated if needed.

CC: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04 18:04:41 -07:00
Florian Westphal 20e883204f net: core: fix kerneldoc comment
net/core/dev.c:1306: warning: No description found for parameter 'name'
net/core/dev.c:1306: warning: Excess function parameter 'alias' description in 'dev_get_alias'

Fixes: 6c5570016b ("net: core: decouple ifalias get/set from rtnl lock")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-04 10:28:22 -07:00
Florian Westphal 6c5570016b net: core: decouple ifalias get/set from rtnl lock
Device alias can be set by either rtnetlink (rtnl is held) or sysfs.

rtnetlink hold the rtnl mutex, sysfs acquires it for this purpose.
Add an extra mutex for it and use rcu to protect concurrent accesses.

This allows the sysfs path to not take rtnl and would later allow
to not hold it when dumping ifalias.

Based on suggestion from Eric Dumazet.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-03 15:56:01 -07:00
Daniel Borkmann de8f3a83b0 bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.

xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.

The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-26 13:36:44 -07:00
David S. Miller 1f8d31d189 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-09-23 10:16:53 -07:00
Alexey Dobriyan 52a59bd509 net: use 32-bit arithmetic while allocating net device
Private part of allocation is never big enough to warrant size_t.

Space savings:

	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
	function                                     old     new   delta
	alloc_netdev_mqs                            1120    1110     -10

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-22 21:25:52 -07:00
Willem de Bruijn 581fe0ea61 net: orphan frags on stand-alone ptype in dev_queue_xmit_nit
Zerocopy skbs frags are copied when the skb is looped to a local sock.
Commit 1080e512d4 ("net: orphan frags on receive") introduced calls
to skb_orphan_frags to deliver_skb and __netif_receive_skb for this.

With msg_zerocopy, these skbs can also exist in the tx path and thus
loop from dev_queue_xmit_nit. This already calls deliver_skb in its
loop. But it does not orphan before a separate pt_prev->func().

Add the missing skb_orphan_frags_rx.

Changes
  v1->v2: handle skb_orphan_frags_rx failure

Fixes: 1f8b977ab3 ("sock: enable MSG_ZEROCOPY")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-22 20:31:29 -07:00
Edward Cree 92dd5452c1 net: change skb->mac_header when Generic XDP calls adjust_head
Since XDP's view of the packet includes the MAC header, moving the start-
 of-packet with bpf_xdp_adjust_head needs to also update the offset of the
 MAC header (which is relative to skb->head, not to the skb->data that was
 changed).
Without this, tcpdump sees packets starting from the old MAC header rather
 than the new one, at least in my tests on the loopback device.

Fixes: b5cdae3291 ("net: Generic XDP")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-20 15:44:22 -07:00
John Fastabend bbbe211c29 net: rcu lock and preempt disable missing around generic xdp
do_xdp_generic must be called inside rcu critical section with preempt
disabled to ensure BPF programs are valid and per-cpu variables used
for redirect operations are consistent. This patch ensures this is true
and fixes the splat below.

The netif_receive_skb_internal() code path is now broken into two rcu
critical sections. I decided it was better to limit the preempt_enable/disable
block to just the xdp static key portion and the fallout is more
rcu_read_lock/unlock calls. Seems like the best option to me.

[  607.596901] =============================
[  607.596906] WARNING: suspicious RCU usage
[  607.596912] 4.13.0-rc4+ #570 Not tainted
[  607.596917] -----------------------------
[  607.596923] net/core/dev.c:3948 suspicious rcu_dereference_check() usage!
[  607.596927]
[  607.596927] other info that might help us debug this:
[  607.596927]
[  607.596933]
[  607.596933] rcu_scheduler_active = 2, debug_locks = 1
[  607.596938] 2 locks held by pool/14624:
[  607.596943]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff95445ffd>] ip_finish_output2+0x14d/0x890
[  607.596973]  #1:  (rcu_read_lock_bh){......}, at: [<ffffffff953c8e3a>] __dev_queue_xmit+0x14a/0xfd0
[  607.597000]
[  607.597000] stack backtrace:
[  607.597006] CPU: 5 PID: 14624 Comm: pool Not tainted 4.13.0-rc4+ #570
[  607.597011] Hardware name: Dell Inc. Precision Tower 5810/0HHV7N, BIOS A17 03/01/2017
[  607.597016] Call Trace:
[  607.597027]  dump_stack+0x67/0x92
[  607.597040]  lockdep_rcu_suspicious+0xdd/0x110
[  607.597054]  do_xdp_generic+0x313/0xa50
[  607.597068]  ? time_hardirqs_on+0x5b/0x150
[  607.597076]  ? mark_held_locks+0x6b/0xc0
[  607.597088]  ? netdev_pick_tx+0x150/0x150
[  607.597117]  netif_rx_internal+0x205/0x3f0
[  607.597127]  ? do_xdp_generic+0xa50/0xa50
[  607.597144]  ? lock_downgrade+0x2b0/0x2b0
[  607.597158]  ? __lock_is_held+0x93/0x100
[  607.597187]  netif_rx+0x119/0x190
[  607.597202]  loopback_xmit+0xfd/0x1b0
[  607.597214]  dev_hard_start_xmit+0x127/0x4e0

Fixes: d445516966 ("net: xdp: support xdp generic on virtual devices")
Fixes: b5cdae3291 ("net: Generic XDP")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-08 21:11:00 -07:00
David S. Miller 6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Ido Schimmel 25cc72a338 mlxsw: spectrum: Forbid linking to devices that have uppers
The mlxsw driver relies on NETDEV_CHANGEUPPER events to configure the
device in case a port is enslaved to a master netdev such as bridge or
bond.

Since the driver ignores events unrelated to its ports and their
uppers, it's possible to engineer situations in which the device's data
path differs from the kernel's.

One example to such a situation is when a port is enslaved to a bond
that is already enslaved to a bridge. When the bond was enslaved the
driver ignored the event - as the bond wasn't one of its uppers - and
therefore a bridge port instance isn't created in the device.

Until such configurations are supported forbid them by checking that the
upper device doesn't have uppers of its own.

Fixes: 0d65fc1304 ("mlxsw: spectrum: Implement LAG port join/leave")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Nogah Frankel <nogahf@mellanox.com>
Tested-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 09:59:41 -07:00
Jesper Dangaard Brouer 1e22391e8f net: missing call of trace_napi_poll in busy_poll_stop
Noticed that busy_poll_stop() also invoke the drivers napi->poll()
function pointer, but didn't have an associated call to trace_napi_poll()
like all other call sites.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-28 11:22:21 -07:00
Jesper Dangaard Brouer 2facaad600 xdp: make generic xdp redirect use tracepoint trace_xdp_redirect
If the xdp_do_generic_redirect() call fails, it trigger the
trace_xdp_exception tracepoint.  It seems better to use the same
tracepoint trace_xdp_redirect, as the native xdp_do_redirect{,_map} does.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 11:59:36 -07:00
Jason Wang 7c4974786f net: export some generic xdp helpers
This patch tries to export some generic xdp helpers to drivers. This
can let driver to do XDP for a specific skb. This is useful for the
case when the packet is hard to be processed at page level directly
(e.g jumbo/GSO frame).

With this patch, there's no need for driver to forbid the XDP set when
configuration is not suitable. Instead, it can defer the XDP for
packets that is hard to be processed directly after skb is created.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-13 19:56:07 -07:00
Tonghao Zhang 939912216f net: skb_needs_check() removes CHECKSUM_UNNECESSARY check for tx.
Because we remove the UFO support, we will also remove the
CHECKSUM_UNNECESSARY check in skb_needs_check().

Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-11 14:45:52 -07:00
David S. Miller 3118e6e19d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:28:45 -07:00
Willem de Bruijn 8d63bee643 net: avoid skb_warn_bad_offload false positives on UFO
skb_warn_bad_offload triggers a warning when an skb enters the GSO
stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL
checksum offload set.

Commit b2504a5dbe ("net: reduce skb_warn_bad_offload() noise")
observed that SKB_GSO_DODGY producers can trigger the check and
that passing those packets through the GSO handlers will fix it
up. But, the software UFO handler will set ip_summed to
CHECKSUM_NONE.

When __skb_gso_segment is called from the receive path, this
triggers the warning again.

Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On
Tx these two are equivalent. On Rx, this better matches the
skb state (checksum computed), as CHECKSUM_NONE here means no
checksum computed.

See also this thread for context:
http://patchwork.ozlabs.org/patch/799015/

Fixes: b2504a5dbe ("net: reduce skb_warn_bad_offload() noise")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-08 21:39:01 -07:00
Willem de Bruijn 1f8b977ab3 sock: enable MSG_ZEROCOPY
Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
skb_zerocopy_clone() wherever needed due to skb split, merge, resize
or clone.

Split skb_orphan_frags into two variants. The split, merge, .. paths
support reference counted zerocopy buffers, so do not do a deep copy.
Add skb_orphan_frags_rx for paths that may loop packets to receive
sockets. That is not allowed, as it may cause unbounded latency.
Deep copy all zerocopy copy buffers, ref-counted or not, in this path.

The exact locations to modify were chosen by exhaustively searching
through all code that might modify skb_frag references and/or the
the SKBTX_DEV_ZEROCOPY tx_flags bit.

The changes err on the safe side, in two ways.

(1) legacy ubuf_info paths virtio and tap are not modified. They keep
    a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
    still call skb_copy_ubufs and thus copy frags in this case.

(2) not all copies deep in the stack are addressed yet. skb_shift,
    skb_split and skb_try_coalesce can be refined to avoid copying.
    These are not in the hot path and this patch is hairy enough as
    is, so that is left for future refinement.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00
Sabrina Dubroca ae847f40b6 net: call udp_tunnel_get_rx_info when NETIF_F_RX_UDP_TUNNEL_PORT is toggled
NETIF_F_RX_UDP_TUNNEL_PORT is special, in that we need to do more than
just flip the bit in dev->features. When disabling we must also clear
currently offloaded ports from the device, and when enabling we must
tell the device to offload the ports it can.

Because vxlan stores its sockets in a hashtable and they are inserted at
the head of per-bucket lists, switching the feature off and then on can
result in a different set of ports being offloaded (depending on the
HW's limits).

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 13:52:59 -07:00
Sabrina Dubroca d764a122cc net: add new netdevice feature for offload of RX port for UDP tunnels
This adds a new netdevice feature, so that the offloading of RX port for
UDP tunnels can be disabled by the administrator on some netdevices,
using the "rx-udp_tunnel-port-offload" feature in ethtool.

This feature is set for all devices that provide ndo_udp_tunnel_add.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-24 13:52:59 -07:00
David S. Miller 7a68ada6ec Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-07-21 03:38:43 +01:00
stephen hemminger 7051b88a35 net: make dev_close and related functions void
There is no useful return value from dev_close. All paths return 0.
Change dev_close and helper functions to void.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-19 16:45:10 -07:00
David S. Miller d4c023f4f3 net: Remove references to NETIF_F_UFO in netdev_fix_features().
It is going away.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-17 09:52:58 -07:00