Commit graph

692794 commits

Author SHA1 Message Date
Yuchung Cheng
4faf783998 tcp: fix cwnd undo in Reno and HTCP congestion controls
Using ssthresh to revert cwnd is less reliable when ssthresh is
bounded to 2 packets. This patch uses an existing variable in TCP
"prior_cwnd" that snapshots the cwnd right before entering fast
recovery and RTO recovery in Reno.  This fixes the issue discussed
in netdev thread: "A buggy behavior for Linux TCP Reno and HTCP"
https://www.spinics.net/lists/netdev/msg444955.html

Suggested-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Wei Sun <unlcsewsun@gmail.com>
Signed-off-by: Yuchung Cheng <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 21:25:10 -07:00
kiki good
10377ba767 net: systemport: Support 64bit statistics
When using Broadcom Systemport device in 32bit Platform, ifconfig can
only report up to 4G tx,rx status, which will be wrapped to 0 when the
number of incoming or outgoing packets exceeds 4G, only taking
around 2 hours in busy network environment (such as streaming).
Therefore, it makes hard for network diagnostic tool to get reliable
statistical result, so the patch is used to add 64bit support for
Broadcom Systemport device in 32bit Platform.

This patch provides 64bit statistics capability on both ethtool and ifconfig.

Signed-off-by: Jianming.qiao <kiki-good@hotmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 21:18:27 -07:00
Intiyaz Basha
2470f3a294 liquidio: moved console_bitmask module param to lio_main.c
Moving PF module param console_bitmask to lio_main.c for consistency.

Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 21:17:57 -07:00
Intiyaz Basha
9060e6bae6 liquidio: add missing strings in oct_dev_state_str array
There's supposed to be a one-to-one correspondence between the 18 macros
that #define the OCT_DEV states (in octeon_device.h) and the strings in the
oct_dev_state_str array, but there are only 14 strings in the array.

Add the missing strings (so they become 18 in total), and also revise some
incorrect/outdated text of existing strings.

Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 21:17:07 -07:00
David S. Miller
234709336b Merge branch 'phylink-and-sfp-support'
Russell King says:

====================
phylink and sfp support

This patch series introduces generic support for SFP sockets found on
various Marvell based platforms.  The idea here is to provide common
SFP socket support which can be re-used by network drivers as
appropriate, rather than each network driver having to re-implement
SFP socket support.

SFP sockets typically use other system resources, eg, I2C buses to read
identifying information, and GPIOs to monitor socket state and control
the socket.  Meanwhile, some network drivers drive multiple ethernet
ports from one instantiation of the driver.

It is not desirable to block the initialisation of a network driver
(thus denying other ports from being operational) if the resources
for the SFP socket are not yet available.  This means that an element
of independence between the SFP support code and the driver is
required.

More than that, SFP modules effectively bring hotplug PHYs to
networking - SFP copper modules normally contain a standard PHY
accessed over the I2C bus, and it is desirable to read their state
so network drivers can be appropriately configured.

To add to the complexity, SFP modules can be connected in at least
two places:

1. Directly to the serdes output of a MAC with no intervening PHY.
   For example:

     mvneta ----> SFP socket

2. To a PHY, for example:

     mvpp2 ---> PHY ---> copper
                 |
                 `-----> SFP socket

This code supports both setups, although it's not fully implemented
with scenario (2).

Moreover, the link presented by the SFP module can be one of the
10Gbase-R family (for SFP+ sockets), SGMII or 1000base-X (for SFP
sockets) depending on the module, and network drivers need to
reconfigure themselves accordingly for the link to come up.

For example, if the MAC is configured for SGMII and a fibre module
is plugged in, the link won't come up until the MAC is reconfigured
for 1000base-X mode.

The SFP code manages the SFP socket - detecting the module, reading
the identifying information, and managing the control and status
signals.  Importantly, it disables the SFP module transmitter when
the MAC is down, so that the laser is turned off (but that is not
a guarantee.)

phylink provides the mechanisms necessary to manage the link modes,
based on the SFP module type, and supports hot-plugging of the PHY
without needing the MAC driver to be brought up and down on
transitions.  phylink also supports the classical static PHY and
fixed-link modes.

I currently (but not included in this series) have code to convert
mvneta to use phylink, and the out of tree mvpp2x driver.  I have
nothing for the mvpp2 driver at present as that driver is only
recently becoming functional on 10G hardware, and is missing a lot
of features that are necessary to make things work correctly.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:29 -07:00
Russell King
7397005545 sfp: add SFP module support
Add support for SFP hotpluggable modules via sfp-bus and phylink.
This supports both copper and optical SFP modules, which require
different Serdes modes in order to properly negotiate the link.

Optical SFP modules typically require the Serdes link to be talking
1000BaseX mode - this is the gigabit ethernet mode defined by the
802.3 standard.

Copper SFP modules typically integrate a PHY in the module to convert
from Serdes to copper, and the PHY will be configured by the vendor
to either present a 1000BaseX Serdes link (for fixed 1000BaseT) or a
SGMII Serdes link.  However, this is vendor defined, so we instead
detect the PHY, switch the link to SGMII mode, and use traditional
PHY based negotiation.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:29 -07:00
Russell King
da7c1862f0 phylink: add in-band autonegotiation support for 10GBase-KR mode.
Add in-band autonegotation support for 10GBase-KR mode.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:29 -07:00
Russell King
ecbd87b843 phylink: add support for MII ioctl access to Clause 45 PHYs
Add support for reading and writing the clause 45 MII registers.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:29 -07:00
Russell King
770a1ad557 phylink: add module EEPROM support
Add support for reading module EEPROMs through phylink.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:29 -07:00
Russell King
ce0aa27ff3 sfp: add sfp-bus to bridge between network devices and sfp cages
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:29 -07:00
Russell King
9525ae8395 phylink: add phylink infrastructure
The link between the ethernet MAC and its PHY has become more complex
as the interface evolves.  This is especially true with serdes links,
where the part of the PHY is effectively integrated into the MAC.

Serdes links can be connected to a variety of devices, including SFF
modules soldered down onto the board with the MAC, a SFP cage with
a hotpluggable SFP module which may contain a PHY or directly modulate
the serdes signals onto optical media with or without a PHY, or even
a classical PHY connection.

Moreover, the negotiation information on serdes links comes in two
varieties - SGMII mode, where the PHY provides its speed/duplex/flow
control information to the MAC, and 1000base-X mode where both ends
exchange their abilities and each resolve the link capabilities.

This means we need a more flexible means to support these arrangements,
particularly with the hotpluggable nature of SFP, where the PHY can
be attached or detached after the network device has been brought up.

Ethtool information can come from multiple sources:
- we may have a PHY operating in either SGMII or 1000base-X mode, in
  which case we take ethtool/mii data directly from the PHY.
- we may have a optical SFP module without a PHY, with the MAC
  operating in 1000base-X mode - the ethtool/mii data needs to come
  from the MAC.
- we may have a copper SFP module with a PHY whic can't be accessed,
  which means we need to take ethtool/mii data from the MAC.

Phylink aims to solve this by providing an intermediary between the
MAC and PHY, providing a safe way for PHYs to be hotplugged, and
allowing a SFP driver to reconfigure the serdes connection.

Phylink also takes over support of fixed link connections, where the
speed/duplex/flow control are fixed, but link status may be controlled
by a GPIO signal.  By avoiding the fixed-phy implementation, phylink
can provide a faster response to link events: fixed-phy has to wait for
phylib to operate its state machine, which can take several seconds.
In comparison, phylink takes milliseconds.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

- remove sync status
- rework supported and advertisment handling
- add 1000base-x speed for fixed links
- use functionality exported from phy-core, reworking
  __phylink_ethtool_ksettings_set for it
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:29 -07:00
Russell King
453d00defb net: phy: add I2C mdio bus
Add an I2C MDIO bus bridge library, to allow phylib to access PHYs which
are connected to an I2C bus instead of the more conventional MDIO bus.
Such PHYs can be found in SFP adapters and SFF modules.

Since PHYs appear at I2C bus address 0x40..0x5f, and 0x50/0x51 are
reserved for SFP EEPROMs/diagnostics, we must not allow the MDIO bus
to access these I2C addresses.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:28 -07:00
Russell King
5e5758d9d8 net: phy: export phy_start_machine() for phylink
phylink will need phy_start_machine exported, so lets export it as a
GPL symbol.  Documentation/networking/phy.txt indicates that this
should be a PHY API function.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:28 -07:00
Russell King
a81497bee7 net: phy: provide a hook for link up/link down events
Sometimes, we need to do additional work between the PHY coming up and
marking the carrier present - for example, we may need to wait for the
PHY to MAC link to finish negotiation.  This changes phylib to provide
a notification function pointer which avoids the built-in
netif_carrier_on() and netif_carrier_off() functions.

Standard ->adjust_link functionality is provided by hooking a helper
into the new ->phy_link_change method.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:28 -07:00
Russell King
1f3645bb41 net: phy: add 1000Base-X to phy settings table
Add the missing 1000Base-X entry to the phy settings table.  This was
not included because the original code could not cope with more than
32 bits of link mode mask.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:28 -07:00
Russell King
0ccb4fc65d net: phy: move phy_lookup_setting() and guts of phy_supported_speeds() to phy-core
phy_lookup_setting() provides useful functionality in ethtool code
outside phylib.  Move it to phy-core and allow it to be re-used (eg,
in phylink) rather than duplicated elsewhere.  Note that this supports
the larger linkmode space.

As we move the phy settings table, we also need to move the guts of
phy_supported_speeds() as well.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:28 -07:00
Russell King
da4625ac26 net: phy: split out PHY speed and duplex string generation
Other code would like to make use of this, so make the speed and duplex
string generation visible, and place it in a separate file.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:28 -07:00
Russell King
c3ecbe757c net: phy: allow settings table to support more than 32 link modes
Allow the phy settings table to support more than 32 link modes by
switching to the ethtool link mode bit number representation, rather
than storing the mask.  This will allow phylink and other ethtool
code to share the settings table to look up settings.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:55:28 -07:00
David S. Miller
21e27f2d86 Merge branch 'IP-cleanup-LSRR-option-processing'
Paolo Abeni says:

====================
IP: cleanup LSRR option processing

The __ip_options_echo() function expect a valid dst entry in skb->dst;
as result we sometimes need to preserve the dst entry for the whole IP
RX path.

The current usage of skb->dst looks more a relic from ancient past that
a real functional constraint. This patchset tries to remove such usage,
and than drops some hacks currently in place in the IP code to keep
skb->dst around.

__ip_options_echo() uses of skb->dst for two different purposes: retrieving
the netns assicated with the skb, and modify the ingress packet LSRR address
list.

The first patch removes the code modifying the ingress packet, and the second
one provides an explicit netns argument to __ip_options_echo(). The following
patches cleanup the current code keeping arund skb->dst for __ip_options_echo's
sake.

Updating the __ip_options_echo() function has been previously discussed here:

http://marc.info/?l=linux-netdev&m=150064533516348&w=2
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:51:12 -07:00
Paolo Abeni
3bdefdf9d9 udp: no need to preserve skb->dst
__ip_options_echo() does not need anymore skb->dst, so we can
avoid explicitly preserving it for its own sake.

This is almost a revert of commit 0ddf3fb2c4 ("udp: preserve
skb->dst if required for IP options processing") plus some
lifting to fit later changes.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:51:12 -07:00
Paolo Abeni
61a1030bad Revert "ipv4: keep skb->dst around in presence of IP options"
ip_options_echo() does not use anymore the skb->dst and don't
need to keep the dst around for options's sake only.
This reverts commit 34b2cef20f.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:51:12 -07:00
Paolo Abeni
91ed1e666a ip/options: explicitly provide net ns to __ip_options_echo()
__ip_options_echo() uses the current network namespace, and
currently retrives it via skb->dst->dev.

This commit adds an explicit 'net' argument to __ip_options_echo()
and update all the call sites to provide it, usually via a simpler
sock_net().

After this change, __ip_options_echo() no more needs to access
skb->dst and we can drop a couple of hack to preserve such
info in the rx path.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:51:12 -07:00
Paolo Abeni
a1e155ece1 IP: do not modify ingress packet IP option in ip_options_echo()
While computing the response option set for LSRR, ip_options_echo()
also changes the ingress packet LSRR addresses list, setting
the last one to the dst specific address for the ingress packet
- via memset(start[ ...
The only visible effect of such change - beyond possibly damaging
shared/cloned skbs - is modifying the data carried by ICMP replies
changing the header information for reported the ingress packet,
which violates RFC1122 3.2.2.6.
All the others call sites just ignore the ingress packet IP options
after calling ip_options_echo()
Note that the last element in the LSRR option address list for the
reply packet will be properly set later in the ip output path
via ip_options_build().
This buggy memset() predates git history and apparently was present
into the initial ip_options_echo() implementation in linux 1.3.30 but
still looks wrong.

The removal of the fib_compute_spec_dst() call will help
completely dropping the skb->dst usage by __ip_options_echo() with a
later patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-06 20:51:11 -07:00
Pavel Belous
a54df682e5 aquantia: Switch to use napi_gro_receive
Add support for GRO (generic receive offload) for aQuantia Atlantic driver.
This results in a perfomance improvement when GRO is enabled.

Signed-off-by: Pavel Belous <pavel.belous@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 20:57:13 -07:00
John Fastabend
56ce097c1c net: comment fixes against BPF devmap helper calls
Update BPF comments to accurately reflect XDP usage.

Fixes: 97f91a7cf0 ("bpf: add bpf_redirect_map helper routine")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:29:03 -07:00
David S. Miller
8f75222446 Merge branch 'net-sched-summer-cleanup-part-1-mainly-in-exts-area'
Jiri Pirko says:

====================
net: sched: summer cleanup part 1, mainly in exts area

This patchset is one of the couple cleanup patchsets I have in queue.
The motivation aside the obvious need to "make things nicer" is also
to prepare for shared filter blocks introduction. That requires tp->q
removal, and therefore removal of all tp->q users.

Patch 1 is just some small thing I spotted on the way
Patch 2 removes one user of tp->q, namely tcf_em_tree_change
Patches 3-8 do preparations for exts->nr_actions removal
Patches 9-10 do simple renames of functions in cls*
Patches 11-19 remove unnecessary calls of tcf_exts_change helper
The last patch changes tcf_exts_change to don't take lock

Tested by tools/testing/selftests/tc-testing

v1->v2:
- removed conversion of action array to list as noted by Cong
- added the past patch instead
- small rebases of patches 11-19
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:25 -07:00
Jiri Pirko
9b0d4446b5 net: sched: avoid atomic swap in tcf_exts_change
tcf_exts_change is always called on newly created exts, which are not used
on fastpath. Therefore, simple struct copy is enough.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
705c709126 net: sched: cls_u32: no need to call tcf_exts_change for newly allocated struct
As the n struct was allocated right before u32_set_parms call,
no need to use tcf_exts_change to do atomic change, and we can just
fill-up the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
8c98d571bb net: sched: cls_route: no need to call tcf_exts_change for newly allocated struct
As the f struct was allocated right before route4_set_parms call,
no need to use tcf_exts_change to do atomic change, and we can just
fill-up the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
c09fc2e11e net: sched: cls_flow: no need to call tcf_exts_change for newly allocated struct
As the fnew struct just was allocated, so no need to use tcf_exts_change
to do atomic change, and we can just fill-up the unused exts struct
directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
8cc6251381 net: sched: cls_cgroup: no need to call tcf_exts_change for newly allocated struct
As the new struct just was allocated, so no need to use tcf_exts_change
to do atomic change, and we can just fill-up the unused exts struct
directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
6839da326d net: sched: cls_bpf: no need to call tcf_exts_change for newly allocated struct
As the prog struct was allocated right before cls_bpf_set_parms call,
no need to use tcf_exts_change to do atomic change, and we can just
fill-up the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
ff1f8ca080 net: sched: cls_basic: no need to call tcf_exts_change for newly allocated struct
As the f struct was allocated right before basic_set_parms call, no need
to use tcf_exts_change to do atomic change, and we can just fill-up
the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
a74cb36980 net: sched: cls_matchall: no need to call tcf_exts_change for newly allocated struct
As the head struct was allocated right before mall_set_parms call,
no need to use tcf_exts_change to do atomic change, and we can just
fill-up the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
94611bff6e net: sched: cls_fw: no need to call tcf_exts_change for newly allocated struct
As the f struct was allocated right before fw_set_parms call, no need
to use tcf_exts_change to do atomic change, and we can just fill-up
the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
455075292b net: sched: cls_flower: no need to call tcf_exts_change for newly allocated struct
As the f struct was allocated right before fl_set_parms call, no need
to use tcf_exts_change to do atomic change, and we can just fill-up
the unused exts struct directly by tcf_exts_validate.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
1e5003af37 net: sched: cls_fw: rename fw_change_attrs function
Since the function name is misleading since it is not changing
anything, name it similarly to other cls.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
6a725c481d net: sched: cls_bpf: rename cls_bpf_modify_existing function
The name cls_bpf_modify_existing is highly misleading, as it indeed does
not modify anything existing. It does not modify at all.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
978dfd8d14 net: sched: use tcf_exts_has_actions instead of exts->nr_actions
For check in tcf_exts_dump use tcf_exts_has_actions helper instead
of exts->nr_actions for checking if there are any actions present.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:24 -07:00
Jiri Pirko
ec1a9cca0e net: sched: remove check for number of actions in tcf_exts_exec
Leave it to tcf_action_exec to return TC_ACT_OK in case there is no
action present.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
af089e701a net: sched: fix return value of tcf_exts_exec
Return the defined TC_ACT_OK instead of 0.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
6fc6d06e53 net: sched: remove redundant helpers tcf_exts_is_predicative and tcf_exts_is_available
These two helpers are doing the same as tcf_exts_has_actions, so remove
them and use tcf_exts_has_actions instead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
af69afc551 net: sched: use tcf_exts_has_actions in tcf_exts_exec
Use the tcf_exts_has_actions helper instead or directly testing
exts->nr_actions in tcf_exts_exec.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
3bcc0cec81 net: sched: change names of action number helpers to be aligned with the rest
The rest of the helpers are named tcf_exts_*, so change the name of
the action number helpers to be aligned. While at it, change to inline
functions.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
4ebc1e3cfc net: sched: remove unneeded tcf_em_tree_change
Since tcf_em_tree_validate could be always called on a newly created
filter, there is no need for this change function.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Jiri Pirko
f7ebdff757 net: sched: sch_atm: use Qdisc_class_common structure
Even if it is only for classid now, use this common struct a be aligned
with the rest of the classful qdiscs.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:21:23 -07:00
Lin Yun Sheng
967b2e2a76 net: hns: Fix for __udivdi3 compiler error
This patch fixes the __udivdi3 undefined error reported by
test robot.

Fixes: b8c17f7088 ("net: hns: Add self-adaptive interrupt coalesce support in hns driver")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 11:06:59 -07:00
Dan Carpenter
5987feb38a net: phy: marvell: logical vs bitwise OR typo
This was supposed to be a bitwise OR but there is a || vs | typo.

Fixes: 864dc729d5 ("net: phy: marvell: Refactor m88e1121 RGMII delay configuration")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-04 10:55:54 -07:00
David S. Miller
35615994c1 Merge branch 'socket-sendmsg-zerocopy'
Willem de Bruijn says:

====================
socket sendmsg MSG_ZEROCOPY

Introduce zerocopy socket send flag MSG_ZEROCOPY. This extends the
shared page support (SKBTX_SHARED_FRAG) from sendpage to sendmsg.
Implement the feature for TCP initially, as large writes benefit
most.

On a send call with MSG_ZEROCOPY, the kernel pins user pages and
links these directly into the skbuff frags[] array.

Each send call with MSG_ZEROCOPY that transmits data will eventually
queue a completion notification on the error queue: a per-socket u32
incremented on each such call. A request may have to revert to copy
to succeed, for instance when a device cannot support scatter-gather
IO. In that case a flag is passed along to notify that the operation
succeeded without zerocopy optimization.

The implementation extends the existing zerocopy infra for tuntap,
vhost and xen with features needed for TCP, notably reference
counting to handle cloning on retransmit and GSO.

For more details, see also the netdev 2.1 paper and presentation at
https://netdevconf.org/2.1/session.html?debruijn

Changelog:

  v3 -> v4:
    - dropped UDP, RAW and PF_PACKET for now
        Without loopback support, datagrams are usually smaller than
        the ~8KB size threshold needed to benefit from zerocopy.
    - style: a few reverse chrismas tree
    - minor: SO_ZEROCOPY returns ENOTSUPP on unsupported protocols
    - minor: squashed SO_EE_CODE_ZEROCOPY_COPIED patch
    - minor: rebased on top of net-next with kmap_atomic fix

  v2 -> v3:
    - fix rebase conflict: SO_ZEROCOPY 59 -> 60

  v1 -> v2:
    - fix (kbuild-bot): do not remove uarg until patch 5
    - fix (kbuild-bot): move zerocopy_sg_from_iter doc with function
    - fix: remove unused extern in header file

  RFCv2 -> v1:
    - patch 2
        - review comment: in skb_copy_ubufs, always allocate order-0
            page, also when replacing compound source pages.
    - patch 3
        - fix: always queue completion notification on MSG_ZEROCOPY,
	    also if revert to copy.
	- fix: on syscall abort, correctly revert notification state
	- minor: skip queue notification on SOCK_DEAD
	- minor: replace BUG_ON with WARN_ON in recoverable error
    - patch 4
        - new: add socket option SOCK_ZEROCOPY.
	    only honor MSG_ZEROCOPY if set, ignore for legacy apps.
    - patch 5
        - fix: clear zerocopy state on skb_linearize
    - patch 6
        - fix: only coalesce if prev errqueue elem is zerocopy
	- minor: try coalescing with list tail instead of head
        - minor: merge bytelen limit patch
    - patch 7
        - new: signal when data had to be copied
    - patch 8 (tcp)
        - optimize: avoid setting PSH bit when exceeding max frags.
	    that limits GRO on the client. do not goto new_segment.
	- fix: fail on MSG_ZEROCOPY | MSG_FASTOPEN
	- minor: do not wait for memory: does not work for optmem
	- minor: simplify alloc
    - patch 9 (udp)
        - new: add PF_INET6
        - fix: attach zerocopy notification even if revert to copy
	- minor: simplify alloc size arithmetic
    - patch 10 (raw hdrinc)
        - new: add PF_INET6
    - patch 11 (pf_packet)
        - minor: simplify slightly
    - patch 12
        - new msg_zerocopy regression test: use veth pair to test
	    all protocols: ipv4/ipv6/packet, tcp/udp/raw, cork
	    all relevant ethtool settings: rx off, sg off
	    all relevant packet lengths: 0, <MAX_HEADER, max size

  RFC -> RFCv2:
    - review comment: do not loop skb with zerocopy frags onto rx:
          add skb_orphan_frags_rx to orphan even refcounted frags
	  call this in __netif_receive_skb_core, deliver_skb and tun:
	  same as commit 1080e512d4 ("net: orphan frags on receive")
    - fix: hold an explicit sk reference on each notification skb.
          previously relied on the reference (or wmem) held by the
	  data skb that would trigger notification, but this breaks
	  on skb_orphan.
    - fix: when aborting a send, do not inc the zerocopy counter
          this caused gaps in the notification chain
    - fix: in packet with SOCK_DGRAM, pull ll headers before calling
          zerocopy_sg_from_iter
    - fix: if sock_zerocopy_realloc does not allow coalescing,
          do not fail, just allocate a new ubuf
    - fix: in tcp, check return value of second allocation attempt
    - chg: allocate notification skbs from optmem
          to avoid affecting tcp write queue accounting (TSQ)
    - chg: limit #locked pages (ulimit) per user instead of per process
    - chg: grow notification ids from 16 to 32 bit
      - pass range [lo, hi] through 32 bit fields ee_info and ee_data
    - chg: rebased to davem-net-next on top of v4.10-rc7
    - add: limit notification coalescing
          sharing ubufs limits overhead, but delays notification until
	  the last packet is released, possibly unbounded. Add a cap.
    - tests: add snd_zerocopy_lo pf_packet test
    - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)

Limitations / Known Issues:
    - TCP may build slightly smaller than max TSO packets due to
      exceeding MAX_SKB_FRAGS frags when zerocopy pages are unaligned.
    - All SKBTX_SHARED_FRAG may require additional __skb_linearize or
      skb_copy_ubufs calls in u32, skb_find_text, similar to
      skb_checksum_help.

Notification skbuffs are allocated from optmem. For sockets that
cannot effectively coalesce notifications, the optmem max may need
to be increased to avoid hitting -ENOBUFS:

  sysctl -w net.core.optmem_max=1048576

In application load, copy avoidance shows a roughly 5% systemwide
reduction in cycles when streaming large flows and a 4-8% reduction in
wall clock time on early tensorflow test workloads.

For the single-machine veth tests to succeed, loopback support has to
be temporarily enabled by making skb_orphan_frags_rx map to
skb_orphan_frags.

* Performance

The below table shows cycles reported by perf for a netperf process
sending a single 10 Gbps TCP_STREAM. The first three columns show
Mcycles spent in the netperf process context. The second three columns
show time spent systemwide (-a -C A,B) on the two cpus that run the
process and interrupt handler. Reported is the median of at least 3
runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
are disabled and the kernel is booted with idle=halt.

NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size

perf stat -e cycles $NETPERF
perf stat -C 2,3 -a -e cycles $NETPERF

        --process cycles--      ----cpu cycles----
           std      zc   %      std         zc   %
4K      27,609  11,217  41      49,217  39,175  79
16K     21,370   3,823  18      43,540  29,213  67
64K     20,557   2,312  11      42,189  26,910  64
256K    21,110   2,134  10      43,006  27,104  63
1M      20,987   1,610   8      42,759  25,931  61

Perf record indicates the main source of these differences. Process
cycles only at 1M writes (perf record; perf report -n):

std:
Samples: 42K of event 'cycles', Event count (approx.): 21258597313
 79.41%         33884  netperf  [kernel.kallsyms]  [k] copy_user_generic_string
  3.27%          1396  netperf  [kernel.kallsyms]  [k] tcp_sendmsg
  1.66%           694  netperf  [kernel.kallsyms]  [k] get_page_from_freelist
  0.79%           325  netperf  [kernel.kallsyms]  [k] tcp_ack
  0.43%           188  netperf  [kernel.kallsyms]  [k] __alloc_skb

zc:
Samples: 1K of event 'cycles', Event count (approx.): 1439509124
 30.36%           584  netperf.zerocop  [kernel.kallsyms]  [k] gup_pte_range
 14.63%           284  netperf.zerocop  [kernel.kallsyms]  [k] __zerocopy_sg_from_iter
  8.03%           159  netperf.zerocop  [kernel.kallsyms]  [k] skb_zerocopy_add_frags_iter
  4.84%            96  netperf.zerocop  [kernel.kallsyms]  [k] __alloc_skb
  3.10%            60  netperf.zerocop  [kernel.kallsyms]  [k] kmem_cache_alloc_node

* Safety

The number of pages that can be pinned on behalf of a user with
MSG_ZEROCOPY is bound by the locked memory ulimit.

While the kernel holds process memory pinned, a process cannot safely
reuse those pages for other purposes. Packets looped onto the receive
stack and queued to a socket can be held indefinitely. Avoid unbounded
notification latency by restricting user pages to egress paths only.
skb_orphan_frags_rx() will create a private copy of pages even for
refcounted packets when these are looped, as did skb_orphan_frags for
the original tun zerocopy implementation.

Pages are not remapped read-only. Processes can modify packet contents
while packets are in flight in the kernel path. Bytes on which kernel
control flow depends (headers) are copied to avoid TOCTTOU attacks.
Datapath integrity does not otherwise depend on payload, with three
exceptions: checksums, optional sk_filter/tc u32/.. and device +
driver logic. The effect of wrong checksums is limited to the
misbehaving process. TC filters that access contents may have to be
excluded by adding an skb_orphan_frags_rx.

Processes can also safely avoid OOM conditions by bounding the number
of bytes passed with MSG_ZEROCOPY and by removing shared pages after
transmission from their own memory map.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00
Willem de Bruijn
07b65c5b31 test: add msg_zerocopy test
Introduce regression test for msg_zerocopy feature. Send traffic from
one process to another with and without zerocopy.

Evaluate tcp, udp, raw and packet sockets, including variants
- udp: corking and corking with mixed copy/zerocopy calls
- raw: with and without hdrincl
- packet: at both raw and dgram level

Test on both ipv4 and ipv6, optionally with ethtool changes to
disable scatter-gather, tx checksum or tso offload. All of these
can affect zerocopy behavior.

The regression test can be run on a single machine if over a veth
pair. Then skb_orphan_frags_rx must be modified to be identical to
skb_orphan_frags to allow forwarding zerocopy locally.

The msg_zerocopy.sh script will setup the veth pair in network
namespaces and run all tests.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 21:37:30 -07:00