Commit graph

725040 commits

Author SHA1 Message Date
Ido Schimmel
398958ae48 ipv6: Add support for non-equal-cost multipath
The use of hash-threshold instead of modulo-N makes it trivial to add
support for non-equal-cost multipath.

Instead of dividing the multipath hash function's output space equally
between the nexthops, each nexthop is assigned a region size which is
proportional to its weight.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 15:14:44 -05:00
Ido Schimmel
3d709f69a3 ipv6: Use hash-threshold instead of modulo-N
Now that each nexthop stores its region boundary in the multipath hash
function's output space, we can use hash-threshold instead of modulo-N
in multipath selection.

This reduces the number of checks we need to perform during lookup, as
dead and linkdown nexthops are assigned a negative region boundary. In
addition, in contrast to modulo-N, only flows near region boundaries are
affected when a nexthop is added or removed.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 15:14:44 -05:00
Ido Schimmel
7696c06a18 ipv6: Use a 31-bit multipath hash
The hash thresholds assigned to IPv6 nexthops are in the range of
[-1, 2^31 - 1], where a negative value is assigned to nexthops that
should not be considered during multipath selection.

Therefore, in a similar fashion to IPv4, we need to use the upper
31-bits of the multipath hash for multipath selection.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 15:14:44 -05:00
Ido Schimmel
d7dedee184 ipv6: Calculate hash thresholds for IPv6 nexthops
Before we convert IPv6 to use hash-threshold instead of modulo-N, we
first need each nexthop to store its region boundary in the hash
function's output space.

The boundary is calculated by dividing the output space equally between
the different active nexthops. That is, nexthops that are not dead or
linkdown.

The boundaries are rebalanced whenever a nexthop is added or removed to
a multipath route and whenever a nexthop becomes active or inactive.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 15:14:44 -05:00
Jason Wang
e2b3b35eb9 vhost_net: batch used ring update in rx
This patch tries to batched used ring update during RX. This is pretty
fit for the case when guest is much faster (e.g dpdk based
backend). In this case, used ring is almost empty:

- we may get serious cache line misses/contending on both used ring
  and used idx.
- at most 1 packet could be dequeued at one time, batching in guest
  does not make much effect.

Update used ring in a batch can help since guest won't access the used
ring until used idx was advanced for several descriptors and since we
advance used ring for every N packets, guest will only need to access
used idx for every N packet since it can cache the used idx. To have a
better interaction for both batch dequeuing and dpdk batching,
VHOST_RX_BATCH was used as the maximum number of descriptors that
could be batched.

Test were done between two machines with 2.40GHz Intel(R) Xeon(R) CPU
E5-2630 connected back to back through ixgbe. Traffic were generated
on one remote ixgbe through MoonGen and measure the RX pps through
testpmd in guest when do xdp_redirect_map from local ixgbe to
tap. RX pps were increased from 3.05 Mpps to 4.00 Mpps (about 31%
improvement).

One possible concern for this is the implications for TCP (especially
latency sensitive workload). Result[1] does not show obvious changes
for most of the netperf test (RR, TX, and RX). And we do get some
improvements for RX on some specific size.

Guest RX:

size/sessions/+thu%/+normalize%
   64/     1/   +2%/   +2%
   64/     2/   +2%/   -1%
   64/     4/   +1%/   +1%
   64/     8/    0%/    0%
  256/     1/   +6%/   -3%
  256/     2/   -3%/   +2%
  256/     4/  +11%/  +11%
  256/     8/    0%/    0%
  512/     1/   +4%/    0%
  512/     2/   +2%/   +2%
  512/     4/    0%/   -1%
  512/     8/   -8%/   -8%
 1024/     1/   -7%/  -17%
 1024/     2/   -8%/   -7%
 1024/     4/   +1%/    0%
 1024/     8/    0%/    0%
 2048/     1/  +30%/  +14%
 2048/     2/  +46%/  +40%
 2048/     4/    0%/    0%
 2048/     8/    0%/    0%
 4096/     1/  +23%/  +22%
 4096/     2/  +26%/  +23%
 4096/     4/    0%/   +1%
 4096/     8/    0%/    0%
16384/     1/   -2%/   -3%
16384/     2/   +1%/   -4%
16384/     4/   -1%/   -3%
16384/     8/    0%/   -1%
65535/     1/  +15%/   +7%
65535/     2/   +4%/   +7%
65535/     4/    0%/   +1%
65535/     8/    0%/    0%

TCP_RR:

size/sessions/+thu%/+normalize%
    1/     1/    0%/   +1%
    1/    25/   +2%/   +1%
    1/    50/   +4%/   +1%
   64/     1/    0%/   -4%
   64/    25/   +2%/   +1%
   64/    50/    0%/   -1%
  256/     1/    0%/    0%
  256/    25/    0%/    0%
  256/    50/   +4%/   +2%

Guest TX:

size/sessions/+thu%/+normalize%
   64/     1/   +4%/   -2%
   64/     2/   -6%/   -5%
   64/     4/   +3%/   +6%
   64/     8/    0%/   +3%
  256/     1/  +15%/  +16%
  256/     2/  +11%/  +12%
  256/     4/   +1%/    0%
  256/     8/   +5%/   +5%
  512/     1/   -1%/   -6%
  512/     2/    0%/   -8%
  512/     4/   -2%/   +4%
  512/     8/   +6%/   +9%
 1024/     1/   +3%/   +1%
 1024/     2/   +3%/   +9%
 1024/     4/    0%/   +7%
 1024/     8/    0%/   +7%
 2048/     1/   +8%/   +2%
 2048/     2/   +3%/   -1%
 2048/     4/   -1%/  +11%
 2048/     8/   +3%/   +9%
 4096/     1/   +8%/   +8%
 4096/     2/    0%/   -7%
 4096/     4/   +4%/   +4%
 4096/     8/   +2%/   +5%
16384/     1/   -3%/   +1%
16384/     2/   -1%/  -12%
16384/     4/   -1%/   +5%
16384/     8/    0%/   +1%
65535/     1/    0%/   -3%
65535/     2/   +5%/  +16%
65535/     4/   +1%/   +2%
65535/     8/   +1%/   -1%

Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 15:04:30 -05:00
David S. Miller
65d51f2682 mlx5-updates-2018-01-08
Four patches from Or that add Hairpin support to mlx5:
 ===========================================================
 From:  Or Gerlitz <ogerlitz@mellanox.com>
 
 We refer the ability of NIC HW to fwd packet received on one port to
 the other port (also from a port to itself) as hairpin. The application API
 is based
 on ingress tc/flower rules set on the NIC with the mirred redirect
 action. Other actions can apply to packets during the redirect.
 
 Hairpin allows to offload the data-path of various SW DDoS gateways,
 load-balancers, etc to HW. Packets go through all the required
 processing in HW (header re-write, encap/decap, push/pop vlan) and
 then forwarded, CPU stays at practically zero usage. HW Flow counters
 are used by the control plane for monitoring and accounting.
 
 Hairpin is implemented by pairing a receive queue (RQ) to send queue (SQ).
 All the flows that share <recv NIC, mirred NIC> are redirected through
 the same hairpin pair. Currently, only header-rewrite is supported as a
 packet modification action.
 
 I'd like to thanks Elijah Shakkour <elijahs@mellanox.com> for implementing this
 functionality
 on HW simulator, before it was avail in the FW so the driver code could be
 tested early.
 ===========================================================
 
 From Feras three patches that provide very small changes that allow IPoIB
 to support RX timestamping for child interfaces, simply by hooking the mlx5e
 timestamping PTP ioctl to IPoIB child interface netdev profile.
 
 One patch from Gal to fix a spilling mistake.
 
 Two patches from Eugenia adds drop counters to VF statistics
 to be reported as part of VF statistics in netlink (iproute2) and
 implemented them in mlx5 eswitch.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJaVF5WAAoJEEg/ir3gV/o+fRkH/0PxjwJRA3REqhi/H8HOdH9f
 cBLrOzFdqTCYQWQFCLFbMQ/Zgoel3KglpJ0iQMjuVFfjMbybVXOe8FAEVdbWHnfL
 C+2HRMe8dplKrsq5UkxJhbyKhFKhl2XeMFYWonw9dSM7Nz5DyowQ1y1r5SgMlMAv
 t3mYAIa4kZHK18BjDoIsCoAXXwsHiztR2irMp5+DwataTGP7vC7AsrucDxLA/qFf
 I3E15DZk9s1f53PUuY7CYnUnJfMMP3VJdxpyx4k6xt9J2IMuilF4YyD6wpAKsVQU
 /LzRkWI9x/6QindffqlrACeeidimOeY4pC4txIhS5uXgFXulugDHq1/Ih1sgZS8=
 =g5vr
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2018-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

mlx5-updates-2018-01-08

Four patches from Or that add Hairpin support to mlx5:
===========================================================
From:  Or Gerlitz <ogerlitz@mellanox.com>

We refer the ability of NIC HW to fwd packet received on one port to
the other port (also from a port to itself) as hairpin. The application API
is based
on ingress tc/flower rules set on the NIC with the mirred redirect
action. Other actions can apply to packets during the redirect.

Hairpin allows to offload the data-path of various SW DDoS gateways,
load-balancers, etc to HW. Packets go through all the required
processing in HW (header re-write, encap/decap, push/pop vlan) and
then forwarded, CPU stays at practically zero usage. HW Flow counters
are used by the control plane for monitoring and accounting.

Hairpin is implemented by pairing a receive queue (RQ) to send queue (SQ).
All the flows that share <recv NIC, mirred NIC> are redirected through
the same hairpin pair. Currently, only header-rewrite is supported as a
packet modification action.

I'd like to thanks Elijah Shakkour <elijahs@mellanox.com> for implementing this
functionality
on HW simulator, before it was avail in the FW so the driver code could be
tested early.
===========================================================

From Feras three patches that provide very small changes that allow IPoIB
to support RX timestamping for child interfaces, simply by hooking the mlx5e
timestamping PTP ioctl to IPoIB child interface netdev profile.

One patch from Gal to fix a spilling mistake.

Two patches from Eugenia adds drop counters to VF statistics
to be reported as part of VF statistics in netlink (iproute2) and
implemented them in mlx5 eswitch.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 14:57:19 -05:00
David S. Miller
45f8982253 Merge branch 'hns3-next'
Peng Li says:

====================
code improvements in HNS3 driver

This patchset fixes 2 comments for community review.
[patch 1/2] reverts "net: hns3: Add packet statistics of netdev"
reported by Jakub Kicinski and David Miller.
[patch 2/2] reports the function type the same line with
hns3_nic_get_stats64, reported by Andrew Lunn.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 14:55:52 -05:00
Peng Li
6c88d9d7ee net: hns3: report the function type the same line with hns3_nic_get_stats64
The function type should be on the same line with the function
name, or it may cause display error if a patch edit the
function. There is am example following:
https://www.spinics.net/lists/netdev/msg476141.html

Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 14:55:51 -05:00
Peng Li
bf909456f6 Revert "net: hns3: Add packet statistics of netdev"
This reverts commit 8491000754.

It is duplicate to add statistics of netdev for ethtool -S.

Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 14:55:51 -05:00
David S. Miller
68d5c2653b Merge branch 'Socionext-Synquacer-NETSEC-driver'
Jassi Brar says:

====================
Socionext Synquacer NETSEC driver

Changes since v5
	# Removed helper macros
	# Removed 'inline' qualifier
	# Changed multiline empty comment to single line
	# Added 'clock-names' property in DT binding example
	# Ignore 'clock-names' property in driver until f/ws in the wild are
	  upgraded or we support instance that take in more than one clock.
	# Rebased the patchset onto net-next

Changes since v4
        # Fixed ucode indexing as a word, instead of byte
        # Removed redundant clocks, keep only phy rate reference clock
          and expect it to be 'phy_ref_clk'

Changes since v3
        # Discard 'socionext,snq-mdio', and simply use 'mdio' subnode.
        # Use ioremap on ucode region as well, instead of memremap.

Changes since v2
        # Use 'mdio' subnode in DT bindings.
        # Use phy_interface_mode_is_rgmii(), instead of open coding the check.
        # Use readl/b with eeprom_base pointer.
        # Unregister mdio bus upon failure in probe.

Changes since v1
        # Switched from using memremap to ioremap
        # Implemented ndo_do_ioctl callback
        # Defined optional 'dma-coherent' DT property
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 14:50:30 -05:00
Jassi Brar
919e66a2d3 MAINTAINERS: Add entry for Socionext ethernet driver
Add entry for the Socionext Netsec controller driver and DT bindings.

Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 14:50:29 -05:00
Jassi Brar
533dd11a12 net: socionext: Add Synquacer NetSec driver
This driver adds support for Socionext "netsec" IP Gigabit
Ethernet + PHY IP used in the Synquacer SC2A11 SoC.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 14:50:29 -05:00
Jassi Brar
f78f4107ea dt-bindings: net: Add DT bindings for Socionext Netsec
This patch adds documentation for Device-Tree bindings for the
Socionext NetSec Controller driver.

Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 14:50:29 -05:00
David S. Miller
c215dae430 Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
10GbE Intel Wired LAN Driver Updates 2018-01-09

This series contains updates to ixgbe and ixgbevf only.

Emil fixes an issue with "wake on LAN"(WoL) where we need to ensure we
enable the reception of multicast packets so that WoL works for IPv6
magic packets.  Cleaned up code no longer needed with the update to
adaptive ITR.

Paul update the driver to advertise the highest capable link speed
when a module gets inserted.  Also extended the displaying of firmware
version to include the iSCSI and OEM block in the EEPROM to better
identify firmware versions/images.

Tonghao Zhang cleans up a code comment that no longer applies since
InterruptThrottleRate has been removed from the driver.

Alex fixes SR-IOV and MACVLAN offload interaction, where the MACVLAN
offload was incorrectly configuring several filters with the wrong
pool value which resulted in MACLVAN interfaces not being able to
receive traffic that had to pass over the physical interface.  Fixed
transmit hangs and dropped receive frames when the number of VFs
changed.  Added support for RSS on MACVLAN pools for X550 devices.
Fixed up the MACVLAN limitations so we can now support 63 offloaded
devices.  Cleaned up MACVLAN code that is no longer needed with the
recent changes and fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10 14:38:06 -05:00
David S. Miller
61ad64080e Merge branch 'r8169-improve-runtime-pm'
Heiner Kallweit says:

====================
r8169: improve runtime pm

On my system with two network ports I found that runtime PM didn't
suspend the unused port. Therefore I checked runtime pm in this driver
in somewhat more detail and this series improves runtime pm in general
and solves the mentioned issue.

Tested on a system with RTL8168evl (MAC version 34).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:38:57 -05:00
Heiner Kallweit
a92a08499b r8169: improve runtime pm in general and suspend unused ports
So far rpm doesn't cover cases like unused ports which are never
brought up. If they are active at probe time they remain in this state.
Included in this patch:

- Let the idle notification check whether we can suspend and let it
  schedule the suspend. This way we don't need to have calls to
  pm_schedule_suspend in different places.

- At the end of rtl_open and rtl_init_one send an idle notification
  to allow suspending if the link is down. If a cable is plugged in
  aneg is finished before the suspend timer expires and the suspend
  request is cancelled.

- Change rtl8169_runtime_suspend to power down the chip if the
  interface is down.

Successfully tested on a RTL8168evl (mac version 34).

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:38:56 -05:00
Heiner Kallweit
ef4d5fcceb r8169: improve runtime pm in rtl8169_check_link_status
This patch partially reverts commit e4fbce740f "r8169: Fix runtime
power management" from 2010. At that time the suspend delay was 100ms
and therefore suspending happened during initial aneg. Currently
suspend delay is 5s, so suspend starts after aneg and the issue
doesn't exist any longer. On my system aneg takes almost 3s, to be on
the safe side let's increase the suspend delay to 10s.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:38:56 -05:00
Heiner Kallweit
b9aa1c75e6 r8169: remove unneeded rpm ops in rtl_shutdown
This patch reverts commit 2a15cd2ff4 "r8169: runtime resume before
shutdown" from 2012. Few months after this change the underlying issue
was solved in the PCI core with commit 3ff2de9ba1 "PCI/PM: Resume
device before shutdown".

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:38:56 -05:00
David S. Miller
fdb533c304 Merge branch 'tipc-improvements-to-group-messaging'
Jon Maloy says:

====================
tipc: improvements to group messaging

We make a number of simplifications and improvements to the group
messaging service. They aim at readability/maintainability of the code
as well as scalability.

The series is based on commit f9c935db80 ("tipc: fix problems with
multipoint-to-point flow control) which has been applied to 'net' but
not yet to 'net-next'.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:59 -05:00
Jon Maloy
eb929a91b2 tipc: improve poll() for group member socket
The current criteria for returning POLLOUT from a group member socket is
too simplistic. It basically returns POLLOUT as soon as the group has
external destinations, something obviously leading to a lot of spinning
during destination congestion situations. At the same time, the internal
congestion handling is unnecessarily complex.

We now change this as follows.

- We introduce an 'open' flag in  struct tipc_group. This flag is used
  only to help poll() get the setting of POLLOUT right, and *not* for
  congeston handling as such. This means that a user can choose to
  ignore an  EAGAIN for a destination and go on sending messages to
  other destinations in the group if he wants to.

- The flag is set to false every time we return EAGAIN on a send call.

- The flag is set to true every time any member, i.e., not necessarily
  the member that caused EAGAIN, is removed from the small_win list.

- We remove the group member 'usr_pending' flag. The size of the send
  window and presence in the 'small_win' list is sufficient criteria
  for recognizing congestion.

This solution seems to be a reasonable compromise between 'anycast',
which is normally not waiting for POLLOUT for a specific destination,
and the other three send modes, which are.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:58 -05:00
Jon Maloy
232d07b74a tipc: improve groupcast scope handling
When a member joins a group, it also indicates a binding scope. This
makes it possible to create both node local groups, invisible to other
nodes, as well as cluster global groups, visible everywhere.

In order to avoid that different members end up having permanently
differing views of group size and memberhip, we must inhibit locally
and globally bound members from joining the same group.

We do this by using the binding scope as an additional separator between
groups. I.e., a member must ignore all membership events from sockets
using a different scope than itself, and all lookups for message
destinations must require an exact match between the message's lookup
scope and the potential target's binding scope.

Apart from making it possible to create local groups using the same
identity on different nodes, a side effect of this is that it now also
becomes possible to create a cluster global group with the same identity
across the same nodes, without interfering with the local groups.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:58 -05:00
Jon Maloy
8348500f80 tipc: add option to suppress PUBLISH events for pre-existing publications
Currently, when a user is subscribing for binding table publications,
he will receive a PUBLISH event for all already existing matching items
in the binding table.

However, a group socket making a subscriptions doesn't need this initial
status update from the binding table, because it has already scanned it
during the join operation. Worse, the multiplicatory effect of issuing
mutual events for dozens or hundreds group members within a short time
frame put a heavy load on the topology server, with the end result that
scale out operations on a big group tend to take much longer than needed.

We now add a new filter option, TIPC_SUB_NO_STATUS, for topology server
subscriptions, so that this initial avalanche of events is suppressed.
This change, along with the previous commit, significantly improves the
range and speed of group scale out operations.

We keep the new option internal for the tipc driver, at least for now.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:58 -05:00
Jon Maloy
d12d2e12ce tipc: send out join messages as soon as new member is discovered
When a socket is joining a group, we look up in the binding table to
find if there are already other members of the group present. This is
used for being able to return EAGAIN instead of EHOSTUNREACH if the
user proceeds directly to a send attempt.

However, the information in the binding table can be used to directly
set the created member in state MBR_PUBLISHED and send a JOIN message
to the peer, instead of waiting for a topology PUBLISH event to do this.
When there are many members in a group, the propagation time for such
events can be significant, and we can save time during the join
operation if we use the initial lookup result fully.

In this commit, we eliminate the member state MBR_DISCOVERED which has
been the result of the initial lookup, and do instead go directly to
MBR_PUBLISHED, which initiates the setup.

After this change, the tipc_member FSM looks as follows:

     +-----------+
---->| PUBLISHED |-----------------------------------------------+
PUB- +-----------+                                 LEAVE/WITHRAW |
LISH       |JOIN                                                 |
           |     +-------------------------------------------+   |
           |     |                            LEAVE/WITHDRAW |   |
           |     |                +------------+             |   |
           |     |   +----------->|  PENDING   |---------+   |   |
           |     |   |msg/maxactv +-+---+------+  LEAVE/ |   |   |
           |     |   |              |   |       WITHDRAW |   |   |
           |     |   |   +----------+   |                |   |   |
           |     |   |   |revert/maxactv|                |   |   |
           |     |   |   V              V                V   V   V
           |   +----------+  msg  +------------+       +-----------+
           +-->|  JOINED  |------>|   ACTIVE   |------>|  LEAVING  |--->
           |   +----------+       +--- -+------+ LEAVE/+-----------+DOWN
           |        A   A               |      WITHDRAW A   A    A   EVT
           |        |   |               |RECLAIM        |   |    |
           |        |   |REMIT          V               |   |    |
           |        |   |== adv   +------------+        |   |    |
           |        |   +---------| RECLAIMING |--------+   |    |
           |        |             +-----+------+  LEAVE/    |    |
           |        |                   |REMIT   WITHDRAW   |    |
           |        |                   |< adv              |    |
           |        |msg/               V            LEAVE/ |    |
           |        |adv==ADV_IDLE+------------+   WITHDRAW |    |
           |        +-------------|  REMITTED  |------------+    |
           |                      +------------+                 |
           |PUBLISH                                              |
JOIN +-----------+                                LEAVE/WITHDRAW |
---->|  JOINING  |-----------------------------------------------+
     +-----------+

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:58 -05:00
Jon Maloy
c2b22bcf2e tipc: simplify group LEAVE sequence
After the changes in the previous commit the group LEAVE sequence
can be simplified.

We now let the arrival of a LEAVE message unconditionally issue a group
DOWN event to the user. When a topology WITHDRAW event is received, the
member, if it still there, is set to state LEAVING, but we only issue a
group DOWN event when the link to the peer node is gone, so that no
LEAVE message is to be expected.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:57 -05:00
Jon Maloy
7ad32bcb78 tipc: create group member event messages when they are needed
In the current implementation, a group socket receiving topology
events about other members just converts the topology event message
into a group event message and stores it until it reaches the right
state to issue it to the user. This complicates the code unnecessarily,
and becomes impractical when we in the coming commits will need to
create and issue membership events independently.

In this commit, we change this so that we just notice the type and
origin of the incoming topology event, and then drop the buffer. Only
when it is time to actually send a group event to the user do we
explicitly create a new message and send it upwards.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:57 -05:00
Jon Maloy
0233493a5f tipc: adjustment to group member FSM
Analysis reveals that the member state MBR_QURANTINED in reality is
unnecessary, and can be replaced by the state MBR_JOINING at all
occurrencs.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:57 -05:00
Jon Maloy
4ea5dab541 tipc: let group member stay in JOINED mode if unable to reclaim
We handle a corner case in the function tipc_group_update_rcv_win().
During extreme pessure it might happen that a message receiver has all
its active senders in RECLAIMING or REMITTED mode, meaning that there
is nobody to reclaim advertisements from if an additional sender tries
to go active.

Currently we just set the new sender to ACTIVE anyway, hence at least
theoretically opening up for a receiver queue overflow by exceeding the
MAX_ACTIVE limit. The correct solution to this is to instead add the
member to the pending queue, while letting the oldest member in that
queue revert to JOINED state.

In this commit we refactor the code for handling message arrival from
a JOINED member, both to make it more comprehensible and to cover the
case described above.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:57 -05:00
Jon Maloy
8d5dee21f6 tipc: a couple of cleanups
- We remove the 'reclaiming' member list in struct tipc_group, since
  it doesn't serve any purpose.

- We simplify the GRP_REMIT_MSG branch of tipc_group_protocol_rcv().

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 12:35:57 -05:00
David S. Miller
a67c01e209 Merge branch 'ethtool-ringparam-upper-bound'
Tariq Toukan says:

====================
ethtool ringparam upper bound

This patchset by Jenny adds sanity checks in ethtool ringparam
operation for input upper bounds, similarly to what's done in
ethtool_set_channels.

The checks are added in patch 1, using a call to get_ringparam
prior to calling set_ringparam NDO.

Patch 2 changes the function's behavior in mlx4_en, so that
it returns an error for out-of-range input, instead of rounding
it to closest valid, similar to mlx5e.

Patch 3 removes the upper bound checks in mlx5e_ethtool_set_ringparam
as it becomes redundant.

Series generated against net-next commit:
f66faae2f8 Merge branch 'ipv6-ipv4-nexthop-align'
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 11:54:50 -05:00
Eugenia Emantayev
bacc794331 net/mlx5e: Remove redundant checks in set_ringparam
Since the checks are done in upper layer ethtool code,
checks in driver are not needed any more.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 11:54:50 -05:00
Eugenia Emantayev
7589fd5c8c net/mlx4_en: Align behavior of set ring size flow via ethtool
In current implementation, any requested RX/TX ring size value
that is less than minimum is silently casted to nearest valid value.
Update this behavior to align with mlx5 behavior by printing warning
in dmesg and remaining the size unchanged.
Kernel is responsible for verifying against the maximum.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 11:54:49 -05:00
Eugenia Emantayev
37e2d99b59 ethtool: Ensure new ring parameters are within bounds during SRINGPARAM
Add a sanity check to ensure that all requested ring parameters
are within bounds, which should reduce errors in driver implementation.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 11:54:49 -05:00
Alexander Duyck
68ae742458 ixgbe: Drop l2_accel_priv data pointer from ring struct
The l2 acceleration private pointer isn't needed in the ring struct. It
isn't really used anywhere other than to test and see if we are supporting
an offloaded macvlan netdev, and it is much easier to test netdev for not
being ixgbe based to verify that.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:51:33 -08:00
Alexander Duyck
1489542b9c ixgbe: Use ring values to test for Tx pending
This patch simplifies the check for Tx pending traffic and makes it more
holistic as there being any difference between next_to_use and
next_to_clean is much more informative than if head and tail are equal, as
it is possible for us to either not update tail, or not be notified of
completed work in which case next_to_clean would not be equal to head.

In addition the simplification makes it so that we don't have to read
hardware which allows us to drop a number of variables that were previously
being used in the call.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:50:17 -08:00
Alexander Duyck
4e039c1675 ixgbe: Fix limitations on macvlan so we can support up to 63 offloaded devices
This change is a fix of the macvlan offload so that we correctly handle
macvlan offloaded devices. Specifically we were configuring our limits based
on the assumption that we were going to max out the RSS indices for every
mode. As a result when we went to 15 or more macvlan interfaces we were
forced into the 2 queue RSS mode on VFs even though they could have still
supported 4.

This change splits the logic up so that we limit either the total number of
macvlan instances if DCB is enabled, or limit the number of RSS queues used
per macvlan (instead of per pool) if SR-IOV is enabled. By doing this we
can make best use of the part.

In addition I have increased the maximum number of supported interfaces to
63 with one queue per offloaded interface as this more closely reflects the
actual values supported by the interface.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:49:04 -08:00
Alexander Duyck
ff815fb2cf ixgbe: There is no need to update num_rx_pools in L2 fwd offload
The num_rx_pools value is overwritten when we reinitialize the queue
configuration. In reality we shouldn't need to be updating the value since
it is redone every time we call into ixgbe_setup_tc so for now just drop
the spots where we were incrementing or decrementing the value.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:47:12 -08:00
Alexander Duyck
2af62c5614 ixgbe: Add support for macvlan offload RSS on X550 and clean-up pool handling
In order for RSS to work on the macvlan pools of the X550 we need to
populate the MRQC, RETA, and RSS key values for each pool. This patch makes
it so that we now take care of that.

In addition I have dropped the macvlan specific configuration of psrtype
since it is redundant with the code that already exists for configuring
this value.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:44:18 -08:00
Alexander Duyck
2097db7d19 ixgbe: Perform reinit any time number of VFs change
If the number of VFs are changed we need to reinitialize the part since the
offset for the device and the number of pools will be incorrect. Without
this change we can end up seeing Tx hangs and dropped Rx frames for
incoming traffic.

In addition we should drop the code that is arbitrarily changing the
default pool and queue configuration. Instead we should wait until the port
is reset and reconfigured via ixgbe_sriov_reinit.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:41:20 -08:00
Colin Ian King
709af180ee ipv6: use ARRAY_SIZE for array sizing calculation on array seg6_action_table
Use the ARRAY_SIZE macro on array seg6_action_table to determine size of
the array. Improvement suggested by coccinelle.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 11:40:46 -05:00
Colin Ian King
2b1eaa6635 be2net: use ARRAY_SIZE for array sizing calculation on array cmd_priv_map
Use the ARRAY_SIZE macro on array cmd_priv_map to determine size of the
array.  Improvement suggested by coccinelle.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 11:40:18 -05:00
Alexander Duyck
361b53436f ixgbe: Fix interaction between SR-IOV and macvlan offload
When SR-IOV was enabled the macvlan offload was configuring several filters
with the wrong pool value. This would result in the macvlan interfaces not
being able to receive traffic that had to pass over the physical interface.

To fix it wrap the pool argument in the VMDQ_P macro which will add the
necessary offset to get to the actual VMDq pool

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:40:13 -08:00
Emil Tantilov
1b953e843d ixgbevf: remove redundant setting of xcast_mode
Removed leftover assignment of xcast_mode.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:39:01 -08:00
Tonghao Zhang
63f721c282 ixgbe: Remove an obsolete comment about ITR
The InterruptThrottleRate has been removed from ixgbe. Then Update
the comment.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:38:03 -08:00
Jason Baron
faa9b39f0e virtio_net: propagate linkspeed/duplex settings from the hypervisor
The ability to set speed and duplex for virtio_net is useful in various
scenarios as described here:

16032be virtio_net: add ethtool support for set and get of settings

However, it would be nice to be able to set this from the hypervisor,
such that virtio_net doesn't require custom guest ethtool commands.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Note that VIRTIO_NET_F_SPEED_DUPLEX is defined as bit 63, the intention
is that device feature bits are to grow down from bit 63, since the
transports are starting from bit 24 and growing up.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtio-dev@lists.oasis-open.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 11:37:56 -05:00
Paul Greenwalt
73834aec71 ixgbe: extend firmware version support
Extend FW version reporting by displaying information from the iSCSI
or OEM block in the EEPROM.

This will allow us to more accurately identify the FW.

Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:36:34 -08:00
Felix Walter
ccfdec9089 macsec: Add support for GCM-AES-256 cipher suite
This adds support for the GCM-AES-256 cipher suite as specified in
IEEE 802.1AEbn-2011. The prepared cipher suite selection mechanism is used,
with GCM-AES-128 being the default cipher suite as defined in the standard.

Signed-off-by: Felix Walter <felix.walter@cloudandheat.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 11:34:18 -05:00
Paul Greenwalt
3ead7c2e86 ixgbe: advertise highest capable link speed
On module insert advertise highest capable link speed. If module is
capable of 10G, then advertise 10G, else advertise modules capable
link speeds.

Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:30:42 -08:00
Emil Tantilov
09099ddf60 ixgbe: remove unused enum latency_range
This enum is no longer needed after
commit: b4ded8327f ("ixgbe: Update adaptive ITR algorithm")

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:26:42 -08:00
Emil Tantilov
d9d11eb36f ixgbe: enable multicast on shutdown for WOL
Previously we only enabled the reception of multicast packets when
wake on multicast is set, but we also need this to allow waking with
IPv6 magic packets.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-01-09 08:26:42 -08:00
David S. Miller
e8b18af8c3 Merge branch 'XDP-transmission-for-tuntap'
Jason Wang says:

====================
XDP transmission for tuntap

This series tries to implement XDP transmission (ndo_xdp_xmit) for
tuntap. Pointer ring was used for queuing both XDP buffers and
sk_buff, this is done by encoding the type into lowest bit of the
pointer and storin XDP metadata in the headroom of XDP buff.

Tests gets 3.05 Mpps when doing xdp_redirect_map from ixgbe to VM
(testpmd + virtio-net in guest). This gives us ~20% improvments
compared to use skb during redirect.

Please review.

Changes from V1:

- slient warnings
- fix typos
- add skb mode number in the commit log
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-09 10:57:19 -05:00