This reverts commit 92ff426450.
Even though the check added is not that taxing, it's not really needed.
First of all this will be per packet cost and second thing is that the
eth_type_trans() already does this correctly. The excessive scrubbing
in IPvlan was changing the pkt-type skb metadata of the packet which
made it necessary to re-check the mac. The subsequent patch in this
series removes the faulty packet-scrub.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2017-12-15
1) Currently we can add or update socket policies, but
not clear them. Support clearing of socket policies
too. From Lorenzo Colitti.
2) Add documentation for the xfrm device offload api.
From Shannon Nelson.
3) Fix IPsec extended sequence numbers (ESN) for
IPsec offloading. From Yossef Efraim.
4) xfrm_dev_state_add function returns success even for
unsupported options, fix this to fail in such cases.
From Yossef Efraim.
5) Remove a redundant xfrm_state assignment.
From Aviv Heller.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the compatible string and Device Tree binding document for
7278B0.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Salil Mehta says:
====================
Hisilicon Network Subsystem 3 VF Ethernet Driver
This patch-set contains the support of the HNS3 (Hisilicon Network Subsystem 3)
Virtual Function Ethernet driver for hip08 family of SoCs. The Physical Function
driver is already part of the Linux mainline.
This VF driver has its Hardware Compatibility Layer and has commom/unified ENET
layer/client/ethtool code with the PF driver. It also has support of mailbox to
communicate with the HNS3 PF driver. The basic architecture of VF driver is
derivative of the PF driver. Just like PF driver, this driver is also PCI
Express based.
This driver is the ongoing development work and HNS3 VF Ethernet driver would be
incrementally enhanced with more new features.
High Level Architecture:
[ Ethtool ]
|
[ Ethernet Client ] ... [ RoCE Client ]
| |
[ HNAE Device ] |________
| | |
--------------------------------------------- |
|
[ HNAE3 Framework (Register/unregister) ] |
|
--------------------------------------------- |
| |
[ VF HCLGE Layer ] |
| | |
| | |
| | |
| [ VF Mailbox (To PF via IMP) ] |
| | |
[ IMP command Interface ] [ IMP command Interface ]
| |
| |
(A B O V E R U N S O N G U E S T S Y S T E M)
-------------------------------------------------------------
Q E M U / V F I O / K V M (on Host System)
-------------------------------------------------------------
HIP08 H A R D W A R E (limited to VF by SMMU)
[ IMP/Mgmt Processor (hardware common to system/cmd based) ]
Fig 1. HNS3 Virtual Function Driver
[ dcbnl ] [ Ethtool ]
| |
[ Ethernet Client ] [ ODP/UIO Client ] . . .[ RoCE Client ]
|_____________________| |
| _________|
[ HNAE Device ] | |
| | |
--------------------------------------------- |
|
[ HNAE3 Framework (Register/unregister) ] |
|
--------------------------------------------- |
| |
[ HCLGE Layer ] |
________________|_________________ |
| | | |
[ DCB ] | | |
| | | |
[ Scheduler/Shaper ] [ MDIO ] [ PF Mailbox ] |
| | | |
|________________|_________________| |
| |
[ IMP command Interface ] [ IMP command Interface ]
----------------------------------------------------------------
HIP08 H A R D W A R E
[ IMP/Mgmt Processor (hardware common to system/cmd based) ]
Fig 2. Existing HNS3 PF Driver (added with mailbox)
Change Log Summary:
Patch V4: Addressed SPDX related comment by Philippe Ombredanne
Patch V3: Addressed SPDX change requested by Philippe Ombredanne
Patch V2: 1. Addressed some comments by David Miller.
2. Addressed some internal comments on various patches
Patch V1: Initial Submit
====================
Acked-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All PF mailbox events are conveyed through a common interrupt
(vector 0). This interrupt vector is shared by reset and mailbox.
This patch adds the handling of mailbox interrupt event and its
deferred processing in context to a separate mailbox task.
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is required to support ring-vector binding and reset
of TQPs requested by the VF driver to the PF driver. Mailbox
handler is added with corresponding VF commands/messages to
handle the request.
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Command queue provides the provision of Mailbox command which
can be used for communication between PF and VF. PF handles
messages from various VFs for fetching various information like,
queue, vlan, link status related etc. It also handles the request
from various VFs to perform certain privileged operations.
This patch adds the support of a message handler for handling
such various command requests from VF.
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of the NAPI handling interface, skb buffer management,
management of the RX/TX descriptors, ethool interface etc.
has quite a bit of code which is common to VF and PF driver.
This patch makes the exisitng PF's HNS3 ENET driver as the
common ENET driver for both Virtual & Physical Function. This
will help in reduction of redundancy and better management of
code.
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces the new Makefiles and updates existing
Makefiles required to build the HNS3 Virtual Function driver.
This also updates the Kconfig for introduction of new menuconfig
entries related to VF driver.
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the support of hardware compatibiltiy layer to the
HNS3 VF Driver. This layer implements various {set|get} operations
over MAC address for a virtual port, RSS related configuration,
fetches the link status info from PF, does various VLAN related
configuration over the virtual port, queries the statistics from
the hardware etc.
This layer can directly interact with hardware through the
IMP(Integrated Mangement Processor) interface or can use mailbox
to interact with the PF driver.
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the support of the mailbox to the VF driver. The
mailbox shall be used as an interface to communicate with the
PF driver for various purposes like {set|get} MAC related
operations, reset, link status etc. The mailbox supports both
synchronous and asynchronous command send to PF driver.
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support of command interface for communication with
the IMP(Integrated Management Processor) for HNS3 Virtual Function
Driver.
Each VF has support of CQP(Command Queue Pair) ring interface.
Each CQP consis of send queue CSQ and receive queue CRQ.
There are various commands a VF may support, like to query frimware
version, TQP management, statistics, interrupt related, mailbox etc.
This also contains code to initialize the command queue, manage the
command queue descriptors and Rx/Tx protocol with the command processor
in the form of various commands/results and acknowledgements.
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: lipeng <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Wang says:
====================
add VLAN support to DSA MT7530
Changes sicne v2:
update to the latest code base from net-next and fix up all building
errors with -Werror.
Changes since v1:
- fix up the typo
- prefer ordering declarations longest to shortest
- update that vlan_prepare callback should not change any state
- use lower case letter for function naming
The patchset extends DSA MT7530 to VLAN support through filling required
callbacks in patch 1 and merging the special tag with VLAN tag in patch 2
for allowing that the hardware can handle these packets with VID from the
CPU port.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
I work for MediaTek and maintain SoC targeting to home gateway and
also will keep extending and testing the function from MediaTek
switch.
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to let MT7530 switch can recognize well those egress packets
having both special tag and VLAN tag, the information about the special
tag should be carried on the existing VLAN tag. On the other hand, it's
unnecessary for extra handling for ingress packets when VLAN tag is
present since it is able to put the VLAN tag after the special tag and
then follow the existing way to parse.
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
MT7530 can treat each port as either VLAN-unaware port or VLAN-aware port
through the implementation of port matrix mode or port security mode on
the ingress port, respectively. On one hand, Each port has been acting as
the VLAN-unaware one whenever the device is created in the initial or
certain port joins or leaves into/from the bridge at the runtime. On the
other hand, the patch just filling the required callbacks for VLAN
operations is achieved via extending the port to be into port security
mode when the port is configured as VLAN-aware port. Which mode can make
the port be able to recognize VID from incoming packets and look up VLAN
table to validate and judge which port it should be going to. And the
range for VID from 1 to 4094 is valid for the hardware.
Signed-off-by: Sean Wang <sean.wang@mediatek.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify lan9303_indirect_phy_wait_for_completion()
and lan9303_switch_wait_for_completion() by using a new function
lan9303_read_wait()
Changes v1 -> v2:
- param 'mask' type u32
- removed param 'value' (will probably never be used)
- add newline before return
Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Hemminger says:
====================
hv_netvsc: minor changes
This includes minor cleanup of code in send and receive path and
also a new statistic to check for allocation failures. This also
eliminates some of the extra RCU when not needed.
There is a theoritical bug where buffered data could be blocked
for longer than necessary if the ring buffer got full. This
has not been seen in the wild, found by inspection.
The reference count between net device and internal RNDIS
is not needed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If the transmit queue is known full, then don't keep aggregating
data. And the cp_partial flag which indicates that the current
aggregation buffer is full can be folded in to avoid more
conditionals.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is only ever a single instance of network device object
referencing the internal rndis object. Therefore the open_cnt atomic
is not necessary.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netvsc_receive_callback function was using RCU to find the
appropriate underlying netvsc_device. Since calling function already
had that pointer, this was unnecessary.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The caller (netvsc_receive) already has the net device pointer,
and should just pass that to functions rather than the hyperv device.
This eliminates several impossible error paths in the process.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When skb can not be allocated, update ethtool statisitics
rather than rx_dropped which is intended for netif_receive.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since only caller does not care about return value.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli says:
====================
PHYLINK preparatory patches for DSA
In preparation for having DSA migrate to PHYLINK, I had to come up with a
number of preparatory patches:
- we need to be able to pass phy_flags from an external component calling
phylink_of_phy_connect()
- DSA tries to connect through OF first, then fallsback using its own internal
MDIO bus, in that case we would both show an error, but also not know what
the correct phy_interface_t would be, instead use the PHY device/driver provided
one
- Finally bcm_sf2 makes use of all possible PHYs out there: internal, external,
fixed, and MoCA, the latter requires a bit of help to signal link notifications
through a MMIO interrupt, as well a report a correct PORT type
Changes in v2:
- rebased against latest net-next/master
- added kernel doc documentation
- dropped error message in phylink_of_phy_connect() as suggested by Russell
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Similarly to what PHYLIB already does, make sure that
PHY_INTERFACE_MODE_MOCA is reported as PORT_BNC.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
phylink_get_fixed_state() currently consults an optional "link_gpio"
GPIO descriptor, expand this mechanism to allow specifying a custom
callback. This is necessary to support out of band link notifcation
(e.g: from an interrupt within a MMIO register).
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some subsystems like DSA may be trying to connect to a PHY through OF first,
and then attempt a connect using a local MDIO bus, remove the error message:
"unable to find PHY node" so we can let MAC drivers whether to print it or not.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
We may not always be able to resolve a correct phy_interface_t value before
actually connecting to the PHY device, when that happens, just have
phylink_connect_phy() utilize what the PHY device/driver provided.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to let subsystems like DSA fully utilize PHYLINK, we need to be able
to communicate phy_device::flags from of_phy_{connect,attach} even when using
PHYLINK APIs.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prior to this patch, active Fast Open is paused on a specific
destination IP address if the previous connections to the
IP address have experienced recurring timeouts . But recent
experiments by Microsoft (https://goo.gl/cykmn7) and Mozilla
browsers indicate the isssue is often caused by broken middle-boxes
sitting close to the client. Therefore it is much better user
experience if Fast Open is disabled out-right globally to avoid
experiencing further timeouts on connections toward other
destinations.
This patch changes the destination-IP disablement to global
disablement if a connection experiencing recurring timeouts
or aborts due to timeout. Repeated incidents would still
exponentially increase the pause time, starting from an hour.
This is extremely conservative but an unfortunate compromise to
minimize bad experience due to broken middle-boxes.
Reported-by: Dragana Damjanovic <ddamjanovic@mozilla.com>
Reported-by: Patrick McManus <mcmanus@ducksong.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's not correct to return NULL when that is actually an error and
function returns errors in any other wrong case. In the same time,
the cpsw driver and davinci emac doesn't check error case while
creating channel and it can miss actual error. Also remove WARNs
replacing them on dev_err msgs.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds support for ethtool get_module_info() and get_module_eeprom()
callbacks that will dump necessary information for a SFP.
Signed-off-by: Arjun Vynipadath <arjun@chelsio.com>
Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_warn_bad_offload warns when packets enter the GSO stack that
require skb_checksum_help or vice versa. Do not warn on arbitrary
bad packets. Packet sockets can craft many. Syzkaller was able to
demonstrate another one with eth_type games.
In particular, suppress the warning when segmentation returns an
error, which is for reasons other than checksum offload.
See also commit 36c9247449 ("net: WARN if skb_checksum_help() is
called on skb requiring segmentation") for context on this warning.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 3a9b76fd0d ("tcp: allow drivers to tweak TSQ logic")
I gave a code sample to set sk->sk_pacing_shift that was not complete.
Better add a helper that can be used by drivers without worries,
and maybe amended in the future.
A wifi driver might use it from its ndo_start_xmit()
Following call would setup TCP to allow up to ~8ms of queued data per
flow.
sk_pacing_shift_update(skb->sk, 7);
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch the bridge used a fixed 256 element hash table which
was fine for small use cases (in my tests it starts to degrade
above 1000 entries), but it wasn't enough for medium or large
scale deployments. Modern setups have thousands of participants in a
single bridge, even only enabling vlans and adding a few thousand vlan
entries will cause a few thousand fdbs to be automatically inserted per
participating port. So we need to scale the fdb table considerably to
cope with modern workloads, and this patch converts it to use a
rhashtable for its operations thus improving the bridge scalability.
Tests show the following results (10 runs each), at up to 1000 entries
rhashtable is ~3% slower, at 2000 rhashtable is 30% faster, at 3000 it
is 2 times faster and at 30000 it is 50 times faster.
Obviously this happens because of the properties of the two constructs
and is expected, rhashtable keeps pretty much a constant time even with
10000000 entries (tested), while the fixed hash table struggles
considerably even above 10000.
As a side effect this also reduces the net_bridge struct size from 3248
bytes to 1344 bytes. Also note that the key struct is 8 bytes.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove XGMII as an option for the 88x3310 PHY driver, as the PHY doesn't
support XGMII's 32-bit data lanes. It supports USXGMII, which is not
XGMII, but a single-lane serdes interface - see
https://developer.cisco.com/site/usgmii-usxgmii/
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit says:
====================
r8169: extend PCI core and switch to device-managed functions in probe
Probe error path and remove callback can be significantly simplified
by using device-managed functions. To be able to do this in the r8169
driver we need a device-managed version of pci_set_mwi first.
v2:
Change patch 1 based on Björn's review comments and add his Acked-by.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_napi_del is called implicitely by free_netdev, therefore we
don't have to do it explicitely.
When the probe error path is reached, the net_device isn't
registered yet. Therefore reordering the call to netif_napi_del
shouldn't cause any issues.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simplify probe error path and remove callback by using device-managed
functions.
rtl_disable_msi isn't needed any longer because the release callback
of pcim_enable_device does this implicitely.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add pcim_set_mwi(), a device-managed version of pci_set_mwi().
First user is the Realtek r8169 driver.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
First, rename __inet_twsk_hashdance() to inet_twsk_hashdance()
Then, remove one inet_twsk_put() by setting tw_refcnt to 3 instead
of 4, but adding a fat warning that we do not have the right to access
tw anymore after inet_twsk_hashdance()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subash Abhinov Kasiviswanathan says:
====================
net: qualcomm: rmnet: Configuration options
This series adds support for configuring features on rmnet devices.
The rmnet specific features to be configured here are aggregation and
control commands.
Patch 1 is a cleanup of return codes in the transmit path.
Patch 2 removes some redundant ingress and egress macros.
Patch 3 restricts the creation of rmnet dev to one dev per mux id for a
given real dev.
Patch 4 adds ethernet data path support.
Patches 5-6 add support for configuring features on new and existing
rmnet devices.
v1->v2:
The memory leak fixed as part of patch 1 is merged seperately as
a896d94abd2c ("net: qualcomm: rmnet: Fix leak on transmit failure").
Fix a use after free in patch 4 if a packet with headroom lesser than ethernet
header length is received.
v2->v3:
Fix formatting problem in patch 5 in the return statement.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an option to configure the mux id, aggregation and commad feature
for existing rmnet devices. Implement the changelink netlink
operation for this.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an option to configure the rmnet aggregation and command features
on device creation. This is achieved by using the vlan flags option.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support to send and receive packets over ethernet.
An example of usage is testing the data path on UML. This can be
achieved by setting up two UML instances in multicast mode and
associating rmnet over the UML ethernet device.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Upon de-multiplexing data from one real dev, the packets can be sent
to an unique rmnet device for a given mux id.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multiplexing is always enabled when transmiting from a rmnet device,
so remove the redundant egress macros. De-multiplexing is always
enabled when receiving packets from a rmnet device, so remove those
ingress macros.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only the success and consumed entries were actually in use.
Use standard error codes instead.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch enables tail loss probe in cwnd reduction (CWR) state
to detect potential losses. Prior to this patch, since the sender
uses PRR to determine the cwnd in CWR state, the combination of
CWR+PRR plus tcp_tso_should_defer() could cause unnecessary stalls
upon losses: PRR makes cwnd so gentle that tcp_tso_should_defer()
defers sending wait for more ACKs. The ACKs may not come due to
packet losses.
Disallowing TLP when there is unused cwnd had the primary effect
of disallowing TLP when there is TSO deferral, Nagle deferral,
or we hit the rwin limit. Because basically every application
write() or incoming ACK will cause us to run tcp_write_xmit()
to see if we can send more, and then if we sent something we call
tcp_schedule_loss_probe() to see if we should schedule a TLP. At
that point, there are a few common reasons why some cwnd budget
could still be unused:
(a) rwin limit
(b) nagle check
(c) TSO deferral
(d) TSQ
For (d), after the next packet tx completion the TSQ mechanism
will allow us to send more packets, so we don't really need a
TLP (in practice it shouldn't matter whether we schedule one
or not). But for (a), (b), (c) the sender won't send any more
packets until it gets another ACK. But if the whole flight was
lost, or all the ACKs were lost, then we won't get any more ACKs,
and ideally we should schedule and send a TLP to get more feedback.
In particular for a long time we have wanted some kind of timer for
TSO deferral, and at least this would give us some kind of timer
Reported-by: Steve Ibanez <sibanez@stanford.edu>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Nandita Dukkipati <nanditad@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>